to_simple_rdd

elephas.utils.rdd_utils.to_simple_rdd(sc: pyspark.context.SparkContext, features: <built-in function array>, labels: <built-in function array>)

Convert numpy arrays of features and labels into an RDD of pairs.

:param sc: Spark context :param features: numpy array with features :param labels: numpy array with labels :return: Spark RDD with feature-label pairs


to_labeled_point

elephas.utils.rdd_utils.to_labeled_point(sc: pyspark.context.SparkContext, features: <built-in function array>, labels: <built-in function array>, categorical: bool = False)

Convert numpy arrays of features and labels into a LabeledPoint RDD for MLlib and ML integration.

:param sc: Spark context :param features: numpy array with features :param labels: numpy array with labels :param categorical: boolean, whether labels are already one-hot encoded or not :return: LabeledPoint RDD with features and labels


from_labeled_point

elephas.utils.rdd_utils.from_labeled_point(rdd: pyspark.rdd.RDD[pyspark.mllib.regression.LabeledPoint], categorical: bool = False, nb_classes: Optional[int] = None)

Convert a LabeledPoint RDD back to a pair of numpy arrays

:param rdd: LabeledPoint RDD :param categorical: boolean, if labels should be one-hot encode when returned :param nb_classes: optional int, indicating the number of class labels :return: pair of numpy arrays, features and labels


encode_label

elephas.utils.rdd_utils.encode_label(label: <built-in function array>, nb_classes: int)

One-hot encoding of a single label

:param label: class label (int or double without floating point digits) :param nb_classes: int, number of total classes :return: one-hot encoded vector


lp_to_simple_rdd

elephas.utils.rdd_utils.lp_to_simple_rdd(lp_rdd: pyspark.rdd.RDD[pyspark.mllib.regression.LabeledPoint], categorical: bool = False, nb_classes: int = None)

Convert a LabeledPoint RDD into an RDD of feature-label pairs

:param lp_rdd: LabeledPoint RDD of features and labels :param categorical: boolean, if labels should be one-hot encode when returned :param nb_classes: int, number of total classes :return: Spark RDD with feature-label pairs