to_simple_rdd
elephas.utils.rdd_utils.to_simple_rdd(sc: pyspark.context.SparkContext, features: <built-in function array>, labels: <built-in function array>)
Convert numpy arrays of features and labels into an RDD of pairs.
:param sc: Spark context :param features: numpy array with features :param labels: numpy array with labels :return: Spark RDD with feature-label pairs
to_labeled_point
elephas.utils.rdd_utils.to_labeled_point(sc: pyspark.context.SparkContext, features: <built-in function array>, labels: <built-in function array>, categorical: bool = False)
Convert numpy arrays of features and labels into a LabeledPoint RDD for MLlib and ML integration.
:param sc: Spark context :param features: numpy array with features :param labels: numpy array with labels :param categorical: boolean, whether labels are already one-hot encoded or not :return: LabeledPoint RDD with features and labels
from_labeled_point
elephas.utils.rdd_utils.from_labeled_point(rdd: pyspark.rdd.RDD[pyspark.mllib.regression.LabeledPoint], categorical: bool = False, nb_classes: Optional[int] = None)
Convert a LabeledPoint RDD back to a pair of numpy arrays
:param rdd: LabeledPoint RDD :param categorical: boolean, if labels should be one-hot encode when returned :param nb_classes: optional int, indicating the number of class labels :return: pair of numpy arrays, features and labels
encode_label
elephas.utils.rdd_utils.encode_label(label: <built-in function array>, nb_classes: int)
One-hot encoding of a single label
:param label: class label (int or double without floating point digits) :param nb_classes: int, number of total classes :return: one-hot encoded vector
lp_to_simple_rdd
elephas.utils.rdd_utils.lp_to_simple_rdd(lp_rdd: pyspark.rdd.RDD[pyspark.mllib.regression.LabeledPoint], categorical: bool = False, nb_classes: int = None)
Convert a LabeledPoint RDD into an RDD of feature-label pairs
:param lp_rdd: LabeledPoint RDD of features and labels :param categorical: boolean, if labels should be one-hot encode when returned :param nb_classes: int, number of total classes :return: Spark RDD with feature-label pairs