Elephas: Distributed Deep Learning with Keras & Spark
Elephas is an extension of Keras, which allows you to run distributed deep learning models at scale with Spark. Elephas currently supports a number of applications, including:
- Data-parallel training of deep learning models
- Distributed inference and evaluation of deep learning models
Distributed training of ensemble models(removed as of 3.0.0)Distributed hyper-parameter optimization(removed as of 3.0.0)- Distributed training and inference with Hugging Face models (added as 6.0.0)
Schematically, elephas works as follows.
Table of content: * Elephas: Distributed Deep Learning with Keras & Spark * Introduction * Getting started * Basic Spark integration * Distributed Inference and Evaluation * Spark MLlib integration * Spark ML integration * Hadoop integration * Distributed hyper-parameter optimization * Distributed training of ensemble models * Discussion * Literature
Introduction
Elephas brings deep learning with Keras to Spark. Elephas intends to keep the simplicity and high usability of Keras, thereby allowing for fast prototyping of distributed models, which can be run on massive data sets. For an introductory example, see the following iPython notebook.
ἐλέφας is Greek for ivory and an accompanying project to κέρας, meaning horn. If this seems weird mentioning, like a bad dream, you should confirm it actually is at the Keras documentation. Elephas also means elephant, as in stuffed yellow elephant.
Elephas implements a class of data-parallel algorithms on top of Keras, using Spark's RDDs and data frames. Keras Models are initialized on the driver, then serialized and shipped to workers, alongside with data and broadcasted model parameters. Spark workers deserialize the model, train their chunk of data and send their gradients back to the driver. The "master" model on the driver is updated by an optimizer, which takes gradients either synchronously or asynchronously.
Getting started
Just install elephas from PyPI with, Spark will be installed through pyspark
for you.
pip install elephas
That's it, you should now be able to run Elephas examples.
Basic Spark integration
After installing both Elephas, you can train a model as follows. First, create a local pyspark context
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('Elephas_App').setMaster('local[8]')
sc = SparkContext(conf=conf)
Next, you define and compile a Keras model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.optimizers import SGD
model = Sequential()
model.add(Dense(128, input_dim=784))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(10))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer=SGD())
and create an RDD from numpy arrays (or however you want to create an RDD)
from elephas.utils.rdd_utils import to_simple_rdd
rdd = to_simple_rdd(sc, x_train, y_train)
The basic model in Elephas is the SparkModel
. You initialize a SparkModel
by passing in a compiled Keras model,
an update frequency and a parallelization mode. After that you can simply fit
the model on your RDD. Elephas fit
has the same options as a Keras model, so you can pass epochs
, batch_size
etc. as you're used to from tensorflow.keras.
from elephas.spark_model import SparkModel
spark_model = SparkModel(model, frequency='epoch', mode='asynchronous')
spark_model.fit(rdd, epochs=20, batch_size=32, verbose=0, validation_split=0.1)
Your script can now be run using spark-submit
spark-submit --driver-memory 1G ./your_script.py
Increasing the driver memory even further may be necessary, as the set of parameters in a network may be very large and collecting them on the driver eats up a lot of resources. See the examples folder for a few working examples.
Distributed Inference and Evaluation
The SparkModel
can also be used for distributed inference (prediction) and evaluation. Similar to the fit
method, the predict
and evaluate
methods
conform to the Keras Model API.
from elephas.spark_model import SparkModel
# create/train the model, similar to the previous section (Basic Spark Integration)
model = ...
spark_model = SparkModel(model, ...)
spark_model.fit(...)
x_test, y_test = ... # load test data
predictions = spark_model.predict(x_test) # perform inference
evaluation = spark_model.evaluate(x_test, y_test) # perform evaluation/scoring
The paradigm is identical to the data parallelism in training, as the model is serialized and shipped to the workers and used to evaluate a chunk of the testing data. The predict method will take either a numpy array or an RDD.
Spark MLlib integration
Following up on the last example, to use Spark's MLlib library with Elephas, you create an RDD of LabeledPoints for supervised training as follows
from elephas.utils.rdd_utils import to_labeled_point
lp_rdd = to_labeled_point(sc, x_train, y_train, categorical=True)
Training a given LabeledPoint-RDD is very similar to what we've seen already
from elephas.spark_model import SparkMLlibModel
spark_model = SparkMLlibModel(model, frequency='batch', mode='hogwild')
spark_model.train(lp_rdd, epochs=20, batch_size=32, verbose=0, validation_split=0.1,
categorical=True, nb_classes=nb_classes)
Spark ML integration
To train a model with a SparkML estimator on a data frame, use the following syntax.
df = to_data_frame(sc, x_train, y_train, categorical=True)
test_df = to_data_frame(sc, x_test, y_test, categorical=True)
estimator = ElephasEstimator(model, epochs=epochs, batch_size=batch_size, frequency='batch', mode='asynchronous',
categorical=True, nb_classes=nb_classes)
fitted_model = estimator.fit(df)
Fitting an estimator results in a SparkML transformer, which we can use for predictions and other evaluations by calling the transform method on it.
prediction = fitted_model.transform(test_df)
pnl = prediction.select("label", "prediction")
pnl.show(100)
import numpy as np
prediction_and_label = pnl.rdd.map(lambda row: (row.label, float(np.argmax(row.prediction))))
metrics = MulticlassMetrics(prediction_and_label)
print(metrics.weightedPrecision)
print(metrics.weightedRecall)
If the model utilizes custom activation function, layer, or loss function, that will need to be supplied using the set_custom_objects
method:
def custom_activation(x):
...
class CustomLayer(Layer):
...
model = Sequential()
model.add(CustomLayer(...))
estimator = ElephasEstimator(model, epochs=epochs, batch_size=batch_size)
estimator.set_custom_objects({'custom_activation': custom_activation, 'CustomLayer': CustomLayer})
Hadoop Integration
In addition to saving locally, models may be saved directly into a network-accessible Hadoop cluster.
spark_model.save('/absolute/file/path/model.h5', to_hadoop=True)
Models saved on a network-accessible Hadoop cluster may be loaded as follows.
from elephas.spark_model import load_spark_model
spark_model = load_spark_model('/absolute/file/path/model.h5', from_hadoop=True)
Distributed hyper-parameter optimization
UPDATE: As of 3.0.0, Hyper-parameter optimization features have been removed, since Hyperas is no longer active and was causing versioning compatibility issues. To use these features, install version 2.1 or below.
Hyper-parameter optimization with elephas is based on hyperas, a convenience wrapper for hyperopt and keras. Each Spark worker executes a number of trials, the results get collected and the best model is returned. As the distributed mode in hyperopt (using MongoDB), is somewhat difficult to configure and error prone at the time of writing, we chose to implement parallelization ourselves. Right now, the only available optimization algorithm is random search.
The first part of this example is more or less directly taken from the hyperas documentation. We define data and model as functions, hyper-parameter ranges are defined through braces. See the hyperas documentation for more on how this works.
from hyperopt import STATUS_OK
from hyperas.distributions import choice, uniform
def data():
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
nb_classes = 10
y_train = to_categorical(y_train, nb_classes)
y_test = to_categorical(y_test, nb_classes)
return x_train, y_train, x_test, y_test
def model(x_train, y_train, x_test, y_test):
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.optimizers import RMSprop
model = Sequential()
model.add(Dense(512, input_shape=(784,)))
model.add(Activation('relu'))
model.add(Dropout({{uniform(0, 1)}}))
model.add(Dense({{choice([256, 512, 1024])}}))
model.add(Activation('relu'))
model.add(Dropout({{uniform(0, 1)}}))
model.add(Dense(10))
model.add(Activation('softmax'))
rms = RMSprop()
model.compile(loss='categorical_crossentropy', optimizer=rms)
model.fit(x_train, y_train,
batch_size={{choice([64, 128])}},
nb_epoch=1,
show_accuracy=True,
verbose=2,
validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test, show_accuracy=True, verbose=0)
print('Test accuracy:', acc)
return {'loss': -acc, 'status': STATUS_OK, 'model': model.to_json()}
Once the basic setup is defined, running the minimization is done in just a few lines of code:
from elephas.hyperparam import HyperParamModel
from pyspark import SparkContext, SparkConf
# Create Spark context
conf = SparkConf().setAppName('Elephas_Hyperparameter_Optimization').setMaster('local[8]')
sc = SparkContext(conf=conf)
# Define hyper-parameter model and run optimization
hyperparam_model = HyperParamModel(sc)
hyperparam_model.minimize(model=model, data=data, max_evals=5)
Distributed training of ensemble models
UPDATE: As of 3.0.0, Hyper-parameter optimization features have been removed, since Hyperas is no longer active and was causing versioning compatibility issues. To use these features, install version 2.1 or below.
Building on the last section, it is possible to train ensemble models with elephas by means of running hyper-parameter
optimization on large search spaces and defining a resulting voting classifier on the top-n performing models.
With data
and model
defined as above, this is a simple as running
result = hyperparam_model.best_ensemble(nb_ensemble_models=10, model=model, data=data, max_evals=5)
In this example an ensemble of 10 models is built, based on optimization of at most 5 runs on each of the Spark workers.
Hugging Face Models Training and Inference
As of 6.0.0, Elephas now supports distributed training (and inference) with HuggingFace models (using the Tensorflow/Keras backend), currently for text classification and causal langugage modeling only, and in the "synchronous"
training mode. In future releases, we hope to expand this to other types of models and the "asynchronous"
and "hogwild"
training modes. This can be accomplished using the SparkHFModel
:
from elephas.spark_model import SparkHFModel
from elephas.utils.rdd_utils import to_simple_rdd
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from tensorflow.keras.optimizers import SGD
batch_size = ...
epochs = ...
num_workers = ...
newsgroups = fetch_20newsgroups(subset='train')
x = newsgroups.data
y = newsgroups.target
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
x_train, x_test, y_train, y_test = train_test_split(x, y_encoded, test_size=0.2)
model_name = 'albert-base-v2'
# Note: the expectation is that text data is being supplied - tokenization is handled during training
rdd = to_simple_rdd(spark_context, x_train, y_train)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(np.unique(y_encoded)))
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer_kwargs = {'padding': True, 'truncation': True, ...}
model.compile(optimizer=SGD(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
spark_model = SparkHFModel(model, num_workers=num_workers, mode="synchronous", tokenizer=tokenizer, tokenizer_kwargs=tokenizer_kwargs, loader=TFAutoModelForSequenceClassification)
spark_model.fit(rdd, epochs=epochs, batch_size=batch_size)
predictions = spark_model.predict(spark_context.parallelize(x_test))
More examples can be seen in the examples
directory, namely "hf_causal_modeling.py"
and "hf_text_classification.py"
.
The computational model is the same as for Keras models, except the model is serialized and deserialized differently due to differences in the HuggingFace API.
To use this capability, just install this package with the huggingface
extra:
pip install elephas[huggingface]
Discussion
Premature parallelization may not be the root of all evil, but it may not always be the best idea to do so. Keep in mind that more workers mean less data per worker and parallelizing a model is not an excuse for actual learning. So, if you can perfectly well fit your data into memory and you're happy with training speed of the model consider just using keras.
One exception to this rule may be that you're already working within the Spark ecosystem and want to leverage what's there. The above SparkML example shows how to use evaluation modules from Spark and maybe you wish to further process the outcome of an elephas model down the road. In this case, we recommend to use elephas as a simple wrapper by setting num_workers=1.
Note that right now elephas restricts itself to data-parallel algorithms for two reasons. First, Spark simply makes it very easy to distribute data. Second, neither Spark nor Theano make it particularly easy to split up the actual model in parts, thus making model-parallelism practically impossible to realize.
Having said all that, we hope you learn to appreciate elephas as a pretty easy to setup and use playground for data-parallel deep-learning algorithms.
Literature
[1] J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, QV. Le, MZ. Mao, M’A. Ranzato, A. Senior, P. Tucker, K. Yang, and AY. Ng. Large Scale Distributed Deep Networks.
[2] F. Niu, B. Recht, C. Re, S.J. Wright HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
[3] C. Noel, S. Osindero. Dogwild! — Distributed Hogwild for CPU & GPU
Maintainers / Contributions
This great project was started by Max Pumperla, and is currently maintained by Daniel Cahall (https://github.com/danielenricocahall). If you have any questions, please feel free to open up an issue or send an email to danielenricocahall@gmail.com. If you want to contribute, feel free to submit a PR, or start a conversation about how we can go about implementing something.