Working with Machine Learning Models

omega|ml currently implements the following machine learning frameworks out of the box. More backends are planned. Any backend can be implemented using the backend API.

  • scikit-learn

  • Keras

  • Tensorflow (tf.keras, tf.estimator, tf.data, tf.SavedModel)

  • Apache Spark MLLib

Note that support for Keras, Tensorflow and Apache Spark is experimental at this time.

Storing models

Storing models and pipelines is as straight forward as storing Pandas DataFrames and Series. Simply create the model, then use om.models.put() to store:

from sklearn.linear_model import LinearRegression

# train a linear regression model
df = pd.DataFrame(dict(x=range(10), y=range(20,30)))
clf = LinearRegression()
clf.fit(df[['x']], df[['y']])
# store the trained model
om.models.put(clf, 'lrmodel')

Models can also be stored untrained:

df = pd.DataFrame(dict(x=range(10), y=range(20,30)))
clf = LinearRegression()
# store the trained model
om.models.put(clf, 'lrmodel')

Using models to predict

Retrieving a model is equally straight forward:

clf = om.models.get('lrmodel')
clf
=>
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Once retrieved the model can be accessed as any model kept in memory, e.g. to predict using new data:

clf = om.models.get('lrmodel')
df = pd.DataFrame(dict(x=range(70,80)))
clf.predict(df[['x']])
=>
array([[ 90.],
   [ 91.],
   [ 92.],
   [ 93.],
   [ 94.],
   [ 95.],
   [ 96.],
   [ 97.],
   [ 98.],
   [ 99.]])

Using the compute cluster

Prediction

omega|ml provides a state-of-the art compute cluster, called the runtime. Using the runtime you can delegate model tasks to the cluster:

model = om.runtime.model('lrmodel')
result = model.predict(df[['x']])
result.get()
=>
array([[ 20.],
   [ 21.],
   [ 22.],
   [ 23.],
   [ 24.],
   [ 25.],
   [ 26.],
   [ 27.],
   [ 28.],
   [ 29.]])

Note that the result is a deferred object that we resolve using get.

Instead of passing data, you may also pass the name of a DataFrame stored in omegaml:

# create a dataframe and store it
df = pd.DataFrame(dict(x=range(70,80)))
om.datasets.put(df, 'testlrmodel')
# use it to predict
result = om.runtime.model('lrmodel').predict('testlrmodel')
result.get()

Model Fitting

To train a model using the runtime, use the fit method on the runtime’s model, as you would on a local model:

# create a dataframe and store it
df = pd.DataFrame(dict(x=range(10), y=range(20,30)))
om.datasets.put(df, 'testlrmodel')
# use it to fit the model
result = om.runtime.model('lrmodel').fit('testlrmodel[x]', 'testlrmodel[y]')
result.get()

GridSearch

currently supported for sckit-learn

To use cross validated grid search on a model, use the gridsearch method on the runtime’s model. This creates, fits and stores a GridSearchCV instance and automatically links it to the model. Use the GridSearchCV model to evaluate the performance of multiple parameter settings.

Note

Instead of using this default implementation of GridSearchCV you may create your own GridSearchCV instance locally and then fit it using the runtime. In this case be sure to link the model used for grid searching and the original model by changing the attributes on the model’s metadata.

X, y = make_classification()
logreg = LogisticRegression()
om.models.put(logreg, 'logreg')
params = {
    'C': [0.1, 0.5, 1.0]
}
# gridsearch on runtime
om.runtime.model('logreg').gridsearch(X, y, parameters=params)
meta = om.models.metadata('logreg')
# check gridsearch was saved
self.assertIn('gridsearch', meta.attributes)
self.assertEqual(len(meta.attributes['gridsearch']), 1)
self.assertIn('gsModel', meta.attributes['gridsearch'][0])
# check we can get back the gridsearch model
gs_model = om.models.get(meta.attributes['gridsearch'][0]['gsModel'])
self.assertIsInstance(gs_model, GridSearchCV)

Other Model tasks

The runtime provides more than just model training and prediction. The runtime implements a common API to all supported backends that follows the scikit-learn estimator model. That is the runtime supports the following methods on a model:

  • fit

  • partial_fit

  • transform

  • score

  • gridsearch

For details refer to the API reference.

Specific frameworks

Keras

The Keras backend implements the .fit() method with the following Keras-specific extensions:

  • validation_data= can refer to a tuple of (testX, testY) dataset names instead of actual data values, similar to X, Y. This will load the validation dataset before model.fit().

  • Metadata.attributes.history stores the history.history object, which is a dictionary of all metrics with one entry per epoch as the return value of Keras’s model.fit() method.

Tensorflow

Tensorflow provides several types of models

  • Native tensorflow models

  • Tensorflow Keras models

  • Estimator models

  • SavedModel

omega|ml supports all model variants as trained SavedModels. Keras models and Estimator models can also be serialized to and trained by the cluster as Python instances. The runtime can execute arbitrary functions that generate a model, train and save it as a SavedModel for subsequent consumption e.g. via the model REST API.

Concepts

Keras models

Consider the following Tensorflow model (source Modelnet). This is a stanard TF Keras model that uses the MobileNetV2 for image detection and trains a new output layer.

(...)
mobile_net = tf.keras.applications.MobileNetV2(input_shape=(192, 192, 3), include_top=False)
mobile_net.trainable=False
model = tf.keras.Sequential([
  mobile_net,
  tf.keras.layers.GlobalAveragePooling2D(),
  tf.keras.layers.Dense(len(label_names))])
model.compile(optimizer='adam',
              loss=tf.keras.losses.sparse_categorical_crossentropy,
              metrics=["accuracy"])
model.summary()
model.fit(ds, epochs=1, steps_per_epoch=3)

Store the model to omega|ml as follows:

om.models.put(model, 'tfkeras-flower')

Load and use the model for prediction as follows. This runs the prediction on the local computer and does not use omega|ml’s runtime cluster.

model_ = om.models.get('tfkeras-flower')
img = plt.imread('/path/to/image')
result = model_.predict(np.array([img]))

Using the runtime cluster is equally straight forward:

img = plt.imread('/path/to/image')
result = om.runtime.model('tfkeras-flower').predict(np.array([img]))

The REST API similarly provides prediction:

resp = requests.put(predict_url, json={
            'columns': ['x'],
            'data': [{'x': img.flatten().tolist()}],
            'shape': [192, 192, 3],
      })
data = resp.json()
prediction = data['result']

tf.data.Dataset

Estimator models support tf.data.Dataset by means of virtual datasets. Virtual datasets are Python functions stored by om.datasets. On accessing a virtual dataset, the function is executed and the result is returned. Thus for Estimator models, a virtual dataset should be used to return a tf.data.Dataset.

om.datasets supports storing tf.train.Example records, a tf.data.Dataset can easily be constructed from this.