Developer API

omega|ml

class omegaml.Omega(defaults=None, mongo_url=None, celeryconf=None, bucket=None, **kwargs)

Client API to omegaml

Provides the following APIs:

  • datasets - access to datasets stored in the cluster

  • models - access to models stored in the cluster

  • runtimes - access to the cluster compute resources

  • jobs - access to jobs stored and executed in the cluster

  • scripts - access to lambda modules stored and executed in the cluster

omegaml.store

Native storage for OmegaML using mongodb as the storage layer

An OmegaStore instance is a MongoDB database. It has at least the metadata collection which lists all objects stored in it. A metadata document refers to the following types of objects (metadata.kind):

  • pandas.dfrows - a Pandas DataFrame stored as a collection of rows

  • sklearn.joblib - a scikit learn estimator/pipline dumped using joblib.dump()

  • python.data - an arbitrary python dict, tuple, list stored as a document

Note that storing Pandas and scikit learn objects requires the availability of the respective packages. If either can not be imported, the OmegaStore degrades to a python.data store only. It will still .list() and get() any object, however reverts to pure python objects. In this case it is up to the client to convert the data into an appropriate format for processing.

Pandas and scikit-learn objects can only be stored if these packages are availables. put() raises a TypeError if you pass such objects and these modules cannot be loaded.

All data are stored within the same mongodb, in per-object collections as follows:

  • .metadata

    all metadata. each object is one document, See omegaml.documents.Metadata for details

  • .<bucket>.files

    this is the GridFS instance used to store blobs (models, numpy, hdf). The actual file name will be <prefix>/<name>.<ext>, where ext is optionally generated by put() / get().

  • .<bucket>.<prefix>.<name>.data

    every other dataset is stored in a separate collection (dataframes, dicts, lists, tuples). Any forward slash in prefix is ignored (e.g. ‘data/’ becomes ‘data’)

DataFrames by default are stored in their own collection, every row becomes a document. To store dataframes as a binary file, use put(…., as_hdf=True). .get() will always return a dataframe.

Python dicts, lists, tuples are stored as a single document with a .data attribute holding the JSON-converted representation. .get() will always return the corresponding python object of .data.

Models are joblib.dump()’ed and ziped prior to transferring into GridFs. .get() will always unzip and joblib.load() before returning the model. Note this requires that the process using .get() supports joblib as well as all python classes referred to. If joblib is not supported, .get() returns a file-like object.

The .metadata entry specifies the format used to store each object as well as it’s location:

  • metadata.kind

    the type of object

  • metadata.name

    the name of the object, as given on put()

  • metadata.gridfile

    the gridfs object (if any, null otherwise)

  • metadata.collection

    the name of the collection

  • metadata.attributes

    arbitrary custom attributes set in put(attributes=obj). This is used e.g. by OmegaRuntime’s fit() method to record the data used in the model’s training.

.put() and .get() use helper methods specific to the type in object’s type and metadata.kind, respectively. In the future a plugin system will enable extension to other types.

class omegaml.store.base.OmegaStore(mongo_url=None, bucket=None, prefix=None, kind=None, defaults=None, dbalias=None)

The storage backend for models and data

omegaml.backends

class omegaml.backends.basedata.BaseDataBackend(model_store=None, data_store=None, tracking=None, **kwargs)

OmegaML BaseDataBackend to be subclassed by other arbitrary backends

This provides the abstract interface for any data backend to be implemented

class omegaml.backends.basemodel.BaseModelBackend(model_store=None, data_store=None, tracking=None, **kwargs)

OmegaML BaseModelBackend to be subclassed by other arbitrary backends

This provides the abstract interface for any model backend to be implemented Subclass to implement custom backends.

Essentially a model backend:

  • provides methods to serialize and deserialize a machine learning model for a given ML framework

  • offers fit() and predict() methods to be called by the runtime

  • offers additional methods such as score(), partial_fit(), transform()

Model backends are the middleware that connects the om.models API to specific frameworks. This class makes it simple to implement a model backend by offering a common syntax as well as a default implementation for get() and put().

Methods to implement:

# for model serialization (mandatory) @classmethod supports() - determine if backend supports given model instance _package_model() - serialize a model instance into a temporary file _extract_model() - deserialize the model from a file-like

By default BaseModelBackend uses joblib.dumps/loads to store the model as serialized Python objects. If this is not sufficient or applicable to your type models, override these methods.

Both methods provide readily set up temporary file names so that all you have to do is actually save the model to the given output file and restore the model from the given input file, respectively. All other logic has already been implemented (see get_model and put_model methods).

# for fitting and predicting (mandatory) fit() predict()

# other methods (optional) fit_transform() - fit and return a transformed dataset partial_fit() - fit incrementally predict_proba() - predict probabilities score() - score fitted classifier vv test dataset

class omegaml.documents.Metadata(**kwargs)

Metadata stores information about objects in OmegaStore

omegaml.mixins

class omegaml.mixins.store.ProjectedMixin

A OmegaStore mixin to process column specifications in dataset name

class omegaml.mixins.mdf.FilterOpsMixin

filter operators on MSeries

class omegaml.mixins.mdf.ApplyMixin(*args, **kwargs)

Implements the apply() mixin supporting arbitrary functions to build aggregation pipelines

Note that .apply() does not execute immediately. Instead it builds an aggregation pipeline that is executed on MDataFrame.value. Note that .apply() calls cannot be cascaded yet, i.e. a later .apply() will override a previous.apply().

See ApplyContext for usage examples.

class omegaml.mixins.mdf.ApplyArithmetics

Math operators for ApplyContext

  • __mul__ (*)

  • __add__ (+)

  • __sub__ (-)

  • __div__ (/)

  • __floordiv__ (//)

  • __mod__ (%)

  • __pow__ (pow)

  • __ceil__ (ceil)

  • __floor__ (floor)

  • __trunc__ (trunc)

  • __abs__ (abs)

  • sqrt (math.sqrt)

__mul__(other)

multiply

class omegaml.mixins.mdf.ApplyDateTime

Datetime operators for ApplyContext

class omegaml.mixins.mdf.ApplyString

String operators

class omegaml.mixins.mdf.ApplyAccumulators

omegaml.runtimes

class omegaml.runtimes.OmegaRuntime(omega, bucket=None, defaults=None, celeryconf=None)

omegaml compute cluster gateway

class omegaml.runtimes.OmegaModelProxy(modelname, runtime=None)

proxy to a remote model in a celery worker

The proxy provides the same methods as the model but will execute the methods using celery tasks and return celery AsyncResult objects

Usage:

om = Omega()
# train a model
# result is AsyncResult, use .get() to return it's result
result = om.runtime.model('foo').fit('datax', 'datay')
result.get()

# predict
result = om.runtime.model('foo').predict('datax')
# result is AsyncResult, use .get() to return it's result
print result.get()

Notes

The actual methods of ModelProxy are defined in its mixins

See also

  • ModelMixin

  • GridSearchMixin

class omegaml.runtimes.OmegaJobProxy(jobname, runtime=None)

proxy to a remote job in a celery worker

Usage:

om = Omega()
# result is AsyncResult, use .get() to return it's result
result = om.runtime.job('foojob').run()
result.get()

# result is AsyncResult, use .get() to return it's result
result = om.runtime.job('foojob').schedule()
result.get()
class omegaml.runtimes.OmegaRuntimeDask(omega, dask_url=None)

omegaml compute cluster gateway to a dask distributed cluster

set environ DASK_DEBUG=1 to run dask tasks locally

omegaml.documents

class omegaml.documents.Metadata(**kwargs)

Metadata stores information about objects in OmegaStore

omegaml.jobs

omegajobs

class omegaml.notebook.omegacontentsmgr.OmegaStoreContentsManager(**kwargs: Any)

Jupyter notebook storage manager for omegaml

Adopted from notebook/services/contents/filemanager.py

This requires a properly configured omegaml instance. see http://jupyter-notebook.readthedocs.io/en/stable/extending/contents.html