Developer API¶
omega|ml¶
- class omegaml.Omega(defaults=None, mongo_url=None, celeryconf=None, bucket=None, **kwargs)¶
Client API to omegaml
Provides the following APIs:
datasets
- access to datasets stored in the clustermodels
- access to models stored in the clusterruntimes
- access to the cluster compute resourcesjobs
- access to jobs stored and executed in the clusterscripts
- access to lambda modules stored and executed in the cluster
omegaml.store¶
Native storage for OmegaML using mongodb as the storage layer
An OmegaStore instance is a MongoDB database. It has at least the metadata collection which lists all objects stored in it. A metadata document refers to the following types of objects (metadata.kind):
pandas.dfrows - a Pandas DataFrame stored as a collection of rows
sklearn.joblib - a scikit learn estimator/pipline dumped using joblib.dump()
python.data - an arbitrary python dict, tuple, list stored as a document
Note that storing Pandas and scikit learn objects requires the availability of the respective packages. If either can not be imported, the OmegaStore degrades to a python.data store only. It will still .list() and get() any object, however reverts to pure python objects. In this case it is up to the client to convert the data into an appropriate format for processing.
Pandas and scikit-learn objects can only be stored if these packages are availables. put() raises a TypeError if you pass such objects and these modules cannot be loaded.
All data are stored within the same mongodb, in per-object collections as follows:
- .metadata
all metadata. each object is one document, See omegaml.documents.Metadata for details
- .<bucket>.files
this is the GridFS instance used to store blobs (models, numpy, hdf). The actual file name will be <prefix>/<name>.<ext>, where ext is optionally generated by put() / get().
- .<bucket>.<prefix>.<name>.data
every other dataset is stored in a separate collection (dataframes, dicts, lists, tuples). Any forward slash in prefix is ignored (e.g. ‘data/’ becomes ‘data’)
DataFrames by default are stored in their own collection, every row becomes a document. To store dataframes as a binary file, use put(…., as_hdf=True). .get() will always return a dataframe.
Python dicts, lists, tuples are stored as a single document with a .data attribute holding the JSON-converted representation. .get() will always return the corresponding python object of .data.
Models are joblib.dump()’ed and ziped prior to transferring into GridFs. .get() will always unzip and joblib.load() before returning the model. Note this requires that the process using .get() supports joblib as well as all python classes referred to. If joblib is not supported, .get() returns a file-like object.
The .metadata entry specifies the format used to store each object as well as it’s location:
- metadata.kind
the type of object
- metadata.name
the name of the object, as given on put()
- metadata.gridfile
the gridfs object (if any, null otherwise)
- metadata.collection
the name of the collection
- metadata.attributes
arbitrary custom attributes set in put(attributes=obj). This is used e.g. by OmegaRuntime’s fit() method to record the data used in the model’s training.
.put() and .get() use helper methods specific to the type in object’s type and metadata.kind, respectively. In the future a plugin system will enable extension to other types.
- class omegaml.store.base.OmegaStore(mongo_url=None, bucket=None, prefix=None, kind=None, defaults=None, dbalias=None)¶
The storage backend for models and data
omegaml.backends¶
- class omegaml.backends.basedata.BaseDataBackend(model_store=None, data_store=None, tracking=None, **kwargs)¶
OmegaML BaseDataBackend to be subclassed by other arbitrary backends
This provides the abstract interface for any data backend to be implemented
- class omegaml.backends.basemodel.BaseModelBackend(model_store=None, data_store=None, tracking=None, **kwargs)¶
OmegaML BaseModelBackend to be subclassed by other arbitrary backends
This provides the abstract interface for any model backend to be implemented Subclass to implement custom backends.
Essentially a model backend:
provides methods to serialize and deserialize a machine learning model for a given ML framework
offers fit() and predict() methods to be called by the runtime
offers additional methods such as score(), partial_fit(), transform()
Model backends are the middleware that connects the om.models API to specific frameworks. This class makes it simple to implement a model backend by offering a common syntax as well as a default implementation for get() and put().
- Methods to implement:
# for model serialization (mandatory) @classmethod supports() - determine if backend supports given model instance _package_model() - serialize a model instance into a temporary file _extract_model() - deserialize the model from a file-like
By default BaseModelBackend uses joblib.dumps/loads to store the model as serialized Python objects. If this is not sufficient or applicable to your type models, override these methods.
Both methods provide readily set up temporary file names so that all you have to do is actually save the model to the given output file and restore the model from the given input file, respectively. All other logic has already been implemented (see get_model and put_model methods).
# for fitting and predicting (mandatory) fit() predict()
# other methods (optional) fit_transform() - fit and return a transformed dataset partial_fit() - fit incrementally predict_proba() - predict probabilities score() - score fitted classifier vv test dataset
- class omegaml.documents.Metadata(**kwargs)¶
Metadata stores information about objects in OmegaStore
omegaml.mixins¶
- class omegaml.mixins.store.ProjectedMixin¶
A OmegaStore mixin to process column specifications in dataset name
- class omegaml.mixins.mdf.FilterOpsMixin¶
filter operators on MSeries
- class omegaml.mixins.mdf.ApplyMixin(*args, **kwargs)¶
Implements the apply() mixin supporting arbitrary functions to build aggregation pipelines
Note that .apply() does not execute immediately. Instead it builds an aggregation pipeline that is executed on MDataFrame.value. Note that .apply() calls cannot be cascaded yet, i.e. a later .apply() will override a previous.apply().
See ApplyContext for usage examples.
- class omegaml.mixins.mdf.ApplyArithmetics¶
Math operators for ApplyContext
__mul__
(*)__add__
(+)__sub__
(-)__div__
(/)__floordiv__
(//)__mod__
(%)__pow__
(pow)__ceil__
(ceil)__floor__
(floor)__trunc__
(trunc)__abs__
(abs)sqrt
(math.sqrt)
- __mul__(other)¶
multiply
- class omegaml.mixins.mdf.ApplyDateTime¶
Datetime operators for ApplyContext
- class omegaml.mixins.mdf.ApplyString¶
String operators
- class omegaml.mixins.mdf.ApplyAccumulators¶
omegaml.runtimes¶
- class omegaml.runtimes.OmegaRuntime(omega, bucket=None, defaults=None, celeryconf=None)¶
omegaml compute cluster gateway
- class omegaml.runtimes.OmegaModelProxy(modelname, runtime=None)¶
proxy to a remote model in a celery worker
The proxy provides the same methods as the model but will execute the methods using celery tasks and return celery AsyncResult objects
Usage:
om = Omega() # train a model # result is AsyncResult, use .get() to return it's result result = om.runtime.model('foo').fit('datax', 'datay') result.get() # predict result = om.runtime.model('foo').predict('datax') # result is AsyncResult, use .get() to return it's result print result.get()
Notes
The actual methods of ModelProxy are defined in its mixins
See also
ModelMixin
GridSearchMixin
- class omegaml.runtimes.OmegaJobProxy(jobname, runtime=None)¶
proxy to a remote job in a celery worker
Usage:
om = Omega() # result is AsyncResult, use .get() to return it's result result = om.runtime.job('foojob').run() result.get() # result is AsyncResult, use .get() to return it's result result = om.runtime.job('foojob').schedule() result.get()
- class omegaml.runtimes.OmegaRuntimeDask(omega, dask_url=None)¶
omegaml compute cluster gateway to a dask distributed cluster
set environ DASK_DEBUG=1 to run dask tasks locally
omegaml.documents¶
- class omegaml.documents.Metadata(**kwargs)¶
Metadata stores information about objects in OmegaStore
omegaml.jobs¶
omegajobs¶
- class omegaml.notebook.omegacontentsmgr.OmegaStoreContentsManager(**kwargs: Any)¶
Jupyter notebook storage manager for omegaml
Adopted from notebook/services/contents/filemanager.py
This requires a properly configured omegaml instance. see http://jupyter-notebook.readthedocs.io/en/stable/extending/contents.html