Storage backends¶
A storage backend can support additional data types for the
datasets,models,jobs stores. All stores share the same backends as
they use the same implementation throughout. There are two types of backends:
data backends and model backends.
All storage backends are initialized to know their respective data and model
stores as self.data_store and self.model_store, respectively.
Accessing MongoDB¶
OmegaStore, accessible from a backend implementation as
self.data_store, provides several methods and properties to interact
with mongodb.
- self.data_store.metadata(name)- return the meta data object for the given object
- self.data_store.collection(name)- return the mongodb Collection instance for the given object
- self.data_store.fs(property) - return the mongodb GridFS instance
- self.data_store.mongodb(property) - return the mongodb Database instance
Warning
A custom backend shall not use any other means to access mongodb as doing so may cause unexpected side-effects.
Generating unique names¶
To generate a unique name for an object that is compatible with MongoDB
collection and GridFS naming rules use the self.data_store.object_store_key
method.
Note
The .collection method already uses object_store_key()
to set the collection name for a given object.
Choosing MongoDB GridFS or collection¶
If the data type of an object is some form of iterable a MongoDB collection may be well suited to store the data.
# some class FooDataBackend(BaseDataBackend):
def put(self, obj, name, **kwargs):
   # store obj in collection
   collection = self.data_store.collection(name)
   collection.insert_many([dict(item) for item in obj])
   # create meta data
   meta = self.data_store.make_metadata(name, 'foo.rows')
   meta.collection = collection
   return meta.save()
Note
The dict(item) call may or may not be necessary. In general for
MongoDB to be able to store the object, it must be BSON serializable. See
the tutorial on MongoDB documents and MongoDB type mapping.
If the data type is of binary form, a GridFS file may be the better choice.
# some class FooDataBackend(BaseDataBackend):
def put(self, obj, name, **kwargs):
   # store obj in gridfs
   filename = self.data_store.object_store_key(name, 'foo')
   buf = BytesIO(obj)
   fileid = self.data_store.fs.put(buf, filename=filename)
   # create meta data
   meta = self.data_store.make_metadata(name, 'foo.file')
   meta.gridfile = GridFSProxy(grid_id=fileid)
   return meta.save()
Note
The above code snippets only show the put method. Implement the
get method to retrieve the object from the object’s collection or
GridFS file, as indicated by meta.kind. It is the responsibility of
the backend to apply whatever data conversions are necessary, i.e.
OmegaStore does not implement any automatic conversions.
Storing data outside MongoDB¶
OmegaStore is oblivious to the storage location of the actual data of
an object, as long as there is a backend that handles storing (put) and
retrieval (get). In other words OmegaStore in combination with a backend
implementation can deal with arbitrary data and storage methods.
For data stored in MongoDB, Metadata.collection and Metadata.gridfile provide
the necessary pointers. For data stored outside mongodb, Metadata.uri provides
an arbitrary URI that a backend can set (on put) and use for retrieval
(on get).
# some class FooDataBackend(BaseDataBackend):
def put(self, obj, name, **kwargs):
   # store obj in some external file system
   filename = self.date_store.object_store_key(name, 'foo')
   buf = BytesIO(obj)
   # get instance of external file system and create file URI
   # note the URI can be anything as long as your get method knows how
   # to dereference
   foofs = ...
   fileid = foofs.put(obj, filename=filename)
   uri = 'foofs://{}'.format(fileid)
   # create meta data
   meta = self.data_store.make_metadata(name, 'foo.file')
   meta.uri = uri
   return meta.save()
def get(self, name, **kwargs):
   # get metadata and URI
   meta = self.data_store.metadata(name)
   uri = meta.uri
   # get object back using some service that understands this uri
   service = ...
   obj = service.get(uri)
   return obj
Data backend¶
A data backend minimally provides the put and get methods:
# some class FooDataBackend(BaseDataBackend):
def put(self, obj, name, **kwargs):
    # code to store the object
    ...
    # create or update the metadata object
    meta = self.data_store.metadata(name)
    if meta is None:
       meta = self.data_store.make_metadata(name, kind)
    # always save the Metadata instance before returning
    return meta.save()
def get(self, name, **kwargs):
    # code to retrieve the object
    obj = ...
    return obj
Model backend¶
Model backends store and retrieve instances of models (in the scikit-learn
sense of model persistency). In addition, they act as the model proxy used
by OmegaRuntime to perform arbitrary actions on an saved model using
named data objects.
The actions expected to be available minimally to OmegaRuntime
on a saved model are as follows. Note that these methods accept the modelname,
XName, and Yname parameters, which must all reference existing objects
in the om.models and om.datasets stores, respectively.
Note
Technically, these methods are called from a worker in the compute cluster without prior loading of the model nor the data. The worker uses om.models.get_backend() to retrieve the model’s backend, then calls the requested method. Thus it is the responsibility of the backend to retrieve the model and any data required.
- BaseModelBackend.fit(modelname, Xname, Yname=None, pure_python=True, **kwargs)[source]¶
- fit the model with data - Parameters:
- modelname – the name of the model object 
- Xname – the name of the X data set 
- Yname – the name of the Y data set 
- pure_python – if True return a python object. If False return a dataframe. Defaults to True to support any client. 
- kwargs – kwargs passed to the model’s predict method 
 
- Returns:
- return the meta data object of the model 
 
- BaseModelBackend.predict(modelname, Xname, rName=None, pure_python=True, **kwargs)[source]¶
- predict using data stored in Xname - Parameters:
- modelname – the name of the model object 
- Xname – the name of the X data set 
- rName – the name of the result data object or None 
- pure_python – if True return a python object. If False return a dataframe. Defaults to True to support any client. 
- kwargs – kwargs passed to the model’s predict method 
 
- Returns:
- return the predicted outcome 
 
- BaseModelBackend.transform(modelname, Xname, rName=None, **kwargs)[source]¶
- transform using data - Parameters:
- modelname – the name of the model object 
- Xname – the name of the X data set 
- rName – the name of the transforms’s result data object or None 
- kwargs – kwargs passed to the model’s transform method 
 
- Returns:
- return the transform data of the model 
 
This is in addition to the put and get methods required by
any storage backend.
Ideally and for user convenience, more methods should be supported,
see the reference on BaseModelBackend. Methods that are not supported
will raise the NotImplemented exception.