Storage backends

A storage backend can support additional data types for the datasets,models,jobs stores. All stores share the same backends as they use the same implementation throughout. There are two types of backends: data backends and model backends.

All storage backends are initialized to know their respective data and model stores as self.data_store and self.model_store, respectively.

Accessing MongoDB

OmegaStore, accessible from a backend implementation as self.data_store, provides several methods and properties to interact with mongodb.

  • self.data_store.metadata(name) - return the meta data object for the given object

  • self.data_store.collection(name) - return the mongodb Collection instance for the given object

  • self.data_store.fs (property) - return the mongodb GridFS instance

  • self.data_store.mongodb (property) - return the mongodb Database instance

Warning

A custom backend shall not use any other means to access mongodb as doing so may cause unexpected side-effects.

Generating unique names

To generate a unique name for an object that is compatible with MongoDB collection and GridFS naming rules use the self.data_store.object_store_key method.

Note

The .collection method already uses object_store_key() to set the collection name for a given object.

Choosing MongoDB GridFS or collection

If the data type of an object is some form of iterable a MongoDB collection may be well suited to store the data.

# some class FooDataBackend(BaseDataBackend):
def put(self, obj, name, **kwargs):
   # store obj in collection
   collection = self.data_store.collection(name)
   collection.insert_many([dict(item) for item in obj])
   # create meta data
   meta = self.data_store.make_metadata(name, 'foo.rows')
   meta.collection = collection
   return meta.save()

Note

The dict(item) call may or may not be necessary. In general for MongoDB to be able to store the object, it must be BSON serializable. See the tutorial on MongoDB documents and MongoDB type mapping.

If the data type is of binary form, a GridFS file may be the better choice.

# some class FooDataBackend(BaseDataBackend):
def put(self, obj, name, **kwargs):
   # store obj in gridfs
   filename = self.data_store.object_store_key(name, 'foo')
   buf = BytesIO(obj)
   fileid = self.data_store.fs.put(buf, filename=filename)
   # create meta data
   meta = self.data_store.make_metadata(name, 'foo.file')
   meta.gridfile = GridFSProxy(grid_id=fileid)
   return meta.save()

Note

The above code snippets only show the put method. Implement the get method to retrieve the object from the object’s collection or GridFS file, as indicated by meta.kind. It is the responsibility of the backend to apply whatever data conversions are necessary, i.e. OmegaStore does not implement any automatic conversions.

Storing data outside MongoDB

OmegaStore is oblivious to the storage location of the actual data of an object, as long as there is a backend that handles storing (put) and retrieval (get). In other words OmegaStore in combination with a backend implementation can deal with arbitrary data and storage methods.

For data stored in MongoDB, Metadata.collection and Metadata.gridfile provide the necessary pointers. For data stored outside mongodb, Metadata.uri provides an arbitrary URI that a backend can set (on put) and use for retrieval (on get).

# some class FooDataBackend(BaseDataBackend):
def put(self, obj, name, **kwargs):
   # store obj in some external file system
   filename = self.date_store.object_store_key(name, 'foo')
   buf = BytesIO(obj)
   # get instance of external file system and create file URI
   # note the URI can be anything as long as your get method knows how
   # to dereference
   foofs = ...
   fileid = foofs.put(obj, filename=filename)
   uri = 'foofs://{}'.format(fileid)
   # create meta data
   meta = self.data_store.make_metadata(name, 'foo.file')
   meta.uri = uri
   return meta.save()

def get(self, name, **kwargs):
   # get metadata and URI
   meta = self.data_store.metadata(name)
   uri = meta.uri
   # get object back using some service that understands this uri
   service = ...
   obj = service.get(uri)
   return obj

Data backend

A data backend minimally provides the put and get methods:

# some class FooDataBackend(BaseDataBackend):
def put(self, obj, name, **kwargs):
    # code to store the object
    ...
    # create or update the metadata object
    meta = self.data_store.metadata(name)
    if meta is None:
       meta = self.data_store.make_metadata(name, kind)
    # always save the Metadata instance before returning
    return meta.save()

def get(self, name, **kwargs):
    # code to retrieve the object
    obj = ...
    return obj

Model backend

Model backends store and retrieve instances of models (in the scikit-learn sense of model persistency). In addition, they act as the model proxy used by OmegaRuntime to perform arbitrary actions on an saved model using named data objects.

The actions expected to be available minimally to OmegaRuntime on a saved model are as follows. Note that these methods accept the modelname, XName, and Yname parameters, which must all reference existing objects in the om.models and om.datasets stores, respectively.

Note

Technically, these methods are called from a worker in the compute cluster without prior loading of the model nor the data. The worker uses om.models.get_backend() to retrieve the model’s backend, then calls the requested method. Thus it is the responsibility of the backend to retrieve the model and any data required.

BaseModelBackend.fit(modelname, Xname, Yname=None, pure_python=True, **kwargs)

fit the model with data

Parameters:

modelname – the name of the model object

ci:param Xname: the name of the X data set :param Yname: the name of the Y data set :param pure_python: if True return a python object. If False return

a dataframe. Defaults to True to support any client.

Parameters:

kwargs – kwargs passed to the model’s predict method

Returns:

return the meta data object of the model

BaseModelBackend.predict(modelname, Xname, rName=None, pure_python=True, **kwargs)

predict using data stored in Xname

Parameters:
  • modelname – the name of the model object

  • Xname – the name of the X data set

  • rName – the name of the result data object or None

  • pure_python – if True return a python object. If False return a dataframe. Defaults to True to support any client.

  • kwargs – kwargs passed to the model’s predict method

Returns:

return the predicted outcome

BaseModelBackend.transform(modelname, Xname, rName=None, **kwargs)

transform using data

Parameters:
  • modelname – the name of the model object

  • Xname – the name of the X data set

  • rName – the name of the transforms’s result data object or None

  • kwargs – kwargs passed to the model’s transform method

Returns:

return the transform data of the model

This is in addition to the put and get methods required by any storage backend.

Ideally and for user convenience, more methods should be supported, see the reference on BaseModelBackend. Methods that are not supported will raise the NotImplemented exception.