Storage backends¶
A storage backend can support additional data types for the
datasets,models,jobs
stores. All stores share the same backends as
they use the same implementation throughout. There are two types of backends:
data backends and model backends.
All storage backends are initialized to know their respective data and model
stores as self.data_store
and self.model_store, respectively.
Accessing MongoDB¶
OmegaStore
, accessible from a backend implementation as
self.data_store
, provides several methods and properties to interact
with mongodb.
self.data_store.metadata(name)
- return the meta data object for the given objectself.data_store.collection(name)
- return the mongodb Collection instance for the given objectself.data_store.fs
(property) - return the mongodb GridFS instanceself.data_store.mongodb
(property) - return the mongodb Database instance
Warning
A custom backend shall not use any other means to access mongodb as doing so may cause unexpected side-effects.
Generating unique names¶
To generate a unique name for an object that is compatible with MongoDB
collection and GridFS naming rules use the self.data_store.object_store_key
method.
Note
The .collection
method already uses object_store_key()
to set the collection name for a given object.
Choosing MongoDB GridFS or collection¶
If the data type of an object is some form of iterable a MongoDB collection may be well suited to store the data.
# some class FooDataBackend(BaseDataBackend):
def put(self, obj, name, **kwargs):
# store obj in collection
collection = self.data_store.collection(name)
collection.insert_many([dict(item) for item in obj])
# create meta data
meta = self.data_store.make_metadata(name, 'foo.rows')
meta.collection = collection
return meta.save()
Note
The dict(item)
call may or may not be necessary. In general for
MongoDB to be able to store the object, it must be BSON serializable. See
the tutorial on MongoDB documents and MongoDB type mapping.
If the data type is of binary form, a GridFS file may be the better choice.
# some class FooDataBackend(BaseDataBackend):
def put(self, obj, name, **kwargs):
# store obj in gridfs
filename = self.data_store.object_store_key(name, 'foo')
buf = BytesIO(obj)
fileid = self.data_store.fs.put(buf, filename=filename)
# create meta data
meta = self.data_store.make_metadata(name, 'foo.file')
meta.gridfile = GridFSProxy(grid_id=fileid)
return meta.save()
Note
The above code snippets only show the put
method. Implement the
get
method to retrieve the object from the object’s collection or
GridFS file, as indicated by meta.kind
. It is the responsibility of
the backend to apply whatever data conversions are necessary, i.e.
OmegaStore
does not implement any automatic conversions.
Storing data outside MongoDB¶
OmegaStore
is oblivious to the storage location of the actual data of
an object, as long as there is a backend that handles storing (put) and
retrieval (get). In other words OmegaStore in combination with a backend
implementation can deal with arbitrary data and storage methods.
For data stored in MongoDB, Metadata.collection
and Metadata.gridfile
provide
the necessary pointers. For data stored outside mongodb, Metadata.uri
provides
an arbitrary URI that a backend can set (on put
) and use for retrieval
(on get
).
# some class FooDataBackend(BaseDataBackend):
def put(self, obj, name, **kwargs):
# store obj in some external file system
filename = self.date_store.object_store_key(name, 'foo')
buf = BytesIO(obj)
# get instance of external file system and create file URI
# note the URI can be anything as long as your get method knows how
# to dereference
foofs = ...
fileid = foofs.put(obj, filename=filename)
uri = 'foofs://{}'.format(fileid)
# create meta data
meta = self.data_store.make_metadata(name, 'foo.file')
meta.uri = uri
return meta.save()
def get(self, name, **kwargs):
# get metadata and URI
meta = self.data_store.metadata(name)
uri = meta.uri
# get object back using some service that understands this uri
service = ...
obj = service.get(uri)
return obj
Data backend¶
A data backend minimally provides the put
and get
methods:
# some class FooDataBackend(BaseDataBackend):
def put(self, obj, name, **kwargs):
# code to store the object
...
# create or update the metadata object
meta = self.data_store.metadata(name)
if meta is None:
meta = self.data_store.make_metadata(name, kind)
# always save the Metadata instance before returning
return meta.save()
def get(self, name, **kwargs):
# code to retrieve the object
obj = ...
return obj
Model backend¶
Model backends store and retrieve instances of models (in the scikit-learn
sense of model persistency). In addition, they act as the model proxy used
by OmegaRuntime
to perform arbitrary actions on an saved model using
named data objects.
The actions expected to be available minimally to OmegaRuntime
on a saved model are as follows. Note that these methods accept the modelname,
XName, and Yname parameters, which must all reference existing objects
in the om.models and om.datasets stores, respectively.
Note
Technically, these methods are called from a worker in the compute cluster without prior loading of the model nor the data. The worker uses om.models.get_backend() to retrieve the model’s backend, then calls the requested method. Thus it is the responsibility of the backend to retrieve the model and any data required.
- BaseModelBackend.fit(modelname, Xname, Yname=None, pure_python=True, **kwargs)¶
fit the model with data
- Parameters:
modelname – the name of the model object
ci:param Xname: the name of the X data set :param Yname: the name of the Y data set :param pure_python: if True return a python object. If False return
a dataframe. Defaults to True to support any client.
- Parameters:
kwargs – kwargs passed to the model’s predict method
- Returns:
return the meta data object of the model
- BaseModelBackend.predict(modelname, Xname, rName=None, pure_python=True, **kwargs)¶
predict using data stored in Xname
- Parameters:
modelname – the name of the model object
Xname – the name of the X data set
rName – the name of the result data object or None
pure_python – if True return a python object. If False return a dataframe. Defaults to True to support any client.
kwargs – kwargs passed to the model’s predict method
- Returns:
return the predicted outcome
- BaseModelBackend.transform(modelname, Xname, rName=None, **kwargs)¶
transform using data
- Parameters:
modelname – the name of the model object
Xname – the name of the X data set
rName – the name of the transforms’s result data object or None
kwargs – kwargs passed to the model’s transform method
- Returns:
return the transform data of the model
This is in addition to the put
and get
methods required by
any storage backend.
Ideally and for user convenience, more methods should be supported,
see the reference on BaseModelBackend
. Methods that are not supported
will raise the NotImplemented
exception.