Out of core DataFrames¶
omega|ml supports working with pandas DataFrames that are larger than available memory by leveraging mongodb as its datastore. This documents the available features supported by omegaml’s MDataFrame.
Note
omega|ml takes a different approach to other frameworks like dask. While dask is focused on distributed processing of a compute-graph and strives for data-locality, omemgal’s focus is on scalable data-persistence, data-sharing and collaboration in teams of data scientists and to provide a distributed API to micro-services that require analytical services but do not themselves have the required compute power. This said, the power of dask and omega|ml can be combined.
Features¶
omega|ml supports efficient persistency of and distributed access to the following Pandas objects:
DataFrames
Series
HDFStores
omega|ml provides a functional API to MongoDB’s aggregation and map/reduce framework, as well as storing geocoded data with geospatial search semantics (e.g. near, within).
In addition omega|ml supports the storage and distributed access to
scikit-learn models
Apache Spark models
Python container objects (dict, list, tuples)
Concepts¶
OmegaStore - a store is the persistence layer that mediates between Python/Pandas and Mongodb. There are three types of stores: data store, model store and jobs (code) store.
Bucket - a bucket is a namespace within a Mongo database. All objects stored by omega|ml reside within a bucket
Prefix - the prefix is the multi/level/path prefix to an object stored in a bucket. Think of this as a hierarchical file system within a bucket
Metadata - omega|ml manages its objects in the bucket’s metadata collection
Access an MDataFrame¶
To access an out-of-core MDataFrame you need to put some data into MongoDB:
In : import omegaml as om
import pandas as pd
df = pd.DataFrame({'x': range(10)})
om.datasets.put(df, 'foo')
Out: <Metadata: Metadata(kind=pandas.dfrows,name=foo, ...)
This stores the df
DataFrame with name foo
. We can get it
back just as quickly:
In : om.datasets.get('foo')
Out:
x
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
We get back a standard pd.DataFrame:
In : type(om.datasets.get('foo')
Out: pandas.core.frame.DataFrame
This omegaml’s default behavior: it returns the same data that it received.
Filtering data¶
In :