Out of core DataFrames¶
omegaml supports working with pandas DataFrames that are larger than available memory by leveraging mongodb as its datastore. This documents the available features supported by omegaml’s MDataFrame.
Note
omegaml takes a different approach to other frameworks like dask. While dask is focused on distributed processing of a compute-graph and strives for data-locality, omemgal’s focus is on scalable data-persistence, data-sharing and collaboration in teams of data scientists and to provide a distributed API to micro-services that require analytical services but do not themselves have the required compute power. This said, the power of dask and omegaml can be combined.
Features¶
omegaml supports efficient persistency of and distributed access to the following Pandas objects:
DataFrames
Series
HDFStores
omegaml provides a functional API to MongoDB’s aggregation and map/reduce framework, as well as storing geocoded data with geospatial search semantics (e.g. near, within).
In addition omegaml supports the storage and distributed access to
scikit-learn models
Apache Spark models
Python container objects (dict, list, tuples)
Concepts¶
OmegaStore - a store is the persistence layer that mediates between Python/Pandas and Mongodb. There are three types of stores: data store, model store and jobs (code) store.
Bucket - a bucket is a namespace within a Mongo database. All objects stored by omegaml reside within a bucket
Prefix - the prefix is the multi/level/path prefix to an object stored in a bucket. Think of this as a hierarchical file system within a bucket
Metadata - omegaml manages its objects in the bucket’s metadata collection
Access an MDataFrame¶
To access an out-of-core MDataFrame you need to put some data into MongoDB:
In : import omegaml as om
import pandas as pd
df = pd.DataFrame({'x': range(10)})
om.datasets.put(df, 'foo')
Out: <Metadata: Metadata(kind=pandas.dfrows,name=foo, ...)
This stores the df
DataFrame with name foo
. We can get it
back just as quickly:
In : om.datasets.get('foo')
Out:
x
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
You may wonder what we just got back? It’s a standard pandas DataFrame:
In : type(om.datasets.get('foo')
Out: pandas.core.frame.DataFrame
This omegaml’s default behavior: it returns the same data that it received.
Filtering data¶
In :