Out of core DataFrames

omega|ml supports working with pandas DataFrames that are larger than available memory by leveraging mongodb as its datastore. This documents the available features supported by omegaml’s MDataFrame.

Note

omega|ml takes a different approach to other frameworks like dask. While dask is focused on distributed processing of a compute-graph and strives for data-locality, omemgal’s focus is on scalable data-persistence, data-sharing and collaboration in teams of data scientists and to provide a distributed API to micro-services that require analytical services but do not themselves have the required compute power. This said, the power of dask and omega|ml can be combined.

Features

omega|ml supports efficient persistency of and distributed access to the following Pandas objects:

  • DataFrames

  • Series

  • HDFStores

omega|ml provides a functional API to MongoDB’s aggregation and map/reduce framework, as well as storing geocoded data with geospatial search semantics (e.g. near, within).

In addition omega|ml supports the storage and distributed access to

  • scikit-learn models

  • Apache Spark models

  • Python container objects (dict, list, tuples)

Concepts

  • OmegaStore - a store is the persistence layer that mediates between Python/Pandas and Mongodb. There are three types of stores: data store, model store and jobs (code) store.

  • Bucket - a bucket is a namespace within a Mongo database. All objects stored by omega|ml reside within a bucket

  • Prefix - the prefix is the multi/level/path prefix to an object stored in a bucket. Think of this as a hierarchical file system within a bucket

  • Metadata - omega|ml manages its objects in the bucket’s metadata collection

Access an MDataFrame

To access an out-of-core MDataFrame you need to put some data into MongoDB:

In : import omegaml as om
     import pandas as pd

     df = pd.DataFrame({'x': range(10)})
     om.datasets.put(df, 'foo')

Out: <Metadata: Metadata(kind=pandas.dfrows,name=foo, ...)

This stores the df DataFrame with name foo. We can get it back just as quickly:

In : om.datasets.get('foo')

Out:
            x
        0   0
        1   1
        2   2
        3   3
        4   4
        5   5
        6   6
        7   7
        8   8
        9   9

We get back a standard pd.DataFrame:

In : type(om.datasets.get('foo')

Out: pandas.core.frame.DataFrame

This omegaml’s default behavior: it returns the same data that it received.

Filtering data

In :