omegaml.mdataframe¶

class omegaml.mdataframe.MDataFrame(collection, columns=None, query=None, limit=None, skip=None, sort_order=None, force_columns=None, immediate_loc=False, auto_inspect=False, normalize=False, raw=False, parser=None, preparefn=None, from_loc_range=False, metadata=None, **kwargs)¶

A DataFrame for mongodb

Performs out-of-core, lazy computOation on a mongodb cluster. Behaves like a pandas DataFrame. Actual results are returned as pandas DataFrames.

__len__()¶: the projected number of rows when resolving

create_index(keys, **kwargs)¶: create and index the easy way

groupby(columns, sort=True)¶

Group by a given set of columns

Parameters:

columns – the list of columns
sort – if True sort by group key

Returns:

MGrouper

head(limit=10)¶

return up to limit numbers of rows

Parameters:: limit – the number of rows to return. Defaults to 10
Returns:: the MDataFrame

inspect(explain=False, cached=False, cursor=None, raw=False)¶

inspect this dataframe’s actual mongodb query

Parameters:: explain – if True explains access path

property loc¶

Access by index

Use as mdf.loc[index_value]

Returns:: MLocIndexer

merge(right, on=None, left_on=None, right_on=None, how='inner', target=None, suffixes=('_x', '_y'), sort=False, inspect=False, filter=None)¶

merge this dataframe with another dataframe. only left outer joins are currently supported. the output is saved as a new collection, target name (defaults to a generated name if not specified).

Parameters:

right – the other MDataFrame
on – the list of key columns to merge by
left_on – the list of the key columns to merge on this dataframe
right_on – the list of the key columns to merge on the other dataframe
how – the method to merge. supported are left, inner, right. Defaults to inner
target – the name of the collection to store the merge results in. If not provided a temporary name will be created.
suffixes – the suffixes to apply to identical left and right columns
sort – if True the merge results will be sorted. If False the MongoDB natural order is implied.

Returns:

the MDataFrame to the target MDataFrame

query(*args, **kwargs)¶

return a new MDataFrame with a filter criteria

Any subsequent operation on the new dataframe will have the filter applied. To reset the filter call .reset() without arguments.

Note: Unlike pandas DataFrames, a filtered MDataFrame operates on the same collection as the original DataFrame

Parameters:

args – a Q object or logical combination of Q objects (optional)
kwargs – all AND filter criteria

Returns:

a new MDataFrame with the filter applied

query_inplace(*args, **kwargs)¶

filters this MDataFrame and returns it.

Any subsequent operation on the dataframe will have the filter applied. To reset the filter call .reset() without arguments.

Parameters:

args – a Q object or logical combination of Q objects (optional)
kwargs – all AND filter criteria

Returns:

self

skip(topn)¶

skip the topn number of rows

Parameters:: topn – the number of rows to skip.
Returns:: the MDataFrame

sort(columns)¶

sort by specified columns

Parameters:: columns – str of single column or a list of columns. Sort order is specified as the + (ascending) or - (descending) prefix to the column name. Default sort order is ascending.
Returns:: the MDataFrame

property value¶

resolve the query and return a Pandas DataFrame

Returns:: the result of the query as a pandas DataFrame

class omegaml.mdataframe.MSeries(*args, **kwargs)¶

Series implementation for MDataFrames

behaves like a DataFrame but limited to one column.

__len__()¶: the projected number of rows when resolving

count()¶: projected number of rows when resolving

create_index(keys, **kwargs)¶: create and index the easy way

groupby(columns, sort=True)¶

Group by a given set of columns

Parameters:

columns – the list of columns
sort – if True sort by group key

Returns:

MGrouper

head(limit=10)¶

return up to limit numbers of rows

Parameters:: limit – the number of rows to return. Defaults to 10
Returns:: the MDataFrame

inspect(explain=False, cached=False, cursor=None, raw=False)¶

inspect this dataframe’s actual mongodb query

Parameters:: explain – if True explains access path

iterchunks(chunksize=100)¶

return an iterator

Parameters:: chunksize (int) – number of rows in each chunk
Returns:: a dataframe of max. length chunksize

list_indexes()¶: list all indices in database

property loc¶

Access by index

Use as mdf.loc[index_value]

Returns:: MLocIndexer

merge(right, on=None, left_on=None, right_on=None, how='inner', target=None, suffixes=('_x', '_y'), sort=False, inspect=False, filter=None)¶

merge this dataframe with another dataframe. only left outer joins are currently supported. the output is saved as a new collection, target name (defaults to a generated name if not specified).

Parameters:

right – the other MDataFrame
on – the list of key columns to merge by
left_on – the list of the key columns to merge on this dataframe
right_on – the list of the key columns to merge on the other dataframe
how – the method to merge. supported are left, inner, right. Defaults to inner
target – the name of the collection to store the merge results in. If not provided a temporary name will be created.
suffixes – the suffixes to apply to identical left and right columns
sort – if True the merge results will be sorted. If False the MongoDB natural order is implied.

Returns:

the MDataFrame to the target MDataFrame

query(*args, **kwargs)¶

return a new MDataFrame with a filter criteria

Any subsequent operation on the new dataframe will have the filter applied. To reset the filter call .reset() without arguments.

Note: Unlike pandas DataFrames, a filtered MDataFrame operates on the same collection as the original DataFrame

Parameters:

args – a Q object or logical combination of Q objects (optional)
kwargs – all AND filter criteria

Returns:

a new MDataFrame with the filter applied

query_inplace(*args, **kwargs)¶

filters this MDataFrame and returns it.

Any subsequent operation on the dataframe will have the filter applied. To reset the filter call .reset() without arguments.

Parameters:

args – a Q object or logical combination of Q objects (optional)
kwargs – all AND filter criteria

Returns:

self

property shape¶: return shape of dataframe

skip(topn)¶

skip the topn number of rows

Parameters:: topn – the number of rows to skip.
Returns:: the MDataFrame

sort(columns)¶

sort by specified columns

Parameters:: columns – str of single column or a list of columns. Sort order is specified as the + (ascending) or - (descending) prefix to the column name. Default sort order is ascending.
Returns:: the MDataFrame

tail(limit=10)¶

return up to limit number of rows from last inserted values

Parameters:: limit –
Returns:

unique()¶

return the unique set of values for the series

Returns:: MSeries

property value¶

return the value of the series

this is a Series unless unique() was called. If unique() only distinct values are returned as an array, matching the behavior of a Series

Returns:: pandas.Series

class omegaml.mdataframe.MGrouper(mdataframe, collection, columns, sort=True)¶

a Grouper for MDataFrames

agg(specs)¶: shortcut for .aggregate

aggregate(specs, **kwargs)¶

aggregate by given specs

See the following link for a list of supported operations. https://docs.mongodb.com/manual/reference/operator/aggregation/group/

Parameters:: specs – a dictionary of { column : function | list[functions] } pairs.

count()¶: return counts by group columns

class omegaml.mdataframe.MLocIndexer(mdataframe, positional=False)¶

implements the LocIndexer for MDataFrames

__getitem__(specs)¶

access by index

use as mdf.loc[specs] where specs is any of

a list or tuple of scalar index values, e.g. .loc[(1,2,3)]
a slice of values e.g. .loc[1:5]
a list of slices, e.g. .loc[1:5, 2:3]

Returns:: the sliced part of the MDataFrame

class omegaml.mdataframe.MPosIndexer(mdataframe)¶

implements the position-based indexer for MDataFrames

__getitem__(specs)¶

access by index

use as mdf.loc[specs] where specs is any of

a list or tuple of scalar index values, e.g. .loc[(1,2,3)]
a slice of values e.g. .loc[1:5]
a list of slices, e.g. .loc[1:5, 2:3]

Returns:: the sliced part of the MDataFrame

class omegaml.mixins.mdf.ApplyContext(caller, columns=None, index=None)¶

Enable apply functions

.apply(fn) will call fn(ctx) where ctx is an ApplyContext. The context supports methods to apply functions in a Pandas-style apply manner. ApplyContext is extensible by adding an extension class to defaults.OMEGA_MDF_APPLY_MIXINS.

Note that unlike a Pandas DataFrame, ApplyContext does not itself contain any data. Rather it is part of an expression tree, i.e. the aggregation pipeline. Thus any expressions applied are translated into operations on the expression tree. The expression tree is evaluated on MDataFrame.value, at which point the ApplyContext nor the function that created it are active.

Examples:

mdf.apply(lambda v: v * 5 ) => multiply every column in dataframe
mdf.apply(lambda v: v['foo'].dt.week) => get week of date for column foo
mdf.apply(lambda v: dict(a=v['foo'].dt.week,
                         b=v['bar'] * 5) => run multiple pipelines and get results

The callable passed to apply can be any function. It can either return None,
the context passed in or a list of pipeline stages.

# apply any of the below functions
mdf.apply(customfn)

# same as lambda v: v.dt.week
def customfn(ctx):
    return ctx.dt.week

# simple pipeline
def customfn(ctx):
    ctx.project(x={'$multiply: ['$x', 5]})
    ctx.project(y={'$divide: ['$x', 2]})

# complex pipeline
def customfn(ctx):
    return [
        { '$match': ... },
        { '$project': ... },
    ]

class omegaml.mixins.mdf.ApplyArithmetics¶

Math operators for ApplyContext

__mul__ (*)
__add__ (+)
__sub__ (-)
__div__ (/)
__floordiv__ (//)
__mod__ (%)
__pow__ (pow)
__ceil__ (ceil)
__floor__ (floor)
__trunc__ (trunc)
__abs__ (abs)
sqrt (math.sqrt)

__add__(other)¶: add

__mul__(other)¶: multiply

class omegaml.mixins.mdf.ApplyDateTime¶: Datetime operators for ApplyContext

class omegaml.mixins.mdf.ApplyString¶: String operators

class omegaml.mixins.mdf.ApplyAccumulators¶