omegaml.mdataframe¶
- class omegaml.mdataframe.MDataFrame(collection, columns=None, query=None, limit=None, skip=None, sort_order=None, force_columns=None, immediate_loc=False, auto_inspect=False, normalize=False, raw=False, parser=None, preparefn=None, from_loc_range=False, metadata=None, **kwargs)¶
A DataFrame for mongodb
Performs out-of-core, lazy computOation on a mongodb cluster. Behaves like a pandas DataFrame. Actual results are returned as pandas DataFrames.
- __len__()¶
the projected number of rows when resolving
- create_index(keys, **kwargs)¶
create and index the easy way
- groupby(columns, sort=True)¶
Group by a given set of columns
- Parameters:
columns – the list of columns
sort – if True sort by group key
- Returns:
MGrouper
- head(limit=10)¶
return up to limit numbers of rows
- Parameters:
limit – the number of rows to return. Defaults to 10
- Returns:
the MDataFrame
- inspect(explain=False, cached=False, cursor=None, raw=False)¶
inspect this dataframe’s actual mongodb query
- Parameters:
explain – if True explains access path
- property loc¶
Access by index
Use as mdf.loc[index_value]
- Returns:
MLocIndexer
- merge(right, on=None, left_on=None, right_on=None, how='inner', target=None, suffixes=('_x', '_y'), sort=False, inspect=False, filter=None)¶
merge this dataframe with another dataframe. only left outer joins are currently supported. the output is saved as a new collection, target name (defaults to a generated name if not specified).
- Parameters:
right – the other MDataFrame
on – the list of key columns to merge by
left_on – the list of the key columns to merge on this dataframe
right_on – the list of the key columns to merge on the other dataframe
how – the method to merge. supported are left, inner, right. Defaults to inner
target – the name of the collection to store the merge results in. If not provided a temporary name will be created.
suffixes – the suffixes to apply to identical left and right columns
sort – if True the merge results will be sorted. If False the MongoDB natural order is implied.
- Returns:
the MDataFrame to the target MDataFrame
- query(*args, **kwargs)¶
return a new MDataFrame with a filter criteria
Any subsequent operation on the new dataframe will have the filter applied. To reset the filter call .reset() without arguments.
Note: Unlike pandas DataFrames, a filtered MDataFrame operates on the same collection as the original DataFrame
- Parameters:
args – a Q object or logical combination of Q objects (optional)
kwargs – all AND filter criteria
- Returns:
a new MDataFrame with the filter applied
- query_inplace(*args, **kwargs)¶
filters this MDataFrame and returns it.
Any subsequent operation on the dataframe will have the filter applied. To reset the filter call .reset() without arguments.
- Parameters:
args – a Q object or logical combination of Q objects (optional)
kwargs – all AND filter criteria
- Returns:
self
- skip(topn)¶
skip the topn number of rows
- Parameters:
topn – the number of rows to skip.
- Returns:
the MDataFrame
- sort(columns)¶
sort by specified columns
- Parameters:
columns – str of single column or a list of columns. Sort order is specified as the + (ascending) or - (descending) prefix to the column name. Default sort order is ascending.
- Returns:
the MDataFrame
- property value¶
resolve the query and return a Pandas DataFrame
- Returns:
the result of the query as a pandas DataFrame
- class omegaml.mdataframe.MSeries(*args, **kwargs)¶
Series implementation for MDataFrames
behaves like a DataFrame but limited to one column.
- __len__()¶
the projected number of rows when resolving
- count()¶
projected number of rows when resolving
- create_index(keys, **kwargs)¶
create and index the easy way
- groupby(columns, sort=True)¶
Group by a given set of columns
- Parameters:
columns – the list of columns
sort – if True sort by group key
- Returns:
MGrouper
- head(limit=10)¶
return up to limit numbers of rows
- Parameters:
limit – the number of rows to return. Defaults to 10
- Returns:
the MDataFrame
- inspect(explain=False, cached=False, cursor=None, raw=False)¶
inspect this dataframe’s actual mongodb query
- Parameters:
explain – if True explains access path
- iterchunks(chunksize=100)¶
return an iterator
- Parameters:
chunksize (int) – number of rows in each chunk
- Returns:
a dataframe of max. length chunksize
- list_indexes()¶
list all indices in database
- property loc¶
Access by index
Use as mdf.loc[index_value]
- Returns:
MLocIndexer
- merge(right, on=None, left_on=None, right_on=None, how='inner', target=None, suffixes=('_x', '_y'), sort=False, inspect=False, filter=None)¶
merge this dataframe with another dataframe. only left outer joins are currently supported. the output is saved as a new collection, target name (defaults to a generated name if not specified).
- Parameters:
right – the other MDataFrame
on – the list of key columns to merge by
left_on – the list of the key columns to merge on this dataframe
right_on – the list of the key columns to merge on the other dataframe
how – the method to merge. supported are left, inner, right. Defaults to inner
target – the name of the collection to store the merge results in. If not provided a temporary name will be created.
suffixes – the suffixes to apply to identical left and right columns
sort – if True the merge results will be sorted. If False the MongoDB natural order is implied.
- Returns:
the MDataFrame to the target MDataFrame
- query(*args, **kwargs)¶
return a new MDataFrame with a filter criteria
Any subsequent operation on the new dataframe will have the filter applied. To reset the filter call .reset() without arguments.
Note: Unlike pandas DataFrames, a filtered MDataFrame operates on the same collection as the original DataFrame
- Parameters:
args – a Q object or logical combination of Q objects (optional)
kwargs – all AND filter criteria
- Returns:
a new MDataFrame with the filter applied
- query_inplace(*args, **kwargs)¶
filters this MDataFrame and returns it.
Any subsequent operation on the dataframe will have the filter applied. To reset the filter call .reset() without arguments.
- Parameters:
args – a Q object or logical combination of Q objects (optional)
kwargs – all AND filter criteria
- Returns:
self
- property shape¶
return shape of dataframe
- skip(topn)¶
skip the topn number of rows
- Parameters:
topn – the number of rows to skip.
- Returns:
the MDataFrame
- sort(columns)¶
sort by specified columns
- Parameters:
columns – str of single column or a list of columns. Sort order is specified as the + (ascending) or - (descending) prefix to the column name. Default sort order is ascending.
- Returns:
the MDataFrame
- tail(limit=10)¶
return up to limit number of rows from last inserted values
- Parameters:
limit
- Returns:
- unique()¶
return the unique set of values for the series
- Returns:
MSeries
- property value¶
return the value of the series
this is a Series unless unique() was called. If unique() only distinct values are returned as an array, matching the behavior of a Series
- Returns:
pandas.Series
- class omegaml.mdataframe.MGrouper(mdataframe, collection, columns, sort=True)¶
a Grouper for MDataFrames
- agg(specs)¶
shortcut for .aggregate
- aggregate(specs, **kwargs)¶
aggregate by given specs
See the following link for a list of supported operations. https://docs.mongodb.com/manual/reference/operator/aggregation/group/
- Parameters:
specs – a dictionary of { column : function | list[functions] } pairs.
- count()¶
return counts by group columns
- class omegaml.mdataframe.MLocIndexer(mdataframe, positional=False)¶
implements the LocIndexer for MDataFrames
- __getitem__(specs)¶
access by index
use as mdf.loc[specs] where specs is any of
a list or tuple of scalar index values, e.g. .loc[(1,2,3)]
a slice of values e.g. .loc[1:5]
a list of slices, e.g. .loc[1:5, 2:3]
- Returns:
the sliced part of the MDataFrame
- class omegaml.mdataframe.MPosIndexer(mdataframe)¶
implements the position-based indexer for MDataFrames
- __getitem__(specs)¶
access by index
use as mdf.loc[specs] where specs is any of
a list or tuple of scalar index values, e.g. .loc[(1,2,3)]
a slice of values e.g. .loc[1:5]
a list of slices, e.g. .loc[1:5, 2:3]
- Returns:
the sliced part of the MDataFrame
- class omegaml.mixins.mdf.ApplyContext(caller, columns=None, index=None)¶
Enable apply functions
.apply(fn) will call fn(ctx) where ctx is an ApplyContext. The context supports methods to apply functions in a Pandas-style apply manner. ApplyContext is extensible by adding an extension class to defaults.OMEGA_MDF_APPLY_MIXINS.
Note that unlike a Pandas DataFrame, ApplyContext does not itself contain any data. Rather it is part of an expression tree, i.e. the aggregation pipeline. Thus any expressions applied are translated into operations on the expression tree. The expression tree is evaluated on MDataFrame.value, at which point the ApplyContext nor the function that created it are active.
Examples:
mdf.apply(lambda v: v * 5 ) => multiply every column in dataframe mdf.apply(lambda v: v['foo'].dt.week) => get week of date for column foo mdf.apply(lambda v: dict(a=v['foo'].dt.week, b=v['bar'] * 5) => run multiple pipelines and get results The callable passed to apply can be any function. It can either return None, the context passed in or a list of pipeline stages. # apply any of the below functions mdf.apply(customfn) # same as lambda v: v.dt.week def customfn(ctx): return ctx.dt.week # simple pipeline def customfn(ctx): ctx.project(x={'$multiply: ['$x', 5]}) ctx.project(y={'$divide: ['$x', 2]}) # complex pipeline def customfn(ctx): return [ { '$match': ... }, { '$project': ... }, ]
- class omegaml.mixins.mdf.ApplyArithmetics¶
Math operators for ApplyContext
__mul__
(*)__add__
(+)__sub__
(-)__div__
(/)__floordiv__
(//)__mod__
(%)__pow__
(pow)__ceil__
(ceil)__floor__
(floor)__trunc__
(trunc)__abs__
(abs)sqrt
(math.sqrt)
- __add__(other)¶
add
- __mul__(other)¶
multiply
- class omegaml.mixins.mdf.ApplyDateTime¶
Datetime operators for ApplyContext
- class omegaml.mixins.mdf.ApplyString¶
String operators
- class omegaml.mixins.mdf.ApplyAccumulators¶