MDataFrame Operations ¶

MDataFrame masks are not series of True/False as they are in Pandas. Instead a MDataFrame filter mask translates into a query filter that is applied on accessing the .value property. Consider MDataFrame a syntactical convenience that makes it easy to transform code for a Pandas DataFrame to an MDataFrame.

Index-Row selection ¶

Specify the index of the rows to be accessed:

# numeric index
om.datasets.get('dfx', lazy=True).loc[2:5].value
=>
    x  y
 2  2  2
 3  3  3
 4  4  4
 5  5  5

# alphanumeric index
om.datasets.get('dfx', lazy=True).loc['abc'].value
=>
      x  y
 abc  2  2

Numeric row selection ¶

Specify the numeric row id. Note this requires that the dataset was created with a continuous row id (automatically created when using datasets.put)

# numeric index
om.datasets.get('dfx', lazy=True).iloc[2:5].value
=>
    x  y
 2  2  2
 3  3  3
 4  4  4
 5  5  5

Note

The .iloc accessor is also used by scikit-learn’s KFold and grid search features. Since MDataFrame’s are very efficiently serializable (only specifications are serialized, not actual data) this feature makes MDataFrames an attractive choice for gridsearch in a compute cluster. Actually MDataFrame instances can be used directly with gridsearch, whereas for example Dask’s DataFrame implementation cannot.

Filtering data ¶

Filtering works the same on an MDataFrame as with the eager get method, by specifying the filter as the keyword arguments:

om.datasets.get('foodf', x__gt=5, lazy=True).value
=>
    x
 6  6
 7  7
 8  8
 9  9

Geo proximity filtering ¶

MDataFrame supports filtering on geodesic proximity by specifying the __near operator and a pair of (lat, lon) coordinates. The result is the list of matching locations sorted by distance from the given coordinates.

om.datasets.getl('geosample',
location__near=dict(location=(7.4474468, 46.9479739))).value['place']
=>
2        Bern
3      Zurich
1      Geneva
0    New York
Name: place, dtype: object

Permanently setting a filter ¶

Note that the query method returns a new MDataFrame instance with the filter applied. To set a permanent filter for any subsequent operations on a specific MDataFrame instance, use the query_inplace method:

mdf = om.datasets.get('dfx', lazy=True)
id(mdf)
=> 140341971534792

# mdf2 is a new object
mdf2 = mdf.query(x__gt=2, x__lt=5)
id(mdf2)
=> 140341971587648

# note how mdf3 is the same object as mdf above
mdf3 = mdf.query_inplace(x__gt=2, x__lt=5))
id(mdf3)
=> 140341971523792

mdf = om.datasets.get('dfx', lazy=True).query_inplace(x__gt=2, x__lt=5)
mdf.value
=>
    x  y
 3  3  3
 4  4  4
 3  3  3
 4  4  4

Note

A new MDataFrame object returned by the query method does not create a new collection in MongoDB. That is, the new instance operates on the same data. The only difference is that one new instance has a permanent filter applied and any subsequent operations on it will work on the subset of the data returned by the filter.

Ordering operations ¶

Sorting ¶

Sorting works by specifying the sort columns. Use - and + before any column name to specify the sort order as descending or ascending, respectively (ascending is the default).

om.datasets.get('dfx', lazy=True).sort(['-x', '+y']).head(5).value
=>
      x    y
 999  999  999
 998  998  998
 997  997  997
 996  996  996
 995  995  995

Limiting and skipping rows ¶

The head(n) and skip(n) methods return and skip the top _n_ rows, respectively:

om.datasets.get('dfx', lazy=True).skip(5).head(3).value
=>
   x  y
5  5  5
6  6  6
7  7  7

Merging data ¶

Merging supports left, inner and right joins of two MDataFrame. The result is stored as a collection in MongoDB and all merge operations are executed by MongoDB. The result of the merge() method is a new MDataFrame on the result

import pandas as pd
# create two dataframes and store in omega-ml
dfl = pd.DataFrame({'x': range(3)})
dfr = pd.DataFrame({'x': range(3), 'y': range(3)})
om.datasets.put(dfl, 'dfxl', append=False)
om.datasets.put(dfr, 'dfxr', append=False)
# merge the dataframes
mdfl = om.datasets.get('dfxl', lazy=True)
mdfr = om.datasets.get('dfxr', lazy=True)
mdfl.merge(mdfr, on='x').value
=>
   x  y
0  0  0
1  1  1
2  2  2

Aggregation ¶

Much like a Pandas DataFrame, MDataFrame supports aggregation. All aggregation operations are executed by MongoDB.

Statistics ¶

The following statistics can be computed on pairs of numeric columns of a MDataFrame and on MSeries:

correlation - returns the pearson correlation matrix
covariance - returns the covariance matrix

mdf = om.datasets.getl('foo')
mdf['x', 'y].correlation().value
mdf['x', 'y].covariance().value

The following statisics can be computed on all numeric columns:

mean
min
max
std
quantile - by defaults calculates the .5 quantile, specify a list of percentiles

mdf = om.datasets.getl('foo')
mdf['x', 'y].mean()
mdf['x', 'y].min()
...

Grouping data ¶

mdf = om.datasets.getl('dfx')
mdf.groupby('x').x.mean().head(5)
=>
     x_mean
x
0     0.0
1     1.0
2     2.0
3     3.0
4     4.0

Multiple aggregations can be applied at once by the agg() method:

mdf = om.datasets.getl('dfx')
print(mdf.groupby('x').agg(dict(x='sum', y='mean')).head(5))

The following aggregations are currently supported:

sum - sum
mean or avg - mean
max - the max value in the group
min - the min value in the group
std - standard deviation in the sample
first - the first in the group
last - the last in the group

MDataFrame Operations ¶

Selection ¶

Column projection ¶

Masked-style selection ¶