Filtering Data

Query filtering

The .get method when operating on a Pandas DataFrame provides keyword-style filtering and an optional lazy evaluation mode. Filters are applied remotely inside the database and thus perform much faster than if filtered in the returned dataframe.

om.datasets.get('foodf', x__gt=5)
=>
    x
 6  6
 7  7
 8  8
 9  9

The filter syntax is <column>__<operator>=<value>, where the operator is one of the following:

  • eq compare equal (this is also the default, when using the short form, i.e. <column>=<value>
  • gt greater than
  • lt less than
  • ge greator or equal
  • le less or equal
  • between between two values, specify value as a 2-tuple
  • contains contains a value, specify value as a sequence
  • startswith starts with a string
  • endswith ends with a string
  • isnull is a null value, specify value as a boolean

In general get returns a Pandas DataFrame. See the Pandas documentation for ways to work with DataFrames.

However, unlike Pandas omega|ml provides methods to work with data that is larger than memory. This is covered in the next section.

Large, Out of Core-sized DataFrames

Using lazy evaluation we can get back a proxy DataFrame, an MDataFrame, which provides many of the features of a Pandas DataFrame including .loc indexing and slicing, column projection and aggregation. All of these operations, however, are executed by the database and thus support out-of-core sized DataFrames, that is DataFrames of arbitrary size.

# ask for a reference to the dfx dataset with lazy evaluation
om.datasets.get('dfx', lazy=True)
=>
<omegaml.mdataframe.MDataFrame at 0x7fa3e991ee48>

# same thing, getl is convenience method that automatically specifies lazy=True
om.datasets.getl('dfx')
=>
<omegaml.mdataframe.MDataFrame at 0x7fa3e991ee48>

MDataFrame in many ways behaves like a normal dataframe, however the evaluation of operations is _lazy_ and is executed by the database as opposed to in-memory. This allows us to process data that is larger than memory.

In order to evaluate MDataFrame and return an actual pandas.DataFrame just access the .value property:

om.datasets.get('dfx', lazy=True).value
=>
    x  y
 0  0  0
 1  1  1
 2  2  2
 3  3  3
 4  4  4

Column projection

Specify the list of columns to be accessed:

om.datasets.get('dfx', lazy=True)[['x', 'y']].head(5).value
=>
    x  y
 0  0  0
 1  1  1
 2  2  2
 3  3  3
 4  4  4

Masked-style selection

As with Pandas DataFrames, omega|ml MDataFrames can be subset using filter masks:

mdf = om.datasets.getl('dfx')
flt = (mdf['x'] > 2) & (mdf['x] < 4)
mdf[flt].value
=>
    x  y
 3  3  3

Note

MDataFrame masks are not series of True/False as they are in Pandas. Instead a MDataFrame filter mask translates into a query filter that is applied on accessing the .value property. Consider MDataFrame a syntactical convenience that makes it easy to transform code for a Pandas DataFrame to an MDataFrame.

Index-Row selection

Specify the index of the rows to be accessed:

# numeric index
om.datasets.get('dfx', lazy=True).loc[2:5].value
=>
    x  y
 2  2  2
 3  3  3
 4  4  4
 5  5  5

# alphanumeric index
om.datasets.get('dfx', lazy=True).loc['abc'].value
=>
      x  y
 abc  2  2

Numeric row selection

Specify the numeric row id. Note this requires that the dataset was created with a continuous row id (automatically created when using datasets.put)

# numeric index
om.datasets.get('dfx', lazy=True).iloc[2:5].value
=>
    x  y
 2  2  2
 3  3  3
 4  4  4
 5  5  5

Note

The .iloc accessor is also used by scikit-learn’s KFold and grid search features. Since MDataFrame’s are very efficiently serializable (only specifications are serialized, not actual data) this feature makes MDataFrames an attractive choice for gridsearch in a compute cluster. Actually MDataFrame instances can be used directly with gridsearch, whereas for example Dask’s DataFrame implementation cannot.

Filter data

Filtering works the same on an MDataFrame as with the eager get method, by specifying the filter as the keyword arguments:

om.datasets.get('foodf', x__gt=5, lazy=True).value
=>
    x
 6  6
 7  7
 8  8
 9  9

Permanently setting a filter

Note that the query method returns a new MDataFrame instance with the filter applied. To set a permanent filter for any subsequent operations on a specific MDataFrame instance, use the query_inplace method:

mdf = om.datasets.get('dfx', lazy=True)
id(mdf)
=> 140341971534792

# mdf2 is a new object
mdf2 = mdf.query(x__gt=2, x__lt=5)
id(mdf2)
=> 140341971587648

# note how mdf3 is the same object as mdf above
mdf3 = mdf.query_inplace(x__gt=2, x__lt=5))
id(mdf3)
=> 140341971523792

mdf = om.datasets.get('dfx', lazy=True).query_inplace(x__gt=2, x__lt=5)
mdf.value
=>
    x  y
 3  3  3
 4  4  4
 3  3  3
 4  4  4

Note

A new MDataFrame object returned by the query method does not create a new collection in MongoDB. That is, the new instance operates on the same data. The only difference is that one new instance has a permanent filter applied and any subsequent operations on it will work on the subset of the data returned by the filter.

Sorting

Sorting works by specifying the sort columns. Use - and + before any column name to specify the sort order as descending or ascending, respectively (ascending is the default).

om.datasets.get('dfx', lazy=True).sort(['-x', '+y']).head(5).value
=>
      x    y
 999  999  999
 998  998  998
 997  997  997
 996  996  996
 995  995  995

Limiting and skipping rows

The head(n) and skip(n) methods return and skip the top _n_ rows, respectively:

om.datasets.get('dfx', lazy=True).skip(5).head(3).value
=>
   x  y
5  5  5
6  6  6
7  7  7

Merging data

Merging supports left, inner and right joins of two MDataFrame. The result is stored as a collection in MongoDB and all merge operations are executed by MongoDB. The result of the merge() method is a new MDataFrame on the result

import pandas as pd
# create two dataframes and store in omega|ml
dfl = pd.DataFrame({'x': range(3)})
dfr = pd.DataFrame({'x': range(3), 'y': range(3)})
om.datasets.put(dfl, 'dfxl', append=False)
om.datasets.put(dfr, 'dfxr', append=False)
# merge the dataframes
mdfl = om.datasets.get('dfxl', lazy=True)
mdfr = om.datasets.get('dfxr', lazy=True)
mdfl.merge(mdfr, on='x').value
=>
   x  y
0  0  0
1  1  1
2  2  2

Statistics

The following statistics can be computed on pairs of numeric columns of a MDataFrame and on MSeries:

  • correlation - returns the pearson correlation matrix
  • covariance - returns the covariance matrix
mdf = om.datasets.getl('foo')
mdf['x', 'y].correlation().value
mdf['x', 'y].covariance().value

The following statisics can be computed on all numeric columns:

  • mean
  • min
  • max
  • std
  • quantile - by defaults calculates the .5 quantile, specify a list of percentiles
mdf = om.datasets.getl('foo')
mdf['x', 'y].mean()
mdf['x', 'y].min()
...

Aggregation

Much like a Pandas DataFrame, MDataFrame supports aggregation. All aggregation operations are executed by MongoDB.

mdf = om.datasets.getl('dfx')
mdf.groupby('x').x.mean().head(5)
=>
     x_mean
x
0     0.0
1     1.0
2     2.0
3     3.0
4     4.0

Multiple aggregations can be applied at once by the agg() method:

mdf = om.datasets.getl('dfx')
print(mdf.groupby('x').agg(dict(x='sum', y='mean')).head(5))

The following aggregations are currently supported:

  • sum - sum
  • mean or avg - mean
  • max - the max value in the group
  • min - the min value in the group
  • std - standard deviation in the sample
  • first - the first in the group
  • last - the last in the group

Geo proximity filtering

If you have licensed the geo location proximity extensions, MDataFrame supports filtering on geodesic proximity by specifying the __near operator and a pair of (lat, lon) coordinates. The result is the list of matching locations sorted by distance from the given coordinates.

om.datasets.getl('geosample',
location__near=dict(location=(7.4474468, 46.9479739))).value['place']
=>
2        Bern
3      Zurich
1      Geneva
0    New York
Name: place, dtype: object

Understanding the actual MongoDB query

Sometimes it is useful to know the actual MongoDB query that is executed, e.g. for debugging or performance tuning purpose. .inspect() returns the actual query that will be executed on accessing the .value: property.

om.datasets.get('dfx', lazy=True).query(x__gt=2, x__lt=5).inspect()
=>
{'explain': 'specify explain=True',
 'projection': ['x', 'y'],
 'query': {'$and': [{'x': {'$lt': 5}}, {'x': {'$gt': 2}}]}}

Explaining the access path

To understand the full access path and indicies used by MongoDB, use the explain=True keyword.

  om.datasets.get('dfx', lazy=True).query(x__gt=2, x__lt=5).inspect(explain=True)
  =>
  {'explain': {'executionStats': {'allPlansExecution': [],
  'executionStages': {'advanced': 4,
   'executionTimeMillisEstimate': 0,
   'inputStage': {'advanced': 4,
    'direction': 'forward',
    'docsExamined': 1100,
    'executionTimeMillisEstimate': 0,
    'filter': {'$and': [{'x': {'$lt': 5}}, {'x': {'$gt': 2}}]},
    'invalidates': 0,
    'isEOF': 1,
    'nReturned': 4,
    'needTime': 1097,
    'needYield': 0,
    'restoreState': 8,
    'saveState': 8,
    'stage': 'COLLSCAN',
    'works': 1102},
   'invalidates': 0,
   'isEOF': 1,
   'nReturned': 4,
   'needTime': 1097,
   'needYield': 0,
   'restoreState': 8,
   'saveState': 8,
   'stage': 'PROJECTION',
   'transformBy': {'_idx#0_0': 1, 'x': 1, 'y': 1},
   'works': 1102},
  'executionSuccess': True,
  'executionTimeMillis': 1,
  'nReturned': 4,
  'totalDocsExamined': 1100,
  'totalKeysExamined': 0},
 'ok': 1.0,
 'queryPlanner': {'indexFilterSet': False,
  'namespace': 'testing3.omegaml.data_.dfx.datastore',
  'parsedQuery': {'$and': [{'x': {'$lt': 5}}, {'x': {'$gt': 2}}]},
  'plannerVersion': 1,
  'rejectedPlans': [],
  'winningPlan': {'inputStage': {'direction': 'forward',
    'filter': {'$and': [{'x': {'$lt': 5}}, {'x': {'$gt': 2}}]},
    'stage': 'COLLSCAN'},
   'stage': 'PROJECTION',
   'transformBy': {'_idx#0_0': 1, 'x': 1, 'y': 1}}},
 'serverInfo': {'gitVersion': '22ec9e93b40c85fc7cae7d56e7d6a02fd811088c',
  'host': 'c24ade3fa980',
  'port': 27017,
  'version': '3.2.9'}},
'projection': ['x', 'y'],
'query': {'$and': [{'x': {'$lt': 5}}, {'x': {'$gt': 2}}]}}