Filtering Data¶
Query filtering¶
The .get
method when operating on a Pandas DataFrame provides
keyword-style filtering and an optional lazy evaluation mode. Filters are
applied remotely inside the database and thus perform much faster than if
filtered in the returned dataframe.
om.datasets.get('foodf', x__gt=5)
=>
x
6 6
7 7
8 8
9 9
The filter syntax is <column>__<operator>=<value>
, where the operator
is one of the following:
eq
compare equal (this is also the default, when using the short form, i.e.<column>=<value>
gt
greater thanlt
less thange
greator or equalle
less or equalbetween
between two values, specifyvalue
as a 2-tuplecontains
contains a value, specifyvalue
as a sequencestartswith
starts with a stringendswith
ends with a stringisnull
is a null value, specifyvalue
as a boolean
In general get
returns a Pandas DataFrame
. See the Pandas
documentation for ways to work with DataFrames.
However, unlike Pandas omega|ml provides methods to work with data that is larger than memory. This is covered in the next section.
Large, Out of Core-sized DataFrames¶
Using lazy evaluation we can get back a proxy DataFrame, an MDataFrame
,
which provides many of the features of a Pandas DataFrame including .loc
indexing and slicing, column projection and aggregation. All of these
operations, however, are executed by the database and thus support out-of-core
sized DataFrames, that is DataFrames of arbitrary size.
# ask for a reference to the dfx dataset with lazy evaluation
om.datasets.get('dfx', lazy=True)
=>
<omegaml.mdataframe.MDataFrame at 0x7fa3e991ee48>
# same thing, getl is convenience method that automatically specifies lazy=True
om.datasets.getl('dfx')
=>
<omegaml.mdataframe.MDataFrame at 0x7fa3e991ee48>
MDataFrame
in many ways behaves like a normal dataframe, however the
evaluation of operations is _lazy_ and is executed by the database as opposed
to in-memory. This allows us to process data that is larger than memory.
In order to evaluate MDataFrame
and return an actual
pandas.DataFrame
just access the .value
property:
om.datasets.get('dfx', lazy=True).value
=>
x y
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
Column projection¶
Specify the list of columns to be accessed:
om.datasets.get('dfx', lazy=True)[['x', 'y']].head(5).value
=>
x y
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
Masked-style selection¶
As with Pandas DataFrames, omega|ml MDataFrames can be subset using filter masks:
mdf = om.datasets.getl('dfx')
flt = (mdf['x'] > 2) & (mdf['x] < 4)
mdf[flt].value
=>
x y
3 3 3
Note
MDataFrame masks are not series of True/False as they are in Pandas. Instead a
MDataFrame filter mask translates into a query filter that is applied on accessing
the .value
property. Consider MDataFrame a syntactical convenience that makes
it easy to transform code for a Pandas DataFrame to an MDataFrame.
Index-Row selection¶
Specify the index of the rows to be accessed:
# numeric index
om.datasets.get('dfx', lazy=True).loc[2:5].value
=>
x y
2 2 2
3 3 3
4 4 4
5 5 5
# alphanumeric index
om.datasets.get('dfx', lazy=True).loc['abc'].value
=>
x y
abc 2 2
Numeric row selection¶
Specify the numeric row id. Note this requires that the dataset was created with a continuous row id
(automatically created when using datasets.put
)
# numeric index
om.datasets.get('dfx', lazy=True).iloc[2:5].value
=>
x y
2 2 2
3 3 3
4 4 4
5 5 5
Note
The .iloc
accessor is also used by scikit-learn’s KFold and grid search features. Since
MDataFrame’s are very efficiently serializable (only specifications are serialized, not actual data)
this feature makes MDataFrames an attractive choice for gridsearch in a compute cluster. Actually
MDataFrame instances can be used directly with gridsearch, whereas for example Dask’s DataFrame implementation
cannot.
Filter data¶
Filtering works the same on an MDataFrame as with the eager get
method, by specifying the filter as the keyword arguments:
om.datasets.get('foodf', x__gt=5, lazy=True).value
=>
x
6 6
7 7
8 8
9 9
Permanently setting a filter¶
Note that the query
method returns a new MDataFrame
instance
with the filter applied. To set a permanent filter for any subsequent
operations on a specific MDataFrame
instance, use the
query_inplace
method:
mdf = om.datasets.get('dfx', lazy=True)
id(mdf)
=> 140341971534792
# mdf2 is a new object
mdf2 = mdf.query(x__gt=2, x__lt=5)
id(mdf2)
=> 140341971587648
# note how mdf3 is the same object as mdf above
mdf3 = mdf.query_inplace(x__gt=2, x__lt=5))
id(mdf3)
=> 140341971523792
mdf = om.datasets.get('dfx', lazy=True).query_inplace(x__gt=2, x__lt=5)
mdf.value
=>
x y
3 3 3
4 4 4
3 3 3
4 4 4
Note
A new MDataFrame
object returned by the query
method
does not create a new collection in MongoDB. That is, the new instance
operates on the same data. The only difference is that one new instance
has a permanent filter applied and any subsequent operations on it will
work on the subset of the data returned by the filter.
Sorting¶
Sorting works by specifying the sort columns. Use -
and +
before any column name to specify the sort order as descending or ascending,
respectively (ascending is the default).
om.datasets.get('dfx', lazy=True).sort(['-x', '+y']).head(5).value
=>
x y
999 999 999
998 998 998
997 997 997
996 996 996
995 995 995
Limiting and skipping rows¶
The head(n)
and skip(n)
methods return and skip the top _n_
rows, respectively:
om.datasets.get('dfx', lazy=True).skip(5).head(3).value
=>
x y
5 5 5
6 6 6
7 7 7
Merging data¶
Merging supports left, inner and right joins of two MDataFrame
.
The result is stored as a collection in MongoDB and all merge operations
are executed by MongoDB. The result of the merge()
method is a new
MDataFrame
on the result
import pandas as pd
# create two dataframes and store in omega|ml
dfl = pd.DataFrame({'x': range(3)})
dfr = pd.DataFrame({'x': range(3), 'y': range(3)})
om.datasets.put(dfl, 'dfxl', append=False)
om.datasets.put(dfr, 'dfxr', append=False)
# merge the dataframes
mdfl = om.datasets.get('dfxl', lazy=True)
mdfr = om.datasets.get('dfxr', lazy=True)
mdfl.merge(mdfr, on='x').value
=>
x y
0 0 0
1 1 1
2 2 2
Statistics¶
The following statistics can be computed on pairs of numeric columns of a MDataFrame
and on MSeries
:
correlation
- returns the pearson correlation matrixcovariance
- returns the covariance matrix
mdf = om.datasets.getl('foo')
mdf['x', 'y].correlation().value
mdf['x', 'y].covariance().value
The following statisics can be computed on all numeric columns:
mean
min
max
std
quantile
- by defaults calculates the .5 quantile, specify a list of percentiles
mdf = om.datasets.getl('foo')
mdf['x', 'y].mean()
mdf['x', 'y].min()
...
Aggregation¶
Much like a Pandas DataFrame, MDataFrame
supports aggregation. All
aggregation operations are executed by MongoDB.
mdf = om.datasets.getl('dfx')
mdf.groupby('x').x.mean().head(5)
=>
x_mean
x
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
Multiple aggregations can be applied at once by the agg()
method:
mdf = om.datasets.getl('dfx')
print(mdf.groupby('x').agg(dict(x='sum', y='mean')).head(5))
The following aggregations are currently supported:
sum
- summean
oravg
- meanmax
- the max value in the groupmin
- the min value in the groupstd
- standard deviation in the samplefirst
- the first in the grouplast
- the last in the group
Geo proximity filtering¶
If you have licensed the geo location proximity extensions,
MDataFrame
supports filtering on geodesic proximity by specifying
the __near
operator and a pair of (lat, lon) coordinates. The result
is the list of matching locations sorted by distance from the given coordinates.
om.datasets.getl('geosample',
location__near=dict(location=(7.4474468, 46.9479739))).value['place']
=>
2 Bern
3 Zurich
1 Geneva
0 New York
Name: place, dtype: object
Understanding the actual MongoDB query¶
Sometimes it is useful to know the actual MongoDB query that is executed,
e.g. for debugging or performance tuning purpose. .inspect()
returns
the actual query that will be executed on accessing the .value
:
property.
om.datasets.get('dfx', lazy=True).query(x__gt=2, x__lt=5).inspect()
=>
{'explain': 'specify explain=True',
'projection': ['x', 'y'],
'query': {'$and': [{'x': {'$lt': 5}}, {'x': {'$gt': 2}}]}}
Explaining the access path¶
To understand the full access path and indicies used by MongoDB, use the
explain=True
keyword.
om.datasets.get('dfx', lazy=True).query(x__gt=2, x__lt=5).inspect(explain=True)
=>
{'explain': {'executionStats': {'allPlansExecution': [],
'executionStages': {'advanced': 4,
'executionTimeMillisEstimate': 0,
'inputStage': {'advanced': 4,
'direction': 'forward',
'docsExamined': 1100,
'executionTimeMillisEstimate': 0,
'filter': {'$and': [{'x': {'$lt': 5}}, {'x': {'$gt': 2}}]},
'invalidates': 0,
'isEOF': 1,
'nReturned': 4,
'needTime': 1097,
'needYield': 0,
'restoreState': 8,
'saveState': 8,
'stage': 'COLLSCAN',
'works': 1102},
'invalidates': 0,
'isEOF': 1,
'nReturned': 4,
'needTime': 1097,
'needYield': 0,
'restoreState': 8,
'saveState': 8,
'stage': 'PROJECTION',
'transformBy': {'_idx#0_0': 1, 'x': 1, 'y': 1},
'works': 1102},
'executionSuccess': True,
'executionTimeMillis': 1,
'nReturned': 4,
'totalDocsExamined': 1100,
'totalKeysExamined': 0},
'ok': 1.0,
'queryPlanner': {'indexFilterSet': False,
'namespace': 'testing3.omegaml.data_.dfx.datastore',
'parsedQuery': {'$and': [{'x': {'$lt': 5}}, {'x': {'$gt': 2}}]},
'plannerVersion': 1,
'rejectedPlans': [],
'winningPlan': {'inputStage': {'direction': 'forward',
'filter': {'$and': [{'x': {'$lt': 5}}, {'x': {'$gt': 2}}]},
'stage': 'COLLSCAN'},
'stage': 'PROJECTION',
'transformBy': {'_idx#0_0': 1, 'x': 1, 'y': 1}}},
'serverInfo': {'gitVersion': '22ec9e93b40c85fc7cae7d56e7d6a02fd811088c',
'host': 'c24ade3fa980',
'port': 27017,
'version': '3.2.9'}},
'projection': ['x', 'y'],
'query': {'$and': [{'x': {'$lt': 5}}, {'x': {'$gt': 2}}]}}