MDataFrame Operations
=====================

.. contents::

Selection
---------

Column projection
+++++++++++++++++

Specify the list of columns to be accessed:

.. code::

   om.datasets.get('dfx', lazy=True)[['x', 'y']].head(5).value
   =>
       x  y
    0  0  0
    1  1  1
    2  2  2
    3  3  3
    4  4  4

Masked-style selection
++++++++++++++++++++++

As with Pandas DataFrames, omega-ml MDataFrames can be subset using filter masks:

.. code::

   mdf = om.datasets.getl('dfx')
   flt = (mdf['x'] > 2) & (mdf['x] < 4)
   mdf[flt].value
   =>
       x  y
    3  3  3

.. note::

    MDataFrame masks are not series of True/False as they are in Pandas. Instead a
    MDataFrame filter mask translates into a query filter that is applied on accessing
    the :code:`.value` property. Consider MDataFrame a syntactical convenience that makes
    it easy to transform code for a Pandas DataFrame to an MDataFrame.


Index-Row selection
+++++++++++++++++++

Specify the index of the rows to be accessed:

.. code::

   # numeric index
   om.datasets.get('dfx', lazy=True).loc[2:5].value
   =>
       x  y
    2  2  2
    3  3  3
    4  4  4
    5  5  5

   # alphanumeric index
   om.datasets.get('dfx', lazy=True).loc['abc'].value
   =>
         x  y
    abc  2  2


Numeric row selection
+++++++++++++++++++++

Specify the numeric row id. Note this requires that the dataset was created with a continuous row id
(automatically created when using :code:`datasets.put`)

.. code::

   # numeric index
   om.datasets.get('dfx', lazy=True).iloc[2:5].value
   =>
       x  y
    2  2  2
    3  3  3
    4  4  4
    5  5  5

.. note::

    The :code:`.iloc` accessor is also used by scikit-learn's KFold and grid search features. Since
    MDataFrame's are very efficiently serializable (only specifications are serialized, not actual data)
    this feature makes MDataFrames an attractive choice for gridsearch in a compute cluster. Actually
    MDataFrame instances can be used directly with gridsearch, whereas for example Dask's DataFrame implementation
    cannot.


Filtering data
--------------

Filtering works the same on an MDataFrame as with the eager :code:`get`
method, by specifying the filter as the keyword arguments:

.. code::

   om.datasets.get('foodf', x__gt=5, lazy=True).value
   =>
       x
    6  6
    7  7
    8  8
    9  9

Geo proximity filtering
+++++++++++++++++++++++

:code:`MDataFrame` supports filtering on geodesic proximity by specifying
the :code:`__near` operator and a pair of (lat, lon) coordinates. The result
is the list of matching locations sorted by distance from the given coordinates.

.. code::

    om.datasets.getl('geosample',
    location__near=dict(location=(7.4474468, 46.9479739))).value['place']
    =>
    2        Bern
    3      Zurich
    1      Geneva
    0    New York
    Name: place, dtype: object


Permanently setting a filter
++++++++++++++++++++++++++++

Note that the :code:`query` method returns a new :code:`MDataFrame` instance
with the filter applied.  To set a permanent filter for any subsequent
operations on a specific :code:`MDataFrame` instance, use the
:code:`query_inplace` method:

.. code::

   mdf = om.datasets.get('dfx', lazy=True)
   id(mdf)
   => 140341971534792

   # mdf2 is a new object
   mdf2 = mdf.query(x__gt=2, x__lt=5)
   id(mdf2)
   => 140341971587648

   # note how mdf3 is the same object as mdf above
   mdf3 = mdf.query_inplace(x__gt=2, x__lt=5))
   id(mdf3)
   => 140341971523792

   mdf = om.datasets.get('dfx', lazy=True).query_inplace(x__gt=2, x__lt=5)
   mdf.value
   =>
       x  y
    3  3  3
    4  4  4
    3  3  3
    4  4  4

.. note::

   A new :code:`MDataFrame` object returned by the :code:`query` method
   does *not* create a new collection in MongoDB. That is, the new instance
   operates on the same data. The only difference is that one new instance
   has a permanent filter applied and any subsequent operations on it will
   work on the subset of the data returned by the filter.

Ordering operations
-------------------

Sorting
+++++++

Sorting works by specifying the sort columns. Use :code:`-` and :code:`+`
before any column name to specify the sort order as descending or ascending,
respectively (ascending is the default).

.. code::

   om.datasets.get('dfx', lazy=True).sort(['-x', '+y']).head(5).value
   =>
         x    y
    999  999  999
    998  998  998
    997  997  997
    996  996  996
    995  995  995


Limiting and skipping rows
++++++++++++++++++++++++++

The :code:`head(n)` and :code:`skip(n)` methods return and skip the top _n_
rows, respectively:

.. code::

   om.datasets.get('dfx', lazy=True).skip(5).head(3).value
   =>
      x  y
   5  5  5
   6  6  6
   7  7  7

Merging data
++++++++++++

Merging supports left, inner and right joins of two :code:`MDataFrame`.
The result is stored as a collection in MongoDB and all merge operations
are executed by MongoDB. The result of the :code:`merge()` method is a new
:code:`MDataFrame` on the result

.. code::

    import pandas as pd
    # create two dataframes and store in omega-ml
    dfl = pd.DataFrame({'x': range(3)})
    dfr = pd.DataFrame({'x': range(3), 'y': range(3)})
    om.datasets.put(dfl, 'dfxl', append=False)
    om.datasets.put(dfr, 'dfxr', append=False)
    # merge the dataframes
    mdfl = om.datasets.get('dfxl', lazy=True)
    mdfr = om.datasets.get('dfxr', lazy=True)
    mdfl.merge(mdfr, on='x').value
    =>
       x  y
    0  0  0
    1  1  1
    2  2  2


Aggregation
-----------

Much like a Pandas DataFrame, :code:`MDataFrame` supports aggregation. All
aggregation operations are executed by MongoDB.

Statistics
++++++++++

The following statistics can be computed on pairs of numeric columns of a :code:`MDataFrame` and on :code:`MSeries`:

* :code:`correlation` - returns the pearson correlation matrix
* :code:`covariance` - returns the covariance matrix

.. code::

    mdf = om.datasets.getl('foo')
    mdf['x', 'y].correlation().value
    mdf['x', 'y].covariance().value


The following statisics can be computed on all numeric columns:

* :code:`mean`
* :code:`min`
* :code:`max`
* :code:`std`
* :code:`quantile` - by defaults calculates the .5 quantile, specify a list of percentiles


.. code::

    mdf = om.datasets.getl('foo')
    mdf['x', 'y].mean()
    mdf['x', 'y].min()
    ...

Grouping data
+++++++++++++

.. code::

    mdf = om.datasets.getl('dfx')
    mdf.groupby('x').x.mean().head(5)
    =>
         x_mean
    x
    0     0.0
    1     1.0
    2     2.0
    3     3.0
    4     4.0

Multiple aggregations can be applied at once by the :code:`agg()` method:

.. code::

    mdf = om.datasets.getl('dfx')
    print(mdf.groupby('x').agg(dict(x='sum', y='mean')).head(5))

The following aggregations are currently supported:

* :code:`sum` - sum
* :code:`mean` or :code:`avg` - mean
* :code:`max` - the max value in the group
* :code:`min` - the min value in the group
* :code:`std` - standard deviation in the sample
* :code:`first` - the first in the group
* :code:`last` - the last in the group