Data Drift ========== To illustrate data drift monitoring, consider the following example. First we create two datasets, `d1` and `d2`, both a random sample of 1000 points from a normal distribution. We then create a `DataDriftMonitor` instance and take snapshots of `d1` and `d2`. Finally, we compare the two snapshots and plot the results. .. code:: python import numpy as np mu = 0; sigma = 0.1 d1 = np.random.normal(mu, sigma, 10000) mu = 1; sigma = 0.1 d2 = np.random.normal(mu, sigma, 10000) .. code:: python from omegaml.backends.monitoring import DataDriftMonitor with om.runtime.experiment('foo', recreate=True) as exp: mon = DataDriftMonitor(tracking=exp) mon.snapshot(d1) mon.snapshot(d2) stats = mon.compare() stats.plot('0') .. image:: /images/mon_0hist_drift.png We can see that the two datasets are similar, as expected, in that they both show a normal distribution. However the second dataset `d2` has a different mean and variance than the first dataset `d1`. This is reflected in the plot. Understanding drift statistics ------------------------------ We can show the data underlying the plot by showing the drift statistics dataframe. .. code:: python stats.df .. image:: /images/mon_0hist_drift_df.png The dataframe shows the following information: * `column`: the column name in the dataset used for the snapshot * `statistic`: the statistic used to compare the two snapshots * `value`: the value of the statistic test * `pvalue`: the p-value of the test * `score`: the drift score, which is normalized to -1 .. +1 where 0 means no drift and 1 means maximum drift * `seq_from`: the sequence number of the first snapshot * `seq_to`: the sequence number of the second snapshot * `dt_from`: the timestamp of the first snapshot * `dt_to`: the timestamp of the second snapshot * `baseline`: the baseline snapshot used for comparison * `target`: the snapshot used for comparison The following statistics are computed automatically: * `ks`: The Kolmogorov-Smirnov statistic (for numeric data) * `wasserstein`: The Wasserstein distance (for numeric data) * `chisq`: The Chi-Square statistic (for categorical data) * `mean`: The mean difference (the mean drift score for all columns) The sequence number and timestamp are used to track the order of snapshots. Each new call to the `monitor.snapshot()` method adds a new snapshot, and thus increases the sequence by one. The first snapshot is assigned sequence 0. Comparing snapshots ------------------- To compare snapshots, we can specify the specific sequence numbers to compare, either the absolute sequence `(0 .. n)`, where `n` is the number of snapshots available, or `len(monitor)`, or as a relative negative index. By default the last two snapshots are compared. .. code:: # compare the most recent snapshots stats = mon.compare() # this is equivalent to calling mon.compare(seq=[-2, -1]) If there are more than two snapshots, we can compare any two snapshots by specifying their relative sequence numbers, e.g. to compare the first and last snapshots: .. code:: # add a new snapshot mu = 1; sigma = 0.5 d3 = np.random.normal(mu, sigma, 10000) mon.snapshot(d3) # compare the first and last snapshots stats = mon.compare(seq=[0, -1]) stats.plot('0') .. image:: /images/mon_2hist_drift.png It is common to have a baseline snapshot that is used for comparison to multiple snapshots taken over time. We can get this comparison by specifying the baseline sequence number or the baseline name. The baseline snapshot is the first snapshot taken, i.e. the one with sequence number 0. We can also specify the baseline by sequence number, using the `baseline=` kwarg. .. code:: # compare all snapshots to the baseline stats = mon.compare(seq='baseline') # compare all snapshots to the baseline An alternative to comparing all snapshots to the baseline is to compare all snapshots to their immediate predecessor. This is useful to see how the data drifts over time, assuming that each snapshots represents a new baseline. We can do this by specifying `seq='series'`: .. code:: # compare all snapshots to their immediate predecessor stats = mon.compare(seq='series') Plotting drift over time ------------------------ To plot the drift score over time, we can use the `plot()` method with `kind='time'`: .. code:: stats.plot('0', kind='time') .. image:: /images/mon_0time_drift.png This plot shows the drift score between the first snapshot and all subsequent snapshots. The drift score is normalized to -1 .. +1 where 0 means no drift and 1 means maximum drift. To analyze drift with respect to a specific statistic, specify `statistic=`: .. code:: stats.plot('0', statistic='ks', kind='time') .. image:: /images/mon_0hist_drift_ks.png To see the statistic's specific values, we can directly filter the drift statistics dataframe: .. code:: stats.df['0', 'ks'] .. image:: /images/mon_0hist_drift_ks_df.png