Storing and retrieving data

om.datasets. provides two simple APIs to store and retrieve data:

  • om.datasets.put(object, 'name')

  • om.datasets.get('name')

Native Python objects

Any Python native list or dict object can be stored and read back directly:

myvar = ['data']
om.datasets.put(myvar, 'foo')
data = om.datasets.get('foo')
=>
[['data']]

Note the result is now a list of the objects stored. This is because any object is stored as a document in a monogodb collection. What you get back is a list of all the documents in the collection. By default put will append an existing collection with new documents.

om.datasets.put(myvar, 'foo')
om.datasets.put(myvar, 'foo')
data = om.datasets.get('foo')
=>
[['data'], ['data'], ['data']]

To replace all documents in a collection use the append=False kwarg.

myvar = ['data']
om.datasets.put(myvar, 'foo', append=False)
data = om.datasets.get('foo')
=>
[['data']]

Pandas DataFrames, Series

Pandas Dataframes are stored in much the same way. Note however that DataFrames provide additional support on querying, as shown in the next section

import pandas as pd
df = pd.DataFrame({'x': range(10)})
om.datasets.put(df, 'foodf', append=False)
om.datasets.get('foodf')
=>
   x
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8
9  9

External Sources

Any Python tools can be used to retrieve data from external sources and ingest into omega|ml datasets. For example, you could use the Pandas’ library pd.read_csv to read a remote csv file and insert it into om.datasets:

# small datasets
# -- note pandas will read all of the dataset into memory, limiting the size of the dataset
df = pd.read_csv('http://example.com/data.csv')
om.datasets.put(df, 'example_data')

# larger then memory datasets
# this will load the dataset in chunks, limitting the amount of memory pandas uses
for chunk_df in pd.read_csv('http://example.com/data.csv', chunksize=1000):
    om.datasets.put(chunk_df, 'example_data')

Alternatively, omega|ml provides a convenience function, om.datasets.read_csv to ingest data from a wide range sources (e.g. S3, HTTPS, SFTP, HDFS, Azure Blob, GCS, etc.).

# retrieve the data and store in the example_data dataset
om.datasets.read_csv('http://example.com/data.csv', 'example_data')

Similarly, om.datasets.to_csv supports writing directly to remote locations:

om.datasets.to_csv('example_data', 's3://my_bucket/example_data.csv')

Accessing DBMS via SQL

If the data resides in an SQL database, om.datasets can store the connection to the database:

# one time, e.g. one person in the team can set this up
mydb_cxs = f'mysql://user:pass@dbhost/db'
om.datasets.put(mydb_cxs, 'mysqldb')

Once the connection is stored like this, dataframes can be stored and retrieved using the connection without knowing the connection string:

# store
df = pd.DataFrame({'x': range(100)})
om.datasets.put(df, 'mysqldb')

# retrieve
df = om.datasets.get('mysqldb')

Note by default the dataset name is used as the table name, prefixed by the bucket name (defaults to omegaml). In the previous example, the actual table is the omegaml_mysqldb table.

To change the table name, specify the table= keyword when storing the connection. In the following example, the actual table is omegaml_mytable,

# store data in a given table
mydb_cxs = f'mysql://user:pass@dbhost/db'
om.datasets.put(mydb_cxs, 'mysql-table', table='mytable')

To specify an existing table, without the bucket name, prefix the table name with a colon, as follows. This will store the data in table mytable.

# store data in a given table
mydb_cxs = f'mysql://user:pass@dbhost/db'
om.datasets.put(mydb_cxs, 'mysql-table', table=':mytable')

To specify a query to be run on retrieving the dataset, specify the sql= keyword:

# store data in a given table
mydb_cxs = f'mysql://user:pass@dbhost/db'
om.datasets.put(mydb_cxs, 'mysql-table', table=':mytable', sql="select * from mytable")

Further possibilites include specifying variables for the connection string (e.g. userid, password) or the sql statement. Details see omegaml.backends.sqlalchemy.SQLAlchemyBackend

Storing and retrieving files

Files can be stored and retrieved natively in several ways :

  1. Use a Python file-like object as input and output:

    # .put() will call file_in.read()
    with open('myfile.bin', 'rb') as file_in:
        om.datasets.put(file_in, 'myfile.bin')
    
    # .get() returns a file-like object
    data = om.datasets.get('myfile.bin').read()
    
  2. Directly use a local path:

    om.datasets.put('myfile.bin', 'testfile')
    om.datasets.get('testfile', local='myfile.bin')