Storing and retrieving data =========================== :code:`om.datasets.` provides two simple APIs to store and retrieve data: * :code:`om.datasets.put(object, 'name')` * :code:`om.datasets.get('name')` Native Python objects --------------------- Any Python native :code:`list` or :code:`dict` object can be stored and read back directly: .. code:: myvar = ['data'] om.datasets.put(myvar, 'foo') data = om.datasets.get('foo') => [['data']] Note the result is now a list of the objects stored. This is because any object is stored as a document in a monogodb collection. What you get back is a list of all the documents in the collection. By default :code:`put` will append an existing collection with new documents. .. code:: om.datasets.put(myvar, 'foo') om.datasets.put(myvar, 'foo') data = om.datasets.get('foo') => [['data'], ['data'], ['data']] To replace all documents in a collection use the :code:`append=False` kwarg. .. code:: myvar = ['data'] om.datasets.put(myvar, 'foo', append=False) data = om.datasets.get('foo') => [['data']] Pandas DataFrames, Series ------------------------- Pandas Dataframes are stored in much the same way. Note however that DataFrames provide additional support on querying, as shown in the next section .. code:: import pandas as pd df = pd.DataFrame({'x': range(10)}) om.datasets.put(df, 'foodf', append=False) om.datasets.get('foodf') => x 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 External Sources ---------------- Any Python tools can be used to retrieve data from external sources and ingest into omega|ml datasets. For example, you could use the Pandas' library :code:`pd.read_csv` to read a remote csv file and insert it into :code:`om.datasets`: .. code:: python # small datasets # -- note pandas will read all of the dataset into memory, limiting the size of the dataset df = pd.read_csv('http://example.com/data.csv') om.datasets.put(df, 'example_data') # larger then memory datasets # this will load the dataset in chunks, limitting the amount of memory pandas uses for chunk_df in pd.read_csv('http://example.com/data.csv', chunksize=1000): om.datasets.put(chunk_df, 'example_data') Alternatively, omega|ml provides a convenience function, `om.datasets.read_csv` to ingest data from a wide range sources (e.g. S3, HTTPS, SFTP, HDFS, Azure Blob, GCS, etc.). .. code:: pyton # retrieve the data and store in the example_data dataset om.datasets.read_csv('http://example.com/data.csv', 'example_data') Similarly, :code:`om.datasets.to_csv` supports writing directly to remote locations: .. code:: python om.datasets.to_csv('example_data', 's3://my_bucket/example_data.csv') Accessing DBMS via SQL ---------------------- If the data resides in an SQL database, `om.datasets` can store the connection to the database: .. code:: # one time, e.g. one person in the team can set this up mydb_cxs = f'mysql://user:pass@dbhost/db' om.datasets.put(mydb_cxs, 'mysqldb') Once the connection is stored like this, dataframes can be stored and retrieved using the connection without knowing the connection string: .. code:: # store df = pd.DataFrame({'x': range(100)}) om.datasets.put(df, 'mysqldb') # retrieve df = om.datasets.get('mysqldb') Note by default the dataset name is used as the table name, prefixed by the bucket name (defaults to :code:`omegaml`). In the previous example, the actual table is the :code:`omegaml_mysqldb` table. To change the table name, specify the :code:`table=` keyword when storing the connection. In the following example, the actual table is :code:`omegaml_mytable`, .. code:: # store data in a given table mydb_cxs = f'mysql://user:pass@dbhost/db' om.datasets.put(mydb_cxs, 'mysql-table', table='mytable') To specify an existing table, without the bucket name, prefix the table name with a colon, as follows. This will store the data in table :code:`mytable`. .. code:: # store data in a given table mydb_cxs = f'mysql://user:pass@dbhost/db' om.datasets.put(mydb_cxs, 'mysql-table', table=':mytable') To specify a query to be run on retrieving the dataset, specify the :code:`sql=` keyword: .. code:: # store data in a given table mydb_cxs = f'mysql://user:pass@dbhost/db' om.datasets.put(mydb_cxs, 'mysql-table', table=':mytable', sql="select * from mytable") Further possibilites include specifying variables for the connection string (e.g. userid, password) or the sql statement. Details see :py:class:`omegaml.backends.sqlalchemy.SQLAlchemyBackend` Storing and retrieving files ---------------------------- Files can be stored and retrieved natively in several ways : 1. Use a Python file-like object as input and output: .. code:: # .put() will call file_in.read() with open('myfile.bin', 'rb') as file_in: om.datasets.put(file_in, 'myfile.bin') # .get() returns a file-like object data = om.datasets.get('myfile.bin').read() 2. Directly use a local path: .. code:: om.datasets.put('myfile.bin', 'testfile') om.datasets.get('testfile', local='myfile.bin')