Architecture¶
Why omega-ml¶
A typical data science workflow consists of the following core steps:
acquire data & store for subsequent processes
clean data & publish for uses
train & evaluate models
publish models & reports
execute prediction using previously trained models
In any production scenario, each step requires a scalable storage to store raw and cleaned data, models and APIs to execute models. You will also need a compute cluster that is easy to access and provides all the required packages. Engineering such a system from scratch is hard, takes considerable time and skills. omega-ml provides all of this in an integrated, scalable fashion.
omega-ml provides
a machine learning repository, acting as the central storage for data and models
a client API to out-of-core data processing that follows Pandas semantics
an integrated compute cluster runtime to train and execute models, as well as to execute arbitrary scripts and automatically publish reports
a sophisticated REST API to data, models, scripts and runtime and custom services with their own Swagger/OpenAPI endpoints
a dashboard to access the repository and monitor the runtime and the status of jobs
Extensibility¶
Thanks to extensibility at the core of the architecture, omega-ml can easily accommodate any third-party storage or machine learning backend, or add new types of operations on data and models.
How omega-ml works¶
data is stored via the
datasets.put
API.datasets.put
supports native Python objects like dicts and lists, Pandas DataFrames and Series, numpy arrays as well as externally stored files that are accessible through http, ftp or stored on cloud services like Amazon’s s3. Other datatypes can be easily added by a custom data backend.machine learning models are stored via the
models.put
API.models.put
supports many frameworks out of the box. Other machine learning frameworks can be easily added by wrapping them in a virtual object, or by writing a custom model backend.jobs (custom python scripts in the form of Jupyter notebooks) are stored via the
jobs.put
API.the runtime cluster and any other authorized user can access the data models and jobs through the
datasets.get
,models.get
andjobs.get
methods, respectively. Using this common API any compute job e.g. to train a model can directly access the relevant data without the need to transfer the data to the worker instance first.
omega-ml is composed of the following main components:
Core components¶
The core components provide the storage for data and models. Models can be trained locally and stored in the cluster for prediction via the REST API.
Omega
- the main API and programming interface to omega-mlOmegaStore
- the storage for data and modelsOmegaRuntime
- the celery runtime cluster to train and execute models and jobs
Commercial Edition¶
The omega-ml Commercial Edition provides a fully integrated, commercial-scale data science platform as a service. It is the best match for a multi-user environment with security features and an extended set of functionality.
security features
- security features covering all components (REST API, MongoDB, RabbitMQ etc.)omegaweb
- a secured REST API, web interface and dashboardomegaops
- cloud manager operationsomegajobs
- JupyterHub with per-user Notebooksapphub
- application hub to provision and host data applications
Third-party dependencies¶
omega-ml depends on the following third-party products (all open source):
MongoDB - the highly scalable NoSQL database, ideal for data science workloads
RabbitMQ - the most-widely used open source message broker
Celery - the efficient and highly-throughput Distributed Task Queue for Python applications
MySQL - the world’s most popular open source database, backed by Oracle
Note that omegaml’s license does not include the above products. However, omega-ml provides the required docker build instructions to download, install and configure these applications for use with omegaml.
A number of smaller third-party components in the Python ecosystem are used in omegaml. Refer to the LICENSES file for details.