The boundaries between data management and advanced analytics are blurring fast. Databases are enhancing capabilities to build, train and validate machine learning models right where the data sits – inside the databases and data warehouses. When the ML operations and the data-preparation are in separate artifacts, the round-trip for investigative analytics is long and ponderous. The old way has high latency at Big-Data scale because it relies on movement (and reformatting) of data from the DBMS to the ML/analytics environment.
An ever-growing list of databases are embedding ML functions and notebooks into the database with built-in Apache Spark. The notebooks allow SQL queries to coexist along with training code written in languages like R and Python. Teradata Vantage, the venerable data warehouse has over 200 pre-packaged ML functions and algorithms from data preparation to statistics (linear regression, K-Means, Random Forest, Naïve Bayes, SVM, XGBoost etc.) to sentiment analysis to graph and time series analysis.
Oracle Autonomous database with AutoML, Google Cloud BigQuery ML, MS SQL Server and Cosmos DB with Spark integration and, Pivotal Greenplum with GPU- accelerated open-source deep learning library called Apache MADlib are only a few of the examples. Amazon Aurora’s integration with Amazon SageMaker and Comprehend AI allows it to do NLP on free-form text data e.g. tweets in a single-line SQL query to perform sentiment analysis.
Azure’s Synapse Analytics platform creates models in Python in the Open Neural Network Exchange (ONNX) format. The model is stored in a table as a binary object and can be called from SQL. For example, the model can be trained on the NY city taxi data on a certain number of features such as the number of passengers and trip distance. Then it can be applied to the table that has all the taxi data to predict, say, the tip amount. The model constantly retrains as the table is updated every few minutes with new taxi trips data.
Splice Machine, a row and column (HTAP) scale-out database built on Apache Spark with an underlying transactional key-value store, takes it a step further. This database is natively built on Spark facilitating the integration of MLflow, Jupyter, H2O, TensorFlow, Keras and Scikit Learn. It provides complete ML lifecycle capabilities including some MLOps features. For example, using the MLflow UI, it allows multiple versions of the models to be compared simply by changing the type of algorithm or the features. Each model is stored as a binary object with a unique run id. When the times comes to deploy the model, with one line of code, Splice Machine automatically creates the necessary table (with features and prediction columns) and the trigger needed to automatically run the model every time new data arrives. Splice Machine can ‘time travel’ to a particular run id and get the state of the data used to train the model at that specific time. This helps in reproducibility of the model.
One drawback cited frequently is overloading operational databases with resource intensive ML training. Splice Machine handles this by providing data scientists on-demand dedicated CPUs and memory with their Jupyter notebook instances and dedicated Spark executors managed by Kubernetes.
The ability to execute ML functions in database using SQL is not new but it will become table stakes in near future. We also feel that some MLOps functionality will eventually move into more databases in future. This will reduce integration cost and complexity and, by reducing the number of copies of data, help in data governance.