Last year I wrote about eight databases that support in-database machine learning. In-database machine learning is important because it brings the machine learning processing to the data, which is much more efficient for big data, rather than forcing data scientists to extract subsets of the data to where the machine learning training and inference run.
These databases each work in a different way:
- Amazon Redshift ML uses SageMaker Autopilot to automatically create prediction models from the data you specify via a SQL statement, which is extracted to an Amazon S3 bucket. The best prediction function found is registered in the Redshift cluster.
- BlazingSQL can run GPU-accelerated queries on data lakes in Amazon S3, pass the resulting DataFrames to RAPIDS cuDF for data manipulation, and finally perform machine learning with RAPIDS XGBoost and cuML, and deep learning with PyTorch and TensorFlow.
- BigQuery ML brings much of the power of Google Cloud Machine Learning into the BigQuery data warehouse with SQL syntax, without extracting the data from the data warehouse.
- IBM Db2 Warehouse includes a wide set of in-database SQL analytics that includes some basic machine learning functionality, plus in-database support for R and Python.
- Kinetica provides a full in-database lifecycle solution for machine learning accelerated by GPUs, and can calculate features from streaming data.
- Microsoft SQL Server can train and infer machine learning models in multiple programming languages.
- Oracle Cloud Infrastructure can host data science resources integrated with its data warehouse, object store, and functions, allowing for a full model development lifecycle.
- Vertica has a nice set of machine learning algorithms built-in, and can import TensorFlow and PMML models. It can do prediction from imported models as well as its own models.
Now there’s another database that can run machine learning internally: Snowflake.