Databases have always been able to do simple, clerical work like finding particular records that match some given criteria — say, all users who are between 20 and 30 years old. Lately database companies have been adding artificial intelligence routines into databases so the users can explore the power of these smarter, more sophisticated algorithms on their own data stored in the database.
The AI algorithms are also finding a home below the surface, where the AI routines help optimize internal tasks like re-indexing or query planning. These new features are often billed as adding automation because they relieve the user of housekeeping work. Developers are encouraged to let them do their work and forget about them.
There’s much more interest, though, in AI routines that are open to users. These machine learning algorithms can classify data and make smarter decisions that evolve and adapt over time. They can unlock new use cases and enhance the flexibility of existing algorithms.
In many cases, the integration is largely pragmatic and essentially cosmetic. The calculations are no different from those that would occur if the data was exported and shipped to a separate AI program. Inside the database, the AI routines are separate and just take advantage of any internal access to the data. Sometimes this faster access can speed up the process dramatically. When the data is substantial, sometimes merely moving it can take up a large portion of the time.
The integration can also limit analysis to the algorithms that are officially part of the database. If the users want to deploy a different algorithm, they must return to the old process of exporting the data in the right format and importing it into the AI routine.
The integration can take advantage of some of the newer distributed, in-memory databases that easily spread out the load and the data storage over multiple machines. These can easily handle a large amount of data. If complex analysis is necessary, it may not be hard to increase the CPU capacity and RAM allocated to each machine.
Some AI-powered databases are also able to leverage GPU chips. Some AI algorithms use the highly parallel architecture of the GPUs to train machine learning models and run other algorithms. There are also some custom chips specially designed for AI that can dramatically accelerate the analysis.
One of the biggest advantages, though, may be the standard interface, which is often SQL, a language that’s already familiar to many programmers. Many software packages already interact easily with SQL databases. If someone wants more AI analysis, it’s no more complex than learning the new SQL instructions.
What are established companies doing?
Artificial intelligence is a very competitive field now. All of the major database companies are exploring integrating the algorithms with their tools. In many cases, the companies offer so many options that it’s impossible to summarize them here.
Oracle has integrated AI routines into their databases in a number of ways, and the company offers a broad set of options in almost every corner of its stack. At the lowest levels, some developers, for instance, are running machine learning algorithms in the Python interpreter that’s built into Oracle’s database. There are also more integrated options like Oracle’s Machine Learning for R, a version that uses R to analyze data stored in Oracle’s databases. Many of the services are incorporated at higher levels — for example, as features for analysis in the data science tools or analytics.
IBM also has a number of AI tools that are integrated with their various databases, and the company sometimes calls Db2 “the AI database.” At the lowest level, the database includes functions in its version of SQL to tackle common parts of building AI models, like linear regression. These can be threaded together into customized stored procedures for training. Many IBM AI tools, such as Watson Studio, are designed to connect directly to the database to speed model construction.
Hadoop and its ecosystem of tools are commonly used to analyze big data sets. While they are often thought of as more data processing pipelines than databases, there’s often a database like HBase buried inside. Some people use the Hadoop Distributed File System to store data, sometimes in CSV format. A variety of AI tools are already integrated into the Hadoop pipeline using tools like Submarine, making it effectively a database with integrated AI.
All of the major cloud companies offer both databases and artificial intelligence products. The amount of integration between any particular database and any particular AI varies substantially, but it is often fairly easy to connect the two. Amazon’s Comprehend, a tool for analyzing natural language text, accepts data from S3 buckets and stores the answers in many locations, including some AWS databases. Amazon’s SageMaker can access data from S3 buckets or Redshift data lakes, sometimes using SQL via Amazon Athena. While it is a fair question about whether these count as true integration, there’s no doubt that they simplify the pathway.
In Google’s Cloud, the AutoML tool for automated machine learning can grab data from BigQuery databases. Firebase ML offers a number of tools for tackling the common challenges for mobile developers, such as classifying images. It will also deploy any trained TensorFlow Lite model to work on your data.
Microsoft Azure also offers a collection of databases and AI tools. The Databricks tool, for instance, is built upon the Apache Spark pipeline and comes with connections to Azure’s Cosmos DB, its Data Lake storage, and other databases like Neo4j or Elasticsearch that may be running inside of Azure. Its Azure Data Factory is designed to find data throughout the cloud, both in databases and generic storage.
What are the upstarts doing?
A number of database startups are also highlighting their direct support of machine learning and other AI routines. SingleStore, for example, offers fast analytics for tracking incoming telemetry in real time. This data can also be scored according to various AI models as it is ingested.
MindsDB adds machine learning routines to standard databases like MariaDB, PostgreSQL, or Microsoft SQL. It extends SQL to include features for learning from the data already in the database to make predictions and classify objects. These features are also easily accessible in more than a dozen of the business intelligence applications, such as Salesforce’s Tableau or Microsoft’s Power BI, that work closely with SQL databases.
Many of the companies effectively bury the database deep into the product and sell only the service itself. Riskified, for example, tracks financial transactions using artificial intelligence models and offers merchants protection through “chargeback guarantees.” The tool ingests transactions and maintains historical data, but there’s little discussion of the database layer.
In many cases, the companies that may bill themselves as pure AI companies are also database providers. After all, the data needs to sit someplace. H2O.ai, for example, is just one of the AI cloud providers that offer integrated data preparation and artificial intelligence analysis. The data storage, though, is more hidden, and many people think of software like H2O.ai’s first for its analytical power. Still, it can both store and analyze the data.
Is there anything integrated AI databases can’t do?
Adding AI routines directly to the feature set of a database can make life simpler for developers and database administrators. It may also make analysis a bit faster in some cases. But beyond the convenience and speed of working with one dataset, this doesn’t offer any large, continual advantage over exporting the data and importing it into a separate program.
The process can limit developers who may choose to only explore the algorithms that are directly implemented inside the database. If the algorithm isn’t part of the database, it’s not an option.
Of course, many problems can’t be solved with machine learning or artificial intelligence at all. Integrating the AI algorithms with the database does not change the power of the algorithms — it merely speeds them up.