Data is the most critical yet still undervalued asset of enterprises. How companies decide to go about their data infrastructure defines which lives on and which will bite the dust in the upcoming decades.
Our recent article “Feature Store as a Foundation for Machine Learning” covered recent changes in the big data tools landscape. The advent of Data Observability Platforms and in-house Global Data Catalogs as successors of traditional Data Catalogs is one of the most profound shifts.
In this article, we will discuss why for traditional data catalogs it becomes harder to meet the critical needs of data-driven enterprises and their data operations, what crucial pieces of the modern data platform puzzle are missing, and what a modern next-gen data platform could look like.
The 2010s have been marked by the heyday of data catalogs. Still, interviews with data-driven enterprises show that 90% of data professionals are dissatisfied with data discovery, reporting it eats up to 50% of their work time. What’s the reason?
The explosion of big data reshaped the demands of data professionals and their expectations from tools. Volumes of data, the number of solutions, and the complexity of data pipelines are skyrocketing. Coupled with the expanding number of people dealing with them in organizations, it creates an entirely new set of infrastructural challenges that never existed before.
Despite all the effort of the past decade, the infrastructure for exchanging metadata and building scalable open data platforms is still in its infancy. Siloed metadata formats and tools unable to communicate became a major bottleneck that blocks data democratization, and truly end-to-end, enterprise-wide data discovery and observability.
While companies are rushing to buy data catalogs for their functionality without paying attention to how they collect, store, and manage data, the lack of integrated solutions leads to complex data pipelines, forcing organizations to jump through many hoops to bring everything together instead of getting actionable insights.
This absence of a solid underlying layer that would unify and simplify metadata collection blocks many important developments and scalability. Also, isn’t it a paradox that to democratize data and make it more open and observable, we are turning to solutions that tend to lock up metadata?
Benefits of Data Catalogs
Data-first organizations have always been looking to improve understanding and management of their data — the crucial resource for driving business value.
For a long time before data catalogs, it has been a challenge to know where data is and get proper context to evaluate it. People had to ask around every time. As a result, data discovery and assessment took up too much time. Companies needed to bring multiple siloed data sources under a single UI accessible company-wide to democratize their data.
Data catalogs emerged to solve this challenge and became an essential step towards better data discovery. They became a central point for anyone in the company to find data, evaluate it and understand who uses it and in what way.
Today, data catalogs offer important benefits for data-driven companies:
- Collect and centralize all the company’s metadata
- Knowledge of how relevant data is, who and how uses it
- Provide information to evaluate data and decide if it matters
- Provide data management tools like search, networking, social features
- Better time-to-discovery and time-to-value
Data Catalogs made the lives of data-first organizations so much easier. If you need to get your data organized, you can adopt an open-source data catalog like Amundsen or purchase a SaaS solution to make your data discoverable right away.
There are three main shifts that are happening now in the realm of data discovery solutions:
- From general data catalogs to purposely made. Data catalogs are becoming more specialized and tailored to the needs of various data teams and user groups — data science, data engineering, etc.
- From data catalogs to data observability platforms. Data quality and observability have become an integral part of the data discovery experience over the last few years. Data discovery alone doesn’t cut it anymore and lacks a lot of important context.
- From each tool building its own metadata scrapers to a shared metadata standard. Siloed metadata collection becomes the major bottleneck slowing down integration efforts and making data pipelines and platforms unnecessarily complex. Each tool is writing its own set of parsers while these efforts can be optimized for the greater good.
These transformations happening in parallel can help us get a glimpse of where the Big Data ecosystem is going and what gaps are present in existing solutions that prevent us from getting there.
What Data Catalogs are Lacking?
Data catalogs have a hard time keeping up with this new reality and addressing modern requirements for better integrated, faster, and more transparent set of tools that would help tackle data downtime.
Based on a research of existing data discovery and observability tools, it seems that these are the reasons for data catalogs having a hard time keeping pace:
- Non-standardized metadata collection
- Incompatibility of data catalogs (the need to recollect data)
- Limited, not truly company-wide end-to-end data lineage
- Absent or insufficient data quality and observability
- Undiscoverable ML world
The absence of a universal metadata standard blocks resolving the rest. This elephant in the room disrupts integration possibilities for the whole data ecosystem and keeps data discovery inefficient.
Let’s dive into each issue and then look how it could be solved.
1. Platform Incompatibility
In the heat of rapidly developing data and ML markets, enterprises often start acquiring new data tools to stay on the edge of things. They get ripe for a data discovery solution and deploy one of the data catalogs like Amundsen, then decide it’s time for more observability and purchase a product like Monte Carlo, then add a feature store for their MLOps team.
This shiny object syndrome might be great for immediate firefighting or experimenting. However, it can quickly inflate your infrastructure and create a lot of (often hidden) overhead. When a company chooses an easy short-term solution over a strategic investment in a long-term approach, it creates technical debt — the cost of rework. You will have to spend a lot of resources to deploy each of these tools, but won’t be able to use them effectively as they don’t integrate well enough and will need to proceed rebuilding until you have a sustainable solution.
Data catalogs, as well as observability tools and feature stores, gather data separately. Most of them use incompatible data formats and tend to vendor-lock metadata. When you want to switch to another catalog or add one to your existing infrastructure, you have to start gathering all the metadata from scratch.
Adding new data tools to a company’s existing data mesh is also a pain. There is no simple, straightforward way of doing so. You can’t plug one data catalog into another as a data source and use them from the same UI.
Large tech companies are increasingly in need of federated solutions from the data mesh standpoint. Mid-size and smaller organizations’ natural growth will inevitably make them face the same matter soon.
With the Big Data and ML industries’ tool landscapes growing by the day, data-driven enterprises that want a working data platform and true data democratization can’t afford to put the compatibility issue on the backburner.
2. Limited Data Lineage
Data lineage is a crucial pillar of data observability. It provides context to data pipeline events and helps locate issues by showing related downstream and upstream assets.
Data catalogs listing end-to-end data lineage among their features imply that all parts of a company’s pipelines exist within a single discovery tool. However, with the exploding complexity of the modern data landscape, this is increasingly not the case.
Truly end-to-end lineage can exist only in a system that embraces all data entities used in an organization across all its discovery solutions. Otherwise, it can’t map dependencies between all the critical assets to provide observability.
There are three main issues with data lineage offered by existing data catalogs that prevent company-wide observability and data democratization:
- No data federation = no company-wide lineage for all data assets
- Include only a limited range of data entities, and no ML entities
- Metadata doesn’t get enriched with data quality information
The OpenLineage specification went ahead to standardize the data lineage discovery process. Nevertheless, it doesn’t cover entities outside of the data lakes and warehouses world, like dashboards, ML pipelines, ML Models, Feature Stores and doesn’t enrich received metadata with information about data profiling and data quality tests.
3. Invisible ML/AI World
With the advent of ML (machine learning), related entities (ML pipelines, ML experiments, feature stores) have become an integral part of the data landscape. Yet the ML/AI world stays absent and invisible in existing data discovery solutions. None of the data catalogs and observability platforms have ML entities as first citizens at the moment.
Undiscoverable Machine learning entities prevent scaling ML in organizations to its full potential. Their data science and data engineering teams are in the dark about what the other team is doing.
In 2021, data-driven companies can’t afford to ignore the ML world in their data discovery ecosystems. It’s impossible to manage or scale what you can’t discover and evaluate, and disregarding ML in your data discovery workflow is a slippery slope.
4. Lack of Data Quality Tools
If you don’t have tools to monitor your data pipelines for both known and unpredictable issues, you are neither in control of data downtime nor have great data discovery. Bad data drives opportunities away from you, even if otherwise your pipelines are perfect.
Currently, most data discovery platforms either don’t feature tools for data unit testing or struggle with proper implementation. According to our research, as of 2021, 55% of data professionals experience data quality issues more often than they deem acceptable.
Of two types of data quality issues, data quality testing answers only the first:
- Known predictable issues that can be predicted and tested for
- Unknown “black box” issues that are hard to predict and detect
Data unit testing with tools like Great Expectations and Deequ helps to catch predictable, well-known data issues. Meanwhile, the increasing complexity of data pipelines and teams made data downtime management the central problem in data operations. The focus rapidly shifts to the new frontier of data quality that is equipped to deal with black box data issues — data observability.
5. Insufficient Data Observability
Observability has emerged to help DevOps teams manage application downtime. Now it is making a clever comeback in data operations to help DataOps teams combat data downtime.
While data quality tests are accounting for known predictable issues, observability is required to cover unknowns. It can help you not only find issues that already occured, but proactively prevent many of them.
Here is an example of how dangerous the blackbox issues can be.
Example of a Black-Box Data Issue: Unnoticed Time Shift
A typical black-box issue occurring in pipelines that span across several departments or/and involve ML is wrong data going unnoticed for weeks due to the lack of cross-department collaboration and the subtle nature of the shift in data.
Say, a company uses an ML model to recommend clothes to customers of its online store. When users view the website, it offers a list of relevant products based on user profiles and data about shopping carts. Here is their data and ML pipeline:
The main steps of the pipeline:
- Data on shopping carts, clickstreams, and purchases land on a Data Lake
- Data Engineering team uses Airflow to ETL and publish data to a Feature Store
- Data Science team supports training pipeline in Kubeflow using data from a Feature Store
- Production ML model provides recommendations to the eCommerce website
Incident: Data Lateness
Say, the Data Engineering team deploys a new version of the Airflow ETL pipeline to production. It is scheduled to run daily at 12:00 am. The ML training pipeline runs daily at 1:00 am. The algorithm changes and the Airflow pipeline execution time goes up from 1 to 1,5 hours. Now the Airflow pipeline terminates after the ML pipeline starts, and is lagging behind. The model ends up constantly running on obsolete data from the previous day. As a result, the recommendation engine stops taking into account the relevant state of customers’ carts and purchases they made during the day.
Incidents like this are hard to detect because teams responsible for different parts of the pipeline usually don’t have enough knowledge about each other’s work. The Data Engineering team doesn’t know what features the ML model uses and the Data Science team doesn’t know about the data part. As a result, the model fails silently and the business ends up with incorrect recommendations and loss of trust and revenue. Usually, cases like this take many days to figure out what/where/why happened.
Next-Gen Metadata Platform = Discovery + Observability + Open Standard?
We think that to solve the issues we described, a next-gen data platform should check the following boxes:
- Build upon an open metadata standard
- Employ federation strategy to allow building meta catalogs
- Include ML ecosystem as first-class citizens
- Provide company-wide data discovery and observability
- Integrate with other open standards
The efforts should start from standardizing metadata operations. Instead of instrumenting all jobs separately and having to deal with them breaking with each new software version as it happens now, it’s better to join efforts to build the missing unifying layer.
The standard would ensure compatibility and consistency of metadata produced by various data sources, allowing quick and easy integration of any source or catalog. Most importantly, it will make the integration effort shared, easing the burden for all stakeholders and accelerating the entire big data industry.
Let’s envision such an Open Metadata standard.
1. Integrated Data Quality
The Open Metadata standard could take advantage of integration with any data quality tools, including other open standards. For instance, it could use OpenTelemetry as a data quality monitoring and observability tool.
OpenTelemetry will use metadata to gather metrics, monitor them for anomalies, and alert owners of data assets about arising issues. Meanwhile, Open Metadata would use metadata to build end-to-end lineage and enrich alerts with information about what data other entities were affected.
2. Data Observability
Here is how the Open Metadata + OpenTelemetry integration can solve the recommendation engine’s case we described earlier.
The OpenTelemetry system would pass an alert to the Open Metadata. The latter checks schedules of dependent jobs and enriches the alert with information about data consumers affected by the issue. Then OpenTelemetry passes the enriched alert to owners of affected entities right after the first ETL execution — before a new model gets a chance to train on wrong data.
Alternatively, an administrator can set SLAs for entities in the Open Metadata system. It will provide end-to-end lineage and take execution metrics from the OpenTelemetry system. Owners of affected entities will receive SLA alerts containing tracing information from the pipeline execution. With it at hand, they will easily determine all the where/why/whats about what happened to their SLAs.
3. Real End-to-End Lineage
To facilitate end-to-end collaboration on data products, lineage needs to be company-wide. Only then data owners will get a full picture of how data flows between their and other owners’ assets and how they are affected by changes elsewhere in the pipeline.
We think, that end-to-end lineage should:
- Include all organization’s data assets
- Cover a wide range of entities, including ML
- Enrich metadata with data profiling and data quality information
- Show full history of data asset from ingestion to end product (report, model, etc.)
This kind of lineage will provide a full context of connections and flows between all assets and will allow you to know how your assets are affected when something changes or breaks in those of other owners.
4. Discoverable ML/AI World
The inclusion of ML infrastructure in the data discovery ecosystem is long overdue. The Open Metadata standard would help build connectors between companys’ data catalogs and ML solutions to start pushing data for training ML models into the upstream catalogs.
The Open Metadata standard could bring ML entities as first-class citizens to data discovery infrastructure, making them visible and discoverable through data catalogs.
Discoverable ML entities will drastically improve collaboration between the data science team and other teams it needs to exchange data with to build ML models. The new level of cross-department collaboration, the efficiency of data discovery, quality, and time-to-market of data products that this would bring to the table is hard to overestimate.
5. Open Metadata Standard
Open Metadata is an open source industry-wide standard for data discovery. It provides a set of technologies to collect and export metadata from cloud-native applications and infrastructure to let it be discovered. The standard defines a schema for metadata collection and integrates with data tools through endpoints to receive metadata from them.
The Open Metadata standard would allow building global federated data catalogs with either hierarchical or horizontal federation. Here is an example reference architecture of such a meta catalog.
The example global catalog employs hierarchical federation, push and pull strategies, and APIs for each data source to provide meta discovery experience. All data catalogs and sources it is composed of get centralized under a single UI with fine-grained access permissions. The solution can be easily scaled by plugging any number of data catalogs or sources through adapters or API endpoints they expose to be discovered. It uses the push strategy to gather metadata from already discovered entities.
Such a meta catalog would spare users the effort of re-collecting metadata every time they need to use another data catalog. Unlike other solutions that make you serve complex environments with AWS, the solution requires only PostgreSQL to be deployed.
A meta catalog built on Open Metadata standard can embrace all important opportunities we talked about in this post: federation, real end-to-end lineage, data quality assurance, company-wide observability, and bring discoverable ML assets to the picture.
The legacy of old data approaches and architectures still shapes modern data discovery and observability. While data pipelines are light years ahead of where they were a decade before, data teams still have to deal with siloed solutions making them spend most of the time on data discovery or firefighting data downtimes instead of building data products. Inefficient metadata exchange between data tools is so mundane it became accepted as unavoidable.
But it doesn’t have to stay like this.
We shared our vision of how this problem can be solved with a joint directed effort. Successful precedents like OpenTelemetry show that this approach can be very successful and of huge benefit for all the participants.
An open standard for collecting metadata could become a sound solution to the lack of efficient discovery and observability and a solid foundation for the next-gen data platform.
How has your experience with lack of cross-company data discovery and observability been? If your organization is thinking of building an in-house meta catalog, we’d like to hear about the challenges you are encountering in the journey.