Towards “Living” Data Catalogs
Part 1: The Problem with current Data Portals

07 Oct

Posted by: Joel Natividad and Sami Baig

Category: CKAN, Data Catalog, Metadata, Solutions, Success

Garden of data/metadata — Generated by MS Creator AI with the prompt "Garden of Data/Metadata"

Since 2011, we’ve deployed more than 100 CKAN data portals. Most of them – traditional open data portals in the public sector. But several are internal data exchanges and enterprise metadata catalogs in the private sector.

Over the past 12 years, the most challenging aspect, without exception, has not been the initial deployment but rather the task of persuading data publishers to embrace and consistently maintain the data catalog. This entails ensuring that both the data and its associated metadata remain current, valuable, and responsive to the needs of their users.

To remain relevant and to serve as the canonical source of enterprise metadata, the Data Catalog must make creating, maintaining, inferring, and updating metadata as painless as possible for users – both machines and humans.

Unfortunately, there were several impediments to doing so:

Manual Compilation of Metadata: Metadata had to be manually compiled and inputted into lengthy web forms.
Limited Data Dictionary: The automatically generated CKAN data dictionary was rudimentary, containing only labels and descriptions. Data types had to be manually specified, and additional metadata, such as summary statistics describing the data’s characteristics, was lacking.
Data Update Challenges: Updating data often disrupted the publishing process due to inevitable data quality issues, and there was no integrated data update validation.
Manual Data Validation: Data stewards were required to conduct manual data validation, quality checks, and privacy screenings before attempting to publish or update data.
Complex Data Preparation: Even before publishing, data preparation involved a convoluted, largely manual process.
Does the data still comply with its data dictionary definition? Did the schema change? Does it have Personally Identifiable Information (PII)? Are there duplicate rows? Is the data sorted? Are there typos in the data that break validation rules (e.g., entering “Unknown” in a column when an integer was expected)? Are there invalid column names that cause the PostgreSQL database to abort uploads? Is the data in Excel format, and you need to export it to CSV? Is the structure of the CSV valid – are there missing delimiters, extra/missing columns, is it UTF-8 encoded? Is the data inside the CSV valid? Do the values in each column conform to the expected data type, domain, range, and enumerated values? Etcetera, etcetera, etcetera…

Even supposedly high-quality data feeds and harvesters employing automated data pipelines backed by enterprise-grade Extract, Transform and Load (ETL) tools were “brittle” and often required non-trivial modifications as data sources morph over time to reflect business changes.

Ultimately, publishing data on CKAN was at the end of a long, often manual, error-prone, tedious data-wrangling process.

“Untended” Data Catalogs

This friction contributes to less dynamic Data Catalogs. Following an initial surge of activity to populate the Catalog, contributions and updates tend to dwindle over time. Users revert to creating ad-hoc, point-to-point, “spaghetti” data feeds, which appear simpler initially but ultimately only further exacerbate the enterprise’s Data Pipeline Debt.

And can we blame Catalog users? They all agree with the principle of a central metadata repository, and that it’s critical in maintaining Data as a Strategic Asset. That’s why they signed on to create a Data Catalog in the first place.

But sharing, publishing and utilizing metadata shouldn’t be this hard.

And the problem is not unique to CKAN. Competing solutions have the same problem – the typical publishing workflow just doesn’t facilitate the process.

This became more painfully apparent to us when we were involved in standing up an Enterprise Data Catalog pilot for a hedge fund in 2020. They had thousands of datasets culled from a variety of sources – from traditional databases, to S3 buckets, to satellite imagery, to data feed subscriptions to mention a few.

Their data landscape was always changing as analysts created, updated, bought (and often rebought) and downloaded data constantly, and there was no central source of truth about the metadata of its ever-growing, ever-changing corpus of Data Assets.

CKAN was a perfect fit for the job. Or so we thought…

Data Catalogs remains stuck in the 2000s

During the pilot, we built an internal harvester/crawler using the usual suspects – python, pandas/numpy, csvkit, scrapy, SQLalchemy, etc. all running on CKAN’s Harvester framework. It crawled the internal data sources at least once a day and on-demand, harvesting data/metadata, deriving additional metadata from disparate sources, pumping it all into the catalog.

It worked!

Kinda worked that is… as it required constant upkeep and babysitting as it was not tolerant of real world data quality issues. But what really killed the pilot however was how slow it was. It was excruciatingly slooowww!

For the catalog to be the canonical source of truth/metadata, and to be willingly adopted and co-maintained by its target users (not only were they users, they were also the primary source of data) the catalog had to make it worth their while to co-groom the catalog.

That meant that uploading and updating their data on the catalog had to be:

Super Easy
Gave them something that made it Worthwhile

Super Easy

As any data scientist/analyst worth their salt will tell you – most of the work in data science and model building is the unglamorous job of data wrangling (as some would jokingly call it – data janitor work).

Even after all these years and all the tooling developed, more than half the time of a “data scientist” is preparing, enriching and normalizing data for analysis.

If we are then going to ask them to complete a web form with hundreds of fields prompting for metadata – it’s not exactly a non-starter, but they will only begrudgingly do it a handful of times.

After all, they universally agree that they need a central metadata catalog that is much more than a glorified, digital library Card Catalog.

A true Enterprise Data Catalog would have allowed users to answers questions like:

When was this data updated, and by what/whom?
Which programs/models use it, and when did they last use it?
How much does this data cost? Is it a subscription? Or manually maintained? What is the license?
Is this from a partner? What are the terms of the Data Sharing agreement?
What other datasets/models does it depend on?
What are the summary statistics of this dataset?
Are there any outliers that can be readily identified?
Is this dataset a product of another data/analysis pipeline? If so, what other datasets/models/pipelines/systems produced this data product?
Do we have metadata about these models/pipelines/systems?

Answering these questions is possible! But only if you have high-quality data and metadata.

And as Google and Meta has shown us – metadata is king! Their business models both depend on collecting it – our “data exhaust” as we click through their web properties tells them a lot about us, metadata which they in turn package, slice and dice to sell us to their primary customers – advertisers.

Aberto!

Apologies to my friends at Google/Meta for the quick “Advertising is the Original Sin of the Internet” detour, but if anything, it emphasizes that metadata collection shouldn’t be hard – it’s the “sine qua non” of why we’re building Data Catalogs in the first place!

Data/metadata onboarding needs to be Super Easy, bordering on “magical”, to get “customer delight.”

As the futurist Artur C. Clarke puts it – “Any sufficiently advanced technology is indistinguishable from magic.”

What if by uploading the data first, even BEFORE filling out any metadata field beyond the title, it can automagically:

Scan the whole dataset quickly, even very large datasets, and compile summary statistics about it
Infer the data set’s data types
Compile a frequency table, compiling the top N values for each column
Detect Sort order or if there are duplicates
Screen the dataset for Personally Identifiable Information
Sanitize the dataset’s column names to ensure that they can be created in the Database
Create an extended data dictionary
Derive and compute additional metadata (e.g. the date range of a time-series dataset,
the spatial extent of a dataset with lat/long fields, the valid values for an enumerated field, etc.)
Auto-classify and auto-tag the dataset

All within a matter of seconds, even for very large datasets, while the data publisher is still entering additional metadata into the web form.

Our hypothesis is that this will not only make uploading datasets worthwhile, we believe users will even stage their own datasets just to see what metadata can be inferred from it!

In the next blog post, we’ll detail how we are reinventing the data upload workflow in our attempt to make it Super Easy and Worthwhile.

Joel Natividad

Co-Founder at datHere, Inc. | Website | + posts

Open Datanaut, Open Source contributor, SemTechie, Urbanist, Civic Hacker, Futurist, Humanist, Dad, Hubby & Social Entrepreneur on a 3BL mission.

Sami Baig

Co-Founder at datHere, Inc. | Website | + posts

I oversee the design, development, and implementation of innovative data solutions for clients. My expertise in data management, data quality, and data integration has been integral to driving data-driven decision-making. I am passionate about creating a culture of data-driven innovation that enables organizations to stay ahead of the competition.

Tags: CKAN, data catalog, data infrastructure, metadata, open data

Towards "Living" Data Catalogs

Towards “Living” Data CatalogsPart 1: The Problem with current Data Portals