Since 2011, we’ve deployed more than 100 CKAN data portals. Most of them – traditional open data portals in the public sector. But several are internal data exchanges and enterprise metadata catalogs in the private sector.
Over the past 12 years, the most challenging aspect, without exception, has not been the initial deployment but rather the task of persuading data publishers to embrace and consistently maintain the data catalog. This entails ensuring that both the data and its associated metadata remain current, valuable, and responsive to the needs of their users.
To remain relevant and to serve as the canonical source of enterprise metadata, the Data Catalog must make creating, maintaining, inferring, and updating metadata as painless as possible for users – both machines and humans.
Unfortunately, there were several impediments to doing so:
Even supposedly high-quality data feeds and harvesters employing automated data pipelines backed by enterprise-grade Extract, Transform and Load (ETL) tools were “brittle” and often required non-trivial modifications as data sources morph over time to reflect business changes.
Ultimately, publishing data on CKAN was at the end of a long, often manual, error-prone, tedious data-wrangling process.
This friction contributes to less dynamic Data Catalogs. Following an initial surge of activity to populate the Catalog, contributions and updates tend to dwindle over time. Users revert to creating ad-hoc, point-to-point, “spaghetti” data feeds, which appear simpler initially but ultimately only further exacerbate the enterprise’s Data Pipeline Debt.
And can we blame Catalog users? They all agree with the principle of a central metadata repository, and that it’s critical in maintaining Data as a Strategic Asset. That’s why they signed on to create a Data Catalog in the first place.
But sharing, publishing and utilizing metadata shouldn’t be this hard.
And the problem is not unique to CKAN. Competing solutions have the same problem – the typical publishing workflow just doesn’t facilitate the process.
This became more painfully apparent to us when we were involved in standing up an Enterprise Data Catalog pilot for a hedge fund in 2020. They had thousands of datasets culled from a variety of sources – from traditional databases, to S3 buckets, to satellite imagery, to data feed subscriptions to mention a few.
Their data landscape was always changing as analysts created, updated, bought (and often rebought) and downloaded data constantly, and there was no central source of truth about the metadata of its ever-growing, ever-changing corpus of Data Assets.
CKAN was a perfect fit for the job. Or so we thought…
During the pilot, we built an internal harvester/crawler using the usual suspects – python, pandas/numpy, csvkit, scrapy, SQLalchemy, etc. all running on CKAN’s Harvester framework. It crawled the internal data sources at least once a day and on-demand, harvesting data/metadata, deriving additional metadata from disparate sources, pumping it all into the catalog.
It worked!
Kinda worked that is… as it required constant upkeep and babysitting as it was not tolerant of real world data quality issues. But what really killed the pilot however was how slow it was. It was excruciatingly slooowww!
For the catalog to be the canonical source of truth/metadata, and to be willingly adopted and co-maintained by its target users (not only were they users, they were also the primary source of data) the catalog had to make it worth their while to co-groom the catalog.
That meant that uploading and updating their data on the catalog had to be:
As any data scientist/analyst worth their salt will tell you – most of the work in data science and model building is the unglamorous job of data wrangling (as some would jokingly call it – data janitor work).
Even after all these years and all the tooling developed, more than half the time of a “data scientist” is preparing, enriching and normalizing data for analysis.
If we are then going to ask them to complete a web form with hundreds of fields prompting for metadata – it’s not exactly a non-starter, but they will only begrudgingly do it a handful of times.
After all, they universally agree that they need a central metadata catalog that is much more than a glorified, digital library Card Catalog.
A true Enterprise Data Catalog would have allowed users to answers questions like:
Answering these questions is possible! But only if you have high-quality data and metadata.
And as Google and Meta has shown us – metadata is king! Their business models both depend on collecting it – our “data exhaust” as we click through their web properties tells them a lot about us, metadata which they in turn package, slice and dice to sell us to their primary customers – advertisers.
Apologies to my friends at Google/Meta for the quick “Advertising is the Original Sin of the Internet” detour, but if anything, it emphasizes that metadata collection shouldn’t be hard – it’s the “sine qua non” of why we’re building Data Catalogs in the first place!
Data/metadata onboarding needs to be Super Easy, bordering on “magical”, to get “customer delight.”
As the futurist Artur C. Clarke puts it – “Any sufficiently advanced technology is indistinguishable from magic.”
What if by uploading the data first, even BEFORE filling out any metadata field beyond the title, it can automagically:
All within a matter of seconds, even for very large datasets, while the data publisher is still entering additional metadata into the web form.
Our hypothesis is that this will not only make uploading datasets worthwhile, we believe users will even stage their own datasets just to see what metadata can be inferred from it!
In the next blog post, we’ll detail how we are reinventing the data upload workflow in our attempt to make it Super Easy and Worthwhile.