+1 732 707 1866

The Speed of Insight

datHere > Blog > CKAN > The Speed of Insight
Turning Raw Data into Insight with High-Speed Metadata Inferencing, Queries & Aggregations

The Speed of Insight
Turning Raw Data into Insight with High-Speed Metadata Inferencing, Queries & Aggregations

Posted by: Joel Natividad and Sami Baig
Category: CKAN, Data Catalog, Data Wrangling, Metadata, qsv

Happy Open Data Week!

In celebration, we just released qsv 0.123.0 – our biggest release to date! This release is especially exciting for us as it enables the next stage in our mission of making your Data Useful, Usable and Used.

It does so by enabling High-Speed Queries & Aggregations. What?!? Let us explain…

Our Data Wrangler Devils – Data Quality, Data Interoperability & Data Ingestion

Since we started way back in 2011, we’ve helped spin up hundreds of CKAN instances in the US. Without fail, the three most significant problems that bedevil us post-installation are Data Quality, Data Interoperability, and Data Ingestion.

Data Quality and Data Interoperability – or, more specifically, the lack of both. They go hand-in-hand and it’s a universal problem as Data is created with a specific application in mind – with Data Interoperability a secondary consideration. 

Even with the most pristine Data Models, Business Processes inevitably change, requiring changes to the Data. As these Applications go through their Lifecycle – layers and layers of Change dictated by the ever-changing Business, Technology, Legal, and Security landscape accrue and make their way into the Data. 

That’s why we think Excel is pervasive – it’s the Duct Tape of Data Quality and Interoperability – with Analysts exporting Data from disparate transaction systems, ultimately manually shoehorning Data into reports, notwithstanding the expensive enterprise BI/ETL tools they procured for the job.

And we don’t mean to be pejorative about Duct Tape – we’re proud Data Wranglers ourselves!  We actually MacGyvered qsv to democratize Data Wrangling! We also think CSV is the Duct Tape of Data Interchange – thus our laser focus on it as qsv’s central data format.

With Datapusher+ (DP+ for short), we tackled and largely solved the majority of our Data Quality, Interoperability, and Ingestion problems in CKAN – primarily by leveraging several of qsv’s commands:

  • with qsv validate, we can quickly detect if an input CSV is corrupt, and if a JSONSchema is available, validate if an incoming CSV conforms to a Schema definition
  • with qsv excel, DP+ can reliably export an input spreadsheet (Excel and Open Office formats supported) to a well-formed, RFC 4180-compliant, UTF8-encoded CSV 
  • with qsv stats and schema, DP+ does guaranteed data type inferencing and reliably infer the schema of a CSV. Further, the comprehensive summary stats it computes allows DP+ to infer & derive extended metadata (especially now that DCAT-US 3 became a Candidate Recommendation just today), and informs several data ingestion optimization heuristics that enable DP+ to populate the CKAN data store in an optimal, storage-efficient, query-performant manner.
  • with qsv frequency, DP+ compiles comprehensive frequency tables, which allows us to derive the domain of each field
  • with qsv sortcheck and dedup, DP+ can easily detect duplicate rows and optionally remove them
  • with qsv searchset, DP+ can quickly screen datasets for PII data and optionally remove or quarantine suspect records
  • with upcoming scripting support for both Luau and Python – Data Stewards will be able to create customizable data pipelines inside DP+

Data Ingestion issues, for the most part, also go away (save for Network issues – but we’re also working on that with qsv sniff, leveraging range requests, using http2 and experimenting with http3 as another big CKAN Devil we aim to slay is Harvesting – but that’s for another blogpost 😀).

Once you have High Quality Data, PostgreSQL COPY works like a dream and is exponentially faster than loading data via CKAN’s Datastore API – as its purpose built for high-speed, PostgreSQL-native, high-volume data ingestion.

Most importantly, all these Data Wrangling tasks we baked into DP+ only take a few seconds, even with the largest datasets.  For the Speed of Insight is Instant – or as near to instant as possible.

With DP+, we believe that we’ve made the Data Useful and Usable – at the same time, relieving Data Publishers of most of the onerous, manual, error-prone data wrangling they had to do before publishing Data.

But what of Useful, Usable Data – unless it’s Used?

In the beginning of the Open Data phenomenon, several big predictions were made as it was claimed that opening data at scale will enable all kinds of businesses and innovations – the same way other government datasets like the Census, weather, water and GPS have become critical data sources underpinning several industries and affordances of modern life. 

Though they were initially collected by the Census, NOAA, USGS and the Defense Department respectively for internal use, these datasets are now widely available and used across all sectors.

So much so that McKinsey even famously projected in an October 2013 report that Open Data can help create $3 trillion a year in value in seven areas of the global economy.  For reference, the US GDP for 2013 was $16.84 trillion and the global GDP was $75.59 trillion.

But if we are being honest, Open Data’s promise has been largely unrealized.

Part of the problem, we suspect, is that we need to do more than PUBLISH high-quality data and metadata and wait for the app developers to come.  With data volume/velocity doubling every few years, publishing yet more data on a data portal does not ensure the Data is Used to gain Insight and Drive Evidence-based Decisions – its but the first, though crucial step towards realizing Open Data’s promise.

That’s why the next major advance for DP+ is to preemptively compute, query, aggregate, and visualize the high-quality data intelligently using all the high-quality metadata we inferred.

And it needs to be done at the near Instant Speed of Insight.

If you want to see what we’re doing to make sure the Useful, Usable Data is actually Used to gain Insight and Drive Evidence-based Decisions, register and join us on the March 2024 installment of CKAN Monthly Live.

See you there!

GET IN TOUCH

Co-Founder at datHere, Inc. | Website | + posts

Open Datanaut, Open Source contributor, SemTechie, Urbanist, Civic Hacker, Futurist, Humanist, Dad, Hubby & Social Entrepreneur on a 3BL mission.

Co-Founder at datHere, Inc. | Website | + posts

I oversee the design, development, and implementation of innovative data solutions for clients. My expertise in data management, data quality, and data integration has been integral to driving data-driven decision-making. I am passionate about creating a culture of data-driven innovation that enables organizations to stay ahead of the competition.