In celebration, we just released qsv 0.123.0 – our biggest release to date! This release is especially exciting for us as it enables the next stage in our mission of making your Data Useful, Usable and Used.
It does so by enabling High-Speed Queries & Aggregations. What?!? Let us explain…
Since we started way back in 2011, we’ve helped spin up hundreds of CKAN instances in the US. Without fail, the three most significant problems that bedevil us post-installation are Data Quality, Data Interoperability, and Data Ingestion.
Data Quality and Data Interoperability – or, more specifically, the lack of both. They go hand-in-hand and it’s a universal problem as Data is created with a specific application in mind – with Data Interoperability a secondary consideration.
Even with the most pristine Data Models, Business Processes inevitably change, requiring changes to the Data. As these Applications go through their Lifecycle – layers and layers of Change dictated by the ever-changing Business, Technology, Legal, and Security landscape accrue and make their way into the Data.
That’s why we think Excel is pervasive – it’s the Duct Tape of Data Quality and Interoperability – with Analysts exporting Data from disparate transaction systems, ultimately manually shoehorning Data into reports, notwithstanding the expensive enterprise BI/ETL tools they procured for the job.
And we don’t mean to be pejorative about Duct Tape – we’re proud Data Wranglers ourselves! We actually MacGyvered qsv to democratize Data Wrangling! We also think CSV is the Duct Tape of Data Interchange – thus our laser focus on it as qsv’s central data format.
With Datapusher+ (DP+ for short), we tackled and largely solved the majority of our Data Quality, Interoperability, and Ingestion problems in CKAN – primarily by leveraging several of qsv’s commands:
Data Ingestion issues, for the most part, also go away (save for Network issues – but we’re also working on that with qsv sniff, leveraging range requests, using http2 and experimenting with http3 as another big CKAN Devil we aim to slay is Harvesting – but that’s for another blogpost 😀).
Once you have High Quality Data, PostgreSQL COPY works like a dream and is exponentially faster than loading data via CKAN’s Datastore API – as its purpose built for high-speed, PostgreSQL-native, high-volume data ingestion.
Most importantly, all these Data Wrangling tasks we baked into DP+ only take a few seconds, even with the largest datasets. For the Speed of Insight is Instant – or as near to instant as possible.
With DP+, we believe that we’ve made the Data Useful and Usable – at the same time, relieving Data Publishers of most of the onerous, manual, error-prone data wrangling they had to do before publishing Data.
In the beginning of the Open Data phenomenon, several big predictions were made as it was claimed that opening data at scale will enable all kinds of businesses and innovations – the same way other government datasets like the Census, weather, water and GPS have become critical data sources underpinning several industries and affordances of modern life.
Though they were initially collected by the Census, NOAA, USGS and the Defense Department respectively for internal use, these datasets are now widely available and used across all sectors.
So much so that McKinsey even famously projected in an October 2013 report that Open Data can help create $3 trillion a year in value in seven areas of the global economy. For reference, the US GDP for 2013 was $16.84 trillion and the global GDP was $75.59 trillion.
But if we are being honest, Open Data’s promise has been largely unrealized.
Part of the problem, we suspect, is that we need to do more than PUBLISH high-quality data and metadata and wait for the app developers to come. With data volume/velocity doubling every few years, publishing yet more data on a data portal does not ensure the Data is Used to gain Insight and Drive Evidence-based Decisions – its but the first, though crucial step towards realizing Open Data’s promise.
That’s why the next major advance for DP+ is to preemptively compute, query, aggregate, and visualize the high-quality data intelligently using all the high-quality metadata we inferred.
And it needs to be done at the near Instant Speed of Insight.
If you want to see what we’re doing to make sure the Useful, Usable Data is actually Used to gain Insight and Drive Evidence-based Decisions, register and join us on the March 2024 installment of CKAN Monthly Live.
See you there!
GET IN TOUCH