05
Jan
Posted by: Joel Natividad
Category:
Data Wrangling, Metadata, qsv, Solutions
When we first started developing qsv as a side-project at the height of the pandemic, the itch we were scratching was the need to compile summary statistics quickly for a hedge fund’s data catalog pilot. Pandas was simply just too slow and was not cutting it, and we needed something that could refresh all the metadata for thousands of datasets on a nightly basis.
Ultimately, the pilot failed as the nightly metadata refresh job was not fast enough. Even though we switched to xsv (qsv’s progenitor) late in the pilot, there were still some algorithms in Python that were slowing down the nightly refresh (specifically, additional statistics, date inferencing and percentage sampling of the datasets).
I ended up creating qsv to implement those algorithms in Rust, but the pilot ended before we could deploy it.
Three and a half years later, qsv has grown from a side project to a major component of datHere’s offerings – It’s at the heart of Datapusher+, the Data Resource Upload First workflow, the many Data Pipelines that feed our Data Library. And now, we’re previewing qsv pro!
When we’re asked what’s your elevator pitch for qsv, our go to answer is:
qsv is a blazing-fast, multi-platform, command-line, data-wrangling toolkit designed to solve the majority of day-to-day data-wrangling use cases in seconds without having to resort to programming or specialized tools.
But that answer is unsatisfying and just doesn’t capture qsv’s utility and full potential. So here are some real-world, day-to-day use cases of what we’ve used qsv for:
- Data Analysis of Very Large Datasets: qsv can be used to quickly filter, sort, dedupe, partition, and summarize multi-gigabyte, larger than memory datasets in seconds. Analysts can perform operations like selecting specific columns, searching for patterns, or calculating aggregations without the need to load data into a database first or custom programming.
- Data Cleaning and Transformation: qsv is ideal for cleaning data, such as removing duplicates, trimming spaces, or converting data types. It can also be used to transform data formats (Excel, JSON, JSONL/NDJSON, Postgres, SQLite, parquet, Apache Avro, Apache Arrow, Data Package, etc.) which is useful for data integration tasks.
- Data Enrichment: qsv can take advantage of datHere’s ever-growing library of lookup tables to quickly enrich a dataset. datHere’s Data Catalog currently includes high-value, pre-normalized datasets from the US Census, FBI, the Bureau of Labor Statistics, and more.
- Automated Reporting: With its ability to process and aggregate data quickly, qsv can be integrated into automated reporting pipelines. It can extract key metrics and summaries, which can then be used for generating daily, weekly, or monthly reports.
- Merging and Comparing Data Sources: qsv can join multiple datasets from disparate sources based on common keys or compare datasets to identify differences without first loading it into a database. This is particularly useful in data reconciliation tasks in finance or retail industries.
- Data Sampling and Quality Checks: qsv can randomly sample data or apply filters to perform quality checks and quickly detect outliers. It can validate if a dataset conforms with a predefined JSON Schema. It can even infer validation rules from a large training dataset and create a JSON Schema to validate other datasets with the same schema. This is crucial for ensuring data integrity and reliability, especially in large datasets where manual checking is impractical.
- PII/PHI Screening: qsv can quickly screen datasets for Personally Identifiable Information (PII) and Protected Health Information (PHI) with a library of pre-built, customizable regular expression patterns for preliminary screening. Using its Luau DSL, analysts can create even more complex PII/PHI screening rules using its integrated library of Natural Language helper functions.
- Stream Processing: For real-time data processing, qsv can be used to handle streaming data, applying transformations and aggregations on the fly. This is valuable in scenarios like monitoring log files or real-time analytics.
- Machine Learning Data Preparation: Before training machine learning models, data scientists can use qsv to preprocess datasets – normalizing data, handling missing values, and creating training/testing splits.
- ETL (Extract, Transform, Load) Processes: qsv can be a part of ETL pipelines, handling the transformation step efficiently. It can extract data from various sources, transform it as required, and prepare it for loading into databases or data warehouses.
- Log Analysis: qsv can be used to analyze log files, extract specific information, and summarize log entries. This is valuable for IT operations, security analysis, and debugging purposes.
- Data Exploration and Prototyping: Analysts can use qsv for exploratory data analysis, quickly prototyping data queries, and understanding data distributions before building more complex data processing pipelines. With its Polars-based `sqlp` command, it can do complex SQL queries replete with complex joins, subqueries, aggregations, grouping, sorting, and more.
- Accelerated Harvesting: With its `sniff` command, qsv enables accelerated harvesting of data sources when compiling a Data Catalog as it can sniff not only the file format (mime type), but also the schema of a well-formed CSV file without having to download the entire file first.
These use cases demonstrate the versatility and efficiency of qsv as a data-wrangling tool, making it valuable across various domains and tasks in data management and analysis.
With its Luau-based Domain Specific Language (DSL), it can even be scripted to run very complex data pipelines. It also supports Python (which is much slower than Luau) if that is your scripting language of preference.
And with the upcoming qsv pro – qsv gains not only a User Interface but also unlocks additional capabilities like recipes, AI-assisted metadata inferencing, automatic data dictionary compilation, expanded data enrichment options (postal address normalization, street-level geocoding, etc.), access from Excel, data publishing to CKAN and more!
Sign up for a qsv pro preview here – https://dathere.com/qsv-feedback-form/. We’ll prioritize existing users who’ve shared what they use qsv for. 😊
Open Datanaut, Open Source contributor, SemTechie, Urbanist, Civic Hacker, Futurist, Humanist, Dad, Hubby & Social Entrepreneur on a 3BL mission.