Virtualization Technology News and Information
Article
RSS
Datafold Launches Open Source data-diff to Compare Tables of Any Size Across Databases

Datafold announced Open Source data-diff, a new open source cross-database diffing package. This new product is an open source extension to Datafold's original Data Diff tool for comparing data sets. Open source data-diff validates the consistency of data across databases using high-performance algorithms.

In the modern data stack, companies extract data from sources, load that data into a warehouse, and transform that data so that it can be used for analysis, activation, or data science use cases. Datafold has been focused on automated testing during the transformation step with Data Diff, ensuring that any change made to a data model does not break a dashboard or cause a predictive algorithm to have the wrong data. With the launch of open source data-diff, Datafold can now help with the extract and load part of the process. Open source data-diff verifies that the data that has been loaded matches the source of that data where it was extracted. All parts of the data stack need testing for data engineers to create reliable data products, and Datafold now gives them coverage throughout the extract, load, transform (ELT) process.

"data-diff fulfills a need that wasn't previously being met," said Gleb Mezhanskiy, Datafold founder and CEO. "Every data-savvy business today replicates data between databases in some way, for example, to integrate all available data in a warehouse or data lake to leverage it for analytics and machine learning. Replicating data at scale is a complex and often error-prone process, and although multiple vendors and open source tools provide replication solutions, there was no tooling to validate the correctness of such replication. As a result, engineering teams resorted to manual one-off checks and tedious investigations of discrepancies, and data consumers couldn't fully trust the data replicated from other systems.

Mezhanskiy continued, "data-diff solves this problem elegantly by providing an easy way to validate consistency of data sets across databases at scale. It relies on state-of-the art algorithms to achieve incredible speed: e.g., comparing one-billion-row data sets across different databases takes less than five minutes on a regular laptop. And, as an open source tool, it can be easily embedded into existing workflows and systems."

Answering an Important Need

Today's organizations are using data replication to consolidate information from multiple sources into data warehouses or data lakes for analytics. They're integrating operational systems with real-time data pipelines, consolidating data for search, and migrating data from legacy systems to modern databases.

Thanks to amazing tools like Fivetran, Airbyte and Stitch, it's easier than ever to sync data across multiple systems and applications. Most data synchronization scenarios call for 100% guaranteed data integrity, yet the practical reality is that in any interconnected system, records are sometimes lost due to dropped packets, general replication issues, or configuration errors. To ensure data integrity, it's necessary to perform validation checks using a data diff tool.

Datafold's approach constitutes a significant step forward for developers and data analysts who wish to compare multiple databases rapidly and efficiently, without building a makeshift diff tool themselves. Currently, data engineers use multiple comparison methods, ranging from simple row counts to comprehensive row-level analysis. The former is fast but not comprehensive, whereas the latter approach is slow but guarantees complete validation. Open source data-diff is fast and provides complete validation.

Open Source data-diff for Building and Managing Data Quality

Available today, data-diff uses checksums to verify 100% consistency between two different data sources quickly and efficiently. This method allows for a row-level comparison of 100 million records to be done in just a few seconds, without sacrificing the granularity of the resulting comparison.

Datafold has released data-diff under the MIT license. Currently, the software includes connectors for Postgres, MySQL, Snowflake, BigQuery, Redshift, Presto and Oracle. Datafold plans to invite contributors to build connectors for additional data sources and for specific business applications.

To learn more about Datafold's open source data-diff, visit https://github.com/datafold/data-diff/.

Published Friday, June 24, 2022 8:23 AM by David Marshall
Filed under:
Comments
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
Calendar
<June 2022>
SuMoTuWeThFrSa
2930311234
567891011
12131415161718
19202122232425
262728293012
3456789