Dear Community, I would like to share recent developments on applying Content Defined Chunking (CDC [1][2]) to Parquet files. CDC is a technique that divides data into variable-sized chunks based on the content of the data itself, rather than fixed-size boundaries. This makes it effective for deduplication in content addressable storage systems, like Hugging Face Hub [3] or restic [4]. There was an earlier discussion [5] on the Parquet mailing list about this feature, this is a follow-up on the progress made since then.
Generally speaking, CDC is more suitable for deduplicating uncompressed row-major data. However, Parquet Format's unique features enable us to apply content-defined chunking effectively on Parquet files as well. Luckily, only the writer needs to be aware of the chunking, the reader can still read the file as a regular Parquet file, no Parquet Format changes are required. One practical example is storing & serving multiple revisions of a Parquet file, including appends/insertions/deletions/updates: - Vanilla Parquet (Snappy): The total size of all revisions is 182.6 GiB, and the content addressable storage requires 148.0 GiB. While the storage is able to identiy some common chunks in the parquet files, the deduplication is fairly low. - Parquet with CDC (Snappy): The total size is 178.3 GiB, and the storage requirement is reduced to 75.6 GiB. The parquet files are written with content-defined chunking, hence the deduplication is greatly improved. - Parquet with CDC (ZSTD): The total size is 109.6 GiB, and the storage requirement is reduced to 55.9 GiB showing that the deduplication ratio is greatly improved for both Snappy and ZSTD compressed parquet files. I created a draft implementation [6] for this feature in Parquet C++ and PyArrow and an evaluation tool [7] to (1) better understand the actual changes in the parquet files and (2) to evaluate the deduplication efficiency of various parquet datasets. You can find more details and results in the evaluation tool's repository [7]. I think this feature could be very useful for other projects as well, so I am eager to hear the community's feedback. Cross-posting to the Apache Arrow mailing list for better visibility, though please reply to the Apache Parquet mailing list. Regards, Krisztian [1]: https://joshleeb.com/posts/content-defined-chunking.html [2]: https://en.wikipedia.org/wiki/Rolling_hash#Gear_fingerprint_and_content-based_chunking_algorithm_FastCDC [3]: https://xethub.com/blog/from-files-to-chunks-improving-hf-storage-efficiency [4]: https://restic.net [5]: https://lists.apache.org/list?d...@parquet.apache.org:2024-10:dedupe [6]: https://github.com/apache/arrow/pull/45360 [7]: https://github.com/kszucs/de