Dear Community,

I would like to share recent developments on applying Content Defined Chunking 
(CDC [1][2]) to Parquet files. CDC is a technique that divides data into 
variable-sized chunks based on the content of the data itself, rather than 
fixed-size boundaries. This makes it effective for deduplication in content 
addressable storage systems, like Hugging Face Hub [3] or restic [4]. 
There was an earlier discussion [5] on the Parquet mailing list about this 
feature, this is a follow-up on the progress made since then. 

Generally speaking, CDC is more suitable for deduplicating uncompressed 
row-major data. However, Parquet Format's unique features enable us to apply 
content-defined chunking effectively on Parquet files as well. Luckily, only 
the writer needs to be aware of the chunking, the reader can still read the 
file as a regular Parquet file, no Parquet Format changes are required.

One practical example is storing & serving multiple revisions of a Parquet 
file, including appends/insertions/deletions/updates:
- Vanilla Parquet (Snappy): The total size of all revisions is 182.6 GiB, 
and the content addressable storage requires 148.0 GiB. While the storage
is able to identiy some common chunks in the parquet files, the 
deduplication is fairly low.
- Parquet with CDC (Snappy): The total size is 178.3 GiB, and the storage 
requirement is reduced to 75.6 GiB. The parquet files are written with
content-defined chunking, hence the deduplication is greatly improved.
- Parquet with CDC (ZSTD): The total size is 109.6 GiB, and the storage 
requirement is reduced to 55.9 GiB showing that the deduplication ratio 
is greatly improved for both Snappy and ZSTD compressed parquet files.

I created a draft implementation [6] for this feature in Parquet C++ and 
PyArrow and an evaluation tool [7] to (1) better understand the actual changes
in the parquet files and (2) to evaluate the deduplication efficiency of 
various parquet datasets. 
You can find more details and results in the evaluation tool's repository [7].

I think this feature could be very useful for other projects as well, so I am 
eager to hear the community's feedback.

Cross-posting to the Apache Arrow mailing list for better visibility, though 
please reply to the Apache Parquet mailing list.

Regards, Krisztian 

[1]: https://joshleeb.com/posts/content-defined-chunking.html
[2]: 
https://en.wikipedia.org/wiki/Rolling_hash#Gear_fingerprint_and_content-based_chunking_algorithm_FastCDC
[3]: 
https://xethub.com/blog/from-files-to-chunks-improving-hf-storage-efficiency
[4]: https://restic.net
[5]: https://lists.apache.org/list?d...@parquet.apache.org:2024-10:dedupe
[6]: https://github.com/apache/arrow/pull/45360
[7]: https://github.com/kszucs/de

Reply via email to