Content Defined Chunking of Parquet Files

Krisztián Szűcs Tue, 28 Jan 2025 11:22:45 -0800

Dear Community,

I would like to share recent developments on applying Content Defined Chunking 
(CDC [1][2]) to Parquet files. CDC is a technique that divides data into 
variable-sized chunks based on the content of the data itself, rather than 
fixed-size boundaries. This makes it effective for deduplication in content 
addressable storage systems, like Hugging Face Hub [3] or restic [4]. 
There was an earlier discussion [5] on the Parquet mailing list about this 
feature, this is a follow-up on the progress made since then.

Generally speaking, CDC is more suitable for deduplicating uncompressed
row-major data. However, Parquet Format's unique features enable us to apply
content-defined chunking effectively on Parquet files as well. Luckily, only
the writer needs to be aware of the chunking, the reader can still read the
file as a regular Parquet file, no Parquet Format changes are required.

One practical example is storing & serving multiple revisions of a Parquet
file, including appends/insertions/deletions/updates:
- Vanilla Parquet (Snappy): The total size of all revisions is 182.6 GiB,
and the content addressable storage requires 148.0 GiB. While the storage
is able to identiy some common chunks in the parquet files, the
deduplication is fairly low.
- Parquet with CDC (Snappy): The total size is 178.3 GiB, and the storage
requirement is reduced to 75.6 GiB. The parquet files are written with
content-defined chunking, hence the deduplication is greatly improved.
- Parquet with CDC (ZSTD): The total size is 109.6 GiB, and the storage
requirement is reduced to 55.9 GiB showing that the deduplication ratio
is greatly improved for both Snappy and ZSTD compressed parquet files.

I created a draft implementation [6] for this feature in Parquet C++ and
PyArrow and an evaluation tool [7] to (1) better understand the actual changes
in the parquet files and (2) to evaluate the deduplication efficiency of
various parquet datasets.
You can find more details and results in the evaluation tool's repository [7].

I think this feature could be very useful for other projects as well, so I am
eager to hear the community's feedback.

Cross-posting to the Apache Arrow mailing list for better visibility, though
please reply to the Apache Parquet mailing list.

Regards, Krisztian

[1]: https://joshleeb.com/posts/content-defined-chunking.html
[2]:
https://en.wikipedia.org/wiki/Rolling_hash#Gear_fingerprint_and_content-based_chunking_algorithm_FastCDC
[3]:
https://xethub.com/blog/from-files-to-chunks-improving-hf-storage-efficiency
[4]: https://restic.net
[5]: https://lists.apache.org/list?d...@parquet.apache.org:2024-10:dedupe
[6]: https://github.com/apache/arrow/pull/45360
[7]: https://github.com/kszucs/de

Content Defined Chunking of Parquet Files

Reply via email to