I think the idea is quite neat -- as I understand your PR basically implements a change to the parquet writer that can efficiently detect duplication in the data and thus avoid storing it multiple times. Thank you for sharing it
One comment I have is that I found the name "Content Defined Chunking", while technically accurate, to obscure what this feature is. CDC seems to describe an implementation detail in my mind. Perhaps it would be better to describe the feature with its usecase. Perhaps "auto deduplication" or "update optimized files" or something else like that. I also had a hard time mapping the description in [7] to my data. I didn't look at the code, but I didn't understand what an "edit" meant in the context (like was the idea that the program updates a value logically "in place" in the encoded values? I think if you made what was happening clearer from the data model perspective, it might be easier for people to understand the potential benefits. Andrew [7]: https://github.com/kszucs/de On Tue, Jan 28, 2025 at 1:05 PM Krisztián Szűcs <szucs.kriszt...@gmail.com> wrote: > Dear Community, > > I would like to share recent developments on applying Content Defined > Chunking > (CDC [1][2]) to Parquet files. CDC is a technique that divides data into > variable-sized chunks based on the content of the data itself, rather than > fixed-size boundaries. This makes it effective for deduplication in > content > addressable storage systems, like Hugging Face Hub [3] or restic [4]. > There was an earlier discussion [5] on the Parquet mailing list about this > feature, this is a follow-up on the progress made since then. > > Generally speaking, CDC is more suitable for deduplicating uncompressed > row-major data. However, Parquet Format's unique features enable us to > apply > content-defined chunking effectively on Parquet files as well. Luckily, > only > the writer needs to be aware of the chunking, the reader can still read > the > file as a regular Parquet file, no Parquet Format changes are required. > > One practical example is storing & serving multiple revisions of a Parquet > file, including appends/insertions/deletions/updates: > - Vanilla Parquet (Snappy): The total size of all revisions is 182.6 GiB, > and the content addressable storage requires 148.0 GiB. While the storage > is able to identiy some common chunks in the parquet files, the > deduplication is fairly low. > - Parquet with CDC (Snappy): The total size is 178.3 GiB, and the storage > requirement is reduced to 75.6 GiB. The parquet files are written with > content-defined chunking, hence the deduplication is greatly improved. > - Parquet with CDC (ZSTD): The total size is 109.6 GiB, and the storage > requirement is reduced to 55.9 GiB showing that the deduplication ratio > is greatly improved for both Snappy and ZSTD compressed parquet files. > > I created a draft implementation [6] for this feature in Parquet C++ and > PyArrow and an evaluation tool [7] to (1) better understand the actual > changes > in the parquet files and (2) to evaluate the deduplication efficiency of > various parquet datasets. > You can find more details and results in the evaluation tool's repository > [7]. > > I think this feature could be very useful for other projects as well, so I > am > eager to hear the community's feedback. > > Cross-posting to the Apache Arrow mailing list for better visibility, > though > please reply to the Apache Parquet mailing list. > > Regards, Krisztian > > [1]: https://joshleeb.com/posts/content-defined-chunking.html > [2]: > https://en.wikipedia.org/wiki/Rolling_hash#Gear_fingerprint_and_content-based_chunking_algorithm_FastCDC > [3]: > https://xethub.com/blog/from-files-to-chunks-improving-hf-storage-efficiency > [4]: https://restic.net > [5]: https://lists.apache.org/list?d...@parquet.apache.org:2024-10:dedupe > [6]: https://github.com/apache/arrow/pull/45360 > [7]: https://github.com/kszucs/de