Re: Content Defined Chunking of Parquet Files

2025-02-02 Thread Micah Kornfield
> > I think the idea is quite neat -- as I understand your PR basically > implements a change to the parquet writer that can efficiently detect > duplication in the data and thus avoid storing it multiple times. Thank you > for sharing it I might be misunderstanding (only looked at code briefly)

Re: Content Defined Chunking of Parquet Files

2025-02-01 Thread Andrew Lamb
I think the idea is quite neat -- as I understand your PR basically implements a change to the parquet writer that can efficiently detect duplication in the data and thus avoid storing it multiple times. Thank you for sharing it One comment I have is that I found the name "Content Defined Chunking

Content Defined Chunking of Parquet Files

2025-01-28 Thread Krisztián Szűcs
Dear Community, I would like to share recent developments on applying Content Defined Chunking (CDC [1][2]) to Parquet files. CDC is a technique that divides data into variable-sized chunks based on the content of the data itself, rather than fixed-size boundaries. This makes it effective for d