>
> I think the idea is quite neat -- as I understand your PR basically
> implements a change to the parquet writer that can efficiently detect
> duplication in the data and thus avoid storing it multiple times. Thank you
> for sharing it
I might be misunderstanding (only looked at code briefly)
I think the idea is quite neat -- as I understand your PR basically
implements a change to the parquet writer that can efficiently detect
duplication in the data and thus avoid storing it multiple times. Thank you
for sharing it
One comment I have is that I found the name "Content Defined Chunking
Dear Community,
I would like to share recent developments on applying Content Defined Chunking
(CDC [1][2]) to Parquet files. CDC is a technique that divides data into
variable-sized chunks based on the content of the data itself, rather than
fixed-size boundaries. This makes it effective for d