Re: Content Defined Chunking of Parquet Files

2025-02-02 Thread Micah Kornfield
> > I think the idea is quite neat -- as I understand your PR basically > implements a change to the parquet writer that can efficiently detect > duplication in the data and thus avoid storing it multiple times. Thank you > for sharing it I might be misunderstanding (only looked at code briefly)

Re: Content Defined Chunking of Parquet Files

2025-02-01 Thread Andrew Lamb
I think the idea is quite neat -- as I understand your PR basically implements a change to the parquet writer that can efficiently detect duplication in the data and thus avoid storing it multiple times. Thank you for sharing it One comment I have is that I found the name "Content Defined Chunking