Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-25 Thread Antoine Pitrou
Regardless of whether they have different compression ratios, it doesn't explain why you would want a different compression *algorithm* altogether. The choice of a compression algorithm should basically be driven by two concerns: the acceptable space/time tradeoff (do you want to minimize d

Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-25 Thread Gang Wu
Sometimes rows from different row groups may have different compression ratios when data distribution varies a lot among them. It seems to me that a harder problem is how would you figure out that pattern before the data is written and compressed. If that is not a problem in your case, it would be

Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-23 Thread Aldrin
Hi Andrei, I tried finding more details on block compression in parquet (or compression per data page) and I couldn't find anything to satisfy my curiosity about how it can be used and how it performs. I hate being the person to just say "test it first," so I want to also recommend figuring out

Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-23 Thread Antoine Pitrou
Hello Andrei, Le 23/03/2024 à 13:23, Andrei Lazăr a écrit : At this very moment, specifying different compression algorithms per column is supported and in my use case it is extremely helpful, as I have some columns (mostly containing floats), for which a compression algorithm like Snappy (or

Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-23 Thread Andrei Lazăr
Hi Aldrin, thanks for taking the time to reply to my email! In my understanding, compression on Parquet files happens on the Data Page level for every column, meaning that even across a row group, there can be multiple units of data compression, and most certainly there are going to be different

Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-22 Thread Aldrin
Hello! I don't do much with compression, so I could be wrong, but I assume a compression algorithm spans the whole column and areas of large variance generally benefit less from the compression, but the encoding still provides benefits across separate areas (e.g. separate row groups). My impress

Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-22 Thread Andrei Lazăr
Hi Gang, Thanks a lot for getting back to me! So the use case I am having is relatively simple: I was playing around with some data and I wanted to benchmark different compression algorithms in an effort to speed up data retrieval in a simple Parquet based database that I am playing around with.

Re: [C++][Parquet] Support different compression algorithms per row group

2024-03-20 Thread Gang Wu
Hi Andrei, What is your use case? IMHO, exposing this kind of configuration will force users to know how will the writer split row groups, which does not look simple to me. Best, Gang On Thu, Mar 21, 2024 at 2:25 AM Andrei Lazăr wrote: > Hi all, > > I would like proposing adding support for wr

[C++][Parquet] Support different compression algorithms per row group

2024-03-20 Thread Andrei Lazăr
Hi all, I would like proposing adding support for writing a Parquet file with different compression algorithms for every row group. In my understanding, the Parquet format allows this, however it seems to me that there is no way to achieve this from the C++ implementation. Does anyone have any t