Regardless of whether they have different compression ratios, it doesn't
explain why you would want a different compression *algorithm* altogether.
The choice of a compression algorithm should basically be driven by two
concerns: the acceptable space/time tradeoff (do you want to minimize
d
Sometimes rows from different row groups may have different compression
ratios when data distribution varies a lot among them. It seems to me that
a harder problem is how would you figure out that pattern before the data
is written and compressed. If that is not a problem in your case, it would
be
Hi Andrei,
I tried finding more details on block compression in parquet (or compression
per data page) and I couldn't find anything to satisfy my curiosity about how
it can be used and how it performs.
I hate being the person to just say "test it first," so I want to also
recommend figuring out
Hello Andrei,
Le 23/03/2024 à 13:23, Andrei Lazăr a écrit :
At this very moment, specifying different compression algorithms per column
is supported and in my use case it is extremely helpful, as I have some
columns (mostly containing floats), for which a compression algorithm like
Snappy (or
Hi Aldrin, thanks for taking the time to reply to my email!
In my understanding, compression on Parquet files happens on the Data Page
level for every column, meaning that even across a row group, there can be
multiple units of data compression, and most certainly there are going to
be different
Hello!
I don't do much with compression, so I could be wrong, but I assume a
compression algorithm spans the whole column and areas of large variance
generally benefit less from the compression, but the encoding still provides
benefits across separate areas (e.g. separate row groups).
My impress
Hi Gang,
Thanks a lot for getting back to me!
So the use case I am having is relatively simple: I was playing around with
some data and I wanted to benchmark different compression algorithms in an
effort to speed up data retrieval in a simple Parquet based database that I
am playing around with.
Hi Andrei,
What is your use case? IMHO, exposing this kind of configuration
will force users to know how will the writer split row groups, which
does not look simple to me.
Best,
Gang
On Thu, Mar 21, 2024 at 2:25 AM Andrei Lazăr
wrote:
> Hi all,
>
> I would like proposing adding support for wr
Hi all,
I would like proposing adding support for writing a Parquet file with
different compression algorithms for every row group.
In my understanding, the Parquet format allows this, however it seems to me
that there is no way to achieve this from the C++ implementation.
Does anyone have any t