Re: [C++][Parquet] Support different compression algorithms per row group

Aldrin Fri, 22 Mar 2024 15:14:00 -0700

Hello!
I don't do much with compression, so I could be wrong, but I assume a 
compression algorithm spans the whole column and areas of large variance 
generally benefit less from the compression, but the encoding still provides 
benefits across separate areas (e.g. separate row groups).
My impression is that compression will not be any better if it's restricted to 
only a subset of the data and if it is only scoped to a subset of the data then 
there are extra overheads you'd have beyond what you normally would have (the 
same raw value would have the same encoded value stored per row group). I 
suppose things like run-length encoding won't be any less efficient, but it 
also wouldn't be any more efficient (with the caveat of a raw value repeating 
across row groups).
A different compression for different columns isn't unreasonable, so I think I 
could be easily convinced that has benefits (though would require per-column 
logic that could slow other things down).
These are just my thoughts, though. Can you share the design and results of 
your benchmark? Have you (or could you) prototyped anything to test it out?
 Sent from Proton Mail for iOS 
On Fri, Mar 22, 2024 at 14:36, Andrei Lazăr &lt;lazarandrei...@gmail.com&gt; 
wrote:  Hi Gang,


Thanks a lot for getting back to me!

So the use case I am having is relatively simple: I was playing around with
some data and I wanted to benchmark different compression algorithms in an
effort to speed up data retrieval in a simple Parquet based database that I
am playing around with. Whilst doing so, I've noticed a very large variance
in the performance of the same compression algorithm over different row
groups in my Parquet files. Therefore, I was thinking that the best
compression configuration for my data would be to use a different algorithm
for every column, for every row group in my files. In a real-world
situation, I can see this being used by a database, either when new entries
are inserted into it, or even as a background 'optimizer' job that runs
over existing data.

How do you feel about this?

Thank you,
Andrei

On Thu, 21 Mar 2024 at 02:11, Gang Wu &lt;ust...@gmail.com&gt; wrote:

&gt; Hi Andrei,
&gt;
&gt; What is your use case? IMHO, exposing this kind of configuration
&gt; will force users to know how will the writer split row groups, which
&gt; does not look simple to me.
&gt;
&gt; Best,
&gt; Gang
&gt;
&gt; On Thu, Mar 21, 2024 at 2:25 AM Andrei Lazăr 
&lt;lazarandrei...@gmail.com&gt;
&gt; wrote:
&gt;
&gt; &gt; Hi all,
&gt; &gt;
&gt; &gt; I would like proposing adding support for writing a Parquet file with
&gt; &gt; different compression algorithms for every row group.
&gt; &gt;
&gt; &gt; In my understanding, the Parquet format allows this, however it seems 
to
&gt; me
&gt; &gt; that there is no way to achieve this from the C++ implementation.
&gt; &gt;
&gt; &gt; Does anyone have any thoughts on this?
&gt; &gt;
&gt; &gt; Thank you,
&gt; &gt; Andrei
&gt; &gt;
&gt;

signature.asc
Description: OpenPGP digital signature

Re: [C++][Parquet] Support different compression algorithms per row group

Reply via email to