Re: [C++][Parquet] Support different compression algorithms per row group

Gang Wu Mon, 25 Mar 2024 08:30:46 -0700

Sometimes rows from different row groups may have different compression
ratios when data distribution varies a lot among them. It seems to me that
a harder problem is how would you figure out that pattern before the data
is written and compressed. If that is not a problem in your case, it would
be
much easier just to make each parquet file contain only one row group and
apply different compression algorithms on a file basis.


Best,
Gang

On Sun, Mar 24, 2024 at 2:04 AM Aldrin <octalene....@pm.me.invalid> wrote:

> Hi Andrei,
>
> I tried finding more details on block compression in parquet (or
> compression per data page) and I couldn't find anything to satisfy my
> curiosity about how it can be used and how it performs.
>
> I hate being the person to just say "test it first," so I want to also
> recommend figuring out how you'd imagine the interface to be designed. Some
> formats like ORC seem to have 2 compression modes (optimize for speed or
> space) while parquet exposes more of the tuning knobs (according to [1]).
> And to Gang's point, there's a question of what can be exposed to the
> various abstraction levels (perhaps end users would never be interested in
> this so it's exposed only through an advanced or internal interface).
>
> Anyways, good luck scoping it out and feel free to iterate with the
> mailing list as you try things out rather than just when finished, maybe
> someone can chime in with more information and thoughts in the meantime.
>
> [1]: https://arxiv.org/pdf/2304.05028.pdf
>
> Sent from Proton Mail <https://proton.me/mail/home> for iOS
>
>
> On Sat, Mar 23, 2024 at 05:23, Andrei Lazăr <lazarandrei...@gmail.com
> <On+Sat,+Mar+23,+2024+at+05:23,+Andrei+Lazăr+%3C%3Ca+href=>> wrote:
>
> Hi Aldrin, thanks for taking the time to reply to my email!
>
> In my understanding, compression on Parquet files happens on the Data Page
> level for every column, meaning that even across a row group, there can be
> multiple units of data compression, and most certainly there are going to
> be different units of data compression across an entire Parquet file.
> Therefore, what I am hoping for is that more granular compression algorithm
> choices could lead to overall better compression as the data in the same
> column across row groups can differ quite a lot.
>
> At this very moment, specifying different compression algorithms per column
> is supported and in my use case it is extremely helpful, as I have some
> columns (mostly containing floats), for which a compression algorithm like
> Snappy (or even no compression at all), significantly speeds up my queries
> than keeping the data compressed with something like ZSTD or GZIP.
>
> That being said, your suggestion of writing a benchmark and sharing the
> results here to support considering this approach is a great idea, I will
> try doing that!
>
> Once again, thank you for your time!
>
> Kind regards,
> Andrei
>
> On Fri, 22 Mar 2024 at 22:12, Aldrin <octalene....@pm.me.invalid> wrote:
>
> > Hello!
> >
> > I don't do much with compression, so I could be wrong, but I assume a
> > compression algorithm spans the whole column and areas of large variance
> > generally benefit less from the compression, but the encoding still
> > provides benefits across separate areas (e.g. separate row groups).
> >
> > My impression is that compression will not be any better if it's
> > restricted to only a subset of the data and if it is only scoped to a
> > subset of the data then there are extra overheads you'd have beyond what
> > you normally would have (the same raw value would have the same encoded
> > value stored per row group). I suppose things like run-length encoding
> > won't be any less efficient, but it also wouldn't be any more efficient
> > (with the caveat of a raw value repeating across row groups).
> >
> > A different compression for different columns isn't unreasonable, so I
> > think I could be easily convinced that has benefits (though would require
> > per-column logic that could slow other things down).
> >
> > These are just my thoughts, though. Can you share the design and results
> > of your benchmark? Have you (or could you) prototyped anything to test it
> > out?
> >
> > Sent from Proton Mail <https://proton.me/mail/home> for iOS
> >
> >
> > On Fri, Mar 22, 2024 at 14:36, Andrei Lazăr <lazarandrei...@gmail.com
> > <On+Fri,+Mar+22,+2024+at+14:36,+Andrei+Lazăr+%3C%3Ca+href=>> wrote:
> >
> > Hi Gang,
> >
> > Thanks a lot for getting back to me!
> >
> > So the use case I am having is relatively simple: I was playing around
> with
> > some data and I wanted to benchmark different compression algorithms in
> an
> > effort to speed up data retrieval in a simple Parquet based database
> that I
> > am playing around with. Whilst doing so, I've noticed a very large
> variance
> > in the performance of the same compression algorithm over different row
> > groups in my Parquet files. Therefore, I was thinking that the best
> > compression configuration for my data would be to use a different
> algorithm
> > for every column, for every row group in my files. In a real-world
> > situation, I can see this being used by a database, either when new
> entries
> > are inserted into it, or even as a background 'optimizer' job that runs
> > over existing data.
> >
> > How do you feel about this?
> >
> > Thank you,
> > Andrei
> >
> > On Thu, 21 Mar 2024 at 02:11, Gang Wu <ust...@gmail.com> wrote:
> >
> > > Hi Andrei,
> > >
> > > What is your use case? IMHO, exposing this kind of configuration
> > > will force users to know how will the writer split row groups, which
> > > does not look simple to me.
> > >
> > > Best,
> > > Gang
> > >
> > > On Thu, Mar 21, 2024 at 2:25 AM Andrei Lazăr <lazarandrei...@gmail.com
> >
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I would like proposing adding support for writing a Parquet file with
> > > > different compression algorithms for every row group.
> > > >
> > > > In my understanding, the Parquet format allows this, however it seems
> > to
> > > me
> > > > that there is no way to achieve this from the C++ implementation.
> > > >
> > > > Does anyone have any thoughts on this?
> > > >
> > > > Thank you,
> > > > Andrei
> > > >
> > >
> >
> >
>
>

Re: [C++][Parquet] Support different compression algorithms per row group

Reply via email to