Sometimes rows from different row groups may have different compression ratios when data distribution varies a lot among them. It seems to me that a harder problem is how would you figure out that pattern before the data is written and compressed. If that is not a problem in your case, it would be much easier just to make each parquet file contain only one row group and apply different compression algorithms on a file basis.
Best, Gang On Sun, Mar 24, 2024 at 2:04 AM Aldrin <octalene....@pm.me.invalid> wrote: > Hi Andrei, > > I tried finding more details on block compression in parquet (or > compression per data page) and I couldn't find anything to satisfy my > curiosity about how it can be used and how it performs. > > I hate being the person to just say "test it first," so I want to also > recommend figuring out how you'd imagine the interface to be designed. Some > formats like ORC seem to have 2 compression modes (optimize for speed or > space) while parquet exposes more of the tuning knobs (according to [1]). > And to Gang's point, there's a question of what can be exposed to the > various abstraction levels (perhaps end users would never be interested in > this so it's exposed only through an advanced or internal interface). > > Anyways, good luck scoping it out and feel free to iterate with the > mailing list as you try things out rather than just when finished, maybe > someone can chime in with more information and thoughts in the meantime. > > [1]: https://arxiv.org/pdf/2304.05028.pdf > > Sent from Proton Mail <https://proton.me/mail/home> for iOS > > > On Sat, Mar 23, 2024 at 05:23, Andrei Lazăr <lazarandrei...@gmail.com > <On+Sat,+Mar+23,+2024+at+05:23,+Andrei+Lazăr+%3C%3Ca+href=>> wrote: > > Hi Aldrin, thanks for taking the time to reply to my email! > > In my understanding, compression on Parquet files happens on the Data Page > level for every column, meaning that even across a row group, there can be > multiple units of data compression, and most certainly there are going to > be different units of data compression across an entire Parquet file. > Therefore, what I am hoping for is that more granular compression algorithm > choices could lead to overall better compression as the data in the same > column across row groups can differ quite a lot. > > At this very moment, specifying different compression algorithms per column > is supported and in my use case it is extremely helpful, as I have some > columns (mostly containing floats), for which a compression algorithm like > Snappy (or even no compression at all), significantly speeds up my queries > than keeping the data compressed with something like ZSTD or GZIP. > > That being said, your suggestion of writing a benchmark and sharing the > results here to support considering this approach is a great idea, I will > try doing that! > > Once again, thank you for your time! > > Kind regards, > Andrei > > On Fri, 22 Mar 2024 at 22:12, Aldrin <octalene....@pm.me.invalid> wrote: > > > Hello! > > > > I don't do much with compression, so I could be wrong, but I assume a > > compression algorithm spans the whole column and areas of large variance > > generally benefit less from the compression, but the encoding still > > provides benefits across separate areas (e.g. separate row groups). > > > > My impression is that compression will not be any better if it's > > restricted to only a subset of the data and if it is only scoped to a > > subset of the data then there are extra overheads you'd have beyond what > > you normally would have (the same raw value would have the same encoded > > value stored per row group). I suppose things like run-length encoding > > won't be any less efficient, but it also wouldn't be any more efficient > > (with the caveat of a raw value repeating across row groups). > > > > A different compression for different columns isn't unreasonable, so I > > think I could be easily convinced that has benefits (though would require > > per-column logic that could slow other things down). > > > > These are just my thoughts, though. Can you share the design and results > > of your benchmark? Have you (or could you) prototyped anything to test it > > out? > > > > Sent from Proton Mail <https://proton.me/mail/home> for iOS > > > > > > On Fri, Mar 22, 2024 at 14:36, Andrei Lazăr <lazarandrei...@gmail.com > > <On+Fri,+Mar+22,+2024+at+14:36,+Andrei+Lazăr+%3C%3Ca+href=>> wrote: > > > > Hi Gang, > > > > Thanks a lot for getting back to me! > > > > So the use case I am having is relatively simple: I was playing around > with > > some data and I wanted to benchmark different compression algorithms in > an > > effort to speed up data retrieval in a simple Parquet based database > that I > > am playing around with. Whilst doing so, I've noticed a very large > variance > > in the performance of the same compression algorithm over different row > > groups in my Parquet files. Therefore, I was thinking that the best > > compression configuration for my data would be to use a different > algorithm > > for every column, for every row group in my files. In a real-world > > situation, I can see this being used by a database, either when new > entries > > are inserted into it, or even as a background 'optimizer' job that runs > > over existing data. > > > > How do you feel about this? > > > > Thank you, > > Andrei > > > > On Thu, 21 Mar 2024 at 02:11, Gang Wu <ust...@gmail.com> wrote: > > > > > Hi Andrei, > > > > > > What is your use case? IMHO, exposing this kind of configuration > > > will force users to know how will the writer split row groups, which > > > does not look simple to me. > > > > > > Best, > > > Gang > > > > > > On Thu, Mar 21, 2024 at 2:25 AM Andrei Lazăr <lazarandrei...@gmail.com > > > > > wrote: > > > > > > > Hi all, > > > > > > > > I would like proposing adding support for writing a Parquet file with > > > > different compression algorithms for every row group. > > > > > > > > In my understanding, the Parquet format allows this, however it seems > > to > > > me > > > > that there is no way to achieve this from the C++ implementation. > > > > > > > > Does anyone have any thoughts on this? > > > > > > > > Thank you, > > > > Andrei > > > > > > > > > > > > >