Hello! I don't do much with compression, so I could be wrong, but I assume a compression algorithm spans the whole column and areas of large variance generally benefit less from the compression, but the encoding still provides benefits across separate areas (e.g. separate row groups). My impression is that compression will not be any better if it's restricted to only a subset of the data and if it is only scoped to a subset of the data then there are extra overheads you'd have beyond what you normally would have (the same raw value would have the same encoded value stored per row group). I suppose things like run-length encoding won't be any less efficient, but it also wouldn't be any more efficient (with the caveat of a raw value repeating across row groups). A different compression for different columns isn't unreasonable, so I think I could be easily convinced that has benefits (though would require per-column logic that could slow other things down). These are just my thoughts, though. Can you share the design and results of your benchmark? Have you (or could you) prototyped anything to test it out? Sent from Proton Mail for iOS On Fri, Mar 22, 2024 at 14:36, Andrei Lazăr <lazarandrei...@gmail.com> wrote: Hi Gang,
Thanks a lot for getting back to me! So the use case I am having is relatively simple: I was playing around with some data and I wanted to benchmark different compression algorithms in an effort to speed up data retrieval in a simple Parquet based database that I am playing around with. Whilst doing so, I've noticed a very large variance in the performance of the same compression algorithm over different row groups in my Parquet files. Therefore, I was thinking that the best compression configuration for my data would be to use a different algorithm for every column, for every row group in my files. In a real-world situation, I can see this being used by a database, either when new entries are inserted into it, or even as a background 'optimizer' job that runs over existing data. How do you feel about this? Thank you, Andrei On Thu, 21 Mar 2024 at 02:11, Gang Wu <ust...@gmail.com> wrote: > Hi Andrei, > > What is your use case? IMHO, exposing this kind of configuration > will force users to know how will the writer split row groups, which > does not look simple to me. > > Best, > Gang > > On Thu, Mar 21, 2024 at 2:25 AM Andrei Lazăr <lazarandrei...@gmail.com> > wrote: > > > Hi all, > > > > I would like proposing adding support for writing a Parquet file with > > different compression algorithms for every row group. > > > > In my understanding, the Parquet format allows this, however it seems to > me > > that there is no way to achieve this from the C++ implementation. > > > > Does anyone have any thoughts on this? > > > > Thank you, > > Andrei > > >
signature.asc
Description: OpenPGP digital signature