Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-22 Thread Benjamin Wilhelm
> > Could you share the benchmark code/how the benchmark was run (does this > account for JIT warm-up time)? I just used the benchmark by the aircompressor project. They run the benchmark for a lot of algorithms on a lot of datasets so I commented out some to get faster results. You can find my ve

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-22 Thread Micah Kornfield
> > I executed some of the benchmarks in the airlift/aircompressor project. I > found that aircompressior achieves on average only about 72% > throughput compared to the current version of the lz4-java JNI bindings > when compressing. When decompressing the gap is even bigger with around 56% > thro

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-22 Thread Antoine Pitrou
Le 22/03/2021 à 15:29, Benjamin Wilhelm a écrit : Also, I would like to resume the discussion about the Frame format vs the Block format. There were 3 points for the Frame format by Antoine: - it allows streaming compression and decompression (meaning you can avoid loading a huge compressed b

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-22 Thread Benjamin Wilhelm
I executed some of the benchmarks in the airlift/aircompressor project. I found that aircompressior achieves on average only about 72% throughput compared to the current version of the lz4-java JNI bindings when compressing. When decompressing the gap is even bigger with around 56% throughout. See

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-18 Thread Micah Kornfield
> > I would start looking into the JNI approach. Contributing back > to lz4-java or adding this to Arrow. A first step might be to compare the performance of the JNI approach vs Airlift. The airlift library only uses Java and claims to be potentially faster. A JNI approach has the downside of r

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-18 Thread Benjamin Wilhelm
> > > 1) contribute the missing support ourselves > I actually think we might need to proceed with this option. I agree. I am willing to help with this and explore and try different approaches. I would start looking into the JNI approach. Contributing back to lz4-java or adding this to Arrow. Be

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-17 Thread Micah Kornfield
> > 1) contribute the missing support ourselves I actually think we might need to proceed with this option. Even more unfortunate, is I think the best place at the moment for the contribution to live is within Arrow. Fortunately, i think a port of the existing Apache Commons library for off-hea

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-17 Thread Antoine Pitrou
If you look at https://github.com/lz4/lz4-java/graphs/contributors?from=2019-12-28&to=2021-03-17&type=c, lz4-java seems to be receiving very little maintenance. So I think there are two possible avenues: 1) contribute the missing support ourselves 2) use another LZ4 library for Java Solut

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-11 Thread Micah Kornfield
FYI, I opened up https://github.com/lz4/lz4-java/issues/176 to discuss support for dependent frames. On Thu, Mar 11, 2021 at 11:59 AM David Li wrote: > At least for Flight, I don't think we'd use that. Right now the way > compression is supported is the same way as with Feather, i.e. the body >

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-11 Thread David Li
At least for Flight, I don't think we'd use that. Right now the way compression is supported is the same way as with Feather, i.e. the body buffers in each individual record batch sent on the wire are compressed, but not the stream as a whole. (And so far we haven't found a compelling benefit fo

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-11 Thread Antoine Pitrou
Le 11/03/2021 à 19:54, Micah Kornfield a écrit : Indeed, I don't think it was discussed publicly. The LZ4 frame format has several things going for it: - it allows streaming compression and decompression (meaning you can avoid loading a huge compressed buffer at once) Is this something we m

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-11 Thread Micah Kornfield
> > Indeed, I don't think it was discussed publicly. The LZ4 frame format > has several things going for it: > - it allows streaming compression and decompression (meaning you can > avoid loading a huge compressed buffer at once) Is this something we make use of or intend to make use of? > - it

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-11 Thread Joris Peeters
"Is https://github.com/lz4/lz4-java the fast Java lz4 library in question? The incompleteness of this implementation is a known problem for other user communities, not only Arrow. It would be a great public service to improve it so that it fully implements the lz4 frame specification." Very much +

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-11 Thread Steve Kim
I prefer the lz4 frame format for the reasons that Antoine stated. To be friendly to users, the Arrow IPC documentation could mention that lz4 compression may break Java interoperability. If block dependency is the only obstacle to Java interoperability, the Arrow IPC implementation could disable

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-11 Thread Antoine Pitrou
Le 11/03/2021 à 17:58, Micah Kornfield a écrit : We've found in the process of implementing support for LZ4 decompression that the fast Java decoder library does not support all the features of the C++ library (dependendent blocks can't be read, and by default that is what the C++ code emits).

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-11 Thread Antoine Pitrou
What about the JNI bindings for lz4-c? Le 11/03/2021 à 18:20, Micah Kornfield a écrit : I looked a little closer and it looks like it only supports Block format (in the code I didn't couldn't find any references to Frame). On Thu, Mar 11, 2021 at 9:16 AM Antoine Pitrou wrote: Have you tr

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-11 Thread Micah Kornfield
I looked a little closer and it looks like it only supports Block format (in the code I didn't couldn't find any references to Frame). On Thu, Mar 11, 2021 at 9:16 AM Antoine Pitrou wrote: > > Have you tried another Java LZ4 library (I think you mentioned Airlift > on a PR)? > > > Le 11/03/2021

Fwd: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-11 Thread Antoine Pitrou
Have you tried another Java LZ4 library (I think you mentioned Airlift on a PR)? Le 11/03/2021 à 17:58, Micah Kornfield a écrit : We've found in the process of implementing support for LZ4 decompression that the fast Java decoder library does not support all the features of the C++ library

[DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-11 Thread Micah Kornfield
We've found in the process of implementing support for LZ4 decompression that the fast Java decoder library does not support all the features of the C++ library (dependendent blocks can't be read, and by default that is what the C++ code emits). The only library we found (Apache Commons) that seem