> > I would start looking into the JNI approach. Contributing back > to lz4-java or adding this to Arrow.
A first step might be to compare the performance of the JNI approach vs Airlift. The airlift library only uses Java and claims to be potentially faster. A JNI approach has the downside of requiring packaging for different systems. I'm not sure we do this today with the other JNI based libraries (I think we require users build the native component themselves), so for something as core to the specification as this, ease of use is also a consideration. It looks like lz4-java might just checkin the native shared libraries into the repo, which is not an approach I'd like to take within Arrow. On Thu, Mar 18, 2021 at 2:59 AM Benjamin Wilhelm <benjamin.wilh...@knime.com> wrote: > > > > > 1) contribute the missing support ourselves > > I actually think we might need to proceed with this option. > > > I agree. I am willing to help with this and explore and try different > approaches. I would start looking into the JNI approach. Contributing back > to lz4-java or adding this to Arrow. > > Best, > Benjamin > > > On Wed, Mar 17, 2021 at 5:51 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > > > > 1) contribute the missing support ourselves > > > > > > I actually think we might need to proceed with this option. Even more > > unfortunate, is I think the best place at the moment for the contribution > > to live is within Arrow. Fortunately, i think a port of the existing > > Apache Commons library for off-heap use should be relatively easy. We > can > > reach out to Apache Commons to see if they would be interested in this > > contribution but I would guess not, since I don't think there is a lot > off > > off-heap logic in the library in general (but my knowledge is stale > here). > > > > 2) use another LZ4 library for Java > > > > > > We are using the only library I could find that seems to have full > support > > for LZ4 Frame data. Unfortunately it is purely on-heap which I believe > is > > the source of the performance problems. > > > > On Wed, Mar 17, 2021 at 7:15 AM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > > If you look at > > > > > > > > > https://github.com/lz4/lz4-java/graphs/contributors?from=2019-12-28&to=2021-03-17&type=c > > , > > > > > > lz4-java seems to be receiving very little maintenance. So I think > > > there are two possible avenues: > > > > > > 1) contribute the missing support ourselves > > > 2) use another LZ4 library for Java > > > > > > Solution #2 seems more reasonable to me. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > Le 11/03/2021 à 21:05, Micah Kornfield a écrit : > > > > FYI, I opened up https://github.com/lz4/lz4-java/issues/176 to > discuss > > > > support for dependent frames. > > > > > > > > On Thu, Mar 11, 2021 at 11:59 AM David Li <lidav...@apache.org> > wrote: > > > > > > > >> At least for Flight, I don't think we'd use that. Right now the way > > > >> compression is supported is the same way as with Feather, i.e. the > > body > > > >> buffers in each individual record batch sent on the wire are > > compressed, > > > >> but not the stream as a whole. (And so far we haven't found a > > compelling > > > >> benefit for compression in Flight in general.) > > > >> > > > >> Best, > > > >> David > > > >> > > > >> On Thu, Mar 11, 2021, at 14:34, Antoine Pitrou wrote: > > > >>> > > > >>> Le 11/03/2021 à 19:54, Micah Kornfield a écrit : > > > >>>>> > > > >>>>> Indeed, I don't think it was discussed publicly. The LZ4 frame > > > format > > > >>>>> has several things going for it: > > > >>>>> - it allows streaming compression and decompression (meaning you > > can > > > >>>>> avoid loading a huge compressed buffer at once) > > > >>>> > > > >>>> Is this something we make use of or intend to make use of? > > > >>> > > > >>> Good question. Currently we don't. Perhaps David Li wants to > answer > > > >>> this, since he's been working a lot on Flight. > > > >>> > > > >>>>> - it embeds the decompressed size, allowing exact allocation of > the > > > >>>>> decompressed buffer > > > >>>> > > > >>>> IIUC, We already do this in the IPC specification (the first 8 > bytes > > > >> of the > > > >>>> compressed buffer are used for this). > > > >>> > > > >>> Ah, you're right. It doesn't matter then. > > > >>> > > > >>>> - it has an optional checksum > > > >>>> > > > >>>> This seems like a good thing, so probably worth keeping (although > it > > > >> would > > > >>>> be the only place where we do checksums today). > > > >>> > > > >>> (or of course we could add an optional higher-level checksum in the > > IPC > > > >>> format) > > > >>> > > > >>> Regards > > > >>> > > > >>> Antoine. > > > >>> > > > >> > > > > > > > > > >