Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

Benjamin Wilhelm Mon, 22 Mar 2021 07:29:55 -0700

I executed some of the benchmarks in the airlift/aircompressor project. I
found that aircompressior achieves on average only about 72%
throughput compared to the current version of the lz4-java JNI bindings
when compressing. When decompressing the gap is even bigger with around 56%
throughout. See the following google sheet for the benchmark results.
https://docs.google.com/spreadsheets/d/1mT1qmpvV25YcRmPz4IYxXyPSzUsMsdovN7gc_vmac5U/edit?usp=sharing


Additionally, aircompressor does not implement the LZ4 Frame format. We
would need to ask for this functionality or implement it ourselves and
contribute it.

Also, I would like to resume the discussion about the Frame format vs the
Block format. There were 3 points for the Frame format by Antoine:

- it allows streaming compression and decompression (meaning you can
> avoid loading a huge compressed buffer at once)
>
It seems like this is not used anywhere. Doesn't it make more sense to use
more record batches if one buffer in a record batch gets too big?


> - it embeds the decompressed size, allowing exact allocation of the
> decompressed buffer
>
Micah pointed out that this is already part of the IPC specification.


> - it has an optional checksum
>
 Wouldn't it make sense to have a higher level checksum (as already
mentioned by Antoine) if we want to have checksums at all? Just having a
checksum in case of one specific compression does not make a lot of sense
to me.

Given these points, I think that the Frame format is not more useful for
compressing Arrow Buffers than the Block format. It only adds unnecessary
overhead in metadata.
Are there any other points for the Frame format that I missed? If not, what
would it mean to switch to the Block format? (Or add the Block format as an
option??)

Best,
Benjamin


On Thu, Mar 18, 2021 at 4:55 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> >
> > I would start looking into the JNI approach. Contributing back
> > to lz4-java or adding this to Arrow.
>
> A first step might be to compare the performance of the JNI approach vs
> Airlift.  The airlift library only uses Java and claims to be potentially
> faster.  A JNI approach has the downside of  requiring packaging for
> different systems.  I'm not sure we do this today with the other JNI based
> libraries (I think we require users build the native component themselves),
> so for something as core to the specification as this, ease of use is also
> a consideration.
>
> It looks like lz4-java might just checkin the native shared libraries into
> the repo, which is not an approach I'd like to take within Arrow.
>
>
> On Thu, Mar 18, 2021 at 2:59 AM Benjamin Wilhelm <
> benjamin.wilh...@knime.com>
> wrote:
>
> > >
> > > > 1) contribute the missing support ourselves
> > > I actually think we might need to proceed with this option.
> >
> >
> > I agree. I am willing to help with this and explore and try different
> > approaches. I would start looking into the JNI approach. Contributing
> back
> > to lz4-java or adding this to Arrow.
> >
> > Best,
> > Benjamin
> >
> >
> > On Wed, Mar 17, 2021 at 5:51 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > >
> > > > 1) contribute the missing support ourselves
> > >
> > >
> > > I actually think we might need to proceed with this option.  Even more
> > > unfortunate, is I think the best place at the moment for the
> contribution
> > > to live is within Arrow.  Fortunately, i think a port of the existing
> > > Apache Commons library for off-heap use should be relatively easy.  We
> > can
> > > reach out to Apache Commons to see if they would be interested in this
> > > contribution but I would guess not, since I don't think there is a lot
> > off
> > > off-heap logic in the library in general (but my knowledge is stale
> > here).
> > >
> > > 2) use another LZ4 library for Java
> > >
> > >
> > > We are using the only library I could find that seems to have full
> > support
> > > for LZ4 Frame data.  Unfortunately it is purely on-heap which I believe
> > is
> > > the source of the performance problems.
> > >
> > > On Wed, Mar 17, 2021 at 7:15 AM Antoine Pitrou <anto...@python.org>
> > wrote:
> > >
> > > >
> > > > If you look at
> > > >
> > > >
> > >
> >
> https://github.com/lz4/lz4-java/graphs/contributors?from=2019-12-28&to=2021-03-17&type=c
> > > ,
> > > >
> > > > lz4-java seems to be receiving very little maintenance.  So I think
> > > > there are two possible avenues:
> > > >
> > > > 1) contribute the missing support ourselves
> > > > 2) use another LZ4 library for Java
> > > >
> > > > Solution #2 seems more reasonable to me.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 11/03/2021 à 21:05, Micah Kornfield a écrit :
> > > > > FYI, I opened up https://github.com/lz4/lz4-java/issues/176 to
> > discuss
> > > > > support for dependent frames.
> > > > >
> > > > > On Thu, Mar 11, 2021 at 11:59 AM David Li <lidav...@apache.org>
> > wrote:
> > > > >
> > > > >> At least for Flight, I don't think we'd use that. Right now the
> way
> > > > >> compression is supported is the same way as with Feather, i.e. the
> > > body
> > > > >> buffers in each individual record batch sent on the wire are
> > > compressed,
> > > > >> but not the stream as a whole. (And so far we haven't found a
> > > compelling
> > > > >> benefit for compression in Flight in general.)
> > > > >>
> > > > >> Best,
> > > > >> David
> > > > >>
> > > > >> On Thu, Mar 11, 2021, at 14:34, Antoine Pitrou wrote:
> > > > >>>
> > > > >>> Le 11/03/2021 à 19:54, Micah Kornfield a écrit :
> > > > >>>>>
> > > > >>>>> Indeed, I don't think it was discussed publicly.  The LZ4 frame
> > > > format
> > > > >>>>> has several things going for it:
> > > > >>>>> - it allows streaming compression and decompression (meaning
> you
> > > can
> > > > >>>>> avoid loading a huge compressed buffer at once)
> > > > >>>>
> > > > >>>> Is this something we make use of or intend to make use of?
> > > > >>>
> > > > >>> Good question.  Currently we don't.  Perhaps David Li wants to
> > answer
> > > > >>> this, since he's been working a lot on Flight.
> > > > >>>
> > > > >>>>> - it embeds the decompressed size, allowing exact allocation of
> > the
> > > > >>>>> decompressed buffer
> > > > >>>>
> > > > >>>> IIUC, We already do this in the IPC specification (the first 8
> > bytes
> > > > >> of the
> > > > >>>> compressed buffer are used for this).
> > > > >>>
> > > > >>> Ah, you're right.  It doesn't matter then.
> > > > >>>
> > > > >>>> - it has an optional checksum
> > > > >>>>
> > > > >>>> This seems like a good thing, so probably worth keeping
> (although
> > it
> > > > >> would
> > > > >>>> be the only place where we do checksums today).
> > > > >>>
> > > > >>> (or of course we could add an optional higher-level checksum in
> the
> > > IPC
> > > > >>> format)
> > > > >>>
> > > > >>> Regards
> > > > >>>
> > > > >>> Antoine.
> > > > >>>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

Reply via email to