I'll run a grid of batch sizes (from 1024 to 64K or 128K) and let you know the read/write times and compression ratios. Shouldn't take too long
On Wed, Mar 25, 2020 at 10:37 PM Fan Liya <[email protected]> wrote: > > Thanks a lot for sharing the good results. > > As investigated by Wes, we have existing zstd library for Java (zstd-jni) > [1], and lz4 library for Java (lz4-java) [2]. > +1 for the 1024 batch size, as it represents an important scenario where the > batch fits into the L1 cache (IMO). > > Best, > Liya Fan > > [1] https://github.com/luben/zstd-jni > [2] https://github.com/lz4/lz4-java > > On Thu, Mar 26, 2020 at 2:38 AM Micah Kornfield <[email protected]> wrote: >> >> If it isn't hard could you run with batch sizes of 1024 or 2048 records? I >> think there was a question previously raised if there was benefit for >> smaller sizes buffers. >> >> Thanks, >> Micah >> >> >> On Wed, Mar 25, 2020 at 8:59 AM Wes McKinney <[email protected]> wrote: >> >> > On Tue, Mar 24, 2020 at 9:22 PM Micah Kornfield <[email protected]> >> > wrote: >> > > >> > > > >> > > > Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on >> > > > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae >> > > > dataset. So that's a huge space savings >> > > >> > > One more question on this. What was the average row-batch size used? I >> > > see in the proposal some buffers might not be compressed, did you this >> > > feature in the test? >> > >> > I used 64K row batch size. I haven't implemented the optional >> > non-compressed buffers (for cases where there is little space savings) >> > so everything is compressed. I can check different batch sizes if you >> > like >> > >> > >> > > On Mon, Mar 23, 2020 at 4:40 PM Wes McKinney <[email protected]> >> > wrote: >> > > >> > > > hi folks, >> > > > >> > > > Sorry it's taken me a little while to produce supporting benchmarks. >> > > > >> > > > * I implemented experimental trivial body buffer compression in >> > > > https://github.com/apache/arrow/pull/6638 >> > > > * I hooked up the Arrow IPC file format with compression as the new >> > > > Feather V2 format in >> > > > https://github.com/apache/arrow/pull/6694#issuecomment-602906476 >> > > > >> > > > I tested a couple of real-world datasets from a prior blog post >> > > > https://ursalabs.org/blog/2019-10-columnar-perf/ with ZSTD and LZ4 >> > > > codecs >> > > > >> > > > The complete results are here >> > > > https://github.com/apache/arrow/pull/6694#issuecomment-602906476 >> > > > >> > > > Summary: >> > > > >> > > > * Compression ratios ranging from ~50% with LZ4 and ~75% with ZSTD on >> > > > the Taxi dataset to ~87% with LZ4 and ~90% with ZSTD on the Fannie Mae >> > > > dataset. So that's a huge space savings >> > > > * Single-threaded decompression times exceeding 2-4GByte/s with LZ4 >> > > > and 1.2-3GByte/s with ZSTD >> > > > >> > > > I would have to do some more engineering to test throughput changes >> > > > with Flight, but given these results on slower networking (e.g. 1 >> > > > Gigabit) my guess is that the compression and decompression overhead >> > > > is little compared with the time savings due to high compression >> > > > ratios. If people would like to see these numbers to help make a >> > > > decision I can take a closer look >> > > > >> > > > As far as what Micah said about having a limited number of >> > > > compressors: I would be in favor of having just LZ4 and ZSTD. It seems >> > > > anecdotally that these outperform Snappy in most real world scenarios >> > > > and generally have > 1 GB/s decompression performance. Some Linux >> > > > distributions (Arch at least) have already started adopting ZSTD over >> > > > LZMA or GZIP [1] >> > > > >> > > > - Wes >> > > > >> > > > [1]: >> > > > >> > https://www.archlinux.org/news/now-using-zstandard-instead-of-xz-for-package-compression/ >> > > > >> > > > On Fri, Mar 6, 2020 at 8:42 AM Fan Liya <[email protected]> wrote: >> > > > > >> > > > > Hi Wes, >> > > > > >> > > > > Thanks a lot for the additional information. >> > > > > Looking forward to see the good results from your experiments. >> > > > > >> > > > > Best, >> > > > > Liya Fan >> > > > > >> > > > > On Thu, Mar 5, 2020 at 11:42 PM Wes McKinney <[email protected]> >> > > > wrote: >> > > > > >> > > > > > I see, thank you. >> > > > > > >> > > > > > For such a scenario, implementations would need to define a >> > > > > > "UserDefinedCodec" interface to enable codecs to be registered from >> > > > > > third party code, similar to what is done for extension types [1] >> > > > > > >> > > > > > I'll update this thread when I get my experimental C++ patch up to >> > see >> > > > > > what I'm thinking at least for the built-in codecs we have like >> > ZSTD. >> > > > > > >> > > > > > >> > > > > > >> > > > >> > https://github.com/apache/arrow/blob/apache-arrow-0.16.0/docs/source/format/Columnar.rst#extension-types >> > > > > > >> > > > > > On Thu, Mar 5, 2020 at 7:56 AM Fan Liya <[email protected]> >> > wrote: >> > > > > > > >> > > > > > > Hi Wes, >> > > > > > > >> > > > > > > Thanks a lot for your further clarification. >> > > > > > > >> > > > > > > Some of my prelimiary thoughts: >> > > > > > > >> > > > > > > 1. We assign a unique GUID to each pair of >> > compression/decompression >> > > > > > > strategies. The GUID is stored as part of the >> > > > Message.custom_metadata. >> > > > > > When >> > > > > > > receiving the GUID, the receiver knows which decompression >> > strategy >> > > > to >> > > > > > use. >> > > > > > > >> > > > > > > 2. We serialize the decompression strategy, and store it into the >> > > > > > > Message.custom_metadata. The receiver can decompress data after >> > > > > > > deserializing the strategy. >> > > > > > > >> > > > > > > Method 1 is generally used in static strategy scenarios while >> > method >> > > > 2 is >> > > > > > > generally used in dynamic strategy scenarios. >> > > > > > > >> > > > > > > Best, >> > > > > > > Liya Fan >> > > > > > > >> > > > > > > On Wed, Mar 4, 2020 at 11:39 PM Wes McKinney < >> > [email protected]> >> > > > > > wrote: >> > > > > > > >> > > > > > > > Okay, I guess my question is how the receiver is going to be >> > able >> > > > to >> > > > > > > > determine how to "rehydrate" the record batch buffers: >> > > > > > > > >> > > > > > > > What I've proposed amounts to the following: >> > > > > > > > >> > > > > > > > * UNCOMPRESSED: the current behavior >> > > > > > > > * ZSTD/LZ4/...: each buffer is compressed and written with an >> > int64 >> > > > > > > > length prefix >> > > > > > > > >> > > > > > > > (I'm close to putting up a PR implementing an experimental >> > version >> > > > of >> > > > > > > > this that uses Message.custom_metadata to transmit the codec, >> > so >> > > > this >> > > > > > > > will make the implementation details more concrete) >> > > > > > > > >> > > > > > > > So in the USER_DEFINED case, how will the library know how to >> > > > obtain >> > > > > > > > the uncompressed buffer? Is some additional metadata structure >> > > > > > > > required to provide instructions? >> > > > > > > > >> > > > > > > > On Wed, Mar 4, 2020 at 8:05 AM Fan Liya <[email protected]> >> > > > wrote: >> > > > > > > > > >> > > > > > > > > Hi Wes, >> > > > > > > > > >> > > > > > > > > I am thinking of adding an option named "USER_DEFINED" (or >> > > > something >> > > > > > > > > similar) to enum CompressionType in your proposal. >> > > > > > > > > IMO, this option should be used primarily in Flight. >> > > > > > > > > >> > > > > > > > > Best, >> > > > > > > > > Liya Fan >> > > > > > > > > >> > > > > > > > > On Wed, Mar 4, 2020 at 11:12 AM Wes McKinney < >> > > > [email protected]> >> > > > > > > > wrote: >> > > > > > > > > >> > > > > > > > > > On Tue, Mar 3, 2020, 8:11 PM Fan Liya < >> > [email protected]> >> > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > Sure. I agree with you that we should not overdo this. >> > > > > > > > > > > I am wondering if we should provide an option to allow >> > users >> > > > to >> > > > > > > > plugin >> > > > > > > > > > > their customized compression strategies. >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > Can you provide a patch showing changes to Message.fbs (or >> > > > > > Schema.fbs) >> > > > > > > > that >> > > > > > > > > > make this idea more concrete? >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > Best, >> > > > > > > > > > > Liya Fan >> > > > > > > > > > > >> > > > > > > > > > > On Tue, Mar 3, 2020 at 9:47 PM Wes McKinney < >> > > > [email protected] >> > > > > > > >> > > > > > > > wrote: >> > > > > > > > > > > >> > > > > > > > > > > > On Tue, Mar 3, 2020, 7:36 AM Fan Liya < >> > > > [email protected]> >> > > > > > > > wrote: >> > > > > > > > > > > > >> > > > > > > > > > > > > I am so glad to see this discussion, and I am >> > willing to >> > > > > > provide >> > > > > > > > help >> > > > > > > > > > > > from >> > > > > > > > > > > > > the Java side. >> > > > > > > > > > > > > >> > > > > > > > > > > > > In the proposal, I see the support for basic >> > compression >> > > > > > > > strategies >> > > > > > > > > > > > > (e.g.gzip, snappy). >> > > > > > > > > > > > > IMO, applying a single basic strategy is not likely >> > to >> > > > > > achieve >> > > > > > > > > > > > performance >> > > > > > > > > > > > > improvement for most scenarios. >> > > > > > > > > > > > > The optimal compression strategy is often obtained by >> > > > > > composing >> > > > > > > > basic >> > > > > > > > > > > > > strategies and tuning parameters. >> > > > > > > > > > > > > >> > > > > > > > > > > > > I hope we can support such highly customized >> > compression >> > > > > > > > strategies. >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > I think very much beyond trivial one-shot buffer level >> > > > > > compression >> > > > > > > > is >> > > > > > > > > > > > probably out of the question for addition to the >> > current >> > > > > > > > "RecordBatch" >> > > > > > > > > > > > Flatbuffers type, because the additional metadata >> > would add >> > > > > > > > undesirable >> > > > > > > > > > > > bloat (which I would be against). If people have other >> > > > ideas it >> > > > > > > > would >> > > > > > > > > > be >> > > > > > > > > > > > great to see exactly what you are thinking as far as >> > > > changes >> > > > > > to the >> > > > > > > > > > > > protocol files. >> > > > > > > > > > > > >> > > > > > > > > > > > I'll try to assemble some examples to show the >> > before/after >> > > > > > > > results of >> > > > > > > > > > > > applying the simple strategy. >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > Best, >> > > > > > > > > > > > > Liya Fan >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > On Tue, Mar 3, 2020 at 8:15 PM Antoine Pitrou < >> > > > > > > > [email protected]> >> > > > > > > > > > > > wrote: >> > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > If we want to use a HTTP header, it would be more >> > of a >> > > > > > > > > > > Accept-Encoding >> > > > > > > > > > > > > > header, no? >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > In any case, we would have to put non-standard >> > values >> > > > there >> > > > > > > > (e.g. >> > > > > > > > > > > lz4), >> > > > > > > > > > > > > > so I'm not sure how desirable it is to repurpose >> > HTTP >> > > > > > headers >> > > > > > > > for >> > > > > > > > > > > that, >> > > > > > > > > > > > > > rather than add some dedicated field to the Flight >> > > > > > messages. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Regards >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Antoine. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Le 03/03/2020 à 12:52, David Li a écrit : >> > > > > > > > > > > > > > > gRPC supports headers so for Flight, we could >> > send >> > > > > > > > essentially an >> > > > > > > > > > > > > Accept >> > > > > > > > > > > > > > > header and perhaps a Content-Type header. >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > David >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > On Mon, Mar 2, 2020, 23:15 Micah Kornfield < >> > > > > > > > > > [email protected]> >> > > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> Hi Wes, >> > > > > > > > > > > > > > >> A few thoughts on this. In general, I think it >> > is a >> > > > > > good >> > > > > > > > idea. >> > > > > > > > > > > But >> > > > > > > > > > > > > > before >> > > > > > > > > > > > > > >> proceeding, I think the following points are >> > worth >> > > > > > > > discussing: >> > > > > > > > > > > > > > >> 1. Does this actually improve >> > throughput/latency >> > > > for >> > > > > > > > Flight? (I >> > > > > > > > > > > > think >> > > > > > > > > > > > > > you >> > > > > > > > > > > > > > >> mentioned you would follow-up with benchmarks). >> > > > > > > > > > > > > > >> 2. I think we should limit the number of >> > supported >> > > > > > > > compression >> > > > > > > > > > > > > schemes >> > > > > > > > > > > > > > to >> > > > > > > > > > > > > > >> only 1 or 2. I think the criteria for selection >> > > > speed >> > > > > > and >> > > > > > > > > > native >> > > > > > > > > > > > > > >> implementations available across the widest >> > possible >> > > > > > > > languages. >> > > > > > > > > > > As >> > > > > > > > > > > > > far >> > > > > > > > > > > > > > as >> > > > > > > > > > > > > > >> i can tell zstd only have bindings in java via >> > JNI, >> > > > but >> > > > > > my >> > > > > > > > > > > > > > understanding is >> > > > > > > > > > > > > > >> it is probably the type of compression for our >> > > > > > use-cases. >> > > > > > > > So I >> > > > > > > > > > > > think >> > > > > > > > > > > > > > >> zstd + potentially 1 more. >> > > > > > > > > > > > > > >> 3. Commitment from someone on the Java side to >> > > > > > implement >> > > > > > > > this. >> > > > > > > > > > > > > > >> 4. This doesn't need to be coupled with this >> > change >> > > > > > per-se >> > > > > > > > but >> > > > > > > > > > > for >> > > > > > > > > > > > > > >> something like flight it would be good to have a >> > > > > > standard >> > > > > > > > > > > mechanism >> > > > > > > > > > > > > for >> > > > > > > > > > > > > > >> negotiating server/client capabilities (e.g. >> > client >> > > > > > doesn't >> > > > > > > > > > > support >> > > > > > > > > > > > > > >> compression or only supports a subset). >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> Thanks, >> > > > > > > > > > > > > > >> Micah >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> On Sun, Mar 1, 2020 at 1:24 PM Wes McKinney < >> > > > > > > > > > [email protected]> >> > > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >>> On Sun, Mar 1, 2020 at 3:14 PM Antoine Pitrou < >> > > > > > > > > > > [email protected]> >> > > > > > > > > > > > > > >> wrote: >> > > > > > > > > > > > > > >>>> >> > > > > > > > > > > > > > >>>> >> > > > > > > > > > > > > > >>>> Le 01/03/2020 à 22:01, Wes McKinney a écrit : >> > > > > > > > > > > > > > >>>>> In the context of a "next version of the >> > Feather >> > > > > > format" >> > > > > > > > > > > > ARROW-5510 >> > > > > > > > > > > > > > >>>>> (which is consumed only by Python and R at >> > the >> > > > > > moment), I >> > > > > > > > > > have >> > > > > > > > > > > > been >> > > > > > > > > > > > > > >>>>> looking at compressing buffers using fast >> > > > compressors >> > > > > > > > like >> > > > > > > > > > ZSTD >> > > > > > > > > > > > > when >> > > > > > > > > > > > > > >>>>> writing the RecordBatch bodies. This could be >> > > > handled >> > > > > > > > > > privately >> > > > > > > > > > > > as >> > > > > > > > > > > > > an >> > > > > > > > > > > > > > >>>>> implementation detail of the Feather file, >> > but >> > > > since >> > > > > > ZSTD >> > > > > > > > > > > > > compression >> > > > > > > > > > > > > > >>>>> could improve throughput in Flight, for >> > example, >> > > > I >> > > > > > > > thought I >> > > > > > > > > > > > would >> > > > > > > > > > > > > > >>>>> bring it up for discussion. >> > > > > > > > > > > > > > >>>>> >> > > > > > > > > > > > > > >>>>> I can see two simple compression strategies: >> > > > > > > > > > > > > > >>>>> >> > > > > > > > > > > > > > >>>>> * Compress the entire message body in >> > one-shot, >> > > > > > writing >> > > > > > > > the >> > > > > > > > > > > > result >> > > > > > > > > > > > > > >> out >> > > > > > > > > > > > > > >>>>> with an 8-byte int64 prefix indicating the >> > > > > > uncompressed >> > > > > > > > size >> > > > > > > > > > > > > > >>>>> * Compress each non-zero-length constituent >> > > > Buffer >> > > > > > prior >> > > > > > > > to >> > > > > > > > > > > > writing >> > > > > > > > > > > > > > >> to >> > > > > > > > > > > > > > >>>>> the body (and using the same >> > > > > > uncompressed-length-prefix >> > > > > > > > when >> > > > > > > > > > > > > writing >> > > > > > > > > > > > > > >>>>> the compressed buffer) >> > > > > > > > > > > > > > >>>>> >> > > > > > > > > > > > > > >>>>> The latter strategy is preferable for >> > scenarios >> > > > > > where we >> > > > > > > > may >> > > > > > > > > > > > > project >> > > > > > > > > > > > > > >>>>> out only a few fields from a larger record >> > batch >> > > > > > (such as >> > > > > > > > > > > reading >> > > > > > > > > > > > > > >> from >> > > > > > > > > > > > > > >>>>> a memory-mapped file). >> > > > > > > > > > > > > > >>>> >> > > > > > > > > > > > > > >>>> Agreed. It may also allow using different >> > > > compression >> > > > > > > > > > > strategies >> > > > > > > > > > > > > for >> > > > > > > > > > > > > > >>>> different kinds of buffers (for example a >> > > > bytestream >> > > > > > > > splitting >> > > > > > > > > > > > > > strategy >> > > > > > > > > > > > > > >>>> for floats and doubles, or a delta encoding >> > > > strategy >> > > > > > for >> > > > > > > > > > > > integers). >> > > > > > > > > > > > > > >>> >> > > > > > > > > > > > > > >>> If we wanted to allow for different >> > compression to >> > > > > > apply to >> > > > > > > > > > > > different >> > > > > > > > > > > > > > >>> buffers, I think we will need a new Message >> > type >> > > > > > because >> > > > > > > > this >> > > > > > > > > > > would >> > > > > > > > > > > > > > >>> inflate metadata sizes in a way that is not >> > likely >> > > > to >> > > > > > be >> > > > > > > > > > > acceptable >> > > > > > > > > > > > > > >>> for the current uncompressed use case. >> > > > > > > > > > > > > > >>> >> > > > > > > > > > > > > > >>> Here is my strawman proposal >> > > > > > > > > > > > > > >>> >> > > > > > > > > > > > > > >>> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > >> > > > >> > https://github.com/apache/arrow/compare/master...wesm:compression-strawman >> > > > > > > > > > > > > > >>> >> > > > > > > > > > > > > > >>>>> Implementation could be accomplished by one >> > of >> > > > the >> > > > > > > > following >> > > > > > > > > > > > > methods: >> > > > > > > > > > > > > > >>>>> >> > > > > > > > > > > > > > >>>>> * Setting a field in Message.custom_metadata >> > > > > > > > > > > > > > >>>>> * Adding a new field to Message >> > > > > > > > > > > > > > >>>> >> > > > > > > > > > > > > > >>>> I think it has to be a new field in Message. >> > > > Making >> > > > > > it an >> > > > > > > > > > > > ignorable >> > > > > > > > > > > > > > >>>> metadata field means non-supporting receivers >> > will >> > > > > > decode >> > > > > > > > and >> > > > > > > > > > > > > > interpret >> > > > > > > > > > > > > > >>>> the data wrongly. >> > > > > > > > > > > > > > >>>> >> > > > > > > > > > > > > > >>>> Regards >> > > > > > > > > > > > > > >>>> >> > > > > > > > > > > > > > >>>> Antoine. >> > > > > > > > > > > > > > >>> >> > > > > > > > > > > > > > >> >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > >> > > > >> >
