Hi Mark, > There is definitely a tradeoff between processing speed and compression, > however I feel there is a use case for 'small in memory footprint' > independent of 'high speed processing'. > Though I appreciate arrow team may not want to address that, given the > focus on processing speed. ( can't be all things to everyone. )
Personally, I think adding in the programming interfaces to handle > compressed in-mem arrays would be a good thing, as well as the 'in flight' > ones. I don't see the two being contradictory. In the thread I linked, I was advocating for implementing the simplest possible methodologies before exploring potentially more complex ones especially those that force size vs access speed tradeoff. I'm currently doing some work on Parquet->Arrow reading, but I'm hoping to be able to do more on the encoding work once I can wrap that up. The first step is building a proof of concept prototype. If you think the encodings in the straw-man proposal [1] will not be at all useful for your use-case that is useful feedback, but I suspect they would still help to some degree even if they aren't optimal. Thanks, Micah [1] https://github.com/apache/arrow/pull/4815 On Sun, Aug 30, 2020 at 10:21 AM <m...@markfarnan.com> wrote: > All, > > Micah: appears my google-fu wasn't strong enough to find the previous > thread, so thanks for pointing that out. > > There is definitely a tradeoff between processing speed and compression, > however I feel there is a use case for 'small in memory footprint' > independent of 'high speed processing'. > Though I appreciate arrow team may not want to address that, given the > focus on processing speed. ( can't be all things to everyone. ) > > Personally, I think adding in the programming interfaces to handle > compressed in-mem arrays would be a good thing, as well as the 'in flight' > ones. > > > For reference, my specific use case is handing large datasets [1] of > varying types [2] to the browser for plotting, inc scrolling over them, > using WASM (currently in GO). > Both network bandwidth to browsers, and browser memory is always > problematic, esp on mobile devices, hence the desire to compress, and keep > it compressed on arrival. And minimize number of in-mem copies needed > > The access to the data is either. > A: forward read from a certain point for a range, to draw. (that point > and range changes with scroll and zoom) > B: Random access for tooltips. (Value of 'n' columns at index 'y') > Both can potentially be efficient enough based on selection of the > block sizes or other internal boundaries / search method. > > > Note: Compressing potentially makes my 'other' problem even harder, which > best method for appending inbound realtime sensor data into the in-memory > model. Still thinking about that one. > > Regards > > Mark. > > > [1] Large in obviously relative: In this case, a single plot may have > 20-50 separate time series, each with between 20k to 10 million points > each. > > [2] The data is often index: time, value float, OR Index:Float > (length measure), Value:Float, But not always: Value could be one of > int(8,16,32,64), float(32,64), string, vector(float32/64), etc. Hence > why I'm liking Arrow as the standard 'format' for this data as they can all > be safely encoded within. > > > > -----Original Message----- > From: Micah Kornfield <emkornfi...@gmail.com> > Sent: Sunday, August 30, 2020 6:20 PM > To: Wes McKinney <wesmck...@gmail.com> > Cc: dev <dev@arrow.apache.org> > Subject: Re: Compression in Arrow - Question > > Agreed, I think it would be useful to make sure the "compute" interfaces > have the right hooks to support alternate encodings. > > On Sunday, August 30, 2020, Wes McKinney <wesmck...@gmail.com> wrote: > > > That said, there is nothing preventing the development of programming > > interfaces for compressed / encoded data right now. When it comes to > > transporting such data, that's when we will have to decide on what to > > support and what new metadata structures are required. > > > > For example, we could add RLE to C++ in prototype form and then > > convert to non-RLE when writing to IPC messages. > > > > On Sat, Aug 29, 2020 at 7:34 PM Micah Kornfield > > <emkornfi...@gmail.com> > > wrote: > > > > > > Hi Mark, > > > See the most recent previous discussion about alternate encodings [1]. > > > This is something that in the long run should be added, I'd > > > personally prefer to start with simpler encodings. > > > > > > I don't think we should add anything more with regard to > > > compression/encoding until at least 3 languages support the current > > > compression methods that are in the specification. C++ has it > > implemented, > > > there is some work in Java and I think we should have at least one > more. > > > > > > -Micah > > > > > > [1] > > > https://lists.apache.org/thread.html/r1d9d707c481c53c13534f7c72d75c > > 7a90dc7b2b9966c6c0772d0e416%40%3Cdev.arrow.apache.org%3E > > > > > > On Sat, Aug 29, 2020 at 4:04 PM <m...@markfarnan.com> wrote: > > > > > > > > > > > I was looking at compression in arrow had a couple questions. > > > > > > > > If I've understood compression currently, it is only used 'in > > flight' > > > > in either IPC or Arrow Flight, using a block compression, but > > > > still decoded into Ram at the destination in full array form. Is > > > > this > > correct ? > > > > > > > > > > > > Given that arrow is a columnar format, has any thought been given > > > > to an option to have the data compressed both in memory and in > > > > flight, using > > some > > > > of the columnar techniques ? > > > > As I deal primarily with Timeseries numerical data, I was > > > > thinking > > about > > > > some of the algorithms from the Gorilla paper [1] for Floats and > > > > Timestamps (Delta-of-Delta) or similar might be appropriate. > > > > > > > > The interface functions could still iterate over the data and > > > > produce > > raw > > > > values so this is transparent to users of the data, but the data > > > > blocks/arrays in-mem are actually compressed. > > > > > > > > With this method, blocks could come out of a data base/source, > > > > through > > the > > > > data service, across the wire (flight) and land in the consuming > > > > applications memory without ever being decompressed or processed > > > > until final use. > > > > > > > > > > > > Crazy thought ? > > > > > > > > > > > > Regards > > > > > > > > Mark. > > > > > > > > > > > > [1]: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf > > > > > > > > > > > >