Re: Compression in Arrow - Question

Micah Kornfield Sun, 30 Aug 2020 20:15:43 -0700

Hi Mark,

> There is definitely a tradeoff between processing speed and compression,
> however I feel there is a use case for 'small in memory footprint'
> independent of  'high speed processing'.
> Though I appreciate arrow team may not want to address that, given the
> focus on processing speed.   ( can't be all things to everyone. )


Personally, I think adding in the programming interfaces to handle
> compressed in-mem arrays would be a good thing, as well as the 'in flight'
> ones.


I don't see the two being contradictory.  In the thread I linked, I was
advocating for implementing the simplest possible methodologies before
exploring potentially more complex ones especially those that force size vs
access speed tradeoff.

I'm currently doing some work on Parquet->Arrow reading, but I'm hoping to
be able to do more on the  encoding work  once I can wrap that up.  The
first step is building a proof of concept prototype.  If you think the
encodings in the straw-man proposal [1] will not be at all useful for your
use-case that is useful feedback, but I suspect they would still help to
some degree even if they aren't optimal.

Thanks,
Micah

[1] https://github.com/apache/arrow/pull/4815


On Sun, Aug 30, 2020 at 10:21 AM <m...@markfarnan.com> wrote:

> All,
>
> Micah: appears my google-fu wasn't strong enough to find the previous
> thread, so thanks for pointing that out.
>
> There is definitely a tradeoff between processing speed and compression,
> however I feel there is a use case for 'small in memory footprint'
> independent of  'high speed processing'.
> Though I appreciate arrow team may not want to address that, given the
> focus on processing speed.   ( can't be all things to everyone. )
>
> Personally, I think adding in the programming interfaces to handle
> compressed in-mem arrays would be a good thing, as well as the 'in flight'
> ones.
>
>
> For reference, my specific use case is handing large datasets [1] of
> varying types [2] to the browser  for plotting, inc scrolling over them,
> using WASM (currently in GO).
> Both network bandwidth to browsers, and browser memory is always
> problematic, esp on mobile devices,  hence the desire to compress, and keep
> it compressed on arrival.  And minimize number of in-mem copies needed
>
> The access to the data is either.
>  A: forward read from a certain point for a range,  to draw.   (that point
> and range changes with scroll and zoom)
>  B: Random access for tooltips.    (Value of 'n' columns at index  'y')
>    Both can potentially be efficient enough based on selection of the
> block sizes or other internal boundaries  / search method.
>
>
> Note: Compressing potentially makes my 'other' problem even harder, which
> best method for appending inbound realtime sensor data into the in-memory
> model.    Still thinking about that one.
>
> Regards
>
> Mark.
>
>
> [1]  Large in obviously relative:  In this case, a single plot may have
> 20-50 separate time series, each with between 20k  to 10 million points
> each.
>
> [2]  The data is often  index: time,  value float,  OR  Index:Float
> (length measure), Value:Float,     But not always:   Value could be one of
> int(8,16,32,64), float(32,64), string, vector(float32/64),  etc.      Hence
> why I'm liking Arrow as the standard 'format' for this data as they can all
> be safely encoded within.
>
>
>
> -----Original Message-----
> From: Micah Kornfield <emkornfi...@gmail.com>
> Sent: Sunday, August 30, 2020 6:20 PM
> To: Wes McKinney <wesmck...@gmail.com>
> Cc: dev <dev@arrow.apache.org>
> Subject: Re: Compression in Arrow - Question
>
> Agreed, I think it would be useful to make sure the "compute" interfaces
> have the right hooks to support alternate encodings.
>
> On Sunday, August 30, 2020, Wes McKinney <wesmck...@gmail.com> wrote:
>
> > That said, there is nothing preventing the development of programming
> > interfaces for compressed / encoded data right now. When it comes to
> > transporting such data, that's when we will have to decide on what to
> > support and what new metadata structures are required.
> >
> > For example, we could add RLE to C++ in prototype form and then
> > convert to non-RLE when writing to IPC messages.
> >
> > On Sat, Aug 29, 2020 at 7:34 PM Micah Kornfield
> > <emkornfi...@gmail.com>
> > wrote:
> > >
> > > Hi Mark,
> > > See the most recent previous discussion about alternate encodings [1].
> > > This is something that in the long run should be added, I'd
> > > personally prefer to start with simpler encodings.
> > >
> > > I don't think we should add anything more with regard to
> > > compression/encoding until at least 3 languages support the current
> > > compression methods that are in the specification.  C++ has it
> > implemented,
> > > there is some work in Java and I think we should have at least one
> more.
> > >
> > > -Micah
> > >
> > > [1]
> > > https://lists.apache.org/thread.html/r1d9d707c481c53c13534f7c72d75c
> > 7a90dc7b2b9966c6c0772d0e416%40%3Cdev.arrow.apache.org%3E
> > >
> > > On Sat, Aug 29, 2020 at 4:04 PM <m...@markfarnan.com> wrote:
> > >
> > > >
> > > > I was looking at compression in arrow had a couple questions.
> > > >
> > > > If I've understood compression currently,   it is only used  'in
> > flight'
> > > > in either IPC or Arrow Flight, using a block compression,  but
> > > > still decoded into Ram at the destination in full array form.  Is
> > > > this
> > correct ?
> > > >
> > > >
> > > > Given that arrow is a columnar format, has any thought been given
> > > > to an option to have the data compressed both in memory and in
> > > > flight, using
> > some
> > > > of the columnar techniques ?
> > > >  As I deal primarily with Timeseries numerical data, I was
> > > > thinking
> > about
> > > > some of the algorithms from the Gorilla paper [1]  for Floats  and
> > > > Timestamps (Delta-of-Delta) or similar might be appropriate.
> > > >
> > > > The interface functions could  still iterate over the data and
> > > > produce
> > raw
> > > > values so this is transparent to users of the data, but the data
> > > > blocks/arrays in-mem are actually compressed.
> > > >
> > > > With this method, blocks could come out of a data base/source,
> > > > through
> > the
> > > > data service, across the wire (flight)  and land in the consuming
> > > > applications memory without ever being decompressed or processed
> > > > until final use.
> > > >
> > > >
> > > > Crazy thought ?
> > > >
> > > >
> > > > Regards
> > > >
> > > > Mark.
> > > >
> > > >
> > > > [1]: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf
> > > >
> > > >
> >
>
>

Re: Compression in Arrow - Question

Reply via email to