Re: [C++] Adding Run-Length Encoding to Arrow

Tobias Zagorni Tue, 31 May 2022 13:25:49 -0700

Hi, 

Am Dienstag, dem 31.05.2022 um 21:12 +0200 schrieb Antoine Pitrou:
> 
> Hi,
> 
> Le 31/05/2022 à 20:24, Tobias Zagorni a écrit :
> > Hi, I'm currently working on adding Run-Length encoding to arrow. I
> > created a function to dictionary-encode arrays here (currently only
> > for
> > fixed length types):
> > https://github.com/apache/arrow/compare/master...zagto:rle?expand=1
> > 
> > The general idea is that RLE data will be a nested data type, with
> > a
> > single child holding a regular ArrayData of the type of the values,
> > but
> > with the duplicate values removed. The parent contains a single
> > buffer
> > of uint64 representing the run lengths by holding the run length of
> > all
> > runs from the first to the current one
> 
> That's an interesting approach. How do nulls work? Is the null bitmap
> stored in the child only?


Excactly. Runs of nulls are just like other runs, but the respective
element of the child array is null. The parent is not does not have a
null buffer since that would mean "run of length null" to me.

> > What are the intended use cases for this:
> > - external engines want to provide run-length encoded data to work
> > on
> > using arrow?
> > - more efficient ACERO (and possibly somewhat simplified, since we
> > can
> > now use RLE arrays to replace Scalar Datums)
> 
> I don't think replacing Scalar compute paths with dedicated paths for
> RLE-encoded data would ever be a simplification. Also, when a kernel 
> hasn't been upgraded with a native path for RLE data, former Scalar 
> Datums would now be expanded to the full RLE-decoded version before 
> running the kernel...

> > Automatic kernel dispatch:
> > - Scalar kernels would likely just work on RLE data
> 
> Unary scalar kernels may indeed just work, but n-ary kernels (where n
> > 
> 1) would not, except in the happy case where run lengths are the same
> in 
> all arguments.

You're right 

> 
> > - How should it be implemented? The current place for logic like
> > that
> > seems to be the DispatchBest methods. These only allow to replace
> > the
> > input types, but for this RLE scheme we would need to make the
> > kernel
> > work on a child array of the input. This mechanism would likely
> > need to
> > be extended a lot.
> > - For kernels which don't work on RLE data, should we automatically
> > call Encode and Decode kernels?
> 
> So, as a data point, most kernels currently don't have a native path
> for 
> dictionary-encoded data; instead, they decode the data before working
> on it.
> 
> For the same reason (lack of manpower), chances are that few kernels 
> would ever get a native path for RLE-encoded data (or it would take a
> long time before the most common kernels are upgraded with such a
> native 
> path).
> 
> > - Should RLE really be a data type? Dictionary encoding set an
> > example
> > for this, but just like it, RLE is actually a different encoding
> > for
> > arrays of the same data type.
> 
> Well, Arrow C++ does not have a notion of encoding distinct from the 
> data type. Adding such a notion would risk breaking compatibility for
> all existing software that hasn't been upgraded to dispatch based on 
> encoding.
> 
> Regards
> 
> Antoine.

Thanks for all the input!

Best,
Tobias

Re: [C++] Adding Run-Length Encoding to Arrow

Reply via email to