Re: [C++] Adding Run-Length Encoding to Arrow

Antoine Pitrou Tue, 31 May 2022 12:12:55 -0700


Hi,

Le 31/05/2022 à 20:24, Tobias Zagorni a écrit :

Hi, I'm currently working on adding Run-Length encoding to arrow. I
created a function to dictionary-encode arrays here (currently only for
fixed length types):
https://github.com/apache/arrow/compare/master...zagto:rle?expand=1

The general idea is that RLE data will be a nested data type, with a
single child holding a regular ArrayData of the type of the values, but
with the duplicate values removed. The parent contains a single buffer
of uint64 representing the run lengths by holding the run length of all
runs from the first to the current one

That's an interesting approach. How do nulls work? Is the null bitmapstored in the child only?

What are the intended use cases for this:
- external engines want to provide run-length encoded data to work on
using arrow?
- more efficient ACERO (and possibly somewhat simplified, since we can
now use RLE arrays to replace Scalar Datums)

I don't think replacing Scalar compute paths with dedicated paths forRLE-encoded data would ever be a simplification. Also, when a kernelhasn't been upgraded with a native path for RLE data, former ScalarDatums would now be expanded to the full RLE-decoded version beforerunning the kernel...

Automatic kernel dispatch:
- Scalar kernels would likely just work on RLE data

Unary scalar kernels may indeed just work, but n-ary kernels (where n >1) would not, except in the happy case where run lengths are the same inall arguments.

- How should it be implemented? The current place for logic like that
seems to be the DispatchBest methods. These only allow to replace the
input types, but for this RLE scheme we would need to make the kernel
work on a child array of the input. This mechanism would likely need to
be extended a lot.
- For kernels which don't work on RLE data, should we automatically
call Encode and Decode kernels?

So, as a data point, most kernels currently don't have a native path fordictionary-encoded data; instead, they decode the data before working on it.

For the same reason (lack of manpower), chances are that few kernelswould ever get a native path for RLE-encoded data (or it would take along time before the most common kernels are upgraded with such a nativepath).

- Should RLE really be a data type? Dictionary encoding set an example
for this, but just like it, RLE is actually a different encoding for
arrays of the same data type.

Well, Arrow C++ does not have a notion of encoding distinct from thedata type. Adding such a notion would risk breaking compatibility forall existing software that hasn't been upgraded to dispatch based onencoding.


Regards

Antoine.

Re: [C++] Adding Run-Length Encoding to Arrow

Reply via email to