Hi,

Le 31/05/2022 à 20:24, Tobias Zagorni a écrit :
Hi, I'm currently working on adding Run-Length encoding to arrow. I
created a function to dictionary-encode arrays here (currently only for
fixed length types):
https://github.com/apache/arrow/compare/master...zagto:rle?expand=1

The general idea is that RLE data will be a nested data type, with a
single child holding a regular ArrayData of the type of the values, but
with the duplicate values removed. The parent contains a single buffer
of uint64 representing the run lengths by holding the run length of all
runs from the first to the current one

That's an interesting approach. How do nulls work? Is the null bitmap stored in the child only?

What are the intended use cases for this:
- external engines want to provide run-length encoded data to work on
using arrow?
- more efficient ACERO (and possibly somewhat simplified, since we can
now use RLE arrays to replace Scalar Datums)

I don't think replacing Scalar compute paths with dedicated paths for RLE-encoded data would ever be a simplification. Also, when a kernel hasn't been upgraded with a native path for RLE data, former Scalar Datums would now be expanded to the full RLE-decoded version before running the kernel...

Automatic kernel dispatch:
- Scalar kernels would likely just work on RLE data

Unary scalar kernels may indeed just work, but n-ary kernels (where n > 1) would not, except in the happy case where run lengths are the same in all arguments.

- How should it be implemented? The current place for logic like that
seems to be the DispatchBest methods. These only allow to replace the
input types, but for this RLE scheme we would need to make the kernel
work on a child array of the input. This mechanism would likely need to
be extended a lot.
- For kernels which don't work on RLE data, should we automatically
call Encode and Decode kernels?

So, as a data point, most kernels currently don't have a native path for dictionary-encoded data; instead, they decode the data before working on it.

For the same reason (lack of manpower), chances are that few kernels would ever get a native path for RLE-encoded data (or it would take a long time before the most common kernels are upgraded with such a native path).

- Should RLE really be a data type? Dictionary encoding set an example
for this, but just like it, RLE is actually a different encoding for
arrays of the same data type.

Well, Arrow C++ does not have a notion of encoding distinct from the data type. Adding such a notion would risk breaking compatibility for all existing software that hasn't been upgraded to dispatch based on encoding.

Regards

Antoine.

Reply via email to