Hi, Le 31/05/2022 à 20:24, Tobias Zagorni a écrit :
Hi, I'm currently working on adding Run-Length encoding to arrow. I created a function to dictionary-encode arrays here (currently only for fixed length types): https://github.com/apache/arrow/compare/master...zagto:rle?expand=1 The general idea is that RLE data will be a nested data type, with a single child holding a regular ArrayData of the type of the values, but with the duplicate values removed. The parent contains a single buffer of uint64 representing the run lengths by holding the run length of all runs from the first to the current one
That's an interesting approach. How do nulls work? Is the null bitmap stored in the child only?
What are the intended use cases for this: - external engines want to provide run-length encoded data to work on using arrow? - more efficient ACERO (and possibly somewhat simplified, since we can now use RLE arrays to replace Scalar Datums)
I don't think replacing Scalar compute paths with dedicated paths for RLE-encoded data would ever be a simplification. Also, when a kernel hasn't been upgraded with a native path for RLE data, former Scalar Datums would now be expanded to the full RLE-decoded version before running the kernel...
Automatic kernel dispatch: - Scalar kernels would likely just work on RLE data
Unary scalar kernels may indeed just work, but n-ary kernels (where n > 1) would not, except in the happy case where run lengths are the same in all arguments.
- How should it be implemented? The current place for logic like that seems to be the DispatchBest methods. These only allow to replace the input types, but for this RLE scheme we would need to make the kernel work on a child array of the input. This mechanism would likely need to be extended a lot. - For kernels which don't work on RLE data, should we automatically call Encode and Decode kernels?
So, as a data point, most kernels currently don't have a native path for dictionary-encoded data; instead, they decode the data before working on it.
For the same reason (lack of manpower), chances are that few kernels would ever get a native path for RLE-encoded data (or it would take a long time before the most common kernels are upgraded with such a native path).
- Should RLE really be a data type? Dictionary encoding set an example for this, but just like it, RLE is actually a different encoding for arrays of the same data type.
Well, Arrow C++ does not have a notion of encoding distinct from the data type. Adding such a notion would risk breaking compatibility for all existing software that hasn't been upgraded to dispatch based on encoding.
Regards Antoine.