Hello Everyone, Recently, I have implemented support for run-length encoding in Arrow C++. So far my implementation is split into different subtasks of ARROW-16771 (https://issues.apache.org/jira/browse/ARROW-16771).
I have (draft) PRs available for: - general handling of RLE in arrow C++, Type, Arrow, Builder subclasses, etc. (subtasks 1-9) - encode, decode kernels (fixed size only): (https://issues.apache.org/jira/browse/ARROW-16772) - filter kernel (fixed size only): (https://issues.apache.org/jira/browse/ARROW-16774) - simple benchmark for the RLE kernels (https://issues.apache.org/jira/browse/ARROW-17026) - adding RLE to Arrow Columnar format document (https://issues.apache.org/jira/browse/ARROW-16773) What is not yet implemented: - converting RLE to formats like Parquet, JSON, IPC. - caching of physical offsets when working with sliced arrays - finding these offsets is an O(log(n)) binary search which could be avoided in a lot of cases I'm interested in any feedback on the code and I'm wondering what would be the best way to get this merged. A lot of the PRs depend on earlier ones. I ordered the subtasks in a way they could be merged. The first would be: 1. Handling of array-only types using VisitTypeInline: https://issues.apache.org/jira/browse/ARROW-17258 2. Adding RLE type / array class (only builds on #1): https://issues.apache.org/jira/browse/ARROW-17261 - also, since it has no dependencies: adding RLE to Arrow Columnar format document https://issues.apache.org/jira/browse/ARROW-16773 Best, Tobias