Hello Everyone,

Recently, I have implemented support for run-length encoding in Arrow
C++. So far my implementation is split into different subtasks of
ARROW-16771 (https://issues.apache.org/jira/browse/ARROW-16771).

I have (draft) PRs available for:
- general handling of RLE in arrow C++, Type, Arrow, Builder
subclasses, etc.
  (subtasks 1-9)
- encode, decode kernels (fixed size only):
  (https://issues.apache.org/jira/browse/ARROW-16772)
- filter kernel (fixed size only):
  (https://issues.apache.org/jira/browse/ARROW-16774)
- simple benchmark for the RLE kernels
  (https://issues.apache.org/jira/browse/ARROW-17026)
- adding RLE to Arrow Columnar format document
  (https://issues.apache.org/jira/browse/ARROW-16773)

What is not yet implemented:
- converting RLE to formats like Parquet, JSON, IPC.
- caching of physical offsets when working with sliced arrays - finding
these offsets is an  O(log(n)) binary search which could be avoided in
a lot of cases 

I'm interested in any feedback on the code and I'm wondering what would
be the best way to get this merged.

A lot of the PRs depend on earlier ones. I ordered the subtasks in a
way they could be merged. The first would be:
1. Handling of array-only types using VisitTypeInline:
   https://issues.apache.org/jira/browse/ARROW-17258
2. Adding RLE type / array class (only builds on #1):
   https://issues.apache.org/jira/browse/ARROW-17261
-  also, since it has no dependencies: adding RLE to Arrow Columnar
format document
   https://issues.apache.org/jira/browse/ARROW-16773

Best,
Tobias

Reply via email to