A format where run lengths and values are interleaved would almost certainly be worse than having them separate. For example, unary scalar kernel evaluation is exactly the same as on raw arrays when they are not interleaved. Further, in the context of vectorization, a vectorized load into the array would be a mix of run lengths and values - not good as you have to de-interleave them somehow anyway. Am I missing something - did you have an optimization in mind?
Regarding a benchmark, I agree it could be useful. We need to decide what’s “reasonable”. There are pathological cases either way: you can be 2x slower than a raw column if you have a million runs of length 1 or you can be 1000000x faster if you have one run of length 1000000. Where I think RLE can shine most is if a filter column is RLE or if a group-by column is RLE. These are the two benchmarks that I would suggest writing. Sasha Krassovsky > On Jun 8, 2022, at 3:59 AM, Alessandro Molina <alessan...@voltrondata.com> > wrote: > > RLE would probably have some benefits that it makes sense to evaluate, I > would personally go in the direction of having a minimal benchmarking suite > for some of the cases where we expect to seem most benefit (IE: filtering) > so we can discuss with real numbers. > > Also, the currently proposed format divides run lengths and values, maybe a > format where run lengths and values are stored interleaved in the same > buffer might be able to allow more optimisations in the contest of > vectorised operations. Even though it might be harder to work with for > things that are not fixed width. > > On Tue, Jun 7, 2022 at 7:56 PM Tobias Zagorni <tob...@zagorni.eu.invalid> > wrote: > >> I created a Jira for adding RLE as ARROW-16771, and draft PRs: >> >> - https://github.com/apache/arrow/pull/13330 >> Encode/Decode functions for (currently fixed width types only) >> >> - https://github.com/apache/arrow/pull/13333 >> For updating docs >> >> Best, >> Tobias >> >> Am Dienstag, dem 31.05.2022 um 17:13 -0500 schrieb Wes McKinney: >>> I haven't had a chance to look at the branch in detail, but if you >>> can >>> provide a pointer to a specification or other details about the >>> proposed memory format for RLE (basically: what would be added to the >>> columnar documentation as well as the Flatbuffers schema files), it >>> would be helpful so it can be circulated to some other interested >>> parties working primarily outside of Arrow (e.g. DuckDB) who might >>> like to converge on a standard especially given that it would be >>> exported across the C data interface. Thanks! >> >>