Re: [C++] Adding Run-Length Encoding to Arrow

Sasha Krassovsky Thu, 09 Jun 2022 11:07:13 -0700

A format where run lengths and values are interleaved would almost certainly be 
worse than having them separate. For example, unary scalar kernel evaluation is 
exactly the same as on raw arrays when they are not interleaved. Further, in 
the context of vectorization, a vectorized load into the array would be a mix 
of run lengths and values - not good as you have to de-interleave them somehow 
anyway. Am I missing something - did you have an optimization in mind?


Regarding a benchmark, I agree it could be useful. We need to decide what’s 
“reasonable”. There are pathological cases either way: you can be 2x slower 
than a raw column if you have a million runs of length 1 or you can be 1000000x 
faster if you have one run of length 1000000. Where I think RLE can shine most 
is if a filter column is RLE or if a group-by column is RLE. These are the two 
benchmarks that I would suggest writing. 

Sasha Krassovsky

> On Jun 8, 2022, at 3:59 AM, Alessandro Molina <[email protected]> 
> wrote:
> 
> RLE would probably have some benefits that it makes sense to evaluate, I
> would personally go in the direction of having a minimal benchmarking suite
> for some of the cases where we expect to seem most benefit (IE: filtering)
> so we can discuss with real numbers.
> 
> Also, the currently proposed format divides run lengths and values, maybe a
> format where run lengths and values are stored interleaved in the same
> buffer might be able to allow more optimisations in the contest of
> vectorised operations. Even though it might be harder to work with for
> things that are not fixed width.
> 
> On Tue, Jun 7, 2022 at 7:56 PM Tobias Zagorni <[email protected]>
> wrote:
> 
>> I created a Jira for adding RLE as ARROW-16771, and draft PRs:
>> 
>> - https://github.com/apache/arrow/pull/13330
>>  Encode/Decode functions for (currently fixed width types only)
>> 
>> - https://github.com/apache/arrow/pull/13333
>>  For updating docs
>> 
>> Best,
>> Tobias
>> 
>> Am Dienstag, dem 31.05.2022 um 17:13 -0500 schrieb Wes McKinney:
>>> I haven't had a chance to look at the branch in detail, but if you
>>> can
>>> provide a pointer to a specification or other details about the
>>> proposed memory format for RLE (basically: what would be added to the
>>> columnar documentation as well as the Flatbuffers schema files), it
>>> would be helpful so it can be circulated to some other interested
>>> parties working primarily outside of Arrow (e.g. DuckDB) who might
>>> like to converge on a standard especially given that it would be
>>> exported across the C data interface. Thanks!
>> 
>>

Re: [C++] Adding Run-Length Encoding to Arrow

Reply via email to