Re: PRs for RLE support

2022-09-15 Thread Matt Topol
IMHO I think it's worth parameterizing for the 16/32-bit case. Despite it being nice to be able to just assume it's a 32bit signed int in terms of code simplicity, I think it would be a good benefit for memory usage of RLE arrays. That said I don't have anything to back that up as I don't regularl

Re: PRs for RLE support

2022-09-14 Thread Weston Pace
> The downside is that we run into similar issues with List/LargeList > where the total number of elements is limited by bit-width (which can also > cause space wastage, e.g. with run ends it might be reasonable to limit > bit-width to 16). True. I guess another question would be whether we want

Re: PRs for RLE support

2022-09-14 Thread Tobias Zagorni
Am Mittwoch, dem 14.09.2022 um 14:33 -0400 schrieb Matthew Topol: > Doesn't this explanation conflate the Logical Offset (the parent's > offset) and the Physical Offset (the offset of the run ends / values > children)? Or am I missing something? If you have an RLE array > of length 200, and you s

Re: PRs for RLE support

2022-09-14 Thread Matthew Topol
> On the other hand, if there were two child arrays then an implementation, when slicing, could choose to always keep the offset of the parent array at 0 and instead put the offsets in the child arrays. Now you have a parent array with offset 0, a run ends (int32) array with offset 74 and length

Re: PRs for RLE support

2022-09-14 Thread Dewey Dunnington
Thanks for the clarification! The (probably very common) slicing case makes a lot of sense. On Wed, Sep 14, 2022 at 3:19 PM Weston Pace wrote: > I will clarify the offset problem. It essentially boils down to "if > you don't have constant access to elements then an array length offset > does no

Re: PRs for RLE support

2022-09-14 Thread Weston Pace
I will clarify the offset problem. It essentially boils down to "if you don't have constant access to elements then an array length offset does not give you constant access to buffer offsets". We start with an RLE array of length 200. We slice it with (start=10, length=100) to get an RLE array o

Re: PRs for RLE support

2022-09-14 Thread Dewey Dunnington
> * Should we encode "run lengths" or "run ends"? In addition to the points mentioned above, this seems the most consistent with the variable-length binary/list layouts > encoding the run ends as a buffer (similar to list array for example) makes it difficult to calculate offsets I don't have a

Re: PRs for RLE support

2022-09-14 Thread Matthew Topol
Just wanted to chime in here that I also have several draft PRs for implementing the RLE arrays in Go as the second implementation (since we use two implementations as a requirement to vote on changes/additions to the format). They can be found here:

Re: PRs for RLE support

2022-09-14 Thread Micah Kornfield
> > * Should we encode "run lengths" or "run ends"? I think the project has leaned towards sublinear access, so run ends make sense. The downside is that we run into similar issues with List/LargeList where the total number of elements is limited by bit-width (which can also cause space wastage

Re: PRs for RLE support

2022-09-14 Thread Weston Pace
I'm going to bump this because it would be good to get feedback. In particular it would be nice to get feedback on the suggested format change[1]. We are currently moving forward on coming up with an IPC format proposal which we will share when ready. The two interesting points that jump out to

PRs for RLE support

2022-08-25 Thread Tobias Zagorni
Hello Everyone, Recently, I have implemented support for run-length encoding in Arrow C++. So far my implementation is split into different subtasks of ARROW-16771 (https://issues.apache.org/jira/browse/ARROW-16771). I have (draft) PRs available for: - general handling of RLE in arrow C++, Type,