RE: RunEndEncodedArray Null Counts

2023-01-22 Thread Tobias Zagorni
Hi Raphael, I think this is indeed a documentation mistake, it should say 0! For exeactly these reasons you mentioned I determined that it is best to leave the null count field always 0 for RLE arrays. This way it is consistent with union types, at least. RunLengthEncoded data should not contain

Re: RLE array slicing

2022-09-15 Thread Tobias Zagorni
> { >     length: 2 >     offset: 6 >     rle: { >     length: 1 // actually physical length >     offset: 2 >     buffer: [3, 5,8] >     } >     values: { >    length: 1 >    offset: 2 >    buffer: [5, 6, 7] >     } > } > Does this make sense? I think this is a valid way o

Re: PRs for RLE support

2022-09-14 Thread Tobias Zagorni
;m not sure I understand this, could you provide an example > > > of > > > the > > >  > > problem > > >  > > that the child array solves? > > >  > > > > >  > > > > >  > > > > >  > > > &g

PRs for RLE support

2022-08-25 Thread Tobias Zagorni
Hello Everyone, Recently, I have implemented support for run-length encoding in Arrow C++. So far my implementation is split into different subtasks of ARROW-16771 (https://issues.apache.org/jira/browse/ARROW-16771). I have (draft) PRs available for: - general handling of RLE in arrow C++, Type,

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-07 Thread Tobias Zagorni
I created a Jira for adding RLE as ARROW-16771, and draft PRs: - https://github.com/apache/arrow/pull/13330 Encode/Decode functions for (currently fixed width types only) - https://github.com/apache/arrow/pull/1 For updating docs Best, Tobias Am Dienstag, dem 31.05.2022 um 17:13 -0500 s

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-03 Thread Tobias Zagorni
Am Freitag, dem 03.06.2022 um 09:32 -0700 schrieb Micah Kornfield: > > > > Thinking about compatibility with existing software, RLE could > > possibly > > even made an Extension Type that follows the layout of a struct of > > int32 and the encoded value type. I'm wondering wether this would > > be

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-03 Thread Tobias Zagorni
> Well, Arrow C++ does not have a notion of encoding distinct from the > data type. Adding such a notion would risk breaking compatibility for > all existing software that hasn't been upgraded to dispatch based on > encoding. Thinking about compatibility with existing software, RLE could possibl

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-01 Thread Tobias Zagorni
Am Dienstag, dem 31.05.2022 um 12:41 -0700 schrieb Micah Kornfield: > > - Should we allow multiple runs of the same value following each > other? > > Otherwise we would either need a pass to correct this after a lot > > of > > operations, or make RLE-aware versions of thier kernels. > > Is there

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Tobias Zagorni
Hi, Am Dienstag, dem 31.05.2022 um 21:12 +0200 schrieb Antoine Pitrou: > > Hi, > > Le 31/05/2022 à 20:24, Tobias Zagorni a écrit : > > Hi, I'm currently working on adding Run-Length encoding to arrow. I > > created a function to dictionary-encode arrays here (cur

[C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Tobias Zagorni
Hi, I'm currently working on adding Run-Length encoding to arrow. I created a function to dictionary-encode arrays here (currently only for fixed length types): https://github.com/apache/arrow/compare/master...zagto:rle?expand=1 The general idea is that RLE data will be a nested data type, with a