To complete this thread, the documentation has been updated to clarify the intent[1].
Thank you all very much, Andrew [1] https://github.com/apache/arrow/pull/33831 On Mon, Jan 23, 2023 at 8:32 AM Raphael Taylor-Davies <r.taylordav...@googlemail.com.invalid> wrote: > Hi Tobias, > > Thank you for clarifying this, makes sense to me > > Kind Regards, > > Raphael > > On 22/01/2023 16:15, Tobias Zagorni wrote: > > Hi Raphael, > > > > I think this is indeed a documentation mistake, it should say 0! > > > > For exeactly these reasons you mentioned I determined that it is best > > to leave the null count field always 0 for RLE arrays. This way it is > > consistent with union types, at least. > > > > RunLengthEncoded data should not contain a null mask by itself. The > > idea so far is that Null is just one of the possible values for a run. > > > > (if we were to allow the RLE array parent to have an additional null > > mask, the null count field would represent that - there seems to be a > > generall assumption in Arrow code that a non-zero (or array length for > > the NULL) null count means the presence of the standard null mask) > > > > Best, > > Tobias > > > > On 2023/01/22 15:12:32 Raphael Taylor-Davies wrote: > >> Hi, > >> > >> Apologies if I am rehashing something that has already been discussed > > or > >> is documented elsewhere, but reading the documentation of the Run- > > Length > >> encoding [1] I noticed that the parent null count can be non-zero > > [2]. > >> This is somewhat surprising to me for a couple of reasons: > >> > >> - This is inconsistent with how it is handled for other nested types > >> like dictionaries, structs, etc... where a null count is solely the > >> number of nulls in the mask of that Array > >> - Codepaths that use null counts to infer validity mask properties > > such > >> as presence, bit counts, etc... will no longer work > >> - This null count can only be recomputed in the context of the run- > > ends, > >> implying codepaths that slice ArrayData or otherwise manipulate > >> ArrayData directly must be run-length aware > >> > >> This leads to a couple of questions > >> > >> - Is this a documentation mistake or is the null count of > > RunEndEncoded > >> ArrayData determined by its children > >> - Can a RunEndEncoded ArrayData contain a null mask itself, > >> independently of its runs, much like dictionary arrays can > >> > >> Any clarifications would be most welcome > >> > >> [1]: > >> > > > https://arrow.apache.org/docs/dev/format/Columnar.html#run-end-encoded-layout > >> [2]: https://github.com/apache/arrow/pull/13333/files#r1083470362 > >> > >> >