Hi Raphael, I think this is indeed a documentation mistake, it should say 0!
For exeactly these reasons you mentioned I determined that it is best to leave the null count field always 0 for RLE arrays. This way it is consistent with union types, at least. RunLengthEncoded data should not contain a null mask by itself. The idea so far is that Null is just one of the possible values for a run. (if we were to allow the RLE array parent to have an additional null mask, the null count field would represent that - there seems to be a generall assumption in Arrow code that a non-zero (or array length for the NULL) null count means the presence of the standard null mask) Best, Tobias On 2023/01/22 15:12:32 Raphael Taylor-Davies wrote: > Hi, > > Apologies if I am rehashing something that has already been discussed or > is documented elsewhere, but reading the documentation of the Run- Length > encoding [1] I noticed that the parent null count can be non-zero [2]. > > This is somewhat surprising to me for a couple of reasons: > > - This is inconsistent with how it is handled for other nested types > like dictionaries, structs, etc... where a null count is solely the > number of nulls in the mask of that Array > - Codepaths that use null counts to infer validity mask properties such > as presence, bit counts, etc... will no longer work > - This null count can only be recomputed in the context of the run- ends, > implying codepaths that slice ArrayData or otherwise manipulate > ArrayData directly must be run-length aware > > This leads to a couple of questions > > - Is this a documentation mistake or is the null count of RunEndEncoded > ArrayData determined by its children > - Can a RunEndEncoded ArrayData contain a null mask itself, > independently of its runs, much like dictionary arrays can > > Any clarifications would be most welcome > > [1]: > https://arrow.apache.org/docs/dev/format/Columnar.html#run-end-encoded-layout > [2]: https://github.com/apache/arrow/pull/13333/files#r1083470362 > >