I agree, but what about the breaking change that this does?
On 2026/01/30 17:50:53 Vignesh Siva wrote: > I firmly support Option 1 as the sole specification that maintains the > Arrow format's structural integrity and performance goals. > > 1. The "single source of truth" argument > If a field is declared as non-nullable, the architecture expects the child > array to not allocate or manage a validity bitmap at all. If we choose > Option 2 or 3, we basically force every "non-nullable" child of a nullable > struct to bear the burden of a validity buffer "just in case" the parent is > null. This bypasses the 'non-nullable' flag's primary memory and CPU > optimizations. > > 2. The "Master Mask" Concept > We should treat the Parent Struct's validity buffer as a master mask. > > If the struct is null at index, the data in all child arrays at index is > logically undefined > It shouldn't matter if the child has a "null" there or not, because a > compliant reader must check the parent's bit first. > > 3. Why this is the clearest path forward: > For Developers: It simplifies kernels. If a child is non-nullable, the > kernel can use high-speed SIMD instructions to process the data without > constantly branching to check a child null map that is redundant anyway. > For Memory: It saves significant space. In deep nested structures, forcing > every child to replicate the parent's null pattern (Option 2) would lead to > massive, redundant memory bloat. > For Consistency: It stops "schema lying." If a field is marked > non-nullable, its own internal state should remain pure. > > Conclusion: > Option 1 respects the hierarchy. The parent manages the "existence" of the > row; the child manages the "value" of the data. > > Thanks, Vignesh. > > > On Fri, 30 Jan 2026 at 20:57, Aldrin <[email protected]> wrote: > > > Just a personal thought, but I think option 3 is valid in a scenario where > > the column has been filtered and then changed to non null. I believe this > > enables some filtering cases to be zero-copy? > > > > I could be confusing how child arrays could be referenced though. > > > > > > # ------------------------------ > > # Aldrin > > > > https://github.com/drin/ > > https://gitlab.com/octalene > > https://keybase.io/octalene > > > > Sent from Proton Mail for iOS. > > > > -------- Original Message -------- > > On Friday, 01/30/26 at 06:19 Weston Pace <[email protected]> wrote: > > I agree with Raphael that this is probably too late to change. There are > > many tools out there that produce Arrow data now and they are not all going > > to conform to definition 1. In fact, as Antoine points out, many tools do > > not even guarantee validity at all (a batch created with pyarrow may have a > > field marked non-nullable that has nulls). > > > > As a result, my personal stance has been to ignore the nullability flag on > > all external data and independently determine whether an array has or does > > not have nulls. > > > > > the problem I have is that this is an undefined behavior, the accepted > > behavior can be (I don't think this should be the behavior) that there > > should be no requirement on the child nulls, and it can have nulls anywhere > > they want even if the parent does not have null there. > > > > There is very little mention of the nullable flag in the spec at all. The > > only thing I see is: > > > > > Whether the field is semantically nullable. While this has no bearing on > > the array’s physical layout, > > > many systems distinguish nullable and non-nullable fields and we want to > > allow them to preserve > > > this metadata to enable faithful schema round trips. > > > > Since the spec explicitly states "this has no bearing on the array's > > physical layout" I think your accepted behavior could, in fact, be seen as > > valid, if not wise. > > > > That being said, my view might be a little out there :). I am content if > > we want to consolidate on a definition. I think definition 3 is the most > > flexible and likely to be adopted. > > > > On Thu, Jan 29, 2026 at 11:55 AM Raz Luvaton <[email protected]> wrote: > > > > > > If something had been > > > > standardised at the start that would be one thing, but retroactively > > > > adding schema restrictions now is likely to break existing workflows, > > > > and is therefore probably best avoided. > > > > > > the problem I have is that this is an undefined behavior, the accepted > > > behavior can be (I don't think this should be the behavior) that there > > > should be no requirement on the child nulls, and it can have nulls > > anywhere > > > they want even if the parent does not have null there. > > > > > > On 2026/01/29 19:40:01 Raphael Taylor-Davies wrote: > > > > For what it is worth arrow-rs takes the most permission interpretation > > 3 > > > > - we only reject unambiguously malformed StructArray. For further > > > > context I believe the instigator of this email thread is [1]. > > > > > > > > I think the main question with taking one of the more strict > > > > interpretations is what value is assigned to "masked" values when > > > > parsing from some other format, such as JSON or parquet, that doesn't > > > > encode them. Some people think it should be NULL, others arbitrary. For > > > > example, when arrow-rs changed the parquet reader from using NULL to > > > > arbitrary it was actually reported as a bug [2]. > > > > > > > > My 2 cents is that this is a bit like the question around whether > > > > StructArray can have fields with the same name. If something had been > > > > standardised at the start that would be one thing, but retroactively > > > > adding schema restrictions now is likely to break existing workflows, > > > > and is therefore probably best avoided. > > > > > > > > Kind Regards, > > > > > > > > Raphael > > > > > > > > [1]: https://github.com/apache/arrow-rs/issues/9302 > > > > [2]: https://github.com/apache/arrow-rs/issues/7119 > > > > > > > > On 29/01/2026 19:10, Raz Luvaton wrote: > > > > > Currently there is ambiguity on what the validity buffer for non > > > nullable > > > > > field of a nullable struct can be. > > > > > > > > > > Lets take for example the following type: > > > > > ``` > > > > > nullable StructArray with non nullable field Int32 > > > > > ``` > > > > > The struct validity is: valid, null, null, valid. > > > > > > > > > > which of the following should be: > > > > > 1. The child array (the int32 array) FORBIDDEN from having nulls at > > all > > > > > (i.e. in our example the validity buffer for the child must be valid, > > > > > valid, valid, valid) as the field is marked as non nullable? > > > > > 2. The child array REQUIRED to have nulls at the same positions of > > the > > > > > struct nulls, i.e. the validity buffer for the child MUST be valid, > > > null, > > > > > null, valid in our example? > > > > > 3. The child array MAY have nulls but it is FORBIDDEN to have nulls > > > where > > > > > the struct does not have nulls, i.e. it can't have null, null, valid, > > > valid > > > > > but it can have valid, null, valid, valid in our example. > > > > > > > > > > I would argue that 1 is the correct and expected requirement, as the > > > field > > > > > is marked as non nullable. > > > > > > > > > > The chosen behavior will be applicable for other nested types as well > > > > > > > > > > > > > > > Thanks, Raz Luvaton > > > > > > > > > > > > > > >
