Re: [DISCUSS] add to the specification a clarification of validity buffer for non nullable field of nullable StructArray

Vignesh Siva Fri, 06 Feb 2026 04:33:13 -0800

Hi Raz,

I agree that "Semantic Equality" is what matters most to the end user. A
null parent should logically mask all child data.


Regarding the "breaking change" risk, I propose adopting the “Liberal
Reader, Conservative Writer” principle:

1. Liberal Reader (Option 3)
To maintain compatibility with existing tools (PyArrow, Go, etc.), we
should accept arrays where “non-nullable” children contain redundant nulls
at the parent’s null positions.

2. Conservative Writer (Option 1)
We should recommend Option 1 as the canonical / optimized form. Encouraging
producers to skip the child validity bitmap entirely preserves the memory
and SIMD benefits for modern engines like Comet and DataFusion.

By formalizing Option 1 as the preferred layout while allowing Option 3 for
compatibility, we can evolve the specification without breaking the
existing ecosystem.

Thanks,
Vignesh

On Wed, 4 Feb 2026 at 15:22, Raz Luvaton <[email protected]> wrote:

> I agree, but what about the breaking change that this does?
>
> On 2026/01/30 17:50:53 Vignesh Siva wrote:
> > I firmly support Option 1 as the sole specification that maintains the
> > Arrow format's structural integrity and performance goals.
> >
> > 1. The "single source of truth" argument
> > If a field is declared as non-nullable, the architecture expects the
> child
> > array to not allocate or manage a validity bitmap at all. If we choose
> > Option 2 or 3, we basically force every "non-nullable" child of a
> nullable
> > struct to bear the burden of a validity buffer "just in case" the parent
> is
> > null. This bypasses the 'non-nullable' flag's primary memory and CPU
> > optimizations.
> >
> > 2. The "Master Mask" Concept
> > We should treat the Parent Struct's validity buffer as a master mask.
> >
> > If the struct is null at index, the data in all child arrays at index is
> > logically undefined
> > It shouldn't matter if the child has a "null" there or not, because a
> > compliant reader must check the parent's bit first.
> >
> > 3. Why this is the clearest path forward:
> > For Developers: It simplifies kernels. If a child is non-nullable, the
> > kernel can use high-speed SIMD instructions to process the data without
> > constantly branching to check a child null map that is redundant anyway.
> > For Memory: It saves significant space. In deep nested structures,
> forcing
> > every child to replicate the parent's null pattern (Option 2) would lead
> to
> > massive, redundant memory bloat.
> > For Consistency: It stops "schema lying." If a field is marked
> > non-nullable, its own internal state should remain pure.
> >
> > Conclusion:
> > Option 1 respects the hierarchy. The parent manages the "existence" of
> the
> > row; the child manages the "value" of the data.
> >
> >   Thanks, Vignesh.
> >
> >
> > On Fri, 30 Jan 2026 at 20:57, Aldrin <[email protected]> wrote:
> >
> > > Just a personal thought, but I think option 3 is valid in a scenario
> where
> > > the column has been filtered and then changed to non null. I believe
> this
> > > enables some filtering cases to be zero-copy?
> > >
> > > I could be confusing how child arrays could be referenced though.
> > >
> > >
> > > # ------------------------------
> > > # Aldrin
> > >
> > > https://github.com/drin/
> > > https://gitlab.com/octalene
> > > https://keybase.io/octalene
> > >
> > > Sent from Proton Mail for iOS.
> > >
> > > -------- Original Message --------
> > > On Friday, 01/30/26 at 06:19 Weston Pace <[email protected]>
> wrote:
> > > I agree with Raphael that this is probably too late to change.  There
> are
> > > many tools out there that produce Arrow data now and they are not all
> going
> > > to conform to definition 1.  In fact, as Antoine points out, many
> tools do
> > > not even guarantee validity at all (a batch created with pyarrow may
> have a
> > > field marked non-nullable that has nulls).
> > >
> > > As a result, my personal stance has been to ignore the nullability
> flag on
> > > all external data and independently determine whether an array has or
> does
> > > not have nulls.
> > >
> > > > the problem I have is that this is an undefined behavior, the
> accepted
> > > behavior can be (I don't think this should be the behavior) that there
> > > should be no requirement on the child nulls, and it can have nulls
> anywhere
> > > they want even if the parent does not have null there.
> > >
> > > There is very little mention of the nullable flag in the spec at all.
> The
> > > only thing I see is:
> > >
> > > > Whether the field is semantically nullable. While this has no
> bearing on
> > > the array’s physical layout,
> > > > many systems distinguish nullable and non-nullable fields and we
> want to
> > > allow them to preserve
> > > > this metadata to enable faithful schema round trips.
> > >
> > > Since the spec explicitly states "this has no bearing on the array's
> > > physical layout" I think your accepted behavior could, in fact, be
> seen as
> > > valid, if not wise.
> > >
> > > That being said, my view might be a little out there :).  I am content
> if
> > > we want to consolidate on a definition.  I think definition 3 is the
> most
> > > flexible and likely to be adopted.
> > >
> > > On Thu, Jan 29, 2026 at 11:55 AM Raz Luvaton <[email protected]>
> wrote:
> > >
> > > > > If something had been
> > > > > standardised at the start that would be one thing, but
> retroactively
> > > > > adding schema restrictions now is likely to break existing
> workflows,
> > > > > and is therefore probably best avoided.
> > > >
> > > > the problem I have is that this is an undefined behavior, the
> accepted
> > > > behavior can be (I don't think this should be the behavior) that
> there
> > > > should be no requirement on the child nulls, and it can have nulls
> > > anywhere
> > > > they want even if the parent does not have null there.
> > > >
> > > > On 2026/01/29 19:40:01 Raphael Taylor-Davies wrote:
> > > > > For what it is worth arrow-rs takes the most permission
> interpretation
> > > 3
> > > > > - we only reject unambiguously malformed StructArray. For further
> > > > > context I believe the instigator of this email thread is [1].
> > > > >
> > > > > I think the main question with taking one of the more strict
> > > > > interpretations is what value is assigned to "masked" values when
> > > > > parsing from some other format, such as JSON or parquet, that
> doesn't
> > > > > encode them. Some people think it should be NULL, others
> arbitrary. For
> > > > > example, when arrow-rs changed the parquet reader from using NULL
> to
> > > > > arbitrary it was actually reported as a bug [2].
> > > > >
> > > > > My 2 cents is that this is a bit like the question around whether
> > > > > StructArray can have fields with the same name. If something had
> been
> > > > > standardised at the start that would be one thing, but
> retroactively
> > > > > adding schema restrictions now is likely to break existing
> workflows,
> > > > > and is therefore probably best avoided.
> > > > >
> > > > > Kind Regards,
> > > > >
> > > > > Raphael
> > > > >
> > > > > [1]: https://github.com/apache/arrow-rs/issues/9302
> > > > > [2]: https://github.com/apache/arrow-rs/issues/7119
> > > > >
> > > > > On 29/01/2026 19:10, Raz Luvaton wrote:
> > > > > > Currently there is ambiguity on what the validity buffer for non
> > > > nullable
> > > > > > field of a nullable struct can be.
> > > > > >
> > > > > > Lets take for example the following type:
> > > > > > ```
> > > > > > nullable StructArray with non nullable field Int32
> > > > > > ```
> > > > > > The struct validity is: valid, null, null, valid.
> > > > > >
> > > > > > which of the following should be:
> > > > > > 1. The child array (the int32 array) FORBIDDEN from having nulls
> at
> > > all
> > > > > > (i.e. in our example the validity buffer for the child must be
> valid,
> > > > > > valid, valid, valid) as the field is marked as non nullable?
> > > > > > 2. The child array REQUIRED to have nulls at the same positions
> of
> > > the
> > > > > > struct nulls, i.e. the validity buffer for the child MUST be
> valid,
> > > > null,
> > > > > > null, valid in our example?
> > > > > > 3. The child array MAY have nulls but it is FORBIDDEN to have
> nulls
> > > > where
> > > > > > the struct does not have nulls, i.e. it can't have null, null,
> valid,
> > > > valid
> > > > > > but it can have valid, null, valid, valid in our example.
> > > > > >
> > > > > > I would argue that 1 is the correct and expected requirement, as
> the
> > > > field
> > > > > > is marked as non nullable.
> > > > > >
> > > > > > The chosen behavior will be applicable for other nested types as
> well
> > > > > >
> > > > > >
> > > > > > Thanks, Raz Luvaton
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] add to the specification a clarification of validity buffer for non nullable field of nullable StructArray

Reply via email to