schenksj opened a new issue, #4432:
URL: https://github.com/apache/datafusion-comet/issues/4432

   ### Describe the bug
   
   `GetStructField` (`native/spark-expr/src/struct_funcs/get_struct_field.rs`) 
extracts a struct field by returning the child column directly, without 
applying the parent struct's null mask:
   
   ```rust
   ColumnarValue::Array(array) => {
       let struct_array = 
array.as_any().downcast_ref::<StructArray>().expect("A struct is expected");
       Ok(ColumnarValue::Array(Arc::clone(struct_array.column(self.ordinal))))
   }
   ```
   
   In Arrow, a `StructArray`'s child arrays carry their own validity, 
independent of the parent struct's null buffer. At a row where the struct 
itself is null, the child buffer can still hold a non-null value. Returning the 
child verbatim therefore reads a field of a NULL struct as non-null, which 
violates Spark semantics (a field of a null struct is null). Concretely, 
`isnotnull(structCol.field)` returns `true` for a row whose `structCol` is null.
   
   This is a data-correctness bug for any query that accesses a field of a 
nullable struct read from a parquet file where a logically-null struct column 
still has a populated child buffer.
   
   ### Steps to reproduce
   
   Read such a parquet file and filter on a struct field:
   
   ```sql
   SELECT * FROM t WHERE structCol.field IS NOT NULL
   ```
   
   Comet returns rows where `structCol` is null.
   
   It surfaces in Delta: 
`CheckpointProvider.readV2ActionsFromParquetCheckpoint` runs
   `... .where("checkpointMetadata.version is not null or sidecar.path is not 
null")`
   over a checkpoint where those structs are all null, expecting zero rows; the 
leak yields `scala.MatchError: (null, null)` (Delta's 
`DeltaIncrementalSetTransactionsSuite`).
   
   Simple structs written by `createDataFrame` happen to align child validity 
with the parent, so the bug only manifests when the child buffer is populated 
under a null parent (e.g. a coalesce-rewritten checkpoint).
   
   ### Expected behavior
   
   A field of a NULL struct is NULL.
   
   ### Additional context
   
   Found while working on the contrib Delta native scan (#4366). The fix — 
union the parent struct's null mask into the extracted child (null where the 
struct is null OR the child is null), plus a unit test — is included in PR 
#4366 (`native/spark-expr/src/struct_funcs/get_struct_field.rs`). It is 
independent of Delta and could be reviewed/cherry-picked as a standalone core 
fix.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to