kosiew opened a new pull request, #20840:
URL: https://github.com/apache/datafusion/pull/20840

   
   ## Which issue does this PR close?
   
   * Part of #20835
   
   ## Rationale for this change
   
   DataFusion currently implements additive schema-evolution compatibility for 
`Struct` columns using `validate_struct_compatibility`. However, this logic 
only applies when the column itself is a `Struct`. When a `Struct` is wrapped 
inside container types such as `List`, `LargeList`, `FixedSizeList`, or `Map`, 
the planner falls back to Arrow's `can_cast_types`.
   
   This behavior treats the container as opaque and causes legitimate schema 
evolutions (for example, adding a nullable field to a struct) to be rejected if 
the struct is nested inside a container.
   
   This PR introduces a recursive datatype compatibility validator that 
recognizes when container types wrap a `Struct` and applies the same additive 
schema-evolution semantics to the nested structure. This ensures consistent 
behavior between top-level structs and structs nested inside supported 
container types.
   
   ## What changes are included in this PR?
   
   ### Recursive compatibility validation
   
   * Introduce `validate_data_type_compatibility` to recursively validate 
compatibility between source and target datatypes.
   * Support recursive validation for:
   
     * `Struct`
     * `List`
     * `LargeList`
     * `FixedSizeList`
     * `Map`
   * Add helper `requires_recursive_compatibility_validation` to determine when 
recursive validation should be applied.
   
   ### Planner integration
   
   * Update `schema_rewriter.rs` to use the new recursive compatibility 
validation during physical expression adaptation.
   * Preserve existing Arrow casting behavior for datatypes that do not require 
recursive validation.
   
   ### Runtime casting support for containers
   
   * Extend nested column casting to support:
   
     * `List`
     * `LargeList`
     * `FixedSizeList`
     * `Map`
   * Introduce reusable helper `cast_container` to unify container casting 
logic.
   * Add specific casting helpers:
   
     * `cast_list_column`
     * `cast_fixed_size_list_column`
     * `cast_map_column`
   
   ### Struct casting refactor
   
   * Simplify `cast_struct_column` implementation by iterating over target 
fields and mapping source fields by name.
   * Preserve struct null buffers while filling missing fields with null arrays 
when allowed.
   
   ### Error message improvements
   
   * Standardize error messages from `"Cannot cast struct field"` to `"Cannot 
cast field"` for consistency across nested contexts.
   
   ### Tests
   
   Added comprehensive tests covering:
   
   * Recursive compatibility validation
   
     * `List<Struct>` with additive schema evolution
     * `Map<_, Struct>` nested struct compatibility
     * `FixedSizeList` size mismatch detection
   
   * Nested casting behavior
   
     * Casting `List<Struct>` where the target struct adds a nullable field
     * Ensuring new fields are filled with null values
     * Numeric type promotion inside nested structs
   
   * Planner integration
   
     * Expression rewrite behavior for `List<Struct>` compatibility
     * Failure scenarios for incompatible nested field casts
   
   * Parquet integration tests
   
     * End-to-end schema evolution validation when reading `List<Struct>` from 
Parquet
   
   ## Are these changes tested?
   
   Yes.
   
   The PR adds both unit tests and integration tests:
   
   * Unit tests in `nested_struct.rs` validating recursive compatibility logic 
and container casting behavior.
   * Planner-level tests in `schema_rewriter.rs` ensuring correct expression 
rewriting.
   * Parquet integration tests verifying that schema evolution works correctly 
when reading nested container types such as `List<Struct>`.
   
   These tests cover both compatible additive schema evolution and failure 
scenarios.
   
   ## Are there any user-facing changes?
   
   Yes, but they are improvements to existing functionality rather than 
breaking changes.
   
   DataFusion will now correctly support additive schema evolution for structs 
nested inside container types such as:
   
   * `List<Struct>`
   * `LargeList<Struct>`
   * `FixedSizeList<Struct>`
   * `Map<_, Struct>`
   
   Previously these cases could fail validation even when the schema change was 
valid.
   
   No API changes are introduced.
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to