kosiew opened a new pull request, #20840:
URL: https://github.com/apache/datafusion/pull/20840
## Which issue does this PR close?
* Part of #20835
## Rationale for this change
DataFusion currently implements additive schema-evolution compatibility for
`Struct` columns using `validate_struct_compatibility`. However, this logic
only applies when the column itself is a `Struct`. When a `Struct` is wrapped
inside container types such as `List`, `LargeList`, `FixedSizeList`, or `Map`,
the planner falls back to Arrow's `can_cast_types`.
This behavior treats the container as opaque and causes legitimate schema
evolutions (for example, adding a nullable field to a struct) to be rejected if
the struct is nested inside a container.
This PR introduces a recursive datatype compatibility validator that
recognizes when container types wrap a `Struct` and applies the same additive
schema-evolution semantics to the nested structure. This ensures consistent
behavior between top-level structs and structs nested inside supported
container types.
## What changes are included in this PR?
### Recursive compatibility validation
* Introduce `validate_data_type_compatibility` to recursively validate
compatibility between source and target datatypes.
* Support recursive validation for:
* `Struct`
* `List`
* `LargeList`
* `FixedSizeList`
* `Map`
* Add helper `requires_recursive_compatibility_validation` to determine when
recursive validation should be applied.
### Planner integration
* Update `schema_rewriter.rs` to use the new recursive compatibility
validation during physical expression adaptation.
* Preserve existing Arrow casting behavior for datatypes that do not require
recursive validation.
### Runtime casting support for containers
* Extend nested column casting to support:
* `List`
* `LargeList`
* `FixedSizeList`
* `Map`
* Introduce reusable helper `cast_container` to unify container casting
logic.
* Add specific casting helpers:
* `cast_list_column`
* `cast_fixed_size_list_column`
* `cast_map_column`
### Struct casting refactor
* Simplify `cast_struct_column` implementation by iterating over target
fields and mapping source fields by name.
* Preserve struct null buffers while filling missing fields with null arrays
when allowed.
### Error message improvements
* Standardize error messages from `"Cannot cast struct field"` to `"Cannot
cast field"` for consistency across nested contexts.
### Tests
Added comprehensive tests covering:
* Recursive compatibility validation
* `List<Struct>` with additive schema evolution
* `Map<_, Struct>` nested struct compatibility
* `FixedSizeList` size mismatch detection
* Nested casting behavior
* Casting `List<Struct>` where the target struct adds a nullable field
* Ensuring new fields are filled with null values
* Numeric type promotion inside nested structs
* Planner integration
* Expression rewrite behavior for `List<Struct>` compatibility
* Failure scenarios for incompatible nested field casts
* Parquet integration tests
* End-to-end schema evolution validation when reading `List<Struct>` from
Parquet
## Are these changes tested?
Yes.
The PR adds both unit tests and integration tests:
* Unit tests in `nested_struct.rs` validating recursive compatibility logic
and container casting behavior.
* Planner-level tests in `schema_rewriter.rs` ensuring correct expression
rewriting.
* Parquet integration tests verifying that schema evolution works correctly
when reading nested container types such as `List<Struct>`.
These tests cover both compatible additive schema evolution and failure
scenarios.
## Are there any user-facing changes?
Yes, but they are improvements to existing functionality rather than
breaking changes.
DataFusion will now correctly support additive schema evolution for structs
nested inside container types such as:
* `List<Struct>`
* `LargeList<Struct>`
* `FixedSizeList<Struct>`
* `Map<_, Struct>`
Previously these cases could fail validation even when the schema change was
valid.
No API changes are introduced.
## LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content
has been manually reviewed and tested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]