kosiew opened a new pull request, #17281: URL: https://github.com/apache/datafusion/pull/17281
## Which issue does this PR close? * Closes #16579 ## Rationale for this change Evolving data sources often have structural mismatches with the expected table schema, especially when nested `Struct` types are involved. This PR introduces robust handling for schema adaptation and column casting within Apache DataFusion to ensure compatibility and correctness when processing such evolving schemas. ## What changes are included in this PR? * Introduces `cast_column` for recursively casting nested `StructArray` fields to match target schema. * Adds compatibility checks to prevent casting nullable fields to non-nullable targets. * Updates the `SchemaAdapter` and `SchemaMapping` logic to leverage `cast_column`. * Adds thorough unit tests covering: * Casting structs with reordering, extra, and missing fields * Preserving parent nullability * Structs containing arrays and maps * Schema mapping and record batch transformation * Fixes column casting behavior in `pruning_predicate` by using `cast_column` instead of generic Arrow cast. * Updates documentation: * Adds new guide: `docs/source/library-user-guide/schema_adapter.md` * References this guide in main user and API docs ## Are these changes tested? Yes, extensive tests are included that: * Validate the `cast_column` logic across multiple complex nested schemas. * Verify compatibility validation logic. * Test end-to-end behavior of `SchemaAdapter::map_batch()` with various structural transformations. * Ensure correct pruning predicate behavior when stats use structs with different field types. ## Are there any user-facing changes? Yes: * Users benefit from improved compatibility when reading nested structured data with evolving schemas. * Documentation has been expanded to include a new section explaining how schema adaptation works and how to use `cast_column`. There are no breaking changes to public APIs. --- This change enhances DataFusion's resilience to schema drift and paves the way for more robust handling of semi-structured data. ✨ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org