kosiew opened a new pull request, #17281:
URL: https://github.com/apache/datafusion/pull/17281

   ## Which issue does this PR close?
   
   * Closes #16579
   
   ## Rationale for this change
   
   Evolving data sources often have structural mismatches with the expected 
table schema, especially when nested `Struct` types are involved. This PR 
introduces robust handling for schema adaptation and column casting within 
Apache DataFusion to ensure compatibility and correctness when processing such 
evolving schemas.
   
   ## What changes are included in this PR?
   
   * Introduces `cast_column` for recursively casting nested `StructArray` 
fields to match target schema.
   * Adds compatibility checks to prevent casting nullable fields to 
non-nullable targets.
   * Updates the `SchemaAdapter` and `SchemaMapping` logic to leverage 
`cast_column`.
   * Adds thorough unit tests covering:
   
     * Casting structs with reordering, extra, and missing fields
     * Preserving parent nullability
     * Structs containing arrays and maps
     * Schema mapping and record batch transformation
   * Fixes column casting behavior in `pruning_predicate` by using 
`cast_column` instead of generic Arrow cast.
   * Updates documentation:
   
     * Adds new guide: `docs/source/library-user-guide/schema_adapter.md`
     * References this guide in main user and API docs
   
   ## Are these changes tested?
   
   Yes, extensive tests are included that:
   
   * Validate the `cast_column` logic across multiple complex nested schemas.
   * Verify compatibility validation logic.
   * Test end-to-end behavior of `SchemaAdapter::map_batch()` with various 
structural transformations.
   * Ensure correct pruning predicate behavior when stats use structs with 
different field types.
   
   ## Are there any user-facing changes?
   
   Yes:
   
   * Users benefit from improved compatibility when reading nested structured 
data with evolving schemas.
   * Documentation has been expanded to include a new section explaining how 
schema adaptation works and how to use `cast_column`.
   
   There are no breaking changes to public APIs.
   
   ---
   
   This change enhances DataFusion's resilience to schema drift and paves the 
way for more robust handling of semi-structured data. ✨
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to