kosiew opened a new pull request, #16371:
URL: https://github.com/apache/datafusion/pull/16371

   ## Which issue does this PR close?
   
   This is the last of a series of PRs re-implementing #15295 to close #14657 
by adding schema‐evolution support for:
   - listing‐based tables
   - with nested structs
   in DataFusion.
   
   - Closes #14757
   
   ## Rationale for this change
   
   This change enables DataFusion's listing-based tables to support schema 
evolution when dealing with files that may have nested struct fields with 
varying structures over time. It ensures more robust data ingestion pipelines, 
especially in environments where schema drift is common (e.g., data lakes, 
log-based ingestion, etc.).
   
   Previously, nested structs with evolved schemas could lead to 
incompatibility errors or data loss. This PR introduces a flexible schema 
adaptation mechanism through the `SchemaAdapterFactory` trait, allowing custom 
logic to map differing schemas safely and correctly.
   
   ## What changes are included in this PR?
   
   - Introduced a `schema_adapter_factory` field to `ListingTableConfig` and 
`ListingTable`.
   - Added support for injecting custom `SchemaAdapterFactory` implementations.
   - Implemented a `NestedStructSchemaAdapterFactory` that can handle nested 
structs and evolve them by injecting nulls for missing nested fields.
   - Integrated the factory into the listing table execution path (scan, 
statistics, file listing).
   - Updated default behavior to use `DefaultSchemaAdapterFactory` if none is 
provided.
   - Added comprehensive tests covering:
     - Adapter selection
     - Mapping of nested structs
     - Error propagation for incompatible schemas
     - Column statistics transformation through the adapter
   
   ## Are these changes tested?
   
   ✅ Yes, the PR includes extensive unit tests that verify:
   - Behavior of schema adapter factories under different schema conditions
   - Handling of missing nested fields
   - Adaptation logic for struct arrays
   - Mapping and transformation of column statistics
   - Error propagation when schema adaptation fails
   
   ## Are there any user-facing changes?
   
   ✅ Yes, this PR introduces the ability to:
   - Provide custom schema adaptation logic to `ListingTable` through 
`ListingTableConfig::with_schema_adapter_factory`
   - Seamlessly read and evolve files with changing nested struct schemas
   
   There are no breaking changes to public APIs. The added functionality is 
optional and backward-compatible with existing behavior.
   
   <!-- If there are any breaking changes to public APIs, please add the `api 
change` label. -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to