kosiew opened a new pull request, #16305:
URL: https://github.com/apache/datafusion/pull/16305

   ## Which issue does this PR close?
   
   - Closes #16270
   
   ## Rationale for this change
   
   The current behavior of `ListingTable` in DataFusion can produce 
inconsistent projected schemas depending on the order of input files, even when 
a schema is explicitly provided. This inconsistency is particularly problematic 
in use cases involving schema evolution or optional/nested fields.
   
   This PR introduces an explicit `SchemaSource` enum to track how a schema was 
derived—either `None`, `Inferred`, or `Specified`. This ensures that schema 
inference does not overwrite an explicitly provided schema, making 
`ListingTable` behavior predictable and robust across file order variations.
   
   ## What changes are included in this PR?
   
   - Introduced `SchemaSource` enum to track the origin of a schema.
   - Updated `ListingTableConfig` and `ListingTable` to store and respect 
`schema_source`.
   - Modified schema inference logic to retain specified schemas and only infer 
when none is provided.
   - Added methods to query the schema source from `ListingTable` and 
`ListingTableConfig`.
   - Extended existing and added new tests to verify:
     - Schema consistency regardless of file order.
     - Schema source tracking behavior across all config transformations.
     - Correct behavior with multi-file inputs and optional fields.
   
   ## Are these changes tested?
   
   Yes. Several comprehensive unit tests have been added to verify:
   - The schema source is correctly preserved through config operations.
   - `ListingTable` uses the explicitly provided schema instead of inferring 
from the first file.
   - Output schema remains consistent regardless of file order.
   - Inferred schema reflects the first file only when no schema is provided.
   
   ## Are there any user-facing changes?
   
   Yes, but they are non-breaking:
   - Users can now rely on `ListingTable` to respect explicitly provided 
schemas even when file contents vary.
   - Behavior is now deterministic across different file orderings.
   - Diagnostic capabilities are improved with access to the schema source via 
`ListingTable::schema_source()`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to