djouallah opened a new pull request, #1359: URL: https://github.com/apache/datafusion-python/pull/1359
## Summary Exposes the `truncated_rows` parameter from DataFusion Rust to Python bindings for `register_csv()` and `read_csv()` methods. This parameter enables reading CSV files with inconsistent column counts by creating a union schema and filling missing columns with nulls. ## Background The `truncated_rows` feature was added to DataFusion Rust in [apache/datafusion#17553](https://github.com/apache/datafusion/pull/17553) (merged October 8, 2025) and is available in DataFusion 51.0.0. **Current workaround:** Users can already use `truncated_rows` via SQL with external tables: ```python ctx.sql(""" CREATE EXTERNAL TABLE mixed STORED AS CSV LOCATION 'file1.csv', 'file2.csv' OPTIONS ('truncated_rows' 'true') """) ``` **Problem:** SQL `LOCATION` clause does **not support lists of file paths** as separate arguments. You must either: **Solution:** `register_csv()` and `read_csv()` accept Python lists of paths, making it much more ergonomic: ```python # Much cleaner API! ctx.register_csv( "mixed", ["file1.csv", "file2.csv", "file3.csv"], truncated_rows=True ) ``` ## Changes - ✅ Add `truncated_rows: bool = False` parameter to `SessionContext.register_csv()` - ✅ Add `truncated_rows: bool = False` parameter to `SessionContext.read_csv()` - ✅ Update Rust PyO3 bindings in `src/context.rs` - ✅ Update Python wrappers in `python/datafusion/context.py` - ✅ Add tests verifying parameter acceptance - ✅ Update docstrings with parameter documentation ## Example Usage ```python from datafusion import SessionContext ctx = SessionContext() # Register multiple CSV files with different schemas ctx.register_csv( "services", ["services_2024.csv", "services_2025.csv"], # Different column counts truncated_rows=True # Create union schema, fill missing columns with nulls ) # Query across files with different schemas result = ctx.sql("SELECT * FROM services").collect() ``` ## Testing Tests verify that the `truncated_rows` parameter is accepted by the Python bindings. The actual behavior of the feature is tested in the upstream DataFusion repository. This follows the principle that Python bindings should expose all Rust API parameters, and behavior testing is the responsibility of the upstream DataFusion library. ## Backward Compatibility ✅ Non-breaking change. The parameter defaults to `False`, maintaining existing behavior. ## Related - Upstream PR: [apache/datafusion#17553](https://github.com/apache/datafusion/pull/17553) - DataFusion version: 51.0.0+ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
