djouallah opened a new pull request, #1359:
URL: https://github.com/apache/datafusion-python/pull/1359

   ## Summary
   
   Exposes the `truncated_rows` parameter from DataFusion Rust to Python 
bindings for `register_csv()` and `read_csv()` methods. This parameter enables 
reading CSV files with inconsistent column counts by creating a union schema 
and filling missing columns with nulls.
   
   ## Background
   
   The `truncated_rows` feature was added to DataFusion Rust in 
[apache/datafusion#17553](https://github.com/apache/datafusion/pull/17553) 
(merged October 8, 2025) and is available in DataFusion 51.0.0.
   
   **Current workaround:** Users can already use `truncated_rows` via SQL with 
external tables:
   
   ```python
   ctx.sql("""
       CREATE EXTERNAL TABLE mixed
       STORED AS CSV
       LOCATION 'file1.csv', 'file2.csv'
       OPTIONS ('truncated_rows' 'true')
   """)
   ```
   
   **Problem:** SQL `LOCATION` clause does **not support lists of file paths** 
as separate arguments. You must either:
   
   
   **Solution:** `register_csv()` and `read_csv()` accept Python lists of 
paths, making it much more ergonomic:
   
   ```python
   # Much cleaner API!
   ctx.register_csv(
       "mixed",
       ["file1.csv", "file2.csv", "file3.csv"],
       truncated_rows=True
   )
   ```
   
   ## Changes
   
   - ✅ Add `truncated_rows: bool = False` parameter to 
`SessionContext.register_csv()`
   - ✅ Add `truncated_rows: bool = False` parameter to 
`SessionContext.read_csv()`
   - ✅ Update Rust PyO3 bindings in `src/context.rs`
   - ✅ Update Python wrappers in `python/datafusion/context.py`
   - ✅ Add tests verifying parameter acceptance
   - ✅ Update docstrings with parameter documentation
   
   ## Example Usage
   
   ```python
   from datafusion import SessionContext
   
   ctx = SessionContext()
   
   # Register multiple CSV files with different schemas
   ctx.register_csv(
       "services",
       ["services_2024.csv", "services_2025.csv"],  # Different column counts
       truncated_rows=True  # Create union schema, fill missing columns with 
nulls
   )
   
   # Query across files with different schemas
   result = ctx.sql("SELECT * FROM services").collect()
   ```
   
   ## Testing
   
   Tests verify that the `truncated_rows` parameter is accepted by the Python 
bindings. The actual behavior of the feature is tested in the upstream 
DataFusion repository.
   
   This follows the principle that Python bindings should expose all Rust API 
parameters, and behavior testing is the responsibility of the upstream 
DataFusion library.
   
   ## Backward Compatibility
   
   ✅ Non-breaking change. The parameter defaults to `False`, maintaining 
existing behavior.
   
   ## Related
   
   - Upstream PR: 
[apache/datafusion#17553](https://github.com/apache/datafusion/pull/17553)
   - DataFusion version: 51.0.0+
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to