kosiew opened a new pull request, #1141: URL: https://github.com/apache/datafusion-python/pull/1141
## Which issue does this PR close? - Closes #1136. ## Rationale for this change This PR improves Python bindings for the DataFusion integration by addressing several shortcomings in asynchronous execution: - Python `KeyboardInterrupt` exceptions were not previously handled correctly during long-running query operations. - Python signal checks were missing during `await`-based async executions, leading to poor interruptibility. - Some futures lacked proper error mapping, causing opaque errors on failure. - Minor GIL and reference lifetime issues existed due to improper value capturing for async contexts. This makes the Python API more responsive, predictable, and aligned with user expectations in interactive environments. ## What changes are included in this PR? - Added a new test (`test_collect_interrupted`) that simulates a long-running SQL query and interrupts it with `KeyboardInterrupt` via `ctypes`. - Wrapped all `wait_for_future()` usages to: - Check for Python signals periodically using `Python::check_signals()`. - Map both future errors and Python runtime errors correctly. - Reworked various parts of the async interface (e.g., `sql()`, `register_csv()`, `read_json()`) to: - Clone/own data where needed to avoid lifetime or borrowing issues in futures. - Defer schema and option construction into async scopes where necessary. - Added helper functions `create_csv_read_options` and `create_ndjson_read_options` for cleaner config composition. - Improved stream execution and error messages in `PyRecordBatchStream`. ## Are these changes tested? Yes: - A new integration test (`test_collect_interrupted`) ensures long-running queries can be gracefully interrupted via simulated `Ctrl-C`. - Existing tests for various `read_*`, `register_*`, `sql`, and `collect` APIs implicitly validate the refactored async logic. - Additional logic errors are covered by unit-level fallbacks and error mapping. Note that in Jupyter, you interrupt by pressing the stop button instead of Ctrl-C. https://github.com/user-attachments/assets/870c3195-98fc-47d6-9236-102214059383 ## Are there any user-facing changes? Yes, and they are improvements: - Interrupting a query (e.g., in a Jupyter notebook or terminal) now stops execution cleanly and raises `KeyboardInterrupt`. - Error messages from failed futures are now consistently mapped to `PyDataFusionError`. - Better diagnostics when schemas or paths fail to resolve during registration or reads. - Slight behavioral change: long-running calls now check signals periodically instead of blocking indefinitely. These changes are designed to improve UX and should not break any existing code. <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org