H0TB0X420 opened a new pull request, #1265: URL: https://github.com/apache/datafusion-python/pull/1265
# Which issue does this PR close? Closes #1227 # Rationale for this change (From original issue) PyArrow is a massive dependency (>100MB unpacked) and the only required dependency for datafusion-python. Many Python Arrow libraries implement the PyCapsule Interface, allowing users to choose lightweight alternatives like nanoarrow (~7MB), arro3, or pass data directly from Polars, DuckDB, etc. This PR implements the first phase of making PyArrow optional by updating input parameters to accept any Arrow-compatible library via the PyCapsule Interface. # What changes are included in this PR? - Add Protocol types for Arrow PyCapsule Interface `ArrowSchemaExportable` - Update schema parameters in `register_csv`, `register_parquet`, `register_json`, `register_avro`, `register_listing_table`, and read methods to accept `ArrowSchemaExportable` - Move pyarrow import to `TYPE_CHECKING` block (optional at runtime for type hints only) **Note:** This PR covers input parameters only. Return types (ToPyArrow conversions) still reference pyarrow and will be addressed in a follow-up PR. # Are there any user-facing changes? **Breaking changes:** None. All existing PyArrow usage continues to work. **New functionality:** Users can now pass Arrow schemas from any library implementing `__arrow_c_schema__()` (nanoarrow, arro3, Polars, DuckDB, etc.) to datafusion methods. **Type hints:** Schema parameters now show `ArrowSchemaExportable | None` instead of `pa.Schema | None`, but accept both. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
