[Python][Discuss] PyArrow Dataset as a Python protocol

Will Jones Wed, 21 Jun 2023 11:09:23 -0700

Hello Arrow devs,

I have drafted a PR defining an experimental protocol which would allow
third-party libraries to imitate the PyArrow Dataset API [5]. This protocol
is intended to endorse an integration pattern that is starting to be used
in the Python ecosystem, where some libraries are providing their own
scanners with this API, while query engines are accepting these as
duck-typed objects.

To give some background: back at the end of 2021, we collaborated with
DuckDB to be able to read datasets (an Arrow C++ concept), supporting
column selection and filter pushdown. This was accomplished by having
DuckDB manipulating Python (or R) objects to get a RecordBatchReader and
then exporting over the C Stream Interface.

Since then, DataFusion [2] and Polars have both made similar
implementations for their Python bindings, allowing them to consume PyArrow
datasets. This has created an implicit protocol, whereby arbitrary compute
engines can push down queries into the PyArrow dataset scanner.

Now, libraries supporting table formats including Delta Lake, Lance, and
Iceberg are looking to be able to support these engines, while bringing
their own scanners and metadata handling implementations. One possible
route is allowing them to imitate the PyArrow datasets API.

Bringing these use cases together, I'd like to propose an experimental
protocol, made out of the minimal subset of the PyArrow Dataset API
necessary to facilitate this kind of integration. This would allow any
library to produce a scanner implementation and that arbitrary query
engines could call into. I've drafted a PR [3] and there is some background
research available in a google doc [4].

I've already gotten some good feedback on both, and would welcome more.

One last point: I'd like for this to be a first step rather than a
comprehensive API. This PR focuses on making explicit a protocol that is
already in use in the ecosystem, but without much concrete definition. Once
this is established, we can use our experience from this protocol to design
something more permanent that takes advantage of newer innovations in the
Arrow ecosystem (such as the PyCapsule for C Data Interface or
Substrait for passing expressions / scan plans). I am tracking such future
improvements in [5].

Best,

Will Jones

[1] https://duckdb.org/2021/12/03/duck-arrow.html
[2] https://github.com/apache/arrow-datafusion-python/pull/9
[3] https://github.com/apache/arrow/pull/35568
[4]
https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?pli=1
[5]
https://docs.google.com/document/d/1-uVkSZeaBtOALVbqMOPeyV3s2UND7Wl-IGEZ-P-gMXQ/edit

[Python][Discuss] PyArrow Dataset as a Python protocol

Reply via email to