Hello Arrow devs, I have drafted a PR defining an experimental protocol which would allow third-party libraries to imitate the PyArrow Dataset API [5]. This protocol is intended to endorse an integration pattern that is starting to be used in the Python ecosystem, where some libraries are providing their own scanners with this API, while query engines are accepting these as duck-typed objects.
To give some background: back at the end of 2021, we collaborated with DuckDB to be able to read datasets (an Arrow C++ concept), supporting column selection and filter pushdown. This was accomplished by having DuckDB manipulating Python (or R) objects to get a RecordBatchReader and then exporting over the C Stream Interface. Since then, DataFusion [2] and Polars have both made similar implementations for their Python bindings, allowing them to consume PyArrow datasets. This has created an implicit protocol, whereby arbitrary compute engines can push down queries into the PyArrow dataset scanner. Now, libraries supporting table formats including Delta Lake, Lance, and Iceberg are looking to be able to support these engines, while bringing their own scanners and metadata handling implementations. One possible route is allowing them to imitate the PyArrow datasets API. Bringing these use cases together, I'd like to propose an experimental protocol, made out of the minimal subset of the PyArrow Dataset API necessary to facilitate this kind of integration. This would allow any library to produce a scanner implementation and that arbitrary query engines could call into. I've drafted a PR [3] and there is some background research available in a google doc [4]. I've already gotten some good feedback on both, and would welcome more. One last point: I'd like for this to be a first step rather than a comprehensive API. This PR focuses on making explicit a protocol that is already in use in the ecosystem, but without much concrete definition. Once this is established, we can use our experience from this protocol to design something more permanent that takes advantage of newer innovations in the Arrow ecosystem (such as the PyCapsule for C Data Interface or Substrait for passing expressions / scan plans). I am tracking such future improvements in [5]. Best, Will Jones [1] https://duckdb.org/2021/12/03/duck-arrow.html [2] https://github.com/apache/arrow-datafusion-python/pull/9 [3] https://github.com/apache/arrow/pull/35568 [4] https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?pli=1 [5] https://docs.google.com/document/d/1-uVkSZeaBtOALVbqMOPeyV3s2UND7Wl-IGEZ-P-gMXQ/edit