> The trouble is that Dataset was not designed to serve as a > general-purpose unmaterialized dataframe. For example, the PyArrow > Dataset constructor [5] exposes options for specifying a list of > source files and a partitioning scheme, which are irrelevant for many > of the applications that Will anticipates. And some work is needed to > reconcile the methods of the PyArrow Dataset object [6] with the > methods of the Table object. Some methods like filter() are exposed by > both and behave lazily on Datasets and eagerly on Tables, as a user > might expect. But many other Table methods are not implemented for > Dataset though they potentially could be, and it is unclear where we > should draw the line between adding methods to Dataset vs. encouraging > new scanner implementations to expose options controlling what lazy > operations should be performed as they see fit.
In my mind there is a distinction between the "compute domain" (e.g. a pandas dataframe or something like ibis or SQL) and the "data domain" (e.g. pyarrow datasets). I think, in a perfect world, you could push any and all compute up and down the chain as far as possible. However, in practice, I think there is a healthy set of tools and libraries that say "simple column projection and filtering is good enough". I would argue that there is room for both APIs and while the temptation is always present to "shove as much compute as you can" I think pyarrow datasets seem to have found a balance between the two that users like. So I would argue that this protocol may never become a general-purpose unmaterialized dataframe and that isn't necessarily a bad thing. > they are splittable and serializable, so that fragments can be distributed > amongst processes / workers. Just to clarify, the proposal currently only requires the fragments to be serializable correct? On Fri, Jun 23, 2023 at 11:48 AM Will Jones <will.jones...@gmail.com> wrote: > Thanks Ian for your extensive feedback. > > I strongly agree with the comments made by David, > > Weston, and Dewey arguing that we should avoid any use of PyArrow > > expressions in this API. Expressions are an implementation detail of > > PyArrow, not a part of the Arrow standard. It would be much safer for > > the initial version of this protocol to not define *any* > > methods/arguments that take expressions. > > > > I would agree with this point, if we were starting from scratch. But one of > my goals is for this protocol to be descriptive of the existing dataset > integrations in the ecosystem, which all currently rely on PyArrow > expressions. For example, you'll notice in the PR that there are unit tests > to verify the current PyArrow Dataset classes conform to this protocol, > without changes. > > I think there's three routes we can go here: > > 1. We keep PyArrow expressions in the API initially, but once we have > Substrait-based alternatives we deprecate the PyArrow expression support. > This is what I intended with the current design, and I think it provides > the most obvious migration paths for existing producers and consumers. > 2. We keep the overall dataset API, but don't introduce the filter and > projection arguments until we have Substrait support. I'm not sure what the > migration path looks like for producers and consumers, but I think this > just implicitly becomes the same as (1), but with worse documentation. > 3. We write a protocol completely from scratch, that doesn't try to > describe the existing dataset API. Producers and consumers would then > migrate to use the new protocol and deprecate their existing dataset > integrations. We could introduce a dunder method in that API (sort of like > __arrow_array__) that would make the migration seamless from the end-user > perspective. > > *Which do you all think is the best path forward?* > > Another concern I have is that we have not fully explained why we want > > to use Dataset instead of RecordBatchReader [9] as the basis of this > > protocol. I would like to see an explanation of why RecordBatchReader > > is not sufficient for this. RecordBatchReader seems like another > > possible way to represent "unmaterialized dataframes" and there are > > some parallels between RecordBatch/RecordBatchReader and > > Fragment/Dataset. > > > > This is a good point. I can add a section describing the differences. The > main ones I can think of are that: (1) Datasets are "pruneable": one can > select a subset of columns and apply a filter on rows to avoid IO and (2) > they are splittable and serializable, so that fragments can be distributed > amongst processes / workers. > > Best, > > Will Jones > > On Fri, Jun 23, 2023 at 10:48 AM Ian Cook <ianmc...@apache.org> wrote: > > > Thanks Will for this proposal! > > > > For anyone familiar with PyArrow, this idea has a clear intuitive > > logic to it. It provides an expedient solution to the current lack of > > a practical means for interchanging "unmaterialized dataframes" > > between different Python libraries. > > > > To elaborate on that: If you look at how people use the Arrow Dataset > > API—which is implemented in the Arrow C++ library [1] and has bindings > > not just for Python [2] but also for Java [3] and R [4]—you'll see > > that Dataset is often used simply as a "virtual" variant of Table. It > > is used in cases when the data is larger than memory or when it is > > desirable to defer reading (materializing) the data into memory. > > > > So we can think of a Table as a materialized dataframe and a Dataset > > as an unmaterialized dataframe. That aspect of Dataset is I think what > > makes it most attractive as a protocol for enabling interoperability: > > it allows libraries to easily "speak Arrow" in cases where > > materializing the full data in memory upfront is impossible or > > undesirable. > > > > The trouble is that Dataset was not designed to serve as a > > general-purpose unmaterialized dataframe. For example, the PyArrow > > Dataset constructor [5] exposes options for specifying a list of > > source files and a partitioning scheme, which are irrelevant for many > > of the applications that Will anticipates. And some work is needed to > > reconcile the methods of the PyArrow Dataset object [6] with the > > methods of the Table object. Some methods like filter() are exposed by > > both and behave lazily on Datasets and eagerly on Tables, as a user > > might expect. But many other Table methods are not implemented for > > Dataset though they potentially could be, and it is unclear where we > > should draw the line between adding methods to Dataset vs. encouraging > > new scanner implementations to expose options controlling what lazy > > operations should be performed as they see fit. > > > > Will, I see that you've already addressed this issue to some extent in > > your proposal. For example, you mention that we should initially > > define this protocol to include only a minimal subset of the Dataset > > API. I agree, but I think there are some loose ends we should be > > careful to tie up. I strongly agree with the comments made by David, > > Weston, and Dewey arguing that we should avoid any use of PyArrow > > expressions in this API. Expressions are an implementation detail of > > PyArrow, not a part of the Arrow standard. It would be much safer for > > the initial version of this protocol to not define *any* > > methods/arguments that take expressions. This will allow us to take > > some more time to finish up the Substrait expression implementation > > work that is underway [7][8], then introduce Substrait-based > > expressions in a latter version of this protocol. This approach will > > better position this protocol to be implemented in other languages > > besides Python. > > > > Another concern I have is that we have not fully explained why we want > > to use Dataset instead of RecordBatchReader [9] as the basis of this > > protocol. I would like to see an explanation of why RecordBatchReader > > is not sufficient for this. RecordBatchReader seems like another > > possible way to represent "unmaterialized dataframes" and there are > > some parallels between RecordBatch/RecordBatchReader and > > Fragment/Dataset. We should help developers and users understand why > > Arrow needs both of these. > > > > Thanks Will for your thoughtful prose explanations about this proposed > > API. After we arrive at a decision about this, I think we should > > reproduce some of these explanations in docs, blog posts, cookbook > > recipes, etc. because there is some important nuance here that will be > > important for integrators of this API to understand. > > > > Ian > > > > [1] https://arrow.apache.org/docs/cpp/api/dataset.html > > [2] https://arrow.apache.org/docs/python/dataset.html > > [3] https://arrow.apache.org/docs/java/dataset.html > > [4] https://arrow.apache.org/docs/r/articles/dataset.html > > [5] > > > https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset > > [6] > > > https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html > > [7] https://github.com/apache/arrow/issues/33985 > > [8] https://github.com/apache/arrow/issues/34252 > > [9] > > > https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html > > > > On Wed, Jun 21, 2023 at 2:09 PM Will Jones <will.jones...@gmail.com> > > wrote: > > > > > > Hello Arrow devs, > > > > > > I have drafted a PR defining an experimental protocol which would allow > > > third-party libraries to imitate the PyArrow Dataset API [5]. This > > protocol > > > is intended to endorse an integration pattern that is starting to be > used > > > in the Python ecosystem, where some libraries are providing their own > > > scanners with this API, while query engines are accepting these as > > > duck-typed objects. > > > > > > To give some background: back at the end of 2021, we collaborated with > > > DuckDB to be able to read datasets (an Arrow C++ concept), supporting > > > column selection and filter pushdown. This was accomplished by having > > > DuckDB manipulating Python (or R) objects to get a RecordBatchReader > and > > > then exporting over the C Stream Interface. > > > > > > Since then, DataFusion [2] and Polars have both made similar > > > implementations for their Python bindings, allowing them to consume > > PyArrow > > > datasets. This has created an implicit protocol, whereby arbitrary > > compute > > > engines can push down queries into the PyArrow dataset scanner. > > > > > > Now, libraries supporting table formats including Delta Lake, Lance, > and > > > Iceberg are looking to be able to support these engines, while bringing > > > their own scanners and metadata handling implementations. One possible > > > route is allowing them to imitate the PyArrow datasets API. > > > > > > Bringing these use cases together, I'd like to propose an experimental > > > protocol, made out of the minimal subset of the PyArrow Dataset API > > > necessary to facilitate this kind of integration. This would allow any > > > library to produce a scanner implementation and that arbitrary query > > > engines could call into. I've drafted a PR [3] and there is some > > background > > > research available in a google doc [4]. > > > > > > I've already gotten some good feedback on both, and would welcome more. > > > > > > One last point: I'd like for this to be a first step rather than a > > > comprehensive API. This PR focuses on making explicit a protocol that > is > > > already in use in the ecosystem, but without much concrete definition. > > Once > > > this is established, we can use our experience from this protocol to > > design > > > something more permanent that takes advantage of newer innovations in > the > > > Arrow ecosystem (such as the PyCapsule for C Data Interface or > > > Substrait for passing expressions / scan plans). I am tracking such > > future > > > improvements in [5]. > > > > > > Best, > > > > > > Will Jones > > > > > > [1] https://duckdb.org/2021/12/03/duck-arrow.html > > > [2] https://github.com/apache/arrow-datafusion-python/pull/9 > > > [3] https://github.com/apache/arrow/pull/35568 > > > [4] > > > > > > https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?pli=1 > > > [5] > > > > > > https://docs.google.com/document/d/1-uVkSZeaBtOALVbqMOPeyV3s2UND7Wl-IGEZ-P-gMXQ/edit > > >