Thanks for pointing that out, Dane. I think that seems like an obvious choice for Dask to be able to consume this protocol.
On Fri, Sep 1, 2023 at 10:13 AM Dane Pitkin <d...@voltrondata.com.invalid> wrote: > The Python Substrait package[1] is on PyPi[2] and currently has python > wrappers for the Substrait protobuf objects. I think this will be a great > opportunity to identify helper features that users of this protocol would > like to see. I'll be keeping an eye out as this develops, but also feel > free to file feature requests in the project! > > > [1]https://github.com/substrait-io/substrait-python > [2]https://pypi.org/project/substrait/ > > > On Thu, Aug 31, 2023 at 10:05 PM Will Jones <will.jones...@gmail.com> > wrote: > > > Hello Arrow devs, > > > > We discussed this further in the Arrow community call on 2023-08-30 [1], > > and concluded we should create an entirely new protocol that uses > Substrait > > expressions. I have created an issue [2] to track this and will start a > PR > > soon. > > > > It does look like we might block this on creating a PyCapsule based > > protocol for arrays, schemas, and streams. That is tracked here [3]. > > Hopefully that isn't too ambitious :) > > > > Best, > > > > Will Jones > > > > > > [1] > > > > > https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/edit > > [2] https://github.com/apache/arrow/issues/37504 > > [3] https://github.com/apache/arrow/issues/35531 > > > > > > On Tue, Aug 29, 2023 at 2:59 PM Ian Cook <ianmc...@apache.org> wrote: > > > > > An update about this: > > > > > > Weston's PR https://github.com/apache/arrow/pull/34834/ merged last > > > week. This makes it possible to convert PyArrow expressions to/from > > > Substrait expressions. > > > > > > As Fokko previously noted, the PR does not change the PyArrow Dataset > > > interface at all. It simply enables a Substrait expression to be > > > converted to a PyArrow expression, which can then be used to > > > filter/project a Dataset. > > > > > > There is a basic example here demonstrating this: > > > https://gist.github.com/ianmcook/f70fc185d29ae97bdf85ffe0378c68e0 > > > > > > We might now consider whether to build upon this to create a Dataset > > > protocol that is independent of the PyArrow Expression implementation > > > and that could interoperate across languages. > > > > > > Ian > > > > > > On Mon, Jul 3, 2023 at 5:48 PM Will Jones <will.jones...@gmail.com> > > wrote: > > > > > > > > Hello, > > > > > > > > After thinking about it, I think I understand the approach David Li > and > > > Ian > > > > are suggesting with respect to expressions. There will be some > > arguments > > > > that only PyArrow's own datasets support, but that aren't in the > > generic > > > > protocol. Passing > > > > PyArrow expressions to the filters argument should be considered one > of > > > > those. DuckDB and others are currently passing them down, so they > > aren't > > > > yet using the protocol properly. But once we add support in the > > protocol > > > > for passing filters via Substrait expressions, we'll move DuckDB and > > > others > > > > over to be fully compliant with the protocol. > > > > > > > > It's a bit of an awkward temporary state for now, but so would having > > > > PyArrow expressions in the protocol just to be deprecated in a few > > > months. > > > > One caveat is that we'll need to provide DuckDB and other consumers > > with > > > a > > > > way to tell whether the dataset supports passing filters as Substrait > > > > expression or PyArrow ones, since I doubt they'll want to lose > support > > > for > > > > integrating with older PyArrow versions. > > > > > > > > I've removed filters from the protocol for now, with the intention of > > > > bringing them back as soon as we can get Substrait support. I think > we > > > can > > > > do this in the 14.0.0 release. > > > > > > > > Best, > > > > > > > > Will Jones > > > > > > > > > > > > On Mon, Jul 3, 2023 at 7:45 AM Fokko Driesprong <fo...@apache.org> > > > wrote: > > > > > > > > > Hey everyone, > > > > > > > > > > Chiming in here from the PyIceberg side. I would love to see the > > > protocol > > > > > as proposed in the PR. I did a small test > > > > > < > > > > https://github.com/apache/arrow/pull/35568#pullrequestreview-1480259722 > > >, > > > > > and it seems to be quite straightforward to implement and it > brings a > > > lot > > > > > of potential. Unsurprisingly, I leaning toward the first option: > > > > > > > > > > 1. We keep PyArrow expressions in the API initially, but once we > have > > > > > > Substrait-based alternatives we deprecate the PyArrow expression > > > support. > > > > > > This is what I intended with the current design, and I think it > > > provides > > > > > > the most obvious migration paths for existing producers and > > > consumers. > > > > > > > > > > > > > > > Let me give my vision on some of the concerns raised. > > > > > > > > > > Will, I see that you've already addressed this issue to some extent > > in > > > > > > your proposal. For example, you mention that we should initially > > > > > > define this protocol to include only a minimal subset of the > > Dataset > > > > > > API. I agree, but I think there are some loose ends we should be > > > > > > careful to tie up. I strongly agree with the comments made by > > David, > > > > > > Weston, and Dewey arguing that we should avoid any use of PyArrow > > > > > > expressions in this API. Expressions are an implementation detail > > of > > > > > > PyArrow, not a part of the Arrow standard. It would be much safer > > for > > > > > > the initial version of this protocol to not define *any* > > > > > > methods/arguments that take expressions. This will allow us to > take > > > > > > some more time to finish up the Substrait expression > implementation > > > > > > work that is underway [7][8], then introduce Substrait-based > > > > > > expressions in a latter version of this protocol. This approach > > will > > > > > > better position this protocol to be implemented in other > languages > > > > > > besides Python. > > > > > > > > > > > > > > > I'm confused here. Looking at GH-33985 > > > > > <https://github.com/apache/arrow/pull/34834/files> I don't see any > > new > > > > > primitives being introduced for composing an expression. As I > > > understand > > > > > it, in PyArrow the expression as it exists today will continue to > > > exist. In > > > > > the case of inter-process communication, it goes to Substrait, and > > > then it > > > > > gets de-serialized in the native expression construct (In > PyIceberg, > > a > > > > > BoundPredicate). I would say that the protocol and substrait are > > > > > complementary. > > > > > > > > > > Another concern I have is that we have not fully explained why we > > want > > > > > > to use Dataset instead of RecordBatchReader [9] as the basis of > > this > > > > > > protocol. I would like to see an explanation of why > > RecordBatchReader > > > > > > is not sufficient for this. RecordBatchReader seems like another > > > > > > possible way to represent "unmaterialized dataframes" and there > are > > > > > > some parallels between RecordBatch/RecordBatchReader and > > > > > > Fragment/Dataset. We should help developers and users understand > > why > > > > > > Arrow needs both of these. > > > > > > > > > > > > > > > Just to clarify, I think there are different use cases. For > example, > > > Lance > > > > > provides its own readers, but PyIceberg does not have any intent to > > > provide > > > > > its own Parquet readers. Iceberg will generate the list of files > that > > > need > > > > > to be read, and do the filtering/projection/deletes/etc. This would > > > make > > > > > the Dataset a better choice than the RecordBatchReader. > > > > > > > > > > That wouldn't remove the feature from DuckDB, would it? It would > just > > > mean > > > > > > that we recognize that PyArrow expressions don't have > well-defined > > > > > > semantics that we are committing to at this time. As long as we > > have > > > > > > `**kwargs` everywhere, we can in the future introduce a > > > > > > `substrait_filter_expression` or similar argument, while allowing > > > current > > > > > > implementors to handle `filter` if possible. (As a compromise, we > > > could > > > > > > reserve `filter` and existing arguments and note that PyArrow > > > Expression > > > > > > semantics are subject to change without notice?) > > > > > > > > > > > > > > > I think we can even re-use the existing filter argument. The > > signature > > > > > would evolve from pc.Expression to Union[pc.Expression, > > > > > pas.BoundExpressions]. In the case we get an expression, we'll > > convert > > > it > > > > > to substrait. > > > > > > > > > > Concluding, I think we can do things in parallel, and I don't think > > > they > > > > > are conflicting. I'm happy to contribute to the PyArrow side to > make > > > this > > > > > happen. > > > > > > > > > > Kind regards, > > > > > Fokko > > > > > > > > > > Op wo 28 jun 2023 om 22:47 schreef Will Jones < > > will.jones...@gmail.com > > > >: > > > > > > > > > > > > > > > > > > > That wouldn't remove the feature from DuckDB, would it? It > would > > > just > > > > > > mean > > > > > > > that we recognize that PyArrow expressions don't have > > well-defined > > > > > > > semantics that we are committing to at this time. > > > > > > > > > > > > > > > > > > > That's a fair point, David. I would be fine excluding it from the > > > > > protocol > > > > > > initially, and keep the existing integrations in DuckDB, Polars, > > and > > > > > > Datafusion "secret" or "not officially supported" for the time > > > being. At > > > > > > the very least, documenting the pattern to get a Arrow C stream > > will > > > be a > > > > > > step forward. > > > > > > > > > > > > Best, > > > > > > > > > > > > Will Jones > > > > > > > > > > > > On Wed, Jun 28, 2023 at 12:35 PM Jonathan Keane < > jke...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > I would understand this objection more if DuckDB hasn't been > > > relying > > > > > on > > > > > > > > being able to pass PyArrow expressions for 18 months now [1]. > > > Unless, > > > > > > do > > > > > > > we > > > > > > > > just think this isn't widely used enough that we don't care? > > > > > > > > > > > > > > This isn't a pro or a con of specifically adopting the PyArrow > > > > > expression > > > > > > > semantics as is / with a warning about changing / not at all, > but > > > > > having > > > > > > > some kind of standardization in this interface would be very > > nice. > > > This > > > > > > > even came up while collaborating with the DuckDB folks that > using > > > some > > > > > of > > > > > > > the expression bits here (and in the R equivalents) was a > little > > > bit > > > > > odd > > > > > > > and having something like a proper API for that would have made > > > that > > > > > > > more natural (and likely that would have been used had it > existed > > > 18 > > > > > > months > > > > > > > ago :)) > > > > > > > > > > > > > > -Jon > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023 at 1:17 PM David Li <lidav...@apache.org> > > > wrote: > > > > > > > > > > > > > > > That wouldn't remove the feature from DuckDB, would it? It > > would > > > just > > > > > > > mean > > > > > > > > that we recognize that PyArrow expressions don't have > > > well-defined > > > > > > > > semantics that we are committing to at this time. As long as > we > > > have > > > > > > > > `**kwargs` everywhere, we can in the future introduce a > > > > > > > > `substrait_filter_expression` or similar argument, while > > allowing > > > > > > current > > > > > > > > implementors to handle `filter` if possible. (As a > compromise, > > we > > > > > could > > > > > > > > reserve `filter` and existing arguments and note that PyArrow > > > > > > Expression > > > > > > > > semantics are subject to change without notice?) > > > > > > > > > > > > > > > > On Wed, Jun 28, 2023, at 13:38, Will Jones wrote: > > > > > > > > > Hi Ian, > > > > > > > > > > > > > > > > > > > > > > > > > > >> I favor option 2 out of concern that option 1 could > create a > > > > > > > > >> temptation for users of this protocol to depend on a > feature > > > that > > > > > we > > > > > > > > >> intend to deprecate. > > > > > > > > >> > > > > > > > > > > > > > > > > > > I would understand this objection more if DuckDB hasn't > been > > > > > relying > > > > > > on > > > > > > > > > being able to pass PyArrow expressions for 18 months now > [1]. > > > > > Unless, > > > > > > > do > > > > > > > > we > > > > > > > > > just think this isn't widely used enough that we don't > care? > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Will > > > > > > > > > > > > > > > > > > [1] https://duckdb.org/2021/12/03/duck-arrow.html > > > > > > > > > > > > > > > > > > On Tue, Jun 27, 2023 at 11:19 AM Ian Cook < > > ianmc...@apache.org > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > >> > I think there's three routes we can go here: > > > > > > > > >> > > > > > > > > > >> > 1. We keep PyArrow expressions in the API initially, but > > > once we > > > > > > > have > > > > > > > > >> > Substrait-based alternatives we deprecate the PyArrow > > > expression > > > > > > > > support. > > > > > > > > >> > This is what I intended with the current design, and I > > > think it > > > > > > > > provides > > > > > > > > >> > the most obvious migration paths for existing producers > > and > > > > > > > consumers. > > > > > > > > >> > 2. We keep the overall dataset API, but don't introduce > > the > > > > > filter > > > > > > > and > > > > > > > > >> > projection arguments until we have Substrait support. > I'm > > > not > > > > > sure > > > > > > > > what > > > > > > > > >> the > > > > > > > > >> > migration path looks like for producers and consumers, > > but I > > > > > think > > > > > > > > this > > > > > > > > >> > just implicitly becomes the same as (1), but with worse > > > > > > > documentation. > > > > > > > > >> > 3. We write a protocol completely from scratch, that > > > doesn't try > > > > > > to > > > > > > > > >> > describe the existing dataset API. Producers and > consumers > > > would > > > > > > > then > > > > > > > > >> > migrate to use the new protocol and deprecate their > > existing > > > > > > dataset > > > > > > > > >> > integrations. We could introduce a dunder method in that > > API > > > > > (sort > > > > > > > of > > > > > > > > >> like > > > > > > > > >> > __arrow_array__) that would make the migration seamless > > > from the > > > > > > > > end-user > > > > > > > > >> > perspective. > > > > > > > > >> > > > > > > > > > >> > *Which do you all think is the best path forward?* > > > > > > > > >> > > > > > > > > >> I favor option 2 out of concern that option 1 could > create a > > > > > > > > >> temptation for users of this protocol to depend on a > feature > > > that > > > > > we > > > > > > > > >> intend to deprecate. I think option 2 also creates a > > stronger > > > > > > > > >> motivation to complete the Substrait expression > integration > > > work, > > > > > > > > >> which is underway in > > > https://github.com/apache/arrow/pull/34834. > > > > > > > > >> > > > > > > > > >> Ian > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> On Fri, Jun 23, 2023 at 1:25 PM Weston Pace < > > > > > weston.p...@gmail.com> > > > > > > > > wrote: > > > > > > > > >> > > > > > > > > > >> > > The trouble is that Dataset was not designed to serve > > as a > > > > > > > > >> > > general-purpose unmaterialized dataframe. For example, > > the > > > > > > PyArrow > > > > > > > > >> > > Dataset constructor [5] exposes options for > specifying a > > > list > > > > > of > > > > > > > > >> > > source files and a partitioning scheme, which are > > > irrelevant > > > > > for > > > > > > > > many > > > > > > > > >> > > of the applications that Will anticipates. And some > work > > > is > > > > > > needed > > > > > > > > to > > > > > > > > >> > > reconcile the methods of the PyArrow Dataset object > [6] > > > with > > > > > the > > > > > > > > >> > > methods of the Table object. Some methods like > filter() > > > are > > > > > > > exposed > > > > > > > > by > > > > > > > > >> > > both and behave lazily on Datasets and eagerly on > > Tables, > > > as a > > > > > > > user > > > > > > > > >> > > might expect. But many other Table methods are not > > > implemented > > > > > > for > > > > > > > > >> > > Dataset though they potentially could be, and it is > > > unclear > > > > > > where > > > > > > > we > > > > > > > > >> > > should draw the line between adding methods to Dataset > > vs. > > > > > > > > encouraging > > > > > > > > >> > > new scanner implementations to expose options > > controlling > > > what > > > > > > > lazy > > > > > > > > >> > > operations should be performed as they see fit. > > > > > > > > >> > > > > > > > > > >> > In my mind there is a distinction between the "compute > > > domain" > > > > > > > (e.g. a > > > > > > > > >> > pandas dataframe or something like ibis or SQL) and the > > > "data > > > > > > > domain" > > > > > > > > >> (e.g. > > > > > > > > >> > pyarrow datasets). I think, in a perfect world, you > could > > > push > > > > > > any > > > > > > > > and > > > > > > > > >> all > > > > > > > > >> > compute up and down the chain as far as possible. > > However, > > > in > > > > > > > > practice, > > > > > > > > >> I > > > > > > > > >> > think there is a healthy set of tools and libraries that > > say > > > > > > "simple > > > > > > > > >> column > > > > > > > > >> > projection and filtering is good enough". I would argue > > > that > > > > > > there > > > > > > > is > > > > > > > > >> room > > > > > > > > >> > for both APIs and while the temptation is always present > > to > > > > > "shove > > > > > > > as > > > > > > > > >> much > > > > > > > > >> > compute as you can" I think pyarrow datasets seem to > have > > > found > > > > > a > > > > > > > > balance > > > > > > > > >> > between the two that users like. > > > > > > > > >> > > > > > > > > > >> > So I would argue that this protocol may never become a > > > > > > > general-purpose > > > > > > > > >> > unmaterialized dataframe and that isn't necessarily a > bad > > > thing. > > > > > > > > >> > > > > > > > > > >> > > they are splittable and serializable, so that > fragments > > > can be > > > > > > > > >> distributed > > > > > > > > >> > > amongst processes / workers. > > > > > > > > >> > > > > > > > > > >> > Just to clarify, the proposal currently only requires > the > > > > > > fragments > > > > > > > > to be > > > > > > > > >> > serializable correct? > > > > > > > > >> > > > > > > > > > >> > On Fri, Jun 23, 2023 at 11:48 AM Will Jones < > > > > > > > will.jones...@gmail.com> > > > > > > > > >> wrote: > > > > > > > > >> > > > > > > > > > >> > > Thanks Ian for your extensive feedback. > > > > > > > > >> > > > > > > > > > > >> > > I strongly agree with the comments made by David, > > > > > > > > >> > > > Weston, and Dewey arguing that we should avoid any > use > > > of > > > > > > > PyArrow > > > > > > > > >> > > > expressions in this API. Expressions are an > > > implementation > > > > > > > detail > > > > > > > > of > > > > > > > > >> > > > PyArrow, not a part of the Arrow standard. It would > be > > > much > > > > > > > safer > > > > > > > > for > > > > > > > > >> > > > the initial version of this protocol to not define > > *any* > > > > > > > > >> > > > methods/arguments that take expressions. > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > I would agree with this point, if we were starting > from > > > > > scratch. > > > > > > > But > > > > > > > > >> one of > > > > > > > > >> > > my goals is for this protocol to be descriptive of the > > > > > existing > > > > > > > > dataset > > > > > > > > >> > > integrations in the ecosystem, which all currently > rely > > on > > > > > > PyArrow > > > > > > > > >> > > expressions. For example, you'll notice in the PR that > > > there > > > > > are > > > > > > > > unit > > > > > > > > >> tests > > > > > > > > >> > > to verify the current PyArrow Dataset classes conform > to > > > this > > > > > > > > protocol, > > > > > > > > >> > > without changes. > > > > > > > > >> > > > > > > > > > > >> > > I think there's three routes we can go here: > > > > > > > > >> > > > > > > > > > > >> > > 1. We keep PyArrow expressions in the API initially, > but > > > once > > > > > we > > > > > > > > have > > > > > > > > >> > > Substrait-based alternatives we deprecate the PyArrow > > > > > expression > > > > > > > > >> support. > > > > > > > > >> > > This is what I intended with the current design, and I > > > think > > > > > it > > > > > > > > >> provides > > > > > > > > >> > > the most obvious migration paths for existing > producers > > > and > > > > > > > > consumers. > > > > > > > > >> > > 2. We keep the overall dataset API, but don't > introduce > > > the > > > > > > filter > > > > > > > > and > > > > > > > > >> > > projection arguments until we have Substrait support. > > I'm > > > not > > > > > > sure > > > > > > > > >> what the > > > > > > > > >> > > migration path looks like for producers and consumers, > > > but I > > > > > > think > > > > > > > > this > > > > > > > > >> > > just implicitly becomes the same as (1), but with > worse > > > > > > > > documentation. > > > > > > > > >> > > 3. We write a protocol completely from scratch, that > > > doesn't > > > > > try > > > > > > > to > > > > > > > > >> > > describe the existing dataset API. Producers and > > consumers > > > > > would > > > > > > > > then > > > > > > > > >> > > migrate to use the new protocol and deprecate their > > > existing > > > > > > > dataset > > > > > > > > >> > > integrations. We could introduce a dunder method in > that > > > API > > > > > > (sort > > > > > > > > of > > > > > > > > >> like > > > > > > > > >> > > __arrow_array__) that would make the migration > seamless > > > from > > > > > the > > > > > > > > >> end-user > > > > > > > > >> > > perspective. > > > > > > > > >> > > > > > > > > > > >> > > *Which do you all think is the best path forward?* > > > > > > > > >> > > > > > > > > > > >> > > Another concern I have is that we have not fully > > > explained why > > > > > > we > > > > > > > > want > > > > > > > > >> > > > to use Dataset instead of RecordBatchReader [9] as > the > > > basis > > > > > > of > > > > > > > > this > > > > > > > > >> > > > protocol. I would like to see an explanation of why > > > > > > > > RecordBatchReader > > > > > > > > >> > > > is not sufficient for this. RecordBatchReader seems > > like > > > > > > another > > > > > > > > >> > > > possible way to represent "unmaterialized > dataframes" > > > and > > > > > > there > > > > > > > > are > > > > > > > > >> > > > some parallels between RecordBatch/RecordBatchReader > > and > > > > > > > > >> > > > Fragment/Dataset. > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > This is a good point. I can add a section describing > the > > > > > > > > differences. > > > > > > > > >> The > > > > > > > > >> > > main ones I can think of are that: (1) Datasets are > > > > > "pruneable": > > > > > > > one > > > > > > > > >> can > > > > > > > > >> > > select a subset of columns and apply a filter on rows > to > > > avoid > > > > > > IO > > > > > > > > and > > > > > > > > >> (2) > > > > > > > > >> > > they are splittable and serializable, so that > fragments > > > can be > > > > > > > > >> distributed > > > > > > > > >> > > amongst processes / workers. > > > > > > > > >> > > > > > > > > > > >> > > Best, > > > > > > > > >> > > > > > > > > > > >> > > Will Jones > > > > > > > > >> > > > > > > > > > > >> > > On Fri, Jun 23, 2023 at 10:48 AM Ian Cook < > > > > > ianmc...@apache.org> > > > > > > > > wrote: > > > > > > > > >> > > > > > > > > > > >> > > > Thanks Will for this proposal! > > > > > > > > >> > > > > > > > > > > > >> > > > For anyone familiar with PyArrow, this idea has a > > clear > > > > > > > intuitive > > > > > > > > >> > > > logic to it. It provides an expedient solution to > the > > > > > current > > > > > > > > lack of > > > > > > > > >> > > > a practical means for interchanging "unmaterialized > > > > > > dataframes" > > > > > > > > >> > > > between different Python libraries. > > > > > > > > >> > > > > > > > > > > > >> > > > To elaborate on that: If you look at how people use > > the > > > > > Arrow > > > > > > > > Dataset > > > > > > > > >> > > > API—which is implemented in the Arrow C++ library > [1] > > > and > > > > > has > > > > > > > > >> bindings > > > > > > > > >> > > > not just for Python [2] but also for Java [3] and R > > > > > [4]—you'll > > > > > > > see > > > > > > > > >> > > > that Dataset is often used simply as a "virtual" > > > variant of > > > > > > > > Table. It > > > > > > > > >> > > > is used in cases when the data is larger than memory > > or > > > when > > > > > > it > > > > > > > is > > > > > > > > >> > > > desirable to defer reading (materializing) the data > > into > > > > > > memory. > > > > > > > > >> > > > > > > > > > > > >> > > > So we can think of a Table as a materialized > dataframe > > > and a > > > > > > > > Dataset > > > > > > > > >> > > > as an unmaterialized dataframe. That aspect of > Dataset > > > is I > > > > > > > think > > > > > > > > >> what > > > > > > > > >> > > > makes it most attractive as a protocol for enabling > > > > > > > > interoperability: > > > > > > > > >> > > > it allows libraries to easily "speak Arrow" in cases > > > where > > > > > > > > >> > > > materializing the full data in memory upfront is > > > impossible > > > > > or > > > > > > > > >> > > > undesirable. > > > > > > > > >> > > > > > > > > > > > >> > > > The trouble is that Dataset was not designed to > serve > > > as a > > > > > > > > >> > > > general-purpose unmaterialized dataframe. For > example, > > > the > > > > > > > PyArrow > > > > > > > > >> > > > Dataset constructor [5] exposes options for > > specifying a > > > > > list > > > > > > of > > > > > > > > >> > > > source files and a partitioning scheme, which are > > > irrelevant > > > > > > for > > > > > > > > many > > > > > > > > >> > > > of the applications that Will anticipates. And some > > > work is > > > > > > > > needed to > > > > > > > > >> > > > reconcile the methods of the PyArrow Dataset object > > [6] > > > with > > > > > > the > > > > > > > > >> > > > methods of the Table object. Some methods like > > filter() > > > are > > > > > > > > exposed > > > > > > > > >> by > > > > > > > > >> > > > both and behave lazily on Datasets and eagerly on > > > Tables, > > > > > as a > > > > > > > > user > > > > > > > > >> > > > might expect. But many other Table methods are not > > > > > implemented > > > > > > > for > > > > > > > > >> > > > Dataset though they potentially could be, and it is > > > unclear > > > > > > > where > > > > > > > > we > > > > > > > > >> > > > should draw the line between adding methods to > Dataset > > > vs. > > > > > > > > >> encouraging > > > > > > > > >> > > > new scanner implementations to expose options > > > controlling > > > > > what > > > > > > > > lazy > > > > > > > > >> > > > operations should be performed as they see fit. > > > > > > > > >> > > > > > > > > > > > >> > > > Will, I see that you've already addressed this issue > > to > > > some > > > > > > > > extent > > > > > > > > >> in > > > > > > > > >> > > > your proposal. For example, you mention that we > should > > > > > > initially > > > > > > > > >> > > > define this protocol to include only a minimal > subset > > > of the > > > > > > > > Dataset > > > > > > > > >> > > > API. I agree, but I think there are some loose ends > we > > > > > should > > > > > > be > > > > > > > > >> > > > careful to tie up. I strongly agree with the > comments > > > made > > > > > by > > > > > > > > David, > > > > > > > > >> > > > Weston, and Dewey arguing that we should avoid any > use > > > of > > > > > > > PyArrow > > > > > > > > >> > > > expressions in this API. Expressions are an > > > implementation > > > > > > > detail > > > > > > > > of > > > > > > > > >> > > > PyArrow, not a part of the Arrow standard. It would > be > > > much > > > > > > > safer > > > > > > > > for > > > > > > > > >> > > > the initial version of this protocol to not define > > *any* > > > > > > > > >> > > > methods/arguments that take expressions. This will > > > allow us > > > > > to > > > > > > > > take > > > > > > > > >> > > > some more time to finish up the Substrait expression > > > > > > > > implementation > > > > > > > > >> > > > work that is underway [7][8], then introduce > > > Substrait-based > > > > > > > > >> > > > expressions in a latter version of this protocol. > This > > > > > > approach > > > > > > > > will > > > > > > > > >> > > > better position this protocol to be implemented in > > other > > > > > > > languages > > > > > > > > >> > > > besides Python. > > > > > > > > >> > > > > > > > > > > > >> > > > Another concern I have is that we have not fully > > > explained > > > > > why > > > > > > > we > > > > > > > > >> want > > > > > > > > >> > > > to use Dataset instead of RecordBatchReader [9] as > the > > > basis > > > > > > of > > > > > > > > this > > > > > > > > >> > > > protocol. I would like to see an explanation of why > > > > > > > > RecordBatchReader > > > > > > > > >> > > > is not sufficient for this. RecordBatchReader seems > > like > > > > > > another > > > > > > > > >> > > > possible way to represent "unmaterialized > dataframes" > > > and > > > > > > there > > > > > > > > are > > > > > > > > >> > > > some parallels between RecordBatch/RecordBatchReader > > and > > > > > > > > >> > > > Fragment/Dataset. We should help developers and > users > > > > > > understand > > > > > > > > why > > > > > > > > >> > > > Arrow needs both of these. > > > > > > > > >> > > > > > > > > > > > >> > > > Thanks Will for your thoughtful prose explanations > > about > > > > > this > > > > > > > > >> proposed > > > > > > > > >> > > > API. After we arrive at a decision about this, I > think > > > we > > > > > > should > > > > > > > > >> > > > reproduce some of these explanations in docs, blog > > > posts, > > > > > > > cookbook > > > > > > > > >> > > > recipes, etc. because there is some important nuance > > > here > > > > > that > > > > > > > > will > > > > > > > > >> be > > > > > > > > >> > > > important for integrators of this API to understand. > > > > > > > > >> > > > > > > > > > > > >> > > > Ian > > > > > > > > >> > > > > > > > > > > > >> > > > [1] > > https://arrow.apache.org/docs/cpp/api/dataset.html > > > > > > > > >> > > > [2] > https://arrow.apache.org/docs/python/dataset.html > > > > > > > > >> > > > [3] https://arrow.apache.org/docs/java/dataset.html > > > > > > > > >> > > > [4] > > > https://arrow.apache.org/docs/r/articles/dataset.html > > > > > > > > >> > > > [5] > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset > > > > > > > > >> > > > [6] > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html > > > > > > > > >> > > > [7] https://github.com/apache/arrow/issues/33985 > > > > > > > > >> > > > [8] https://github.com/apache/arrow/issues/34252 > > > > > > > > >> > > > [9] > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html > > > > > > > > >> > > > > > > > > > > > >> > > > On Wed, Jun 21, 2023 at 2:09 PM Will Jones < > > > > > > > > will.jones...@gmail.com> > > > > > > > > >> > > > wrote: > > > > > > > > >> > > > > > > > > > > > > >> > > > > Hello Arrow devs, > > > > > > > > >> > > > > > > > > > > > > >> > > > > I have drafted a PR defining an experimental > > protocol > > > > > which > > > > > > > > would > > > > > > > > >> allow > > > > > > > > >> > > > > third-party libraries to imitate the PyArrow > Dataset > > > API > > > > > > [5]. > > > > > > > > This > > > > > > > > >> > > > protocol > > > > > > > > >> > > > > is intended to endorse an integration pattern that > > is > > > > > > starting > > > > > > > > to > > > > > > > > >> be > > > > > > > > >> > > used > > > > > > > > >> > > > > in the Python ecosystem, where some libraries are > > > > > providing > > > > > > > > their > > > > > > > > >> own > > > > > > > > >> > > > > scanners with this API, while query engines are > > > accepting > > > > > > > these > > > > > > > > as > > > > > > > > >> > > > > duck-typed objects. > > > > > > > > >> > > > > > > > > > > > > >> > > > > To give some background: back at the end of 2021, > we > > > > > > > > collaborated > > > > > > > > >> with > > > > > > > > >> > > > > DuckDB to be able to read datasets (an Arrow C++ > > > concept), > > > > > > > > >> supporting > > > > > > > > >> > > > > column selection and filter pushdown. This was > > > > > accomplished > > > > > > by > > > > > > > > >> having > > > > > > > > >> > > > > DuckDB manipulating Python (or R) objects to get a > > > > > > > > >> RecordBatchReader > > > > > > > > >> > > and > > > > > > > > >> > > > > then exporting over the C Stream Interface. > > > > > > > > >> > > > > > > > > > > > > >> > > > > Since then, DataFusion [2] and Polars have both > made > > > > > similar > > > > > > > > >> > > > > implementations for their Python bindings, > allowing > > > them > > > > > to > > > > > > > > consume > > > > > > > > >> > > > PyArrow > > > > > > > > >> > > > > datasets. This has created an implicit protocol, > > > whereby > > > > > > > > arbitrary > > > > > > > > >> > > > compute > > > > > > > > >> > > > > engines can push down queries into the PyArrow > > dataset > > > > > > > scanner. > > > > > > > > >> > > > > > > > > > > > > >> > > > > Now, libraries supporting table formats including > > > Delta > > > > > > Lake, > > > > > > > > >> Lance, > > > > > > > > >> > > and > > > > > > > > >> > > > > Iceberg are looking to be able to support these > > > engines, > > > > > > while > > > > > > > > >> bringing > > > > > > > > >> > > > > their own scanners and metadata handling > > > implementations. > > > > > > One > > > > > > > > >> possible > > > > > > > > >> > > > > route is allowing them to imitate the PyArrow > > datasets > > > > > API. > > > > > > > > >> > > > > > > > > > > > > >> > > > > Bringing these use cases together, I'd like to > > > propose an > > > > > > > > >> experimental > > > > > > > > >> > > > > protocol, made out of the minimal subset of the > > > PyArrow > > > > > > > Dataset > > > > > > > > API > > > > > > > > >> > > > > necessary to facilitate this kind of integration. > > This > > > > > would > > > > > > > > allow > > > > > > > > >> any > > > > > > > > >> > > > > library to produce a scanner implementation and > that > > > > > > arbitrary > > > > > > > > >> query > > > > > > > > >> > > > > engines could call into. I've drafted a PR [3] and > > > there > > > > > is > > > > > > > some > > > > > > > > >> > > > background > > > > > > > > >> > > > > research available in a google doc [4]. > > > > > > > > >> > > > > > > > > > > > > >> > > > > I've already gotten some good feedback on both, > and > > > would > > > > > > > > welcome > > > > > > > > >> more. > > > > > > > > >> > > > > > > > > > > > > >> > > > > One last point: I'd like for this to be a first > step > > > > > rather > > > > > > > > than a > > > > > > > > >> > > > > comprehensive API. This PR focuses on making > > explicit > > > a > > > > > > > protocol > > > > > > > > >> that > > > > > > > > >> > > is > > > > > > > > >> > > > > already in use in the ecosystem, but without much > > > concrete > > > > > > > > >> definition. > > > > > > > > >> > > > Once > > > > > > > > >> > > > > this is established, we can use our experience > from > > > this > > > > > > > > protocol > > > > > > > > >> to > > > > > > > > >> > > > design > > > > > > > > >> > > > > something more permanent that takes advantage of > > newer > > > > > > > > innovations > > > > > > > > >> in > > > > > > > > >> > > the > > > > > > > > >> > > > > Arrow ecosystem (such as the PyCapsule for C Data > > > > > Interface > > > > > > or > > > > > > > > >> > > > > Substrait for passing expressions / scan plans). I > > am > > > > > > tracking > > > > > > > > such > > > > > > > > >> > > > future > > > > > > > > >> > > > > improvements in [5]. > > > > > > > > >> > > > > > > > > > > > > >> > > > > Best, > > > > > > > > >> > > > > > > > > > > > > >> > > > > Will Jones > > > > > > > > >> > > > > > > > > > > > > >> > > > > [1] https://duckdb.org/2021/12/03/duck-arrow.html > > > > > > > > >> > > > > [2] > > > > > > https://github.com/apache/arrow-datafusion-python/pull/9 > > > > > > > > >> > > > > [3] https://github.com/apache/arrow/pull/35568 > > > > > > > > >> > > > > [4] > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?pli=1 > > > > > > > > >> > > > > [5] > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1-uVkSZeaBtOALVbqMOPeyV3s2UND7Wl-IGEZ-P-gMXQ/edit > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >