Re: [Python][Discuss] PyArrow Dataset as a Python protocol

David Li Wed, 28 Jun 2023 11:17:43 -0700

That wouldn't remove the feature from DuckDB, would it? It would just mean that 
we recognize that PyArrow expressions don't have well-defined semantics that we 
are committing to at this time. As long as we have `**kwargs` everywhere, we 
can in the future introduce a `substrait_filter_expression` or similar 
argument, while allowing current implementors to handle `filter` if possible. 
(As a compromise, we could reserve `filter` and existing arguments and note 
that PyArrow Expression semantics are subject to change without notice?)


On Wed, Jun 28, 2023, at 13:38, Will Jones wrote:
> Hi Ian,
>
>
>> I favor option 2 out of concern that option 1 could create a
>> temptation for users of this protocol to depend on a feature that we
>> intend to deprecate.
>>
>
> I would understand this objection more if DuckDB hasn't been relying on
> being able to pass PyArrow expressions for 18 months now [1]. Unless, do we
> just think this isn't widely used enough that we don't care?
>
> Best,
> Will
>
> [1] https://duckdb.org/2021/12/03/duck-arrow.html
>
> On Tue, Jun 27, 2023 at 11:19 AM Ian Cook <[email protected]> wrote:
>
>> > I think there's three routes we can go here:
>> >
>> > 1. We keep PyArrow expressions in the API initially, but once we have
>> > Substrait-based alternatives we deprecate the PyArrow expression support.
>> > This is what I intended with the current design, and I think it provides
>> > the most obvious migration paths for existing producers and consumers.
>> > 2. We keep the overall dataset API, but don't introduce the filter and
>> > projection arguments until we have Substrait support. I'm not sure what
>> the
>> > migration path looks like for producers and consumers, but I think this
>> > just implicitly becomes the same as (1), but with worse documentation.
>> > 3. We write a protocol completely from scratch, that doesn't try to
>> > describe the existing dataset API. Producers and consumers would then
>> > migrate to use the new protocol and deprecate their existing dataset
>> > integrations. We could introduce a dunder method in that API (sort of
>> like
>> > __arrow_array__) that would make the migration seamless from the end-user
>> > perspective.
>> >
>> > *Which do you all think is the best path forward?*
>>
>> I favor option 2 out of concern that option 1 could create a
>> temptation for users of this protocol to depend on a feature that we
>> intend to deprecate. I think option 2 also creates a stronger
>> motivation to complete the Substrait expression integration work,
>> which is underway in https://github.com/apache/arrow/pull/34834.
>>
>> Ian
>>
>>
>> On Fri, Jun 23, 2023 at 1:25 PM Weston Pace <[email protected]> wrote:
>> >
>> > > The trouble is that Dataset was not designed to serve as a
>> > > general-purpose unmaterialized dataframe. For example, the PyArrow
>> > > Dataset constructor [5] exposes options for specifying a list of
>> > > source files and a partitioning scheme, which are irrelevant for many
>> > > of the applications that Will anticipates. And some work is needed to
>> > > reconcile the methods of the PyArrow Dataset object [6] with the
>> > > methods of the Table object. Some methods like filter() are exposed by
>> > > both and behave lazily on Datasets and eagerly on Tables, as a user
>> > > might expect. But many other Table methods are not implemented for
>> > > Dataset though they potentially could be, and it is unclear where we
>> > > should draw the line between adding methods to Dataset vs. encouraging
>> > > new scanner implementations to expose options controlling what lazy
>> > > operations should be performed as they see fit.
>> >
>> > In my mind there is a distinction between the "compute domain" (e.g. a
>> > pandas dataframe or something like ibis or SQL) and the "data domain"
>> (e.g.
>> > pyarrow datasets).  I think, in a perfect world, you could push any and
>> all
>> > compute up and down the chain as far as possible.  However, in practice,
>> I
>> > think there is a healthy set of tools and libraries that say "simple
>> column
>> > projection and filtering is good enough".  I would argue that there is
>> room
>> > for both APIs and while the temptation is always present to "shove as
>> much
>> > compute as you can" I think pyarrow datasets seem to have found a balance
>> > between the two that users like.
>> >
>> > So I would argue that this protocol may never become a general-purpose
>> > unmaterialized dataframe and that isn't necessarily a bad thing.
>> >
>> > > they are splittable and serializable, so that fragments can be
>> distributed
>> > > amongst processes / workers.
>> >
>> > Just to clarify, the proposal currently only requires the fragments to be
>> > serializable correct?
>> >
>> > On Fri, Jun 23, 2023 at 11:48 AM Will Jones <[email protected]>
>> wrote:
>> >
>> > > Thanks Ian for your extensive feedback.
>> > >
>> > > I strongly agree with the comments made by David,
>> > > > Weston, and Dewey arguing that we should avoid any use of PyArrow
>> > > > expressions in this API. Expressions are an implementation detail of
>> > > > PyArrow, not a part of the Arrow standard. It would be much safer for
>> > > > the initial version of this protocol to not define *any*
>> > > > methods/arguments that take expressions.
>> > > >
>> > >
>> > > I would agree with this point, if we were starting from scratch. But
>> one of
>> > > my goals is for this protocol to be descriptive of the existing dataset
>> > > integrations in the ecosystem, which all currently rely on PyArrow
>> > > expressions. For example, you'll notice in the PR that there are unit
>> tests
>> > > to verify the current PyArrow Dataset classes conform to this protocol,
>> > > without changes.
>> > >
>> > > I think there's three routes we can go here:
>> > >
>> > > 1. We keep PyArrow expressions in the API initially, but once we have
>> > > Substrait-based alternatives we deprecate the PyArrow expression
>> support.
>> > > This is what I intended with the current design, and I think it
>> provides
>> > > the most obvious migration paths for existing producers and consumers.
>> > > 2. We keep the overall dataset API, but don't introduce the filter and
>> > > projection arguments until we have Substrait support. I'm not sure
>> what the
>> > > migration path looks like for producers and consumers, but I think this
>> > > just implicitly becomes the same as (1), but with worse documentation.
>> > > 3. We write a protocol completely from scratch, that doesn't try to
>> > > describe the existing dataset API. Producers and consumers would then
>> > > migrate to use the new protocol and deprecate their existing dataset
>> > > integrations. We could introduce a dunder method in that API (sort of
>> like
>> > > __arrow_array__) that would make the migration seamless from the
>> end-user
>> > > perspective.
>> > >
>> > > *Which do you all think is the best path forward?*
>> > >
>> > > Another concern I have is that we have not fully explained why we want
>> > > > to use Dataset instead of RecordBatchReader [9] as the basis of this
>> > > > protocol. I would like to see an explanation of why RecordBatchReader
>> > > > is not sufficient for this. RecordBatchReader seems like another
>> > > > possible way to represent "unmaterialized dataframes" and there are
>> > > > some parallels between RecordBatch/RecordBatchReader and
>> > > > Fragment/Dataset.
>> > > >
>> > >
>> > > This is a good point. I can add a section describing the differences.
>> The
>> > > main ones I can think of are that: (1) Datasets are "pruneable": one
>> can
>> > > select a subset of columns and apply a filter on rows to avoid IO and
>> (2)
>> > > they are splittable and serializable, so that fragments can be
>> distributed
>> > > amongst processes / workers.
>> > >
>> > > Best,
>> > >
>> > > Will Jones
>> > >
>> > > On Fri, Jun 23, 2023 at 10:48 AM Ian Cook <[email protected]> wrote:
>> > >
>> > > > Thanks Will for this proposal!
>> > > >
>> > > > For anyone familiar with PyArrow, this idea has a clear intuitive
>> > > > logic to it. It provides an expedient solution to the current lack of
>> > > > a practical means for interchanging "unmaterialized dataframes"
>> > > > between different Python libraries.
>> > > >
>> > > > To elaborate on that: If you look at how people use the Arrow Dataset
>> > > > API—which is implemented in the Arrow C++ library [1] and has
>> bindings
>> > > > not just for Python [2] but also for Java [3] and R [4]—you'll see
>> > > > that Dataset is often used simply as a "virtual" variant of Table. It
>> > > > is used in cases when the data is larger than memory or when it is
>> > > > desirable to defer reading (materializing) the data into memory.
>> > > >
>> > > > So we can think of a Table as a materialized dataframe and a Dataset
>> > > > as an unmaterialized dataframe. That aspect of Dataset is I think
>> what
>> > > > makes it most attractive as a protocol for enabling interoperability:
>> > > > it allows libraries to easily "speak Arrow" in cases where
>> > > > materializing the full data in memory upfront is impossible or
>> > > > undesirable.
>> > > >
>> > > > The trouble is that Dataset was not designed to serve as a
>> > > > general-purpose unmaterialized dataframe. For example, the PyArrow
>> > > > Dataset constructor [5] exposes options for specifying a list of
>> > > > source files and a partitioning scheme, which are irrelevant for many
>> > > > of the applications that Will anticipates. And some work is needed to
>> > > > reconcile the methods of the PyArrow Dataset object [6] with the
>> > > > methods of the Table object. Some methods like filter() are exposed
>> by
>> > > > both and behave lazily on Datasets and eagerly on Tables, as a user
>> > > > might expect. But many other Table methods are not implemented for
>> > > > Dataset though they potentially could be, and it is unclear where we
>> > > > should draw the line between adding methods to Dataset vs.
>> encouraging
>> > > > new scanner implementations to expose options controlling what lazy
>> > > > operations should be performed as they see fit.
>> > > >
>> > > > Will, I see that you've already addressed this issue to some extent
>> in
>> > > > your proposal. For example, you mention that we should initially
>> > > > define this protocol to include only a minimal subset of the Dataset
>> > > > API. I agree, but I think there are some loose ends we should be
>> > > > careful to tie up. I strongly agree with the comments made by David,
>> > > > Weston, and Dewey arguing that we should avoid any use of PyArrow
>> > > > expressions in this API. Expressions are an implementation detail of
>> > > > PyArrow, not a part of the Arrow standard. It would be much safer for
>> > > > the initial version of this protocol to not define *any*
>> > > > methods/arguments that take expressions. This will allow us to take
>> > > > some more time to finish up the Substrait expression implementation
>> > > > work that is underway [7][8], then introduce Substrait-based
>> > > > expressions in a latter version of this protocol. This approach will
>> > > > better position this protocol to be implemented in other languages
>> > > > besides Python.
>> > > >
>> > > > Another concern I have is that we have not fully explained why we
>> want
>> > > > to use Dataset instead of RecordBatchReader [9] as the basis of this
>> > > > protocol. I would like to see an explanation of why RecordBatchReader
>> > > > is not sufficient for this. RecordBatchReader seems like another
>> > > > possible way to represent "unmaterialized dataframes" and there are
>> > > > some parallels between RecordBatch/RecordBatchReader and
>> > > > Fragment/Dataset. We should help developers and users understand why
>> > > > Arrow needs both of these.
>> > > >
>> > > > Thanks Will for your thoughtful prose explanations about this
>> proposed
>> > > > API. After we arrive at a decision about this, I think we should
>> > > > reproduce some of these explanations in docs, blog posts, cookbook
>> > > > recipes, etc. because there is some important nuance here that will
>> be
>> > > > important for integrators of this API to understand.
>> > > >
>> > > > Ian
>> > > >
>> > > > [1] https://arrow.apache.org/docs/cpp/api/dataset.html
>> > > > [2] https://arrow.apache.org/docs/python/dataset.html
>> > > > [3] https://arrow.apache.org/docs/java/dataset.html
>> > > > [4] https://arrow.apache.org/docs/r/articles/dataset.html
>> > > > [5]
>> > > >
>> > >
>> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset
>> > > > [6]
>> > > >
>> > >
>> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html
>> > > > [7] https://github.com/apache/arrow/issues/33985
>> > > > [8] https://github.com/apache/arrow/issues/34252
>> > > > [9]
>> > > >
>> > >
>> https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html
>> > > >
>> > > > On Wed, Jun 21, 2023 at 2:09 PM Will Jones <[email protected]>
>> > > > wrote:
>> > > > >
>> > > > > Hello Arrow devs,
>> > > > >
>> > > > > I have drafted a PR defining an experimental protocol which would
>> allow
>> > > > > third-party libraries to imitate the PyArrow Dataset API [5]. This
>> > > > protocol
>> > > > > is intended to endorse an integration pattern that is starting to
>> be
>> > > used
>> > > > > in the Python ecosystem, where some libraries are providing their
>> own
>> > > > > scanners with this API, while query engines are accepting these as
>> > > > > duck-typed objects.
>> > > > >
>> > > > > To give some background: back at the end of 2021, we collaborated
>> with
>> > > > > DuckDB to be able to read datasets (an Arrow C++ concept),
>> supporting
>> > > > > column selection and filter pushdown. This was accomplished by
>> having
>> > > > > DuckDB manipulating Python (or R) objects to get a
>> RecordBatchReader
>> > > and
>> > > > > then exporting over the C Stream Interface.
>> > > > >
>> > > > > Since then, DataFusion [2] and Polars have both made similar
>> > > > > implementations for their Python bindings, allowing them to consume
>> > > > PyArrow
>> > > > > datasets. This has created an implicit protocol, whereby arbitrary
>> > > > compute
>> > > > > engines can push down queries into the PyArrow dataset scanner.
>> > > > >
>> > > > > Now, libraries supporting table formats including Delta Lake,
>> Lance,
>> > > and
>> > > > > Iceberg are looking to be able to support these engines, while
>> bringing
>> > > > > their own scanners and metadata handling implementations. One
>> possible
>> > > > > route is allowing them to imitate the PyArrow datasets API.
>> > > > >
>> > > > > Bringing these use cases together, I'd like to propose an
>> experimental
>> > > > > protocol, made out of the minimal subset of the PyArrow Dataset API
>> > > > > necessary to facilitate this kind of integration. This would allow
>> any
>> > > > > library to produce a scanner implementation and that arbitrary
>> query
>> > > > > engines could call into. I've drafted a PR [3] and there is some
>> > > > background
>> > > > > research available in a google doc [4].
>> > > > >
>> > > > > I've already gotten some good feedback on both, and would welcome
>> more.
>> > > > >
>> > > > > One last point: I'd like for this to be a first step rather than a
>> > > > > comprehensive API. This PR focuses on making explicit a protocol
>> that
>> > > is
>> > > > > already in use in the ecosystem, but without much concrete
>> definition.
>> > > > Once
>> > > > > this is established, we can use our experience from this protocol
>> to
>> > > > design
>> > > > > something more permanent that takes advantage of newer innovations
>> in
>> > > the
>> > > > > Arrow ecosystem (such as the PyCapsule for C Data Interface or
>> > > > > Substrait for passing expressions / scan plans). I am tracking such
>> > > > future
>> > > > > improvements in [5].
>> > > > >
>> > > > > Best,
>> > > > >
>> > > > > Will Jones
>> > > > >
>> > > > > [1] https://duckdb.org/2021/12/03/duck-arrow.html
>> > > > > [2] https://github.com/apache/arrow-datafusion-python/pull/9
>> > > > > [3] https://github.com/apache/arrow/pull/35568
>> > > > > [4]
>> > > > >
>> > > >
>> > >
>> https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?pli=1
>> > > > > [5]
>> > > > >
>> > > >
>> > >
>> https://docs.google.com/document/d/1-uVkSZeaBtOALVbqMOPeyV3s2UND7Wl-IGEZ-P-gMXQ/edit
>> > > >
>> > >
>>

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

Reply via email to