Re: [Python][Discuss] PyArrow Dataset as a Python protocol

Weston Pace Fri, 23 Jun 2023 13:25:37 -0700

> The trouble is that Dataset was not designed to serve as a
> general-purpose unmaterialized dataframe. For example, the PyArrow
> Dataset constructor [5] exposes options for specifying a list of
> source files and a partitioning scheme, which are irrelevant for many
> of the applications that Will anticipates. And some work is needed to
> reconcile the methods of the PyArrow Dataset object [6] with the
> methods of the Table object. Some methods like filter() are exposed by
> both and behave lazily on Datasets and eagerly on Tables, as a user
> might expect. But many other Table methods are not implemented for
> Dataset though they potentially could be, and it is unclear where we
> should draw the line between adding methods to Dataset vs. encouraging
> new scanner implementations to expose options controlling what lazy
> operations should be performed as they see fit.


In my mind there is a distinction between the "compute domain" (e.g. a
pandas dataframe or something like ibis or SQL) and the "data domain" (e.g.
pyarrow datasets).  I think, in a perfect world, you could push any and all
compute up and down the chain as far as possible.  However, in practice, I
think there is a healthy set of tools and libraries that say "simple column
projection and filtering is good enough".  I would argue that there is room
for both APIs and while the temptation is always present to "shove as much
compute as you can" I think pyarrow datasets seem to have found a balance
between the two that users like.

So I would argue that this protocol may never become a general-purpose
unmaterialized dataframe and that isn't necessarily a bad thing.

> they are splittable and serializable, so that fragments can be distributed
> amongst processes / workers.

Just to clarify, the proposal currently only requires the fragments to be
serializable correct?

On Fri, Jun 23, 2023 at 11:48 AM Will Jones <[email protected]> wrote:

> Thanks Ian for your extensive feedback.
>
> I strongly agree with the comments made by David,
> > Weston, and Dewey arguing that we should avoid any use of PyArrow
> > expressions in this API. Expressions are an implementation detail of
> > PyArrow, not a part of the Arrow standard. It would be much safer for
> > the initial version of this protocol to not define *any*
> > methods/arguments that take expressions.
> >
>
> I would agree with this point, if we were starting from scratch. But one of
> my goals is for this protocol to be descriptive of the existing dataset
> integrations in the ecosystem, which all currently rely on PyArrow
> expressions. For example, you'll notice in the PR that there are unit tests
> to verify the current PyArrow Dataset classes conform to this protocol,
> without changes.
>
> I think there's three routes we can go here:
>
> 1. We keep PyArrow expressions in the API initially, but once we have
> Substrait-based alternatives we deprecate the PyArrow expression support.
> This is what I intended with the current design, and I think it provides
> the most obvious migration paths for existing producers and consumers.
> 2. We keep the overall dataset API, but don't introduce the filter and
> projection arguments until we have Substrait support. I'm not sure what the
> migration path looks like for producers and consumers, but I think this
> just implicitly becomes the same as (1), but with worse documentation.
> 3. We write a protocol completely from scratch, that doesn't try to
> describe the existing dataset API. Producers and consumers would then
> migrate to use the new protocol and deprecate their existing dataset
> integrations. We could introduce a dunder method in that API (sort of like
> __arrow_array__) that would make the migration seamless from the end-user
> perspective.
>
> *Which do you all think is the best path forward?*
>
> Another concern I have is that we have not fully explained why we want
> > to use Dataset instead of RecordBatchReader [9] as the basis of this
> > protocol. I would like to see an explanation of why RecordBatchReader
> > is not sufficient for this. RecordBatchReader seems like another
> > possible way to represent "unmaterialized dataframes" and there are
> > some parallels between RecordBatch/RecordBatchReader and
> > Fragment/Dataset.
> >
>
> This is a good point. I can add a section describing the differences. The
> main ones I can think of are that: (1) Datasets are "pruneable": one can
> select a subset of columns and apply a filter on rows to avoid IO and (2)
> they are splittable and serializable, so that fragments can be distributed
> amongst processes / workers.
>
> Best,
>
> Will Jones
>
> On Fri, Jun 23, 2023 at 10:48 AM Ian Cook <[email protected]> wrote:
>
> > Thanks Will for this proposal!
> >
> > For anyone familiar with PyArrow, this idea has a clear intuitive
> > logic to it. It provides an expedient solution to the current lack of
> > a practical means for interchanging "unmaterialized dataframes"
> > between different Python libraries.
> >
> > To elaborate on that: If you look at how people use the Arrow Dataset
> > API—which is implemented in the Arrow C++ library [1] and has bindings
> > not just for Python [2] but also for Java [3] and R [4]—you'll see
> > that Dataset is often used simply as a "virtual" variant of Table. It
> > is used in cases when the data is larger than memory or when it is
> > desirable to defer reading (materializing) the data into memory.
> >
> > So we can think of a Table as a materialized dataframe and a Dataset
> > as an unmaterialized dataframe. That aspect of Dataset is I think what
> > makes it most attractive as a protocol for enabling interoperability:
> > it allows libraries to easily "speak Arrow" in cases where
> > materializing the full data in memory upfront is impossible or
> > undesirable.
> >
> > The trouble is that Dataset was not designed to serve as a
> > general-purpose unmaterialized dataframe. For example, the PyArrow
> > Dataset constructor [5] exposes options for specifying a list of
> > source files and a partitioning scheme, which are irrelevant for many
> > of the applications that Will anticipates. And some work is needed to
> > reconcile the methods of the PyArrow Dataset object [6] with the
> > methods of the Table object. Some methods like filter() are exposed by
> > both and behave lazily on Datasets and eagerly on Tables, as a user
> > might expect. But many other Table methods are not implemented for
> > Dataset though they potentially could be, and it is unclear where we
> > should draw the line between adding methods to Dataset vs. encouraging
> > new scanner implementations to expose options controlling what lazy
> > operations should be performed as they see fit.
> >
> > Will, I see that you've already addressed this issue to some extent in
> > your proposal. For example, you mention that we should initially
> > define this protocol to include only a minimal subset of the Dataset
> > API. I agree, but I think there are some loose ends we should be
> > careful to tie up. I strongly agree with the comments made by David,
> > Weston, and Dewey arguing that we should avoid any use of PyArrow
> > expressions in this API. Expressions are an implementation detail of
> > PyArrow, not a part of the Arrow standard. It would be much safer for
> > the initial version of this protocol to not define *any*
> > methods/arguments that take expressions. This will allow us to take
> > some more time to finish up the Substrait expression implementation
> > work that is underway [7][8], then introduce Substrait-based
> > expressions in a latter version of this protocol. This approach will
> > better position this protocol to be implemented in other languages
> > besides Python.
> >
> > Another concern I have is that we have not fully explained why we want
> > to use Dataset instead of RecordBatchReader [9] as the basis of this
> > protocol. I would like to see an explanation of why RecordBatchReader
> > is not sufficient for this. RecordBatchReader seems like another
> > possible way to represent "unmaterialized dataframes" and there are
> > some parallels between RecordBatch/RecordBatchReader and
> > Fragment/Dataset. We should help developers and users understand why
> > Arrow needs both of these.
> >
> > Thanks Will for your thoughtful prose explanations about this proposed
> > API. After we arrive at a decision about this, I think we should
> > reproduce some of these explanations in docs, blog posts, cookbook
> > recipes, etc. because there is some important nuance here that will be
> > important for integrators of this API to understand.
> >
> > Ian
> >
> > [1] https://arrow.apache.org/docs/cpp/api/dataset.html
> > [2] https://arrow.apache.org/docs/python/dataset.html
> > [3] https://arrow.apache.org/docs/java/dataset.html
> > [4] https://arrow.apache.org/docs/r/articles/dataset.html
> > [5]
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset
> > [6]
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html
> > [7] https://github.com/apache/arrow/issues/33985
> > [8] https://github.com/apache/arrow/issues/34252
> > [9]
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html
> >
> > On Wed, Jun 21, 2023 at 2:09 PM Will Jones <[email protected]>
> > wrote:
> > >
> > > Hello Arrow devs,
> > >
> > > I have drafted a PR defining an experimental protocol which would allow
> > > third-party libraries to imitate the PyArrow Dataset API [5]. This
> > protocol
> > > is intended to endorse an integration pattern that is starting to be
> used
> > > in the Python ecosystem, where some libraries are providing their own
> > > scanners with this API, while query engines are accepting these as
> > > duck-typed objects.
> > >
> > > To give some background: back at the end of 2021, we collaborated with
> > > DuckDB to be able to read datasets (an Arrow C++ concept), supporting
> > > column selection and filter pushdown. This was accomplished by having
> > > DuckDB manipulating Python (or R) objects to get a RecordBatchReader
> and
> > > then exporting over the C Stream Interface.
> > >
> > > Since then, DataFusion [2] and Polars have both made similar
> > > implementations for their Python bindings, allowing them to consume
> > PyArrow
> > > datasets. This has created an implicit protocol, whereby arbitrary
> > compute
> > > engines can push down queries into the PyArrow dataset scanner.
> > >
> > > Now, libraries supporting table formats including Delta Lake, Lance,
> and
> > > Iceberg are looking to be able to support these engines, while bringing
> > > their own scanners and metadata handling implementations. One possible
> > > route is allowing them to imitate the PyArrow datasets API.
> > >
> > > Bringing these use cases together, I'd like to propose an experimental
> > > protocol, made out of the minimal subset of the PyArrow Dataset API
> > > necessary to facilitate this kind of integration. This would allow any
> > > library to produce a scanner implementation and that arbitrary query
> > > engines could call into. I've drafted a PR [3] and there is some
> > background
> > > research available in a google doc [4].
> > >
> > > I've already gotten some good feedback on both, and would welcome more.
> > >
> > > One last point: I'd like for this to be a first step rather than a
> > > comprehensive API. This PR focuses on making explicit a protocol that
> is
> > > already in use in the ecosystem, but without much concrete definition.
> > Once
> > > this is established, we can use our experience from this protocol to
> > design
> > > something more permanent that takes advantage of newer innovations in
> the
> > > Arrow ecosystem (such as the PyCapsule for C Data Interface or
> > > Substrait for passing expressions / scan plans). I am tracking such
> > future
> > > improvements in [5].
> > >
> > > Best,
> > >
> > > Will Jones
> > >
> > > [1] https://duckdb.org/2021/12/03/duck-arrow.html
> > > [2] https://github.com/apache/arrow-datafusion-python/pull/9
> > > [3] https://github.com/apache/arrow/pull/35568
> > > [4]
> > >
> >
> https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?pli=1
> > > [5]
> > >
> >
> https://docs.google.com/document/d/1-uVkSZeaBtOALVbqMOPeyV3s2UND7Wl-IGEZ-P-gMXQ/edit
> >
>

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

Reply via email to