Thanks Ian for your extensive feedback.

I strongly agree with the comments made by David,
> Weston, and Dewey arguing that we should avoid any use of PyArrow
> expressions in this API. Expressions are an implementation detail of
> PyArrow, not a part of the Arrow standard. It would be much safer for
> the initial version of this protocol to not define *any*
> methods/arguments that take expressions.
>

I would agree with this point, if we were starting from scratch. But one of
my goals is for this protocol to be descriptive of the existing dataset
integrations in the ecosystem, which all currently rely on PyArrow
expressions. For example, you'll notice in the PR that there are unit tests
to verify the current PyArrow Dataset classes conform to this protocol,
without changes.

I think there's three routes we can go here:

1. We keep PyArrow expressions in the API initially, but once we have
Substrait-based alternatives we deprecate the PyArrow expression support.
This is what I intended with the current design, and I think it provides
the most obvious migration paths for existing producers and consumers.
2. We keep the overall dataset API, but don't introduce the filter and
projection arguments until we have Substrait support. I'm not sure what the
migration path looks like for producers and consumers, but I think this
just implicitly becomes the same as (1), but with worse documentation.
3. We write a protocol completely from scratch, that doesn't try to
describe the existing dataset API. Producers and consumers would then
migrate to use the new protocol and deprecate their existing dataset
integrations. We could introduce a dunder method in that API (sort of like
__arrow_array__) that would make the migration seamless from the end-user
perspective.

*Which do you all think is the best path forward?*

Another concern I have is that we have not fully explained why we want
> to use Dataset instead of RecordBatchReader [9] as the basis of this
> protocol. I would like to see an explanation of why RecordBatchReader
> is not sufficient for this. RecordBatchReader seems like another
> possible way to represent "unmaterialized dataframes" and there are
> some parallels between RecordBatch/RecordBatchReader and
> Fragment/Dataset.
>

This is a good point. I can add a section describing the differences. The
main ones I can think of are that: (1) Datasets are "pruneable": one can
select a subset of columns and apply a filter on rows to avoid IO and (2)
they are splittable and serializable, so that fragments can be distributed
amongst processes / workers.

Best,

Will Jones

On Fri, Jun 23, 2023 at 10:48 AM Ian Cook <ianmc...@apache.org> wrote:

> Thanks Will for this proposal!
>
> For anyone familiar with PyArrow, this idea has a clear intuitive
> logic to it. It provides an expedient solution to the current lack of
> a practical means for interchanging "unmaterialized dataframes"
> between different Python libraries.
>
> To elaborate on that: If you look at how people use the Arrow Dataset
> API—which is implemented in the Arrow C++ library [1] and has bindings
> not just for Python [2] but also for Java [3] and R [4]—you'll see
> that Dataset is often used simply as a "virtual" variant of Table. It
> is used in cases when the data is larger than memory or when it is
> desirable to defer reading (materializing) the data into memory.
>
> So we can think of a Table as a materialized dataframe and a Dataset
> as an unmaterialized dataframe. That aspect of Dataset is I think what
> makes it most attractive as a protocol for enabling interoperability:
> it allows libraries to easily "speak Arrow" in cases where
> materializing the full data in memory upfront is impossible or
> undesirable.
>
> The trouble is that Dataset was not designed to serve as a
> general-purpose unmaterialized dataframe. For example, the PyArrow
> Dataset constructor [5] exposes options for specifying a list of
> source files and a partitioning scheme, which are irrelevant for many
> of the applications that Will anticipates. And some work is needed to
> reconcile the methods of the PyArrow Dataset object [6] with the
> methods of the Table object. Some methods like filter() are exposed by
> both and behave lazily on Datasets and eagerly on Tables, as a user
> might expect. But many other Table methods are not implemented for
> Dataset though they potentially could be, and it is unclear where we
> should draw the line between adding methods to Dataset vs. encouraging
> new scanner implementations to expose options controlling what lazy
> operations should be performed as they see fit.
>
> Will, I see that you've already addressed this issue to some extent in
> your proposal. For example, you mention that we should initially
> define this protocol to include only a minimal subset of the Dataset
> API. I agree, but I think there are some loose ends we should be
> careful to tie up. I strongly agree with the comments made by David,
> Weston, and Dewey arguing that we should avoid any use of PyArrow
> expressions in this API. Expressions are an implementation detail of
> PyArrow, not a part of the Arrow standard. It would be much safer for
> the initial version of this protocol to not define *any*
> methods/arguments that take expressions. This will allow us to take
> some more time to finish up the Substrait expression implementation
> work that is underway [7][8], then introduce Substrait-based
> expressions in a latter version of this protocol. This approach will
> better position this protocol to be implemented in other languages
> besides Python.
>
> Another concern I have is that we have not fully explained why we want
> to use Dataset instead of RecordBatchReader [9] as the basis of this
> protocol. I would like to see an explanation of why RecordBatchReader
> is not sufficient for this. RecordBatchReader seems like another
> possible way to represent "unmaterialized dataframes" and there are
> some parallels between RecordBatch/RecordBatchReader and
> Fragment/Dataset. We should help developers and users understand why
> Arrow needs both of these.
>
> Thanks Will for your thoughtful prose explanations about this proposed
> API. After we arrive at a decision about this, I think we should
> reproduce some of these explanations in docs, blog posts, cookbook
> recipes, etc. because there is some important nuance here that will be
> important for integrators of this API to understand.
>
> Ian
>
> [1] https://arrow.apache.org/docs/cpp/api/dataset.html
> [2] https://arrow.apache.org/docs/python/dataset.html
> [3] https://arrow.apache.org/docs/java/dataset.html
> [4] https://arrow.apache.org/docs/r/articles/dataset.html
> [5]
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset
> [6]
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html
> [7] https://github.com/apache/arrow/issues/33985
> [8] https://github.com/apache/arrow/issues/34252
> [9]
> https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html
>
> On Wed, Jun 21, 2023 at 2:09 PM Will Jones <will.jones...@gmail.com>
> wrote:
> >
> > Hello Arrow devs,
> >
> > I have drafted a PR defining an experimental protocol which would allow
> > third-party libraries to imitate the PyArrow Dataset API [5]. This
> protocol
> > is intended to endorse an integration pattern that is starting to be used
> > in the Python ecosystem, where some libraries are providing their own
> > scanners with this API, while query engines are accepting these as
> > duck-typed objects.
> >
> > To give some background: back at the end of 2021, we collaborated with
> > DuckDB to be able to read datasets (an Arrow C++ concept), supporting
> > column selection and filter pushdown. This was accomplished by having
> > DuckDB manipulating Python (or R) objects to get a RecordBatchReader and
> > then exporting over the C Stream Interface.
> >
> > Since then, DataFusion [2] and Polars have both made similar
> > implementations for their Python bindings, allowing them to consume
> PyArrow
> > datasets. This has created an implicit protocol, whereby arbitrary
> compute
> > engines can push down queries into the PyArrow dataset scanner.
> >
> > Now, libraries supporting table formats including Delta Lake, Lance, and
> > Iceberg are looking to be able to support these engines, while bringing
> > their own scanners and metadata handling implementations. One possible
> > route is allowing them to imitate the PyArrow datasets API.
> >
> > Bringing these use cases together, I'd like to propose an experimental
> > protocol, made out of the minimal subset of the PyArrow Dataset API
> > necessary to facilitate this kind of integration. This would allow any
> > library to produce a scanner implementation and that arbitrary query
> > engines could call into. I've drafted a PR [3] and there is some
> background
> > research available in a google doc [4].
> >
> > I've already gotten some good feedback on both, and would welcome more.
> >
> > One last point: I'd like for this to be a first step rather than a
> > comprehensive API. This PR focuses on making explicit a protocol that is
> > already in use in the ecosystem, but without much concrete definition.
> Once
> > this is established, we can use our experience from this protocol to
> design
> > something more permanent that takes advantage of newer innovations in the
> > Arrow ecosystem (such as the PyCapsule for C Data Interface or
> > Substrait for passing expressions / scan plans). I am tracking such
> future
> > improvements in [5].
> >
> > Best,
> >
> > Will Jones
> >
> > [1] https://duckdb.org/2021/12/03/duck-arrow.html
> > [2] https://github.com/apache/arrow-datafusion-python/pull/9
> > [3] https://github.com/apache/arrow/pull/35568
> > [4]
> >
> https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?pli=1
> > [5]
> >
> https://docs.google.com/document/d/1-uVkSZeaBtOALVbqMOPeyV3s2UND7Wl-IGEZ-P-gMXQ/edit
>

Reply via email to