Re: [Python][Discuss] PyArrow Dataset as a Python protocol

Will Jones Mon, 03 Jul 2023 14:48:35 -0700

Hello,

After thinking about it, I think I understand the approach David Li and Ian
are suggesting with respect to expressions. There will be some arguments
that only PyArrow's own datasets support, but that aren't in the generic
protocol. Passing
PyArrow expressions to the filters argument should be considered one of
those. DuckDB and others are currently passing them down, so they aren't
yet using the protocol properly. But once we add support in the protocol
for passing filters via Substrait expressions, we'll move DuckDB and others
over to be fully compliant with the protocol.


It's a bit of an awkward temporary state for now, but so would having
PyArrow expressions in the protocol just to be deprecated in a few months.
One caveat is that we'll need to provide DuckDB and other consumers with a
way to tell whether the dataset supports passing filters as Substrait
expression or PyArrow ones, since I doubt they'll want to lose support for
integrating with older PyArrow versions.

I've removed filters from the protocol for now, with the intention of
bringing them back as soon as we can get Substrait support. I think we can
do this in the 14.0.0 release.

Best,

Will Jones


On Mon, Jul 3, 2023 at 7:45 AM Fokko Driesprong <[email protected]> wrote:

> Hey everyone,
>
> Chiming in here from the PyIceberg side. I would love to see the protocol
> as proposed in the PR. I did a small test
> <https://github.com/apache/arrow/pull/35568#pullrequestreview-1480259722>,
> and it seems to be quite straightforward to implement and it brings a lot
> of potential. Unsurprisingly, I leaning toward the first option:
>
> 1. We keep PyArrow expressions in the API initially, but once we have
> > Substrait-based alternatives we deprecate the PyArrow expression support.
> > This is what I intended with the current design, and I think it provides
> > the most obvious migration paths for existing producers and consumers.
>
>
> Let me give my vision on some of the concerns raised.
>
> Will, I see that you've already addressed this issue to some extent in
> > your proposal. For example, you mention that we should initially
> > define this protocol to include only a minimal subset of the Dataset
> > API. I agree, but I think there are some loose ends we should be
> > careful to tie up. I strongly agree with the comments made by David,
> > Weston, and Dewey arguing that we should avoid any use of PyArrow
> > expressions in this API. Expressions are an implementation detail of
> > PyArrow, not a part of the Arrow standard. It would be much safer for
> > the initial version of this protocol to not define *any*
> > methods/arguments that take expressions. This will allow us to take
> > some more time to finish up the Substrait expression implementation
> > work that is underway [7][8], then introduce Substrait-based
> > expressions in a latter version of this protocol. This approach will
> > better position this protocol to be implemented in other languages
> > besides Python.
>
>
> I'm confused here. Looking at GH-33985
> <https://github.com/apache/arrow/pull/34834/files> I don't see any new
> primitives being introduced for composing an expression. As I understand
> it, in PyArrow the expression as it exists today will continue to exist. In
> the case of inter-process communication, it goes to Substrait, and then it
> gets de-serialized in the native expression construct (In PyIceberg, a
> BoundPredicate). I would say that the protocol and substrait are
> complementary.
>
> Another concern I have is that we have not fully explained why we want
> > to use Dataset instead of RecordBatchReader [9] as the basis of this
> > protocol. I would like to see an explanation of why RecordBatchReader
> > is not sufficient for this. RecordBatchReader seems like another
> > possible way to represent "unmaterialized dataframes" and there are
> > some parallels between RecordBatch/RecordBatchReader and
> > Fragment/Dataset. We should help developers and users understand why
> > Arrow needs both of these.
>
>
> Just to clarify, I think there are different use cases. For example, Lance
> provides its own readers, but PyIceberg does not have any intent to provide
> its own Parquet readers. Iceberg will generate the list of files that need
> to be read, and do the filtering/projection/deletes/etc. This would make
> the Dataset a better choice than the RecordBatchReader.
>
> That wouldn't remove the feature from DuckDB, would it? It would just mean
> > that we recognize that PyArrow expressions don't have well-defined
> > semantics that we are committing to at this time. As long as we have
> > `**kwargs` everywhere, we can in the future introduce a
> > `substrait_filter_expression` or similar argument, while allowing current
> > implementors to handle `filter` if possible. (As a compromise, we could
> > reserve `filter` and existing arguments and note that PyArrow Expression
> > semantics are subject to change without notice?)
>
>
> I think we can even re-use the existing filter argument. The signature
> would evolve from pc.Expression to Union[pc.Expression,
> pas.BoundExpressions]. In the case we get an expression, we'll convert it
> to substrait.
>
> Concluding, I think we can do things in parallel, and I don't think they
> are conflicting. I'm happy to contribute to the PyArrow side to make this
> happen.
>
> Kind regards,
> Fokko
>
> Op wo 28 jun 2023 om 22:47 schreef Will Jones <[email protected]>:
>
> > >
> > > That wouldn't remove the feature from DuckDB, would it? It would just
> > mean
> > > that we recognize that PyArrow expressions don't have well-defined
> > > semantics that we are committing to at this time.
> > >
> >
> > That's a fair point, David. I would be fine excluding it from the
> protocol
> > initially, and keep the existing integrations in DuckDB, Polars, and
> > Datafusion "secret" or "not officially supported" for the time being. At
> > the very least, documenting the pattern to get a Arrow C stream will be a
> > step forward.
> >
> > Best,
> >
> > Will Jones
> >
> > On Wed, Jun 28, 2023 at 12:35 PM Jonathan Keane <[email protected]>
> wrote:
> >
> > > > I would understand this objection more if DuckDB hasn't been relying
> on
> > > > being able to pass PyArrow expressions for 18 months now [1]. Unless,
> > do
> > > we
> > > > just think this isn't widely used enough that we don't care?
> > >
> > > This isn't a pro or a con of specifically adopting the PyArrow
> expression
> > > semantics as is / with a warning about changing / not at all, but
> having
> > > some kind of standardization in this interface would be very nice. This
> > > even came up while collaborating with the DuckDB folks that using some
> of
> > > the expression bits here (and in the R equivalents) was a little bit
> odd
> > > and having something like a proper API for that would have made that
> > > more natural (and likely that would have been used had it existed 18
> > months
> > > ago :))
> > >
> > > -Jon
> > >
> > >
> > > On Wed, Jun 28, 2023 at 1:17 PM David Li <[email protected]> wrote:
> > >
> > > > That wouldn't remove the feature from DuckDB, would it? It would just
> > > mean
> > > > that we recognize that PyArrow expressions don't have well-defined
> > > > semantics that we are committing to at this time. As long as we have
> > > > `**kwargs` everywhere, we can in the future introduce a
> > > > `substrait_filter_expression` or similar argument, while allowing
> > current
> > > > implementors to handle `filter` if possible. (As a compromise, we
> could
> > > > reserve `filter` and existing arguments and note that PyArrow
> > Expression
> > > > semantics are subject to change without notice?)
> > > >
> > > > On Wed, Jun 28, 2023, at 13:38, Will Jones wrote:
> > > > > Hi Ian,
> > > > >
> > > > >
> > > > >> I favor option 2 out of concern that option 1 could create a
> > > > >> temptation for users of this protocol to depend on a feature that
> we
> > > > >> intend to deprecate.
> > > > >>
> > > > >
> > > > > I would understand this objection more if DuckDB hasn't been
> relying
> > on
> > > > > being able to pass PyArrow expressions for 18 months now [1].
> Unless,
> > > do
> > > > we
> > > > > just think this isn't widely used enough that we don't care?
> > > > >
> > > > > Best,
> > > > > Will
> > > > >
> > > > > [1] https://duckdb.org/2021/12/03/duck-arrow.html
> > > > >
> > > > > On Tue, Jun 27, 2023 at 11:19 AM Ian Cook <[email protected]>
> > wrote:
> > > > >
> > > > >> > I think there's three routes we can go here:
> > > > >> >
> > > > >> > 1. We keep PyArrow expressions in the API initially, but once we
> > > have
> > > > >> > Substrait-based alternatives we deprecate the PyArrow expression
> > > > support.
> > > > >> > This is what I intended with the current design, and I think it
> > > > provides
> > > > >> > the most obvious migration paths for existing producers and
> > > consumers.
> > > > >> > 2. We keep the overall dataset API, but don't introduce the
> filter
> > > and
> > > > >> > projection arguments until we have Substrait support. I'm not
> sure
> > > > what
> > > > >> the
> > > > >> > migration path looks like for producers and consumers, but I
> think
> > > > this
> > > > >> > just implicitly becomes the same as (1), but with worse
> > > documentation.
> > > > >> > 3. We write a protocol completely from scratch, that doesn't try
> > to
> > > > >> > describe the existing dataset API. Producers and consumers would
> > > then
> > > > >> > migrate to use the new protocol and deprecate their existing
> > dataset
> > > > >> > integrations. We could introduce a dunder method in that API
> (sort
> > > of
> > > > >> like
> > > > >> > __arrow_array__) that would make the migration seamless from the
> > > > end-user
> > > > >> > perspective.
> > > > >> >
> > > > >> > *Which do you all think is the best path forward?*
> > > > >>
> > > > >> I favor option 2 out of concern that option 1 could create a
> > > > >> temptation for users of this protocol to depend on a feature that
> we
> > > > >> intend to deprecate. I think option 2 also creates a stronger
> > > > >> motivation to complete the Substrait expression integration work,
> > > > >> which is underway in https://github.com/apache/arrow/pull/34834.
> > > > >>
> > > > >> Ian
> > > > >>
> > > > >>
> > > > >> On Fri, Jun 23, 2023 at 1:25 PM Weston Pace <
> [email protected]>
> > > > wrote:
> > > > >> >
> > > > >> > > The trouble is that Dataset was not designed to serve as a
> > > > >> > > general-purpose unmaterialized dataframe. For example, the
> > PyArrow
> > > > >> > > Dataset constructor [5] exposes options for specifying a list
> of
> > > > >> > > source files and a partitioning scheme, which are irrelevant
> for
> > > > many
> > > > >> > > of the applications that Will anticipates. And some work is
> > needed
> > > > to
> > > > >> > > reconcile the methods of the PyArrow Dataset object [6] with
> the
> > > > >> > > methods of the Table object. Some methods like filter() are
> > > exposed
> > > > by
> > > > >> > > both and behave lazily on Datasets and eagerly on Tables, as a
> > > user
> > > > >> > > might expect. But many other Table methods are not implemented
> > for
> > > > >> > > Dataset though they potentially could be, and it is unclear
> > where
> > > we
> > > > >> > > should draw the line between adding methods to Dataset vs.
> > > > encouraging
> > > > >> > > new scanner implementations to expose options controlling what
> > > lazy
> > > > >> > > operations should be performed as they see fit.
> > > > >> >
> > > > >> > In my mind there is a distinction between the "compute domain"
> > > (e.g. a
> > > > >> > pandas dataframe or something like ibis or SQL) and the "data
> > > domain"
> > > > >> (e.g.
> > > > >> > pyarrow datasets).  I think, in a perfect world, you could push
> > any
> > > > and
> > > > >> all
> > > > >> > compute up and down the chain as far as possible.  However, in
> > > > practice,
> > > > >> I
> > > > >> > think there is a healthy set of tools and libraries that say
> > "simple
> > > > >> column
> > > > >> > projection and filtering is good enough".  I would argue that
> > there
> > > is
> > > > >> room
> > > > >> > for both APIs and while the temptation is always present to
> "shove
> > > as
> > > > >> much
> > > > >> > compute as you can" I think pyarrow datasets seem to have found
> a
> > > > balance
> > > > >> > between the two that users like.
> > > > >> >
> > > > >> > So I would argue that this protocol may never become a
> > > general-purpose
> > > > >> > unmaterialized dataframe and that isn't necessarily a bad thing.
> > > > >> >
> > > > >> > > they are splittable and serializable, so that fragments can be
> > > > >> distributed
> > > > >> > > amongst processes / workers.
> > > > >> >
> > > > >> > Just to clarify, the proposal currently only requires the
> > fragments
> > > > to be
> > > > >> > serializable correct?
> > > > >> >
> > > > >> > On Fri, Jun 23, 2023 at 11:48 AM Will Jones <
> > > [email protected]>
> > > > >> wrote:
> > > > >> >
> > > > >> > > Thanks Ian for your extensive feedback.
> > > > >> > >
> > > > >> > > I strongly agree with the comments made by David,
> > > > >> > > > Weston, and Dewey arguing that we should avoid any use of
> > > PyArrow
> > > > >> > > > expressions in this API. Expressions are an implementation
> > > detail
> > > > of
> > > > >> > > > PyArrow, not a part of the Arrow standard. It would be much
> > > safer
> > > > for
> > > > >> > > > the initial version of this protocol to not define *any*
> > > > >> > > > methods/arguments that take expressions.
> > > > >> > > >
> > > > >> > >
> > > > >> > > I would agree with this point, if we were starting from
> scratch.
> > > But
> > > > >> one of
> > > > >> > > my goals is for this protocol to be descriptive of the
> existing
> > > > dataset
> > > > >> > > integrations in the ecosystem, which all currently rely on
> > PyArrow
> > > > >> > > expressions. For example, you'll notice in the PR that there
> are
> > > > unit
> > > > >> tests
> > > > >> > > to verify the current PyArrow Dataset classes conform to this
> > > > protocol,
> > > > >> > > without changes.
> > > > >> > >
> > > > >> > > I think there's three routes we can go here:
> > > > >> > >
> > > > >> > > 1. We keep PyArrow expressions in the API initially, but once
> we
> > > > have
> > > > >> > > Substrait-based alternatives we deprecate the PyArrow
> expression
> > > > >> support.
> > > > >> > > This is what I intended with the current design, and I think
> it
> > > > >> provides
> > > > >> > > the most obvious migration paths for existing producers and
> > > > consumers.
> > > > >> > > 2. We keep the overall dataset API, but don't introduce the
> > filter
> > > > and
> > > > >> > > projection arguments until we have Substrait support. I'm not
> > sure
> > > > >> what the
> > > > >> > > migration path looks like for producers and consumers, but I
> > think
> > > > this
> > > > >> > > just implicitly becomes the same as (1), but with worse
> > > > documentation.
> > > > >> > > 3. We write a protocol completely from scratch, that doesn't
> try
> > > to
> > > > >> > > describe the existing dataset API. Producers and consumers
> would
> > > > then
> > > > >> > > migrate to use the new protocol and deprecate their existing
> > > dataset
> > > > >> > > integrations. We could introduce a dunder method in that API
> > (sort
> > > > of
> > > > >> like
> > > > >> > > __arrow_array__) that would make the migration seamless from
> the
> > > > >> end-user
> > > > >> > > perspective.
> > > > >> > >
> > > > >> > > *Which do you all think is the best path forward?*
> > > > >> > >
> > > > >> > > Another concern I have is that we have not fully explained why
> > we
> > > > want
> > > > >> > > > to use Dataset instead of RecordBatchReader [9] as the basis
> > of
> > > > this
> > > > >> > > > protocol. I would like to see an explanation of why
> > > > RecordBatchReader
> > > > >> > > > is not sufficient for this. RecordBatchReader seems like
> > another
> > > > >> > > > possible way to represent "unmaterialized dataframes" and
> > there
> > > > are
> > > > >> > > > some parallels between RecordBatch/RecordBatchReader and
> > > > >> > > > Fragment/Dataset.
> > > > >> > > >
> > > > >> > >
> > > > >> > > This is a good point. I can add a section describing the
> > > > differences.
> > > > >> The
> > > > >> > > main ones I can think of are that: (1) Datasets are
> "pruneable":
> > > one
> > > > >> can
> > > > >> > > select a subset of columns and apply a filter on rows to avoid
> > IO
> > > > and
> > > > >> (2)
> > > > >> > > they are splittable and serializable, so that fragments can be
> > > > >> distributed
> > > > >> > > amongst processes / workers.
> > > > >> > >
> > > > >> > > Best,
> > > > >> > >
> > > > >> > > Will Jones
> > > > >> > >
> > > > >> > > On Fri, Jun 23, 2023 at 10:48 AM Ian Cook <
> [email protected]>
> > > > wrote:
> > > > >> > >
> > > > >> > > > Thanks Will for this proposal!
> > > > >> > > >
> > > > >> > > > For anyone familiar with PyArrow, this idea has a clear
> > > intuitive
> > > > >> > > > logic to it. It provides an expedient solution to the
> current
> > > > lack of
> > > > >> > > > a practical means for interchanging "unmaterialized
> > dataframes"
> > > > >> > > > between different Python libraries.
> > > > >> > > >
> > > > >> > > > To elaborate on that: If you look at how people use the
> Arrow
> > > > Dataset
> > > > >> > > > API—which is implemented in the Arrow C++ library [1] and
> has
> > > > >> bindings
> > > > >> > > > not just for Python [2] but also for Java [3] and R
> [4]—you'll
> > > see
> > > > >> > > > that Dataset is often used simply as a "virtual" variant of
> > > > Table. It
> > > > >> > > > is used in cases when the data is larger than memory or when
> > it
> > > is
> > > > >> > > > desirable to defer reading (materializing) the data into
> > memory.
> > > > >> > > >
> > > > >> > > > So we can think of a Table as a materialized dataframe and a
> > > > Dataset
> > > > >> > > > as an unmaterialized dataframe. That aspect of Dataset is I
> > > think
> > > > >> what
> > > > >> > > > makes it most attractive as a protocol for enabling
> > > > interoperability:
> > > > >> > > > it allows libraries to easily "speak Arrow" in cases where
> > > > >> > > > materializing the full data in memory upfront is impossible
> or
> > > > >> > > > undesirable.
> > > > >> > > >
> > > > >> > > > The trouble is that Dataset was not designed to serve as a
> > > > >> > > > general-purpose unmaterialized dataframe. For example, the
> > > PyArrow
> > > > >> > > > Dataset constructor [5] exposes options for specifying a
> list
> > of
> > > > >> > > > source files and a partitioning scheme, which are irrelevant
> > for
> > > > many
> > > > >> > > > of the applications that Will anticipates. And some work is
> > > > needed to
> > > > >> > > > reconcile the methods of the PyArrow Dataset object [6] with
> > the
> > > > >> > > > methods of the Table object. Some methods like filter() are
> > > > exposed
> > > > >> by
> > > > >> > > > both and behave lazily on Datasets and eagerly on Tables,
> as a
> > > > user
> > > > >> > > > might expect. But many other Table methods are not
> implemented
> > > for
> > > > >> > > > Dataset though they potentially could be, and it is unclear
> > > where
> > > > we
> > > > >> > > > should draw the line between adding methods to Dataset vs.
> > > > >> encouraging
> > > > >> > > > new scanner implementations to expose options controlling
> what
> > > > lazy
> > > > >> > > > operations should be performed as they see fit.
> > > > >> > > >
> > > > >> > > > Will, I see that you've already addressed this issue to some
> > > > extent
> > > > >> in
> > > > >> > > > your proposal. For example, you mention that we should
> > initially
> > > > >> > > > define this protocol to include only a minimal subset of the
> > > > Dataset
> > > > >> > > > API. I agree, but I think there are some loose ends we
> should
> > be
> > > > >> > > > careful to tie up. I strongly agree with the comments made
> by
> > > > David,
> > > > >> > > > Weston, and Dewey arguing that we should avoid any use of
> > > PyArrow
> > > > >> > > > expressions in this API. Expressions are an implementation
> > > detail
> > > > of
> > > > >> > > > PyArrow, not a part of the Arrow standard. It would be much
> > > safer
> > > > for
> > > > >> > > > the initial version of this protocol to not define *any*
> > > > >> > > > methods/arguments that take expressions. This will allow us
> to
> > > > take
> > > > >> > > > some more time to finish up the Substrait expression
> > > > implementation
> > > > >> > > > work that is underway [7][8], then introduce Substrait-based
> > > > >> > > > expressions in a latter version of this protocol. This
> > approach
> > > > will
> > > > >> > > > better position this protocol to be implemented in other
> > > languages
> > > > >> > > > besides Python.
> > > > >> > > >
> > > > >> > > > Another concern I have is that we have not fully explained
> why
> > > we
> > > > >> want
> > > > >> > > > to use Dataset instead of RecordBatchReader [9] as the basis
> > of
> > > > this
> > > > >> > > > protocol. I would like to see an explanation of why
> > > > RecordBatchReader
> > > > >> > > > is not sufficient for this. RecordBatchReader seems like
> > another
> > > > >> > > > possible way to represent "unmaterialized dataframes" and
> > there
> > > > are
> > > > >> > > > some parallels between RecordBatch/RecordBatchReader and
> > > > >> > > > Fragment/Dataset. We should help developers and users
> > understand
> > > > why
> > > > >> > > > Arrow needs both of these.
> > > > >> > > >
> > > > >> > > > Thanks Will for your thoughtful prose explanations about
> this
> > > > >> proposed
> > > > >> > > > API. After we arrive at a decision about this, I think we
> > should
> > > > >> > > > reproduce some of these explanations in docs, blog posts,
> > > cookbook
> > > > >> > > > recipes, etc. because there is some important nuance here
> that
> > > > will
> > > > >> be
> > > > >> > > > important for integrators of this API to understand.
> > > > >> > > >
> > > > >> > > > Ian
> > > > >> > > >
> > > > >> > > > [1] https://arrow.apache.org/docs/cpp/api/dataset.html
> > > > >> > > > [2] https://arrow.apache.org/docs/python/dataset.html
> > > > >> > > > [3] https://arrow.apache.org/docs/java/dataset.html
> > > > >> > > > [4] https://arrow.apache.org/docs/r/articles/dataset.html
> > > > >> > > > [5]
> > > > >> > > >
> > > > >> > >
> > > > >>
> > > >
> > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset
> > > > >> > > > [6]
> > > > >> > > >
> > > > >> > >
> > > > >>
> > > >
> > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html
> > > > >> > > > [7] https://github.com/apache/arrow/issues/33985
> > > > >> > > > [8] https://github.com/apache/arrow/issues/34252
> > > > >> > > > [9]
> > > > >> > > >
> > > > >> > >
> > > > >>
> > > >
> > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html
> > > > >> > > >
> > > > >> > > > On Wed, Jun 21, 2023 at 2:09 PM Will Jones <
> > > > [email protected]>
> > > > >> > > > wrote:
> > > > >> > > > >
> > > > >> > > > > Hello Arrow devs,
> > > > >> > > > >
> > > > >> > > > > I have drafted a PR defining an experimental protocol
> which
> > > > would
> > > > >> allow
> > > > >> > > > > third-party libraries to imitate the PyArrow Dataset API
> > [5].
> > > > This
> > > > >> > > > protocol
> > > > >> > > > > is intended to endorse an integration pattern that is
> > starting
> > > > to
> > > > >> be
> > > > >> > > used
> > > > >> > > > > in the Python ecosystem, where some libraries are
> providing
> > > > their
> > > > >> own
> > > > >> > > > > scanners with this API, while query engines are accepting
> > > these
> > > > as
> > > > >> > > > > duck-typed objects.
> > > > >> > > > >
> > > > >> > > > > To give some background: back at the end of 2021, we
> > > > collaborated
> > > > >> with
> > > > >> > > > > DuckDB to be able to read datasets (an Arrow C++ concept),
> > > > >> supporting
> > > > >> > > > > column selection and filter pushdown. This was
> accomplished
> > by
> > > > >> having
> > > > >> > > > > DuckDB manipulating Python (or R) objects to get a
> > > > >> RecordBatchReader
> > > > >> > > and
> > > > >> > > > > then exporting over the C Stream Interface.
> > > > >> > > > >
> > > > >> > > > > Since then, DataFusion [2] and Polars have both made
> similar
> > > > >> > > > > implementations for their Python bindings, allowing them
> to
> > > > consume
> > > > >> > > > PyArrow
> > > > >> > > > > datasets. This has created an implicit protocol, whereby
> > > > arbitrary
> > > > >> > > > compute
> > > > >> > > > > engines can push down queries into the PyArrow dataset
> > > scanner.
> > > > >> > > > >
> > > > >> > > > > Now, libraries supporting table formats including Delta
> > Lake,
> > > > >> Lance,
> > > > >> > > and
> > > > >> > > > > Iceberg are looking to be able to support these engines,
> > while
> > > > >> bringing
> > > > >> > > > > their own scanners and metadata handling implementations.
> > One
> > > > >> possible
> > > > >> > > > > route is allowing them to imitate the PyArrow datasets
> API.
> > > > >> > > > >
> > > > >> > > > > Bringing these use cases together, I'd like to propose an
> > > > >> experimental
> > > > >> > > > > protocol, made out of the minimal subset of the PyArrow
> > > Dataset
> > > > API
> > > > >> > > > > necessary to facilitate this kind of integration. This
> would
> > > > allow
> > > > >> any
> > > > >> > > > > library to produce a scanner implementation and that
> > arbitrary
> > > > >> query
> > > > >> > > > > engines could call into. I've drafted a PR [3] and there
> is
> > > some
> > > > >> > > > background
> > > > >> > > > > research available in a google doc [4].
> > > > >> > > > >
> > > > >> > > > > I've already gotten some good feedback on both, and would
> > > > welcome
> > > > >> more.
> > > > >> > > > >
> > > > >> > > > > One last point: I'd like for this to be a first step
> rather
> > > > than a
> > > > >> > > > > comprehensive API. This PR focuses on making explicit a
> > > protocol
> > > > >> that
> > > > >> > > is
> > > > >> > > > > already in use in the ecosystem, but without much concrete
> > > > >> definition.
> > > > >> > > > Once
> > > > >> > > > > this is established, we can use our experience from this
> > > > protocol
> > > > >> to
> > > > >> > > > design
> > > > >> > > > > something more permanent that takes advantage of newer
> > > > innovations
> > > > >> in
> > > > >> > > the
> > > > >> > > > > Arrow ecosystem (such as the PyCapsule for C Data
> Interface
> > or
> > > > >> > > > > Substrait for passing expressions / scan plans). I am
> > tracking
> > > > such
> > > > >> > > > future
> > > > >> > > > > improvements in [5].
> > > > >> > > > >
> > > > >> > > > > Best,
> > > > >> > > > >
> > > > >> > > > > Will Jones
> > > > >> > > > >
> > > > >> > > > > [1] https://duckdb.org/2021/12/03/duck-arrow.html
> > > > >> > > > > [2]
> > https://github.com/apache/arrow-datafusion-python/pull/9
> > > > >> > > > > [3] https://github.com/apache/arrow/pull/35568
> > > > >> > > > > [4]
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >>
> > > >
> > >
> >
> https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?pli=1
> > > > >> > > > > [5]
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >>
> > > >
> > >
> >
> https://docs.google.com/document/d/1-uVkSZeaBtOALVbqMOPeyV3s2UND7Wl-IGEZ-P-gMXQ/edit
> > > > >> > > >
> > > > >> > >
> > > > >>
> > > >
> > >
> >
>

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

Reply via email to