Re: [Python][Discuss] PyArrow Dataset as a Python protocol

Will Jones Fri, 01 Sep 2023 12:04:44 -0700

Thanks for pointing that out, Dane. I think that seems like an obvious
choice for Dask to be able to consume this protocol.


On Fri, Sep 1, 2023 at 10:13 AM Dane Pitkin <[email protected]>
wrote:

> The Python Substrait package[1] is on PyPi[2] and currently has python
> wrappers for the Substrait protobuf objects. I think this will be a great
> opportunity to identify helper features that users of this protocol would
> like to see. I'll be keeping an eye out as this develops, but also feel
> free to file feature requests in the project!
>
>
> [1]https://github.com/substrait-io/substrait-python
> [2]https://pypi.org/project/substrait/
>
>
> On Thu, Aug 31, 2023 at 10:05 PM Will Jones <[email protected]>
> wrote:
>
> > Hello Arrow devs,
> >
> > We discussed this further in the Arrow community call on 2023-08-30 [1],
> > and concluded we should create an entirely new protocol that uses
> Substrait
> > expressions. I have created an issue [2] to track this and will start a
> PR
> > soon.
> >
> > It does look like we might block this on creating a PyCapsule based
> > protocol for arrays, schemas, and streams. That is tracked here [3].
> > Hopefully that isn't too ambitious :)
> >
> > Best,
> >
> > Will Jones
> >
> >
> > [1]
> >
> >
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/edit
> > [2] https://github.com/apache/arrow/issues/37504
> > [3] https://github.com/apache/arrow/issues/35531
> >
> >
> > On Tue, Aug 29, 2023 at 2:59 PM Ian Cook <[email protected]> wrote:
> >
> > > An update about this:
> > >
> > > Weston's PR https://github.com/apache/arrow/pull/34834/ merged last
> > > week. This makes it possible to convert PyArrow expressions to/from
> > > Substrait expressions.
> > >
> > > As Fokko previously noted, the PR does not change the PyArrow Dataset
> > > interface at all. It simply enables a Substrait expression to be
> > > converted to a PyArrow expression, which can then be used to
> > > filter/project a Dataset.
> > >
> > > There is a basic example here demonstrating this:
> > > https://gist.github.com/ianmcook/f70fc185d29ae97bdf85ffe0378c68e0
> > >
> > > We might now consider whether to build upon this to create a Dataset
> > > protocol that is independent of the PyArrow Expression implementation
> > > and that could interoperate across languages.
> > >
> > > Ian
> > >
> > > On Mon, Jul 3, 2023 at 5:48 PM Will Jones <[email protected]>
> > wrote:
> > > >
> > > > Hello,
> > > >
> > > > After thinking about it, I think I understand the approach David Li
> and
> > > Ian
> > > > are suggesting with respect to expressions. There will be some
> > arguments
> > > > that only PyArrow's own datasets support, but that aren't in the
> > generic
> > > > protocol. Passing
> > > > PyArrow expressions to the filters argument should be considered one
> of
> > > > those. DuckDB and others are currently passing them down, so they
> > aren't
> > > > yet using the protocol properly. But once we add support in the
> > protocol
> > > > for passing filters via Substrait expressions, we'll move DuckDB and
> > > others
> > > > over to be fully compliant with the protocol.
> > > >
> > > > It's a bit of an awkward temporary state for now, but so would having
> > > > PyArrow expressions in the protocol just to be deprecated in a few
> > > months.
> > > > One caveat is that we'll need to provide DuckDB and other consumers
> > with
> > > a
> > > > way to tell whether the dataset supports passing filters as Substrait
> > > > expression or PyArrow ones, since I doubt they'll want to lose
> support
> > > for
> > > > integrating with older PyArrow versions.
> > > >
> > > > I've removed filters from the protocol for now, with the intention of
> > > > bringing them back as soon as we can get Substrait support. I think
> we
> > > can
> > > > do this in the 14.0.0 release.
> > > >
> > > > Best,
> > > >
> > > > Will Jones
> > > >
> > > >
> > > > On Mon, Jul 3, 2023 at 7:45 AM Fokko Driesprong <[email protected]>
> > > wrote:
> > > >
> > > > > Hey everyone,
> > > > >
> > > > > Chiming in here from the PyIceberg side. I would love to see the
> > > protocol
> > > > > as proposed in the PR. I did a small test
> > > > > <
> > >
> https://github.com/apache/arrow/pull/35568#pullrequestreview-1480259722
> > >,
> > > > > and it seems to be quite straightforward to implement and it
> brings a
> > > lot
> > > > > of potential. Unsurprisingly, I leaning toward the first option:
> > > > >
> > > > > 1. We keep PyArrow expressions in the API initially, but once we
> have
> > > > > > Substrait-based alternatives we deprecate the PyArrow expression
> > > support.
> > > > > > This is what I intended with the current design, and I think it
> > > provides
> > > > > > the most obvious migration paths for existing producers and
> > > consumers.
> > > > >
> > > > >
> > > > > Let me give my vision on some of the concerns raised.
> > > > >
> > > > > Will, I see that you've already addressed this issue to some extent
> > in
> > > > > > your proposal. For example, you mention that we should initially
> > > > > > define this protocol to include only a minimal subset of the
> > Dataset
> > > > > > API. I agree, but I think there are some loose ends we should be
> > > > > > careful to tie up. I strongly agree with the comments made by
> > David,
> > > > > > Weston, and Dewey arguing that we should avoid any use of PyArrow
> > > > > > expressions in this API. Expressions are an implementation detail
> > of
> > > > > > PyArrow, not a part of the Arrow standard. It would be much safer
> > for
> > > > > > the initial version of this protocol to not define *any*
> > > > > > methods/arguments that take expressions. This will allow us to
> take
> > > > > > some more time to finish up the Substrait expression
> implementation
> > > > > > work that is underway [7][8], then introduce Substrait-based
> > > > > > expressions in a latter version of this protocol. This approach
> > will
> > > > > > better position this protocol to be implemented in other
> languages
> > > > > > besides Python.
> > > > >
> > > > >
> > > > > I'm confused here. Looking at GH-33985
> > > > > <https://github.com/apache/arrow/pull/34834/files> I don't see any
> > new
> > > > > primitives being introduced for composing an expression. As I
> > > understand
> > > > > it, in PyArrow the expression as it exists today will continue to
> > > exist. In
> > > > > the case of inter-process communication, it goes to Substrait, and
> > > then it
> > > > > gets de-serialized in the native expression construct (In
> PyIceberg,
> > a
> > > > > BoundPredicate). I would say that the protocol and substrait are
> > > > > complementary.
> > > > >
> > > > > Another concern I have is that we have not fully explained why we
> > want
> > > > > > to use Dataset instead of RecordBatchReader [9] as the basis of
> > this
> > > > > > protocol. I would like to see an explanation of why
> > RecordBatchReader
> > > > > > is not sufficient for this. RecordBatchReader seems like another
> > > > > > possible way to represent "unmaterialized dataframes" and there
> are
> > > > > > some parallels between RecordBatch/RecordBatchReader and
> > > > > > Fragment/Dataset. We should help developers and users understand
> > why
> > > > > > Arrow needs both of these.
> > > > >
> > > > >
> > > > > Just to clarify, I think there are different use cases. For
> example,
> > > Lance
> > > > > provides its own readers, but PyIceberg does not have any intent to
> > > provide
> > > > > its own Parquet readers. Iceberg will generate the list of files
> that
> > > need
> > > > > to be read, and do the filtering/projection/deletes/etc. This would
> > > make
> > > > > the Dataset a better choice than the RecordBatchReader.
> > > > >
> > > > > That wouldn't remove the feature from DuckDB, would it? It would
> just
> > > mean
> > > > > > that we recognize that PyArrow expressions don't have
> well-defined
> > > > > > semantics that we are committing to at this time. As long as we
> > have
> > > > > > `**kwargs` everywhere, we can in the future introduce a
> > > > > > `substrait_filter_expression` or similar argument, while allowing
> > > current
> > > > > > implementors to handle `filter` if possible. (As a compromise, we
> > > could
> > > > > > reserve `filter` and existing arguments and note that PyArrow
> > > Expression
> > > > > > semantics are subject to change without notice?)
> > > > >
> > > > >
> > > > > I think we can even re-use the existing filter argument. The
> > signature
> > > > > would evolve from pc.Expression to Union[pc.Expression,
> > > > > pas.BoundExpressions]. In the case we get an expression, we'll
> > convert
> > > it
> > > > > to substrait.
> > > > >
> > > > > Concluding, I think we can do things in parallel, and I don't think
> > > they
> > > > > are conflicting. I'm happy to contribute to the PyArrow side to
> make
> > > this
> > > > > happen.
> > > > >
> > > > > Kind regards,
> > > > > Fokko
> > > > >
> > > > > Op wo 28 jun 2023 om 22:47 schreef Will Jones <
> > [email protected]
> > > >:
> > > > >
> > > > > > >
> > > > > > > That wouldn't remove the feature from DuckDB, would it? It
> would
> > > just
> > > > > > mean
> > > > > > > that we recognize that PyArrow expressions don't have
> > well-defined
> > > > > > > semantics that we are committing to at this time.
> > > > > > >
> > > > > >
> > > > > > That's a fair point, David. I would be fine excluding it from the
> > > > > protocol
> > > > > > initially, and keep the existing integrations in DuckDB, Polars,
> > and
> > > > > > Datafusion "secret" or "not officially supported" for the time
> > > being. At
> > > > > > the very least, documenting the pattern to get a Arrow C stream
> > will
> > > be a
> > > > > > step forward.
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Will Jones
> > > > > >
> > > > > > On Wed, Jun 28, 2023 at 12:35 PM Jonathan Keane <
> [email protected]>
> > > > > wrote:
> > > > > >
> > > > > > > > I would understand this objection more if DuckDB hasn't been
> > > relying
> > > > > on
> > > > > > > > being able to pass PyArrow expressions for 18 months now [1].
> > > Unless,
> > > > > > do
> > > > > > > we
> > > > > > > > just think this isn't widely used enough that we don't care?
> > > > > > >
> > > > > > > This isn't a pro or a con of specifically adopting the PyArrow
> > > > > expression
> > > > > > > semantics as is / with a warning about changing / not at all,
> but
> > > > > having
> > > > > > > some kind of standardization in this interface would be very
> > nice.
> > > This
> > > > > > > even came up while collaborating with the DuckDB folks that
> using
> > > some
> > > > > of
> > > > > > > the expression bits here (and in the R equivalents) was a
> little
> > > bit
> > > > > odd
> > > > > > > and having something like a proper API for that would have made
> > > that
> > > > > > > more natural (and likely that would have been used had it
> existed
> > > 18
> > > > > > months
> > > > > > > ago :))
> > > > > > >
> > > > > > > -Jon
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Jun 28, 2023 at 1:17 PM David Li <[email protected]>
> > > wrote:
> > > > > > >
> > > > > > > > That wouldn't remove the feature from DuckDB, would it? It
> > would
> > > just
> > > > > > > mean
> > > > > > > > that we recognize that PyArrow expressions don't have
> > > well-defined
> > > > > > > > semantics that we are committing to at this time. As long as
> we
> > > have
> > > > > > > > `**kwargs` everywhere, we can in the future introduce a
> > > > > > > > `substrait_filter_expression` or similar argument, while
> > allowing
> > > > > > current
> > > > > > > > implementors to handle `filter` if possible. (As a
> compromise,
> > we
> > > > > could
> > > > > > > > reserve `filter` and existing arguments and note that PyArrow
> > > > > > Expression
> > > > > > > > semantics are subject to change without notice?)
> > > > > > > >
> > > > > > > > On Wed, Jun 28, 2023, at 13:38, Will Jones wrote:
> > > > > > > > > Hi Ian,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >> I favor option 2 out of concern that option 1 could
> create a
> > > > > > > > >> temptation for users of this protocol to depend on a
> feature
> > > that
> > > > > we
> > > > > > > > >> intend to deprecate.
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > > I would understand this objection more if DuckDB hasn't
> been
> > > > > relying
> > > > > > on
> > > > > > > > > being able to pass PyArrow expressions for 18 months now
> [1].
> > > > > Unless,
> > > > > > > do
> > > > > > > > we
> > > > > > > > > just think this isn't widely used enough that we don't
> care?
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Will
> > > > > > > > >
> > > > > > > > > [1] https://duckdb.org/2021/12/03/duck-arrow.html
> > > > > > > > >
> > > > > > > > > On Tue, Jun 27, 2023 at 11:19 AM Ian Cook <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> > I think there's three routes we can go here:
> > > > > > > > >> >
> > > > > > > > >> > 1. We keep PyArrow expressions in the API initially, but
> > > once we
> > > > > > > have
> > > > > > > > >> > Substrait-based alternatives we deprecate the PyArrow
> > > expression
> > > > > > > > support.
> > > > > > > > >> > This is what I intended with the current design, and I
> > > think it
> > > > > > > > provides
> > > > > > > > >> > the most obvious migration paths for existing producers
> > and
> > > > > > > consumers.
> > > > > > > > >> > 2. We keep the overall dataset API, but don't introduce
> > the
> > > > > filter
> > > > > > > and
> > > > > > > > >> > projection arguments until we have Substrait support.
> I'm
> > > not
> > > > > sure
> > > > > > > > what
> > > > > > > > >> the
> > > > > > > > >> > migration path looks like for producers and consumers,
> > but I
> > > > > think
> > > > > > > > this
> > > > > > > > >> > just implicitly becomes the same as (1), but with worse
> > > > > > > documentation.
> > > > > > > > >> > 3. We write a protocol completely from scratch, that
> > > doesn't try
> > > > > > to
> > > > > > > > >> > describe the existing dataset API. Producers and
> consumers
> > > would
> > > > > > > then
> > > > > > > > >> > migrate to use the new protocol and deprecate their
> > existing
> > > > > > dataset
> > > > > > > > >> > integrations. We could introduce a dunder method in that
> > API
> > > > > (sort
> > > > > > > of
> > > > > > > > >> like
> > > > > > > > >> > __arrow_array__) that would make the migration seamless
> > > from the
> > > > > > > > end-user
> > > > > > > > >> > perspective.
> > > > > > > > >> >
> > > > > > > > >> > *Which do you all think is the best path forward?*
> > > > > > > > >>
> > > > > > > > >> I favor option 2 out of concern that option 1 could
> create a
> > > > > > > > >> temptation for users of this protocol to depend on a
> feature
> > > that
> > > > > we
> > > > > > > > >> intend to deprecate. I think option 2 also creates a
> > stronger
> > > > > > > > >> motivation to complete the Substrait expression
> integration
> > > work,
> > > > > > > > >> which is underway in
> > > https://github.com/apache/arrow/pull/34834.
> > > > > > > > >>
> > > > > > > > >> Ian
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> On Fri, Jun 23, 2023 at 1:25 PM Weston Pace <
> > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > > >> >
> > > > > > > > >> > > The trouble is that Dataset was not designed to serve
> > as a
> > > > > > > > >> > > general-purpose unmaterialized dataframe. For example,
> > the
> > > > > > PyArrow
> > > > > > > > >> > > Dataset constructor [5] exposes options for
> specifying a
> > > list
> > > > > of
> > > > > > > > >> > > source files and a partitioning scheme, which are
> > > irrelevant
> > > > > for
> > > > > > > > many
> > > > > > > > >> > > of the applications that Will anticipates. And some
> work
> > > is
> > > > > > needed
> > > > > > > > to
> > > > > > > > >> > > reconcile the methods of the PyArrow Dataset object
> [6]
> > > with
> > > > > the
> > > > > > > > >> > > methods of the Table object. Some methods like
> filter()
> > > are
> > > > > > > exposed
> > > > > > > > by
> > > > > > > > >> > > both and behave lazily on Datasets and eagerly on
> > Tables,
> > > as a
> > > > > > > user
> > > > > > > > >> > > might expect. But many other Table methods are not
> > > implemented
> > > > > > for
> > > > > > > > >> > > Dataset though they potentially could be, and it is
> > > unclear
> > > > > > where
> > > > > > > we
> > > > > > > > >> > > should draw the line between adding methods to Dataset
> > vs.
> > > > > > > > encouraging
> > > > > > > > >> > > new scanner implementations to expose options
> > controlling
> > > what
> > > > > > > lazy
> > > > > > > > >> > > operations should be performed as they see fit.
> > > > > > > > >> >
> > > > > > > > >> > In my mind there is a distinction between the "compute
> > > domain"
> > > > > > > (e.g. a
> > > > > > > > >> > pandas dataframe or something like ibis or SQL) and the
> > > "data
> > > > > > > domain"
> > > > > > > > >> (e.g.
> > > > > > > > >> > pyarrow datasets).  I think, in a perfect world, you
> could
> > > push
> > > > > > any
> > > > > > > > and
> > > > > > > > >> all
> > > > > > > > >> > compute up and down the chain as far as possible.
> > However,
> > > in
> > > > > > > > practice,
> > > > > > > > >> I
> > > > > > > > >> > think there is a healthy set of tools and libraries that
> > say
> > > > > > "simple
> > > > > > > > >> column
> > > > > > > > >> > projection and filtering is good enough".  I would argue
> > > that
> > > > > > there
> > > > > > > is
> > > > > > > > >> room
> > > > > > > > >> > for both APIs and while the temptation is always present
> > to
> > > > > "shove
> > > > > > > as
> > > > > > > > >> much
> > > > > > > > >> > compute as you can" I think pyarrow datasets seem to
> have
> > > found
> > > > > a
> > > > > > > > balance
> > > > > > > > >> > between the two that users like.
> > > > > > > > >> >
> > > > > > > > >> > So I would argue that this protocol may never become a
> > > > > > > general-purpose
> > > > > > > > >> > unmaterialized dataframe and that isn't necessarily a
> bad
> > > thing.
> > > > > > > > >> >
> > > > > > > > >> > > they are splittable and serializable, so that
> fragments
> > > can be
> > > > > > > > >> distributed
> > > > > > > > >> > > amongst processes / workers.
> > > > > > > > >> >
> > > > > > > > >> > Just to clarify, the proposal currently only requires
> the
> > > > > > fragments
> > > > > > > > to be
> > > > > > > > >> > serializable correct?
> > > > > > > > >> >
> > > > > > > > >> > On Fri, Jun 23, 2023 at 11:48 AM Will Jones <
> > > > > > > [email protected]>
> > > > > > > > >> wrote:
> > > > > > > > >> >
> > > > > > > > >> > > Thanks Ian for your extensive feedback.
> > > > > > > > >> > >
> > > > > > > > >> > > I strongly agree with the comments made by David,
> > > > > > > > >> > > > Weston, and Dewey arguing that we should avoid any
> use
> > > of
> > > > > > > PyArrow
> > > > > > > > >> > > > expressions in this API. Expressions are an
> > > implementation
> > > > > > > detail
> > > > > > > > of
> > > > > > > > >> > > > PyArrow, not a part of the Arrow standard. It would
> be
> > > much
> > > > > > > safer
> > > > > > > > for
> > > > > > > > >> > > > the initial version of this protocol to not define
> > *any*
> > > > > > > > >> > > > methods/arguments that take expressions.
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> > > I would agree with this point, if we were starting
> from
> > > > > scratch.
> > > > > > > But
> > > > > > > > >> one of
> > > > > > > > >> > > my goals is for this protocol to be descriptive of the
> > > > > existing
> > > > > > > > dataset
> > > > > > > > >> > > integrations in the ecosystem, which all currently
> rely
> > on
> > > > > > PyArrow
> > > > > > > > >> > > expressions. For example, you'll notice in the PR that
> > > there
> > > > > are
> > > > > > > > unit
> > > > > > > > >> tests
> > > > > > > > >> > > to verify the current PyArrow Dataset classes conform
> to
> > > this
> > > > > > > > protocol,
> > > > > > > > >> > > without changes.
> > > > > > > > >> > >
> > > > > > > > >> > > I think there's three routes we can go here:
> > > > > > > > >> > >
> > > > > > > > >> > > 1. We keep PyArrow expressions in the API initially,
> but
> > > once
> > > > > we
> > > > > > > > have
> > > > > > > > >> > > Substrait-based alternatives we deprecate the PyArrow
> > > > > expression
> > > > > > > > >> support.
> > > > > > > > >> > > This is what I intended with the current design, and I
> > > think
> > > > > it
> > > > > > > > >> provides
> > > > > > > > >> > > the most obvious migration paths for existing
> producers
> > > and
> > > > > > > > consumers.
> > > > > > > > >> > > 2. We keep the overall dataset API, but don't
> introduce
> > > the
> > > > > > filter
> > > > > > > > and
> > > > > > > > >> > > projection arguments until we have Substrait support.
> > I'm
> > > not
> > > > > > sure
> > > > > > > > >> what the
> > > > > > > > >> > > migration path looks like for producers and consumers,
> > > but I
> > > > > > think
> > > > > > > > this
> > > > > > > > >> > > just implicitly becomes the same as (1), but with
> worse
> > > > > > > > documentation.
> > > > > > > > >> > > 3. We write a protocol completely from scratch, that
> > > doesn't
> > > > > try
> > > > > > > to
> > > > > > > > >> > > describe the existing dataset API. Producers and
> > consumers
> > > > > would
> > > > > > > > then
> > > > > > > > >> > > migrate to use the new protocol and deprecate their
> > > existing
> > > > > > > dataset
> > > > > > > > >> > > integrations. We could introduce a dunder method in
> that
> > > API
> > > > > > (sort
> > > > > > > > of
> > > > > > > > >> like
> > > > > > > > >> > > __arrow_array__) that would make the migration
> seamless
> > > from
> > > > > the
> > > > > > > > >> end-user
> > > > > > > > >> > > perspective.
> > > > > > > > >> > >
> > > > > > > > >> > > *Which do you all think is the best path forward?*
> > > > > > > > >> > >
> > > > > > > > >> > > Another concern I have is that we have not fully
> > > explained why
> > > > > > we
> > > > > > > > want
> > > > > > > > >> > > > to use Dataset instead of RecordBatchReader [9] as
> the
> > > basis
> > > > > > of
> > > > > > > > this
> > > > > > > > >> > > > protocol. I would like to see an explanation of why
> > > > > > > > RecordBatchReader
> > > > > > > > >> > > > is not sufficient for this. RecordBatchReader seems
> > like
> > > > > > another
> > > > > > > > >> > > > possible way to represent "unmaterialized
> dataframes"
> > > and
> > > > > > there
> > > > > > > > are
> > > > > > > > >> > > > some parallels between RecordBatch/RecordBatchReader
> > and
> > > > > > > > >> > > > Fragment/Dataset.
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> > > This is a good point. I can add a section describing
> the
> > > > > > > > differences.
> > > > > > > > >> The
> > > > > > > > >> > > main ones I can think of are that: (1) Datasets are
> > > > > "pruneable":
> > > > > > > one
> > > > > > > > >> can
> > > > > > > > >> > > select a subset of columns and apply a filter on rows
> to
> > > avoid
> > > > > > IO
> > > > > > > > and
> > > > > > > > >> (2)
> > > > > > > > >> > > they are splittable and serializable, so that
> fragments
> > > can be
> > > > > > > > >> distributed
> > > > > > > > >> > > amongst processes / workers.
> > > > > > > > >> > >
> > > > > > > > >> > > Best,
> > > > > > > > >> > >
> > > > > > > > >> > > Will Jones
> > > > > > > > >> > >
> > > > > > > > >> > > On Fri, Jun 23, 2023 at 10:48 AM Ian Cook <
> > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > > >> > >
> > > > > > > > >> > > > Thanks Will for this proposal!
> > > > > > > > >> > > >
> > > > > > > > >> > > > For anyone familiar with PyArrow, this idea has a
> > clear
> > > > > > > intuitive
> > > > > > > > >> > > > logic to it. It provides an expedient solution to
> the
> > > > > current
> > > > > > > > lack of
> > > > > > > > >> > > > a practical means for interchanging "unmaterialized
> > > > > > dataframes"
> > > > > > > > >> > > > between different Python libraries.
> > > > > > > > >> > > >
> > > > > > > > >> > > > To elaborate on that: If you look at how people use
> > the
> > > > > Arrow
> > > > > > > > Dataset
> > > > > > > > >> > > > API—which is implemented in the Arrow C++ library
> [1]
> > > and
> > > > > has
> > > > > > > > >> bindings
> > > > > > > > >> > > > not just for Python [2] but also for Java [3] and R
> > > > > [4]—you'll
> > > > > > > see
> > > > > > > > >> > > > that Dataset is often used simply as a "virtual"
> > > variant of
> > > > > > > > Table. It
> > > > > > > > >> > > > is used in cases when the data is larger than memory
> > or
> > > when
> > > > > > it
> > > > > > > is
> > > > > > > > >> > > > desirable to defer reading (materializing) the data
> > into
> > > > > > memory.
> > > > > > > > >> > > >
> > > > > > > > >> > > > So we can think of a Table as a materialized
> dataframe
> > > and a
> > > > > > > > Dataset
> > > > > > > > >> > > > as an unmaterialized dataframe. That aspect of
> Dataset
> > > is I
> > > > > > > think
> > > > > > > > >> what
> > > > > > > > >> > > > makes it most attractive as a protocol for enabling
> > > > > > > > interoperability:
> > > > > > > > >> > > > it allows libraries to easily "speak Arrow" in cases
> > > where
> > > > > > > > >> > > > materializing the full data in memory upfront is
> > > impossible
> > > > > or
> > > > > > > > >> > > > undesirable.
> > > > > > > > >> > > >
> > > > > > > > >> > > > The trouble is that Dataset was not designed to
> serve
> > > as a
> > > > > > > > >> > > > general-purpose unmaterialized dataframe. For
> example,
> > > the
> > > > > > > PyArrow
> > > > > > > > >> > > > Dataset constructor [5] exposes options for
> > specifying a
> > > > > list
> > > > > > of
> > > > > > > > >> > > > source files and a partitioning scheme, which are
> > > irrelevant
> > > > > > for
> > > > > > > > many
> > > > > > > > >> > > > of the applications that Will anticipates. And some
> > > work is
> > > > > > > > needed to
> > > > > > > > >> > > > reconcile the methods of the PyArrow Dataset object
> > [6]
> > > with
> > > > > > the
> > > > > > > > >> > > > methods of the Table object. Some methods like
> > filter()
> > > are
> > > > > > > > exposed
> > > > > > > > >> by
> > > > > > > > >> > > > both and behave lazily on Datasets and eagerly on
> > > Tables,
> > > > > as a
> > > > > > > > user
> > > > > > > > >> > > > might expect. But many other Table methods are not
> > > > > implemented
> > > > > > > for
> > > > > > > > >> > > > Dataset though they potentially could be, and it is
> > > unclear
> > > > > > > where
> > > > > > > > we
> > > > > > > > >> > > > should draw the line between adding methods to
> Dataset
> > > vs.
> > > > > > > > >> encouraging
> > > > > > > > >> > > > new scanner implementations to expose options
> > > controlling
> > > > > what
> > > > > > > > lazy
> > > > > > > > >> > > > operations should be performed as they see fit.
> > > > > > > > >> > > >
> > > > > > > > >> > > > Will, I see that you've already addressed this issue
> > to
> > > some
> > > > > > > > extent
> > > > > > > > >> in
> > > > > > > > >> > > > your proposal. For example, you mention that we
> should
> > > > > > initially
> > > > > > > > >> > > > define this protocol to include only a minimal
> subset
> > > of the
> > > > > > > > Dataset
> > > > > > > > >> > > > API. I agree, but I think there are some loose ends
> we
> > > > > should
> > > > > > be
> > > > > > > > >> > > > careful to tie up. I strongly agree with the
> comments
> > > made
> > > > > by
> > > > > > > > David,
> > > > > > > > >> > > > Weston, and Dewey arguing that we should avoid any
> use
> > > of
> > > > > > > PyArrow
> > > > > > > > >> > > > expressions in this API. Expressions are an
> > > implementation
> > > > > > > detail
> > > > > > > > of
> > > > > > > > >> > > > PyArrow, not a part of the Arrow standard. It would
> be
> > > much
> > > > > > > safer
> > > > > > > > for
> > > > > > > > >> > > > the initial version of this protocol to not define
> > *any*
> > > > > > > > >> > > > methods/arguments that take expressions. This will
> > > allow us
> > > > > to
> > > > > > > > take
> > > > > > > > >> > > > some more time to finish up the Substrait expression
> > > > > > > > implementation
> > > > > > > > >> > > > work that is underway [7][8], then introduce
> > > Substrait-based
> > > > > > > > >> > > > expressions in a latter version of this protocol.
> This
> > > > > > approach
> > > > > > > > will
> > > > > > > > >> > > > better position this protocol to be implemented in
> > other
> > > > > > > languages
> > > > > > > > >> > > > besides Python.
> > > > > > > > >> > > >
> > > > > > > > >> > > > Another concern I have is that we have not fully
> > > explained
> > > > > why
> > > > > > > we
> > > > > > > > >> want
> > > > > > > > >> > > > to use Dataset instead of RecordBatchReader [9] as
> the
> > > basis
> > > > > > of
> > > > > > > > this
> > > > > > > > >> > > > protocol. I would like to see an explanation of why
> > > > > > > > RecordBatchReader
> > > > > > > > >> > > > is not sufficient for this. RecordBatchReader seems
> > like
> > > > > > another
> > > > > > > > >> > > > possible way to represent "unmaterialized
> dataframes"
> > > and
> > > > > > there
> > > > > > > > are
> > > > > > > > >> > > > some parallels between RecordBatch/RecordBatchReader
> > and
> > > > > > > > >> > > > Fragment/Dataset. We should help developers and
> users
> > > > > > understand
> > > > > > > > why
> > > > > > > > >> > > > Arrow needs both of these.
> > > > > > > > >> > > >
> > > > > > > > >> > > > Thanks Will for your thoughtful prose explanations
> > about
> > > > > this
> > > > > > > > >> proposed
> > > > > > > > >> > > > API. After we arrive at a decision about this, I
> think
> > > we
> > > > > > should
> > > > > > > > >> > > > reproduce some of these explanations in docs, blog
> > > posts,
> > > > > > > cookbook
> > > > > > > > >> > > > recipes, etc. because there is some important nuance
> > > here
> > > > > that
> > > > > > > > will
> > > > > > > > >> be
> > > > > > > > >> > > > important for integrators of this API to understand.
> > > > > > > > >> > > >
> > > > > > > > >> > > > Ian
> > > > > > > > >> > > >
> > > > > > > > >> > > > [1]
> > https://arrow.apache.org/docs/cpp/api/dataset.html
> > > > > > > > >> > > > [2]
> https://arrow.apache.org/docs/python/dataset.html
> > > > > > > > >> > > > [3] https://arrow.apache.org/docs/java/dataset.html
> > > > > > > > >> > > > [4]
> > > https://arrow.apache.org/docs/r/articles/dataset.html
> > > > > > > > >> > > > [5]
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset
> > > > > > > > >> > > > [6]
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html
> > > > > > > > >> > > > [7] https://github.com/apache/arrow/issues/33985
> > > > > > > > >> > > > [8] https://github.com/apache/arrow/issues/34252
> > > > > > > > >> > > > [9]
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html
> > > > > > > > >> > > >
> > > > > > > > >> > > > On Wed, Jun 21, 2023 at 2:09 PM Will Jones <
> > > > > > > > [email protected]>
> > > > > > > > >> > > > wrote:
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > Hello Arrow devs,
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > I have drafted a PR defining an experimental
> > protocol
> > > > > which
> > > > > > > > would
> > > > > > > > >> allow
> > > > > > > > >> > > > > third-party libraries to imitate the PyArrow
> Dataset
> > > API
> > > > > > [5].
> > > > > > > > This
> > > > > > > > >> > > > protocol
> > > > > > > > >> > > > > is intended to endorse an integration pattern that
> > is
> > > > > > starting
> > > > > > > > to
> > > > > > > > >> be
> > > > > > > > >> > > used
> > > > > > > > >> > > > > in the Python ecosystem, where some libraries are
> > > > > providing
> > > > > > > > their
> > > > > > > > >> own
> > > > > > > > >> > > > > scanners with this API, while query engines are
> > > accepting
> > > > > > > these
> > > > > > > > as
> > > > > > > > >> > > > > duck-typed objects.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > To give some background: back at the end of 2021,
> we
> > > > > > > > collaborated
> > > > > > > > >> with
> > > > > > > > >> > > > > DuckDB to be able to read datasets (an Arrow C++
> > > concept),
> > > > > > > > >> supporting
> > > > > > > > >> > > > > column selection and filter pushdown. This was
> > > > > accomplished
> > > > > > by
> > > > > > > > >> having
> > > > > > > > >> > > > > DuckDB manipulating Python (or R) objects to get a
> > > > > > > > >> RecordBatchReader
> > > > > > > > >> > > and
> > > > > > > > >> > > > > then exporting over the C Stream Interface.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > Since then, DataFusion [2] and Polars have both
> made
> > > > > similar
> > > > > > > > >> > > > > implementations for their Python bindings,
> allowing
> > > them
> > > > > to
> > > > > > > > consume
> > > > > > > > >> > > > PyArrow
> > > > > > > > >> > > > > datasets. This has created an implicit protocol,
> > > whereby
> > > > > > > > arbitrary
> > > > > > > > >> > > > compute
> > > > > > > > >> > > > > engines can push down queries into the PyArrow
> > dataset
> > > > > > > scanner.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > Now, libraries supporting table formats including
> > > Delta
> > > > > > Lake,
> > > > > > > > >> Lance,
> > > > > > > > >> > > and
> > > > > > > > >> > > > > Iceberg are looking to be able to support these
> > > engines,
> > > > > > while
> > > > > > > > >> bringing
> > > > > > > > >> > > > > their own scanners and metadata handling
> > > implementations.
> > > > > > One
> > > > > > > > >> possible
> > > > > > > > >> > > > > route is allowing them to imitate the PyArrow
> > datasets
> > > > > API.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > Bringing these use cases together, I'd like to
> > > propose an
> > > > > > > > >> experimental
> > > > > > > > >> > > > > protocol, made out of the minimal subset of the
> > > PyArrow
> > > > > > > Dataset
> > > > > > > > API
> > > > > > > > >> > > > > necessary to facilitate this kind of integration.
> > This
> > > > > would
> > > > > > > > allow
> > > > > > > > >> any
> > > > > > > > >> > > > > library to produce a scanner implementation and
> that
> > > > > > arbitrary
> > > > > > > > >> query
> > > > > > > > >> > > > > engines could call into. I've drafted a PR [3] and
> > > there
> > > > > is
> > > > > > > some
> > > > > > > > >> > > > background
> > > > > > > > >> > > > > research available in a google doc [4].
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > I've already gotten some good feedback on both,
> and
> > > would
> > > > > > > > welcome
> > > > > > > > >> more.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > One last point: I'd like for this to be a first
> step
> > > > > rather
> > > > > > > > than a
> > > > > > > > >> > > > > comprehensive API. This PR focuses on making
> > explicit
> > > a
> > > > > > > protocol
> > > > > > > > >> that
> > > > > > > > >> > > is
> > > > > > > > >> > > > > already in use in the ecosystem, but without much
> > > concrete
> > > > > > > > >> definition.
> > > > > > > > >> > > > Once
> > > > > > > > >> > > > > this is established, we can use our experience
> from
> > > this
> > > > > > > > protocol
> > > > > > > > >> to
> > > > > > > > >> > > > design
> > > > > > > > >> > > > > something more permanent that takes advantage of
> > newer
> > > > > > > > innovations
> > > > > > > > >> in
> > > > > > > > >> > > the
> > > > > > > > >> > > > > Arrow ecosystem (such as the PyCapsule for C Data
> > > > > Interface
> > > > > > or
> > > > > > > > >> > > > > Substrait for passing expressions / scan plans). I
> > am
> > > > > > tracking
> > > > > > > > such
> > > > > > > > >> > > > future
> > > > > > > > >> > > > > improvements in [5].
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > Best,
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > Will Jones
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > [1] https://duckdb.org/2021/12/03/duck-arrow.html
> > > > > > > > >> > > > > [2]
> > > > > > https://github.com/apache/arrow-datafusion-python/pull/9
> > > > > > > > >> > > > > [3] https://github.com/apache/arrow/pull/35568
> > > > > > > > >> > > > > [4]
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?pli=1
> > > > > > > > >> > > > > [5]
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://docs.google.com/document/d/1-uVkSZeaBtOALVbqMOPeyV3s2UND7Wl-IGEZ-P-gMXQ/edit
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
>

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

Reply via email to