I am also in favor of the idea of an alternative layout. IIRC, a new
alternative
layout still goes into a process of standardization though it is the choice
of
each implementation to decide support now or later. I'd like to ask if we
can
provide the flexibility for implementations or downstream projects to
actually
implement a new alternative layout by means of a pluggable interface before
starting the standardization process. This is similar to promoting a popular
extension type implemented by many users to a canonical extension type.
I know this is more complicated as extension type simply reuses existing
layout but alternative layout usually means a brand new one. For example,
if two projects speak Arrow and now they want to share a new layout, they
can simply implement a pluggable alternative layout before Arrow adopts it.
This can unblock projects to evolve and help Arrow not to be fragmented.

Best,
Gang

On Tue, Jul 18, 2023 at 10:35 PM Antoine Pitrou <anto...@python.org> wrote:

>
> Hello,
>
> I'm trying to reason about the advantages and drawbacks of this
> proposal, but it seems to me that it lacks definition.
>
> I would welcome a draft PR showcasing the changes necessary in the IPC
> format definition, and in the C Data Interface specification (no need to
> actually implement them for now :-)).
>
>
> As it is, it seems that this proposal would allow us to switch from:
>
> """We'd like to add a more efficient physical data representation, so
> we'll introduce a new Arrow data type. Implementations may or may not
> support it, but we will progressively try to bring reference
> implementations to parity.""" (1)
>
> to:
>
> """We'd like to add a more efficient physical data representation, so
> we'll introduce a new alternative layout for an existing Arrow data
> type. Implementations may or may not support it, but we will
> progressively try to bring reference implementations to parity.""" (2)
>
> The expected advantage of (2) over (1) seems to be mainly a difference
> in how new format features are communicated. There are mainline
> features, and there are experimental / provisional features.
>
> Regards
>
> Antoine.
>
>
>
> Le 13/07/2023 à 00:01, Neal Richardson a écrit :
> > Hi all,
> > As was previously raised in [1] and surfaced again in [2], there is a
> > proposal for representing alternative layouts. The intent, as I
> understand
> > it, is to be able to support memory layouts that some (but perhaps not
> all)
> > applications of Arrow find valuable, so that these nearly Arrow systems
> can
> > be fully Arrow-native.
> >
> > I wanted to start a more focused discussion on it because I think it's
> > worth being considered on its own merits, but I also think this gets to
> the
> > core of what the Arrow project is and should be, and I don't want us to
> > lose sight of that.
> >
> > To restate the proposal from [1]:
> >
> >   * There are one or more primary layouts
> >     * Existing layouts are automatically considered primary layouts,
> > even if they
> > wouldn't have been primary layouts initially (e.g. large list)
> >   * A new layout, if it is semantically equivalent to another, is
> considered an
> > alternative layout
> >   * An alternative layout still has the same requirements for adoption
> > (two implementations
> > and a vote)
> >     * An implementation should not feel pressured to rush and implement
> the new
> > layout. It would be good if they contribute in the discussion and
> consider
> > the layout and vote if they feel it would be an acceptable design.
> >   * We can define and vote and approve as many canonical alternative
> layouts as
> > we want:
> >     * A canonical alternative layout should, at a minimum, have some
> reasonable
> > justification, such as improved performance for algorithm X
> >   * Arrow implementations MUST support the primary layouts
> >   * An Arrow implementation MAY support a canonical alternative, however:
> >     * An Arrow implementation MUST first support the primary layout
> >     * An Arrow implementation MUST support conversion to/from the
> primary and
> > canonical layout
> >     * An Arrow implementation's APIs MUST only provide data in the
> > alternative layout if it is explicitly asked for (e.g. schema inference
> > should prefer the primary layout).
> >   * We can still vote for new primary layouts (e.g. promoting a
> > canonical alternative)
> > but, in these votes we don't only consider the value (e.g. performance)
> of
> > the layout but also the interoperability. In other words, a layout can
> only
> > become a primary layout if there is significant evidence that most
> > implementations
> > plan to adopt it.
> >
> >
> > To summarize some of the arguments against the proposal from the previous
> > threads, there are concerns about increasing the complexity of the Arrow
> > specification and the cost/burden of updating all of the Arrow
> > specifications to support them.
> >
> > Where these discussions, both about several proposed new types and this
> > layout proposal, get to the core of Arrow is well expressed in the
> comments
> > on the previous thread by Raphael [3] and Pedro [4]. Raphael asks: "what
> > matters to people more, interoperability or best-in-class performance?"
> And
> > Pedro notes that because of the overhead of converting these
> not-yet-Arrow
> > types to the Arrow C ABI is high enough that they've considered
> abandoning
> > Arrow as their interchange format. So: on the one hand, we're kinda
> > choosing which quality we're optimizing for, but on the other,
> > interoperability and performance are dependent on each other.
> >
> > What I see that we're trying to do here is find a way to expand the Arrow
> > specification just enough so that Arrow becomes or remains the in-memory
> > standard everywhere, but not so much that it creates too much complexity
> or
> > burden to implement. Expand too much and you get a fragmented ecosystem
> > where everyone is writing subsets of the Arrow standard and so nothing is
> > fully compatible and the whole premise is undermined. But expand too
> little
> > and projects will abandon the standard and we've also failed.
> >
> > I don't have a tidy answer, but I wanted to acknowledge the bigger
> issues,
> > and see if this helps us reason about the various proposals on the
> table. I
> > wonder if the alternative layout proposal is the happy medium that adds
> > some complexity to the specification, but less than there would be if
> three
> > new types were added, and still meets the needs of projects like DuckDB,
> > Velox, and Gluten and gets them fully Arrow native.
> >
> > Neal
> >
> >
> > [1]: https://lists.apache.org/thread/pfy02d9m2zh08vn8opm5td6l91z6ssrk
> > [2]: https://lists.apache.org/thread/wosy53ysoy4s0yy6zbnch3dx2x4jplw6
> > [3]: https://lists.apache.org/thread/r35g5612kszx9scfpk5rqpmlym4yq832
> > [4]: https://lists.apache.org/thread/5k7kopc5r9morm0vk4z2f6w1vh87q38h
> >
>

Reply via email to