Re: [DISCUSS] Canonical alternative layout proposal

Benjamin Kietzman Thu, 13 Jul 2023 17:29:15 -0700

Canonical alternative layouts sounds like a workable path forward. Perhaps
understandably, my immediate thought is how I could rephrase Utf8View as a
canonical alternative layout for Utf8. In light of that, I have a few
questions to clarify what constitutes support for a canonical alternative
layout. Specifically:
- do we extend Field to indicate if and which alternative layout is being
used
  - or do we add AltSchema to wrap a schema and indicate which of its
fields have alternate layouts
  - ...
- do we extend RecordBatch to support canonical alternative layouts
  - or do we add AltRecordBatch for that purpose (which iiuc would
complicate dictionary batches containing any column of an alternate layout)
  - ...


To add context, one of the reasons we could not just use extension types
for Utf8View is that these are required to be backed by a known layout, and
no primary layout in the format has a variable number of buffers. In order
to accommodate Utf8View as an alternative layout, the minimal change which
I can think of right now is
- to add `string Field::alternative_layout` to identify alternative layouts
in a Schema
- to extend RecordBatch with support for variable buffer counts

This will put some burden on implementers to navigate the multiple
character buffers when reading serialized arrow batches. However it will
not require that any implementations' data structures support multiple
buffers since the explicit default for any implementation is to always
convert Utf8View to Utf8. If this sounds acceptable, I'll prepare a draft
PR which
- adds language for canonical alternative layouts to Columnar.rst
- adds Field::alternative_layout and RecordBatch::variable_buffer_counts
- adds the "view" alternative layout for the Utf8 Type as an initial example

Ben Kietzman

On Thu, Jul 13, 2023, 18:32 Aldrin <octalene....@pm.me.invalid> wrote:

> Thanks Neal and Weston!
>
> I prepared a diagram to solidify my own understanding of the context,
> which can be found at [1].
>
> I think alternative layouts sounds like a nice first approach to allowing
> new layouts that can be supported lazily (implemented when it is
> beneficial) by various implementations of the Arrow Columnar Format. But, I
> do think that it's just a (practical) formalization of saying what layouts
> are required and which ones are optional.
>
> From the making of the diagram, I also decided that the discussion isn't
> limited to performance, since there are several reasons new physical
> layouts may be proposed (or, at least, there are many aspects of
> performance). Even if it's not "canonical alternative layouts," I think it
> is important that there be some process for developers that use Arrow to
> propose extensions to the columnar format without having to prove out the
> benefits for libraries that use a different tech stack (e.g. rust vs C++ vs
> go).
>
>
> [1]:
> https://docs.google.com/presentation/d/1EiBgwtoYW6ADTxFc9iRs8KLPV0st0GZqmGy40Uz8jPk/edit?usp=sharing
>
>
>
>
> # ------------------------------
>
> # Aldrin
>
>
> https://github.com/drin/
>
> https://gitlab.com/octalene
>
> https://keybase.io/octalene
>
>
> ------- Original Message -------
> On Thursday, July 13th, 2023 at 10:49, Dane Pitkin
> <d...@voltrondata.com.INVALID> wrote:
>
>
> > I am in favor of this proposal. IMO the Arrow project is the right place
> to
> > standardize both the interoperability and operability of columnar data
> > layouts. Data engines are a core component of the Arrow ecosystem and the
> > project should be able to grow with these data engines as they converge
> on
> > new layouts. Since columnar data is ubiquitous in analytical workloads,
> we
> > are seeing a natural progression into optimizing those workloads. This
> > includes new lossless compression schemes for columnar data that allows
> > engines to operate directly on the compressed data (e.g. RLE). If we
> can't
> > reliably support the growing needs of the broader data engine ecosystem
> in
> > a timely manner, then I also fear Arrow might lose relevancy over time.
> >
>
> > On Thu, Jul 13, 2023 at 11:59 AM Ian Cook ianmc...@apache.org wrote:
> >
>
> > > Thank you Weston for proposing this solution and Neal for describing
> > > its context and implications. I agree with the other replies here—this
> > > seems like an elegant solution to a growing need that could, if left
> > > unaddressed, increase the fragmentation of the ecosystem and reduce
> > > the centrality of the Arrow format.
> > >
>
> > > Greater diversity of layouts is happening. Whether it happens inside
> > > of Arrow or outside of Arrow is up to us. I think we all would like to
> > > see it happen inside of Arrow. This proposal allows for that, while
> > > striking a balance as Raphael describes.
> > >
>
> > > However I think there is still some ambiguity about exactly how an
> > > Arrow implementation that is consuming/producing data would negotiate
> > > with an Arrow implementation or other component that is
> > > producing/consuming data to determine whether an alternative layout is
> > > supported. This was discussed briefly in [5] but I am interested to
> > > see how this negotiation would be implemented in practice in the C
> > > data interface, IPC, Flight, etc.
> > >
>
> > > Ian
> > >
>
> > > [5] https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2
> > >
>
> > > On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies
> > > r.taylordav...@googlemail.com.invalid wrote:
> > >
>
> > > > I like this proposal, I think it strikes a pragmatic balance between
> > > > preserving interoperability whilst still allowing new ideas to be
> > > > incorporated into the standard. Thank you for writing this up.
> > > >
>
> > > > On 13/07/2023 10:22, Matt Topol wrote:
> > > >
>
> > > > > I don't have much to add but I do want to second Jacob's comments.
> I
> > > > > agree
> > > > > that this is a good way to avoid the fragmentation while keeping
> Arrow
> > > > > relevant, and likely something we need to do so that we can ensure
> > > > > Arrow
> > > > > remains the way to do this data integration and interoperability.
> > > > >
>
> > > > > On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
> > > > > ja...@voltrondata.com.invalid wrote:
> > > > >
>
> > > > > > Hello Everyone,
> > > > > >
>
> > > > > > Thanks for this comprehensive but concise write up Neal! I think
> this
> > > > > > proposal is a good way to avoid both fragmentation of the arrow
> > > > > > ecosystem
> > > > > > as well as its obsolescence. In my opinion of these two problems
> the
> > > > > > obsolescence is the bigger issue as (as mentioned in the
> proposal)
> > > > > > arrow is
> > > > > > already (close to) being relegated to the sidelines in eco-system
> > > > > > defining
> > > > > > projects.
> > > > > >
>
> > > > > > Jacob
> > > > > >
>
> > > > > > On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
> > > > > > neal.p.richard...@gmail.com> wrote:
> > > > > >
>
> > > > > > > Hi all,
> > > > > > > As was previously raised in 1 and surfaced again in 2, there
> is a
> > > > > > > proposal for representing alternative layouts. The intent, as I
> > > > > > > understand
> > > > > > > it, is to be able to support memory layouts that some (but
> perhaps
> > > > > > > not
> > > > > > > all)
> > > > > > > applications of Arrow find valuable, so that these nearly Arrow
> > > > > > > systems
> > > > > > > can
> > > > > > > be fully Arrow-native.
> > > > > > >
>
> > > > > > > I wanted to start a more focused discussion on it because I
> think
> > > > > > > it's
> > > > > > > worth being considered on its own merits, but I also think
> this gets
> > > > > > > to
> > > > > > > the
> > > > > > > core of what the Arrow project is and should be, and I don't
> want us
> > > > > > > to
> > > > > > > lose sight of that.
> > > > > > >
>
> > > > > > > To restate the proposal from 1:
> > > > > > >
>
> > > > > > > * There are one or more primary layouts
> > > > > > > * Existing layouts are automatically considered primary
> layouts,
> > > > > > > even if they
> > > > > > > wouldn't have been primary layouts initially (e.g. large list)
> > > > > > > * A new layout, if it is semantically equivalent to another, is
> > > > > > > considered an
> > > > > > > alternative layout
> > > > > > > * An alternative layout still has the same requirements for
> > > > > > > adoption
> > > > > > > (two implementations
> > > > > > > and a vote)
> > > > > > > * An implementation should not feel pressured to rush and
> > > > > > > implement
> > > > > > > the
> > > > > > > new
> > > > > > > layout. It would be good if they contribute in the discussion
> and
> > > > > > > consider
> > > > > > > the layout and vote if they feel it would be an acceptable
> design.
> > > > > > > * We can define and vote and approve as many canonical
> alternative
> > > > > > > layouts as
> > > > > > > we want:
> > > > > > > * A canonical alternative layout should, at a minimum, have
> some
> > > > > > > reasonable
> > > > > > > justification, such as improved performance for algorithm X
> > > > > > > * Arrow implementations MUST support the primary layouts
> > > > > > > * An Arrow implementation MAY support a canonical alternative,
> > > > > > > however:
> > > > > > > * An Arrow implementation MUST first support the primary layout
> > > > > > > * An Arrow implementation MUST support conversion to/from the
> > > > > > > primary
> > > > > > > and
> > > > > > > canonical layout
> > > > > > > * An Arrow implementation's APIs MUST only provide data in the
> > > > > > > alternative layout if it is explicitly asked for (e.g. schema
> > > > > > > inference
> > > > > > > should prefer the primary layout).
> > > > > > > * We can still vote for new primary layouts (e.g. promoting a
> > > > > > > canonical alternative)
> > > > > > > but, in these votes we don't only consider the value (e.g.
> > > > > > > performance)
> > > > > > > of
> > > > > > > the layout but also the interoperability. In other words, a
> layout
> > > > > > > can
> > > > > > > only
> > > > > > > become a primary layout if there is significant evidence that
> most
> > > > > > > implementations
> > > > > > > plan to adopt it.
> > > > > > >
>
> > > > > > > To summarize some of the arguments against the proposal from
> the
> > > > > > > previous
> > > > > > > threads, there are concerns about increasing the complexity of
> the
> > > > > > > Arrow
> > > > > > > specification and the cost/burden of updating all of the Arrow
> > > > > > > specifications to support them.
> > > > > > >
>
> > > > > > > Where these discussions, both about several proposed new types
> and
> > > > > > > this
> > > > > > > layout proposal, get to the core of Arrow is well expressed in
> the
> > > > > > > comments
> > > > > > > on the previous thread by Raphael 3 and Pedro 4. Raphael asks:
> > > > > > > "what
> > > > > > > matters to people more, interoperability or best-in-class
> > > > > > > performance?"
> > > > > > > And
> > > > > > > Pedro notes that because of the overhead of converting these
> > > > > > > not-yet-Arrow
> > > > > > > types to the Arrow C ABI is high enough that they've considered
> > > > > > > abandoning
> > > > > > > Arrow as their interchange format. So: on the one hand, we're
> kinda
> > > > > > > choosing which quality we're optimizing for, but on the other,
> > > > > > > interoperability and performance are dependent on each other.
> > > > > > >
>
> > > > > > > What I see that we're trying to do here is find a way to
> expand the
> > > > > > > Arrow
> > > > > > > specification just enough so that Arrow becomes or remains the
> > > > > > > in-memory
> > > > > > > standard everywhere, but not so much that it creates too much
> > > > > > > complexity
> > > > > > > or
> > > > > > > burden to implement. Expand too much and you get a fragmented
> > > > > > > ecosystem
> > > > > > > where everyone is writing subsets of the Arrow standard and so
> > > > > > > nothing is
> > > > > > > fully compatible and the whole premise is undermined. But
> expand too
> > > > > > > little
> > > > > > > and projects will abandon the standard and we've also failed.
> > > > > > >
>
> > > > > > > I don't have a tidy answer, but I wanted to acknowledge the
> bigger
> > > > > > > issues,
> > > > > > > and see if this helps us reason about the various proposals on
> the
> > > > > > > table. I
> > > > > > > wonder if the alternative layout proposal is the happy medium
> that
> > > > > > > adds
> > > > > > > some complexity to the specification, but less than there
> would be if
> > > > > > > three
> > > > > > > new types were added, and still meets the needs of projects
> like
> > > > > > > DuckDB,
> > > > > > > Velox, and Gluten and gets them fully Arrow native.
> > > > > > >
>
> > > > > > > Neal

Re: [DISCUSS] Canonical alternative layout proposal

Reply via email to