Re: [DISCUSS] Canonical alternative layout proposal

Raphael Taylor-Davies Thu, 13 Jul 2023 20:33:59 -0700

clarify what constitutes support for a canonical alternative
layout

I had envisaged, perhaps naively, that we would just add a new DataTypecontaining a string layout name, perhaps DataType::Raw(String). Thiswould have no restrictions on the number of buffers, children, etc...and would effectively just be an opaque ArrayData. As interpreting suchan array would require the layout name, I think it warrants inclusion ata lower level than just Field metadata. This is in contrast to extensiontypes, where this metadata is not strictly necessary to operate on thearrays.

I haven't given this a huge amount of thought though, and so it isentirely possible this has been discounted for some reason, or has somepeculiar edge cases.

would negotiate
with an Arrow implementation or other component that is
producing/consuming data to determine whether an alternative layout is
supported

My interpretation of the proposal was that no such negotiation wouldtake place, instead the primary layout would always be chosen except inthe presence of an explicit contrary signal. This is not hugelydissimilar from how dictionaries or run-encoded arrays are currentlyhandled, where they will only be returned if either present in the inputor explicitly requested.


Perhaps we might clarify this with something along the lines of?

- An Arrow implementation's APIs MAY produce data in an alternativelayout to match one or more of its inputs, or an embedded arrow schema

This covers both the cases of writing data to files and sending dataover FFI. It does, however, carry the implication that alternativelayouts are viral. I therefore wonder if we generalize the wording to:

- Arrow-native APIs MUST only produce data in an alternative layout ifit is already present in one of its inputs, or explicitly requested

This is to help ensure that users only end up with alternative layoutsif they explicitly opt-in to this behaviour. What I think we want toavoid is systems producing alternative layouts by default, and this thenleading to user confusion when data produced by one system is notinteroperable with those of another.


On 13/07/2023 20:28, Benjamin Kietzman wrote:

Canonical alternative layouts sounds like a workable path forward. Perhaps
understandably, my immediate thought is how I could rephrase Utf8View as a
canonical alternative layout for Utf8. In light of that, I have a few
questions to clarify what constitutes support for a canonical alternative
layout. Specifically:
- do we extend Field to indicate if and which alternative layout is being
used
   - or do we add AltSchema to wrap a schema and indicate which of its
fields have alternate layouts
   - ...
- do we extend RecordBatch to support canonical alternative layouts
   - or do we add AltRecordBatch for that purpose (which iiuc would
complicate dictionary batches containing any column of an alternate layout)
   - ...

To add context, one of the reasons we could not just use extension types
for Utf8View is that these are required to be backed by a known layout, and
no primary layout in the format has a variable number of buffers. In order
to accommodate Utf8View as an alternative layout, the minimal change which
I can think of right now is
- to add `stringField::alternative_layout` to identify alternative layouts
in a Schema
- to extend RecordBatch with support for variable buffer counts

This will put some burden on implementers to navigate the multiple
character buffers when reading serialized arrow batches. However it will
not require that any implementations' data structures support multiple
buffers since the explicit default for any implementation is to always
convert Utf8View to Utf8. If this sounds acceptable, I'll prepare a draft
PR which
- adds language for canonical alternative layouts to Columnar.rst
- addsField::alternative_layout  andRecordBatch::variable_buffer_counts
- adds the "view" alternative layout for the Utf8 Type as an initial example

Ben Kietzman

On Thu, Jul 13, 2023, 18:32 Aldrin<[email protected]>  wrote:

Thanks Neal and Weston!

I prepared a diagram to solidify my own understanding of the context,
which can be found at [1].

I think alternative layouts sounds like a nice first approach to allowing
new layouts that can be supported lazily (implemented when it is
beneficial) by various implementations of the Arrow Columnar Format. But, I
do think that it's just a (practical) formalization of saying what layouts
are required and which ones are optional.

 From the making of the diagram, I also decided that the discussion isn't
limited to performance, since there are several reasons new physical
layouts may be proposed (or, at least, there are many aspects of
performance). Even if it's not "canonical alternative layouts," I think it
is important that there be some process for developers that use Arrow to
propose extensions to the columnar format without having to prove out the
benefits for libraries that use a different tech stack (e.g. rust vs C++ vs
go).


[1]:
https://docs.google.com/presentation/d/1EiBgwtoYW6ADTxFc9iRs8KLPV0st0GZqmGy40Uz8jPk/edit?usp=sharing




# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


------- Original Message -------
On Thursday, July 13th, 2023 at 10:49, Dane Pitkin
<[email protected]>  wrote:

I am in favor of this proposal. IMO the Arrow project is the right place

to

standardize both the interoperability and operability of columnar data
layouts. Data engines are a core component of the Arrow ecosystem and the
project should be able to grow with these data engines as they converge

on

new layouts. Since columnar data is ubiquitous in analytical workloads,

we

are seeing a natural progression into optimizing those workloads. This
includes new lossless compression schemes for columnar data that allows
engines to operate directly on the compressed data (e.g. RLE). If we

can't

reliably support the growing needs of the broader data engine ecosystem

in

a timely manner, then I also fear Arrow might lose relevancy over time.

On Thu, Jul 13, 2023 at 11:59 AM Ian [email protected]  wrote:

Thank you Weston for proposing this solution and Neal for describing
its context and implications. I agree with the other replies here—this
seems like an elegant solution to a growing need that could, if left
unaddressed, increase the fragmentation of the ecosystem and reduce
the centrality of the Arrow format.

Greater diversity of layouts is happening. Whether it happens inside
of Arrow or outside of Arrow is up to us. I think we all would like to
see it happen inside of Arrow. This proposal allows for that, while
striking a balance as Raphael describes.

However I think there is still some ambiguity about exactly how an
Arrow implementation that is consuming/producing data would negotiate
with an Arrow implementation or other component that is
producing/consuming data to determine whether an alternative layout is
supported. This was discussed briefly in [5] but I am interested to
see how this negotiation would be implemented in practice in the C
data interface, IPC, Flight, etc.

Ian

[5]https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2

On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies
[email protected]  wrote:

I like this proposal, I think it strikes a pragmatic balance between
preserving interoperability whilst still allowing new ideas to be
incorporated into the standard. Thank you for writing this up.

On 13/07/2023 10:22, Matt Topol wrote:

I don't have much to add but I do want to second Jacob's comments.

agree
that this is a good way to avoid the fragmentation while keeping

Arrow

relevant, and likely something we need to do so that we can ensure
Arrow
remains the way to do this data integration and interoperability.

On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
[email protected]  wrote:

Hello Everyone,

Thanks for this comprehensive but concise write up Neal! I think

this

proposal is a good way to avoid both fragmentation of the arrow
ecosystem
as well as its obsolescence. In my opinion of these two problems

the

obsolescence is the bigger issue as (as mentioned in the

proposal)

arrow is
already (close to) being relegated to the sidelines in eco-system
defining
projects.

Jacob

On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
[email protected]> wrote:

Hi all,
As was previously raised in 1 and surfaced again in 2, there

is a

proposal for representing alternative layouts. The intent, as I
understand
it, is to be able to support memory layouts that some (but

perhaps

not
all)
applications of Arrow find valuable, so that these nearly Arrow
systems
can
be fully Arrow-native.

I wanted to start a more focused discussion on it because I

think

it's
worth being considered on its own merits, but I also think

this gets

to
the
core of what the Arrow project is and should be, and I don't

want us

to
lose sight of that.

To restate the proposal from 1:

* There are one or more primary layouts
* Existing layouts are automatically considered primary

layouts,

even if they
wouldn't have been primary layouts initially (e.g. large list)
* A new layout, if it is semantically equivalent to another, is
considered an
alternative layout
* An alternative layout still has the same requirements for
adoption
(two implementations
and a vote)
* An implementation should not feel pressured to rush and
implement
the
new
layout. It would be good if they contribute in the discussion

and

consider
the layout and vote if they feel it would be an acceptable

design.

* We can define and vote and approve as many canonical

alternative

layouts as
we want:
* A canonical alternative layout should, at a minimum, have

some

reasonable
justification, such as improved performance for algorithm X
* Arrow implementations MUST support the primary layouts
* An Arrow implementation MAY support a canonical alternative,
however:
* An Arrow implementation MUST first support the primary layout
* An Arrow implementation MUST support conversion to/from the
primary
and
canonical layout
* An Arrow implementation's APIs MUST only provide data in the
alternative layout if it is explicitly asked for (e.g. schema
inference
should prefer the primary layout).
* We can still vote for new primary layouts (e.g. promoting a
canonical alternative)
but, in these votes we don't only consider the value (e.g.
performance)
of
the layout but also the interoperability. In other words, a

layout

can
only
become a primary layout if there is significant evidence that

most

implementations
plan to adopt it.

To summarize some of the arguments against the proposal from

the

previous
threads, there are concerns about increasing the complexity of

the

Arrow
specification and the cost/burden of updating all of the Arrow
specifications to support them.

Where these discussions, both about several proposed new types

and

this
layout proposal, get to the core of Arrow is well expressed in

the

comments
on the previous thread by Raphael 3 and Pedro 4. Raphael asks:
"what
matters to people more, interoperability or best-in-class
performance?"
And
Pedro notes that because of the overhead of converting these
not-yet-Arrow
types to the Arrow C ABI is high enough that they've considered
abandoning
Arrow as their interchange format. So: on the one hand, we're

kinda

choosing which quality we're optimizing for, but on the other,
interoperability and performance are dependent on each other.

What I see that we're trying to do here is find a way to

expand the

Arrow
specification just enough so that Arrow becomes or remains the
in-memory
standard everywhere, but not so much that it creates too much
complexity
or
burden to implement. Expand too much and you get a fragmented
ecosystem
where everyone is writing subsets of the Arrow standard and so
nothing is
fully compatible and the whole premise is undermined. But

expand too

little
and projects will abandon the standard and we've also failed.

I don't have a tidy answer, but I wanted to acknowledge the

bigger

issues,
and see if this helps us reason about the various proposals on

the

table. I
wonder if the alternative layout proposal is the happy medium

that

adds
some complexity to the specification, but less than there

would be if

three
new types were added, and still meets the needs of projects

like

DuckDB,
Velox, and Gluten and gets them fully Arrow native.

Neal

Re: [DISCUSS] Canonical alternative layout proposal

Reply via email to