clarify what constitutes support for a canonical alternative
layout
I had envisaged, perhaps naively, that we would just add a new DataType
containing a string layout name, perhaps DataType::Raw(String). This
would have no restrictions on the number of buffers, children, etc...
and would effectively just be an opaque ArrayData. As interpreting such
an array would require the layout name, I think it warrants inclusion at
a lower level than just Field metadata. This is in contrast to extension
types, where this metadata is not strictly necessary to operate on the
arrays.
I haven't given this a huge amount of thought though, and so it is
entirely possible this has been discounted for some reason, or has some
peculiar edge cases.
would negotiate
with an Arrow implementation or other component that is
producing/consuming data to determine whether an alternative layout is
supported
My interpretation of the proposal was that no such negotiation would
take place, instead the primary layout would always be chosen except in
the presence of an explicit contrary signal. This is not hugely
dissimilar from how dictionaries or run-encoded arrays are currently
handled, where they will only be returned if either present in the input
or explicitly requested.
Perhaps we might clarify this with something along the lines of?
- An Arrow implementation's APIs MAY produce data in an alternative
layout to match one or more of its inputs, or an embedded arrow schema
This covers both the cases of writing data to files and sending data
over FFI. It does, however, carry the implication that alternative
layouts are viral. I therefore wonder if we generalize the wording to:
- Arrow-native APIs MUST only produce data in an alternative layout if
it is already present in one of its inputs, or explicitly requested
This is to help ensure that users only end up with alternative layouts
if they explicitly opt-in to this behaviour. What I think we want to
avoid is systems producing alternative layouts by default, and this then
leading to user confusion when data produced by one system is not
interoperable with those of another.
On 13/07/2023 20:28, Benjamin Kietzman wrote:
Canonical alternative layouts sounds like a workable path forward. Perhaps
understandably, my immediate thought is how I could rephrase Utf8View as a
canonical alternative layout for Utf8. In light of that, I have a few
questions to clarify what constitutes support for a canonical alternative
layout. Specifically:
- do we extend Field to indicate if and which alternative layout is being
used
- or do we add AltSchema to wrap a schema and indicate which of its
fields have alternate layouts
- ...
- do we extend RecordBatch to support canonical alternative layouts
- or do we add AltRecordBatch for that purpose (which iiuc would
complicate dictionary batches containing any column of an alternate layout)
- ...
To add context, one of the reasons we could not just use extension types
for Utf8View is that these are required to be backed by a known layout, and
no primary layout in the format has a variable number of buffers. In order
to accommodate Utf8View as an alternative layout, the minimal change which
I can think of right now is
- to add `stringField::alternative_layout` to identify alternative layouts
in a Schema
- to extend RecordBatch with support for variable buffer counts
This will put some burden on implementers to navigate the multiple
character buffers when reading serialized arrow batches. However it will
not require that any implementations' data structures support multiple
buffers since the explicit default for any implementation is to always
convert Utf8View to Utf8. If this sounds acceptable, I'll prepare a draft
PR which
- adds language for canonical alternative layouts to Columnar.rst
- addsField::alternative_layout andRecordBatch::variable_buffer_counts
- adds the "view" alternative layout for the Utf8 Type as an initial example
Ben Kietzman
On Thu, Jul 13, 2023, 18:32 Aldrin<octalene....@pm.me.invalid> wrote:
Thanks Neal and Weston!
I prepared a diagram to solidify my own understanding of the context,
which can be found at [1].
I think alternative layouts sounds like a nice first approach to allowing
new layouts that can be supported lazily (implemented when it is
beneficial) by various implementations of the Arrow Columnar Format. But, I
do think that it's just a (practical) formalization of saying what layouts
are required and which ones are optional.
From the making of the diagram, I also decided that the discussion isn't
limited to performance, since there are several reasons new physical
layouts may be proposed (or, at least, there are many aspects of
performance). Even if it's not "canonical alternative layouts," I think it
is important that there be some process for developers that use Arrow to
propose extensions to the columnar format without having to prove out the
benefits for libraries that use a different tech stack (e.g. rust vs C++ vs
go).
[1]:
https://docs.google.com/presentation/d/1EiBgwtoYW6ADTxFc9iRs8KLPV0st0GZqmGy40Uz8jPk/edit?usp=sharing
# ------------------------------
# Aldrin
https://github.com/drin/
https://gitlab.com/octalene
https://keybase.io/octalene
------- Original Message -------
On Thursday, July 13th, 2023 at 10:49, Dane Pitkin
<d...@voltrondata.com.INVALID> wrote:
I am in favor of this proposal. IMO the Arrow project is the right place
to
standardize both the interoperability and operability of columnar data
layouts. Data engines are a core component of the Arrow ecosystem and the
project should be able to grow with these data engines as they converge
on
new layouts. Since columnar data is ubiquitous in analytical workloads,
we
are seeing a natural progression into optimizing those workloads. This
includes new lossless compression schemes for columnar data that allows
engines to operate directly on the compressed data (e.g. RLE). If we
can't
reliably support the growing needs of the broader data engine ecosystem
in
a timely manner, then I also fear Arrow might lose relevancy over time.
On Thu, Jul 13, 2023 at 11:59 AM Ian cookianmc...@apache.org wrote:
Thank you Weston for proposing this solution and Neal for describing
its context and implications. I agree with the other replies here—this
seems like an elegant solution to a growing need that could, if left
unaddressed, increase the fragmentation of the ecosystem and reduce
the centrality of the Arrow format.
Greater diversity of layouts is happening. Whether it happens inside
of Arrow or outside of Arrow is up to us. I think we all would like to
see it happen inside of Arrow. This proposal allows for that, while
striking a balance as Raphael describes.
However I think there is still some ambiguity about exactly how an
Arrow implementation that is consuming/producing data would negotiate
with an Arrow implementation or other component that is
producing/consuming data to determine whether an alternative layout is
supported. This was discussed briefly in [5] but I am interested to
see how this negotiation would be implemented in practice in the C
data interface, IPC, Flight, etc.
Ian
[5]https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2
On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies
r.taylordav...@googlemail.com.invalid wrote:
I like this proposal, I think it strikes a pragmatic balance between
preserving interoperability whilst still allowing new ideas to be
incorporated into the standard. Thank you for writing this up.
On 13/07/2023 10:22, Matt Topol wrote:
I don't have much to add but I do want to second Jacob's comments.
I
agree
that this is a good way to avoid the fragmentation while keeping
Arrow
relevant, and likely something we need to do so that we can ensure
Arrow
remains the way to do this data integration and interoperability.
On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
ja...@voltrondata.com.invalid wrote:
Hello Everyone,
Thanks for this comprehensive but concise write up Neal! I think
this
proposal is a good way to avoid both fragmentation of the arrow
ecosystem
as well as its obsolescence. In my opinion of these two problems
the
obsolescence is the bigger issue as (as mentioned in the
proposal)
arrow is
already (close to) being relegated to the sidelines in eco-system
defining
projects.
Jacob
On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
neal.p.richard...@gmail.com> wrote:
Hi all,
As was previously raised in 1 and surfaced again in 2, there
is a
proposal for representing alternative layouts. The intent, as I
understand
it, is to be able to support memory layouts that some (but
perhaps
not
all)
applications of Arrow find valuable, so that these nearly Arrow
systems
can
be fully Arrow-native.
I wanted to start a more focused discussion on it because I
think
it's
worth being considered on its own merits, but I also think
this gets
to
the
core of what the Arrow project is and should be, and I don't
want us
to
lose sight of that.
To restate the proposal from 1:
* There are one or more primary layouts
* Existing layouts are automatically considered primary
layouts,
even if they
wouldn't have been primary layouts initially (e.g. large list)
* A new layout, if it is semantically equivalent to another, is
considered an
alternative layout
* An alternative layout still has the same requirements for
adoption
(two implementations
and a vote)
* An implementation should not feel pressured to rush and
implement
the
new
layout. It would be good if they contribute in the discussion
and
consider
the layout and vote if they feel it would be an acceptable
design.
* We can define and vote and approve as many canonical
alternative
layouts as
we want:
* A canonical alternative layout should, at a minimum, have
some
reasonable
justification, such as improved performance for algorithm X
* Arrow implementations MUST support the primary layouts
* An Arrow implementation MAY support a canonical alternative,
however:
* An Arrow implementation MUST first support the primary layout
* An Arrow implementation MUST support conversion to/from the
primary
and
canonical layout
* An Arrow implementation's APIs MUST only provide data in the
alternative layout if it is explicitly asked for (e.g. schema
inference
should prefer the primary layout).
* We can still vote for new primary layouts (e.g. promoting a
canonical alternative)
but, in these votes we don't only consider the value (e.g.
performance)
of
the layout but also the interoperability. In other words, a
layout
can
only
become a primary layout if there is significant evidence that
most
implementations
plan to adopt it.
To summarize some of the arguments against the proposal from
the
previous
threads, there are concerns about increasing the complexity of
the
Arrow
specification and the cost/burden of updating all of the Arrow
specifications to support them.
Where these discussions, both about several proposed new types
and
this
layout proposal, get to the core of Arrow is well expressed in
the
comments
on the previous thread by Raphael 3 and Pedro 4. Raphael asks:
"what
matters to people more, interoperability or best-in-class
performance?"
And
Pedro notes that because of the overhead of converting these
not-yet-Arrow
types to the Arrow C ABI is high enough that they've considered
abandoning
Arrow as their interchange format. So: on the one hand, we're
kinda
choosing which quality we're optimizing for, but on the other,
interoperability and performance are dependent on each other.
What I see that we're trying to do here is find a way to
expand the
Arrow
specification just enough so that Arrow becomes or remains the
in-memory
standard everywhere, but not so much that it creates too much
complexity
or
burden to implement. Expand too much and you get a fragmented
ecosystem
where everyone is writing subsets of the Arrow standard and so
nothing is
fully compatible and the whole premise is undermined. But
expand too
little
and projects will abandon the standard and we've also failed.
I don't have a tidy answer, but I wanted to acknowledge the
bigger
issues,
and see if this helps us reason about the various proposals on
the
table. I
wonder if the alternative layout proposal is the happy medium
that
adds
some complexity to the specification, but less than there
would be if
three
new types were added, and still meets the needs of projects
like
DuckDB,
Velox, and Gluten and gets them fully Arrow native.
Neal