Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Felipe Oliveira Carvalho Mon, 21 Aug 2023 16:50:02 -0700

I marked the C++ implementation PR ready for review today and will soon be
working on the Go implementation.


https://github.com/apache/arrow/pull/35345

Note that differently from Velox's ArrayVector, the Arrow implementation
(ListView) also features a 64-bit version (LargeListView) to be
symmetrical with the existing List and LargeList types.

--
Felipe

On Tue, Aug 22, 2023 at 12:09 AM Andrew Lamb <al...@influxdata.com> wrote:

> I am convinced that the benefits of the ArrayViews as discussed on this
> thread, despite the inconvenience of two similar formats (ListArray and
> ArrayView) and equivalent formats, are enough to add it to the Arrow spec.
>
> It is my opinion we should add ArrowView to the Arrow format, and really
> push forward the notion of composable data management. I think it is a
> compelling vision and one Arrow can very much help to create.
>
> Thank you for your well worded summary Pedro and for everyone else who has
> shared their opinion on this issue.
>
> Andrew
>
> On Thu, Aug 17, 2023 at 11:58 AM Pedro Eugenio Rocha Pedreira
> <pedro...@meta.com.invalid> wrote:
>
> > Hi all,
> >
> > Getting back to this thread as I realize there were a few unanswered
> > questions. Adding a bit more context on the rationale and usage of
> > ArrayViews in Velox, and the importance to standardize it:
> >
> > re: Why do we need it?
> >
> > We use ArrayViews for two main reasons. First, for efficient execution of
> > conditionals (IF/SWITCHs) that generate list types (array/maps). In a
> > vectorized engine, there are two ways to execute conditionals. One will
> > always first generate the condition bitmap that decides which branch each
> > row will take, otherwise you would pay the dispatch cost for each row.
> Once
> > the bitmap is generated, without ArrayView, one would write each branch
> to
> > its own buffer output (one containing the "then" rows, and one with the
> > "else" rows), then at the end conduct an additional step to merge these
> two
> > into a third output buffer, taking the condition bitmap into account.
> This
> > is analogous to a "scatter" operation.
> >
> > With ArrayViews one can write them to their final location with a single
> > pass over each branch, removing the need for the final step. This is
> always
> > faster as it removes the need for a last step. As raised before, the
> actual
> > perf improvement will depend on the data types, their average sizes, cost
> > of evaluating conditions, branches, buffer sizes and many other factors.
> > One can play with these factors and tweak a microbenchmark to make it
> more
> > of more or less evident (as convenient), but the fact is that it will
> > always be faster. We have found extensive empirical data in real
> workloads
> > to support these claims. We have found that it is common to find
> workloads
> > with large nested conditional statements rearranging maps for feature
> > engineering workloads. Often, they happen right after table scans, so the
> > additional copies and cache misses can really add up.
> >
> > The second reason is convenience for slicing array/maps. One can think of
> > array_slice(), array_trim(), and similar functions (and their map
> > counterparts). With ArrayViews as output they can be implemented
> zero-copy
> > by only adjusting the offsets and lengths, without ever touching (or even
> > having to decode) the base elements. This is also faster on every case.
> >
> > For these reasons, I would argue that the ArrayView format is the
> > state-of-the-art representation for any modern vectorized query engine
> that
> > really cares about performance (like Velox, DuckDB, and similar). If
> > there's no Arrow counterpart for it, they (we) will continue diverging
> from
> > Arrow, and we might end up missing the mark of standardization.
> >
> > re: Where do we need it?
> >
> > An interesting point raised is that this could be seen as an
> > implementation detail inside these engines, and that we might not need it
> > at the engine boundary. While this is a fair way of looking at it, we've
> > found two use cases where this presents a hurdle:
> >
> > First, in the Gluten project that enables Velox to be integrated within
> > Spark, for example, the communication between Velox and Spark java is
> done
> > via JNI. In this case, the very data coming from a table scan +
> projection,
> > which in many cases will be represented using an ArrayView in Velox,
> needs
> > to be efficiently shipped to the Java-land. Without ArrayView support,
> this
> > communication will not be done via Arrow as we would rather specialize
> than
> > incur in an extra copy. The same for other bridges between systems, like
> a
> > DuckDB<->Velox present in Velox, which for the same reason (in addition
> to
> > historical lack of StringView) is not built using Arrow today.
> >
> > The second, as mentioned by Will, is UDF support. For these use cases, it
> > would be very convenient to support UDF packages in different languages
> > written based on the Arrow format. However, that means again that data
> > coming from a Velox plan could be using the ArrayView format. Once more,
> > requiring an extra copy on that step will end up pushing developers to
> > build a more specialized API instead.
> >
> > In general, I recognize the need for stability and simplicity in the
> > format, but wonder if inflexibility given the evolving state-of-the-art
> > could lead to further divergence and obsolescence. Weston's proposal of
> > canonical alternative layouts seems like a fine way to avoid this
> pitfall,
> > IMHO.
> >
> > re: Which libraries need it?
> >
> > One last point to be made is that though it is useful to have ArrayView
> > support in the Arrow C++ library, most of the systems I described do not
> > use it and rather have their own implementation. The most important
> piece,
> > in our view, is the unified and agreed upon memory layout, and a way to
> > push this through the stable C ABI.
> >
> > re: why not dictionaries?
> >
> > The suggested dictionary over variable-sized arrays approach can be used
> > out-of-the-box and indeed gives you some flexibility, but there are two
> > shortcomings: (a) like Felipe pointed out, it incurs in an extra
> > indirection (cache miss) in order to fetch the array, (b) it still
> doesn't
> > provide the same flexibility to slice the array/map (you can reorder
> > arrays, but not slice them).
> >
> > I hope this adds a bit more context on the usage and relevance of this
> > format in modern vectorized engines. For further reading, our VLDB paper
> > has a section explaining in more detail the reasoning behind our arrow
> > divergence [0] in Section 4.2.1. As a proud supporter of composability in
> > data management [1], I really hope we can find ways to converge our
> > standards and build more composable data stacks.
> >
> > [0] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
> > [1] - https://dl.acm.org/doi/10.14778/3603581.3603604
> >
> > Best,
> > --
> > Pedro Pedreira
> >
>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Reply via email to