Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Andrew Lamb Mon, 21 Aug 2023 15:09:42 -0700

I am convinced that the benefits of the ArrayViews as discussed on this
thread, despite the inconvenience of two similar formats (ListArray and
ArrayView) and equivalent formats, are enough to add it to the Arrow spec.


It is my opinion we should add ArrowView to the Arrow format, and really
push forward the notion of composable data management. I think it is a
compelling vision and one Arrow can very much help to create.

Thank you for your well worded summary Pedro and for everyone else who has
shared their opinion on this issue.

Andrew

On Thu, Aug 17, 2023 at 11:58 AM Pedro Eugenio Rocha Pedreira
<[email protected]> wrote:

> Hi all,
>
> Getting back to this thread as I realize there were a few unanswered
> questions. Adding a bit more context on the rationale and usage of
> ArrayViews in Velox, and the importance to standardize it:
>
> re: Why do we need it?
>
> We use ArrayViews for two main reasons. First, for efficient execution of
> conditionals (IF/SWITCHs) that generate list types (array/maps). In a
> vectorized engine, there are two ways to execute conditionals. One will
> always first generate the condition bitmap that decides which branch each
> row will take, otherwise you would pay the dispatch cost for each row. Once
> the bitmap is generated, without ArrayView, one would write each branch to
> its own buffer output (one containing the "then" rows, and one with the
> "else" rows), then at the end conduct an additional step to merge these two
> into a third output buffer, taking the condition bitmap into account. This
> is analogous to a "scatter" operation.
>
> With ArrayViews one can write them to their final location with a single
> pass over each branch, removing the need for the final step. This is always
> faster as it removes the need for a last step. As raised before, the actual
> perf improvement will depend on the data types, their average sizes, cost
> of evaluating conditions, branches, buffer sizes and many other factors.
> One can play with these factors and tweak a microbenchmark to make it more
> of more or less evident (as convenient), but the fact is that it will
> always be faster. We have found extensive empirical data in real workloads
> to support these claims. We have found that it is common to find workloads
> with large nested conditional statements rearranging maps for feature
> engineering workloads. Often, they happen right after table scans, so the
> additional copies and cache misses can really add up.
>
> The second reason is convenience for slicing array/maps. One can think of
> array_slice(), array_trim(), and similar functions (and their map
> counterparts). With ArrayViews as output they can be implemented zero-copy
> by only adjusting the offsets and lengths, without ever touching (or even
> having to decode) the base elements. This is also faster on every case.
>
> For these reasons, I would argue that the ArrayView format is the
> state-of-the-art representation for any modern vectorized query engine that
> really cares about performance (like Velox, DuckDB, and similar). If
> there's no Arrow counterpart for it, they (we) will continue diverging from
> Arrow, and we might end up missing the mark of standardization.
>
> re: Where do we need it?
>
> An interesting point raised is that this could be seen as an
> implementation detail inside these engines, and that we might not need it
> at the engine boundary. While this is a fair way of looking at it, we've
> found two use cases where this presents a hurdle:
>
> First, in the Gluten project that enables Velox to be integrated within
> Spark, for example, the communication between Velox and Spark java is done
> via JNI. In this case, the very data coming from a table scan + projection,
> which in many cases will be represented using an ArrayView in Velox, needs
> to be efficiently shipped to the Java-land. Without ArrayView support, this
> communication will not be done via Arrow as we would rather specialize than
> incur in an extra copy. The same for other bridges between systems, like a
> DuckDB<->Velox present in Velox, which for the same reason (in addition to
> historical lack of StringView) is not built using Arrow today.
>
> The second, as mentioned by Will, is UDF support. For these use cases, it
> would be very convenient to support UDF packages in different languages
> written based on the Arrow format. However, that means again that data
> coming from a Velox plan could be using the ArrayView format. Once more,
> requiring an extra copy on that step will end up pushing developers to
> build a more specialized API instead.
>
> In general, I recognize the need for stability and simplicity in the
> format, but wonder if inflexibility given the evolving state-of-the-art
> could lead to further divergence and obsolescence. Weston's proposal of
> canonical alternative layouts seems like a fine way to avoid this pitfall,
> IMHO.
>
> re: Which libraries need it?
>
> One last point to be made is that though it is useful to have ArrayView
> support in the Arrow C++ library, most of the systems I described do not
> use it and rather have their own implementation. The most important piece,
> in our view, is the unified and agreed upon memory layout, and a way to
> push this through the stable C ABI.
>
> re: why not dictionaries?
>
> The suggested dictionary over variable-sized arrays approach can be used
> out-of-the-box and indeed gives you some flexibility, but there are two
> shortcomings: (a) like Felipe pointed out, it incurs in an extra
> indirection (cache miss) in order to fetch the array, (b) it still doesn't
> provide the same flexibility to slice the array/map (you can reorder
> arrays, but not slice them).
>
> I hope this adds a bit more context on the usage and relevance of this
> format in modern vectorized engines. For further reading, our VLDB paper
> has a section explaining in more detail the reasoning behind our arrow
> divergence [0] in Section 4.2.1. As a proud supporter of composability in
> data management [1], I really hope we can find ways to converge our
> standards and build more composable data stacks.
>
> [0] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
> [1] - https://dl.acm.org/doi/10.14778/3603581.3603604
>
> Best,
> --
> Pedro Pedreira
>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Reply via email to