Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Pedro Eugenio Rocha Pedreira Thu, 17 Aug 2023 08:57:13 -0700

Hi all,

Getting back to this thread as I realize there were a few unanswered questions. 
Adding a bit more context on the rationale and usage of ArrayViews in Velox, 
and the importance to standardize it:


re: Why do we need it?

We use ArrayViews for two main reasons. First, for efficient execution of 
conditionals (IF/SWITCHs) that generate list types (array/maps). In a 
vectorized engine, there are two ways to execute conditionals. One will always 
first generate the condition bitmap that decides which branch each row will 
take, otherwise you would pay the dispatch cost for each row. Once the bitmap 
is generated, without ArrayView, one would write each branch to its own buffer 
output (one containing the "then" rows, and one with the "else" rows), then at 
the end conduct an additional step to merge these two into a third output 
buffer, taking the condition bitmap into account. This is analogous to a 
"scatter" operation.

With ArrayViews one can write them to their final location with a single pass 
over each branch, removing the need for the final step. This is always faster 
as it removes the need for a last step. As raised before, the actual perf 
improvement will depend on the data types, their average sizes, cost of 
evaluating conditions, branches, buffer sizes and many other factors. One can 
play with these factors and tweak a microbenchmark to make it more of more or 
less evident (as convenient), but the fact is that it will always be faster. We 
have found extensive empirical data in real workloads to support these claims. 
We have found that it is common to find workloads with large nested conditional 
statements rearranging maps for feature engineering workloads. Often, they 
happen right after table scans, so the additional copies and cache misses can 
really add up.

The second reason is convenience for slicing array/maps. One can think of 
array_slice(), array_trim(), and similar functions (and their map 
counterparts). With ArrayViews as output they can be implemented zero-copy by 
only adjusting the offsets and lengths, without ever touching (or even having 
to decode) the base elements. This is also faster on every case.

For these reasons, I would argue that the ArrayView format is the 
state-of-the-art representation for any modern vectorized query engine that 
really cares about performance (like Velox, DuckDB, and similar). If there's no 
Arrow counterpart for it, they (we) will continue diverging from Arrow, and we 
might end up missing the mark of standardization.

re: Where do we need it?

An interesting point raised is that this could be seen as an implementation 
detail inside these engines, and that we might not need it at the engine 
boundary. While this is a fair way of looking at it, we've found two use cases 
where this presents a hurdle:

First, in the Gluten project that enables Velox to be integrated within Spark, 
for example, the communication between Velox and Spark java is done via JNI. In 
this case, the very data coming from a table scan + projection, which in many 
cases will be represented using an ArrayView in Velox, needs to be efficiently 
shipped to the Java-land. Without ArrayView support, this communication will 
not be done via Arrow as we would rather specialize than incur in an extra 
copy. The same for other bridges between systems, like a DuckDB<->Velox present 
in Velox, which for the same reason (in addition to historical lack of 
StringView) is not built using Arrow today.

The second, as mentioned by Will, is UDF support. For these use cases, it would 
be very convenient to support UDF packages in different languages written based 
on the Arrow format. However, that means again that data coming from a Velox 
plan could be using the ArrayView format. Once more, requiring an extra copy on 
that step will end up pushing developers to build a more specialized API 
instead.

In general, I recognize the need for stability and simplicity in the format, 
but wonder if inflexibility given the evolving state-of-the-art could lead to 
further divergence and obsolescence. Weston's proposal of canonical alternative 
layouts seems like a fine way to avoid this pitfall, IMHO.

re: Which libraries need it?

One last point to be made is that though it is useful to have ArrayView support 
in the Arrow C++ library, most of the systems I described do not use it and 
rather have their own implementation. The most important piece, in our view, is 
the unified and agreed upon memory layout, and a way to push this through the 
stable C ABI.

re: why not dictionaries?

The suggested dictionary over variable-sized arrays approach can be used 
out-of-the-box and indeed gives you some flexibility, but there are two 
shortcomings: (a) like Felipe pointed out, it incurs in an extra indirection 
(cache miss) in order to fetch the array, (b) it still doesn't provide the same 
flexibility to slice the array/map (you can reorder arrays, but not slice them).

I hope this adds a bit more context on the usage and relevance of this format 
in modern vectorized engines. For further reading, our VLDB paper has a section 
explaining in more detail the reasoning behind our arrow divergence [0] in 
Section 4.2.1. As a proud supporter of composability in data management [1], I 
really hope we can find ways to converge our standards and build more 
composable data stacks.

[0] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
[1] - https://dl.acm.org/doi/10.14778/3603581.3603604

Best,
--
Pedro Pedreira

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

Reply via email to