Hi all, Getting back to this thread as I realize there were a few unanswered questions. Adding a bit more context on the rationale and usage of ArrayViews in Velox, and the importance to standardize it:
re: Why do we need it? We use ArrayViews for two main reasons. First, for efficient execution of conditionals (IF/SWITCHs) that generate list types (array/maps). In a vectorized engine, there are two ways to execute conditionals. One will always first generate the condition bitmap that decides which branch each row will take, otherwise you would pay the dispatch cost for each row. Once the bitmap is generated, without ArrayView, one would write each branch to its own buffer output (one containing the "then" rows, and one with the "else" rows), then at the end conduct an additional step to merge these two into a third output buffer, taking the condition bitmap into account. This is analogous to a "scatter" operation. With ArrayViews one can write them to their final location with a single pass over each branch, removing the need for the final step. This is always faster as it removes the need for a last step. As raised before, the actual perf improvement will depend on the data types, their average sizes, cost of evaluating conditions, branches, buffer sizes and many other factors. One can play with these factors and tweak a microbenchmark to make it more of more or less evident (as convenient), but the fact is that it will always be faster. We have found extensive empirical data in real workloads to support these claims. We have found that it is common to find workloads with large nested conditional statements rearranging maps for feature engineering workloads. Often, they happen right after table scans, so the additional copies and cache misses can really add up. The second reason is convenience for slicing array/maps. One can think of array_slice(), array_trim(), and similar functions (and their map counterparts). With ArrayViews as output they can be implemented zero-copy by only adjusting the offsets and lengths, without ever touching (or even having to decode) the base elements. This is also faster on every case. For these reasons, I would argue that the ArrayView format is the state-of-the-art representation for any modern vectorized query engine that really cares about performance (like Velox, DuckDB, and similar). If there's no Arrow counterpart for it, they (we) will continue diverging from Arrow, and we might end up missing the mark of standardization. re: Where do we need it? An interesting point raised is that this could be seen as an implementation detail inside these engines, and that we might not need it at the engine boundary. While this is a fair way of looking at it, we've found two use cases where this presents a hurdle: First, in the Gluten project that enables Velox to be integrated within Spark, for example, the communication between Velox and Spark java is done via JNI. In this case, the very data coming from a table scan + projection, which in many cases will be represented using an ArrayView in Velox, needs to be efficiently shipped to the Java-land. Without ArrayView support, this communication will not be done via Arrow as we would rather specialize than incur in an extra copy. The same for other bridges between systems, like a DuckDB<->Velox present in Velox, which for the same reason (in addition to historical lack of StringView) is not built using Arrow today. The second, as mentioned by Will, is UDF support. For these use cases, it would be very convenient to support UDF packages in different languages written based on the Arrow format. However, that means again that data coming from a Velox plan could be using the ArrayView format. Once more, requiring an extra copy on that step will end up pushing developers to build a more specialized API instead. In general, I recognize the need for stability and simplicity in the format, but wonder if inflexibility given the evolving state-of-the-art could lead to further divergence and obsolescence. Weston's proposal of canonical alternative layouts seems like a fine way to avoid this pitfall, IMHO. re: Which libraries need it? One last point to be made is that though it is useful to have ArrayView support in the Arrow C++ library, most of the systems I described do not use it and rather have their own implementation. The most important piece, in our view, is the unified and agreed upon memory layout, and a way to push this through the stable C ABI. re: why not dictionaries? The suggested dictionary over variable-sized arrays approach can be used out-of-the-box and indeed gives you some flexibility, but there are two shortcomings: (a) like Felipe pointed out, it incurs in an extra indirection (cache miss) in order to fetch the array, (b) it still doesn't provide the same flexibility to slice the array/map (you can reorder arrays, but not slice them). I hope this adds a bit more context on the usage and relevance of this format in modern vectorized engines. For further reading, our VLDB paper has a section explaining in more detail the reasoning behind our arrow divergence [0] in Section 4.2.1. As a proud supporter of composability in data management [1], I really hope we can find ways to converge our standards and build more composable data stacks. [0] - https://vldb.org/pvldb/vol15/p3372-pedreira.pdf [1] - https://dl.acm.org/doi/10.14778/3603581.3603604 Best, -- Pedro Pedreira