Columnar UDFs is still a work in progress. For now, all UDFs are row-based, and in fact, all processing is row based. We are working on plumbing in more columnar support to Spark https://github.com/apache/spark/pull/24795 but it is going to be a little while before we are at the point where we could support doing what you want.
The APIs for ColumnVector were really designed for the purpose of providing an iterator that goes from Columnar formatted data to row formatted data. When we get to the point that we can support columnar UDFs some of the APIs in ColumnVector are likely to be expanded so you can do some of the things you are requesting. I hope this helps, Bobby On Tue, Jun 25, 2019 at 4:24 PM Andrew Melo <andrew.m...@gmail.com> wrote: > Hello, > > I've (nearly) implemented a DSV2-reader interface to read particle physics > data stored in the ROOT (https://root.cern.ch/) file format. You can > think of these ROOT files as roughly parquet-like: column-wise and nested > (i.e. a column can be of type "float[]", meaning each row in the column is > a variable-length array of floats). The overwhelming majority of our > columns are these variable-length arrays, since they represent physical > quantities that vary widely with each particle collision*. > > Exposing these columns via the "SupportsScanColumnarBatch" interface has > raised a question I have about the DSV2 API. I know the interface is > currently Evolving, but I don't know if this is the appropriate place to > ask about it (I presume JIRA is a good place as well, but I had trouble > finding exactly where the best place to join is) > > There is no provision in the org.apache.spark.sql.vectorized.ColumnVector > interface to return multiple rows of arrays (i.e. no "getArrays" analogue > to "getArray"). A big use case we have is to pipe these data through UDFs, > so it would be nice to be able to get the data from the file into a UDF > batch without having to convert to an intermediate row-wise representation. > Looking into ColumnarArray, however, it seems like instead of storing a > single offset and length, it could be extended to arrays of "offsets" and > "lengths". The public interface could remain the same by adding a 2nd > constructor which accepts arrays and keeping the existing constructor as a > degenerate case of a 1-length array. > > > * e.g. "electron_momentum" column will have a different number of entries > each row, one for each electron that is produced in a collision. >