Hey Paul, it looks like ScalarReader is simply a renamed version of FieldReader, which the Arrow Vector module already contains.
On Mon, Sep 3, 2018 at 4:07 PM Paul Rogers <par0...@yahoo.com.invalid> wrote: > Filed a JIRA ticket: ARROW-3164. > > The original e-mail linked to a wiki that explains the Row Set abstraction > in the Drill context. The ticket points to a new GitHub wiki that discusses > the abstraction in the Arrow context, including examples. The Wiki also > explains the motivation: the challenges Drill faced when reading > row-oriented data into vectors and reading that data out, and how those may > apply to the Arrow context. > > Looks like recent Arrow work has greatly improved the interoperability > features of the project. Still, at some point, code must write data into > vectors and read data out. Often the interface is row-oriented. If that > code is in Java, the Row Set abstractions can help. > > A new top-level Java module is a great idea. Looks like there might be > some dependency issues to resolve to leverage material in the "vector" > module which we'll resolve as we hit them. > > Here is a quick example on the read side. Jacques recently posted a code > example to retrieve data from an vector of VARCHAR columns: > > int recordIndexToRead = ... > ListVector lv = ... > ArrowBuf offsetVector = lv.getOffsetBuffer(); > VarCharVector vc = lv.getDataVector(); > int listStart = offsetVector.getInt((recordIndexToRead ) * 4) ; > int listEnd = offsetVector.getInt((recordIndexToRead + 1) * 4); > NullableVarCharHolder nvh = new NullableVarCharHolder(); > for(int i = listStart; i < listEnd; i++){ > vc.get(i, nvh); > // do something with data. > } > > Here is how to iterate over a record batch, accessing a single VARCHAR > column, using the Row Set abstractions. The e-mail mentioned a byte array, > so let's use that here: > > RowSet rowSet = // create row set from record batch > RowSetReader reader = rowSet.reader(); > ScalarReader vcReader = reader.scalar("colName"); // Get your VARCHAR > column > while (reader.next()) { > byte data[] = vcReader.getBytes(); > // Do something with the data > } > > Data can also be retrieved as a Java String, if that is more convenient in > this use case: > > String data = vcReader.getString(); > > In either case, if the value is a SQL NULL, the above methods (because > they return Java objects) will return a Java null. (For primitive types, > you can call the ScalarReader.isNull() method.) > > Thanks, > - Paul > > > > On Thursday, August 30, 2018, 7:44:51 PM PDT, Jacques Nadeau < > jacq...@apache.org> wrote: > > New Jira sounds good. > > Many times algorithms interact directly with vectors but there also many > times this is not the case. Would be great to see more detail about an > example use. Maybe propose as a new module so people can use if they want > but don't have to consume unless they need to? > > On Mon, Aug 27, 2018 at 6:28 PM Paul Rogers <par0...@yahoo.com.invalid> > wrote: > > > Hi Jacques, > > > > Thanks much for the note. I wonder, when reading data into, or out of, > > Arrow, are not the interfaces often row-wise? For example, it is somewhat > > difficult to read a CSV file column-wise. Similarly, when serving a BI > tool > > (for tables or charts), data must be presented row-wise. (JDBC, for > > example, is a row-wise interface.) The abstractions help with these > cases. > > > > Perhaps much of the emphasis in Arrow is in cross-tool compatibility in > > which data is passed column-wise as a set of vectors? The abstractions > > wouldn't be needed in this data transfer case. > > > > The batch size component is an essential part of row-wise loading. When > > reading data into vectors, even from Parquet, we found it necessary to 1) > > control the overall amount of memory used by the batch, and 2) read the > > same number of rows for every column. The RowSet abstractions encapsulate > > this coordinated cross-column work. > > > > The memory limits in the "RowSet" abstraction are not estimates. (There > > was a separate Drill project for that, which is why it might be > confusing.) > > Instead, the memory limits are based on knowing the current write offset > > into each vector. In Drill, when a vector becomes full, we automatically > > resize the vector by doubling the memory for that vector. The RowSet > > abstraction tracks when doubling the vector would exceed the "budget" set > > for that vector or batch. When the limit occurs, the abstraction marks > the > > batch complete. (The "overflow" row is saved for later to avoid exceeding > > the limit, and to keep the details of overflow hidden from the client.) > The > > same logic can be applied, I would assume, to whatever memory allocation > > technique is used in Arrow, if Arrow has evolved beyond Drill's > technique. > > > > A size estimate (when available) helps by allowing the client code to > > pre-allocate vectors to their final size. Doing so avoids growing vectors > > during data loads. In this case, the abstractions simply pack data into > > those pre-allocated vectors until one of them becomes full. > > > > The idea of separating memory from reading/writing is sound. In fact, > > that's how the code is structured. The memory-unaware version is heavily > > used in unit tests where we know how much memory is used. The > memory-aware > > version is used in production to handle whatever strange data sets > present > > themselves. > > > > Of course, none of this was clear from my terse description. I'll go > ahead > > and create a JIRA ticket to provide additional context and to gather > > detailed comments so we can figure out the best way to proceed. > > > > Thanks, > > > > - Paul > > > > > > > > On Monday, August 27, 2018, 5:52:19 PM PDT, Jacques Nadeau < > > jacq...@apache.org> wrote: > > > > This seems like it could be a useful addition. In general, our > experience > > with writing Arrow structures is that the most optimal path is using > > columnar interaction rather than rowwise. That being said, most people > > start out by interacting with Arrow rowwise first and having an interface > > like this could be helpful in allowing people to start writing Arrow > > datasets with less effort and mistakes. > > > > In terms of record batch sizing/estimations, I think that should probably > > be uncoupled from writing/reading vectors. > > > > > > > > On Mon, Aug 27, 2018 at 7:00 AM Li Jin <ice.xell...@gmail.com> wrote: > > > > > Hi Paul, > > > > > > Thank you for the email. I think this is interesting. > > > > > > Arrow (Java API) currently doesn't have the capability of automatically > > > limiting the memory size of record batches. In Spark we have similar > > needs > > > to limit the size of record batches and have talked about implementing > > some > > > kind of size estimator for record batches but haven't started to work > on > > > it. > > > > > > I personally think it makes sense for Arrow to incorporate such > > > capabilities. > > > > > > > > > > > > On Mon, Aug 27, 2018 at 1:33 AM Paul Rogers <par0...@yahoo.com.invalid > > > > > wrote: > > > > > > > Hi All, > > > > > > > > Over in the Apache Drill project, we developed some handy vector > > > > reader/writer abstractions. I wonder if they might be of interest to > > > Apache > > > > Arrow. Key contributions of the "RowSet" abstractions: > > > > > > > > * Control row batch size: the aggregate memory taken by a set of > > vectors > > > > (and all their sub-vectors for structured types.) > > > > * Control the maximum per-vector size. > > > > * Simple, highly optimized read/write interface that handles vector > > > offset > > > > accounting, even for deeply nested types. > > > > * Minimize vector internal fragmentation (wasted space.) > > > > > > > > More information is available in [1]. Arrow improved and simplified > > > > Drill's original vector and metadata abstractions. As a result, work > > > would > > > > be required to port the RowSet code from Drill's version of these > > classes > > > > to the Arrow versions. > > > > > > > > Does Arrow already have a similar solution? If not, would the above > be > > > > useful for Arrow? > > > > > > > > Thanks, > > > > - Paul > > > > > > > > > > > > Apache Drill PMC member > > > > Co-author of the upcoming O'Reilly book "Learning Apache Drill" > > > > [1] > > > > > > https://github.com/paul-rogers/drill/wiki/RowSet-Abstractions-for-Arrow > > > > > > > > > > > > > > > > > >