Hi Jacques, Thanks much for the note. I wonder, when reading data into, or out of, Arrow, are not the interfaces often row-wise? For example, it is somewhat difficult to read a CSV file column-wise. Similarly, when serving a BI tool (for tables or charts), data must be presented row-wise. (JDBC, for example, is a row-wise interface.) The abstractions help with these cases.
Perhaps much of the emphasis in Arrow is in cross-tool compatibility in which data is passed column-wise as a set of vectors? The abstractions wouldn't be needed in this data transfer case. The batch size component is an essential part of row-wise loading. When reading data into vectors, even from Parquet, we found it necessary to 1) control the overall amount of memory used by the batch, and 2) read the same number of rows for every column. The RowSet abstractions encapsulate this coordinated cross-column work. The memory limits in the "RowSet" abstraction are not estimates. (There was a separate Drill project for that, which is why it might be confusing.) Instead, the memory limits are based on knowing the current write offset into each vector. In Drill, when a vector becomes full, we automatically resize the vector by doubling the memory for that vector. The RowSet abstraction tracks when doubling the vector would exceed the "budget" set for that vector or batch. When the limit occurs, the abstraction marks the batch complete. (The "overflow" row is saved for later to avoid exceeding the limit, and to keep the details of overflow hidden from the client.) The same logic can be applied, I would assume, to whatever memory allocation technique is used in Arrow, if Arrow has evolved beyond Drill's technique. A size estimate (when available) helps by allowing the client code to pre-allocate vectors to their final size. Doing so avoids growing vectors during data loads. In this case, the abstractions simply pack data into those pre-allocated vectors until one of them becomes full. The idea of separating memory from reading/writing is sound. In fact, that's how the code is structured. The memory-unaware version is heavily used in unit tests where we know how much memory is used. The memory-aware version is used in production to handle whatever strange data sets present themselves. Of course, none of this was clear from my terse description. I'll go ahead and create a JIRA ticket to provide additional context and to gather detailed comments so we can figure out the best way to proceed. Thanks, - Paul On Monday, August 27, 2018, 5:52:19 PM PDT, Jacques Nadeau <jacq...@apache.org> wrote: This seems like it could be a useful addition. In general, our experience with writing Arrow structures is that the most optimal path is using columnar interaction rather than rowwise. That being said, most people start out by interacting with Arrow rowwise first and having an interface like this could be helpful in allowing people to start writing Arrow datasets with less effort and mistakes. In terms of record batch sizing/estimations, I think that should probably be uncoupled from writing/reading vectors. On Mon, Aug 27, 2018 at 7:00 AM Li Jin <ice.xell...@gmail.com> wrote: > Hi Paul, > > Thank you for the email. I think this is interesting. > > Arrow (Java API) currently doesn't have the capability of automatically > limiting the memory size of record batches. In Spark we have similar needs > to limit the size of record batches and have talked about implementing some > kind of size estimator for record batches but haven't started to work on > it. > > I personally think it makes sense for Arrow to incorporate such > capabilities. > > > > On Mon, Aug 27, 2018 at 1:33 AM Paul Rogers <par0...@yahoo.com.invalid> > wrote: > > > Hi All, > > > > Over in the Apache Drill project, we developed some handy vector > > reader/writer abstractions. I wonder if they might be of interest to > Apache > > Arrow. Key contributions of the "RowSet" abstractions: > > > > * Control row batch size: the aggregate memory taken by a set of vectors > > (and all their sub-vectors for structured types.) > > * Control the maximum per-vector size. > > * Simple, highly optimized read/write interface that handles vector > offset > > accounting, even for deeply nested types. > > * Minimize vector internal fragmentation (wasted space.) > > > > More information is available in [1]. Arrow improved and simplified > > Drill's original vector and metadata abstractions. As a result, work > would > > be required to port the RowSet code from Drill's version of these classes > > to the Arrow versions. > > > > Does Arrow already have a similar solution? If not, would the above be > > useful for Arrow? > > > > Thanks, > > - Paul > > > > > > Apache Drill PMC member > > Co-author of the upcoming O'Reilly book "Learning Apache Drill" > > [1] > > https://github.com/paul-rogers/drill/wiki/RowSet-Abstractions-for-Arrow > > > > > > >