Re: Contribute "RowSet" mechanism from Apache Drill?

Paul Rogers Mon, 27 Aug 2018 18:29:02 -0700

Hi Jacques,

Thanks much for the note. I wonder, when reading data into, or out of, Arrow, 
are not the interfaces often row-wise? For example, it is somewhat difficult to 
read a CSV file column-wise. Similarly, when serving a BI tool (for tables or 
charts), data must be presented row-wise. (JDBC, for example, is a row-wise 
interface.) The abstractions help with these cases.

Perhaps much of the emphasis in Arrow is in cross-tool compatibility in which 
data is passed column-wise as a set of vectors? The abstractions wouldn't be 
needed in this data transfer case.

The batch size component is an essential part of row-wise loading. When reading 
data into vectors, even from Parquet, we found it necessary to 1) control the 
overall amount of memory used by the batch, and 2) read the same number of rows 
for every column. The RowSet abstractions encapsulate this coordinated 
cross-column work.

The memory limits in the "RowSet" abstraction are not estimates. (There was a 
separate Drill project for that, which is why it might be confusing.) Instead, 
the memory limits are based on knowing the current write offset into each 
vector.  In Drill, when a vector becomes full, we automatically resize the 
vector by doubling the memory for that vector. The RowSet abstraction tracks 
when doubling the vector would exceed the "budget" set for that vector or 
batch. When the limit occurs, the abstraction marks the batch complete. (The 
"overflow" row is saved for later to avoid exceeding the limit, and to keep the 
details of overflow hidden from the client.) The same logic can be applied, I 
would assume, to whatever memory allocation technique is used in Arrow, if 
Arrow has evolved beyond Drill's technique.

A size estimate (when available) helps by allowing the client code to 
pre-allocate vectors to their final size. Doing so avoids growing vectors 
during data loads. In this case, the abstractions simply pack data into those 
pre-allocated vectors until one of them becomes full.

The idea of separating memory from reading/writing is sound. In fact, that's 
how the code is structured. The memory-unaware version is heavily used in unit 
tests where we know how much memory is used. The memory-aware version is used 
in production to handle whatever strange data sets present themselves.

Of course, none of this was clear from my terse description. I'll go ahead and 
create a JIRA ticket to provide additional context and to gather detailed 
comments so we can figure out the best way to proceed.

Thanks,

- Paul

    On Monday, August 27, 2018, 5:52:19 PM PDT, Jacques Nadeau 
<jacq...@apache.org> wrote:  

 This seems like it could be a useful addition. In general, our experience
with writing Arrow structures is that the most optimal path is using
columnar interaction rather than rowwise. That being said, most people
start out by interacting with Arrow rowwise first and having an interface
like this could be helpful in allowing people to start writing Arrow
datasets with less effort and mistakes.

In terms of record batch sizing/estimations, I think that should probably
be uncoupled from writing/reading vectors.

On Mon, Aug 27, 2018 at 7:00 AM Li Jin <ice.xell...@gmail.com> wrote:

> Hi Paul,
>
> Thank you for the email. I think this is interesting.
>
> Arrow (Java API) currently doesn't have the capability of automatically
> limiting the memory size of record batches. In Spark we have similar needs
> to limit the size of record batches and have talked about implementing some
> kind of size estimator for record batches but haven't started to work on
> it.
>
> I personally think it makes sense for Arrow to incorporate such
> capabilities.
>
>
>
> On Mon, Aug 27, 2018 at 1:33 AM Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
>
> > Hi All,
> >
> > Over in the Apache Drill project, we developed some handy vector
> > reader/writer abstractions. I wonder if they might be of interest to
> Apache
> > Arrow. Key contributions of the "RowSet" abstractions:
> >
> > * Control row batch size: the aggregate memory taken by a set of vectors
> > (and all their sub-vectors for structured types.)
> > * Control the maximum per-vector size.
> > * Simple, highly optimized read/write interface that handles vector
> offset
> > accounting, even for deeply nested types.
> > * Minimize vector internal fragmentation (wasted space.)
> >
> > More information is available in [1]. Arrow improved and simplified
> > Drill's original vector and metadata abstractions. As a result, work
> would
> > be required to port the RowSet code from Drill's version of these classes
> > to the Arrow versions.
> >
> > Does Arrow already have a similar solution? If not, would the above be
> > useful for Arrow?
> >
> > Thanks,
> > - Paul
> >
> >
> > Apache Drill PMC member
> > Co-author of the upcoming O'Reilly book "Learning Apache Drill"
> > [1]
> > https://github.com/paul-rogers/drill/wiki/RowSet-Abstractions-for-Arrow
> >
> >
> >
>

Re: Contribute "RowSet" mechanism from Apache Drill?

Reply via email to