This seems like it could be a useful addition. In general, our experience
with writing Arrow structures is that the most optimal path is using
columnar interaction rather than rowwise. That being said, most people
start out by interacting with Arrow rowwise first and having an interface
like this could be helpful in allowing people to start writing Arrow
datasets with less effort and mistakes.

In terms of record batch sizing/estimations, I think that should probably
be uncoupled from writing/reading vectors.



On Mon, Aug 27, 2018 at 7:00 AM Li Jin <ice.xell...@gmail.com> wrote:

> Hi Paul,
>
> Thank you for the email. I think this is interesting.
>
> Arrow (Java API) currently doesn't have the capability of automatically
> limiting the memory size of record batches. In Spark we have similar needs
> to limit the size of record batches and have talked about implementing some
> kind of size estimator for record batches but haven't started to work on
> it.
>
> I personally think it makes sense for Arrow to incorporate such
> capabilities.
>
>
>
> On Mon, Aug 27, 2018 at 1:33 AM Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
>
> > Hi All,
> >
> > Over in the Apache Drill project, we developed some handy vector
> > reader/writer abstractions. I wonder if they might be of interest to
> Apache
> > Arrow. Key contributions of the "RowSet" abstractions:
> >
> > * Control row batch size: the aggregate memory taken by a set of vectors
> > (and all their sub-vectors for structured types.)
> > * Control the maximum per-vector size.
> > * Simple, highly optimized read/write interface that handles vector
> offset
> > accounting, even for deeply nested types.
> > * Minimize vector internal fragmentation (wasted space.)
> >
> > More information is available in [1]. Arrow improved and simplified
> > Drill's original vector and metadata abstractions. As a result, work
> would
> > be required to port the RowSet code from Drill's version of these classes
> > to the Arrow versions.
> >
> > Does Arrow already have a similar solution? If not, would the above be
> > useful for Arrow?
> >
> > Thanks,
> > - Paul
> >
> >
> > Apache Drill PMC member
> > Co-author of the upcoming O'Reilly book "Learning Apache Drill"
> > [1]
> > https://github.com/paul-rogers/drill/wiki/RowSet-Abstractions-for-Arrow
> >
> >
> >
>

Reply via email to