> > I think having a chunked array with multiple vector buffers would be > ideal, similar to C++. It might take a fair amount of work to add this but > would open up a lot more functionality.
There are potentially two different use-cases. ChunkedArray is logical/lazy concatenation where as concat, physically rebuilds the vectors to be a single vector. On Fri, Nov 8, 2019 at 10:51 AM Bryan Cutler <cutl...@gmail.com> wrote: > I think having a chunked array with multiple vector buffers would be > ideal, similar to C++. It might take a fair amount of work to add this but > would open up a lot more functionality. As for the API, > VectorSchemaRoot.concat(Collection<VectorSchemaRoot>) seems good to me. > > On Thu, Nov 7, 2019 at 12:09 AM Fan Liya <liya.fa...@gmail.com> wrote: > >> Hi Micah, >> >> Thanks for bringing this up. >> >> > 1. An efficient solution already exists? It seems like TransferPair >> implementations could possibly be improved upon or have they already been >> optimized? >> >> Fundamnentally, memory copy is unavoidable, IMO, because the source and >> targe memory regions are likely to be in non-contiguous regions. >> An alternative is to make ArrowBuf support a number of non-contiguous >> memory regions. However, that would harm the perfomance of ArrowBuf, and >> ArrowBuf is the core of the Arrow library. >> >> > 2. What the preferred API for doing this would be? Some options i can >> think of: >> >> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>) >> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>) >> > * VectorLoader.load(Collection<ArrowRecordBatch>) >> >> IMO, option 1 is required, as we have scenarios that need to concate >> vectors/VectorSchemaRoots (e.g. restore the complete dictionary from delta >> dictionaries). >> Options 2 and 3 are optional for us. >> >> Best, >> Liya Fan >> >> On Thu, Nov 7, 2019 at 3:44 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >> > Hi, >> > A colleague opened up https://issues.apache.org/jira/browse/ARROW-7048 >> for >> > having similar functionality to the python APIs that allow for creating >> one >> > larger data structure from a series of record batches. I just wanted to >> > surface it here in case: >> > 1. An efficient solution already exists? It seems like TransferPair >> > implementations could possibly be improved upon or have they already >> been >> > optimized? >> > 2. What the preferred API for doing this would be? Some options i can >> > think of: >> > >> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>) >> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>) >> > * VectorLoader.load(Collection<ArrowRecordBatch>) >> > >> > Thanks, >> > Micah >> > >> >