Re: How to concatenate RecordBatches into a single RecordBatch?

Uwe L. Korn Tue, 28 Aug 2018 07:57:10 -0700

Hello Jacob,

while not optimal, you could try to use 
https://docs.python.org/3/library/io.html#io.BufferedReader together with a 
much larger buffer_size than the default. This might not be the best way 
possible as we have to cross the Python/C++ boundary more often but should 
improve on the current situation.


Uwe

On Tue, Aug 28, 2018, at 2:42 AM, Wes McKinney wrote:
> hi Jacob,
> 
> We have https://issues.apache.org/jira/browse/ARROW-549 about
> concatenating arrays. Someone needs to write the code and tests, and
> then we can easily add an API to "consolidate" table columns.
> 
> If you have small record batches, could you read the entire file into
> memory before parsing it with pyarrow.open_file/open_stream? The might
> improve IO performance by reducing seeks. We don't support any
> buffering in open_stream yet, so I'm going to open a JIRA about that:
> 
> https://issues.apache.org/jira/browse/ARROW-3126
> 
> Building a big development platform like this is a lot of work, but we
> are making progress!
> 
> - Wes
> 
> On Mon, Aug 27, 2018 at 8:22 PM, Jacob Quinn Shenker
> <jqshen...@g.harvard.edu> wrote:
> > Hi all,
> >
> > Question: If I have a set of small (10-1000 rows) RecordBatches on
> > disk or in memory, how can I (efficiently) concatenate/rechunk them
> > into larger RecordBatches (so that each column is output as a
> > contiguous array when written to a new Arrow buffer)?
> >
> > Context: With such small RecordBatches, I'm finding that reading Arrow
> > into a pandas table is very slow (~100x slower than local disk) from
> > my cluster's Lustre distributed file system (plenty of bandwidth but
> > each IO op has very high latency); I'm assuming this has to do with
> > needing many seek() calls for each RecordBatch. I'm hoping it'll help
> > if I rechunk my data into larger RecordBatches before writing to disk.
> > (The input RecordBatches are small because they are the individual
> > results returned by millions of tasks on a dask cluster, as part of a
> > streaming analysis pipeline.)
> >
> > While I'm here I also wanted to thank everyone on this list for all
> > their work on Arrow! I'm a PhD student in biology at Harvard Medical
> > School. We take images of about 1 billion individual bacteria every
> > day with our microscopes, generating about ~1PB/yr in raw data. We're
> > using this data to search for new kinds of antibiotic drugs. Using way
> > more data allows us precisely measure how the bacteria's growth is
> > affected by the drug candidates, which allows us to find new drugs
> > that previous screens have missed—and that's why I'm really excited
> > about Arrow, it's making dealing with these data volumes a lot easier
> > for us!
> >
> > ~ J

Re: How to concatenate RecordBatches into a single RecordBatch?

Reply via email to