hi Jacob,

We have https://issues.apache.org/jira/browse/ARROW-549 about
concatenating arrays. Someone needs to write the code and tests, and
then we can easily add an API to "consolidate" table columns.

If you have small record batches, could you read the entire file into
memory before parsing it with pyarrow.open_file/open_stream? The might
improve IO performance by reducing seeks. We don't support any
buffering in open_stream yet, so I'm going to open a JIRA about that:

https://issues.apache.org/jira/browse/ARROW-3126

Building a big development platform like this is a lot of work, but we
are making progress!

- Wes

On Mon, Aug 27, 2018 at 8:22 PM, Jacob Quinn Shenker
<[email protected]> wrote:
> Hi all,
>
> Question: If I have a set of small (10-1000 rows) RecordBatches on
> disk or in memory, how can I (efficiently) concatenate/rechunk them
> into larger RecordBatches (so that each column is output as a
> contiguous array when written to a new Arrow buffer)?
>
> Context: With such small RecordBatches, I'm finding that reading Arrow
> into a pandas table is very slow (~100x slower than local disk) from
> my cluster's Lustre distributed file system (plenty of bandwidth but
> each IO op has very high latency); I'm assuming this has to do with
> needing many seek() calls for each RecordBatch. I'm hoping it'll help
> if I rechunk my data into larger RecordBatches before writing to disk.
> (The input RecordBatches are small because they are the individual
> results returned by millions of tasks on a dask cluster, as part of a
> streaming analysis pipeline.)
>
> While I'm here I also wanted to thank everyone on this list for all
> their work on Arrow! I'm a PhD student in biology at Harvard Medical
> School. We take images of about 1 billion individual bacteria every
> day with our microscopes, generating about ~1PB/yr in raw data. We're
> using this data to search for new kinds of antibiotic drugs. Using way
> more data allows us precisely measure how the bacteria's growth is
> affected by the drug candidates, which allows us to find new drugs
> that previous screens have missed—and that's why I'm really excited
> about Arrow, it's making dealing with these data volumes a lot easier
> for us!
>
> ~ J

Reply via email to