Hi all,

Question: If I have a set of small (10-1000 rows) RecordBatches on
disk or in memory, how can I (efficiently) concatenate/rechunk them
into larger RecordBatches (so that each column is output as a
contiguous array when written to a new Arrow buffer)?

Context: With such small RecordBatches, I'm finding that reading Arrow
into a pandas table is very slow (~100x slower than local disk) from
my cluster's Lustre distributed file system (plenty of bandwidth but
each IO op has very high latency); I'm assuming this has to do with
needing many seek() calls for each RecordBatch. I'm hoping it'll help
if I rechunk my data into larger RecordBatches before writing to disk.
(The input RecordBatches are small because they are the individual
results returned by millions of tasks on a dask cluster, as part of a
streaming analysis pipeline.)

While I'm here I also wanted to thank everyone on this list for all
their work on Arrow! I'm a PhD student in biology at Harvard Medical
School. We take images of about 1 billion individual bacteria every
day with our microscopes, generating about ~1PB/yr in raw data. We're
using this data to search for new kinds of antibiotic drugs. Using way
more data allows us precisely measure how the bacteria's growth is
affected by the drug candidates, which allows us to find new drugs
that previous screens have missed—and that's why I'm really excited
about Arrow, it's making dealing with these data volumes a lot easier
for us!

~ J

Reply via email to