Hi all, Question: If I have a set of small (10-1000 rows) RecordBatches on disk or in memory, how can I (efficiently) concatenate/rechunk them into larger RecordBatches (so that each column is output as a contiguous array when written to a new Arrow buffer)?
Context: With such small RecordBatches, I'm finding that reading Arrow into a pandas table is very slow (~100x slower than local disk) from my cluster's Lustre distributed file system (plenty of bandwidth but each IO op has very high latency); I'm assuming this has to do with needing many seek() calls for each RecordBatch. I'm hoping it'll help if I rechunk my data into larger RecordBatches before writing to disk. (The input RecordBatches are small because they are the individual results returned by millions of tasks on a dask cluster, as part of a streaming analysis pipeline.) While I'm here I also wanted to thank everyone on this list for all their work on Arrow! I'm a PhD student in biology at Harvard Medical School. We take images of about 1 billion individual bacteria every day with our microscopes, generating about ~1PB/yr in raw data. We're using this data to search for new kinds of antibiotic drugs. Using way more data allows us precisely measure how the bacteria's growth is affected by the drug candidates, which allows us to find new drugs that previous screens have missed—and that's why I'm really excited about Arrow, it's making dealing with these data volumes a lot easier for us! ~ J
