Hello Jacob, while not optimal, you could try to use https://docs.python.org/3/library/io.html#io.BufferedReader together with a much larger buffer_size than the default. This might not be the best way possible as we have to cross the Python/C++ boundary more often but should improve on the current situation.
Uwe On Tue, Aug 28, 2018, at 2:42 AM, Wes McKinney wrote: > hi Jacob, > > We have https://issues.apache.org/jira/browse/ARROW-549 about > concatenating arrays. Someone needs to write the code and tests, and > then we can easily add an API to "consolidate" table columns. > > If you have small record batches, could you read the entire file into > memory before parsing it with pyarrow.open_file/open_stream? The might > improve IO performance by reducing seeks. We don't support any > buffering in open_stream yet, so I'm going to open a JIRA about that: > > https://issues.apache.org/jira/browse/ARROW-3126 > > Building a big development platform like this is a lot of work, but we > are making progress! > > - Wes > > On Mon, Aug 27, 2018 at 8:22 PM, Jacob Quinn Shenker > <jqshen...@g.harvard.edu> wrote: > > Hi all, > > > > Question: If I have a set of small (10-1000 rows) RecordBatches on > > disk or in memory, how can I (efficiently) concatenate/rechunk them > > into larger RecordBatches (so that each column is output as a > > contiguous array when written to a new Arrow buffer)? > > > > Context: With such small RecordBatches, I'm finding that reading Arrow > > into a pandas table is very slow (~100x slower than local disk) from > > my cluster's Lustre distributed file system (plenty of bandwidth but > > each IO op has very high latency); I'm assuming this has to do with > > needing many seek() calls for each RecordBatch. I'm hoping it'll help > > if I rechunk my data into larger RecordBatches before writing to disk. > > (The input RecordBatches are small because they are the individual > > results returned by millions of tasks on a dask cluster, as part of a > > streaming analysis pipeline.) > > > > While I'm here I also wanted to thank everyone on this list for all > > their work on Arrow! I'm a PhD student in biology at Harvard Medical > > School. We take images of about 1 billion individual bacteria every > > day with our microscopes, generating about ~1PB/yr in raw data. We're > > using this data to search for new kinds of antibiotic drugs. Using way > > more data allows us precisely measure how the bacteria's growth is > > affected by the drug candidates, which allows us to find new drugs > > that previous screens have missed—and that's why I'm really excited > > about Arrow, it's making dealing with these data volumes a lot easier > > for us! > > > > ~ J