RecordBatchFileWriter and ParquetWriter create different kinds of files (Arrow IPC vs Parquet, respectively). Otherwise there is generally not much difference performance or memory use wise to writing RecordBatch vs Table
On Sat, Apr 25, 2020 at 5:33 AM Hei Chan <structurech...@yahoo.com.invalid> wrote: > > I managed to iterate through small chucks of data, and for each chuck, I > convert it into pandas.DataFrame, convert into Table, and then write to a > parquet file. > Is there any advantage to write RecordBatch (through > pyarrow.RecordBatchFileWriter.write_batch()?) instead of > ParquetWriter.write_table()? > Thanks. On Saturday, April 25, 2020, 10:15:34 AM GMT+8, Hei Chan > <structurech...@yahoo.com.invalid> wrote: > > Hi Wes, > Thanks for your pointers. > It seems like to skip pandas as intermediary, I can only construct > pyarrow.RecordBatch from pyarrow.Array or > pyarrow.StructArray:https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html > And StructArray.from_pandas()'s description states, "Convert pandas.Series to > an Arrow Array". > So you are suggesting to build a python list of StructArray directly in > batch, and then call pyarrow.RecordBatch.from_arrays() and then call > pyarrow.Table.from_batches() after I convert all record from my binary file > into RecordBatches, and then pyarrow.parquet.write_table()? It seems holding > all RecordBatches will not fitted into my memory. But pyarrow.parquet > doesn't seem to allow to "append" Tables. > Is there an easy way to construct StructArray without constructing > pandas.Series()? > > On Friday, April 24, 2020, 10:41:49 PM GMT+8, Wes McKinney > <wesmck...@gmail.com> wrote: > > I recommend going directly via Arrow instead of routing through pandas (or > at least only using pandas as an intermediary to convert smaller chunks to > Arrow). Tables can be composed from smaller RecordBatch objects (see > Table.from_batches) so you don't need to accumulate much non-Arrow data in > memory. You can also zero-copy concat tables with concat_tables > > On Fri, Apr 24, 2020, 9:31 AM Hei Chan <structurech...@yahoo.com.invalid> > wrote: > > > Hi, > > I am new to Arrow and Parquet. > > My goal is to decode a 4GB binary file (packed c struct) and write all > > records to a file that can be used by R dataframe and Pandas dataframe and > > so others can do some heavy analysis on the big dataset efficiently (in > > terms of loading time and running statistical analysis). > > I first tried to do something like this in Python: > > # for each record after I decodeupdates.append(result) # updates = deque() > > # then after reading in all recordspd_updates = pd.DataFrame(updates) # I > > think I got out of memory here that OOM handler kicked in and killed my > > process > > > > pd_book_updates['my_cat_col'].astype('category', copy=False) > > table = pa.Table.from_pandas(pd_updates, preserve_index=False) > > pq.write_table(table, 'my.parquet', compression='brotli') > > > > What's the recommended way to deal with big dataset conversion? And later > > loading from R and Python (pandas)? > > Thanks in advance! >