Hi Wes, Thanks for your pointers. It seems like to skip pandas as intermediary, I can only construct pyarrow.RecordBatch from pyarrow.Array or pyarrow.StructArray:https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html And StructArray.from_pandas()'s description states, "Convert pandas.Series to an Arrow Array". So you are suggesting to build a python list of StructArray directly in batch, and then call pyarrow.RecordBatch.from_arrays() and then call pyarrow.Table.from_batches() after I convert all record from my binary file into RecordBatches, and then pyarrow.parquet.write_table()? It seems holding all RecordBatches will not fitted into my memory. But pyarrow.parquet doesn't seem to allow to "append" Tables. Is there an easy way to construct StructArray without constructing pandas.Series()?
On Friday, April 24, 2020, 10:41:49 PM GMT+8, Wes McKinney <wesmck...@gmail.com> wrote: I recommend going directly via Arrow instead of routing through pandas (or at least only using pandas as an intermediary to convert smaller chunks to Arrow). Tables can be composed from smaller RecordBatch objects (see Table.from_batches) so you don't need to accumulate much non-Arrow data in memory. You can also zero-copy concat tables with concat_tables On Fri, Apr 24, 2020, 9:31 AM Hei Chan <structurech...@yahoo.com.invalid> wrote: > Hi, > I am new to Arrow and Parquet. > My goal is to decode a 4GB binary file (packed c struct) and write all > records to a file that can be used by R dataframe and Pandas dataframe and > so others can do some heavy analysis on the big dataset efficiently (in > terms of loading time and running statistical analysis). > I first tried to do something like this in Python: > # for each record after I decodeupdates.append(result) # updates = deque() > # then after reading in all recordspd_updates = pd.DataFrame(updates) # I > think I got out of memory here that OOM handler kicked in and killed my > process > > pd_book_updates['my_cat_col'].astype('category', copy=False) > table = pa.Table.from_pandas(pd_updates, preserve_index=False) > pq.write_table(table, 'my.parquet', compression='brotli') > > What's the recommended way to deal with big dataset conversion? And later > loading from R and Python (pandas)? > Thanks in advance!