I recommend going directly via Arrow instead of routing through pandas (or at least only using pandas as an intermediary to convert smaller chunks to Arrow). Tables can be composed from smaller RecordBatch objects (see Table.from_batches) so you don't need to accumulate much non-Arrow data in memory. You can also zero-copy concat tables with concat_tables
On Fri, Apr 24, 2020, 9:31 AM Hei Chan <structurech...@yahoo.com.invalid> wrote: > Hi, > I am new to Arrow and Parquet. > My goal is to decode a 4GB binary file (packed c struct) and write all > records to a file that can be used by R dataframe and Pandas dataframe and > so others can do some heavy analysis on the big dataset efficiently (in > terms of loading time and running statistical analysis). > I first tried to do something like this in Python: > # for each record after I decodeupdates.append(result) # updates = deque() > # then after reading in all recordspd_updates = pd.DataFrame(updates) # I > think I got out of memory here that OOM handler kicked in and killed my > process > > pd_book_updates['my_cat_col'].astype('category', copy=False) > table = pa.Table.from_pandas(pd_updates, preserve_index=False) > pq.write_table(table, 'my.parquet', compression='brotli') > > What's the recommended way to deal with big dataset conversion? And later > loading from R and Python (pandas)? > Thanks in advance!