Re: Strategy for Writing a Large Table?

Wes McKinney Fri, 24 Apr 2020 07:42:39 -0700

I recommend going directly via Arrow instead of routing through pandas (or
at least only using pandas as an intermediary to convert smaller chunks to
Arrow). Tables can be composed from smaller RecordBatch objects (see
Table.from_batches) so you don't need to accumulate much non-Arrow data in
memory. You can also zero-copy concat tables with concat_tables


On Fri, Apr 24, 2020, 9:31 AM Hei Chan <structurech...@yahoo.com.invalid>
wrote:

> Hi,
> I am new to Arrow and Parquet.
> My goal is to decode a 4GB binary file (packed c struct) and write all
> records to a file that can be used by R dataframe and Pandas dataframe and
> so others can do some heavy analysis on the big dataset efficiently (in
> terms of loading time and running statistical analysis).
> I first tried to do something like this in Python:
> # for each record after I decodeupdates.append(result) # updates = deque()
> # then after reading in all recordspd_updates = pd.DataFrame(updates) # I
> think I got out of memory here that OOM handler kicked in and killed my
> process
>
> pd_book_updates['my_cat_col'].astype('category', copy=False)
> table = pa.Table.from_pandas(pd_updates, preserve_index=False)
> pq.write_table(table, 'my.parquet', compression='brotli')
>
> What's the recommended way to deal with big dataset conversion? And later
> loading from R and Python (pandas)?
> Thanks in advance!

Re: Strategy for Writing a Large Table?

Reply via email to