Hi Wes,
Thanks for your pointers.
It seems like to skip pandas as intermediary, I can only construct 
pyarrow.RecordBatch from pyarrow.Array or 
pyarrow.StructArray:https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html
And StructArray.from_pandas()'s description states, "Convert pandas.Series to 
an Arrow Array".
So you are suggesting to build a python list of StructArray directly in batch, 
and then call pyarrow.RecordBatch.from_arrays() and then call 
pyarrow.Table.from_batches() after I convert all record from my binary file 
into RecordBatches, and then pyarrow.parquet.write_table()?  It seems holding 
all RecordBatches will not fitted into my memory.  But pyarrow.parquet doesn't 
seem to allow to "append" Tables.
Is there an easy way to construct StructArray without constructing 
pandas.Series()?

    On Friday, April 24, 2020, 10:41:49 PM GMT+8, Wes McKinney 
<wesmck...@gmail.com> wrote:  
 
 I recommend going directly via Arrow instead of routing through pandas (or
at least only using pandas as an intermediary to convert smaller chunks to
Arrow). Tables can be composed from smaller RecordBatch objects (see
Table.from_batches) so you don't need to accumulate much non-Arrow data in
memory. You can also zero-copy concat tables with concat_tables

On Fri, Apr 24, 2020, 9:31 AM Hei Chan <structurech...@yahoo.com.invalid>
wrote:

> Hi,
> I am new to Arrow and Parquet.
> My goal is to decode a 4GB binary file (packed c struct) and write all
> records to a file that can be used by R dataframe and Pandas dataframe and
> so others can do some heavy analysis on the big dataset efficiently (in
> terms of loading time and running statistical analysis).
> I first tried to do something like this in Python:
> # for each record after I decodeupdates.append(result) # updates = deque()
> # then after reading in all recordspd_updates = pd.DataFrame(updates) # I
> think I got out of memory here that OOM handler kicked in and killed my
> process
>
> pd_book_updates['my_cat_col'].astype('category', copy=False)
> table = pa.Table.from_pandas(pd_updates, preserve_index=False)
> pq.write_table(table, 'my.parquet', compression='brotli')
>
> What's the recommended way to deal with big dataset conversion? And later
> loading from R and Python (pandas)?
> Thanks in advance!
  

Reply via email to