There's a C++ facility to do this, but it's not exposed in Python yet.
I opened ARROW-7375 for it.

Regards

Antoine.


Le 11/12/2019 à 19:36, Weston Pace a écrit :
> I'm trying to combine multiple parquet files.  They were produced at
> different points in time and have different columns.  For example, one has
> columns A, B, C.  Two has columns B, C, D.  Three has columns C, D, E.  I
> want to concatenate all three into one table with columns A, B, C, D, E.
> 
> To do this I am adding the missing columns to each table.  For example, I
> am adding column D to table one and setting all values to null.  In order
> to do this I need to create a vector with length equal to one.num_rows and
> set all values to null.  The vector type is controlled by the type of D in
> the other tables.
> 
> I am currently doing this by creating one large python list ahead of time
> and using:
> 
> pa.array(big_list_of_nones, type=column_type, size=desired_size).slice(0,
> desired_size)
> 
> However, this ends up being very slow.  The calls to pa.array take longer
> than reading the data in the first place.
> 
> I can build a large empty vector for every possible data type at the start
> of my application but that seems inefficient.
> 
> Is there a good way to initialize a vector with all null values that I am
> missing?
> 

Reply via email to