I'm trying to combine multiple parquet files.  They were produced at
different points in time and have different columns.  For example, one has
columns A, B, C.  Two has columns B, C, D.  Three has columns C, D, E.  I
want to concatenate all three into one table with columns A, B, C, D, E.

To do this I am adding the missing columns to each table.  For example, I
am adding column D to table one and setting all values to null.  In order
to do this I need to create a vector with length equal to one.num_rows and
set all values to null.  The vector type is controlled by the type of D in
the other tables.

I am currently doing this by creating one large python list ahead of time
and using:

pa.array(big_list_of_nones, type=column_type, size=desired_size).slice(0,
desired_size)

However, this ends up being very slow.  The calls to pa.array take longer
than reading the data in the first place.

I can build a large empty vector for every possible data type at the start
of my application but that seems inefficient.

Is there a good way to initialize a vector with all null values that I am
missing?

Reply via email to