I'm trying to combine multiple parquet files. They were produced at different points in time and have different columns. For example, one has columns A, B, C. Two has columns B, C, D. Three has columns C, D, E. I want to concatenate all three into one table with columns A, B, C, D, E.
To do this I am adding the missing columns to each table. For example, I am adding column D to table one and setting all values to null. In order to do this I need to create a vector with length equal to one.num_rows and set all values to null. The vector type is controlled by the type of D in the other tables. I am currently doing this by creating one large python list ahead of time and using: pa.array(big_list_of_nones, type=column_type, size=desired_size).slice(0, desired_size) However, this ends up being very slow. The calls to pa.array take longer than reading the data in the first place. I can build a large empty vector for every possible data type at the start of my application but that seems inefficient. Is there a good way to initialize a vector with all null values that I am missing?