There's a C++ facility to do this, but it's not exposed in Python yet. I opened ARROW-7375 for it.
Regards Antoine. Le 11/12/2019 à 19:36, Weston Pace a écrit : > I'm trying to combine multiple parquet files. They were produced at > different points in time and have different columns. For example, one has > columns A, B, C. Two has columns B, C, D. Three has columns C, D, E. I > want to concatenate all three into one table with columns A, B, C, D, E. > > To do this I am adding the missing columns to each table. For example, I > am adding column D to table one and setting all values to null. In order > to do this I need to create a vector with length equal to one.num_rows and > set all values to null. The vector type is controlled by the type of D in > the other tables. > > I am currently doing this by creating one large python list ahead of time > and using: > > pa.array(big_list_of_nones, type=column_type, size=desired_size).slice(0, > desired_size) > > However, this ends up being very slow. The calls to pa.array take longer > than reading the data in the first place. > > I can build a large empty vector for every possible data type at the start > of my application but that seems inefficient. > > Is there a good way to initialize a vector with all null values that I am > missing? >