Not sure if this is any better, but I have an open PR right now in Iceberg, where we are doing something similar: https://github.com/apache/incubator-iceberg/pull/544/commits/28166fd3f0e3a24863048a2721f1ae69f243e2af#diff-51d6edf951c105e1e62a3f1e8b4640aaR319-R341
@staticmethod def create_null_column(reference_column, name, dtype_tuple): dtype, init_val = dtype_tuple chunk = pa.chunked_array([pa.array(np.full(len(c), init_val), type=dtype, mask=[True] * len(c)) for c in reference_column.data.chunks], type=dtype) return pa.Column.from_array(name, chunk) Note, that this is using the <0.15 column API, which has been deprecated. On Wed, Dec 11, 2019 at 10:36 AM Weston Pace <weston.p...@gmail.com> wrote: > I'm trying to combine multiple parquet files. They were produced at > different points in time and have different columns. For example, one has > columns A, B, C. Two has columns B, C, D. Three has columns C, D, E. I > want to concatenate all three into one table with columns A, B, C, D, E. > > To do this I am adding the missing columns to each table. For example, I > am adding column D to table one and setting all values to null. In order > to do this I need to create a vector with length equal to one.num_rows and > set all values to null. The vector type is controlled by the type of D in > the other tables. > > I am currently doing this by creating one large python list ahead of time > and using: > > pa.array(big_list_of_nones, type=column_type, size=desired_size).slice(0, > desired_size) > > However, this ends up being very slow. The calls to pa.array take longer > than reading the data in the first place. > > I can build a large empty vector for every possible data type at the start > of my application but that seems inefficient. > > Is there a good way to initialize a vector with all null values that I am > missing? >