Not sure if this is any better, but I have an open PR right now in Iceberg,
where we are doing something similar:
https://github.com/apache/incubator-iceberg/pull/544/commits/28166fd3f0e3a24863048a2721f1ae69f243e2af#diff-51d6edf951c105e1e62a3f1e8b4640aaR319-R341

@staticmethod
def create_null_column(reference_column, name, dtype_tuple):
    dtype, init_val = dtype_tuple
    chunk = pa.chunked_array([pa.array(np.full(len(c), init_val),
type=dtype, mask=[True] * len(c))
                              for c in reference_column.data.chunks],
type=dtype)

    return pa.Column.from_array(name, chunk)


Note, that this is using the <0.15 column API, which has been deprecated.

On Wed, Dec 11, 2019 at 10:36 AM Weston Pace <weston.p...@gmail.com> wrote:

> I'm trying to combine multiple parquet files.  They were produced at
> different points in time and have different columns.  For example, one has
> columns A, B, C.  Two has columns B, C, D.  Three has columns C, D, E.  I
> want to concatenate all three into one table with columns A, B, C, D, E.
>
> To do this I am adding the missing columns to each table.  For example, I
> am adding column D to table one and setting all values to null.  In order
> to do this I need to create a vector with length equal to one.num_rows and
> set all values to null.  The vector type is controlled by the type of D in
> the other tables.
>
> I am currently doing this by creating one large python list ahead of time
> and using:
>
> pa.array(big_list_of_nones, type=column_type, size=desired_size).slice(0,
> desired_size)
>
> However, this ends up being very slow.  The calls to pa.array take longer
> than reading the data in the first place.
>
> I can build a large empty vector for every possible data type at the start
> of my application but that seems inefficient.
>
> Is there a good way to initialize a vector with all null values that I am
> missing?
>

Reply via email to