Thanks.  Ted, I tried using numpy similar to your approach and had the same
performance.  For the time being I am using a dictionary of data-type to
pre-allocated big empty array which should work for me in the meantime.

On Wed, Dec 11, 2019 at 9:20 AM Antoine Pitrou <anto...@python.org> wrote:

>
> There's a C++ facility to do this, but it's not exposed in Python yet.
> I opened ARROW-7375 for it.
>
> Regards
>
> Antoine.
>
>
> Le 11/12/2019 à 19:36, Weston Pace a écrit :
> > I'm trying to combine multiple parquet files.  They were produced at
> > different points in time and have different columns.  For example, one
> has
> > columns A, B, C.  Two has columns B, C, D.  Three has columns C, D, E.  I
> > want to concatenate all three into one table with columns A, B, C, D, E.
> >
> > To do this I am adding the missing columns to each table.  For example, I
> > am adding column D to table one and setting all values to null.  In order
> > to do this I need to create a vector with length equal to one.num_rows
> and
> > set all values to null.  The vector type is controlled by the type of D
> in
> > the other tables.
> >
> > I am currently doing this by creating one large python list ahead of time
> > and using:
> >
> > pa.array(big_list_of_nones, type=column_type, size=desired_size).slice(0,
> > desired_size)
> >
> > However, this ends up being very slow.  The calls to pa.array take longer
> > than reading the data in the first place.
> >
> > I can build a large empty vector for every possible data type at the
> start
> > of my application but that seems inefficient.
> >
> > Is there a good way to initialize a vector with all null values that I am
> > missing?
> >
>

Reply via email to