This works very well and is much simpler.  Thank you for the workaround.

On Wed, Dec 11, 2019 at 10:29 AM Antoine Pitrou <anto...@python.org> wrote:

>
> As a workaround, you can use the following hack:
>
> >>> arr = pa.Array.from_buffers(pa.null(), 123, [pa.py_buffer(b"")])
>
>
> >>> arr
>
>
> <pyarrow.lib.NullArray object at 0x7f9a84d79be8>
> 123 nulls
> >>> arr.cast(pa.int32())
>
>
> <pyarrow.lib.Int32Array object at 0x7f9a84d79d68>
> [
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   ...
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null
> ]
>
>
> Regards
>
> Antoine.
>
>
> Le 11/12/2019 à 21:08, Weston Pace a écrit :
> > Thanks.  Ted, I tried using numpy similar to your approach and had the
> same
> > performance.  For the time being I am using a dictionary of data-type to
> > pre-allocated big empty array which should work for me in the meantime.
> >
> > On Wed, Dec 11, 2019 at 9:20 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> >>
> >> There's a C++ facility to do this, but it's not exposed in Python yet.
> >> I opened ARROW-7375 for it.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 11/12/2019 à 19:36, Weston Pace a écrit :
> >>> I'm trying to combine multiple parquet files.  They were produced at
> >>> different points in time and have different columns.  For example, one
> >> has
> >>> columns A, B, C.  Two has columns B, C, D.  Three has columns C, D,
> E.  I
> >>> want to concatenate all three into one table with columns A, B, C, D,
> E.
> >>>
> >>> To do this I am adding the missing columns to each table.  For
> example, I
> >>> am adding column D to table one and setting all values to null.  In
> order
> >>> to do this I need to create a vector with length equal to one.num_rows
> >> and
> >>> set all values to null.  The vector type is controlled by the type of D
> >> in
> >>> the other tables.
> >>>
> >>> I am currently doing this by creating one large python list ahead of
> time
> >>> and using:
> >>>
> >>> pa.array(big_list_of_nones, type=column_type,
> size=desired_size).slice(0,
> >>> desired_size)
> >>>
> >>> However, this ends up being very slow.  The calls to pa.array take
> longer
> >>> than reading the data in the first place.
> >>>
> >>> I can build a large empty vector for every possible data type at the
> >> start
> >>> of my application but that seems inefficient.
> >>>
> >>> Is there a good way to initialize a vector with all null values that I
> am
> >>> missing?
> >>>
> >>
> >
>

Reply via email to