As a workaround, you can use the following hack:

>>> arr = pa.Array.from_buffers(pa.null(), 123, [pa.py_buffer(b"")])


>>> arr


<pyarrow.lib.NullArray object at 0x7f9a84d79be8>
123 nulls
>>> arr.cast(pa.int32())


<pyarrow.lib.Int32Array object at 0x7f9a84d79d68>
[
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  ...
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null
]


Regards

Antoine.


Le 11/12/2019 à 21:08, Weston Pace a écrit :
> Thanks.  Ted, I tried using numpy similar to your approach and had the same
> performance.  For the time being I am using a dictionary of data-type to
> pre-allocated big empty array which should work for me in the meantime.
> 
> On Wed, Dec 11, 2019 at 9:20 AM Antoine Pitrou <anto...@python.org> wrote:
> 
>>
>> There's a C++ facility to do this, but it's not exposed in Python yet.
>> I opened ARROW-7375 for it.
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 11/12/2019 à 19:36, Weston Pace a écrit :
>>> I'm trying to combine multiple parquet files.  They were produced at
>>> different points in time and have different columns.  For example, one
>> has
>>> columns A, B, C.  Two has columns B, C, D.  Three has columns C, D, E.  I
>>> want to concatenate all three into one table with columns A, B, C, D, E.
>>>
>>> To do this I am adding the missing columns to each table.  For example, I
>>> am adding column D to table one and setting all values to null.  In order
>>> to do this I need to create a vector with length equal to one.num_rows
>> and
>>> set all values to null.  The vector type is controlled by the type of D
>> in
>>> the other tables.
>>>
>>> I am currently doing this by creating one large python list ahead of time
>>> and using:
>>>
>>> pa.array(big_list_of_nones, type=column_type, size=desired_size).slice(0,
>>> desired_size)
>>>
>>> However, this ends up being very slow.  The calls to pa.array take longer
>>> than reading the data in the first place.
>>>
>>> I can build a large empty vector for every possible data type at the
>> start
>>> of my application but that seems inefficient.
>>>
>>> Is there a good way to initialize a vector with all null values that I am
>>> missing?
>>>
>>
> 

Reply via email to