As a workaround, you can use the following hack:
>>> arr = pa.Array.from_buffers(pa.null(), 123, [pa.py_buffer(b"")]) >>> arr <pyarrow.lib.NullArray object at 0x7f9a84d79be8> 123 nulls >>> arr.cast(pa.int32()) <pyarrow.lib.Int32Array object at 0x7f9a84d79d68> [ null, null, null, null, null, null, null, null, null, null, ... null, null, null, null, null, null, null, null, null, null ] Regards Antoine. Le 11/12/2019 à 21:08, Weston Pace a écrit : > Thanks. Ted, I tried using numpy similar to your approach and had the same > performance. For the time being I am using a dictionary of data-type to > pre-allocated big empty array which should work for me in the meantime. > > On Wed, Dec 11, 2019 at 9:20 AM Antoine Pitrou <anto...@python.org> wrote: > >> >> There's a C++ facility to do this, but it's not exposed in Python yet. >> I opened ARROW-7375 for it. >> >> Regards >> >> Antoine. >> >> >> Le 11/12/2019 à 19:36, Weston Pace a écrit : >>> I'm trying to combine multiple parquet files. They were produced at >>> different points in time and have different columns. For example, one >> has >>> columns A, B, C. Two has columns B, C, D. Three has columns C, D, E. I >>> want to concatenate all three into one table with columns A, B, C, D, E. >>> >>> To do this I am adding the missing columns to each table. For example, I >>> am adding column D to table one and setting all values to null. In order >>> to do this I need to create a vector with length equal to one.num_rows >> and >>> set all values to null. The vector type is controlled by the type of D >> in >>> the other tables. >>> >>> I am currently doing this by creating one large python list ahead of time >>> and using: >>> >>> pa.array(big_list_of_nones, type=column_type, size=desired_size).slice(0, >>> desired_size) >>> >>> However, this ends up being very slow. The calls to pa.array take longer >>> than reading the data in the first place. >>> >>> I can build a large empty vector for every possible data type at the >> start >>> of my application but that seems inefficient. >>> >>> Is there a good way to initialize a vector with all null values that I am >>> missing? >>> >> >