This works very well and is much simpler. Thank you for the workaround. On Wed, Dec 11, 2019 at 10:29 AM Antoine Pitrou <anto...@python.org> wrote:
> > As a workaround, you can use the following hack: > > >>> arr = pa.Array.from_buffers(pa.null(), 123, [pa.py_buffer(b"")]) > > > >>> arr > > > <pyarrow.lib.NullArray object at 0x7f9a84d79be8> > 123 nulls > >>> arr.cast(pa.int32()) > > > <pyarrow.lib.Int32Array object at 0x7f9a84d79d68> > [ > null, > null, > null, > null, > null, > null, > null, > null, > null, > null, > ... > null, > null, > null, > null, > null, > null, > null, > null, > null, > null > ] > > > Regards > > Antoine. > > > Le 11/12/2019 à 21:08, Weston Pace a écrit : > > Thanks. Ted, I tried using numpy similar to your approach and had the > same > > performance. For the time being I am using a dictionary of data-type to > > pre-allocated big empty array which should work for me in the meantime. > > > > On Wed, Dec 11, 2019 at 9:20 AM Antoine Pitrou <anto...@python.org> > wrote: > > > >> > >> There's a C++ facility to do this, but it's not exposed in Python yet. > >> I opened ARROW-7375 for it. > >> > >> Regards > >> > >> Antoine. > >> > >> > >> Le 11/12/2019 à 19:36, Weston Pace a écrit : > >>> I'm trying to combine multiple parquet files. They were produced at > >>> different points in time and have different columns. For example, one > >> has > >>> columns A, B, C. Two has columns B, C, D. Three has columns C, D, > E. I > >>> want to concatenate all three into one table with columns A, B, C, D, > E. > >>> > >>> To do this I am adding the missing columns to each table. For > example, I > >>> am adding column D to table one and setting all values to null. In > order > >>> to do this I need to create a vector with length equal to one.num_rows > >> and > >>> set all values to null. The vector type is controlled by the type of D > >> in > >>> the other tables. > >>> > >>> I am currently doing this by creating one large python list ahead of > time > >>> and using: > >>> > >>> pa.array(big_list_of_nones, type=column_type, > size=desired_size).slice(0, > >>> desired_size) > >>> > >>> However, this ends up being very slow. The calls to pa.array take > longer > >>> than reading the data in the first place. > >>> > >>> I can build a large empty vector for every possible data type at the > >> start > >>> of my application but that seems inefficient. > >>> > >>> Is there a good way to initialize a vector with all null values that I > am > >>> missing? > >>> > >> > > >