Thanks. Ted, I tried using numpy similar to your approach and had the same performance. For the time being I am using a dictionary of data-type to pre-allocated big empty array which should work for me in the meantime.
On Wed, Dec 11, 2019 at 9:20 AM Antoine Pitrou <anto...@python.org> wrote: > > There's a C++ facility to do this, but it's not exposed in Python yet. > I opened ARROW-7375 for it. > > Regards > > Antoine. > > > Le 11/12/2019 à 19:36, Weston Pace a écrit : > > I'm trying to combine multiple parquet files. They were produced at > > different points in time and have different columns. For example, one > has > > columns A, B, C. Two has columns B, C, D. Three has columns C, D, E. I > > want to concatenate all three into one table with columns A, B, C, D, E. > > > > To do this I am adding the missing columns to each table. For example, I > > am adding column D to table one and setting all values to null. In order > > to do this I need to create a vector with length equal to one.num_rows > and > > set all values to null. The vector type is controlled by the type of D > in > > the other tables. > > > > I am currently doing this by creating one large python list ahead of time > > and using: > > > > pa.array(big_list_of_nones, type=column_type, size=desired_size).slice(0, > > desired_size) > > > > However, this ends up being very slow. The calls to pa.array take longer > > than reading the data in the first place. > > > > I can build a large empty vector for every possible data type at the > start > > of my application but that seems inefficient. > > > > Is there a good way to initialize a vector with all null values that I am > > missing? > > >