Wes McKinney created ARROW-6051: ----------------------------------- Summary: [C++][Python] Parquet float column writing performance regression from 0.13.0 to 0.14.1 Key: ARROW-6051 URL: https://issues.apache.org/jira/browse/ARROW-6051 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney
I'm not sure the origin of the regression but I have with pyarrow 0.13.0 from conda-forge {code} import pyarrow as pa import pyarrow.parquet as pq import numpy as np import pandas as pd arr = pa.array([np.nan] * 10000000) t = pa.Table.from_arrays([arr], names=['f0']) %timeit pq.write_table(t, '/home/wesm/tmp/nans.parquet') 28.7 ms ± 570 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) {code} but in pyarrow 0.14.1 from conda-forge {code} %timeit pq.write_table(t, '/home/wesm/tmp/nans.parquet') 88.1 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) {code} I'm sorry to say, but this is what happens when benchmark data is not tracked and monitored -- This message was sent by Atlassian JIRA (v7.6.14#76016)