Wes McKinney created ARROW-6051:
-----------------------------------

             Summary: [C++][Python] Parquet float column writing performance 
regression from 0.13.0 to 0.14.1
                 Key: ARROW-6051
                 URL: https://issues.apache.org/jira/browse/ARROW-6051
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Wes McKinney


I'm not sure the origin of the regression but I have with

pyarrow 0.13.0 from conda-forge

{code}
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
import pandas as pd
arr = pa.array([np.nan] * 10000000)
t = pa.Table.from_arrays([arr], names=['f0'])

%timeit pq.write_table(t, '/home/wesm/tmp/nans.parquet')
28.7 ms ± 570 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
{code}

but in pyarrow 0.14.1 from conda-forge

{code}
%timeit pq.write_table(t, '/home/wesm/tmp/nans.parquet')
88.1 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
{code}

I'm sorry to say, but this is what happens when benchmark data is not tracked 
and monitored



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to