Hi, The code to show the performance issue with DateTimeIndex is at: https://gist.github.com/maartenb/256556bcd6d7c7636d400f3b464db18c
It shows three case 0) int index, 1) datetime index, 2) date time index created in a slightly roundabout way I’m a little confused by the two date time cases. Case 2) is much slower but the df compares identical to case 1) (I originally used something like 2) to match our specific data. I don’t see why it behaves differently??) The timings I find are: 1073741824 float64 8388608 16 0: make_dataframe : 2390.830 msec, 428 MB/s 0: write_arrow_parquet : 2486.463 msec, 412 MB/s 0: read_arrow_parquet : 813.946 msec, 1258 MB/s <<< 1: make_dataframe : 2579.815 msec, 397 MB/s 1: write_arrow_parquet : 2708.151 msec, 378 MB/s 1: read_arrow_parquet : 1413.999 msec, 724 MB/s <<< 2: make_dataframe : 15126.520 msec, 68 MB/s 2: write_arrow_parquet : 9205.815 msec, 111 MB/s 2: read_arrow_parquet : 5929.346 msec, 173 MB/s <<< Case 0, int index. This is all great. Case 1, date time index. We loose almost half the speed. Given that a datetime is only scaled from Pandas IIRC that seems like a lot? Case 3, other datetime index. No idea what is going on. Any insights are much appreciated. Cheers, Maarten. > On Sep 24, 2019, at 11:25 AM, Wes McKinney <wesmck...@gmail.com> wrote: > > hi > > On Tue, Sep 24, 2019 at 9:26 AM Maarten Ballintijn <maart...@xs4all.nl > <mailto:maart...@xs4all.nl>> wrote: >> >> Hi Wes, >> >> Thanks for your quick response. >> >> Yes, we’re using Python 3.7.4, from miniconda and conda-forge, and: >> >> numpy: 1.16.5 >> pandas: 0.25.1 >> pyarrow: 0.14.1 >> >> It looks like 0.15 is close, so I can wait for that. >> >> Theoretically I see three components driving the performance: >> 1) The cost of locating the column (directory overhead) >> 2) The overhead of reading a single column. (reading and processing meta >> data, setting up for reading) >> 3) Bulk reading and unmarshalling/decoding the data. >> >> Only 1) would be impacted by the number of columns, but if you’re reading >> everything ideally this would not be a problem. > > The problem is more nuanced than that. Parquet's metadata is somewhat > "heavy" at the column level. So when you're writing thousands of > columns, the fixed overhead associated with reading a single column > becomes problematic. There are several data structures associated with > decoding a column have a fixed setup and teardown cost. Even if there > is 1 millisecond of fixed overhead related to reading a column (I > don't know what the number is exactly) then reading 10,000 columns has > 10 seconds of unavoidable overhead. It might be useful for us to > quantify and communicate the expected overhead when metadata and > decoding is taken into account. Simply put having more than 1000 > columns is not advisable. > >> Based on an initial cursory look at the Parquet format I guess the index and >> the column meta-data might need to be read in full so I can see how that >> might slow down reading only a few columns out of a large set. But that was >> not really the case here? >> >> What would you suggest for looking into the date index slow-down? > > Can you show a code example to make things easier for us to see what > you're seeing? > >> >> Cheers, >> Maarten. >> >> >> >>> On Sep 23, 2019, at 7:07 PM, Wes McKinney <wesmck...@gmail.com> wrote: >>> >>> hi Maarten, >>> >>> Are you using the master branch or 0.14.1? There are a number of >>> performance regressions in 0.14.0/0.14.1 that are addressed in the >>> master branch, to appear as 0.15.0 relatively soon. >>> >>> As a file format, Parquet (and columnar formats in general) is not >>> known to perform well with more than 1000 columns. >>> >>> On the other items, we'd be happy to work with you to dig through the >>> performance issues you're seeing. >>> >>> Thanks >>> Wes >>> >>> On Mon, Sep 23, 2019 at 5:52 PM Maarten Ballintijn <maart...@xs4all.nl> >>> wrote: >>>> >>>> Greetings, >>>> >>>> We have Pandas DataFrames with typically about 6,000 rows using >>>> DateTimeIndex. >>>> They have about 20,000 columns with integer column labels, and data with a >>>> dtype of float32. >>>> >>>> We’d like to store these dataframes with parquet, using the ability to >>>> read a subset of columns and to store meta-data with the file. >>>> >>>> We’ve found the reading performance less than expected compared to the >>>> published benchmarks (e.g. Wes’ blog post). >>>> >>>> Using a modified version of his script we did reproduce his results (~ >>>> 1GB/s for high entropy, no dict on MacBook pro) >>>> >>>> But there seem to be three factors that contribute to the slowdown for our >>>> datasets: >>>> >>>> - DateTimeIndex is much slower then a Int index (we see about a factor 5). >>>> - The number of columns impact reading speed significantly (factor ~2 >>>> going from 16 to 16,000 columns) >>>> - The ‘use_pandas_metadata=True’ slows down reading significantly and >>>> appears unnecessary? (about 40%) >>>> >>>> Are there ways we could speedup the reading? Should we use a different >>>> layout? >>>> >>>> Thanks for your help and insights! >>>> >>>> Cheers, >>>> Maarten >>>> >>>> >>>> ps. the routines we used: >>>> >>>> def write_arrow_parquet(df: pd.DataFrame, fname: str) -> None: >>>> table = pa.Table.from_pandas(df) >>>> pq.write_table(table, fname, use_dictionary=False, compression=None) >>>> return >>>> >>>> def read_arrow_parquet(fname: str) -> pd.DataFrame: >>>> table = pq.read_table(fname, use_pandas_metadata=False, use_threads=True) >>>> df = table.to_pandas() >>>> return df