Jim Crist created ARROW-1982: -------------------------------- Summary: [Python] Return parquet statistics min/max as values instead of strings Key: ARROW-1982 URL: https://issues.apache.org/jira/browse/ARROW-1982 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Jim Crist
Currently `min` and `max` column statistics are returned as formatted strings of the _physical type_. This makes using them in python a bit tricky, as the strings need to be parsed as the proper _logical type_. Observe: {code:python} In [20]: import pandas as pd In [21]: df = pd.DataFrame({'a': [1, 2, 3], ...: 'b': ['a', 'b', 'c'], ...: 'c': [pd.Timestamp('1991-01-01')]*3}) ...: In [22]: df.to_parquet('temp.parquet', engine='pyarrow') In [23]: from pyarrow import parquet as pq In [24]: f = pq.ParquetFile('temp.parquet') In [25]: rg = f.metadata.row_group(0) In [26]: rg.column(0).statistics.min # string instead of integer Out[26]: '1' In [27]: rg.column(1).statistics.min # weird space added after value due to formatter Out[27]: 'a ' In [28]: rg.column(2).statistics.min # formatted as physical type (int) instead of logical (datetime) Out[28]: '662688000000' {code} Since the type information is known, it should be possible to convert these to arrow values instead of strings. -- This message was sent by Atlassian JIRA (v6.4.14#64029)