Jim Crist created ARROW-1982:
--------------------------------
Summary: [Python] Return parquet statistics min/max as values
instead of strings
Key: ARROW-1982
URL: https://issues.apache.org/jira/browse/ARROW-1982
Project: Apache Arrow
Issue Type: Bug
Components: Python
Reporter: Jim Crist
Currently `min` and `max` column statistics are returned as formatted strings
of the _physical type_. This makes using them in python a bit tricky, as the
strings need to be parsed as the proper _logical type_. Observe:
{code:python}
In [20]: import pandas as pd
In [21]: df = pd.DataFrame({'a': [1, 2, 3],
...: 'b': ['a', 'b', 'c'],
...: 'c': [pd.Timestamp('1991-01-01')]*3})
...:
In [22]: df.to_parquet('temp.parquet', engine='pyarrow')
In [23]: from pyarrow import parquet as pq
In [24]: f = pq.ParquetFile('temp.parquet')
In [25]: rg = f.metadata.row_group(0)
In [26]: rg.column(0).statistics.min # string instead of integer
Out[26]: '1'
In [27]: rg.column(1).statistics.min # weird space added after value due to
formatter
Out[27]: 'a '
In [28]: rg.column(2).statistics.min # formatted as physical type (int)
instead of logical (datetime)
Out[28]: '662688000000'
{code}
Since the type information is known, it should be possible to convert these to
arrow values instead of strings.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)