Jim Crist created ARROW-1982:
--------------------------------

             Summary: [Python] Return parquet statistics min/max as values 
instead of strings
                 Key: ARROW-1982
                 URL: https://issues.apache.org/jira/browse/ARROW-1982
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
            Reporter: Jim Crist


Currently `min` and `max` column statistics are returned as formatted strings 
of the _physical type_. This makes using them in python a bit tricky, as the 
strings need to be parsed as the proper _logical type_. Observe:


{code:python}
In [20]: import pandas as pd

In [21]: df = pd.DataFrame({'a': [1, 2, 3],
    ...:                    'b': ['a', 'b', 'c'],
    ...:                    'c': [pd.Timestamp('1991-01-01')]*3})
    ...:

In [22]: df.to_parquet('temp.parquet', engine='pyarrow')

In [23]: from pyarrow import parquet as pq

In [24]: f = pq.ParquetFile('temp.parquet')

In [25]: rg = f.metadata.row_group(0)

In [26]: rg.column(0).statistics.min  # string instead of integer
Out[26]: '1'

In [27]: rg.column(1).statistics.min  # weird space added after value due to 
formatter
Out[27]: 'a '

In [28]: rg.column(2).statistics.min  # formatted as physical type (int) 
instead of logical (datetime)
Out[28]: '662688000000'
{code}

Since the type information is known, it should be possible to convert these to 
arrow values instead of strings.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to