[jira] [Created] (ARROW-2503) Trailing space character in RowGroup statistics of pyarrow.parquet.ParquetFile

Julius Neuffer (JIRA) Tue, 24 Apr 2018 02:18:38 -0700

Julius Neuffer created ARROW-2503:
-------------------------------------

             Summary: Trailing space character in RowGroup statistics of 
pyarrow.parquet.ParquetFile
                 Key: ARROW-2503
                 URL: https://issues.apache.org/jira/browse/ARROW-2503
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 0.9.0
            Reporter: Julius Neuffer



When reading a parquet file containing a string column, the _RowGroup_ 
statistics contain a trailing space character for the string column. The 
example below shows the behavior.
{code:Python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# create and write arrow table as parquet
df = pd.DataFrame({'string_column': ['some', 'string', 'values', 'here']})
table = pa.Table.from_pandas(df)
pq.write_table(table, 'example.parquet')

# read parquet file metadata and print string column statistics
pq_file = pq.ParquetFile(open('example.parquet', 'rb'))
print(pq_file.metadata.row_group(0).column(0).statistics.max) # yields b'values 
'
print(pq_file.metadata.row_group(0).column(0).statistics.min) # yields b'here '
{code}
For other data types I did not observe this problem, even though the statistics 
are always strings.

When reading the same file with _fastparquet_, there is no trailing space 
character, which implies that this problem occurs in the reading path of 
_pyarrow.parquet_. I am aware that this might well be an issue with 
_parquet-cpp_, but as I face this bug as a _pyarrow_ user, I report it here.

I'll try to investigate this further and report back here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2503) Trailing space character in RowGroup statistics of pyarrow.parquet.ParquetFile

Reply via email to