Julius Neuffer created ARROW-2503: ------------------------------------- Summary: Trailing space character in RowGroup statistics of pyarrow.parquet.ParquetFile Key: ARROW-2503 URL: https://issues.apache.org/jira/browse/ARROW-2503 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.9.0 Reporter: Julius Neuffer
When reading a parquet file containing a string column, the _RowGroup_ statistics contain a trailing space character for the string column. The example below shows the behavior. {code:Python} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq # create and write arrow table as parquet df = pd.DataFrame({'string_column': ['some', 'string', 'values', 'here']}) table = pa.Table.from_pandas(df) pq.write_table(table, 'example.parquet') # read parquet file metadata and print string column statistics pq_file = pq.ParquetFile(open('example.parquet', 'rb')) print(pq_file.metadata.row_group(0).column(0).statistics.max) # yields b'values ' print(pq_file.metadata.row_group(0).column(0).statistics.min) # yields b'here ' {code} For other data types I did not observe this problem, even though the statistics are always strings. When reading the same file with _fastparquet_, there is no trailing space character, which implies that this problem occurs in the reading path of _pyarrow.parquet_. I am aware that this might well be an issue with _parquet-cpp_, but as I face this bug as a _pyarrow_ user, I report it here. I'll try to investigate this further and report back here. -- This message was sent by Atlassian JIRA (v7.6.3#76005)