[ https://issues.apache.org/jira/browse/ARROW-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17258168#comment-17258168 ]
Joris Van den Bossche commented on ARROW-11057: ----------------------------------------------- You can indeed also see in pyarrow that the only difference between {{a}} and {{b}} is the metadata of the schema: {code} In [20]: a.equals(b) Out[20]: True In [21]: a.equals(b, check_metadata=True) Out[21]: False In [22]: a.schema Out[22]: x: int64 y: int64 z: int64 In [23]: b.schema Out[23]: x: int64 -- field metadata -- PARQUET:field_id: '1' y: int64 -- field metadata -- PARQUET:field_id: '2' z: int64 -- field metadata -- PARQUET:field_id: '3' {code} > [Python] Data inconsistency with read and write > ----------------------------------------------- > > Key: ARROW-11057 > URL: https://issues.apache.org/jira/browse/ARROW-11057 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 2.0.0 > Reporter: David Quijano > Priority: Major > > I have been reading and writing some tables to parquet and I found some > inconsistencies. > {code:java} > # create a table with some data > a = pa.Table.from_pydict({'x': [1]*100,'y': [2]*100,'z': [3]*100,}) > # write it to file > pq.write_table(a, 'test.parquet') > # read the same file > b = pq.read_table('test.parquet') > # a == b is True, that's good > # write table b to file > pq.write_table(b, 'test2.parquet') > # test is different from test2{code} > Basically it is: > * Create table in memory > * Write it to file > * Read it again > * Write it to a different file > The files are not the same. The second one contains extra information. > The differences are consistent across different compressions (I tried snappy > and zstd). > Also, reading the second file and and writing it again, produces the same > file. > Is this a bug or an expected behavior? > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)