[ https://issues.apache.org/jira/browse/ARROW-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-11069: ------------------------------------------ Description: When writing a dict column using pyarrow. {code:python} import pandas as pd orig = pd.read_parquet("original.parquet") orig.to_parquet("first_write.parquet") first_write = pd.read_parquet("first_write.parquet") print(orig.equals(first_write)) {code} This incorrect results start appearing after index 1024. first_write.parquet was created after reading and then writing it again. I don't see any obvious pattern in the shuffled rows. !image-2020-12-30-01-20-45-183.png! Original records !image-2020-12-30-01-19-20-491.png! Written records was: When writing a dict column using pyarrow. {code:python} import pandas as pd orig = pd.read_parquet("original.parquet") df.to_parquet("first_write.parquet") first_write = pd.read_parquet("first_write.parquet") print(orig.equals(first_write)) {code} This incorrect results start appearing after index 1024. first_write.parquet was created after reading and then writing it again. I don't see any obvious pattern in the shuffled rows. !image-2020-12-30-01-20-45-183.png! Original records !image-2020-12-30-01-19-20-491.png! Written records > Parquet writer incorrect data being written when data type is dictionary > ------------------------------------------------------------------------ > > Key: ARROW-11069 > URL: https://issues.apache.org/jira/browse/ARROW-11069 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 2.0.0 > Environment: pandas v1.0.4 > Reporter: Palash Goel > Priority: Major > Attachments: first_write.parquet, image-2020-12-30-01-19-20-491.png, > image-2020-12-30-01-19-42-739.png, image-2020-12-30-01-20-45-183.png, > original.parquet > > > When writing a dict column using pyarrow. > > {code:python} > import pandas as pd > orig = pd.read_parquet("original.parquet") > orig.to_parquet("first_write.parquet") > first_write = pd.read_parquet("first_write.parquet") > print(orig.equals(first_write)) > {code} > > This incorrect results start appearing after index 1024. first_write.parquet > was created after reading and then writing it again. I don't see any obvious > pattern in the shuffled rows. > !image-2020-12-30-01-20-45-183.png! > Original records > !image-2020-12-30-01-19-20-491.png! > Written records -- This message was sent by Atlassian Jira (v8.3.4#803005)