Re: [Python] Dictionary Arrays with duplicate values jumbling on round-trip to parquet

2020-10-09 Thread Joris Van den Bossche
This has been reported as https://issues.apache.org/jira/browse/ARROW-10237, and is in the meantime also already fixed. Joris On Thu, 8 Oct 2020 at 18:20, Wes McKinney wrote: > I haven't looked closely but it looks like a bug, can someone open a > JIRA issue and copy the reproducible example? >

Re: [Python] Dictionary Arrays with duplicate values jumbling on round-trip to parquet

2020-10-08 Thread Wes McKinney
I haven't looked closely but it looks like a bug, can someone open a JIRA issue and copy the reproducible example? On Thu, Oct 8, 2020 at 10:57 AM Jadczak, Matt wrote: > > I am unsure if this behaviour is intended (and duplicate values should be > forbidden), but it seems to me that the reason t

Re: [Python] Dictionary Arrays with duplicate values jumbling on round-trip to parquet

2020-10-08 Thread Jadczak, Matt
I am unsure if this behaviour is intended (and duplicate values should be forbidden), but it seems to me that the reason this is happening is that when re-encoding an Arrow dictionary as a Parquet one, the function at https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/

[Python] Dictionary Arrays with duplicate values jumbling on round-trip to parquet

2020-10-08 Thread Al Taylor
Hi, I've found the following odd behaviour when round-tripping data via parquet using pyarrow, when the data contains dictionary arrays with duplicate values. ```python import pyarrow as pa import pyarrow.parquet as pq my_table = pa.Table.from_batches( [ pa.Recor