I am unsure if this behaviour is intended (and duplicate values should be 
forbidden), but it seems to me that the reason this is happening is that when 
re-encoding an Arrow dictionary as a Parquet one, the function at 
https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773
 is called to create a Parquet DictEncoder out of the Arrow dictionary data. 
This internally uses a map from value to index, and this map is constructed by 
continually calling GetOrInsert on a memo table. When called with duplicate 
values as in Al's example, the duplicates do not cause a new dictionary index 
to be allocated, but instead return the existing one (which is just ignored). 
However, the caller assumes that the resulting Parquet dictionary uses the 
exact same indices as the Arrow one, and proceeds to just copy the index data 
directly. In Al's example, this results in an invalid dictionary index being 
written (that it is somehow wrapped around when reading again, rather than 
crashing, is potentially a second bug).

On 2020/10/08 15:04:22, Al Taylor <a...@googlemail.com.INVALID> wrote:
> Hi,>
>
> I've found the following odd behaviour when round-tripping data via parquet 
> using pyarrow, when the data contains dictionary arrays with duplicate 
> values.>

>
> ```python>
>     import pyarrow as pa>
>     import pyarrow.parquet as pq>
>
>     my_table = pa.Table.from_batches(>
>         [>
>             pa.RecordBatch.from_arrays(>
>                 [>
>                     pa.array([0, 1, 2, 3, 4]),>
>                     pa.DictionaryArray.from_arrays(>
>                         pa.array([0, 1, 2, 3, 4]),>
>                         pa.array(['a', 'd', 'c', 'd', 'e'])>
>                     )>
>                 ],>
>                 names=['foo', 'bar']>
>             )>
>         ]>
>     )>
>     my_table.validate(full=True)>
>
>     pq.write_table(my_table, "foo.parquet")>
>
>     read_table = pq.ParquetFile("foo.parquet").read()>
>     read_table.validate(full=True)>
>
>     print(my_table.column(1).to_pylist())>
>     print(read_table.column(1).to_pylist())>
>
>     assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()>
> ```>
>
> Both tables pass full validation, yet the last three lines print:>
> ```>
> ['a', 'd', 'c', 'd', 'e']>
> ['a', 'd', 'c', 'e', 'a']>
> Traceback (most recent call last):>
>   File 
> "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", line 
> 29, in <module>>
>     assert my_table.column(1).to_pylist() == read_table.column(1).to_pylist()>
> AssertionError>
>
> ```>
>
> Which clearly doesn't look right!>
>
> My question is whether I'm fundamentally breaking some assumption that 
> dictionary values are unique or if there's a bug in the parquet-arrow 
> conversion?>

>
> Thanks,>
>
> Al>
>

For details of how GSA uses your personal information, please see our Privacy 
Notice here: https://www.gsacapital.com/privacy-notice 

This email and any files transmitted with it contain confidential and 
proprietary information and is solely for the use of the intended recipient.
If you are not the intended recipient please return the email to the sender and 
delete it from your computer and you must not use, disclose, distribute, copy, 
print or rely on this email or its contents.
This communication is for informational purposes only.
It is not intended as an offer or solicitation for the purchase or sale of any 
financial instrument or as an official confirmation of any transaction.
Any comments or statements made herein do not necessarily reflect those of GSA 
Capital.
GSA Capital Partners LLP is authorised and regulated by the Financial Conduct 
Authority and is registered in England and Wales at Stratton House, 5 Stratton 
Street, London W1J 8LA, number OC309261.
GSA Capital Services Limited is registered in England and Wales at the same 
address, number 5320529.

Reply via email to