This has been reported as https://issues.apache.org/jira/browse/ARROW-10237,
and is in the meantime also already fixed.

Joris

On Thu, 8 Oct 2020 at 18:20, Wes McKinney <wesmck...@gmail.com> wrote:

> I haven't looked closely but it looks like a bug, can someone open a
> JIRA issue and copy the reproducible example?
>
> On Thu, Oct 8, 2020 at 10:57 AM Jadczak, Matt
> <matt.jadc...@gsacapital.com> wrote:
> >
> > I am unsure if this behaviour is intended (and duplicate values should
> be forbidden), but it seems to me that the reason this is happening is that
> when re-encoding an Arrow dictionary as a Parquet one, the function at
> https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773
> is called to create a Parquet DictEncoder out of the Arrow dictionary data.
> This internally uses a map from value to index, and this map is constructed
> by continually calling GetOrInsert on a memo table. When called with
> duplicate values as in Al's example, the duplicates do not cause a new
> dictionary index to be allocated, but instead return the existing one
> (which is just ignored). However, the caller assumes that the resulting
> Parquet dictionary uses the exact same indices as the Arrow one, and
> proceeds to just copy the index data directly. In Al's example, this
> results in an invalid dictionary index being written (that it is somehow
> wrapped around when reading again, rather than crashing, is potentially a
> second bug).
> >
> > On 2020/10/08 15:04:22, Al Taylor <a...@googlemail.com.INVALID> wrote:
> > > Hi,>
> > >
> > > I've found the following odd behaviour when round-tripping data via
> parquet using pyarrow, when the data contains dictionary arrays with
> duplicate values.>
> >
> > >
> > > ```python>
> > >     import pyarrow as pa>
> > >     import pyarrow.parquet as pq>
> > >
> > >     my_table = pa.Table.from_batches(>
> > >         [>
> > >             pa.RecordBatch.from_arrays(>
> > >                 [>
> > >                     pa.array([0, 1, 2, 3, 4]),>
> > >                     pa.DictionaryArray.from_arrays(>
> > >                         pa.array([0, 1, 2, 3, 4]),>
> > >                         pa.array(['a', 'd', 'c', 'd', 'e'])>
> > >                     )>
> > >                 ],>
> > >                 names=['foo', 'bar']>
> > >             )>
> > >         ]>
> > >     )>
> > >     my_table.validate(full=True)>
> > >
> > >     pq.write_table(my_table, "foo.parquet")>
> > >
> > >     read_table = pq.ParquetFile("foo.parquet").read()>
> > >     read_table.validate(full=True)>
> > >
> > >     print(my_table.column(1).to_pylist())>
> > >     print(read_table.column(1).to_pylist())>
> > >
> > >     assert my_table.column(1).to_pylist() ==
> read_table.column(1).to_pylist()>
> > > ```>
> > >
> > > Both tables pass full validation, yet the last three lines print:>
> > > ```>
> > > ['a', 'd', 'c', 'd', 'e']>
> > > ['a', 'd', 'c', 'e', 'a']>
> > > Traceback (most recent call last):>
> > >   File
> "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py",
> line 29, in <module>>
> > >     assert my_table.column(1).to_pylist() ==
> read_table.column(1).to_pylist()>
> > > AssertionError>
> > >
> > > ```>
> > >
> > > Which clearly doesn't look right!>
> > >
> > > My question is whether I'm fundamentally breaking some assumption that
> dictionary values are unique or if there's a bug in the parquet-arrow
> conversion?>
> >
> > >
> > > Thanks,>
> > >
> > > Al>
> > >
> >
> > For details of how GSA uses your personal information, please see our
> Privacy Notice here: https://www.gsacapital.com/privacy-notice
> >
> > This email and any files transmitted with it contain confidential and
> proprietary information and is solely for the use of the intended recipient.
> > If you are not the intended recipient please return the email to the
> sender and delete it from your computer and you must not use, disclose,
> distribute, copy, print or rely on this email or its contents.
> > This communication is for informational purposes only.
> > It is not intended as an offer or solicitation for the purchase or sale
> of any financial instrument or as an official confirmation of any
> transaction.
> > Any comments or statements made herein do not necessarily reflect those
> of GSA Capital.
> > GSA Capital Partners LLP is authorised and regulated by the Financial
> Conduct Authority and is registered in England and Wales at Stratton House,
> 5 Stratton Street, London W1J 8LA, number OC309261.
> > GSA Capital Services Limited is registered in England and Wales at the
> same address, number 5320529.
>

Reply via email to