I haven't looked closely but it looks like a bug, can someone open a JIRA issue and copy the reproducible example?
On Thu, Oct 8, 2020 at 10:57 AM Jadczak, Matt <matt.jadc...@gsacapital.com> wrote: > > I am unsure if this behaviour is intended (and duplicate values should be > forbidden), but it seems to me that the reason this is happening is that when > re-encoding an Arrow dictionary as a Parquet one, the function at > https://github.com/apache/arrow/blob/4bbb74713c6883e8523eeeb5ac80a1e1f8521674/cpp/src/parquet/encoding.cc#L773 > is called to create a Parquet DictEncoder out of the Arrow dictionary data. > This internally uses a map from value to index, and this map is constructed > by continually calling GetOrInsert on a memo table. When called with > duplicate values as in Al's example, the duplicates do not cause a new > dictionary index to be allocated, but instead return the existing one (which > is just ignored). However, the caller assumes that the resulting Parquet > dictionary uses the exact same indices as the Arrow one, and proceeds to just > copy the index data directly. In Al's example, this results in an invalid > dictionary index being written (that it is somehow wrapped around when > reading again, rather than crashing, is potentially a second bug). > > On 2020/10/08 15:04:22, Al Taylor <a...@googlemail.com.INVALID> wrote: > > Hi,> > > > > I've found the following odd behaviour when round-tripping data via parquet > > using pyarrow, when the data contains dictionary arrays with duplicate > > values.> > > > > > ```python> > > import pyarrow as pa> > > import pyarrow.parquet as pq> > > > > my_table = pa.Table.from_batches(> > > [> > > pa.RecordBatch.from_arrays(> > > [> > > pa.array([0, 1, 2, 3, 4]),> > > pa.DictionaryArray.from_arrays(> > > pa.array([0, 1, 2, 3, 4]),> > > pa.array(['a', 'd', 'c', 'd', 'e'])> > > )> > > ],> > > names=['foo', 'bar']> > > )> > > ]> > > )> > > my_table.validate(full=True)> > > > > pq.write_table(my_table, "foo.parquet")> > > > > read_table = pq.ParquetFile("foo.parquet").read()> > > read_table.validate(full=True)> > > > > print(my_table.column(1).to_pylist())> > > print(read_table.column(1).to_pylist())> > > > > assert my_table.column(1).to_pylist() == > > read_table.column(1).to_pylist()> > > ```> > > > > Both tables pass full validation, yet the last three lines print:> > > ```> > > ['a', 'd', 'c', 'd', 'e']> > > ['a', 'd', 'c', 'e', 'a']> > > Traceback (most recent call last):> > > File > > "/home/ataylor/projects/dsg-python-dtcc-equity-kinetics/dsg/example.py", > > line 29, in <module>> > > assert my_table.column(1).to_pylist() == > > read_table.column(1).to_pylist()> > > AssertionError> > > > > ```> > > > > Which clearly doesn't look right!> > > > > My question is whether I'm fundamentally breaking some assumption that > > dictionary values are unique or if there's a bug in the parquet-arrow > > conversion?> > > > > > Thanks,> > > > > Al> > > > > For details of how GSA uses your personal information, please see our Privacy > Notice here: https://www.gsacapital.com/privacy-notice > > This email and any files transmitted with it contain confidential and > proprietary information and is solely for the use of the intended recipient. > If you are not the intended recipient please return the email to the sender > and delete it from your computer and you must not use, disclose, distribute, > copy, print or rely on this email or its contents. > This communication is for informational purposes only. > It is not intended as an offer or solicitation for the purchase or sale of > any financial instrument or as an official confirmation of any transaction. > Any comments or statements made herein do not necessarily reflect those of > GSA Capital. > GSA Capital Partners LLP is authorised and regulated by the Financial Conduct > Authority and is registered in England and Wales at Stratton House, 5 > Stratton Street, London W1J 8LA, number OC309261. > GSA Capital Services Limited is registered in England and Wales at the same > address, number 5320529.