[ https://issues.apache.org/jira/browse/ARROW-4501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rok Mihevc updated ARROW-4501: ------------------------------ External issue URL: https://github.com/apache/arrow/issues/21053 > [C++] Unique returns non-unique strings > --------------------------------------- > > Key: ARROW-4501 > URL: https://issues.apache.org/jira/browse/ARROW-4501 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Ilya Tokar > Assignee: Ilya Tokar > Priority: Major > Labels: pull-request-available > Fix For: 0.12.1, 0.13.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Calling Unique on e. g. \{"some long string data","some long string > data","other data"} returns > dictionary with "some long string data" appearing twice. This is caused by > off by 1 error in DoubleCrcHash, which caused it to read 1 byte past the end > of the strings with length higher than 16, and not divisible by 4. In such > cases, we never hash p[0], and we always read one extra byte. -- This message was sent by Atlassian Jira (v8.20.10#820010)