Ilya Tokar created ARROW-4501: --------------------------------- Summary: [C++] Unique returns non-unique strings Key: ARROW-4501 URL: https://issues.apache.org/jira/browse/ARROW-4501 Project: Apache Arrow Issue Type: Bug Reporter: Ilya Tokar
Calling Unique on e. g. \{"some long string data","some long string data","other data"} returns dictionary with "some long string data" appearing twice. This is caused by off by 1 error in DoubleCrcHash, which caused it to read 1 byte past the end of the strings with length higher than 16, and not divisible by 4. In such cases, we never hash p[0], and we always read one extra byte. -- This message was sent by Atlassian JIRA (v7.6.3#76005)