[ https://issues.apache.org/jira/browse/SPARK-45900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17938631#comment-17938631 ]
Dmitry Kravchuk commented on SPARK-45900: ----------------------------------------- We also need this feature) > Expand hash functionalities from to include XXH3 > ------------------------------------------------ > > Key: SPARK-45900 > URL: https://issues.apache.org/jira/browse/SPARK-45900 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.5.0 > Reporter: Nathan Holland > Priority: Major > > I often work in projects that require deterministic randomness, especially > when creating surrogate keys. For small volumes of data xxhash64 works well > however this functionality doesn't scale well - with a 64-bit hash code, the > chance of collision is one in a million when you hash just six million items > increasing sharply due to the birthday paradox. > Currently there are a few ways to handle this > - hash: 32-bit output (>50% chance of collision at least one for tables > larger than 77000 rows) > - xxhash64: 64-bit output (>50% chance of collision at least one for tables > larger than 5 billion rows) > - shaXXX/md5: single binary column input, string output, quite > computationally expensive. > I'd suggest adding the newest algorithm in the xxhash64 family, XXH3. The > XXH3 family is a modern 64-bit and 128-bit hash function family that provides > improved strength and performance across the board. > I'd imagine this would be a new function named xxhash3 and take 64 bit, and > 128bit bit lengths. For usability I believe the bit length should default to > 128bits to provide the best experience to reduce accidental collisions and > leave users to set the bit length to 64 as an override if they need to for > additional performance or interop reasons. (given the benchmarks, this would > likely be quite rare) > References: > * [Documentation|https://xxhash.com/] > * xxHash64 Ticket (https://issues.apache.org/jira/browse/SPARK-27099) > * [Existing xxHash64 > logic|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/XXH64.java] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org