Thanks for the guidance. That was my initial inclination, but I decided that consistency with the existing ‘hash’ was better. However, like you, I also prefer the specific form.
I’ve opened https://issues.apache.org/jira/browse/SPARK-27099 and submitted the patch (using ‘xxhash64’) at https://github.com/apache/spark/pull/24019. - Huon From: Reynold Xin <r...@databricks.com> Date: Thursday, 7 March 2019 at 6:33 pm To: "Wilson, Huon (Data61, Eveleigh ATP)" <huon.wil...@data61.csiro.au> Cc: "dev@spark.apache.org" <dev@spark.apache.org> Subject: Re: [SQL] hash: 64-bits and seeding Rather than calling it hash64, it'd be better to just call it xxhash64. The reason being ten years from now, we probably would look back and laugh at a specific hash implementation. It'd be better to just name the expression what it is. On Wed, Mar 06, 2019 at 7:59 PM, <huon.wil...@data61.csiro.au<mailto:huon.wil...@data61.csiro.au>> wrote: Hi, I’m working on something that requires deterministic randomness, i.e. a row gets the same “random” value no matter the order of the DataFrame. A seeded hash seems to be the perfect way to do this, but the existing hashes have various limitations: - hash: 32-bit output (only 4 billion possibilities will result in a lot of collisions for many tables: the birthday paradox implies >50% chance of at least one for tables larger than 77000 rows, and likely ~1.6 billion collisions in a table of size 4 billion) - sha1/sha2/md5: single binary column input, string output It seems there’s already support for a 64-bit hash function that can work with an arbitrary number of arbitrary-typed columns (XxHash64), and exposing this for DataFrames seems like it’s essentially one line in sql/functions.scala to match `hash` (plus docs, tests, function registry etc.): def hash64(cols: Column*): Column = withExpr { new XxHash64(cols.map(_.expr)) } For my use case, this can then be used to get a 64-bit “random” column like val seed = rng.nextLong() hash64(lit(seed), col1, col2) I’ve created a (hopefully) complete patch by mimicking ‘hash’ at https://github.com/apache/spark/compare/master...huonw:hash64; should I open a JIRA and submit it as a pull request? Additionally, both hash and the new hash64 already have support for being seeded, but this isn’t exposed directly and instead requires something like the `lit` above. Would it make sense to add overloads like the following? def hash(seed: Int, cols: Columns*) = … def hash64(seed: Long, cols: Columns*) = … Though, it does seem a bit unfortunate to be forced to pass the seed first. (I sent this email to u...@spark.apache.org<mailto:u...@spark.apache.org> a few days ago, but didn't get any discussion about the Spark aspects of this, so I'm resending it here; I apologise in advance if I'm breaking protocol!) - Huon Wilson --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>