Re: [SQL] hash: 64-bits and seeding

Huon.Wilson Thu, 07 Mar 2019 19:16:07 -0800

Thanks for the guidance. That was my initial inclination, but I decided that 
consistency with the existing ‘hash’ was better. However, like you, I also 
prefer the specific form.

I’ve opened https://issues.apache.org/jira/browse/SPARK-27099 and submitted the 
patch (using ‘xxhash64’) at https://github.com/apache/spark/pull/24019.

- Huon

From: Reynold Xin <r...@databricks.com>
Date: Thursday, 7 March 2019 at 6:33 pm
To: "Wilson, Huon (Data61, Eveleigh ATP)" <huon.wil...@data61.csiro.au>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>
Subject: Re: [SQL] hash: 64-bits and seeding

Rather than calling it hash64, it'd be better to just call it xxhash64. The 
reason being ten years from now, we probably would look back and laugh at a 
specific hash implementation. It'd be better to just name the expression what 
it is.

On Wed, Mar 06, 2019 at 7:59 PM, 
<huon.wil...@data61.csiro.au<mailto:huon.wil...@data61.csiro.au>> wrote:

Hi,

I’m working on something that requires deterministic randomness, i.e. a row 
gets the same “random” value no matter the order of the DataFrame. A seeded 
hash seems to be the perfect way to do this, but the existing hashes have 
various limitations:

- hash: 32-bit output (only 4 billion possibilities will result in a lot of 
collisions for many tables: the birthday paradox implies >50% chance of at 
least one for tables larger than 77000 rows, and likely ~1.6 billion collisions 
in a table of size 4 billion)
- sha1/sha2/md5: single binary column input, string output

It seems there’s already support for a 64-bit hash function that can work with 
an arbitrary number of arbitrary-typed columns (XxHash64), and exposing this 
for DataFrames seems like it’s essentially one line in sql/functions.scala to 
match `hash` (plus docs, tests, function registry etc.):

def hash64(cols: Column*): Column = withExpr { new XxHash64(cols.map(_.expr)) }

For my use case, this can then be used to get a 64-bit “random” column like

val seed = rng.nextLong()
hash64(lit(seed), col1, col2)

I’ve created a (hopefully) complete patch by mimicking ‘hash’ at 
https://github.com/apache/spark/compare/master...huonw:hash64; should I open a 
JIRA and submit it as a pull request?

Additionally, both hash and the new hash64 already have support for being 
seeded, but this isn’t exposed directly and instead requires something like the 
`lit` above. Would it make sense to add overloads like the following?

def hash(seed: Int, cols: Columns*) = …
def hash64(seed: Long, cols: Columns*) = …

Though, it does seem a bit unfortunate to be forced to pass the seed first.

(I sent this email to u...@spark.apache.org<mailto:u...@spark.apache.org> a few 
days ago, but didn't get any discussion about the Spark aspects of this, so I'm 
resending it here; I apologise in advance if I'm breaking protocol!)

- Huon Wilson

--------------------------------------------------------------------- To 
unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org<mailto:dev-unsubscr...@spark.apache.org>

Re: [SQL] hash: 64-bits and seeding

Reply via email to