Re: Hash algorithm issues

Furcy Pin Wed, 21 Jan 2015 14:19:09 -0800

Hi Murali,

As you can see from the source code of Hive's hash UDF:


https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFHash.java
https://github.com/apache/hive/blob/trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java

it is basically using Java's hashCode method.

In Java, I believe hashCode is mostly used for HashMaps because it is
really efficient to compute, however it is really poor at avoiding
collisions.

For instance, in Java , 100L.hashCode()==100

Moreover, the .hashCode() method (as well as hive's hash) returns an int.

So if you hash more than 2^32 bigints into Integers of size 32 you'll end
up with... a lot of collisions.

I suggest you go for writing your own UDF that directly compute the hash as
hex format, and perhaps use another hash function, eg md5.
It is pretty easy to find md5 UDF implementations you can start from,
eg: https://gist.github.com/dataminelab/1050002

There is no guarantee of quality however, I do not personally use this one,
so I don't know if it works fine or not.















2015-01-21 22:38 GMT+01:00 murali parimi <muralikrishna.par...@icloud.com>:

> Hello team,
>
> We are extracting data from netezza and loading into hive tables. While
> loading data, we are using hash function to mask few PII data for security
> reasons.
>
> One such column is acct_num stored as bigint in netezza, which we are
> storing in a string column after converting hash of that acct_num to a hex
> format.
>
> Now the issue is we found same value is generated for distinct acct_num in
> most of the records. So any known issues with the algorithm that hash
> function uses in hive?
>
> Thanks,
> Murali
>
> Sent from my iPhone

Re: Hash algorithm issues

Reply via email to