You are overflowing the integer type, which goes up to a max value of 2147483647 (2^31 - 1). Change the return type of `sha2Int2` to `LongType()` and it works as expected.
On Mon, Mar 23, 2020 at 6:15 AM ayan guha <guha.a...@gmail.com> wrote: > Hi > > I am trying to implement simple hashing/checksum logic. The key logic is - > > 1. Generate sha1 hash > 2. Extract last 8 chars > 3. Convert 8 chars to Int (using base 16) > > Here is the cut down version of the code: > > > --------------------------------------------------------------------------------------- > > > > > > > > > > > *from pyspark.sql.functions import *from pyspark.sql.types import *from > hashlib import sha1 as local_sha1df = spark.sql("select '4104003141' > value_to_hash union all select '4102859263'")f1 = lambda x: > str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16))f2 = lambda x: > int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)sha2Int1 = udf( f1 , > StringType())sha2Int2 = udf( f2 , IntegerType())print(f('4102859263'))dfr = > df.select(df.value_to_hash, sha2Int1(df.value_to_hash).alias('1'), > sha2Int2(df.value_to_hash).alias('2'))* > *dfr.show(truncate=False)* > > --------------------------------------------------------------------------------------------- > > I was expecting both columns should provide exact same values, however > thats not the case *"always" * > > 2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2 | > +-------------+----------+-----------+ |4104003141 |478797741 |478797741 | > |4102859263 > |2520346415|-1774620881| +-------------+----------+-----------+ > > The function working fine, as shown in the print statement. However values > are not matching and vary widely. > > Any pointer? > > -- > Best Regards, > Ayan Guha >