Re: Issue with UDF Int Conversion - Str to Int

Jacob Lynn Mon, 23 Mar 2020 02:17:08 -0700

You are overflowing the integer type, which goes up to a max value
of 2147483647 (2^31 - 1). Change the return type of `sha2Int2` to
`LongType()` and it works as expected.


On Mon, Mar 23, 2020 at 6:15 AM ayan guha <guha.a...@gmail.com> wrote:

> Hi
>
> I am trying to implement simple hashing/checksum logic. The key logic is -
>
> 1. Generate sha1 hash
> 2. Extract last 8 chars
> 3. Convert 8 chars to Int (using base 16)
>
> Here is the cut down version of the code:
>
>
> ---------------------------------------------------------------------------------------
>
>
>
>
>
>
>
>
>
>
> *from pyspark.sql.functions import *from pyspark.sql.types import *from
> hashlib import sha1 as local_sha1df = spark.sql("select '4104003141'
> value_to_hash union all  select '4102859263'")f1 = lambda x:
> str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16))f2 = lambda x:
> int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)sha2Int1 = udf( f1 ,
> StringType())sha2Int2 = udf( f2 , IntegerType())print(f('4102859263'))dfr =
> df.select(df.value_to_hash, sha2Int1(df.value_to_hash).alias('1'),
> sha2Int2(df.value_to_hash).alias('2'))*
> *dfr.show(truncate=False)*
>
> ---------------------------------------------------------------------------------------------
>
> I was expecting both columns should provide exact same values, however
> thats not the case *"always" *
>
> 2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2 |
> +-------------+----------+-----------+ |4104003141 |478797741 |478797741 | 
> |4102859263
> |2520346415|-1774620881| +-------------+----------+-----------+
>
> The function working fine, as shown in the print statement. However values
> are not matching and vary widely.
>
> Any pointer?
>
> --
> Best Regards,
> Ayan Guha
>

Re: Issue with UDF Int Conversion - Str to Int

Reply via email to