Awesome....Did not know about conv function so thanks for that

On Tue, 24 Mar 2020 at 1:23 am, Enrico Minack <m...@enrico.minack.dev>
wrote:

> Ayan,
>
> no need for UDFs, the SQL API provides all you need (sha1, substring, conv
> ):
> https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html
>
> >>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16,
> 10).cast("long").alias("sha2long")).show()
> +----------+
> |  sha2long|
> +----------+
> | 478797741|
> |2520346415|
> +----------+
>
> This creates a lean query plan:
>
> >>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16,
> 10).cast("long").alias("sha2long")).explain()
> == Physical Plan ==
> Union
> :- *(1) Project [478797741 AS sha2long#74L]
> :  +- Scan OneRowRelation[]
> +- *(2) Project [2520346415 AS sha2long#76L]
>    +- Scan OneRowRelation[]
>
>
> Enrico
>
>
> Am 23.03.20 um 06:13 schrieb ayan guha:
>
> Hi
>
> I am trying to implement simple hashing/checksum logic. The key logic is -
>
> 1. Generate sha1 hash
> 2. Extract last 8 chars
> 3. Convert 8 chars to Int (using base 16)
>
> Here is the cut down version of the code:
>
>
> ---------------------------------------------------------------------------------------
>
>
>
>
>
>
>
>
>
>
> *from pyspark.sql.functions import * from pyspark.sql.types import * from
> hashlib import sha1 as local_sha1 df = spark.sql("select '4104003141'
> value_to_hash union all  select '4102859263'") f1 = lambda x:
> str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)) f2 = lambda x:
> int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16) sha2Int1 = udf( f1 ,
> StringType()) sha2Int2 = udf( f2 , IntegerType()) print(f('4102859263'))
> dfr = df.select(df.value_to_hash, sha2Int1(df.value_to_hash).alias('1'),
> sha2Int2(df.value_to_hash).alias('2')) *
> *dfr.show(truncate=False)*
>
> ---------------------------------------------------------------------------------------------
>
> I was expecting both columns should provide exact same values, however
> thats not the case *"always" *
>
> 2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2 |
> +-------------+----------+-----------+ |4104003141 |478797741 |478797741 | 
> |4102859263
> |2520346415|-1774620881| +-------------+----------+-----------+
>
> The function working fine, as shown in the print statement. However values
> are not matching and vary widely.
>
> Any pointer?
>
> --
> Best Regards,
> Ayan Guha
>
>
> --
Best Regards,
Ayan Guha

Reply via email to