Awesome....Did not know about conv function so thanks for that On Tue, 24 Mar 2020 at 1:23 am, Enrico Minack <m...@enrico.minack.dev> wrote:
> Ayan, > > no need for UDFs, the SQL API provides all you need (sha1, substring, conv > ): > https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html > > >>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16, > 10).cast("long").alias("sha2long")).show() > +----------+ > | sha2long| > +----------+ > | 478797741| > |2520346415| > +----------+ > > This creates a lean query plan: > > >>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16, > 10).cast("long").alias("sha2long")).explain() > == Physical Plan == > Union > :- *(1) Project [478797741 AS sha2long#74L] > : +- Scan OneRowRelation[] > +- *(2) Project [2520346415 AS sha2long#76L] > +- Scan OneRowRelation[] > > > Enrico > > > Am 23.03.20 um 06:13 schrieb ayan guha: > > Hi > > I am trying to implement simple hashing/checksum logic. The key logic is - > > 1. Generate sha1 hash > 2. Extract last 8 chars > 3. Convert 8 chars to Int (using base 16) > > Here is the cut down version of the code: > > > --------------------------------------------------------------------------------------- > > > > > > > > > > > *from pyspark.sql.functions import * from pyspark.sql.types import * from > hashlib import sha1 as local_sha1 df = spark.sql("select '4104003141' > value_to_hash union all select '4102859263'") f1 = lambda x: > str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)) f2 = lambda x: > int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16) sha2Int1 = udf( f1 , > StringType()) sha2Int2 = udf( f2 , IntegerType()) print(f('4102859263')) > dfr = df.select(df.value_to_hash, sha2Int1(df.value_to_hash).alias('1'), > sha2Int2(df.value_to_hash).alias('2')) * > *dfr.show(truncate=False)* > > --------------------------------------------------------------------------------------------- > > I was expecting both columns should provide exact same values, however > thats not the case *"always" * > > 2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2 | > +-------------+----------+-----------+ |4104003141 |478797741 |478797741 | > |4102859263 > |2520346415|-1774620881| +-------------+----------+-----------+ > > The function working fine, as shown in the print statement. However values > are not matching and vary widely. > > Any pointer? > > -- > Best Regards, > Ayan Guha > > > -- Best Regards, Ayan Guha