Hi Ayan, You don't have to bother with conversion at all. All functions that should work on number columns would still work as long as all values in the column are numbers: scala> df2.printSchema root |-- id: string (nullable = false) |-- id2: string (nullable = false)
scala> df2.show +---+---+ | id|id2| +---+---+ | 0| 0| | 1| 1| | 2| 2| | 3| 3| | 4| 4| | 5| 5| | 6| 6| | 7| 7| | 8| 8| | 9| 9| +---+---+ scala> df2.select($"id" + $"id2").show +----------+ |(id + id2)| +----------+ | 0.0| | 2.0| | 4.0| | 6.0| | 8.0| | 10.0| | 12.0| | 14.0| | 16.0| | 18.0| +----------+ scala> df2.select(sum("id")).show +-------+ |sum(id)| +-------+ | 45.0| +-------+ On Tue, Mar 24, 2020 at 12:11 AM ayan guha <guha.a...@gmail.com> wrote: > Awesome....Did not know about conv function so thanks for that > > On Tue, 24 Mar 2020 at 1:23 am, Enrico Minack <m...@enrico.minack.dev> > wrote: > >> Ayan, >> >> no need for UDFs, the SQL API provides all you need (sha1, substring, >> conv): >> https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html >> >> >>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16, >> 10).cast("long").alias("sha2long")).show() >> +----------+ >> | sha2long| >> +----------+ >> | 478797741| >> |2520346415| >> +----------+ >> >> This creates a lean query plan: >> >> >>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16, >> 10).cast("long").alias("sha2long")).explain() >> == Physical Plan == >> Union >> :- *(1) Project [478797741 AS sha2long#74L] >> : +- Scan OneRowRelation[] >> +- *(2) Project [2520346415 AS sha2long#76L] >> +- Scan OneRowRelation[] >> >> >> Enrico >> >> >> Am 23.03.20 um 06:13 schrieb ayan guha: >> >> Hi >> >> I am trying to implement simple hashing/checksum logic. The key logic is >> - >> >> 1. Generate sha1 hash >> 2. Extract last 8 chars >> 3. Convert 8 chars to Int (using base 16) >> >> Here is the cut down version of the code: >> >> >> --------------------------------------------------------------------------------------- >> >> >> >> >> >> >> >> >> >> >> *from pyspark.sql.functions import * from pyspark.sql.types import * from >> hashlib import sha1 as local_sha1 df = spark.sql("select '4104003141' >> value_to_hash union all select '4102859263'") f1 = lambda x: >> str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)) f2 = lambda x: >> int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16) sha2Int1 = udf( f1 , >> StringType()) sha2Int2 = udf( f2 , IntegerType()) print(f('4102859263')) >> dfr = df.select(df.value_to_hash, sha2Int1(df.value_to_hash).alias('1'), >> sha2Int2(df.value_to_hash).alias('2')) * >> *dfr.show(truncate=False)* >> >> --------------------------------------------------------------------------------------------- >> >> I was expecting both columns should provide exact same values, however >> thats not the case *"always" * >> >> 2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2 | >> +-------------+----------+-----------+ |4104003141 |478797741 |478797741 | >> |4102859263 >> |2520346415|-1774620881| +-------------+----------+-----------+ >> >> The function working fine, as shown in the print statement. However >> values are not matching and vary widely. >> >> Any pointer? >> >> -- >> Best Regards, >> Ayan Guha >> >> >> -- > Best Regards, > Ayan Guha >