Hi Ayan,

You don't have to bother with conversion at all. All functions that should
work on number columns would still work as long as all values in the column
are numbers:
scala> df2.printSchema
root
 |-- id: string (nullable = false)
 |-- id2: string (nullable = false)


scala> df2.show
+---+---+
| id|id2|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
|  5|  5|
|  6|  6|
|  7|  7|
|  8|  8|
|  9|  9|
+---+---+


scala> df2.select($"id" + $"id2").show
+----------+
|(id + id2)|
+----------+
|       0.0|
|       2.0|
|       4.0|
|       6.0|
|       8.0|
|      10.0|
|      12.0|
|      14.0|
|      16.0|
|      18.0|
+----------+


scala> df2.select(sum("id")).show
+-------+
|sum(id)|
+-------+
|   45.0|
+-------+

On Tue, Mar 24, 2020 at 12:11 AM ayan guha <guha.a...@gmail.com> wrote:

> Awesome....Did not know about conv function so thanks for that
>
> On Tue, 24 Mar 2020 at 1:23 am, Enrico Minack <m...@enrico.minack.dev>
> wrote:
>
>> Ayan,
>>
>> no need for UDFs, the SQL API provides all you need (sha1, substring,
>> conv):
>> https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html
>>
>> >>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16,
>> 10).cast("long").alias("sha2long")).show()
>> +----------+
>> |  sha2long|
>> +----------+
>> | 478797741|
>> |2520346415|
>> +----------+
>>
>> This creates a lean query plan:
>>
>> >>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16,
>> 10).cast("long").alias("sha2long")).explain()
>> == Physical Plan ==
>> Union
>> :- *(1) Project [478797741 AS sha2long#74L]
>> :  +- Scan OneRowRelation[]
>> +- *(2) Project [2520346415 AS sha2long#76L]
>>    +- Scan OneRowRelation[]
>>
>>
>> Enrico
>>
>>
>> Am 23.03.20 um 06:13 schrieb ayan guha:
>>
>> Hi
>>
>> I am trying to implement simple hashing/checksum logic. The key logic is
>> -
>>
>> 1. Generate sha1 hash
>> 2. Extract last 8 chars
>> 3. Convert 8 chars to Int (using base 16)
>>
>> Here is the cut down version of the code:
>>
>>
>> ---------------------------------------------------------------------------------------
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *from pyspark.sql.functions import * from pyspark.sql.types import * from
>> hashlib import sha1 as local_sha1 df = spark.sql("select '4104003141'
>> value_to_hash union all  select '4102859263'") f1 = lambda x:
>> str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)) f2 = lambda x:
>> int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16) sha2Int1 = udf( f1 ,
>> StringType()) sha2Int2 = udf( f2 , IntegerType()) print(f('4102859263'))
>> dfr = df.select(df.value_to_hash, sha2Int1(df.value_to_hash).alias('1'),
>> sha2Int2(df.value_to_hash).alias('2')) *
>> *dfr.show(truncate=False)*
>>
>> ---------------------------------------------------------------------------------------------
>>
>> I was expecting both columns should provide exact same values, however
>> thats not the case *"always" *
>>
>> 2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2 |
>> +-------------+----------+-----------+ |4104003141 |478797741 |478797741 | 
>> |4102859263
>> |2520346415|-1774620881| +-------------+----------+-----------+
>>
>> The function working fine, as shown in the print statement. However
>> values are not matching and vary widely.
>>
>> Any pointer?
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>> --
> Best Regards,
> Ayan Guha
>

Reply via email to