Re: Issue with UDF Int Conversion - Str to Int

Vipul Rajan Mon, 23 Mar 2020 12:27:08 -0700

Hi Ayan,

You don't have to bother with conversion at all. All functions that should
work on number columns would still work as long as all values in the column
are numbers:
scala> df2.printSchema
root
 |-- id: string (nullable = false)
 |-- id2: string (nullable = false)



scala> df2.show
+---+---+
| id|id2|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
|  5|  5|
|  6|  6|
|  7|  7|
|  8|  8|
|  9|  9|
+---+---+


scala> df2.select($"id" + $"id2").show
+----------+
|(id + id2)|
+----------+
|       0.0|
|       2.0|
|       4.0|
|       6.0|
|       8.0|
|      10.0|
|      12.0|
|      14.0|
|      16.0|
|      18.0|
+----------+


scala> df2.select(sum("id")).show
+-------+
|sum(id)|
+-------+
|   45.0|
+-------+

On Tue, Mar 24, 2020 at 12:11 AM ayan guha <guha.a...@gmail.com> wrote:

> Awesome....Did not know about conv function so thanks for that
>
> On Tue, 24 Mar 2020 at 1:23 am, Enrico Minack <m...@enrico.minack.dev>
> wrote:
>
>> Ayan,
>>
>> no need for UDFs, the SQL API provides all you need (sha1, substring,
>> conv):
>> https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html
>>
>> >>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16,
>> 10).cast("long").alias("sha2long")).show()
>> +----------+
>> |  sha2long|
>> +----------+
>> | 478797741|
>> |2520346415|
>> +----------+
>>
>> This creates a lean query plan:
>>
>> >>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16,
>> 10).cast("long").alias("sha2long")).explain()
>> == Physical Plan ==
>> Union
>> :- *(1) Project [478797741 AS sha2long#74L]
>> :  +- Scan OneRowRelation[]
>> +- *(2) Project [2520346415 AS sha2long#76L]
>>    +- Scan OneRowRelation[]
>>
>>
>> Enrico
>>
>>
>> Am 23.03.20 um 06:13 schrieb ayan guha:
>>
>> Hi
>>
>> I am trying to implement simple hashing/checksum logic. The key logic is
>> -
>>
>> 1. Generate sha1 hash
>> 2. Extract last 8 chars
>> 3. Convert 8 chars to Int (using base 16)
>>
>> Here is the cut down version of the code:
>>
>>
>> ---------------------------------------------------------------------------------------
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *from pyspark.sql.functions import * from pyspark.sql.types import * from
>> hashlib import sha1 as local_sha1 df = spark.sql("select '4104003141'
>> value_to_hash union all  select '4102859263'") f1 = lambda x:
>> str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)) f2 = lambda x:
>> int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16) sha2Int1 = udf( f1 ,
>> StringType()) sha2Int2 = udf( f2 , IntegerType()) print(f('4102859263'))
>> dfr = df.select(df.value_to_hash, sha2Int1(df.value_to_hash).alias('1'),
>> sha2Int2(df.value_to_hash).alias('2')) *
>> *dfr.show(truncate=False)*
>>
>> ---------------------------------------------------------------------------------------------
>>
>> I was expecting both columns should provide exact same values, however
>> thats not the case *"always" *
>>
>> 2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2 |
>> +-------------+----------+-----------+ |4104003141 |478797741 |478797741 | 
>> |4102859263
>> |2520346415|-1774620881| +-------------+----------+-----------+
>>
>> The function working fine, as shown in the print statement. However
>> values are not matching and vary widely.
>>
>> Any pointer?
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>> --
> Best Regards,
> Ayan Guha
>

Re: Issue with UDF Int Conversion - Str to Int

Reply via email to