Thank you both!

Here's the code that's working now. It's a bit hard to read due to so many
functions. Any idea how I can improve the readability?

from pyspark.sql.functions import trim, when, from_unixtime,
unix_timestamp, minute, hour

duration_test = flight2.select("stop_duration1")
duration_test.show()


duration_test.withColumn('duration_h',
when(duration_test.stop_duration1.isNull(), 999)

.otherwise(hour(unix_timestamp(duration_test.stop_duration1,"HH'h'mm'm'").cast("timestamp")))).show(20)


+--------------+
|stop_duration1|
+--------------+
|         0h50m|
|         3h15m|
|         8h35m|
|         1h30m|
|        12h15m|
|        11h50m|
|          2h5m|
|        10h25m|
|         8h20m|
|          null|
|         2h50m|
|         2h30m|
|         7h45m|
|         1h10m|
|         2h15m|
|          2h0m|
|        10h25m|
|         1h40m|
|         1h55m|
|         1h40m|
+--------------+
only showing top 20 rows

+--------------+----------+
|stop_duration1|duration_h|
+--------------+----------+
|         0h50m|         0|
|         3h15m|         3|
|         8h35m|         8|
|         1h30m|         1|
|        12h15m|        12|
|        11h50m|        11|
|          2h5m|         2|
|        10h25m|        10|
|         8h20m|         8|
|          null|       999|
|         2h50m|         2|
|         2h30m|         2|
|         7h45m|         7|
|         1h10m|         1|
|         2h15m|         2|
|          2h0m|         2|
|        10h25m|        10|
|         1h40m|         1|
|         1h55m|         1|
|         1h40m|         1|
+--------------+----------+
only showing top 20 rows





On Tue, Apr 25, 2017 at 11:29 AM, Pushkar.Gujar <pushkarvgu...@gmail.com>
wrote:

> Someone had similar issue today at stackoverflow.
>
> http://stackoverflow.com/questions/43595201/python-how-
> to-convert-pyspark-column-to-date-type-if-there-are-null-
> values/43595728#43595728
>
>
> Thank you,
> *Pushkar Gujar*
>
>
> On Mon, Apr 24, 2017 at 8:22 PM, Zeming Yu <zemin...@gmail.com> wrote:
>
>> hi all,
>>
>> I tried to write a UDF that handles null values:
>>
>> def getMinutes(hString, minString):
>>     if (hString != None) & (minString != None): return int(hString) * 60
>> + int(minString[:-1])
>>     else: return None
>>
>> flight2 = (flight2.withColumn("duration_minutes",
>> udfGetMinutes("duration_h", "duration_m")))
>>
>>
>> but I got this error:
>>
>>   File "<ipython-input-67-5eb2daa1c1f2>", line 6, in getMinutes
>> TypeError: int() argument must be a string, a bytes-like object or a number, 
>> not 'NoneType'
>>
>>
>> Does anyone know how to do this?
>>
>>
>> Thanks,
>>
>> Zeming
>>
>>
>

Reply via email to