Hi,

I have a raw source data frame having 2 columns as below

timestamp                              
2019-11-29 9:30:45

message_log

<123>NOV 29 10:20:35 ips01 sfids: connection:
tcp,bytes:104,user:unknown,url:unknown,host:127.0.0.1

how do we break above each key value as separate columns using udf in
pyspark?

what is the right approach for flattening this type of log data - regex or
python logic?

Could you please help me the logic to bring flattening the log data?

Final output dataframe having the below  each columns:

timestamp                              
2019-11-29 9:30:45

prio
123

msg_ts
NOV 29 10:20:35

msg_ids
ips01 

sfids

connection
tcp

bytes
104

user
unknown

url
unknown

host
127.0.0.1


Thanks
Anbu




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to