Hi, I am looking at Description column of a bank statement (CSV download) that has the following format
scala> account_table.printSchema root |-- TransactionDate: date (nullable = true) |-- TransactionType: string (nullable = true) |-- Description: string (nullable = true) |-- Value: double (nullable = true) |-- Balance: double (nullable = true) |-- AccountName: string (nullable = true) |-- AccountNumber: string (nullable = true) The column description for BACS payments contains the name of the individual who paid into the third party account. I need to mask the name but cannot simply use a literal as below for all contents of descriptions column! f1.withColumn("Description", lit("*** Masked ***")).select('Description.as("Who paid") So I try the following combination f1.select(trim(substring(substring_index('Description, ",", 1),2,50)).as("name in clear"), reverse(regexp_replace(regexp_replace(regexp_replace(substring(regexp_replace('Description, "^['A-Z]", "XX"),2,6),"[A-F]","X")," ","X"),"[,]","R")).as("Masked")).show +------------------+------+ | in clear|Masked| +------------------+------+ | FATAH SABAH|HXTXXX| | C HIGGINSON|GIHXXX| | SOLTA A|XTLOSX| +------------------+------+ This seems to work as it not only masks the name but also makes it consistent for all names (in other words, the same username gets the same mask). Are there any better alternatives? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.