You need to understand how join works to make sense of it. Logically, a join does a cartesian product of the 2 tables, and then filters the rows that satisfy the contains UDF. So, let's say you have
Input Allen Armstrong nishanth hemanth Allen shivu Armstrong nishanth shree shivu DeWALT Replacement of words The word in LHS has to replace with the words in RHS given in the input sentence Allen => Apex Tool Group Armstrong => Apex Tool Group DeWALT => StanleyBlack Logically speaking it will first do a cartesian product, which will give you this Input x Replacement Allen Armstrong nishanth hemanth Allen, Allen, Apex Tool Group Allen Armstrong nishanth hemanth Allen, Armstrong, Apex Tool Group Allen Armstrong nishanth hemanth Allen, DeWalt, Apex Tool Group shivu Armstrong nishanth, Allen, Apex Tool Group shivu Armstrong nishanth, Armstrong, Apex Tool Group shivu Armstrong nishanth, DeWalt, Apex Tool Group shree shivu DeWALT, Allen, Apex Tool Group shree shivu DeWALT, Armstrong, Apex Tool Group shree shivu DeWALT, DeWalt, Apex Tool Group Then it will filter and keep only the records that satisfies contains Join output Allen Armstrong nishanth hemanth Allen, Allen, Apex Tool Group Allen Armstrong nishanth hemanth Allen, Armstrong, Apex Tool Group shivu Armstrong nishanth, Armstrong, Apex Tool Group shree shivu DeWALT, DeWalt, Apex Tool Group So, as you can see you have 4 output rows instead of 3. Now when ir performs the replace WithTerm operation, you get the output that you are getting -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Synonym-handling-replacement-issue-with-UDF-in-Apache-Spark-tp28638p28648.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org