You need to understand how join works to make sense of it. Logically, a join
does a cartesian product of the 2 tables, and then filters the rows that
satisfy the contains UDF. So, let's say you have

Input

Allen Armstrong nishanth hemanth Allen
shivu Armstrong nishanth
shree shivu DeWALT

Replacement of words
The word in LHS has to replace with the words in RHS given in the input
sentence
Allen        => Apex Tool Group
Armstrong => Apex Tool Group
DeWALT    => StanleyBlack

Logically speaking it will first do a cartesian product, which will give you
this

Input x Replacement
Allen Armstrong nishanth hemanth Allen, Allen, Apex Tool Group
Allen Armstrong nishanth hemanth Allen, Armstrong, Apex Tool Group
Allen Armstrong nishanth hemanth Allen, DeWalt, Apex Tool Group
shivu Armstrong nishanth, Allen, Apex Tool Group
shivu Armstrong nishanth, Armstrong, Apex Tool Group
shivu Armstrong nishanth, DeWalt, Apex Tool Group
shree shivu DeWALT, Allen, Apex Tool Group
shree shivu DeWALT, Armstrong, Apex Tool Group
shree shivu DeWALT, DeWalt, Apex Tool Group

Then it will filter and keep only the records that satisfies contains

Join output
Allen Armstrong nishanth hemanth Allen, Allen, Apex Tool Group
Allen Armstrong nishanth hemanth Allen, Armstrong, Apex Tool Group
shivu Armstrong nishanth, Armstrong, Apex Tool Group
shree shivu DeWALT, DeWalt, Apex Tool Group

So, as you can see you have 4 output rows instead of 3. Now when ir performs
the replace WithTerm operation, you get the output that you are getting





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Synonym-handling-replacement-issue-with-UDF-in-Apache-Spark-tp28638p28648.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to