Normally a family of joins (left, right outter, inner) are performed on two
dataframes using columns for the comparison ie left("acol") ===
ight("acol") . the comparison operator of the "left" dataframe does
something internally and produces a column that i assume is used by the
join.

What I want is to create my own comparison operation (i have a case where i
want to use some fuzzy matching between rows and if they fall within some
threshold we allow the join to happen)

so it would look something like

left.join(right, my_fuzzy_udf (left("cola"),right("cola")))

Where my_fuzzy_udf  is my defined UDF. My main concern is the column that
would have to be output what would its value be ie what would the function
need to return that the udf susbsystem would then turn to a column to be
evaluated by the join.


Thanks in advance for any advice

Reply via email to