Hello,
I'm trying to link records from two large data sources. Both datasets have
almost same number of rows.
Goal is to match records based on multiple columns.
val matchId =
> SFAccountDF.as("SF").join(ELAccountDF.as("EL")).where($"SF.Email" ===
> $"EL.EmailAddress" || $"SF.Phone" === "EL.Phone")
Joining with a OR(||) will result in a CartesianProduct. I'm trying to
avoid that.
One way to do this is to join on each column and UNION the results.
> val phoneMatch = SFAccountDF.as("SF").filter("Phone !=
> ''").join(ELAccountDF.as("EL").filter("BusinessPhone !=
> ''")).where($"SF.Phone" === $"EL.BusinessPhone")
> val emailMatch = SFAccountDF.as("SF").filter("Email !=
> ''").join(ELAccountDF.as("EL").filter("EmailAddress !=
> ''")).where($"SF.Email" === $"EL.EmailAddress")
>
> val matchId =
> phoneMatch.unionAll(emailMatch.unionAll(faxMatch.unionAll(mobileMatch)))
> matchId.cache().registerTempTable("matchId")
Is there a more elegant way to do this?
On a related note, has anyone worked on record linkage using Bloom Filters,
Levenshtein distance, etc in Spark?
Srikanth