Is there R software available for doing approximate matching of personal
names?

I have data about the same people produced by different organizations and
the only matching key I have is the name. I know that commercial solutions
exist, and I know I code code this from scratch, but I'd prefer to build on
some existing free solution if it exists.

Unfortunately, the names are not standardized, and there is also a certain
level of error:

       Danny Williams (nickname)
       Dan Williams (nickname)
       Daniel Williams (nickname)
       Dan William (spelling error)
       D. Williams (initials)
       Daniel "Danny" Williams (formal + nickname)
       Dan P. Williams (includes middle initial)
       Williams, Daniel (different convention)
       William Daniel (wrong order or missing comma + misspelling)

Is there any R software available to find likely matches, ideally with some
estimate of accuracy of match?  Levenshtein distance as implemented in agrep
is a useful solution for some of these cases; I was wondering if there is
something that covers more cases.

For this particular application, I am not concerned with issues such as
variant latinizations/transliterations (e.g. Tsung-Dao Lee ~ T.D. Lee ~ Li
Zhengdao; Ghaddafi ~ Qaddhaffi), but of course if someone handles that as
well....

Thanks,

            -s

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to