Hello dear R-helpers, I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of two data frames by characters. In each data frame I got two different list of names, that is my main-key to be merged.
To figure out what I'm saying, I build up a modified "?merge" example, with errors by purpose: # Data for authors: authors <- data.frame( surname = I(c("Tukey", "Venable", "Terney", "Ripley", "McNeil")), nationality = c("US", "Australia", "US", "UK", "Australia"), deceased = c("yes", rep("no", 4))) "Venables" is without the final 's', and "Tierney, without "i". # Data for books: books <- data.frame( surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "Rippley", "McNeil", "R Core")), title = c("Exploratory Data Analysis", "Modern Applied Statistics ...", "LISP-STAT", "Spatial Statistics", "Stochastic Simulation", "Interactive Data Analysis", "An Introduction to R"), other.author = c(NA, "Ripley", NA, NA, NA, NA, "Venables & Smith")) With "surname" column instead of "name" (differs from original example for more easy going merge). And the second "Ripley" with double "p". So, if I ask for: merge(authors, books, all=TRUE) I got: But we know that "Rippley" corresponds to "Ripley", "Terney" to "Tierney" and "Venable" to "Venables". I was wondering if there was any way to work around this problem. My orginal data have around 27,000 name entries, and if I take "all=FALSE", this database drops out to around 17,000, most because mispelling (or truncated expressions). If I take "all=TRUE", I got many of this <NA> cases like the example above. Has anyone experienced this? Any idea how I can get out? I'm thinking to take the longest match possible to each entry. For example, in "Venable"/"Venables" there is a 87.5% match. As I have name and surname, and also auxiliary keys to this match, I think this could work. Thank you in advance. ----- Victor Delgado cedeplar.ufmg.br P.H.D. student www.fjp.mg.gov.br reseacher -- View this message in context: http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.