On Nov 2, 2012, at 11:20 AM, VictorDelgado wrote: > Hello dear R-helpers, > > I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of two > data frames by characters. > In each data frame I got two different list of names, that is my main-key to > be merged. > > To figure out what I'm saying, I build up a modified "?merge" example, with > errors by purpose: > > # Data for authors: > > authors <- data.frame( > surname = I(c("Tukey", "Venable", "Terney", "Ripley", "McNeil")), > nationality = c("US", "Australia", "US", "UK", "Australia"), > deceased = c("yes", rep("no", 4))) > > "Venables" is without the final 's', and "Tierney, without "i". > > # Data for books: > > books <- data.frame( > surname = I(c("Tukey", "Venables", "Tierney", > "Ripley", "Rippley", "McNeil", "R Core")), > title = c("Exploratory Data Analysis", > "Modern Applied Statistics ...", > "LISP-STAT", > "Spatial Statistics", "Stochastic Simulation", > "Interactive Data Analysis", > "An Introduction to R"), > other.author = c(NA, "Ripley", NA, NA, NA, NA, > "Venables & Smith"))
In your example the authors list has better spelling. The 'agrep' functions by default will return matches that are 90% ( or more precisely Levenshtein distance of less than or equalt to 0.1) : books$altname <- NA altidx <- unlist( sapply(books$surname, agrep, authors$surname) ) books$altname[seq(altidx)] <- authors$surname[altidx] books #----------------------- surname title other.author altname 1 Tukey Exploratory Data Analysis <NA> Tukey 2 Venables Modern Applied Statistics ... Ripley Venable 3 Tierney LISP-STAT <NA> Terney 4 Ripley Spatial Statistics <NA> Ripley 5 Rippley Stochastic Simulation <NA> Ripley 6 McNeil Interactive Data Analysis <NA> McNeil 7 R Core An Introduction to R Venables & Smith <NA> If you then match 'books' to 'authors' with a merge on authors$surname and books$altname, you should get closer to your goals -- David. > > With "surname" column instead of "name" (differs from original example for > more easy going merge). And the second "Ripley" with double "p". > > So, if I ask for: > > merge(authors, books, all=TRUE) > > I got: > > > But we know that "Rippley" corresponds to "Ripley", "Terney" to "Tierney" > and "Venable" to "Venables". I was wondering if there was any way to work > around this problem. My orginal data have around 27,000 name entries, and if > I take "all=FALSE", this database drops out to around 17,000, most because > mispelling (or truncated expressions). If I take "all=TRUE", I got many of > this <NA> cases like the example above. > > Has anyone experienced this? Any idea how I can get out? I'm thinking to > take the longest match possible to each entry. For example, in > "Venable"/"Venables" there is a 87.5% match. As I have name and surname, and > also auxiliary keys to this match, I think this could work. > > Thank you in advance. > > > > ----- > Victor Delgado > cedeplar.ufmg.br P.H.D. student > www.fjp.mg.gov.br reseacher > -- > View this message in context: > http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD Alameda, CA, USA ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.