David Winsemius wrote > On Nov 2, 2012, at 11:20 AM, VictorDelgado wrote: > >> Hello dear R-helpers, >> >> I'm working with R-2.15.2 on Windows 7 OS. I'm stucked with a merge of >> two >> data frames by characters. >> In each data frame I got two different list of names, that is my main-key >> to >> be merged. >> >> To figure out what I'm saying, I build up a modified "?merge" example, >> with >> errors by purpose: >> >> # Data for authors: >> >> authors <- data.frame( >> surname = I(c("Tukey", "Venable", "Terney", "Ripley", "McNeil")), >> nationality = c("US", "Australia", "US", "UK", "Australia"), >> deceased = c("yes", rep("no", 4))) >> >> "Venables" is without the final 's', and "Tierney, without "i". >> >> # Data for books: >> >> books <- data.frame( >> surname = I(c("Tukey", "Venables", "Tierney", >> "Ripley", "Rippley", "McNeil", "R Core")), >> title = c("Exploratory Data Analysis", >> "Modern Applied Statistics ...", >> "LISP-STAT", >> "Spatial Statistics", "Stochastic Simulation", >> "Interactive Data Analysis", >> "An Introduction to R"), >> other.author = c(NA, "Ripley", NA, NA, NA, NA, >> "Venables & Smith")) > > In your example the authors list has better spelling. The 'agrep' > functions by default will return matches that are 90% ( or more precisely > Levenshtein distance of less than or equalt to 0.1) : > > > books$altname <- NA > altidx <- unlist( sapply(books$surname, agrep, authors$surname) ) > books$altname[seq(altidx)] <- authors$surname[altidx] > books > #----------------------- > surname title other.author altname > 1 Tukey Exploratory Data Analysis > <NA> > Tukey > 2 Venables Modern Applied Statistics ... Ripley Venable > 3 Tierney LISP-STAT > <NA> > Terney > 4 Ripley Spatial Statistics > <NA> > Ripley > 5 Rippley Stochastic Simulation > <NA> > Ripley > 6 McNeil Interactive Data Analysis > <NA> > McNeil > 7 R Core An Introduction to R Venables & Smith > <NA> > If you then match 'books' to 'authors' with a merge on authors$surname and > books$altname, you should get closer to your goals > > -- > David. >> >> With "surname" column instead of "name" (differs from original example >> for >> more easy going merge). And the second "Ripley" with double "p". >> >> So, if I ask for: >> >> merge(authors, books, all=TRUE) >> >> I got: >> >> >> But we know that "Rippley" corresponds to "Ripley", "Terney" to "Tierney" >> and "Venable" to "Venables". I was wondering if there was any way to work >> around this problem. My orginal data have around 27,000 name entries, and >> if >> I take "all=FALSE", this database drops out to around 17,000, most >> because >> mispelling (or truncated expressions). If I take "all=TRUE", I got many >> of >> this > <NA> > cases like the example above. >> >> Has anyone experienced this? Any idea how I can get out? I'm thinking to >> take the longest match possible to each entry. For example, in >> "Venable"/"Venables" there is a 87.5% match. As I have name and surname, >> and >> also auxiliary keys to this match, I think this could work. >> >> Thank you in advance. >> >> >> >> ----- >> Victor Delgado >> cedeplar.ufmg.br P.H.D. student >> www.fjp.mg.gov.br reseacher >> -- >> View this message in context: >> http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255.html >> Sent from the R help mailing list archive at Nabble.com. >> >> ______________________________________________ >>
> R-help@ > mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > David Winsemius, MD > Alameda, CA, USA > > ______________________________________________ > R-help@ > mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. It's amazing to get such useful answers so fast. I did not know the RecordLinkage package, it looks very sophisticated and useful for this kind of demand. I just made some tests and I think it could be very useful. I'm working with portuguese spelling names, so I will also test agrep and see what function returns better results, giving less data loss. Thank you a lot, Jim Holtman and also David Winsemius. ----- Victor Delgado cedeplar.ufmg.br P.H.D. student UFOP assistant professor -- View this message in context: http://r.789695.n4.nabble.com/Merge-data-frame-with-mispelling-characters-tp4648255p4648266.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.