I think you have to assume there could be other coding errors as well, such as misspellings or abbreviating Street as St. You probably will need to use sub() to correct A2 and possibly A1 before trying to merge. To figure where the problems are, you might try something like this. The last command lists the paste(A1, A2) entries that do not match anything in paste(B1, B2).
> set.seed(42) > a <- paste(sample(c(letters, LETTERS[1:5]), 150, replace=TRUE), + sample(c("St", "Rd", "Ave"), 150, replace=TRUE)) > b <- paste(sample(letters, 1000, replace=TRUE), + sample(c("St", "Rd", "Ave"), 1000, replace=TRUE)) > (ua <- sort(unique(a))) [1] "a Ave" "a Rd" "A Rd" "a St" "A St" "b Rd" "B Rd" "b St" "c Ave" [10] "C Ave" "c Rd" "C Rd" "C St" "D Ave" "D Rd" "d St" "D St" "e Ave" [19] "E Ave" "e Rd" "E Rd" "e St" "E St" "f Ave" "f Rd" "g Ave" "g Rd" [28] "g St" "h Ave" "h Rd" "h St" "i Ave" "i Rd" "i St" "j Rd" "j St" [37] "k Ave" "k Rd" "k St" "l St" "m Ave" "m Rd" "m St" "n Ave" "n St" [46] "o Ave" "o Rd" "o St" "p St" "q Ave" "q Rd" "q St" "r Ave" "r Rd" [55] "r St" "s Ave" "s Rd" "s St" "t Ave" "t Rd" "t St" "u Ave" "u Rd" [64] "u St" "v Ave" "v Rd" "v St" "w Ave" "w Rd" "w St" "x Rd" "x St" [73] "y Ave" "y Rd" "z Ave" "z Rd" "z St" > (ub <- sort(unique(b))) [1] "a Ave" "a Rd" "a St" "b Ave" "b Rd" "b St" "c Ave" "c Rd" "c St" [10] "d Ave" "d Rd" "d St" "e Ave" "e Rd" "e St" "f Ave" "f Rd" "f St" [19] "g Ave" "g Rd" "g St" "h Ave" "h Rd" "h St" "i Ave" "i Rd" "i St" [28] "j Ave" "j Rd" "j St" "k Ave" "k Rd" "k St" "l Ave" "l Rd" "l St" [37] "m Ave" "m Rd" "m St" "n Ave" "n Rd" "n St" "o Ave" "o Rd" "o St" [46] "p Ave" "p Rd" "p St" "q Ave" "q Rd" "q St" "r Ave" "r Rd" "r St" [55] "s Ave" "s Rd" "s St" "t Ave" "t Rd" "t St" "u Ave" "u Rd" "u St" [64] "v Ave" "v Rd" "v St" "w Ave" "w Rd" "w St" "x Ave" "x Rd" "x St" [73] "y Ave" "y Rd" "y St" "z Ave" "z Rd" "z St" > ua[!(ua %in% ub)] [1] "A Rd" "A St" "B Rd" "C Ave" "C Rd" "C St" "D Ave" "D Rd" "D St" [10] "E Ave" "E Rd" "E St" ------------------------------------- David L Carlson Associate Professor of Anthropology Texas A&M University College Station, TX 77840-4352 -----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of A M Lavezzi Sent: Friday, June 21, 2013 4:56 AM To: r-help Subject: [R] matching similar character strings Hello everybody I have this problem: I need to match an addresses database F1 with the information contained in a toponymic database F2. The format of F1 is given by three columns and 800 rows, with the columns being: A1. Street/Road/Avenue A2. Name A3. Number Consider for instance Avenue J. Kennedy , 3011. In F1 this is: A1. Avenue A2. J. Kennedy A3. 3011 The format of F2 file is instead given by 20000 rows and five columns: B1. Street/Road/Avenue B2. Name B3. Starting Street Number B4. Ending Street Number B5. Census section So my problem is attributing the B5 Census section to every observation of F1 if: A1=B1, A2=B2, and A3 is comprised between B3 and B4. The problem is that while the information in A2 is irregularly recorded, B2 has a given format that is Family name (space) Given name. So I could have that while in B2 the information is: Kennedy John In A2 it could be: John Kennedy JF Kennedy J. Kennedy and so on. Thanks, Mario -- Andrea Mario Lavezzi Dipartimento di Scienze Giuridiche, della Società e dello Sport Sezione Diritto e Società Università di Palermo Piazza Bologni 8 90134 Palermo, Italy tel. ++39 091 23892208 fax ++39 091 6111268 skype: lavezzimario email: mario.lavezzi (at) unipa.it web: http://www.unipa.it/~mario.lavezzi ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.