with a fresh restart
test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" > > test [1] "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" > sub(".*(\\d{5}).*", "\\1", test) [1] "</th>" > sub(".*([0-9]{5}).*", "\\1", test) [1] "88958" > test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW" > sub(".*(\\d{5}).*", "\\1", test2) [1] "WWWWW" > > sub(".*(\\d{5}).*", "\\1", test2) [1] "WWWWW" > sub(".*([0-9]{5}).*", "\\1", test2) [1] "12345" Steve. On Wed, May 5, 2010 at 3:20 PM, David Winsemius <dwinsem...@comcast.net>wrote: > > On May 5, 2010, at 5:35 PM, Gabor Grothendieck wrote: > > Here are two ways to extract 5 digits. >> >> In the first one \\1 refers to the portion matched between the >> parentheses in the regular expression. >> >> In the second one strapply is like apply where the object to be worked >> on is the first argument (array for apply, string for strapply) the >> second modifies it (which dimension for apply, regular expression for >> strapply) and the last is a function which acts on each value >> (typically each row or column for apply and each match for strapply). >> In this case we use c as our function to just return all the results. >> They are returned in a list with one component per string but here >> test is just a single string so we get a list one long and we ask for >> the contents of the first component using [[1]]. >> >> # 1 - sub >> sub(".*(\\d{5}).*", "\\1", test) >> > > test > [1] > "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" > > I get different results than I expected given that "\\d" should be > synonymous with "[0-9]": > > > > sub(".*([0-9]{5}).*", "\\1", test) > [1] "88958" > > > sub(".*(\\d{5}).*", "\\1", test) > [1] "</th>" > > -- > David. > >> >> # 2 - strapply - see http://gsubfn.googlecode.com >> library(gsubfn) >> strapply(test, "\\d{5}", c)[[1]] >> >> >> >> On Wed, May 5, 2010 at 5:13 PM, steven mosher <mosherste...@gmail.com> >> wrote: >> >>> Given a text like >>> >>> I want to be able to extract a matched regular expression from a piece of >>> text. >>> >>> this apparently works, but is pretty ugly >>> # some html >>> >>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" >>> # a pattern to extract 5 digits >>> >>>> pattern<-"[0-9]{5}" >>>> >>> # regexpr returns a start point[1] and an attribute "match.length" >>> attr(,"match.length) >>> # get the substring from the start point to the stop point.. where stop = >>> start +length-1 >>> >>>> >>>> answer<-substr(test,regexpr(pattern,test)[1],regexpr(pattern,test)[1]+attr(regexpr(pattern,test),"match.length")-1) >>> >>>> answer >>>> >>> [1] "88958" >>> >>> I tried using sub(pattern, replacement, x ) with a regexp that captured >>> the >>> group. I'd found an example of this in the mails >>> but it didnt seem to work.. >>> >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > David Winsemius, MD > West Hartford, CT > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.