On Thu, Jun 7, 2012 at 1:40 PM, emorway <emor...@usgs.gov> wrote: > Thanks for your suggestions. Bert, in your response you raised my awareness > to "regular expressions". Are regular expressions the same across various > languages? Consider the following line of text: > > txt_line<-" PERCENT DISCREPANCY = 0.01 PERCENT DISCREPANCY = > -0.05" > > It seems python uses the following line of code to extract the two values in > "txt_line" and store them in a variable called "v": > > v = re.findall("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?", line) > #v[0] 0.01 > #v[1] -0.05 > > I tried something similar in R (but it didn't work) by using the same > regular expression, but got an error: > > edm<-grep("[+-]? *(?:\d+(?:\.\d*)|\.\d+)(?:[eE][+-]?\d+)?",txt_line) > #Error: '\d' is an unrecognized escape in character string starting "[+-]? > *(?:\d" > > I'm not even sure which function in R most efficiently extracts the values > from "txt_line". Basically, I want to peel out the values and think I can > use the decimal point to construct the regular expression, but don't know > where to go from here?
Try this. strapply applies the function (3rd argument) to each match of the regular expressoin (2nd argument) outputting the result of the function. The regular expression we have used matches a minus or digit followed by non-spaces. That seems good enough for this simple example but, of course, it can be changed. > library(gsubfn) > p <- "[-0-9]\\S+" > txt_line <- " PERCENT DISCREPANCY = 0.01 PERCENT DISCREPANCY = > -0.05" > > strapply(txt_line, p, as.numeric)[[1]] [1] 0.01 -0.05 or using strapplyc (which is similar but uses c as the function) and is optimized for speed: > as.numeric(strapplyc(txt_line, p)[[1]]) [1] 0.01 -0.05 If we are only parsing a few lines then the speed does not matter but if there are large amounts to parse then be sure to have the tcltk package installed to get the best speed from the gsubfn functions (on Windows and most but not all Linux systems tcltk is installed by default but on a few you have to do it yourself). If you don't have tcltk the gsubfn package will use R which is slower. Also, as noted, strapplyc is faster than strapply. There are arguments and options that can override the defaults. The gsubfn home page is at http://gsubfn.googlecode.com regular expressions are largely the same but not 100% identical across languages. There are some links to regular expression info in different languages at the bottom of the home page just listed. R can use R or perl regular expressions and the gsubfn functions, in addition, can use tcl regular expressions. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.