I think 1 gb is small enough that this can be easily and efficiently done in R. The key is: regular expressions are your friend.
I shall assume that the text file has been read into R as a single character string, named "mystring" . The code below could easily be modifed to work on a a vector of strings if the file is read in line by line. Alternatively, (paste, yourvec,collapse="") could be used to collapse such a vector into a single string, which would be necessary if, for instance, the key info could be broken up over several lines/elements of the vector. Here is a small reproducible example of a way to do this, in which the keword string to search for is "qdxpRxt" and the next 3 characters that follow it are what you wish to extract. ### Create the example test <- lapply(sample(1:20,30,rep=TRUE) ,function(n) paste(sample(c(letters,LETTERS,rep(" ",12)),n,rep=TRUE),collapse="")) mystring <- paste(lapply(test,function(x)paste(x,c("qdxpRxt"), floor(runif(1,0,1000)),sep="")),collapse="") ## Extract the strings of interest extracted <- strsplit(gsub(".*?qdxpRxt(.{3})","\\1%&",mystring),"%&") Note that this is not quite right yet. The extracted strings might include characters, not just numbers; the strings have to be converted from character to numeric; and I assumed that "%&" would not occur in the extracted strings and could be used as a split string. I leave the fixups to you. Note that I make no claims about the efficiency of this approach vis a vis external tools like gawk -- of which there are many -- nor that any of several R string function packages might also easily do the job. I only wanted to point out that it appears to be straightforward in vanilla R since this already has the regular expression engine in it. Cheers, Bert On Wed, Jun 6, 2012 at 10:34 AM, Rainer Schuermann <rainer.schuerm...@gmx.net> wrote: > R may not be the best tool for this. > Did you look at gawk? It is also available for Windows: > http://gnuwin32.sourceforge.net/packages/gawk.htm > > Once gawk has written a new file that only contains the lines / data you > want, you could use R for the next steps. > You also can run gawk from within R with the System() command. > > Rgds, > Rainer > > > On Wednesday 06 June 2012 09:54:15 emorway wrote: >> useRs- >> >> I'm attempting to scan a more than 1Gb text file and read and store the >> values that follow a specific key-phrase that is repeated multiple time >> throughout the file. A snippet of the text file I'm trying to read is >> attached. The text file is a dumping ground for various aspects of the >> performance of the model that generates it. Thus, the location of >> information I'm wanting to extract from the file is not in a fixed position >> (i.e. it does not always appears in a predictable location, like line 1000, >> or 2000, etc.). Rather, the desired values always follow a specific phrase: >> " PERCENT DISCREPANCY =" >> >> One approach I took was the following: >> >> library(R.utils) >> >> txt_con<-file(description="D:/MCR_BeoPEST - Copy/MCR.out",open="r") >> #The above will need to be altered if one desires to test code on the >> attached txt file, which will run much quicker >> system.time(num_lines<-countLines("D:/MCR_BeoPEST - Copy/MCR.out")) >> #elapsed time on full 1Gb file took about 55 seconds on a 3.6Gh Xeon >> num_lines >> #14405247 >> >> system.time( >> for(i in 1:num_lines){ >> txt_line<-readLines(txt_con,n=1) >> if (length(grep(" PERCENT DISCREPANCY =",txt_line))) { >> pd<-c(pd,as.numeric(substr(txt_line,70,78))) >> } >> } >> ) >> #Time took about 5 minutes >> >> The inefficiencies in this approach arise due to reading the file twice >> (first to get num_lines, then to step through each line looking for the >> desired text). >> >> Is there a way to speed this process up through the use of a ?scan ? I >> wan't able to get anything working, but what I had in mind was scan through >> the more than 1Gb file and when the keyphrase (e.g. " PERCENT >> DISCREPANCY = ") is encountered, read and store the next 13 characters >> (which will include some white spaces) as a numeric value, then resume the >> scan until the key phrase is encountered again and repeat until the >> end-of-the-file marker is encountered. Is such an approach even possible or >> is line-by-line the best bet? >> >> http://r.789695.n4.nabble.com/file/n4632558/MCR.out MCR.out >> >> >> >> -- >> View this message in context: >> http://r.789695.n4.nabble.com/extracting-values-from-txt-file-that-follow-user-supplied-quote-tp4632558.html >> Sent from the R help mailing list archive at Nabble.com. >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.