Yes, thanks ... that works. -----Original Message----- From: Bert Gunter [mailto:gunter.ber...@gene.com] Sent: 14 October 2010 21:26 To: Mike Marchywka Cc: santosh.srini...@gmail.com; r-help@r-project.org Subject: Re: [R] Drop matching lines from readLines
If I understand correctly, the poster knows what regex error pattern to look for, in which case (mod memory capacity -- but 200 mb should not be a problem, I think) is not merely cleanData <- dirtyData[!grepl("errorPatternregex",dirtyData)] sufficient? Cheers, Bert On Thu, Oct 14, 2010 at 4:05 AM, Mike Marchywka <marchy...@hotmail.com> wrote: > > > > > > > ---------------------------------------- >> From: santosh.srini...@gmail.com >> To: r-help@r-project.org >> Date: Thu, 14 Oct 2010 11:27:57 +0530 >> Subject: [R] Drop matching lines from readLines >> >> Dear R-group, >> >> I have some noise in my text file (coding issues!) ... I imported a 200 MB >> text file using readlines >> Used grep to find the lines with the error? >> >> What is the easiest way to drop those lines? I plan to write back the >> "cleaned" data set to my base file. > > Generally for text processing, I've been using utilities external to R > although there may be R alternatives that work better for you. You > mention grep, I've suggested sed as a general way to fix formatting things, > there is also something called "uniq" on linux or cygwin. > I have gotten into the habit of using these for a variety of data > manipulation tasks, only feed clean data into R. > > $ echo -e a bc\\na bc > a bc > a bc > > $ echo -e a bc\\na bc | uniq > a bc > > $ uniq --help > Usage: uniq [OPTION]... [INPUT [OUTPUT]] > Filter adjacent matching lines from INPUT (or standard input), > writing to OUTPUT (or standard output). > > With no options, matching lines are merged to the first occurrence. > > Mandatory arguments to long options are mandatory for short options too. > -c, --count prefix lines by the number of occurrences > -d, --repeated only print duplicate lines > -D, --all-repeated[=delimit-method] print all duplicate lines > delimit-method={none(default),prepend,separate} > Delimiting is done with blank lines > -f, --skip-fields=N avoid comparing the first N fields > -i, --ignore-case ignore differences in case when comparing > -s, --skip-chars=N avoid comparing the first N characters > -u, --unique only print unique lines > -z, --zero-terminated end lines with 0 byte, not newline > -w, --check-chars=N compare no more than N characters in lines > --help display this help and exit > --version output version information and exit > > A field is a run of blanks (usually spaces and/or TABs), then non-blank > characters. Fields are skipped before chars. > > Note: 'uniq' does not detect repeated lines unless they are adjacent. > You may want to sort the input first, or use `sort -u' without `uniq'. > Also, comparisons honor the rules specified by `LC_COLLATE'. > > > > > > > > > > >> >> Thanks. > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Bert Gunter Genentech Nonclinical Biostatistics ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.