Thanks, Enrico! I did this, and if I did it right, there are no nonstandard characters. So now I am suspecting a size limit internal to the filehash package, and trying to chase that down. Your help is much appreciated.
On Mon, Dec 9, 2013 at 11:11 PM, Enrico Schumann <e...@enricoschumann.net>wrote: > On Mon, 09 Dec 2013, andrewH <ahoer...@rprogress.org> writes: > > > I have a humongous csv file containing census data, far too big to read > into > > RAM. I have been trying to extract individual columns from this file > using > > the colbycol package. This works for certain subsets of the columns, but > not > > for others. I have not yet been able to precisely identify the problem > > columns, as there are 731 columns and running colbycol on the file on my > old > > slow machine takes about 6 hours. > > > > However, my suspicion is that there are some funky characters, either > > control characters or characters with some non-standard encoding, > somewhere > > in this 14 gig file. Moreover, I am concerned that these characters may > > cause me trouble down the road even if I use a different approach to > getting > > columns out of the file. > > > > Is there an r utility will search through my file without trying to read > it > > all into memory at one time and find non-standard characters or misplaced > > (non-end-of-line) control characters? Or some R code to the same end? > Even > > if the real problem ultimately proves top be different, it would be > helpful > > to eliminate this possibility. And this is also something I would > routinely > > run on files from external sources if I had it. > > > > I am working in a windows XP environment, in case that makes a > difference. > > > > Any help anyone could offer would be greatly appreciated. > > > > Sincerely, andrewH > > You could process your file in chunks: > > f <- file("myfile.csv", open = "r") > lines <- readLines(f, n = 10000) > ## do something with lines > lines <- readLines(f, n = 10000) > ## do something with lines > ## .... > > To find 'non-standard characters' you will need to define what > 'non-standard characters' are. But perhaps ?tools:::showNonASCII, which > uses ?iconv, can help you. (Please note the warnings and caveats on the > functions' help pages.) > > > -- > Enrico Schumann > Lucerne, Switzerland > http://enricoschumann.net > -- J. Andrew Hoerner Director, Sustainable Economics Program Redefining Progress (510) 507-4820 [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.