Thanks, Enrico! I did this, and if I did it right, there are no nonstandard
characters. So now I am suspecting a size limit internal to the filehash
package, and trying to chase that down. Your help is much appreciated.


On Mon, Dec 9, 2013 at 11:11 PM, Enrico Schumann <e...@enricoschumann.net>wrote:

> On Mon, 09 Dec 2013, andrewH <ahoer...@rprogress.org> writes:
>
> > I have a humongous csv file containing census data, far too big to read
> into
> > RAM. I have been trying to extract individual columns from this file
> using
> > the colbycol package. This works for certain subsets of the columns, but
> not
> > for others. I have not yet been able to precisely identify the problem
> > columns, as there are 731 columns and running colbycol on the file on my
> old
> > slow machine takes about 6 hours.
> >
> > However, my suspicion is that there are some funky characters, either
> > control characters or characters with some non-standard encoding,
> somewhere
> > in this 14 gig file. Moreover, I am concerned that these characters may
> > cause me trouble down the road even if I use a different approach to
> getting
> > columns out of the file.
> >
> > Is there an r utility will search through my file without trying to read
> it
> > all into memory at one time and find non-standard characters or misplaced
> > (non-end-of-line) control characters? Or some R code to the same end?
>  Even
> > if the real problem ultimately proves top be different, it would be
> helpful
> > to eliminate this possibility. And this is also something I would
> routinely
> > run on files from external sources if I had it.
> >
> >  I am working in a windows XP environment, in case that makes a
> difference.
> >
> > Any help anyone could offer would be greatly appreciated.
> >
> > Sincerely, andrewH
>
> You could process your file in chunks:
>
>   f <- file("myfile.csv", open = "r")
>   lines <- readLines(f, n = 10000)
>   ## do something with lines
>   lines <- readLines(f, n = 10000)
>   ## do something with lines
>   ## ....
>
> To find 'non-standard characters' you will need to define what
> 'non-standard characters' are.  But perhaps ?tools:::showNonASCII, which
> uses ?iconv, can help you.  (Please note the warnings and caveats on the
> functions' help pages.)
>
>
> --
> Enrico Schumann
> Lucerne, Switzerland
> http://enricoschumann.net
>



-- 
J. Andrew Hoerner
Director, Sustainable Economics Program
Redefining Progress
(510) 507-4820

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to