andrewH wrote:

However, my suspicion is that there are some funky characters, either
control characters or characters with some non-standard encoding, somewhere
in this 14 gig file. Moreover, I am concerned that these characters may
cause me trouble down the road even if I use a different approach to getting
columns out of the file.

This is not an R solution, but here's a Windows utility I wrote to produce a table of frequency counts for all hex characters x00 to xFF in a file.

http://www.efg2.com/Lab/OtherProjects/CharCount.ZIP

Normally, you'll want to scrutinize anything below x20 or above x7F, since ASCII printable characters are in the range x20 to x7E. You can see how many tab (x09) characters are in the file, and whether the line endings are from Linux (x0A) or Windows (paired x0A and x0D).


The ZIP includes Delphi source code, but provides a Windows executable. I made a change several months ago to allow drag-and-drop, so you can just drop the file on the application to have the characters counted. Just run the EXE after unzipping. No installation is needed.

Once you find problems characters in the file, you can read the file as character data and use sub/gsub or other tools to remove or alter problem characters.

efg
Earl F Glynn
UMKC School of Medicine
Center for Health Insights

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to