On 08/11/2012 07:11, Lee Hachadoorian wrote:
I have a large (105MB) data file, tab-delimited with a header. There are
some odd characters at the beginning of the file that are preventing it
from being read by R.

 > dfTemp = read.delim(filename)
Error in make.names(col.names, unique = TRUE) :
invalid multibyte string at '<ff><fe>m'

When I view the file with head, I see:

��muni_code parcel_id…

The file is too large to edit in a graphical text editor (gedit). I
tried just dropping the header row with

sed '1 d' <old.txt >new.txt"

but then

 > dfTemp = read.delim(filename)
Error in read.table(file = file, header = header, sep = sep, quote =
quote, :
empty beginning of file

I tried some other shenanigans with sed (with which I am not really
experienced) but did not get a usable file. Does anyone have any ideas
for how to (a) directly read this into R, skipping the offending line or
characters, or (b) preprocess it so that I can read it into R?

That is a BOM make in UCS-2 encoding.  Was this file created on Windows?

It so try using iconv to convert it to UTF-8, or in R use

read.delim(filename, fileEncoding = "UCS-2LE")



Best,
--Lee

R version 2.14.1 (2011-12-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Linux Mint 13

Yes, but what locale? See the 'at a minimum' information asked for in your posting guide.



--
Brian D. Ripley,                  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to