For the record, in R-devel you can do

f <-
read.table(url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt";, encoding = "UTF-8-BOM"), quote="", sep="|", stringsAsFactors=FALSE)
f[1,]
   V1 V2 V3   V4   V5
1 aar    aa Afar afar
charToRaw(f[1,1])
[1] 61 61 72

Whether this works with "UTF-8" depends on the implementation of iconv: strangely Microsoft remove BOMs in UTF-16 but not in UTF-8 (although almost the only people to put them there in UTF-8 are Microsoft's applications).



On 13/09/2012 21:43, peter dalgaard wrote:
Pragmatically, one can zap the BOM from the output with

language.ISO.table[1,1] <- substring(language.ISO.table[1,1],2)

and be gone with it.

It would be nicer to zap the BOM before read.table, though. It does work for me 
with the below (notice that the BOM is a single character if you don't use 
useBytes=).

get.language.ISO.table
function () {
  socket <- url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt";,
                open="r",encoding="utf-8");
  readChar(socket, nchar=1)
  data <- read.table(socket, as.is = TRUE, sep = "|", header = FALSE,
                     col.names = c("a3bibliographic","a3terminologic",
                       "a2","english","french"), quote="");
  close(socket);
  data
}


On Sep 13, 2012, at 22:26 , William Dunlap wrote:

It would be helpful if you showed your commands and printed
outputs, copied directly from your R session, from the beginning
to the end.  I put the call to sessionInfo() in my message because
it is probably relevant.  It is nice to completely include the original
email when responding to it so others can see the whole story in
one place.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


-----Original Message-----
From: Sam Steingold [mailto:sam.steing...@gmail.com] On Behalf Of Sam Steingold
Sent: Thursday, September 13, 2012 1:18 PM
To: William Dunlap
Cc: peter dalgaard; r-help@r-project.org
Subject: Re: [R] cannot read iso639 table

* William Dunlap <jqha...@gvopb.pbz> [2012-09-13 19:50:21 +0000]:

On Windows with R-2.15.1 in a 1252 locale, I had to read (and toss) out
the initial 3 bytes (the byte-order mark?) to make things work:

socket <-
url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-
8.txt",open="r",encoding="utf-8")
readChar(socket, nchars=3, useBytes=TRUE)
  [1] ""

confirmed - first 3 bytes are "\357\273\277"

d <- read.table(socket, quote="", sep="|", stringsAsFactors=FALSE)
dim(d)
  [1] 485   5
head(d)
     V1 V2 V3             V4      V5
  1 aar    aa           Afar    afar
  2 abk    ab      Abkhazian abkhaze
  3 ace             Achinese    aceh
  4 ach                Acoli   acoli
  5 ada              Adangme adangme
  6 ady       Adyghe; Adygei  adyghé

alas, this is all I get:

Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  invalid input found on input connection 
'http://www.loc.gov/standards/iso639-2/ISO-
639-2_utf-8.txt'

  a3bibliographic a3terminologic a2        english  french
1             aar             NA aa           Afar    afar
2             abk             NA ab      Abkhazian abkhaze
3             ace             NA          Achinese    aceh
4             ach             NA             Acoli   acoli
5             ada             NA           Adangme adangme
6             ady             NA    Adyghe; Adygei   adygh

note that the first non-ASCII character terminates the input.

so, I still cannot read the data from the URL.

I can read the file though - with quote="" (thanks Peter!) -
except that the first record is "\357\273\277aar".


--
Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://thereligionofpeace.com
http://mideasttruth.com http://iris.org.il http://jihadwatch.org
The only thing worse than X Windows: (X Windows) - X



--
Brian D. Ripley,                  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to