Hello R users, I am having problems to read a CSV file that contains names with character ÿ. In case it doesn't print correctly, it's Unicode character 00FF or LATIN SMALL LETTER Y WITH DIAERESIS. My computer has Windows 7 and R 3.2.4.
Initially, I configured my computer to run options(encoding="UTF-8") in my .Rprofile, since I prefer this encoding, for portability. Good and modern standard, I thought. Rather than sending a large file, here is how to reproduce my problem: options(encoding="UTF-8") f <- file("test.txt", "wb") writeBin(as.integer(c(65, 13, 10, 66, 255, 67, 13, 10, 68, 13, 10)), f, size=1) close(f) read.table("test.txt", encoding="latin1") f <- file("test.txt", "rt") readLines(f, encoding="latin1") close(f) I write a file with three lines, in binary to avoid any translation: A B\xffC D Upon reading I get only: > read.table("test.txt", encoding="latin1") V1 1 A 2 B Warning messages: 1: In read.table("test.txt", encoding = "latin1") : invalid input found on input connection 'test.txt' 2: In read.table("test.txt", encoding = "latin1") : incomplete final line found by readTableHeader on 'test.txt' > readLines(f, encoding="latin1") [1] "A" "B" Warning messages: 1: In readLines(f, encoding = "latin1") : invalid input found on input connection 'test.txt' 2: In readLines(f, encoding = "latin1") : incomplete final line found on 'test.txt' Hence the file is truncated. However, character \xff is a valid latin1 character, as one can check for instance at https://en.wikipedia.org/wiki/ISO/IEC_8859-1 I tried with an UTF-8 version of this file: f <- file("test.txt", "wb") writeBin(as.integer(c(65, 13, 10, 66, 195, 191, 67, 13, 10, 68, 13, 10)), f, size=1) close(f) read.table("test.txt", encoding="UTF-8") f <- file("test.txt", "rt") readLines(f, encoding="UTF-8") close(f) Since this character ÿ is encoded as two bytes 195, 191 in UTF-8, I would expect that I get my complete file. But I don't. Instead, I get: > read.table("test.txt", encoding="UTF-8") V1 1 A 2 B 3 C 4 D Warning message: In read.table("test.txt", encoding = "UTF-8") : incomplete final line found by readTableHeader on 'test.txt' > readLines(f, encoding="UTF-8") [1] "A" "B" Warning message: In readLines(f, encoding = "UTF-8") : incomplete final line found on 'test.txt' I tried all the preceding but with options(encoding="latin1") at the beginning. For the first attempt, with byte 255, I get: > read.table("test.txt", encoding="latin1") V1 1 A 2 B 3 C 4 D Warning message: In read.table("test.txt", encoding = "latin1") : incomplete final line found by readTableHeader on 'test.txt' > > f <- file("test.txt", "rt") > readLines(f, encoding="latin1") For the other attempt, with 195, 191: > read.table("test.txt", encoding="UTF-8") V1 1 A 2 BÿC 3 D > > f <- file("test.txt", "rt") > readLines(f, encoding="UTF-8") [1] "A" "BÿC" "D" > close(f) Thus the second one does indeed work, it seems. Just a check: > a <- read.table("test.txt", encoding="UTF-8") > Encoding(a$V1) [1] "unknown" "UTF-8" "unknown" At last, I figured out that with the default encoding in R, both attempts work, with or without even giving the encoding as a parameter of read.table or readLines. However, I don't understand what happens: f <- file("test.txt", "wb") writeBin(as.integer(c(65, 13, 10, 66, 255, 67, 13, 10, 68, 13, 10)), f, size=1) close(f) a <- read.table("test.txt", encoding="latin1")$V1 Encoding(a) iconv(a[2], toRaw=T) a a <- read.table("test.txt")$V1 Encoding(a) iconv(a[2], toRaw=T) a This will yield: > a <- read.table("test.txt", encoding="latin1")$V1 > Encoding(a) [1] "unknown" "latin1" "unknown" > iconv(a[2], toRaw=T) [[1]] [1] 42 ff 43 > a [1] "A" "BÿC" "D" > > a <- read.table("test.txt")$V1 > Encoding(a) [1] "unknown" "unknown" "unknown" > iconv(a[2], toRaw=T) [[1]] [1] 42 ff 43 > a [1] "A" "BÿC" "D" The second line is correctly encoded, but the encoding is just not "marked" in one case. With the UTF-8 bytes: f <- file("test.txt", "wb") writeBin(as.integer(c(65, 13, 10, 66, 195, 191, 67, 13, 10, 68, 13, 10)), f, size=1) close(f) a <- read.table("test.txt", encoding="UTF-8")$V1 Encoding(a) iconv(a[2], toRaw=T) a a <- read.table("test.txt")$V1 Encoding(a) iconv(a[2], toRaw=T) a This will yield: > a <- read.table("test.txt", encoding="UTF-8")$V1 > Encoding(a) [1] "unknown" "UTF-8" "unknown" > iconv(a[2], toRaw=T) [[1]] [1] 42 c3 bf 43 > a [1] "A" "BÿC" "D" > a <- read.table("test.txt")$V1 > Encoding(a) [1] "unknown" "unknown" "unknown" > iconv(a[2], toRaw=T) [[1]] [1] 42 c3 bf 43 > a [1] "A" "BÿC" "D" Both are correctly read (the raw bytes are ok), but the second one doesn't print correctly because the encoding is not "marked". My thoughts: With options(encoding="native.enc"), the characters read are not translated, and are read as raw bytes, which can get an encoding mark to print correctly (otherwise it prints as native, that is mostly latin1). With options(encoding="latin1"), and reading the UTF-8 file, I guess it's mostly like the preceding: the characters are read as raw, and marked as UTF-8, which works. With options(encoding="latin1"), and reading the latin1 file (with the 0xFF byte), I don't understand what happens. The file gets truncated almost as if 0xFF were an EOF character - which is perplexing, since I think that in C, 0xFF is sometimes (wrongly) confused with EOF. And with options(encoding="UTF-8"), I am not sure what happens. Questions: * What's wrong with options(encoding="latin1")? * Is it unsafe to use another option(encoding) than the default native.enc, on Windows? * Is it safe to assume that with native.enc R reads raw characters and, only when requested, marks an encoding afterwards? (that is, I get "unknown" by default which is printed as latin1 on Windows, and if I enforce another encoding, it will be used whatever the bytes really are) * What does really happen with another option(encoding), especially UTF-8? * If I save a character variable to an Rdata file, is the file usable on another OS, or on the same with another default encoding (by changing options())? Does it depend whether the character string has un "unknown" encoding or an explicit one? * Is there a way (preferably an options()) to tell R to read text files as UTF-8 by default? Would it work with any one of read.table(), readLines(), or even source()? I thought options(encoding="UTF-8") would do, but it fails on the examples above. Best regards, Jean-Claude Arbaut ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.