On 23.02.2016 14:06, Mikko Korpela wrote: > On 23.02.2016 11:37, Martin Maechler wrote: >>>>>>> nospam@altfeld-im de <nos...@altfeld-im.de> >>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes: >> >> > Dear R developers >> > I think I have found a bug that can be reproduced with two lines of >> code >> > and I am very thankful to get your first assessment or feed-back on my >> > report. >> >> > If this is the wrong mailing list or I did something wrong >> > (e. g. semi "anonymous" email address to protect my privacy and defend >> > unwanted spam) please let me know since I am new here. >> >> > Thank you very much :-) >> >> > J. Altfeld >> >> Dear J., >> (yes, a bit less anonymity would be very welcomed here!), >> >> You are right, this is a bug, at least in the documentation, but >> probably "all real", indeed, >> >> but read on. >> >> > On Tue, 2016-02-16 at 18:25 +0100, nos...@altfeld-im.de wrote: >> >> >> >> >> >> If I execute the code from the "?write.table" examples section >> >> >> >> x <- data.frame(a = I("a \" quote"), b = pi) >> >> # (ommited code) >> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") >> >> >> >> the resulting CSV file has a size of 6 bytes which is too short >> >> (truncated): >> >> >> >> """,3 >> >> reproducibly, yes. >> If you look at what write.csv does >> and then simplify, you can get a similar wrong result by >> >> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") >> >> which results in a file with one line >> >> """ 3 >> >> and if you debug write.table() you see that its building blocks >> here are >> file <- file(........, encoding = fileEncoding) >> >> a writeLines(*, file=file) for the column headers, >> >> and then "deeper down" C code which I did not investigate. > > I took a look at connections.c. There is a call to strlen() that gets > confused by null characters. I think the obvious fix is to avoid the > call to strlen() as the size is already known: > > Index: src/main/connections.c > =================================================================== > --- src/main/connections.c (revision 70213) > +++ src/main/connections.c (working copy) > @@ -369,7 +369,7 @@ > /* is this safe? */ > warning(_("invalid char string in output conversion")); > *ob = '\0'; > - con->write(outbuf, 1, strlen(outbuf), con); > + con->write(outbuf, 1, ob - outbuf, con); > } while(again && inb > 0); /* it seems some iconv signal -1 on > zero-length input */ > } else > > >> >> But just looking a bit at such a file() object with writeLines() >> seems slightly revealing, as e.g., 'eol' does not seem to >> "work" for this encoding: >> >> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE") >> > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff) >> > close(ff) >> > file.show(fn) >> CBA|> >> > file.size(fn) >> [1] 5 >> > > > With the patch applied: > > > readLines(fn, encoding="UTF-16LE", skipNul=TRUE) > [1] "C" "B" "A" "|" ">a" > > file.size(fn) > [1] 22 I just realized that I was misusing the encoding argument of readLines(). The code above works by accident, but the following would be more appropriate:
> ff <- file(fn, open="r", encoding="UTF-16LE") > readLines(ff) [1] "C" "B" "A" "|" ">a" > close(ff) Testing on Linux, with the patch applied. (As noted by Duncan Murdoch, the patch is incomplete on Windows.) - Mikko ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel