On 23/02/2016 7:06 AM, Mikko Korpela wrote:
On 23.02.2016 11:37, Martin Maechler wrote:
nospam@altfeld-im de <nos...@altfeld-im.de>
     on Mon, 22 Feb 2016 18:45:59 +0100 writes:

     > Dear R developers
     > I think I have found a bug that can be reproduced with two lines of code
     > and I am very thankful to get your first assessment or feed-back on my
     > report.

     > If this is the wrong mailing list or I did something wrong
     > (e. g. semi "anonymous" email address to protect my privacy and defend
     > unwanted spam) please let me know since I am new here.

     > Thank you very much :-)

     > J. Altfeld

Dear J.,
(yes, a bit less anonymity would be very welcomed here!),

You are right, this is a bug, at least in the documentation, but
probably "all real", indeed,

but read on.

     > On Tue, 2016-02-16 at 18:25 +0100, nos...@altfeld-im.de wrote:
     >>
     >>
     >> If I execute the code from the "?write.table" examples section
     >>
     >> x <- data.frame(a = I("a \" quote"), b = pi)
     >> # (ommited code)
     >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
     >>
     >> the resulting CSV file has a size of 6 bytes which is too short
     >> (truncated):
     >>
     >> """,3

reproducibly, yes.
If you look at what write.csv does
and then simplify, you can get a similar wrong result by

   write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")

which results in a file with one line

""" 3

and if you debug  write.table() you see that its building blocks
here are
         file <- file(........, encoding = fileEncoding)

a        writeLines(*, file=file)  for the column headers,

and then "deeper down" C code which I did not investigate.

I took a look at connections.c. There is a call to strlen() that gets
confused by null characters. I think the obvious fix is to avoid the
call to strlen() as the size is already known:

Index: src/main/connections.c
===================================================================
--- src/main/connections.c      (revision 70213)
+++ src/main/connections.c      (working copy)
@@ -369,7 +369,7 @@
                /* is this safe? */
                warning(_("invalid char string in output conversion"));
            *ob = '\0';
-           con->write(outbuf, 1, strlen(outbuf), con);
+           con->write(outbuf, 1, ob - outbuf, con);
        } while(again && inb > 0);  /* it seems some iconv signal -1 on
                                       zero-length input */
      } else



But just looking a bit at such a file() object with writeLines()
seems slightly revealing, as e.g., 'eol' does not seem to
"work" for this encoding:

     > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE")
     > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff)
     > close(ff)
     > file.show(fn)
     CBA|>
     > file.size(fn)
     [1] 5
     >

With the patch applied:

     > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
     [1] "C"  "B"  "A"  "|"  ">a"
     > file.size(fn)
     [1] 22

That may be okay on Unix, but it's not enough on Windows. There the \n that writeLines adds at the end of each line isn't translated to UTF-16LE properly, so things get messed up. (I think the \n is translated, but the \r that Windows wants is not, so you get a mix of 8 bit and 16 bit characters.)

Duncan Murdoch

- Mikko Korpela

     >> The problem seems to be the iconv function:
     >>
     >> iconv("foo", to="UTF-16")
     >>
     >> produces
     >>
     >> Error in iconv("foo", to = "UTF-16"):
     >> embedded nul in string: '\xff\xfef\0o\0o\0'

but this works

     > iconv("foo", to="UTF-16", toRaw=TRUE)
     [[1]]
     [1] ff fe 66 00 6f 00 6f 00

(indeed showing the embedded '\0's)

     >> In 2010 a (partial) patch for this problem was submitted:
     >> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html

the patch only related to the iconv() problem not allowing 'raw'
(instead of character) argument x.

... and it is > 5.5 years old, for an iconv() version that was less
featureful than today.
Rather, current iconv(x) allows x to be a list of raw entries.


     >> Are there chances to fix this problem since it prevents writing Windows
     >> UTF-16LE text files?

     >>
     >> PS: This problem can be reproduced on Windows and Linux.

indeed.... also on "R devel of today".

I agree it should be fixed... but as I said not by the patch you
mentioned.

Tested patches to fix this are welcome, indeed.

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to