Re: [Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

Mikko Korpela Thu, 25 Feb 2016 01:34:51 -0800

On 23.02.2016 14:06, Mikko Korpela wrote:
> On 23.02.2016 11:37, Martin Maechler wrote:
>>>>>>> nospam@altfeld-im de <nos...@altfeld-im.de>
>>>>>>>     on Mon, 22 Feb 2016 18:45:59 +0100 writes:
>>
>>     > Dear R developers
>>     > I think I have found a bug that can be reproduced with two lines of 
>> code
>>     > and I am very thankful to get your first assessment or feed-back on my
>>     > report.
>>
>>     > If this is the wrong mailing list or I did something wrong
>>     > (e. g. semi "anonymous" email address to protect my privacy and defend
>>     > unwanted spam) please let me know since I am new here.
>>
>>     > Thank you very much :-)
>>
>>     > J. Altfeld
>>
>> Dear J.,
>> (yes, a bit less anonymity would be very welcomed here!),
>>
>> You are right, this is a bug, at least in the documentation, but
>> probably "all real", indeed,
>>
>> but read on.
>>
>>     > On Tue, 2016-02-16 at 18:25 +0100, nos...@altfeld-im.de wrote:
>>     >> 
>>     >> 
>>     >> If I execute the code from the "?write.table" examples section
>>     >> 
>>     >> x <- data.frame(a = I("a \" quote"), b = pi)
>>     >> # (ommited code)
>>     >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
>>     >> 
>>     >> the resulting CSV file has a size of 6 bytes which is too short
>>     >> (truncated):
>>     >> 
>>     >> """,3
>>
>> reproducibly, yes.
>> If you look at what write.csv does
>> and then simplify, you can get a similar wrong result by
>>
>>   write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")
>>
>> which results in a file with one line
>>
>> """ 3
>>
>> and if you debug  write.table() you see that its building blocks
>> here are
>>       file <- file(........, encoding = fileEncoding)
>>
>> a     writeLines(*, file=file)  for the column headers,
>>
>> and then "deeper down" C code which I did not investigate.
> 
> I took a look at connections.c. There is a call to strlen() that gets
> confused by null characters. I think the obvious fix is to avoid the
> call to strlen() as the size is already known:
> 
> Index: src/main/connections.c
> ===================================================================
> --- src/main/connections.c    (revision 70213)
> +++ src/main/connections.c    (working copy)
> @@ -369,7 +369,7 @@
>               /* is this safe? */
>               warning(_("invalid char string in output conversion"));
>           *ob = '\0';
> -         con->write(outbuf, 1, strlen(outbuf), con);
> +         con->write(outbuf, 1, ob - outbuf, con);
>       } while(again && inb > 0);  /* it seems some iconv signal -1 on
>                                      zero-length input */
>      } else
> 
> 
>>
>> But just looking a bit at such a file() object with writeLines()
>> seems slightly revealing, as e.g., 'eol' does not seem to
>> "work" for this encoding:
>>
>>     > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE")
>>     > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff)
>>     > close(ff)
>>     > file.show(fn)
>>     CBA|>
>>     > file.size(fn)
>>     [1] 5
>>     > 
> 
> With the patch applied:
> 
>     > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
>     [1] "C"  "B"  "A"  "|"  ">a"
>     > file.size(fn)
>     [1] 22
I just realized that I was misusing the encoding argument of
readLines(). The code above works by accident, but the following would
be more appropriate:


    > ff <- file(fn, open="r", encoding="UTF-16LE")
    > readLines(ff)
    [1] "C"  "B"  "A"  "|"  ">a"
    > close(ff)

Testing on Linux, with the patch applied. (As noted by Duncan Murdoch,
the patch is incomplete on Windows.)

- Mikko

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

Reply via email to