Re: [R-SIG-Mac] Bug in reading UTF-16LE file?

Tomas Kalibera Tue, 01 Oct 2024 13:50:47 -0700

On 10/1/24 15:31, Jeff Newmiller wrote:

This is a problem in macOS libiconv. When converting from "UTF-16" with a BOM, 
it correctly learns the byte-order from the BOM, but later forgets it in some cases.  
This is not a problem in R, but could be worked-around in R.

So, buggy system code on one system...

As Simon wrote, to avoid running into these problems (in released versions of R), one 
should use "UTF-16LE", so explicitly specify the byte-order in the encoding 
name.

... leads to institutionalized non-complince.

This is useful also because it is not clear what should be the default when no 
BOM is present and different systems have different defaults.

This is nonsense, for reasons previously provided. You are calling a bug a 
feature. The BOM is supposed to prevent you from having to know this detail, 
and what you do when no BOM is present should have no bearing on this case.

I will try to explain this differently. The handling of BOMs in existingiconv implementations is unreliable (one issue is documented in Rdocumentation, one issue is the one we have ran into now). Because it isunreliable, people who want to be defensive and avoid problems areadvised to use *LE (or *BE) specifications. What is the defaultbyte-order when no BOM is specified is not reliable, either (defaultsdiffer between systems and the standard is open to interpretation - e.g.my Linux and Windows builds of R default to little-endian, while mymacOS build defaults to big-endian). It is thus not advisable to dependon the default order, either, and a defensive solution is again to use*LE or *BE specifications. So, in principle, simply always use *LE or *BE.

This advice is not a feature, it is a work-around that works for twoproblems: that the byte order for specifications like "UTF-16" isunknown (bug in the standard) and that specifying the byte-order by aBOM is unreliable (bugs in implementations of iconv).

If Apple is intransigent (which would not be out of character) you could avoid 
institutionalized non-compliance at the user level by recognizing the buggy 
system and replacing the generic specification with this inappropriate LE or BE 
specification as directed by the BOM in the Mac-specific R code.

Yes, indeed, the work-around for the libiconv bug can be implemented infuture versions of R and an experimental version is already in R-devel(still subject to change), so that at user level, specifying say"UTF-16" on an input with BOM will correctly use the byte-order of the BOM.


I don't find anything inappropriate about the *LE/*BE specifications.

Best
Tomas



On October 1, 2024 4:34:41 AM MST, Tomas Kalibera <tomas.kalib...@gmail.com> 
wrote:

On 9/9/24 12:53, Tomas Kalibera wrote:

On 9/9/24 10:53, peter dalgaard wrote:

I am confused, and maybe I should just butt out of this, but:

(a) BOM are designed to, um, mark the byte order...

(b) in connections.c we have

              if(checkBOM && con->inavail >= 2 &&
                 ((int)con->iconvbuff[0] & 0xff) == 255 &&
                 ((int)con->iconvbuff[1] & 0xff) == 254) {
                  con->inavail -= (short) 2;
                  memmove(con->iconvbuff, con->iconvbuff+2, con->inavail);
              }
   which checks for the two first bytes being FF, FE. However, a big-endian BOM 
would be FE, FF and I see no check for that.

I think this is correct, it is executed only for encodings declared 
little-endian (UTF-16LE, UCS2-LE) - so, iconv will still know what is the 
byte-order from the name of the encoding, it will just not see the same 
information in the BOM.

Duncan's file starts

readBin('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
 what="raw", n=10)

   [1] ff fe 74 00 69 00 6d 00 65 00

so the BOM does indeed indicate little-endian, but apparently we proceed to 
discard it and read the file with system (big-)endianness, which strikes me as 
just plain wrong...

I've tested we are not discarding it by the code above and that iconv gets to 
see the BOM bytes.

I see no Mac-specific code for this, only win_iconv.c, so presumably we have 
potential issues on everything non-Windows?

I can reproduce the problem and will have a closer look, it may still be there 
is a bug in R. We have some work-arounds for recent iconv issues on macOS in 
sysutils.c.

This is a problem in macOS libiconv. When converting from "UTF-16" with a BOM, 
it correctly learns the byte-order from the BOM, but later forgets it in some cases.  
This is not a problem in R, but could be worked-around in R.

As Simon wrote, to avoid running into these problems (in released versions of R), one 
should use "UTF-16LE", so explicitly specify the byte-order in the encoding 
name. This is useful also because it is not clear what should be the default when no BOM 
is present and different systems have different defaults.

Best
Tomas

Tomas

-pd

On 9 Sep 2024, at 01:11 , Simon Urbanek <simon.urba...@r-project.org> wrote:

  From the help page:

      The encodings ‘"UCS-2LE"’ and ‘"UTF-16LE"’ are treated specially,
      as they are appropriate values for Windows ‘Unicode’ text files.
      If the first two bytes are the Byte Order Mark ‘0xFEFF’ then these
      are removed as some implementations of ‘iconv’ do not accept BOMs.

so "UTF-16LE" is the documented way to reliably read such files.

Cheers,
Simon

On 8 Sep 2024, at 21:23, Duncan Murdoch <murdoch.dun...@gmail.com> wrote:

To R-SIG-Mac, with a copy to Jeff Newmiller:

On R-help there's a thread about reading a remote file that is coded in UTF-16LE with a 
byte-order mark.  Jeff Newmiller pointed out 
(https://stat.ethz.ch/pipermail/r-help/2024-September/479933.html) that it would be 
better to declare the encoding as "UTF-16", because the BOM will indicate 
little endian.

I tried this on my Mac running R 4.4.1, and it didn't work. I get the same 
incorrect result from all of these commands:

# Automatically recognizing a URL and using fileEncoding:
read.delim(
'https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
     fileEncoding = "UTF-16"
)

# Using explicit url() with encoding:
read.delim(
url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
        encoding = "UTF-16")
)

# Specifying the endianness incorrectly:
read.delim(
url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt',
        encoding = "UTF-16BE")
)

The only way I get the correct result is if I specify "UTF-16LE" explicitly, whereas Jeff 
got correct results on several different systems using "UTF-16".

Is this a MacOS bug or an R for MacOS bug?

Duncan Murdoch

_______________________________________________
R-SIG-Mac mailing list
R-SIG-Mac@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-mac

_______________________________________________
R-SIG-Mac mailing list
R-SIG-Mac@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-mac

_______________________________________________
R-SIG-Mac mailing list
R-SIG-Mac@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-mac


_______________________________________________
R-SIG-Mac mailing list
R-SIG-Mac@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-mac

Re: [R-SIG-Mac] Bug in reading UTF-16LE file?

Reply via email to