Re: [R-SIG-Mac] Bug in reading UTF-16LE file?

Jeff Newmiller via R-SIG-Mac Wed, 02 Oct 2024 06:24:44 -0700

Prior to saying:

> "Any U+FEFF would be interpreted as a ZWNBSP."


it says:

> ... if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or 
> UTF-32LE, a BOM is neither necessary nor permitted.

which ("neither ... permitted") says don't mix an endianness indicator in your 
encoding spec with a BOM. It makes sense to be able to override an incorrect 
BOM, but not to do it all the time because if you do the BOM is rendered 
toothless. Programmer mis-specification is the problem that the BOM exists to 
solve.


On October 2, 2024 2:04:45 AM MST, Matt Denwood <m...@sund.ku.dk> wrote:
>Hi Jeff / all
>
>On 02/10/2024, 08.54, Jeff Newmiller wrote:
>> The Unicode FAQ does. If you specify endian-ness and a BOM is present and 
>> these specifications agree then it would seem no harm no foul. The problem 
>> is that if they conflict, then there is no clearly correct behavior: if the 
>> BOM is valid then the user spec must be incorrectly specified and favoring 
>> the user specification forces incorrect decoding. If the BOM is erroneous, 
>> then you would want the user to be able to override the incorrect BOM... but 
>> these two cases amount to defeating the BOMs purpose... it might as well not 
>> be there. So the compliant handling of data with a BOM is for the user to 
>> make a standard practice of not specifying endianness _unless they must 
>> override an invalid BOM_ (which ought to be highly unusual)... save the 
>> sledgehammer for unusual cases, and let the BOM be the "only" specification 
>> if it is present. This lets the BOM serve its intended purpose of reducing 
>> how often users have to guess.
>
>Actually, the Unicode FAQ (https://unicode.org/faq/utf_bom.html, under "Q: Why 
>wouldn’t I always use a protocol that requires a BOM?") says:  "In particular, 
>if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a 
>BOM is neither necessary nor permitted. Any U+FEFF would be interpreted as a 
>ZWNBSP."
>
>So, my interpretation of the Unicode recommendation is that specifying *LE/*BE 
>takes precedence - and if both are provided, then the BOM should be 
>interpreted as a zero-width non-breaking space i.e. ignored.  Therefore, it 
>would seem sensible for defensive programmers to specify *LE/*BE manually, 
>safe in the knowledge that any BOM (correct or otherwise) becomes irrelevant - 
>which is what I believe Tomas and Simon are suggesting.  Although it is 
>possible I misunderstood something...
>
>Best wishes,
>
>Matt
>
>
>
>On 02/10/2024, 08.54, "R-SIG-Mac on behalf of Jeff Newmiller via R-SIG-Mac" 
><r-sig-mac-boun...@r-project.org <mailto:r-sig-mac-boun...@r-project.org> on 
>behalf of r-sig-mac@r-project.org <mailto:r-sig-mac@r-project.org>> wrote:
>
>[SNIP]
>
>>>I don't find anything inappropriate about the *LE/*BE specifications.
>
>
>> The Unicode FAQ does. If you specify endian-ness and a BOM is present and 
>> these specifications agree then it would seem no harm no foul. The problem 
>> is that if they conflict, then there is no clearly correct behavior: if the 
>> BOM is valid then the user spec must be incorrectly specified and favoring 
>> the user specification forces incorrect decoding. If the BOM is erroneous, 
>> then you would want the user to be able to override the incorrect BOM... but 
>> these two cases amount to defeating the BOMs purpose... it might as well not 
>> be there. So the compliant handling of data with a BOM is for the user to 
>> make a standard practice of not specifying endianness _unless they must 
>> override an invalid BOM_ (which ought to be highly unusual)... save the 
>> sledgehammer for unusual cases, and let the BOM be the "only" specification 
>> if it is present. This lets the BOM serve its intended purpose of reducing 
>> how often users have to guess.
>
>
>
>
>On October 1, 2024 1:50:25 PM MST, Tomas Kalibera <tomas.kalib...@gmail.com 
><mailto:tomas.kalib...@gmail.com>> wrote:
>>On 10/1/24 15:31, Jeff Newmiller wrote:
>>>> This is a problem in macOS libiconv. When converting from "UTF-16" with a 
>>>> BOM, it correctly learns the byte-order from the BOM, but later forgets it 
>>>> in some cases. This is not a problem in R, but could be worked-around in R.
>>> So, buggy system code on one system...
>>> 
>>>> As Simon wrote, to avoid running into these problems (in released versions 
>>>> of R), one should use "UTF-16LE", so explicitly specify the byte-order in 
>>>> the encoding name.
>>> ... leads to institutionalized non-complince.
>>> 
>>>> This is useful also because it is not clear what should be the default 
>>>> when no BOM is present and different systems have different defaults.
>>> This is nonsense, for reasons previously provided. You are calling a bug a 
>>> feature. The BOM is supposed to prevent you from having to know this 
>>> detail, and what you do when no BOM is present should have no bearing on 
>>> this case.
>>
>>I will try to explain this differently. The handling of BOMs in existing 
>>iconv implementations is unreliable (one issue is documented in R 
>>documentation, one issue is the one we have ran into now). Because it is 
>>unreliable, people who want to be defensive and avoid problems are advised to 
>>use *LE (or *BE) specifications. What is the default byte-order when no BOM 
>>is specified is not reliable, either (defaults differ between systems and the 
>>standard is open to interpretation - e.g. my Linux and Windows builds of R 
>>default to little-endian, while my macOS build defaults to big-endian). It is 
>>thus not advisable to depend on the default order, either, and a defensive 
>>solution is again to use *LE or *BE specifications. So, in principle, simply 
>>always use *LE or *BE.
>>
>>This advice is not a feature, it is a work-around that works for two 
>>problems: that the byte order for specifications like "UTF-16" is unknown 
>>(bug in the standard) and that specifying the byte-order by a BOM is 
>>unreliable (bugs in implementations of iconv).
>>
>>> If Apple is intransigent (which would not be out of character) you could 
>>> avoid institutionalized non-compliance at the user level by recognizing the 
>>> buggy system and replacing the generic specification with this 
>>> inappropriate LE or BE specification as directed by the BOM in the 
>>> Mac-specific R code.
>>
>>Yes, indeed, the work-around for the libiconv bug can be implemented in 
>>future versions of R and an experimental version is already in R-devel (still 
>>subject to change), so that at user level, specifying say "UTF-16" on an 
>>input with BOM will correctly use the byte-order of the BOM.
>>
>>I don't find anything inappropriate about the *LE/*BE specifications.
>>
>>Best
>>Tomas
>>
>>> 
>>> 
>>> On October 1, 2024 4:34:41 AM MST, Tomas Kalibera <tomas.kalib...@gmail.com 
>>> <mailto:tomas.kalib...@gmail.com>> wrote:
>>>> On 9/9/24 12:53, Tomas Kalibera wrote:
>>>>> On 9/9/24 10:53, peter dalgaard wrote:
>>>>>> I am confused, and maybe I should just butt out of this, but:
>>>>>> 
>>>>>> (a) BOM are designed to, um, mark the byte order...
>>>>>> 
>>>>>> (b) in connections.c we have
>>>>>> 
>>>>>> if(checkBOM && con->inavail >= 2 &&
>>>>>> ((int)con->iconvbuff[0] & 0xff) == 255 &&
>>>>>> ((int)con->iconvbuff[1] & 0xff) == 254) {
>>>>>> con->inavail -= (short) 2;
>>>>>> memmove(con->iconvbuff, con->iconvbuff+2, con->inavail);
>>>>>> }
>>>>>> which checks for the two first bytes being FF, FE. However, a big-endian 
>>>>>> BOM would be FE, FF and I see no check for that.
>>>>> I think this is correct, it is executed only for encodings declared 
>>>>> little-endian (UTF-16LE, UCS2-LE) - so, iconv will still know what is the 
>>>>> byte-order from the name of the encoding, it will just not see the same 
>>>>> information in the BOM.
>>>>>> Duncan's file starts
>>>>>> 
>>>>>>> readBin('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt'
>>>>>>>  
>>>>>>> <https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt;>,
>>>>>>>  what="raw", n=10)
>>>>>> [1] ff fe 74 00 69 00 6d 00 65 00
>>>>>> 
>>>>>> so the BOM does indeed indicate little-endian, but apparently we proceed 
>>>>>> to discard it and read the file with system (big-)endianness, which 
>>>>>> strikes me as just plain wrong...
>>>>> I've tested we are not discarding it by the code above and that iconv 
>>>>> gets to see the BOM bytes.
>>>>>> I see no Mac-specific code for this, only win_iconv.c, so presumably we 
>>>>>> have potential issues on everything non-Windows?
>>>>> I can reproduce the problem and will have a closer look, it may still be 
>>>>> there is a bug in R. We have some work-arounds for recent iconv issues on 
>>>>> macOS in sysutils.c.
>>>> This is a problem in macOS libiconv. When converting from "UTF-16" with a 
>>>> BOM, it correctly learns the byte-order from the BOM, but later forgets it 
>>>> in some cases. This is not a problem in R, but could be worked-around in R.
>>>> 
>>>> As Simon wrote, to avoid running into these problems (in released versions 
>>>> of R), one should use "UTF-16LE", so explicitly specify the byte-order in 
>>>> the encoding name. This is useful also because it is not clear what should 
>>>> be the default when no BOM is present and different systems have different 
>>>> defaults.
>>>> 
>>>> Best
>>>> Tomas
>>>> 
>>>>> Tomas
>>>>> 
>>>>>> -pd
>>>>>> 
>>>>>>> On 9 Sep 2024, at 01:11 , Simon Urbanek <simon.urba...@r-project.org 
>>>>>>> <mailto:simon.urba...@r-project.org>> wrote:
>>>>>>> 
>>>>>>> From the help page:
>>>>>>> 
>>>>>>> The encodings ‘"UCS-2LE"’ and ‘"UTF-16LE"’ are treated specially,
>>>>>>> as they are appropriate values for Windows ‘Unicode’ text files.
>>>>>>> If the first two bytes are the Byte Order Mark ‘0xFEFF’ then these
>>>>>>> are removed as some implementations of ‘iconv’ do not accept BOMs.
>>>>>>> 
>>>>>>> so "UTF-16LE" is the documented way to reliably read such files.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Simon
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 8 Sep 2024, at 21:23, Duncan Murdoch <murdoch.dun...@gmail.com 
>>>>>>>> <mailto:murdoch.dun...@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>> To R-SIG-Mac, with a copy to Jeff Newmiller:
>>>>>>>> 
>>>>>>>> On R-help there's a thread about reading a remote file that is coded 
>>>>>>>> in UTF-16LE with a byte-order mark. Jeff Newmiller pointed out 
>>>>>>>> (https://stat.ethz.ch/pipermail/r-help/2024-September/479933.html 
>>>>>>>> <https://stat.ethz.ch/pipermail/r-help/2024-September/479933.html>) 
>>>>>>>> that it would be better to declare the encoding as "UTF-16", because 
>>>>>>>> the BOM will indicate little endian.
>>>>>>>> 
>>>>>>>> I tried this on my Mac running R 4.4.1, and it didn't work. I get the 
>>>>>>>> same incorrect result from all of these commands:
>>>>>>>> 
>>>>>>>> # Automatically recognizing a URL and using fileEncoding:
>>>>>>>> read.delim(
>>>>>>>> 'https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt'
>>>>>>>>  
>>>>>>>> <https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt;>,
>>>>>>>> fileEncoding = "UTF-16"
>>>>>>>> )
>>>>>>>> 
>>>>>>>> # Using explicit url() with encoding:
>>>>>>>> read.delim(
>>>>>>>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt'
>>>>>>>>  
>>>>>>>> <https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt;>,
>>>>>>>> encoding = "UTF-16")
>>>>>>>> )
>>>>>>>> 
>>>>>>>> # Specifying the endianness incorrectly:
>>>>>>>> read.delim(
>>>>>>>> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt'
>>>>>>>>  
>>>>>>>> <https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt;>,
>>>>>>>> encoding = "UTF-16BE")
>>>>>>>> )
>>>>>>>> 
>>>>>>>> The only way I get the correct result is if I specify "UTF-16LE" 
>>>>>>>> explicitly, whereas Jeff got correct results on several different 
>>>>>>>> systems using "UTF-16".
>>>>>>>> 
>>>>>>>> Is this a MacOS bug or an R for MacOS bug?
>>>>>>>> 
>>>>>>>> Duncan Murdoch
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> R-SIG-Mac mailing list
>>>>>>>> R-SIG-Mac@r-project.org <mailto:R-SIG-Mac@r-project.org>
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac 
>>>>>>>> <https://stat.ethz.ch/mailman/listinfo/r-sig-mac>
>>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> R-SIG-Mac mailing list
>>>>>>> R-SIG-Mac@r-project.org <mailto:R-SIG-Mac@r-project.org>
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac 
>>>>>>> <https://stat.ethz.ch/mailman/listinfo/r-sig-mac>
>>>> _______________________________________________
>>>> R-SIG-Mac mailing list
>>>> R-SIG-Mac@r-project.org <mailto:R-SIG-Mac@r-project.org>
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac 
>>>> <https://stat.ethz.ch/mailman/listinfo/r-sig-mac>
>
>

-- 
Sent from my phone. Please excuse my brevity.

_______________________________________________
R-SIG-Mac mailing list
R-SIG-Mac@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-mac

Re: [R-SIG-Mac] Bug in reading UTF-16LE file?

Reply via email to