I am confused, and maybe I should just butt out of this, but: (a) BOM are designed to, um, mark the byte order...
(b) in connections.c we have if(checkBOM && con->inavail >= 2 && ((int)con->iconvbuff[0] & 0xff) == 255 && ((int)con->iconvbuff[1] & 0xff) == 254) { con->inavail -= (short) 2; memmove(con->iconvbuff, con->iconvbuff+2, con->inavail); } which checks for the two first bytes being FF, FE. However, a big-endian BOM would be FE, FF and I see no check for that. Duncan's file starts > readBin('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt', > what="raw", n=10) [1] ff fe 74 00 69 00 6d 00 65 00 so the BOM does indeed indicate little-endian, but apparently we proceed to discard it and read the file with system (big-)endianness, which strikes me as just plain wrong... I see no Mac-specific code for this, only win_iconv.c, so presumably we have potential issues on everything non-Windows? -pd > On 9 Sep 2024, at 01:11 , Simon Urbanek <simon.urba...@r-project.org> wrote: > > From the help page: > > The encodings ‘"UCS-2LE"’ and ‘"UTF-16LE"’ are treated specially, > as they are appropriate values for Windows ‘Unicode’ text files. > If the first two bytes are the Byte Order Mark ‘0xFEFF’ then these > are removed as some implementations of ‘iconv’ do not accept BOMs. > > so "UTF-16LE" is the documented way to reliably read such files. > > Cheers, > Simon > > > >> On 8 Sep 2024, at 21:23, Duncan Murdoch <murdoch.dun...@gmail.com> wrote: >> >> To R-SIG-Mac, with a copy to Jeff Newmiller: >> >> On R-help there's a thread about reading a remote file that is coded in >> UTF-16LE with a byte-order mark. Jeff Newmiller pointed out >> (https://stat.ethz.ch/pipermail/r-help/2024-September/479933.html) that it >> would be better to declare the encoding as "UTF-16", because the BOM will >> indicate little endian. >> >> I tried this on my Mac running R 4.4.1, and it didn't work. I get the same >> incorrect result from all of these commands: >> >> # Automatically recognizing a URL and using fileEncoding: >> read.delim( >> 'https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt', >> fileEncoding = "UTF-16" >> ) >> >> # Using explicit url() with encoding: >> read.delim( >> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt', >> encoding = "UTF-16") >> ) >> >> # Specifying the endianness incorrectly: >> read.delim( >> url('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/ch15/employee.txt', >> encoding = "UTF-16BE") >> ) >> >> The only way I get the correct result is if I specify "UTF-16LE" explicitly, >> whereas Jeff got correct results on several different systems using "UTF-16". >> >> Is this a MacOS bug or an R for MacOS bug? >> >> Duncan Murdoch >> >> _______________________________________________ >> R-SIG-Mac mailing list >> R-SIG-Mac@r-project.org >> https://stat.ethz.ch/mailman/listinfo/r-sig-mac >> > > _______________________________________________ > R-SIG-Mac mailing list > R-SIG-Mac@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-mac -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd....@cbs.dk Priv: pda...@gmail.com _______________________________________________ R-SIG-Mac mailing list R-SIG-Mac@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mac