On Sat, Jun 24, 2017 at 2:51 AM, Colin Watson <cjwat...@debian.org> wrote: > On Fri, Jun 23, 2017 at 11:49:20PM -0700, Russ Allbery wrote: >> I'm still a bit dubious about this, since I don't believe editors and >> generators normally add it, but given how we generate the text versions of >> the documents, it's relatively easy to add a leading BOM and seems >> harmless. I'll take a look. > > I share the discomfort in your previous message with using the UTF-8 > BOM. I'd have thought that a better approach here would be to fix this > at the HTTP layer: > https://www.debian.org/doc/packaging-manuals/upgrading-checklist.txt > (and other text files here) should return "Content-Type: text/plain; > charset=UTF-8", not just "Content-Type: text/plain". > > -- > Colin Watson [cjwat...@debian.org]
If a ".txt" file is delivered with an HTTP header that includes the UTF-8 "charset" tag that would hopefully fix it. I spent a bit experimenting with the Firefox version installed in Stretch to see if there was a setting that would display the file correctly as well. Russ, you are correct that the Unicode standard counseled against using the UTF-8 version of the BOM in earlier days. That was for standalone text files not necessarily served as pages on the web. Now, however, HTML5 browsers are required to recognize this sequence and so that guidance has loosened up; see https://www.w3.org/International/questions/qa-byte-order-mark "In HTML5 browsers are required to recognize the UTF-8 BOM and use it to detect the encoding of the page, and recent versions of major browsers handle the BOM as expected when used for UTF-8 encoded pages." Although the next paragraph ends with: "However, bear in mind that it is always a good idea to declare the encoding of your page using the meta element, in addition to the BOM, so that the encoding is apparent to people looking at the source text." That implies using the <meta charset="utf-8"> tag, which is intended for HTML documents, not plain text files. With the Firefox version installed in Stretch, the "Text Mode" button was not in the toolbar by default. When I added it to the toolbar and went to select "UTF-8" to try to come up with a way of viewing the text file, it had options of "Unicode" and "Western", but there was no choice for UTF-8. Choosing "Unicode" for that file ("upgrading-checklist.txt") did not change the appearance; I would have expected "Unicode" to imply UTF-8 in a web browser. Adding the three-byte sequence at the start of the file did. At that point, I posted this bug report. Alternatively, if convenient, you could convert the non-breaking space characters to a plain space in that text file in a script. That will avoid the problem until you need some other non-ASCII character in the file other than non-breaking space. You could convert all of those non-breaking space characters to ordinary spaces in one fell swoop with: sed -i 's/\o302\o240/ /g' upgrading-checklist.txt If that file is served as an HTML page anywhere, the UTF-8 non-breaking space could be converted to the HTML entity " " to avoid non-ASCII content. Whatever route is taken (modifying the text file or making changes in the Debian web server or something else), it would be nice if eventually that text file rendered correctly in the Firefox browser that is on the Stretch desktop. (I'm using GNOME by the way.) Of course, this whole thing is really a minor issue. Paul Hardy