On Sat, Jun 24, 2017 at 7:12 PM, Paul Wise <p...@debian.org> wrote: > On Sun, Jun 25, 2017 at 8:54 AM, Simon McVittie wrote: > >> For what it's worth, I agree that declaring the correct charset in HTTP >> metadata is a better solution than prepending U+FEFF ZERO WIDTH NO-BREAK >> SPACE >> (aka the "byte-order mark") in the file content.
Yes, the BOM was only intended for UTF-16, which could actually have two different byte orders. Because there is no such thing as "byte order" with UTF-8, the world wide web has rebranded the UTF-8 three-byte version of U+FEFF as the "UTF-8 signature". The original intention of The Unicode Consortium was that the sequence would never be used in a UTF-8 document. In Firefox, if you press Ctrl+Shift+Q you will get an "Inspector". Loading https://www.debian.org/doc/packaging-manuals/upgrading-checklist.txt with the Netowrk tab selected in the Inspector shows multiple tabs. The "Console" tab gives this message, highlighted in pink [for dramatic effect]: "The character encoding of the plain text document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the file needs to be declared in the transfer protocol or file needs to use a byte order mark as an encoding signature." So the browser is encouraging the use of this three-byte UTF-8 version of U+FEFF, even though it was never supposed to be used in a document. We live in an imperfect world. Going to the Network tab, reloading the page, and clicking on "Raw Headers" shows the following information (i just made the request again): Request Headers: Host: www.debian.org User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate, br Connection: keep-alive Upgrade-Insecure-Requests: 1 If-Modified-Since: Sat, 24 Jun 2017 20:17:13 GMT If-None-Match: "e965-552ba67456626-gzip" Cache-Control: max-age=0 Response Headers: Accept-Ranges: bytes Cache-Control: max-age=86400 Connection: Keep-Alive Content-Encoding: gzip Content-Length: 18592 Content-Type: text/plain Date: Sun, 25 Jun 2017 02:10:24 GMT Etag: "e965-552ba67456626-gzip" Expires: Mon, 26 Jun 2017 02:10:24 GMT Keep-Alive: timeout=5, max=100 Last-Modified: Sat, 24 Jun 2017 20:17:13 GMT Server: Apache Strict-Transport-Security: max-age=15552000 Vary: Accept-Encoding X-Clacks-Overhead: GNU Terry Pratchett X-Content-Type-Options: nosniff X-Frame-Options: sameorigin X-XSS-Protection: 1 referrer-policy: no-referrer So the Content-Type is "text/plain", which results in the "garbled characters", to quote the Firefox Console window in the Inspector. As an aside, the Content-Encoding is "gzip", which is a good thing. On Sat, Jun 24, 2017 at 7:12 PM, Paul Wise <p...@debian.org> wrote: > Forcing every text file to UTF-8 isn't the correct solution either, > since it breaks text files that are not encoded in UTF-8 (such as old > dedication texts) and does not work on Debian mirrors that are not > controlled by us. If using the UTF-8 signature in a document is too aesthetically distateful (and I don't disagree), and if setting the HTTP header to denote a UTF-8 charset is not a universal solution because it will only have effect on Debian's servers, would a tool that converted such text files to an HTML document be desirable? Such a hypothetical tool would insert a meta tag in the header saying <meta charset="UTF-8">. If that is an acceptable solution, I could put together an awk script for Debian (if it would get used) that would employ awk's BEGIN and END sections to wrap a UTF-8 document in HTML tags, enclosing the text itself in <PRE>...</PRE> tags. That would mean that Debian UTF-8 documents intended for being served on the web would have to run such a utility and be converted into HTML pages for display. Three possibilities seem to exist, and I am fine with any one being chosen: 1) Use the UTF-8 signature in UTF-8 text files 2) Set the HTTP headers for charset="UTF-8" 3) Convert UTF-8 text files to HTML documents for web display Paul Hardy