On Mon, Dec 2, 2019 at 2:42 PM Henri Sivonen <hsivo...@mozilla.com> wrote: > 1. On _unlabeled_ text/html and text/plain pages, autodetect _legacy_ > encoding, excluding UTF-8, for non-file: URLs and autodetect the > encoding, including UTF-8, for file: URLs. > > Elevator pitch: Chrome already did this unilaterally. The motivation > is to avoid a situation where a user switches to a Chromium-based as a > result of browsing the legacy Web or local files.
Feature #1 is now on autoland. > # Preference For file: URLs, I ended up not putting the new detector behind a pref, because the file: detection code is messy enough even without alternative code paths, and I'm pretty confident that the new detector is an improvement for our file: URL handling behavior. For non-file: URLs, the new detector is overall controlled by intl.charset.detector.ng.enabled, which defaults to true, i.e. detector enabled. When the detector is enabled, various old intl.charset.* are ignored in various ways. The detector is, however, disabled by default for three TLDs: .jp, .in, and .lk. This can be overridden via the prefs intl.charset.detector.ng.jp.enabled, intl.charset.detector.ng.in.enabled, and intl.charset.detector.ng.lk.enabled all three of which default to false. (These prefs cannot enable the detector if intl.charset.detector.ng.enabled is false) In the case of .jp, the pre-existing Japanese-specific detector is used. This avoids regressing how soon we start reloading if we detect EUC-JP. The detector detects encodings that are actually part of the Web Platform. However, this can cause problems when a site expects the page to be decoded as windows-1252 _as a matter of undeclared fallback_ and expects the user to have an _intentionally mis-encoded_ font that assigns non-Latin glyphs to the windows-1252 code points. (Note that if the site says <meta charset=x-user-defined>, that continues to be undisturbed: https://searchfox.org/mozilla-central/rev/62a130ba0ac80f75175e4b65536290b52391f116/parser/html/nsHtml5StreamParser.cpp#1512 ) Chrome has detection for three windows-1252-misusing Devanagari font encodings and nine Tamil ones. (Nine looks like a lot, but Python tool in this space is documented to handle 25 Tamil legacy encodings!) There is no indication that the Chrome developers found it necessary to have these detections. Actively-maintained newspaper sites that, according to old Bugzilla items, previously used these font hacks have migrated to Unicode. Rather, it looks like Chrome inherited them from Google search engine code. Still, this leaves the possibility that there are sites that presently work (if the user has the appropriate fonts installed) in Chrome thanks to this detection and in Firefox thanks to Firefox mapping the .in TLD to windows-1252 and mapping .com to windows-1252 in the English localizations as well as in the localizations for the Brahmic-script languages of India. By not enabling the new detector on .in at least for now avoids disrupting sites that intentionally misuse windows-1252 without declaring it if such sites are still used by users (at the expense of out-of-locale usage of .in as a generic TLD; data disclosed by Google as part of Chrome's detector suggest e.g. Japanese use of .in). To the extent the phenomenon of relying on intentionally misencoded fonts still exists but on .com, the new detector will likely disrupt it (likely by guessing some Cyrillic encoding). However, I think it doesn't make sense to let that possibility derail this whole project/feature. Although I believe this phenomenon to be mostly a Tamil in Tamil Nadu thing rather than a general Tamil language thing, I disabled the detector on .lk just in case to have more time to research the issue. If reports of legacy Tamil sites breaking show up, please needinfo me on Bugzilla. I didn't disable the detector for .am, because Chrome doesn't appear to have detections for Armenian intentional misuse of windows-1252. If intl.charset.detector.ng.enabled is false, Japanese detection behaves like previously, except that encoding inheritance from a same-origin parent frame now takes precedence over the detector. (This was a spec compliance bug that had previously gone unnoticed because we hadn't run the full test suite with a detector enabled. It turns out that tests both semi-intentionally and accidentally depend on same-origin inheritance taking precedence as the spec says.) In the interest of binary size, I removed the old Cyrillic detector at the same time as landing the new one. If the new detector is disabled by the old Cyrillic detector is enabled, the new detector runs in the situations where the old Cyrillic detector would have run in a mode that approximates the old Cyrillic detector. (This approximation can, however, result in some non-Cyrillic outcomes that were impossible with the old Cyrillic detector.) > # web-platform-tests I added tests as tentative WPTs. -- Henri Sivonen hsivo...@mozilla.com _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform