I think Gecko needs some work and attention around character encodings. This is a collection of items that I need think we should get it done and items that I think we should investigate.
# Why? In general, it's better for the Web that browsers behave consistently. In the character encoding department we have room for improvement. Also, there are security-related improvements that could be made both to defend against XSS when sites themselves don't have enough clue to defend themselves and to reduce C++ attack surface in Gecko. # Why now? After a long period of underspecification, there is now speccing activity to make this area better. In addition to improvements to the HTML and CSS specifications, there is now the Encoding Standard. Since implementation feedback is generally valuable for spec development, I think it's a good idea to look at this area now that the specs are being developed. Especially, I'd like to see movement in this area while there still is an opportunity to give feedback on the Encoding Standard. # Stuff I think we should do ## Get rid of implicit trips through the old alias table Even though in mozilla-central we now use EncodingUtils for Encoding Standards-compliant label handling instead of calling nsCharsetAlias, we still tend to instantiate decoders using nsICharsetConverterManager:: GetUnicodeDecoder which implicitly goes through the old alias code. We should stop using that method in Gecko. Part of https://bugzilla.mozilla.org/show_bug.cgi?id=863728 ## Add a non-XPCOM way to get a decoder by encoding name Having to deal with nsICharsetConverterManager results in useless boilerplate in C++ whenever we instantiate decoders. I think we should have a non-XPCOM way to instantiate a decoder by encoding name. This method should not do any label resolution but instead should only be called after EncodingUtils:: FindEncodingForLabel has already been used. https://bugzilla.mozilla.org/show_bug.cgi?id=919935 ## Implement the replacement encoding Some legacy encodings have the characteristic that bytes that don't represent scripts become scripts if the bytes are interpreted according to a more normal encoding (UTF-8, windows-1252 or similar). Therefore, simply not having support for such encodings is a dangerous. Instead of merely not having support for such dangerous encodings, we should implement the replacement encoding that is guaranteed to decode any stream of bytes to non-script and map the labels for the dangers encodings that we no longer support to the replacement encoding. https://bugzilla.mozilla.org/show_bug.cgi?id=863728 ## Remove the Korean, Simplified Chinese and Traditional Chinese detectors Each of the Korean, Simplified Chinese and Traditional Chinese has a single legacy fallback encoding: euc-kr, gbk and big5, respectively. Therefore, none of these locales needs a detector and, indeed, Firefox for these locales ships with detector off. Since detectors are undesirable due to their interference with HTML loading, non-standard status, their non-obvious behavior and the bad incentives they pose to Web authors if turned on by default, I think we should remove these detectors that are unnecessary and not even currently turned on by default for these locales. WebKit and Blink don't have autodetection except for Japanese encodings. https://bugzilla.mozilla.org/show_bug.cgi?id=844118 https://bugzilla.mozilla.org/show_bug.cgi?id=844120 ## Make the File API not use the Universal chardet Our File API implementation uses the Universal chardet when converting a local file to JS strings in a blatant violation of the specification that differs from how e.g. Chrome behaves. We should comply with the specification and assume UTF-8 instead of making stuff up like this. https://bugzilla.mozilla.org/show_bug.cgi?id=848842 ## Remove the Universal chardet The Universal chardet is not really universal. The Universal detector is rather arbitrary in what it tries to detect. For example, it tries to detect Hebrew, Hungarian and Thai, but it doesn't try to detect Arabic, Czech or Vietnamese (and the Hungarian detection apparently doesn't actually work right). As far as I can tell, what's detected depends on the interests of the people who worked on the detector in the Netscape days. I see no indication that it would be reasonable to expect the Universal detector to actually grow universal encoding and language coverage in a reasonable timeframe. The code hasn't seen much active development since the Netscape days. I think it's reasonable to assume that even if the Universal detector gained coverage for more languages and encodings, reliably choosing between various single-byte encodings would be a gamble. For example, if we enabled a detector in builds that we ship to Western Europe, Africa, South Asia and the Americas, I expect it would be virtually certain that the result would be worse than just having the fallback encoding always be windows-1252, because we'd introduce e.g. windows-1250 misguesses to locales where windows-1252 is consistently the best guess. Basing the detection on the payload of the HTTP response is bad for incremental parsing and stopping in the mid-parse and reloading the page with a different encoding is bad for the user experience. Due to this (and the above point) I think it doesn't make sense to try to improve the detector but to try to get rid of the detector. Since the Universal detector is not enabled by default in any configuration that we ship, it is not necessary to keep it around. But if it's not enabled by default, what's the harm then? The harm is that the name "Universal" oversells the detector and makes it attractive in a misleading ways. For example, would we ever have used the detector in the File API implementation if it had been truthfully advertised as not really being universal? Also, when the name "Universal" is shown in the menu, enabling the feature might look like a no-brainer to the user who doesn't realize what the associated downsides are. Once enabled, it might not even be obvious to the user why some sites break, which might easily make Firefox look bad. Also, if enabled by a Web author, the author may think his/her pages work when they don't work with the default settings. Again, WebKit and Blink don't have autodetection except for Japanese encodings. https://bugzilla.mozilla.org/show_bug.cgi?id=849113 https://bugzilla.mozilla.org/show_bug.cgi?id=844115 ## Remove support for single-byte encodings that are not in the Encoding Standard We should see which single-byte encodings are no longer used, because no label maps to those encodings according to the Encoding Standard. To gain some savings in binary size, we should remove the decoders and encoders for these encodings. We should also make sure that none of the detectors can detect encodings that are not in the Encoding Standard. If Thunderbird still wants to have these, I think they should take them into comm-central. ## Remove support for multi-byte encodings that are not in the Encoding Standard Decoders for multi-byte encodings not only contribute to the binary size but are also potential attack surface in the form of buffer overflows or pointers otherwise pointing to wrong things. Even though there's not supposed to be a way to instantiate decoders for encodings that are not in the Encoding Standard, to make sure that we have gotten rid of the attack surface, we should remove decoders and encoders for multi-byte encodings that are not in the Encoding Standard from mozilla-central. If Thunderbird still wants to have these, I think they should take them into comm-central. ## Unify big5 and big5-hkscs When we moved to Encoding Standard-compliant label handling, we backpedaled and made an exception for big5-hkscs, because our decoders weren't Encoding Standard-compliant yet. We should make our big5 decoder Encoding Standard-compliant and then get rid of the separate big5-hkscs implementation. https://bugzilla.mozilla.org/show_bug.cgi?id=912470 ## Add a way to signal the end of the stream to encoders and decoders Currently, the converter interfaces we use don't have a way to signal the end of the stream. This means that when a stream ends with a partial code unit sequence, we don't properly emit the REPLACEMENT CHARACTER. https://bugzilla.mozilla.org/show_bug.cgi?id=562590 ## Review the remaining encoders and decoders for Encoding Standard compliance We should review the remaining encoders and decoders for Encoding Standard compliance. In particular, we should review the multi-byte decoders. # Stuff I think should be investigated ## Get telemetry for how often email messages in Thunderbird don't come with an encoding declaration Email messages are typically generated by email programs that you know what they are generating, so it's reasonable to expect email messages to have a higher rate of encoding labeling than Web content. If the rate of labeling is indeed very high, it doesn't make sense to support encoding detectors in Thunderbird. We should gather telemetry to find out. ## Remove the combined Chinese detector We have an off-by-default detector that detects between Traditional Chinese and Simplified Chinese. While this detector might have a valid use case when users read content both from Taiwan and mainland China or both from Hong Kong and mainland China, we might be able to address that use case with the less magic and without adverse side effects to HTML parsing by looking at the .tw, .hk and .cn top-level domains of the content. ## Remove the Russian and Ukranian detectors WebKit and Blink get away with Japanese detection only. Since detection has downsides, we shouldn't do it unless we need to. We should investigate the feasibility of removing the Russian and Ukrainian detectors and hopefully remove them. https://bugzilla.mozilla.org/show_bug.cgi?id=845791 ## Remove support for HZ and map it to the replacement encoding Of the encodings that we do support, I think HZ is by far the scariest one, because it's escape delimiters are printable characters in ASCII and because it's fairly easy to construct attack byte sequences that round-trip from HZ to Unicode and back. I think we should investigate if we could get away with removing support for HZ and mapping its labels to the replacement encoding without breaking the Web too much. ## Make our ISO-2022-JP decoder more IE-compatible by supporting Shift_JIS sequences I haven't personally verified this, but I've been told that in IE, the ISO-2022-JP decoder supports Shift_JIS sequences. This means that when Shift_JIS content is mislabeled as ISO-2022-JP, it works in IE right away but in Firefox the user has to manually override the encoding from the Character Encoding menu. In order to minimize the cases where the user has to reach for the Character Encoding menu, we should standardize and adopt IE's behavior. ## Investigate the unification of gbk and gb18030 Both Gecko and the Encoding Standard treat gbk and gb18030 as distinct encodings. Hixie seems to think that these could be unified. We should investigate if that's indeed feasible. ## Consider moving to WebKit-like detection for Japanese encodings WebKit has autodetection only for Japanese and it's pretty simple. We should investigate doing what WebKit does for Japanese. http://trac.webkit.org/browser/trunk/Source/WebCore/loader/TextResourceDecoder.cpp#L157 ## Add UTF-8 detection for file: URLs When files are saved locally, they lose their HTTP headers. This is increasingly a problem with locally-saved UTF-8-encoded pages, since loading those pages from file: URLs results in the use of the non-UTF-8 fallback encoding. The problem is that misfiring autodetection causes don't apply to UTF-8 when the whole stream is available, since it's very certain that the page is intended to be UTF-8-encoded if the whole stream validates as UTF-8. Also, when reading from the local file system, the whole file is available and can be inspected ahead of parsing without causing the problems that autodetection causes when it interacts with incremental loading. To make the Character Encoding menu unnecessary when accessing local UTF-8-encoded files, we should read all the bytes of the file from the disk, check if they are valid UTF-8 and only then parse (as UTF-8 if the stream was valid UTF-8). -- Henri Sivonen hsivo...@hsivonen.fi http://hsivonen.fi/ _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform