Tasks to implement the Encoding Standard and to reduce encoding autodetection

Henri Sivonen Wed, 06 Nov 2013 00:20:35 -0800

I think Gecko needs some work and attention around character
encodings. This is a collection of items that I need think we should
get it done and items that I think we should investigate.


# Why?

In general, it's better for the Web that browsers behave consistently.
In the character encoding department we have room for improvement.

Also, there are security-related improvements that could be made both
to defend against XSS when sites themselves don't have enough clue to
defend themselves and to reduce C++ attack surface in Gecko.

# Why now?

After a long period of underspecification, there is now speccing
activity to make this area better. In addition to improvements to the
HTML and CSS specifications, there is now the Encoding Standard. Since
implementation feedback is generally valuable for spec development, I
think it's a good idea to look at this area now that the specs are
being developed. Especially, I'd like to see movement in this area
while there still is an opportunity to give feedback on the Encoding
Standard.

# Stuff I think we should do

## Get rid of implicit trips through the old alias table

Even though in mozilla-central we now use EncodingUtils for Encoding
Standards-compliant label handling instead of calling nsCharsetAlias,
we still tend to instantiate decoders using
nsICharsetConverterManager::
GetUnicodeDecoder which implicitly goes
through the old alias code. We should stop using that method in Gecko.

Part of https://bugzilla.mozilla.org/show_bug.cgi?id=863728

## Add a non-XPCOM way to get a decoder by encoding name

Having to deal with nsICharsetConverterManager results in useless
boilerplate in C++ whenever we instantiate decoders. I think we should
have a non-XPCOM way to instantiate a decoder by encoding name. This
method should not do any label resolution but instead should only be
called after EncodingUtils::
FindEncodingForLabel has already been
used.

https://bugzilla.mozilla.org/show_bug.cgi?id=919935

## Implement the replacement encoding

Some legacy encodings have the characteristic that bytes that don't
represent scripts become scripts if the bytes are interpreted
according to a more normal encoding (UTF-8, windows-1252 or similar).
Therefore, simply not having support for such encodings is a
dangerous. Instead of merely not having support for such dangerous
encodings, we should implement the replacement encoding that is
guaranteed to decode any stream of bytes to non-script and map the
labels for the dangers encodings that we no longer support to the
replacement encoding.

https://bugzilla.mozilla.org/show_bug.cgi?id=863728

## Remove the Korean, Simplified Chinese and Traditional Chinese detectors

Each of the Korean, Simplified Chinese and Traditional Chinese has a
single legacy fallback encoding: euc-kr, gbk and big5, respectively.
Therefore, none of these locales needs a detector and, indeed, Firefox
for these locales ships with detector off.

Since detectors are undesirable due to their interference with HTML
loading, non-standard status, their non-obvious behavior and the bad
incentives they pose to Web authors if turned on by default, I think
we should remove these detectors that are unnecessary and not even
currently turned on by default for these locales.

WebKit and Blink don't have autodetection except for Japanese encodings.

https://bugzilla.mozilla.org/show_bug.cgi?id=844118
https://bugzilla.mozilla.org/show_bug.cgi?id=844120

## Make the File API not use the Universal chardet

Our File API implementation uses the Universal chardet when converting
a local file to JS strings in a blatant violation of the specification
that differs from how e.g. Chrome behaves. We should comply with the
specification and assume UTF-8 instead of making stuff up like this.

https://bugzilla.mozilla.org/show_bug.cgi?id=848842

## Remove the Universal chardet

The Universal chardet is not really universal. The Universal detector
is rather arbitrary in what it tries to detect. For example, it tries
to detect Hebrew, Hungarian and Thai, but it doesn't try to detect
Arabic, Czech or Vietnamese (and the Hungarian detection apparently
doesn't actually work right).  As far as I can tell,  what's detected
depends on the interests of the people who worked on the detector in
the Netscape days.

I see no indication that it would be reasonable to expect the
Universal detector to actually grow universal encoding and language
coverage in a reasonable timeframe. The code hasn't seen much active
development since the Netscape days.

I think it's reasonable to assume that even if the Universal detector
gained coverage for more languages and encodings, reliably choosing
between various single-byte encodings would be a gamble. For example,
if we enabled a detector in builds that we ship to Western Europe,
Africa, South Asia and the Americas, I  expect it would be virtually
certain that the result would be worse than just having the fallback
encoding always be windows-1252, because we'd introduce e.g.
windows-1250 misguesses to locales where windows-1252 is consistently
the best guess.

Basing the detection on the payload of the HTTP response is bad for
incremental parsing and stopping in the mid-parse and reloading the
page with a different encoding is bad for the user experience. Due to
this (and the above point) I think it doesn't make sense to try to
improve the detector but to try to get rid of the detector.

Since the Universal detector is not enabled by default in any
configuration that we ship, it is not necessary to keep it around. But
if it's not enabled by default, what's the harm then?

The harm is that the name "Universal" oversells the detector and makes
it attractive in a misleading ways. For example, would we ever have
used the detector in the File API implementation if it had been
truthfully advertised as not really being universal? Also, when the
name "Universal" is shown in the menu, enabling the feature might look
like a no-brainer to the user who doesn't realize what the associated
downsides are. Once enabled, it might not even be obvious to the user
why some sites break, which might easily make Firefox look bad. Also,
if enabled by a Web author, the author may think his/her pages work
when they don't work with the default settings.

Again, WebKit and Blink don't have autodetection except for Japanese encodings.

https://bugzilla.mozilla.org/show_bug.cgi?id=849113
https://bugzilla.mozilla.org/show_bug.cgi?id=844115

## Remove support for single-byte encodings that are not in the
Encoding Standard

We should see which single-byte encodings are no longer used, because
no label maps to those encodings according to the Encoding
Standard. To gain some savings in binary size, we should remove the
decoders and encoders for these encodings. We should also make sure
that none of the detectors can detect encodings that are not in the
Encoding Standard.

If Thunderbird still wants to have these, I think they should take
them into comm-central.

## Remove support for multi-byte encodings that are not in the Encoding Standard

Decoders for multi-byte encodings not only contribute to the binary
size but are also potential attack surface in the form of buffer
overflows or pointers otherwise pointing to wrong things. Even though
there's not supposed to be a way to instantiate decoders for encodings
that are not in the Encoding Standard, to make sure that we have
gotten rid of the attack surface, we should remove decoders and
encoders for multi-byte encodings that are not in the Encoding
Standard from mozilla-central.

If Thunderbird still wants to have these, I think they should take
them into comm-central.

## Unify big5 and big5-hkscs

When we moved to Encoding Standard-compliant label handling, we
backpedaled and made an exception for big5-hkscs, because our decoders
weren't Encoding Standard-compliant yet. We should make our big5
decoder Encoding Standard-compliant and then get rid of the separate
big5-hkscs implementation.

https://bugzilla.mozilla.org/show_bug.cgi?id=912470

## Add a way to signal the end of the stream to encoders and decoders

Currently, the converter interfaces we use don't have a way to signal
the end of the stream. This means that when a stream ends with a
partial code unit sequence, we don't properly emit the REPLACEMENT
CHARACTER.

https://bugzilla.mozilla.org/show_bug.cgi?id=562590

## Review the remaining encoders and decoders for Encoding Standard compliance

We should review the remaining encoders and decoders for Encoding
Standard compliance. In particular, we should review the multi-byte
decoders.

# Stuff I think should be investigated

## Get telemetry for how often email messages in Thunderbird don't
come with an encoding declaration

Email messages are typically generated by email programs that you know
what they are generating, so it's reasonable to expect email messages
to have a higher rate of encoding labeling than Web content. If the
rate of labeling is indeed very high, it doesn't make sense to support
encoding detectors in Thunderbird. We should gather telemetry to find
out.

## Remove the combined Chinese detector

We have an off-by-default detector that detects between Traditional Chinese and
Simplified Chinese. While this detector might have a valid use case
when users read content both from Taiwan and mainland China or both
from Hong Kong and mainland China, we might be able to address that
use case with the less magic and without adverse side effects to HTML
parsing by looking at the .tw, .hk and .cn top-level domains of the
content.

## Remove the Russian and Ukranian detectors

WebKit and Blink get away with Japanese detection only. Since
detection has downsides, we shouldn't do it unless we need to. We
should investigate the feasibility of removing the Russian and
Ukrainian detectors and hopefully remove them.

https://bugzilla.mozilla.org/show_bug.cgi?id=845791

## Remove support for HZ and map it to the replacement encoding

Of the encodings that we do support, I think HZ is by far the scariest
one, because it's escape delimiters are printable characters in ASCII
and because it's fairly easy to construct attack byte sequences that
round-trip from HZ to Unicode and back.

I think we should investigate if we could get away with removing
support for HZ and mapping its labels to the replacement encoding
without breaking the Web too much.

## Make our ISO-2022-JP decoder more IE-compatible by supporting
Shift_JIS sequences

I haven't personally verified this, but I've been told that in IE, the
ISO-2022-JP decoder supports Shift_JIS sequences. This means that when
Shift_JIS content is mislabeled as ISO-2022-JP, it works in IE right
away but in Firefox the user has to manually override the encoding
from the Character Encoding menu. In order to minimize the cases where
the user has to reach for the Character Encoding menu, we should
standardize and adopt IE's behavior.

## Investigate the unification of gbk and gb18030

Both Gecko and the Encoding Standard treat gbk and gb18030 as distinct
encodings. Hixie seems to think that these could be unified. We should
investigate if that's indeed feasible.

## Consider moving to WebKit-like detection for Japanese encodings

WebKit has autodetection only for Japanese and it's pretty simple. We
should investigate doing what WebKit does for Japanese.

http://trac.webkit.org/browser/trunk/Source/WebCore/loader/TextResourceDecoder.cpp#L157

## Add UTF-8 detection for file: URLs

When files are saved locally, they lose their HTTP headers. This is
increasingly a problem with locally-saved UTF-8-encoded pages, since
loading those pages from file: URLs results in the use of the
non-UTF-8 fallback encoding. The problem is that misfiring
autodetection causes don't apply to UTF-8 when the whole stream is
available, since it's very certain that the page is intended to be
UTF-8-encoded if the whole stream validates as UTF-8. Also, when
reading from the local file system, the whole file is available and
can be inspected ahead of parsing without causing the problems that
autodetection causes when it interacts with incremental loading.

To make the Character Encoding menu unnecessary when accessing local
UTF-8-encoded files, we should read all the bytes of the file from the
disk, check if they are valid UTF-8 and only then parse (as UTF-8 if
the stream was valid UTF-8).


-- 
Henri Sivonen
hsivo...@hsivonen.fi
http://hsivonen.fi/
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Tasks to implement the Encoding Standard and to reduce encoding autodetection

Reply via email to