Re: Detection of unlabeled UTF-8

Henri Sivonen Mon, 09 Sep 2013 00:31:43 -0700

On Fri, Sep 6, 2013 at 6:17 PM, Adam Roach <[email protected]> wrote:

> Sure. It's a much trickier problem (and, in any case, the UI is
> necessarily more intrusive than what I'm suggesting). There's no good way
> to explain the nuanced implications of security decisions in a way that is
> both accessible to a lay user and concise enough to hold the average user's
> attention.
>

Yes, the decisions that the user is asked to make in the case of HTTPS
deployment errors  are more difficult than the decision whether to reload
the page as UTF-8.

(Just for completeness, I should mention that what you're proposing could
be security-sensitive without some further tweaks. For starters, if a page
has been labeled as UTF-16 or  anything that maps to the replacement
encoding according to the Encoding Standard, we should not let the user
reload the page as UTF-8. When I say "labeled as UTF-16", I mean labels
that are supposed to take effect as UTF-16 per WHATWG HTML. I don't mean
the sort of bogus UTF-16 labels that actually are treated as UTF-8 labels
by WHATWG HTML.)

> To the first point: the increase in complexity is fairly minimal for a
> substantial gain in usability.
>

How substantial the gain in usability  would be is not known without exact
telemetry, but see below.

As for complexity, as the person who has  been working with the relevant
code the most in the last couple of years,  I think we should try to get
rid of the code for implementing encoding overrides by the user instead of
coming up with new ways to trigger that code. Thanks to e.g. the mistake of
introducing UTF-16 as an interchange encodinge to the Web, that code has
needed security fixes.

> Absent hard statistics, I suspect we will disagree about how "fringe" this
> particular exception is. Suffice it to say that I have personally
> encountered it as a problem as recently as last week. If you think we need
> to move beyond anecdotes and personal experience, let's go ahead and add
> telemetry to find out how often this arises in the field.
>

We don't have telemetry for the question "How often are pages that are not
labeled as UTF-8, UTF-16 or anything that maps to their replacement
encoding according to the Encoding Standard and that contain non-ASCII
bytes in fact valid UTF-8?" How rare would the mislabeled UTF-8 case need
to be for you to consider the UI that you're proposing not worth it?

However, we do have telemetry for the percentage of Firefox sessions in
which the  current character encoding override UI has been used at least
once. See https://bugzilla.mozilla.org/show_bug.cgi?id=906032 for the
results broken down by desktop versus Android and then by locale.  One
could speculate the answer to the UTF-8 question relative to this telemetry
data both ways: Since the general character encoding override usage
includes cases where the encoding being switched to is not to UTF-8, one
could expect the UTF-8 case to be even more fringe than what these
telemetry results show. On the other hand, these telemetry results show
only cases where the user is aware of the existence of the character
encoding override UI and bothers to use it, so one could argue that the
UTF-8 case could actually be more common.

I would accept  a (performance-conscious) patch for gathering telemetry for
the UTF-8 question in the HTML parser.  However, I'm not volunteering to
write one myself immediately, because I have bugs on my todo list that have
been caused by previous attempts of Gecko developers to be well-intentioned
about DWIM and UI around character encodings. Gotta fix those first.

Your second point is an argument against automatic correction. Don't get me
> wrong: I think automatic correction leads to innocent publisher mistakes
> that make things worse over the long term. I absolutely agree that doing so
> trades short-term gain for long-term damage. But I'm not arguing for
> automatic correction.
>

Even non-automatic correction means authors can take the attitude that
getting the encoding wrong is no big deal since the fix is a click away for
the user. But how will that UI work in non-browser apps that load Web
content on B2G, etc.?

On Fri, Sep 6, 2013 at 6:45 PM, Robert Kaiser <[email protected]> wrote:
> Hmm, do we have to treat the whole document as a consistent charset?

The practical answer is yes.

> Could
> we instead, if we don't know the charset, look at every rendered-as-text
> node/attribute in the DOM tree and run some kind of charset detection on
it?
>
> May be a dumb idea but might avoid the problem on the parsing level.

And then we'd have at least 34 problems (if my quick count of legacy
encodings was correct). On a more serious note, though, it's a bad idea to
try to develop complex solutions to problems that are actually relatively
rare on the Web these days and it's even worse to  go deeper into DWIM when
experience shows that DWIM  in this area is a big part of the reason we
have this mess.

On Fri, Sep 6, 2013 at 7:36 PM, Neil Harris <[email protected]> wrote:
> http://w3techs.com/technologies/overview/character_encoding/all

I don't trust the methodology of that site. Previously, their methodology
has been shown to be bogus. See the discussion starting at
http://krijnhoetmer.nl/irc-logs/whatwg/20121114#l-406 . Hopefully, they
have changed their methodology by now (
https://twitter.com/W3Techs/status/268835927500148737 ), but I still don't
trust that they get it right.

Since it isn't even the only site in this area with bogus methodology (see
e.g. https://twitter.com/builtwith/status/268951926085910528 ), I don't
trust purported data in this area unless it's has been compiled by someone
who I know to know enough about the topic (not just about encodings but
also the mechanisms that browsers use to pick an encodings) to have a
chance of getting the methodology right.

-- 
Henri Sivonen
[email protected]
http://hsivonen.iki.fi/
_______________________________________________
dev-platform mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

Reply via email to