On Fri, Sep 6, 2013 at 6:17 PM, Adam Roach <[email protected]> wrote: > Sure. It's a much trickier problem (and, in any case, the UI is > necessarily more intrusive than what I'm suggesting). There's no good way > to explain the nuanced implications of security decisions in a way that is > both accessible to a lay user and concise enough to hold the average user's > attention. >
Yes, the decisions that the user is asked to make in the case of HTTPS deployment errors are more difficult than the decision whether to reload the page as UTF-8. (Just for completeness, I should mention that what you're proposing could be security-sensitive without some further tweaks. For starters, if a page has been labeled as UTF-16 or anything that maps to the replacement encoding according to the Encoding Standard, we should not let the user reload the page as UTF-8. When I say "labeled as UTF-16", I mean labels that are supposed to take effect as UTF-16 per WHATWG HTML. I don't mean the sort of bogus UTF-16 labels that actually are treated as UTF-8 labels by WHATWG HTML.) > To the first point: the increase in complexity is fairly minimal for a > substantial gain in usability. > How substantial the gain in usability would be is not known without exact telemetry, but see below. As for complexity, as the person who has been working with the relevant code the most in the last couple of years, I think we should try to get rid of the code for implementing encoding overrides by the user instead of coming up with new ways to trigger that code. Thanks to e.g. the mistake of introducing UTF-16 as an interchange encodinge to the Web, that code has needed security fixes. > Absent hard statistics, I suspect we will disagree about how "fringe" this > particular exception is. Suffice it to say that I have personally > encountered it as a problem as recently as last week. If you think we need > to move beyond anecdotes and personal experience, let's go ahead and add > telemetry to find out how often this arises in the field. > We don't have telemetry for the question "How often are pages that are not labeled as UTF-8, UTF-16 or anything that maps to their replacement encoding according to the Encoding Standard and that contain non-ASCII bytes in fact valid UTF-8?" How rare would the mislabeled UTF-8 case need to be for you to consider the UI that you're proposing not worth it? However, we do have telemetry for the percentage of Firefox sessions in which the current character encoding override UI has been used at least once. See https://bugzilla.mozilla.org/show_bug.cgi?id=906032 for the results broken down by desktop versus Android and then by locale. One could speculate the answer to the UTF-8 question relative to this telemetry data both ways: Since the general character encoding override usage includes cases where the encoding being switched to is not to UTF-8, one could expect the UTF-8 case to be even more fringe than what these telemetry results show. On the other hand, these telemetry results show only cases where the user is aware of the existence of the character encoding override UI and bothers to use it, so one could argue that the UTF-8 case could actually be more common. I would accept a (performance-conscious) patch for gathering telemetry for the UTF-8 question in the HTML parser. However, I'm not volunteering to write one myself immediately, because I have bugs on my todo list that have been caused by previous attempts of Gecko developers to be well-intentioned about DWIM and UI around character encodings. Gotta fix those first. Your second point is an argument against automatic correction. Don't get me > wrong: I think automatic correction leads to innocent publisher mistakes > that make things worse over the long term. I absolutely agree that doing so > trades short-term gain for long-term damage. But I'm not arguing for > automatic correction. > Even non-automatic correction means authors can take the attitude that getting the encoding wrong is no big deal since the fix is a click away for the user. But how will that UI work in non-browser apps that load Web content on B2G, etc.? On Fri, Sep 6, 2013 at 6:45 PM, Robert Kaiser <[email protected]> wrote: > Hmm, do we have to treat the whole document as a consistent charset? The practical answer is yes. > Could > we instead, if we don't know the charset, look at every rendered-as-text > node/attribute in the DOM tree and run some kind of charset detection on it? > > May be a dumb idea but might avoid the problem on the parsing level. And then we'd have at least 34 problems (if my quick count of legacy encodings was correct). On a more serious note, though, it's a bad idea to try to develop complex solutions to problems that are actually relatively rare on the Web these days and it's even worse to go deeper into DWIM when experience shows that DWIM in this area is a big part of the reason we have this mess. On Fri, Sep 6, 2013 at 7:36 PM, Neil Harris <[email protected]> wrote: > http://w3techs.com/technologies/overview/character_encoding/all I don't trust the methodology of that site. Previously, their methodology has been shown to be bogus. See the discussion starting at http://krijnhoetmer.nl/irc-logs/whatwg/20121114#l-406 . Hopefully, they have changed their methodology by now ( https://twitter.com/W3Techs/status/268835927500148737 ), but I still don't trust that they get it right. Since it isn't even the only site in this area with bogus methodology (see e.g. https://twitter.com/builtwith/status/268951926085910528 ), I don't trust purported data in this area unless it's has been compiled by someone who I know to know enough about the topic (not just about encodings but also the mechanisms that browsers use to pick an encodings) to have a chance of getting the methodology right. -- Henri Sivonen [email protected] http://hsivonen.iki.fi/ _______________________________________________ dev-platform mailing list [email protected] https://lists.mozilla.org/listinfo/dev-platform

