On Thu, Apr 13, 2017 at 02:39:36AM +0000, Viktor Dukhovni wrote: > > Well, IIRC they sensibly converged on a case-folded normal form > that ensures that https://Духовный.org maps to the same underlying > wire-form domain as https://духовный.org, i.e. both result in > queries for xn--b1adqpd3ao5c.org. AFAIK, those would generally be > different domains under IDNA2008.
They would be different domains because the first of them is DISALLOWED. But everyone knew, when making IDNA2008, that removing the case mapping from the protocol meant that clients needed to do it before starting. That's what https://tools.ietf.org/html/rfc5895 was all about (a document that could have moved faster if some participants had collaborated more enthusiastically instead of, well, going away and making their own protocol). One of the problems we had with IDNA2003 was that the protocol did the caseFold operation. The difficulty there was that there was no way to pay attention to locale or other information that might tell you the right thing to do, because caseFold is not nearly as simple as ASCII always pretended it was. IDNA2008's answer was to kick this problem out of the protocol and into user agents, which were supposed to do this in a sensible way. If UTS#46 had restricted itself to that kind of job, it could well have been an enormous contribution to the practical use of IDNs. Unfortunately, it didn't do just that. > While is true that UTS#46 maps <U+1F4A9>.org to xn--ls8h.org, (see In the way I'm using the term, that's not mapping, that's re-encoding. The idea of "mapping" is to substitute in some more or less predictable way some set of Unicode code points for some other set of Unicode code points. Then you can run the resulting final string through the IDNA2008 algorithm. The problem with treaing U+1F4A9 as an acceptable character for an identifier is not in itself -- maybe it is fine on its own -- but that it is part of a class of characters that do not have normalizations and are not letters or digits. The IETF took seriously the advice from UTC that we should use the stable categories that UTC had invented, and derive our properties from those. We did so, and no emojis are in the categories that we used for the derivation. It is therefore more than a little frustrating to see the same UTC now recommending that such characters be used in identifiers on the network. Such use is particularly bad with emojis because they have no normalization either, and they interact in some ways with ZWNJ. They provide a completely new playground for attackers to use in phishing and so on, and we already have _enough_ trouble with that without inventing new ways to cause ourselves grief. Anyway, I doubt very much that DNSOP is the list where this ought to be discussed (idna-update is still an open list, as is the IAB's i18n-discuss list even though the program is closed, and precis is still an active WG last I checked). But any sentence about internationalization that involves the concept "just do _x_" is, I think, already too naïve. Best regards, A -- Andrew Sullivan a...@anvilwalrusden.com _______________________________________________ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop