On Thu, Apr 13, 2017 at 02:39:36AM +0000, Viktor Dukhovni wrote:
> 
> Well, IIRC they sensibly converged on a case-folded normal form
> that ensures that https://Духовный.org maps to the same underlying
> wire-form domain as https://духовный.org, i.e. both result in
> queries for xn--b1adqpd3ao5c.org.  AFAIK, those would generally be
> different domains under IDNA2008.

They would be different domains because the first of them is
DISALLOWED.  But everyone knew, when making IDNA2008, that removing
the case mapping from the protocol meant that clients needed to do it
before starting.  That's what https://tools.ietf.org/html/rfc5895 was
all about (a document that could have moved faster if some
participants had collaborated more enthusiastically instead of, well,
going away and making their own protocol).

One of the problems we had with IDNA2003 was that the protocol did the
caseFold operation.  The difficulty there was that there was no way to
pay attention to locale or other information that might tell you the
right thing to do, because caseFold is not nearly as simple as ASCII
always pretended it was.  IDNA2008's answer was to kick this problem
out of the protocol and into user agents, which were supposed to do
this in a sensible way.  If UTS#46 had restricted itself to that kind
of job, it could well have been an enormous contribution to the
practical use of IDNs.  Unfortunately, it didn't do just that.

> While is true that UTS#46 maps <U+1F4A9>.org to xn--ls8h.org, (see

In the way I'm using the term, that's not mapping, that's re-encoding.
The idea of "mapping" is to substitute in some more or less
predictable way some set of Unicode code points for some other set of
Unicode code points.  Then you can run the resulting final string
through the IDNA2008 algorithm.  

The problem with treaing U+1F4A9 as an acceptable character for an
identifier is not in itself -- maybe it is fine on its own -- but that
it is part of a class of characters that do not have normalizations
and are not letters or digits.  The IETF took seriously the advice
from UTC that we should use the stable categories that UTC had
invented, and derive our properties from those.  We did so, and no
emojis are in the categories that we used for the derivation.  It is
therefore more than a little frustrating to see the same UTC now
recommending that such characters be used in identifiers on the
network.  Such use is particularly bad with emojis because they have
no normalization either, and they interact in some ways with ZWNJ.
They provide a completely new playground for attackers to use in
phishing and so on, and we already have _enough_ trouble with that
without inventing new ways to cause ourselves grief.

Anyway, I doubt very much that DNSOP is the list where this ought to
be discussed (idna-update is still an open list, as is the IAB's
i18n-discuss list even though the program is closed, and precis is
still an active WG last I checked).  But any sentence about
internationalization that involves the concept "just do _x_" is, I
think, already too naïve.

Best regards,

A

-- 
Andrew Sullivan
a...@anvilwalrusden.com

_______________________________________________
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop

Reply via email to