Re: [Uta] UTS-46 / WHATWG

John C Klensin Sat, 28 Jan 2023 14:26:50 -0800

Rob,

I am just about burned out on this discussion (or family of
discussions) so let me see if I can review the history and
explain the problem but with the understanding that I will
probably not respond further until and unless a relevant
document goes into IETF Last Call.  I'm adding Patrik to this
because we act as sanity checks on each other but he is very
busy.  Unless I say something truly stupid or have omitted
something important, I don't expect him to respond and you
probably shouldn't either.  Also copying Vint, who chaired the
WG that created IDNA2008, in case he has anything to add.

This was mostly written before your three notes today.  I got
interrupted for several hours before proofreading and sending
it.  While one comment from those notes is reflected a bit in my
recommendations below, I'm not going to take the additional time
to rewrite the message to respond directly to those notes.

** History **

If we go back to around 2006-2007, the main thing that motivated
the work that became IDNA2008 was that we had become convinced,
after some nasty experiences, that IDNA2003 was not working well
and, in particular, that it permitted many code points and
constructions in domain names that could, either accidentally or
via a bit of malice, cause serious user confusion including
strings that would not be presented consistently, etc. _and_
that it was excluding some important code points (by mapping
them away) that should not be excluded.   There were some ideas
in the initial design that were pulled out during the process of
developing IDNA2008 that a few of us still regret losing and
that would have prevented a range of additional problems.   I
don't know whether you consider "names cannot be used reliably"
as an interoperability issue.  It certainly can be the basis of
considerable problems for the user and significant security
risks.  The second big problem IDNA2008 was supposed to solve
was to move from a table-based system with code points
considered for acceptability one at a time (and a likely default
of "if Unicode thinks it is ok, so do we") to a rule-based one
that would automatically (and correctly) classify the vast
majority of Unicode code points including newly-added ones.  The
latter has not worked out as well as we hoped... but still much
better than IDNA2003 would have.

We had also, empirically and sadly, discovered that some
registrars were not as knowledgeable and/or careful as they had
been expected to be when IDNA2003 was approved.   The assumption
had been that they would be on the lookout for proposed names
that would be likely to cause problems and push back on them.
Instead, for whatever reason, even when such names were
identified, they seem to have been seen by some as revenue
opportunities.  Partially as a consequence, IDNA2008 contained,
not only more explicit provisions for checking of known problem
cases by registries but for checking a subset of those cases
(the ones that were easy and fast to check) by lookup
applications.

It is probably also worth mentioning that, in addition to some
compromises (or changes in response to pressure from) the
Unicode folks, IDNA2008 also reflects discussions with, and
agreement from, the ICANN registry constituency.  That agreement
included one of the issues to which some of the Unicode people
objected most strongly, including the changes in definitions of
a very small number of characters between IDNA2003 and IDNA2008.
Indeed, one of those changes was advocated by some of those most
directly affected.

So, IDNA2008 was approved and published.   The Unicode
consortium fairly promptly put out UTS#46, announced as a
transition strategy document.  From my perspective (I don't know
if the authors would agree), it addressed the following main
issues with more of a "do this instead" rather than a  "how to
get from here to there" approach, a distinction that appears to
me to have become more clear over the years.

(i) Anything that would have been valid under IDNA2003 (whether
registered before 2010 or not) remains valid and can be
registered.

(ii) Special character interpretations given by IDNA2003 but
removed by IDNA2008, notably including the mapping of Eszett
(Sharp S, U+00DF) to "ss" and the treatment of Dotless I
(U+0131) remained as given in IDNA2003.   See Section 1.3.2 of
version 15.0.0 of UTS#46.

(iii) IDNA2003 treats several characters as "ignorable", i.e.,
if they appear in a name being looked up into the DNS, they are
treated as not being part of the string at all.  UTS#46
preserves that behavior.

(iv) While IDNA2003 included case-folding (IIR, trying to
parallel that in the base DNS specs, which some believe may have
been a problem even then/ for ASCII) and the use of NFKC (which
suppresses some characters that registries believed to be
important), IDNA2008 does not.  More generally, IDNA2003 tends
to specify mapping from one code point to another while IDNA2008
tries to avoid mappings in favor of either making the characters
that were previously mapped out valid or by prohibiting them
altogether.  Like the two issues above, UTS#46 appears to have
been intended to preserve the validity and interpretation of
anything that would have been valid under IDNA2003 instead. 

(v) The IDNA2003 documents did not clearly specify a plan for
moving forward from Unicode 3.2 to later versions of that spec.
Recent versions of UTS#46, in what is Section 7 and Table 4 in
the current version, effectively provide a Unicode Consortium
version of what might reasonably be thought of as IDNA2003bis.

(vi) One of the "deviations" is that, while it isn't as clear
about it as I would like, UTS#46 allows emoji in domain names
and UTA#31 and UTS#51 allow them in identifiers more generally).
Because they are considered symbols (even by Unicode), IDNA2008
does not.  Some DNS TLD registries have chosen to allow them at
the second level, so they exist in the wild although it is not
clear that those domains are being used for anything that would
not have bigger problems if used with TLS.  FWIW, other than
some "if you allow that, where do you draw line" questions, the
big, serious, problems with Emoji arise when they are used in
sequence, with or without combining sequences (IMO, UTS#51 does
not cover a number of cases that have appeared in the wild even
if mostly in illustrative attacks).  The assumption used in some
design thinking (including, I believe, often by WHATWG) is that
people will just copy and paste things rather than trying to
read them aloud, transcribe them from someone else's reading, or
just type them in.  Unfortunately, and for a variety of reasons,
those things happen... and not even copy-and-paste is 100%
reliable when it involves multiple applications.

(vii) Finally, UTS#46 removed all requirements for lookup-time
checking.  In essence, if a name appears in the DNS, it is
assumed to be valid and should be as trusted as any other name
that appears in the DNS.

In addition, because of the mapping issues, even if one
"follows" UTS#46, it is important to distinguish between what
can actually be stored in the DNS (or, equivalently, mapped back
to native character from Punycode stored in the DNS) and what
strings are "valid" for various purposes.

** Today ** 

Where we stand today is that we basically have two competing
specifications.  While the vast majority of characters in
Unicode, and hence the vast majority of possible labels, are
unaffected, there are several areas in which they are
incompatible in how characters are treated: it is just not
accurate to claim or imply that either is a strict subset or
superset of the other. Whether the last paragraph of Section
1.3.1 of UTS#46 (particularly "...defines a mapping consistent
with the normative requirements of the IDNA2008...") is correct
or not depends on some hair-splitting interpretations of the
IDNA2008 specs and their text.   IDNA2008 is used (and, in
principle, required) by ICANN and, as Peter says, by several
other applications. UTS#46 is used by the web browsers, or at
least those who subscribe to WHATWG, and probably several other
applications.  In the latter case, and probably in the former,
many applications have picked up and used libraries without
fully understanding what those libraries are doing. Neither spec
can be reasonably claimed to be --globally and across the
Internet-- "what works".

** Recommendation ** 

If UTA is concerned, as the charter indicates, about a range of
applications and not just HTTP, it is probably unreasonable to
pick between the two specs.  It is definitely unreasonable to
pick UTS#46 alone.    If you wanted to be safe, you'd need to
pick the intersection of what the two specs allow and then
further remove any code points that are allowed by both but
given different interpretations.  I have no idea how many people
that would make unhappy in practice, but it wouldn't be zero.
If you go for "permits the most characters" (even if that is
true-- I have not done enough analysis lately but because of the
NFKC mapping, it was not a few years back) and pick UTS#46, I
believe you are going to need to decide, code point by code
point, what to do about those characters that are handled
differently by the two specs and then how to handle/define
canonical forms of strings of Emoji characters and/or Emoji
intermixed with, e.g., more traditional letters and numerals.
The latter would be a big job and is not obviously part of the
UTA charter even if picking an IDN standard is (that isn't
obvious either).

You'd better have a deep understanding of what is allowed in
certificates too.

Alternate suggestion, which is more or less what I tried to
suggest some months ago: tell the truth and then, insofar as
possible, stay out of this.  Specifically, indicate that
IDNA2008 is the IETF Standard but that UTS#46 is widely used
and, in places, incompatible.  Use as much of the above as you
think necessary to point out where the incompatibilities lie,
possibly pointing to relevant specific sections of UTS#46 to
help identify some of the more problematic ones but don't lose
sight of either the facts that, by mapping huge numbers of
"compatibility equivalent" characters into others (hence
preventing their distinct use in the DNS) and by allowing Emoji
sequences that are hard to take seriously as identifiers, it can
pose it own set of problems.  Then suggest that, if you are
trying to enable use of TLS across a wide range of applications,
both the TLS implementations and those applications should be
aware of those issues, which are otherwise out of scope for the
WG.

That is clearly kicking the can down the road, but, until the
now-closed I18n Directorate comes up with carefully developed
recommendations about how to use IDNs when either standard might
be assumed (or about how to get there) or the ART Area makes
another plan for such recommendations, I don't see choices that
are both realistic and clearly better.

See you at IETF Last Call.

    john

--On Friday, January 27, 2023 17:26 -0800 Rob Sayre
<say...@gmail.com> wrote:

> On Fri, Jan 27, 2023 at 5:16 PM Peter Saint-Andre
> <stpe...@stpeter.im> wrote:
> 
>> > That is what works.
>> 
>> Well, IDNA2008 works for many applications and UTS-46 works
>> for many other applications. I'm not as certain as you are
>> that one of these technologies works and the other does not.
>> Can you produce evidence that, by implication, IDNA2008 does
>> not work? What problems does it not solve?
>> 
> 
> That's the dispute, right? UTS-46 allows more names than
> IDNA2008, so it will be more interoperable, and it is popular.
> 
> If you look at this table, that seems correct:
> 
> https://www.unicode.org/reports/tr46/#Table_IDNA_Comparisons
> 
> I am not a fan of works of fiction in standards, and I think
> UTS-46 is closer to the truth here.
> 
> thanks,
> Rob

_______________________________________________
Uta mailing list
Uta@ietf.org
https://www.ietf.org/mailman/listinfo/uta

Re: [Uta] UTS-46 / WHATWG

Reply via email to