Rob, I am just about burned out on this discussion (or family of discussions) so let me see if I can review the history and explain the problem but with the understanding that I will probably not respond further until and unless a relevant document goes into IETF Last Call. I'm adding Patrik to this because we act as sanity checks on each other but he is very busy. Unless I say something truly stupid or have omitted something important, I don't expect him to respond and you probably shouldn't either. Also copying Vint, who chaired the WG that created IDNA2008, in case he has anything to add.
This was mostly written before your three notes today. I got interrupted for several hours before proofreading and sending it. While one comment from those notes is reflected a bit in my recommendations below, I'm not going to take the additional time to rewrite the message to respond directly to those notes. ** History ** If we go back to around 2006-2007, the main thing that motivated the work that became IDNA2008 was that we had become convinced, after some nasty experiences, that IDNA2003 was not working well and, in particular, that it permitted many code points and constructions in domain names that could, either accidentally or via a bit of malice, cause serious user confusion including strings that would not be presented consistently, etc. _and_ that it was excluding some important code points (by mapping them away) that should not be excluded. There were some ideas in the initial design that were pulled out during the process of developing IDNA2008 that a few of us still regret losing and that would have prevented a range of additional problems. I don't know whether you consider "names cannot be used reliably" as an interoperability issue. It certainly can be the basis of considerable problems for the user and significant security risks. The second big problem IDNA2008 was supposed to solve was to move from a table-based system with code points considered for acceptability one at a time (and a likely default of "if Unicode thinks it is ok, so do we") to a rule-based one that would automatically (and correctly) classify the vast majority of Unicode code points including newly-added ones. The latter has not worked out as well as we hoped... but still much better than IDNA2003 would have. We had also, empirically and sadly, discovered that some registrars were not as knowledgeable and/or careful as they had been expected to be when IDNA2003 was approved. The assumption had been that they would be on the lookout for proposed names that would be likely to cause problems and push back on them. Instead, for whatever reason, even when such names were identified, they seem to have been seen by some as revenue opportunities. Partially as a consequence, IDNA2008 contained, not only more explicit provisions for checking of known problem cases by registries but for checking a subset of those cases (the ones that were easy and fast to check) by lookup applications. It is probably also worth mentioning that, in addition to some compromises (or changes in response to pressure from) the Unicode folks, IDNA2008 also reflects discussions with, and agreement from, the ICANN registry constituency. That agreement included one of the issues to which some of the Unicode people objected most strongly, including the changes in definitions of a very small number of characters between IDNA2003 and IDNA2008. Indeed, one of those changes was advocated by some of those most directly affected. So, IDNA2008 was approved and published. The Unicode consortium fairly promptly put out UTS#46, announced as a transition strategy document. From my perspective (I don't know if the authors would agree), it addressed the following main issues with more of a "do this instead" rather than a "how to get from here to there" approach, a distinction that appears to me to have become more clear over the years. (i) Anything that would have been valid under IDNA2003 (whether registered before 2010 or not) remains valid and can be registered. (ii) Special character interpretations given by IDNA2003 but removed by IDNA2008, notably including the mapping of Eszett (Sharp S, U+00DF) to "ss" and the treatment of Dotless I (U+0131) remained as given in IDNA2003. See Section 1.3.2 of version 15.0.0 of UTS#46. (iii) IDNA2003 treats several characters as "ignorable", i.e., if they appear in a name being looked up into the DNS, they are treated as not being part of the string at all. UTS#46 preserves that behavior. (iv) While IDNA2003 included case-folding (IIR, trying to parallel that in the base DNS specs, which some believe may have been a problem even then/ for ASCII) and the use of NFKC (which suppresses some characters that registries believed to be important), IDNA2008 does not. More generally, IDNA2003 tends to specify mapping from one code point to another while IDNA2008 tries to avoid mappings in favor of either making the characters that were previously mapped out valid or by prohibiting them altogether. Like the two issues above, UTS#46 appears to have been intended to preserve the validity and interpretation of anything that would have been valid under IDNA2003 instead. (v) The IDNA2003 documents did not clearly specify a plan for moving forward from Unicode 3.2 to later versions of that spec. Recent versions of UTS#46, in what is Section 7 and Table 4 in the current version, effectively provide a Unicode Consortium version of what might reasonably be thought of as IDNA2003bis. (vi) One of the "deviations" is that, while it isn't as clear about it as I would like, UTS#46 allows emoji in domain names and UTA#31 and UTS#51 allow them in identifiers more generally). Because they are considered symbols (even by Unicode), IDNA2008 does not. Some DNS TLD registries have chosen to allow them at the second level, so they exist in the wild although it is not clear that those domains are being used for anything that would not have bigger problems if used with TLS. FWIW, other than some "if you allow that, where do you draw line" questions, the big, serious, problems with Emoji arise when they are used in sequence, with or without combining sequences (IMO, UTS#51 does not cover a number of cases that have appeared in the wild even if mostly in illustrative attacks). The assumption used in some design thinking (including, I believe, often by WHATWG) is that people will just copy and paste things rather than trying to read them aloud, transcribe them from someone else's reading, or just type them in. Unfortunately, and for a variety of reasons, those things happen... and not even copy-and-paste is 100% reliable when it involves multiple applications. (vii) Finally, UTS#46 removed all requirements for lookup-time checking. In essence, if a name appears in the DNS, it is assumed to be valid and should be as trusted as any other name that appears in the DNS. In addition, because of the mapping issues, even if one "follows" UTS#46, it is important to distinguish between what can actually be stored in the DNS (or, equivalently, mapped back to native character from Punycode stored in the DNS) and what strings are "valid" for various purposes. ** Today ** Where we stand today is that we basically have two competing specifications. While the vast majority of characters in Unicode, and hence the vast majority of possible labels, are unaffected, there are several areas in which they are incompatible in how characters are treated: it is just not accurate to claim or imply that either is a strict subset or superset of the other. Whether the last paragraph of Section 1.3.1 of UTS#46 (particularly "...defines a mapping consistent with the normative requirements of the IDNA2008...") is correct or not depends on some hair-splitting interpretations of the IDNA2008 specs and their text. IDNA2008 is used (and, in principle, required) by ICANN and, as Peter says, by several other applications. UTS#46 is used by the web browsers, or at least those who subscribe to WHATWG, and probably several other applications. In the latter case, and probably in the former, many applications have picked up and used libraries without fully understanding what those libraries are doing. Neither spec can be reasonably claimed to be --globally and across the Internet-- "what works". ** Recommendation ** If UTA is concerned, as the charter indicates, about a range of applications and not just HTTP, it is probably unreasonable to pick between the two specs. It is definitely unreasonable to pick UTS#46 alone. If you wanted to be safe, you'd need to pick the intersection of what the two specs allow and then further remove any code points that are allowed by both but given different interpretations. I have no idea how many people that would make unhappy in practice, but it wouldn't be zero. If you go for "permits the most characters" (even if that is true-- I have not done enough analysis lately but because of the NFKC mapping, it was not a few years back) and pick UTS#46, I believe you are going to need to decide, code point by code point, what to do about those characters that are handled differently by the two specs and then how to handle/define canonical forms of strings of Emoji characters and/or Emoji intermixed with, e.g., more traditional letters and numerals. The latter would be a big job and is not obviously part of the UTA charter even if picking an IDN standard is (that isn't obvious either). You'd better have a deep understanding of what is allowed in certificates too. Alternate suggestion, which is more or less what I tried to suggest some months ago: tell the truth and then, insofar as possible, stay out of this. Specifically, indicate that IDNA2008 is the IETF Standard but that UTS#46 is widely used and, in places, incompatible. Use as much of the above as you think necessary to point out where the incompatibilities lie, possibly pointing to relevant specific sections of UTS#46 to help identify some of the more problematic ones but don't lose sight of either the facts that, by mapping huge numbers of "compatibility equivalent" characters into others (hence preventing their distinct use in the DNS) and by allowing Emoji sequences that are hard to take seriously as identifiers, it can pose it own set of problems. Then suggest that, if you are trying to enable use of TLS across a wide range of applications, both the TLS implementations and those applications should be aware of those issues, which are otherwise out of scope for the WG. That is clearly kicking the can down the road, but, until the now-closed I18n Directorate comes up with carefully developed recommendations about how to use IDNs when either standard might be assumed (or about how to get there) or the ART Area makes another plan for such recommendations, I don't see choices that are both realistic and clearly better. See you at IETF Last Call. john --On Friday, January 27, 2023 17:26 -0800 Rob Sayre <say...@gmail.com> wrote: > On Fri, Jan 27, 2023 at 5:16 PM Peter Saint-Andre > <stpe...@stpeter.im> wrote: > >> > That is what works. >> >> Well, IDNA2008 works for many applications and UTS-46 works >> for many other applications. I'm not as certain as you are >> that one of these technologies works and the other does not. >> Can you produce evidence that, by implication, IDNA2008 does >> not work? What problems does it not solve? >> > > That's the dispute, right? UTS-46 allows more names than > IDNA2008, so it will be more interoperable, and it is popular. > > If you look at this table, that seems correct: > > https://www.unicode.org/reports/tr46/#Table_IDNA_Comparisons > > I am not a fan of works of fiction in standards, and I think > UTS-46 is closer to the truth here. > > thanks, > Rob _______________________________________________ Uta mailing list Uta@ietf.org https://www.ietf.org/mailman/listinfo/uta