Mark wrote: > The Unicode Standard distinguishes between Unicode Strings (16-bit) and > UTF-16. In the former, which is often the form used in programming > languages, a singleton value of 0xD800..0xDFFF is allowed, and is treated > as if it were a reserved code point.
Ahah! "Unicode Strings (16-bit)" vs "UTF-16". That was the subtlety I was missing, because I'm most used to working with logical code points, or at worst with well-formed UTF-8. That's why I was surprised when your TestRegex() sample turned up no troubles in the surrogate range. I even modified your code to add an INSPAN enum plus this: Failures.INSPAN.checkMatch(i, "a[" + hexPattern + "-" + hexPattern + "]b", target); because I couldn't see how a pattern a[\uD800-\uD800]b could possibly be matched, since those are specifying a span of code points in the UTF-16 surrogate range. Yet it can. My confusion derived from the C1 conformance requirement from TUS 6.0.0: C1 A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character. C2 A process shall not interpret a noncharacter code point as an abstract character. So I was thinking that allowing a surrogate to match /./ was violating that. But that wouldn't make sense, considering that /\p{Cs}/ should be a usable property. Reading further though clear this up somewhat: D14 Noncharacter: A code point that is permanently reserved for internal use and that should never be interchanged. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF. D15 Reserved code point: Any code point of the Unicode Standard that is reserved for future assignment. Also known as an unassigned code point. · Surrogate code points and noncharacters are considered assigned code points, but not assigned characters. Also, there's D77's · In the Unicode Standard, specific values of some code units cannot be used to represent an encoded character in isolation. This restriction applies to isolated surrogate code units in UTF-16 and to the bytes 80FF in UTF-8. [...] People often think of these "illegal" code points, or "not a character", but I now see how upon a close reading of The Unicode Standard, that these reserved code points can occur in data. After all, if you have to be able to build up a buffer a UTF-16 code unit at a time, unpaired surrogates have to exist even temporarily. As far as I understand it, reserved characters should not occur in data used for interchange, but may occur within an application. What the various encoders do with these is not always clear or consistent, although I suspect this is more a library matter rather than an issue with the Standard itself. I therefore withdraw my doubts regarding java.util.regex meeting tr18-13's RL1.7 requirement: RL1.7 Supplementary Code Points To meet this requirement, an implementation shall handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF. In particular, where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching. I do see that way back in tr18-6 of 2002-04-21, the language was clearer: While surrogate pairs could be used to identify code points above FFFF₁₆, that mechanism is clumsy. It is much more useful to provide specific syntax for specifying Unicode code points [...] It's a pity some of that earlier language wasn't retained, either for RL1.7 — or, more likely, for RL1.1. It might have made the intent of RL1.1 more obvious to all readers. --tom