On Monday, 24 January 2011 at 14:39:59 +0900, Masayoshi Okutsu <masayoshi.oku...@oracle.com> wrote
>>> Are you talking about unpaired surrogates or something else? >> Yes, I am talking about unpaired surrogates. > I believe each code unit of UTF-16 gets converted to its code point. So, > an unpaired surrogate gets converted to a surrogate code point. So, it's > still processed based on code points? Apparently so. I misunderstood what constituted proper handling of unpaired surrogates within the regex engine. That's because I made incorrect inferences when reading this out of from section 3.2 Conformance Requirements: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf C1 A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character. I thought C1 meant to forbid matching an unpaired surrogate with say, the "." metacharacter, because the "." metacharacter means* an abstract character, by which I understand it to mean a single code point. I had not realized that as reserved code points, unpaired surrogates could still be matched. I had thought them non-characters, not as abstract characters. That said, I'm still trying to reconcile C1 to all this. I think --tom [*] Well, in this interpretation. There are other interpretations in which dot would match something else. For example, under tr18's 2.2.1 Grapheme Cluster Mode, dot "behaves like \X; that is, matches a full extended grapheme cluster going forward." From: http://www.unicode.org/reports/tr18/#Default_Grapheme_Clusters In Perl5, you have to use \X to get \X. :) However, in Perl6's grapheme mode, dot matches a language-independent grapheme. That's the 3rd highest level, just short of matching language-dependent notions of "characters". http://perlcabal.org/syn/S05.html#Modifiers New modifiers specify Unicode level: m:bytes / .**2 / # match two bytes m:codes / .**2 / # match two codepoints m:graphs / .**2 / # match two language-independent graphemes m:chars / .**2 / # match two characters at current max level There are corresponding pragmas to default to these levels. Note that the :chars modifier is always redundant because dot always matches characters at the highest level allowed in scope. This highest level may be identical to one of the other three levels, or it may be more specific than :graphs when a particular language's character rules are in use. Note that you may not specify language-dependent character processing without specifying which language you're depending on. [Conjecture: the :chars modifier could take an argument specifying which language's rules to use for this match.]