Sherman wrote: > The difference is at
> test("\uD800\uDF3C", "^[\\x{D800}\\x{DF3C}]+$"); > test("\uD800\uDF3C", "^[\\x{DF3C}\\x{D800}]+$"); > You can have unpaired surrogate in Java String, but > if you have a paired one you can't say I want them to > be two separated "unpaired" surrogates. I was being wicked. :) I knew that would happen: I was taking (unfair?) advantage of Java's regexes' smarts compared it its strings' lameness. As I know you know, it's all because although Java *regexes* correctly deal in true logical Unicode code points qua characters (by virtue of copying all the code points for the pattern into an int array as the first thing it does, effectively making regexes UTF-32ish in character), Java's native *strings* are forever stuck with all the unfortunate restrictions inherent to serialized UTF-16. It has always struck me as a terribly unfortunate consequence of the UCS-2 => UTF-16 hack that Java should make the even more unfortunate programmer think constantly of annoying serialization issues instead of logical code points. This decision has lead to many unfortunate parodoxes, including these two: * A Java "char"/"Character" data type cannot hold a Unicode character. More simply put, a Java "Character" cannot hold a Unicode "character" -- because Java does not use Unicode as its native character set: it uses UTF-16. * Given strings A and B, and a LENGTH function returning the number of code points in its string argument, neither of these fundamental logical guarantees can be made: LENGTH(A + B) == LENGTH(A) + LENGTH(B) LENGTH(A + B) == LENGTH(B + A) I don't know which of those two paradoxes bothers me more; both make my head spin and eyes water. They are... *unfortunate*. I dearly, desperately wish Java strings were logical sequences of code points instead of UTF-16 of all awful things! If only that had been nipped in the bud. If only if only if only. I also know that that longing shall remain forever unrequited. That doesn't stop me from wishing it were otherwise. Unfortunately. :( Thank you, Sherman, for all your hard work! --tom