Sherman, referring to Java's ASCII-only senses of \w and \s, and of \p{alpha} and \p{space}, wrote:
> (does Perl 5 work in this way as well?) No, not for a very, very long time. For most of Perl's life, charclass escapes like \w have always been Unicode aware. However, it did take us some time to separate out the POSIX names from the Unicode properties. As I've mentioned, we eventually solved this by prepending "POSIX" to the name of the property. So \p{POSIX_Alpha} gets you exactly what POSIX says-- which in Perl is locale-aware; I don't believe this is true of Java, though. Whereas \p{Alpha} gets you \p{Alphabetic} as RL1.2a requires under both recommendations. As for \w and such, Perl defines \w, \d, \s, and \b -- and their uppercase complements -- to work exactly as the definitions given Annex C of tr18's RL1.2a state that they work per the Standard Recommendation. For \d, there is some flexibility in that the POSIX Compatible version allows it to match only [0-9] instead of all of \p{Decimal_Number}. To meet the requirements of RL1.2a, one must state whether one is using Standard Recommendation or the POSIX Compatible version. Only for \d does Java use either of the allowable senses. The others all choose their own definitions which are out of compliance with RL1.2a. (And Java does not support Annex C's \X at all. I know that that one is on your own personal wish-list, Sherman.) All of this is what first motivated me to write a drop-in replacement that preprocesses Pattern strings to allow them to work properly (read: per RL1.2a and others) on Unicode strings. And it was because of that code that you first became known to me, and vice versa. So I think I should discuss it a bit. That's what parts 2 and 3 will be about. > This is by design and I don't agree "this is a mess" conclusion. Sherman, you're right that just because things like \w and \s, or \p{alpha} or \p{space}, do not meet the requirements of RL1.2a does not lead one to conclude that "this is a mess." That would be grossly overstating matters. Alone, it is simply non-conformant, not a mess. What I meant was a mess was the mismatch between \w and \b. It is this mismatch that makes possible nonsense results like I wrote about here: One fundamental bug is that Java has misunderstood the connection between \b and \w regexes, so that now a string like "élève" is not matched by the pattern "\b\w+\b" at any point in the string. It turns out that because of this, Java is out of compliance with any of the permissible senses of \b and \w given in tr18. I will demonstrate that in part 3 of this letter, as well as provide code demonstrating a remedy. > While there are developers over there might like these properties to > evolve to be the Unicode properties, I am pretty much sure there might > be the same amount of developers there would prefer these properties > be kept as the "original" POSIX properties. My experience suggests that you are indeed correct that there are many developers who want one thing and also many who want the other. We faced this very thing in Perl, and you wouldn't believe how many messages and threads the issue spawned. There are passionate views on both sides of this issue. The flaw in providing only the ASCII-only definitions as primitives is that one cannot using those derive the full Unicode definitions, whereas if one had the full Unicode definitions available as primitives, one could trivially derive the ASCII-only definitions. It's a matter of the choice of primitives. Choosing ASCII-only as the bare primitive locks one into the 7-bit past in what is even now very much a 21-bit world, and shall be even more so in future. You are sacrificing Unicode by choosing ASCII as the primitive. But if you chose Unicode as the primitive, you would *not* be sacrificing ASCII. That makes it an unequal tradeoff between the two sets of developers. Favoring ASCII blocks Unicode, but favoring Unicode does not block ASCII. That isn't really fair. I believe the only just thing is to provide both. That's the only way to make everyone happy. That's what we finally arrived at in Perl, at least, and it has largely worked out. There are still grumblers about "reasonable" defaults, but no one is locked out-- as in Java, they currently are. However, I understand that there are two separate issues here: * One is that you have used Unicode property names to mean something other than what the spec says they should mean. * The other is that the charclass aliases are either ASCII-only (\w \s \d) or broken (\b \B). I have several different ideas about how to fix these in a backwards compatible fashion, ideas I can discuss later in a separate letter. Meanwhile, parts 2 and 3 of the current letter will discuss what my rewrite code does and how it satisfies almost all of the unmet requirements for Level 1 compliance, plus several others. --tom