Converting tr18 "strong recommendations" into RFEs

2011-01-25 Thread Tom Christiansen
Sherman, Since you're looking through my messages for potential RFEs, I thought I would point a pair of low-hanging fruit for you. tr18 contains two distinct strong recommendations, both of which should be quite easy to convert into RFEs. As recommendations, even strong ones, they are of course

regex rewriting code (part 1 of 3)

2011-01-25 Thread Tom Christiansen
Sherman, referring to Java's ASCII-only senses of \w and \s, and of \p{alpha} and \p{space}, wrote: > (does Perl 5 work in this way as well?) No, not for a very, very long time. For most of Perl's life, charclass escapes like \w have always been Unicode aware. However, it did take us some time

regex rewriting code (part 2 of 3)

2011-01-25 Thread Tom Christiansen
When I set about resolving the Unicode troubles in Java regular expressions through rewriting them into something Java understood, I found it convenient to divide that functionality into two different rewriting functions, one to handle string escapes like \u and the other to handle charclas

regex rewriting code (part 3 of 3)

2011-01-25 Thread Tom Christiansen
Now I will discuss the more interesting of my two functions, the one that handles charclass escapes such as those given in RL1.2a. The particular Level 1 place where this code is relevant is RL1.2a's Annex C Compatibility Properties, RL 1.4 Simple Word Boundaries, and RL1.6 Line Boundaries. I do

Now what?

2011-01-25 Thread Tom Christiansen
That concludes my discoveries, analysis, and remediations related to j.u.r.Pattern's conformance with tr18's Level 1 requirements. I would be interested in guidance toward how I can best help you now that all that is done. Would you all like some time to absorb and digest this set of writings fr

Re: regex rewriting code (part 1 of 3)

2011-01-25 Thread Xueming Shen
Tom, The fact that these POSIX/ASCII only version properties/constructs have been there for years ("compatibility") and it appears that "most" developers are happy (habit, performance...) with them, I don't think we can and want to switch to the Unicode version, simply for conformance. Java ta

Re: regex rewriting code (part 1 of 3)

2011-01-25 Thread Tom Christiansen
> The fact that these POSIX/ASCII only version properties/constructs > have been there for years ("compatibility") and it appears that "most" > developers are happy (habit, performance...) with them, I don't think > we can and want to switch to the Unicode version, simply for > conformance. I agr

Re: Now what?

2011-01-25 Thread Xueming Shen
Tom, Yes, I would need some time to digest all the technical details, though I believe I've had a good understanding of most issues you raised. Sure, I will keep you updated for the related RFEs I will submit based on your research. The CR# so far I have are 7014645: Support Perl style Uni

Re: Now what?

2011-01-25 Thread Martijn Verburg
Hi all, Just like to say that this is why I got involved in the combination of Java and open source - it really does lift the spirits to see this sort of discourse, even if a majority of the technical details fly over my head (all of you are plain scary ;p)! I'm speaking at a number of conference

Re: RL1.1 Hex Notation

2011-01-25 Thread Mark Davis ☕
The goal of the clause is to have a mechanism for using hex values for character literals. That is, you should be able to take a code point from 0 to 10, get a hex value for that, embed it in some syntax, and concatenate it into a pattern, and have it work as a literal. For example: String pa

Re: RL1.1 Hex Notation (part 2 of 3)

2011-01-25 Thread Mark Davis ☕
The Unicode Standard distinguishes between Unicode Strings (16-bit) and UTF-16. In the former, which is often the form used in programming languages, a singleton value of 0xD800..0xDFFF is allowed, and is treated as if it were a reserved code point. So you do get some funny cases, because 1. 0

UTS18 clarifications

2011-01-25 Thread Mark Davis ☕
There's been a long and interesting discussion on this list. We are coming up to a quarterly Unicode Technical Committee meeting (starting Feb 7), so there is the opportunity to make requests / proposals about UTS18. In particular, if there are areas of the spec that are unclear or features that p

Re: RL1.1 Hex Notation

2011-01-25 Thread Xueming Shen
Hi Mark, I guess you are asking for something like? char[] cc = Character.toChars(0x12345); Matcher m = Pattern.compile("[" + "\\u" + HEX(cc[0]) + "\\u" + HEX(cc[1]) + "