Sherman wrote: > At the end, Java RegEx is NOT a Unicode RegEx, while it > supports Unicode RegEx at certain level, sometime via different > syntax, I don't feel this is a big problem for most Java > developers and should not be a stopper for most program.
I do not understand what you mean when you say that Java regexes aren't Unicode regexes. Are you referring to the various syntactic features of UTS 18, Unicode Regular Expressions? If so, it's my understanding that many of those are examples only, especially when it comes to how something actually looks. I fully agree with you that Java indeed offers some of the functionality described there in other ways than given by those particular examples, and that quite often this doesn't make enough practical difference as to be a show-stopper. I discuss this further later on down in this message. Another possible interpretation of: > Java RegEx is NOT a Unicode RegEx, while it > supports Unicode RegEx at certain level, is that you are saying that the standard Java regex class does not provide the baseline Level 1 Unicode support spelled out in UTS#18, then I'm afraid you are again correct. However, I would very much like to see this fixed. That's because Level 1 support is the absolute mimimum level required for useful Unicode support. To quote from UTS#18: Level 1 is the minimally useful level of support for Unicode. All regex implementations dealing with Unicode should be at least at Level 1. I believe it *extremely important* that Java provide useful Unicode support. In my text-mining group at the university here, we process megabytes and sometimes gigabytes of UTF-8 text with Java. And we use regexes. For us it is a *very* big problem that Java does not provide even the minimally required Level 1 support, because there is only so much you can do to work around this; that's why they call Level 1 "minimally useful". Because Java's native character set is and always has been Unicode, I feel it is is reasonable to hope that Java should provide the minimally useful level of support for Unicode. The exponential(*) growth in the proportion of Unicode text data over the last decade means that Java is suddenly not well-suited to handle this data. This is a real shame. (*) I use here the term "exponential growth" purely in its mathematically strict sense, not in the more commonly heard popular sense of merely growing faster than expected. There is no question that Java is the premiere platform of choice for millions of people doing real work. Because of the shocking growth rate of Unicode, it has "suddenly" come time that the whole Java infrastructure fully support basic Unicode, just as much as it does ASCII. That's what the future is, and the future is now. It is not enough to say "Oh well, use another language then," because that is not a viable option to programming shops who are fully committed to Java as a programming platform. That's why it needs to be there. In later messages I'll discuss the particulars of precisely where Java already does manage Level 1 Unicode support, where it is missing out, and what needs to be done to bring it into not merely compliance but also usefulness--which I actually hold to be of greater importance. Sherman, please understand that I am not asking you to do all this work!! That would be not just impolite but also impractical and possibly even impossible. I do recognize that it is too much for just one person. I do not ask that. I just want to detail where the holes are. I very much hope you will take no offence by this; I assure you that absolutely none is intended! --tom