RE: single non-BMP character counted as two characters

2008-08-06 Thread Michael Glavassevich
ael Glavassevich [mailto:[EMAIL PROTECTED] > Sent: Tuesday, August 05, 2008 6:00 PM > To: j-users@xerces.apache.org > Subject: Re: single non-BMP character counted as two characters > > > > Hi Taki, > > It's a long standing bug/limitation. Xerces uses String.length(

RE: single non-BMP character counted as two characters

2008-08-05 Thread Taki Kamiya
ch is probably partly because the use of length is not very common in practice. Thanks! -taki From: Michael Glavassevich [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 05, 2008 6:00 PM To: j-users@xerces.apache.org Subject: Re: single non-BMP character counted a

Re: single non-BMP character counted as two characters

2008-08-05 Thread Nathan Beyer
That's essentially what's happening in the Harmony code base. The code essentially delegates to Character.codePointCount(CharSequence,int,int), which loops over the chars looking for high surrogates. This could certainly be optimized though. -Nathan On Tue, Aug 5, 2008 at 8:44 PM, Michael Glavass

Re: single non-BMP character counted as two characters

2008-08-05 Thread Michael Glavassevich
Hi Nathan, Is the implementation of that method any better than iterating over the string and counting the number of code points? I think the last time I noticed this bug in the code I resisted fixing it because of the negative performance impact on the majority of input which only contains charac

Re: single non-BMP character counted as two characters

2008-08-05 Thread Nathan Beyer
This might be an additional impetus to move the code base for future development to Java 5 libraries, so things like String.codePointCount can be used. -Nathan On Tue, Aug 5, 2008 at 7:59 PM, Michael Glavassevich <[EMAIL PROTECTED]>wrote: > Hi Taki, > > It's a long standing bug/limitation. Xerce

Re: single non-BMP character counted as two characters

2008-08-05 Thread Michael Glavassevich
Hi Taki, It's a long standing bug/limitation. Xerces uses String.length() (which returns the length of the string in chars rather than Unicode code points) for checking the length facet. Thanks. Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: [EMAIL PROTECTED] E-mail: [EMAIL

single non-BMP character counted as two characters

2008-08-05 Thread Taki Kamiya
Hi, The following schema, which is supposedly valid, results in this error: cvc-length-valid: Value '𠀋' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_act'. The default value "𠀋" for attribute "a" is a single non-BMP character. It is as though a surroga