Re: Java encoder errors

2011-09-20 Thread Xueming Shen
On 09/19/2011 03:26 PM, Tom Christiansen wrote: Mark Davis ☕ wrote on Mon, 19 Sep 2011 14:41:49 PDT: I agree with the first part, disallowing the irregular code sequences. Finding that Java allowed surrogates to sneak through in their UTF-8 streams like that was quite odd. It's said "b

Re: Java encoder errors

2011-09-19 Thread Mark Davis ☕
They are really "super private use" characters, available for definition within a given implementation or domain. For example, in CLDR collation tables: The code point U+ is tailored to have a weight higher than all other characters. This allows reliable specification of a range, such as “Sch

Re: Java encoder errors

2011-09-19 Thread Tom Christiansen
Mark Davis ☕ wrote on Mon, 19 Sep 2011 14:41:49 PDT: > I agree with the first part, disallowing the irregular code sequences. Finding that Java allowed surrogates to sneak through in their UTF-8 streams like that was quite odd. > As to the noncharacters, it would be a horrible mistake to di

Re: Java encoder errors

2011-09-19 Thread Mark Davis ☕
I agree with the first part, disallowing the irregular code sequences. As to the noncharacters, it would be a horrible mistake to disallow them. Tom, a Java code converter is far too low a level for C9; if the converter can't handle them, it screws up all perfectly legitimate *internal*interchang

Re: Java encoder errors

2011-09-19 Thread Xueming Shen
Tom, Very good timing:-) I'm back to my encoding related bugs just fixing some corner cases in the new UTF-8 implementation we putback in for JDK7. The surrogates part is a known issue. Unicode Standard can simply change its "terms" [1] and announce "the irregular code unit sequence is no lo

Java encoder errors

2011-09-19 Thread Tom Christiansen
Does anybody know anything about the Java UTF-8 encoder? It seems to be broken in a couple (actually, three) of ways. * First, it allows for intermixed CESU-8 and UTF-8 even though you specify UTF-8, when it should be throwing an exception on the CESU-8. It also allows unpaired surrog