I agree with the first part, disallowing the irregular code sequences. As to the noncharacters, it would be a horrible mistake to disallow them.
Tom, a Java code converter is far too low a level for C9; if the converter can't handle them, it screws up all perfectly legitimate *internal*interchange. C9 is really for a very high level, eg don't put them into interchanged plain text, like a web page. I agree that it needs more clarification. Mark *— Il meglio è l’inimico del bene —* * * * [https://plus.google.com/114199149796022210033] * On Mon, Sep 19, 2011 at 14:35, Xueming Shen <xueming.s...@oracle.com> wrote: > Tom, > > Very good timing:-) I'm back to my encoding related bugs just fixing some > corner cases > in the new UTF-8 implementation we putback in for JDK7. > > The surrogates part is a known issue. Unicode Standard can simply change > its "terms" [1] and > announce "the irregular code unit sequence is no longer needed", go use > CESU-8 if you have > to deal with it. It's not that easy for a platform/implementation that > takes compatibility very > serious, such as Java, to simply break the compatibility to follow. The > current implementation > still accepts the surrogates but never generates them. I'm not that firm on > this, if everybody > agrees that after so many years, compatibility for irregular utf-8 byte > sequence is no longer > a concern, we definitely can follow the "conformance request", especially > it appears from 4.0 > Unicode Standard clearly declares sequence mapped to surrogate are > "ill-formed'. We just > need more voice on this issue. > > I'm not sure, however, regarding the "forbidden noncharacters". The > ch03/D92 appears to be > fine (not explicitly forbid) to do conversion between different Unicode > encoding forms for > these non-character. Personally I don't see any benefit of not allowing it. > I think we have lots > of Unicode expert:-) on this mailing list, what's the "official words" on > this issue. But again, I > doubt the Java UTF-8 can then simply drop these code points from the > UTF16<->UTF8 > conversion. > > -Sherman > > [1]http://www.unicode.org/**reports/tr28/tr28-3.html#3_1_**conformance<http://www.unicode.org/reports/tr28/tr28-3.html#3_1_conformance> > > > On 09/19/2011 11:45 AM, Tom Christiansen wrote: > >> Does anybody know anything about the Java UTF-8 encoder? It seems to be >> broken >> in a couple (actually, three) of ways. >> >> * First, it allows for intermixed CESU-8 and UTF-8 even though you >> specify UTF-8, when it should be throwing an exception on the CESU-8. >> It also allows unpaired surrogates, which is also forbidden by the >> standard. >> >> * Second, it allows in the 66 noncharacter code points that the Unicode >> Standard says "shall not" be used. >> >> The charset encoders and decoders tend to be a bit finicky on whether they >> throw proper exceptions or not, so I'll show you exactly what I'm using: >> >> import java.io.*; >> import java.nio.charset.Charset; >> public class utf8test { >> public static void main(String argv[]) >> throws IOException >> { >> BufferedReader stdin = new BufferedReader(new >> InputStreamReader(System.in, Charset.forName("UTF-8").**newDecoder())); >> PrintWriter stdout = new PrintWriter(new >> OutputStreamWriter(System.out, Charset.forName("UTF-8").**newEncoder()), >> true); >> String line; >> while ((line = stdin.readLine()) != null) { >> stdout.println(line); >> for (int i = 0; i< line.length(); i++) { // XXX: not the >> real code point length! >> int cp = line.codePointAt(i); >> if (cp< 32 || cp> 126) { >> stdout.printf("\\x{%05X}", cp); >> } else { >> stdout.printf("%c", cp); >> } >> if (cp> Character.MAX_VALUE) { >> i++; // correct for code unit != code point >> } >> } >> stdout.printf("\n"); >> } >> } >> } >> >> I can get that code to raise an exception by feeding it purported UTF-8 >> that is: >> >> 1. Invalid because it has the wrong bit pattern (eg, 0xE9 by itself). >> 2. Invalid because it has a non-shortest encoding error (eg, \xC0\x80 >> instead of \x00). >> >> However, I cannot get it to raise an exception by feeding it purported >> UTF-8 that has: >> >> 3. Invalid because it has surrogates in it, unpaired or paired. >> >> 3a. unpaired example: \xED\xB0\x80, which would be the UTF-8 >> encoding of >> surrogate U+DC00. Surrogates are not allowed. >> >> 3b. paired example: \xED\xA0\x80\xED\xB0\x80, which would be CESU-8 >> encoding of code point U+10000; the correct UTF-8 is >> \xF0\x90\x80\x80. >> In fact, if you feed Java both \xED\xA0\x80\xED\xB0\x80 and >> \xF0\x90\x80\x80, Java quietly treats that as two U+10000 code >> points. >> It should not be doing that; it should be raising an exception. >> >> 4. Invalid because it has one of the 66 forbidden noncharacters in it. >> >> The 66 noncharacter code points are the 32 code points between U+FDD0 and >> U+FDEF, and the 34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... >> U+10FFFE, U+10FFFF. Here's something from the published Unicode >> Standard's >> p.24 about noncharacter code points: >> >> • Noncharacter code points are reserved for internal use, such as for >> sentinel values. They should never be interchanged. They do, >> however, >> have well-formed representations in Unicode encoding forms and >> survive >> conversions between encoding forms. This allows sentinel values to >> be >> preserved internally across Unicode encoding forms, even though they >> are >> not designed to be used in open interchange. >> >> And here is more about this matter from the Unicode Standard's chapter on >> Conformance, section 3.2, p. 59: >> >> C2 A process shall not interpret a noncharacter code point as an >> abstract character. >> >> • The noncharacter code points may be used internally, such as for >> sentinel values or delimiters, but should not be exchanged >> publicly. >> >> That certainly looks to me that by that description, Java is >> non-conformant >> because of what it does for 3a, 3b, and 4. >> >> Does anyone know anything about this? >> >> thanks, >> >> --tom >> > >