Tom,

Very good timing:-) I'm back to my encoding related bugs just fixing some corner cases
in the new UTF-8 implementation we putback in for JDK7.

The surrogates part is a known issue. Unicode Standard can simply change its "terms" [1] and announce "the irregular code unit sequence is no longer needed", go use CESU-8 if you have to deal with it. It's not that easy for a platform/implementation that takes compatibility very serious, such as Java, to simply break the compatibility to follow. The current implementation still accepts the surrogates but never generates them. I'm not that firm on this, if everybody agrees that after so many years, compatibility for irregular utf-8 byte sequence is no longer a concern, we definitely can follow the "conformance request", especially it appears from 4.0 Unicode Standard clearly declares sequence mapped to surrogate are "ill-formed'. We just
need more voice on this issue.

I'm not sure, however, regarding the "forbidden noncharacters". The ch03/D92 appears to be fine (not explicitly forbid) to do conversion between different Unicode encoding forms for these non-character. Personally I don't see any benefit of not allowing it. I think we have lots of Unicode expert:-) on this mailing list, what's the "official words" on this issue. But again, I doubt the Java UTF-8 can then simply drop these code points from the UTF16<->UTF8
conversion.

-Sherman

[1]http://www.unicode.org/reports/tr28/tr28-3.html#3_1_conformance

On 09/19/2011 11:45 AM, Tom Christiansen wrote:
Does anybody know anything about the Java UTF-8 encoder?  It seems to be broken
in a couple (actually, three) of ways.

   * First, it allows for intermixed CESU-8 and UTF-8 even though you
     specify UTF-8, when it should be throwing an exception on the CESU-8.
     It also allows unpaired surrogates, which is also forbidden by the 
standard.

   * Second, it allows in the 66 noncharacter code points that the Unicode
     Standard says "shall not" be used.

The charset encoders and decoders tend to be a bit finicky on whether they
throw proper exceptions or not, so I'll show you exactly what I'm using:

     import java.io.*;
     import java.nio.charset.Charset;
     public class utf8test {
          public static void main(String argv[])
             throws IOException
          {
              BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in, 
Charset.forName("UTF-8").newDecoder()));
              PrintWriter stdout = new PrintWriter(new OutputStreamWriter(System.out, 
Charset.forName("UTF-8").newEncoder()), true);
              String line;
              while ((line = stdin.readLine()) != null) {
                 stdout.println(line);
                 for (int i = 0; i<  line.length(); i++) {  // XXX: not the 
real code point length!
                     int cp = line.codePointAt(i);
                     if (cp<  32 || cp>  126) {
                         stdout.printf("\\x{%05X}", cp);
                     } else {
                         stdout.printf("%c", cp);
                     }
                     if (cp>  Character.MAX_VALUE) {
                         i++; // correct for code unit != code point
                     }
                 }
                 stdout.printf("\n");
              }
         }
     }

I can get that code to raise an exception by feeding it purported UTF-8 that is:

    1. Invalid because it has the wrong bit pattern (eg, 0xE9 by itself).
    2. Invalid because it has a non-shortest encoding error (eg, \xC0\x80 
instead of \x00).

However, I cannot get it to raise an exception by feeding it purported UTF-8 
that has:

    3. Invalid because it has surrogates in it, unpaired or paired.

       3a. unpaired example: \xED\xB0\x80, which would be the UTF-8 encoding of
           surrogate U+DC00.  Surrogates are not allowed.

       3b. paired example: \xED\xA0\x80\xED\xB0\x80, which would be CESU-8
           encoding of code point U+10000; the correct UTF-8 is 
\xF0\x90\x80\x80.
           In fact, if you feed Java both \xED\xA0\x80\xED\xB0\x80 and
           \xF0\x90\x80\x80, Java quietly treats that as two U+10000 code 
points.
           It should not be doing that; it should be raising an exception.

    4. Invalid because it has one of the 66 forbidden noncharacters in it.

The 66 noncharacter code points are the 32 code points between U+FDD0 and
U+FDEF, and the 34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ...
U+10FFFE, U+10FFFF.  Here's something from the published Unicode Standard's
p.24 about noncharacter code points:

     • Noncharacter code points are reserved for internal use, such as for
       sentinel values. They should never be interchanged. They do, however,
       have well-formed representations in Unicode encoding forms and survive
       conversions between encoding forms. This allows sentinel values to be
       preserved internally across Unicode encoding forms, even though they are
       not designed to be used in open interchange.

And here is more about this matter from the Unicode Standard's chapter on
Conformance, section 3.2, p. 59:

     C2 A process shall not interpret a noncharacter code point as an
        abstract character.

         • The noncharacter code points may be used internally, such as for
           sentinel values or delimiters, but should not be exchanged publicly.

That certainly looks to me that by that description, Java is non-conformant
because of what it does for 3a, 3b, and 4.

Does anyone know anything about this?

thanks,

--tom

Reply via email to