Re: Java encoder errors

Xueming Shen Mon, 19 Sep 2011 14:35:02 -0700

Tom,

Very good timing:-) I'm back to my encoding related bugs just fixingsome corner cases

in the new UTF-8 implementation we putback in for JDK7.

The surrogates part is a known issue. Unicode Standard can simply changeits "terms" [1] andannounce "the irregular code unit sequence is no longer needed", go useCESU-8 if you haveto deal with it. It's not that easy for a platform/implementation thattakes compatibility veryserious, such as Java, to simply break the compatibility to follow. Thecurrent implementationstill accepts the surrogates but never generates them. I'm not that firmon this, if everybodyagrees that after so many years, compatibility for irregular utf-8 bytesequence is no longera concern, we definitely can follow the "conformance request",especially it appears from 4.0Unicode Standard clearly declares sequence mapped to surrogate are"ill-formed'. We just

need more voice on this issue.

I'm not sure, however, regarding the "forbidden noncharacters". Thech03/D92 appears to befine (not explicitly forbid) to do conversion between different Unicodeencoding forms forthese non-character. Personally I don't see any benefit of not allowingit. I think we have lotsof Unicode expert:-) on this mailing list, what's the "official words"on this issue. But again, Idoubt the Java UTF-8 can then simply drop these code points from theUTF16<->UTF8

conversion.

-Sherman

[1]http://www.unicode.org/reports/tr28/tr28-3.html#3_1_conformance

On 09/19/2011 11:45 AM, Tom Christiansen wrote:

Does anybody know anything about the Java UTF-8 encoder?  It seems to be broken
in a couple (actually, three) of ways.

   * First, it allows for intermixed CESU-8 and UTF-8 even though you
     specify UTF-8, when it should be throwing an exception on the CESU-8.
     It also allows unpaired surrogates, which is also forbidden by the 
standard.

   * Second, it allows in the 66 noncharacter code points that the Unicode
     Standard says "shall not" be used.

The charset encoders and decoders tend to be a bit finicky on whether they
throw proper exceptions or not, so I'll show you exactly what I'm using:

     import java.io.*;
     import java.nio.charset.Charset;
     public class utf8test {
          public static void main(String argv[])
             throws IOException
          {
              BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in, 
Charset.forName("UTF-8").newDecoder()));
              PrintWriter stdout = new PrintWriter(new OutputStreamWriter(System.out, 
Charset.forName("UTF-8").newEncoder()), true);
              String line;
              while ((line = stdin.readLine()) != null) {
                 stdout.println(line);
                 for (int i = 0; i<  line.length(); i++) {  // XXX: not the 
real code point length!
                     int cp = line.codePointAt(i);
                     if (cp<  32 || cp>  126) {
                         stdout.printf("\\x{%05X}", cp);
                     } else {
                         stdout.printf("%c", cp);
                     }
                     if (cp>  Character.MAX_VALUE) {
                         i++; // correct for code unit != code point
                     }
                 }
                 stdout.printf("\n");
              }
         }
     }

I can get that code to raise an exception by feeding it purported UTF-8 that is:

    1. Invalid because it has the wrong bit pattern (eg, 0xE9 by itself).
    2. Invalid because it has a non-shortest encoding error (eg, \xC0\x80 
instead of \x00).

However, I cannot get it to raise an exception by feeding it purported UTF-8 
that has:

    3. Invalid because it has surrogates in it, unpaired or paired.

       3a. unpaired example: \xED\xB0\x80, which would be the UTF-8 encoding of
           surrogate U+DC00.  Surrogates are not allowed.

       3b. paired example: \xED\xA0\x80\xED\xB0\x80, which would be CESU-8
           encoding of code point U+10000; the correct UTF-8 is 
\xF0\x90\x80\x80.
           In fact, if you feed Java both \xED\xA0\x80\xED\xB0\x80 and
           \xF0\x90\x80\x80, Java quietly treats that as two U+10000 code 
points.
           It should not be doing that; it should be raising an exception.

    4. Invalid because it has one of the 66 forbidden noncharacters in it.

The 66 noncharacter code points are the 32 code points between U+FDD0 and
U+FDEF, and the 34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ...
U+10FFFE, U+10FFFF.  Here's something from the published Unicode Standard's
p.24 about noncharacter code points:

     • Noncharacter code points are reserved for internal use, such as for
       sentinel values. They should never be interchanged. They do, however,
       have well-formed representations in Unicode encoding forms and survive
       conversions between encoding forms. This allows sentinel values to be
       preserved internally across Unicode encoding forms, even though they are
       not designed to be used in open interchange.

And here is more about this matter from the Unicode Standard's chapter on
Conformance, section 3.2, p. 59:

     C2 A process shall not interpret a noncharacter code point as an
        abstract character.

         • The noncharacter code points may be used internally, such as for
           sentinel values or delimiters, but should not be exchanged publicly.

That certainly looks to me that by that description, Java is non-conformant
because of what it does for 3a, 3b, and 4.

Does anyone know anything about this?

thanks,

--tom

Re: Java encoder errors

Reply via email to