Tom,
Very good timing:-) I'm back to my encoding related bugs just fixing
some corner cases
in the new UTF-8 implementation we putback in for JDK7.
The surrogates part is a known issue. Unicode Standard can simply change
its "terms" [1] and
announce "the irregular code unit sequence is no longer needed", go use
CESU-8 if you have
to deal with it. It's not that easy for a platform/implementation that
takes compatibility very
serious, such as Java, to simply break the compatibility to follow. The
current implementation
still accepts the surrogates but never generates them. I'm not that firm
on this, if everybody
agrees that after so many years, compatibility for irregular utf-8 byte
sequence is no longer
a concern, we definitely can follow the "conformance request",
especially it appears from 4.0
Unicode Standard clearly declares sequence mapped to surrogate are
"ill-formed'. We just
need more voice on this issue.
I'm not sure, however, regarding the "forbidden noncharacters". The
ch03/D92 appears to be
fine (not explicitly forbid) to do conversion between different Unicode
encoding forms for
these non-character. Personally I don't see any benefit of not allowing
it. I think we have lots
of Unicode expert:-) on this mailing list, what's the "official words"
on this issue. But again, I
doubt the Java UTF-8 can then simply drop these code points from the
UTF16<->UTF8
conversion.
-Sherman
[1]http://www.unicode.org/reports/tr28/tr28-3.html#3_1_conformance
On 09/19/2011 11:45 AM, Tom Christiansen wrote:
Does anybody know anything about the Java UTF-8 encoder? It seems to be broken
in a couple (actually, three) of ways.
* First, it allows for intermixed CESU-8 and UTF-8 even though you
specify UTF-8, when it should be throwing an exception on the CESU-8.
It also allows unpaired surrogates, which is also forbidden by the
standard.
* Second, it allows in the 66 noncharacter code points that the Unicode
Standard says "shall not" be used.
The charset encoders and decoders tend to be a bit finicky on whether they
throw proper exceptions or not, so I'll show you exactly what I'm using:
import java.io.*;
import java.nio.charset.Charset;
public class utf8test {
public static void main(String argv[])
throws IOException
{
BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in,
Charset.forName("UTF-8").newDecoder()));
PrintWriter stdout = new PrintWriter(new OutputStreamWriter(System.out,
Charset.forName("UTF-8").newEncoder()), true);
String line;
while ((line = stdin.readLine()) != null) {
stdout.println(line);
for (int i = 0; i< line.length(); i++) { // XXX: not the
real code point length!
int cp = line.codePointAt(i);
if (cp< 32 || cp> 126) {
stdout.printf("\\x{%05X}", cp);
} else {
stdout.printf("%c", cp);
}
if (cp> Character.MAX_VALUE) {
i++; // correct for code unit != code point
}
}
stdout.printf("\n");
}
}
}
I can get that code to raise an exception by feeding it purported UTF-8 that is:
1. Invalid because it has the wrong bit pattern (eg, 0xE9 by itself).
2. Invalid because it has a non-shortest encoding error (eg, \xC0\x80
instead of \x00).
However, I cannot get it to raise an exception by feeding it purported UTF-8
that has:
3. Invalid because it has surrogates in it, unpaired or paired.
3a. unpaired example: \xED\xB0\x80, which would be the UTF-8 encoding of
surrogate U+DC00. Surrogates are not allowed.
3b. paired example: \xED\xA0\x80\xED\xB0\x80, which would be CESU-8
encoding of code point U+10000; the correct UTF-8 is
\xF0\x90\x80\x80.
In fact, if you feed Java both \xED\xA0\x80\xED\xB0\x80 and
\xF0\x90\x80\x80, Java quietly treats that as two U+10000 code
points.
It should not be doing that; it should be raising an exception.
4. Invalid because it has one of the 66 forbidden noncharacters in it.
The 66 noncharacter code points are the 32 code points between U+FDD0 and
U+FDEF, and the 34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ...
U+10FFFE, U+10FFFF. Here's something from the published Unicode Standard's
p.24 about noncharacter code points:
• Noncharacter code points are reserved for internal use, such as for
sentinel values. They should never be interchanged. They do, however,
have well-formed representations in Unicode encoding forms and survive
conversions between encoding forms. This allows sentinel values to be
preserved internally across Unicode encoding forms, even though they are
not designed to be used in open interchange.
And here is more about this matter from the Unicode Standard's chapter on
Conformance, section 3.2, p. 59:
C2 A process shall not interpret a noncharacter code point as an
abstract character.
• The noncharacter code points may be used internally, such as for
sentinel values or delimiters, but should not be exchanged publicly.
That certainly looks to me that by that description, Java is non-conformant
because of what it does for 3a, 3b, and 4.
Does anyone know anything about this?
thanks,
--tom