Re: Java encoder errors

Xueming Shen Tue, 20 Sep 2011 10:01:27 -0700

On 09/19/2011 03:26 PM, Tom Christiansen wrote:

Mark Davis ☕<m...@macchiato.com>  wrote
    on Mon, 19 Sep 2011 14:41:49 PDT:

I agree with the first part, disallowing the irregular code sequences.

Finding that Java allowed surrogates to sneak through in their UTF-8
streams like that was quite odd.


It's said "be conservative in what you send, liberal in what you accept" :-)

Considered the surrogates in UTF-8 was still labeled as "irregular"instead of "ill-formed" not

long time ago [1] and with its C12/D36 explicitly suggested

C12: Processes may transform irregular code unit sequences into theequivalent well-formed

        code unit sequences.

D36: As a consequence of C12, these irregular UTF-8 sequences shall notbe generated

        by a conformant process._

_It does not appear to be that odd for an implementation to continue tobe "liberal"__on these

surrogates:-)

As acknowledged in TR#26, there are data over there that do usesurrogates pair in "UTF-8"form. It would be a little inconvenient, if not odd, that you will haveto use two UTF-8 convertersto get the "unicode code" in and out, especially I would assume mostdevelopers might noteven know CESU-8. The only thing most people would notice is that theirapplications suddently

do not work on their data after upgraded from JDK N to JDK N+ 1.
_
_That said, standard is standard, if possible it's nice to follow.

-Sherman

[1]http://unicode.org/versions/corrigendum1.html

Re: Java encoder errors

Reply via email to