On 09/19/2011 03:26 PM, Tom Christiansen wrote:
Mark Davis ☕ wrote
on Mon, 19 Sep 2011 14:41:49 PDT:
I agree with the first part, disallowing the irregular code sequences.
Finding that Java allowed surrogates to sneak through in their UTF-8
streams like that was quite odd.
It's said "b
They are really "super private use" characters, available for definition
within a given implementation or domain.
For example, in CLDR collation tables:
The code point U+ is tailored to have a weight higher than all other
characters. This allows reliable specification of a range, such as “Sch
Mark Davis ☕ wrote
on Mon, 19 Sep 2011 14:41:49 PDT:
> I agree with the first part, disallowing the irregular code sequences.
Finding that Java allowed surrogates to sneak through in their UTF-8
streams like that was quite odd.
> As to the noncharacters, it would be a horrible mistake to di
I agree with the first part, disallowing the irregular code sequences.
As to the noncharacters, it would be a horrible mistake to disallow them.
Tom, a Java code converter is far too low a level for C9; if the converter
can't handle them, it screws up all perfectly legitimate
*internal*interchang
Tom,
Very good timing:-) I'm back to my encoding related bugs just fixing
some corner cases
in the new UTF-8 implementation we putback in for JDK7.
The surrogates part is a known issue. Unicode Standard can simply change
its "terms" [1] and
announce "the irregular code unit sequence is no lo
Does anybody know anything about the Java UTF-8 encoder? It seems to be broken
in a couple (actually, three) of ways.
* First, it allows for intermixed CESU-8 and UTF-8 even though you
specify UTF-8, when it should be throwing an exception on the CESU-8.
It also allows unpaired surrog