Re: [CVS ci] hash compare

Jeff Clites Thu, 13 Nov 2003 16:57:14 -0800

On Nov 13, 2003, at 2:21 PM, Nicholas Clark wrote:

On Wed, Nov 12, 2003 at 02:07:52PM -0800, Mark A. Biggar wrote:

And even when the sequence of Unicode code-points is the same, some
encodings have multiple byte sequences for the same code-point.  For
example, UTF-8 has two ways to encode a code-point that is larger the
0xFFFF (Unicode as code-points up to 0x10FFF), as either two 16 bit
surrogate code points encoded as two 3 byte UTF-8 code sequences or as
a single value encoded as a single 4 or 5 byte UTF-8 code sequence.


Is it legal to encode surrogate pairs as UTF8? Or does that count as
malformed UTF8?

No, it's not legal. As of Unicode 3.2, it's not permissible to encode a non-BMP (that is, code point > 0xFFFF) character in UTF-8 via two 3-byte UTF-8 sequences. There is another encoding which does this, called CESU-8, but I don't think it's really ever used.

JEff

Re: [CVS ci] hash compare

Reply via email to