On Wed, Nov 12, 2003 at 02:07:52PM -0800, Mark A. Biggar wrote:
And even when the sequence of Unicode code-points is the same, some encodings have multiple byte sequences for the same code-point. For example, UTF-8 has two ways to encode a code-point that is larger the 0xFFFF (Unicode as code-points up to 0x10FFF), as either two 16 bit surrogate code points encoded as two 3 byte UTF-8 code sequences or as a single value encoded as a single 4 or 5 byte UTF-8 code sequence.
Is it legal to encode surrogate pairs as UTF8? Or does that count as malformed UTF8?
No, it's not legal. As of Unicode 3.2, it's not permissible to encode a non-BMP (that is, code point > 0xFFFF) character in UTF-8 via two 3-byte UTF-8 sequences. There is another encoding which does this, called CESU-8, but I don't think it's really ever used.
JEff