why one char in UTF8 (3 bytes) converted to UTF16 becomes 6 bytes?

Kee Nethery Tue, 29 Mar 2011 18:30:37 -0700

I have the don't sign symbol (Combining enclosing circle backslash) in a text 
file that I read into livecode. For grins, the character between "Petro" and 
"Max" seen below.


Petro⃠Max

When I scan the bytes, in UTF8, this is encoded as: 226 131 160 also known as 
E2 83 A0. This is the correct UTF8 encoding for this character.

When I convert this to UTF16 using

uniencode(theUtf8Text) or uniencode(theUtf8Text,"UTF16") the byte values are: 
26 32 201 0 32 32

A unicode character in UTF16 should either be stored as two bytes or four bytes 
but never 6 bytes. According to the unicode spec, the characters that require 4 
bytes are pretty uncommon and I'm willing to ignore the error they will create 
if the data stream ever contains them. But the thing I'm trying to do is count 
characters on a line and my single character looks like three when converted to 
UTF16.

Any suggestions on how to get a UTF8 character to correctly convert to UTF16?

Kee Nethery
_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

why one char in UTF8 (3 bytes) converted to UTF16 becomes 6 bytes?

Reply via email to