On 06/07/2012 09:21 AM, Paolo Bonzini wrote: >>> ... which is to wrap MB_CUR_MAX and pretend that it is 3. >> >> Actually, MB_CUR_MAX of UTF-8 is 6, thanks to surrogate pairs. > > No, it is 6 mostly thanks to the original 32-bit definition of > ISO-10646. UTF-8 codes that decode to 0xD800 -> 0xDFFF are invalid. > Some programs produce this encoding, but iconv will not support it on > glibc and technically it's not UTF-8. > > However I did count wrong, MB_CUR_MAX for UTF-8 must be at least 4 to > encode the 21 bits of Unicode (3 in a first byte of the form 11110bbb, 6 > each in the next: 3+6*3 = 21).
You are correct that on glibc, where sizeof(wchar_t)==4, that MB_CUR_MAX of 4 is valid. But on Cygwin, where sizeof(wchar_t)==2, MB_CUR_MAX is intentionally 6, because cygwin intentionally supports surrogate pairs as the only way to represent high plane Unicode characters (although such support is NOT compliant with POSIX, it is a useful enough extension that it was deemed better than any other alternative - and yes, that means that on Cygwin if you use any character > 0xffff, you have multi-wchar_t encodings to deal with - which makes use of all the wide character functions even harder to reason about). -- Eric Blake ebl...@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature