Angus Leeming <[EMAIL PROTECTED]> writes: | Bennett Helm wrote: | > Now that unicode is in, I notice that whatever I type on Mac comes | > out in Chinese characters, no matter what the display font setting | > in preferences. Any idea what's going wrong? | | This is probably another manifestation of the problems that Abdel's | been having on Windows. There it seems that he has to explicitly tell | Qt to use the Little Endian flavour of the USC-4 encoding. In your | case, I think that the Mac is, by default, big endian. | | You might try out the explorartory big endian patch that Abdel posted | at http://marc.theaimsgroup.com/?l=lyx-devel&m=115616257321312
We should of course not really need to worry about the endianess of the system we run on. and I thnik it is my conversion routines that is at fault. In particular the manual part of the conversion routines: bytes_to_ucs4 and bytes_to_ucs2. So if somebody can tell me a better/faster/correct way to do this: std::vector<boost::uint32_t> bytes_to_ucs4(std::vector<char> const & bytes) { //lyxerr << "Outbuf =" << std::hex; std::vector<boost::uint32_t> ucs4; for (size_t i = 0; i < bytes.size(); i += 4) { unsigned char const b1 = bytes[i ]; unsigned char const b2 = bytes[i + 1]; unsigned char const b3 = bytes[i + 2]; unsigned char const b4 = bytes[i + 3]; boost::uint32_t c; char * cc = reinterpret_cast<char *>(&c); cc[3] = b1; cc[2] = b2; cc[1] = b3; cc[0] = b4; ucs4.push_back(c); } return ucs4; } and: std::vector<unsigned short> bytes_to_ucs2(std::vector<char> const & bytes) { //lyxerr << "Outbuf =" << std::hex; std::vector<unsigned short> ucs2; for (size_t i = 0; i < bytes.size(); i += 2) { unsigned char const b1 = bytes[i ]; unsigned char const b2 = bytes[i + 1]; unsigned short c; char * cc = reinterpret_cast<char *>(&c); cc[0] = b1; cc[1] = b2; ucs2.push_back(c); } return ucs2; } That does not have any endianess issues. I am thinking of clever use of unions etc. f.ex. something like: (type-punning ahead... beware) std::vector<boost::uint32_t> bytes_to_ucs4(std::vector<char> const & bytes) { std::vector<boost::uint32_t> ucs4; for (size_t i = 0; i < bytes.size(); i += 4) { union { char cc[4]; boost::uint32_t c; } t; t.cc[0] = bytes[i ]; t.cc[1] = bytes[i + 1]; t.cc[2] = bytes[i + 2]; t.cc[3] = bytes[i + 3]; ucs4.push_back(t.c); } return ucs4; } std::vector<unsigned short> bytes_to_ucs2(std::vector<char> const & bytes) { std::vector<unsigned short> ucs2; for (size_t i = 0; i < bytes.size(); i += 2) { union { char cc[2]; unsigned short c; } t; t.cc[0] = bytes[i ]; t.cc[1] = bytes[i + 1]; ucs2.push_back(t.c); } return ucs2; } Or does that just have the same endianess problems? (Or not work at all...) Also the ucsX_toucsY would need a similar treatment. I'll test out this later tonight or tomorrow. Probably quite a few things in unicode.C coulde be simpler if this union trick works as I hope it will. -- Lgb