Re: Unicode on Mac

Angus Leeming Tue, 22 Aug 2006 11:27:58 -0700

Angus Leeming wrote:

Lars Gullik Bjønnes wrote:
Angus Leeming <[EMAIL PROTECTED]> writes:
| Bennett Helm wrote:
| > Now that unicode is in, I notice that whatever I type on Mac comes
| > out in Chinese characters, no matter what the display font setting
| > in preferences. Any idea what's going wrong?
| | This is probably another manifestation of the problems that Abdel's
| been having on Windows. There it seems that he has to explicitly tell
| Qt to use the Little Endian flavour of the USC-4 encoding. In your
| case, I think that the Mac is, by default, big endian.
| | You might try out the explorartory big endian patch that Abdel posted
| at http://marc.theaimsgroup.com/?l=lyx-devel&m=115616257321312

We should of course not really need to worry about the endianess of
the system we run on. and I thnik it is my conversion routines that is
at fault. In particular the manual part of the conversion routines:
bytes_to_ucs4 and bytes_to_ucs2.

So if somebody can tell me a better/faster/correct way to do this:
std::vector<boost::uint32_t> bytes_to_ucs4(std::vector<char> const &bytes)
{
        std::vector<boost::uint32_t> ucs4;
        for (size_t i = 0; i < bytes.size(); i += 4) {
                union {
                        char cc[4];
                        boost::uint32_t c;
                } t;
                t.cc[0] = bytes[i    ];
                t.cc[1] = bytes[i + 1];
                t.cc[2] = bytes[i + 2];
                t.cc[3] = bytes[i + 3];

                ucs4.push_back(t.c);
        }
        return ucs4;
}
Question: does the char* buffer that is filled by iconv when convertingfrom UTF-8 to UCS-4 not take into account endian issues? I see thathttp://www.gnu.org/software/libiconv/ says:
This library provides an iconv() implementation, for use on systemswhich don't have one, or whose implementation cannot convert from/toUnicode.
It provides support for the encodings:
...
Full Unicode, in terms of uint16_t or uint32_t (with machine dependentendianness and alignment)
UCS-2-INTERNAL, UCS-4-INTERNAL

If you use UCS-4-INTERNAL, your char* buffer can just be memcpy-ed, no?

boost::uint32 dest;
memcpy(dest, &bytes[0], 4);
I note, however, that you've no check in your routines that the bytesvector is a multiple of 4 (or 2, as appropriate) chars in length.
So, if you can leverage these will iconv not do the job for you? No needto use iconv to fill a char* buffer and then convert to unint32 if iconvwill fill the uint32 itself. You could also reserve the size of the ucs4vector to avoid any unnecessary copies.
Any help?
Angus

In fact, when I look at your code, why don't you make iconv_convert atemplate and return a vector of the correct type directly. All that'sneeded is a change to the last couple of lines of iconv_convert whereyou fill outvec:


        int const bytes = 1000 - outbytesleft;
        std::size_t length_outvec = bytes * sizeof(char) / sizeof(T);
        ASSERT(length_outvec * sizeof(T) / sizeof(char) == bytes);

        std::vector<T> outvec(length_outvec);
        std::memcpy(&outvec[0], out, bytes);
        return outvec;

I think that the only assumption is that "ucs-4-internal" is a validflag to pass to iconv itself and that iconv will then do the right thingwhen filling the char* buffer.


What do I miss?
Angus

Re: Unicode on Mac

Reply via email to