Angus Leeming wrote:
Lars Gullik Bjønnes wrote:
Angus Leeming <[EMAIL PROTECTED]> writes:

| Bennett Helm wrote:
| > Now that unicode is in, I notice that whatever I type on Mac comes
| > out in Chinese characters, no matter what the display font setting
| > in preferences. Any idea what's going wrong?
| | This is probably another manifestation of the problems that Abdel's
| been having on Windows. There it seems that he has to explicitly tell
| Qt to use the Little Endian flavour of the USC-4 encoding. In your
| case, I think that the Mac is, by default, big endian.
| | You might try out the explorartory big endian patch that Abdel posted
| at http://marc.theaimsgroup.com/?l=lyx-devel&m=115616257321312

We should of course not really need to worry about the endianess of
the system we run on. and I thnik it is my conversion routines that is
at fault. In particular the manual part of the conversion routines:
bytes_to_ucs4 and bytes_to_ucs2.

So if somebody can tell me a better/faster/correct way to do this:

std::vector<boost::uint32_t> bytes_to_ucs4(std::vector<char> const & bytes)
{
        std::vector<boost::uint32_t> ucs4;
        for (size_t i = 0; i < bytes.size(); i += 4) {
                union {
                        char cc[4];
                        boost::uint32_t c;
                } t;
                t.cc[0] = bytes[i    ];
                t.cc[1] = bytes[i + 1];
                t.cc[2] = bytes[i + 2];
                t.cc[3] = bytes[i + 3];

                ucs4.push_back(t.c);
        }
        return ucs4;
}

Question: does the char* buffer that is filled by iconv when converting from UTF-8 to UCS-4 not take into account endian issues? I see that http://www.gnu.org/software/libiconv/ says:

This library provides an iconv() implementation, for use on systems which don't have one, or whose implementation cannot convert from/to Unicode.

It provides support for the encodings:
...
Full Unicode, in terms of uint16_t or uint32_t (with machine dependent endianness and alignment)
UCS-2-INTERNAL, UCS-4-INTERNAL

If you use UCS-4-INTERNAL, your char* buffer can just be memcpy-ed, no?

boost::uint32 dest;
memcpy(dest, &bytes[0], 4);

I note, however, that you've no check in your routines that the bytes vector is a multiple of 4 (or 2, as appropriate) chars in length.

So, if you can leverage these will iconv not do the job for you? No need to use iconv to fill a char* buffer and then convert to unint32 if iconv will fill the uint32 itself. You could also reserve the size of the ucs4 vector to avoid any unnecessary copies.

Any help?
Angus

In fact, when I look at your code, why don't you make iconv_convert a template and return a vector of the correct type directly. All that's needed is a change to the last couple of lines of iconv_convert where you fill outvec:

        int const bytes = 1000 - outbytesleft;
        std::size_t length_outvec = bytes * sizeof(char) / sizeof(T);
        ASSERT(length_outvec * sizeof(T) / sizeof(char) == bytes);

        std::vector<T> outvec(length_outvec);
        std::memcpy(&outvec[0], out, bytes);
        return outvec;

I think that the only assumption is that "ucs-4-internal" is a valid flag to pass to iconv itself and that iconv will then do the right thing when filling the char* buffer.

What do I miss?
Angus

Reply via email to