Angus Leeming <[EMAIL PROTECTED]> writes:

| Bennett Helm wrote:
| > Now that unicode is in, I notice that whatever I type on Mac comes
| > out in Chinese characters, no matter what the display font setting
| > in preferences. Any idea what's going wrong?
| 
| This is probably another manifestation of the problems that Abdel's
| been having on Windows. There it seems that he has to explicitly tell
| Qt to use the Little Endian flavour of the USC-4 encoding. In your
| case, I think that the Mac is, by default, big endian.
| 
| You might try out the explorartory big endian patch that Abdel posted
| at http://marc.theaimsgroup.com/?l=lyx-devel&m=115616257321312

We should of course not really need to worry about the endianess of
the system we run on. and I thnik it is my conversion routines that is
at fault. In particular the manual part of the conversion routines:
bytes_to_ucs4 and bytes_to_ucs2.

So if somebody can tell me a better/faster/correct way to do this:

std::vector<boost::uint32_t> bytes_to_ucs4(std::vector<char> const & bytes)
{
        //lyxerr << "Outbuf =" << std::hex;

        std::vector<boost::uint32_t> ucs4;
        for (size_t i = 0; i < bytes.size(); i += 4) {
                unsigned char const b1 = bytes[i    ];
                unsigned char const b2 = bytes[i + 1];
                unsigned char const b3 = bytes[i + 2];
                unsigned char const b4 = bytes[i + 3];

                boost::uint32_t c;
                char * cc = reinterpret_cast<char *>(&c);
                cc[3] = b1;
                cc[2] = b2;
                cc[1] = b3;
                cc[0] = b4;

                ucs4.push_back(c);
        }
        return ucs4;
}


and:

std::vector<unsigned short> bytes_to_ucs2(std::vector<char> const & bytes)
{
        //lyxerr << "Outbuf =" << std::hex;

        std::vector<unsigned short> ucs2;
        for (size_t i = 0; i < bytes.size(); i += 2) {
                unsigned char const b1 = bytes[i    ];
                unsigned char const b2 = bytes[i + 1];

                unsigned short c;
                char * cc = reinterpret_cast<char *>(&c);
                cc[0] = b1;
                cc[1] = b2;

                ucs2.push_back(c);
        }
        return ucs2;
}

That does not have any endianess issues.


I am thinking of clever use of unions etc. f.ex. something like:
(type-punning ahead... beware)


std::vector<boost::uint32_t> bytes_to_ucs4(std::vector<char> const & bytes)
{
        std::vector<boost::uint32_t> ucs4;
        for (size_t i = 0; i < bytes.size(); i += 4) {
                union {
                        char cc[4];
                        boost::uint32_t c;
                } t;
                t.cc[0] = bytes[i    ];
                t.cc[1] = bytes[i + 1];
                t.cc[2] = bytes[i + 2];
                t.cc[3] = bytes[i + 3];

                ucs4.push_back(t.c);
        }
        return ucs4;
}

std::vector<unsigned short> bytes_to_ucs2(std::vector<char> const & bytes)
{
        std::vector<unsigned short> ucs2;
        for (size_t i = 0; i < bytes.size(); i += 2) {
                union {
                        char cc[2];
                        unsigned short c;
                } t;
                t.cc[0] = bytes[i    ];
                t.cc[1] = bytes[i + 1];
                ucs2.push_back(t.c);
        }
        return ucs2;
}


Or does that just have the same endianess problems?
(Or not work at all...)


Also the ucsX_toucsY would need a similar treatment.

I'll test out this later tonight or tomorrow.

Probably quite a few things in unicode.C coulde be simpler if this
union trick works as I hope it will.

-- 
        Lgb

Reply via email to