Georg Baum wrote:
Abdelrazak Younes wrote:

(besides it is unused).
Yes I know. But I planned to use it.

That would be conceptually wrong. If you convert a given UCS4 character into an eightbit encoding you never know whether the result will be only
one character, not even in fixed width encodings. For example the single
byte fixed width encoding iso_8859-7 has two modifier letters: REVERSED
COMMA and APOSTROPHE. Therefore a single UCS4 character can result in two
iso_8859-7 characters.

If that is true, then we have a problem in Encoding::init() because we only test for the first 256 character for fixed width encodings.

For that reason a ucs4_to_eightbit function that returns a single char
should not exist. But if you use a presized vector there should not be any
performance penalty, since of course the common case is that you get
exactly one character.

Then we have a problem, see above.


- The name of ucs4_to_multibytes is misleading: This function does
exactly the same as ucs4_to_eightbit, only optimized for one UCS4 char
Why misleading? This function is specifically designed for multibytes
encoding.

I guess you mean variable width encodings (and not two-byte encodings such
as utf16).

Yes, that's what I meant.

The term "eightbit" as used in unicode.C/h means both fixed and
variable width 8 bit singlebyte encodings, so this term is the right one
here. And ucs4_to_multibytes works as well for fixed width encodings.

Yes I know. It's just my tendency to optimize everything.


If you don't like the term "eightbit" feel free to change it to something
better, but please be consistent: There is no conceptual difference between
ucs4_to_eightbit and ucs4_to_multibytes, so they should have the same name.

In the context of what you said above I agree. I just wanted to speed-up and simplify the Encodings that requires only one byte in all cases, but this is apparently impossible.

- ucs4_to_multibytes silently fails for exotic conversions that result in
more than 4 bytes.
I didn't know that such encoding exists. I haven't find anything about
that in Wikipedia... That's why we need you Georg :-)

I believe that I once read about an encoding that needs more than 4 bytes
for one code point, but am not 100% sure. Since it does not cost anything
to support such a beast it should be supported IMHO.

OK.

So, I will think a bit more about this and try to find a correct solution for 1.5.0. Right now, the simplest solution I can think of is to generate the correspondence table between ucs4 and the different encodings using iconv and distribute that. Or generate them on first use in Encoding::init().

Abdel.

Reply via email to