Re: [Patch] optimize ucs4 to local conversion

Abdelrazak Younes Wed, 16 May 2007 05:36:38 -0700

Georg Baum wrote:

Abdelrazak Younes wrote:

(besides it is unused).

Yes I know. But I planned to use it.

That would be conceptually wrong. If you convert a given UCS4 characterinto an eightbit encoding you never know whether the result will be only

one character, not even in fixed width encodings. For example the single
byte fixed width encoding iso_8859-7 has two modifier letters: REVERSED
COMMA and APOSTROPHE. Therefore a single UCS4 character can result in two
iso_8859-7 characters.

If that is true, then we have a problem in Encoding::init() because weonly test for the first 256 character for fixed width encodings.

For that reason a ucs4_to_eightbit function that returns a single char
should not exist. But if you use a presized vector there should not be any
performance penalty, since of course the common case is that you get
exactly one character.


Then we have a problem, see above.

- The name of ucs4_to_multibytes is misleading: This function does
exactly the same as ucs4_to_eightbit, only optimized for one UCS4 char

Why misleading? This function is specifically designed for multibytes
encoding.


I guess you mean variable width encodings (and not two-byte encodings such
as utf16).


Yes, that's what I meant.

The term "eightbit" as used in unicode.C/h means both fixed and
variable width 8 bit singlebyte encodings, so this term is the right one
here. And ucs4_to_multibytes works as well for fixed width encodings.


Yes I know. It's just my tendency to optimize everything.

If you don't like the term "eightbit" feel free to change it to something
better, but please be consistent: There is no conceptual difference between
ucs4_to_eightbit and ucs4_to_multibytes, so they should have the same name.

In the context of what you said above I agree. I just wanted to speed-upand simplify the Encodings that requires only one byte in all cases, butthis is apparently impossible.

- ucs4_to_multibytes silently fails for exotic conversions that result in
more than 4 bytes.

I didn't know that such encoding exists. I haven't find anything about
that in Wikipedia... That's why we need you Georg :-)


I believe that I once read about an encoding that needs more than 4 bytes
for one code point, but am not 100% sure. Since it does not cost anything
to support such a beast it should be supported IMHO.

OK.

So, I will think a bit more about this and try to find a correctsolution for 1.5.0. Right now, the simplest solution I can think of isto generate the correspondence table between ucs4 and the differentencodings using iconv and distribute that. Or generate them on first usein Encoding::init().


Abdel.

Re: [Patch] optimize ucs4 to local conversion

Reply via email to