Abdelrazak Younes wrote:

> I secretly hoped that you would comment ;-)

Well I saw your request and actually got a bit of time so I thought I'll
quickly write some comments. But don't expect testing or so, that would
take too long.

>> - char ucs4_to_eightbit(char_type ucs4, string const & encoding) does not
>> make sense at all. You cannot guarantee that the result is only one char
> 
> I can add an assertion for that. If we use this method with multibytes
> encoding then it will fail.

This is not enough, see below.

>> (besides it is unused).
> 
> Yes I know. But I planned to use it.

That would be conceptually wrong. If you convert a given UCS4 character 
into an eightbit encoding you never know whether the result will be only
one character, not even in fixed width encodings. For example the single
byte fixed width encoding iso_8859-7 has two modifier letters: REVERSED
COMMA and APOSTROPHE. Therefore a single UCS4 character can result in two
iso_8859-7 characters.

For that reason a ucs4_to_eightbit function that returns a single char
should not exist. But if you use a presized vector there should not be any
performance penalty, since of course the common case is that you get
exactly one character.

>> - The name of ucs4_to_multibytes is misleading: This function does
>> exactly the same as ucs4_to_eightbit, only optimized for one UCS4 char
> 
> Why misleading? This function is specifically designed for multibytes
> encoding.

I guess you mean variable width encodings (and not two-byte encodings such
as utf16). The term "eightbit" as used in unicode.C/h means both fixed and
variable width 8 bit singlebyte encodings, so this term is the right one
here. And ucs4_to_multibytes works as well for fixed width encodings.

If you don't like the term "eightbit" feel free to change it to something
better, but please be consistent: There is no conceptual difference between
ucs4_to_eightbit and ucs4_to_multibytes, so they should have the same name.

>> - Now there are two maps with ucs4 -> 8bit iconv processors, one is
>> enough and more efficient.
> 
> I know, I was about to remove the other template based ones once I
> finished the transition.

OK.

>> - ucs4_to_multibytes silently fails for exotic conversions that result in
>> more than 4 bytes.
> 
> I didn't know that such encoding exists. I haven't find anything about
> that in Wikipedia... That's why we need you Georg :-)

I believe that I once read about an encoding that needs more than 4 bytes
for one code point, but am not 100% sure. Since it does not cost anything
to support such a beast it should be supported IMHO.

>> - there is no reason not to use the optimized map lookup in
>> eightbit_to_ucs4, too.
> 
> I don't like the template based method because that too is unoptimized.

If it goes away then my changes do not make sense, but if it does not go
away then they don't slow down your optimized version at all.


Georg

Reply via email to