At 01:09 PM 3/27/2001 -0800, Hong Zhang wrote:
> > The only problem with that is it means we'll be potentially altering the
> > data as it comes in, which leads back to the problem of input and output
> > files not matching for simple filter programs. (Plus it means we spend CPU
> > cycles altering data that we might not actually need to)
> >
>
>I don't think it will be a serious problem in practice. Most encodings has
>only one representation for every character. So we convert iso-8859-1 to
>unicode (whatever form) and convert it back, we will have exactly the same
>binary data. In most cases, we are safe.

For this stuff, I agree. We'll likely not even convert it to Unicode 
anyway, which pushes the problem off even further.

>However, if the input is one of unicode encoding form, say UTF-8 NFC,
>we may have trouble. The output file may be different from the input file.
>Because any unicode-compliant application should be able to handle the
>different representation, so it should not be a big problem either.

On the one hand I agree, especially since most folks are probably reading 
and writing Unicode data via some library or other that makes sure things 
are reasonably correct. (Or so I'd hope) On the other hand I'm not really 
fond of changing data for no reason--for the CPU and memory costs alone, if 
nothing else.

>For people who do need to keep the binary representation, they can use
>some special code or library to do so. Here is some pseudo code,

Well, sorta. If we can avoid that more often than not I'd be happy. 
Housekeeping code tends to get in the way of the bits that do the real work.

>For myself, I prefer to normalize everything upon input conversion.

Which is a fine and reasonable thing to want, but I don't think it should 
be the default, mainly for speed reasons.

>So we can have stable ord(), chr(), length(), substr(). Otherwise,
>I don't think those functions will be useful.

It would probably be crass of me to point out that you were arguing not 
that long ago that some of these functions were generally useless in the 
face of Unicode data, wouldn't it? :-)

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Reply via email to