Re: Unicode handling

Hong Zhang Tue, 27 Mar 2001 12:52:20 -0800
> The only problem with that is it means we'll be potentially altering the
> data as it comes in, which leads back to the problem of input and output
> files not matching for simple filter programs. (Plus it means we spend CPU
> cycles altering data that we might not actually need to)
>

I don't think it will be a serious problem in practice. Most encodings has
only one representation for every character. So we convert iso-8859-1 to
unicode (whatever form) and convert it back, we will have exactly the same
binary data. In most cases, we are safe.

However, if the input is one of unicode encoding form, say UTF-8 NFC,
we may have trouble. The output file may be different from the input file.
Because any unicode-compliant application should be able to handle the
different representation, so it should not be a big problem either.

For people who do need to keep the binary representation, they can use
some special code or library to do so. Here is some pseudo code,

  open $FH, "./name.dat", "rb"; # readonly binary
  open $F2, "./output.dat", "wb"; # writeonly binary
  while (<FH>) {
    $line = convert($_, "GB2312", "NFD"); # convert chinese to unicode NFD
    if ($line =~ m/resume/3) { # compare at unicode level 3
        write($F2, $_);
  }

The mentioned unicode level 3 is just my recommendation.
I believe the unicode defines several level of comparison, including
a) strict binary; b) ignore cannonical equivalence; c) ignore case;
d) ignore character equivalence; e) ignore width/fonts; f) collation.

For myself, I prefer to normalize everything upon input conversion.
So we can have stable ord(), chr(), length(), substr(). Otherwise,
I don't think those functions will be useful.

For people who don't want to pay the cost of conversion. They can
process the data in binary mode, with some library support.

Hong
Re: Unicode handling

Reply via email to