> The only problem with that is it means we'll be potentially altering the
> data as it comes in, which leads back to the problem of input and output
> files not matching for simple filter programs. (Plus it means we spend CPU
> cycles altering data that we might not actually need to)
>
I don't think it will be a serious problem in practice. Most encodings has
only one representation for every character. So we convert iso-8859-1 to
unicode (whatever form) and convert it back, we will have exactly the same
binary data. In most cases, we are safe.
However, if the input is one of unicode encoding form, say UTF-8 NFC,
we may have trouble. The output file may be different from the input file.
Because any unicode-compliant application should be able to handle the
different representation, so it should not be a big problem either.
For people who do need to keep the binary representation, they can use
some special code or library to do so. Here is some pseudo code,
open $FH, "./name.dat", "rb"; # readonly binary
open $F2, "./output.dat", "wb"; # writeonly binary
while (<FH>) {
$line = convert($_, "GB2312", "NFD"); # convert chinese to unicode NFD
if ($line =~ m/resume/3) { # compare at unicode level 3
write($F2, $_);
}
The mentioned unicode level 3 is just my recommendation.
I believe the unicode defines several level of comparison, including
a) strict binary; b) ignore cannonical equivalence; c) ignore case;
d) ignore character equivalence; e) ignore width/fonts; f) collation.
For myself, I prefer to normalize everything upon input conversion.
So we can have stable ord(), chr(), length(), substr(). Otherwise,
I don't think those functions will be useful.
For people who don't want to pay the cost of conversion. They can
process the data in binary mode, with some library support.
Hong