Hi, From: Tomohiro KUBOTA <[EMAIL PROTECTED]> Subject: Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages Date: Tue, 07 Jan 2003 21:29:24 +0900 (JST)
> Anyway, though I don't know such a module, your way can be very easily > implemented. I think the easiest one is like following: > > $name =~ s/([\x80-\xff])/"&#".ord($1).";"/eg; I wrote a new filter which - assume the input string is UTF-8 if it can be interpreted as such, - assume it is ISO-8859-1 if not. Since UTF-8 encoding method is relatively strict, it is not likely that ISO-8859-1-intended string is wrongly assumed to be UTF-8. I confirmed that people.names has no octet stream which can be interpreted as UTF-8. (Individual 8bit character must not be UTF-8; in UTF-8, 8bit character must appear in series.) With this filter, my concern is completely solved. Also you don't need to think about future maintainance labor when a new maintainer uses 8bit characters for his/her name.
#!/usr/bin/perl sub from_utf8_or_iso88591_to_sgml ($) { my $str=$_[0]; my $strsave = $str; if ($str !~ /[\x80-\xff]/) { # return ASCII string for less machine-time consumption. return $str; } $str =~ s/([\xf0-\xf7])([\x80-\xbf])([\x80-\xbf])([\x80-\xbf])/ "&#" . ((ord($1)&0x7)* 0x40000 + (ord($2)&0x3f)* 0x1000 + (ord($3)&0x3f)* 0x40 + (ord($4)&0x3f)) . ";"/eg; $str =~ s/([\xe0-\xef])([\x80-\xbf])([\x80-\xbf])/ "&#" . ((ord($1)&0xf)* 0x1000 + (ord($2)&0x3f)* 0x40 + (ord($3)&0x3f)) . ";"/eg; $str =~ s/([\xc0-\xdf])([\x80-\xbf])/ "&#" . ((ord($1)&0x1f)* 0x40 + (ord($2)&0x3f)) . ";"/eg; if ($str !~ /[\x80-\xff]/) { # $str is UTF-8 compliant, assume UTF-8. return $str; } else { # $str is not UTF-8 compliant, assume ISO-8859-1. $strsave =~ s/([\x80-\xff])/"&#".ord($1).";"/eg; return $strsave; } } while(<>) { chomp($_); print from_utf8_or_iso88591_to_sgml($_); }