Re: character encoding & regex

Mumia W. Sat, 16 Jun 2007 21:20:26 -0700

On 06/16/2007 05:01 PM, Tom Allison wrote:

Mumia W. wrote:
On 06/16/2007 02:29 PM, Tom Allison wrote:
I'm trying to do some regular expression on strings in email.[...]
And with unicode and locales and bytes it all gets extremely ugly.
I found something that SpamAssassin uses to convert all this "goo"into a repeatable set of characters (which is all I'm really after)by running something that looks like this:
What do you mean by a "repeatable set of characters"? Unicodecharacters are repeatable.
The fundamental problem is that this:

$string =~ /(\w\w\w+)/
returns nothing because unicode/utf8/Big5 characters are not considered'words'.
[...]

Many UTF8 characters are words, and many are not. Consider this program(written in utf-8):


#!/usr/bin/perl
use strict;
use warnings;
use encoding 'utf8', 'STDOUT', 'utf8';

my $string2 = '☺ 膄 膅 膆 ☺
á é í ó ú ¶ | ✗ ∷ е み む も
ä ë ï ö ü µ  ± × ṁ · ';

my @wchars = $string2 =~ /(\w)/g;
print "@wchars\n";

exit;
__END__

My output for this program is this:

膄 膅 膆 á é í ó ú е み む も ä ë ï ö ü µ ṁ

Notice that some characters made it and some didn't. In order to do thisright, I had to enable a utf8 locale in my Debian O/S [ :-) ]. Then Iset LANG=en_US.UTF-8 before writing the program in vim.

Furthermore, I had to tell Perl that the program was written in utf8using the 'encoding' module.

Basically, the '\w' in a regular expression is sensitive to the currentlocale, and if utf8 is enabled in the locale, '\w' will (probably) knowwhich unicode characters are word characters and which are not.

BTW, I don't know Chinese or Korean. I just know how to play with vimdigraphs enough to enter random foreign characters--sort of like amonkey banging on a computer keyboard :-)

And I don't really care to get exactly the right character.
I could just as easily use the character ascii values, but the regex forthat is not something I'm familiar with.
I got this far:
my $string = chr(0x263a);
my @A = unpack "C*", $string;

# @A = ( 226, 152, 186 )

At least this is consistent.
But there are a lot of characters that I want to break on and I don'tknow that I can do this. The best I can come up with is:
my $string = chr(0x263a);
$string = $string .' '. $string;
print $string,"\n";
foreach my $str (split / / ,$string) {
    my @A = unpack "C*", $str;
    print "FOO: @A\n";
}
exit;
Using the above I can get a consistent array of characters but I don'tknow if this will work for any character encoding. I guess this is partof my question/quandry.
One thing I'm not sure about is if the MIME::Parser is even decodingthings sanely. I suspect it isn't because I get '?' a lot.
I installed urxvt from my Debian installation [ :) ] and I get...

:-)

Wide character in print at unicode_capture.pl line 5.
âº
Wide character in print at unicode_capture.pl line 9.
âº âº
FOO: 226 152 186
FOO: 226 152 186

However it doesn't print the boxes, which is good.

Put "use encoding 'iso-8859-1', STDOUT => 'utf8';" at the top of yourfile. Also read up on the encoding module (perldoc encoding).

This will probably work a lot better if you've configured your system tosupport a utf8 locale:


http://www.debian.org/doc/manuals/reference/ch-tune.en.html#s-activate-locales

BTW, you're using a great O/S ;-)



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: character encoding & regex

Reply via email to