On 06/16/2007 05:01 PM, Tom Allison wrote:
Mumia W. wrote:
On 06/16/2007 02:29 PM, Tom Allison wrote:
I'm trying to do some regular expression on strings in email.
[...]
And with unicode and locales and bytes it all gets extremely ugly.
I found something that SpamAssassin uses to convert all this "goo"
into a repeatable set of characters (which is all I'm really after)
by running something that looks like this:
What do you mean by a "repeatable set of characters"? Unicode
characters are repeatable.
The fundamental problem is that this:
$string =~ /(\w\w\w+)/
returns nothing because unicode/utf8/Big5 characters are not considered
'words'.
[...]
Many UTF8 characters are words, and many are not. Consider this program
(written in utf-8):
#!/usr/bin/perl
use strict;
use warnings;
use encoding 'utf8', 'STDOUT', 'utf8';
my $string2 = '☺ 膄 膅 膆 ☺
á é í ó ú ¶ | ✗ ∷ е み む も
ä ë ï ö ü µ ± × ṁ · ';
my @wchars = $string2 =~ /(\w)/g;
print "@wchars\n";
exit;
__END__
My output for this program is this:
膄 膅 膆 á é í ó ú е み む も ä ë ï ö ü µ ṁ
Notice that some characters made it and some didn't. In order to do this
right, I had to enable a utf8 locale in my Debian O/S [ :-) ]. Then I
set LANG=en_US.UTF-8 before writing the program in vim.
Furthermore, I had to tell Perl that the program was written in utf8
using the 'encoding' module.
Basically, the '\w' in a regular expression is sensitive to the current
locale, and if utf8 is enabled in the locale, '\w' will (probably) know
which unicode characters are word characters and which are not.
BTW, I don't know Chinese or Korean. I just know how to play with vim
digraphs enough to enter random foreign characters--sort of like a
monkey banging on a computer keyboard :-)
And I don't really care to get exactly the right character.
I could just as easily use the character ascii values, but the regex for
that is not something I'm familiar with.
I got this far:
my $string = chr(0x263a);
my @A = unpack "C*", $string;
# @A = ( 226, 152, 186 )
At least this is consistent.
But there are a lot of characters that I want to break on and I don't
know that I can do this. The best I can come up with is:
my $string = chr(0x263a);
$string = $string .' '. $string;
print $string,"\n";
foreach my $str (split / / ,$string) {
my @A = unpack "C*", $str;
print "FOO: @A\n";
}
exit;
Using the above I can get a consistent array of characters but I don't
know if this will work for any character encoding. I guess this is part
of my question/quandry.
One thing I'm not sure about is if the MIME::Parser is even decoding
things sanely. I suspect it isn't because I get '?' a lot.
I installed urxvt from my Debian installation [ :) ] and I get...
:-)
Wide character in print at unicode_capture.pl line 5.
âº
Wide character in print at unicode_capture.pl line 9.
⺠âº
FOO: 226 152 186
FOO: 226 152 186
However it doesn't print the boxes, which is good.
Put "use encoding 'iso-8859-1', STDOUT => 'utf8';" at the top of your
file. Also read up on the encoding module (perldoc encoding).
This will probably work a lot better if you've configured your system to
support a utf8 locale:
http://www.debian.org/doc/manuals/reference/ch-tune.en.html#s-activate-locales
BTW, you're using a great O/S ;-)
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/