Mumia W. wrote:
On 06/16/2007 02:29 PM, Tom Allison wrote:
I'm trying to do some regular expression on strings in email. They
could be encoded to something. But I can't tell because I don't have
a utf8 unicode xterm window that will show me anything. At best I get
?????a?? and other trash like that. I think this is typical for
ascii text renderings of two-bit characters.
Not be to deterred by the lack of anything this fancy in xterm I
thought I would plug along.
I made a character thus:
my $string = chr(0x263a); # reported to be a smiley face...
under 'use bytes' this prints as a ':'
without bytes this prints to something resembling a, a little box, a
little circle.
And with unicode and locales and bytes it all gets extremely ugly.
I found something that SpamAssassin uses to convert all this "goo"
into a repeatable set of characters (which is all I'm really after) by
running something that looks like this:
What do you mean by a "repeatable set of characters"? Unicode characters
are repeatable.
The fundamental problem is that this:
$string =~ /(\w\w\w+)/
returns nothing because unicode/utf8/Big5 characters are not considered 'words'.
And I don't really care to get exactly the right character.
I could just as easily use the character ascii values, but the regex for that is
not something I'm familiar with.
I got this far:
my $string = chr(0x263a);
my @A = unpack "C*", $string;
# @A = ( 226, 152, 186 )
At least this is consistent.
But there are a lot of characters that I want to break on and I don't know that
I can do this. The best I can come up with is:
my $string = chr(0x263a);
$string = $string .' '. $string;
print $string,"\n";
foreach my $str (split / / ,$string) {
my @A = unpack "C*", $str;
print "FOO: @A\n";
}
exit;
Using the above I can get a consistent array of characters but I don't know if
this will work for any character encoding. I guess this is part of my
question/quandry.
One thing I'm not sure about is if the MIME::Parser is even decoding things
sanely. I suspect it isn't because I get '?' a lot.
I installed urxvt from my Debian installation [ :) ] and I get...
Wide character in print at unicode_capture.pl line 5.
âº
Wide character in print at unicode_capture.pl line 9.
⺠âº
FOO: 226 152 186
FOO: 226 152 186
However it doesn't print the boxes, which is good.
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/