Mumia W. wrote:

On 06/16/2007 02:29 PM, Tom Allison wrote:
I'm trying to do some regular expression on strings in email. They could be encoded to something. But I can't tell because I don't have a utf8 unicode xterm window that will show me anything. At best I get ?????a?? and other trash like that. I think this is typical for ascii text renderings of two-bit characters.

Not be to deterred by the lack of anything this fancy in xterm I thought I would plug along.

I made a character thus:
my $string = chr(0x263a);  # reported to be a smiley face...

under 'use bytes' this prints as a ':'
without bytes this prints to something resembling a, a little box, a little circle.


And with unicode and locales and bytes it all gets extremely ugly.


I found something that SpamAssassin uses to convert all this "goo" into a repeatable set of characters (which is all I'm really after) by running something that looks like this:


What do you mean by a "repeatable set of characters"? Unicode characters are repeatable.

The fundamental problem is that this:

$string =~ /(\w\w\w+)/
returns nothing because unicode/utf8/Big5 characters are not considered 'words'.

And I don't really care to get exactly the right character.
I could just as easily use the character ascii values, but the regex for that is not something I'm familiar with.

I got this far:
my $string = chr(0x263a);
my @A = unpack "C*", $string;

# @A = ( 226, 152, 186 )

At least this is consistent.
But there are a lot of characters that I want to break on and I don't know that I can do this. The best I can come up with is:

my $string = chr(0x263a);
$string = $string .' '. $string;
print $string,"\n";
foreach my $str (split / / ,$string) {
    my @A = unpack "C*", $str;
    print "FOO: @A\n";
}
exit;

Using the above I can get a consistent array of characters but I don't know if this will work for any character encoding. I guess this is part of my question/quandry.

One thing I'm not sure about is if the MIME::Parser is even decoding things sanely. I suspect it isn't because I get '?' a lot.

I installed urxvt from my Debian installation [ :) ] and I get...

Wide character in print at unicode_capture.pl line 5.
âº
Wide character in print at unicode_capture.pl line 9.
⺠âº
FOO: 226 152 186
FOO: 226 152 186

However it doesn't print the boxes, which is good.


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to