Re: character encoding & regex

Tom Allison Sat, 16 Jun 2007 15:01:18 -0700

Mumia W. wrote:

On 06/16/2007 02:29 PM, Tom Allison wrote:
I'm trying to do some regular expression on strings in email. Theycould be encoded to something. But I can't tell because I don't havea utf8 unicode xterm window that will show me anything. At best I get?????a?? and other trash like that. I think this is typical forascii text renderings of two-bit characters.
Not be to deterred by the lack of anything this fancy in xterm Ithought I would plug along.
I made a character thus:
my $string = chr(0x263a);  # reported to be a smiley face...

under 'use bytes' this prints as a ':'
without bytes this prints to something resembling a, a little box, alittle circle.
And with unicode and locales and bytes it all gets extremely ugly.
I found something that SpamAssassin uses to convert all this "goo"into a repeatable set of characters (which is all I'm really after) byrunning something that looks like this:
What do you mean by a "repeatable set of characters"? Unicode charactersare repeatable.


The fundamental problem is that this:

$string =~ /(\w\w\w+)/
returns nothing because unicode/utf8/Big5 characters are not considered 'words'.

And I don't really care to get exactly the right character.

I could just as easily use the character ascii values, but the regex for that isnot something I'm familiar with.


I got this far:
my $string = chr(0x263a);
my @A = unpack "C*", $string;

# @A = ( 226, 152, 186 )

At least this is consistent.

But there are a lot of characters that I want to break on and I don't know thatI can do this. The best I can come up with is:


my $string = chr(0x263a);
$string = $string .' '. $string;
print $string,"\n";
foreach my $str (split / / ,$string) {
    my @A = unpack "C*", $str;
    print "FOO: @A\n";
}
exit;

Using the above I can get a consistent array of characters but I don't know ifthis will work for any character encoding. I guess this is part of myquestion/quandry.

One thing I'm not sure about is if the MIME::Parser is even decoding thingssanely. I suspect it isn't because I get '?' a lot.


I installed urxvt from my Debian installation [ :) ] and I get...

Wide character in print at unicode_capture.pl line 5.
âº
Wide character in print at unicode_capture.pl line 9.
âº âº
FOO: 226 152 186
FOO: 226 152 186

However it doesn't print the boxes, which is good.


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: character encoding & regex

Reply via email to