Re: character encoding & regex

Tom Allison Sun, 17 Jun 2007 04:25:03 -0700


On Jun 17, 2007, at 6:14 AM, Dr.Ruud wrote:


Tom Allison schreef:

I'm trying to do some regular expression on strings in email. They
could be encoded to something.  But I can't tell because I don't have
a utf8 unicode xterm window that will show me anything.


There are more simple ways to find out, see charnames and perlunitut.
http://search.cpan.org/perldoc?charnames
http://search.cpan.org/perldoc?perlunitut

I would first convert to a common base, like UTF-8, before trying to
match strings. Are you talking about raw mail messages? Consider
SpamAssassin and custom rules.

I don't require actual character comparison, comparison of \{263a} issufficient.And it's rather difficult to determine in raw email what the correctcharset is to use for each string. I find that email sometimespasses multiple encodings in one message making it more difficult topick apart.

The point that I'm coming from is post MIME::Parse which does a goodjob of parsing out messages but I'm not sure how to manage thedecoding in every case. It's hard to find good examples sometimes.

As for SpamAssassin. I'm trying to stay away from that because it'svery large and from a development perspective -- badly documented inthe code. Basically, SpamAssassin is capable for what it does, but Idon't exactly want to do that. Similar, yet, but not exactly.



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: character encoding & regex

Reply via email to