Re: character encoding & regex

Dr.Ruud Sun, 17 Jun 2007 07:54:31 -0700

Tom Allison schreef:

> I don't require actual character comparison, comparison of \{263a} is
> sufficient.


A Perl string contains characters (not octets).
The codepoint U+263a is represented by the character "\x{263a}".
Whether that takes 1 or 2 or 3 or even more octets in the string,
shouldn't matter. Read perlunitut.

If you convert your data first to proper UTF-8, then the next steps are
far easier.


> And it's rather difficult to determine in raw email what the correct
> charset is to use for each string.  I find that email sometimes
> passes multiple encodings in one message making it more difficult to
> pick apart.

There are plenty of tools available to do that for you. I have never
looked for it, but I wouldn't be surprised that someone already did
exactly that: convert an e-mail message (including of course all encoded
header lines, and all MIME parts) to an UTF-8 version.


> As for SpamAssassin.  I'm trying to stay away from that because it's
> very large

http://wiki.apache.org/spamassassin/OutOfMemoryProblems
http://wiki.apache.org/spamassassin/SURBL

> and from a development perspective -- badly documented in
> the code.  Basically, SpamAssassin is capable for what it does, but I
> don't exactly want to do that.  Similar, yet, but not exactly.

Did you look into SA "custom rules"? I find them quite easy to use.
http://mywebpages.comcast.net/mkettler/sa/SA-rules-howto.txt
http://www.askdavetaylor.com/how_do_i_add_custom_spamassassin_rules_for_content_filtering.html
http://wiki.apache.org/spamassassin/CustomRulesets
http://www.rulesemporium.com/rules.htm

-- 
Affijn, Ruud

"Gewoon is een tijger."


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: character encoding & regex

Reply via email to