Canonicalizing text parts to UTF-8 before applying body rules

David F. Skoll Tue, 29 May 2012 12:58:52 -0700

Hi,

This idea is growing out of a thread I started in which someone pointed me
to https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062


Ignoring the locale under which SA runs and also ignoring the character
encoding of the message can make body matching rules behave differently
on different systems and just plain incorrectly for some messages.

I'm thinking of making something (a plugin, maybe?) that canonicalizes
text/* parts to UTF-8 and lets you write rules using Unicode regexes.
Something like:

body_utf8  __DRUGS_MUSCLE1 /.. proper Unicode regex/...

According to the perlunicode man page:

   Regular Expressions
       The regular expression compiler produces polymorphic opcodes.  That
       is, the pattern adapts to the data and automatically switches to
       the Unicode character scheme when presented with data that is
       internally encoded in UTF-8 -- or instead uses a traditional byte
       scheme when presented with byte data.

so assuming we present it with proper UTF-8 data, the regexes should Just Work.

I'm not sure how easy this will be, but I think it's worthwhile.
In the long run, I think all body rules should be body_utf8 and another
rule type should provide access to the body in its original encoding if that
is needed.

Comments?  Suggestions?

Regards,

DAvid.

Canonicalizing text parts to UTF-8 before applying body rules

Reply via email to