Re: Canonicalizing text parts to UTF-8 before applying body rules

Kevin A. McGrail Tue, 29 May 2012 13:19:02 -0700

On 5/29/2012 3:58 PM, David F. Skoll wrote:

This idea is growing out of a thread I started in which someone pointed me
to https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062


Ignoring the locale under which SA runs and also ignoring the character
encoding of the message can make body matching rules behave differently
on different systems and just plain incorrectly for some messages.

I'm thinking of making something (a plugin, maybe?) that canonicalizes
text/* parts to UTF-8 and lets you write rules using Unicode regexes.
Something like:

body_utf8  __DRUGS_MUSCLE1 /.. proper Unicode regex/...

According to the perlunicode man page:

    Regular Expressions
        The regular expression compiler produces polymorphic opcodes.  That
        is, the pattern adapts to the data and automatically switches to
        the Unicode character scheme when presented with data that is
        internally encoded in UTF-8 -- or instead uses a traditional byte
        scheme when presented with byte data.

so assuming we present it with proper UTF-8 data, the regexes should Just Work.

I'm not sure how easy this will be, but I think it's worthwhile.
In the long run, I think all body rules should be body_utf8 and another
rule type should provide access to the body in its original encoding if that
is needed.

Comments?  Suggestions?

Your idea seems elegant to me.  I'd help support it in SA.

Regards,
KAM

Re: Canonicalizing text parts to UTF-8 before applying body rules

Reply via email to