Hi, This idea is growing out of a thread I started in which someone pointed me to https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062
Ignoring the locale under which SA runs and also ignoring the character encoding of the message can make body matching rules behave differently on different systems and just plain incorrectly for some messages. I'm thinking of making something (a plugin, maybe?) that canonicalizes text/* parts to UTF-8 and lets you write rules using Unicode regexes. Something like: body_utf8 __DRUGS_MUSCLE1 /.. proper Unicode regex/... According to the perlunicode man page: Regular Expressions The regular expression compiler produces polymorphic opcodes. That is, the pattern adapts to the data and automatically switches to the Unicode character scheme when presented with data that is internally encoded in UTF-8 -- or instead uses a traditional byte scheme when presented with byte data. so assuming we present it with proper UTF-8 data, the regexes should Just Work. I'm not sure how easy this will be, but I think it's worthwhile. In the long run, I think all body rules should be body_utf8 and another rule type should provide access to the body in its original encoding if that is needed. Comments? Suggestions? Regards, DAvid.