I wrote a script that takes a list of words with UTF-8 characters, and generates rules matching them:
http://chaosreigns.com/code/dl/sawordrule.pl For example: $ echo "anĂ¡lisis" | perl ./sawordrule.pl SPANISH_ body SPANISH_ANALISIS /\ban[\x{C1}\x{E1}]lisis\b/i # anĂ¡lisis (The two characters per UTF8 character are the upper and lower case characters, because /i apparently doesn't apply to these.) For a bigger example: cat spanish.txt | tr -d ',;.:"()-' | tr ' ' '\n' | sort -f | uniq -i | ./sawordrule.pl SPANISH_ > spanish.cf A couple untested results: http://www.chaosreigns.com/sa/spanish.cf http://www.chaosreigns.com/sa/polish.cf To be clear, these files will likely flag ALL Polish or Spanish emails as spam. By default, rules have a score of 1, so without a corresponding "score" line, each of these have a score of 1. The output is going to include some garbage rules you're going to need to manually delete. It's also probably going to include occasional rules which will match English words. I'm sure I missed a couple of these in the .cf files I provided. To use the .cf files, add something like this to your local.cf: include /etc/spamassassin/spanish.cf include /etc/spamassassin/polish.cf On 09/26, John Hardin wrote: > On Fri, 26 Sep 2014, dar...@chaosreigns.com wrote: > > >I created some rules to match Polish text: > >http://www.chaosreigns.com/sa/polish.txt > > > >The rules with only ascii characters work, the ones with utf8 characters > >don't. According to hexedit, they're identical in my maildir and in my > >/etc/spamassassin/local.cf. > > Put the hex strings for the accented characters into the RE. > > I've had the best reliability from placing each byte in its own > character class: [\xd0][\x80] Thanks.