UTF-8 rule generator script Re: UTF-8 rules, what am I missing?

darxus Fri, 26 Sep 2014 12:39:07 -0700

I wrote a script that takes a list of words with UTF-8 characters, and
generates rules matching them:

http://chaosreigns.com/code/dl/sawordrule.pl

For example:

$ echo "análisis" | perl ./sawordrule.pl SPANISH_
body SPANISH_ANALISIS /\ban[\x{C1}\x{E1}]lisis\b/i # análisis

(The two characters per UTF8 character are the upper and lower case
characters, because /i apparently doesn't apply to these.)

For a bigger example:
cat spanish.txt | tr -d ',;.:"()-' | tr ' ' '\n' | sort -f | uniq -i | 
./sawordrule.pl SPANISH_ > spanish.cf

A couple untested results:
http://www.chaosreigns.com/sa/spanish.cf
http://www.chaosreigns.com/sa/polish.cf

To be clear, these files will likely flag ALL Polish or Spanish emails as
spam.

By default, rules have a score of 1, so without a corresponding "score"
line, each of these have a score of 1.

The output is going to include some garbage rules you're going to need to
manually delete.  It's also probably going to include occasional rules
which will match English words.  I'm sure I missed a couple of these in the
.cf files I provided.

To use the .cf files, add something like this to your local.cf:

include /etc/spamassassin/spanish.cf
include /etc/spamassassin/polish.cf

On 09/26, John Hardin wrote:
> On Fri, 26 Sep 2014, dar...@chaosreigns.com wrote:
> 
> >I created some rules to match Polish text:
> >http://www.chaosreigns.com/sa/polish.txt
> >
> >The rules with only ascii characters work, the ones with utf8 characters
> >don't.  According to hexedit, they're identical in my maildir and in my
> >/etc/spamassassin/local.cf.
> 
> Put the hex strings for the accented characters into the RE.
> 
> I've had the best reliability from placing each byte in its own
> character class:  [\xd0][\x80]

Thanks.

UTF-8 rule generator script Re: UTF-8 rules, what am I missing?

Reply via email to