Re: UTF-8 Spam rules

2013-09-27 Thread Kevin A. McGrail
On 9/25/2013 11:15 PM, Karsten Bräckelmann wrote: On Fri, 2013-09-20 at 14:20 -0400, Kevin A. McGrail wrote: Anyone have some examples of rules designed to catch words by content in UTF-8 encoded messages? I'm doing some work on improving this. Right now, I'm just having problems with really p

Re: UTF-8 Spam rules

2013-09-27 Thread Kevin A. McGrail
On 9/20/2013 2:30 PM, David F. Skoll wrote: You won't like my answer, but... You really*have* to normalize everything to Unicode (possible using UTF-8 as the canonical on-disk format) before trying to apply rules or extract Bayes tokens. Then you can do nice things like blocking CJK spams with

Re: UTF-8 Spam rules

2013-09-25 Thread Karsten Bräckelmann
On Fri, 2013-09-20 at 14:20 -0400, Kevin A. McGrail wrote: > > > Anyone have some examples of rules designed to catch words by content in > > > UTF-8 encoded messages? I'm doing some work on improving this. > Right now, I'm just having problems with really putting a nail in the > coffin of spams

Re: UTF-8 Spam rules

2013-09-20 Thread Kevin A. McGrail
On 9/19/2013 3:09 PM, Jay Sekora wrote: On 09/16/2013 10:12 AM, Kevin A. McGrail wrote: Anyone have some examples of rules designed to catch words by content in UTF-8 encoded messages? I'm doing some work on improving this. Are you trying to match UTF-8 encoded messages as a stream of bytes,

Re: UTF-8 Spam rules

2013-09-20 Thread David F. Skoll
On Fri, 20 Sep 2013 14:20:58 -0400 "Kevin A. McGrail" wrote: > As of yet, I'm not using normalize_charset and researching what hits > things the best. You won't like my answer, but... You really *have* to normalize everything to Unicode (possible using UTF-8 as the canonical on-disk format) be

Re: UTF-8 Spam rules

2013-09-19 Thread Jay Sekora
On 09/16/2013 10:12 AM, Kevin A. McGrail wrote: Anyone have some examples of rules designed to catch words by content in UTF-8 encoded messages? I'm doing some work on improving this. Are you trying to match UTF-8 encoded messages as a stream of bytes, or are you using normalize_charset? (An