Re: Canonicalizing text parts to UTF-8 before applying body rules

2012-05-31 Thread David F. Skoll
On Thu, 31 May 2012 09:05:00 +0200 "Andrzej A. Filip" wrote: > a) Unicode itself may require canonicalization too. Perl's Encode module should take care of that. > b) some spammers do not declare encoding properly so some encoding > guessing would be handy Possibly, but probably not. Guessin

Re: Canonicalizing text parts to UTF-8 before applying body rules

2012-05-31 Thread Andrzej A. Filip
On 05/29/2012 09:58 PM, David F. Skoll wrote: > This idea is growing out of a thread I started in which someone pointed me > to https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062 > > Ignoring the locale under which SA runs and also ignoring the character > encoding of the message can make

Re: Canonicalizing text parts to UTF-8 before applying body rules

2012-05-30 Thread David F. Skoll
On Wed, 30 May 2012 08:26:44 -0700 jdow wrote: > I'm idly wondering what affect this would have on the time to scan a > single email. Actually converting from the original encoding to UTF-8 is very fast. Internally, Perl uses pretty fast C code to convert between character encodings. As for Uni

Re: Canonicalizing text parts to UTF-8 before applying body rules

2012-05-30 Thread Kevin A. McGrail
I'm idly wondering what affect this would have on the time to scan a single email. I'd suspect the time required would increase significantly if the user has a "bloody ridiculous (but effective) lot of rules", such as I use. I had the same thought but figured that we will have to improve th

Re: Canonicalizing text parts to UTF-8 before applying body rules

2012-05-30 Thread jdow
On 2012/05/29 13:18, Kevin A. McGrail wrote: On 5/29/2012 3:58 PM, David F. Skoll wrote: This idea is growing out of a thread I started in which someone pointed me to https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062 Ignoring the locale under which SA runs and also ignoring the charac

Re: Canonicalizing text parts to UTF-8 before applying body rules

2012-05-30 Thread David F. Skoll
On Wed, 30 May 2012 14:43:54 +0100 RW wrote: > UTF-8 wont work, it will need to be UTF-32 to be compatible with > sa-compile. From the re2c man page: Ah. Too bad. :( (I don't use sa-compile, so this is not a killer problem for me, but I can see how it could be for some people.) On Wed, 30 Ma

Re: Canonicalizing text parts to UTF-8 before applying body rules

2012-05-30 Thread Henrik K
On Wed, May 30, 2012 at 02:43:54PM +0100, RW wrote: > On Tue, 29 May 2012 15:58:21 -0400 > David F. Skoll wrote: > > > > I'm thinking of making something (a plugin, maybe?) that canonicalizes > > text/* parts to UTF-8 and lets you write rules using Unicode regexes. > > Something like: > > > Acc

Re: Canonicalizing text parts to UTF-8 before applying body rules

2012-05-30 Thread RW
On Tue, 29 May 2012 15:58:21 -0400 David F. Skoll wrote: > I'm thinking of making something (a plugin, maybe?) that canonicalizes > text/* parts to UTF-8 and lets you write rules using Unicode regexes. > Something like: > According to the perlunicode man page: > >Regular Expressions >

Re: Canonicalizing text parts to UTF-8 before applying body rules

2012-05-29 Thread Kevin A. McGrail
On 5/29/2012 3:58 PM, David F. Skoll wrote: This idea is growing out of a thread I started in which someone pointed me to https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062 Ignoring the locale under which SA runs and also ignoring the character encoding of the message can make body matc

Canonicalizing text parts to UTF-8 before applying body rules

2012-05-29 Thread David F. Skoll
Hi, This idea is growing out of a thread I started in which someone pointed me to https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062 Ignoring the locale under which SA runs and also ignoring the character encoding of the message can make body matching rules behave differently on differen