Re: Google Summer of Code 2007 ...

Raul Dias Wed, 21 Feb 2007 08:14:47 -0800

On Wed, 2007-02-21 at 15:29 +0100, Justin Mason wrote:
> actually I think this is already implemented in 3.2.0 -- see
> http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details.


Nice.  This patch solves the message part problem.

With this, rules can be written in Unicode too.
A final change would be to let rules be written into other charserts.

Rule files are read separated.  A easy implementation would be to add a
file_charset option.  This option will advice the charset used by the
rule file like iso-8859-15 and be converted internally to unicode too if
and only if (IMO) normalize_charset option is set to 1.

-Raul Dias

> --j.
> 
> Raul Dias writes:
> > On Fri, 2007-02-16 at 15:35 +0000, Justin Mason wrote:
> > > Theo Van Dinter writes:
> > > > I'm assuming that there will be a Google Summer of Code 2007 going on, 
> > > > and
> > > > that the ASF will be involved again.  So it's a good time to start 
> > > > thinking
> > > > about things we'd like to put up as possible projects.
> > > > 
> > > > We still have a number of items from last year that we could use again.
> > > > Anything else that we'd like people to code up?
> > 
> > Another thing that might worth adding to GSC2007.
> > 
> > Internal Encoding/Charset used by SA.
> > 
> > I havent find anything like that, but that doesnt mean SA does not do
> > this already.  In this case sorry :)
> > 
> > Mail messages can have multiple encodings like ISO-8859-*, utf-8,
> > utf-16, windows-*, and so on.
> > 
> > Also, perl (unless set "use utf8") will default to the system encoding
> > like LC_CTYPE.
> > 
> > Rule writters needs a way to tell SA, which encoding their rules are.
> > 
> > This is not a real issue for english rule, but for other languages are,
> > like portugues, french, russian, chinese, japanese and so on.
> > 
> > The real problem is that a string in one encoding with special
> > characters is not the same in another encoding.
> > 
> > So, what is needed is:
> > 1 - a way to tell SA the encoding/charset used in some rules
> > 2 - SA convert the rules to an universal encoding internally 
> >     (e.g. utf-8/16).
> > 3 - Temporary reconvert to the message encoding/charset to proper match.
> > 
> > I really dont know if SA does somithing like this internally, but I
> > think it does not.
> > Doing this will require a considerable amount of work (so, gsc20007).
> > 
> > Without this kind of support, I see it will be easier in the future
> > spammers playing with charset to avoid specific rules.
> > 
> > -Raul Dias

Re: Google Summer of Code 2007 ...

Reply via email to