Raul Dias writes: > On Wed, 2007-02-21 at 15:29 +0100, Justin Mason wrote: > > actually I think this is already implemented in 3.2.0 -- see > > http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details. > > Nice. This patch solves the message part problem. > > With this, rules can be written in Unicode too. > A final change would be to let rules be written into other charserts. > > Rule files are read separated. A easy implementation would be to add a > file_charset option. This option will advice the charset used by the > rule file like iso-8859-15 and be converted internally to unicode too if > and only if (IMO) normalize_charset option is set to 1.
I think I prefer the current model, where rules are UTF-8, I'm afraid ;) --j. > -Raul Dias > > > --j. > > > > Raul Dias writes: > > > On Fri, 2007-02-16 at 15:35 +0000, Justin Mason wrote: > > > > Theo Van Dinter writes: > > > > > I'm assuming that there will be a Google Summer of Code 2007 going > > > > > on, and > > > > > that the ASF will be involved again. So it's a good time to start > > > > > thinking > > > > > about things we'd like to put up as possible projects. > > > > > > > > > > We still have a number of items from last year that we could use > > > > > again. > > > > > Anything else that we'd like people to code up? > > > > > > Another thing that might worth adding to GSC2007. > > > > > > Internal Encoding/Charset used by SA. > > > > > > I havent find anything like that, but that doesnt mean SA does not do > > > this already. In this case sorry :) > > > > > > Mail messages can have multiple encodings like ISO-8859-*, utf-8, > > > utf-16, windows-*, and so on. > > > > > > Also, perl (unless set "use utf8") will default to the system encoding > > > like LC_CTYPE. > > > > > > Rule writters needs a way to tell SA, which encoding their rules are. > > > > > > This is not a real issue for english rule, but for other languages are, > > > like portugues, french, russian, chinese, japanese and so on. > > > > > > The real problem is that a string in one encoding with special > > > characters is not the same in another encoding. > > > > > > So, what is needed is: > > > 1 - a way to tell SA the encoding/charset used in some rules > > > 2 - SA convert the rules to an universal encoding internally > > > (e.g. utf-8/16). > > > 3 - Temporary reconvert to the message encoding/charset to proper match. > > > > > > I really dont know if SA does somithing like this internally, but I > > > think it does not. > > > Doing this will require a considerable amount of work (so, gsc20007). > > > > > > Without this kind of support, I see it will be easier in the future > > > spammers playing with charset to avoid specific rules. > > > > > > -Raul Dias