At 05:27 AM 12/20/03 +0800, Dan Jacobson wrote:

Maybe spamassassin needs some parts that can be downloaded more often
than the whole software, like Windows virus checkers' pattern files,
but for the 'latest major spam patterns'.

Of course this might already be what you are doing, I haven't kept track.

Unfortunately, that's not easily done..



Let me re-quote my 12/02/2003 post answering this same basic question...


For context, the person posting suggested having "sub-releases" with rule updates in between the main releases.

"quick fix" type rules to patch things are really best dealt with by places like the custom rule emporium at www.exit0.us

A good, well tuned ruleset just takes WAYYY too long to be done in the the virus scanner model.

-------------------

*sigh*

Horribly well intentioned, but vastly ill conceived. And also oft discussed in the past on this list.

I understand the desire to view SA as being like a virus scanner with signature updates every few days, unfortunately that's not possible in practical terms for the full-blown official ruleset. In fact, the reality of SA is that the existing version numbers and development work the *exact opposite* way of what you suggest because rule releases take an extraordinarily long amount of time.

See, in reality, SA's ruleset works absolutely NOTHING like a virus scanner, and it's usually the ruleset of SA that holds back new releases because they take so long to develop.

A virus scanner each signature operates independent of the other signatures. It's simply a "this is a virus" assertion. This allows you to add new signatures and update old ones without having to change every signature in the database. This feature facilitates "quick turn" updates to the signature database. By comparison updates to the code take a very much longer time frame, and are also more-or-less independent of the signatures as well.

Spamassassin on the other hand has a "score tally" system where large numbers of inter-related rules fire off and total up a score to determine if a message is spam or not.. In this system each rule affects the proper score of *every* other rule in the ruleset. The whole system operates by trying to balance the most spam and nonspam each on the right side of the 5.0 mark. Adding rules, changing their hits, etc tends to shift the "center line" and requires you to rebalance the entire system.

This "reabalancing" is done by the mass-check tests and GA run of spamassassin. This process takes about 4 weeks to complete for a new ruleset, and heavily bogs down the computers of all the corpus submitters for days on end, particularly for the runs with network checks. The last time this was done 1,690,967 messages were processed (between the 4 sets combined), the hits of each and every message (not just aggregate hit counts per rule) were fed into a GA, and the 4 scoresets were evolved.

And that's 4 weeks just for the mass-check and GA process, and that's done *after* they've decided a set of rules are good based on some smaller-scale dry runs, testing, tweaking, and studying spam.

So, we could use a different numbering system for SA, but the reality is that the bulk of the "time to release" on SA rules is driven by how long it takes to make a ruleset. It's not encumbered by some "slow lumbering code development" that causes rule updates to wait for months before being released to the world..

Instead new code comes out a whole lot faster than new rules.. which is why SA the least significant digit of the SA version is usually representative of the *code* release and not the rules release... Generaly 5-6 updates to the code can be made before a new ruleset is ready.

2.50, 2.51, and 2.52 all used the same basic rules and scores... later a re-run of the GA only was done after some rules that were abused by spammers were removed. This used the same mass-check data, and was merely a "quick re-GA" done on the old data after a few rules were dropped. This happened for 2.53 or 2.54 (I forget which), and 2.55 also came along before the new 2.60 ruleset was all ready to go.

That "quick re-ga" also can't be done if rules are added, or significantly changed.. it can only be done when rules are removed entirely. If rules are added, the mass-check portion needs to be run too, otherwise you'll have no data on that rule and it will get a 0 score.

And sure, there are some subtle changes to the rules between major releases, but these are only to fix ghastly false positive cases caused by minor typos (whoops, that should have been s\?x not s?x)... They aren't adding rules or doing major rule rewrites.





-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to