On Thu, 2002-02-21 at 13:42, Arpi wrote:
> when will it be implemented, or better: when will you accept such patch for
> ruleset? (i cannot modify the perl code, as i don't know the perl languege
> nor the spamassassin core enough, but i could help making this optimzation
> to the ruleset)

You can attach rulefile patches to bug #47 in bugzilla.  I'll stick them
in CVS once the coding is done.  It's not going to go into 2.1, but
it'll be early on 2.2

http://bugzilla.spamassassin.org/show_bug.cgi?id=47

> anyway, i have a request:
> could you add a new rule type, for plain text matches?
> searching for a text string is always simpler and faster than for regexps,
> and many of your regexps are such strings (/some words/i) and there will be
> much more when start adding multiple-rule things.

if you have a smart regexp library, then when it compiles a simple regex
it should in fact just be doing whatever "simpler and faster" comparison
to look for a text substring.  In fact, because it's done some
pre-compilation, it can do some fancy Boyer-Moore type searching, and
run even faster than your typical strstr() implementation.

> and i will implement spam phrase check such way:
> go through the whole text, split it to words, calculate hash for each word
> and lookup it in hash-table accelerated word table.
> the word table contains a word->id mapping, each word has an uniqe serial
> number id, word_id.
> then:
>    ++word_match[word_id];
>    ++phrase_match[previous_word_id][word_id];
> 
> so, when executing rule matching, and we have plain text string match
> (instead of regexp) we could simply check the word_match array.
> (at least in my C version, as my ruleset -> rules.c precompiler could
> replace/extend these with word_id numbers)
> 
> with the right balance, it could reduce rule matching to a single-pass
> word/phrase counting and then matching only regexps having their requested
> word counted. it could really speed up the whole process a lot.

There are many rules though where they don't really have any "words"
that you can pre-match against.  So you're still going to have to do a
medium-sized number of regex matches.  You might not gain all that much
over just doing the multi-match system, and the coding and complexity of
the program will go up substantially.

C

_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to