Justin Mason wrote:
Matthias Keller writes:
Nix wrote:
On 31 May 2007, Graham Murray said:

Nix <[EMAIL PROTECTED]> writes:

(And, let's be blunt, the pure this-word-is-spammy recognition part of
FuzzyOCR is much less smart than the Bayesian system already present
in SA: FuzzyOCR should really use the Bayesian system to determine the
spamminess of words, I suppose...)
Or even just act as a MIME part 'decoding' system (like Base64) and feed
all words it finds in images into Bayes, along with all other text in
the mail, rather than generating a score itself.
Perhaps so, but if so those words should have a score-multiplier of some
sort applied, because the fact that those words originated in images is
itself an obfuscation technique that should be noted in the score.
This has been discussed here again and again and again

first of all, these 10 words found in an image cannot stand against the bayes poisoning found in all these messages - so it would literally be useless for bayes filtering

by the way, this is a common misconception of how our Bayes system works;
what *should* happen is that the "poison" text winds up with "weak"
Bayesian probability scores between 0.2 and 0.8, since it uses words that
also appear in ham (hence why it appears as poison).  However, the OCR'd
text would wind up with "strong" scores around 0.99 or greater.

The chi-square probability combining algorithm we use takes care of this,
by discounting the "weak" clues and taking more account of the "strong"
clues.  (This is what makes it a more effective combining algorithm for
Bayes than the traditional Graham style.)
Would be nice if that worked - just it doesn't for me. I dont know how the algorithm works but I observed its results... I learnt dozens of spams with nearly identical spam texts (and only the spam stuff, not the poisoning) and an identical mail WITH random text got a Bayes 0.500 - hence really - it just doesn't work for me...

Matt

Reply via email to