Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Matus UHLAR - fantomas Thu, 05 Aug 2010 01:06:42 -0700

> >It's unlikely that that could push the BAYES RESULT down to BAYES_00
> >unless there is uncorrected mistraining.


On 04.08.10 06:07, Happy Chap wrote:
> Possibly, but I suspect mistraining isn't a problem because apart from this
> specific type of spam, Spamassassin is doing (and has done for sometime) a
> very good job of correctly identifying mail properly.

However, if you feed the mentioned spam to SA, it gets classified as ham,
which means the SA is not doing very good job for this kind of spam.
It is apparently caused by mistraining and can be solved by proper training.
(apparently many ham contains the same tokens).

> >I don't think the 3.2.x rules get updated much. Perhaps this is leading
> >to false autotraining in BAYES.

> Incidentally, I'm not sure the autotraining is much of a problem as it only
> seems to be very obvious (high scoring) spam (and ham) that triggers
> autotraining, according to the headers at least. Certainly none of this
> particular type of spam is getting autotrainined according to the headers.

luckily you can re-train all misclassified spam and ham, and you are doing
it, aren't you?

> Finally, do you know if Spamassassin has rules that *should* catch this type
> of spam (ie. no legitimate email would include big blocks of random
> paragraphs inside HTML comments). I would have thought that of itself would
> have perhaps been picked up by a rule to identify it as spam.

the bayes_use_hapaxes (default on) could help here. 
-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Save the whales. Collect the whole set.

Re: Text contained in HTML comments causing BAYES_00 to classify as non-spam

Reply via email to