At 8:57 PM +0200 06/30/2013, Benny Pedersen wrote:
well it might confuse bayes yes, but it cant confuse you to run sa-learn --spam on it ?

I've been running "sa-learn --spam" on these messages for a month straight. Some get picked up, others don't. I'm still getting a lot of BAYES_50 on these, and I'm almost positive it's because of these enormous gibberish comments. 95% of the message content is this gibberish, and because it's random, it doesn't get picked up by Bayes very well. The actual spammy content is only 5% of the message (maybe less) and therefore doesn't "weigh" much in the Bayes analysis.

In other words, learning these messages has far smaller effect than one might think it would, and I'm pretty certain one of the reasons the spammers are including kilobytes of gibberish text is exactly because it reduces the efficacy of learning these messages, per the description above.

it could maybe add language checking on how many words is spelled incorrect, compated to big msg sizes

How's it going to figure out what's spelled incorrectly, especially for people who might have messages not in English? Has someone written a "spellcheck" plugin for SA to do this? Seems like a recipe for FPs, unfortunately.

At 11:01 PM +0200 06/30/2013, Benny Pedersen wrote:
it does not matter what poinson is in spam mails aslong one learn it as spam

Per above, I don't think this is correct. If 95% of the poison is random and changes every time, the "important" part of the poison doesn't weigh much in the tokenization. I run these messages through sa-learn every time, and it catches a few nearly-identical messages because of it, but the next day, or the next week, others that LOOK like they should have been caught will slip by.

I don't know if there is an algorithm update to Bayes that could help catch this, but adding an HTML_COMMENT_GIBBERISH rule with a fairly high score will at least help to offset the lack of Bayes hits. One doesn't need to run it through lint or tidy or what-not... I think a regexp similar to what John Hardin made for STYLE_GIBBERISH should work for this, appropriately modified for comments rather than style tags.

Thanks.

                                                --- Amir

Reply via email to