Re: LONGWORDS not hitting?

Amir 'CG' Caspi Sun, 30 Jun 2013 14:10:35 -0700

At 8:57 PM +0200 06/30/2013, Benny Pedersen wrote:

well it might confuse bayes yes, but it cant confuse you to runsa-learn --spam on it ?

I've been running "sa-learn --spam" on these messages for a monthstraight. Some get picked up, others don't. I'm still getting a lotof BAYES_50 on these, and I'm almost positive it's because of theseenormous gibberish comments. 95% of the message content is thisgibberish, and because it's random, it doesn't get picked up by Bayesvery well. The actual spammy content is only 5% of the message(maybe less) and therefore doesn't "weigh" much in the Bayes analysis.

In other words, learning these messages has far smaller effect thanone might think it would, and I'm pretty certain one of the reasonsthe spammers are including kilobytes of gibberish text is exactlybecause it reduces the efficacy of learning these messages, per thedescription above.

it could maybe add language checking on how many words is spelledincorrect, compated to big msg sizes

How's it going to figure out what's spelled incorrectly, especiallyfor people who might have messages not in English? Has someonewritten a "spellcheck" plugin for SA to do this? Seems like a recipefor FPs, unfortunately.


At 11:01 PM +0200 06/30/2013, Benny Pedersen wrote:

it does not matter what poinson is in spam mails aslong one learn it as spam

Per above, I don't think this is correct. If 95% of the poison israndom and changes every time, the "important" part of the poisondoesn't weigh much in the tokenization. I run these messages throughsa-learn every time, and it catches a few nearly-identical messagesbecause of it, but the next day, or the next week, others that LOOKlike they should have been caught will slip by.

I don't know if there is an algorithm update to Bayes that could helpcatch this, but adding an HTML_COMMENT_GIBBERISH rule with a fairlyhigh score will at least help to offset the lack of Bayes hits. Onedoesn't need to run it through lint or tidy or what-not... I think aregexp similar to what John Hardin made for STYLE_GIBBERISH shouldwork for this, appropriately modified for comments rather than styletags.


Thanks.

                                                --- Amir

Re: LONGWORDS not hitting?

Reply via email to