on Tue, Dec 23, 2003 at 01:25:30PM +0000, Dale Amon ([EMAIL PROTECTED]) wrote: > I've been noticing loads of mails like this lately: > > Date: Sun, 21 Dec 2003 16:25:34 +0500 > From: "Joseph Jenkins" <[EMAIL PROTECTED]> > Subject: Re: MIT, rest in peace! > To: [EMAIL PROTECTED] > X-Mailer: mPOP Web-Mail 2.19 > > emery atrocious larval drippy elate incontrollable raster anglicanism > checkerberry feed sit ajar saturable decathlon > already climate inhibition pagoda narcissus expository toni > > I can only assume someone out there is trying to attack bayesian > systems by loading them up with all sorts of normal words so that good > mail gets false positives, thus breaking the systems.
The success of this sort of attack on Bayesian filters is likely to be weak at best. See Paul Graham's commentary on this: So Far, So Good August, 2003 http://www.paulgraham.com/sofar.html Spammers can attempt to bypass Bayesian filters by using fewer bad tokens, or more good tokens (as Dale notes). That's it. Seeding content with more neutral tokens tends to make the body more, well, neutral. Unless specifically non-spammy tokens are used, there's little net effect. Unseen words have a slightly spammy weighting in Graham's work. Note too that, at least for Graham's Bayesian algorithm, the computation of spamminess is based on the most "interesting" 15 tokens. So adding a bunch of neutral chaff to a message doesn't mask the fact that it contains a large number of spammish keywords. _Random_ padding won't be effective. _Targeted_ padding will be, though spammers would have to target the non-spam keyword list of individual recipients to be highly effective (guessing wrong simply adds to the spamminess of an individual's keyword list). A Plan for Spam August, 2002 http://www.paulgraham.com/spam.html I've seen a few chaffed message slip past my filters in recent weeks, but I dump these to a 'spam-learn' folder which is crawled by sa-learn every 30 minutes (cronjob), after a few days of which the chaffed messages aren't appearing in my "greylist" box (previously unknown senders). I also maintain a whitelist which is the only way a given user can end up in my inbox. Mailing lists collect some spam, but not much. Peace. -- Karsten M. Self <[EMAIL PROTECTED]> http://kmself.home.netcom.com/ What Part of "Gestalt" don't you understand? At the sound of the toner, boycott Lexmark: trade restraint via DMCA. http://news.com.com/2100-1023-979791.html
pgp00000.pgp
Description: PGP signature