on Mon, Dec 29, 2003 at 11:03:09AM +0100, Kjetil Kjernsmo ([EMAIL PROTECTED]) wrote: > On Monday 29 December 2003 00:12, Karsten M. Self wrote: > > _Random_ padding won't be > > effective. ?_Targeted_ padding will be, though spammers would have to > > target the non-spam keyword list of individual recipients to be > > highly effective (guessing wrong simply adds to the spamminess of an > > individual's keyword list). > > Indeed. But it underlines the importance that every individual needs > to train the filter with his own ham.
Yes. And this is an important distinction between server-side and user-side filtering. I've been interested in, though I haven't researched, systems which allow server-side implementation of spam filters which allow user-specific training and correction. The latter preferably through an ability to send misclassified messages to a "user-spam" or "user-ham" address. With suitable protections to keep same from being abused by persons other than the user in question. > My previous university has not trained their filters well, and this > seems like an effective attack against their filter. For me, all these > messages have been tagged with BAYES_99. However: what proportion of total spam are these? Likely little. And you can use your own (local) spam filters to cut these out. Spam filtering is a game of percentages, it's not a realm of absolutes. > However, it seems like SA has no other rules that match these spams, > so they seldom get above my reject-at-smtp threshold. Is it possible > to make a rule to match this practice? I've been leaning toward originating-IP reputation systems for this myself. That is: if you plot the originating IP of remote sending mail hosts, you're likely to find that a small number of these account for a large portion of your legitimate mail. Traditional antispam approaches have involved identifying spamming hosts and blocking these. This worked moderately well when a relatively small number of open relays (or netblocks containing same) originated the bulk of spam. It does poorly when spammers have hundreds of thousands, or millions, of trojaned broadband spam proxies, whose numbers range widely on a daily or weekly basis. From current stats on my own system, looking only at originating domain, I have: 6854 messages 610 domains ...so, on average, 11 and change messages per domain. The top _two_ domains (my own, and debian.org) account for almost half the traffic. Most of which is trusted. The idea, then, is to expedite processing of mail from known, and (largely) trusted domains. Known spammy domains (RBL lookup or local experience) are rejected or greatly throttled. Previously unknown domains can be treated several ways. One option is to simply give a non-permanent SMTP reject on initial connection, and let the remote host resend the message according to its schedule. You'll likely be able to make an immediate classification: any host which fails to acknowledge the reject and either attempts continued delivery or immediately retries can be classified as spam and blocked outright. They're essentially poorly behaved, aren't respecting conventional rules of behavior, and deserve to be shown the door. Hosts which _do_ respect the reject are, at the very least, well-behaved (if not non-spammy). Spamming hosts will, if this protocol becomes widespread, find their local queues backing up badly (if they're well behaved) or find that they're IP filtered, teergrubed, or blacklisted very rapidly. Again: it's a percentages game, and there are multiple levels at which you want to apply classifications. Bayesian and content works well once you've got content. But if you've got other reasons to believe the remote host is spammy, why give them quarter in the first place? Peace. -- Karsten M. Self <kmself@ix.netcom.com> http://kmself.home.netcom.com/ What Part of "Gestalt" don't you understand? Windows Refund Day II: fight for your right to refund http://www.windowsrefund.net/
signature.asc
Description: Digital signature