On 10/14/2014 11:54 PM, Axb wrote: > On 10/14/2014 05:07 PM, RW wrote: >> On Tue, 14 Oct 2014 13:58:27 +0200 >> Axb wrote: >> >>> On 10/14/2014 01:51 PM, RW wrote: >>>> On Tue, 14 Oct 2014 10:44:51 +0200 >>>> Axb wrote: >>>> >>>>> >>>>> have you verified that some of these are not included? >>>>> >>>>> X-Originating-IP will not be included as it can be used to help >>>>> detect ham or spam >>>> >>>> It's really no different to other headers you are ignoring. >>> >>> for example, if you get a flood of 419s from the same source, you may >>> want it to be tokenized... >> >> >> As I do with, for example: >> >> X-AntiAbuse: Originator/Caller UID/GID - [514 32007] / [47 12] >> >> in this spam Bayes found >> >> 0.999-4--HX-AntiAbuse:32007 >> >> These numbers seem to be very good indicators for me. >> >> >> Most of the headers in the file have never appeared in my ham, so >> they'll be pure spam indicators if they are ever faked. In general >> it's difficult for a spammer to gain an overall advantage against >> an average per user database using faked headers. >> >> Whatever the merits of this on system-wide Bayes (if any beyond >> reducing token count), I think it would have a negative effect on >> per user Bayes. >> > > oooooooooooook.. > now here's a suprise (it's all in the code :) > > the Bayes.pm plugin alreafy includes: > > > # Which headers should we scan for tokens? Don't use all of them, as > it's easy > # to pick up spurious clues from some. What we now do is use all of them > # *less* these well-known headers; that way we can pick up spammers' > tracking > # headers (which are obviously not well-known in advance!). > > # Received is handled specially > $IGNORED_HDRS = qr{(?: (?:X-)?Sender # misc noise > |Delivered-To |Delivery-Date > |(?:X-)?Envelope-To > |X-MIME-Auto[Cc]onverted |X-Converted-To-Plain-Text > > |Subject # not worth a tiny gain vs. to db size increase > > # Date: can provide invalid cues if your spam corpus is > # older/newer than ham > |Date > > # List headers: ignore. a spamfiltering mailing list will > # become a nonspam sign. > |X-List|(?:X-)?Mailing-List > |(?:X-)?List-(?:Archive|Help|Id|Owner|Post|Subscribe > |Unsubscribe|Host|Id|Manager|Admin|Comment > |Name|Url) > |X-Unsub(?:scribe)? > |X-Mailman-Version |X-Been[Tt]here |X-Loop > |Mail-Followup-To > |X-eGroups-(?:Return|From) > |X-MDMailing-List > |X-XEmacs-List > > # gatewayed through mailing list (thanks to Allen Smith) > |(?:X-)?Resent-(?:From|To|Date) > |(?:X-)?Original-(?:From|To|Date) > > # Spamfilter/virus-scanner headers: too easy to chain from > # these > |X-MailScanner(?:-SpamCheck)? > |X-Spam(?:-(?:Status|Level|Flag|Report|Hits|Score|Checker-Version))? > |X-Antispam |X-RBL-Warning |X-Mailscanner > |X-MDaemon-Deliver-To |X-Virus-Scanned > |X-Mass-Check-Id > |X-Pyzor |X-DCC-\S{2,25}-Metrics > |X-Filtered-B[Yy] |X-Scanned-By |X-Scanner > |X-AP-Spam-(?:Score|Status) |X-RIPE-Spam-Status > |X-SpamCop-[^:]+ > |X-SMTPD |(?:X-)?Spam-Apparently-To > |SPAM |X-Perlmx-Spam > |X-Bogosity > > # some noisy Outlook headers that add no good clues: > |Content-Class |Thread-(?:Index|Topic) > |X-Original[Aa]rrival[Tt]ime > > # Annotations from IMAP, POP, and MH: > |(?:X-)?Status |X-Flags |X-Keywords |Replied |Forwarded > |Lines |Content-Length > |X-UIDL? |X-IMAPbase > > # Annotations from Bugzilla > |X-Bugzilla-[^:]+ > > # Annotations from VM: (thanks to Allen Smith) > |X-VM-(?:Bookmark|(?:POP|IMAP)-Retrieved|Labels|Last-Modified > |Summary-Format|VHeader|v\d-Data|Message-Order) > > # Annotations from Gnus: > | X-Gnus-Mail-Source > | Xref > > )}x; > > # Note only the presence of these headers, in order to reduce the > # hapaxen they generate. > $MARK_PRESENCE_ONLY_HDRS = qr{(?: X-Face > |X-(?:Gnu-?PG|PGP|GPG)(?:-Key)?-Fingerprint > |D(?:KIM|omainKey)-Signature > )}ix; > > funny... >
Doing this in code has some drawbacks, just like the tld listing: it's not visible to most people (like this thread nicely illustrates), and you actually want to have it configurable. This one actually is configurable, so now there are 2 tuneables for this problem: the code (mostly static, hidden from view and unreachable for 99% of the users), and the config file. I propose to simplify, and move the code-wise exclusion to a config file too: one tuneable (and one location to look at) is better than two. Besides, the config file is far easier to read for the not so regex-capable admin :) Regards, Tom
signature.asc
Description: OpenPGP digital signature