Re: 23_bayes_ignore_header.cf

Tom Hendrikx Wed, 15 Oct 2014 00:20:08 -0700

On 10/14/2014 11:54 PM, Axb wrote:
> On 10/14/2014 05:07 PM, RW wrote:
>> On Tue, 14 Oct 2014 13:58:27 +0200
>> Axb wrote:
>>
>>> On 10/14/2014 01:51 PM, RW wrote:
>>>> On Tue, 14 Oct 2014 10:44:51 +0200
>>>> Axb wrote:
>>>>
>>>>>
>>>>> have you verified that some of these are not included?
>>>>>
>>>>> X-Originating-IP will not be included as it can be used to help
>>>>> detect ham or spam
>>>>
>>>> It's really no different to other headers you are ignoring.
>>>
>>> for example, if you get a flood of 419s from the same source, you may
>>> want it to be tokenized...
>>
>>
>> As I do with, for example:
>>
>>    X-AntiAbuse: Originator/Caller UID/GID - [514 32007] / [47 12]
>>
>> in this spam Bayes found
>>
>>    0.999-4--HX-AntiAbuse:32007
>>
>> These numbers seem to be very good indicators for me.
>>
>>
>> Most of the headers in the file have never appeared in my ham, so
>> they'll be pure spam indicators if they are ever faked. In general
>> it's difficult for a spammer to gain an overall advantage against
>> an average per user database using faked headers.
>>
>> Whatever the merits of this on system-wide Bayes (if any beyond
>> reducing token count), I think it would have a negative effect on
>> per user Bayes.
>>
> 
> oooooooooooook..
> now here's a suprise (it's all in the code :)
> 
> the Bayes.pm plugin alreafy includes:
> 
> 
> # Which headers should we scan for tokens?  Don't use all of them, as
> it's easy
> # to pick up spurious clues from some.  What we now do is use all of them
> # *less* these well-known headers; that way we can pick up spammers'
> tracking
> # headers (which are obviously not well-known in advance!).
> 
> # Received is handled specially
> $IGNORED_HDRS = qr{(?: (?:X-)?Sender    # misc noise
>   |Delivered-To |Delivery-Date
>   |(?:X-)?Envelope-To
>   |X-MIME-Auto[Cc]onverted |X-Converted-To-Plain-Text
> 
>   |Subject      # not worth a tiny gain vs. to db size increase
> 
>   # Date: can provide invalid cues if your spam corpus is
>   # older/newer than ham
>   |Date
> 
>   # List headers: ignore. a spamfiltering mailing list will
>   # become a nonspam sign.
>   |X-List|(?:X-)?Mailing-List
>   |(?:X-)?List-(?:Archive|Help|Id|Owner|Post|Subscribe
>     |Unsubscribe|Host|Id|Manager|Admin|Comment
>     |Name|Url)
>   |X-Unsub(?:scribe)?
>   |X-Mailman-Version |X-Been[Tt]here |X-Loop
>   |Mail-Followup-To
>   |X-eGroups-(?:Return|From)
>   |X-MDMailing-List
>   |X-XEmacs-List
> 
>   # gatewayed through mailing list (thanks to Allen Smith)
>   |(?:X-)?Resent-(?:From|To|Date)
>   |(?:X-)?Original-(?:From|To|Date)
> 
>   # Spamfilter/virus-scanner headers: too easy to chain from
>   # these
>   |X-MailScanner(?:-SpamCheck)?
>   |X-Spam(?:-(?:Status|Level|Flag|Report|Hits|Score|Checker-Version))?
>   |X-Antispam |X-RBL-Warning |X-Mailscanner
>   |X-MDaemon-Deliver-To |X-Virus-Scanned
>   |X-Mass-Check-Id
>   |X-Pyzor |X-DCC-\S{2,25}-Metrics
>   |X-Filtered-B[Yy] |X-Scanned-By |X-Scanner
>   |X-AP-Spam-(?:Score|Status) |X-RIPE-Spam-Status
>   |X-SpamCop-[^:]+
>   |X-SMTPD |(?:X-)?Spam-Apparently-To
>   |SPAM |X-Perlmx-Spam
>   |X-Bogosity
> 
>   # some noisy Outlook headers that add no good clues:
>   |Content-Class |Thread-(?:Index|Topic)
>   |X-Original[Aa]rrival[Tt]ime
> 
>   # Annotations from IMAP, POP, and MH:
>   |(?:X-)?Status |X-Flags |X-Keywords |Replied |Forwarded
>   |Lines |Content-Length
>   |X-UIDL? |X-IMAPbase
> 
>   # Annotations from Bugzilla
>   |X-Bugzilla-[^:]+
> 
>   # Annotations from VM: (thanks to Allen Smith)
>   |X-VM-(?:Bookmark|(?:POP|IMAP)-Retrieved|Labels|Last-Modified
>     |Summary-Format|VHeader|v\d-Data|Message-Order)
> 
>   # Annotations from Gnus:
>   | X-Gnus-Mail-Source
>   | Xref
> 
> )}x;
> 
> # Note only the presence of these headers, in order to reduce the
> # hapaxen they generate.
> $MARK_PRESENCE_ONLY_HDRS = qr{(?: X-Face
>   |X-(?:Gnu-?PG|PGP|GPG)(?:-Key)?-Fingerprint
>   |D(?:KIM|omainKey)-Signature
> )}ix;
> 
> funny...
>


Doing this in code has some drawbacks, just like the tld listing: it's
not visible to most people (like this thread nicely illustrates), and
you actually want to have it configurable. This one actually is
configurable, so now there are 2 tuneables for this problem: the code
(mostly static, hidden from view and unreachable for 99% of the users),
and the config file.

I propose to simplify, and move the code-wise exclusion to a config file
too: one tuneable (and one location to look at) is better than two.
Besides, the config file is far easier to read for the not so
regex-capable admin :)

Regards,
        Tom

signature.asc
Description: OpenPGP digital signature

Re: 23_bayes_ignore_header.cf

Reply via email to