On Thu, 22 Sep 2016, Thomas Barth wrote:

And what about filter poisening? In the last 10 hours my company address got 43 mails classified as spam (even a virus mail detected today). And there was one mail classified as spam due to my rule (bad country, message-id.

X-Spam-Status: Yes, score=7.474 tag=2 tag2=6.31 kill=6.31
       tests=[MESSAGEID_LOCAL=3, RDNS_NONE=1.274, RELAYCOUNTRY_BAD=3.2]
       autolearn=no autolearn_force=no

The content of the mail is:

------------------------------------------------
From: "Lupe Monroe" <monroe.4...@static.vnpt.vn>
To: "my boss address"
Subject: Payment approved
MIME-Version: 1.0
Content-Type: multipart/related;
       boundary="boundary_af9c8db46e1111b73fca8b315aafef01"
Message-Id: <20160922063255.e11d3e5...@static.vnpt.vn.local>
Date: Thu, 22 Sep 2016 06:32:55 +0700

--boundary_af9c8db46e1111b73fca8b315aafef01
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit

Dear so,

Your payment has been approved. Your account will be debited within two days.

You can email us for any query regarding your account.

Thank you.

Lupe Monroe
Support

--boundary_af9c8db46e1111b73fca8b315aafef01
Content-Type: application/x-zip-compressed; name="e6dfa16bdb.zip.virus-scan-me.virus-scan-me"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="e6dfa16bdb.zip.virus-scan-me.virus-scan-me"
------------------------------------------------

There is no spam content, am I right? Normal words and content that a normal person can use. I dont need spam learning for all the mails already classified as spam with high score. Spam with low score are interesting for spam learning like this one. But when I use these mails for spam learning there is a risk of false positive some day, because it has learned that normal mails are also spam?

You are missing the point that Bayes uses more than just body words from a message. It also looks at headers and meta-data. So those particular body words could become "neutral" (neither spam nor ham indicators) but the other components of that message (such as that '.vn.local' message ID) would be learned as spam signs.

This is why you MUST also train your Bayes with HAM messages (and train them with the --ham flag) so Bayes knows how to recognise 'hammy' or 'neutral' tokens to prevent false-positives.


--
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Reply via email to