On Thu, 22 Sep 2016, Thomas Barth wrote:
And what about filter poisening? In the last 10 hours my company address got
43 mails classified as spam (even a virus mail detected today). And there was
one mail classified as spam due to my rule (bad country, message-id.
X-Spam-Status: Yes, score=7.474 tag=2 tag2=6.31 kill=6.31
tests=[MESSAGEID_LOCAL=3, RDNS_NONE=1.274, RELAYCOUNTRY_BAD=3.2]
autolearn=no autolearn_force=no
The content of the mail is:
------------------------------------------------
From: "Lupe Monroe" <monroe.4...@static.vnpt.vn>
To: "my boss address"
Subject: Payment approved
MIME-Version: 1.0
Content-Type: multipart/related;
boundary="boundary_af9c8db46e1111b73fca8b315aafef01"
Message-Id: <20160922063255.e11d3e5...@static.vnpt.vn.local>
Date: Thu, 22 Sep 2016 06:32:55 +0700
--boundary_af9c8db46e1111b73fca8b315aafef01
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
Dear so,
Your payment has been approved. Your account will be debited within two days.
You can email us for any query regarding your account.
Thank you.
Lupe Monroe
Support
--boundary_af9c8db46e1111b73fca8b315aafef01
Content-Type: application/x-zip-compressed;
name="e6dfa16bdb.zip.virus-scan-me.virus-scan-me"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="e6dfa16bdb.zip.virus-scan-me.virus-scan-me"
------------------------------------------------
There is no spam content, am I right? Normal words and content that a normal
person can use. I dont need spam learning for all the mails already
classified as spam with high score. Spam with low score are interesting for
spam learning like this one. But when I use these mails for spam learning
there is a risk of false positive some day, because it has learned that
normal mails are also spam?
You are missing the point that Bayes uses more than just body words from a
message. It also looks at headers and meta-data. So those particular body
words could become "neutral" (neither spam nor ham indicators) but the
other components of that message (such as that '.vn.local' message ID)
would be learned as spam signs.
This is why you MUST also train your Bayes with HAM messages (and train
them with the --ham flag) so Bayes knows how to recognise 'hammy' or
'neutral' tokens to prevent false-positives.
--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{