On 8/31/2014 7:35 AM, Reindl Harald wrote:
Am 31.08.2014 um 16:08 schrieb Ted Mittelstaedt:
On 8/31/2014 2:21 AM, Reindl Harald wrote:
Am 31.08.2014 um 02:15 schrieb Ted Mittelstaedt:
Yes, it does work great when you have the bayes filter turned on and you take
the time to feed it. And that means
you have to feed the
learner both ham and spam and setup reliable sources for those.
Unfortunately if Bayes is not turned on, it does not catch more than
around 60-70% of spam. As a Spamassassin user& server admin, I would
really like to see that improve.
60-70% without training is great
keep in mind that the first 90% of incoming is eaten by RBL's
and the 60% are from the remaining 10% at all :-)
i think it's impossible to improve that much "out-of-the-box" because
that would make it to sensitive while the bayes has the ham side of
your communication too for decisions
Google does it. It's not impossible.
Google has a lot of more data and power to feed a global bayes
and even then: they fail as you say yourself in the next paragraph
i don't care for the 5 spam messages
i care for the eaten important one
eaten? Your the one who is deleting the stuff because it's being tagged
as spam, that's YOUR decision not SA's. SA is just saying "we think
this is spam" you are deciding to eat the message.
i am coming from a commercial device trying to block 100% and there
it ends in zero-hour-blocklists with domains even if they are only
linked on the youtube page of the blocked facebook notification
so i am glad that i have to do soem training by myself instead fear
of false positives which do much more harm
My experience is that the commercial providers like Gmail are now
so aggressive that false positives are VERY common on their systems,
this leads to people nowadays quite commonly saying "check your
spam folder" on their websites and such that send feedback messages.
which defeats the intention of a spamfilter and the whole idea
of a junk-folder is broken - i need a contenfilter running
relieable before-queue to not see the real crap and some [SPAM]
tagged messages which are hand-move to ham/spam for train bayes
Out of the box the default decision point of 5 is too high anyway.
I think the emphasis on avoiding false positives in the stock
(non-Bayes) distribution is far too high. I suspect that over
the years many good rule submissions have been ignored because
incidence of false positives with them was too high for the
SA maintainers.
if you have users to support there is nothing more bad than
a false positive - 10 slipped junk mails are not that worse
as having a user complaining that ge don't get legit mail
and is tired of try to explain his customers how the could
make it through the filter
If you tag the message as SPAM, either in the header or in the
subject line and pass it to the user, the user gets the message.
The user determines the level that the message "slips as junk"
not SA - they determine it through the spam score.
They can change their spam score (or you can change it for them)
so that SA is less aggressive and catches less spam.
They can change whatever rule they have that moves spam into a
junk mail folder to simply leave it in the inbox (or you can)
The only way a user would complain that they didn't get a message
is if they had configured their setup so that any incoming message
that SA thinks is spam, gets deleted.
They could instead configure their setup so that messages SA thinks
are definitely spam (high spam score) go into junk, messages that
SA thinks might be spam (moderate spam score) are merely flagged in
the subject line as "POSSIBLE SPAM" then put into the inbox where
they see them.
Or they could just have all mail delivered to their inbox and
tag it spam in the subject line.
You merely have SA put the spam score in the header then use Procmail
to munge up the subject line or delete the message or whatever.
For a newbie to SA it is disheartening to install SA and not
get 90% with a 2% false positive, out of the box, but rather get
50% with a 0% false positive. And I think that is a mistake the
maintainers are making is over-reliance on bayes.
no - as i showed in another thread that day the opposite is true
the bayes could and should have more impact
I did not see that other thread (and I'm not really interested in
looking it up) if your going to disagree at least explain the
reasoning in the same thread and don't make people dig it out.
but that can't be default values because no software can know
how good the bayes data (ham and spam) are really and if it
is trained by a noob fire any newsletter into "spam" it makes
damage - mine is trustable because i know what i am doing in
that context
the most important thing in train a bayes is to know what
messages you should strongly avoid to feed in
of course.
At the least the SA maintainers should maintain a separate
"highly aggressive" rule distro that was optional that would
give us a much higher success rate with a corresponding
slight increase in false positives.
here i agree - maybe with a meta-rule or such which have
it's own score in "local.cf" - but i still think you
need to know what you are doing because such meta value
also makes compromises and in my case i trust my base
nearly unconditional but would not have other default
rules with the same power
Their design approach has been to rely on Bayes to be trained to go from 50%
capture out of box with 0% FP to 80-90% capture with 0% FP.
easy spoken words
spammer are not dumb and follow SA updates too
how long do you think would such a default survive in the wild?
Uh, spammers don't even like the 50% capture out of the box
and constantly work to defeat the rules. If even 1 of their messages
is blocked that is too many.
But, the design approach could easily be relying on Bayes to go
from 90% capture with 5% FP out of the box, to 90% capture with
0% FP with Bayes, and the emphasis being on training Bayes on ham,
not spam.
5% false positives out of the box is just inacceptable
To you, maybe. Not to Google or Hotmail, and a lot of people use
those services. No, they are not 5% FP but they ARE accepting some
FP - and I'm quite sure the actual amount is a trade secret.
Granted, a lot of their base is free clients so they can tell them
to go pound sand if those clients complain about FPs. But many are
businesses and I think their reasoning is sound. They are selling into
the real world and the real world has a lot more people complaining
a lot more about spam, than about FP's.
the contentfilter anyways should be only the last defense
and your 90% spam eaten by postscreen and DNSBL scores
combined with postfix-PTR-regex reject dailup networks
only with the PTR check you get rid of around 80% of
botnet junk without anything else
Those are the easiest things to defeat, and today I see most
spam coming in from hosts with valid PTRs and valid domain
names. And the DNSBLs are getting less effective probably
because spammers are using large cable networks that hand
out IP numbers via DHCP and the spammers use these to rapidly
cycle through many IP numbers with their fake servers.
Note I am pulling the percentages out of my ass, but I
think you get the idea.
i get the idea and a few years ago a thought the same way
but looking what support times angry customers not get
important mail (including myself) wasted and how less
time it takes for each user to just delete his 10 daily
spam never face the other thounsands already blocked
my attitude in that context changed dramatically
I do not agree than 10 daily spams is acceptable. The only
valid number of spams a user should ever get is 0. If you
say that 10 out of 10,000 are OK then the spammers just think
"wheeee! That means all I have to do is send that guy
1000 spams and I'll get 1 of them through to him. And if can get
1 though that lets me steal his credit card data from his
PC then it's a great day for me!" And the spammers can
definitely send 1000 spams.
Gmail is also aiming at 0 daily spams not 10, and people rave about how
good their spam filter is, and those same people NOT
complaining about losing important mail in Gmail's junk
folder (even though they do), so my attitude is 180 degrees
opposite yours.
that's also why postscreen with a lot of RBL's combined
with differernt weighted DNSWL's to not allow a single
RBL by mistake do damage like block large providers
like GMX/Web.de (United Internet) not so long ago
i am a new SA user built up a complete mailfilter system
the last few weeks but with some years expierience from
other systems
what i see here at least over the weekend is the result below
and says clearly "rely on a contentfilter only as last defense
for several reasons"
SA is very expensive (connection time, resources), postscreen is
for free and don't eat a single smtpd process most of the time
[root@localhost:~]$ cat maillog | grep "CONNECT from" | wc -l
1940
[root@localhost:~]$ cat maillog | grep "NOQUEUE" | grep postscreen | wc -l
1584
[root@localhost:~]$ cat maillog | grep "relay=" | wc -l
286
[root@localhost:~]$ cat maillog | grep "SpamAssassin" | wc -l
58
[root@localhost:~]$ cat maillog | grep "cannot find your reverse hostname" | wc
-l
12
I don't use postscreen or Postfix. but I do greylist and that does
a similar thing, gets rid of spambot mail. Even though spambot mail is
nowadays a small amount of spam anymore.
In my world the cost of hardware that has CPU power and memory power
that far and away exceeds the disk I/O is a little bit higher than dirt.
Have you profiled your servers? Mine spend most of their CPU power
loafing along, the disk I/O channel can be almost saturated and the
CPU's cores are still idling along. But of course, these are servers
that are only a few years old.
If I was serving 100K mail clients I might feel differently.
Ted
On 8/30/2014 2:41 PM, Reindl Harald wrote:
after two days running SA for the first two test-domains with a
well trained bayes for the global milter-user: impressive!
the few crap making it through poscreen RBL scroing is detected
0.000 0 3 0 non-token data: bayes db version
0.000 0 1389 0 non-token data: nspam
0.000 0 1350 0 non-token data: nham
0.000 0 257152 0 non-token data: ntokens
Aug 30 23:34:19 localhost spamd[4882]: spamd: identified spam (8.9/4.5) for
sa-milt:189 in 0.6 seconds, 2454
bytes.
Aug 30 23:34:19 localhost spamd[4882]: spamd: result: Y 8 -
BAYES_80,CUST_DNSBL_15,CUST_DNSWL_2,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,FREEMAIL_REPLYTO,FREEMAIL_REPLYTO_END_DIGIT,HTML_MESSAGE,MALFORMED_FREEMAIL,MISSING_HEADERS,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,REPLYTO_WITHOUT_TO_CC,RP_MATCHES_RCVD,SPF_PASS
scantime=0.6,size=2454,user=sa-milt,uid=189,required_score=4.5,rhost=localhost,raddr=127.0.0.1,rport=51671,mid=<[email protected]>,bayes=0.842503,autolearn=disabled
Aug 30 23:34:19 localhost postfix/cleanup[6195]: 3hlrXp5S3dz1w: milter-reject:
END-OF-MESSAGE from
snt004-omc1s37.hotmail.com[65.55.90.48]: 5.7.1 Blocked by SpamAssassin;
from=<[email protected]>
to=<***>