Am 31.08.2014 um 16:08 schrieb Ted Mittelstaedt:
> On 8/31/2014 2:21 AM, Reindl Harald wrote:
>>
>> Am 31.08.2014 um 02:15 schrieb Ted Mittelstaedt:
>>> Yes, it does work great when you have the bayes filter turned on and you 
>>> take the time to feed it.  And that means
>>> you have to feed the
>>> learner both ham and spam and setup reliable sources for those.
>>>
>>> Unfortunately if Bayes is not turned on, it does not catch more than
>>> around 60-70% of spam.  As a Spamassassin user&  server admin, I would
>>> really like to see that improve.
>>
>> 60-70% without training is great
>>
>> keep in mind that the first 90% of incoming is eaten by RBL's
>> and the 60% are from the remaining 10% at all :-)
>>
>> i think it's impossible to improve that much "out-of-the-box" because
>> that would make it to sensitive while the bayes has the ham side of
>> your communication too for decisions
>>
> 
> Google does it.  It's not impossible.

Google has a lot of more data and power to feed a global bayes
and even then: they fail as you say yourself in the next paragraph

i don't care for the 5 spam messages
i care for the eaten important one

>> i am coming from a commercial device trying to block 100% and there
>> it ends in zero-hour-blocklists with domains even if they are only
>> linked on the youtube page of the blocked facebook notification
>>
>> so i am glad that i have to do soem training by myself instead fear
>> of false positives which do much more harm
> 
> My experience is that the commercial providers like Gmail are now
> so aggressive that false positives are VERY common on their systems,
> this leads to people nowadays quite commonly saying "check your
> spam folder" on their websites and such that send feedback messages.

which defeats the intention of a spamfilter and the whole idea
of a junk-folder is broken - i need a contenfilter running
relieable before-queue to not see the real crap and some [SPAM]
tagged messages which are hand-move to ham/spam for train bayes

> Out of the box the default decision point of 5 is too high anyway.
> 
> I think the emphasis on avoiding false positives in the stock
> (non-Bayes) distribution is far too high. I suspect that over
> the years many good rule submissions have been ignored because
> incidence of false positives with them was too high for the
> SA maintainers.

if you have users to support there is nothing more bad than
a false positive - 10 slipped junk mails are not that worse
as having a user complaining that ge don't get legit mail
and is tired of try to explain his customers how the could
make it through the filter

> For a newbie to SA it is disheartening to install SA and not
> get 90% with a 2% false positive, out of the box, but rather get
> 50% with a 0% false positive.  And I think that is a mistake the
> maintainers are making is over-reliance on bayes.

no - as i showed in another thread that day the opposite is true
the bayes could and should have more impact

but that can't be default values because no software can know
how good the bayes data (ham and spam) are really and if it
is trained by a noob fire any newsletter into "spam" it makes
damage - mine is trustable because i know what i am doing in
that context

the most important thing in train a bayes is to know what
messages you should strongly avoid to feed in

> At the least the SA maintainers should maintain a separate
> "highly aggressive" rule distro that was optional that would
> give us a much higher success rate with a corresponding
> slight increase in false positives.

here i agree - maybe with a meta-rule or such which have
it's own score in "local.cf" - but i still think you
need to know what you are doing because such meta value
also makes compromises and in my case i trust my base
nearly unconditional but would not have other default
rules with the same power

> Their design approach has been to rely on Bayes to be trained to go from 50% 
> capture out of box with 0% FP to 80-90% capture with 0% FP.

easy spoken words

spammer are not dumb and follow SA updates too
how long do you think would such a default survive in the wild?

> But, the design approach could easily be relying on Bayes to go
> from 90% capture with 5% FP out of the box, to 90% capture with
> 0% FP with Bayes, and the emphasis being on training Bayes on ham,
> not spam.

5% false positives out of the box is just inacceptable

the contentfilter anyways should be only the last defense
and your 90% spam eaten by postscreen and DNSBL scores
combined with postfix-PTR-regex reject dailup networks

only with the PTR check you get rid of around 80% of
botnet junk without anything else

> Note I am pulling the percentages out of my ass, but I 
> think you get the idea.

i get the idea and a few years ago a thought the same way

but looking what support times angry customers not get
important mail (including myself) wasted and how less
time it takes for each user to just delete his 10 daily
spam never face the other thounsands already blocked
my attitude in that context changed dramatically

that's also why postscreen with a lot of RBL's combined
with differernt weighted DNSWL's to not allow a single
RBL by mistake do damage like block large providers
like GMX/Web.de (United Internet) not so long ago

i am a new SA user built up a complete mailfilter system
the last few weeks but with some years expierience from
other systems

what i see here at least over the weekend is the result below
and says clearly "rely on a contentfilter only as last defense
for several reasons"

SA is very expensive (connection time, resources), postscreen is
for free and don't eat a single smtpd process most of the time

[root@localhost:~]$ cat maillog | grep "CONNECT from" | wc -l
1940

[root@localhost:~]$ cat maillog | grep "NOQUEUE" | grep postscreen | wc -l
1584

[root@localhost:~]$ cat maillog | grep "relay=" | wc -l
286

[root@localhost:~]$ cat maillog | grep "SpamAssassin" | wc -l
58

[root@localhost:~]$ cat maillog | grep "cannot find your reverse hostname" | wc 
-l
12

>>> On 8/30/2014 2:41 PM, Reindl Harald wrote:
>>>> after two days running SA for the first two test-domains with a
>>>> well trained bayes for the global milter-user: impressive!
>>>>
>>>> the few crap making it through poscreen RBL scroing is detected
>>>>
>>>> 0.000          0          3          0  non-token data: bayes db version
>>>> 0.000          0       1389          0  non-token data: nspam
>>>> 0.000          0       1350          0  non-token data: nham
>>>> 0.000          0     257152          0  non-token data: ntokens
>>>>
>>>> Aug 30 23:34:19 localhost spamd[4882]: spamd: identified spam (8.9/4.5) 
>>>> for sa-milt:189 in 0.6 seconds, 2454
>>>> bytes.
>>>> Aug 30 23:34:19 localhost spamd[4882]: spamd: result: Y 8 -
>>>> BAYES_80,CUST_DNSBL_15,CUST_DNSWL_2,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,FREEMAIL_REPLYTO,FREEMAIL_REPLYTO_END_DIGIT,HTML_MESSAGE,MALFORMED_FREEMAIL,MISSING_HEADERS,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,REPLYTO_WITHOUT_TO_CC,RP_MATCHES_RCVD,SPF_PASS
>>>>
>>>>
>>>> scantime=0.6,size=2454,user=sa-milt,uid=189,required_score=4.5,rhost=localhost,raddr=127.0.0.1,rport=51671,mid=<snt152-w505982b05a6fbba5c49ad2b1...@phx.gbl>,bayes=0.842503,autolearn=disabled
>>>>
>>>>
>>>> Aug 30 23:34:19 localhost postfix/cleanup[6195]: 3hlrXp5S3dz1w: 
>>>> milter-reject: END-OF-MESSAGE from
>>>> snt004-omc1s37.hotmail.com[65.55.90.48]: 5.7.1 Blocked by SpamAssassin; 
>>>> from=<jenniferje...@hotmail.com>  
>>>> to=<***>

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to