Re: New plugin: reaper

Matt Simerson Tue, 05 Jun 2012 00:39:25 -0700

On Jun 4, 2012, at 2:56 PM, Stevan Bajić wrote:

> On 04.06.2012 21:09, Matt Simerson wrote:
>> On Jun 4, 2012, at 6:40 AM, Stevan Bajić wrote:
>>> 
>>> Care to explain little more to me what this is all about?
> 
>> 49237 (data_post) dspam: training naughty as spam
>> 
>> Overnight, dpsam's spam detection accuracy improved from about 1% to 60%.  
>> In another day or two, I expect training will no longer be necessary.  But 
>> again, I'm learning dspam as I go.
> now I understand what you are trying to do.
> 
>> It might make a lot of sense to add a header with the MAIL FROM information, 
>> before feeding it to dspam.  Is it worth the effort?
> No. IMHO it is not worth the effort.
> 
>>  Is there a standard header name DSPAM looks for?
> You mean for the MAIL FROM information?


Yes.

>> Any advice you offer is appreciated.
> Feeding blindly any 'naughty' mail into DSPAM or SA blindly can result in an 
> over-training. I see often people thinking that the more they train the 
> better it is for the software. But this is not the case in real world. Better 
> is to train less and only then when it is needed. Spam is usually easier to 
> capture than Ham.

Isn't that what tum (train until mature) is about?  Doesn't that mean dspam 
automatically ignores new training once it has a "mature" corpus?

> If you really want to automatically train 'naughty' mail then I would do it 
> the following way:
> 
> 1) I would create a global, merged group in DSPAM (I don't know if you are 
> aware of DSPAMs group capabilities/concept?)
> 2) Any 'naughty' mail training would go automatically to that group
> 3) You need to ensure that the tokens for that global, merged group does not 
> get to much biased towards Spam. So you should, MUST feed Ham messages too!

How do I determine bias? Programmatically?  Should the corpuses be numerically 
balanced?  So, lets say that I'm installing qpsmtpd on one of my clients 
servers. I have a large corpus of ham available to feed it, and so I do.  Then 
I could feed spam to it until the bias is roughly even, and then pause until 
the bias tilts towards ham again?

> Personally I would even go one step ahead and mimic TONE training (train on 
> error or near error) with an asymmetric thickness threshold for spam/ham.

Okay, I found the paper on OSFB-Lua and TONE, but I have no idea what 
"asymmetric thickness threshold for spam/ham" means, nor how I might implement 
it with my DSPAM plugin. 

> If you don't trust the untroubled.org corpi then go and make two directories 
> on your file system:
> 
> ~/ham
> ~/spam
> 
> In ham you add NN (where NN should be > 1000, if possible) messages that you 
> verified manually that they are indeed innocent messages. In spam you add the 
> same amount of manually verified spam messages. Then you go and use 
> dspam_train to train a certain DSPAM user (lets call that user 'testmatt'). 
> After you are finished with training you go and download a bunch of months 
> from untroubled.org and extract them (lets say you use 2012-06.7z). That 
> should give you a directory 2012/06. Now go on and check how many of those 
> messages there are (according to your token data) not SPAM. Usually most 
> messages will be properly identified as SPAM. If you have the time then check 
> each message that DSPAM is claiming not to be SPAM again and if it is indeed 
> Spam then use dspam --source=corpus --class=spam to learn the message as 
> spam. Don't forget to check messages that your DSPAM is claiming to be Ham as 
> well. If they are Spam then I would learn them as Spam with dspam 
> --source=inoculation --class=spam.
> 
> When you are finished with that month then take the next one and look how 
> much difference you have there. I am very much confident that you will have a 
> very, very, very low FP/FN rate with the untroubled.org data. However... for 
> proper training you need to have ham data too. Just spam data will not be 
> sufficient. You have to think about DSPAM or SA or any other statistical 
> anti-spam solution to be like a kid. It has no concept of what is good and 
> what is bad. So you need to learn it what is good and what is bad. After that 
> kid knows what is good and what is bad you are not going to again teach it 
> the same thing. Right? You are only going to correct it when it makes errors. 
> So using that forced learning that you intend to do with 'naughty' is going 
> to do more damage then benefit. Better would be really to only learn when it 
> is needed (aka: when the kid is making a mistake then you go on and explain 
> that it has mad a mistake and the kid learns the new situation). Forcing it 
> every time to learn (even if it has given you the right answer) is going to 
> damage the kid. Another form of forced learning is when you let the kid train 
> something without first asking for the result. So if you train blindly any 
> 'naughty' mail to DSPAM/SA without first asking where DSPAM/SA would have 
> classified 'naughty' mail then you make more damage than benefit. Off course 
> that kind of damage is not huge but over time this can build up and 
> completely destroy your result.
> 
> btw: I stopped doing that unsupervised training.

I love the analogy. :-)  You need to copy/paste most of that last paragraph 
onto the DSPAM web page. 

While I could possibly go through that much effort on my own mail server, I'm 
heavily biased towards automated means of getting DSPAM trained.  I have only 
one mail server. But I build, repair, and at least partially administer 
hundreds of them. Which means, any 'solution' needs to be mostly automated. It 
also means I'm most interested in solutions that will work reliably well for 
everybody. The needs of the many outweigh those of the few. I can fine tune the 
few manually.

I have been identifying ham by inspecting mailboxes. If a message is read and 
exists in a folder that !~  /drafts|sent|trash|spam|junk|delete/, then it's ham 
the user purposefully saved in that folder. 

I identify spam by looking only at read messages in folders like =~ 
/spam|junk/.  If the user read it and left it the Spam folder, it's spam. Most 
users don't bother marking spam's as 'read', so if it has been done, it was 
deliberate and I can be very confident the messages are unwanted, and almost 
certainly spam.

I am very confident in the results because I hand tested my training script 
several times, with many thousands of messages. It's been working well for 
training both SA and DSPAM.  The training makes a positive difference.

But it sounds like I'm only doing it mostly right. 

When I'm handing incoming mail via my dspam plugin, I should first process the 
message through DSPAM and get its opinion? Like this:

        dspam --user $user --mode=tum --process     # does --mode matter here?

Then:

if  $dspam_opinion is correct 
         return  "all done"

if  $dspam_opinion is unverifiable
         return  "all done"

# we have a way to know the message was ham/spam, and dspam was incorrect. 
Train.
if  $dspam_opinion eq spam
        dspam --mode=toe --source=error --class=innocent                # train 
as ham

if $dspam_opinion eq innocent
        if $dspam_group
                dspam --mode=toe --source=inoculation --class=spam      # train 
as group spam
        else
                dspam --mode=toe --source=error --class=spam                    
# train as spam


And my script that trains dspam by inspecting the contents of mailboxes should 
train in the same way?  So essentially I'm processing all emails but only 
training DSPAM when it makes an error? 

My current learning script, which feeds each mailboxes ham and spam through 
DSPAM to bootstrap it trains like this:

        dspam --client --user $email --source=corpus --class=innocent
        dspam --client --user $email --source=corpus --class=spam

After it gets above 2,500 hams for a particular user, it throttles back to 
processing only 1 in 10 messages, and after 5,000 it only processes 1 in 100.  
I added that because enormous ham folders were taking forever to process. 

The learning script checks the messages for a DSPAM header. If it finds a 
header and the message was misclassified during delivery, then it retrains as 
shown above (minus the group support).

I'm somewhat concerned by using groups. I know of users that actually subscribe 
to penny stock newsletters.  They're easy prey for pump-n-dump stock scammers. 
But they'd be miffed to lose their precious emails because someone a bit wiser, 
and on the same mail server, righteously classified them as spam. 

> If you want my advice how to stop a lot of spammers then I would do the 
> following:

Here's the rest of the log entries from that email connection I posted earlier. 

11965 Connection from 94-62-192-156.cl.ipv4ilink.net [94.62.192.156]
49237 Accepted connection 1/15 from 50.72.202.227 / 
S0106001560c96a0b.wp.shawcable.net
49237 Connection from S0106001560c96a0b.wp.shawcable.net [50.72.202.227]
49237 (connect) ident::geoip: CA, Canada
49237 (connect) ident::p0f: Windows XP
49237 (connect) relay: skip: no match
49237 (connect) karma: pass, no record
49237 (connect) dnsbl: fail, NAUGHTY
49237 (connect) earlytalker: skip, naughty
49237 220 mail.theartfarm.com ESMTP qpsmtpd 0.84 ready; send us your mail, but 
not your spam.
49237 dispatching HELO S0106001560c96a0b.wp.shawcable.net
49237 (helo) helo: skip, naughty

> - Make the SMTP banner spawn more then just one line. Aka:
>  220-This is my first line of my SMTPD banner
>  220 localhost.localdomain ESMTP qpsmtpd .....
> - Have a delay (configurable) between printing the first line of the banner 
> and the second line of the banner
> - Every idiot sending before the '220<space>FQDN ESMTP....' line gets 
> rejected (early talker) and gets banned from connecting for the next NN 
> seconds. Every reconnecting attempt gets punished with +30 Seconds.

The delay you suggest is the earlytalker plugin.  If a connection fails the 
test, its karma is decremented and the karma plugin denies it access for a 
configurable period.  I have yet to see an earlytalker false positive.

> - I would run DNSWL checks against the connecting IP
> - If DNSWL had no positive result then I would run DNSBL checks against the 
> IP. I would use weighted results and block the IP if it reaches a certain 
> score/weight.
> - If you want to implement sender whitelisting then I would do that DNSBL 
> checks after the MAIL FROM stage and first check if the sender is in your 
> white list (IMHO this is very dangerous but people often need that stuff, 
> even if it is easy forged).

What I'm doing now is having dnsbl run to completion during the connection 
phase, and mark the connection as naughty if the IP is blacklisted. But, 
naughty doesn't disconnect until MAIL, which gives the user a chance to HELO, 
TLS, and AUTH.  If they auth successfully, then I clear the naughty flag. 
Voila, I don't need a whitelist. :-)

> - I would maybe do as well RHSBL (IMHO DNSBL is more than enough).

This is my experience.  I run the RHSBL before the dnsbl, so it has the first 
shot at tagging a message as naughty. And it catches 0.9% as many as dnsbl.  
But I only have one RHSBL list (dsn.rfc-ignorant.org) configured.

> - If it is important to you then doing something like GeoIP lookups could be 
> interesting for certain users (either to block or whitelist based on 
> continent, region, country). I usually use that data to compute the distance 
> between me and the sender. The bigger the distance is the more likely it is 
> spam (search for SNARE if you need a research paper on that topic).
> - I would run as well something like p0f and by default add punishing points 
> to each connection coming from a desktop OS (aka: Windows XP, Windows 7, 
> Windows CE, etc....)

You'll see that I have plugins for both GeoIP and p0f  active. :-)   And p0f v2 
and p0f v3 results disagree. There are limits to p0f's OS detection.

SNARE:  
http://smartech.gatech.edu/bitstream/handle/1853/25135/GT-CSE-08-02.pdf?sequence=1

Their findings echo what I see in my logs. Windows PCs sending email to my 
server from outside my country have a 95% chance of being spam. The problem I 
see with SNARE is the way-too-high  7% false positive rate. It's useful 
information, but not something that can be used to reject the connection. On 
its own. But you can use the data to make the sender jump through extra hoops.

> - I would as well look if the connecting IP is from a dynamic/dialup range 
> and add some punishing points to that connection (I think there are DNSBL 
> available to identify dialups or end-user DSLx lines.

That DNSBL is pbl.spamhaus.org, included in zen.spamhause.org.

> If all points from above reach a certain score then I would disconnect the 
> client.

Adding up all these scores and using them to disconnect feels rather like 
reinventing SpamAssassin. Except way faster.

> This should IMHO already block most Spam messages even reaching your queue.

And it does.

> If you use DSPAM then you should not forget to clean unused tokens from time 
> to time. On 'all-in-one' script that can help you doing that is this one here 
> -> 
> http://dspam.git.sourceforge.net/git/gitweb.cgi?p=dspam/dspam;a=tree;f=contrib/dspam_maintenance;hb=HEAD
>  <-

Duly noted in the plugin.

> ohhh and btw: I would disconnect those bastards as fast as I can. Forget 
> about trying to be smart and doing fancy computations and such. It's mostly 
> useless. Get them of your line and save your resources for connections that 
> have value.

My karma plugin does exactly that. :-)  With most plugins, I'm not willing to 
disconnect immediately due to false positives. But if you have bad karma (you 
sent me junk but nothing good), then I feel perfectly justified in being rude. 

57992 Accepted connection 0/15 from 69.61.27.199 / Unknown
57992 Connection from Unknown [69.61.27.199]
57992 (connect) ident::geoip: US, United States
57992 (connect) ident::p0f_2: Windows (2000 SP4, XP SP1+)
57992 (connect) ident::p0f_3: Linux 3.x
57992 (connect) relay: skip: no match
57992 (connect) karma: fail, (2 naughty, 0 nice, 3 connects)
57992 550 You were naughty. You are penalized for 0.39 more days.
57992 click, disconnecting
57992 (post-connection) connection_time: 0.102 s.
54597 cleaning up after 57992

But I also provide others with the option of disconnecting later. To each their 
own.

> btw2: Every IP passing the above test scenario should next time not be forced 
> to go throw the whole evaluation again. I would cache the result for a bunch 
> of hours and refresh the cache time each time the IP reconnects. I would 
> delete the cache entry after NN hours/minutes without reconnection from the 
> IP.

I haven't implemented that yet, but I plan to add a similar feature to karma. 
If you're a good sender, bypass most tests. The smite feature of karma is 
complete, it's time to hand out lollipops to good senders.

Matt

Re: New plugin: reaper

Reply via email to