Re: message/rfc822 to mbox script for use with sa-learn workflow

Jesse Norell Tue, 15 Aug 2017 15:37:50 -0700

On Tue, 2017-08-15 at 07:55 -0700, Scott wrote:
> I need a way to go from Outlook to train SA if I'm to train at all.
> FOr
> most of my users the inbound mail is handed off to a 3rd party
> Exchange
> server that I don't have access to.  So setting up a public IMAP
> folder on
> the exchange server type solution is probably not possible.  And I
> presume
> that process messes with the messages too anyway.  I can't cc the
> users mail
> on my server for later review, there would be too many.
> 
> If I'm forwarded spam as an attachment for learning, I would require
> ham
> from the same method.
> 
> My plan wasn't to make this a daily routine.  Only to help a few users
> who
> say they are getting too much spam slipping through all the other
> checks
> untagged.  To help train bayes to assist on those problem users.  Old
> email
> accounts that can't be changed and are on the golden spam lists.
> 
> The reason to "reassemble" the extracted attachments was just to make
> it
> easier for me to access the messages and review them.  Too tedious at
> the
> console.  Don't know how to use formal to do it, and wont' it add some
> more
> headers to the mess too?
> 
> FWIW, I did try sa-learn on a sample of extracted attachments in their
> raw
> form.  It was happy with them:
> [root@tn3 msg-1502747659-31280-0]# sa-learn --spam *
> Learned tokens from 97 message(s) (97 message(s) examined)
> 
> But picking through them to vet them would be too tedious at the
> console. 
> They get random number type filenames as part of the extraction.
> 
> My constraints are:
> - messages are sent to 3rd party exchange server
> - exchange server access does not exist at this time
> - users use Outlook client at least v2003
> - I use site wide bayes
> - I don't trust the users to feed bayes. 
> - I can't cc their Email on my server for later feeding.
> - I want to use this process for corpus building, not daily
> maintenance.
> 
> My plan was:
> - receive spam and ham (separately) "as attachments" form outlook
> - extract attachments
> - review attachments
> - feed attachments to sa-learn
> 
> Open for a better method..



An idea for an alternate collection method:  run an imap server on your
sa-learn training box, setup a second email account in Outlook for the
users who are training, and have them just drag the ham/spam to training
folders.  I don't know if it's "better," but I'd prefer it myself to
)re)training users to forward as attachment, then piecing things back
together.

If that's an option you'll pursue and you can use dovecot as your imap
server, check out https://github.com/jnorell/train-spam-scanner as a
training script.  It's designed for exactly the goals you have in mind,
ie. users supplying training messages which can be moderated and built
into a corpus.

-- 
Jesse Norell
Kentec Communications, Inc.
970-522-8107  -  www.kci.net

Re: message/rfc822 to mbox script for use with sa-learn workflow

Reply via email to