Bart Schaefer <barton.schae...@gmail.com> writes: > On Sat, Sep 14, 2013 at 1:07 PM, Harry Putnam <rea...@newsguy.com> wrote: >> >> 1) Does it matter that I have autolearn turned off in spamassassin >> conf filt 'local.cf' while doing my sandbox work > > No, it doesn't. In fact it's probably better that way because SA > won't waste time updating the bayes database with the mis-classified > stuff that will have to be backed out later. > >> 2) I've dirived the mbox files of pure ham and pure spam by running >> mixed mail so SA has already seen this mail. > > That definitely doesn't make any difference *IF* you disabled > auto-learning in the previous step. It shouldn't make any difference > even if autolearning was on, because sa-learn will discard the tokens > from the first pass on each message before re-learning, but it'll be > somewhat faster if that's not necessary.
Thanks for confirmations. Since last post, I've sort of started over by clearing out ~/.spamassassin where the db is kept. Reduced procmailrc to a spam and a ham mbox. I ran about 700 fresh mixed messages thru, then went into the ham findings and peeled out the 60-70 percent spam into a pure spam mbox. Ran enough more mixed mail to gather an equivalent mbox of pure ham. I ran those two under sa-learn --spam and then --ham. About 450 msgs each I was a little disappointed to find that after that SA is still miss identifying spam as ham by at least 50%. After the learning sessions I ran unseen mixed mail thru and find that 50% or worse is mis-classified. Is it just not enough learning yet or should I see more improvement than I have? If the latter then I'm probably doing something wrong. So, can you review the summary that follows and tell me if you think I should be seeing better results? 1) rm -rf ~/.spamassassin 2) run a few mails thru procmail/SA with: cat 5mixedMboxMsgs| formail -e -s procmail -m ${sandbox}/trc This recreates ~/.spamassassin the rc file (trc above) has this: ------- 8<--------- 8<---=--- --------- -------- #shell-script-*-- PATH=/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin SHELL=/bin/sh MAILDIR=/home/reader/projects/reader/proc/spool LOGFILE=/home/reader/projects/reader/proc/log/log ORGMAIL=/home/reader/projects/reader/proc/spool/$LOGNAME DEFAULT=$ORGMAIL VERBOSE=YES LOG=" `echo -e START " TRAP='formail -XMessage-Id: && date +"%b %d %T%nSTOP"' PSCRIPTS="/home/reader/projects/perl" SCRIPTS="/home/reader/scripts/" MAILARC="/home/reader/proc/spool" :0fw | /usr/bin/spamc :0: * ^X-Spam-Status: Yes spam_.in :0 ham.in ------- --------- ---=--- --------- -------- 3) run 700 mixed message thru the sandbox command shown above 4) Using mutt, I went thru the resulting `ham' mbox and picked out all the spam 5) put the remaining ham into all ham file, then enough more mixed mail to capture a few hundred more all ham messages. 6) Ran sa-learn --mbox --spam purespam [..] sa-learn --mbox --ham pureham (Approximately 450 msgs of each) I could see the tokens file inside ~/.spamassassin had grown quite a bit following those runs. 7) run 700 fresh mixed messages thru the sandbox. I see SA's ability to tell the difference has improved very little .. maybe 5-10% (roughly) Is this result about par for the course? Do I need to run more mail, pull out spam/ham and run more sa-learn sessions? And if that is the case can any take a good guess at how much is enough.