Bart Schaefer <barton.schae...@gmail.com> writes:

> On Sat, Sep 14, 2013 at 1:07 PM, Harry Putnam <rea...@newsguy.com> wrote:
>>
>> 1) Does it matter that I have autolearn turned off in spamassassin
>> conf filt 'local.cf' while doing my sandbox work
>
> No, it doesn't.  In fact it's probably better that way because SA
> won't waste time updating the bayes database with the mis-classified
> stuff that will have to be backed out later.
>
>> 2) I've dirived the mbox files of pure ham and pure spam by running
>> mixed mail so SA has already seen this mail.
>
> That definitely doesn't make any difference *IF* you disabled
> auto-learning in the previous step.  It shouldn't make any difference
> even if autolearning was on, because sa-learn will discard the tokens
> from the first pass on each message before re-learning, but it'll be
> somewhat faster if that's not necessary.

Thanks for confirmations.

Since last post, I've sort of started over by clearing out
~/.spamassassin where the db is kept.  Reduced procmailrc to a spam
and a ham mbox.

I ran about 700 fresh mixed messages thru, then went into the ham findings
and peeled out the 60-70 percent spam into a pure spam mbox.

Ran enough more mixed mail to gather an equivalent mbox of pure ham. 

I ran those two under sa-learn --spam and then --ham.

About 450 msgs each

I was a little disappointed to find that after that SA is still miss
identifying spam as ham by at least 50%.

After the learning sessions I ran unseen mixed mail thru and find that
50% or worse is mis-classified.

Is it just not enough learning yet or should I see more improvement
than I have?  If the latter then I'm probably doing something wrong.

So, can you review the summary that follows and tell me if you think I
should be seeing better results?

1) rm -rf ~/.spamassassin
2) run a few mails thru procmail/SA with:
  cat 5mixedMboxMsgs| formail -e -s procmail -m ${sandbox}/trc
  This recreates ~/.spamassassin

  the rc file (trc above) has this:

-------      8<---------     8<---=---       ---------      -------- 
#shell-script-*--
PATH=/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin
SHELL=/bin/sh
MAILDIR=/home/reader/projects/reader/proc/spool
LOGFILE=/home/reader/projects/reader/proc/log/log
ORGMAIL=/home/reader/projects/reader/proc/spool/$LOGNAME
DEFAULT=$ORGMAIL
VERBOSE=YES 
LOG=" `echo -e  START
"
TRAP='formail -XMessage-Id: && date +"%b %d %T%nSTOP"'

PSCRIPTS="/home/reader/projects/perl"
SCRIPTS="/home/reader/scripts/"
MAILARC="/home/reader/proc/spool"


:0fw
| /usr/bin/spamc

:0:
* ^X-Spam-Status: Yes   
spam_.in

:0
ham.in
-------        ---------       ---=---       ---------      -------- 

3) run 700 mixed message thru the sandbox command shown above

4) Using mutt, I went thru the resulting `ham' mbox and picked out all
   the spam

5) put the remaining ham into all ham file, then enough more mixed
mail to capture a few hundred more all ham messages.

6) Ran  sa-learn --mbox --spam purespam
   [..] sa-learn --mbox --ham  pureham   
  (Approximately 450 msgs of each)

  I could see the tokens file inside ~/.spamassassin had grown quite a bit
  following those runs.

7) run 700 fresh mixed messages thru the sandbox.

   I see SA's ability to tell the difference has improved very little
   .. maybe 5-10% (roughly)

Is this result about par for the course?  Do I need to run more mail,
pull out spam/ham and run more sa-learn sessions?  And if that is the
case can any take a good guess at how much is enough.



Reply via email to