On Wed, Dec 17, 2003 at 11:32:48AM -0500, Pedro Sam wrote:
> On December 17, 2003 11:20 am, stan wrote:
> > On Wed, Dec 17, 2003 at 11:00:04AM -0500, Pedro Sam wrote:
> > > On December 17, 2003 10:16 am, stan wrote:
> > > > NTW, I've got a macro that runs sa-lar, and another that runs
> > > > spamassian -r. If I run the 2nd one first, I get a message about 0
> > > > messages learned from, if I run the first one. Whereas, If I reverse
> > > > the order, I get 1 message learned. So it looks to me that I can't
> > > > reproduce your error here.
> > >
> > > ... sorry, I wasn't clear before ...
> > >
> > > reporting and learning both works with "spamassassin -r".  BUT!! 
> > > remember that SA markup must be stripped before reporting or learning. 
> > > Now,
> > >
> > > 1. "sa-learn" command automatically strip SA markup before learning,
> > > WORKS! 2. "spamassassin -r" command claims to strip SA markup before
> > > reporting, WORKS! (ie it reports the spam without SA markup)
> > > 3.  "spamassassin -r" command claims to strip SA markup before learning,
> > > DOES NOT WORK!!  (ie it learns the spam WITH SA markup)
> > >
> > > Why did I suspect that 3 did not work?  because I found many tokens in
> > > the bayes database that could only had come from SA markup.  Tokesn like
> > > "BAYES_99" were considered VERY spammy.
> > >
> > > I 'm begging you, can someone please either confirm this problem so we
> > > can report it, or someone tell me that it's my problem only ...
> >
> > OK, if the problem exists, I should have it. But I'm a newbie here. Tell me
> > how to check my tokens, and I'll reprt back.
> 
> try this:
> 
> sa-learn --dump all | sort -n > SOME_FILE
> 
> You should get something like the following:
> 
> ...
> 0.978          2          0 1067239234  UD:mygrantnow.org
> 0.985          3          0 1066771497  N:junkN.jpg
> 0.958          1          0 1067155182  N:NsN-NkwN-N-jNiN
> 0.958          1          0 1067040199  H*r:8LN3VP9W.vip.fi
> 0.985          3          0 1071089788  comp-01_05.gif
> 0.958          1          0 1067324476  HTo:U*sarajonsson
> 0.958          1          0 1066969001  H*M:7719
> 0.958          1          0 1067081011  H*m:h9PBOoog018734
> ...
> 
> The first column should be the "spamminess", second is the # occurrence as 
> spam, third is the # occurence as ham, fourth is the time (in unix seconds), 
> and fifth is the token itself...
> 
> So if you find tokens that could only had came from SA markup (stuff like 
> BAYES_99), then it probably meant the mechanism used to invoke bayes learning 
> did not strip the SA markup...

I did the above, and the greped for BATE. Here is the result:


0.985          3          0 1071579889  BAYES_80
0.992          6          0 1071677182  BAYES_60
0.995          9          0 1071621059  BAYES_70
0.997         15          0 1071966119  BAYES_50
0.997         16          0 1071724596  BAYES_90
1.000        187          0 1134561784  BAYES_99
1.000        236          0 1134561784  N:BAYES_NN

So, it looks like I'm seeing the smae behavior as you are.

Now the question. Is this a problem?

-- 
"They that would give up essential liberty for temporary safety deserve
neither liberty nor safety."
                                                -- Benjamin Franklin


-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to