Re: Learning both spam and ham, edge case

2014-08-22 Thread Karsten Bräckelmann
On Fri, 2014-08-22 at 17:44 -0700, Ian Zimmerman wrote: > I know that if you misclassify a mail as spam with > > sa-learn --spam /path/to/ham > > you can later run > > sa-learn --ham /path/to/ham > > to correct the mistake, and SA will do the right thing (ie. forget the > wrong classification

Learning both spam and ham, edge case

2014-08-22 Thread Ian Zimmerman
I know that if you misclassify a mail as spam with sa-learn --spam /path/to/ham you can later run sa-learn --ham /path/to/ham to correct the mistake, and SA will do the right thing (ie. forget the wrong classification). And conversely, with ham <-> spam. My question is, what happens if you

Re: Provide sa-learn with a CSV file of spam and ham?

2012-11-26 Thread John Hardin
On Mon, 26 Nov 2012, John Hardin wrote: On Mon, 26 Nov 2012, Ed Flecko wrote: Hi folks, I'm running SpamAssassin version 3.3.2 (running on Perl version 5.14.2) on FreeBSD 9.0. I've exported a bunch of spam and ham messages from my Baracuda 400. What format did the Barracuda

Re: Provide sa-learn with a CSV file of spam and ham?

2012-11-26 Thread darxus
in version 3.3.2 (running on Perl version > 5.14.2) on FreeBSD 9.0. > > I've exported a bunch of spam and ham messages from my Baracuda 400. > > I have an Excel .csv file of about 2500 spam messages and 2500 ham > messages, and I'm wondering if I can supply those as a

Re: Provide sa-learn with a CSV file of spam and ham?

2012-11-26 Thread John Hardin
On Mon, 26 Nov 2012, Ed Flecko wrote: Hi folks, I'm running SpamAssassin version 3.3.2 (running on Perl version 5.14.2) on FreeBSD 9.0. I've exported a bunch of spam and ham messages from my Baracuda 400. What format did the Barracuda export the messages in? It might be possible t

Provide sa-learn with a CSV file of spam and ham?

2012-11-26 Thread Ed Flecko
Hi folks, I'm running SpamAssassin version 3.3.2 (running on Perl version 5.14.2) on FreeBSD 9.0. I've exported a bunch of spam and ham messages from my Baracuda 400. I have an Excel .csv file of about 2500 spam messages and 2500 ham messages, and I'm wondering if I can su

Re: [Q] Bayes dB: ratio of spam and ham heavily in favour of ham

2011-08-22 Thread Benny Pedersen
On Mon, 22 Aug 2011 15:46:14 +0200, J4K wrote: # sa-learn --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0640 0 non-token data: nspam 0.000 0 7001 0 non-token data: nham 0.000 0 36689

[Q] Bayes dB: ratio of spam and ham heavily in favour of ham

2011-08-22 Thread J4K
Afternoon gentlemen, Seems the Bayes dB has become lop-sided in favour of ham. SA is doing its job as there is little spam coming through these recently. I had hoped we could keep it one third spam and two thirds spam. Does the slant shown below (nspam verses nham) cause any problems w

Re: Spam and Ham corpora

2011-01-24 Thread J4
On 01/24/2011 04:42 PM, J4 wrote: > Dear all, > > I am cure this question has come up before on this list, yet after > spending a little while trawling Google, I did not find any sites :( > So I ask here! > > Are there are any recent (<6 months) ham or spam corporaout there that > I can down

Spam and Ham corpora

2011-01-24 Thread J4
Dear all, I am cure this question has come up before on this list, yet after spending a little while trawling Google, I did not find any sites :( So I ask here! Are there are any recent (<6 months) ham or spam corporaout there that I can download and feed into sa-learn? I would like to give

Re: Bayes spam and ham out of proportion

2010-04-30 Thread RW
On Fri, 30 Apr 2010 11:53:49 +0200 "Giampaolo Tomassoni" wrote: > > Correct, but if those counts came from autolearning 90% of spam and > > 30% of ham, then rescaling may be the correct thing to do. > > > > It may also be pragmatic, if a high spam/ham ratio is leading to > > FPs, to keep the le

RE: Bayes spam and ham out of proportion

2010-04-30 Thread Giampaolo Tomassoni
> On Thu, 29 Apr 2010 18:32:04 +0200 > "Giampaolo Tomassoni" wrote: > > > > what you need to do write a script that divides the metadata > > > num_spam value and all the token Nspam counts by 3. The updated > > > database can then be loaded back in with --restore. > > > > I don't know if this is

Re: Bayes spam and ham out of proportion

2010-04-29 Thread Matt Kettler
On 4/29/2010 8:25 AM, Frank Bures wrote: > I've been running spamassassin for years. I am using auto-learn with very > conservative thresholds. However, after several years of usage my spam > database is about three time larger than my ham database and I am starting > to see false positives. > >

Re: Bayes spam and ham out of proportion

2010-04-29 Thread RW
On Thu, 29 Apr 2010 18:32:04 +0200 "Giampaolo Tomassoni" wrote: > > what you need to do write a script that divides the metadata > > num_spam value and all the token Nspam counts by 3. The updated > > database can then be loaded back in with --restore. > > I don't know if this is going to be eff

RE: Bayes spam and ham out of proportion

2010-04-29 Thread Giampaolo Tomassoni
> Hi, > > > I would instead, in order of effectiveness: > > > >        a) expire old tokens; > >        b) eliminate tokens with very few ham/spam occurrences. > >        c) eliminate tokens with very close nham to nspam values; > > Can you explain how to do this, or point to documentation that w

Re: Bayes spam and ham out of proportion

2010-04-29 Thread Alex
Hi, > I would instead, in order of effectiveness: > >        a) expire old tokens; >        b) eliminate tokens with very few ham/spam occurrences. >        c) eliminate tokens with very close nham to nspam values; Can you explain how to do this, or point to documentation that would explain? My

RE: Bayes spam and ham out of proportion

2010-04-29 Thread Giampaolo Tomassoni
> On Thu, 29 Apr 2010 08:25:29 -0400 > Frank Bures wrote: > what you need to do write a script that divides the metadata num_spam > value and all the token Nspam counts by 3. The updated database can > then be loaded back in with --restore. I don't know if this is going to be effective. After all

Re: Bayes spam and ham out of proportion

2010-04-29 Thread RW
On Thu, 29 Apr 2010 08:25:29 -0400 Frank Bures wrote: > I've been running spamassassin for years. I am using auto-learn with > very conservative thresholds. However, after several years of usage > my spam database is about three time larger than my ham database and > I am starting to see false

Re: Bayes spam and ham out of proportion

2010-04-29 Thread Jason Bertoch
On 2010/04/29 8:25 AM, Frank Bures wrote: I've been running spamassassin for years. I am using auto-learn with very conservative thresholds. However, after several years of usage my spam database is about three time larger than my ham database and I am starting to see false positives. Is there

[Copfilter] Copy of quarantined email - *** SPAM *** [8.9/8.0] Bayes spam and ham out of proportion

2010-04-29 Thread babedh-d...@biggdog.biz
I've been running spamassassin for years. I am using auto-learn with very conservative thresholds. However, after several years of usage my spam database is about three time larger than my ham database and I am starting to see false positives. Is there a way how to "shrink" the spam database? T

Re: Bayes spam and ham out of proportion

2010-04-29 Thread Matus UHLAR - fantomas
e false positives. > > Is there a way how to "shrink" the spam database? there's no spam and ham database - there's just database with tokens, where each token has its own value that indicates if it's spammy or hammy token. You can run expire on the database, but I recomme

Bayes spam and ham out of proportion

2010-04-29 Thread Frank Bures
I've been running spamassassin for years. I am using auto-learn with very conservative thresholds. However, after several years of usage my spam database is about three time larger than my ham database and I am starting to see false positives. Is there a way how to "shrink" the spam database? T

Re: SPAM and HAM

2006-08-01 Thread David Cary Hart
On Tue, 1 Aug 2006 19:54:43 +0530, sokka <[EMAIL PROTECTED]> opined: > Dear Group Member, > > Can anyone explian me the clear definition of SPAM and HAM > > regards Please see: http://tqmcube.com/spamdef.php and; http://tqmcube.com/contrib.php (which was contributed by

RE: SPAM and HAM

2006-08-01 Thread Rob McEwen
> Dear Group Member, > Can anyone explain me the clear definition of SPAM and HAM   Most everyone agrees that spam is unsolicited e-mail sent from entities with whom you do not have a previously established business or personal relationship ...OR... where you've opted to

Re: SPAM and HAM

2006-08-01 Thread Sipos Gabor
Use google, or wikipedia on spam: http://en.wikipedia.org/wiki/Spam_%28electronic%29 Ham is everything that is NOT spam. hope that clears things... :) Gabor Sipos > Dear Group Member, Can anyone explian me the clear definition of SPAM and HAM regards

Re: SPAM and HAM

2006-08-01 Thread Theo Van Dinter
On Tue, Aug 01, 2006 at 07:54:43PM +0530, sokka wrote: > Can anyone explian me the clear definition of SPAM and HAM The short version is: spam: unsolicited bulk email (aka: bad mail) ham: anything that's not spam (aka: good mail) It really comes down to consent as opposed to what

RE: SPAM and HAM

2006-08-01 Thread Sietse van Zanen
From: sokka [mailto:[EMAIL PROTECTED] Sent: Tue 01-Aug-06 16:24 To: SpamAssassin Users List Subject: SPAM and HAM Dear Group Member, Can anyone explian me the clear definition of SPAM and HAM regards

Re: SPAM and HAM

2006-08-01 Thread Gary G. Taylor
On Tuesday 01 August 2006 07:24, sokka wrote: > Dear Group Member, > > Can anyone explian me the clear definition of SPAM and HAM Spam is spam. Ham ain't. What's the problem? -- Gary G. Taylor * Pomona, CA * 34.07°N 117.75°W [EMAIL PROTECTED] * http://www.donavan.org

Re: SPAM and HAM

2006-08-01 Thread Loren Wilton
Dear Group Member, Can anyone explian me the clear definition of SPAM and HAM Yes. I'm sure quite a few people can. Loren

SPAM and HAM

2006-08-01 Thread sokka
Dear Group Member,   Can anyone explian me the clear definition of SPAM and HAM   regards

Re: Forward to learn spam and ham?

2005-10-04 Thread mouss
Jorgen Lundman a écrit : I would assume someone has already solved this, but it seems hard to search for. I would like to setup SA site wide, so that all the users can use it. However, users are not very technical, so it would be nice if they could have an easy method to train their own DB

Re: Forward to learn spam and ham?

2005-10-03 Thread Jorgen Lundman
ot; <[EMAIL PROTECTED]> Cc: "ML-spamassassin-talk" ; "William Stearns" <[EMAIL PROTECTED]> Sent: Monday, October 03, 2005 10:10 PM Subject: Re: Forward to learn spam and ham? Good evening, Jorgen, On Tue, 4 Oct 2005, Jorgen Lundman wrote: I would assume someone ha

Re: Forward to learn spam and ham?

2005-10-03 Thread Matthew Lenz
t; Sent: Monday, October 03, 2005 10:10 PM Subject: Re: Forward to learn spam and ham? Good evening, Jorgen, On Tue, 4 Oct 2005, Jorgen Lundman wrote: I would assume someone has already solved this, but it seems hard to search for. I would like to setup SA site wide, so that all the users c

Re: Forward to learn spam and ham?

2005-10-03 Thread William Stearns
Good evening, Jorgen, On Tue, 4 Oct 2005, Jorgen Lundman wrote: I would assume someone has already solved this, but it seems hard to search for. I would like to setup SA site wide, so that all the users can use it. However, users are not very technical, so it would be nice if they could have

Forward to learn spam and ham?

2005-10-03 Thread Jorgen Lundman
I would assume someone has already solved this, but it seems hard to search for. I would like to setup SA site wide, so that all the users can use it. However, users are not very technical, so it would be nice if they could have an easy method to train their own DBs. I envisioned that a "[EM

Re: update on floating dividing score between spam and ham messages

2005-07-18 Thread Joe Flowers
spam if a mail scores <= ham threshold, it's ham; >= spam threshold, it's spam; and > ham threshold and < spam threshold, it's "unsure". this is similar to the SpamBayes UI. - --j.

Re: update on floating dividing score between spam and ham messages

2005-07-18 Thread Justin Mason
<|---|--> | | ham | .unsure.. | spam if a mail scores <= ham threshold, it's ham; >= spam threshold, it's spam; and > ham threshold and < spam threshold, it's "unsure".

Re: update on floating dividing score between spam and ham messages

2005-07-12 Thread Kai Schaetzl
; It sounds like you have put in a lot of time to become an expert in the > traditional wisdom of SA and to tune it accordingly. Not more than others here. Not really too much time. And, I assume you > spend a lot of time keeping it tuned and dealing with SA upgrades. Not at all. I have

Re: update on floating dividing score between spam and ham messages

2005-07-12 Thread Joe Flowers
Kai Schaetzl wrote: Joe Flowers wrote on Mon, 11 Jul 2005 12:09:29 -0400: That's bad, really bad detection ... No. It's good, really good detection. You should improve that instead of trying to find a barrier which gives you the best FP:FN ratio. I'm not trying to find the best F

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kelson
Joe Flowers wrote: BTW, if anyone knows a command line program that can easy run thu a bunch of mbox files and tell how many messages are in them, I will report back how many ham and how many spam messages that I have fed to bayes. It's far from perfect, but it may offer some interesting info

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
> BTW, if anyone knows a command line program that can easy run thu a bunch of mbox files and tell how many messages are in them, I will report > back how many ham and how many spam messages that I have fed to bayes. Well, I thought this might give some good stats on the FP:FN ratio, but I for

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kai Schaetzl
Kai Schaetzl wrote on Mon, 11 Jul 2005 22:31:29 +0200: > With the default of 5 we get almost none, not even one per day. That was about FPs. Wrong. We don't get *any* FPs. We do not get even one *FN* per day. Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: htt

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kris Deugau
jdow wrote: > A few weeks ago I'd have said "Easy, Ducky!" Then I ran into DoveCot > that uses an indexed almost "mbox" file. There is no way to do it > other than "good guess". However, for a traditional UNIX mbox file > you should be able to nail it perfectly simply looking for the "From" > featu

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kai Schaetzl
Loren Wilton wrote on Mon, 11 Jul 2005 11:30:07 -0700: > Which of course means that by picking the ratio value you can pick pretty > much any fp/fn ratio you want. Only if the distribution was equal. Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.c

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Kai Schaetzl
; which may indeed float because tomorrow's messages score different than yesterday's. It does not float at all in the long run. And it exists *only* in the long run. It may throw off next day's detection quite heavily, since there's no guarantee spam and ham look the sa

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread jdow
A few weeks ago I'd have said "Easy, Ducky!" Then I ran into DoveCot that uses an indexed almost "mbox" file. There is no way to do it other than "good guess". However, for a traditional UNIX mbox file you should be able to nail it perfectly simply looking for the "From" feature. The dirt stupid "m

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
jdow wrote: > The greater the separation the > better the results for a decision point between them. > But anything you can do that widens the > typical score distribution between ham and spam is a good thing. Amen

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Loren Wilton
> There's another thing worth noting -- the SpamAssassin score distribution > for hams and spams isn't even. I don't necessarily see that those particular curve shapes necessarily in any way invalidate this method, although they do bias the method somewhat. The two curves are essentially smooth cu

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
Matt: I know you know a lot more about this than I do, but for what it's worth, you're impressions/intuitions are very close to mine. Originally back in April, I started off using the "average of the means", but that let through way too much spam. So, what I have now is it set to 30% above th

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Loren Wilton
> > score of -2.1532284. I have the divding line "set" at 30% of the > > distance between the average ham score and average spam score (30% above > > the average ham score). So, the dividing line is currently floating > > around 0.55416414. > > > The only problem I see with this approach is that i

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread jdow
From: "Matt Kettler" <[EMAIL PROTECTED]> > Joe Flowers wrote: > > I don't know if this will help anyone or not, but I wanted to report > > back just in case. > > > > In early April, I completely unhinged the dividing line between what SA > > score is used to mark a message as spam or ham (5.00 = d

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 the real-world figures can be seen for various thresholds in the rules/STATISTICS*.txt files... - --j. Matt Kettler writes: > Joe Flowers wrote: > > Matt Kettler wrote: > > > >> The only problem I see with this approach is that it treats false > >>

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Matt Kettler
Joe Flowers wrote: > Matt Kettler wrote: > >> The only problem I see with this approach is that it treats false >> positives and >> false negatives as being equally bad. >> >> > > We do get many more false negatives than false positives, even though we > don't get false positives very often - t

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
Thanks Jason! That's good, new info for me. That'll help me *at the very least* visualize what I am trying to do a little better. I've been very curious to know what the rough shapes of those graphs look like. Joe Justin Mason wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 There'

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Justin Mason
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 There's another thing worth noting -- the SpamAssassin score distribution for hams and spams isn't even. If you draw a graph of hams and spams, plotting the number of mails in each category as the vertical axis and the score they get as teh horizonta

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Joe Flowers
Matt Kettler wrote: The only problem I see with this approach is that it treats false positives and false negatives as being equally bad. We do get many more false negatives than false positives, even though we don't get false positives very often - they are rare. We certainly don't get 1

Re: update on floating dividing score between spam and ham messages

2005-07-11 Thread Matt Kettler
Joe Flowers wrote: > I don't know if this will help anyone or not, but I wanted to report > back just in case. > > In early April, I completely unhinged the dividing line between what SA > score is used to mark a message as spam or ham (5.00 = default). This > allows the system and this dividing l

Re: update on floating dividing score between spam and ham messages

2005-07-10 Thread Joe Flowers
ctron could jump across into the next energy band without ever being in the middle no-man's land - giving rise to a way to measure SA's imperfectness? These percentages are stabs in the dark about what the distributions of spam and ham really look like. What is intriguing to me is: why c

Re: update on floating dividing score between spam and ham messages

2005-07-10 Thread Loren Wilton
This is quite interesting, and seems reasonably obvious that with the right sort of mail (at least, maybe with any mail) this shoudl work better, since it self tunes to your conditions. It does of course assume a reasonable fp/fn rate to start, but SA is generally pretty good about that. How have

update on floating dividing score between spam and ham messages

2005-07-10 Thread Joe Flowers
I don't know if this will help anyone or not, but I wanted to report back just in case. In early April, I completely unhinged the dividing line between what SA score is used to mark a message as spam or ham (5.00 = default). This allows the system and this dividing line to drift "freely" to an