On Tuesday 07 February 2006 15:27, Clay Davis wrote: >Does anyone have any good techniques for capturing a sample of ham > that can be used as the ham corpus. I'm in a corporate environment > and am not keen on the idea of intercepting non-spam messages. I > will if I have to, but was hoping someone had a better idea. > I wouldn't have too guilty a consience(sp?) on that subject because generally, you won't be reading very much other than intercepted spam. There may be an FP in there occasionally, but you'll soon learn to catch those and feed them to the ham learner & hence move them to the correct mailbox folder. In other words, to make an omelete, you normally have to break a few eggs. What you accidently read in an FP should be treated with the usual amount of salt and otherwise forgotten.
>Regards, >Clay > >>>> On 2/7/2006 at 3:16 pm, in message <[EMAIL PROTECTED]>, >>>> Matt Kettler > ><[EMAIL PROTECTED]> wrote: >> [EMAIL PROTECTED] wrote: >>> Can you just feed spamassassin spam or do you need to give it ham >>> also? >>> >>> I read the docs and it didn't say you had to feed it ham. >>> >>> I then read another doc and it said you should feed it equal >>> amounts of spam and ham. >> >> Yes, you really should feed it both. You also should strive for a >> 1:1 ratio of >> spam and nonspam, but don't kill yourself to get there. >> >> SA's use of chi-squared combining makes it very tolerant of wild >> imbalances in >> training. However, the closer you are to a 1:1 ratio the better SA >> will be able >> to distinguish tokens that are present in both kinds of mail and >> ignore them. So >> this is a worthwhile goal to strive for as long as it doesn't become >> a burden. >> >> My current training ratio is about 7:1 spam:nonspam, but in the past >> it's been >> as bad as 20:1. Both of those are very far off from equal amounts, >> but the imbalance has never caused me any problems. >> >> From my sa-learn --dump magic output as of today: >> 0.000 0 995764 0 non-token data: nspam >> 0.000 0 145377 0 non-token data: nham >> >> That works out to a ratio of 6.85:1 -- Cheers, Gene People having trouble with vz bouncing email to me should add the word 'online' between the 'verizon', and the dot which bypasses vz's stupid bounce rules. I do use spamassassin too. :-) Yahoo.com and AOL/TW attorneys please note, additions to the above message by Gene Heskett are: Copyright 2006 by Maurice Eugene Heskett, all rights reserved.