> -----Original Message----- > From: [EMAIL PROTECTED] > that's exactly what Bayes does, effectively ;) Given the amount of > training you put in, it makes reliable decisions based solely on > that data, taking into account the *amount* of training that has > been performed etc.
Since my knowledge of SA Bayes is limited, I appreciate any and all insight. I was trying to point out that the criteria for minimum amount of training is not based on number of messages but rather length of time. So rather than a minimum sample of 200 messages for ham and spam, we would look at a minimum time span of 7 dayes or longer. I believe that some interval of 7 days would be best as the ham pattern may change not only between a 24 hour interval but may be different from Mon-Fri and Sat-Sun. > The way to do this would be to divide the day up into hourly > or bi-hourly chunks, create Bayes tokens from that (from the > Received headers' date, probably), add those to the set of existing > bayes tokens, and let it all "come out in the wash". I would think you would want to create spam-time-date tokens and ham-time-date tokens from every message. I hesitate to use anything from the header as it could be forged but I don't really have an alternative unless the host timedate stamp is used. I also believe that Fuzzy is correct in that a separate db would be beneficial. Do you? I think I might be missing a larger point. I believe this is complex as I see two signals. How will Bayes actually learn? How is it able to describe the spam white-noise so that it can actually associate ham or spam for a specific message? With the current system, word tokens are acquired and associated with ham or spam. In this test we will take timedate tokens to associate with ham or spam. Unlike word tokens that describe a message, timedate itself has no description. Theoritically, the word tokens for ham is different than the word tokens for spam. With timedate tokens, there can be overlap. The timedate tokens for ham can be the same as timedate tokens for spam. We can not differentiate until we actually look at the patterns of data points and be able to distinguish ham timedate patterns from spam timedate patterns. A change in point threshold alone will change the patterns for timedate tokens. A change in point threshold will have more affect in rate of word token acquisition than the word tokens being associated with either ham or spam. Am I thinking clearly on this or am I missing a piece of information? --Larry ------------------------------------------------------- This SF.Net email sponsored by: Free pre-built ASP.NET sites including Data Reports, E-commerce, Portals, and Forums are available now. Download today and enter to win an XBOX or Visual Studio .NET. http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01 _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk