Hi,
On Tue, 19 Nov 2002 14:26:51 GMT, Matt Sergeant wrote: > Ross Vandegrift said the following on 19/11/02 14:17: > > If the Bayseian analysis actaully takes into account the joint and > > conditional densities of word frequency, and it has a reasonable way to > > assign an expectation to them (ie, if the corpus is seeded with real-non > > spam and real spam), the fact that a spam has been seeded with real > > words should show up in the joint and conditional frequency analysis. > > This would allow the filter to assign a spam score, though perhaps with > > a smaller confidence interval. > > See now I did two years of a Maths degree, and you've already gone way > over my head :-) > > What does "joint and conditional frequency analysis" mean? First, start with Larry Gonick's fantastic "The Cartoon Guide To Statistics": http://www.powells.com/cgi-bin/biblio?inkey=7-0062731025-0 Being neither a mathematician nor a statistician, sounds like joint frequency analysis is how you derive P(A and B), and conditional frequency analysis is how you generate P(A given B) (generally written P(A|B)), where A and B are two events (in this case, the occurrence of words A and B, respectively.) Bayes' Theorem boils down to P(A|B) = P(A and B)/P(B), which is intuitive if you draw the big Venn diagram. Regardless, if spammers start including random chunks of legitimate mailing list traffic, the Bible, Rod McKuen poetry - whatever - it shouldn't matter since phrases like "This is a one-time mailing" and "HOT TEENS REFINANCE YOUR TONER CARTRIDGES" still show up only in the spam corpus. If word combinations show up in both the spam and non-spam corpi (sp?) they should end up with a low weight. I don't see random 'hash-busting' text having much effect on the so-called Baysian filters; the worst that will happen is that inbound Rod McKuen poetry will be misclassified as spam, provided the spammers start mailing it to you before your friends do[1]. Just periodically analyze the corpi to keep up with current trends in spam content, and you should be ok. -- Bob Apthorpe <[EMAIL PROTECTED]> [1] In most cases, this is not a bad thing... ------------------------------------------------------- This sf.net email is sponsored by: To learn the basics of securing your web site with SSL, click here to get a FREE TRIAL of a Thawte Server Certificate: http://www.gothawte.com/rd524.html _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk