Hi,

On Tue, 19 Nov 2002 14:26:51 GMT, Matt Sergeant wrote:
> Ross Vandegrift said the following on 19/11/02 14:17:
> > If the Bayseian analysis actaully takes into account the joint and
> > conditional densities of word frequency, and it has a reasonable way to
> > assign an expectation to them (ie, if the corpus is seeded with real-non
> > spam and real spam), the fact that a spam has been seeded with real
> > words should show up in the joint and conditional frequency analysis.
> > This would allow the filter to assign a spam score, though perhaps with
> > a smaller confidence interval.
> 
> See now I did two years of a Maths degree, and you've already gone way 
> over my head :-)
> 
> What does "joint and conditional frequency analysis" mean?

First, start with Larry Gonick's fantastic "The Cartoon Guide To Statistics":
http://www.powells.com/cgi-bin/biblio?inkey=7-0062731025-0

Being neither a mathematician nor a statistician, sounds like joint
frequency analysis is how you derive P(A and B), and conditional frequency
analysis is how you generate P(A given B) (generally written P(A|B)),
where A and B are two events (in this case, the occurrence of words A and
B, respectively.) Bayes' Theorem boils down to P(A|B) = P(A and B)/P(B), which 
is intuitive if you draw the big Venn diagram.

Regardless, if spammers start including random chunks of legitimate
mailing list traffic, the Bible, Rod McKuen poetry - whatever - it
shouldn't matter since phrases like "This is a one-time mailing" and
"HOT TEENS REFINANCE YOUR TONER CARTRIDGES" still show up only in the
spam corpus. If word combinations show up in both the spam and non-spam
corpi (sp?) they should end up with a low weight. I don't see random
'hash-busting' text having much effect on the so-called Baysian filters;
the worst that will happen is that inbound Rod McKuen poetry will be
misclassified as spam, provided the spammers start mailing it to you
before your friends do[1].

Just periodically analyze the corpi to keep up with current trends in
spam content, and you should be ok.

-- 
Bob Apthorpe <[EMAIL PROTECTED]>
[1] In most cases, this is not a bad thing...




-------------------------------------------------------
This sf.net email is sponsored by: To learn the basics of securing 
your web site with SSL, click here to get a FREE TRIAL of a Thawte 
Server Certificate: http://www.gothawte.com/rd524.html
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to