RE: [SAtalk] Mail arrival time may be a criteria

Larry Gilson Thu, 14 Aug 2003 12:17:28 -0700

> -----Original Message-----
> From: [EMAIL PROTECTED]

> that's exactly what Bayes does, effectively ;)  Given the amount of
> training you put in, it makes reliable decisions based solely on
> that data, taking into account the *amount* of training that has 
> been performed etc.


Since my knowledge of SA Bayes is limited, I appreciate any and all insight.
I was trying to point out that the criteria for minimum amount of training
is not based on number of messages but rather length of time.  So rather
than a minimum sample of 200 messages for ham and spam, we would look at a
minimum time span of 7 dayes or longer.  I believe that some interval of 7
days would be best as the ham pattern may change not only between a 24 hour
interval but may be different from Mon-Fri and Sat-Sun. 


> The way to do this would be to divide the day up into hourly 
> or bi-hourly chunks, create Bayes tokens from that (from the
> Received headers' date, probably), add those to the set of existing
> bayes tokens, and let it all "come out in the wash".

I would think you would want to create spam-time-date tokens and
ham-time-date tokens from every message.  I hesitate to use anything from
the header as it could be forged but I don't really have an alternative
unless the host timedate stamp is used.  I also believe that Fuzzy is
correct in that a separate db would be beneficial.  Do you?


I think I might be missing a larger point.  I believe this is complex as I
see two signals.  How will Bayes actually learn?  How is it able to describe
the spam white-noise so that it can actually associate ham or spam for a
specific message?  With the current system, word tokens are acquired and
associated with ham or spam.  In this test we will take timedate tokens to
associate with ham or spam.  Unlike word tokens that describe a message,
timedate itself has no description.  Theoritically, the word tokens for ham
is different than the word tokens for spam.  With timedate tokens, there can
be overlap.  The timedate tokens for ham can be the same as timedate tokens
for spam.  We can not differentiate until we actually look at the patterns
of data points and be able to distinguish ham timedate patterns from spam
timedate patterns.  A change in point threshold alone will change the
patterns for timedate tokens.  A change in point threshold will have more
affect in rate of word token acquisition than the word tokens being
associated with either ham or spam.  Am I thinking clearly on this or am I
missing a piece of information?

--Larry



-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

RE: [SAtalk] Mail arrival time may be a criteria

Reply via email to