On Thu, 20 Dec 2018 21:12:33 -0700 Grant Taylor wrote: > On 12/20/18 8:34 PM, Grant Taylor wrote: > > I'm going back through and analyzing how I'm extracting data and > > trying to satisfactorily explain some oddities. > > Out of 244,921 messages there are 16,528 unique addresses, this is > how the messages break down for > > Here's how the dots in the user parts of 16,528 unique addresses out > of 244,921 messages break down: > > 13,277 (no dots 80.3%) > 2,936 . ( 1 dot 17.7%) > 281 .. ( 2 dots 1.7%) > 29 ... ( 3 dots 0.2%) > 3 .... ( 4 dots 0.0%) > 1 ..... ( 5 dots 0.0%) > 1 ........... (11 dots 0.0%) > > So, in light of this information, I would be willing to concede 3 or > more dots is possibly and indicator of spam.
I think you are a bit premature there, without having separate figures for spam and ham, you can't say even whether any of these are good spam indicator - even in isolation. > My previous log methodology Isn't a sound method for scoring. For one thing it assumes that more dots are more spammy. It could be that the S/O peaks at 4. For another, scoring should be about the balance of extra TPs and FPs that the rule creates. Sometimes the more spammy looking rule hits higher scoring spam and warrants a lower score.