What you are trying to do is to identify a source of messages by its entropy....supposed the entropy of a ham source is distinguishable from a spam one...
2016-08-22 13:48 GMT-03:00 Antony Stone < antony.st...@spamassassin.open.source.it>: > On Monday 22 August 2016 at 18:00:35, Marc Perkel wrote: > > > On 08/22/16 07:37, Antony Stone wrote: > > > > > > So what makes "cheapest Viagra online" a token, such that "cheapest" > and > > > "online" are not tokens? > > > > They would all be tokens. Just pointing out one that would match spam > > and not match ham. "cheapest" and "online" would likely be in both sets > > and would be ignored. > > Hm, that doesn't tie up with your earlier reply: > > On Monday 22 August 2016 at 16:34:00, Marc Perkel wrote: > > > On 08/22/16 07:28, Dianne Skoll wrote: > > > On Mon, 22 Aug 2016 07:16:41 -0700 > > > > > > As far as I understand your algorithm, if an email contains at least > one > > > token in the "ham" set and zero tokens in the "spam" set, you classify > it > > > as ham. And conversely, if it contains at least one spam token but > zero > > > ham tokens, you classify it as spam. > > > > YES! YES! YES! > > Er, really? See below. > > > Although I look at some thousand "fingerprints" to get a more > > significant result. > > > > > The other two possibilities (no tokens in either or some tokens in > both) > > > are undecidable. > > > > Exactly! > > So, it's not that "if an email contains at least one token in the 'ham' set > and zero tokens in the 'spam' set, you classify it as ham". > > You in fact ignore any tokens in the email which are in both the 'ham' and > 'spam' sets, and then - what - work out which set contains more of the > left- > over tokens? > > > Antony. > > -- > Pavlov is in the pub enjoying a pint. > The barman rings for last orders, and Pavlov jumps up exclaiming "Damn! I > forgot to feed the dog!" > > Please reply to the > list; > please *don't* CC > me. >