I'm finding this discussion interesting, because I've been trying to
wrap my head around the theoretical basis of this system. As such,
I've noticed that several questions have been asked now that are
explained in the document Marc initially pointed to
(http://wiki.junkemailfilter.com/index.php/The_Evolution_Spam_Filter).
Given Marc's situation, it seems reasonable to read that document
before asking too many questions.
As a way to (maybe) save Marc some time, test my own knowledge and
perhaps help move the conversation forward, I'm going to summarize
the questions I've seen so far and, as much as possible, the answers
to those questions (and Marc, correct me if I'm getting anything wrong
here):
- How do you classify an email that has tokens from both the ham and
spam set?
Whichever set (out of "only found in ham" and "only found in spam") is
larger (or "better") determines the final classification.
- What length are the tokens?
Marc's examples use multiple length tokens, capturing everything
between 1 and 4 "words", but I suspect the exact maximum token
length might be adjustable.
- What happens when spammers use "hammy" text to avoid detection?
I don't see this directly addressed, but I would guess there are
several things that mitigate against this. Multi-word tokens
prevent the truly random word salad attempts at poisoning, and
probably help with "cuttings" from other texts because the transition
from one cutting to the next probably doesn't appear in ham, leaving
the "spam-only" aspects of the mail to push it towards a spam
classification. The unlearning and expiration of fingerprints would
mean that such cuttings would have to appear repeatedly over time in
legitimate mail to tip an email toward a ham classification.
- Will bad spellers (or typists) be seen as spammier?
Again, I don't see this addressed specifically, but I don't think so,
unless they are such tremendously bad spellers that nearly every word
is misspelled. To take the "let's get some lunch" example, even if I
accidentally mis-type "some" as "som", I still have other tokens to
compare against, and the tokens "som", "get som", "som lunch", "let's
get som", etc. would have to have appeared in spam (and only spam) to
pull the classification toward spam. So I'd say the occasional typo
or misspelling would come up neutral.
- What happens to messages that have a lot of neutral tokens?
Now I'm really speculating, but unless every token is neutral, there's
still something to decide on, though it does seem that detection
becomes less reliable as the number of non-neutral tokens appraches
zero. A similar question that I thought of is what happens to
messages where the the final sets "only found in spam" and "only found
in ham" are nearly (or exactly) the same size. If you're using this
filter as part of SA scoring, the answer would seem to be that you
have an appropriately small score for "undetermined" (like bogofilter
does), but if it's acting as a separate filter, I don't know.
On Wed, 17 Aug 2016, Antony Stone wrote:
On Wednesday 17 August 2016 at 05:06:50, Marc Perkel wrote:
What I'm doing is looking for fingerprints in email that intersect HAM
and not in SPAM - which would be a HAM result.
If it matches SPAM and does NOT match HAM - then it's SPAM.
The magic is in the NOT matching on the other side.
So if I say to you, "Let's get some lunch" that's ham because spammers
never say that, but normal people do. So the way to test what "spammers
never say" is to store what they do say and see if it's NOT in the list.
(Thus the infinite set)
What length are the tokens you store in the list? Single words (so the above
lunch example would contain 4 tokens)? Entire phrases (so the above would be
just 1 token)? Also how do you deal with spam which contains random cuttings
from legitimate texts (generally along with a graphic attachment and/or a URL
to get aross the "real" message)?
Similarly, there's only so many ways to misspell viagra, and good email
wouldn't have it spelled wrong.
Does this mean that people with bad spelling will more likely get classified as
spam, because they do not match the 'ham' group very well?
Also, what happens to mail contains lots of tokens which match neither set
(for example, perfectly legitimate email which happens to be in a language the
system hasn't been trained with)?
Antony.
--
Public key #7BBC68D9 at | Shane Williams
http://pgp.mit.edu/ | System Admin - UT CompSci
=----------------------------------+-------------------------------
All syllogisms contain three lines | sha...@shanew.net
Therefore this is not a syllogism | www.ischool.utexas.edu/~shanew