Re: I have some bad news

Shawn Bakhtiar Wed, 17 Aug 2016 09:17:17 -0700

On Aug 17, 2016, at 3:43 AM, Matus UHLAR - fantomas 
<uh...@fantomas.sk<mailto:uh...@fantomas.sk>> wrote:

On 16.08.16 20:06, Marc Perkel wrote:
What I'm doing is looking for fingerprints in email that intersect HAM and not
in SPAM - which would be a HAM result.
If it matches SPAM and does NOT match HAM - then it's SPAM.

The magic is in the NOT matching on the other side.

so, if mail matches both hammy and spammy tokens (or token sets), you don't
classify at all?

I guess what is confusing me (and I imagine others, as alluded to by Matus) is
the fact that you are describing a special condition of Bayes' probability
theorem. You are testing two variables (match SPAM and match HAM) (not matching
is simply the negation of matching) thus giving you four conditions:

1) SPAM && HAM
2) SPAM && ~HAM
3) ~SPAM && HAM
4) ~SPAM && ~HAM

Here is a great diagram to show the four probable conditions:
https://en.wikipedia.org/wiki/Bayes%27_theorem#/media/File:Bayes%27_Theorem_2D.svg<https://en.wikipedia.org/wiki/Bayes'_theorem#/media/File:Bayes'_Theorem_2D.svg>

So (if I am correct) Matus is asking what if condition 1 is true? How are you
classifying an email than? Which is often the state of most emails, and thus
why the use of Naive Bayes spam filtering, which generates a probability based
on Bayes' probability theorem and is the conventional methodology to date. A
Rose by any other name....

Condition 4 is obvious it's nothing you have ever seen so classifying it
anything other than HAM would be a huge mistake (IMHO), and fully covered by
the aforementioned theorem as the probability of SPAM would (should) be 0. Same
with Condition 3, obviously it never hits SPAM so wether it matches HAM or not
you're going to treat it as HAM anyway same as condition 4.

That leaves condition 2. Which (if I'm not mistaken) is "... it matches SPAM
and does NOT match HAM - then it's SPAM.". Which brings us back to Matus
question, what if the email contains a single HAM token? Two HAM tokens? This
is exactly what Bayes' probability theorem is designed for. All you are doing
is defining a special condition in which the HAM probability is ZERO.

I think that's were I need to understand a bit more about what HAM means in
this solution, does getting a hit on HAM somehow negate it being SPAM
completely? In other words if the email contains some set of tokens that are
SPAM, yet only one HAM token, that single HAM token makes it not SPAM? If so,
you have a long way to go in convincing me that this is a good solution.

So if I say to you, "Let's get some lunch" that's ham because spammers never
say that, but normal people do. So the way to test what "spammers never say" is
to store what they do say and see if it's NOT in the list. (Thus the infinite
set)

Actually I get SPAM with that very set of tokes in it. If somehow the HAM
rating of it overrides the SPAM, I don't believe it would have a desirable
effect.

I get plenty of:

"
Hay Shawn,

Hope you have time to do some lunch, click on this link and check out my new
pictures!

Wannabe Phisher
"

Based on your example there's plenty of HAM and SPAM tokens in there, "Click on
this link" high probability of SPAM-e-ness, would it get HAMed based on "hope
you have time to do lunch". Or am I missing something?

Similarly, there's only so many ways to misspell viagra, and good email
wouldn't have it spelled wrong.

Does that make sense?

Again, what you are saying makes sense in that it is special condition of the
probability theory, What does not make sense is why would you not simply use
the probability theory, that already encompasses that condition?

--
Matus UHLAR - fantomas, uh...@fantomas.sk<mailto:uh...@fantomas.sk> ;
http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Linux - It's now safe to turn on your computer.
Linux - Teraz mozete pocitac bez obav zapnut.

Re: I have some bad news

Reply via email to