Re: [Declude.JunkMail] OT: How to define "spam" and "ham"

Matt Tue, 21 Dec 2004 10:18:53 -0800

Pete,

I'm still exploring this topic, or at least trying to...hoping for some others to share their own definitions or practices (nudge, nudge, wink, wink) so the sample would be slightly more scientific.

I am certainly not at all looking to convince anyone to change their own definitions. Instead my goal is to try to further the awareness of the differences that may or may not exist and hopefully apply this programatically and maybe in policy to the way that either Sniffer works, or I work with Sniffer...or both. I might also find that I need to change my own implementation of the definition that I use because as Marcus stated, "life is short enough to not spend it on handling all this stuff manually." Fixing FP's on ads is a thankless job most of the time.

I do understand the balance that works for Sniffer in handling such matters, but I don't want to be the guy that reports FP's for the things that another user reports as spam. One of us would be wasting our time and pissing off the other. The other day for instance, someone manually reported the HarryandDavid first-party ad, and then I manually reported it as a false positive. Who is right? Because of this, and regardless of the present system for handling such things, I do think that Sniffer should have a definition for this type of E-mail and a generalized set of rules to follow (soft edges of course). Today for instance you decided to bring backscatter into your definition of spam/unwanted E-mail, a fully conscious choice, and one that needed to be done with purpose and qualification. I believe that when it comes to first-party advertising, this should be done similarly when it comes to qualifying manual reports of both false positives and false negatives, and also in qualifying some tertiary links that can land in spamtraps assigning guilt to an innocent source (maybe the association is guilt enough). Although you allow for customizations among your individual clients to handle such differences, this is not the best use of any of our time to feel our way through this unless it is a part of a process of finding a larger consensus.

I am not of course so bold as to suggest that my preference would be the best choice for anyone but myself, and hence the query to the list for feedback. I also think that the discussion could be fruitful in many other regards...if people would be willing to share their opinions.

Matt

Pete McNeil wrote:

On Tuesday, December 21, 2004, 4:49:33 AM, Markus wrote:

MG> First of all spam is anything MG> comming from nonexistant, or forged senders MG> having "hidden" content

MG> But what you're  asking for is the difference between our
MG> human brain and stupid computers (Pete,  your comment please ;-)

Well... I'm having fun lurking and I don't want to spoil that. I'm
anxious to learn what folks are thinking about all of this (without my
nudging).

The current implementation of Sniffer is a kind of broad spectrum
hybrid learning system. We use statistical models to try and keep the
core rulebase targeting what our users _seem_ to want filtered then we
customize individual rulebases to match specific preferences. The
learning model isn't perfect, but it has shown that by and large there
is a strong agreement for most folks about what should be filtered -
even if that definition cannot be clearly and consistently stated.

(Note I did not say "what is spam" because that is getting to be more
precise and more contentious these days.)

What I find (and it really stands out when working with Matt) is that
the definition indicated by the standing rules in our core rulebase
is a mixed bag of features and that the definition is highly fluid
around the edges.

For example, in large part Matt's rules would indicate traffic from
chtah is "not spam" but even he admits it's not acceptable to make
that definition hard (not ok to white-list chtah).

One more liberal definition of ham holds that if the recipient has a
first party relationship with the sender then any content from that
sender should not be filtered... Clearly from the volume of direct
advertising that is submitted to us as spam (even as recurring spam
problems) this definition does not hold for most of our users.

This "edge definition problem" was predicted and so far our model is
doing a reasonably good job of dealing with it - though improvements
are clearly needed and are on their way (albeit slowly).

In the mean time, end-user specific bayesian classification can often
solve the edge problem -- thus reinforcing that the fluidity at the
edge is largely due to differences in the filtering preferences of the
end users and the variability thereof.

Add to that the problem of data collection and the problem becomes not
only difficult to solve, but difficult to measure --- Imagine piloting
a supersonic fighter jet through a narrow winding canyon with your
eyes shut and you've just about got the picture.

As for the stupidity of machines... I personally believe that strong
intelligence can be built artificially (and in fact I do that for fun
and profit)... The big challenge with using AI for spam is the same as
for many AI systems where people's expectations are concerned: The AI
cannot and does not have a human frame of reference and so even if it
did match or exceed the innate intelligence of a human counterpart, it
would not be in a position to predict or model human behaviors
precisely.

Said another way (partly tongue in cheek) - since computers don't have
sex, they don't grok porn and (ahem) organ enhancement spam.

Without a social frame of reference they are reduced to guessing at
otherwise meaningless patterns. You or I could do no better in that
world.

So, what we do with the design of Sniffer is to build a highly
integrated hybrid with both human and machine components. Each gives
the other strong leverage where it's needed. The machines remember
better than we do, find and learn patterns well, and manage large
datasets without too much effort. The humans understand the social
contexts, predict and decode the strategies that are used by spammers,
and interpret the needs and desires of our customers.

I think I might be rambling...

Were these the kinds of comments you were looking for?

_M

---
[This E-mail was scanned for viruses by Declude Virus (http://www.declude.com)]

---
This E-mail came from the Declude.JunkMail mailing list.  To
unsubscribe, just send an E-mail to [EMAIL PROTECTED], and
type "unsubscribe Declude.JunkMail".  The archives can be found
at http://www.mail-archive.com.


--
=====================================================
MailPure custom filters for Declude JunkMail Pro.
http://www.mailpure.com/software/
=====================================================

---
[This E-mail was scanned for viruses by Declude Virus (http://www.declude.com)]

---
This E-mail came from the Declude.JunkMail mailing list.  To
unsubscribe, just send an E-mail to [EMAIL PROTECTED], and
type "unsubscribe Declude.JunkMail".  The archives can be found
at http://www.mail-archive.com.

Re: [Declude.JunkMail] OT: How to define "spam" and "ham"

Reply via email to