Pete,
I'm still exploring this topic, or at least trying to...hoping for some others to share their own definitions or practices (nudge, nudge, wink, wink) so the sample would be slightly more scientific.
I am certainly not at all looking to convince anyone to change their own definitions. Instead my goal is to try to further the awareness of the differences that may or may not exist and hopefully apply this programatically and maybe in policy to the way that either Sniffer works, or I work with Sniffer...or both. I might also find that I need to change my own implementation of the definition that I use because as Marcus stated, "life is short enough to not spend it on handling all this stuff manually." Fixing FP's on ads is a thankless job most of the time.
I do understand the balance that works for Sniffer in handling such matters, but I don't want to be the guy that reports FP's for the things that another user reports as spam. One of us would be wasting our time and pissing off the other. The other day for instance, someone manually reported the HarryandDavid first-party ad, and then I manually reported it as a false positive. Who is right? Because of this, and regardless of the present system for handling such things, I do think that Sniffer should have a definition for this type of E-mail and a generalized set of rules to follow (soft edges of course). Today for instance you decided to bring backscatter into your definition of spam/unwanted E-mail, a fully conscious choice, and one that needed to be done with purpose and qualification. I believe that when it comes to first-party advertising, this should be done similarly when it comes to qualifying manual reports of both false positives and false negatives, and also in qualifying some tertiary links that can land in spamtraps assigning guilt to an innocent source (maybe the association is guilt enough). Although you allow for customizations among your individual clients to handle such differences, this is not the best use of any of our time to feel our way through this unless it is a part of a process of finding a larger consensus.
I am not of course so bold as to suggest that my preference would be the best choice for anyone but myself, and hence the query to the list for feedback. I also think that the discussion could be fruitful in many other regards...if people would be willing to share their opinions.
Matt
Pete McNeil wrote:
On Tuesday, December 21, 2004, 4:49:33 AM, Markus wrote:
MG> First of all spam is anything
MG> comming from nonexistant, or forged senders
MG> having "hidden" content
MG> But what you're asking for is the difference between our MG> human brain and stupid computers (Pete, your comment please ;-)
Well... I'm having fun lurking and I don't want to spoil that. I'm anxious to learn what folks are thinking about all of this (without my nudging).
The current implementation of Sniffer is a kind of broad spectrum hybrid learning system. We use statistical models to try and keep the core rulebase targeting what our users _seem_ to want filtered then we customize individual rulebases to match specific preferences. The learning model isn't perfect, but it has shown that by and large there is a strong agreement for most folks about what should be filtered - even if that definition cannot be clearly and consistently stated.
(Note I did not say "what is spam" because that is getting to be more precise and more contentious these days.)
What I find (and it really stands out when working with Matt) is that the definition indicated by the standing rules in our core rulebase is a mixed bag of features and that the definition is highly fluid around the edges.
For example, in large part Matt's rules would indicate traffic from chtah is "not spam" but even he admits it's not acceptable to make that definition hard (not ok to white-list chtah).
One more liberal definition of ham holds that if the recipient has a first party relationship with the sender then any content from that sender should not be filtered... Clearly from the volume of direct advertising that is submitted to us as spam (even as recurring spam problems) this definition does not hold for most of our users.
This "edge definition problem" was predicted and so far our model is doing a reasonably good job of dealing with it - though improvements are clearly needed and are on their way (albeit slowly).
In the mean time, end-user specific bayesian classification can often solve the edge problem -- thus reinforcing that the fluidity at the edge is largely due to differences in the filtering preferences of the end users and the variability thereof.
Add to that the problem of data collection and the problem becomes not only difficult to solve, but difficult to measure --- Imagine piloting a supersonic fighter jet through a narrow winding canyon with your eyes shut and you've just about got the picture.
As for the stupidity of machines... I personally believe that strong intelligence can be built artificially (and in fact I do that for fun and profit)... The big challenge with using AI for spam is the same as for many AI systems where people's expectations are concerned: The AI cannot and does not have a human frame of reference and so even if it did match or exceed the innate intelligence of a human counterpart, it would not be in a position to predict or model human behaviors precisely.
Said another way (partly tongue in cheek) - since computers don't have sex, they don't grok porn and (ahem) organ enhancement spam.
Without a social frame of reference they are reduced to guessing at otherwise meaningless patterns. You or I could do no better in that world.
So, what we do with the design of Sniffer is to build a highly integrated hybrid with both human and machine components. Each gives the other strong leverage where it's needed. The machines remember better than we do, find and learn patterns well, and manage large datasets without too much effort. The humans understand the social contexts, predict and decode the strategies that are used by spammers, and interpret the needs and desires of our customers.
I think I might be rambling...
Were these the kinds of comments you were looking for?
_M
--- [This E-mail was scanned for viruses by Declude Virus (http://www.declude.com)]
--- This E-mail came from the Declude.JunkMail mailing list. To unsubscribe, just send an E-mail to [EMAIL PROTECTED], and type "unsubscribe Declude.JunkMail". The archives can be found at http://www.mail-archive.com.
-- ===================================================== MailPure custom filters for Declude JunkMail Pro. http://www.mailpure.com/software/ =====================================================
--- [This E-mail was scanned for viruses by Declude Virus (http://www.declude.com)]
--- This E-mail came from the Declude.JunkMail mailing list. To unsubscribe, just send an E-mail to [EMAIL PROTECTED], and type "unsubscribe Declude.JunkMail". The archives can be found at http://www.mail-archive.com.
