------------ Forwarded Message ------------ Date: Tuesday, September 09, 2003 9:57 AM +0200 From: Thorsten Sick <[EMAIL PROTECTED]> To: Kenneth Porter <[EMAIL PROTECTED]> Subject: Re: [SAtalk] Fitz, an add-on to Spamassassin
Hi
Am Die, 2003-09-09 um 03.41 schrieb Kenneth Porter:
> - The results of the AI alone are as good as Spamassassin's results. > Combined it is therefor better.
What would make the combined result better?
My experiences with a practical use of spamassassin with fitz show the following results: - spamassassin gets the 90% or more spam which are NOT optimized for spamassassin to get through. - the rest is caught by fitz. The rest is optimized spam and things the user doesn't want.
Interesting was the experiment where I got mails from an account for a role-playing-game weekend. Subscriptions and questions as ham. Spam as usual PLUS an Roleplaying game newsletter with a lot of announcements. Normally the newsletter is non-spam. And it looks like ham. Talking about RPGs and even mentioning the convention and where to subscribe. But after learning two instances of it, it was classified as spam.
What does Fitz do different from SA?
The big new thing is a special tokenization. Many naive bayes solutions dissect the spam word by word. Even the header. My Fitz dissects every field of the header a special way. It doesn't learn -007 but: Time-zone = -007
By that it get more information out of an mail. The Date-Header alone supports us with: Mon, => When does the user normally get mail ? Job accounts get less HAM on weekends 08 Sep 2003 => Not really relevant 18:41:33 => A lot of SPAMs are written between Midnight and about 5 o clock -0700 => Time zone interesting for firms who only have local partners
And this is only the date header.
I also tried not to use Paul Grahams Naive Bayes but as much of the AI-book-standard Naive Bayes as possible. I had to alter it for my special tokenization a bit. But not much.
Thorsten Sick -- Thorsten Sick [EMAIL PROTECTED] www.hort-des-wissens.de Winter is coming -----BEGIN GEEK CODE BLOCK----- Version: 3.12 GCS d-- s++:- a-- C++ UL+++ P+++ L+++ E W++ N o K w--- O-- M- V- PS+ PE- Y+ PGP++ t 5+++ X+ R+ !tv b++++ DI- D G e+ h-- r++ y? ------END GEEK CODE BLOCK------
---------- End Forwarded Message ----------
------------------------------------------------------- This sf.net email is sponsored by:ThinkGeek Welcome to geek heaven. http://thinkgeek.com/sf _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk