On 14 Oct 2003 00:42:38 -0700, Daniel Quinlan <[EMAIL PROTECTED]> posted to gmane.mail.spam.spamassassin.general: > "Fred I-IS.COM" <[EMAIL PROTECTED]> writes: >> I created a list which might be helpful, using a dictionary I >> searched for letter pairs which did not exist. I created the >> following meta rule to search for these non-existant pairs, it >> might do just what you are looking for. > Your meta rule seems to work pretty well. > Some issues that might need to be worked out: > - getting it to work in an internationalized fashion, we could just > write a rule to be used when the message specifies that it is > English, when "ok_languages en" is set, or something like that, > but that is non-optimal > - false positives are still a bit high: > - PGP signatures > - some "legitimate" URLs (Network Solutions unsubscribe URL for > renewal notices) > Another thing that might work well is instead using an eval test that > counts non-existent pairs. There are also the triplets and N-gram files > used by the language testing in TextCat.pm -- we could test N-gram > frequency and if the advertized language is well off the language model > for that language, then score a hit.
I'd suggest going with this solution, absolutely. The framework for doing it in a language-independent fashion is already there, why not use it? What's more, I think this could be useful on a wider basis, too -- a lot of the body regex rules are, quite frankly, a bit fuzzy (either too specific or too general) and could perhaps be replaced with some sort of n-gram language models. (Think e.g. one language model for English ham and another for English spam. Informal quick testing would seem to indicate that this is not an infeasible approach.) What is troubling me is the patent status of this whole n-gram thing. US Pat no 5,418,951 (filed in 1995; this is a renewal of an older patent from the 80s or early 90s) seems to cover pretty much all of this. I don't know if language identification in particular (as done by TextCat and several other similar applications) was missing from the patent in its earlier versions but topic filtering is definitely something it was always supposed to cover. A Google search on "damashek patent language identification" brings up a number of interesting technical documents but nothing on the status of the patent ... A note on using digraphs (or digrams, if we're talking n-grams) -- it's a nice compact way to get pretty good results, but it kind of leaks because only looking at pairs gives you too narrow a scope. As a trivial case in point, pairs of identical consonants are perfectly okay in English, but triplets of identical consonants are not. (And then again in the cases which Dan enumerates, you can't say it's "not English" based on a single hit in a message.) On the other hand, n-gram search space with n=2 is nicely bounded, whereas if you go to larger values of n, you get a large and sparse search space. But there are methods for coping with that. /* era */ -- formail -s procmail <http://www.iki.fi/era/spam/ >http://www.euro.cauce.org/ cat | more | cat<http://www.iki.fi/era/unix/award.html>http://www.debian.org/ ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. SourceForge.net hosts over 70,000 Open Source Projects. See the people who have HELPED US provide better services: Click here: http://sourceforge.net/supporters.php _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk