[SAtalk] Re: Consonant and Vowel Pairs or Sequences

era Tue, 14 Oct 2003 23:23:02 -0700

On 14 Oct 2003 00:42:38 -0700, Daniel Quinlan <[EMAIL PROTECTED]>
posted to gmane.mail.spam.spamassassin.general:
 > "Fred I-IS.COM" <[EMAIL PROTECTED]> writes:
 >> I created a list which might be helpful, using a dictionary I
 >> searched for letter pairs which did not exist. I created the
 >> following meta rule to search for these non-existant pairs, it
 >> might do just what you are looking for.
 > Your meta rule seems to work pretty well.
 > Some issues that might need to be worked out:
 >  - getting it to work in an internationalized fashion, we could just
 >    write a rule to be used when the message specifies that it is
 >    English, when "ok_languages en" is set, or something like that,
 >    but that is non-optimal
 >  - false positives are still a bit high:
 >    - PGP signatures
 >    - some "legitimate" URLs (Network Solutions unsubscribe URL for
 >      renewal notices)
 > Another thing that might work well is instead using an eval test that
 > counts non-existent pairs.  There are also the triplets and N-gram files
 > used by the language testing in TextCat.pm -- we could test N-gram
 > frequency and if the advertized language is well off the language model
 > for that language, then score a hit.


I'd suggest going with this solution, absolutely. The framework for
doing it in a language-independent fashion is already there, why not
use it?

What's more, I think this could be useful on a wider basis, too -- a
lot of the body regex rules are, quite frankly, a bit fuzzy (either
too specific or too general) and could perhaps be replaced with some
sort of n-gram language models. (Think e.g. one language model for
English ham and another for English spam. Informal quick testing would
seem to indicate that this is not an infeasible approach.)

What is troubling me is the patent status of this whole n-gram thing.
US Pat no 5,418,951 (filed in 1995; this is a renewal of an older
patent from the 80s or early 90s) seems to cover pretty much all of
this. I don't know if language identification in particular (as done
by TextCat and several other similar applications) was missing from
the patent in its earlier versions but topic filtering is definitely
something it was always supposed to cover.

A Google search on "damashek patent language identification" brings up
a number of interesting technical documents but nothing on the status
of the patent ...

A note on using digraphs (or digrams, if we're talking n-grams) --
it's a nice compact way to get pretty good results, but it kind of
leaks because only looking at pairs gives you too narrow a scope. As a
trivial case in point, pairs of identical consonants are perfectly
okay in English, but triplets of identical consonants are not. (And
then again in the cases which Dan enumerates, you can't say it's "not
English" based on a single hit in a message.)

On the other hand, n-gram search space with n=2 is nicely bounded,
whereas if you go to larger values of n, you get a large and sparse
search space. But there are methods for coping with that.

/* era */

-- 
formail -s procmail <http://www.iki.fi/era/spam/ >http://www.euro.cauce.org/
cat | more | cat<http://www.iki.fi/era/unix/award.html>http://www.debian.org/



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
SourceForge.net hosts over 70,000 Open Source Projects.
See the people who have HELPED US provide better services:
Click here: http://sourceforge.net/supporters.php
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] Re: Consonant and Vowel Pairs or Sequences

Reply via email to