On Fri, 2 Jan 2004 16:24:12 +0100, [EMAIL PROTECTED] wrote: > Essentially AI interpretation of the meaning (or intent as they > put it) of language in order to identify spam.
I haven't looked at that product at all and haven't read the whitepaper, so I'm just commenting on the above statement: Getting a computer to interprete written language to fidure out the meaning and message it contains is *very* difficult. Most languages are too ambigous and inconsistent for computers, and they are also full of hidden meaning you only understand if you know the aphorisms, the proverbs, the taken-for-granted-in-this-context, etc, etc. This is simply a whole science in itself. If these people have actually managed to make an application that can find the meaning of a text fast and accurately enough to use in a spam-filter, they ought to contact some computer/linguistic researchers and give them some hints. :-) And then we might see more automagic translation apps that actually works. :-) I know of no well functioning commersial or open source project in this area, but considering that cheap computers keep getting faster and that a number for universities are researching and refining algorithms for doing this I'm sure we'll see some eventually. Once having a computer doing this fast enogh is feasible, we'll probably see antispam (as well as anti-whatever, pro-whatever, search engines etc) systems using it. The recent batch of randomly-copy-from-a-dictionary spams has given me an idea for something simpler but still connected to this though: I am not sure if this idea is worth following up att all, but here goes. It should be a lot easier to find possibly meaningless texts. For example, in english sentences there should be a mix of different word classes and forms. When a paragraph only contains the base form of words, almost all the words are substantives and none of the words are the "the, on, it, why, for, I, you" kinds of words, that paragraph is either written in *extremely* poor english or it doesn't mean much and is probably just random words from a dictionary. For this you don't need to check the relations between word wich makes it a lot easier and also less prone to FPs. A lot of non-native-english-writers get words in the wrong order and you don't want tpo classify their mails as spam just because of that, but almost all non-native-english-writers does manage to get in verbs as well as substantives and also manage to use different forms of words (even if not allways the correct ones) as well as words like "for, the, on, in, I, you, it". This makes bad english quite different form the randomly-copy-from-a-dictionary spams. Of course, once spammers realize that the randomly-copy-from-a-dictionary method starts failing them, we will see a lot more copy-whole-paragraphs-from-books-or-online-texts instead. And that means we're back to the complex stuff again. Regards /Jonas -- Jonas Eckerman, [EMAIL PROTECTED] http://www.fsdb.org/ ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id78&alloc_id371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk