On Fri, 2 Jan 2004 16:24:12 +0100, [EMAIL PROTECTED] wrote:

>  Essentially AI interpretation of the meaning (or intent as they
>  put it) of language in order to identify spam.

I haven't looked at that product at all and haven't read the whitepaper, so I'm just 
commenting on the above statement:

Getting a computer to interprete written language to fidure out the meaning and 
message it contains is *very* difficult. Most languages are too ambigous and 
inconsistent for computers, and they are also full of hidden meaning you only 
understand if you know the aphorisms, the proverbs, the 
taken-for-granted-in-this-context, etc, etc. This is simply a whole science in itself.

If these people have actually managed to make an application that can find the meaning 
of a text fast and accurately enough to use in a spam-filter, they ought to contact 
some computer/linguistic researchers and give them some hints. :-) And then we might 
see more automagic translation apps that actually works. :-)

I know of no well functioning commersial or open source project in this area, but 
considering that cheap computers keep getting faster and that a number for 
universities are researching and refining algorithms for doing this I'm sure we'll see 
some eventually.

Once having a computer doing this fast enogh is feasible, we'll probably see antispam 
(as well as anti-whatever, pro-whatever, search engines etc) systems using it.

The recent batch of randomly-copy-from-a-dictionary spams has given me an idea for 
something simpler but still connected to this though:

I am not sure if this idea is worth following up att all, but here goes.

It should be a lot easier to find possibly meaningless texts. For example, in english 
sentences there should be a mix of different word classes and forms. When a paragraph 
only contains the base form of words, almost all the words are substantives and none 
of the words are the "the, on, it, why, for, I, you" kinds of words, that paragraph is 
either written in *extremely* poor english or it doesn't mean much and is probably 
just random words from a dictionary.

For this you don't need to check the relations between word wich makes it a lot easier 
and also less prone to FPs. A lot of non-native-english-writers get words in the wrong 
order and you don't want tpo classify their mails as spam just because of that, but 
almost all non-native-english-writers does manage to get in verbs as well as 
substantives and also manage to use different forms of words (even if not allways the 
correct ones) as well as words like "for, the, on, in, I, you, it". This makes bad 
english quite different form the randomly-copy-from-a-dictionary spams.

Of course, once spammers realize that the randomly-copy-from-a-dictionary method 
starts failing them, we will see a lot more 
copy-whole-paragraphs-from-books-or-online-texts instead. And that means we're back to 
the complex stuff again.

Regards
/Jonas

--
Jonas Eckerman, [EMAIL PROTECTED]
http://www.fsdb.org/



-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id78&alloc_id371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to