Craig R Hughes writes:

> thanks, great work.  It's getting late now, and I have a big
> breakfast meeting early tomorrow, so I'll take a look at this
> sometime after noon.  Is it kosher to roll this with the
> language-detection stuff and all into the SA distribution then?
> Sounds like you've got the upstream author's OK to do that.  Don't
> want o accidentally step on his/her toes though.  If it's OK then
> after I patch my local tree and do a test or two, I'll check it into
> CVS.  I'm sure by the time we get around to rolling 2.30 it'll be
> stable enough.

Yes, it's okay.  I'll forward you the permission from the author.
 
> One thing that would be useful here is probably to get a couple of
> foreign-language messages for test purposes, along with creation of
> t/language_ok.t -- I'll do those too -- the latter is pretty easy,
> the former ought to be straightforward by just copy/pasting some
> text from random foreign-language websites.

There were a bunch of test files distributed with TextCat.

> As far as accuracy, I understand that if the thing thinks it can't
> tell what the real language is, it'll try to be overly broad rather
> than overly narrow, but do you have any stats on how often it rules
> out the actual language of a message?  I suppose that'll be factored
> into score for the rule.

It seems quite rare.

I tested 3678 non-spam messages which should be all English.  All
languages are treated equally, so it shouldn't matter that I'm using
English.  Of the 3678 messages, 3263 were matched to one or more
languages and 415 couldn't be matched to a language.  Only 1 message of
the 3263 was misidentified as non-English (it had a one-word body).
Incidentally, it found 5 non-English spam messages that had made it
through spamassassin and my hand-filtering.

One place it does have a bit of trouble is "evenly mixed" language
messages that are found in spam sometimes.  I tested a different set of
421 spam messages (about half were not English), here are some of the
errors I found:

 - It flagged a spam message that was one third English, one third
   Spanish, and one third Catalan (really!) as Catalan only.
 - A quad-lingual spam (English, Spanish, French, and Italian) was
   flagged as one of Catalan, French and Spanish.

There may have been some more like that, but frankly, I got tired of
looking at spams after a while.  It seems like anyone expecting more
than one language will add the other languages to their ok_languages
list and multi-lingual messages shouldn't be a problem.  Also, I'm only
really concerned about false-positives and they don't seem to be a
problem.

Having the GA score this would be nice.  My last rule was almost a
"fiver" after the GA got done with it.  :-)

Dan

_______________________________________________________________

Have big pipes? SourceForge.net is looking for download mirrors. We supply
the hardware. You get the recognition. Email Us: [EMAIL PROTECTED]
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to