On 15 Oct 2003 08:50:17 +0300, I posted to the spamassassin-talk mailing list: > On 14 Oct 2003 00:42:38 -0700, Daniel Quinlan <[EMAIL PROTECTED]> > posted to gmane.mail.spam.spamassassin.general: >> Another thing that might work well is instead using an eval test that >> counts non-existent pairs. There are also the triplets and N-gram files >> used by the language testing in TextCat.pm -- we could test N-gram >> frequency and if the advertized language is well off the language model >> for that language, then score a hit. > I'd suggest going with this solution, absolutely. The framework for > doing it in a language-independent fashion is already there, why not > use it?
I had a look at this and tried it out. The results from limited training are not exactly stellar but nevertheless IMHO promising. One thing which is probably problematic for anybody who wants to look further into this is the lack of documentation of how the lm/ stuff works. Here's a brief rundown of what I did: 1. Obtain TextCat from the location in the file lm/LICENSE 2. (For completeness, figure out the mapping between the original naming scheme and the one used in SpamAssassin. This is outlined in lm/README but it would be beneficial to get an actual mapping. I attach one I recreated by induction :-) 3. Use the TextCat training mode to create new language models. 4. (Optionally, convert the .lm files into .ln ones. Haven't tried that so can't offer advice.) 5. (Optionally, remove some of lm/*.l[mn] -- I think a number of them are probably superfluous in practice.) 6. Dump your new models into lm/ and run lm/build.pl 7. (Maybe hack on SA itself to interpret the results from language identification differently from what it does now.) /* era */ Here's the mapping from textcat/LM/*.lm to spamassassin/lm/*.lm Missing: textcat/LM/drents.lm (zero-byte file anyway) New in SA/lm: lm/inactive/haw.lm (hmm, maybe I have an older TextCat?) lm/ja.iso-2022-jp.ln lm/tr.iso-8859-9.ln
LM/afrikaans.lm lm/af.lm LM/albanian.lm lm/sq.lm LM/amharic-utf.lm lm/am.utf-8.lm LM/arabic-iso8859_6.lm lm/ar.iso-8859-6.lm LM/arabic-windows1256.lm lm/ar.windows-1256.lm LM/armenian.lm lm/hy.lm LM/basque.lm lm/eu.lm LM/belarus-windows1251.lm lm/be.windows-1251.lm LM/bosnian.lm lm/bs.lm LM/breton.lm lm/inactive/br.lm LM/bulgarian-iso8859_5.lm lm/bg.iso-8859-5.lm LM/catalan.lm lm/ca.lm LM/chinese-big5.lm lm/zh.big5.lm LM/chinese-gb2312.lm lm/zh.gb2312.lm LM/croatian-ascii.lm lm/hr.us-ascii.lm LM/czech-iso8859_2.lm lm/cs.iso-8859-2.lm LM/danish.lm lm/da.lm LM/dutch.lm lm/nl.lm LM/english.lm lm/en.lm LM/esperanto.lm lm/eo.lm LM/estonian.lm lm/et.lm LM/finnish.lm lm/fi.lm LM/french.lm lm/fr.lm LM/frisian.lm lm/fy.lm LM/georgian.lm lm/ka.lm LM/german.lm lm/de.lm LM/greek-iso8859-7.lm lm/el.iso-8859-7.lm LM/hebrew-iso8859_8.lm lm/he.iso-8859-8.lm LM/hindi.lm lm/hi.lm LM/hungarian.lm lm/hu.lm LM/icelandic.lm lm/is.lm LM/indonesian.lm lm/id.lm LM/irish.lm lm/ga.lm LM/italian.lm lm/it.lm LM/japanese-euc_jp.lm lm/ja.euc-jp.lm LM/japanese-shift_jis.lm lm/ja.shift-jis.lm LM/korean.lm lm/ko.lm LM/latin.lm lm/la.lm LM/latvian.lm lm/lv.lm LM/lithuanian.lm lm/lt.lm LM/malay.lm lm/ms.lm LM/manx.lm lm/inactive/gv.lm LM/marathi.lm lm/mr.lm LM/middle-frisian.lm lm/inactive/middle-frisian.lm LM/mingo.lm lm/inactive/mingo.lm LM/nepali.lm lm/ne.lm LM/norwegian.lm lm/no.lm LM/persian.lm lm/fa.lm LM/polish.lm lm/pl.lm LM/portuguese.lm lm/pt.lm LM/quechua.lm lm/qu.lm LM/romanian.lm lm/ro.lm LM/rumantsch.lm lm/rm.lm LM/russian-iso8859_5.lm lm/ru.iso-8859-5.lm LM/russian-koi8_r.lm lm/ru.koi8-r.lm LM/russian-windows1251.lm lm/ru.windows-1251.lm LM/sanskrit.lm lm/sa.lm LM/scots.lm lm/sco.lm LM/scots_gaelic.lm lm/gd.lm LM/serbian-ascii.lm lm/sr.us-ascii.lm LM/slovak-ascii.lm lm/sk.us-ascii.lm LM/slovak-windows1250.lm lm/sk.windows-1250.lm LM/slovenian-ascii.lm lm/sl.us-ascii.lm LM/slovenian-iso8859_2.lm lm/sl.iso-8859-2.lm LM/spanish.lm lm/es.lm LM/swahili.lm lm/sw.lm LM/swedish.lm lm/sv.lm LM/tagalog.lm lm/tl.lm LM/tamil.lm lm/ta.lm LM/thai.lm lm/th.lm LM/turkish.lm lm/tr.unknown.lm LM/ukrainian-koi8_r.lm lm/uk.koi8-r.lm LM/vietnamese.lm lm/vi.lm LM/welsh.lm lm/cy.lm LM/yiddish-utf.lm lm/yi.utf-8.lm
-- The email address era the contact information Just for kicks, imagine at iki dot fi is heavily link on my home page at what it's like to get spam filtered. If you <http://www.iki.fi/era/> 500 pieces of spam for want to reach me, see instead. each wanted message.