[SAtalk] Re: Consonant and Vowel Pairs or Sequences

era Tue, 28 Oct 2003 08:49:10 -0800

On 15 Oct 2003 08:50:17 +0300, I posted to the spamassassin-talk
mailing list:
 > On 14 Oct 2003 00:42:38 -0700, Daniel Quinlan <[EMAIL PROTECTED]>
 > posted to gmane.mail.spam.spamassassin.general:
 >> Another thing that might work well is instead using an eval test that
 >> counts non-existent pairs.  There are also the triplets and N-gram files
 >> used by the language testing in TextCat.pm -- we could test N-gram
 >> frequency and if the advertized language is well off the language model
 >> for that language, then score a hit.
 > I'd suggest going with this solution, absolutely. The framework for
 > doing it in a language-independent fashion is already there, why not
 > use it?


I had a look at this and tried it out. The results from limited
training are not exactly stellar but nevertheless IMHO promising.

One thing which is probably problematic for anybody who wants to look
further into this is the lack of documentation of how the lm/ stuff
works.

Here's a brief rundown of what I did:

 1. Obtain TextCat from the location in the file lm/LICENSE

 2. (For completeness, figure out the mapping between the original
    naming scheme and the one used in SpamAssassin. This is outlined
    in lm/README but it would be beneficial to get an actual mapping.
    I attach one I recreated by induction :-)

 3. Use the TextCat training mode to create new language models.

 4. (Optionally, convert the .lm files into .ln ones. Haven't tried
    that so can't offer advice.)

 5. (Optionally, remove some of lm/*.l[mn] -- I think a number of them
    are probably superfluous in practice.)

 6. Dump your new models into lm/ and run lm/build.pl

 7. (Maybe hack on SA itself to interpret the results from language
    identification differently from what it does now.)

/* era */

Here's the mapping from textcat/LM/*.lm to spamassassin/lm/*.lm

Missing: textcat/LM/drents.lm (zero-byte file anyway)

New in SA/lm:

lm/inactive/haw.lm (hmm, maybe I have an older TextCat?)
lm/ja.iso-2022-jp.ln
lm/tr.iso-8859-9.ln

LM/afrikaans.lm lm/af.lm
LM/albanian.lm  lm/sq.lm
LM/amharic-utf.lm       lm/am.utf-8.lm
LM/arabic-iso8859_6.lm  lm/ar.iso-8859-6.lm
LM/arabic-windows1256.lm        lm/ar.windows-1256.lm
LM/armenian.lm  lm/hy.lm
LM/basque.lm    lm/eu.lm
LM/belarus-windows1251.lm       lm/be.windows-1251.lm
LM/bosnian.lm                   lm/bs.lm
LM/breton.lm    lm/inactive/br.lm
LM/bulgarian-iso8859_5.lm       lm/bg.iso-8859-5.lm
LM/catalan.lm   lm/ca.lm
LM/chinese-big5.lm      lm/zh.big5.lm
LM/chinese-gb2312.lm    lm/zh.gb2312.lm
LM/croatian-ascii.lm    lm/hr.us-ascii.lm
LM/czech-iso8859_2.lm   lm/cs.iso-8859-2.lm
LM/danish.lm    lm/da.lm
LM/dutch.lm     lm/nl.lm
LM/english.lm   lm/en.lm
LM/esperanto.lm lm/eo.lm
LM/estonian.lm  lm/et.lm
LM/finnish.lm   lm/fi.lm
LM/french.lm    lm/fr.lm
LM/frisian.lm   lm/fy.lm
LM/georgian.lm  lm/ka.lm
LM/german.lm    lm/de.lm
LM/greek-iso8859-7.lm   lm/el.iso-8859-7.lm
LM/hebrew-iso8859_8.lm  lm/he.iso-8859-8.lm
LM/hindi.lm     lm/hi.lm
LM/hungarian.lm lm/hu.lm
LM/icelandic.lm lm/is.lm
LM/indonesian.lm        lm/id.lm
LM/irish.lm     lm/ga.lm
LM/italian.lm   lm/it.lm
LM/japanese-euc_jp.lm   lm/ja.euc-jp.lm
LM/japanese-shift_jis.lm        lm/ja.shift-jis.lm
LM/korean.lm    lm/ko.lm
LM/latin.lm     lm/la.lm
LM/latvian.lm   lm/lv.lm
LM/lithuanian.lm        lm/lt.lm
LM/malay.lm     lm/ms.lm
LM/manx.lm      lm/inactive/gv.lm
LM/marathi.lm   lm/mr.lm
LM/middle-frisian.lm    lm/inactive/middle-frisian.lm
LM/mingo.lm             lm/inactive/mingo.lm
LM/nepali.lm    lm/ne.lm
LM/norwegian.lm lm/no.lm
LM/persian.lm   lm/fa.lm
LM/polish.lm    lm/pl.lm
LM/portuguese.lm        lm/pt.lm
LM/quechua.lm   lm/qu.lm
LM/romanian.lm  lm/ro.lm
LM/rumantsch.lm lm/rm.lm
LM/russian-iso8859_5.lm lm/ru.iso-8859-5.lm
LM/russian-koi8_r.lm    lm/ru.koi8-r.lm
LM/russian-windows1251.lm       lm/ru.windows-1251.lm
LM/sanskrit.lm  lm/sa.lm
LM/scots.lm     lm/sco.lm
LM/scots_gaelic.lm      lm/gd.lm
LM/serbian-ascii.lm     lm/sr.us-ascii.lm
LM/slovak-ascii.lm      lm/sk.us-ascii.lm
LM/slovak-windows1250.lm        lm/sk.windows-1250.lm
LM/slovenian-ascii.lm   lm/sl.us-ascii.lm
LM/slovenian-iso8859_2.lm       lm/sl.iso-8859-2.lm
LM/spanish.lm   lm/es.lm
LM/swahili.lm   lm/sw.lm
LM/swedish.lm   lm/sv.lm
LM/tagalog.lm   lm/tl.lm
LM/tamil.lm     lm/ta.lm
LM/thai.lm      lm/th.lm
LM/turkish.lm   lm/tr.unknown.lm
LM/ukrainian-koi8_r.lm  lm/uk.koi8-r.lm
LM/vietnamese.lm        lm/vi.lm
LM/welsh.lm     lm/cy.lm
LM/yiddish-utf.lm       lm/yi.utf-8.lm

-- 
The email address era     the contact information   Just for kicks, imagine
at iki dot fi is heavily  link on my home page at   what it's like to get
spam filtered.  If you    <http://www.iki.fi/era/>  500 pieces of spam for
want to reach me, see     instead.                  each wanted message.

[SAtalk] Re: Consonant and Vowel Pairs or Sequences

Reply via email to