Hey all, Promise, this will be the last I'll write on this for a while - but people suggested I look at bi-gram, tri-gram and up to 5-gram distributions of words found in GNOME software messages. That is, given the input message :
"This is a long sentence so there." We'd get the tri-grams : This is a is a long a long sentence long sentence so sentence so there - usually people mess about with this sort of thing when trying to build terminology lists, but I suspect my sample set is a bit small to be interesting. Regardless, I've got results at http://blogs.sun.com/roller/page/timf?entry=more_word_bagging cheers, tim -- Tim Foster - Tools Engineer, Software Globalisation, Sun Microsystems, Inc. Project Lead, Open Language Tools https://open-language-tools.dev.java.net/ http://blogs.sun.com/timf http://www.netsoc.ucd.ie/~timf _______________________________________________ gnome-i18n mailing list gnome-i18n@gnome.org http://mail.gnome.org/mailman/listinfo/gnome-i18n