Hey all,

Promise, this will be the last I'll write on this for a while - but
people suggested I look at bi-gram, tri-gram and up to 5-gram
distributions of words found in GNOME software messages. That is, given
the input message :

"This is a long sentence so there."

We'd get the tri-grams :

This is a
is a long
a long sentence
long sentence so
sentence so there

- usually people mess about with this sort of thing when trying to build
terminology lists, but I suspect my sample set is a bit small to be
interesting.

Regardless, I've got results at 
http://blogs.sun.com/roller/page/timf?entry=more_word_bagging


        cheers,
                        tim
-- 
Tim Foster - Tools Engineer, Software Globalisation, Sun Microsystems, Inc.
Project Lead, Open Language Tools https://open-language-tools.dev.java.net/
http://blogs.sun.com/timf         http://www.netsoc.ucd.ie/~timf

_______________________________________________
gnome-i18n mailing list
gnome-i18n@gnome.org
http://mail.gnome.org/mailman/listinfo/gnome-i18n

Reply via email to