[HACKERS] N-grams

Anthony Gentile Wed, 12 Jan 2011 23:29:03 -0800

Hello,

     Today I was reading a blog post from a fellow coworker
http://www.depesz.com/index.php/2010/12/11/waiting-for-9-1-knngist/  and
started to mess around with the trigram contrib package for postgres and
playing with some different word dictionaries for English and German. I was
wanting to see how performant particular queries could be if SIGLENINT in
trgm.h was adjusted to be the avg character length for a particular word
dictionary


http://packages.ubuntu.com/dapper/wamerican
compling=# SELECT AVG(LENGTH(CAST(word AS bytea), 'UTF8')) FROM
english_words;
        avg
--------------------
 8.4498980409662267

vs

http://packages.ubuntu.com/dapper/wngerman
compling=# SELECT AVG(LENGTH(CAST(word AS bytea), 'UTF8')) FROM words;
//german
         avg
---------------------
 11.9518056504365566

(unsurprisingly German words are on average longer than English ones)

Effectly wanting to make the trigram package act more along the lines of
n-gram where I am explicitly setting the N when it is built. I, am however,
not very proficient in C and doubt that is the only change necessary needed
to convert the trigram contrib to an n-gram as after changing SIGLENINT to
12 in trgm.h I still get trigram results for show_trgrm() . I was hoping
someone familiar with it could provide a little help for me by perhaps
giving me a path of action needed to change the trigram implementation to
behave as an n-gram. Thanks for your time and I appreciate any advice anyone
can give me.

Anthony Gentile

[HACKERS] N-grams

Reply via email to