Consider Spaces in pg_trgm for Better Similarity

Igal @ Lucee.org Sun, 28 Jan 2018 21:57:26 -0800

Is there a way to consider white space in tri-grams? That would allowfor better matches of phrases.

For example, currently "one two three" and "three two one" wouldgenerate the same tri-grams ({ o, t, on, th, tw,ee ,hre,ne,one,ree,thr,two,wo }), and the distance of "one two four" will be thesame for both of them. The query:


SELECT   phrase
        ,input
        ,similarity(t1.phrase, t2.input)
        ,word_similarity(t1.phrase, t2.input)
FROM      (values('one two three'),('three two one')) t1(phrase)
        ,(values('one two four')) t2(input);

Returns:

phrase        |input        |similarity  |word_similarity |
--------------|-------------|------------|----------------|
one two three |one two four |0.444444448 |0.615384638 |
three two one |one two four |0.444444448 |0.615384638 |

But surely "one two four" is more similar to "one two three" than to"three two one".


Any thoughts?

Igal Sapir
Lucee Core Developer
Lucee.org <http://lucee.org/>

Consider Spaces in pg_trgm for Better Similarity

Reply via email to