Re: [BUGS] pg_trgm word_similarity inconsistencies or bug

François CHAHUNEAU Thu, 07 Dec 2017 09:40:19 -0800

Hello Alexander,
This is fine with us. Yes, separate thresholds seem preferable.
Best Regards

Obtenez Outlook pour iOS<https://aka.ms/o0ukef>
________________________________
From: Alexander Korotkov <a.korot...@postgrespro.ru>
Sent: Thursday, December 7, 2017 4:38:59 PM
To: Jan Przemysław Wójcik; Cristiano Coelho
Cc: pgsql-b...@postgresql.org; François CHAHUNEAU; Artur Zakirov; pgsql-hackers
Subject: Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

On Tue, Nov 7, 2017 at 7:24 PM, Alexander Korotkov 
<a.korot...@postgrespro.ru<mailto:a.korot...@postgrespro.ru>> wrote:
On Tue, Nov 7, 2017 at 3:51 PM, Jan Przemysław Wójcik 
<jan.przemyslaw.woj...@gmail.com<mailto:jan.przemyslaw.woj...@gmail.com>> wrote:
my statement about the function usefulness was probably too categorical,
though I had in mind the current name of the function.

I'm afraid that creating a function that implements quite different
algorithms depending on a global parameter seems very hacky and would lead
to misunderstandings. I do understand the need of backward compatibility,
but I'd opt for the lesser evil. Perhaps a good idea would be to change the
name to 'substring_similarity()' and introduce the new function
'word_similarity()' later, for example in the next major version release.

Good point.  I've no complaints about that.  I'm going to propose corresponding 
patch to the next commitfest.

I've written a draft patch for fixing this inconsistency.  Please, find it in 
attachment.  This patch doesn't contain proper documentation and comments yet.

I've called existing behavior subset_similarity().  I didn't use name 
substring_similarity(), because it doesn't really looking for substring with 
appropriate padding, but rather searching for continuous subset of trigrams.  
For index search over subset similarity, %>>, <<%, <->>>, <<<-> operators are 
provided.  I've added extra arrow sign to denote these operators look deeper 
into string.

Simultaneously, word_similarity() now forces extent bounds to be word bounds.  
Now word_similarity() behaves similar to my_word_similarity() proposed on 
stackoverlow.

# with data(t) as (
values
('message'),
('message s'),
('message sag'),
('message sag sag'),
('message sag sage')
)
select t, subset_similarity('sage', t), word_similarity('sage', t)
from data;
        t         | subset_similarity | word_similarity
------------------+-------------------+-----------------
 message          |               0.6 |             0.3
 message s        |               0.8 |        0.363636
 message sag      |                 1 |             0.5
 message sag sag  |                 1 |             0.5
 message sag sage |                 1 |               1
(5 rows)

The difference here is only in 'messsage s' row, because word_similarity() 
allows matching one word to two or more while my_word_similarity() doesn't 
allow that.  In this case word_similarity() returns similarity between 'sage' 
and 'message s'.

# select similarity('sage', 'message s');
 similarity
------------
   0.363636
(1 row)

I think behavior of word_similarity() appears better here, because typo can 
break word into two.

I also wonder if word_similarity() and subset_similarity() should share same 
threshold value for indexed search.  subset_similarity() typically returns 
higher values than word_similarity().  Thus, it's probably makes sense to split 
their threshold values.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com<http://www.postgrespro.com/>
The Russian Postgres Company

Re: [BUGS] pg_trgm word_similarity inconsistencies or bug

Reply via email to