Hello Alexander, This is fine with us. Yes, separate thresholds seem preferable. Best Regards
Obtenez Outlook pour iOS<https://aka.ms/o0ukef> ________________________________ From: Alexander Korotkov <a.korot...@postgrespro.ru> Sent: Thursday, December 7, 2017 4:38:59 PM To: Jan Przemysław Wójcik; Cristiano Coelho Cc: pgsql-b...@postgresql.org; François CHAHUNEAU; Artur Zakirov; pgsql-hackers Subject: Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug On Tue, Nov 7, 2017 at 7:24 PM, Alexander Korotkov <a.korot...@postgrespro.ru<mailto:a.korot...@postgrespro.ru>> wrote: On Tue, Nov 7, 2017 at 3:51 PM, Jan Przemysław Wójcik <jan.przemyslaw.woj...@gmail.com<mailto:jan.przemyslaw.woj...@gmail.com>> wrote: my statement about the function usefulness was probably too categorical, though I had in mind the current name of the function. I'm afraid that creating a function that implements quite different algorithms depending on a global parameter seems very hacky and would lead to misunderstandings. I do understand the need of backward compatibility, but I'd opt for the lesser evil. Perhaps a good idea would be to change the name to 'substring_similarity()' and introduce the new function 'word_similarity()' later, for example in the next major version release. Good point. I've no complaints about that. I'm going to propose corresponding patch to the next commitfest. I've written a draft patch for fixing this inconsistency. Please, find it in attachment. This patch doesn't contain proper documentation and comments yet. I've called existing behavior subset_similarity(). I didn't use name substring_similarity(), because it doesn't really looking for substring with appropriate padding, but rather searching for continuous subset of trigrams. For index search over subset similarity, %>>, <<%, <->>>, <<<-> operators are provided. I've added extra arrow sign to denote these operators look deeper into string. Simultaneously, word_similarity() now forces extent bounds to be word bounds. Now word_similarity() behaves similar to my_word_similarity() proposed on stackoverlow. # with data(t) as ( values ('message'), ('message s'), ('message sag'), ('message sag sag'), ('message sag sage') ) select t, subset_similarity('sage', t), word_similarity('sage', t) from data; t | subset_similarity | word_similarity ------------------+-------------------+----------------- message | 0.6 | 0.3 message s | 0.8 | 0.363636 message sag | 1 | 0.5 message sag sag | 1 | 0.5 message sag sage | 1 | 1 (5 rows) The difference here is only in 'messsage s' row, because word_similarity() allows matching one word to two or more while my_word_similarity() doesn't allow that. In this case word_similarity() returns similarity between 'sage' and 'message s'. # select similarity('sage', 'message s'); similarity ------------ 0.363636 (1 row) I think behavior of word_similarity() appears better here, because typo can break word into two. I also wonder if word_similarity() and subset_similarity() should share same threshold value for indexed search. subset_similarity() typically returns higher values than word_similarity(). Thus, it's probably makes sense to split their threshold values. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com<http://www.postgrespro.com/> The Russian Postgres Company