Hi! * levenshtein_internal() and levenshtein_less_equal_internal() are very > similar. Can you merge the code? We can always use less_equal_internal() > if the overhead is ignorable. Did you compare them? > With big value of max_d overhead is significant. Here is example on american-english dictionary from Openoffice.
test=# select sum(levenshtein('qweqweqweqweqwe',word)) from words; sum --------- 1386456 (1 row) Time: 195,083 ms test=# select sum(levenshtein_less_equal('qweqweqweqweqwe',word,100)) from words; sum --------- 1386456 (1 row) Time: 317,821 ms > * There are many "if (!multibyte_encoding)" in levenshtein_internal(). > How about split the function into two funcs for single-byte chars and > multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always > use multi-byte version if the overhead is small. > The overhead of multi-byte version was about 4 times slower. But I have rewritten my CHAR_CMP macro with inline function. And now it's only about 1.5 times slower. In database with muti-byte encoding: test=# select * from words where levenshtein('qweqweqwe',word)<=5; id | word -------+---------- 69053 | peewee 69781 | pewee 81279 | sequence 88421 | sweetie (4 rows) Time: 136,742 ms In database with single-byte encoding: test2=# select * from words where levenshtein('qweqweqwe',word)<=5; id | word -------+---------- 69053 | peewee 69781 | pewee 81279 | sequence 88421 | sweetie (4 rows) Time: 88,471 ms Anyway I think that overhead is not ignorable. That's why I have splited levenshtein_internal into levenshtein_internal and levenshtein_internal_mb, and levenshtein_less_equal_internal into levenshtein_less_equal_internal and levenshtein_less_equal_internal_mb. > * I prefer a struct rather than an array. "4 * m" and "3 * m" look like > magic > numbers for me. Could you name the entries with definition of a struct? > /* > * For multibyte encoding we'll also store array of lengths of > * characters and array with character offsets in first string > * in order to avoid great number of pg_mblen calls. > */ > prev = (int *) palloc(4 * m * sizeof(int)); > I this line of code the memory is allocated for 4 arrays: prev, curr, offsets, char_lens. So I have joined offsets and char_lens into struct. But I can't join prev and curr because of this trick: temp = curr; curr = prev; prev = temp; * There are some compiler warnings. Avoid them if possible. > fuzzystrmatch.c: In function ‘levenshtein_less_equal_internal’: > fuzzystrmatch.c:428: warning: ‘char_lens’ may be used uninitialized in > this function > fuzzystrmatch.c:428: warning: ‘offsets’ may be used uninitialized in > this function > fuzzystrmatch.c:430: warning: ‘curr_right’ may be used uninitialized > in this function > fuzzystrmatch.c: In function ‘levenshtein_internal’: > fuzzystrmatch.c:222: warning: ‘char_lens’ may be used uninitialized in > this function > Fixed. * Coding style: Use "if (m == 0)" instead of "if (!m)" when the type > of 'm' is an integer. > Fixed. > * Need to fix the caution in docs. > http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html > | Caution: At present, fuzzystrmatch does not work well with > | multi-byte encodings (such as UTF-8). > but now levenshtein supports multi-byte encoding! We should > mention which function supports mbchars not to confuse users. > I've updated this notification. Also I've added documentation for levenshtein_less_equal function. * (Not an issue for the patch, but...) > Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to > PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR? > Unpacking versions make the core a bit faster. > Fixed. With best regards, Alexander Korotkov.
fuzzystrmatch-0.4.diff.gz
Description: GNU Zip compressed data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers