Re: [HACKERS] multibyte charater set in levenshtein function

2010-09-01 Thread Robert Haas
2010/8/28 Alexander Korotkov : > Now test for levenshtein_less_equal performance. Nice results. I'll try to find time to look at this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)

Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-28 Thread Alexander Korotkov
SELECT SUM(levenshtein(a, 'foo')) from words; SELECT SUM(levenshtein(a, 'Urbański')) FROM words; SELECT SUM(levenshtein(a, 'ańs')) FROM words; SELECT SUM(levenshtein(a, 'foo')) from words2; SELECT SUM(levenshtein(a, 'дом')) FROM words2; SELECT SUM(levenshtein(a, 'компьютер')) FROM words2; Before t

Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-28 Thread Alexander Korotkov
Here is the patch which adds levenshtein_less_equal function. I'm going to add it to current commitfest. With best regards, Alexander Korotkov. On Tue, Aug 3, 2010 at 3:23 AM, Robert Haas wrote: > On Mon, Aug 2, 2010 at 5:07 PM, Alexander Korotkov > wrote: > > Now I think patch is as goo

Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-28 Thread Robert Haas
On Aug 28, 2010, at 8:34 AM, Alexander Korotkov wrote: > Here is the patch which adds levenshtein_less_equal function. I'm going to > add it to current commitfest. Cool. Please submit some performance results comparing levenshtein in HEAD vs. levenshtein with this patch vs. levenshtein_less_equ

Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-04 Thread Alexander Korotkov
Now I think patch is as good as can be. :) I'm going to prepare less-or-equal function in same manner as this patch. With best regards, Alexander Korotkov.

Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-04 Thread Alexander Korotkov
On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas wrote: > I reviewed this code in a fair amount of detail today and ended up > rewriting it. In general terms, it's best to avoid changing things > that are not relevant to the central purpose of the patch. This patch > randomly adds a whole bunch of w

Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-02 Thread Robert Haas
On Mon, Aug 2, 2010 at 5:07 PM, Alexander Korotkov wrote: > Now I think patch is as good as can be. :) OK, committed. > I'm going to prepare less-or-equal function in same manner as this patch. Sounds good. Since we're now more than half-way through this CommitFest and this patch has undergone

Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-02 Thread Robert Haas
2010/8/2 Alexander Korotkov : > The dump of the table with russian dictionary is in attachment. > > I use following tests: > SELECT SUM(levenshtein(a, 'foo')) from words; > SELECT SUM(levenshtein(a, 'Urbański')) FROM words; > SELECT SUM(levenshtein(a, 'ańs')) FROM words; > SELECT SUM(levenshtein(a,

Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-02 Thread Robert Haas
2010/8/2 Alexander Korotkov : > On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas wrote: >> I reviewed this code in a fair amount of detail today and ended up >> rewriting it.  In general terms, it's best to avoid changing things >> that are not relevant to the central purpose of the patch.  This patch

Re: [HACKERS] multibyte charater set in levenshtein function

2010-08-01 Thread Robert Haas
On Fri, Jul 30, 2010 at 1:14 PM, Alexander Korotkov wrote: > Ok, here is the patch for multi-byte characters. > I changed arguments of levenshtein_internal function from text * to const > char * and int. I think that it makes levenshtein_internal more reusable. > For example, this function can be

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-29 Thread Robert Haas
On Wed, Jul 21, 2010 at 5:59 PM, Robert Haas wrote: > On Wed, Jul 21, 2010 at 2:47 PM, Alexander Korotkov > wrote: >> On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas wrote: >>> >>> *scratches head*  Aren't you just moving the same call to a different >>> place? >> >> So, where you can find this di

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-29 Thread Alexander Korotkov
I forgot attribution in levenshtein.c file. With best regards, Alexander Korotkov. fuzzystrmatch-0.5.1.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-27 Thread Alexander Korotkov
Here is new version of my patch. There are following changes: 1) I've merged singlebyte and multibyte versions of levenshtein_internal and levenshtein_less_equal_internal using macros and includes. 2) I found that levenshtein takes reasonable time even for long strings. There is an example with st

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-23 Thread Alvaro Herrera
Excerpts from Alexander Korotkov's message of jue jul 22 03:21:57 -0400 2010: > On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas wrote: > > > Ah, I see. That's pretty compelling, I guess. Although it still > > seems like a lot of code... > > > I think there is a way to merge single-byte and multi-b

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-22 Thread Alexander Korotkov
On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas wrote: > Ah, I see. That's pretty compelling, I guess. Although it still > seems like a lot of code... > I think there is a way to merge single-byte and multi-byte versions of functions without loss in performance using macros and includes (like in '

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-22 Thread Alexander Korotkov
Such version with macros and includes can look like this: #ifdef MULTIBYTE #define NEXT_X (x+= char_lens[i-1]) #define NEXT_Y (y+= y_char_len) #define CMP (char_cmp(x, char_lens[i-1], y, y_char_len)) #else #define NEXT_X (x++) #define NEXT_Y (y++) #define CMP (*x == *y) #endif static int levensht

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-22 Thread Robert Haas
On Thu, Jul 22, 2010 at 3:21 AM, Alexander Korotkov wrote: > On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas wrote: >> >> Ah, I see.  That's pretty compelling, I guess.  Although it still >> seems like a lot of code... > > I think there is a way to merge single-byte and multi-byte versions of > func

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-21 Thread Robert Haas
On Wed, Jul 21, 2010 at 2:47 PM, Alexander Korotkov wrote: > On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas wrote: >> >> *scratches head*  Aren't you just moving the same call to a different >> place? > > So, where you can find this different place? :) In this patch > null-terminated strings are n

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-21 Thread Alvaro Herrera
Excerpts from Robert Haas's message of mié jul 21 14:25:47 -0400 2010: > On Wed, Jul 21, 2010 at 7:40 AM, Alexander Korotkov > wrote: > > On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas wrote: > > Same benefit can be achived by replacing char * with > > char * and length. > > I changed !m to m == 0

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-21 Thread Alexander Korotkov
On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas wrote: > *scratches head* Aren't you just moving the same call to a different > place? > So, where you can find this different place? :) In this patch null-terminated strings are not used at all. > Yeah, we usually try to avoid changing that sort o

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-21 Thread Robert Haas
On Wed, Jul 21, 2010 at 7:40 AM, Alexander Korotkov wrote: > On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas wrote: >> This patch still needs some work.  It includes a bunch of stylistic >> changes that aren't relevant to the purpose of the patch.  There's no >> reason that I can see to change the e

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-21 Thread Alexander Korotkov
On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas wrote: > This patch still needs some work. It includes a bunch of stylistic > changes that aren't relevant to the purpose of the patch. There's no > reason that I can see to change the existing levenshtein_internal > function to take text arguments i

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-20 Thread Robert Haas
On Tue, Jul 20, 2010 at 3:37 AM, Itagaki Takahiro wrote: > 2010/7/13 Alexander Korotkov : >> Anyway I think that overhead is not ignorable. That's why I have splited >> levenshtein_internal into levenshtein_internal and levenshtein_internal_mb, >> and levenshtein_less_equal_internal into levenshte

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-20 Thread Itagaki Takahiro
2010/7/13 Alexander Korotkov : > Anyway I think that overhead is not ignorable. That's why I have splited > levenshtein_internal into levenshtein_internal and levenshtein_internal_mb, > and levenshtein_less_equal_internal into levenshtein_less_equal_internal and > levenshtein_less_equal_internal_mb

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-13 Thread Alexander Korotkov
Hi! * levenshtein_internal() and levenshtein_less_equal_internal() are very > similar. Can you merge the code? We can always use less_equal_internal() > if the overhead is ignorable. Did you compare them? > With big value of max_d overhead is significant. Here is example on american-english dict

Re: [HACKERS] multibyte charater set in levenshtein function

2010-07-11 Thread Itagaki Takahiro
Hi, I'm reviewing "Multibyte charater set in levenshtein function" patch. https://commitfest.postgresql.org/action/patch_view?id=304 The main logic seems to be good, but I have some comments about the coding style and refactoring. * levenshtein_internal() and levenshtein_less_equal_internal() are

Re: [HACKERS] multibyte charater set in levenshtein function

2010-06-07 Thread Alexander Korotkov
Hello Hackers! I have extended my patch by introducing levenshtein_less_equal function. This function have additional argument max_d and stops calculating when distance exceeds max_d. With low values of max_d function works much faster than original one. The example of original levenshtein functi

Re: [HACKERS] multibyte charater set in levenshtein function

2010-05-13 Thread Alexander Korotkov
On Wed, May 12, 2010 at 11:04 PM, Alvaro Herrera wrote: > On a quick look, I didn't like the way you separated the > "pg_database_encoding_max_length() > 1" cases. There seem to be too > much common code. Can that be refactored a bit better? > I did a little refactoring in order to avoid some si

Re: [HACKERS] multibyte charater set in levenshtein function

2010-05-13 Thread Alexander Korotkov
On Thu, May 13, 2010 at 6:03 AM, Alvaro Herrera wrote: > Well, since it's only used in one place, why are you defining a macro at > all? > In order to structure code better. My question was about another. Is memcmp function good choice to compare very short sequences of bytes (from 1 to 4 bytes)

Re: [HACKERS] multibyte charater set in levenshtein function

2010-05-12 Thread Alvaro Herrera
Alexander Korotkov escribió: > On Wed, May 12, 2010 at 11:04 PM, Alvaro Herrera > wrote: > > > On a quick look, I didn't like the way you separated the > > "pg_database_encoding_max_length() > 1" cases. There seem to be too > > much common code. Can that be refactored a bit better? > > > I did

Re: [HACKERS] multibyte charater set in levenshtein function

2010-05-12 Thread Alvaro Herrera
Excerpts from Alexander Korotkov's message of lun may 10 11:35:02 -0400 2010: > Hackers, > > The current version of levenshtein function in fuzzystrmatch contrib modulte > doesn't work properly with multibyte charater sets. > My patch make this function works properly with multibyte charater sets

[HACKERS] multibyte charater set in levenshtein function

2010-05-12 Thread Alexander Korotkov
Hackers, The current version of levenshtein function in fuzzystrmatch contrib modulte doesn't work properly with multibyte charater sets. test=# select levenshtein('фыва','аыва'); levenshtein - 2 (1 row) My patch make this function works properly with multibyte charater s