Re: [HACKERS] Proposal: q-gram GIN and GiST indexes

2011-04-05 Thread Tom Lane
Alexander Korotkov writes: > On Tue, Apr 5, 2011 at 5:05 PM, Robert Haas wrote: >> I am probably being stupid here, but doesn't the number of links to >> rows grow proportionately to the number of n-grams? > Number of links to rows grow proportionally to total number of extracted > q-grams, but

Re: [HACKERS] Proposal: q-gram GIN and GiST indexes

2011-04-05 Thread Alexander Korotkov
On Tue, Apr 5, 2011 at 5:05 PM, Robert Haas wrote: > I am probably being stupid here, but doesn't the number of links to > rows grow proportionately to the number of n-grams? Number of links to rows grow proportionally to total number of extracted q-grams, but not proportionally to number of uni

Re: [HACKERS] Proposal: q-gram GIN and GiST indexes

2011-04-05 Thread Robert Haas
On Tue, Apr 5, 2011 at 8:41 AM, Alexander Korotkov wrote: > For example, here is distribution of q-grams count in 120 Mb of dblp paper > titles (pretty large dataset). > q   count > 2    7218 > 3  115107 > 4  589428 > 5 1648453 > 6 3336685 > Number of 5-grams if about 15x larger than number of 3-g

Re: [HACKERS] Proposal: q-gram GIN and GiST indexes

2011-04-05 Thread Alexander Korotkov
For example, here is distribution of q-grams count in 120 Mb of dblp paper titles (pretty large dataset). q count 27218 3 115107 4 589428 5 1648453 6 3336685 Number of 5-grams if about 15x larger than number of 3-grams. But most part of index space will be occupied by links to the rows(abou

Re: [HACKERS] Proposal: q-gram GIN and GiST indexes

2011-04-05 Thread Alexander Korotkov
On Tue, Apr 5, 2011 at 3:56 PM, Robert Haas wrote: > So with q=5, the index will be approximately 10x larger than with q=3. > Maybe that's OK, I'm not sure. But it is a big difference. Not whole index will be approximately 10x larger, but only entries pages number (which contains btree on gin

Re: [HACKERS] Proposal: q-gram GIN and GiST indexes

2011-04-05 Thread Robert Haas
On Tue, Apr 5, 2011 at 4:52 AM, Alexander Korotkov wrote: > On Mon, Apr 4, 2011 at 9:01 PM, Robert Haas wrote: >> >> On Mon, Apr 4, 2011 at 12:38 PM, Alexander Korotkov >> wrote: >> > relatively small when q <= 5. Accordingly, I think we should expect >> > indexes >> > to be usable with at least

Re: [HACKERS] Proposal: q-gram GIN and GiST indexes

2011-04-05 Thread Alexander Korotkov
On Mon, Apr 4, 2011 at 9:01 PM, Robert Haas wrote: > On Mon, Apr 4, 2011 at 12:38 PM, Alexander Korotkov > wrote: > > relatively small when q <= 5. Accordingly, I think we should expect > indexes > > to be usable with at least with q = 5. > > I defer to your opinion on this, since you know more

Re: [HACKERS] Proposal: q-gram GIN and GiST indexes

2011-04-04 Thread Robert Haas
On Mon, Apr 4, 2011 at 12:38 PM, Alexander Korotkov wrote: > relatively small when q <= 5. Accordingly, I think we should expect indexes > to be usable with at least with q = 5. I defer to your opinion on this, since you know more about it than I do. But I think it would still be worthwhile to w

Re: [HACKERS] Proposal: q-gram GIN and GiST indexes

2011-04-04 Thread Alexander Korotkov
On Mon, Apr 4, 2011 at 6:56 PM, Robert Haas wrote: > On Mon, Apr 4, 2011 at 7:35 AM, Alexander Korotkov > wrote: > > I would like to propose a q-gram module which would have following > > differences in comparison with pg_trgm: > > 1) Focus on acceleration of edit distance (e.g. levenshtein dist

Re: [HACKERS] Proposal: q-gram GIN and GiST indexes

2011-04-04 Thread Robert Haas
On Mon, Apr 4, 2011 at 7:35 AM, Alexander Korotkov wrote: > I would like to propose a q-gram module which would have following > differences in comparison with pg_trgm: > 1) Focus on acceleration of edit distance (e.g. levenshtein distance) > queries and LIKE/ILIKE queries > 2) Support of various

Re: [HACKERS] Proposal: q-gram GIN and GiST indexes

2011-04-04 Thread Alexander Korotkov
Here is text of my GSoC proposal. Given details probably makes essence of my proposal clear. Any comments are welcome. *Name of project* Q-gram indexing module * * *Synopsis* Currently PostgreSQL has support for trigram-based string collection indexing in pg_trgm module. Indexes in pg_trgm was o

Re: [HACKERS] Proposal: q-gram GIN and GiST indexes

2011-03-28 Thread Alexander Korotkov
On Mon, Mar 28, 2011 at 7:27 PM, Tom Lane wrote: > Robert Haas writes: > > I'm afraid I don't know this code well enough to give you any > > meaningful feedback, but I hope someone will. > > Really Oleg and Teodor are the only people well-qualified to comment on > such stuff. (It sounds reasona

Re: [HACKERS] Proposal: q-gram GIN and GiST indexes

2011-03-28 Thread Tom Lane
Robert Haas writes: > On Fri, Mar 25, 2011 at 1:32 PM, Alexander Korotkov > wrote: >> I would like to ask you about currency of the work above. I propose to >> develop functionality of GIN and GiST q-gram indexes with following >> features: > I'm afraid I don't know this code well enough to give

Re: [HACKERS] Proposal: q-gram GIN and GiST indexes

2011-03-28 Thread Robert Haas
On Fri, Mar 25, 2011 at 1:32 PM, Alexander Korotkov wrote: > I would like to ask you about currency of the work above. I propose to > develop functionality of GIN and GiST q-gram indexes with following > features: > 1) Handle edit distance (e.g. levenshtein distance) and LIKE/ILIKE > queries(using

Re: [HACKERS] Proposal: q-gram GIN and GiST indexes

2011-03-25 Thread Alexander Korotkov
On Fri, Mar 25, 2011 at 8:32 PM, Alexander Korotkov wrote: > I would like to ask you about currency of the work above. This seems to be a mess of words. Sorry for my bad english. Actually, I meant that I need a appraisal of my proposal. With best regards, Alexander Korotkov. -- Sent via pg

[HACKERS] Proposal: q-gram GIN and GiST indexes

2011-03-25 Thread Alexander Korotkov
Hackers, I would like to ask you about currency of the work above. I propose to develop functionality of GIN and GiST q-gram indexes with following features: 1) Handle edit distance (e.g. levenshtein distance) and LIKE/ILIKE queries(using GIN partial match if no full q-grams can be extracted from