Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

2006-01-26 Thread Ron
At 11:13 PM 1/26/2006, Craig A. James wrote: Ron, I'll write to you privately, because these discussions can get messy "in public". I'm responding to this missive publicly in an attempt to help the discussion along. It is not my usual practice to respond to private messages publicly, but t

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

2006-01-26 Thread Ron
You seem to have missed my point. I just gave a very clear description of how to "decide which bitmaps go in each of the two buckets" by reformulating the question into "decide which bitmaps go in each of =four= buckets". The intent is to have two indexes, one optimized for one common class

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

2006-01-26 Thread Craig A. James
Ron <[EMAIL PROTECTED]> writes: We have two problems here. The first is that the page splitting code for these indexes currently has O(N^2) performance. The second is that whatever solution we do use for this functionality, we still need good performance during searches that use the index. No

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

2006-01-26 Thread Ron
At 08:00 PM 1/26/2006, Alvaro Herrera wrote: Ron wrote: > At 01:27 PM 1/21/2006, Tom Lane wrote: > >Ron <[EMAIL PROTECTED]> writes: > >> At 07:23 PM 1/20/2006, Tom Lane wrote: > >>> Well, we're trying to split an index page that's gotten full into > >>> two index pages, preferably with approximat

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

2006-01-26 Thread Alvaro Herrera
Ron wrote: > At 01:27 PM 1/21/2006, Tom Lane wrote: > >Ron <[EMAIL PROTECTED]> writes: > >> At 07:23 PM 1/20/2006, Tom Lane wrote: > >>> Well, we're trying to split an index page that's gotten full into > >>> two index pages, preferably with approximately equal numbers of items in > >>> each new pa

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

2006-01-26 Thread Ron
At 01:27 PM 1/21/2006, Tom Lane wrote: Ron <[EMAIL PROTECTED]> writes: > At 07:23 PM 1/20/2006, Tom Lane wrote: >> Well, we're trying to split an index page that's gotten full into >> two index pages, preferably with approximately equal numbers of items in >> each new page (this isn't a hard requ

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

2006-01-21 Thread Tom Lane
David Lang <[EMAIL PROTECTED]> writes: > On Sat, 21 Jan 2006, Tom Lane wrote: >> Ron <[EMAIL PROTECTED]> writes: >>> Maybe we are over thinking this. What happens if we do the obvious >>> and just make a new page and move the "last" n/2 items on the full >>> page to the new page? >> >> Search per

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

2006-01-21 Thread David Lang
On Sat, 21 Jan 2006, Tom Lane wrote: Ron <[EMAIL PROTECTED]> writes: At 07:23 PM 1/20/2006, Tom Lane wrote: Well, we're trying to split an index page that's gotten full into two index pages, preferably with approximately equal numbers of items in each new page (this isn't a hard requirement th

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-21 Thread Martijn van Oosterhout
On Sat, Jan 21, 2006 at 06:22:52PM +0300, Oleg Bartunov wrote: > >I see how it works, what I don't quite get is whether the "inverted > >index" you refer to is what we're working with here, or just what's in > >tsearchd? > > just tsearchd. We plan to implement inverted index into PostgreSQL core >

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

2006-01-21 Thread Tom Lane
Ron <[EMAIL PROTECTED]> writes: > At 07:23 PM 1/20/2006, Tom Lane wrote: >> Well, we're trying to split an index page that's gotten full into >> two index pages, preferably with approximately equal numbers of items in >> each new page (this isn't a hard requirement though). > Maybe we are over th

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-21 Thread Oleg Bartunov
gevel is available from http://www.sai.msu.su/~megera/postgres/gist/ Oleg On Sat, 21 Jan 2006, Martijn van Oosterhout wrote: On Sat, Jan 21, 2006 at 04:29:13PM +0300, Oleg Bartunov wrote: Martijn, you're right! We want not only to split page to very different parts, but not to increas

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-21 Thread Oleg Bartunov
On Sat, 21 Jan 2006, Martijn van Oosterhout wrote: On Sat, Jan 21, 2006 at 04:29:13PM +0300, Oleg Bartunov wrote: Martijn, you're right! We want not only to split page to very different parts, but not to increase the number of sets bits in resulted signatures, which are union (OR'ed) of all sig

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

2006-01-21 Thread Ron
Perhaps a different approach to this problem is called for: _Managing Gigabytes: Compressing and Indexing Documents and Images_ 2ed Witten, Moffat, Bell ISBN 1-55860-570-3 This is a VERY good book on the subject. I'd also suggest looking at the publicly available work on indexing and searchin

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

2006-01-21 Thread Oleg Bartunov
On Sat, 21 Jan 2006, Ron wrote: Perhaps a different approach to this problem is called for: _Managing Gigabytes: Compressing and Indexing Documents and Images_ 2ed Witten, Moffat, Bell ISBN 1-55860-570-3 This is a VERY good book on the subject. I'd also suggest looking at the publicly availa

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-21 Thread Martijn van Oosterhout
On Sat, Jan 21, 2006 at 04:29:13PM +0300, Oleg Bartunov wrote: > Martijn, you're right! We want not only to split page to very > different parts, but not to increase the number of sets bits in > resulted signatures, which are union (OR'ed) of all signatures > in part. We need not only fast index c

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

2006-01-21 Thread Oleg Bartunov
On Sat, 21 Jan 2006, Ron wrote: At 07:23 PM 1/20/2006, Tom Lane wrote: "Steinar H. Gunderson" <[EMAIL PROTECTED]> writes: > On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote: >> It's also worth considering that the entire approach is a heuristic, >> really --- getting the furthest-apart

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-21 Thread Oleg Bartunov
On Sat, 21 Jan 2006, Martijn van Oosterhout wrote: However, IMHO, this algorithm is optimising the wrong thing. It shouldn't be trying to split into sets that are far apart, it should be trying to split into sets that minimize the number of set bits (ie distance from zero), since that's what's w

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-21 Thread Martijn van Oosterhout
On Fri, Jan 20, 2006 at 05:50:36PM -0500, Tom Lane wrote: > Yeah, but fetching from a small constant table is pretty quick too; > I doubt it's worth getting involved in machine-specific assembly code > for this. I'm much more interested in the idea of improving the > furthest-distance algorithm in

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very

2006-01-21 Thread Ron
At 07:23 PM 1/20/2006, Tom Lane wrote: "Steinar H. Gunderson" <[EMAIL PROTECTED]> writes: > On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote: >> It's also worth considering that the entire approach is a heuristic, >> really --- getting the furthest-apart pair of seeds doesn't guarantee >>

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Steinar H. Gunderson
On Fri, Jan 20, 2006 at 05:46:34PM -0500, Ron wrote: > For an even more extreme speedup, don't most modern CPUs have an asm > instruction that counts the bits (un)set (AKA "population counting") > in various size entities (4b, 8b, 16b, 32b, 64b, and 128b for 64b > CPUs with SWAR instructions)?

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Craig A. James
Tom Lane wrote: Well, we're trying to split an index page that's gotten full into two index pages, preferably with approximately equal numbers of items in each new page (this isn't a hard requirement though). ... If that's correct, what you really want is to divide the values so that the unions

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Steinar H. Gunderson
On Fri, Jan 20, 2006 at 07:23:10PM -0500, Tom Lane wrote: > I'm not very clear on what tsearch2 is doing with these bitmaps, but it > looks like an upper page's downlink has the union (bitwise OR) of the > one-bits in the values on the lower page, and you have to visit the lower > page if this unio

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Tom Lane
"Steinar H. Gunderson" <[EMAIL PROTECTED]> writes: > On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote: >> It's also worth considering that the entire approach is a heuristic, >> really --- getting the furthest-apart pair of seeds doesn't guarantee >> an optimal split as far as I can see. M

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Steinar H. Gunderson
On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote: > It's also worth considering that the entire approach is a heuristic, > really --- getting the furthest-apart pair of seeds doesn't guarantee > an optimal split as far as I can see. Maybe there's some totally > different way to do it. For

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Tom Lane
"Steinar H. Gunderson" <[EMAIL PROTECTED]> writes: > For the record: Could we do with a less-than-optimal split here? Yeah, I was wondering the same. The code is basically choosing two "seed" values to drive the index-page split. Intuitively it seems that "pretty far apart" would be nearly as go

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Steinar H. Gunderson
On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote: > I wonder if there is a way to improve on that. http://www.cs.uwaterloo.ca/~tmchan/slide_isaac.ps: The diameter problem has been studied extensively in the traditional model. Although O(n log n) algorithms have been given for d = 2 an

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Steinar H. Gunderson
On Fri, Jan 20, 2006 at 05:50:36PM -0500, Tom Lane wrote: > Yeah, but fetching from a small constant table is pretty quick too; > I doubt it's worth getting involved in machine-specific assembly code > for this. I'm much more interested in the idea of improving the > furthest-distance algorithm in

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Steinar H. Gunderson
On Fri, Jan 20, 2006 at 05:29:46PM -0500, Ron wrote: > If the N-dimensional space is Euclidean (any is the same > distance apart in dimension x), then finding the farthest pair can be > done in at least O(n). That sounds a bit optimistic. http://portal.acm.org/ft_gateway.cfm?id=167217&type=

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Martijn van Oosterhout
On Fri, Jan 20, 2006 at 05:46:34PM -0500, Ron wrote: > At 04:37 PM 1/20/2006, Martijn van Oosterhout wrote: > >Given that all it's doing is counting bits, a simple fix would be to > >loop over bytes, use XOR and count ones. For extreme speedup create a > >lookup table with 256 entries to give you t

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Tom Lane
Ron <[EMAIL PROTECTED]> writes: > For an even more extreme speedup, don't most modern CPUs have an asm > instruction that counts the bits (un)set (AKA "population counting") > in various size entities (4b, 8b, 16b, 32b, 64b, and 128b for 64b > CPUs with SWAR instructions)? Yeah, but fetching fr

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Ron
At 04:37 PM 1/20/2006, Martijn van Oosterhout wrote: On Fri, Jan 20, 2006 at 04:19:15PM -0500, Tom Lane wrote: > % cumulative self self total > time seconds secondscalls Ks/call Ks/call name > 98.96 1495.93 1495.93 33035195 0.00 0.00 hemdistsign

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Martijn van Oosterhout
On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote: > (hemdistcache calls hemdistsign --- I think gprof is doing something > funny with tail-calls here, and showing hemdistsign as directly called > from gtsvector_picksplit when control really arrives through hemdistcache.) It may be the comp

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Ron
At 05:16 PM 1/20/2006, Steinar H. Gunderson wrote: On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote: > I wonder if there is a way to improve on that. Ooh, the farthest pair problem (in an N-dimensional vector space, though). I'm pretty sure problems like this has been studied quite exten

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Steinar H. Gunderson
On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote: > I wonder if there is a way to improve on that. Ooh, the farthest pair problem (in an N-dimensional vector space, though). I'm pretty sure problems like this has been studied quite extensively in the literature, although perhaps not with t

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Tom Lane
Martijn van Oosterhout writes: > Given that all it's doing is counting bits, a simple fix would be to > loop over bytes, use XOR and count ones. For extreme speedup create a > lookup table with 256 entries to give you the answer straight away... Yeah, I just finished doing that and got about a 20

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Steinar H. Gunderson
On Fri, Jan 20, 2006 at 10:37:54PM +0100, Martijn van Oosterhout wrote: > Given that all it's doing is counting bits, a simple fix would be to > loop over bytes, use XOR and count ones. For extreme speedup create a > lookup table with 256 entries to give you the answer straight away... For extra o

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Martijn van Oosterhout
On Fri, Jan 20, 2006 at 04:19:15PM -0500, Tom Lane wrote: > % cumulative self self total > time seconds secondscalls Ks/call Ks/call name > 98.96 1495.93 1495.93 33035195 0.00 0.00 hemdistsign > So we gotta fix hemdistsign ... lol!

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Tom Lane
Well, I feel like a fool, because I failed to notice that the total runtime shown in that profile wasn't anywhere close to the actual wall clock time. gprof is indeed simply not counting the time spent in dynamically-linked code. With tsearch2 statically linked into the backend, a more believable

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Martijn van Oosterhout
On Fri, Jan 20, 2006 at 03:21:45PM -0500, Tom Lane wrote: > If the totals given by gprof are correct, then it's down in the noise. > I don't think I trust that too much ... but I don't see anything in the > gprof manual about how to include a dynamically loaded library in the > profile. (I did com

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Tom Lane
Martijn van Oosterhout writes: > Something I'm missing is the calls to tsearch functions. I'm not 100% > familiar with gprof, but is it possible those costs have been added > somewhere else because it's in a shared library? Perhaps the costs went > into FunctionCall1/3? I think that the tsearch f

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Martijn van Oosterhout
On Fri, Jan 20, 2006 at 02:14:29PM -0500, Tom Lane wrote: > [ thread moved to pgsql-performance ] > > I've obtained a gprof profile on Stephan's sample case (many thanks for > providing the data, Stephan). The command is Something I'm missing is the calls to tsearch functions. I'm not 100% fam

Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Tom Lane
[ thread moved to pgsql-performance ] I've obtained a gprof profile on Stephan's sample case (many thanks for providing the data, Stephan). The command is CREATE INDEX foo ON publications_test USING gist (fti_title); where fti_title is a tsvector column. There are 236984 rows in the tabl