At 11:13 PM 1/26/2006, Craig A. James wrote:
Ron,
I'll write to you privately, because these discussions can get messy
"in public".
I'm responding to this missive publicly in an attempt to help the
discussion along. It is not my usual practice to respond to private
messages publicly, but t
You seem to have missed my point. I just gave a very clear
description of how to "decide which bitmaps go in each of the two
buckets" by reformulating the question into "decide which bitmaps go
in each of =four= buckets".
The intent is to have two indexes, one optimized for one common class
Ron <[EMAIL PROTECTED]> writes:
We have two problems here.
The first is that the page splitting code for these indexes currently
has O(N^2) performance.
The second is that whatever solution we do use for this functionality,
we still need good performance during searches that use the index.
No
At 08:00 PM 1/26/2006, Alvaro Herrera wrote:
Ron wrote:
> At 01:27 PM 1/21/2006, Tom Lane wrote:
> >Ron <[EMAIL PROTECTED]> writes:
> >> At 07:23 PM 1/20/2006, Tom Lane wrote:
> >>> Well, we're trying to split an index page that's gotten full into
> >>> two index pages, preferably with approximat
Ron wrote:
> At 01:27 PM 1/21/2006, Tom Lane wrote:
> >Ron <[EMAIL PROTECTED]> writes:
> >> At 07:23 PM 1/20/2006, Tom Lane wrote:
> >>> Well, we're trying to split an index page that's gotten full into
> >>> two index pages, preferably with approximately equal numbers of items in
> >>> each new pa
At 01:27 PM 1/21/2006, Tom Lane wrote:
Ron <[EMAIL PROTECTED]> writes:
> At 07:23 PM 1/20/2006, Tom Lane wrote:
>> Well, we're trying to split an index page that's gotten full into
>> two index pages, preferably with approximately equal numbers of items in
>> each new page (this isn't a hard requ
David Lang <[EMAIL PROTECTED]> writes:
> On Sat, 21 Jan 2006, Tom Lane wrote:
>> Ron <[EMAIL PROTECTED]> writes:
>>> Maybe we are over thinking this. What happens if we do the obvious
>>> and just make a new page and move the "last" n/2 items on the full
>>> page to the new page?
>>
>> Search per
On Sat, 21 Jan 2006, Tom Lane wrote:
Ron <[EMAIL PROTECTED]> writes:
At 07:23 PM 1/20/2006, Tom Lane wrote:
Well, we're trying to split an index page that's gotten full into
two index pages, preferably with approximately equal numbers of items in
each new page (this isn't a hard requirement th
On Sat, Jan 21, 2006 at 06:22:52PM +0300, Oleg Bartunov wrote:
> >I see how it works, what I don't quite get is whether the "inverted
> >index" you refer to is what we're working with here, or just what's in
> >tsearchd?
>
> just tsearchd. We plan to implement inverted index into PostgreSQL core
>
Ron <[EMAIL PROTECTED]> writes:
> At 07:23 PM 1/20/2006, Tom Lane wrote:
>> Well, we're trying to split an index page that's gotten full into
>> two index pages, preferably with approximately equal numbers of items in
>> each new page (this isn't a hard requirement though).
> Maybe we are over th
gevel is available from
http://www.sai.msu.su/~megera/postgres/gist/
Oleg
On Sat, 21 Jan 2006, Martijn van Oosterhout wrote:
On Sat, Jan 21, 2006 at 04:29:13PM +0300, Oleg Bartunov wrote:
Martijn, you're right! We want not only to split page to very
different parts, but not to increas
On Sat, 21 Jan 2006, Martijn van Oosterhout wrote:
On Sat, Jan 21, 2006 at 04:29:13PM +0300, Oleg Bartunov wrote:
Martijn, you're right! We want not only to split page to very
different parts, but not to increase the number of sets bits in
resulted signatures, which are union (OR'ed) of all sig
Perhaps a different approach to this problem is called for:
_Managing Gigabytes: Compressing and Indexing Documents and Images_ 2ed
Witten, Moffat, Bell
ISBN 1-55860-570-3
This is a VERY good book on the subject.
I'd also suggest looking at the publicly available work on indexing
and searchin
On Sat, 21 Jan 2006, Ron wrote:
Perhaps a different approach to this problem is called for:
_Managing Gigabytes: Compressing and Indexing Documents and Images_ 2ed
Witten, Moffat, Bell
ISBN 1-55860-570-3
This is a VERY good book on the subject.
I'd also suggest looking at the publicly availa
On Sat, Jan 21, 2006 at 04:29:13PM +0300, Oleg Bartunov wrote:
> Martijn, you're right! We want not only to split page to very
> different parts, but not to increase the number of sets bits in
> resulted signatures, which are union (OR'ed) of all signatures
> in part. We need not only fast index c
On Sat, 21 Jan 2006, Ron wrote:
At 07:23 PM 1/20/2006, Tom Lane wrote:
"Steinar H. Gunderson" <[EMAIL PROTECTED]> writes:
> On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote:
>> It's also worth considering that the entire approach is a heuristic,
>> really --- getting the furthest-apart
On Sat, 21 Jan 2006, Martijn van Oosterhout wrote:
However, IMHO, this algorithm is optimising the wrong thing. It
shouldn't be trying to split into sets that are far apart, it should be
trying to split into sets that minimize the number of set bits (ie
distance from zero), since that's what's w
On Fri, Jan 20, 2006 at 05:50:36PM -0500, Tom Lane wrote:
> Yeah, but fetching from a small constant table is pretty quick too;
> I doubt it's worth getting involved in machine-specific assembly code
> for this. I'm much more interested in the idea of improving the
> furthest-distance algorithm in
At 07:23 PM 1/20/2006, Tom Lane wrote:
"Steinar H. Gunderson" <[EMAIL PROTECTED]> writes:
> On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote:
>> It's also worth considering that the entire approach is a heuristic,
>> really --- getting the furthest-apart pair of seeds doesn't guarantee
>>
On Fri, Jan 20, 2006 at 05:46:34PM -0500, Ron wrote:
> For an even more extreme speedup, don't most modern CPUs have an asm
> instruction that counts the bits (un)set (AKA "population counting")
> in various size entities (4b, 8b, 16b, 32b, 64b, and 128b for 64b
> CPUs with SWAR instructions)?
Tom Lane wrote:
Well, we're trying to split an index page that's gotten full into two
index pages, preferably with approximately equal numbers of items in
each new page (this isn't a hard requirement though). ... If that's
correct, what you really want is to divide the values so that the unions
On Fri, Jan 20, 2006 at 07:23:10PM -0500, Tom Lane wrote:
> I'm not very clear on what tsearch2 is doing with these bitmaps, but it
> looks like an upper page's downlink has the union (bitwise OR) of the
> one-bits in the values on the lower page, and you have to visit the lower
> page if this unio
"Steinar H. Gunderson" <[EMAIL PROTECTED]> writes:
> On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote:
>> It's also worth considering that the entire approach is a heuristic,
>> really --- getting the furthest-apart pair of seeds doesn't guarantee
>> an optimal split as far as I can see. M
On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote:
> It's also worth considering that the entire approach is a heuristic,
> really --- getting the furthest-apart pair of seeds doesn't guarantee
> an optimal split as far as I can see. Maybe there's some totally
> different way to do it.
For
"Steinar H. Gunderson" <[EMAIL PROTECTED]> writes:
> For the record: Could we do with a less-than-optimal split here?
Yeah, I was wondering the same. The code is basically choosing two
"seed" values to drive the index-page split. Intuitively it seems that
"pretty far apart" would be nearly as go
On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote:
> I wonder if there is a way to improve on that.
http://www.cs.uwaterloo.ca/~tmchan/slide_isaac.ps:
The diameter problem has been studied extensively in the traditional model.
Although O(n log n) algorithms have been given for d = 2 an
On Fri, Jan 20, 2006 at 05:50:36PM -0500, Tom Lane wrote:
> Yeah, but fetching from a small constant table is pretty quick too;
> I doubt it's worth getting involved in machine-specific assembly code
> for this. I'm much more interested in the idea of improving the
> furthest-distance algorithm in
On Fri, Jan 20, 2006 at 05:29:46PM -0500, Ron wrote:
> If the N-dimensional space is Euclidean (any is the same
> distance apart in dimension x), then finding the farthest pair can be
> done in at least O(n).
That sounds a bit optimistic.
http://portal.acm.org/ft_gateway.cfm?id=167217&type=
On Fri, Jan 20, 2006 at 05:46:34PM -0500, Ron wrote:
> At 04:37 PM 1/20/2006, Martijn van Oosterhout wrote:
> >Given that all it's doing is counting bits, a simple fix would be to
> >loop over bytes, use XOR and count ones. For extreme speedup create a
> >lookup table with 256 entries to give you t
Ron <[EMAIL PROTECTED]> writes:
> For an even more extreme speedup, don't most modern CPUs have an asm
> instruction that counts the bits (un)set (AKA "population counting")
> in various size entities (4b, 8b, 16b, 32b, 64b, and 128b for 64b
> CPUs with SWAR instructions)?
Yeah, but fetching fr
At 04:37 PM 1/20/2006, Martijn van Oosterhout wrote:
On Fri, Jan 20, 2006 at 04:19:15PM -0500, Tom Lane wrote:
> % cumulative self self total
> time seconds secondscalls Ks/call Ks/call name
> 98.96 1495.93 1495.93 33035195 0.00 0.00 hemdistsign
On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote:
> (hemdistcache calls hemdistsign --- I think gprof is doing something
> funny with tail-calls here, and showing hemdistsign as directly called
> from gtsvector_picksplit when control really arrives through hemdistcache.)
It may be the comp
At 05:16 PM 1/20/2006, Steinar H. Gunderson wrote:
On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote:
> I wonder if there is a way to improve on that.
Ooh, the farthest pair problem (in an N-dimensional vector space, though).
I'm pretty sure problems like this has been studied quite exten
On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote:
> I wonder if there is a way to improve on that.
Ooh, the farthest pair problem (in an N-dimensional vector space, though).
I'm pretty sure problems like this has been studied quite extensively in the
literature, although perhaps not with t
Martijn van Oosterhout writes:
> Given that all it's doing is counting bits, a simple fix would be to
> loop over bytes, use XOR and count ones. For extreme speedup create a
> lookup table with 256 entries to give you the answer straight away...
Yeah, I just finished doing that and got about a 20
On Fri, Jan 20, 2006 at 10:37:54PM +0100, Martijn van Oosterhout wrote:
> Given that all it's doing is counting bits, a simple fix would be to
> loop over bytes, use XOR and count ones. For extreme speedup create a
> lookup table with 256 entries to give you the answer straight away...
For extra o
On Fri, Jan 20, 2006 at 04:19:15PM -0500, Tom Lane wrote:
> % cumulative self self total
> time seconds secondscalls Ks/call Ks/call name
> 98.96 1495.93 1495.93 33035195 0.00 0.00 hemdistsign
> So we gotta fix hemdistsign ...
lol!
Well, I feel like a fool, because I failed to notice that the total
runtime shown in that profile wasn't anywhere close to the actual wall
clock time. gprof is indeed simply not counting the time spent in
dynamically-linked code. With tsearch2 statically linked into the
backend, a more believable
On Fri, Jan 20, 2006 at 03:21:45PM -0500, Tom Lane wrote:
> If the totals given by gprof are correct, then it's down in the noise.
> I don't think I trust that too much ... but I don't see anything in the
> gprof manual about how to include a dynamically loaded library in the
> profile. (I did com
Martijn van Oosterhout writes:
> Something I'm missing is the calls to tsearch functions. I'm not 100%
> familiar with gprof, but is it possible those costs have been added
> somewhere else because it's in a shared library? Perhaps the costs went
> into FunctionCall1/3?
I think that the tsearch f
On Fri, Jan 20, 2006 at 02:14:29PM -0500, Tom Lane wrote:
> [ thread moved to pgsql-performance ]
>
> I've obtained a gprof profile on Stephan's sample case (many thanks for
> providing the data, Stephan). The command is
Something I'm missing is the calls to tsearch functions. I'm not 100%
fam
[ thread moved to pgsql-performance ]
I've obtained a gprof profile on Stephan's sample case (many thanks for
providing the data, Stephan). The command is
CREATE INDEX foo ON publications_test USING gist (fti_title);
where fti_title is a tsvector column. There are 236984 rows in the
tabl
42 matches
Mail list logo