Re: [HACKERS] WIP: Fast GiST index build

Heikki Linnakangas Mon, 05 Sep 2011 09:34:13 -0700

On 05.09.2011 14:10, Heikki Linnakangas wrote:

On 01.09.2011 12:23, Alexander Korotkov wrote:

On Thu, Sep 1, 2011 at 12:59 PM, Heikki Linnakangas<
heikki.linnakan...@enterprisedb.com> wrote:

So I changed the test script to generate the table as:

CREATE TABLE points AS SELECT random() as x, random() as y FROM
generate_series(1, $NROWS);

The unordered results are in:

testname | nrows | duration | accesses
-----------------------------+**-----------+-----------------+**----------

points unordered buffered | 250000000 | 05:56:58.575789 | 2241050
points unordered auto | 250000000 | 05:34:12.187479 | 2246420
points unordered unbuffered | 250000000 | 04:38:48.663952 | 2244228

Although the buffered build doesn't lose as badly as it did with more
overlap, it still doesn't look good :-(. Any ideas?


But it's still a lot of overlap. It's about 220 accesses per small area
request. It's about 10 - 20 times greater than should be without
overlaps.
If we roughly assume that 10 times more overlap makes 1/10 of tree to be
used for actual inserts, then that part of tree can easily fit to the
cache.
You can try my splitting algorithm on your test setup (it this case I
advice
to start from smaller number of rows, 100 M for example).
I'm requesting real-life datasets which makes troubles in real life from
Oleg. Probably those datasets is even larger or new linear split produce
less overlaps on them.


I made a small tweak to the patch, and got much better results (this is
with my original method of generating the data):

testname | nrows | duration | accesses
-----------------------------+-----------+-----------------+----------
points unordered buffered | 250000000 | 03:34:23.488275 | 3945486
points unordered auto | 250000000 | 02:55:10.248722 | 3767548
points unordered unbuffered | 250000000 | 04:02:04.168138 | 4564986


The full results of this test are in:

          testname           |   nrows   |    duration     | accesses
-----------------------------+-----------+-----------------+----------
 points unordered buffered   | 250000000 | 03:34:23.488275 |  3945486
 points unordered auto       | 250000000 | 02:55:10.248722 |  3767548
 points unordered unbuffered | 250000000 | 04:02:04.168138 |  4564986
 points ordered buffered     | 250000000 | 02:00:10.467914 |  5572906
 points ordered auto         | 250000000 | 02:16:01.859039 |  5435673
 points ordered unbuffered   | 250000000 | 03:23:18.061742 |  1875826
(6 rows)

Interestingly, in this test case the buffered build was significantlyfaster even in the case of ordered input - but the quality of theproduced index was much worse. I suspect it's because of thelast-in-first-out nature of the buffering, tuples that pushed intobuffers first are flushed to lower levels last. Tweaking the datastructures to make the buffer flushing a FIFO process might help withthat, but I'm afraid that might make our cache hit ratio worse whenreading pages from the temporary file.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

Reply via email to