On Mon, Sep 7, 2015 at 9:20 PM, Peter Geoghegan <p...@heroku.com> wrote: >>> This matters because a major cost during CREATE INDEX CONCURRENTLY is >>> a TID-based datum sort (this is probably most of the cost over and >>> above a conventional CREATE INDEX). >> >> Might be better to hack a special case right there (ie, embed TIDs into >> int8s and sort the int8s) rather than try to change the type's SQL >> declaration. > > That sounds at least as effective as what I originally sketched.
> I hope someone picks this up soon. I suggested to someone else that he take a look at this as a project, but I guess he was busy too. I decided to just do it myself. Patch is attached. This has the bulkdelete callback that gathers TIDs from the index during CREATE INDEX CONCURRENTLY encode them as int8 values ahead of the sort, while sorting the values as int8 values. They're later decoded as needed. This turns out to be a relatively simple patch. Testing shows it removes *most* of the overhead of CREATE INDEX CONCURRENTLY over CREATE INDEX. In absolute terms, a test case involving a CREATE INDEX CONCURRENTLY on a single integer column takes about 30% - 40% less time with the patch applied. The TID sort itself is about 3 times faster, and that's without the benefit of the patch making the TID sort an internal sort where it would otherwise have been an external sort. Sorting item pointers as TIDs naively currently wastes a lot of memory (not just memory bandwidth) -- a pass-by-value datum sort only ever allocates memory for SortTuples, avoiding allocating any memory for a "tuple proper". Clearly just having the sort be pass-by-value is *much* faster. As comments above process_ordered_aggregate_single() put it: * The one-input case is handled separately from the multi-input case * for performance reasons: for single by-value inputs, such as the * common case of count(distinct id), the tuplesort_getdatum code path * is around 300% faster. (The speedup for by-reference types is less * but still noticeable.) Another way of stating how things are improved here is: my CREATE INDEX CONCURRENTLY test case had about a 2.1x overhead over an equivalent CREATE INDEX on master, but with the patch applied the CREATE INDEX CONCURRENTLY has an overhead of only about 1.3x. The extra logical I/O for CONCURRENTLY's second scan, and for the physical-order index scan (to "bulk delete", to gather TIDs to sort) is a surprisingly small cost. I'll add the patch to the open commitfest. Think of all these patches of mine as giving reviewers a choice of what to review. This patch does seem like a slam dunk, even if I do say so myself, so at least it's relatively easy to review. The external sort work remains my most interesting pending work, though. -- Peter Geoghegan
From 3c7551435394090a4c2d3e7588d018ea9dca2482 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan <peter.geoghega...@gmail.com> Date: Sat, 14 Nov 2015 16:57:20 -0800 Subject: [PATCH] Speed up CREATE INDEX CONCURRENTLY's TID sort Add a simple TID encoding scheme. This is used for encoding and decoding TIDs to and from int8/int64 values, for the benefit of an expensive tuplesort operation performed by CREATE INDEX CONCURRENTLY. The sort now uses the default int8 opclass to sort values in the ad-hoc encoding, and so benefits from the availability of int8 SortSupport. Despite the fact that item pointers take up only 6 bytes on disk, the TID type is always pass-by-reference due to considerations about how Datums are represented and accessed. On the other hand, int8 is usually, in practice, a pass-by-value type. Testing shows this approach makes CREATE INDEX CONCURRENTLY significantly faster on platforms where int8 is pass-by-value, especially when the big saving in memory within tuplesort.c prevents the sort from being performed externally. The approach will help to some degree on all platforms, though. --- src/backend/catalog/index.c | 56 +++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 52 insertions(+), 4 deletions(-) diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index e59b163..7429f3c 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -109,6 +109,8 @@ static void index_update_stats(Relation rel, static void IndexCheckExclusion(Relation heapRelation, Relation indexRelation, IndexInfo *indexInfo); +static inline int64 itemptr_encode(ItemPointer itemptr); +static inline void itemptr_decode(ItemPointer itemptr, int64 encoded); static bool validate_index_callback(ItemPointer itemptr, void *opaque); static void validate_index_heapscan(Relation heapRelation, Relation indexRelation, @@ -2832,7 +2834,13 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot) ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples; ivinfo.strategy = NULL; - state.tuplesort = tuplesort_begin_datum(TIDOID, TIDLessOperator, + /* + * Encode TIDs as int8 values for the sort, rather than directly sorting + * item pointers. This can be significantly faster, primarily because TID + * is a pass-by-reference type on all platforms, whereas int8 is + * pass-by-value on most platforms. + */ + state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator, InvalidOid, false, maintenance_work_mem, false); @@ -2872,14 +2880,45 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot) } /* + * itemptr_encode - Encode ItemPointer as int64/int8 + * + * This representation must produce values encoded as int64 that sort in the + * same order as their corresponding original TID values would. + */ +static inline int64 +itemptr_encode(ItemPointer itemptr) +{ + BlockNumber block = ItemPointerGetBlockNumber(itemptr); + OffsetNumber offset = ItemPointerGetOffsetNumber(itemptr); + int64 encoded; + + encoded = ((int64) block << 16) | offset; + + return encoded; +} + +/* + * itemptr_decode - Decode int64/int8 representation back to ItemPointer + */ +static inline void +itemptr_decode(ItemPointer itemptr, int64 encoded) +{ + BlockNumber block = (BlockNumber) (encoded >> 16); + OffsetNumber offset = (OffsetNumber) (encoded & 0xFFFF); + + ItemPointerSet(itemptr, block, offset); +} + +/* * validate_index_callback - bulkdelete callback to collect the index TIDs */ static bool validate_index_callback(ItemPointer itemptr, void *opaque) { v_i_state *state = (v_i_state *) opaque; + int64 encoded = itemptr_encode(itemptr); - tuplesort_putdatum(state->tuplesort, PointerGetDatum(itemptr), false); + tuplesort_putdatum(state->tuplesort, Int64GetDatum(encoded), false); state->itups += 1; return false; /* never actually delete anything */ } @@ -2911,6 +2950,7 @@ validate_index_heapscan(Relation heapRelation, /* state variables for the merge */ ItemPointer indexcursor = NULL; + ItemPointerData decoded; bool tuplesort_empty = false; /* @@ -3020,13 +3060,21 @@ validate_index_heapscan(Relation heapRelation, */ if (ItemPointerGetBlockNumber(indexcursor) == root_blkno) in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true; - pfree(indexcursor); } tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true, &ts_val, &ts_isnull); Assert(tuplesort_empty || !ts_isnull); - indexcursor = (ItemPointer) DatumGetPointer(ts_val); + if (!tuplesort_empty) + { + itemptr_decode(&decoded, DatumGetInt64(ts_val)); + indexcursor = &decoded; + + /* If int8 is pass-by-ref, free (encoded) TID Datum memory */ +#ifndef USE_FLOAT8_BYVAL + pfree(DatumGetPointer(ts_val)); +#endif + } } /* -- 1.9.1
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers