Hello. WIP-Patch for optimisation of OFFSET + IndexScan using visibility map. Patch based on idea of Maxim Boguk [1] with some inspiration from Douglas Doole [2]. --------- Everyone knows - using OFFSET (especially big) is not an good practice. But in reality they widely used mostly for paging (because it is simple).
Typical situation is some table (for example tickets) with indexes used for paging\sorting: VACUUM FULL; VACUUM ANALYZE ticket; SET work_mem = '512MB'; SET random_page_cost = 1.0; CREATE TABLE ticket AS SELECT id, TRUNC(RANDOM() * 100 + 1) AS project_id, NOW() + (RANDOM() * (NOW()+'365 days' - NOW())) AS created_date, repeat((TRUNC(RANDOM() * 100 + 1)::text), 1000) as payload FROM GENERATE_SERIES(1, 1000000) AS g(id); CREATE INDEX simple_index ON ticket using btree(project_id, created_date); And some typical query to do offset on tickets of some project with paging, some filtration (based on index) and sorting: SELECT * FROM ticket WHERE project_id = ? AND created_date > '20.06.2017' ORDER BY created_date offset 500 limit 100; At the current moment IndexScan node will be required to do 600 heap fetches to execute the query. But first 500 of them are just dropped by the NodeLimit. The idea of the patch is to push down offset information in ExecSetTupleBound (like it done for Top-sort) to IndexScan in case of simple scan (without projection, reordering and qual). In such situation we could use some kind of index only scan (even better because we dont need index data) to avoid fetching tuples while they are just thrown away by nodeLimit. Patch is also availble on Github: https://github.com/michail-nikolaev/postgres/commit/a368c3483250e4c02046d418a27091678cb963f4?diff=split And some test here: https://gist.github.com/michail-nikolaev/b7cbe1d6f463788407ebcaec8917d1e0 So, at the moment everything seems to work (check-world is ok too) and I got next result for test ticket table: | offset | master | patch | 100 | ~1.3ms | ~0.7ms | 1000 | ~5.6ms | ~1.1ms | 10000 | ~46.7ms | ~3.6ms To continue development I have following questions: 0) Maybe I missed something huge... 1) Is it required to support non-mvvc (dirty) snapshots? They are not supported for IndexOnlyScan - not sure about IndexScan. 2) Should I try to pass informaiton about such optimisation to planner/optimizer? It is not too easy with current desigh but seems possible. 3) If so, should I add something to EXPLAIN? 4) If so, should I add some counters to EXPLAIN ANALYZE? (for example number of heap fetch avoided). 5) Should I add description of optimisation to https://www.postgresql.org/docs/10/static/queries-limit.html ? 6) Maybe you have some ideas for additional tests I need to add. Thanks a lot. [1] https://www.postgresql.org/message-id/CAK-MWwQpZobHfuTtHj9%2B9G%2B5%3Dck%2BaX-ANWHtBK_0_D_qHYxWuw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CADE5jYLuugnEEUsyW6Q_4mZFYTxHxaVCQmGAsF0yiY8ZDggi-w%40mail.gmail.com
*** a/src/backend/executor/execParallel.c --- b/src/backend/executor/execParallel.c *************** *** 1303,1310 **** ParallelQueryMain(dsm_segment *seg, shm_toc *toc) pwcxt.seg = seg; ExecParallelInitializeWorker(queryDesc->planstate, &pwcxt); ! /* Pass down any tuple bound */ ! ExecSetTupleBound(fpes->tuples_needed, queryDesc->planstate); /* * Run the plan. If we specified a tuple bound, be careful not to demand --- 1303,1310 ---- pwcxt.seg = seg; ExecParallelInitializeWorker(queryDesc->planstate, &pwcxt); ! /* Pass down any tuple bound. Offset cannot be optimized because parallel execution. */ ! ExecSetTupleBound(fpes->tuples_needed, 0, queryDesc->planstate); /* * Run the plan. If we specified a tuple bound, be careful not to demand *** a/src/backend/executor/execProcnode.c --- b/src/backend/executor/execProcnode.c *************** *** 785,793 **** ExecShutdownNode(PlanState *node) * * Set a tuple bound for a planstate node. This lets child plan nodes * optimize based on the knowledge that the maximum number of tuples that ! * their parent will demand is limited. The tuple bound for a node may ! * only be changed between scans (i.e., after node initialization or just ! * before an ExecReScan call). * * Any negative tuples_needed value means "no limit", which should be the * default assumption when this is not called at all for a particular node. --- 785,794 ---- * * Set a tuple bound for a planstate node. This lets child plan nodes * optimize based on the knowledge that the maximum number of tuples that ! * their parent will demand is limited. Also tuples skipped may be used by ! * child nodes to optimize retrieval of tuples which immediately skipped by ! * parent (nodeLimit). The tuple bound for a node may only be changed ! * between scans (i.e., after node initialization or just before an ExecReScan call). * * Any negative tuples_needed value means "no limit", which should be the * default assumption when this is not called at all for a particular node. *************** *** 797,803 **** ExecShutdownNode(PlanState *node) * only unchanging conditions are tested here. */ void ! ExecSetTupleBound(int64 tuples_needed, PlanState *child_node) { /* * Since this function recurses, in principle we should check stack depth --- 798,804 ---- * only unchanging conditions are tested here. */ void ! ExecSetTupleBound(int64 tuples_needed, int64 tuples_skipped, PlanState *child_node) { /* * Since this function recurses, in principle we should check stack depth *************** *** 839,845 **** ExecSetTupleBound(int64 tuples_needed, PlanState *child_node) int i; for (i = 0; i < maState->ms_nplans; i++) ! ExecSetTupleBound(tuples_needed, maState->mergeplans[i]); } else if (IsA(child_node, ResultState)) { --- 840,846 ---- int i; for (i = 0; i < maState->ms_nplans; i++) ! ExecSetTupleBound(tuples_needed, 0, maState->mergeplans[i]); } else if (IsA(child_node, ResultState)) { *************** *** 853,859 **** ExecSetTupleBound(int64 tuples_needed, PlanState *child_node) * rows will be demanded from the Result child anyway. */ if (outerPlanState(child_node)) ! ExecSetTupleBound(tuples_needed, outerPlanState(child_node)); } else if (IsA(child_node, SubqueryScanState)) { --- 854,860 ---- * rows will be demanded from the Result child anyway. */ if (outerPlanState(child_node)) ! ExecSetTupleBound(tuples_needed, 0, outerPlanState(child_node)); } else if (IsA(child_node, SubqueryScanState)) { *************** *** 864,870 **** ExecSetTupleBound(int64 tuples_needed, PlanState *child_node) SubqueryScanState *subqueryState = (SubqueryScanState *) child_node; if (subqueryState->ss.ps.qual == NULL) ! ExecSetTupleBound(tuples_needed, subqueryState->subplan); } else if (IsA(child_node, GatherState)) { --- 865,871 ---- SubqueryScanState *subqueryState = (SubqueryScanState *) child_node; if (subqueryState->ss.ps.qual == NULL) ! ExecSetTupleBound(tuples_needed, tuples_skipped, subqueryState->subplan); } else if (IsA(child_node, GatherState)) { *************** *** 881,887 **** ExecSetTupleBound(int64 tuples_needed, PlanState *child_node) gstate->tuples_needed = tuples_needed; /* Also pass down the bound to our own copy of the child plan */ ! ExecSetTupleBound(tuples_needed, outerPlanState(child_node)); } else if (IsA(child_node, GatherMergeState)) { --- 882,888 ---- gstate->tuples_needed = tuples_needed; /* Also pass down the bound to our own copy of the child plan */ ! ExecSetTupleBound(tuples_needed, 0, outerPlanState(child_node)); } else if (IsA(child_node, GatherMergeState)) { *************** *** 890,896 **** ExecSetTupleBound(int64 tuples_needed, PlanState *child_node) gstate->tuples_needed = tuples_needed; ! ExecSetTupleBound(tuples_needed, outerPlanState(child_node)); } /* --- 891,905 ---- gstate->tuples_needed = tuples_needed; ! ExecSetTupleBound(tuples_needed, 0, outerPlanState(child_node)); ! } ! else if (IsA(child_node, IndexScanState)) ! { ! IndexScanState* isState = (IndexScanState *) child_node; ! ! /* Simple case of IndexScan could use index-only optimisation for skipping offset. */ ! if (!isState->ss.ps.qual && !isState->ss.ps.ps_ProjInfo && isState->iss_NumOrderByKeys == 0) ! isState->iss_tuples_skipped = isState->iss_tuples_skipped_remaning = tuples_skipped; } /* *** a/src/backend/executor/nodeIndexscan.c --- b/src/backend/executor/nodeIndexscan.c *************** *** 31,41 **** --- 31,43 ---- #include "access/nbtree.h" #include "access/relscan.h" + #include "access/visibilitymap.h" #include "catalog/pg_am.h" #include "executor/execdebug.h" #include "executor/nodeIndexscan.h" #include "lib/pairingheap.h" #include "miscadmin.h" + #include "storage/predicate.h" #include "nodes/nodeFuncs.h" #include "optimizer/clauses.h" #include "utils/array.h" *************** *** 118,123 **** IndexNext(IndexScanState *node) --- 120,131 ---- node->iss_ScanDesc = scandesc; + if (node->iss_tuples_skipped != 0) + { + /* Set it up for index-only scan if we are going to use it for skipped tupples. */ + node->iss_VMBuffer = InvalidBuffer; + } + /* * If no run-time keys to calculate or they are ready, go ahead and * pass the scankeys to the index AM. *************** *** 128,167 **** IndexNext(IndexScanState *node) node->iss_OrderByKeys, node->iss_NumOrderByKeys); } ! /* ! * ok, now that we have what we need, fetch the next tuple. */ ! while ((tuple = index_getnext(scandesc, direction)) != NULL) { ! CHECK_FOR_INTERRUPTS(); ! /* ! * Store the scanned tuple in the scan tuple slot of the scan state. ! * Note: we pass 'false' because tuples returned by amgetnext are ! * pointers onto disk pages and must not be pfree()'d. ! */ ! ExecStoreTuple(tuple, /* tuple to store */ ! slot, /* slot to store in */ ! scandesc->xs_cbuf, /* buffer containing tuple */ ! false); /* don't pfree */ /* ! * If the index was lossy, we have to recheck the index quals using ! * the fetched tuple. */ ! if (scandesc->xs_recheck) { ! econtext->ecxt_scantuple = slot; ! ResetExprContext(econtext); ! if (!ExecQual(node->indexqualorig, econtext)) { ! /* Fails recheck, so drop it and loop back for another */ ! InstrCountFiltered2(node, 1); ! continue; } - } ! return slot; } /* --- 136,287 ---- node->iss_OrderByKeys, node->iss_NumOrderByKeys); } ! /** ! * Use visibility buffer while tuples are skipped by parent nodeLimit. ! * Code below based on IndexOnlyNext. Refer to nodeIndexonlyscan.h for ! * comments about memory ordering and concurrency. */ ! if (node->iss_tuples_skipped_remaning != 0) /* Parent limit still just ignoring our output. */ { ! ItemPointer tid = NULL; /* ! * Fetch the next tuple. xs_want_itup is set to false because we not need any index data. ! */ ! while ((tid = index_getnext_tid(scandesc, direction)) != NULL) ! { ! CHECK_FOR_INTERRUPTS(); ! tuple = NULL; ! /* ! * We don't support rechecking ORDER BY distances. We should be in IndexNextWithReorder if so. ! */ ! Assert(!(scandesc->numberOfOrderBys > 0 && scandesc->xs_recheckorderby)); ! ! /* ! * Only MVCC snapshots are supported here. TODO: is it ok? ! */ ! if (scandesc->xs_continue_hot) ! elog(ERROR, "non-MVCC snapshots are not supported in index scans for index-only offset"); ! ! /* ! * Refer to IndexOnlyNext for information about VM_ALL_VISIBLE usage during index scan. ! * If recheck is required - just skip visibility map check because we need heap tuple anyway. ! */ ! if (scandesc->xs_recheck || ! !VM_ALL_VISIBLE(scandesc->heapRelation, ! ItemPointerGetBlockNumber(tid), ! &node->iss_VMBuffer) ! ) ! { ! tuple = index_fetch_heap(scandesc); ! if (tuple == NULL) ! continue; /* no visible tuple, try next index entry */ + /* + * Note: at this point we are holding a pin on the heap page, as + * recorded in scandesc->xs_cbuf. We could release that pin now, + * but it's not clear whether it's a win to do so. The next index + * entry might require a visit to the same heap page. + */ + } + + /* + * If the index was lossy, we have to recheck the index quals. + */ + if (scandesc->xs_recheck) + { + /* + * Store the scanned tuple in the scan tuple slot of the scan state. + * Note: we pass 'false' because tuples returned by amgetnext are + * pointers onto disk pages and must not be pfree()'d. + */ + ExecStoreTuple(tuple, /* tuple to store */ + slot, /* slot to store in */ + scandesc->xs_cbuf, /* buffer containing tuple */ + false); /* don't pfree */ + + econtext->ecxt_scantuple = slot; + ResetExprContext(econtext); + if (!ExecQual(node->indexqualorig, econtext)) + { + /* Fails recheck, so drop it and loop back for another */ + InstrCountFiltered2(node, 1); + continue; + } + } + else + { + /* + * We know there is a tuple in index passing all checks. + * Parent nodeLimit will skip it anyway - so just prepare fake non-empty slot. + */ + ExecClearTuple(slot); + slot->tts_isempty = false; + slot->tts_nvalid = 0; + } + + /* + * Predicate locks for index-only scans must be acquired at the page + * level when the heap is not accessed, since tuple-level predicate + * locks need the tuple's xmin value. If we had to visit the tuple + * anyway, then we already have the tuple-level lock and can skip the + * page lock. + */ + if (tuple == NULL) + PredicateLockPage(scandesc->heapRelation, + ItemPointerGetBlockNumber(tid), + estate->es_snapshot); + + /* + * Decrement counter for remaning skipped tuples. + * If last tuple skipped - release the buffer. + */ + if (--node->iss_tuples_skipped_remaning == 0 && node->iss_VMBuffer != InvalidBuffer) + { + /* If we had to return one more tuple then regular index scan will used. */ + ReleaseBuffer(node->iss_VMBuffer); + node->iss_VMBuffer = InvalidBuffer; + } + + return slot; + } + } + else + { /* ! * ok, now that we have what we need, fetch the next tuple. */ ! while ((tuple = index_getnext(scandesc, direction)) != NULL) { ! CHECK_FOR_INTERRUPTS(); ! ! /* ! * Store the scanned tuple in the scan tuple slot of the scan state. ! * Note: we pass 'false' because tuples returned by amgetnext are ! * pointers onto disk pages and must not be pfree()'d. ! */ ! ExecStoreTuple(tuple, /* tuple to store */ ! slot, /* slot to store in */ ! scandesc->xs_cbuf, /* buffer containing tuple */ ! false); /* don't pfree */ ! ! /* ! * If the index was lossy, we have to recheck the index quals using ! * the fetched tuple. ! */ ! if (scandesc->xs_recheck) { ! econtext->ecxt_scantuple = slot; ! ResetExprContext(econtext); ! if (!ExecQual(node->indexqualorig, econtext)) ! { ! /* Fails recheck, so drop it and loop back for another */ ! InstrCountFiltered2(node, 1); ! continue; ! } } ! return slot; ! } } /* *************** *** 609,614 **** ExecReScanIndexScan(IndexScanState *node) --- 729,735 ---- node->iss_ScanKeys, node->iss_NumScanKeys, node->iss_OrderByKeys, node->iss_NumOrderByKeys); node->iss_ReachedEnd = false; + node->iss_tuples_skipped_remaning = node->iss_tuples_skipped; /* Reset counter for skipped tuples to skip them again. */ ExecScanReScan(&node->ss); } *************** *** 818,823 **** ExecEndIndexScan(IndexScanState *node) --- 939,951 ---- indexScanDesc = node->iss_ScanDesc; relation = node->ss.ss_currentRelation; + /* Release VM buffer pin, if any. */ + if (node->iss_VMBuffer != InvalidBuffer) + { + ReleaseBuffer(node->iss_VMBuffer); + node->iss_VMBuffer = InvalidBuffer; + } + /* * Free the exprcontext(s) ... now dead code, see ExecFreeExprContext */ *** a/src/backend/executor/nodeLimit.c --- b/src/backend/executor/nodeLimit.c *************** *** 298,309 **** recompute_limits(LimitState *node) node->lstate = LIMIT_RESCAN; /* ! * Notify child node about limit. Note: think not to "optimize" by ! * skipping ExecSetTupleBound if compute_tuples_needed returns < 0. We ! * must update the child node anyway, in case this is a rescan and the * previous time we got a different result. */ ! ExecSetTupleBound(compute_tuples_needed(node), outerPlanState(node)); } /* --- 298,309 ---- node->lstate = LIMIT_RESCAN; /* ! * Notify child node about limit and offset. Note: think not to "optimize" ! * by skipping ExecSetTupleBound if compute_tuples_needed < 0 and offset = 0. ! * We must update the child node anyway, in case this is a rescan and the * previous time we got a different result. */ ! ExecSetTupleBound(compute_tuples_needed(node), node->offset, outerPlanState(node)); } /* *** a/src/include/executor/executor.h --- b/src/include/executor/executor.h *************** *** 226,232 **** extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function); extern Node *MultiExecProcNode(PlanState *node); extern void ExecEndNode(PlanState *node); extern bool ExecShutdownNode(PlanState *node); ! extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node); /* ---------------------------------------------------------------- --- 226,232 ---- extern Node *MultiExecProcNode(PlanState *node); extern void ExecEndNode(PlanState *node); extern bool ExecShutdownNode(PlanState *node); ! extern void ExecSetTupleBound(int64 tuples_needed, int64 tuples_skipped, PlanState *child_node); /* ---------------------------------------------------------------- *** a/src/include/nodes/execnodes.h --- b/src/include/nodes/execnodes.h *************** *** 1219,1224 **** typedef struct IndexScanState --- 1219,1227 ---- bool *iss_OrderByTypByVals; int16 *iss_OrderByTypLens; Size iss_PscanLen; + int64 iss_tuples_skipped; /* tuple offset, see ExecSetTupleBound */ + int64 iss_tuples_skipped_remaning; /* tuple offset counter */ + Buffer iss_VMBuffer; /* buffer used for visibility map in case of iss_tuples_skipped > 0 */ } IndexScanState; /* ----------------