25.08.2019 0:59, Floris Van Nee wrote:
Though, I got interested in the comment inconsistency you have found.
I added debug message into this code branch of the patch and was able to
see it in regression.diffs after 'make check':
Speaking of your patch, it seems that the buffer was unpinned and pinned
again between two reads,
and the condition of holding it continuously has not been met.
May I ask what makes you conclude that the condition of holding the pin
continuously has not been met?
Your reply encouraged me to dig a little bit more into this today. First, I
wanted to check if indeed the pin was continuously held by the backend or not.
I added some debug info to ReleaseBuffer for this: it turned out that the pin
on the buffer was definitely never released by the backend between the calls to
_bt_first and _bt_next. So the buffer got compacted while the backend held a
pin on it.
After some more searching I found the following code: _bt_vacuum_one_page in
nbtinsert.c
This function compacts one single page without taking a super-exclusive lock.
It is used during inserts to make room on a page. I verified that if I comment
out the calls to this function, the compacting never happens while I have a pin
on the buffer.
So I guess that answers my own question: cleaning up garbage during inserts is
one of the cases where compacting may happen even while other backends hold a
pin to the buffer. Perhaps this should also be more clearly phrased in the
comments in eg. _bt_killitems? Because currently those comments make it look
like this case never occurs.
You're right, the pin was not released between page reads.
I also added debug to UnpinBuffer, but now I see that I had interpreted
it wrongly.
As far as I understand, the issue with your patch is that it breaks the
*scan stops "between" pages* assumption
and thus it unsafely interacts with _bt_vacuum_one_page() cleanup.
See README:
>Page read locks are held only for as long as a scan is examining a page.
To minimize lock/unlock traffic, an index scan always searches a leaf page
to identify all the matching items at once, copying their heap tuple IDs
into backend-local storage. The heap tuple IDs are then processed while
not holding any page lock within the index. We do continue to hold a pin
on the leaf page in some circumstances, to protect against concurrent
deletions (see below). In this state the scan is effectively stopped
"between" pages, either before or after the page it has pinned. This is
safe in the presence of concurrent insertions and even page splits, because
items are never moved across pre-existing page boundaries --- so the scan
cannot miss any items it should have seen, nor accidentally return the same
item twice.
and
>Once an index tuple has been marked LP_DEAD it can actually be removed
from the index immediately; since index scans only stop "between" pages,
no scan can lose its place from such a deletion.
It seems that it contradicts the very idea of your patch, so probably we
should look for other ways to optimize this use-case.
Maybe this restriction can be relaxed for write only tables, that never
have to reread the page because of visibility, or something like that.
Also we probably can add to IndexScanDescData info about expected number
of tuples, to allow index work more optimal
and avoid the overhead for other loads.=
While reading the code to answer your question, I noticed that
BTScanPosIsPinned macro name is misleading.
It calls BufferIsValid(), not BufferIsPinned() as one could expect.
And BufferIsValid in bufmgr.h comment explicitly states that it
shouldn't be confused with BufferIsPinned.
The same goes for BTScanPosUnpinIfPinned().
I agree the name is misleading. It clearly does something else than how it's
named. However, I don't believe this introduces problems in these particular
pieces of code, as long as the macro's are always used. BTScanPosIsPinned
actually checks whether it's valid and not necessarily whether it's pinned, as
you mentioned. However, any time the buffer gets unpinned using the macro
BTScanPosUnpin, the buffer gets set to Invalid by the macro as well. Therefore,
any consecutive call to BTScanPosIsPinned should indeed return false. It'd
definitely be nice if this gets clarified in comments though.
That's true. It took me quite some time to understand that existing code
is correct.
There is a comment for the structure's field that claims that
BufferIsValid is the same that BufferIsPinned in ScanPos context.
Attached patch contains some comments' updates. Any suggestions on how
to improve them are welcome.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9b172c1..316a42d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1751,6 +1751,8 @@ _bt_killitems(IndexScanDesc scan)
Buffer buf;
/* Attempt to re-read the buffer, getting pin and lock. */
+ /* TODO Is it possible that currPage is not valid anymore? */
+ Assert(BTScanPosIsValid(so->currPos));
buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
/* It might not exist anymore; in which case we can't hint it. */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6..9abca7d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -512,14 +512,6 @@ typedef struct BTInsertStateData
typedef BTInsertStateData *BTInsertState;
/*
- * BTScanOpaqueData is the btree-private state needed for an indexscan.
- * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
- * details of the preprocessing), information about the current location
- * of the scan, and information about the marked location, if any. (We use
- * BTScanPosData to represent the data needed for each of current and marked
- * locations.) In addition we can remember some known-killed index entries
- * that must be marked before we can move off the current page.
- *
* Index scans work a page at a time: we pin and read-lock the page, identify
* all the matching items on the page and save them in BTScanPosData, then
* release the read-lock while returning the items to the caller for
@@ -583,6 +575,14 @@ typedef struct BTScanPosData
typedef BTScanPosData *BTScanPos;
+/*
+ * It's enough to check BufferIsValid inside this macro,
+ * since in context of BTScanPosData buf is always pinned if valid,
+ * see comment to struct's field above.
+ *
+ * NOTE To keep this assumption correct all users of scanpos must unpin buffer
+ * using macros below instead of directly calling ReleaseBuffer.
+ */
#define BTScanPosIsPinned(scanpos) \
( \
AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
@@ -600,12 +600,24 @@ typedef BTScanPosData *BTScanPos;
BTScanPosUnpin(scanpos); \
} while (0)
+/*
+ * Check if scanpos is initialized.
+ * TODO It is not clear to me
+ * why to check scanpos validity based on currPage value.
+ * I wonder, if we need currPage at all? Is there any codepath that
+ * assumes that currPage is not the same as BufferGetBlockNumber(buf)?
+ */
#define BTScanPosIsValid(scanpos) \
( \
AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
!BufferIsValid((scanpos).buf)), \
BlockNumberIsValid((scanpos).currPage) \
)
+
+/*
+ * Reset BTScanPos fields.
+ * If scanpos was valid, caller must call BTScanPosUnpinIfPinned in advance.
+ */
#define BTScanPosInvalidate(scanpos) \
do { \
(scanpos).currPage = InvalidBlockNumber; \
@@ -625,6 +637,15 @@ typedef struct BTArrayKeyInfo
Datum *elem_values; /* array of num_elems Datums */
} BTArrayKeyInfo;
+/*
+ * BTScanOpaqueData is the btree-private state needed for an indexscan.
+ * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
+ * details of the preprocessing), information about the current location
+ * of the scan, and information about the marked location, if any. (We use
+ * BTScanPosData to represent the data needed for each of current and marked
+ * locations.) In addition we can remember some known-killed index entries
+ * that must be marked before we can move off the current page.
+ */
typedef struct BTScanOpaqueData
{
/* these fields are set by _bt_preprocess_keys(): */