Re: Add Information during standby recovery conflicts

Fujii Masao Mon, 14 Dec 2020 22:40:29 -0800



On 2020/12/15 12:04, Kyotaro Horiguchi wrote:

At Tue, 15 Dec 2020 02:00:21 +0900, Fujii Masao <masao.fu...@oss.nttdata.com> 
wrote in

Thanks for the review! I'm thinking to wait half a day before
commiting
the patch just in the case someone may object the patch.


Sorry for coming late.  I have looked only the latest thread so I
should be missing many things so please ignore if it was settled in
the past discussion.


It emits messages like the follows;

[40509:startup] LOG:  recovery still waiting after 1021.431 ms: recovery 
conflict on lock
[40509:startup] DETAIL:  Conflicting processes: 41171, 41194.
[40509:startup] CONTEXT:  WAL redo at 0/3013158 for Standby/LOCK: xid 510 db 
13609 rel 16384

IFAIK DETAIL usually shows ordinary sentences so the first word is
capitalized and ends with a period. But it is not a sentence so
following period looks odd.  a searcheing the tree for errdetails
showed some anomalies.

src/backend/parser/parse_param.c                                         
errdetail("%s versus %s",
src/backend/jit/llvm/llvmjit_error.cpp                   errdetail("while in 
LLVM")));
src/backend/replication/logical/tablesync.c                              errdetail("The 
error was: %s", res->err)));
src/backend/tcop/postgres.c                             errdetail("prepare: %s", 
pstmt->plansource->query_string);
src/backend/tcop/postgres.c             errdetail("abort reason: recovery 
conflict");

and one similar occurance:

src/backend/utils/adt/dbsize.c                                   errdetail("Invalid size unit: 
\"%s\".", strptr),

I'm not sure, but it seems to me at least the period is unnecessary
here.


Since Error Message Style Guide in the docs says "Detail and hint messages:
Use complete sentences, and end each with a period.", I think that a period
is necessary here. No?



+                       bool            maybe_log_conflict =
+                       (standbyWaitStart != 0 && !logged_recovery_conflict);

odd indentation.


This is the result of pgindent run. I'm not sure why pgindent indents
that way, but for now I'd like to follow pgindent.



+               /* Also, set the timer if necessary */
+               if (logging_timer)
+               {
+                       timeouts[cnt].id = STANDBY_LOCK_TIMEOUT;
+                       timeouts[cnt].type = TMPARAM_AFTER;
+                       timeouts[cnt].delay_ms = DeadlockTimeout;
+                       cnt++;
+               }

This doesn't consider spurious wakeup. I'm not sure it actually
happenes but we usually consider that.  That is since before this
patch, but ProcWaitForSignal()'s comment says that:

  * As this uses the generic process latch the caller has to be robust against
  * unrelated wakeups: Always check that the desired state has occurred, and
  * wait again if not.


If we don't care of spurious wakeups, we should add such a comment.


If ProcWaitForSignal() wakes up because of the reason (e.g., SIGHUP)
other than deadlock_timeout, ProcSleep() will call
ResolveRecoveryConflictWithLock() and we sleep on ProcWaitForSignal()
again since the recovery conflict has not been resolved yet. So we can
say that we consider "spurious wakeups"?

However when I read the related code again, I found another issue in
the patch. In ResolveRecoveryConflictWithLock(), if SIGHUP causes us to
wake up out of ProcWaitForSignal() before deadlock_timeout is reached,
we will use deadlock_timeout again when sleeping on ProcWaitForSignal().
Instead, probably we should use the "deadlock_timeout - elapsed time"
so that we can emit a log message as soon as deadlock_timeout passes
since starting waiting on recovery conflict. Otherwise it may take at most
"deadlock_timeout * 2" to log "still waiting ..." message.

To fix this issue, "deadlock_timeout - elapsed time" needs to be used as
the timeout when enabling the timer at least in
ResolveRecoveryConflictWithLock() and ResolveRecoveryConflictWithBufferPin().
Also similar change needs to be applied to
ResolveRecoveryConflictWithVirtualXIDs().

BTW, without applying the patch, *originally*
ResolveRecoveryConflictWithBufferPin() seems to have this issue.
It enables deadlock_timeout timer so as to request for hot-standbfy
backends to check themselves for deadlocks. But if we wake up out of
ProcWaitForSignal() before deadlock_timeout is reached, the subsequent
call to ResolveRecoveryConflictWithBufferPin() also uses deadlock_timeout
again instead of "deadlock_timeout - elapsed time". So the request for
deadlock check can be delayed. Furthermore,
if ResolveRecoveryConflictWithBufferPin() always wakes up out of
ProcWaitForSignal() before deadlock_timeout is reached, the request
for deadlock check may also never happen infinitely.

Maybe we should fix the original issue at first separately from the patch.

+                               bool            maybe_log_conflict;
+                               bool            maybe_update_title;

Although it should be a matter of taste and I understand that the
"maybe" means that "that logging and changing of ps display may not
happen in this iteration" , that variables seem expressing
respectively "we should write log if the timeout for recovery conflict
expires" and "we should update title if 500ms elapsed". So they seem
*to me* better be just "log_conflict" and "update_title".

I feel the same with  "maybe_log_conflict" in ProcSleep().


I have no strong opinion about those names. So if other people also
think so, I'm ok to rename them.



+        for recovery conflicts.  This is useful in determining if recovery
+        conflicts prevent the recovery from applying WAL.

(I'm not confident on this) Isn't the sentence better be in past or
present continuous tense?


Could you tell me why you think that's better?

BTW, attached is the POC patch that implements the idea discussed
upthread;
if log_recovery_conflict_waits is enabled, the startup process reports
the log also after the recovery conflict was resolved and the startup
process
finished waiting for it. This patch needs to be applied after
v11-0002-Log-the-standby-recovery-conflict-waits.patch is applied.


Ah. I was just writing a comment about that. I haven't looked it
closer but it looks good to me.  By the way doesn't it contains a
simple fix of a comment for the base patch?


Yes, so the typo included in the base patch should be fixed when pushing it.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Add Information during standby recovery conflicts

Reply via email to