On Sun, Mar 15, 2020 at 11:27:52PM -0500, Justin Pryzby wrote: > On Tue, Jan 28, 2020 at 12:36:41PM +0100, Julien Rouhaud wrote: > > So, AFAICT the LockHashPartitionLockByProc is required when > > iterating/modifying lockGroupMembers or lockGroupLink, but just > > getting the leader pid should be safe. > > This still seems unsafe: > > git show -U11 -w --patience b025f32e0b src/backend/utils/adt/pgstatfuncs.c > + /* > + * If a PGPROC entry was retrieved, display wait > events and lock > + * group leader information if any. To avoid extra > overhead, no > + * extra lock is being held, so there is no guarantee > of > + * consistency across multiple rows. > + */ > ... > + PGPROC *leader; > ... > + leader = proc->lockGroupLeader; > + if (leader) > + { > # does something guarantee that leader doesn't change ? > + values[29] = > Int32GetDatum(leader->pid); > + nulls[29] = false; > } > > Michael seems to have raised the issue: > > On Thu, Jan 16, 2020 at 04:49:12PM +0900, Michael Paquier wrote: > > And actually, the way you are looking at the leader's PID is visibly > > incorrect and inconsistent because the patch takes no shared LWLock on > > the leader using LockHashPartitionLockByProc() followed by > > LWLockAcquire(), no? That's incorrect because it could be perfectly > > possible to crash with this code between the moment you check if > > lockGroupLeader is NULL and the moment you look at > > lockGroupLeader->pid if a process is being stopped in-between and > > removes itself from a lock group in ProcKill(). > > But I don't see how it was addressed ? > > I read this: > > src/backend/storage/lmgr/lock.c: * completely valid. We cannot > safely dereference another backend's > src/backend/storage/lmgr/lock.c- * lockGroupLeader field without > holding all lock partition locks, and > src/backend/storage/lmgr/lock.c- * it's not worth that.) > > I think if you do: > |LWLockAcquire(&proc->backendLock, LW_SHARED); > ..then it's at least *safe* to access leader->pid, but it may be inconsistent > unless you also call LockHashPartitionLockByProc. > > I wasn't able to produce a crash, so maybe I missed something.
I think I see. Julien's v3 patch did this: https://www.postgresql.org/message-id/attachment/106429/pgsa_leader_pid-v3.diff + if (proc->lockGroupLeader) + values[29] = Int32GetDatum(proc->lockGroupLeader->pid); ..which is racy because a proc with a leader might die and be replaced by another proc without a leader between 1 and 2. But the code since v4 does: + leader = proc->lockGroupLeader; + if (leader) + values[29] = Int32GetDatum(leader->pid); ..which is safe because PROCs are allocated in shared memory, so leader is for sure a non-NULL pointer to a PROC. leader->pid may be read inconsistently, which is what the comment says: "no extra lock is being held". -- Justin