We got more information about this issue. There is one backend process still present into beentry which has changecount as odd value. However this process is long gone/terminated. It means when this process was killed/terminated its entry was not cleaned from beentry list. There seems to be some shutdown hook which cleans beentry if process is killed/terminated that somehow was not kicked off ?
These are some of the fields of corrupted beentry entry which is still hanging : st_changecount = 1407, st_procpid = 0, st_backendType = B_BACKEND, st_proc_start_timestamp = 610236763633421, st_xact_start_timestamp = 0, st_clienthostname = 0x9000023d480 "", st_ssl = 1 '\001', st_sslstatus = 0x90000c60f80, st_state = STATE_IDLEINTRANSACTION_ABORTED, On Thu, May 9, 2019 at 1:00 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > Jeremy Schneider <schnj...@amazon.com> writes: > > Seems to me that at a minimum, this loop shouldn't go on forever. Even > > having an arbitrary, crazy high, hard-coded number of attempts before > > failure (like a million) would be better than spinning on the CPU > > forever - which is what we are seeing. > > I don't think it's the readers' fault. The problem is that the > writer is violating the protocol. If we put an upper limit on > the number of spin cycles on the reader side, we'll just be creating > a new failure mode when a writer gets swapped out at the wrong moment. > > IMO we need to (a) get the failure-prone code out of the critical > section, and then (b) fix the pgstat_increment_changecount macros > so that the critical sections around these shmem changes really are > critical sections (ie bump CritSectionCount). That way, if somebody > makes the same mistake again, at least there'll be a pretty obvious > failure rather than a lot of stuck readers. > > regards, tom lane > -- ------------------------------------- Thanks Neeraj Kumar, +1 (206) 427-7267