We got more information about this issue. There is one backend process
still present into beentry which has changecount as odd value. However this
process is long gone/terminated. It means when this process was
killed/terminated its entry was not cleaned from beentry list. There seems
to be some shutdown hook which cleans beentry if process is
killed/terminated that somehow was not kicked off ?

These are some of the fields of corrupted beentry entry which is still
hanging :

st_changecount = 1407, st_procpid = 0, st_backendType = B_BACKEND,
st_proc_start_timestamp = 610236763633421, st_xact_start_timestamp =
0, st_clienthostname = 0x9000023d480 "", st_ssl = 1 '\001',
st_sslstatus = 0x90000c60f80, st_state =
STATE_IDLEINTRANSACTION_ABORTED,





On Thu, May 9, 2019 at 1:00 PM Tom Lane <t...@sss.pgh.pa.us> wrote:

> Jeremy Schneider <schnj...@amazon.com> writes:
> > Seems to me that at a minimum, this loop shouldn't go on forever. Even
> > having an arbitrary, crazy high, hard-coded number of attempts before
> > failure (like a million) would be better than spinning on the CPU
> > forever - which is what we are seeing.
>
> I don't think it's the readers' fault.  The problem is that the
> writer is violating the protocol.  If we put an upper limit on
> the number of spin cycles on the reader side, we'll just be creating
> a new failure mode when a writer gets swapped out at the wrong moment.
>
> IMO we need to (a) get the failure-prone code out of the critical
> section, and then (b) fix the pgstat_increment_changecount macros
> so that the critical sections around these shmem changes really are
> critical sections (ie bump CritSectionCount).  That way, if somebody
> makes the same mistake again, at least there'll be a pretty obvious
> failure rather than a lot of stuck readers.
>
>                         regards, tom lane
>


-- 
-------------------------------------
Thanks
Neeraj Kumar,
+1  (206) 427-7267

Reply via email to