On 04/04/2017 06:52 PM, Robert Haas wrote:
On Mon, Apr 3, 2017 at 6:08 AM, Kuntal Ghosh <kuntalghosh.2...@gmail.com> wrote:
On Fri, Mar 31, 2017 at 6:50 PM, Robert Haas <robertmh...@gmail.com> wrote:
On Thu, Mar 30, 2017 at 4:35 PM, Kuntal Ghosh
<kuntalghosh.2...@gmail.com> wrote:
2. the server restarts automatically, initialize
BackgroundWorkerData->parallel_register_count and
BackgroundWorkerData->parallel_terminate_count in the shared memory.
After that, it calls ForgetBackgroundWorker and it increments
parallel_terminate_count.

Hmm.  So this seems like the root of the problem.  Presumably those
things need to be reset AFTER forgetting any background workers from
before the crash.

IMHO, the fix would be not to increase the terminated parallel worker
count whenever ForgetBackgroundWorker is called due to a bgworker
crash. I've attached a patch for the same. PFA.

While I'm not opposed to that approach, I don't think this is a good
way to implement it.  If you want to pass an explicit flag to
ForgetBackgroundWorker telling it whether or not it should performing
the increment, fine.  But with what you've got here, you're
essentially relying on "spooky action at a distance".  It would be
easy for future code changes to break this, not realizing that
somebody's got a hard dependency on 0 having a specific meaning.


I'm probably missing something, but I don't quite understand how these values actually survive the crash. I mean, what I observed is OOM followed by a restart, so shouldn't BackgroundWorkerShmemInit() simply reset the values back to 0? Or do we call ForgetBackgroundWorker() after the crash for some reason?


In any case, the comment right before BackgroundWorkerArray says this:

 * These counters can of course overflow, but it's not important here
 * since the subtraction will still give the right number.

which means that this assert

+       Assert(BackgroundWorkerData->parallel_register_count >=
+               BackgroundWorkerData->parallel_terminate_count);

is outright broken, just like any other attempts to rely on simple comparisons of these two counters, no?

BTW, if this isn't on the open items list, it should be.  It's
presumably the fault of b460f5d6693103076dc554aa7cbb96e1e53074f9.


Unrelated, but perhaps we should also add this to open items:

https://www.postgresql.org/message-id/flat/57bbc52c-5d40-bb80-2992-7e1027d0b67c%402ndquadrant.com

(although that's more a 9.6 material).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to