On Fri, Mar 31, 2017 at 12:32 AM, Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > On Fri, Mar 31, 2017 at 7:38 AM, Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: >> Hi, >> >> While doing some benchmarking, I've ran into a fairly strange issue with OOM >> breaking LaunchParallelWorkers() after the restart. What I see happening is >> this: >> >> 1) a query is executed, and at the end of LaunchParallelWorkers we get >> >> nworkers=8 nworkers_launched=8 >> >> 2) the query does a Hash Aggregate, but ends up eating much more memory due >> to n_distinct underestimate (see [1] from 2015 for details), and gets killed >> by OOM >> >> 3) the server restarts, the query is executed again, but this time we get in >> LaunchParallelWorkers >> >> nworkers=8 nworkers_launched=0 >> >> There's nothing else running on the server, and there definitely should be >> free parallel workers. >> >> 4) The query gets killed again, and on the next execution we get >> >> nworkers=8 nworkers_launched=8 >> >> again, although not always. I wonder whether the exact impact depends on OOM >> killing the leader or worker, for example. > > I don't know what's going on but I think I have seen this once or > twice myself while hacking on test code that crashed. I wonder if the > DSM_CREATE_NULL_IF_MAXSEGMENTS case could be being triggered because > the DSM control is somehow confused? > I think I've run into the same problem while working on parallelizing plans containing InitPlans. You can reproduce that scenario by following steps:
1. Put an Assert(0) in ParallelQueryMain(), start server and execute any parallel query. In LaunchParallelWorkers, you can see nworkers = n nworkers_launched = n (n>0) But, all the workers will crash because of the assert statement. 2. the server restarts automatically, initialize BackgroundWorkerData->parallel_register_count and BackgroundWorkerData->parallel_terminate_count in the shared memory. After that, it calls ForgetBackgroundWorker and it increments parallel_terminate_count. In LaunchParallelWorkers, we have the following condition: if ((BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count) >= max_parallel_workers) DO NOT launch any parallel worker. Hence, nworkers = n nworkers_launched = 0. I thought because of my stupid mistake the parallel worker is crashing, so, this is supposed to happen. Sorry for that. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers