On Wed, Jul 24, 2019 at 10:11 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.mu...@gmail.com> writes: > > *I suspect that the only thing implicating parallelism in this failure > > is that parallel leaders happen to print out that message if the > > postmaster dies while they are waiting for workers; most other places > > (probably every other backend in your cluster) just quietly exit. > > That tells us something about what's happening, but on its own doesn't > > tell us that parallelism plays an important role in the failure mode. > > I agree that there's little evidence implicating parallelism directly. > The reason I'm suspicious about a possible OOM kill is that parallel > queries would appear to the OOM killer to be eating more resources > than the same workload non-parallel, so that we might be at more > hazard of getting OOM'd just because of that. > > A different theory is that there's some hard-to-hit bug in the > postmaster's processing of parallel workers that doesn't apply to > regular backends. I've looked for one in a desultory way but not > really focused on it. > > In any case, the evidence from the buildfarm is pretty clear that > there is *some* connection. We've seen a lot of recent failures > involving "postmaster exited during a parallel transaction", while > the number of postmaster failures not involving that is epsilon.
I don't have access to the build farm history in searchable format (I'll go and ask for that). Do you have an example to hand? Is this failure always happening on Linux? -- Thomas Munro https://enterprisedb.com