Hey Thomas, after a few months, we started having this issue again. So we revert the work_mem parameter to 600MB instead of 2GB. But the issue is still there. A query went to segmentation fault, the DB went to recovery mode and our app went to read only for a few minutes.
I understand we can increase max_connections so we can have many more segments. My question is : is there a way to understand the number of segments we reached? Currently we have 220 max_conn so as your formula is 64 + 2* 220 we have about 500 shared segments. We would like to increase that number to 300 or 400 but would be great to understand if there is a way to make sure we will solve the issue as it requires a restart of the service. I know you were also talking about a redesign this part in PostgreSQL. Do you know if anything has changed in any of the newer versions after 11.5? Thanks a lot, Nicola Il giorno gio 12 set 2019 alle ore 01:01 Thomas Munro < thomas.mu...@gmail.com> ha scritto: > On Wed, Sep 11, 2019 at 11:20 PM Nicola Contu <nicola.co...@gmail.com> > wrote: > > If the error persist I will try to revert the work_mem. > > Thanks a lot > > Hi Nicola, > > It's hard to say exactly what the cause of the problem is in your case > and how to avoid it, without knowing what your query plans look like. > PostgreSQL allows 64 + 2 * max_connections segments to exist a time, > and it needs a number of them that depends on work_mem (in the case of > Parallel Hash Join and Parallel Bitmap Index Scan), and also depends > on the number of Gather nodes that appear in the plan, which in some > unusual cases can result from partitioning. > > I've seen people reaching this error by running a lot of parallel > queries concurrently. If that's the cause, then you can definitely > get some relief by turning work_mem down, or by turning > max_connections up (even though you don't want to allow more > connections -- because it influences the formula for deciding on the > DSM segment limit). We should probably adjust some of the internal > constants to give us more slots, to avoid that problem, as discussed > here: > > > https://www.postgresql.org/message-id/flat/CA%2BhUKGL6H2BpGbiF7Lj6QiTjTGyTLW_vLR%3DSn2tEBeTcYXiMKw%40mail.gmail.com > > I've also seen people reaching this error by somehow coming up with > plans that had a very large number of Gather nodes in them, > corresponding to partitions; that's probably a bad plan (it'd > presumably be better to terminate parallelism higher up in the plan, > but these plans do seem to exist in the wild; I don't recall exactly > why). I think we need a bit of a redesign so that if there are > multiple Gather nodes, they share the same main DSM segment, instead > of blowing through this limit. > > -- > Thomas Munro > https://enterprisedb.com >