Re: [sqlsmith] Unpinning error in parallel worker
On Wed, Mar 29, 2017, at 10:50, Robert Haas wrote: > On Wed, Mar 29, 2017 at 1:31 AM, Thomas Munro > wrote: > > I considered whether the error message could be improved but it > > matches the message for an existing similar case (where you try to > > attach to an unknown handle). > > Ugh, OK. I committed this, but I think this whole file needs a visit > from the message style police. I ran into an issue that matches the symptoms of this bug on a production cluster running 10.3 (Ubuntu 10.3-1.pgdg16.04+1) twice in the past week. Hundreds of queries stuck with a wait_event of DynamicSharedMemoryControlLock and pg_terminate_backend did not terminate the queries. In the log: > FATAL: cannot unpin a segment that is not pinned I don't have a backtrace yet, but I will provide them if/when the issue happens again. Jonathan
Re: [sqlsmith] Unpinning error in parallel worker
On Tue, Apr 17, 2018, at 18:38, Thomas Munro wrote: > I don't have any theories about how that could be going wrong right > now, but I'm looking into it. Thank you! > > I don't have a backtrace yet, but I will provide them if/when the issue > > happens again. > > Thanks, that would be much appreciated, as would any clues about what > workload you're running. Do you know what the query plan looks like > for the queries that crashed? Yep, I think I know approximately what it looked like, I've attached a lightly redacted plan. All of the hung queries were running some variant of this plan as far as I can tell. Thanks again! Jonathan QUERY PLAN -- Sort (cost=1896.13..1896.13 rows=2 width=4336) (actual time=29.396..29.447 rows=410 loops=1) Sort Key: assets."position" Sort Method: quicksort Memory: 1671kB CTE assets -> HashAggregate (cost=1861.88..1863.61 rows=173 width=864) (actual time=23.328..23.883 rows=1242 loops=1) Group Key: assets_7.id, assets_7.name, assets_7.description, assets_7.asset_key, assets_7.type, assets_7.created_at, assets_7.updated_at, assets_7.deleted_at, assets_7.redacted_id, assets_7.section_id, assets_7."position", assets_7.creator_id, assets_7.modifier_id, assets_7.data, assets_7.thumbnail_override, assets_7.availability_start, assets_7.availability_end, assets_7.availability -> Append (cost=853.88..1854.10 rows=173 width=864) (actual time=10.871..22.002 rows=1242 loops=1) -> CTE Scan on assets assets_7 (cost=853.88..861.60 rows=2 width=864) (actual time=10.870..12.339 rows=1242 loops=1) Filter: (redacted_id = 32456) CTE assets -> HashAggregate (cost=850.45..853.88 rows=343 width=864) (actual time=10.867..11.392 rows=1242 loops=1) Group Key: assets_1.id, assets_1.name, assets_1.description, assets_1.asset_key, assets_1.type, assets_1.created_at, assets_1.updated_at, assets_1.deleted_at, assets_1.redacted_id, assets_1.section_id, assets_1."position", assets_1.creator_id, assets_1.modifier_id, assets_1.data, assets_1.thumbnail_override, assets_1.availability_start, assets_1.availability_end, assets_1.availability -> Append (cost=1.41..835.02 rows=343 width=864) (actual time=0.380..9.456 rows=1261 loops=1) -> Nested Loop (cost=1.41..174.78 rows=1 width=249) (actual time=0.046..0.046 rows=0 loops=1) -> Nested Loop (cost=0.98..5.71 rows=1 width=4) (actual time=0.046..0.046 rows=0 loops=1) -> Nested Loop (cost=0.70..5.14 rows=1 width=8) (actual time=0.046..0.046 rows=0 loops=1) -> Index Scan using index_user_id_on_permissible_type on user_permissions (cost=0.42..2.64 rows=1 width=4) (actual time=0.045..0.045 rows=0 loops=1) Index Cond: ((user_id = 98789) AND ((permissible_type)::text = 'Organization'::text)) Filter: ((permission_level)::text = 'guest'::text) -> Index Only Scan using organizations_pkey on organizations (cost=0.28..2.50 rows=1 width=4) (never executed) Index Cond: (id = user_permissions.permissible_id) Heap Fetches: 0 -> Index Scan using index_redacteds_on_organization_id on redacteds redacteds_1 (cost=0.28..0.43 rows=1 width=8) (never executed) Index Cond: (organization_id = organizations.id)
Re: [sqlsmith] Unpinning error in parallel worker
On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote: > On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg > wrote: > > On Tue, Apr 17, 2018, at 18:38, Thomas Munro wrote: > >> Thanks, that would be much appreciated, as would any clues about what > >> workload you're running. Do you know what the query plan looks like > >> for the queries that crashed? > > > > Yep, I think I know approximately what it looked like, I've attached a > > lightly redacted plan. All of the hung queries were running some variant of > > this plan as far as I can tell. > > Hmm, that isn't a parallel query. I was expecting to see "Gather" and > "Parallel" in there. Oops, I'm really sorry about that. I only have the first part of the hung queries, and there are a few variants. Here's one that's parallel. QUERY PLAN -- Sort (cost=13229.50..13229.51 rows=2 width=4336) (actual time=23.486..23.515 rows=316 loops=1) Sort Key: assets."position" Sort Method: quicksort Memory: 1293kB CTE assets -> HashAggregate (cost=13060.31..13075.88 rows=1557 width=864) (actual time=19.064..19.601 rows=1242 loops=1) Group Key: assets_7.id, assets_7.name, assets_7.description, assets_7.asset_key, assets_7.type, assets_7.created_at, assets_7.updated_at, assets_7.deleted_at, assets_7.redacted_id, assets_7.section_id, assets_7."position", assets_7.creator_id, assets_7.modifier_id, assets_7.data, assets_7.thumbnail_override, assets_7.availability_start, assets_7.availability_end, assets_7.availability -> Append (cost=6383.09..12990.25 rows=1557 width=864) (actual time=6.781..17.848 rows=1242 loops=1) -> CTE Scan on assets assets_7 (cost=6383.09..6451.83 rows=15 width=864) (actual time=6.780..8.180 rows=1242 loops=1) Filter: (redacted_id = 32456) CTE assets -> HashAggregate (cost=6352.54..6383.09 rows=3055 width=864) (actual time=6.775..7.270 rows=1242 loops=1) Group Key: assets_1.id, assets_1.name, assets_1.description, assets_1.asset_key, assets_1.type, assets_1.created_at, assets_1.updated_at, assets_1.deleted_at, assets_1.redacted_id, assets_1.section_id, assets_1."position", assets_1.creator_id, assets_1.modifier_id, assets_1.data, assets_1.thumbnail_override, assets_1.availability_start, assets_1.availability_end, assets_1.availability -> Append (cost=1.41..6215.07 rows=3055 width=864) (actual time=0.137..5.495 rows=1261 loops=1) -> Nested Loop (cost=1.41..1514.55 rows=1 width=248) (actual time=0.026..0.026 rows=0 loops=1) -> Nested Loop (cost=0.98..5.71 rows=1 width=4) (actual time=0.026..0.026 rows=0 loops=1) -> Nested Loop (cost=0.70..5.14 rows=1 width=8) (actual time=0.026..0.026 rows=0 loops=1) -> Index Scan using index_user_id_on_permissible_type on user_permissions (cost=0.42..2.64 rows=1 width=4) (actual time=0.025..0.025 rows=0 loops=1) Index Cond: ((user_id = 98789) AND ((permissible_type)::text = 'Organization'::text)) Filter: ((permission_level)::text = 'guest'::text) -> Index Only Scan using organizations_pkey on organizations (cost=0.28..2.50 rows=1 width=4) (never executed) Index Cond: (id = user_permissions.permissible_id)
Re: [sqlsmith] Unpinning error in parallel worker
On Fri, Apr 20, 2018, at 00:42, Thomas Munro wrote: > On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg > wrote: > > On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote: > >> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg > >> wrote: > >> > Yep, I think I know approximately what it looked like, I've attached a > >> > lightly redacted plan. All of the hung queries were running some variant > >> > of this plan as far as I can tell. > >> > >> Hmm, that isn't a parallel query. I was expecting to see "Gather" and > >> "Parallel" in there. > > > > Oops, I'm really sorry about that. I only have the first part of the hung > > queries, and there are a few variants. Here's one that's parallel. > > I spent some time trying to reproduce this failure without any luck, > using query plans similar to your Gather plan fragment, and using some > test harness code for the allocator stuff in isolation. > > I had an idea that (1) freeing a large object that releases and unpins > a segment in one backend and then (2) freeing it again in another > backend (illegally) might produce this effect with sufficiently bad > luck. I'm still trying to reproduce that without any success, but I > get other kinds of failures which I think you'd be seeing too if that > hunch were right. Still looking... Thanks for investigating! We have monitoring in place and a plan to grab stack traces if this happens again. Is there anything else that would be useful for us to try to get in addition to a stack trace from the process processing the stuck query?
Re: [sqlsmith] Unpinning error in parallel worker
On Fri, Apr 20, 2018, at 00:42, Thomas Munro wrote: > On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg > wrote: > > On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote: > >> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg > >> wrote: > >> > Yep, I think I know approximately what it looked like, I've attached a > >> > lightly redacted plan. All of the hung queries were running some variant > >> > of this plan as far as I can tell. > >> > >> Hmm, that isn't a parallel query. I was expecting to see "Gather" and > >> "Parallel" in there. > > > > Oops, I'm really sorry about that. I only have the first part of the hung > > queries, and there are a few variants. Here's one that's parallel. > > I spent some time trying to reproduce this failure without any luck, > using query plans similar to your Gather plan fragment, and using some > test harness code for the allocator stuff in isolation. > > I had an idea that (1) freeing a large object that releases and unpins > a segment in one backend and then (2) freeing it again in another > backend (illegally) might produce this effect with sufficiently bad > luck. I'm still trying to reproduce that without any success, but I > get other kinds of failures which I think you'd be seeing too if that > hunch were right. Still looking... This issue happened again in production, here are the stack traces from three we grabbed before nuking the >400 hanging backends. process: /proc/24194/mem thread: 0x7f46263648c0, lwp: 24194, type: 1 #0 0x7f4625fdc827 in do_futex_wait+55() at /lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205 #1 0x7f4625fdc8d4 in __new_sem_wait_slow+84() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181 #2 0x7f4625fdc97a in __new_sem_wait+58() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_wait.c:29 #3 0x55fccb8c8832 in PGSemaphoreLock+34() at /usr/lib/postgresql/10/bin/postgres at pg_sema.c:310 #4 0x55fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233 #5 0x55fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493 #6 0x55fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266 #7 0x55fccb815f27 in ExecInitParallelPlan+807() at /usr/lib/postgresql/10/bin/postgres at execParallel.c:470 #8 0x55fccb825a50 in ExecGather+1168() at /usr/lib/postgresql/10/bin/postgres at nodeGather.c:158 #9 0x55fccb81dbd3 in ExecAppend+83() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #10 0x55fccb81df2c in fetch_input_tuple+172() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #11 0x55fccb820293 in ExecAgg+1235() at /usr/lib/postgresql/10/bin/postgres at nodeAgg.c:2539 #12 0x55fccb835235 in CteScanNext+229() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #13 0x55fccb817f0d in ExecScan+557() at /usr/lib/postgresql/10/bin/postgres at execScan.c:97 #14 0x55fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #15 0x55fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #16 0x55fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #17 0x55fccb81df2c in fetch_input_tuple+172() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #18 0x55fccb8201eb in ExecAgg+1067() at /usr/lib/postgresql/10/bin/postgres at nodeAgg.c:2347 #19 0x55fccb811dfb in standard_ExecutorRun+379() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #20 0x55fccb94d19b in PortalRunSelect+507() at /usr/lib/postgresql/10/bin/postgres at pquery.c:932 #21 0x55fccb94e798 in PortalRun+824() at /usr/lib/postgresql/10/bin/postgres at pquery.c:773 #22 0x55fccb94b329 in PostgresMain+3721() at /usr/lib/postgresql/10/bin/postgres at postgres.c:1984 #23 0x55fccb6840d9 in ServerLoop+3412() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:4405 #24 0x55fccb8da78b in PostmasterMain+4235() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:1363 #25 0x55fccb6854d5 in main+2165() at /usr/lib/postgresql/10/bin/postgres at main.c:228 #26 0x7f4623c07830 in __libc_start_main+240() at /lib/x86_64-linux-gnu/libc.so.6 at libc-start.c:291 #27 0x55fccb685559 in _start!+41() at /usr/lib/postgresql/10/bin/postgres process: /proc/24231/mem thread: 0x7f46263648c0, lwp: 24231, type: 1 #0 0x7f4625fdc827 in do_futex_wait+55() at /lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205 #1 0x7f4625fdc8d4 in __new_sem_wait_slow+84() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181 #2 0x7f4625fdc97a in __new_sem_wait+58() at /lib/x86_64-linux-gnu/libpthread.
Re: [sqlsmith] Unpinning error in parallel worker
On Tue, Apr 24, 2018, at 16:06, Thomas Munro wrote: > On Wed, Apr 25, 2018 at 2:21 AM, Jonathan Rudenberg > wrote: > > This issue happened again in production, here are the stack traces from > > three we grabbed before nuking the >400 hanging backends. > > > > [...] > > #4 0x55fccb93b21c in LWLockAcquire+188() at > > /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233 > > #5 0x55fccb925fa7 in dsm_create+151() at > > /usr/lib/postgresql/10/bin/postgres at dsm.c:493 > > #6 0x55fccb6f2a6f in InitializeParallelDSM+511() at > > /usr/lib/postgresql/10/bin/postgres at parallel.c:266 > > [...] > > Thank you. These stacks are all blocked trying to acquire > DynamicSharedMemoryControlLock. My theory is that they can't because > one backend -- the one that emitted the error "FATAL: cannot unpin a > segment that is not pinned" -- is deadlocked against itself. After > emitting that error you can see from Andreas's "seabisquit" stack that > that shmem_exit() runs dsm_backend_shutdown() which runs dsm_detach() > which tries to acquire DynamicSharedMemoryControlLock again, even > though we already hold it at that point. > > I'll write a patch to fix that unpleasant symptom. While holding > DynamicSharedMemoryControlLock we shouldn't raise any errors without > releasing it first, because the error handling path will try to > acquire it again. That's a horrible failure mode as you have > discovered. > > But that isn't the root problem: we shouldn't be raising that error, > and I'd love to see the stack of the one process that did that and > then self-deadlocked. I will have another go at trying to reproduce > it here today. Thanks for the update! We have turned off parallel queries (using max_parallel_workers_per_gather = 0) for now, as the production impact of this bug is unfortunately quite problematic. What will this failure look like with the patch you've proposed? Thanks again, Jonathan