On Sat, Jun 12, 2021 at 1:13 PM Michael Paquier <mich...@paquier.xyz> wrote: > > wrasse has just failed with what looks like a timing error with a > replication slot drop: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-06-12%2006%3A16%3A30 > > Here is the error: > error running SQL: 'psql:<stdin>:1: ERROR: could not drop replication > slot "tap_sub" on publisher: ERROR: replication slot "tap_sub" is > active for PID 1641' > > It seems to me that this just lacks a poll_query_until() doing some > slot monitoring? >
I think it is showing a race condition issue in the code. In DropSubscription, we first stop the worker that is receiving the WAL, and then in a separate connection with the publisher, it tries to drop the slot which leads to this error. The reason is that walsender is still active as we just wait for wal receiver (or apply worker) to stop. Normally, as soon as the apply worker is stopped the walsender detects it and exits but in this case, it took some time to exit, and in the meantime, we tried to drop the slot which is still in use by walsender. If we want to fix this, we might want to wait till the slot is active on the publisher before trying to drop it but not sure if it is a good idea. In the worst case, if the user retries this operation (Drop Subscription), it will succeed. -- With Regards, Amit Kapila.