On Wed, Jan 27, 2021 at 4:58 AM Peter Smith <smithpb2...@gmail.com> wrote: > > Hi Hackers. > > As discovered elsewhere [ak0125] there is a potential race condition > in the pg_replication_origin_drop API > > The current code of pg_replication_origin_drop looks like: > ==== > roident = replorigin_by_name(name, false); > Assert(OidIsValid(roident)); > > replorigin_drop(roident, true); > ==== > > Users cannot deliberately drop a non-existent origin > (replorigin_by_name passes missing_ok = false) but there is still a > small window where concurrent processes may be able to call > replorigin_drop for the same valid roident. > > Locking within replorigin_drop guards against concurrent drops so the > 1st execution will succeed, but then the 2nd execution would give > internal cache error: elog(ERROR, "cache lookup failed for replication > origin with oid %u", roident); > > Some ideas to fix this include: > 1. Do nothing except write a comment about this in the code. The > internal ERROR is not ideal for a user API there is no great harm > done. > 2. Change the behavior of replorigin_drop to be like > replorigin_drop_IF_EXISTS, so the 2nd execution of this race would > silently do nothing when it finds the roident is already gone. > 3. Same as 2, but make the NOP behavior more explicit by introducing a > new "missing_ok" parameter for replorigin_drop. >
How about if we call replorigin_by_name() inside replorigin_drop after acquiring the lock? Wouldn't that close this race condition? We are doing something similar for pg_replication_origin_advance(). -- With Regards, Amit Kapila.