Hi
On Tuesday, March 16, 2021 4:15 PM vignesh C <vignes...@gmail.com> wrote: > On Tue, Mar 16, 2021 at 12:29 PM Amit Kapila <amit.kapil...@gmail.com> > wrote: > > > > On Tue, Mar 16, 2021 at 9:00 AM Amit Kapila <amit.kapil...@gmail.com> > wrote: > > > > > > On Mon, Mar 15, 2021 at 6:00 PM Thomas Munro > <thomas.mu...@gmail.com> wrote: > > > > > > > > Hi, > > > > > > > > This seems to be a new low frequency failure, I didn't see it mentioned > already: > > > > > > > > > > Thanks for reporting, I'll look into it. > > > > > > > By looking at the logs [1] in the buildfarm, I think I know what is > > going on here. After Create Subscription, the tablesync worker is > > launched and tries to create the slot for doing the initial copy but > > before it could finish creating the slot, we issued the Drop > > Subscription which first stops the tablesync worker and then tried to > > drop its slot. Now, it is quite possible that by the time Drop > > Subscription tries to drop the tablesync slot, it is not yet created. > > We treat this condition okay and just Logs the message. I don't think > > this is an issue because anyway generally such a slot created on the > > server will be dropped before we persist it but the test was checking > > the existence of slots on server before it gets dropped. I think we > > can avoid such a situation by preventing cancel/die interrupts while > > creating tablesync slot. > > > > This is a timing issue, so I have reproduced it via debugger and > > tested that the attached patch fixes it. > > > > Thanks for the patch. > I was able to reproduce the issue using debugger by making it wait at > CreateReplicationSlot. After applying the patch the issue gets solved. I really appreciate everyone's help. For the double check, I utilized the patch and debugger too. I also put one while loop at the top of CreateReplicationSlot to control walsender. Without the patch, DROP SUBSCRIPTION goes forward, even when the table sync worker cannot move by the CreateReplicationSlot loop, and the table sync worker is killed by DROP SUBSCRIPTION command. On the other hand, with the patch contents, I observed that DROP SUBSCRIPTION hangs and waits until I release the walsender process from CreateReplicationSlot. After this, the command drops two slots like below. NOTICE: dropped replication slot "pg_16391_sync_16385_6940222843739406079" on publisher NOTICE: dropped replication slot "mysub1" on publisher DROP SUBSCRIPTION To me, this correctly works because the timing I put the while loop and stops the walsender makes the DROP SUBSCRIPTION affects two slots. Any comments ? Best Regards, Takamichi Osumi