On Sun, Mar 20, 2022 at 4:53 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > On 3/20/22 07:23, Amit Kapila wrote: > > On Sun, Mar 20, 2022 at 8:41 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > >> > >> On Fri, Mar 18, 2022 at 10:42 PM Tomas Vondra > >> <tomas.von...@enterprisedb.com> wrote: > >> > >>> So the question is why those two sync workers never complete - I guess > >>> there's some sort of lock wait (deadlock?) or infinite loop. > >>> > >> > >> It would be a bit tricky to reproduce this even if the above theory is > >> correct but I'll try it today or tomorrow. > >> > > > > I am able to reproduce it with the help of a debugger. Firstly, I have > > added the LOG message and some While (true) loops to debug sync and > > apply workers. Test setup > > > > Node-1: > > create table t1(c1); > > create table t2(c1); > > insert into t1 values(1); > > create publication pub1 for table t1; > > create publication pu2; > > > > Node-2: > > change max_sync_workers_per_subscription to 1 in potgresql.conf > > create table t1(c1); > > create table t2(c1); > > create subscription sub1 connection 'dbname = postgres' publication pub1; > > > > Till this point, just allow debuggers in both workers just continue. > > > > Node-1: > > alter publication pub1 add table t2; > > insert into t1 values(2); > > > > Here, we have to debug the apply worker such that when it tries to > > apply the insert, stop the debugger in function apply_handle_insert() > > after doing begin_replication_step(). > > > > Node-2: > > alter subscription sub1 set pub1, pub2; > > > > Now, continue the debugger of apply worker, it should first start the > > sync worker and then exit because of parameter change. All of these > > debugging steps are to just ensure the point that it should first > > start the sync worker and then exit. After this point, table sync > > worker never finishes and log is filled with messages: "reached > > max_sync_workers_per_subscription limit" (a newly added message by me > > in the attached debug patch). > > > > Now, it is not completely clear to me how exactly '013_partition.pl' > > leads to this situation but there is a possibility based on the LOGs > > it shows. > > > > Thanks, I'll take a look later. From the description it seems this is an > issue that existed before any of the patches, right? It might be more > likely to hit due to some test changes, but the root cause is older. >
Yes, your understanding is correct. If my understanding is correct, then we need probably just need some changes in the new test to make it behave as per the current code. -- With Regards, Amit Kapila.