Hi Shveta, Amit, > > > > ... We should try to > > > > find out if there is a performance benefit with the use of > > > > synchronous_standby_names in the normal configurations like the one > > > > you used in the above tests to prove the value of this patch.
I don't expect there to be a performance benefit, if anything I would expect it to perform slightly worse because of the contention on SyncRepLock. The main value of the patch for me is it makes it easy for administrators to set the parameter and avoid having to re-toggle configuration if they want very up-to-date logical clients when one of the replicas they previously specified in 'synchronized_standby_slots ' starts being unavailable in a synchronous configuration setup. > > > I didn't fully understand the parameters mentioned above, specifically > > > what 'latency stddev' and 'latency average' represent If I understand correctly, latency is just representing the average latency of each transaction from commit, while stddev is the standard deviation of these transactions. > > Yes, I also expect the patch should perform better in such a scenario > > but it is better to test it. Also, irrespective of that, we should > > investigate why the reported case is slower for > > synchronous_standby_names and see if we can improve it. We could test it but I'm not sure how interesting it is since depending on how much the chosen slot in 'synchronized_standby_slots' lags behind we can easily show that this patch will perform better. For instance, in Shveta's suggestion of > > > We can perform this test with both of the below settings and say make > > > D and E slow in sending responses: > > > 1) synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' > > > 2) standby_slot_names = A_slot, B_slot, C_slot, D_slot, E_slot. if the server associated with E_slot is just down or undergoing some sort of maintenance, then all logical consumers would start lagging until the server is back up. I could also mimic a network lag of 20 seconds and it's guaranteed that this patch will perform better. I re-ran the benchmarks with a longer run time of 3 hours, and testing a new shared cache for walsenders to check the value before obtaining the SyncRepLock. I also saw I was being throttled on storage in my previous benchmarks so I moved to a new setup. I benchmarked a new test case with an additional shared cache between all the walsenders to reduce potential contention on SyncRepLock, and have attached said patch. Database: Writer on it's own disk, 5 RRs on the other disk together Client: 10 logical clients, pgbench running from here as well 'pgbench -c 32 -j 4 -T 10800 -U "ec2-user" -d postgres -r -P 1' # Test failover_slots with synchronized_standby_slots = 'rr_1, rr_2, rr_3, rr_4, rr_5' latency average = 10.683 ms latency stddev = 11.851 ms initial connection time = 145.876 ms tps = 2994.595673 (without initial connection time) # Test failover_slots waiting on sync_rep no new shared cache latency average = 10.684 ms latency stddev = 12.247 ms initial connection time = 142.561 ms tps = 2994.160136 (without initial connection time) statement latencies in milliseconds and failures: # Test failover slots with additional shared cache latency average = 10.674 ms latency stddev = 11.917 ms initial connection time = 142.486 ms tps = 2997.315874 (without initial connection time) The tps improvement between no cache and shared_cache seems marginal, but we do see the slight improvement in stddev which makes sense from a contention perspective. I think the cache would demonstrate a lot more improvement if we had say 1000 logical slots and all of them are trying to obtain SyncRepLock for updating its values. I've attached the patch but don't feel particularly strongly about the new shared LSN values. Thanks, -- John Hsu - Amazon Web Services
0003-Wait-on-synchronous-replication-by-default-for-logic.patch
Description: Binary data