Hi, By default, currently streaming of in-progress transactions for subscriptions is disabled. All transactions are fully decoded on the publisher before being sent to the subscriber. This approach can lead to increased latency and reduced performance, particularly under heavy load. By default, we could enable the parallel streaming option for subscriptions. By doing this, incoming changes will be directly applied by one of the available parallel apply workers. This method significantly improves the performance of commit operations.
I conducted a series of tests using logical replication, comparing SQL execution times with streaming set to both parallel and off. The tests varied the logical_decoding_work_mem setting and included the following scenarios: a) Insert, b) Delete, c) Update, d) rollback 5% records, e) rollback 10% records, f) rollback 20% records, g) rollback 50% records. I have written tap tests for the same, the attached files can be copied to src/test/subscription/t and the logical_decoding_work_mem configuration and streaming option in create subscription command should be changed accordingly before running the tests. The tests were executed 5 times and the average of them was taken. The execution time is in seconds. Insert 5kk records Logical Decoding mem | Parallel | off | % Improvement -------------------------------|-------------|---------------|------------------------ 64 KB | 37.304 | 69.465 | 46.298 256 KB | 36.327 | 70.671 | 48.597 64 MB | 41.173 | 69.228 | 40.526 Delete 5kk records Logical Decoding mem | Parallel | off | % Improvement -------------------------------|-------------|---------------|------------------------ 64 KB | 42.322 | 69.404 | 39.021 256 KB | 43.250 | 66.973 | 35.422 64 MB | 44.183 | 67.873 | 34.903 Update 5kk records Logical Decoding mem | Parallel | off | % Improvement -------------------------------|-------------|---------------|------------------------ 64 KB | 93.953 | 127.691 | 26.422 256 KB | 94.166 | 128.541 | 26.743 64 MB | 93.367 | 134.275 | 30.465 Rollback 05% records Logical Decoding mem | Parallel | off | % Improvement -------------------------------|-------------|---------------|------------------------ 64 KB | 36.968 | 67.161 | 44.957 256 KB | 38.059 | 68.021 | 44.047 64 MB | 39.431 | 66.878 | 41.041 Rollback 10% records Logical Decoding mem | Parallel | off | % Improvement -------------------------------|-------------|---------------|------------------------ 64 KB | 35.966 | 63.968 | 43.775 256 KB | 36.597 | 64.836 | 43.554 64 MB | 39.069 | 64.357 | 39.292 Rollback 20% records Logical Decoding mem | Parallel | off | % Improvement -------------------------------|-------------|---------------|------------------------ 64 KB | 37.616 | 58.903 | 36.139 256 KB | 37.330 | 58.606 | 36.303 64 MB | 38.720 | 60.236 | 35.720 Rollback 50% records Logical Decoding mem | Parallel | off | % Improvement -------------------------------|-------------|---------------|------------------------ 64 KB | 38.999 | 44.776 | 12.902 256 KB | 36.567 | 44.530 | 17.882 64 MB | 38.592 | 45.346 | 14.893 The machine configuration that was used is also attached. The tests demonstrate a significant performance improvement when using the parallel streaming option, insert shows 40-48 %improvement, delete shows 34-39 %improvement, update shows 26-30 %improvement. In the case of rollback the improvement is between 12-44%, the improvement slightly reduces with larger amounts of data being rolled back in this case. If there's a significant amount of data to roll back, the performance of streaming in parallel may be comparable to or slightly lower in some instances. However, this is acceptable since commit operations are generally more frequent than rollback operations. One key point to consider is that the lock on transaction objects will be held for a longer duration when using streaming in parallel. This occurs because the parallel apply worker initiates the transaction as soon as streaming begins, maintaining the lock until the transaction is fully completed. As a result, for long-running transactions, this extended lock can hinder concurrent access that requires a lock. Since there is a significant percentage improvement, we should make the default subscription streaming option parallel. Attached patch has the change for the same. Thoughts? All of these tests were conducted with both the publisher and subscriber on the same host. I will perform additional tests with one of the logical replication nodes on a different host and share the results later. Regards, Vignesh
CPU INFO processor : 119 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz stepping : 7 microcode : 0x715 cpu MHz : 1505.957 cache size : 38400 KB physical id : 3 siblings : 30 core id : 14 cpu cores : 15 apicid : 125 initial apicid : 125 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d bogomips : 5629.54 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:
v1-0001-Make-default-value-for-susbcription-streaming-opt.patch
Description: Binary data
MemTotal: 792237404 kB MemFree: 724051992 kB MemAvailable: 762505368 kB Buffers: 2108 kB Cached: 43885588 kB SwapCached: 0 kB Active: 22276460 kB Inactive: 21761812 kB Active(anon): 1199380 kB Inactive(anon): 4228212 kB Active(file): 21077080 kB Inactive(file): 17533600 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 4194300 kB SwapFree: 4194300 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 150796 kB Mapped: 5283248 kB Shmem: 5277044 kB Slab: 1472084 kB SReclaimable: 1165144 kB SUnreclaim: 306940 kB KernelStack: 17504 kB PageTables: 21540 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 400313000 kB Committed_AS: 86287072 kB VmallocTotal: 34359738367 kB VmallocUsed: 1942092 kB VmallocChunk: 33753397244 kB Percpu: 36352 kB HardwareCorrupted: 0 kB AnonHugePages: 81920 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 785756 kB DirectMap2M: 7432192 kB DirectMap1G: 799014912 kB
<<attachment: test_tap_test_scripts.zip>>