Hi,

By default, currently streaming of in-progress transactions for
subscriptions is disabled. All transactions are fully decoded on the
publisher before being sent to the subscriber. This approach can lead
to increased latency and reduced performance, particularly under heavy
load. By default, we could enable the parallel streaming option for
subscriptions. By doing this, incoming changes will be directly
applied by one of the available parallel apply workers. This method
significantly improves the performance of commit operations.

I conducted a series of tests using logical replication, comparing SQL
execution times with streaming set to both parallel and off. The tests
varied the logical_decoding_work_mem setting and included the
following scenarios: a) Insert, b) Delete, c) Update, d) rollback 5%
records, e) rollback 10% records, f) rollback 20% records, g) rollback
50% records. I have written tap tests for the same, the attached files
can be copied to src/test/subscription/t and the
logical_decoding_work_mem configuration and streaming option in create
subscription command should be changed accordingly before running the
tests. The tests were executed 5 times and the average of them was
taken.
The execution time is in seconds.

Insert 5kk records
Logical Decoding mem |  Parallel |  off            |  % Improvement
-------------------------------|-------------|---------------|------------------------
 64 KB                           | 37.304   | 69.465      |     46.298
256 KB                          | 36.327   | 70.671      |     48.597
 64 MB                          | 41.173   | 69.228      |     40.526

Delete 5kk records
Logical Decoding mem |  Parallel |  off            |  % Improvement
-------------------------------|-------------|---------------|------------------------
 64 KB                           | 42.322   | 69.404      |     39.021
256 KB                          | 43.250   | 66.973      |     35.422
 64 MB                           | 44.183   | 67.873      |     34.903

Update 5kk records
Logical Decoding mem |  Parallel |  off            |  % Improvement
-------------------------------|-------------|---------------|------------------------
64 KB                           | 93.953    | 127.691    |     26.422
256 KB                         | 94.166    | 128.541    |     26.743
 64 MB                         | 93.367    | 134.275    |     30.465

Rollback 05% records
Logical Decoding mem |  Parallel |  off            |  % Improvement
-------------------------------|-------------|---------------|------------------------
64 KB                           | 36.968    | 67.161      |     44.957
256 KB                         | 38.059    | 68.021      |     44.047
 64 MB                         | 39.431    | 66.878      |     41.041

Rollback 10% records
Logical Decoding mem |  Parallel |  off            |  % Improvement
-------------------------------|-------------|---------------|------------------------
64 KB                           | 35.966    | 63.968      |     43.775
256 KB                         | 36.597    | 64.836      |     43.554
 64 MB                         | 39.069    | 64.357      |     39.292

Rollback 20% records
Logical Decoding mem |  Parallel |  off            |  % Improvement
-------------------------------|-------------|---------------|------------------------
64 KB                           | 37.616    | 58.903      |     36.139
256 KB                         | 37.330    | 58.606      |     36.303
 64 MB                         | 38.720    | 60.236      |     35.720

Rollback 50% records
Logical Decoding mem |  Parallel |  off            |  % Improvement
-------------------------------|-------------|---------------|------------------------
64 KB                           | 38.999    | 44.776      |     12.902
256 KB                         | 36.567    | 44.530      |     17.882
 64 MB                         | 38.592     | 45.346      |     14.893

The machine configuration that was used is also attached.

The tests demonstrate a significant performance improvement when using
the parallel streaming option, insert shows 40-48 %improvement, delete
shows 34-39 %improvement, update shows 26-30 %improvement. In the case
of rollback the improvement is between 12-44%, the improvement
slightly reduces with larger amounts of data being rolled back in this
case. If there's a significant amount of data to roll back, the
performance of streaming in parallel may be comparable to or slightly
lower in some instances. However, this is acceptable since commit
operations are generally more frequent than rollback operations.

One key point to consider is that the lock on transaction objects will
be held for a longer duration when using streaming in parallel. This
occurs because the parallel apply worker initiates the transaction as
soon as streaming begins, maintaining the lock until the transaction
is fully completed. As a result, for long-running transactions, this
extended lock can hinder concurrent access that requires a lock.

Since there is a significant percentage improvement, we should make
the default subscription streaming option parallel. Attached patch has
the change for the same.
Thoughts?

All of these tests were conducted with both the publisher and
subscriber on the same host. I will perform additional tests with one
of the logical replication nodes on a different host and share the
results later.

Regards,
Vignesh
CPU INFO

processor       : 119
vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
stepping        : 7
microcode       : 0x715
cpu MHz         : 1505.957
cache size      : 38400 KB
physical id     : 3
siblings        : 30
core id         : 14
cpu cores       : 15
apicid          : 125
initial apicid  : 125
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est 
tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb intel_ppin ssbd ibrs 
ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt 
dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d
bogomips        : 5629.54
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Attachment: v1-0001-Make-default-value-for-susbcription-streaming-opt.patch
Description: Binary data

MemTotal:       792237404 kB
MemFree:        724051992 kB
MemAvailable:   762505368 kB
Buffers:            2108 kB
Cached:         43885588 kB
SwapCached:            0 kB
Active:         22276460 kB
Inactive:       21761812 kB
Active(anon):    1199380 kB
Inactive(anon):  4228212 kB
Active(file):   21077080 kB
Inactive(file): 17533600 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       4194300 kB
SwapFree:        4194300 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        150796 kB
Mapped:          5283248 kB
Shmem:           5277044 kB
Slab:            1472084 kB
SReclaimable:    1165144 kB
SUnreclaim:       306940 kB
KernelStack:       17504 kB
PageTables:        21540 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    400313000 kB
Committed_AS:   86287072 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     1942092 kB
VmallocChunk:   33753397244 kB
Percpu:            36352 kB
HardwareCorrupted:     0 kB
AnonHugePages:     81920 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      785756 kB
DirectMap2M:     7432192 kB
DirectMap1G:    799014912 kB

<<attachment: test_tap_test_scripts.zip>>

Reply via email to