> On Mon, 13 Aug 2018 at 17:36, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > > 2) lwlock-far-2.patch > New flag LW_FLAG_FAIR is introduced. This flag is set when first > shared locker in the row releases the lock. When LW_FLAG_FAIR is set > and there is already somebody in the queue, then shared locker goes to > the queue. Basically it means that first shared locker "holds the > door" for other shared lockers to go without queue. > > I run pgbench (read-write and read-only benchmarks) on Amazon > c5d.18xlarge virtual machine, which has 72 VCPU (approximately same > power as 36 physical cores). The results are attached > (lwlock-fair-ro.png and lwlock-fair-rw.png).
I've tested the second patch a bit using my bpf scripts to measure the lock contention. These scripts are still under the development, so there maybe some rough edges and of course they make things slower, but so far the event-by-event tracing correlates quite good with a perf script output. For highly contented case (I simulated it using random_zipfian) I've even got some visible improvement in the time distribution, but in an interesting way - there is almost no difference in the distribution of time for waiting on exclusive/shared locks, but a similar metric for holding shared locks is somehow has bigger portion of short time frames: # without the patch Shared lock holding time hold time (us) : count distribution 0 -> 1 : 17897059 |************************** | 2 -> 3 : 27306589 |****************************************| 4 -> 7 : 6386375 |********* | 8 -> 15 : 5103653 |******* | 16 -> 31 : 3846960 |***** | 32 -> 63 : 118039 | | 64 -> 127 : 15588 | | 128 -> 255 : 2791 | | 256 -> 511 : 1037 | | 512 -> 1023 : 137 | | 1024 -> 2047 : 3 | | # with the patch Shared lock holding time hold time (us) : count distribution 0 -> 1 : 20909871 |******************************** | 2 -> 3 : 25453610 |****************************************| 4 -> 7 : 6012183 |********* | 8 -> 15 : 5364837 |******** | 16 -> 31 : 3606992 |***** | 32 -> 63 : 112562 | | 64 -> 127 : 13483 | | 128 -> 255 : 2593 | | 256 -> 511 : 1029 | | 512 -> 1023 : 138 | | 1024 -> 2047 : 7 | | So looks like the locks, queued as implemented in this patch, are released faster than without this queue (probably it reduces contention in the less expected way). I've tested it also using c5d.18xlarge, although with a bit different options (more pgbench scale, shared_buffers, number of clients is fixed at 72) and I'll try to make few more rounds with different options. For the case of uniform distribution (just a normal read-write workload) in the same environment I don't see yet any significant differences in time distribution between the patched version and the master, which is a bit surprising for me. Can you point out some analysis why this kind of "fairness" introduces significant performance regression?