Hi Everyone, On 2017-05-19 17:25:38 +0530, Rafia Sabih wrote: > While analysing the performance of TPC-H queries for the newly developed > parallel-operators, viz, parallel index, bitmap heap scan, etc. we noticed > that the time taken by gather node is significant. On investigation, as per > the current method it copies each tuple to the shared queue and notifies > the receiver. Since, this copying is done in shared queue, a lot of locking > and latching overhead is there. > > So, in this POC patch I tried to copy all the tuples in a local queue thus > avoiding all the locks and latches. Once, the local queue is filled as per > it's capacity, tuples are transferred to the shared queue. Once, all the > tuples are transferred the receiver is sent the notification about the same. > > With this patch I could see significant improvement in performance for > simple queries,
I've spent some time looking into this, and I'm not quite convinced this is the right approach. Let me start by describing where I see the current performance problems around gather stemming from. The observations here are made using select * from t where i < 30000000 offset 29999999 - 1; with Rafia's data. That avoids slowdowns on the leader due to too many rows printed out. Sometimes I'll also use SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 1000000000 LIMIT 1; on tpch data to show the effects on wider tables. The precise query doesn't really matter, the observations here are more general, I hope. 1) nodeGather.c re-projects every row from workers. As far as I can tell that's now always exactly the same targetlist as it's coming from the worker. Projection capability was added in 8538a6307049590 (without checking whether it's needed afaict), but I think it in turn often obsoleted by 992b5ba30dcafdc222341505b072a6b009b248a7. My measurement shows that removing the projection yields quite massive speedups in queries like yours, which is not too surprising. I suspect this just needs a tlist_matches_tupdesc check + an if around ExecProject(). And a test, right now tests pass unless force_parallel_mode is used even if just commenting out the projection unconditionally. before commenting out nodeGather projection: rafia time: 8283.583 rafia profile: + 30.62% postgres postgres [.] shm_mq_receive + 18.49% postgres postgres [.] s_lock + 10.08% postgres postgres [.] SetLatch - 7.02% postgres postgres [.] slot_deform_tuple - slot_deform_tuple - 88.01% slot_getsomeattrs ExecInterpExpr ExecGather ExecLimit lineitem time: 8448.468 lineitem profile: + 24.63% postgres postgres [.] slot_deform_tuple + 24.43% postgres postgres [.] shm_mq_receive + 17.36% postgres postgres [.] ExecInterpExpr + 7.41% postgres postgres [.] s_lock + 5.73% postgres postgres [.] SetLatch after: rafia time: 6660.224 rafia profile: + 36.77% postgres postgres [.] shm_mq_receive + 19.33% postgres postgres [.] s_lock + 13.14% postgres postgres [.] SetLatch + 9.22% postgres postgres [.] AllocSetReset + 4.27% postgres postgres [.] ExecGather + 2.79% postgres postgres [.] AllocSetAlloc lineitem time: 4507.416 lineitem profile: + 34.81% postgres postgres [.] shm_mq_receive + 15.45% postgres postgres [.] s_lock + 13.38% postgres postgres [.] SetLatch + 9.87% postgres postgres [.] AllocSetReset + 5.82% postgres postgres [.] ExecGather as quite clearly visible, avoiding the projection yields some major speedups. The following analysis here has the projection removed. 2) The spinlocks both on the the sending and receiving side a quite hot: rafia query leader: + 36.16% postgres postgres [.] shm_mq_receive + 19.49% postgres postgres [.] s_lock + 13.24% postgres postgres [.] SetLatch The presence of s_lock shows us that we're clearly often contending on spinlocks, given that's the slow-path for SpinLockAcquire(). In shm_mq_receive the instruction profile shows: │ SpinLockAcquire(&mq->mq_mutex); │1 5ab: mov $0xa9b580,%ecx │ mov $0x4a4,%edx │ mov $0xa9b538,%esi │ mov %r15,%rdi │ → callq s_lock │ ↑ jmpq 2a1 │ tas(): │1 5c7: mov $0x1,%eax 32.83 │ lock xchg %al,(%r15) │ shm_mq_inc_bytes_read(): │ SpinLockAcquire(&mq->mq_mutex); and 0.01 │ pop %r15 0.04 │ ← retq │ nop │ tas(): │1 338: mov $0x1,%eax 17.59 │ lock xchg %al,(%r15) │ shm_mq_get_bytes_written(): │ SpinLockAcquire(&mq->mq_mutex); 0.05 │ test %al,%al 0.01 │ ↓ jne 448 │ v = mq->mq_bytes_written; rafia query worker: + 33.00% postgres postgres [.] shm_mq_send_bytes + 9.90% postgres postgres [.] s_lock + 7.74% postgres postgres [.] shm_mq_send + 5.40% postgres postgres [.] ExecInterpExpr + 5.34% postgres postgres [.] SetLatch Again, we have strong indicators for a lot of spinlock contention. The instruction profiles show the same; shm_mq_send_bytes │ shm_mq_inc_bytes_written(mq, MAXALIGN(sendnow)); │ and $0xfffffffffffffff8,%r15 │ tas(): 0.08 │ mov %ebp,%eax 31.07 │ lock xchg %al,(%r14) │ shm_mq_inc_bytes_written(): │ * Increment the number of bytes written. │ */ and │3 98: cmp %r13,%rbx 0.70 │ ↓ jae 430 │ tas(): 0.12 │1 a1: mov %ebp,%eax 28.53 │ lock xchg %al,(%r14) │ shm_mq_get_bytes_read(): │ SpinLockAcquire(&mq->mq_mutex); │ test %al,%al │ ↓ jne 298 │ v = mq->mq_bytes_read; shm_mq_send: │ tas(): 50.73 │ lock xchg %al,0x0(%r13) │ shm_mq_notify_receiver(): │ shm_mq_notify_receiver(volatile shm_mq *mq) │ { │ PGPROC *receiver; │ bool detached; My interpretation here is that it's not just the effect of the locking causing the slowdown, but to a significant degree the effect of the implied bus lock. To test that theory, here are the timings for 1) spinlocks present time: 6593.045 2) spinlocks acuisition replaced by *full* memory barriers, which on x86 is a lock; addl $0,0(%%rsp) time: 5159.306 3) spinlocks replaced by read/write barriers as appropriate. time: 4687.610 By my understanding of shm_mq.c's logic, something like 3) aught to be doable in a correct manner. There should be, in normal circumstances, only be one end modifying each of the protected variables. Doing so instead of using full block atomics with locked instructions is very likely to yield considerably better performance. The top-level profile after 3 is: leader: + 25.89% postgres postgres [.] shm_mq_receive + 24.78% postgres postgres [.] SetLatch + 14.63% postgres postgres [.] AllocSetReset + 7.31% postgres postgres [.] ExecGather worker: + 14.02% postgres postgres [.] ExecInterpExpr + 11.63% postgres postgres [.] shm_mq_send_bytes + 11.25% postgres postgres [.] heap_getnext + 6.78% postgres postgres [.] SetLatch + 6.38% postgres postgres [.] slot_deform_tuple still a lot of cycles in the queue code, but proportionally less. 4) Modulo computations in shm_mq are expensive: │ shm_mq_send_bytes(): │ Size offset = mq->mq_bytes_written % (uint64) ringsize; 0.12 │1 70: xor %edx,%edx │ Size sendnow = Min(available, ringsize - offset); │ mov %r12,%rsi │ Size offset = mq->mq_bytes_written % (uint64) ringsize; 43.75 │ div %r12 │ memcpy(&mq->mq_ring[mq->mq_ring_offset + offset], 7.23 │ movzbl 0x31(%r15),%eax │ shm_mq_receive_bytes(): │ used = written - mq->mq_bytes_read; 1.17 │ sub %rax,%rcx │ offset = mq->mq_bytes_read % (uint64) ringsize; 18.49 │ div %rbp │ mov %rdx,%rdi that we end up with a full blown div makes sense - the compiler can't know anything about ringsize, therefore it can't optimize to a mask. I think we should force the size of the ringbuffer to be a power of two, and use a maks instead of a size for the buffer. 5) There's a *lot* of latch interactions. The biggest issue actually is the memory barrier implied by a SetLatch, waiting for the latch barely shows up. from 4) above: leader: + 25.89% postgres postgres [.] shm_mq_receive + 24.78% postgres postgres [.] SetLatch + 14.63% postgres postgres [.] AllocSetReset + 7.31% postgres postgres [.] ExecGather │ 0000000000781b10 <SetLatch>: │ SetLatch(): │ /* │ * The memory barrier has to be placed here to ensure that any flag │ * variables possibly changed by this process have been flushed to main │ * memory, before we check/set is_set. │ */ │ pg_memory_barrier(); 77.43 │ lock addl $0x0,(%rsp) │ │ /* Quick exit if already set */ │ if (latch->is_set) 0.12 │ mov (%rdi),%eax Commenting out the memory barrier - which is NOT CORRECT - improves timing: before: 4675.626 after: 4125.587 The correct fix obviously is not to break latch correctness. I think the big problem here is that we perform a SetLatch() for every read from the latch. I think we should a) introduce a batch variant for receiving like: extern shm_mq_result shm_mq_receivev(shm_mq_handle *mqh, shm_mq_iovec *iov, int *iovcnt, bool nowait) which then only does the SetLatch() at the end. And maybe if it wrapped around. b) Use shm_mq_sendv in tqueue.c by batching up insertions into the queue whenever it's not empty when a tuple is ready. I've not benchmarked this, but I'm pretty certain that the benefits isn't just going to be reduced cost of SetLatch() calls, but also increased performance due to fewer context switches 6) I've observed, using strace, debug outputs with timings, and top with a short interval, that quite often only one backend has sufficient work, while other backends are largely idle. I think the problem here is that the "anti round robin" provisions from bc7fcab5e36b9597857, while much better than the previous state, have swung a bit too far into the other direction. Especially if we were to introduce batching as I suggest in 5), but even without, this leads to back-fort on half-empty queues between the gatherstate->nextreader backend, and the leader. I'm not 100% certain what the right fix here is. One fairly drastic solution would be to move away from a single-produce-single-consumer style, per worker, queue, to a global queue. That'd avoid fairness issues between the individual workers, at the price of potential added contention. One disadvantage is that such a combined queue approach is not easily applicable for gather merge. One less drastic approach would be to move to try to drain the queue fully in one batch, and then move to the next queue. That'd avoid triggering a small wakeups for each individual tuple, as one currently would get without the 'stickyness'. It might also be a good idea to use a more directed form of wakeups, e.g. using a per-worker latch + a wait event set, to avoid iterating over all workers. Unfortunately the patch's "local worker queue" concept seems, to me, like it's not quite addressing the structural issues, instead opting to address them by adding another layer of queuing. I suspect that if we'd go for the above solutions there'd be only very small benefit in implementing such per-worker local queues. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers