Hi Everyone,

On 2017-05-19 17:25:38 +0530, Rafia Sabih wrote:
> While analysing the performance of TPC-H queries for the newly developed
> parallel-operators, viz, parallel index, bitmap heap scan, etc. we noticed
> that the time taken by gather node is significant. On investigation, as per
> the current method it copies each tuple to the shared queue and notifies
> the receiver. Since, this copying is done in shared queue, a lot of locking
> and latching overhead is there.
>
> So, in this POC patch I tried to copy all the tuples in a local queue thus
> avoiding all the locks and latches. Once, the local queue is filled as per
> it's capacity, tuples are transferred to the shared queue. Once, all the
> tuples are transferred the receiver is sent the notification about the same.
>
> With this patch I could see significant improvement in performance for
> simple queries,

I've spent some time looking into this, and I'm not quite convinced this
is the right approach.  Let me start by describing where I see the
current performance problems around gather stemming from.

The observations here are made using
select * from t where i < 30000000 offset 29999999 - 1;
with Rafia's data. That avoids slowdowns on the leader due to too many
rows printed out. Sometimes I'll also use
SELECT * FROM lineitem WHERE l_suppkey > '5012' OFFSET 1000000000 LIMIT 1;
on tpch data to show the effects on wider tables.

The precise query doesn't really matter, the observations here are more
general, I hope.

1) nodeGather.c re-projects every row from workers. As far as I can tell
   that's now always exactly the same targetlist as it's coming from the
   worker. Projection capability was added in 8538a6307049590 (without
   checking whether it's needed afaict), but I think it in turn often
   obsoleted by 992b5ba30dcafdc222341505b072a6b009b248a7.  My
   measurement shows that removing the projection yields quite massive
   speedups in queries like yours, which is not too surprising.

   I suspect this just needs a tlist_matches_tupdesc check + an if
   around ExecProject(). And a test, right now tests pass unless
   force_parallel_mode is used even if just commenting out the
   projection unconditionally.

   before commenting out nodeGather projection:

   rafia time: 8283.583
   rafia profile:
+   30.62%  postgres  postgres             [.] shm_mq_receive
+   18.49%  postgres  postgres             [.] s_lock
+   10.08%  postgres  postgres             [.] SetLatch
-    7.02%  postgres  postgres             [.] slot_deform_tuple
   - slot_deform_tuple
      - 88.01% slot_getsomeattrs
           ExecInterpExpr
           ExecGather
           ExecLimit

   lineitem time: 8448.468
   lineitem profile:
+   24.63%  postgres  postgres             [.] slot_deform_tuple
+   24.43%  postgres  postgres             [.] shm_mq_receive
+   17.36%  postgres  postgres             [.] ExecInterpExpr
+    7.41%  postgres  postgres             [.] s_lock
+    5.73%  postgres  postgres             [.] SetLatch

after:
   rafia time: 6660.224
   rafia profile:
+   36.77%  postgres  postgres              [.] shm_mq_receive
+   19.33%  postgres  postgres              [.] s_lock
+   13.14%  postgres  postgres              [.] SetLatch
+    9.22%  postgres  postgres              [.] AllocSetReset
+    4.27%  postgres  postgres              [.] ExecGather
+    2.79%  postgres  postgres              [.] AllocSetAlloc

   lineitem time: 4507.416
   lineitem profile:
+   34.81%  postgres  postgres            [.] shm_mq_receive
+   15.45%  postgres  postgres            [.] s_lock
+   13.38%  postgres  postgres            [.] SetLatch
+    9.87%  postgres  postgres            [.] AllocSetReset
+    5.82%  postgres  postgres            [.] ExecGather

   as quite clearly visible, avoiding the projection yields some major
   speedups.

   The following analysis here has the projection removed.

2) The spinlocks both on the the sending and receiving side a quite hot:

   rafia query leader:
+   36.16%  postgres  postgres            [.] shm_mq_receive
+   19.49%  postgres  postgres            [.] s_lock
+   13.24%  postgres  postgres            [.] SetLatch

   The presence of s_lock shows us that we're clearly often contending
   on spinlocks, given that's the slow-path for SpinLockAcquire(). In
   shm_mq_receive the instruction profile shows:

       │               SpinLockAcquire(&mq->mq_mutex);
       │1 5ab:   mov    $0xa9b580,%ecx
       │         mov    $0x4a4,%edx
       │         mov    $0xa9b538,%esi
       │         mov    %r15,%rdi
       │       → callq  s_lock
       │       ↑ jmpq   2a1
       │       tas():
       │1 5c7:   mov    $0x1,%eax
 32.83 │         lock   xchg %al,(%r15)
       │       shm_mq_inc_bytes_read():
       │               SpinLockAcquire(&mq->mq_mutex);
and
  0.01 │         pop    %r15
  0.04 │       ← retq
       │         nop
       │       tas():
       │1 338:   mov    $0x1,%eax
 17.59 │         lock   xchg %al,(%r15)
       │       shm_mq_get_bytes_written():
       │               SpinLockAcquire(&mq->mq_mutex);
  0.05 │         test   %al,%al
  0.01 │       ↓ jne    448
       │               v = mq->mq_bytes_written;

    rafia query worker:
+   33.00%  postgres  postgres            [.] shm_mq_send_bytes
+    9.90%  postgres  postgres            [.] s_lock
+    7.74%  postgres  postgres            [.] shm_mq_send
+    5.40%  postgres  postgres            [.] ExecInterpExpr
+    5.34%  postgres  postgres            [.] SetLatch

   Again, we have strong indicators for a lot of spinlock
   contention. The instruction profiles show the same;

   shm_mq_send_bytes
       │                               shm_mq_inc_bytes_written(mq, 
MAXALIGN(sendnow));
       │         and    $0xfffffffffffffff8,%r15
       │       tas():
  0.08 │         mov    %ebp,%eax
 31.07 │         lock   xchg %al,(%r14)
       │       shm_mq_inc_bytes_written():
       │        * Increment the number of bytes written.
       │        */

    and

       │3  98:   cmp    %r13,%rbx
  0.70 │       ↓ jae    430
       │       tas():
  0.12 │1  a1:   mov    %ebp,%eax
 28.53 │         lock   xchg %al,(%r14)
       │       shm_mq_get_bytes_read():
       │               SpinLockAcquire(&mq->mq_mutex);
       │         test   %al,%al
       │       ↓ jne    298
       │               v = mq->mq_bytes_read;

    shm_mq_send:
       │       tas():
 50.73 │         lock   xchg %al,0x0(%r13)
       │       shm_mq_notify_receiver():
       │       shm_mq_notify_receiver(volatile shm_mq *mq)
       │       {
       │               PGPROC     *receiver;
       │               bool            detached;


   My interpretation here is that it's not just the effect of the
   locking causing the slowdown, but to a significant degree the effect
   of the implied bus lock.

   To test that theory, here are the timings for
   1) spinlocks present
      time: 6593.045
   2) spinlocks acuisition replaced by *full* memory barriers, which on x86 is 
a lock; addl $0,0(%%rsp)
      time: 5159.306
   3) spinlocks replaced by read/write barriers as appropriate.
      time: 4687.610

   By my understanding of shm_mq.c's logic, something like 3) aught to
   be doable in a correct manner. There should be, in normal
   circumstances, only be one end modifying each of the protected
   variables. Doing so instead of using full block atomics with locked
   instructions is very likely to yield considerably better performance.

   The top-level profile after 3 is:

   leader:
+   25.89%  postgres  postgres          [.] shm_mq_receive
+   24.78%  postgres  postgres          [.] SetLatch
+   14.63%  postgres  postgres          [.] AllocSetReset
+    7.31%  postgres  postgres          [.] ExecGather

   worker:
+   14.02%  postgres  postgres            [.] ExecInterpExpr
+   11.63%  postgres  postgres            [.] shm_mq_send_bytes
+   11.25%  postgres  postgres            [.] heap_getnext
+    6.78%  postgres  postgres            [.] SetLatch
+    6.38%  postgres  postgres            [.] slot_deform_tuple

   still a lot of cycles in the queue code, but proportionally less.

4) Modulo computations in shm_mq are expensive:

       │       shm_mq_send_bytes():
       │                               Size            offset = 
mq->mq_bytes_written % (uint64) ringsize;
  0.12 │1  70:   xor    %edx,%edx
       │                               Size            sendnow = Min(available, 
ringsize - offset);
       │         mov    %r12,%rsi
       │                               Size            offset = 
mq->mq_bytes_written % (uint64) ringsize;
 43.75 │         div    %r12
       │                               memcpy(&mq->mq_ring[mq->mq_ring_offset + 
offset],
  7.23 │         movzbl 0x31(%r15),%eax


       │       shm_mq_receive_bytes():
       │                       used = written - mq->mq_bytes_read;
  1.17 │         sub    %rax,%rcx
       │                       offset = mq->mq_bytes_read % (uint64) ringsize;
 18.49 │         div    %rbp
       │         mov    %rdx,%rdi


   that we end up with a full blown div makes sense - the compiler can't
   know anything about ringsize, therefore it can't optimize to a mask.
   I think we should force the size of the ringbuffer to be a power of
   two, and use a maks instead of a size for the buffer.

5) There's a *lot* of latch interactions. The biggest issue actually is
   the memory barrier implied by a SetLatch, waiting for the latch
   barely shows up.

   from 4) above:

   leader:
+   25.89%  postgres  postgres          [.] shm_mq_receive
+   24.78%  postgres  postgres          [.] SetLatch
+   14.63%  postgres  postgres          [.] AllocSetReset
+    7.31%  postgres  postgres          [.] ExecGather


       │      0000000000781b10 <SetLatch>:
       │      SetLatch():
       │              /*
       │               * The memory barrier has to be placed here to ensure 
that any flag
       │               * variables possibly changed by this process have been 
flushed to main
       │               * memory, before we check/set is_set.
       │               */
       │              pg_memory_barrier();
 77.43 │        lock   addl   $0x0,(%rsp)
       │
       │              /* Quick exit if already set */
       │              if (latch->is_set)
  0.12 │        mov    (%rdi),%eax


   Commenting out the memory barrier - which is NOT CORRECT - improves
   timing:
   before: 4675.626
   after: 4125.587

   The correct fix obviously is not to break latch correctness. I think
   the big problem here is that we perform a SetLatch() for every read
   from the latch.

   I think we should
   a) introduce a batch variant for receiving like:

extern shm_mq_result shm_mq_receivev(shm_mq_handle *mqh,
                                     shm_mq_iovec *iov, int *iovcnt,
                                     bool nowait)

      which then only does the SetLatch() at the end. And maybe if it
      wrapped around.

   b) Use shm_mq_sendv in tqueue.c by batching up insertions into the
      queue whenever it's not empty when a tuple is ready.

   I've not benchmarked this, but I'm pretty certain that the benefits
   isn't just going to be reduced cost of SetLatch() calls, but also
   increased performance due to fewer context switches

6) I've observed, using strace, debug outputs with timings, and top with
   a short interval, that quite often only one backend has sufficient
   work, while other backends are largely idle.

   I think the problem here is that the "anti round robin" provisions from
   bc7fcab5e36b9597857, while much better than the previous state, have
   swung a bit too far into the other direction. Especially if we were
   to introduce batching as I suggest in 5), but even without, this
   leads to back-fort on half-empty queues between the gatherstate->nextreader
   backend, and the leader.

   I'm not 100% certain what the right fix here is.

   One fairly drastic solution would be to move away from a
   single-produce-single-consumer style, per worker, queue, to a global
   queue. That'd avoid fairness issues between the individual workers,
   at the price of potential added contention. One disadvantage is that
   such a combined queue approach is not easily applicable for gather
   merge.

   One less drastic approach would be to move to try to drain the queue
   fully in one batch, and then move to the next queue. That'd avoid
   triggering a small wakeups for each individual tuple, as one
   currently would get without the 'stickyness'.

   It might also be a good idea to use a more directed form of wakeups,
   e.g. using a per-worker latch + a wait event set, to avoid iterating
   over all workers.


Unfortunately the patch's "local worker queue" concept seems, to me,
like it's not quite addressing the structural issues, instead opting to
address them by adding another layer of queuing. I suspect that if we'd
go for the above solutions there'd be only very small benefit in
implementing such per-worker local queues.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to