smp: add more data to CSD lock debugging

Paul E. McKenney Mon, 05 Apr 2021 08:29:23 -0700

On Mon, Apr 05, 2021 at 09:37:40AM +0200, Juergen Gross wrote:
> On 02.04.21 18:11, Paul E. McKenney wrote:
> > On Fri, Apr 02, 2021 at 05:46:52PM +0200, Juergen Gross wrote:
> > > On 30.03.21 19:33, Paul E. McKenney wrote:
> > > > On Wed, Mar 24, 2021 at 11:18:03AM +0100, Jürgen Groß wrote:
> > > > > On 02.03.21 07:28, Juergen Gross wrote:
> > > > > > In order to help identifying problems with IPI handling and remote
> > > > > > function execution add some more data to IPI debugging code.
> > > > > > 
> > > > > > There have been multiple reports of cpus looping long times (many
> > > > > > seconds) in smp_call_function_many() waiting for another cpu 
> > > > > > executing
> > > > > > a function like tlb flushing. Most of these reports have been for
> > > > > > cases where the kernel was running as a guest on top of KVM or Xen
> > > > > > (there are rumours of that happening under VMWare, too, and even on
> > > > > > bare metal).
> > > > > > 
> > > > > > Finding the root cause hasn't been successful yet, even after more 
> > > > > > than
> > > > > > 2 years of chasing this bug by different developers.
> > > > > > 
> > > > > > Commit 35feb60474bf4f7 ("kernel/smp: Provide CSD lock timeout
> > > > > > diagnostics") tried to address this by adding some debug code and by
> > > > > > issuing another IPI when a hang was detected. This helped mitigating
> > > > > > the problem (the repeated IPI unlocks the hang), but the root cause 
> > > > > > is
> > > > > > still unknown.
> > > > > > 
> > > > > > Current available data suggests that either an IPI wasn't sent when 
> > > > > > it
> > > > > > should have been, or that the IPI didn't result in the target cpu
> > > > > > executing the queued function (due to the IPI not reaching the cpu,
> > > > > > the IPI handler not being called, or the handler not seeing the 
> > > > > > queued
> > > > > > request).
> > > > > > 
> > > > > > Try to add more diagnostic data by introducing a global atomic 
> > > > > > counter
> > > > > > which is being incremented when doing critical operations (before 
> > > > > > and
> > > > > > after queueing a new request, when sending an IPI, and when 
> > > > > > dequeueing
> > > > > > a request). The counter value is stored in percpu variables which 
> > > > > > can
> > > > > > be printed out when a hang is detected.
> > > > > > 
> > > > > > The data of the last event (consisting of sequence counter, source
> > > > > > cpu, target cpu, and event type) is stored in a global variable. 
> > > > > > When
> > > > > > a new event is to be traced, the data of the last event is stored in
> > > > > > the event related percpu location and the global data is updated 
> > > > > > with
> > > > > > the new event's data. This allows to track two events in one data
> > > > > > location: one by the value of the event data (the event before the
> > > > > > current one), and one by the location itself (the current event).
> > > > > > 
> > > > > > A typical printout with a detected hang will look like this:
> > > > > > 
> > > > > > csd: Detected non-responsive CSD lock (#1) on CPU#1, waiting 
> > > > > > 5000000003 ns for CPU#06 scf_handler_1+0x0/0x50(0xffffa2a881bb1410).
> > > > > >     csd: CSD lock (#1) handling prior 
> > > > > > scf_handler_1+0x0/0x50(0xffffa2a8813823c0) request.
> > > > > >            csd: cnt(00008cc): ffff->0000 dequeue (src cpu 0 == 
> > > > > > empty)
> > > > > >            csd: cnt(00008cd): ffff->0006 idle
> > > > > >            csd: cnt(0003668): 0001->0006 queue
> > > > > >            csd: cnt(0003669): 0001->0006 ipi
> > > > > >            csd: cnt(0003e0f): 0007->000a queue
> > > > > >            csd: cnt(0003e10): 0001->ffff ping
> > > > > >            csd: cnt(0003e71): 0003->0000 ping
> > > > > >            csd: cnt(0003e72): ffff->0006 gotipi
> > > > > >            csd: cnt(0003e73): ffff->0006 handle
> > > > > >            csd: cnt(0003e74): ffff->0006 dequeue (src cpu 0 == 
> > > > > > empty)
> > > > > >            csd: cnt(0003e7f): 0004->0006 ping
> > > > > >            csd: cnt(0003e80): 0001->ffff pinged
> > > > > >            csd: cnt(0003eb2): 0005->0001 noipi
> > > > > >            csd: cnt(0003eb3): 0001->0006 queue
> > > > > >            csd: cnt(0003eb4): 0001->0006 noipi
> > > > > >            csd: cnt now: 0003f00
> > > > > > 
> > > > > > This example (being an artificial one, produced with a previous 
> > > > > > version
> > > > > > of this patch without the "hdlend" event), shows that cpu#6 started 
> > > > > > to
> > > > > > handle an IPI (cnt 3e72-3e74), bit didn't start to handle another 
> > > > > > IPI
> > > > > > (sent by cpu#4, cnt 3e7f). The next request from cpu#1 for cpu#6 was
> > > > > > queued (3eb3), but no IPI was needed (cnt 3eb4, there was the event
> > > > > > from cpu#4 in the queue already).
> > > > > > 
> > > > > > The idea is to print only relevant entries. Those are all events 
> > > > > > which
> > > > > > are associated with the hang (so sender side events for the source 
> > > > > > cpu
> > > > > > of the hanging request, and receiver side events for the target 
> > > > > > cpu),
> > > > > > and the related events just before those (for adding data needed to
> > > > > > identify a possible race). Printing all available data would be
> > > > > > possible, but this would add large amounts of data printed on larger
> > > > > > configurations.
> > > > > > 
> > > > > > Signed-off-by: Juergen Gross <jgr...@suse.com>
> > > > > > Tested-by: Paul E. McKenney <paul...@kernel.org>
> > > > > 
> > > > > Just an update regarding current status with debugging the underlying
> > > > > issue:
> > > > > 
> > > > > On a customer's machine with a backport of this patch applied we've
> > > > > seen another case of the hang. In the logs we've found:
> > > > > 
> > > > > smp: csd: Detected non-responsive CSD lock (#1) on CPU#18, waiting
> > > > > 5000000046 ns for CPU#06 do_flush_tlb_all+0x0/0x30(          (null)).
> > > > > smp:  csd: CSD lock (#1) unresponsive.
> > > > > smp:  csd: cnt(0000000): 0000->0000 queue
> > > > > smp:  csd: cnt(0000001): ffff->0006 idle
> > > > > smp:  csd: cnt(0025dba): 0012->0006 queue
> > > > > smp:  csd: cnt(0025dbb): 0012->0006 noipi
> > > > > smp:  csd: cnt(01d1333): 001a->0006 pinged
> > > > > smp:  csd: cnt(01d1334): ffff->0006 gotipi
> > > > > smp:  csd: cnt(01d1335): ffff->0006 handle
> > > > > smp:  csd: cnt(01d1336): ffff->0006 dequeue (src cpu 0 == empty)
> > > > > smp:  csd: cnt(01d1337): ffff->0006 hdlend (src cpu 0 == early)
> > > > > smp:  csd: cnt(01d16cb): 0012->0005 ipi
> > > > > smp:  csd: cnt(01d16cc): 0012->0006 queue
> > > > > smp:  csd: cnt(01d16cd): 0012->0006 ipi
> > > > > smp:  csd: cnt(01d16f3): 0012->001a ipi
> > > > > smp:  csd: cnt(01d16f4): 0012->ffff ping
> > > > > smp:  csd: cnt(01d1750): ffff->0018 hdlend (src cpu 0 == early)
> > > > > smp:  csd: cnt(01d1751): 0012->ffff pinged
> > > > > smp:  csd: cnt now: 01d1769
> > > > > 
> > > > > So we see that cpu#18 (0012 hex) is waiting for cpu#06 (first line of 
> > > > > the
> > > > > data).
> > > > > 
> > > > > The next 4 lines of the csd actions are not really interesting, as 
> > > > > they are
> > > > > rather old.
> > > > > 
> > > > > Then we see that cpu 0006 did handle a request rather recently (cnt 
> > > > > 01d1333
> > > > > - 01d1337): cpu 001a pinged it via an IPI and it got the IPI, entered 
> > > > > the
> > > > > handler, dequeued a request, and handled it.
> > > > > 
> > > > > Nearly all of the rest shows the critical request: cpu 0012 did a 
> > > > > loop over
> > > > > probably all other cpus and queued the requests and marked them to be 
> > > > > IPI-ed
> > > > > (including cpu 0006, cnt 01d16cd). Then the cpus marked to receive an 
> > > > > IPI
> > > > > were pinged (cnt 01d16f4 and cnt 01d1751).
> > > > > 
> > > > > The entry cnt 01d1750 is not of interest here.
> > > > > 
> > > > > This data confirms that on sending side everything seems to be okay 
> > > > > at the
> > > > > level above the actual IPI sending. On receiver side there seems no 
> > > > > IPI to
> > > > > be seen, but there is no visible reason for a race either.
> > > > > 
> > > > > It seems as if we need more debugging in the deeper layers: is the IPI
> > > > > really sent out, and is something being received on the destination 
> > > > > cpu?
> > > > > I'll have another try with even more debugging code, probably in 
> > > > > private
> > > > > on the customer machine first.
> > > > 
> > > > Apologies for the late reply, was out last week.
> > > > 
> > > > Excellent news, and thank you!
> > > > 
> > > > For my part, I have put together a rough prototype script that allows
> > > > me to run scftorture on larger groups of systems and started running it,
> > > > though I am hoping that 1,000 is far more than will be required.
> > > > 
> > > > Your diagnosis of a lost IPI matches what we have been able to glean
> > > > from the occasional occurrences in the wild on our systems, for whatever
> > > > that might be worth.  The hope is to get something that reproduces more
> > > > quickly, which would allow deeper debugging at this end as well.
> > > 
> > > Sometimes one is lucky.
> > > 
> > > I've found a reproducer while hunting another bug. The test on that
> > > machine will trigger the csd_lock timeout about once a day.
> > 
> > Nice!!!  You are way ahead of me!
> > 
> > > I've used my new debug kernel and found that the IPI is really sent
> > > out (more precise: the hypervisor has been requested to do so, and
> > > it didn't report an error). On the target cpu there was no interrupt
> > > received after that, so the IPI has not been swallowed on the target
> > > cpu by the Linux kernel.
> > > 
> > > Will now try to instrument the hypervisor to get more data.
> > 
> > I am increasing the number and types of systems and the test duration.
> > Ijust started running three different systems with IPI workloads in both
> > guests and within host over the weekend.
> 
> Maybe you can try my kind of workload:
> 
> I have a guest with 16 vcpus and 8 GB of memory running 8 instances of
> 
> sysbench --test=fileio --file-test-mode=rndrw --rand-seed=0 --max-time=300
> --max-requests=0 run
> 
> on disjunct nfs mounts. Those have been created with:
> 
> mount -t nfs -o
> rw,proto=tcp,nolock,nfsvers=3,rsize=65536,wsize=65536,nosharetransport
> server:/share[1-8] /mount[1-8]
> 
> with the server running on the host system of the guest and the shares
> located in a ramdisk.
> 
> The host has 72 cpus and 48 GB of RAM.
> 
> A csd lock timeout happens about once per day on the host.


Thank you!  I will give it a try.  It has been quite some time since I
have done anything with NFS, so it should be entertaining.  ;-)

                                                        Thanx, Paul

Re: [PATCH v3 3/4] kernel/smp: add more data to CSD lock debugging

Reply via email to