Hello,

A customer noticed a strange issue on his setup, a bonding interface
composed of two igb nics. After several debug sessions we are pretty sure
the specific symptom reported is caused by a busy loop on
igb_get_hw_semaphore(). The problem was reported on a 3.0.25 kernel but
the patch below was written on 3.8.13.

The complete scenario is described below and there is a great chance that
this issue is only present (or at least more likely to be triggered) on
the PREEMPT_RT enabled kernels... but I would like to confirm whether this
solution is valid or if there is a better way to mitigate the problem.

Thanks, 
Luis

----

igb: minimize busy loop on igb_get_hw_semaphore

Bugzilla: 976912

In drivers/net/ethernet/intel/igb/e1000_82575.c, funtion
igb_release_swfw_sync_82575() there is this line:

        while (igb_get_hw_semaphore(hw) != 0);

That is basically a busy loop waiting on a HW semaphore.

A customer has a setup where two igb NICs are part of a bonding interface.
This customer also has a monitoring script that calls ifconfig often. It was
observed that in this scenario there is a chance that this ifconfig, that
happens to hold the bond->lock while collecting statistics, enters this busy
loop waiting for another thread clear that HW semaphore.

Meanwhile, the irq/xxx-ethY-Tx threads, running at FIFO:85, try to acquire
the bond lock, held by ifconfig. As it happens on RT, a Priority Inheritance
operation is started and ifconfig is boosted to FIFO:85 so that it may be able
to finish its work sooner and release the bond->lock, desired by the
aforementioned threads.

As ifconfig is running on a busy loop, waiting for the HW semaphore, this
thread now runs a busy loop at a very high priority, preventing other threads
on that CPU from progressing.

On that scenario, it seems that the thread holding the HW semaphore is also
waiting for a lock held by other task. This whole scenario leads to RCU stall
warnings, that have as side effects a crescent number of threads being stuck.
As this progresses, the livelock reaches threads on other CPUs and the system
becomes more and more unresponsive.

This little patch aims to prevent the busy loop at a high priority (the code
called by ifconfig in this example) to starve the threads on the same CPU. It
may not solve the issue but will at least lead us closer to the real issue,
masked by the RCU stalls created by the busy loop.

This is mostly a debug patch for a testing kernel.

Signed-off-by: Luis Claudio R. Goncalves <[email protected]>

diff --git a/drivers/net/ethernet/intel/igb/e1000_mac.c 
b/drivers/net/ethernet/intel/igb/e1000_mac.c
index a5c7200..ec0be87 100644
--- a/drivers/net/ethernet/intel/igb/e1000_mac.c
+++ b/drivers/net/ethernet/intel/igb/e1000_mac.c
@@ -1225,7 +1225,7 @@ s32 igb_get_hw_semaphore(struct e1000_hw *hw)
                if (!(swsm & E1000_SWSM_SMBI))
                        break;
 
-               udelay(50);
+               usleep_range(50,51);
                i++;
        }
 
@@ -1244,7 +1244,7 @@ s32 igb_get_hw_semaphore(struct e1000_hw *hw)
                if (rd32(E1000_SWSM) & E1000_SWSM_SWESMBI)
                        break;
 
-               udelay(50);
+               usleep_range(50,51);
        }
 
        if (i == timeout) {
-- 
[ Luis Claudio R. Goncalves                    Bass - Gospel - RT ]
[ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9  2696 7203 D980 A448 C8F8 ]


------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to