The latest version of the Intel e1000 driver use an implementation which is incompatible with netpoll when compiled as NAPI enabled.

The netpoll_poll routines uses the parent netdev structure as the device that gets polled:

poll_napi:

if (test_bit(__LINK_STATE_RX_SCHED, &np->dev->state) &&
           npinfo->poll_owner != smp_processor_id() &&
           spin_trylock(&npinfo->poll_lock)) {
               npinfo->rx_flags |= NETPOLL_RX_DROP;
               atomic_inc(&trapped);

               np->dev->poll(np->dev, &budget);


However, the e1000_intr routine schedules the napi poll based on a pseudo device created by the driver.
e1000_intr:
static irqreturn_t
e1000_intr(int irq, void *data, struct pt_regs *regs)
{
       struct net_device *netdev = data;
       struct e1000_adapter *adapter = netdev_priv(netdev);
...
#else /* if !CONFIG_E1000_MQ */
       if (likely(netif_rx_schedule_prep(&adapter->polling_netdev[0])))
               __netif_rx_schedule(&adapter->polling_netdev[0]);
       else
               e1000_irq_enable(adapter);
#endif /* CONFIG_E1000_MQ */

printks inserted confirm that data != &adapter->polling_netdev[0] and the the parent netdev for the e1000 device is not on the poll list and does not have the LINK_STATE_RX_SCHED bit set.

Hence the necessary LINK_STATE_RX_SCHED flag is not set in the parent device's state field, and the e1000_clean routine is never called from netpoll:napi_poll.

I haven't looked closely at the MQ code, but I doubt the situation improves, as the whole idea with the implementation seems to be to allow multiple pseudo devices, one for each queue.

This problem does not exist in the normal napi polling sequence, as the poll list is walked, and whatever devices are on the list are polled, irrespective of whether they were the actual parent devices or driver-created pseudo devices.

The bug causes netdump to not work at all over e1000-napi, and probably leads to stalls in the case of kgdboe or netconsole as well. Unfortunately, it is not a simple coding error, but rather the case of two designs being incompatible.

I would like comments on potential solutions, none of which seem particularly appealing

1) Have netpoll:napi_poll walk the poll list like the normal dev.c code does. This seems simplest. I am concerned that it lets too much activity go on for kgdboe and netdump, since all devices get serviced that are on the poll list at the time of the system freeze for debug/dump.

2) Fix the e1000 (and other drivers that may do similar things) to be aware that it is scheduling napi-callback based on a net poll, falling back into single queue operation off of the parent device only - simple in concept, but possibly a good bit of change to the driver(s). 2.5) provide polled operation directly to the netif_rx path - probably straightforward, but adds code to the driver and requires a means to untangle any pending napi polls, which might be "interesting" in multi-queue, smp, rt_pre-emptible situations.

3) Figure out a way for np device setup to become aware of the actual device to be polled - perhaps an additional driver api - this may also require some work of #2 above, but offers an advantage of keeping the decisions out of the fast path. This requires work to several different areas, including the netpoll setup api, the driver api, and every driver, although the driver work should be relatively trivial except for multiqueue devices.

4) Some brilliant 3 line change that makes everything work which I've stupidly overlooked. (my preferred solution)

I'm willing to do the work after receiving some comments from the list. Unfortunately, I have to fix this fairly soon, so I look forward to comments in the next few days.

Thanks,
Mark Huth
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to