Herbert, as you may have noticed we found some missing locking in sk_stream_rfree(). It could explain the "!sk_forward_alloc" BUG() we thought was caused by e1000's TSO implementation and the Intel folks have provided enough datapoints to prove that it is indeed not an e1000 specific problem.
sk_stream_rfree() modifies sk->sk_forward_alloc without holding any locks from the __kfree_skb() callback. Now, _most_ of the time it is OK because we're being invoked with the socket lock held as ACKs come back to liberate the transmit queue or during socket shutdown which also holds the socket locked. (I did audit this, but I may have missed something, please double check this assertion) However, if skb->users is incremented or similar, that entity could end up being the context in which the skb is freed up and that will not necessarily have the socket locked correctly when the skb->destructor callback invokes sk_stream_rfree(). Consider tcpdump or similar. But we're stuck if we try to cure this: 1) We can't take bh_lock_sock(). Well, we could if we disabled BH explicitly first but even then if the socket is locked by the user we can't tell if that's the current cpu and there is no easy way to handle deferring this work. We really can't consider using a workqueue or something like that to handle this, that's way too heavy handed. 2) We can't turn sk_forward_alloc easily into an atomic_t. The masking operation in __sk_stream_mem_reclaim() does not translate readily into an atomic_t op: sk->sk_forward_alloc &= SK_STREAM_MEM_QUANTUM - 1; That line has always driven me nuts, but I know that it is there to handle partial page allocations existing when that function is called. I guess for #2 we could change those two lines into: int n = atomic_read(&sk->sk_forward_alloc) / SK_STREAM_MEM_QUANTUM; atomic_sub(n, sk->sk_prot->memory_allocated); sk->sk_forward_alloc -= n * SK_STREAM_MEM_QUANTUM; in order to make it "atomic_t op" translatable. But even if we did this with sk->sk_forward_alloc as an atomic_t it would have the same kinds of races we're trying to cure here, namely that we test the value and then act upon it while an unlocked asynchronous context can modify the value in another thread of execution in between the test and the action. Any ideas? Come to think of it, __sk_stream_mem_reclaim() looks really racey even as-is. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html