Hello, On Wed, 20 May 2015, Ying Xue wrote:
> Yes, this way is absolutely safe for us! But, for example, if we check > refcnt==0 > in write_lock protection, the checking is a bit late. Instead, if the checking > is advanced to ip_finish_output2(), we can _early_ find the fact that we > cannot > use the neigh entry looked up by __ipv4_neigh_lookup_noref(). Of course, doing > the check with atomic_read() is unsafe _really_, but once atomic_read() > returned > value tells us neigh refcnt is zero, the result is absolutely true. This is The problem is that this atomic_read is just an extra code that works in rare cases. Modern CPUs have 128-byte cache lines and 'dead', 'refcnt' and 'next' can be in same cache line. If you take into account the write-back write policy used by CPU caches and the cache locking effects, in most of the cases when such reader finds entry in list, the refcnt will be > 0 and dead will be 0. The reason: write-back policy makes the whole cache line public to other CPUs at once. So, without explicit barriers other CPUs can see data after delay and the order of stores in same cache line is not observed. As result, pure reader can not see 0 in refcnt, even if you put more atomic_read calls at this place. atomic_read contains only hints to compiler. So, we must ensure that if a writer decides to remove entry other writers should not change things. More observations for the writers: - neigh_lookup is initially a reader bur it can increase refcnt at any time, even while another CPU is under write_lock. This makes all refcnt checks in neigh_forced_gc, neigh_flush_dev and neigh_periodic_work racy because neigh_cleanup_and_release is called after write_unlock. If a writer decides to remove the entry, other write operations may try to modify the entry because they rely on their own reference. What can happen is that refcnt is 1 when it is decided to remove the entry but the refcnt can be increased by concurrent writers such as neigh_lookup callers. As result, neigh_lookup followed by neigh_update may work with dead=1 entry. May be the fix is to add check for dead flag in neigh_update to avoid modifications for removed entry, after the both write_lock_bh calls. This again happens later, i.e. __neigh_lookup calls neigh_lookup (which does not notice the dead flag early), we miss to call neigh_create and then neigh_update will fail on dead==1. - neigh_timer_handler looks safe, may be it does not need check for dead flag because if neigh_flush_dev calls neigh_del_timer and del_timer in neigh_del_timer fails with 0 (due to other CPU stuck at write_lock in neigh_timer_handler) the 'if (!(state & NUD_IN_TIMER))' check in neigh_timer_handler should succeed because neigh_flush_dev will change nud_state on success of the 'if (atomic_read(&n->refcnt) != 1)' check. - __neigh_event_send: needs check for dead flag, as already discussed Regards -- Julian Anastasov <j...@ssi.bg> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html