Agree with Dave on atomic ops being bad on the reader side. What about following schema:
As bucket is just u64 value on the reader side we grab bucket before (b0) and after (b1) search operation. If search finds entry, we simply do 2 checks: - that b0 is equal to b1 - that lock bit is not set in both of them If check fails, we simply retry. On the writer side, we have add, remove and replace operations. First 2 alter refcnt which is part of bucket. To deal with replace case we introduce another bit (change bit) which is flipped every time data is changed in the bucket. Here are possible scenarios: - reader grabs b0 before lock and b1 after unlock - add, del - refcnt and change bit will be different between b0 and b1 causing retry - replace - change bit will be different between b0 and b1 causing retry - reader grabs b0 after lock and/or b1 before unlock - lock bit will be set causing retry Of course, this to work properly we need to ensure proper memory ordering (i.e. avoid bucket change to be visible to remote thread before kvp change). I crafted WIP patch to present my idea: https://gerrit.fd.io/r/c/vpp/+/34326 In this patch I get a rid of all store barriers and replaced them with more lightweight: __atomic_store_n (ptr, val, __ATOMIC_RELEASE); On platforms with strong memory ordering (like x86_64) this will result with just normal stores (but compiler will know that it should not reorder them). On platforms with weak memory ordering (like arch64) this will result in special store instruction, but that one is still cheaper than full memory barrier. Thoughts? Comments? Thanks, — Damjan > On 02.11.2021., at 12:14, Dave Barach <v...@barachs.net> wrote: > > Dear Nick, > > As the code comment suggests, we tiptoe right up to the line to extract > performance. Have you tried e.g. ISOLCPUS, thread priority, or some other > expedients to make the required assumptions true? > > It’s easy enough to change the code in various ways so this use-case cannot > backfire. High on the list: always make a working copy of the bucket, vs. > update in place. Won’t help write performance, but it’s likely to make the > pain go away. > > Bucket-level reader-locks would involve adding Avogadro’s number of atomic > ops to the predominant case. I’m pretty sure that’s a non-starter. > > FWIW... Dave > > > From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Nick Zavaritsky > Sent: Monday, November 1, 2021 12:12 PM > To: vpp-dev@lists.fd.io > Subject: [vpp-dev] Bihash is considered thread-safe but probably shouldn't > > Hello bihash experts! > > There's an old thread claiming that bihash lookup can produce a value=-1 > under intense add/delete concurrent activity: > https://lists.fd.io/g/vpp-dev/message/15606 > > We had a seemingly related crash recently when a lookup in > snat_main.flow_hash yielded a value=-1 which was subsequently used as a > destination thread index to offload to. This crash prompted me to study > bihash more closely. > > The rest of the message is structured as follows: > 1. Presenting reasons why I believe that bihash is not thread-safe. > 2. Proposing a fix. > > 1 Bihash is probably not thread-safe > > The number of buckets in a hash table never changes. Every bucket has a lock > bit. Updates happen via clib_bihash_add_del_inline_with_hash. The function > grabs the bucket lock early on and performs update while holding the lock. > Obviously this is safe, let's focus on readers. > > Lookups happen via clib_bihash_search_inline_with_hash / > clib_bihash_search_inline_2_with_hash. The function locates the bucket and > waits until the lock bit is cleared. > > b = BV (clib_bihash_get_bucket) (h, hash); > > if (PREDICT_FALSE (BV (clib_bihash_bucket_is_empty) (b))) > return -1; > > if (PREDICT_FALSE (b->lock)) > { > volatile BVT (clib_bihash_bucket) * bv = b; > while (bv->lock) > CLIB_PAUSE (); > } > > From this point on the function examines the data structure without ever > bothering to check the lock again. Nothing prevents an updater from changing > the data the reader is currently looking at, or even deallocating it right > away. The only way it could work is if we could make assumptions about > relative performance of lookup and update operations. Checking the lock early > in lookup ensures that there's no update currently in progress. If lookup is > faster than update, then no future updater will manage to progress to the > point where the memory is written BEFORE the lookup was complete. Indeed, we > have the following comment in clib_bihash_add_del_inline_with_hash: > > /* > * Because reader threads are looking at live data, > * we have to be extra careful. Readers do NOT hold the > * bucket lock. We need to be SLOWER than a search, past the > * point where readers CHECK the bucket lock. > */ > > Unfortunately, the assumption doesn't hold. Any thread could get preempted at > arbitrary time. Even if we rule out preemption, there are microarchitectural > quirks (e.g. caches, branch misprediction) that could slow down lookup to the > point that memory read and update will overlap. > > The core of lookup is the following loop. Please note that checking a key and > fetching the value is not atomic, hence if we are preempted in-between the > result could be bogus. > > for (i = 0; i < limit; i++) > { > if (BV (clib_bihash_key_compare) (v->kvp[i].key, key_result->key)) > { > *key_result = v->kvp[i]; > return 0; > } > } > > Different ways the key-value pair could get updated: > > (1) Add using a previously empty slot: > > clib_memcpy_fast (&(v->kvp[i].value), > &add_v->value, sizeof (add_v->value)); > CLIB_MEMORY_STORE_BARRIER (); /* Make sure the value has > settled */ > clib_memcpy_fast (&(v->kvp[i]), &add_v->key, > sizeof (add_v->key)); > > The key update is not atomic, reader could observe a key updated half-way. > > (2) Add that recycles a stale slot: > > clib_memcpy_fast (&(v->kvp[i]), add_v, sizeof (*add_v)); > > Yet again not atomic. A reader could witness (old_k, new_v) or (new_k, old_v) > or even an arbitrary interleaving of chunks from old and new keys. > > (3) Deleting an entry: > > clib_memset_u8 (&(v->kvp[i]), 0xff, sizeof (*(add_v))); > > Not atomic. > > > 2 A fix > > It's worth noting that bihash never crashes. It does yield bogus results > occasionally, though. While -1 is easy to check for, the analysis shows that > other bogus results are possible. In particular: > > 1. Value updated half-way, possible with bihash_8_16. > 2. Observing a key that never existed due to a key partial update. The > probability is low since the hash should map it to the same bucket. > 3. Old key matched with a new value. The probability is low since lookup > should get preempted at the particular spot to make it happen. > > Even though these anomalies are unlikely they are still possible and > exploitable. > > Should we consider a fix? > > The proposal is to introduce read locks for buckets. An implementation > favouring readers could be as follows: > > Extend clib_bihash wirh "u64 rlocks[MAX_THREADS]". Based on the thread index, > each reader publishes the bucket number it is currently examining in the > respective array item. Padding is introduced to avoid false sharing. > > The writer lock sequence would be: 1) lock bucket; 2) wait until the bucket > number is not in rlocks. > > Reader lock sequence: 1) publish bucket number in rlocks; 2) if bucket not > locked then done; 3) otherwise clear bucket number from rlocks, wait for > bucket lock to be released and restart. > > Thoughts? > >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#20414): https://lists.fd.io/g/vpp-dev/message/20414 Mute This Topic: https://lists.fd.io/mt/86744671/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-