Dear Nick,
As the code comment suggests, we tiptoe right up to the line to extract performance. Have you tried e.g. ISOLCPUS, thread priority, or some other expedients to make the required assumptions true? It’s easy enough to change the code in various ways so this use-case cannot backfire. High on the list: always make a working copy of the bucket, vs. update in place. Won’t help write performance, but it’s likely to make the pain go away. Bucket-level reader-locks would involve adding Avogadro’s number of atomic ops to the predominant case. I’m pretty sure that’s a non-starter. FWIW... Dave From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Nick Zavaritsky Sent: Monday, November 1, 2021 12:12 PM To: vpp-dev@lists.fd.io Subject: [vpp-dev] Bihash is considered thread-safe but probably shouldn't Hello bihash experts! There's an old thread claiming that bihash lookup can produce a value=-1 under intense add/delete concurrent activity: https://lists.fd.io/g/vpp-dev/message/15606 We had a seemingly related crash recently when a lookup in snat_main.flow_hash yielded a value=-1 which was subsequently used as a destination thread index to offload to. This crash prompted me to study bihash more closely. The rest of the message is structured as follows: 1. Presenting reasons why I believe that bihash is not thread-safe. 2. Proposing a fix. 1 Bihash is probably not thread-safe The number of buckets in a hash table never changes. Every bucket has a lock bit. Updates happen via clib_bihash_add_del_inline_with_hash. The function grabs the bucket lock early on and performs update while holding the lock. Obviously this is safe, let's focus on readers. Lookups happen via clib_bihash_search_inline_with_hash / clib_bihash_search_inline_2_with_hash. The function locates the bucket and waits until the lock bit is cleared. b = BV (clib_bihash_get_bucket) (h, hash); if (PREDICT_FALSE (BV (clib_bihash_bucket_is_empty) (b))) return -1; if (PREDICT_FALSE (b->lock)) { volatile BVT (clib_bihash_bucket) * bv = b; while (bv->lock) CLIB_PAUSE (); } >From this point on the function examines the data structure without ever >bothering to check the lock again. Nothing prevents an updater from changing >the data the reader is currently looking at, or even deallocating it right >away. The only way it could work is if we could make assumptions about >relative performance of lookup and update operations. Checking the lock early >in lookup ensures that there's no update currently in progress. If lookup is >faster than update, then no future updater will manage to progress to the >point where the memory is written BEFORE the lookup was complete. Indeed, we >have the following comment in clib_bihash_add_del_inline_with_hash: /* * Because reader threads are looking at live data, * we have to be extra careful. Readers do NOT hold the * bucket lock. We need to be SLOWER than a search, past the * point where readers CHECK the bucket lock. */ Unfortunately, the assumption doesn't hold. Any thread could get preempted at arbitrary time. Even if we rule out preemption, there are microarchitectural quirks (e.g. caches, branch misprediction) that could slow down lookup to the point that memory read and update will overlap. The core of lookup is the following loop. Please note that checking a key and fetching the value is not atomic, hence if we are preempted in-between the result could be bogus. for (i = 0; i < limit; i++) { if (BV (clib_bihash_key_compare) (v->kvp[i].key, key_result->key)) { *key_result = v->kvp[i]; return 0; } } Different ways the key-value pair could get updated: (1) Add using a previously empty slot: clib_memcpy_fast (&(v->kvp[i].value), &add_v->value, sizeof (add_v->value)); CLIB_MEMORY_STORE_BARRIER (); /* Make sure the value has settled */ clib_memcpy_fast (&(v->kvp[i]), &add_v->key, sizeof (add_v->key)); The key update is not atomic, reader could observe a key updated half-way. (2) Add that recycles a stale slot: clib_memcpy_fast (&(v->kvp[i]), add_v, sizeof (*add_v)); Yet again not atomic. A reader could witness (old_k, new_v) or (new_k, old_v) or even an arbitrary interleaving of chunks from old and new keys. (3) Deleting an entry: clib_memset_u8 (&(v->kvp[i]), 0xff, sizeof (*(add_v))); Not atomic. 2 A fix It's worth noting that bihash never crashes. It does yield bogus results occasionally, though. While -1 is easy to check for, the analysis shows that other bogus results are possible. In particular: 1. Value updated half-way, possible with bihash_8_16. 2. Observing a key that never existed due to a key partial update. The probability is low since the hash should map it to the same bucket. 3. Old key matched with a new value. The probability is low since lookup should get preempted at the particular spot to make it happen. Even though these anomalies are unlikely they are still possible and exploitable. Should we consider a fix? The proposal is to introduce read locks for buckets. An implementation favouring readers could be as follows: Extend clib_bihash wirh "u64 rlocks[MAX_THREADS]". Based on the thread index, each reader publishes the bucket number it is currently examining in the respective array item. Padding is introduced to avoid false sharing. The writer lock sequence would be: 1) lock bucket; 2) wait until the bucket number is not in rlocks. Reader lock sequence: 1) publish bucket number in rlocks; 2) if bucket not locked then done; 3) otherwise clear bucket number from rlocks, wait for bucket lock to be released and restart. Thoughts?
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#20403): https://lists.fd.io/g/vpp-dev/message/20403 Mute This Topic: https://lists.fd.io/mt/86744671/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-