yes, i am aware of that, it is extremelly unlikely and only way i can see this fixed is introducing epoch on the bucket level but we dont have enough space there….
— Damjan > On 03.11.2021., at 19:16, Florin Coras <fcoras.li...@gmail.com> wrote: > > Hi Damjan, > > Definitely like the scheme but the change bit might not be enough, unless I’m > misunderstanding. For instance, two consecutive updates to a bucket before > reader grabs b1 will hide the change. > > Florin > >> On Nov 3, 2021, at 9:36 AM, Damjan Marion via lists.fd.io >> <dmarion=me....@lists.fd.io> wrote: >> >> >> Agree with Dave on atomic ops being bad on the reader side. >> >> What about following schema: >> >> As bucket is just u64 value on the reader side we grab bucket before (b0) >> and after (b1) search operation. >> >> If search finds entry, we simply do 2 checks: >> - that b0 is equal to b1 >> - that lock bit is not set in both of them >> If check fails, we simply retry. >> >> On the writer side, we have add, remove and replace operations. >> First 2 alter refcnt which is part of bucket. >> To deal with replace case we introduce another bit (change bit) which is >> flipped every time data is changed in the bucket. >> >> Here are possible scenarios: >> >> - reader grabs b0 before lock and b1 after unlock >> - add, del - refcnt and change bit will be different between b0 and b1 >> causing retry >> - replace - change bit will be different between b0 and b1 causing retry >> >> - reader grabs b0 after lock and/or b1 before unlock >> - lock bit will be set causing retry >> >> Of course, this to work properly we need to ensure proper memory ordering >> (i.e. avoid bucket change to be visible to remote thread before kvp change). >> >> I crafted WIP patch to present my idea: >> >> https://gerrit.fd.io/r/c/vpp/+/34326 >> >> In this patch I get a rid of all store barriers and replaced them with more >> lightweight: >> >> __atomic_store_n (ptr, val, __ATOMIC_RELEASE); >> >> On platforms with strong memory ordering (like x86_64) this will result with >> just normal stores (but compiler will know that it should not reorder them). >> On platforms with weak memory ordering (like arch64) this will result in >> special store instruction, but that one is still cheaper than full memory >> barrier. >> >> Thoughts? Comments? >> >> Thanks, >> >> — >> Damjan >> >> >> >>> On 02.11.2021., at 12:14, Dave Barach <v...@barachs.net> wrote: >>> >>> Dear Nick, >>> >>> As the code comment suggests, we tiptoe right up to the line to extract >>> performance. Have you tried e.g. ISOLCPUS, thread priority, or some other >>> expedients to make the required assumptions true? >>> >>> It’s easy enough to change the code in various ways so this use-case cannot >>> backfire. High on the list: always make a working copy of the bucket, vs. >>> update in place. Won’t help write performance, but it’s likely to make the >>> pain go away. >>> >>> Bucket-level reader-locks would involve adding Avogadro’s number of atomic >>> ops to the predominant case. I’m pretty sure that’s a non-starter. >>> >>> FWIW... Dave >>> >>> >>> From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Nick Zavaritsky >>> Sent: Monday, November 1, 2021 12:12 PM >>> To: vpp-dev@lists.fd.io >>> Subject: [vpp-dev] Bihash is considered thread-safe but probably shouldn't >>> >>> Hello bihash experts! >>> >>> There's an old thread claiming that bihash lookup can produce a value=-1 >>> under intense add/delete concurrent activity: >>> https://lists.fd.io/g/vpp-dev/message/15606 >>> >>> We had a seemingly related crash recently when a lookup in >>> snat_main.flow_hash yielded a value=-1 which was subsequently used as a >>> destination thread index to offload to. This crash prompted me to study >>> bihash more closely. >>> >>> The rest of the message is structured as follows: >>> 1. Presenting reasons why I believe that bihash is not thread-safe. >>> 2. Proposing a fix. >>> >>> 1 Bihash is probably not thread-safe >>> >>> The number of buckets in a hash table never changes. Every bucket has a >>> lock bit. Updates happen via clib_bihash_add_del_inline_with_hash. The >>> function grabs the bucket lock early on and performs update while holding >>> the lock. Obviously this is safe, let's focus on readers. >>> >>> Lookups happen via clib_bihash_search_inline_with_hash / >>> clib_bihash_search_inline_2_with_hash. The function locates the bucket and >>> waits until the lock bit is cleared. >>> >>> b = BV (clib_bihash_get_bucket) (h, hash); >>> >>> if (PREDICT_FALSE (BV (clib_bihash_bucket_is_empty) (b))) >>> return -1; >>> >>> if (PREDICT_FALSE (b->lock)) >>> { >>> volatile BVT (clib_bihash_bucket) * bv = b; >>> while (bv->lock) >>> CLIB_PAUSE (); >>> } >>> >>> From this point on the function examines the data structure without ever >>> bothering to check the lock again. Nothing prevents an updater from >>> changing the data the reader is currently looking at, or even deallocating >>> it right away. The only way it could work is if we could make assumptions >>> about relative performance of lookup and update operations. Checking the >>> lock early in lookup ensures that there's no update currently in progress. >>> If lookup is faster than update, then no future updater will manage to >>> progress to the point where the memory is written BEFORE the lookup was >>> complete. Indeed, we have the following comment in >>> clib_bihash_add_del_inline_with_hash: >>> >>> /* >>> * Because reader threads are looking at live data, >>> * we have to be extra careful. Readers do NOT hold the >>> * bucket lock. We need to be SLOWER than a search, past the >>> * point where readers CHECK the bucket lock. >>> */ >>> >>> Unfortunately, the assumption doesn't hold. Any thread could get preempted >>> at arbitrary time. Even if we rule out preemption, there are >>> microarchitectural quirks (e.g. caches, branch misprediction) that could >>> slow down lookup to the point that memory read and update will overlap. >>> >>> The core of lookup is the following loop. Please note that checking a key >>> and fetching the value is not atomic, hence if we are preempted in-between >>> the result could be bogus. >>> >>> for (i = 0; i < limit; i++) >>> { >>> if (BV (clib_bihash_key_compare) (v->kvp[i].key, key_result->key)) >>> { >>> *key_result = v->kvp[i]; >>> return 0; >>> } >>> } >>> >>> Different ways the key-value pair could get updated: >>> >>> (1) Add using a previously empty slot: >>> >>> clib_memcpy_fast (&(v->kvp[i].value), >>> &add_v->value, sizeof (add_v->value)); >>> CLIB_MEMORY_STORE_BARRIER (); /* Make sure the value has >>> settled */ >>> clib_memcpy_fast (&(v->kvp[i]), &add_v->key, >>> sizeof (add_v->key)); >>> >>> The key update is not atomic, reader could observe a key updated half-way. >>> >>> (2) Add that recycles a stale slot: >>> >>> clib_memcpy_fast (&(v->kvp[i]), add_v, sizeof (*add_v)); >>> >>> Yet again not atomic. A reader could witness (old_k, new_v) or (new_k, >>> old_v) or even an arbitrary interleaving of chunks from old and new keys. >>> >>> (3) Deleting an entry: >>> >>> clib_memset_u8 (&(v->kvp[i]), 0xff, sizeof (*(add_v))); >>> >>> Not atomic. >>> >>> >>> 2 A fix >>> >>> It's worth noting that bihash never crashes. It does yield bogus results >>> occasionally, though. While -1 is easy to check for, the analysis shows >>> that other bogus results are possible. In particular: >>> >>> 1. Value updated half-way, possible with bihash_8_16. >>> 2. Observing a key that never existed due to a key partial update. The >>> probability is low since the hash should map it to the same bucket. >>> 3. Old key matched with a new value. The probability is low since lookup >>> should get preempted at the particular spot to make it happen. >>> >>> Even though these anomalies are unlikely they are still possible and >>> exploitable. >>> >>> Should we consider a fix? >>> >>> The proposal is to introduce read locks for buckets. An implementation >>> favouring readers could be as follows: >>> >>> Extend clib_bihash wirh "u64 rlocks[MAX_THREADS]". Based on the thread >>> index, each reader publishes the bucket number it is currently examining in >>> the respective array item. Padding is introduced to avoid false sharing. >>> >>> The writer lock sequence would be: 1) lock bucket; 2) wait until the bucket >>> number is not in rlocks. >>> >>> Reader lock sequence: 1) publish bucket number in rlocks; 2) if bucket not >>> locked then done; 3) otherwise clear bucket number from rlocks, wait for >>> bucket lock to be released and restart. >>> >>> Thoughts? >>> >>> >> >> >> >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#20417): https://lists.fd.io/g/vpp-dev/message/20417 Mute This Topic: https://lists.fd.io/mt/86744671/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-