Let me resurrect this discussion as my patch is still in gerrit. Nick, any chance you can submit your proposal as discussed bellow?
Thanks, — Damjan > On 04.11.2021., at 19:35, Damjan Marion via lists.fd.io > <dmarion=me....@lists.fd.io> wrote: > > > Dear Nick, > > Will be great if you can support your proposal with a running code so we can > understand exactly what it means. > > — > Damjan > >> On 04.11.2021., at 19:24, Nick Zavaritsky <nick.zavarit...@emnify.com> wrote: >> >> Hi, thanks for an insightful discussion! >> >> I do understand that high performance is one of the most important goals of >> vpp, therefore certain solutions might not fly. From my POV, the version >> counter would be an improvement. It definitely decreases the probability of >> triggering the bug. >> >> Concerning isolcpus, currently this is presented as an optimisation, not a >> prerequisite. Without isolcpus, a thread could get preempted for arbitrarily >> long. Meaning that no matter how many bits we allocate for the version >> field, occasionally they won’t be enough. >> >> I’d love to have something that’s robust no matter how the threads are >> scheduled. Would it be possible to use vpp benchmarking lab to evaluate the >> performance impact of the proposed solutions? >> >> Finally, I'd like to rehash the reader lock proposal. The idea was that we >> don’t introduce any atomic operations in the reader path. A reader >> *publishes* the bucket number it is about to examine in int >> rlock[MAX_THREADS] array. Every thread uses a distinct cell in rlock >> (determined by the thread id), therefore it could be a regular write >> followed by a barrier. Eliminate false sharing with padding. >> >> Writer locks a bucket as currently implemented (CAS) and then waits until >> the bucket number disappears from rlock[]. >> >> Reader publishes the bucket number and then checks if the bucket is locked >> (regular write, barrier, regular read). Good to go if not locked, otherwise >> remove the bucket number from rlock, wait for the lock to get released, >> restart. >> >> The proposal doesn’t introduce any new atomic operations. There still might >> be a slowdown due to cache line ping-pong in the rlock array. In the worst >> case, it costs us 1 extra cache miss for the reader. Could be coalesced with >> the bucket prefetch, making it essentially free (few if any bihash users >> prefetch buckets). >> >> Best, >> >> Nick >> >> >>> On 3. Nov 2021, at 21:29, Florin Coras via lists.fd.io >>> <fcoras.lists=gmail....@lists.fd.io> wrote: >>> >>> >>> Agreed it’s unlikely so maybe just use the 2 bits left for the epoch >>> counter as a middle ground? The new approach should be better either way :-) >>> >>> Florin >>> >>> >>>> On Nov 3, 2021, at 11:55 AM, Damjan Marion <dmar...@me.com> wrote: >>>> >>>> What about the following, we shift offset by 6, as all buckets are aligned >>>> to 64, anyway, and that gives us 6 more bits so we can have 8 bit epoch >>>> counter…. ? >>>> >>>> — >>>> Damjan >>>> >>>>> On 03.11.2021., at 19:45, Damjan Marion <dmar...@me.com> wrote: >>>>> >>>>> >>>>> >>>>> yes, i am aware of that, it is extremelly unlikely and only way i can see >>>>> this fixed is introducing epoch on the bucket level but we dont have >>>>> enough space there…. >>>>> >>>>> — >>>>> Damjan >>>>> >>>>>> On 03.11.2021., at 19:16, Florin Coras <fcoras.li...@gmail.com> wrote: >>>>>> >>>>>> Hi Damjan, >>>>>> >>>>>> Definitely like the scheme but the change bit might not be enough, >>>>>> unless I’m misunderstanding. For instance, two consecutive updates to a >>>>>> bucket before reader grabs b1 will hide the change. >>>>>> >>>>>> Florin >>>>>> >>>>>>> On Nov 3, 2021, at 9:36 AM, Damjan Marion via lists.fd.io >>>>>>> <dmarion=me....@lists.fd.io> wrote: >>>>>>> >>>>>>> >>>>>>> Agree with Dave on atomic ops being bad on the reader side. >>>>>>> >>>>>>> What about following schema: >>>>>>> >>>>>>> As bucket is just u64 value on the reader side we grab bucket before >>>>>>> (b0) and after (b1) search operation. >>>>>>> >>>>>>> If search finds entry, we simply do 2 checks: >>>>>>> - that b0 is equal to b1 >>>>>>> - that lock bit is not set in both of them >>>>>>> If check fails, we simply retry. >>>>>>> >>>>>>> On the writer side, we have add, remove and replace operations. >>>>>>> First 2 alter refcnt which is part of bucket. >>>>>>> To deal with replace case we introduce another bit (change bit) which >>>>>>> is flipped every time data is changed in the bucket. >>>>>>> >>>>>>> Here are possible scenarios: >>>>>>> >>>>>>> - reader grabs b0 before lock and b1 after unlock >>>>>>> - add, del - refcnt and change bit will be different between b0 and >>>>>>> b1 causing retry >>>>>>> - replace - change bit will be different between b0 and b1 causing >>>>>>> retry >>>>>>> >>>>>>> - reader grabs b0 after lock and/or b1 before unlock >>>>>>> - lock bit will be set causing retry >>>>>>> >>>>>>> Of course, this to work properly we need to ensure proper memory >>>>>>> ordering (i.e. avoid bucket change to be visible to remote thread >>>>>>> before kvp change). >>>>>>> >>>>>>> I crafted WIP patch to present my idea: >>>>>>> >>>>>>> https://gerrit.fd.io/r/c/vpp/+/34326 >>>>>>> >>>>>>> In this patch I get a rid of all store barriers and replaced them with >>>>>>> more lightweight: >>>>>>> >>>>>>> __atomic_store_n (ptr, val, __ATOMIC_RELEASE); >>>>>>> >>>>>>> On platforms with strong memory ordering (like x86_64) this will result >>>>>>> with just normal stores (but compiler will know that it should not >>>>>>> reorder them). >>>>>>> On platforms with weak memory ordering (like arch64) this will result >>>>>>> in special store instruction, but that one is still cheaper than full >>>>>>> memory barrier. >>>>>>> >>>>>>> Thoughts? Comments? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> — >>>>>>> Damjan >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On 02.11.2021., at 12:14, Dave Barach <v...@barachs.net> wrote: >>>>>>>> >>>>>>>> Dear Nick, >>>>>>>> >>>>>>>> As the code comment suggests, we tiptoe right up to the line to >>>>>>>> extract performance. Have you tried e.g. ISOLCPUS, thread priority, or >>>>>>>> some other expedients to make the required assumptions true? >>>>>>>> >>>>>>>> It’s easy enough to change the code in various ways so this use-case >>>>>>>> cannot backfire. High on the list: always make a working copy of the >>>>>>>> bucket, vs. update in place. Won’t help write performance, but it’s >>>>>>>> likely to make the pain go away. >>>>>>>> >>>>>>>> Bucket-level reader-locks would involve adding Avogadro’s number of >>>>>>>> atomic ops to the predominant case. I’m pretty sure that’s a >>>>>>>> non-starter. >>>>>>>> >>>>>>>> FWIW... Dave >>>>>>>> >>>>>>>> >>>>>>>> From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Nick >>>>>>>> Zavaritsky >>>>>>>> Sent: Monday, November 1, 2021 12:12 PM >>>>>>>> To: vpp-dev@lists.fd.io >>>>>>>> Subject: [vpp-dev] Bihash is considered thread-safe but probably >>>>>>>> shouldn't >>>>>>>> >>>>>>>> Hello bihash experts! >>>>>>>> >>>>>>>> There's an old thread claiming that bihash lookup can produce a >>>>>>>> value=-1 under intense add/delete concurrent activity: >>>>>>>> https://lists.fd.io/g/vpp-dev/message/15606 >>>>>>>> >>>>>>>> We had a seemingly related crash recently when a lookup in >>>>>>>> snat_main.flow_hash yielded a value=-1 which was subsequently used as >>>>>>>> a destination thread index to offload to. This crash prompted me to >>>>>>>> study bihash more closely. >>>>>>>> >>>>>>>> The rest of the message is structured as follows: >>>>>>>> 1. Presenting reasons why I believe that bihash is not thread-safe. >>>>>>>> 2. Proposing a fix. >>>>>>>> >>>>>>>> 1 Bihash is probably not thread-safe >>>>>>>> >>>>>>>> The number of buckets in a hash table never changes. Every bucket has >>>>>>>> a lock bit. Updates happen via clib_bihash_add_del_inline_with_hash. >>>>>>>> The function grabs the bucket lock early on and performs update while >>>>>>>> holding the lock. Obviously this is safe, let's focus on readers. >>>>>>>> >>>>>>>> Lookups happen via clib_bihash_search_inline_with_hash / >>>>>>>> clib_bihash_search_inline_2_with_hash. The function locates the bucket >>>>>>>> and waits until the lock bit is cleared. >>>>>>>> >>>>>>>> b = BV (clib_bihash_get_bucket) (h, hash); >>>>>>>> >>>>>>>> if (PREDICT_FALSE (BV (clib_bihash_bucket_is_empty) (b))) >>>>>>>> return -1; >>>>>>>> >>>>>>>> if (PREDICT_FALSE (b->lock)) >>>>>>>> { >>>>>>>> volatile BVT (clib_bihash_bucket) * bv = b; >>>>>>>> while (bv->lock) >>>>>>>> CLIB_PAUSE (); >>>>>>>> } >>>>>>>> >>>>>>>> From this point on the function examines the data structure without >>>>>>>> ever bothering to check the lock again. Nothing prevents an updater >>>>>>>> from changing the data the reader is currently looking at, or even >>>>>>>> deallocating it right away. The only way it could work is if we could >>>>>>>> make assumptions about relative performance of lookup and update >>>>>>>> operations. Checking the lock early in lookup ensures that there's no >>>>>>>> update currently in progress. If lookup is faster than update, then no >>>>>>>> future updater will manage to progress to the point where the memory >>>>>>>> is written BEFORE the lookup was complete. Indeed, we have the >>>>>>>> following comment in clib_bihash_add_del_inline_with_hash: >>>>>>>> >>>>>>>> /* >>>>>>>> * Because reader threads are looking at live data, >>>>>>>> * we have to be extra careful. Readers do NOT hold the >>>>>>>> * bucket lock. We need to be SLOWER than a search, past the >>>>>>>> * point where readers CHECK the bucket lock. >>>>>>>> */ >>>>>>>> >>>>>>>> Unfortunately, the assumption doesn't hold. Any thread could get >>>>>>>> preempted at arbitrary time. Even if we rule out preemption, there are >>>>>>>> microarchitectural quirks (e.g. caches, branch misprediction) that >>>>>>>> could slow down lookup to the point that memory read and update will >>>>>>>> overlap. >>>>>>>> >>>>>>>> The core of lookup is the following loop. Please note that checking a >>>>>>>> key and fetching the value is not atomic, hence if we are preempted >>>>>>>> in-between the result could be bogus. >>>>>>>> >>>>>>>> for (i = 0; i < limit; i++) >>>>>>>> { >>>>>>>> if (BV (clib_bihash_key_compare) (v->kvp[i].key, key_result->key)) >>>>>>>> { >>>>>>>> *key_result = v->kvp[i]; >>>>>>>> return 0; >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> Different ways the key-value pair could get updated: >>>>>>>> >>>>>>>> (1) Add using a previously empty slot: >>>>>>>> >>>>>>>> clib_memcpy_fast (&(v->kvp[i].value), >>>>>>>> &add_v->value, sizeof (add_v->value)); >>>>>>>> CLIB_MEMORY_STORE_BARRIER (); /* Make sure the value >>>>>>>> has settled */ >>>>>>>> clib_memcpy_fast (&(v->kvp[i]), &add_v->key, >>>>>>>> sizeof (add_v->key)); >>>>>>>> >>>>>>>> The key update is not atomic, reader could observe a key updated >>>>>>>> half-way. >>>>>>>> >>>>>>>> (2) Add that recycles a stale slot: >>>>>>>> >>>>>>>> clib_memcpy_fast (&(v->kvp[i]), add_v, sizeof >>>>>>>> (*add_v)); >>>>>>>> >>>>>>>> Yet again not atomic. A reader could witness (old_k, new_v) or (new_k, >>>>>>>> old_v) or even an arbitrary interleaving of chunks from old and new >>>>>>>> keys. >>>>>>>> >>>>>>>> (3) Deleting an entry: >>>>>>>> >>>>>>>> clib_memset_u8 (&(v->kvp[i]), 0xff, sizeof (*(add_v))); >>>>>>>> >>>>>>>> Not atomic. >>>>>>>> >>>>>>>> >>>>>>>> 2 A fix >>>>>>>> >>>>>>>> It's worth noting that bihash never crashes. It does yield bogus >>>>>>>> results occasionally, though. While -1 is easy to check for, the >>>>>>>> analysis shows that other bogus results are possible. In particular: >>>>>>>> >>>>>>>> 1. Value updated half-way, possible with bihash_8_16. >>>>>>>> 2. Observing a key that never existed due to a key partial update. >>>>>>>> The probability is low since the hash should map it to the same bucket. >>>>>>>> 3. Old key matched with a new value. The probability is low since >>>>>>>> lookup should get preempted at the particular spot to make it happen. >>>>>>>> >>>>>>>> Even though these anomalies are unlikely they are still possible and >>>>>>>> exploitable. >>>>>>>> >>>>>>>> Should we consider a fix? >>>>>>>> >>>>>>>> The proposal is to introduce read locks for buckets. An implementation >>>>>>>> favouring readers could be as follows: >>>>>>>> >>>>>>>> Extend clib_bihash wirh "u64 rlocks[MAX_THREADS]". Based on the thread >>>>>>>> index, each reader publishes the bucket number it is currently >>>>>>>> examining in the respective array item. Padding is introduced to avoid >>>>>>>> false sharing. >>>>>>>> >>>>>>>> The writer lock sequence would be: 1) lock bucket; 2) wait until the >>>>>>>> bucket number is not in rlocks. >>>>>>>> >>>>>>>> Reader lock sequence: 1) publish bucket number in rlocks; 2) if bucket >>>>>>>> not locked then done; 3) otherwise clear bucket number from rlocks, >>>>>>>> wait for bucket lock to be released and restart. >>>>>>>> >>>>>>>> Thoughts? >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>> >>> >> > > >
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#20766): https://lists.fd.io/g/vpp-dev/message/20766 Mute This Topic: https://lists.fd.io/mt/86744671/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-