Agreed it’s unlikely so maybe just use the 2 bits left for the epoch counter as a middle ground? The new approach should be better either way :-)
Florin > On Nov 3, 2021, at 11:55 AM, Damjan Marion <dmar...@me.com> wrote: > > What about the following, we shift offset by 6, as all buckets are aligned to > 64, anyway, and that gives us 6 more bits so we can have 8 bit epoch > counter…. ? > > — > Damjan > >> On 03.11.2021., at 19:45, Damjan Marion <dmar...@me.com> wrote: >> >> >> >> yes, i am aware of that, it is extremelly unlikely and only way i can see >> this fixed is introducing epoch on the bucket level but we dont have enough >> space there…. >> >> — >> Damjan >> >>> On 03.11.2021., at 19:16, Florin Coras <fcoras.li...@gmail.com> wrote: >>> >>> Hi Damjan, >>> >>> Definitely like the scheme but the change bit might not be enough, unless >>> I’m misunderstanding. For instance, two consecutive updates to a bucket >>> before reader grabs b1 will hide the change. >>> >>> Florin >>> >>>> On Nov 3, 2021, at 9:36 AM, Damjan Marion via lists.fd.io >>>> <http://lists.fd.io/> <dmarion=me....@lists.fd.io >>>> <mailto:dmarion=me....@lists.fd.io>> wrote: >>>> >>>> >>>> Agree with Dave on atomic ops being bad on the reader side. >>>> >>>> What about following schema: >>>> >>>> As bucket is just u64 value on the reader side we grab bucket before (b0) >>>> and after (b1) search operation. >>>> >>>> If search finds entry, we simply do 2 checks: >>>> - that b0 is equal to b1 >>>> - that lock bit is not set in both of them >>>> If check fails, we simply retry. >>>> >>>> On the writer side, we have add, remove and replace operations. >>>> First 2 alter refcnt which is part of bucket. >>>> To deal with replace case we introduce another bit (change bit) which is >>>> flipped every time data is changed in the bucket. >>>> >>>> Here are possible scenarios: >>>> >>>> - reader grabs b0 before lock and b1 after unlock >>>> - add, del - refcnt and change bit will be different between b0 and b1 >>>> causing retry >>>> - replace - change bit will be different between b0 and b1 causing retry >>>> >>>> - reader grabs b0 after lock and/or b1 before unlock >>>> - lock bit will be set causing retry >>>> >>>> Of course, this to work properly we need to ensure proper memory ordering >>>> (i.e. avoid bucket change to be visible to remote thread before kvp >>>> change). >>>> >>>> I crafted WIP patch to present my idea: >>>> >>>> https://gerrit.fd.io/r/c/vpp/+/34326 <https://gerrit.fd.io/r/c/vpp/+/34326> >>>> >>>> In this patch I get a rid of all store barriers and replaced them with >>>> more lightweight: >>>> >>>> __atomic_store_n (ptr, val, __ATOMIC_RELEASE); >>>> >>>> On platforms with strong memory ordering (like x86_64) this will result >>>> with just normal stores (but compiler will know that it should not reorder >>>> them). >>>> On platforms with weak memory ordering (like arch64) this will result in >>>> special store instruction, but that one is still cheaper than full memory >>>> barrier. >>>> >>>> Thoughts? Comments? >>>> >>>> Thanks, >>>> >>>> — >>>> Damjan >>>> >>>> >>>> >>>>> On 02.11.2021., at 12:14, Dave Barach <v...@barachs.net >>>>> <mailto:v...@barachs.net>> wrote: >>>>> >>>>> Dear Nick, >>>>> >>>>> As the code comment suggests, we tiptoe right up to the line to extract >>>>> performance. Have you tried e.g. ISOLCPUS, thread priority, or some other >>>>> expedients to make the required assumptions true? >>>>> >>>>> It’s easy enough to change the code in various ways so this use-case >>>>> cannot backfire. High on the list: always make a working copy of the >>>>> bucket, vs. update in place. Won’t help write performance, but it’s >>>>> likely to make the pain go away. >>>>> >>>>> Bucket-level reader-locks would involve adding Avogadro’s number of >>>>> atomic ops to the predominant case. I’m pretty sure that’s a non-starter. >>>>> >>>>> FWIW... Dave >>>>> >>>>> >>>>> From: vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io> >>>>> <vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io>> On Behalf Of Nick >>>>> Zavaritsky >>>>> Sent: Monday, November 1, 2021 12:12 PM >>>>> To: vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io> >>>>> Subject: [vpp-dev] Bihash is considered thread-safe but probably shouldn't >>>>> >>>>> Hello bihash experts! >>>>> >>>>> There's an old thread claiming that bihash lookup can produce a value=-1 >>>>> under intense add/delete concurrent activity: >>>>> https://lists.fd.io/g/vpp-dev/message/15606 >>>>> <https://lists.fd.io/g/vpp-dev/message/15606> >>>>> >>>>> We had a seemingly related crash recently when a lookup in >>>>> snat_main.flow_hash yielded a value=-1 which was subsequently used as a >>>>> destination thread index to offload to. This crash prompted me to study >>>>> bihash more closely. >>>>> >>>>> The rest of the message is structured as follows: >>>>> 1. Presenting reasons why I believe that bihash is not thread-safe. >>>>> 2. Proposing a fix. >>>>> >>>>> 1 Bihash is probably not thread-safe >>>>> >>>>> The number of buckets in a hash table never changes. Every bucket has a >>>>> lock bit. Updates happen via clib_bihash_add_del_inline_with_hash. The >>>>> function grabs the bucket lock early on and performs update while holding >>>>> the lock. Obviously this is safe, let's focus on readers. >>>>> >>>>> Lookups happen via clib_bihash_search_inline_with_hash / >>>>> clib_bihash_search_inline_2_with_hash. The function locates the bucket >>>>> and waits until the lock bit is cleared. >>>>> >>>>> b = BV (clib_bihash_get_bucket) (h, hash); >>>>> >>>>> if (PREDICT_FALSE (BV (clib_bihash_bucket_is_empty) (b))) >>>>> return -1; >>>>> >>>>> if (PREDICT_FALSE (b->lock)) >>>>> { >>>>> volatile BVT (clib_bihash_bucket) * bv = b; >>>>> while (bv->lock) >>>>> CLIB_PAUSE (); >>>>> } >>>>> >>>>> From this point on the function examines the data structure without ever >>>>> bothering to check the lock again. Nothing prevents an updater from >>>>> changing the data the reader is currently looking at, or even >>>>> deallocating it right away. The only way it could work is if we could >>>>> make assumptions about relative performance of lookup and update >>>>> operations. Checking the lock early in lookup ensures that there's no >>>>> update currently in progress. If lookup is faster than update, then no >>>>> future updater will manage to progress to the point where the memory is >>>>> written BEFORE the lookup was complete. Indeed, we have the following >>>>> comment in clib_bihash_add_del_inline_with_hash: >>>>> >>>>> /* >>>>> * Because reader threads are looking at live data, >>>>> * we have to be extra careful. Readers do NOT hold the >>>>> * bucket lock. We need to be SLOWER than a search, past the >>>>> * point where readers CHECK the bucket lock. >>>>> */ >>>>> >>>>> Unfortunately, the assumption doesn't hold. Any thread could get >>>>> preempted at arbitrary time. Even if we rule out preemption, there are >>>>> microarchitectural quirks (e.g. caches, branch misprediction) that could >>>>> slow down lookup to the point that memory read and update will overlap. >>>>> >>>>> The core of lookup is the following loop. Please note that checking a key >>>>> and fetching the value is not atomic, hence if we are preempted >>>>> in-between the result could be bogus. >>>>> >>>>> for (i = 0; i < limit; i++) >>>>> { >>>>> if (BV (clib_bihash_key_compare) (v->kvp[i].key, key_result->key)) >>>>> { >>>>> *key_result = v->kvp[i]; >>>>> return 0; >>>>> } >>>>> } >>>>> >>>>> Different ways the key-value pair could get updated: >>>>> >>>>> (1) Add using a previously empty slot: >>>>> >>>>> clib_memcpy_fast (&(v->kvp[i].value), >>>>> &add_v->value, sizeof (add_v->value)); >>>>> CLIB_MEMORY_STORE_BARRIER (); /* Make sure the value has >>>>> settled */ >>>>> clib_memcpy_fast (&(v->kvp[i]), &add_v->key, >>>>> sizeof (add_v->key)); >>>>> >>>>> The key update is not atomic, reader could observe a key updated half-way. >>>>> >>>>> (2) Add that recycles a stale slot: >>>>> >>>>> clib_memcpy_fast (&(v->kvp[i]), add_v, sizeof (*add_v)); >>>>> >>>>> Yet again not atomic. A reader could witness (old_k, new_v) or (new_k, >>>>> old_v) or even an arbitrary interleaving of chunks from old and new keys. >>>>> >>>>> (3) Deleting an entry: >>>>> >>>>> clib_memset_u8 (&(v->kvp[i]), 0xff, sizeof (*(add_v))); >>>>> >>>>> Not atomic. >>>>> >>>>> >>>>> 2 A fix >>>>> >>>>> It's worth noting that bihash never crashes. It does yield bogus results >>>>> occasionally, though. While -1 is easy to check for, the analysis shows >>>>> that other bogus results are possible. In particular: >>>>> >>>>> 1. Value updated half-way, possible with bihash_8_16. >>>>> 2. Observing a key that never existed due to a key partial update. The >>>>> probability is low since the hash should map it to the same bucket. >>>>> 3. Old key matched with a new value. The probability is low since lookup >>>>> should get preempted at the particular spot to make it happen. >>>>> >>>>> Even though these anomalies are unlikely they are still possible and >>>>> exploitable. >>>>> >>>>> Should we consider a fix? >>>>> >>>>> The proposal is to introduce read locks for buckets. An implementation >>>>> favouring readers could be as follows: >>>>> >>>>> Extend clib_bihash wirh "u64 rlocks[MAX_THREADS]". Based on the thread >>>>> index, each reader publishes the bucket number it is currently examining >>>>> in the respective array item. Padding is introduced to avoid false >>>>> sharing. >>>>> >>>>> The writer lock sequence would be: 1) lock bucket; 2) wait until the >>>>> bucket number is not in rlocks. >>>>> >>>>> Reader lock sequence: 1) publish bucket number in rlocks; 2) if bucket >>>>> not locked then done; 3) otherwise clear bucket number from rlocks, wait >>>>> for bucket lock to be released and restart. >>>>> >>>>> Thoughts? >>>>> >>>>> >>>> >>>> >>>> >>>
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#20419): https://lists.fd.io/g/vpp-dev/message/20419 Mute This Topic: https://lists.fd.io/mt/86744671/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-