Dear Nick,

Will be great if you can support your proposal with a running code so we can 
understand exactly what it means.

— 
Damjan

> On 04.11.2021., at 19:24, Nick Zavaritsky <nick.zavarit...@emnify.com> wrote:
> 
>  Hi, thanks for an insightful discussion!
> 
> I do understand that high performance is one of the most important goals of 
> vpp, therefore certain solutions might not fly. From my POV, the version 
> counter would be an improvement. It definitely decreases the probability of 
> triggering the bug.
> 
> Concerning isolcpus, currently this is presented as an optimisation, not a 
> prerequisite. Without isolcpus, a thread could get preempted for arbitrarily 
> long. Meaning that no matter how many bits we allocate for the version field, 
> occasionally they won’t be enough.
> 
> I’d love to have something that’s robust no matter how the threads are 
> scheduled. Would it be possible to use vpp benchmarking lab to evaluate the 
> performance impact of the proposed solutions?
> 
> Finally, I'd like to rehash the reader lock proposal. The idea was that we 
> don’t introduce any atomic operations in the reader path. A reader 
> *publishes* the bucket number it is about to examine in int 
> rlock[MAX_THREADS] array. Every thread uses a distinct cell in rlock 
> (determined by the thread id), therefore it could be a regular write followed 
> by a barrier. Eliminate false sharing with padding.
> 
> Writer locks a bucket as currently implemented (CAS) and then waits until the 
> bucket number disappears from rlock[].
> 
> Reader publishes the bucket number and then checks if the bucket is locked 
> (regular write, barrier, regular read). Good to go if  not locked, otherwise 
> remove the bucket number from rlock, wait for the lock to get released, 
> restart.
> 
> The proposal doesn’t introduce any new atomic operations. There still might 
> be a slowdown due to cache line ping-pong in the rlock array. In the worst 
> case, it costs us 1 extra cache miss for the reader. Could be coalesced with 
> the bucket prefetch, making it essentially free (few if any bihash users 
> prefetch buckets).
> 
> Best,
> 
> Nick
> 
> 
>>> On 3. Nov 2021, at 21:29, Florin Coras via lists.fd.io 
>>> <fcoras.lists=gmail....@lists.fd.io> wrote:
>>> 
>>> 
>>> Agreed it’s unlikely so maybe just use the 2 bits left for the epoch 
>>> counter as a middle ground? The new approach should be better either way :-)
>>> 
>>> Florin
>>> 
>>> 
>>> On Nov 3, 2021, at 11:55 AM, Damjan Marion <dmar...@me.com> wrote:
>>> 
>>> What about the following, we shift offset by 6, as all buckets are aligned 
>>> to 64, anyway,  and that gives us 6 more bits so we can have 8 bit epoch 
>>> counter…. ?
>>> 
>>> — 
>>> Damjan
>>> 
>>>> On 03.11.2021., at 19:45, Damjan Marion <dmar...@me.com> wrote:
>>>> 
>>>> 
>>>> 
>>>> yes, i am aware of that, it is extremelly unlikely and only way i can see 
>>>> this fixed is introducing epoch on the bucket level but we dont have 
>>>> enough space there…. 
>>>> 
>>>> — 
>>>> Damjan
>>>> 
>>>>> On 03.11.2021., at 19:16, Florin Coras <fcoras.li...@gmail.com> wrote:
>>>>> 
>>>>> Hi Damjan, 
>>>>> 
>>>>> Definitely like the scheme but the change bit might not be enough, unless 
>>>>> I’m misunderstanding. For instance, two consecutive updates to a bucket 
>>>>> before reader grabs b1 will hide the change. 
>>>>> 
>>>>> Florin
>>>>> 
>>>>>> On Nov 3, 2021, at 9:36 AM, Damjan Marion via lists.fd.io 
>>>>>> <dmarion=me....@lists.fd.io> wrote:
>>>>>> 
>>>>>> 
>>>>>> Agree with Dave on atomic ops being bad on the reader side.
>>>>>> 
>>>>>> What about following schema:
>>>>>> 
>>>>>> As bucket is just u64 value on the reader side we grab bucket before 
>>>>>> (b0) and after (b1) search operation.
>>>>>> 
>>>>>> If search finds entry, we simply do 2 checks:
>>>>>> - that b0 is equal to b1
>>>>>> - that lock bit is not set in both of them
>>>>>> If check fails, we simply retry.
>>>>>> 
>>>>>> On the writer side, we have add, remove and replace operations.
>>>>>> First 2 alter refcnt which is part of bucket.
>>>>>> To deal with replace case we introduce another bit (change bit) which is 
>>>>>> flipped every time data is changed in the bucket.
>>>>>> 
>>>>>> Here are possible scenarios:
>>>>>> 
>>>>>> - reader grabs b0 before lock and b1 after unlock
>>>>>>    - add, del - refcnt and change bit will be different between b0 and 
>>>>>> b1 causing retry
>>>>>>    - replace - change bit will be different between b0 and b1 causing 
>>>>>> retry
>>>>>> 
>>>>>> - reader grabs b0 after lock and/or b1 before unlock
>>>>>>    - lock bit will be set causing retry  
>>>>>> 
>>>>>> Of course, this to work properly we need to ensure proper memory 
>>>>>> ordering (i.e. avoid bucket change to be visible to remote thread before 
>>>>>> kvp change).
>>>>>> 
>>>>>> I crafted WIP patch to present my idea:
>>>>>> 
>>>>>> https://gerrit.fd.io/r/c/vpp/+/34326
>>>>>> 
>>>>>> In this patch I get a rid of all store barriers and replaced them with 
>>>>>> more lightweight:
>>>>>> 
>>>>>> __atomic_store_n (ptr, val, __ATOMIC_RELEASE);
>>>>>> 
>>>>>> On platforms with strong memory ordering (like x86_64) this will result 
>>>>>> with just normal stores (but compiler will know that it should not 
>>>>>> reorder them).
>>>>>> On platforms with weak memory ordering (like arch64) this will result in 
>>>>>> special store instruction, but that one is still cheaper than full 
>>>>>> memory barrier.
>>>>>> 
>>>>>> Thoughts? Comments?
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> — 
>>>>>> Damjan
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 02.11.2021., at 12:14, Dave Barach <v...@barachs.net> wrote:
>>>>>>> 
>>>>>>> Dear Nick,
>>>>>>> 
>>>>>>> As the code comment suggests, we tiptoe right up to the line to extract 
>>>>>>> performance. Have you tried e.g. ISOLCPUS, thread priority, or some 
>>>>>>> other expedients to make the required assumptions true?
>>>>>>> 
>>>>>>> It’s easy enough to change the code in various ways so this use-case 
>>>>>>> cannot backfire. High on the list: always make a working copy of the 
>>>>>>> bucket, vs. update in place. Won’t help write performance, but it’s 
>>>>>>> likely to make the pain go away.
>>>>>>> 
>>>>>>> Bucket-level reader-locks would involve adding Avogadro’s number of 
>>>>>>> atomic ops to the predominant case. I’m pretty sure that’s a 
>>>>>>> non-starter.
>>>>>>> 
>>>>>>> FWIW... Dave
>>>>>>> 
>>>>>>> 
>>>>>>> From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Nick 
>>>>>>> Zavaritsky
>>>>>>> Sent: Monday, November 1, 2021 12:12 PM
>>>>>>> To: vpp-dev@lists.fd.io
>>>>>>> Subject: [vpp-dev] Bihash is considered thread-safe but probably 
>>>>>>> shouldn't
>>>>>>> 
>>>>>>> Hello bihash experts!
>>>>>>> 
>>>>>>> There's an old thread claiming that bihash lookup can produce a 
>>>>>>> value=-1 under intense add/delete concurrent activity: 
>>>>>>> https://lists.fd.io/g/vpp-dev/message/15606
>>>>>>> 
>>>>>>> We had a seemingly related crash recently when a lookup in 
>>>>>>> snat_main.flow_hash yielded a value=-1 which was subsequently used as a 
>>>>>>> destination thread index to offload to. This crash prompted me to study 
>>>>>>> bihash more closely.
>>>>>>> 
>>>>>>> The rest of the message is structured as follows:
>>>>>>>  1. Presenting reasons why I believe that bihash is not thread-safe.
>>>>>>>  2. Proposing a fix.
>>>>>>> 
>>>>>>> 1 Bihash is probably not thread-safe
>>>>>>> 
>>>>>>> The number of buckets in a hash table never changes. Every bucket has a 
>>>>>>> lock bit. Updates happen via clib_bihash_add_del_inline_with_hash. The 
>>>>>>> function grabs the bucket lock early on and performs update while 
>>>>>>> holding the lock. Obviously this is safe, let's focus on readers.
>>>>>>> 
>>>>>>> Lookups happen via clib_bihash_search_inline_with_hash / 
>>>>>>> clib_bihash_search_inline_2_with_hash. The function locates the bucket 
>>>>>>> and waits until the lock bit is cleared.
>>>>>>> 
>>>>>>>  b = BV (clib_bihash_get_bucket) (h, hash);
>>>>>>> 
>>>>>>>  if (PREDICT_FALSE (BV (clib_bihash_bucket_is_empty) (b)))
>>>>>>>    return -1;
>>>>>>> 
>>>>>>>  if (PREDICT_FALSE (b->lock))
>>>>>>>    {
>>>>>>>      volatile BVT (clib_bihash_bucket) * bv = b;
>>>>>>>      while (bv->lock)
>>>>>>>        CLIB_PAUSE ();
>>>>>>>    }
>>>>>>> 
>>>>>>> From this point on the function examines the data structure without 
>>>>>>> ever bothering to check the lock again. Nothing prevents an updater 
>>>>>>> from changing the data the reader is currently looking at, or even 
>>>>>>> deallocating it right away. The only way it could work is if we could 
>>>>>>> make assumptions about relative performance of lookup and update 
>>>>>>> operations. Checking the lock early in lookup ensures that there's no 
>>>>>>> update currently in progress. If lookup is faster than update, then no 
>>>>>>> future updater will manage to progress to the point where the memory is 
>>>>>>> written BEFORE the lookup was complete. Indeed, we have the following 
>>>>>>> comment in clib_bihash_add_del_inline_with_hash:
>>>>>>> 
>>>>>>>      /*
>>>>>>>       * Because reader threads are looking at live data,
>>>>>>>       * we have to be extra careful. Readers do NOT hold the
>>>>>>>       * bucket lock. We need to be SLOWER than a search, past the
>>>>>>>       * point where readers CHECK the bucket lock.
>>>>>>>       */
>>>>>>> 
>>>>>>> Unfortunately, the assumption doesn't hold. Any thread could get 
>>>>>>> preempted at arbitrary time. Even if we rule out preemption, there are 
>>>>>>> microarchitectural quirks (e.g. caches, branch misprediction) that 
>>>>>>> could slow down lookup to the point that memory read and update will 
>>>>>>> overlap. 
>>>>>>> 
>>>>>>> The core of lookup is the following loop. Please note that checking a 
>>>>>>> key and fetching the value is not atomic, hence if we are preempted 
>>>>>>> in-between the result could be bogus.
>>>>>>> 
>>>>>>>  for (i = 0; i < limit; i++)
>>>>>>>    {
>>>>>>>      if (BV (clib_bihash_key_compare) (v->kvp[i].key, key_result->key))
>>>>>>>        {
>>>>>>>          *key_result = v->kvp[i];
>>>>>>>          return 0;
>>>>>>>        }
>>>>>>>    }
>>>>>>> 
>>>>>>> Different ways the key-value pair could get updated:
>>>>>>> 
>>>>>>> (1) Add using a previously empty slot:
>>>>>>> 
>>>>>>>              clib_memcpy_fast (&(v->kvp[i].value),
>>>>>>>                                &add_v->value, sizeof (add_v->value));
>>>>>>>              CLIB_MEMORY_STORE_BARRIER ();     /* Make sure the value 
>>>>>>> has settled */
>>>>>>>              clib_memcpy_fast (&(v->kvp[i]), &add_v->key,
>>>>>>>                                sizeof (add_v->key));
>>>>>>> 
>>>>>>> The key update is not atomic, reader could observe a key updated 
>>>>>>> half-way.
>>>>>>> 
>>>>>>> (2) Add that recycles a stale slot:
>>>>>>> 
>>>>>>>                  clib_memcpy_fast (&(v->kvp[i]), add_v, sizeof 
>>>>>>> (*add_v));
>>>>>>> 
>>>>>>> Yet again not atomic. A reader could witness (old_k, new_v) or (new_k, 
>>>>>>> old_v) or even an arbitrary interleaving of chunks from old and new 
>>>>>>> keys.
>>>>>>> 
>>>>>>> (3) Deleting an entry:
>>>>>>> 
>>>>>>> clib_memset_u8 (&(v->kvp[i]), 0xff, sizeof (*(add_v)));
>>>>>>> 
>>>>>>> Not atomic.
>>>>>>> 
>>>>>>> 
>>>>>>> 2 A fix
>>>>>>> 
>>>>>>> It's worth noting that bihash never crashes. It does yield bogus 
>>>>>>> results occasionally, though. While -1 is easy to check for, the 
>>>>>>> analysis shows that other bogus results are possible. In particular:
>>>>>>> 
>>>>>>>  1. Value updated half-way, possible with bihash_8_16.
>>>>>>>  2. Observing a key that never existed due to a key partial update. The 
>>>>>>> probability is low since the hash should map it to the same bucket.
>>>>>>>  3. Old key matched with a new value. The probability is low since 
>>>>>>> lookup should get preempted at the particular spot to make it happen.
>>>>>>> 
>>>>>>> Even though these anomalies are unlikely they are still possible and 
>>>>>>> exploitable.
>>>>>>> 
>>>>>>> Should we consider a fix?
>>>>>>> 
>>>>>>> The proposal is to introduce read locks for buckets. An implementation 
>>>>>>> favouring readers could be as follows:
>>>>>>> 
>>>>>>> Extend clib_bihash wirh "u64 rlocks[MAX_THREADS]". Based on the thread 
>>>>>>> index, each reader publishes the bucket number it is currently 
>>>>>>> examining in the respective array item. Padding is introduced to avoid 
>>>>>>> false sharing.
>>>>>>> 
>>>>>>> The writer lock sequence would be: 1) lock bucket; 2) wait until the 
>>>>>>> bucket number is not in rlocks.
>>>>>>> 
>>>>>>> Reader lock sequence: 1) publish bucket number in rlocks; 2) if bucket 
>>>>>>> not locked then done; 3) otherwise clear bucket number from rlocks, 
>>>>>>> wait for bucket lock to be released and restart.
>>>>>>> 
>>>>>>> Thoughts?
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>> 
>> 
>> 
> 
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#20424): https://lists.fd.io/g/vpp-dev/message/20424
Mute This Topic: https://lists.fd.io/mt/86744671/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to