Agreed it’s unlikely so maybe just use the 2 bits left for the epoch counter as 
a middle ground? The new approach should be better either way :-)

Florin


> On Nov 3, 2021, at 11:55 AM, Damjan Marion <dmar...@me.com> wrote:
> 
> What about the following, we shift offset by 6, as all buckets are aligned to 
> 64, anyway,  and that gives us 6 more bits so we can have 8 bit epoch 
> counter…. ?
> 
> — 
> Damjan
> 
>> On 03.11.2021., at 19:45, Damjan Marion <dmar...@me.com> wrote:
>> 
>> 
>> 
>> yes, i am aware of that, it is extremelly unlikely and only way i can see 
>> this fixed is introducing epoch on the bucket level but we dont have enough 
>> space there…. 
>> 
>> — 
>> Damjan
>> 
>>> On 03.11.2021., at 19:16, Florin Coras <fcoras.li...@gmail.com> wrote:
>>> 
>>> Hi Damjan, 
>>> 
>>> Definitely like the scheme but the change bit might not be enough, unless 
>>> I’m misunderstanding. For instance, two consecutive updates to a bucket 
>>> before reader grabs b1 will hide the change. 
>>> 
>>> Florin
>>> 
>>>> On Nov 3, 2021, at 9:36 AM, Damjan Marion via lists.fd.io 
>>>> <http://lists.fd.io/> <dmarion=me....@lists.fd.io 
>>>> <mailto:dmarion=me....@lists.fd.io>> wrote:
>>>> 
>>>> 
>>>> Agree with Dave on atomic ops being bad on the reader side.
>>>> 
>>>> What about following schema:
>>>> 
>>>> As bucket is just u64 value on the reader side we grab bucket before (b0) 
>>>> and after (b1) search operation.
>>>> 
>>>> If search finds entry, we simply do 2 checks:
>>>> - that b0 is equal to b1
>>>> - that lock bit is not set in both of them
>>>> If check fails, we simply retry.
>>>> 
>>>> On the writer side, we have add, remove and replace operations.
>>>> First 2 alter refcnt which is part of bucket.
>>>> To deal with replace case we introduce another bit (change bit) which is 
>>>> flipped every time data is changed in the bucket.
>>>> 
>>>> Here are possible scenarios:
>>>> 
>>>> - reader grabs b0 before lock and b1 after unlock
>>>>    - add, del - refcnt and change bit will be different between b0 and b1 
>>>> causing retry
>>>>    - replace - change bit will be different between b0 and b1 causing retry
>>>> 
>>>> - reader grabs b0 after lock and/or b1 before unlock
>>>>    - lock bit will be set causing retry  
>>>> 
>>>> Of course, this to work properly we need to ensure proper memory ordering 
>>>> (i.e. avoid bucket change to be visible to remote thread before kvp 
>>>> change).
>>>> 
>>>> I crafted WIP patch to present my idea:
>>>> 
>>>> https://gerrit.fd.io/r/c/vpp/+/34326 <https://gerrit.fd.io/r/c/vpp/+/34326>
>>>> 
>>>> In this patch I get a rid of all store barriers and replaced them with 
>>>> more lightweight:
>>>> 
>>>> __atomic_store_n (ptr, val, __ATOMIC_RELEASE);
>>>> 
>>>> On platforms with strong memory ordering (like x86_64) this will result 
>>>> with just normal stores (but compiler will know that it should not reorder 
>>>> them).
>>>> On platforms with weak memory ordering (like arch64) this will result in 
>>>> special store instruction, but that one is still cheaper than full memory 
>>>> barrier.
>>>> 
>>>> Thoughts? Comments?
>>>> 
>>>> Thanks,
>>>> 
>>>> — 
>>>> Damjan
>>>> 
>>>> 
>>>> 
>>>>> On 02.11.2021., at 12:14, Dave Barach <v...@barachs.net 
>>>>> <mailto:v...@barachs.net>> wrote:
>>>>> 
>>>>> Dear Nick,
>>>>> 
>>>>> As the code comment suggests, we tiptoe right up to the line to extract 
>>>>> performance. Have you tried e.g. ISOLCPUS, thread priority, or some other 
>>>>> expedients to make the required assumptions true?
>>>>> 
>>>>> It’s easy enough to change the code in various ways so this use-case 
>>>>> cannot backfire. High on the list: always make a working copy of the 
>>>>> bucket, vs. update in place. Won’t help write performance, but it’s 
>>>>> likely to make the pain go away.
>>>>> 
>>>>> Bucket-level reader-locks would involve adding Avogadro’s number of 
>>>>> atomic ops to the predominant case. I’m pretty sure that’s a non-starter.
>>>>> 
>>>>> FWIW... Dave
>>>>> 
>>>>> 
>>>>> From: vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io> 
>>>>> <vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io>> On Behalf Of Nick 
>>>>> Zavaritsky
>>>>> Sent: Monday, November 1, 2021 12:12 PM
>>>>> To: vpp-dev@lists.fd.io <mailto:vpp-dev@lists.fd.io>
>>>>> Subject: [vpp-dev] Bihash is considered thread-safe but probably shouldn't
>>>>> 
>>>>> Hello bihash experts!
>>>>> 
>>>>> There's an old thread claiming that bihash lookup can produce a value=-1 
>>>>> under intense add/delete concurrent activity: 
>>>>> https://lists.fd.io/g/vpp-dev/message/15606 
>>>>> <https://lists.fd.io/g/vpp-dev/message/15606>
>>>>> 
>>>>> We had a seemingly related crash recently when a lookup in 
>>>>> snat_main.flow_hash yielded a value=-1 which was subsequently used as a 
>>>>> destination thread index to offload to. This crash prompted me to study 
>>>>> bihash more closely.
>>>>> 
>>>>> The rest of the message is structured as follows:
>>>>>  1. Presenting reasons why I believe that bihash is not thread-safe.
>>>>>  2. Proposing a fix.
>>>>> 
>>>>> 1 Bihash is probably not thread-safe
>>>>> 
>>>>> The number of buckets in a hash table never changes. Every bucket has a 
>>>>> lock bit. Updates happen via clib_bihash_add_del_inline_with_hash. The 
>>>>> function grabs the bucket lock early on and performs update while holding 
>>>>> the lock. Obviously this is safe, let's focus on readers.
>>>>> 
>>>>> Lookups happen via clib_bihash_search_inline_with_hash / 
>>>>> clib_bihash_search_inline_2_with_hash. The function locates the bucket 
>>>>> and waits until the lock bit is cleared.
>>>>> 
>>>>>  b = BV (clib_bihash_get_bucket) (h, hash);
>>>>> 
>>>>>  if (PREDICT_FALSE (BV (clib_bihash_bucket_is_empty) (b)))
>>>>>    return -1;
>>>>> 
>>>>>  if (PREDICT_FALSE (b->lock))
>>>>>    {
>>>>>      volatile BVT (clib_bihash_bucket) * bv = b;
>>>>>      while (bv->lock)
>>>>>        CLIB_PAUSE ();
>>>>>    }
>>>>> 
>>>>> From this point on the function examines the data structure without ever 
>>>>> bothering to check the lock again. Nothing prevents an updater from 
>>>>> changing the data the reader is currently looking at, or even 
>>>>> deallocating it right away. The only way it could work is if we could 
>>>>> make assumptions about relative performance of lookup and update 
>>>>> operations. Checking the lock early in lookup ensures that there's no 
>>>>> update currently in progress. If lookup is faster than update, then no 
>>>>> future updater will manage to progress to the point where the memory is 
>>>>> written BEFORE the lookup was complete. Indeed, we have the following 
>>>>> comment in clib_bihash_add_del_inline_with_hash:
>>>>> 
>>>>>      /*
>>>>>       * Because reader threads are looking at live data,
>>>>>       * we have to be extra careful. Readers do NOT hold the
>>>>>       * bucket lock. We need to be SLOWER than a search, past the
>>>>>       * point where readers CHECK the bucket lock.
>>>>>       */
>>>>> 
>>>>> Unfortunately, the assumption doesn't hold. Any thread could get 
>>>>> preempted at arbitrary time. Even if we rule out preemption, there are 
>>>>> microarchitectural quirks (e.g. caches, branch misprediction) that could 
>>>>> slow down lookup to the point that memory read and update will overlap. 
>>>>> 
>>>>> The core of lookup is the following loop. Please note that checking a key 
>>>>> and fetching the value is not atomic, hence if we are preempted 
>>>>> in-between the result could be bogus.
>>>>> 
>>>>>  for (i = 0; i < limit; i++)
>>>>>    {
>>>>>      if (BV (clib_bihash_key_compare) (v->kvp[i].key, key_result->key))
>>>>>        {
>>>>>          *key_result = v->kvp[i];
>>>>>          return 0;
>>>>>        }
>>>>>    }
>>>>> 
>>>>> Different ways the key-value pair could get updated:
>>>>> 
>>>>> (1) Add using a previously empty slot:
>>>>> 
>>>>>              clib_memcpy_fast (&(v->kvp[i].value),
>>>>>                                &add_v->value, sizeof (add_v->value));
>>>>>              CLIB_MEMORY_STORE_BARRIER ();     /* Make sure the value has 
>>>>> settled */
>>>>>              clib_memcpy_fast (&(v->kvp[i]), &add_v->key,
>>>>>                                sizeof (add_v->key));
>>>>> 
>>>>> The key update is not atomic, reader could observe a key updated half-way.
>>>>> 
>>>>> (2) Add that recycles a stale slot:
>>>>> 
>>>>>                  clib_memcpy_fast (&(v->kvp[i]), add_v, sizeof (*add_v));
>>>>> 
>>>>> Yet again not atomic. A reader could witness (old_k, new_v) or (new_k, 
>>>>> old_v) or even an arbitrary interleaving of chunks from old and new keys.
>>>>> 
>>>>> (3) Deleting an entry:
>>>>> 
>>>>> clib_memset_u8 (&(v->kvp[i]), 0xff, sizeof (*(add_v)));
>>>>> 
>>>>> Not atomic.
>>>>> 
>>>>> 
>>>>> 2 A fix
>>>>> 
>>>>> It's worth noting that bihash never crashes. It does yield bogus results 
>>>>> occasionally, though. While -1 is easy to check for, the analysis shows 
>>>>> that other bogus results are possible. In particular:
>>>>> 
>>>>>  1. Value updated half-way, possible with bihash_8_16.
>>>>>  2. Observing a key that never existed due to a key partial update. The 
>>>>> probability is low since the hash should map it to the same bucket.
>>>>>  3. Old key matched with a new value. The probability is low since lookup 
>>>>> should get preempted at the particular spot to make it happen.
>>>>> 
>>>>> Even though these anomalies are unlikely they are still possible and 
>>>>> exploitable.
>>>>> 
>>>>> Should we consider a fix?
>>>>> 
>>>>> The proposal is to introduce read locks for buckets. An implementation 
>>>>> favouring readers could be as follows:
>>>>> 
>>>>> Extend clib_bihash wirh "u64 rlocks[MAX_THREADS]". Based on the thread 
>>>>> index, each reader publishes the bucket number it is currently examining 
>>>>> in the respective array item. Padding is introduced to avoid false 
>>>>> sharing.
>>>>> 
>>>>> The writer lock sequence would be: 1) lock bucket; 2) wait until the 
>>>>> bucket number is not in rlocks.
>>>>> 
>>>>> Reader lock sequence: 1) publish bucket number in rlocks; 2) if bucket 
>>>>> not locked then done; 3) otherwise clear bucket number from rlocks, wait 
>>>>> for bucket lock to be released and restart.
>>>>> 
>>>>> Thoughts?
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>> 

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#20419): https://lists.fd.io/g/vpp-dev/message/20419
Mute This Topic: https://lists.fd.io/mt/86744671/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to