Re: [vpp-dev] Bihash is considered thread-safe but probably shouldn't

Nick Zavaritsky Thu, 04 Nov 2021 11:24:21 -0700

Hi, thanks for an insightful discussion!

I do understand that high performance is one of the most important goals of 
vpp, therefore certain solutions might not fly. From my POV, the version 
counter would be an improvement. It definitely decreases the probability of 
triggering the bug.


Concerning isolcpus, currently this is presented as an optimisation, not a 
prerequisite. Without isolcpus, a thread could get preempted for arbitrarily 
long. Meaning that no matter how many bits we allocate for the version field, 
occasionally they won’t be enough.

I’d love to have something that’s robust no matter how the threads are 
scheduled. Would it be possible to use vpp benchmarking lab to evaluate the 
performance impact of the proposed solutions?

Finally, I'd like to rehash the reader lock proposal. The idea was that we 
don’t introduce any atomic operations in the reader path. A reader *publishes* 
the bucket number it is about to examine in int rlock[MAX_THREADS] array. Every 
thread uses a distinct cell in rlock (determined by the thread id), therefore 
it could be a regular write followed by a barrier. Eliminate false sharing with 
padding.

Writer locks a bucket as currently implemented (CAS) and then waits until the 
bucket number disappears from rlock[].

Reader publishes the bucket number and then checks if the bucket is locked 
(regular write, barrier, regular read). Good to go if  not locked, otherwise 
remove the bucket number from rlock, wait for the lock to get released, restart.

The proposal doesn’t introduce any new atomic operations. There still might be 
a slowdown due to cache line ping-pong in the rlock array. In the worst case, 
it costs us 1 extra cache miss for the reader. Could be coalesced with the 
bucket prefetch, making it essentially free (few if any bihash users prefetch 
buckets).

Best,

Nick


On 3. Nov 2021, at 21:29, Florin Coras via lists.fd.io<http://lists.fd.io> 
<fcoras.lists=gmail....@lists.fd.io<mailto:fcoras.lists=gmail....@lists.fd.io>> 
wrote:


Agreed it’s unlikely so maybe just use the 2 bits left for the epoch counter as 
a middle ground? The new approach should be better either way :-)

Florin


On Nov 3, 2021, at 11:55 AM, Damjan Marion 
<dmar...@me.com<mailto:dmar...@me.com>> wrote:

What about the following, we shift offset by 6, as all buckets are aligned to 
64, anyway,  and that gives us 6 more bits so we can have 8 bit epoch counter…. 
?

—
Damjan

On 03.11.2021., at 19:45, Damjan Marion <dmar...@me.com<mailto:dmar...@me.com>> 
wrote:



yes, i am aware of that, it is extremelly unlikely and only way i can see this 
fixed is introducing epoch on the bucket level but we dont have enough space 
there….

—
Damjan

On 03.11.2021., at 19:16, Florin Coras 
<fcoras.li...@gmail.com<mailto:fcoras.li...@gmail.com>> wrote:

Hi Damjan,

Definitely like the scheme but the change bit might not be enough, unless I’m 
misunderstanding. For instance, two consecutive updates to a bucket before 
reader grabs b1 will hide the change.

Florin

On Nov 3, 2021, at 9:36 AM, Damjan Marion via 
lists.fd.io<https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.fd.io%2F&data=04%7C01%7Cnick.zavaritsky%40emnify.com%7Cad2e88dba96b499db00708d99f089ee7%7Cf644ad61a00a4982bed140ea728f0209%7C0%7C0%7C637715681706706412%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=vN73bEYUhxAQGM0z9ZmRncll3ZylNA0grQDvi5HJ%2FCk%3D&reserved=0>
 <dmarion=me....@lists.fd.io<mailto:dmarion=me....@lists.fd.io>> wrote:


Agree with Dave on atomic ops being bad on the reader side.

What about following schema:

As bucket is just u64 value on the reader side we grab bucket before (b0) and 
after (b1) search operation.

If search finds entry, we simply do 2 checks:
- that b0 is equal to b1
- that lock bit is not set in both of them
If check fails, we simply retry.

On the writer side, we have add, remove and replace operations.
First 2 alter refcnt which is part of bucket.
To deal with replace case we introduce another bit (change bit) which is 
flipped every time data is changed in the bucket.

Here are possible scenarios:

- reader grabs b0 before lock and b1 after unlock
   - add, del - refcnt and change bit will be different between b0 and b1 
causing retry
   - replace - change bit will be different between b0 and b1 causing retry

- reader grabs b0 after lock and/or b1 before unlock
   - lock bit will be set causing retry

Of course, this to work properly we need to ensure proper memory ordering (i.e. 
avoid bucket change to be visible to remote thread before kvp change).

I crafted WIP patch to present my idea:

https://gerrit.fd.io/r/c/vpp/+/34326<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgerrit.fd.io%2Fr%2Fc%2Fvpp%2F%2B%2F34326&data=04%7C01%7Cnick.zavaritsky%40emnify.com%7Cad2e88dba96b499db00708d99f089ee7%7Cf644ad61a00a4982bed140ea728f0209%7C0%7C0%7C637715681706706412%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=1Hw6NhgQm4O35CCPWuT0YXcT3SmQABDRPrrvUbzC9bg%3D&reserved=0>

In this patch I get a rid of all store barriers and replaced them with more 
lightweight:

__atomic_store_n (ptr, val, __ATOMIC_RELEASE);

On platforms with strong memory ordering (like x86_64) this will result with 
just normal stores (but compiler will know that it should not reorder them).
On platforms with weak memory ordering (like arch64) this will result in 
special store instruction, but that one is still cheaper than full memory 
barrier.

Thoughts? Comments?

Thanks,

—
Damjan



On 02.11.2021., at 12:14, Dave Barach 
<v...@barachs.net<mailto:v...@barachs.net>> wrote:

Dear Nick,

As the code comment suggests, we tiptoe right up to the line to extract 
performance. Have you tried e.g. ISOLCPUS, thread priority, or some other 
expedients to make the required assumptions true?

It’s easy enough to change the code in various ways so this use-case cannot 
backfire. High on the list: always make a working copy of the bucket, vs. 
update in place. Won’t help write performance, but it’s likely to make the pain 
go away.

Bucket-level reader-locks would involve adding Avogadro’s number of atomic ops 
to the predominant case. I’m pretty sure that’s a non-starter.

FWIW... Dave


From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
<vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>> On Behalf Of Nick Zavaritsky
Sent: Monday, November 1, 2021 12:12 PM
To: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>
Subject: [vpp-dev] Bihash is considered thread-safe but probably shouldn't

Hello bihash experts!

There's an old thread claiming that bihash lookup can produce a value=-1 under 
intense add/delete concurrent activity: 
https://lists.fd.io/g/vpp-dev/message/15606<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.fd.io%2Fg%2Fvpp-dev%2Fmessage%2F15606&data=04%7C01%7Cnick.zavaritsky%40emnify.com%7Cad2e88dba96b499db00708d99f089ee7%7Cf644ad61a00a4982bed140ea728f0209%7C0%7C0%7C637715681706716404%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pu8AP%2BJb53uEAQCqRrguZ%2FdrHjB6QHrP5eHCG8TVMXM%3D&reserved=0>

We had a seemingly related crash recently when a lookup in snat_main.flow_hash 
yielded a value=-1 which was subsequently used as a destination thread index to 
offload to. This crash prompted me to study bihash more closely.

The rest of the message is structured as follows:
 1. Presenting reasons why I believe that bihash is not thread-safe.
 2. Proposing a fix.

1 Bihash is probably not thread-safe

The number of buckets in a hash table never changes. Every bucket has a lock 
bit. Updates happen via clib_bihash_add_del_inline_with_hash. The function 
grabs the bucket lock early on and performs update while holding the lock. 
Obviously this is safe, let's focus on readers.

Lookups happen via clib_bihash_search_inline_with_hash / 
clib_bihash_search_inline_2_with_hash. The function locates the bucket and 
waits until the lock bit is cleared.

 b = BV (clib_bihash_get_bucket) (h, hash);

 if (PREDICT_FALSE (BV (clib_bihash_bucket_is_empty) (b)))
   return -1;

 if (PREDICT_FALSE (b->lock))
   {
     volatile BVT (clib_bihash_bucket) * bv = b;
     while (bv->lock)
       CLIB_PAUSE ();
   }

From this point on the function examines the data structure without ever 
bothering to check the lock again. Nothing prevents an updater from changing 
the data the reader is currently looking at, or even deallocating it right 
away. The only way it could work is if we could make assumptions about relative 
performance of lookup and update operations. Checking the lock early in lookup 
ensures that there's no update currently in progress. If lookup is faster than 
update, then no future updater will manage to progress to the point where the 
memory is written BEFORE the lookup was complete. Indeed, we have the following 
comment in clib_bihash_add_del_inline_with_hash:

     /*
      * Because reader threads are looking at live data,
      * we have to be extra careful. Readers do NOT hold the
      * bucket lock. We need to be SLOWER than a search, past the
      * point where readers CHECK the bucket lock.
      */

Unfortunately, the assumption doesn't hold. Any thread could get preempted at 
arbitrary time. Even if we rule out preemption, there are microarchitectural 
quirks (e.g. caches, branch misprediction) that could slow down lookup to the 
point that memory read and update will overlap.

The core of lookup is the following loop. Please note that checking a key and 
fetching the value is not atomic, hence if we are preempted in-between the 
result could be bogus.

 for (i = 0; i < limit; i++)
   {
     if (BV (clib_bihash_key_compare) (v->kvp[i].key, key_result->key))
       {
         *key_result = v->kvp[i];
         return 0;
       }
   }

Different ways the key-value pair could get updated:

(1) Add using a previously empty slot:

             clib_memcpy_fast (&(v->kvp[i].value),
                               &add_v->value, sizeof (add_v->value));
             CLIB_MEMORY_STORE_BARRIER ();     /* Make sure the value has 
settled */
             clib_memcpy_fast (&(v->kvp[i]), &add_v->key,
                               sizeof (add_v->key));

The key update is not atomic, reader could observe a key updated half-way.

(2) Add that recycles a stale slot:

                 clib_memcpy_fast (&(v->kvp[i]), add_v, sizeof (*add_v));

Yet again not atomic. A reader could witness (old_k, new_v) or (new_k, old_v) 
or even an arbitrary interleaving of chunks from old and new keys.

(3) Deleting an entry:

clib_memset_u8 (&(v->kvp[i]), 0xff, sizeof (*(add_v)));

Not atomic.


2 A fix

It's worth noting that bihash never crashes. It does yield bogus results 
occasionally, though. While -1 is easy to check for, the analysis shows that 
other bogus results are possible. In particular:

 1. Value updated half-way, possible with bihash_8_16.
 2. Observing a key that never existed due to a key partial update. The 
probability is low since the hash should map it to the same bucket.
 3. Old key matched with a new value. The probability is low since lookup 
should get preempted at the particular spot to make it happen.

Even though these anomalies are unlikely they are still possible and 
exploitable.

Should we consider a fix?

The proposal is to introduce read locks for buckets. An implementation 
favouring readers could be as follows:

Extend clib_bihash wirh "u64 rlocks[MAX_THREADS]". Based on the thread index, 
each reader publishes the bucket number it is currently examining in the 
respective array item. Padding is introduced to avoid false sharing.

The writer lock sequence would be: 1) lock bucket; 2) wait until the bucket 
number is not in rlocks.

Reader lock sequence: 1) publish bucket number in rlocks; 2) if bucket not 
locked then done; 3) otherwise clear bucket number from rlocks, wait for bucket 
lock to be released and restart.

Thoughts?

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#20423): https://lists.fd.io/g/vpp-dev/message/20423
Mute This Topic: https://lists.fd.io/mt/86744671/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] Bihash is considered thread-safe but probably shouldn't

Reply via email to