On 12.02.25 11:02, Pavel Tikhomirov wrote:


On 2/12/25 14:38, Alexander Atanasov wrote:
On 12.02.25 7:54, Pavel Tikhomirov wrote:


On 2/11/25 22:25, Alexander Atanasov wrote:
With the locking reduced it opened windows for races between
running updates, pending and new updates. Current logic to
deal with them is not correct.

Examples of exact race situation in code would be appreciated, as now it's impossible to unserstand which exact race we are trying to fix.

E.g. preferably in this format:

Thread 1:
A1
B1
                 Thread2:
                 A2
                 B2
C1
                 C2



Too many to describe them that way - we have three actors.
Locking and logic is wrong.

Sorry, but it is not obvious what was wrong with it in the first place. To review and verify the fix, I need to have full information about what the problem was (it's unclear from JIRA too, unless I reinvestigate all the crashes there).

Should i go over the code and explain it?

Can you please explain general lock logic a bit and maybe with couple of references to code. Because for me "races between running updates, pending and new updates" brings more questions when answers: What is "running" updates? What is "pending" and "new" updates? What is "updates"?


Ok let me try - an md page can be in three states / which comments are updated in .h file/
 - MD_DIRTY - page awaits writeback - needs to be updated on disk
- MD_WRITEBACK - page update is being processed - write out started and is in progress

there is only one piwb - in md_page struct.
Let's walk over ploop_locate_new_cluster_and_attach_pio as it is the base case, others are derivatives.


First under lock spin_lock_irqsave(&ploop->bat_lock, flags);
The there is :
if (piwb && (piwb->type != type || test_bit(MD_WRITEBACK, &md->status)))

It uses piwb to check for dirty - if we have piwb we delay.
But the problem is in the case MD_WRITEBACK , i.e. it is not started
page is waiting for it to start.

Next check the type ALLOC vs DISCARD - we have only one piwb.
If it is the same it does not delay. And the currently waiting piwb
can be executed before it is completely updated.
For example in cow case a pio is added to cow_llist -> the update can go and complete before it is added, so it won't be executed.

RACE - bat_write_complete clears  MD_WRITEBACK and sets piwb to NULL
early before the update is completed. Delayes pios can cause new updates
while the initial one is not completed.


Next it marks page dirty - but along that it is adding it to the writeback list - RACE - thread can see it before update is ready.
Next lock is unlocked.

To summarize - the race is between update preparation and thread
executing the updates - in several forms.

Request thread                            Execution Thread
Start to prepare update
take lock
mark for update and put into the list
release lock
                                          Executes unfinished update
clears flags and NULLs piwb -> crash somewhere along the path

do more work to finish preparation






Current flags are:
MD_DIRTY - means md page is dirty and is waiting for writeback
MD_WRITEBACK - write back is in progress
But we drop the lock after we make page dirty so it races with
writeback.

To fix this introduce a new flag:
MD_UPDATING - which means we are updating the pending writeback
or we are creating a new one. Remove the check if piwb is present
on the page and clear piwb early on submit, check was using a type
alloc vs discard but in case of concurrent alloc and discard the piwb
is overwritten by the second operation. Using a flag solves this
and makes it more clear in what state page is.

Nice side effect of closing that race is that we have a minimalistic
delay for writeback updates. This happens when we try to submit a
pending writeback and we see the MD_UPDATING flags, update is not
submitted but it is retried on next cycle. So if we have a lot of
update to the same page in a very short periot they are accumulated
as one.

On error only clear MD_DIRTY if it is set by us before dropping
the lock and when breaking the prepared bat update only clear
md->piwb  if it is the same as the piwb allocated in the function.

Add some clarification error messages around WARN_ONs.

While at it fix:
- mapping bat_bage two times in same function, do it only once
- missed unmapping on error path
- update proper bat levels in delayed update case
- locking around prepare_bat_update from ploop_prepare_reloc_index_wb

https://virtuozzo.atlassian.net/browse/VSTOR-98807
Signed-off-by: Alexander Atanasov <alexander.atana...@virtuozzo.com>
---
  drivers/md/dm-ploop-cmd.c |   4 +-
  drivers/md/dm-ploop-map.c | 153 +++++++++++++++++++++++++++-----------
  drivers/md/dm-ploop.h     |   8 +-
  3 files changed, 117 insertions(+), 48 deletions(-)

diff --git a/drivers/md/dm-ploop-cmd.c b/drivers/md/dm-ploop-cmd.c
index e032af52b64e..daec9827b42c 100644
--- a/drivers/md/dm-ploop-cmd.c
+++ b/drivers/md/dm-ploop-cmd.c
@@ -315,10 +315,8 @@ static int ploop_grow_relocate_cluster(struct ploop *ploop,
          goto out;
      }
-    spin_lock_irq(&ploop->bat_lock);
      ret = ploop_prepare_reloc_index_wb(ploop, &md, clu, &new_dst,
                         ploop_top_delta(ploop)->file);
-    spin_unlock_irq(&ploop->bat_lock);
      if (ret < 0) {
          PL_ERR("reloc: can't prepare it: %d", ret);
          goto out;
@@ -329,7 +327,7 @@ static int ploop_grow_relocate_cluster(struct ploop *ploop,
      ret = ploop_write_cluster_sync(ploop, pio, new_dst);
      if (ret) {
          PL_ERR("reloc: failed write: %d", ret);
-        ploop_break_bat_update(ploop, md);
+        ploop_break_bat_update(ploop, md, piwb);

You should probably understand that this code:
```
         piwb = md->piwb;

         /* Write clu to new destination */
         ret = ploop_write_cluster_sync(ploop, pio, new_dst);
         if (ret) {
                 PL_ERR("reloc: failed write: %d", ret);
                 ploop_break_bat_update(ploop, md, piwb);
                 goto out;
         }

```
can be optimized by the compiler to:
```
         /* Write clu to new destination */
         ret = ploop_write_cluster_sync(ploop, pio, new_dst);
         if (ret) {
                 PL_ERR("reloc: failed write: %d", ret);
                 ploop_break_bat_update(ploop, md, md->piwb);
                 goto out;
         }
```


And in this case there is no gain from having this extra md->piwb argument as we can already see changed md->piwb at this point. Even if it somehow helps, I believe the fix is still racy.

Yes it can be optimised but this case is a sideway case for the commands and it is a question of style what to use.

We can do two versions of ploop_break_bat_update, one for three arguments and one for two (which explicitly does not care about piwb change). Or just add comment on sideway case that piwb change is impossible.

Ok, i can add comment, i don't see a point to make another version just for that.


If you really want to read md->piwb to piwb variable BEFORE ploop_write_cluster_sync, you should use READ_ONCE, or other appropriately used memory barrier.


READ_ONCE is not a memory barrier - it is a COMPILER BARRIER.
https://docs.kernel.org/core-api/wrappers/memory-barriers.html
see COMPILER BARRIER section for indeep explanation. There you will see that infact READ_ONCE forces variable to be re-read , and not read once as in one time.

Unrelated to the patch, as you've proved that patch is ok. But we've already had this discussion https://lists.openvz.org/pipermail/devel/2025-January/081790.html , and I've provided you with citation from documentation from "COMPILER BARRIER". READ_ONCE disables different compiler optimisations, including that it does not allow local variable to be optimized by instead directly using the global variable which can be changed in concurrent thread, which looks from the user perspective as if local variable is read twice from global variable. So it actually makes reads "one time" reads.

No, it actually marks the variable volatile /i missed that/ and forces
access as a whole. Which results in variable to be read every time instead of cached one to be used, so it is not a one time read.


The example from the docs:


 (*) The compiler is within its rights to merge successive loads from
     the same variable.  Such merging can cause the compiler to "optimize"
     the following code:

        while (tmp = a)
                do_something_with(tmp);

     into the following code, which, although in some sense legitimate
     for single-threaded code, is almost certainly not what the developer
     intended:

        if (tmp = a)
                for (;;)
                        do_something_with(tmp);

     Use READ_ONCE() to prevent the compiler from doing this to you:

        while (tmp = READ_ONCE(a))
                do_something_with(tmp);



READ_ONCE(a) -> forces read every time - cast to volatile
and the side efect is it can not split access in two

The last paragaph is what you missed:

"All that aside, it is never necessary to use READ_ONCE() and
WRITE_ONCE() on a variable that has been marked volatile.  For example,
because 'jiffies' is marked volatile, it is never necessary to
say READ_ONCE(jiffies).  The reason for this is that READ_ONCE() and
WRITE_ONCE() are implemented as volatile casts, which has no effect when
its argument is already marked volatile.

Please note that these compiler barriers have no direct effect on the CPU, which may then reorder things however it wishes."


Yes we did but let's get on the same page.

--
Regards,
Alexander Atanasov

_______________________________________________
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Reply via email to