Re: [PR] sched/hrtimer: Part 2: refine hrtimer state machine and introduce scheduler support with hrtimer [nuttx]

via GitHub Wed, 31 Dec 2025 00:07:48 -0800


wangchdo commented on PR #17573:
URL: https://github.com/apache/nuttx/pull/17573#issuecomment-3701692113


   > > > @anchao @wangchdo I have serious doubts about your professionalism.
   > > > > @wangchdo's HRTimer implementation has **fundamental functional 
correctness issues. It violates the ownership invariant of hrtimer**: that only 
one reader/writer can use the hrtimer object at a time. This violation makes it 
prone to issues in SMP scenarios, as I have consistently pointed out over and 
over again. Simple modifications like reference counting can not fix this 
implementation at all.
   > > > 
   > > > 
   > > > Why are you allowing such a functional incorrect code to be merged 
into the master branch. Isn’t this simply a case of factional bias clouding 
judgment? Please check this, thanks. [#17675 
(comment)](https://github.com/apache/nuttx/pull/17675#issuecomment-3698712790)
   > > 
   > > 
   > > @Fix-Point This PR addresses the SMP corner cases you mentioned. Could 
you please take a look? I propose this solution for the following reasons:
   > > 
   > > 1. It allows separate optimization paths for SMP and non-SMP, leaving 
room for extreme performance and strong determinism in both cases. (per 
@xiaoxiang781216 's comments I will submit a separate PR to put the SMP 
specific logics in CONFIG_SMP )
   > > 2. It provides clean, easy-to-use and  minimal user-facing APIs — only 
three core APIs, with no duplicated parameters:
   > >    `hrtimer_init()`
   > >    `hrtimer_start()`
   > >    `hrtimer_cancel() / hrtimer_cancel_sync()`
   > > 3. It provides basement for queue abstraction if needed.
   > > 
   > > @xiaoxiang781216 I resolved you comments, could you please take a look 
again @acassis @anchao please double check
   > 
   > To behonest, you've only modified the code to prevent the test program 
from getting stuck there. You haven't solved the actual problem: **canceled 
periodic timer cannot overwrite the new timer**.
   > 
   > I've emphasized this many times. Why make futile attempts again and again? 
Do you know how much of our time you've wasted?
   > 
   > > More similar concurrency issues could be cited here. As I have 
emphasized again and again, **the fundamental problem is the violation of the 
ownership invariant of hrtimer: only one owner can modify the hrtimer object at 
a time**.
   > > **Designing functionally correct concurrent algorithms is not easy at 
all. Relying solely on engineering experience is insufficient; theoretical 
methods are necessary to avoid errors, such as adapting resource invariants and 
using structured diagrams to clarify every possible concurrent state 
transition**. @wangchdo's design failed to consider how to handle concurrency 
correctly, making it nearly impossible to improve upon his code base.
   > 
   > > I appreciate your use of **reference counting** in the latest version to 
solve memory reclamation issues. However, the primary problem is that 
**reference counting loses version information**. A version‑aware method, such 
as **Epoch‑based memory reclamation** (a typical kernel implementation is 
Linux’s QSBR‑RCU), is needed to address this.
   > > As I mentioned in a previous scenario:
   > > 
   > > * **t0**: Thread 0 on Core 0 starts a periodic timer with callback `CA`, 
initial delay `1000 ns`, period `UINT32_MAX`.
   > > * **t1**: The timer fires on Core 0, and callback `CA` is executing.
   > > * **t2**: Thread 0 is scheduled onto Core 1, it cancels the same timer, 
and restarts the timer with callback `CB`, initial delay `0 ns`, period `1 000 
000 ns`.
   > > * **t3**: The timer fires on Core 1, and callback `CB` starts executing.
   > > * **t4**: Callback `CA` on Core 0 finishes, finds the `hrtimer` in a 
“running” state (but not of this version), and attempts to restart the timer by 
setting `hrtimer->expired += UINT32_MAX`.
   > > 
   > > Here, two errors occur:
   > > 
   > > * The old periodic timer that should have been canceled continues, 
overwriting the newly started timer.
   > > * After the old timer restarts, its next trigger time is `t2 + 
UINT32_MAX`, which is incorrect—it should have been `t0 + 1000 ns + UINT32_MAX`.
   > > 
   > > **In my HRTimer implementation, this situation is impossible:**
   > > 
   > > * **t0**: Thread 0 on Core 0 starts a periodic timer with callback `CA`, 
initial delay `1000 ns`, period `UINT32_MAX`.
   > > * **t1**: The timer fires on Core 0, and callback `CA` is executing.
   > > * **t2**: Thread 0 is scheduled onto Core 1, it cancels the same timer, 
**taking ownership** of the `hrtimer` data structure from Core 0, and restarts 
the timer with callback `CB`, initial delay `0 ns`, period `1 000 000 ns`.
   > > * **t3**: The timer fires on Core 1, and callback `CB` starts executing.
   > > * **t4**: Callback `CA` on Core 0 finishes, checks the hazard pointer 
and realizes it has lost ownership of the `hrtimer`, so it simply exits without 
further action.
   > > 
   > > The key difference is that **reference counting loses version 
information, while hazard pointers do not**. Canceling a timer via hazard 
pointers ensures that, by time **t2**, all cores using the `hrtimer` have lost 
ownership of the hrtimer, thereby **guaranteeing that an old timer callback can 
never overwrite a new timer**.
   > > **Designing correct concurrent algorithms requires systematic 
consideration**. As I stated in #17570, after carefully reviewing your 
implementation, I could not really find a good solution that preserves version 
information without introducing significant performance or memory-overhead 
overhead.
   > > That is why I suggested we not spend additional time on the previous 
design and instead directly adopt this implementation. I do not disrespect your 
work or doubt your abilities—I acknowledge you as a diligent and excellent 
engineer. I simply do not wish to see anyone stumble twice on the same issue.
   
   I suggest reviewing your API design. The issues you mentioned stem from 
differences between your API design and mine, as I already pointed out in my 
comments after reviewing your implementation. In my view, your current API 
design is less intuitive and more error-prone, which makes such issues likely 
to occur.
   
   Also, I am not wasting anyone’s time—I am simply proposing solutions that I 
believe are better. 
   
   By the way, I provides many comments in your implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] sched/hrtimer: Part 2: refine hrtimer state machine and introduce scheduler support with hrtimer [nuttx]

Reply via email to