xiaoxiang781216 commented on PR #15705: URL: https://github.com/apache/nuttx/pull/15705#issuecomment-2623557988
> > Yes, but you need first change the exuction of ALL IRQ from the interupt context to the thread context. Before that happen, the sched lock must be taken at the same time we hold spinlock. > > > Let's emphasis the key point: > > > > 1. The busy loop lock(e.g. spinlock) must hold sched lock too > > 2. The sleep lock(e.g. mutex) don't/shouldn't hold sched lock > > > > So, you can't do the change in seperate step without breaking the code: > > > > 1. Remove sched lock first > > 2. Change spinlock to mutex at the late time > > > > Let's consider a VERY COMMON case in SMP: > > > > 1. thread0 on CPU0 hold a spinlock without hold sched lock > > 2. thread1 on CPU1 wake up thread2 > > 3. OS decide suspend thread0 and resume thread2 on CPU0 > > 4. thread1 on CPU1 want to hold the spinlock hold by thread0 > > > > but thread1 is ver hard to know when it can finish the busy wait loop since thread0 is already in the suspend state. so, it's VERY VERY dangerous to schedule a thread which hold a spinlock. > > Let’s first unify our understanding: > > 1. There was no deadlock problem before in nuttx's API semantics, and all codes works very well, regardless of SMP or AMP > > Right? No, it isn't truth. the critical section is a big lock which is a well known SMP isssue and should be replaced by the fine grant lock step by step: https://en.wikipedia.org/wiki/Giant_lock More and more IoT device has SMP capability, so it's important to improve NuttX's SMP performance. > > 2. The deadlock problem was caused by xiaomi's changes(from you @xiaoxiang781216 and @hujun260 ) All known issues are identified and fixed timely(ALL within 24hours), do you have any new issue? > 3. You first replaced **enter_critical_section()** with **spin_lock_irqsave()**, Let's emphasize that the critical section isn't equal to a spin lock since the thread will leave the critical seciton if the thread is suspended by shceduler, but the spinlock doesn't have this capability that's why many places replace enter_critical_section with spin_lock_irqsave + sched_lock. > 4. After finding the deadlock, you guys replace **spin_lock_irqsave()** with **spin_lock_irqsave()**+**sched_lock()** Where do you find we patch the sched_lock seperatedly? we change to spin_lock_irqsave+sched_lock at the same time. > 5. In order to simplify the code, you inside **sched_lock()** into **spin_lock_irqsave()**. No, https://github.com/apache/nuttx/pull/14578 was created three months ago to fix https://github.com/apache/nuttx/issues/9531 and the problem I mentioned before. But since it take a long time to review this important change, we have to use spin_lock_irqsave+sched_lock before it get merged. > Now you guys find that such a change will break all ESP devices and the performance of the entire system will be degraded that's why spin_lock and raw_spin_lock exist: 1. In normal case, driver should use spin_lock which disable the schedule at the same time 2. In special case, you can call raw_spin_lock to bypass locking schedule But the default version should be the safe one. > 6. Look at what you are doing now, > Prepare to replace all kernel codes from **spin_lock_irqsave()** to **raw_spin_lock_irqsave()** > [replace spin_lock_irqrestore with raw_spin_lock_irqrestore #15695](https://github.com/apache/nuttx/pull/15695) Yes, since this place is identified carefully that the lock scheduler sin't really needed. > In order to solve the problem of startup crash, replace all **spin_lock_irqsave()** with **raw_spin_lock_irqsave()** This is the wrong usage of spinlock which hold the lock in one thread and unlock in another thread. > [arch: use raw_spin_[un]lock to replace spin_[un]lock, fix regression b69111d16a2a330fa272af8175c832e08881844b of https://github.com/apache/nuttx/pull/14578 #15691](https://github.com/apache/nuttx/pull/15691) > [esp32: replace spin_lock_irqsave with raw_spin_lock_irqsave. espressif/esp-hal-3rdparty#8](https://github.com/espressif/esp-hal-3rdparty/pull/8) > > I can't imagine how many board-level and chip codes will need similar adjustments in the future. How can other developers continue to trust NuttX when the kernel API is so unstable? > @hujun260 could we use a new approach to avoid touching esp32 code? > Most kernel code does not need to hold **sched_lock()**, which is unfair to some code that does not cause scheduling. New API semantics should be expressed with new APIs instead of modifying the behavior of the original API. > Yes, that's why we provide raw_spin__lock or spin_lock_preempt, but the default one(spin_lock) need to hold sched_lock for safety. > > > 3. You can look at the spin_lock implementation in other RTOS, no one will disable preemption(**zephyr**) > > > https://github.com/zephyrproject-rtos/zephyr/blob/c66a0bb2a9269dc483079d5cc24ba77b8c332a9a/include/zephyr/spinlock.h#L182-L214 > > > > > > zephyr's k_spin_lock disable irq too, but nuttx's spin_lock doesn't disable irq. > > let remove **spin_lock_irqsave()** and move the **irqsave** to **spin_lock()** ok? please look at the dissusion: https://github.com/apache/nuttx/issues/9531 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@nuttx.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org