xiaoxiang781216 commented on PR #15705:
URL: https://github.com/apache/nuttx/pull/15705#issuecomment-2623557988

   > > Yes, but you need first change the exuction of ALL IRQ from the interupt 
context to the thread context. Before that happen, the sched lock must be taken 
at the same time we hold spinlock.
   > 
   > > Let's emphasis the key point:
   > > 
   > > 1. The busy loop lock(e.g. spinlock) must hold sched lock too
   > > 2. The sleep lock(e.g. mutex) don't/shouldn't hold sched lock
   > > 
   > > So, you can't do the change in seperate step without breaking the code:
   > > 
   > > 1. Remove sched lock first
   > > 2. Change spinlock to mutex at the late time
   > > 
   > > Let's consider a VERY COMMON case in SMP:
   > > 
   > > 1. thread0 on CPU0 hold a spinlock without hold sched lock
   > > 2. thread1 on CPU1 wake up thread2
   > > 3. OS decide suspend thread0 and resume thread2 on CPU0
   > > 4. thread1 on CPU1 want to hold the spinlock hold by thread0
   > > 
   > > but thread1 is ver hard to know when it can finish the busy wait loop 
since thread0 is already in the suspend state. so, it's VERY VERY dangerous to 
schedule a thread which hold a spinlock.
   > 
   > Let’s first unify our understanding:
   > 
   > 1. There was no deadlock problem before in nuttx's API semantics, and all 
codes works very well, regardless of SMP or AMP
   > 
   > Right?
   
   No, it isn't truth. the critical section is a big lock which is a well known 
SMP isssue and should be replaced by the fine grant lock step by step:
   https://en.wikipedia.org/wiki/Giant_lock
   More and more IoT device has SMP capability, so it's important to improve 
NuttX's SMP performance.
   
   > 
   > 2. The deadlock problem was caused by xiaomi's changes(from you 
@xiaoxiang781216  and @hujun260 )
   
   All known issues are identified and fixed timely(ALL within 24hours), do you 
have any new issue?
   
   > 3. You first replaced **enter_critical_section()** with 
**spin_lock_irqsave()**,
   
   Let's emphasize that the critical section isn't equal to a spin lock since 
the thread will leave the critical seciton if the thread is suspended by 
shceduler, but the spinlock doesn't have this capability that's why many places 
replace enter_critical_section with spin_lock_irqsave + sched_lock.
   
   > 4. After finding the deadlock, you guys replace **spin_lock_irqsave()** 
with **spin_lock_irqsave()**+**sched_lock()**
   
   Where do you find we patch the sched_lock seperatedly? we change to 
spin_lock_irqsave+sched_lock at the same time.
   
   > 5. In order to simplify the code, you inside **sched_lock()** into 
**spin_lock_irqsave()**.
   
   No, https://github.com/apache/nuttx/pull/14578 was created three months ago 
to fix https://github.com/apache/nuttx/issues/9531 and the problem I mentioned 
before.
   But since it take a long time to review this important change, we have to 
use spin_lock_irqsave+sched_lock before it get merged.
   
   > Now you guys find that such a change will break all ESP devices and the 
performance of the entire system will be degraded
   
   that's why spin_lock and raw_spin_lock exist:
   
   1. In normal case, driver should use spin_lock which disable the schedule at 
the same time
   2. In special case, you can call raw_spin_lock to bypass locking schedule
   
   But the default version should be the safe one.
   
   > 6. Look at what you are doing now,
   >    Prepare to replace all kernel codes from **spin_lock_irqsave()** to 
**raw_spin_lock_irqsave()**
   >    [replace spin_lock_irqrestore with raw_spin_lock_irqrestore 
#15695](https://github.com/apache/nuttx/pull/15695)
   
   Yes, since this place is identified carefully that the lock scheduler sin't 
really needed.
   
   >    In order to solve the problem of startup crash, replace all 
**spin_lock_irqsave()** with **raw_spin_lock_irqsave()**
   
   This is the wrong usage of spinlock which hold the lock in one thread and 
unlock in another thread.
   
   >    [arch: use raw_spin_[un]lock to replace spin_[un]lock, fix regression 
b69111d16a2a330fa272af8175c832e08881844b of 
https://github.com/apache/nuttx/pull/14578 
#15691](https://github.com/apache/nuttx/pull/15691)
   >    [esp32: replace spin_lock_irqsave with raw_spin_lock_irqsave. 
espressif/esp-hal-3rdparty#8](https://github.com/espressif/esp-hal-3rdparty/pull/8)
   > 
   > I can't imagine how many board-level and chip codes will need similar 
adjustments in the future. How can other developers continue to trust NuttX 
when the kernel API is so unstable?
   > 
   
   @hujun260 could we use a new approach to avoid touching esp32 code? 
   
   > Most kernel code does not need to hold **sched_lock()**, which is unfair 
to some code that does not cause scheduling. New API semantics should be 
expressed with new APIs instead of modifying the behavior of the original API.
   > 
   
   Yes, that's why we provide raw_spin__lock or spin_lock_preempt, but the 
default one(spin_lock) need  to hold sched_lock for safety.
   
   > > > 3. You can look at the spin_lock implementation in other RTOS, no one 
will disable preemption(**zephyr**)
   > > >    
https://github.com/zephyrproject-rtos/zephyr/blob/c66a0bb2a9269dc483079d5cc24ba77b8c332a9a/include/zephyr/spinlock.h#L182-L214
   > > 
   > > 
   > > zephyr's k_spin_lock disable irq too, but nuttx's spin_lock doesn't 
disable irq.
   > 
   > let remove **spin_lock_irqsave()** and move the **irqsave** to 
**spin_lock()** ok?
   
   please look at the dissusion: https://github.com/apache/nuttx/issues/9531
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@nuttx.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to