This is an automated email from the ASF dual-hosted git repository. jerpelea pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/incubator-nuttx.git
commit 409c65ce0bc27df685bd595f5100ad03692d674a Author: Masayuki Ishikawa <[email protected]> AuthorDate: Wed Nov 25 11:18:24 2020 +0900 arch, sched: Fix global IRQ control logics for SMP Summary: - This commit fixes global IRQ control logic - In previous implementation, g_cpu_irqset for a remote CPU was set in sched_add_readytorun(), sched_remove_readytorun() and up_schedule_sigaction() - In this implementation, they are removed. - Instead, in the pause handler, call enter_critical_setion() which will call up_cpu_paused() then acquire g_cpu_irqlock - So if a new task with irqcount > 1 restarts on the remote CPU, the CPU will only hold a critical section. Thus, the issue such as 'POSSIBLE FOR TWO CPUs TO HOLD A CRITICAL SECTION' could be resolved. - Fix nxsched_resume_scheduler() so that it does not call spin_clrbit() if a CPU does not hold a g_cpu_irqset - Fix nxtask_exit() so that it acquires g_cpu_irqlock - Update TODO Impact: - All SMP implementations Testing: - Tested with smp, ostest with the following configurations - Tested with spresense:wifi_smp (NCPUS=2,4) - Tested with sabre-6quad:smp (QEMU, dev board) - Tested with maix-bit:smp (QEMU) - Tested with esp32-core:smp (QEMU) - Tested with lc823450-xgevk:rndis Signed-off-by: Masayuki Ishikawa <[email protected]> --- TODO | 68 +------------------------- arch/arm/src/armv7-a/arm_cpupause.c | 17 ++++++- arch/arm/src/armv7-a/arm_schedulesigaction.c | 14 ++---- arch/arm/src/armv7-m/arm_schedulesigaction.c | 14 ++---- arch/arm/src/cxd56xx/cxd56_cpupause.c | 32 ++++++++++-- arch/arm/src/lc823450/lc823450_cpupause.c | 17 ++++++- arch/risc-v/src/k210/k210_cpupause.c | 17 ++++++- arch/risc-v/src/k210/k210_schedulesigaction.c | 14 ++---- arch/xtensa/src/common/xtensa_cpupause.c | 17 ++++++- arch/xtensa/src/common/xtensa_schedsigaction.c | 14 ++---- sched/sched/sched_addreadytorun.c | 43 ++-------------- sched/sched/sched_removereadytorun.c | 43 ++-------------- sched/sched/sched_resumescheduler.c | 7 ++- sched/task/task_exit.c | 7 +++ 14 files changed, 131 insertions(+), 193 deletions(-) diff --git a/TODO b/TODO index acc6e5b..00bb931 100644 --- a/TODO +++ b/TODO @@ -1,4 +1,4 @@ -NuttX TODO List (Last updated November 20, 2020) +NuttX TODO List (Last updated November 26, 2020) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This file summarizes known NuttX bugs, limitations, inconsistencies with @@ -10,7 +10,7 @@ issues related to each board port. nuttx/: (16) Task/Scheduler (sched/) - (3) SMP + (2) SMP (1) Memory Management (mm/) (0) Power Management (drivers/pm) (5) Signals (sched/signal, arch/) @@ -485,70 +485,6 @@ o SMP an bugs caused by this. But I believe that failures are possible. - Title: POSSIBLE FOR TWO CPUs TO HOLD A CRITICAL SECTION? - Description: The SMP design includes logic that will support multiple - CPUs holding a critical section. Is this necessary? How - can that occur? I think it can occur in the following - situation: - - The log below was reported is NuttX running on two cores - Cortex-A7 architecture in SMP mode. You can notice see that - when nxsched_add_readytorun() was called, the g_cpu_irqset is 3. - - nxsched_add_readytorun: irqset cpu 1, me 0 btcbname init, irqset 1 irqcount 2. - nxsched_add_readytorun: nxsched_add_readytorun line 338 g_cpu_irqset = 3. - - This can happen, but only under a very certain condition. - g_cpu_irqset only exists to support this certain condition: - - a. A task running on CPU 0 takes the critical section. So - g_cpu_irqset == 0x1. - - b. A task exits on CPU 1 and a waiting, ready-to-run task - is re-started on CPU 1. This new task also holds the - critical section. So when the task is re-restarted on - CPU 1, we than have g_cpu_irqset == 0x3 - - So we are in a very perverse state! There are two tasks - running on two different CPUs and both hold the critical - section. I believe that is a dangerous situation and there - could be undiscovered bugs that could happen in that case. - However, as of this moment, I have not heard of any specific - problems caused by this weird behavior. - - A possible solution would be to add a new task state that - would exist only for SMP. - - - Add a new SMP-only task list and state. Say, - g_csection_wait[]. It should be prioritized. - - When a task acquires the critical section, all tasks in - g_readytorun[] that need the critical section would be - moved to g_csection_wait[]. - - When any task is unblocked for any reason and moved to the - g_readytorun[] list, if that unblocked task needs the - critical section, it would also be moved to the - g_csection_wait[] list. No task that needs the critical - section can be in the ready-to-run list if the critical - section is not available. - - When the task releases the critical section, all tasks in - the g_csection_wait[] needs to be moved back to - g_readytorun[]. - - This may result in a context switch. The tasks should be - moved back to g_readytorun[] highest priority first. If a - context switch occurs and the critical section to re-taken - by the re-started task, the lower priority tasks in - g_csection_wait[] must stay in that list. - - That is really not as much work as it sounds. It is - something that could be done in 2-3 days of work if you know - what you are doing. Getting the proper test setup and - verifying the change would be the more difficult task. - -Status: Open -Priority: Unknown. Might be high, but first we would need to confirm - that this situation can occur and that is actually causes - a failure. - o Memory Management (mm/) ^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/arch/arm/src/armv7-a/arm_cpupause.c b/arch/arm/src/armv7-a/arm_cpupause.c index 4725418..8b2c56f 100644 --- a/arch/arm/src/armv7-a/arm_cpupause.c +++ b/arch/arm/src/armv7-a/arm_cpupause.c @@ -212,9 +212,22 @@ int arm_pause_handler(int irq, FAR void *context, FAR void *arg) * been processed then g_cpu_paused[cpu] will not be locked. */ - if (spin_islocked(&g_cpu_paused[cpu])) + if (up_cpu_pausereq(cpu)) { - return up_cpu_paused(cpu); + /* NOTE: The following enter_critical_section() will call + * up_cpu_paused() to process a pause request to break a deadlock + * because the caller held a critical section. Once up_cpu_paused() + * finished, the caller will proceed and release the g_cpu_irqlock. + * Then this CPU will acquire g_cpu_irqlock in the function. + */ + + irqstate_t flags = enter_critical_section(); + + /* NOTE: the pause request should not exist here */ + + DEBUGVERIFY(!up_cpu_pausereq(cpu)); + + leave_critical_section(flags); } return OK; diff --git a/arch/arm/src/armv7-a/arm_schedulesigaction.c b/arch/arm/src/armv7-a/arm_schedulesigaction.c index 424ba0d..8b44b32 100644 --- a/arch/arm/src/armv7-a/arm_schedulesigaction.c +++ b/arch/arm/src/armv7-a/arm_schedulesigaction.c @@ -314,17 +314,13 @@ void up_schedule_sigaction(struct tcb_s *tcb, sig_deliver_t sigdeliver) DEBUGASSERT(tcb->irqcount < INT16_MAX); tcb->irqcount++; - /* In an SMP configuration, the interrupt disable logic also - * involves spinlocks that are configured per the TCB irqcount - * field. This is logically equivalent to - * enter_critical_section(). The matching call to - * leave_critical_section() will be performed in - * arm_sigdeliver(). + /* NOTE: If the task runs on another CPU(cpu), adjusting + * global IRQ controls will be done in the pause handler + * on the CPU(cpu) by taking a critical section. + * If the task is scheduled on this CPU(me), do nothing + * because this CPU already took a critical section */ - spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock, - &g_cpu_irqlock); - /* RESUME the other CPU if it was PAUSED */ if (cpu != me) diff --git a/arch/arm/src/armv7-m/arm_schedulesigaction.c b/arch/arm/src/armv7-m/arm_schedulesigaction.c index 96aaf3f..7376939 100644 --- a/arch/arm/src/armv7-m/arm_schedulesigaction.c +++ b/arch/arm/src/armv7-m/arm_schedulesigaction.c @@ -362,17 +362,13 @@ void up_schedule_sigaction(struct tcb_s *tcb, sig_deliver_t sigdeliver) DEBUGASSERT(tcb->irqcount < INT16_MAX); tcb->irqcount++; - /* In an SMP configuration, the interrupt disable logic also - * involves spinlocks that are configured per the TCB irqcount - * field. This is logically equivalent to - * enter_critical_section(). The matching call to - * leave_critical_section() will be performed in - * arm_sigdeliver(). + /* NOTE: If the task runs on another CPU(cpu), adjusting + * global IRQ controls will be done in the pause handler + * on the CPU(cpu) by taking a critical section. + * If the task is scheduled on this CPU(me), do nothing + * because this CPU already took a critical section */ - spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock, - &g_cpu_irqlock); - /* RESUME the other CPU if it was PAUSED */ if (cpu != me) diff --git a/arch/arm/src/cxd56xx/cxd56_cpupause.c b/arch/arm/src/cxd56xx/cxd56_cpupause.c index c8bd410..925450a 100644 --- a/arch/arm/src/cxd56xx/cxd56_cpupause.c +++ b/arch/arm/src/cxd56xx/cxd56_cpupause.c @@ -283,6 +283,7 @@ int up_cpu_paused(int cpu) int arm_pause_handler(int irq, void *c, FAR void *arg) { int cpu = up_cpu_index(); + int ret = OK; DPRINTF("cpu%d will be paused \n", cpu); @@ -295,12 +296,37 @@ int arm_pause_handler(int irq, void *c, FAR void *arg) * interrupt by calling up_cpu_paused. */ - if (spin_islocked(&g_cpu_paused[cpu])) + if (up_cpu_pausereq(cpu)) { - return up_cpu_paused(cpu); + /* NOTE: The following enter_critical_section() would call + * up_cpu_paused() to process a pause request to break a deadlock + * because the caller held a critical section. Once up_cpu_paused() + * finished, the caller will proceed and release the g_cpu_irqlock. + * Then this CPU will acquire g_cpu_irqlock in the function. + */ + + irqstate_t flags = enter_critical_section(); + + /* NOTE: Normally, we do not call up_cpu_paused() here because + * the above enter_critical_setion() would call up_cpu_paused() + * inside because the caller holds a crtical section. + * Howerver, cxd56's remote IRQ control logic also uses this handler + * and a caller might not take a critical section to avoid a deadlock + * during up_enable_irq() and up_disable_irq(). This is allowed + * because IRQ control logic does not interact wtih the scheduler. + * This means that if the request was not handled above, we need + * to call up_cpu_puased() here again. + */ + + if (up_cpu_pausereq(cpu)) + { + ret = up_cpu_paused(cpu); + } + + leave_critical_section(flags); } - return OK; + return ret; } /**************************************************************************** diff --git a/arch/arm/src/lc823450/lc823450_cpupause.c b/arch/arm/src/lc823450/lc823450_cpupause.c index 2714fc3..af8165c 100644 --- a/arch/arm/src/lc823450/lc823450_cpupause.c +++ b/arch/arm/src/lc823450/lc823450_cpupause.c @@ -209,9 +209,22 @@ int lc823450_pause_handler(int irq, void *c, FAR void *arg) * interrupt by calling up_cpu_paused. */ - if (spin_islocked(&g_cpu_paused[cpu])) + if (up_cpu_pausereq(cpu)) { - return up_cpu_paused(cpu); + /* NOTE: The following enter_critical_section() will call + * up_cpu_paused() to process a pause request to break a deadlock + * because the caller held a critical section. Once up_cpu_paused() + * finished, the caller will proceed and release the g_cpu_irqlock. + * Then this CPU will acquire g_cpu_irqlock in the function. + */ + + irqstate_t flags = enter_critical_section(); + + /* NOTE: the pause request should not exist here */ + + DEBUGVERIFY(!up_cpu_pausereq(cpu)); + + leave_critical_section(flags); } return OK; diff --git a/arch/risc-v/src/k210/k210_cpupause.c b/arch/risc-v/src/k210/k210_cpupause.c index 88af783..957d115 100644 --- a/arch/risc-v/src/k210/k210_cpupause.c +++ b/arch/risc-v/src/k210/k210_cpupause.c @@ -215,9 +215,22 @@ int riscv_pause_handler(int irq, void *c, FAR void *arg) * interrupt by calling up_cpu_paused. */ - if (spin_islocked(&g_cpu_paused[cpu])) + if (up_cpu_pausereq(cpu)) { - return up_cpu_paused(cpu); + /* NOTE: The following enter_critical_section() will call + * up_cpu_paused() to process a pause request to break a deadlock + * because the caller held a critical section. Once up_cpu_paused() + * finished, the caller will proceed and release the g_cpu_irqlock. + * Then this CPU will acquire g_cpu_irqlock in the function. + */ + + irqstate_t flags = enter_critical_section(); + + /* NOTE: the pause request should not exist here */ + + DEBUGVERIFY(!up_cpu_pausereq(cpu)); + + leave_critical_section(flags); } return OK; diff --git a/arch/risc-v/src/k210/k210_schedulesigaction.c b/arch/risc-v/src/k210/k210_schedulesigaction.c index 26a6c77..05aabbd 100644 --- a/arch/risc-v/src/k210/k210_schedulesigaction.c +++ b/arch/risc-v/src/k210/k210_schedulesigaction.c @@ -353,17 +353,13 @@ void up_schedule_sigaction(struct tcb_s *tcb, sig_deliver_t sigdeliver) DEBUGASSERT(tcb->irqcount < INT16_MAX); tcb->irqcount++; - /* In an SMP configuration, the interrupt disable logic also - * involves spinlocks that are configured per the TCB irqcount - * field. This is logically equivalent to - * enter_critical_section(). The matching call to - * leave_critical_section() will be performed in - * up_sigdeliver(). + /* NOTE: If the task runs on another CPU(cpu), adjusting + * global IRQ controls will be done in the pause handler + * on the CPU(cpu) by taking a critical section. + * If the task is scheduled on this CPU(me), do nothing + * because this CPU already took a critical section */ - spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock, - &g_cpu_irqlock); - /* RESUME the other CPU if it was PAUSED */ if (cpu != me) diff --git a/arch/xtensa/src/common/xtensa_cpupause.c b/arch/xtensa/src/common/xtensa_cpupause.c index 5b59200..7c58176 100644 --- a/arch/xtensa/src/common/xtensa_cpupause.c +++ b/arch/xtensa/src/common/xtensa_cpupause.c @@ -190,9 +190,22 @@ void xtensa_pause_handler(void) * interrupt by calling up_cpu_paused. */ - if (spin_islocked(&g_cpu_paused[cpu])) + if (up_cpu_pausereq(cpu)) { - up_cpu_paused(cpu); + /* NOTE: The following enter_critical_section() will call + * up_cpu_paused() to process a pause request to break a deadlock + * because the caller held a critical section. Once up_cpu_paused() + * finished, the caller will proceed and release the g_cpu_irqlock. + * Then this CPU will acquire g_cpu_irqlock in the function. + */ + + irqstate_t flags = enter_critical_section(); + + /* NOTE: the pause request should not exist here */ + + DEBUGVERIFY(!up_cpu_pausereq(cpu)); + + leave_critical_section(flags); } } diff --git a/arch/xtensa/src/common/xtensa_schedsigaction.c b/arch/xtensa/src/common/xtensa_schedsigaction.c index 9888557..62f4bfd 100644 --- a/arch/xtensa/src/common/xtensa_schedsigaction.c +++ b/arch/xtensa/src/common/xtensa_schedsigaction.c @@ -339,17 +339,13 @@ void up_schedule_sigaction(struct tcb_s *tcb, sig_deliver_t sigdeliver) DEBUGASSERT(tcb->irqcount < INT16_MAX); tcb->irqcount++; - /* In an SMP configuration, the interrupt disable logic also - * involves spinlocks that are configured per the TCB irqcount - * field. This is logically equivalent to - * enter_critical_section(). - * The matching call to leave_critical_section() will be - * performed in up_sigdeliver(). + /* NOTE: If the task runs on another CPU(cpu), adjusting + * global IRQ controls will be done in the pause handler + * on the CPU(cpu) by taking a critical section. + * If the task is scheduled on this CPU(me), do nothing + * because this CPU already took a critical section */ - spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock, - &g_cpu_irqlock); - /* RESUME the other CPU if it was PAUSED */ if (cpu != me) diff --git a/sched/sched/sched_addreadytorun.c b/sched/sched/sched_addreadytorun.c index 0d610b5..15077d9 100644 --- a/sched/sched/sched_addreadytorun.c +++ b/sched/sched/sched_addreadytorun.c @@ -318,47 +318,12 @@ bool nxsched_add_readytorun(FAR struct tcb_s *btcb) &g_cpu_schedlock); } - /* Adjust global IRQ controls. If irqcount is greater than zero, - * then this task/this CPU holds the IRQ lock + /* NOTE: If the task runs on another CPU(cpu), adjusting global IRQ + * controls will be done in the pause handler on the new CPU(cpu). + * If the task is scheduled on this CPU(me), do nothing because + * this CPU already has a critical section */ - if (btcb->irqcount > 0) - { - /* Yes... make sure that scheduling logic on other CPUs knows - * that we hold the IRQ lock. - */ - - spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock, - &g_cpu_irqlock); - } - - /* No.. This CPU will be relinquishing the lock. But this works - * differently if we are performing a context switch from an - * interrupt handler and the interrupt handler has established - * a critical section. We can detect this case when - * g_cpu_nestcount[me] > 0. - */ - - else if (g_cpu_nestcount[me] <= 0) - { - /* Do nothing here - * NOTE: spin_clrbit() will be done in sched_resumescheduler() - */ - } - - /* Sanity check. g_cpu_netcount should be greater than zero - * only while we are within the critical section and within - * an interrupt handler. If we are not in an interrupt handler - * then there is a problem; perhaps some logic previously - * called enter_critical_section() with no matching call to - * leave_critical_section(), leaving the non-zero count. - */ - - else - { - DEBUGASSERT(up_interrupt_context()); - } - /* If the following task is not locked to this CPU, then it must * be moved to the g_readytorun list. Since it cannot be at the * head of the list, we can do this without invoking any heavy diff --git a/sched/sched/sched_removereadytorun.c b/sched/sched/sched_removereadytorun.c index 963a259..87b6cbd 100644 --- a/sched/sched/sched_removereadytorun.c +++ b/sched/sched/sched_removereadytorun.c @@ -260,47 +260,12 @@ bool nxsched_remove_readytorun(FAR struct tcb_s *rtcb) &g_cpu_schedlock); } - /* Adjust global IRQ controls. If irqcount is greater than zero, - * then this task/this CPU holds the IRQ lock + /* NOTE: If the task runs on another CPU(cpu), adjusting global IRQ + * controls will be done in the pause handler on the new CPU(cpu). + * If the task is scheduled on this CPU(me), do nothing because + * this CPU already has a critical section */ - if (nxttcb->irqcount > 0) - { - /* Yes... make sure that scheduling logic on other CPUs knows - * that we hold the IRQ lock. - */ - - spin_setbit(&g_cpu_irqset, cpu, &g_cpu_irqsetlock, - &g_cpu_irqlock); - } - - /* No.. This CPU will be relinquishing the lock. But this works - * differently if we are performing a context switch from an - * interrupt handler and the interrupt handler has established - * a critical section. We can detect this case when - * g_cpu_nestcount[me] > 0. - */ - - else if (g_cpu_nestcount[me] <= 0) - { - /* Do nothing here - * NOTE: spin_clrbit() will be done in sched_resumescheduler() - */ - } - - /* Sanity check. g_cpu_netcount should be greater than zero - * only while we are within the critical section and within - * an interrupt handler. If we are not in an interrupt handler - * then there is a problem; perhaps some logic previously - * called enter_critical_section() with no matching call to - * leave_critical_section(), leaving the non-zero count. - */ - - else - { - DEBUGASSERT(up_interrupt_context()); - } - nxttcb->task_state = TSTATE_TASK_RUNNING; /* All done, restart the other CPU (if it was paused). */ diff --git a/sched/sched/sched_resumescheduler.c b/sched/sched/sched_resumescheduler.c index fc9fe26..88c354b 100644 --- a/sched/sched/sched_resumescheduler.c +++ b/sched/sched/sched_resumescheduler.c @@ -136,8 +136,11 @@ void nxsched_resume_scheduler(FAR struct tcb_s *tcb) { /* Release our hold on the IRQ lock. */ - spin_clrbit(&g_cpu_irqset, me, &g_cpu_irqsetlock, - &g_cpu_irqlock); + if ((g_cpu_irqset & (1 << me)) != 0) + { + spin_clrbit(&g_cpu_irqset, me, &g_cpu_irqsetlock, + &g_cpu_irqlock); + } } #endif /* CONFIG_SMP */ } diff --git a/sched/task/task_exit.c b/sched/task/task_exit.c index d570e45..64daf88 100644 --- a/sched/task/task_exit.c +++ b/sched/task/task_exit.c @@ -164,6 +164,13 @@ int nxtask_exit(void) if (rtcb->irqcount == 0) { + /* NOTE: Need to aquire g_cpu_irqlock here again before + * calling spin_setbit() becauses it was released in + * the above nxsched_resume_scheduler() + */ + + DEBUGVERIFY(irq_waitlock(this_cpu())); + spin_setbit(&g_cpu_irqset, this_cpu(), &g_cpu_irqsetlock, &g_cpu_irqlock); }
