jlaitine commented on PR #16673: URL: https://github.com/apache/nuttx/pull/16673#issuecomment-3164027206
> > @fdcavalcanti : I made some code review around the related code paths. There seems to be several existing issues related to this, and I honestly don't understand how it has ever worked. > > Observations from the sched/sched: > > ``` > > 1. nxsched_set_affinity tries to schedule the task out and back in by calling nxsched_set_priority. But this works _only_ if there happens to be another ready-to-run pending task with equal (or higher) priority in queue. And this is wrong. If the task is running on a wrong CPU, it should be scheduled out regardless of it's priority and then try to schedule it back (on another CPU). I have _not_ touched this behaviour... But for sure I can try to fix this. But this is really a different issue. > > ``` > > > > > > > > > > > > > > > > > > > > > > > > These are the observations from both the esp32_spiflash.c and the esp32s3_spiflash.c: > > ``` > > 2. there is a race condition between the debugassert and nxsched_set_priority and esp32_spiflash.c. Since the DEBUGASSERT in esp32_spiflash.c is not inside a critical section, it is possible that the the "spiflash_op" thread hits the DEBUGASSERT (on one CPU) right after the the "nxsched_set_affinity" has set the new affinity mask, but before it has scheduled the other task out (on the other CPU). > > > > 3. in any case, the "spi_flash_op_block_task" may start on other CPU than what will be set later on in set_affinity. This means that the "int cpu = this_cpu();" just reads out the CPU on which the task is started, and later even if the task would have switched to the correct cpu (after the affinity setting), it is comparing just to the CPU on which it started on. > > > > 4. the DEBUGASSERT itself is crazy. It is only checking if the affinity mask of the tcb changes, and if the CPU in the affinity mask matches the CPU on which the task started on. It is not checking on which CPU the task is running on inside the loop, which I believe has been the intention! (it is not checking this_cpu() again inside the loop). > > ``` > > > > > > > > > > > > > > > > > > > > > > > > So, in order to fix this; > > ``` > > * In the esp32s3 you'd want to make sure that the "spi_flash_op_block_task" does not even start on a wrong CPU. You should be able to fix this by keeping a both "kthread_create" and "nxsched_set_affinity" inside a single critical section in "spiflash_init_spi_flash_op_block_task". This should fix the obvious bug(s) in the spiflash driver. > > > > * Perhaps think again what the DEBUGASSERT in question really wants to check? > > ``` > > > > > > > > > > > > > > > > > > > > > > > > In addition: > > ``` > > * The nxsched_set_affinity should be fixed as well; although this is an old bug, and doesn't cause this issue (it has been running on luck?) > > > > * Are you sure that the CONFIG_ESP32_SPIFLASH_OP_TASK_STACKSIZE is large enough with debugasserts enabled? The stack dumps and especially the register values in the last dump look really suspicious..... > > ``` > > > > > > > > > > > > > > > > > > > > > > > > I am sorry I have caused extra testing effort for you, @fdcavalcanti; I really appreciate your testing! I definitely don't want to break things. It seems that I just managed to surface some hidden bugs with this PR... Like the signal delivery race condition which was really fatal and on many platforms, and we were really lucky that it started crashing on xtensa in qemu on that. > > Thank you for the detailed answer @jlaitine. Let us review this internally during the following week. I would ask to not merge this PR for now. I agree, there is no hurry to merge this PR. I have two crashing bugs which this PR solves, but of course we need to be certain that we don't break things. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@nuttx.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org