Hello Michael, On Mon, Jan 14, 2019 at 12:11:44PM -0600, Michael Bringmann wrote: > On 1/9/19 12:08 AM, Gautham R Shenoy wrote: > > > I did some testing during the holidays. Here are the observations: > > > > 1) With just your patch (without any additional debug patch), if I run > > DLPAR on /off operations on a system that has SMT=off, I am able to > > see a crash involving RTAS stack corruption within an hour's time. > > > > 2) With the debug patch (appended below) which has additional debug to > > capture the callers of stop-self, start-cpu, set-power-levels, the > > system is able to perform DLPAR on/off operations on a system with > > SMT=off for three days. And then, it crashed with the dead CPU showing > > a "Bad kernel stack pointer". From this log, I can clearly > > see that there were no concurrent calls to stop-self, start-cpu, > > set-power-levels. The only concurrent RTAS calls were the dying CPU > > calling "stop-self", and the CPU running the DLPAR operation calling > > "query-cpu-stopped-state". The crash signature is appended below as > > well. > > > > 3) Modifying your patch to remove the udelay and increase the loop > > count from 25 to 1000 doesn't improve the situation. We are still able > > to see the crash. > > > > 4) With my patch, even without any additional debug, I was able to > > observe the system run the tests successfully for over a week (I > > started the tests before the Christmas weekend, and forgot to turn it > > off!) > > So does this mean that the problem is fixed with your patch?
No. On the contrary I think my patch unable to exploit the possible race window. From a technical point of view, Thiago's patch does the right things - It waits for the target CPU to come out of CEDE and set its state to CPU_STATE_OFFLINE. - Only then it then makes the "query-cpu-stopped-state" rtas call in a loop, while giving sufficient delay between successive queries. This reduces the unnecessary rtas-calls. In my patch, I don't do any of this, but simply keep calling "query-cpu-stopped-state" call in a loop for 4000 iterations. That said, if I incorporate the wait for the target CPU to set its state to CPU_STATE_OFFLINE in my patch and then make the "query-cpu-stopped-state" rtas call, then, even I am able to get the crash with the "RTAS CALL BUFFER CORRUPTION" message. I think that incorporating the wait for the target CPU set its state to CPU_STATE_OFFLINE increases the probability that "query-cpu-stopped-state" and "stop-self" rtas calls get called more or less at the same time. However since concurrent invocations of these rtas-calls is allowed by the PAPR, it should not result in a "RTAS CALL BUFFER CORRUPTION". Am I missing something here ? > > > > > It appears that there is a narrow race window involving rtas-stop-self > > and rtas-query-cpu-stopped-state calls that can be observed with your > > patch. Adding any printk's seems to reduce the probability of hitting > > this race window. It might be worth the while to check with RTAS > > folks, if they suspect something here. > > What would the RTAS folks be looking at here? The 'narrow race window' > is with respect to a patch that it sound like we should not be > using. IMHO, if the race-window exists, it would be better to confirm and fix it, given that we have a patch that can exploit it consistently. > > Thanks. > Michael > > -- > Michael W. Bringmann > Linux Technology Center > IBM Corporation > Tie-Line 363-5196 > External: (512) 286-5196 > Cell: (512) 466-0650 > m...@linux.vnet.ibm.com -- Thanks and Regards gautham.