Re: [rfc] powernv/kdump: Fix cases where the kdump kernel can get HMI's

Balbir Singh Tue, 05 Dec 2017 22:14:47 -0800

On Wed, Dec 6, 2017 at 4:07 PM, Haren Myneni <ha...@linux.vnet.ibm.com> wrote:
> On 12/05/2017 08:29 PM, Balbir Singh wrote:
>> On Mon, Dec 4, 2017 at 2:10 PM, Nicholas Piggin <npig...@gmail.com> wrote:
>>> On Mon, 4 Dec 2017 11:37:01 +1100
>>> Balbir Singh <bsinghar...@gmail.com> wrote:
>>>
>>>> On Sun, Dec 3, 2017 at 1:36 PM, Nicholas Piggin <npig...@gmail.com> wrote:
>>>>> Seems like a reasonable approach. Why do we only do this for
>>>>> powernv? It seems like a good idea in general to pull all
>>>>> offlined CPUs out and into the same state for all platforms
>>>>> and for all shutdown/restart/crash paths.
>>>>>
>>>>
>>>> The reason is largely wake-up related, do we expect offline CPUs to wake
>>>> up in the kdump kernel. Largely the infrastructure allows us to selectively
>>>> decide what platforms need this support. I did not want to break the world
>>>> by enabling it across platforms (pseries for example) without good reason.
>>>
>>> What happens if a pseries offlined CPU gets an exception for some reason
>>> though? It seems like it would return into pseries_mach_cpu_die of the
>>> old kernel which will go wrong.
>>>
>>> Maybe the platform has stronger guarantees that it won't wake up there,
>>> like requiring a specific hcall or something?
>>>
>>> I was just thinking trying to move all platforms in general to the same
>>> scheme would be preferable, unless there is a good reason not to. Just
>>> for sharing code and behaviour.
>>>
>>
>> I am all for it, can I propose we start with powernv, since I've tested that
>> and as I test I can start enabling other platforms with follow-on patches.
>>
>>>>
>>>>> Also I wonder if there is anything we should do on the other
>>>>> side of the equation for the kdump kernel to pull CPUs into a
>>>>> known state rather than rely on the crash kernel to do it for
>>>>> us. We might have a better ability to do that with system
>>>>> reset IPIs now.
>>>>>
>>>>
>>>> Yes, but do we need to do that or quickly dump the vmcore to a file
>>>> and exit?
>>>
>>> Well if the previous kernel did not shut them down properly, we need
>>> to do that. Don't we? My point is the previous kernel crashed somehow,
>>> we should be trying to fix everything up rather than hoping it crashed
>>> "nicely" for us.
>>>
>>> Yes we shouldn't disturb things as much as possible, but we've booted
>>> an entire new kernel in its own reserved memory, so I'm not sure if
>>> it's such a concern to try fixing up wayward CPUs.
>>
>> I think it might be a little late to fix them up, since their stack traces 
>> won't
>> show up in the crash. We can of-course revisit this if required. Consider
>> for example a crash I saw where the kernel crashed and held a spinlock
>> at the time of crash, other CPUs were stuck spinning on that lock and did
>> not report back on either side of the crash. I think we'd want our dump to
>> show that. In my case I'm waking up offline CPUs to prevent them from
>> waking up and doing processing that would otherwise break the system.
>> I'm open to doing the same thing on the other-side, but I think the logic
>> is more complex on the new kernel side
>
> We do not need collect stack traces for offline CPUs at the time of crash 
> anyway. Even if these CPUs to be online, has to be after collecting the 
> current CPU states and just before kdump boot.
>
> In the case of CPUs stuck with IRQs disabled, they will respond anyway with 
> NMI. Before Nick's NMI patches, these cpus states were not collected with IPI.
>
> Why do we need to bring offline CPUs online in kdump boot? I thought we 
> always boot kdump kernel with single CPU.


The reason is described in the patch (changelog)

Balbir Singh

Re: [rfc] powernv/kdump: Fix cases where the kdump kernel can get HMI's

Reply via email to