Hi, > On 20 Mar 2024, at 18:40, Julien Grall <jul...@xen.org> wrote: > > Hi John, > > On 20/03/2024 16:24, John Ernberg wrote: >> Hi Bertrand, >> On 3/13/24 11:07, Bertrand Marquis wrote: >>> Hi, >>> >>>> On 8 Mar 2024, at 15:04, Julien Grall <jul...@xen.org> wrote: >>>> >>>> Hi John, >>>> >>>> Thank you for the reply. >>>> >>>> On 08/03/2024 13:40, John Ernberg wrote: >>>>> On 3/7/24 00:07, Julien Grall wrote: >>>>>> > Ping on the watchdog discussion bits. >>>>>> >>>>>> Sorry for the late reply. >>>>>> >>>>>> On 06/03/2024 13:13, John Ernberg wrote: >>>>>>> On 2/9/24 14:14, John Ernberg wrote: >>>>>>>> >>>>>>>>> * IMX_SIP_TIMER_*: This seems to be related to the watchdog. >>>>>>>>> Shouldn't dom0 rely on the watchdog provided by Xen instead? So those >>>>>>>>> call will be used by Xen. >>>>>>>> >>>>>>>> That is indeed a watchdog SIP, and also for setting the SoC internal >>>>>>>> RTC >>>>>>>> if it is being used. >>>>>>>> >>>>>>>> I looked around if there was previous discussion and only really >>>>>>>> found [3]. >>>>>>>> Is the xen/xen/include/watchdog.h header meant to be for this kind of >>>>>>>> watchdog support or is that more for the VM watchdog? Looking at the >>>>>>>> x86 >>>>>>>> ACPI NMI watchdog it seems like the former, but I have never worked >>>>>>>> with >>>>>>>> x86 nor ACPI. >>>>>> >>>>>> include/watchdog.h contains helper to configure the watchdog for Xen. We >>>>>> also have per-VM watchdog and this is configured by the hypercall >>>>>> SCHEDOP_watchdog. >>>>>> >>>>>>>> >>>>>>>> Currently forwarding it to Dom0 has not caused any watchdog resets with >>>>>>>> our watchdog timeout settings, our specific Dom0 setup and VM count. >>>>>> >>>>>> IIUC, the SMC API for the watchdog would be similar to the ACPI NMI >>>>>> watchdog. So I think it would make more sense if this is not exposed to >>>>>> dom0 (even if Xen is doing nothing with it). >>>>>> >>>>>> Can you try to hide the SMCs and check if dom0 still behave properly? >>>>>> >>>>>> Cheers, >>>>>> >>>>> This SMC manages a hardware watchdog, if it's not pinged within a >>>>> specific interval the entire board resets. >>>> >>>> Do you know what's the default interval? Is it large enough so Xen + dom0 >>>> can boot (at least up to when the watchdog driver is initialized)? >>>> >>>>> If I block the SMCs the watchdog driver in Dom0 will fail to ping the >>>>> watchdog, triggering a board reset because the system looks to have >>>>> become unresponsive. The reason this patch set started is because we >>>>> couldn't ping the watchdog when running with Xen. >>>>> In our specific system the bootloader enables the watchdog as early as >>>>> possible so that we can get watchdog protection for as much of the boot >>>>> as we possibly can. >>>>> So, if we are to block the SMC from Dom0, then Xen needs to take over >>>>> the pinging. It could be implemented similarly to the NMI watchdog, >>>>> except that the system will reset if the ping is missed rather than >>>>> backtrace. >>>>> It would also mean that Xen will get a whole watchdog driver-category >>>>> due to the watchdog being vendor and sometimes even SoC specific when it >>>>> comes to Arm. >>>>> My understanding of the domain watchdog code is that today the domain >>>>> needs to call SCHEDOP_watchdog at least once to start the watchdog >>>>> timer. Since watchdog protection through the whole boot process is >>>>> desirable we'd need some core changes, such as an option to start the >>>>> domain watchdog on init. > >>>>> It's quite a big change to make >>>> >>>> For clarification, above you seem to mention two changes: >>>> >>>> 1) Allow Xen to use the HW watchdog >>>> 2) Allow the domain to use the watchdog early >>>> >>>> I am assuming by big change, you are referring to 2? >>>> >>>> , while I am not against doing it if it >>>>> makes sense, I now wonder if Xen should manage hardware watchdogs. >>>>> Looking at the domain watchdog code it looks like if a domain does not >>>>> get enough execution time, the watchdog will not be pinged enough and >>>>> the guest will be reset. So either watchdog approach requires Dom0 to >>>>> get execution time. Dom0 also needs to service all the PV backends it's >>>>> responsible for. I'm not sure it's valuable to add another layer of >>>>> watchdog for this scenario as the end result (checking that the entire >>>>> system works) is achieved without it as well. >>>>> So, before I try to find the time to make a proposal for moving the >>>>> hardware watchdog bit to Xen, do we really want it? >>>> >>>> Thanks for the details. Given that the watchdog is enabled by the >>>> bootloader, I think we want Xen to drive the watchdog for two reasons: >>>> 1) In true dom0less environment, dom0 would not exist >>>> 2) You are relying on Xen + Dom0 to boot (or at least enough to get the >>>> watchdog working) within the watchdog interval. >>> >>> Definitely we need to consider the case where there is no Dom0. >>> >>> I think there are in fact 3 different use cases here: >>> - watchdog fully driven in a domain (dom0 or another): would work if it is >>> expected >>> that Xen + Domain boot time is under the watchdog initial refresh rate. >>> I think this >>> could make sense on some applications where your system depends on a >>> specific >>> domain to be properly booted to work. This would require an initial >>> refresh time >>> configurable in the boot loader probably. >> This is our use-case. ^ >> Our dom0 is monitoring and managing the other domains in our system. >> Without dom0 working the system isn't really working as a whole. >> @Julien: Would you be ok with the patch set continuing in the direction >> of the >> original proposal, letting another party (or me at a later time) implement >> the fully driven by Xen option? > I am concerned about this particular point from Bertrand: > > "would work if it is expected that Xen + Domain boot time is under the > watchdog initial refresh rate." > > How will a user be able to figure out how to initially configure the > watchdog? Is this something you can easily configure in the bootloader at > runtime?
Definitely here it would be better to have the watchdog turned off by default and document how to enable it in the firmware. Even if a long timeout is configured by default, a user could run into trouble if using a linux without watchdog or not running linux or using dom0less without a linux having access to it. I agree with Julien here that the concern could be that users would come to us instead of NXP if there is system is doing a reset without reasons after some seconds or minutes. > > > Overall, I am not for this approach at least in the current status. I would > be more inclined if there are some documentations explaining how this is > supposed to be configured on NXP, so others can use the code. > > Anyway, this is why we have multiple Arm maintainers for this kind of > situation. If they other agrees with you, then they can ack the patch and > this can be merged. I agree with Stefano that it would be good to have those board supported. One thing i could suggest until there is a watchdog driver inside Xen is to have a clear Warning at Xen boot on those boards in the console so that we could at least identify the reason easily if a reset occurs for someone. Cheers Bertrand > > Cheers, > > -- > Julien Grall