On Wed, Mar 18, 2015 at 12:08:15PM -0600, Alex Williamson wrote: > On Wed, 2015-03-18 at 18:45 +0100, Michael S. Tsirkin wrote: > > On Wed, Mar 18, 2015 at 11:11:28AM -0600, Alex Williamson wrote: > > > On Wed, 2015-03-18 at 17:44 +0100, Michael S. Tsirkin wrote: > > > > On Wed, Mar 18, 2015 at 09:45:29AM -0600, Alex Williamson wrote: > > > > > On Wed, 2015-03-18 at 16:02 +0100, Michael S. Tsirkin wrote: > > > > > > On Wed, Mar 18, 2015 at 08:50:54AM -0600, Alex Williamson wrote: > > > > > > > On Wed, 2015-03-18 at 15:36 +0100, Michael S. Tsirkin wrote: > > > > > > > > On Wed, Mar 18, 2015 at 08:15:01AM -0600, Alex Williamson wrote: > > > > > > > > > On Wed, 2015-03-18 at 15:05 +0100, Michael S. Tsirkin wrote: > > > > > > > > > > On Wed, Mar 18, 2015 at 08:02:26AM -0600, Alex Williamson > > > > > > > > > > wrote: > > > > > > > > > > > On Wed, 2015-03-18 at 14:23 +0100, Michael S. Tsirkin > > > > > > > > > > > wrote: > > > > > > > > > > > > typo in subject: vfio, not vifo. > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Mar 12, 2015 at 06:23:59PM +0800, Chen Fan > > > > > > > > > > > > wrote: > > > > > > > > > > > > > for piix4 chipset, we don't need to expose aer, so > > > > > > > > > > > > > introduce > > > > > > > > > > > > > PC_I440FX_COMPAT for all piix4 machines to disable > > > > > > > > > > > > > aercap, > > > > > > > > > > > > > and add HW_COMPAT_2_2 to disable aercap for all lower > > > > > > > > > > > > > than 2.3. > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Chen Fan <chen.fan.f...@cn.fujitsu.com> > > > > > > > > > > > > > > > > > > > > > > > > Well vfio is never migrated ATM. > > > > > > > > > > > > So why is compat code needed at all? > > > > > > > > > > > > > > > > > > > > > > It's not for migration, it's to maintain current behavior > > > > > > > > > > > on existing > > > > > > > > > > > platforms. If someone gets an uncorrected AER error on > > > > > > > > > > > q35 machine type > > > > > > > > > > > today, the VM stops. With this change, AER would be > > > > > > > > > > > exposed to the > > > > > > > > > > > guest and the guest could handle it. The compat change > > > > > > > > > > > therefore > > > > > > > > > > > maintains the stop VM behavior on existing q35 machine > > > > > > > > > > > types. > > > > > > > > > > > > > > > > > > > > If stop VM behaviour is useful, expose it to users. > > > > > > > > > > If not, then don't. > > > > > > > > > > I don't see why does it have to be tied to machine types. > > > > > > > > > > > > > > > > > > Because q35-2.2 machine type will currently do a stop VM on > > > > > > > > > uncorrected > > > > > > > > > AER error. If we don't tie that to a machine option then > > > > > > > > > q35-2.2 would > > > > > > > > > suddenly start exposing the error to the guest. That's a > > > > > > > > > fairly > > > > > > > > > significant change in behavior for a static machine type. > > > > > > > > > > > > > > > > I don't think you can classify it as a behaviour change. VM > > > > > > > > stop is not > > > > > > > > guest visible behaviour. > > > > > > > > > > > > > > In one case, an uncorrected AER occurs and the VM is stopped by > > > > > > > QEMU. > > > > > > > In the other case, the guest is notified and may attempt > > > > > > > corrective > > > > > > > action... or maybe the guest doesn't understand AER and the user > > > > > > > is > > > > > > > depending on the previous behavior. That is absolutely a behavior > > > > > > > change. > > > > > > > > > > > > > > > Are you worrying about guests misbehaving when they see these > > > > > > > > errors? > > > > > > > > Then you want this as user-controlled, supported option. > > > > > > > > > > > > > > Whether the option is user visible is tangential to whether the > > > > > > > behavior > > > > > > > of existing machine types should be maintained. Existing machine > > > > > > > types > > > > > > > can impose a different default than current machine types. > > > > > > > > > > > > > > > In other words: we only tie things to machine types when we > > > > > > > > have to. This code gets almost no testing, and is a lot of > > > > > > > > work to test. This one sounds like "just in case" is not a good > > > > > > > > motivation. > > > > > > > > > > > > > > It seems like an obvious use case for using machine types to > > > > > > > maintain > > > > > > > compatibility with previous behavior, which is exactly why we have > > > > > > > machine types. If we're not going to use it, why do we have it? > > > > > > > > > > > > We have machine types because of the following issues: > > > > > > - some silent changes confuse guests. For example guest installed > > > > > > with > > > > > > one machine type might not boot if you try to use it after > > > > > > changing something, or - in case of windows - throw up warnings. > > > > > > - some changes break migration > > > > > > > > > > > > Looks like none of these cases. > > > > > > If AER is unsafe, turn it off by default for everyone. > > > > > > > > > > This is silly, we have the tools, let's use them. > > > > > > > > It's a very expensive tool, maintainance-wise. We often don't > > > > have the choice but I'm not going to use this tool by choice > > > > unless we know why we are doing this. > > > > > > > > > If a user is running > > > > > a VM that gets a VM stop on AER error one day and they upgrade QEMU > > > > > and > > > > > restart it, they should get the same behavior, whether a migration is > > > > > involved or not. > > > > > > > > You keep saying this, but why should it? Answer that question, the rest > > > > will follow. > > > > > > My answer is that it's a user visible change in behavior that they may > > > rely on and would not expect to change within an existing machine type. > > > Silently modifying the default may expose them to error conditions that > > > were previously handled by QEMU and adds AER handling requirements to > > > the guest. > > > > > > Ignoring the question about whether an AER can be reliably recovered in > > > the guest for a moment, a typical PC hardware platform will signal the > > > running OS with AER errors, so it seems that *if* we can achieve parity > > > in a virtual machine with that signaling, and more importantly the > > > recovery process, then the default going forward should be to mimic bare > > > metal behavior, thus changing the default from what we do today. > > > > > > As it is now, I think we have too many outstanding questions about how > > > recover occurs to change the default. > > > > > > So let me return the question, if we were to resolve those questions and > > > change the default handling from what we do today, creating a user > > > visible behavior change and imposing new requirements for AER handling > > > in the guest, why is that not worthy of machine type stability? I'd > > > certainly expect it from a distro specific machine type. > > > > This really depends on what guests do with this. > > We try to avoid guest visible changes during migration > > (even that's not 100%). > > There aren't such requirements for devices where migration > > is disabled, e.g. we did many guest visible changes in q35. > > > > Latest machine types are better tested. Deviating from latest machine > > just for a theoretical "this is guest visible" isn't going to give > > you better stability, it will give you worse stability. > > You keep calling this theoretical. As it exists today, an uncorrected > AER error results in a VM stop with no guest intervention or support > required. If we were to change the default, the guest would instead be > notified of the AER and given the opportunity to recover. That's not > just guest visible, that's a complete change in error handling paradigm.
OK, so maybe it's a feature that users should have control over. But tying it to machine types makes no sense. > > If you are worried about guest bugs, they are not going > > away just because we release new QEMU.So if guests are > > buggy with this feature, this needs a solution for all > > machine types. > > Guest bugs are not my issue. > > > Anyway, if we keep the default, that's even easier. > > Agreed, and that may be where we end if we can't come up with a > reasonable expectation that the guest can recover from an AER. However, > I think machine compatibility should be fair game if we do want to flip > that switch. > > > > > > Maybe the default should be disabled, this patch > > > > > series hasn't yet even convinced me that there's a worthwhile general > > > > > case where the guest can recover, but using the existing machine > > > > > compatibility infrastructure should be at our disposal if we do think > > > > > the default going forward should be different than the behavior today. > > > > > > > > I'm sorry, I don't think it's the right tool for the job. > > > > > > > > > > > >