On Thu, May 9, 2019 at 12:00 PM Andrew Cooper <andrew.coop...@citrix.com> wrote:
> On 09/05/2019 18:46, Tamas K Lengyel wrote:
> > On Thu, May 9, 2019 at 10:43 AM Andrew Cooper <andrew.coop...@citrix.com> 
> > wrote:
> >> On 09/05/2019 17:19, Mathieu Tarral wrote:
> >>> Le mardi, mai 7, 2019 2:01 PM, Mathieu Tarral 
> >>> <mathieu.tar...@protonmail.com> a écrit :
> >>>
> >>>>> Given how many EPT flushing bugs I've already found in this area, I 
> >>>>> wouldn't be surprised if there are further ones lurking.  If it is an 
> >>>>> EPT flushing bug, this delta should make it go away, but it will come 
> >>>>> with a hefty perf hit.
> >>>>>
> >>>>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> >>>>> index 283eb7b..019333d 100644
> >>>>> --- a/xen/arch/x86/hvm/vmx/vmx.c
> >>>>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> >>>>> @@ -4285,9 +4285,7 @@ bool vmx_vmenter_helper(const struct 
> >>>>> cpu_user_regs *regs)
> >>>>>              }
> >>>>>          }
> >>>>>
> >>>>> -        if ( inv )
> >>>>> -            __invept(inv == 1 ? INVEPT_SINGLE_CONTEXT : 
> >>>>> -                     inv == 1 ? single->eptp          : 0);
> >>>>> +        __invept(INVEPT_ALL_CONTEXT, 0);
> >>>>>      }
> >>>>>
> >>>>>   out:
> >>>> I can give this a try, and see if it resolves the problem !
> >>> Just tested, on Xen 4.12.0, and the bug is still here.
> >>> Windows 7 is having BSODs with 4 VCPUs.
> >>> I didn't noticed a hefty performance impact though.
> >>>
> >>> Do we have other caches to invalidate ?
> >>> Something else that i should test ?
> >>>
> >>> I don't feel comfortable digging into Xen's code, especially for 
> >>> something as complicated as page table and memory management,
> >>> increased by the complexity of altp2m.
> >>> What i can do however, is test your ideas and patches and report the 
> >>> information I can gather on this issue.
> >>>
> >>> Note: I tested with the latest commits on Drakvuf/master, especially:
> >>> "Add a VM pause for shadow copy refresh operation"
> >>> https://github.com/tklengyel/drakvuf/pull/626
> >>>
> >>> @tamas, did you made this patch to fix these kind of race conditions 
> >>> issue that i'm reporting ?
> >>> Or was it totally unrelated ?
> >> With the above change in place and BSODs still happening, I'm fairly
> >> convinced that it not a TLB flushing issue.
> >>
> >> Therefore, the conclusion to draw is that it is a logical bug somewhere.
> > I agree.
> >
> >> First of all - ensure you are using up-to-date microcode.  The number of
> >> errata which have been discovered by people associated with the Xen
> >> community is large.
> >>
> >> The microcode is available from
> >> https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files/ and
> >> https://andrewcoop-xen.readthedocs.io/en/latest/admin-guide/microcode-loading.html
> >> is some documentation I prepared earlier.
> >>
> >> Beyond that, I think it would help to know exactly how libvmi is
> >> manipulating the guest.
> > I already suggested to Mathieu to try to reproduce the issue using the
> > xen-access test tool that's in the Xen tree to cut out all that
> > complexity.
> xen-access is ok, but I've never encountered a situation where I haven't
> had to modify it first to get it usable.

Right, it would likely have to be modified.

> I have some plans to replace it with something far more usable, as part
> of tying together some XTF-based VMI testing, but none of that is
> remotely ready yet.

Yes, that would be fantastic to have.

> > Without being able to limit the scope of the bug and being
> > able to reproducible trigger it I see little chance of finding the
> > root cause. Unfortunately I don't have the time to do that myself.
> I can probably help out with some suggestions, but I agree that we are
> going to have to cut out some of the complexity here to figure out
> exactly what is going on.
> Alternatively, if there are some sufficiently detailed instructions for
> how to put together a repro of the problem using libvmi/etc, I might be
> able to start debugging from that, but I definitely don't have time to
> do that in the next week.

The instructions are on https://drakvuf.com. AFAICT Mathieu is running
into the issue with simply running it on a up-to-date Windows 10 guest
but not in any way that I would call reproducible. Running it "for a
minute or two" is really not a reproducible bug description.


Xen-devel mailing list

Reply via email to