Re: [Xen-devel] HVM domains crash after upgrade from XEN 4.5.1 to 4.5.2

Andrew Cooper Sat, 14 Nov 2015 12:34:33 -0800

On 14/11/2015 00:16, Atom2 wrote:
> Am 13.11.15 um 11:09 schrieb Andrew Cooper:
>> On 13/11/15 07:25, Jan Beulich wrote:
>>>>>> On 13.11.15 at 00:00, <ariel.at...@web2web.at> wrote:
>>>> Am 12.11.15 um 17:43 schrieb Andrew Cooper:
>>>>> On 12/11/15 14:29, Atom2 wrote:
>>>>>> Hi Andrew,
>>>>>> thanks for your reply. Answers are inline further down.
>>>>>>
>>>>>> Am 12.11.15 um 14:01 schrieb Andrew Cooper:
>>>>>>> On 12/11/15 12:52, Jan Beulich wrote:
>>>>>>>>>>> On 12.11.15 at 02:08, <ariel.at...@web2web.at> wrote:
>>>>>>>>> After the upgrade HVM domUs appear to no longer work - regardless
>>>>>>>>> of the
>>>>>>>>> dom0 kernel (tested with both 3.18.9 and 4.1.7 as the dom0 kernel); PV
>>>>>>>>> domUs, however, work just fine as before on both dom0 kernels.
>>>>>>>>>
>>>>>>>>> xl dmesg shows the following information after the first crashed HVM
>>>>>>>>> domU which is started as part of the machine booting up:
>>>>>>>>> [...]
>>>>>>>>> (XEN) Failed vm entry (exit reason 0x80000021) caused by invalid guest
>>>>>>>>> state (0).
>>>>>>>>> (XEN) ************* VMCS Area **************
>>>>>>>>> (XEN) *** Guest State ***
>>>>>>>>> (XEN) CR0: actual=0x0000000000000039, shadow=0x0000000000000011,
>>>>>>>>> gh_mask=ffffffffffffffff
>>>>>>>>> (XEN) CR4: actual=0x0000000000002050, shadow=0x0000000000000000,
>>>>>>>>> gh_mask=ffffffffffffffff
>>>>>>>>> (XEN) CR3: actual=0x0000000000800000, target_count=0
>>>>>>>>> (XEN)      target0=0000000000000000, target1=0000000000000000
>>>>>>>>> (XEN)      target2=0000000000000000, target3=0000000000000000
>>>>>>>>> (XEN) RSP = 0x0000000000006fdc (0x0000000000006fdc)  RIP =
>>>>>>>>> 0x0000000100000000 (0x0000000100000000)
>>>>>>>> Other than RIP looking odd for a guest still in non-paged protected
>>>>>>>> mode I can't seem to spot anything wrong with guest state.
>>>>>>> odd? That will be the source of the failure.
>>>>>>>
>>>>>>> Out of long mode, the upper 32bit of %rip should all be zero, and it
>>>>>>> should not be possible to set any of them.
>>>>>>>
>>>>>>> I suspect that the guest has exited for emulation, and there has been a
>>>>>>> bad update to %rip.  The alternative (which I hope is not the case) is
>>>>>>> that there is a hardware errata which allows the guest to accidentally
>>>>>>> get it self into this condition.
>>>>>>>
>>>>>>> Are you able to rerun with a debug build of the hypervisor?
>>>>>> [snip]
>>>>>> Another question is whether prior to enabling the debug USE flag it
>>>>>> might make sense to re-compile with gcc-4.8.5 (please see my previous
>>>>>> list reply) to rule out any compiler related issues. Jan, Andrew -
>>>>>> what are your thoughts?
>>>>> First of all, check whether the compiler makes a difference on 4.5.2
>>>> Hi Andrew,
>>>> I changed the compiler and there was no change to the better: 
>>>> Unfortunately the HVM domU is still crashing with a similar error 
>>>> message as soon as it is being started.
>>>>> If both compiles result in a guest crashing in that manner, test a debug
>>>>> Xen to see if any assertions/errors are encountered just before the
>>>>> guest crashes.
>>>>>
>>>> As the compiler did not make any difference, I enabled the debug USE 
>>>> flag, re-compiled (using gcc-4.9.3), and rebooted using a serial console 
>>>> to capture output. Unfortunately I did not get very far and things 
>>>> become even stranger: This time the system did not even finnish the boot 
>>>> process, but rather hard-stopped pretty early with a message reading 
>>>> "Panic on CPU 3: DOUBLE FAULT -- system shutdown". The captured logfile 
>>>> is attached as "serial log.txt".
>>>>
>>>> As this happened immediately after the CPU microcode update, I thought 
>>>> there might be a connection and disabled the microcode update. After the 
>>>> next reboot it seemed as if the boot process got a bit further as 
>>>> evidenced by a few more lines in the log file (those between lines 136 
>>>> and 197 in the second log file named "serial log no ucode.txt"), but in 
>>>> the end it finnished off with an identical error message (only the CPU # 
>>>> was different this time, but that number seems to change between boots 
>>>> anyways).
>>>>
>>>> I hope that makes some sense to you.
>>> Not really, other than now even more suspecting bad hardware or
>>> something fundamentally wrong with your build. Did you retry with
>>> a freshly built 4.5.1? Could you alternatively try with a known good
>>> build of 4.5.2 (e.g. from osstest)?
> Andrew,
> many thanks again for your help.
>> Agreed.  Double faults indicate that the exception handing entry points
>> are not set up in an appropriate state.  Something is definitely wrong
>> with either the compiled binary or the hardware.
> The hardware (it's a SandyBridge XEON processor with ECC RAM and
> Enterprise SATA disks) has worked for almost two years together with
> XEN and other than this issue there's also currently nothing strange
> (i.e. if I boot with a standard linux kernel, the system boots and
> works without any issues and is very stable and there are also no
> strange messages in /var/log/messages).
>> Several questions and lines of investigation:
>>
>> Is this straight Xen 4.5.1 and 2, or do Gentoo have their own patches on
>> top?
> By the looks of it there are only security patches, but no gentoo
> specific patches.
>> On repeated attempts, are the details of the double fault identical
>> (other than the cpu), or does it move around (i.e. always do_IRQ+0x15)
> It always seems to be do_IRQ+0x15 (I have made a number of boots), and
> more specifically it was always do_IRQ+0x15/0x64c. Timings varied, the
> CPU differed between boots and also the rax, rbx, rcx, rdx, rsi, rdi,
> rbp, rsp, and r8 values were diffent as were those for r15 and cr2 and
> the values next to valid stack range. I don't know whether that's of
> relevance, but I thought I'd mention it after a quick analysis of two
> serial logs.
>> Can you boot with console_timestamps=boot on the command line in the
>> future.  This will put Linux-sytle timestamps on log messages.
> Done - see two the attached serial log files mentioned above.
>> Can you also compile in the attached patch? I haven't quite got it
>> suitable for inclusion upstream yet, but it will also dump the
>> instruction stream under the fault.
> I have added the patch, but it seems not to trigger - at least the
> text that is in the patch does not show in the serial console output;
> it is however in the (uncompressed) xen.gz file as a quick grep for
> texts "Xen code around " and " [fault on access]" confirmed (grep
> said: binary file matches).
>> Finally, can you disassemble the xen-syms which results from the debug
>> build and paste the start of do_IRQ.  (i.e. `gdb xen-syms` and "disass
>> do_IRQ")
> Please see the third file named "do_IRQ".
>
> Furthermore I have managed to get the system to again boot up (a HVM
> domU, however, still crashes): If I turn off the debug USE flag and
> just add symbols to the XEN binary. It seems that the debug USE flag
> is probably not what I was expecting. On
> https://wiki.gentoo.org/wiki/Project:Quality_Assurance/Backtraces it
> describes the purpose of the debug USE flag as follows (emphasis by me):
>
> ==== start excerpt from web page ====
> Some ebuilds provide a *debug* USE flag. While some mistakenly use it
> to provide debug information and play with compiler flags when it is
> enabled, that is not its purpose.
>
> _If you're trying to debug a reproduceable crash, you want to leave
> this USE flag alone, as it'll be building a different source than what
> you had before._ It is more efficient to get first a backtrace without
> changing the code, by simply emitting symbol information, and just
> afterward enable debug features to track the issue further down.
>
> Debug features that are enabled by the USE flag include assertions,
> debug logs on screen, debug files, leak detection and extra-safe
> operations (such as scrubbing memory before use). Some of them might
> be taxing, especially for complex software or software where
> performance is an important issue.
>
> For these reasons, please exercise caution when enabling the *debug*
> USE flag, and only consider it a last-chance card.
> ==== end excerpt from web page ====
>
> Now _without_ the debug USE flag, but with debug information in the
> binary (I used splitdebug), all is back to where the problem started
> off (i.e. the system boots without issues until such time it starts a
> HVM domU which then crashes; PV domUs are working). I have attached
> the latest "xl dmesg" output with the timing information included.
>
> I hope any of this makes sense to you.
>
> Again many thanks and best regards
>


Right - it would appear that the USE flag is definitely not what you
wanted, and causes bad compilation for Xen.  The do_IRQ disassembly you
sent is a the result of disassembling a whole block of zeroes.  Sorry
for leading you on a goose chase - the double faults will be the product
of bad compilation, rather than anything to do with your specific problem.

However, the final log you sent (dmesg) is using a debug Xen, which is
what I was attempting to get you to do originally.

We still observe that the VM ends up in 32bit non-paged mode but with an
RIP with bit 32 set, which is an invalid state to be in.  However, there
was nothing particularly interesting in the extra log information.

Please can you rerun with "hvm_debug=0xc3f", which will cause far more
logging to occur to the console while the HVM guest is running.  That
might show some hints.

Also, the fact that this occurs just after starting SeaBIOS is
interesting.  As you have switched versions of Xen, you have also
switched hvmloader, which contains the SeaBIOS binary embedded in it. 
Would you be able to compile both 4.5.1 and 4.5.2 and switch the
hvmloader binaries in use.  It would be very interesting to see whether
the failure is caused by the hvmloader binary or the hypervisor.  (With
`xl`, you can use firmware_override="/full/path/to/firmware" to override
the default hvmloader).

Thanks,

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Re: [Xen-devel] HVM domains crash after upgrade from XEN 4.5.1 to 4.5.2

Reply via email to