On Tue, Mar 10, 2015 at 9:16 PM, Dr. David Alan Gilbert <dgilb...@redhat.com> wrote: > * Andrey Korolyov (and...@xdel.ru) wrote: >> On Tue, Mar 10, 2015 at 7:57 PM, Dr. David Alan Gilbert >> <dgilb...@redhat.com> wrote: >> > * Andrey Korolyov (and...@xdel.ru) wrote: >> >> On Sat, Mar 7, 2015 at 3:00 AM, Andrey Korolyov <and...@xdel.ru> wrote: >> >> > On Fri, Mar 6, 2015 at 7:57 PM, Bandan Das <b...@redhat.com> wrote: >> >> >> Andrey Korolyov <and...@xdel.ru> writes: >> >> >> >> >> >>> On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov <and...@xdel.ru> >> >> >>> wrote: >> >> >>>> Hello, >> >> >>>> >> >> >>>> recently I`ve got a couple of shiny new Intel 2620v2s for future >> >> >>>> replacement of the E5-2620v1, but I experienced relatively many >> >> >>>> events >> >> >>>> with emulation errors, all traces looks simular to the one below. I >> >> >>>> am >> >> >>>> running qemu-2.1 on x86 on top of 3.10 branch for testing purposes >> >> >>>> but >> >> >>>> can switch to some other versions if necessary. Most of crashes >> >> >>>> happened during reboot cycle or at the end of ACPI-based shutdown >> >> >>>> action, if this can help. I have zero clues of what can introduce >> >> >>>> such >> >> >>>> a mess inside same processor family using identical software, as >> >> >>>> 2620v1 has no simular problem ever. Please let me know if there can >> >> >>>> be >> >> >>>> some side measures for making entire story more clear. >> >> >>>> >> >> >>>> Thanks! >> >> >>>> >> >> >>>> KVM internal error. Suberror: 2 >> >> >>>> extra data[0]: 800000d1 >> >> >>>> extra data[1]: 80000b0d >> >> >>>> EAX=00000003 EBX=00000000 ECX=00000000 EDX=00000000 >> >> >>>> ESI=00000000 EDI=00000000 EBP=00000000 ESP=00006cd4 >> >> >>>> EIP=0000d3f9 EFL=00010202 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0 >> >> >>>> ES =0000 00000000 0000ffff 00009300 >> >> >>>> CS =f000 000f0000 0000ffff 00009b00 >> >> >>>> SS =0000 00000000 0000ffff 00009300 >> >> >>>> DS =0000 00000000 0000ffff 00009300 >> >> >>>> FS =0000 00000000 0000ffff 00009300 >> >> >>>> GS =0000 00000000 0000ffff 00009300 >> >> >>>> LDT=0000 00000000 0000ffff 00008200 >> >> >>>> TR =0000 00000000 0000ffff 00008b00 >> >> >>>> GDT= 000f6e98 00000037 >> >> >>>> IDT= 00000000 000003ff >> >> >>>> CR0=00000010 CR2=00000000 CR3=00000000 CR4=00000000 >> >> >>>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 >> >> >>>> DR3=0000000000000000 >> >> >>>> DR6=00000000ffff0ff0 DR7=0000000000000400 >> >> >>>> EFER=0000000000000000 >> >> >>>> Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb <cd> >> >> >>>> 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66 >> >> >>>> b8 00 e0 00 00 8e >> >> >>> >> >> >>> >> >> >>> It turns out that those errors are introduced by APICv, which gets >> >> >>> enabled due to different feature set. If anyone is interested in >> >> >>> reproducing/fixing this exactly on 3.10, it takes about one hundred of >> >> >>> migrations/power state changes for an issue to appear, guest OS can be >> >> >>> Linux or Win. >> >> >> >> >> >> Are you able to reproduce this on a more recent upstream kernel as >> >> >> well ? >> >> >> >> >> >> Bandan >> >> > >> >> > I`ll go through test cycle with 3.18 and 2603v2 around tomorrow and >> >> > follow up with any reproduceable results. >> >> >> >> Heh.. issue is not triggered on 2603v2 at all, at least I am not able >> >> to hit this. The only difference with 2620v2 except lower frequency is >> >> an Intel Dynamic Acceleration feature. I`d appreciate any testing with >> >> higher CPU models with same or richer feature set. The testing itself >> >> can be done on both generic 3.10 or RH7 kernels, as both of them are >> >> experiencing this issue. I conducted all tests with disabled cstates >> >> so I advise to do the same for a first reproduction step. >> >> >> >> Thanks! >> >> >> >> model name : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz >> >> stepping : 4 >> >> microcode : 0x416 >> >> cpu MHz : 2100.039 >> >> cache size : 15360 KB >> >> siblings : 12 >> >> apicid : 43 >> >> initial apicid : 43 >> >> fpu : yes >> >> fpu_exception : yes >> >> cpuid level : 13 >> >> wp : yes >> >> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge >> >> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe >> >> syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts >> >> rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq >> >> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca >> >> sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c >> >> rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi >> >> flexpriority ept vpid fsgsbase smep erms >> > >> > I'm seeing something similar; it's very intermittent and generally >> > happening right at boot of the guest; I'm running this on qemu >> > head+my postcopy world (but it's happening right at boot before postcopy >> > gets a chance), and I'm using a 3.19ish kernel. Xeon E5-2407 in my case >> > but hey maybe I'm seeing a different bug. >> > >> > Dave >> >> Yep, looks like we are hitting same bug - two thirds of mine failure >> events shot during boot/reboot cycle and approx. one third of events >> happened in the middle of runtime. What CPU, v0 or v2 are you using >> (in other words, is APICv enabled)? > > processor : 7 > vendor_id : GenuineIntel > cpu family : 6 > model : 45 > model name : Intel(R) Xeon(R) CPU E5-2407 0 @ 2.20GHz > stepping : 7 > microcode : 0x70d > cpu MHz : 2200.000 > cache size : 10240 KB > physical id : 1 > siblings : 4 > core id : 3 > cpu cores : 4 > apicid : 38 > initial apicid : 38 > fpu : yes > fpu_exception : yes > cpuid level : 13 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx > pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology > nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx > est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt > tsc_deadline_timer aes xsave avx lahf_lm arat pln pts dtherm tpr_shadow vnmi > flexpriority ept vpid xsaveopt > bugs : > bogomips : 4409.23 > clflush size : 64 > cache_alignment : 64 > address sizes : 46 bits physical, 48 bits virtual > power management: > > It's really random as well; I had two within half an hour yesterday, and then > it survived overnight with no change. > > KVM internal error. Suberror: 1 > emulation failure > EAX=00000000 EBX=00000000 ECX=00000000 EDX=000fd2bc > ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000 > EIP=000fd2c5 EFL=00010007 [-----PC] CPL=0 II=0 A20=1 SMM=0 HLT=0 > ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] > CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA] > SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] > DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] > FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] > GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA] > LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT > TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy > GDT= 000f6a80 00000037 > IDT= 000f6abe 00000000 > CR0=60000011 CR2=00000000 CR3=00000000 CR4=00000000 > DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 > DR3=0000000000000000 > DR6=00000000ffff0ff0 DR7=0000000000000400 > EFER=0000000000000000 > Code=66 ba bc d2 0f 00 e9 a2 fe f3 90 f0 0f ba 2d 04 ff fb bf 00 <72> f3 8b > 25 00 ff fb bf e8 44 66 ff ff c7 05 04 ff > fb bf 00 00 00 00 f4 eb fd fa fc 66 b8 > KVM internal error. Suberror: 1 > emulation failure > > and > > 11:37:49 INFO | [qemu output] KVM internal error. Suberror: 1 > 11:37:49 INFO | [qemu output] emulation failure > 11:37:49 INFO | [qemu output] EAX=00000000 EBX=00000000 ECX=00000000 > EDX=000fd2bc > 11:37:49 INFO | [qemu output] ESI=00000000 EDI=00000000 EBP=00000000 > ESP=00000000 > 11:37:49 INFO | [qemu output] EIP=000fd2bc EFL=00010007 [-----PC] CPL=0 II=0 > A20=1 SMM=0 HLT=0 > 11:37:49 INFO | [qemu output] ES =0010 00000000 ffffffff 00c09300 DPL=0 DS > [-WA] > 11:37:49 INFO | [qemu output] CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 > [-RA] > 11:37:49 INFO | [qemu output] SS =0010 00000000 ffffffff 00c09300 DPL=0 DS > [-WA] > 11:37:49 INFO | [qemu output] DS =0010 00000000 ffffffff 00c09300 DPL=0 DS > [-WA] > 11:37:49 INFO | [qemu output] FS =0010 00000000 ffffffff 00c09300 DPL=0 DS > [-WA] > 11:37:49 INFO | [qemu output] GS =0010 00000000 ffffffff 00c09300 DPL=0 DS > [-WA] > 11:37:49 INFO | [qemu output] LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT > 11:37:49 INFO | [qemu output] TR =0000 00000000 0000ffff 00008b00 DPL=0 > TSS32-busy > 11:37:49 INFO | [qemu output] GDT= 000f6a80 00000037 > 11:37:49 INFO | [qemu output] IDT= 000f6abe 00000000 > 11:37:49 INFO | [qemu output] CR0=60000011 CR2=00000000 CR3=00000000 > CR4=00000000 > 11:37:49 INFO | [qemu output] DR0=0000000000000000 DR1=0000000000000000 > DR2=0000000000000000 DR3=0000000000000000 > 11:37:49 INFO | [qemu output] DR6=00000000ffff0ff0 DR7=0000000000000400 > 11:37:49 INFO | [qemu output] EFER=0000000000000000 > 11:37:49 INFO | [qemu output] Code=0a 00 e8 a0 64 ff ff 0f aa 66 ba bc d2 0f > 00 e9 a2 fe f3 90 <f0> 0f ba 2d 04 ff fb 3f 00 72 f3 8b 25 00 ff fb 3f e8 44 > 66 ff ff c7 05 04 ff fb 3f 00 00 > > note the code in that second one is in the middle of the bios, > but the code has a few bytes different from what an objdump gets, > so I'm not quite sure if something is stamping on the bios or > if that's separate. > > Dave
Thanks, AFAIU suberror 1 and suberror 2 are completely different by a nature so this is a different bug. What is interesting that you`ve got same reproduction pattern as in mine case, it may point to a single userspace issue triggering two independent KVM bugs...