On Tue, Dec 12, 2023 at 10:56:48AM +0000, Andrew Cooper wrote:
> On 12/12/2023 9:43 am, James Dingwall wrote:
> > Hi,
> >
> > We were experiencing a crash during PV domU boot on several different models
> > of hardware but all with Intel CPUs.  The Xen version was based on 
> > stable-4.15
> > at 4a4daf6bddbe8a741329df5cc8768f7dec664aed (XSA-444) with some local
> > patches.  Since updating the branch to 
> > b918c4cdc7ab2c1c9e9a9b54fa9d9c595913e028
> > (XSA-446) we have not observed the same crash.
> 
> That range covers:
> 
> 1f5f515da0f6 - iommu/amd-vi: use correct level for quarantine domain
> page tables
> b918c4cdc7ab - x86/spec-ctrl: Remove conditional IRQs-on-ness for INT
> $0x80/0x82 paths
> 
> so yeah - not much in the way of change.
> 
> > The occurrence was on 1-2% of boots and we couldn't determine a particular
> > sequence of events that would trigger it.  The kernel is based on Ubuntu's
> > 5.15.0-91 tag but we also observed the same with -85.  Due to the low
> > frequency it is possible that we simply haven't observed it again since
> > updating our Xen build.
> >
> > If I have followed the early startup this is happening shortly after 
> > detection
> > of possible CPU vulnerabilities and patching in alternative instructions.  
> > As
> > the RIP was native_irq_return_iret and XSA-446 related to interupt 
> > management
> > I wondered if it was possible that despite "Xen is not believed to be 
> > vulnerable
> > in default configurations on CPUs from other hardware vendors." there could
> > be some conditions in which an Intel CPU is affected?
> 
> In short, XSA-446 isn't plausibly related.  It's completely internal to
> Xen, with no alteration on guest state.
> 
> It is an error that Linux has ended up in native_irq_return_iret.  Linux
> cannot return to itself with an IRET instruction, and must use
> HYPERCALL_iret instead.
> 
> In recent versions of Linux, this is fixed up as about the earliest
> action a PV kernel takes, but on older versions of Linux, any
> interrupt/exception early enough on boot was fatal in this way.
> 
> 
> This part of the backtrace is odd:
> 
> [    0.398962]  ? native_iret+0x7/0x7
> [    0.398967]  ? insn_decode+0x79/0x100
> [    0.398975]  ? insn_decode+0xcf/0x100
> [    0.398980]  optimize_nops+0x68/0x150
> 
> as it's not clear how we've ended up in a case wanting to return back to
> the kernel to begin with.  However, it's most likely a pagefault, as
> optimize_nops() is making changes in arbitrary locations.
> 
> It is possible that a change in visible features has altered the
> behaviour enough not to crash, but if everything is still the same as
> far as you can tell, then it's likely just chance that you haven't seen
> it again.
> 
> This is definitely a Linux bug, so I suspect something bad has been
> backported into Ubuntu.
> 
> ~Andrew

Thanks for the response.  I had a look at the more recent kernels and managed
to backport "x86/entry,xen: Early rewrite of 
restore_regs_and_return_to_kernel()"
without too much trouble.  It may still be a coincidence that we haven't
encountered the problem but it seems to have gone away for now. 

Regards,
James

Reply via email to