On Fri, Jun 19, 2020 at 06:41:21PM +0200, Martin Lucina wrote:
> On 2020-06-19 13:21, Roger Pau Monné wrote:
> > On Fri, Jun 19, 2020 at 12:28:50PM +0200, Martin Lucina wrote:
> > > On 2020-06-18 13:46, Roger Pau Monné wrote:
> > > > On Thu, Jun 18, 2020 at 12:13:30PM +0200, Martin Lucina wrote:
> > > > > At this point I don't really have a clear idea of how to progress,
> > > > > comparing my implementation side-by-side with the original PV
> > > > > Mini-OS-based
> > > > > implementation doesn't show up any differences I can see.
> > > > >
> > > > > AFAICT the OCaml code I've also not changed in any material way, and
> > > > > that
> > > > > has been running in production on PV for years, so I'd be inclined
> > > > > to think
> > > > > the problem is in my reimplementation of the C parts, but where...?
> > > >
> > > > A good start would be to print the ISR and IRR lapic registers when
> > > > blocked, to assert there are no pending vectors there.
> > > >
> > > > Can you apply the following patch to your Xen, rebuild and check the
> > > > output of the 'l' debug key?
> > > >
> > > > Also add the output of the 'v' key.
> > >
> > > Had to fight the Xen Debian packages a bit as I wanted to patch the
> > > exact
> > > same Xen (there are some failures when building on a system that has
> > > Xen
> > > installed due to following symlinks when fixing shebangs).
> > >
> > > Here you go, when stuck during netfront setup, after allocating its
> > > event
> > > channel, presumably waiting on Xenstore:
> > >
> > > 'e':
> > >
> > > (XEN) Event channel information for domain 3:
> > > (XEN) Polling vCPUs: {}
> > > (XEN) port [p/m/s]
> > > (XEN) 1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
> > > (XEN) 2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
> > > (XEN) 3 [1/0/1]: s=5 n=0 x=0 v=0
> > > (XEN) 4 [0/1/1]: s=2 n=0 x=0 d=0
> > >
> > > 'l':
> > >
> > > (XEN) d3v0 IRR:
> > > ffff8301732dc200b
> > > (XEN) d3v0 ISR:
> > > ffff8301732dc100b
> >
> > Which version of Xen is this? AFAICT it doesn't have the support to
> > print a bitmap.
>
> That in Debian 10 (stable):
>
> ii xen-hypervisor-4.11-amd64 4.11.3+24-g14b62ab3e5-1~deb10u1.2
> amd64 Xen Hypervisor on AMD64
>
> xen_major : 4
> xen_minor : 11
> xen_extra : .4-pre
> xen_version : 4.11.4-pre
>
> >
> > Do you think you could also pick commit
> > 8cd9500958d818e3deabdd0d4164ea6fe1623d7c [0] and rebuild? (and print
> > the info again).
>
> Done, here you go:
>
> (XEN) Event channel information for domain 3:
> (XEN) Polling vCPUs: {}
> (XEN) port [p/m/s]
> (XEN) 1 [1/0/1]: s=3 n=0 x=0 d=0 p=33
> (XEN) 2 [1/1/1]: s=3 n=0 x=0 d=0 p=34
> (XEN) 3 [1/0/1]: s=5 n=0 x=0 v=0
> (XEN) 4 [0/1/1]: s=3 n=0 x=0 d=0 p=35
>
>
> (XEN) d3v0 IRR:
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
> (XEN) d3v0 ISR:
> 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
So there's nothing pending on the lapic. Can you assert that you will
always execute evtchn_demux_pending after you have received an event
channel interrupt (ie: executed solo5__xen_evtchn_vector_handler)?
I think this would be simpler if you moved evtchn_demux_pending into
solo5__xen_evtchn_vector_handler? As there would be less asynchronous
processing, and thus likely less races?