On Thu, Dec 3, 2020 at 1:41 AM Roman Bolshakov <r.bolsha...@yadro.com> wrote: > > On Mon, Nov 30, 2020 at 04:00:11PM -0800, Peter Collingbourne wrote: > > On Mon, Nov 30, 2020 at 3:18 PM Alexander Graf <ag...@csgraf.de> wrote: > > > > > > > > > On 01.12.20 00:01, Peter Collingbourne wrote: > > > > On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf <ag...@csgraf.de> wrote: > > > >> Hi Peter, > > > >> > > > >> On 30.11.20 22:08, Peter Collingbourne wrote: > > > >>> On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <l...@google.com> wrote: > > > >>>> > > > >>>> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <ag...@csgraf.de> > > > >>>> wrote: > > > >>>>> Hi Frank, > > > >>>>> > > > >>>>> Thanks for the update :). Your previous email nudged me into the > > > >>>>> right direction. I previously had implemented WFI through the > > > >>>>> internal timer framework which performed way worse. > > > >>>> Cool, glad it's helping. Also, Peter found out that the main thing > > > >>>> keeping us from just using cntpct_el0 on the host directly and > > > >>>> compare with cval is that if we sleep, cval is going to be much < > > > >>>> cntpct_el0 by the sleep time. If we can get either the architecture > > > >>>> or macos to read out the sleep time then we might be able to not > > > >>>> have to use a poll interval either! > > > >>>>> Along the way, I stumbled over a few issues though. For starters, > > > >>>>> the signal mask for SIG_IPI was not set correctly, so while > > > >>>>> pselect() would exit, the signal would never get delivered to the > > > >>>>> thread! For a fix, check out > > > >>>>> > > > >>>>> > > > >>>>> https://patchew.org/QEMU/20201130030723.78326-1-ag...@csgraf.de/20201130030723.78326-4-ag...@csgraf.de/ > > > >>>>> > > > >>>> Thanks, we'll take a look :) > > > >>>> > > > >>>>> Please also have a look at my latest stab at WFI emulation. It > > > >>>>> doesn't handle WFE (that's only relevant in overcommitted > > > >>>>> scenarios). But it does handle WFI and even does something similar > > > >>>>> to hlt polling, albeit not with an adaptive threshold. > > > >>> Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so > > > >>> I'll reply to your patch here. You have: > > > >>> > > > >>> + /* Set cpu->hvf->sleeping so that we get a > > > >>> SIG_IPI signal. */ > > > >>> + cpu->hvf->sleeping = true; > > > >>> + smp_mb(); > > > >>> + > > > >>> + /* Bail out if we received an IRQ meanwhile */ > > > >>> + if (cpu->thread_kicked || > > > >>> (cpu->interrupt_request & > > > >>> + (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) { > > > >>> + cpu->hvf->sleeping = false; > > > >>> + break; > > > >>> + } > > > >>> + > > > >>> + /* nanosleep returns on signal, so we wake up on > > > >>> kick. */ > > > >>> + nanosleep(ts, NULL); > > > >>> > > > >>> and then send the signal conditional on whether sleeping is true, but > > > >>> I think this is racy. If the signal is sent after sleeping is set to > > > >>> true but before entering nanosleep then I think it will be ignored and > > > >>> we will miss the wakeup. That's why in my implementation I block IPI > > > >>> on the CPU thread at startup and then use pselect to atomically > > > >>> unblock and begin sleeping. The signal is sent unconditionally so > > > >>> there's no need to worry about races between actually sleeping and the > > > >>> "we think we're sleeping" state. It may lead to an extra wakeup but > > > >>> that's better than missing it entirely. > > > >> > > > >> Thanks a bunch for the comment! So the trick I was using here is to > > > > >> > >> modify the timespec from the kick function before sending the IPI > > > >> signal. That way, we know that either we are inside the sleep (where > > > >> the > > > >> signal wakes it up) or we are outside the sleep (where timespec={} will > > > >> make it return immediately). > > > >> > > > >> The only race I can think of is if nanosleep does calculations based on > > > >> the timespec and we happen to send the signal right there and then. > > > > Yes that's the race I was thinking of. Admittedly it's a small window > > > > but it's theoretically possible and part of the reason why pselect was > > > > created. > > > > > > > >> The problem with blocking IPIs is basically what Frank was describing > > > >> earlier: How do you unset the IPI signal pending status? If the signal > > > >> is never delivered, how can pselect differentiate "signal from last > > > >> time > > > >> is still pending" from "new signal because I got an IPI"? > > > > In this case we would take the additional wakeup which should be > > > > harmless since we will take the WFx exit again and put us in the > > > > correct state. But that's a lot better than busy looping. > > > > > > > > > I'm not sure I follow. I'm thinking of the following scenario: > > > > > > - trap into WFI handler > > > - go to sleep with blocked SIG_IPI > > > - SIG_IPI arrives, pselect() exits > > > - signal is still pending because it's blocked > > > - enter guest > > > - trap into WFI handler > > > - run pselect(), but it immediate exits because SIG_IPI is still > > > pending > > > > > > This was the loop I was seeing when running with SIG_IPI blocked. That's > > > part of the reason why I switched to a different model. > > > > What I observe is that when returning from a pending signal pselect > > consumes the signal (which is also consistent with my understanding of > > what pselect does). That means that it doesn't matter if we take a > > second WFx exit because once we reach the pselect in the second WFx > > exit the signal will have been consumed by the pselect in the first > > exit and we will just wait for the next one. > > > > Aha! Thanks for the explanation. So, the first WFI in the series of > guest WFIs will likely wake up immediately? After a period without WFIs > there must be a pending SIG_IPI... > > It shouldn't be a critical issue though because (as defined in D1.16.2) > "the architecture permits a PE to leave the low-power state for any > reason, it is permissible for a PE to treat WFI as a NOP, but this is > not recommended for lowest power operation." > > BTW. I think a bit from the thread should go into the description of > patch 8, because it's not trivial and it would really be helpful to keep > in repo history. At least something like this (taken from an earlier > reply in the thread): > > In this implementation IPI is blocked on the CPU thread at startup and > pselect() is used to atomically unblock the signal and begin sleeping. > The signal is sent unconditionally so there's no need to worry about > races between actually sleeping and the "we think we're sleeping" > state. It may lead to an extra wakeup but that's better than missing > it entirely.
Okay, I'll add something like that to the next version of the patch I send out. Peter > > > Thanks, > Roman > > > I don't know why things may have been going wrong in your > > implementation but it may be related to the issue with > > mach_absolute_time() which I posted about separately and was also > > causing busy loops for us in some cases. Once that issue was fixed in > > our implementation we started seeing sleep until VTIMER due work > > properly. > > > > > > > > > > > > I reckon that you could improve things a little by unblocking the > > > > signal and then reblocking it before unlocking iothread (e.g. with a > > > > pselect with zero time interval), which would flush any pending > > > > signals. Since any such signal would correspond to a signal from last > > > > time (because we still have the iothread lock) we know that any future > > > > signals should correspond to new IPIs. > > > > > > > > > Yeah, I think you actually *have* to do exactly that, because otherwise > > > pselect() will always return after 0ns because the signal is still > > > pending. > > > > > > And yes, I agree that that starts to sound a bit less racy now. But it > > > means we can probably also just do > > > > > > - WFI handler > > > - block SIG_IPI > > > - set hvf->sleeping = true > > > - check for pending interrupts > > > - pselect() > > > - unblock SIG_IPI > > > > > > which means we run with SIG_IPI unmasked by default. I don't think the > > > number of signal mask changes is any different with that compared to > > > running with SIG_IPI always masked, right? > > > > P.S. Just found that Alex already raised my concern. Pending signals > have to be consumed or there should be no pending signals to start > sleeping on the very first WFI. > > > And unlock/lock iothread around the pselect? I suppose that could work > > but as I mentioned it would just be an optimization. > > > > Maybe I can try to make my approach work on top of your series, or if > > you already have a patch I can try to debug it. Let me know. > > > > Peter