On Mon, Feb 23, 2015 at 6:14 PM, Maciej W. Rozycki <ma...@linux-mips.org> wrote: > On Mon, 23 Feb 2015, Andy Lutomirski wrote: > >> >> After a context switch, the instructions from the old task are no >> >> longer in the pipeline. >> > >> > I'd say it's implementation-specific. As I mentioned the i486 aborted >> > any transcendental x87 instruction in progress upon taking an exception or >> > interrupt. That was a model like you refer to, but as I also mentioned it >> > had its shortcomings. >> >> IRET is serializing, according to the the docs (I think) and according >> to the Intel engineers I asked (I'm absolutely certain about this >> part). So FPU ops are entirely done at the end of a normal context >> switch. > > No question about the serialising property of IRET, it has been like this > since the original Pentium implementation. Do you have an architecture > specification reference to back up your claim though as far as the FPU is > concerned? I'm asking because I am genuinely curious. > > The x87 case is so special, there isn't anything there really that is > externally observable or should be affected by IRET or any other > synchronisation barriers apart from WAIT (or a waiting x87 instruction) > that has been there for this purpose since forever. And it would defeat > some documented benefits of running the FP pipeline in the parallel.
It's plausible that this is special, but I doubt it. Especially since this optimization would be nuts post-SSE2. > > And certainly such synchronisation didn't happen in the old days. > >> We also always save the FPU context on every context switch away from >> a task that used the FPU, even in lazy mode. This is because we might >> switch the task back in on a different CPU, and we don't want to use >> an IPI to move the FPU context. > > That's an interesting case too, although not necessarily related. If you > say that we always save the FP context eagerly for the purpose of process > migration, then sure, that invalidates any benefit we'd have from letting > the x87 proceed. > > However I can see different ways to address this case avoiding the need > of eager FP context saving or an IPI: > > 1. We could bind any currently suspended process with an unsaved FP > context to the CPU it last executed on. This would be insane. > > 2. We could mark such a process for migration next time and let it execute > on the CPU that holds its FP context once more, and then save the FP > context eagerly on the way out. This would be worse than insane. Now, in order to wake such a process on a different CPU, we'd have to force a *context switch* on the source CPU. Now we're replacing a few hundred cycles at worse for a transcendental function with at least 10k cycles (at a guess) and possibly many orders of magnitude more if locks are held, plus priority issues, plus total craziness. > > In some cases a lazily retained FP context would be preempted by another > process before the process in question would resume anyway. In this case > any temporary binding to a CPU could be given up. > >> Given that we're only talking about old CPUs here, I sincerely doubt >> that there's any relevant case in which an fxsave can usefully wait >> for a long-running transcendental op to finish while we continue doing >> useful work. *Especially* since there will almost certainly be >> several more mfences or atomic ops before the end of the context >> switch, even if we're lucky enough to complete the context switching >> using sysret. > > I am not sure what you mean by FXSAVE usefully waiting for an op, please > elaborate. At the point you've reached FXSAVE and an earlier x87 > instruction hasn't completed, you've already lost. The pipeline will be > stalled until the x87 instruction has completed and it can be hundreds of > cycles. My point therefore has been about avoiding to execute FXSAVE for > the old task until absolutely necessary, that with the lazy FP context > switching would be at the next x87 (or SSE) instruction reached by the new > task. > > Likewise I don't see why MFENCE or an atomic operation should affect the > excecution of say FSINCOS. Whether the results of FSINCOS arrive before > or after MFENCE, etc. are not externally observable. FSINCOS; FXSAVE; MFENCE had better serialize all the way, no matter what weird architectural crud is going on. > > And I'm not sure if this all affects old CPUs only -- I don't know how > much x87 software is out there, but after all these years I'd expect quite > some. Sure, lots of this can be recompiled to use SSE instead, but not > all, and even where it is feasible, that's an extra burden for people, > beyond say a routine hardware or Linux distribution or for that matter > lone kernel upgrade. Therefore I think we need to be careful not to > pessimise things for a subset of people too much and ideally at all. > > And to be clear, I am not against removing lazy FP context switching per > se. I am just emphasizing to be careful with that and be absolutely sure > that it does not cause excessive harm. We're talking about the unusual case in which we context switch within ~100 cycles of a legacy transcendental operation, and, even so, there's *still* no regression, since we don't optimize this case today. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/