On Tue, Apr 28, 2015 at 10:16:33AM -0700, Linus Torvalds wrote: > I suspect it might be related to things like getting performance > counters and instruction debug traps etc right. There are quite > possibly also simply constraints where the front end has to generate > *something* just to keep the back end happy. > > The front end can generally not just totally remove things without any > tracking, since the front end doesn't know if things are speculative > etc. So you can't do instruction debug traps in the front end afaik. > Or rather, I'm sure you *could*, but in general I suspect the best way > to handle nops without making them *too* special is to bunch up > several to make them look like one big instruction, and then associate > that bunch with some minimal tracking uop that uses minimal resources > in the back end without losing sight of the original nop entirely, so > that you can still do checks at retirement time.
Yeah, I was thinking about a simplified uop for tracking - makes most sense ... > So I think the "you can do ~5 nops per cycle" is not unreasonable. > Even in the uop cache, the nops have to take some space, and have to > do things like update eip, so I don't think they'll ever be entirely > free, the best you can do is minimize their impact. ... exactly! So something needs to increment rIP so you either need to special-handle that and remember by how many bytes to increment and exactly *when* at retire time or simply use a barebones, simplified uop which does that for you for free and flows down the pipe. Yeah, that makes a lot of sense! > Yeah. That looks somewhat reasonable. I think the 16h architecture > technically decodes just two instructions per cycle, Yeah, fetch 32B and look at two 16B for max 2 insns per cycle. I.e., two-way. > but I wouldn't be surprised if there's some simple nop special casing > going on so that it can decode three nops in one go when things line > up right. Right. > So you might get 0.33 cycles for the best case, but then 0.5 cycles > when it crosses a 16-byte boundary or something. So you might have > some pattern where it decodes 32 bytes worth of nops as 12/8/12 bytes > (3/2/3 instructions), which would come out to 0.38 cycles. Add some > random overhead for the loop, and I could see the 0.39 cycles. > > That was wild handwaving with no data to back it up, but I'm trying > to explain to myself why you could get some odd number like that. It > seems _possiible_ at least. Yep, that makes sense. Now if only we had some numbers to back this up with... I'll play with this more. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/