On Wed, 9 Dec 2020, Paul Koning wrote: > > This all sounds great. Do you happen to know if it is cycle-accurate > > with respect to individual hardware microarchitectures simulated? That > > would be required for performance evaluation of compiler-generated code. > > No, it isn't. I believe it just charges one time unit per instruction, > with the possible exception of CIS instructions.
Fair enough, from experience most CPU emulators are instruction-accurate only. Of all the generally available emulators I came across (and looked into closely enough; maybe I missed something) only ones for the Z80 were cycle-accurate, and I believe the MAME project has had cycle-accurate emulation, both down to the system level and both out of necessity, as software they were written for was often unforgiving when it comes to any discrepancy with respect to original hardware. Commercially, MIPS Technologies used to have cycle-accurate MIPSsim, actually used for hardware verification, and taking into account all the implementation details such as the TLB and caches of individual CPU cores supported. And you could choose the topology of these resources according to what actual silicon could have. Some LV hardware has had it too for evaluation purposes: YAMON> scpu Current settings : I-Cache bytes per way = 0x1000 I-Cache associativity = 4 D-Cache bytes per way = 0x1000 D-Cache associativity = 4 MMU = tlb YAMON> scpu -a Available settings : I-Cache bytes per way : 0x1000, 0x0 I-Cache associativity : 4, 3, 2, 1 D-Cache bytes per way : 0x1000, 0x0 D-Cache associativity : 4, 3, 2, 1 MMU types : tlb, fixed YAMON> scpu -i 0x1000 2 YAMON> scpu -d 0x1000 2 YAMON> scpu fixed YAMON> scpu Current settings : I-Cache bytes per way = 0x1000 I-Cache associativity = 2 D-Cache bytes per way = 0x1000 D-Cache associativity = 2 MMU = fixed YAMON> But then even cycle-accurate MIPSsim would not take every parameter of a system into account, such as the latency of peripheral components. Not sure about DRAM either, though being predictable I guess that might have been simulated. > I don't know of any cycle accurate PDP-11 emulators. It's not even > clear if it is possible to build one, given the asynchronous operation > of the UNIBUS. It certainly would be extremely difficult since even the > documented timing is amazingly complex, never mind the possibility that > the reality is different from what is documented. For the purpose of compiler's performance evaluation however I don't think we need to go down as far as the external bus, so however UNIBUS performs should not really matter. Even with the modern systems all the pipeline descriptions and operation timings we have recorded within GCC reflect perfect operating conditions such as hot caches, no TLB misses, no branch mispredictions, to say nothing of disruption to all that caused by hardware interrupts and context switches. So I guess with cycle-accurate PDP-11 emulation it would be sufficient if relative CPU instruction execution timings were correctly reflected, such as the latency of say MOV vs DIV, as I am fairly sure they are not even close to being equivalent. But that does come at a cost; cycle-accurate MIPSsim was much slower than its instruction-accurate counterpart which also existed. > The pdp11 back end uses a very rough approximation of the documented > 11/70 timing, but GCC doesn't make it easy (or maybe not even possible) > to use the full timing details. It's not something I'd expect to refine > a whole lot further. Understood. > More interesting would be to tweak the optimizing machinery to improve > parts that either have bitrotted or never actually worked. The code > generation for auto-increment etc. isn't particularly effective and I > think that's a known limitation. Ditto indirect addressing, since few > other machines have that. (VAX does, of course; it might benefit too.) > And with LRA things are more limited still, again this seems to be known > and is caused by the focus on modern machine architectures. Correctness absolutely has to take precedence over performance, but that does not mean the latter has to be completely ignored either. And the presence of tools may only help with that. We may not have the resources available commercially significant ports have, but that does not mean we should decide upfront to abandon any kind of performance QA. I think we can still act professionally and try to do our best to make the quality of code produced as good as possible within our available resources. FWIW, Maciej