On Wed, May 17, 2023 at 11:27:43AM +0200, John Paul Adrian Glaubitz wrote: > Hi Michael! > > On Tue, 2023-05-16 at 20:25 +1200, Michael Cree wrote: > > On Tue, May 16, 2023 at 09:38:56AM +0200, John Paul Adrian Glaubitz wrote: > > > After a long discussion on IRC and the mailing list, we have agreed to > > > raise the > > > baseline for the alpha architecture to EV56 to improve the generated code > > > and fix > > > a number of issues. The change is already being implemented in the glibc > > > packages > > > which switches to EV56 [1] since hwcaps are no longer available with > > > glibc 2.37 [2]. > > > > > > Could you raise the baseline for gcc on alpha to EV56? > > > > > > I assume, it should be "--with-cpu=ev56" or "--with-arch=ev56". > > > > Yes, please! > > > > I suggest the following in debian/rules2: > > > > ifneq (,$(findstring alpha,$(DEB_TARGET_ARCH))) > > CONFARGS += --with-cpu=ev56 --with-tune=ev6 > > endif > > > > (the --with-tune only affects instruction scheduling and better tunes > > code for ev6 and more recent machines, but allows execution down to > > ev56.) I have tested this in the past with a rebuild of most packages > > that are in the base essential chroot in the past and it works well. > > Doesn't that come with a speed penalty for EV56 machines? I'm asking because > EV56 is > currently the baseline for QEMU when emulating Alpha.
I was under the impression that qemu was ev6/ev67 being machine type clipper which emulates an ES40. Am I mistaken? With regards instruction scheduling EV56 is in-order two-instructions [1] executed per cycle. EV6 and EV67 are out-of-order [2] four-instructions executed per cycle. Hence, for ev6/ev67 it can be advantageous to bring forward instructions that are data (operand) ready and delaying by four cpu-instructions those that depend on a result of a previous instruction instead of placing them immediately after the previous instruction to guarantee they don't waste an instruction slot in the same cpu cycle. [3] The deleterious impact on ev56 of doing this will be very small to utterly negligible. It is not worth worrying about. Regards, Michael. [1] Here I am talking about most integer register/register operate instructions. Memory, integer multiply and floating-point instructions have longer latencies. [2] Note that out-of-order does not mean the cpu can bring forward data ready instructions that have not yet been seen in the instruction pipeline. That is why we ask the compiler to place them earlier. [3] Even more advantageous on ev6/ev67 is to loop unroll and evaluate two iterations of the loop in parallel, i.e., intertwine the two computational pathways. When I did tests some time ago with gcc (4.6 and earlier versions) the compiler did not do this well, whereas my manually optimised machine code was getting better than three instructions executed per cpu cyle on certain code.