Hi! On Sat, Jul 05, 2025 at 07:33:32PM +0100, David Laight wrote: > On Thu, 26 Jun 2025 17:01:48 -0500 > Segher Boessenkool <seg...@kernel.crashing.org> wrote: > > On Thu, Jun 26, 2025 at 07:56:10AM +0200, Christophe Leroy wrote: > ... > > I have no idea why you think power9 has it while older CPUS do not. In > > the GCC source code we have this comment: > > /* For ISA 2.06, don't add ISEL, since in general it isn't a win, but > > altivec is a win so enable it. */ > > and in fact we do not enable it for ISA 2.06 (p8) either, probably for
2.07 I meant of course. Sigh. > > a similar reason. > > Odd, I'd have thought that replacing a conditional branch with a > conditional move would pretty much always be a win. > Unless, of course, you only consider benchmark loops where the > branch predictor in 100% accurate. The isel machine instruction is super expensive on p8: it is marked as first in an instruction group, and has latency 5 for the GPR sources, and 8 for the CR field source. On p7 it wasn't great either, it was actually converted to a branch sequence internally! On p8 there are bc+8 optimisations done by the core as well, conditional branches that skip one insn are faster than equivalent isel insns! Since p9 it is a lot better :-) > OTOH isn't altivec 'simd' instructions? AltiVec is the old motorola marketing name for what is called the "Vector Facility" in the architecture, and which at IBM is still called VMX, the name it was developed under ("Vector Multimedia Extension"). Since p7 (ISA 2.06, 2010) there also is the Vector-Scalar Extension Facility, VSX, which adds another 32 vector registers, and the traditional floating point registers are physically the same (but those use only the first half of each vector reg). Many new VSX instructions can do simple floating point stuff on all 64 VSX registers, either just on the first lane ("scalar") or on all lanes ("vector"). This does largely mean that all floating point is stored in IEEE DP format internally (on older cores usually some close to 70-bit format was used internally), which in olden times actually allowed to make the cores faster. Only when storing a value to memory it was actually converted to IEEE format (but of course it was always rounded correctly, etc.) > They pretty much only help for loops with lots of iterations. > I don't know about ppc, but I've seen gcc make a real 'pigs breakfast' > of loop vectorisation on x86. For PowerPC (or Power, the more modern name) of course we also have our fair share of problems with vectorisation. It does help that we were the first architecture used by GCC that had a serious Vector thing, the C syntax extension for Vector literals is taken from the old extensions in the AltiVec PIM but using curly brackets {} instead of round brackets (), for example. > For the linux kernel (which as Linus keeps reminding people) tends > to run 'cold cache', you probably want conditional moves in order > to avoid mis-predicted branches and non-linear execution, but > don't want loop vectorisation because the setup and end cases > cost too much compared to the gain for each iteration. You are best off using what GCC gives you, usually. It is very well tuned, both the generic and the machine-specific code :-) The kernel of course disables all Vector and FP stuff, essentially it disables use of any of the associated registers, and that's pretty much the end of it ;-) (The reason for that is that it would make task switches more expensive, long ago all task switches, but nowadays still user<->kernel switches). Segher