Hi In a previous post I pointed to a strange code generation`by gcc in the riscv-64 targets. To resume: Suppose a 64 bit operation: c = a OP b; Gcc does the following: Instead of loading 64 bits from memory gcc loads 8 bytes into 8 separate registers for both operands. Then it ORs the 8 bytes into a single 64 bit number. Then, it executes the 64 bit operation. And lastly, it splits the 64 bits result into 8 bytes into 8 different registers, and stores this 8 bytes one after the other.
When I saw this I was impressed that that utterly bloated code did run faster than a hastyly written assembly program I did in 10 minutes. Obviously I didn’t take any pipeline turbulence into account and my program was slower. When I did take pipeline turbulence into account, I managed to write a program that runs several times faster than the bloated code. You realize that for the example above, instead of 1) Load 64 bits into a register (2 operations) 2) Do the operation 3) Store the result We have 2 loads, and 1 operation + a store. 4 instructions compared to 46 operations for the « gcc way » (16 loads of a byte, 14 x 2 OR operations and 8 shifts to split the result and 8 stores of a byte each. I think this is a BUG, but I’m still not convinced that it is one, and I do not have a clue WHY you do this. Is here anyone doing the riscv backend? This happens only with -O3 by the way Sample code: #define ACCUM_MENGTH 9 #define WORDSIZE 64 Typedef struct { Int sign, exponent; Long long mantissa[ACCUM_LENGTH]; } QfloatAccum,*QfloatAccump; void shup1(QfloatAccump x) { QELT newbits,bits; int i; bits = x->mantissa[ACCUM_LENGTH] >> (WORDSIZE-1); x->mantissa[ACCUM_LENGTH] <<= 1; for( i=ACCUM_LENGTH-1; i>0; i-- ) { newbits = x->mantissa[i] >> (WORDSIZE - 1); x->mantissa[i] <<= 1; x->mantissa[i] |= bits; bits = newbits; } x->mantissa[0] <<= 1; x->mantissa[0] |= bits; } Please point me to the right person. Thanks