Hi Looking at the code generated by the riscv backend: Consider this C source code:
void shup1(QfloatAccump x) { QELT newbits,bits; int i; bits = x->mantissa[9] >> 63; x->mantissa[9] <<= 1; for( i=8; i>0; i-- ) { newbits = x->mantissa[i] >> 63; x->mantissa[i] <<= 1; x->mantissa[i] |= bits; bits = newbits; } x->mantissa[0] <<= 1; x->mantissa[0] |= bits; } This code is shifting a $64\times 10\rightarrow640$ bits right by 1 position. The algorithm is simple: save the highest bit, do the shift, and introduce the bits of the previous position at the least significant position. When compiling with gcc the generated code looks extremely weird. Instead of loading a 64 bit number into some register, doing the operation, then storing the result into memory, gcc does the following: 1) Load the 64 bit number byte by byte into 8 different registers. Each 64 bit register contains only one byte. 2) ORing the 8 registers together into a 64 bit number 3) Doing the 64 bit operation 4) Splitting the result into 8 different registers 5) Storing the 8 different bytes one by one. Obviously, I thought that this is a serious bug in gcc. I was going to write that bug report but I had the reflex of rewriting that function using reasonable assembly like this: 1) Loading 64 bits into 10 different registers 2) Doing the operations 3) Storing 64 bits at a time. The results are /catastrophic/ Instead of increasing performance, there is a slow down of several times compared to the performance of gcc. Now, my question is: Where did you get this information from? Because I can’t believe that by « trial and error » you arrived at that weird way of doing things. There must be some document that pointed you to the right solution. Can you share that information with the public? Thanks in advance. Jacob sipeed@lpi4a:~/lcc/qlibriscv$ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/riscv64-linux-gnu/13/lto-wrapper Target: riscv64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 13.2.0-4revyos1' --with-bugurl=file:///usr/share/doc/gcc-13/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-13 --program-prefix=riscv64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/libexec --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --disable-libquadmath-support --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --disable-multilib --with-arch=rv64gc --with-abi=lp64d --enable-checking=release --build=riscv64-linux-gnu --host=riscv64-linux-gnu --target=riscv64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=16 Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 13.2.0 (Debian 13.2.0-4revyos1)