Hi
Looking at the code generated by the riscv backend:

Consider this C source code:

void shup1(QfloatAccump x)
{
        QELT newbits,bits;
        int i;
        bits = x->mantissa[9] >> 63;
        x->mantissa[9] <<= 1;
        for( i=8; i>0; i-- ) {
                newbits = x->mantissa[i] >> 63;
                x->mantissa[i] <<= 1;
                x->mantissa[i] |= bits;
                bits = newbits;
        }    
        x->mantissa[0] <<= 1;
        x->mantissa[0] |= bits;
}

This code is shifting a $64\times 10\rightarrow640$ bits right by 1 position. 
The algorithm is simple: save the highest bit, do the shift, and introduce the 
bits of the previous position at the least significant position.

When compiling with gcc the generated code looks extremely weird. Instead of 
loading a 64 bit number into some register, doing the operation, then storing 
the result into memory, gcc does the following:

        1) Load the 64 bit number byte by byte into 8 different registers. Each 
64 bit register contains only one byte.
        2) ORing the 8 registers together into a 64 bit number
        3) Doing the 64 bit operation
        4) Splitting the result into 8 different registers
        5) Storing the 8 different bytes one by one.

Obviously, I thought that this is a serious bug in gcc. I was going to write 
that bug report but I had the reflex of rewriting that function using 
reasonable assembly like this:

        1) Loading 64 bits into 10 different registers
        2) Doing the operations
        3) Storing 64 bits at a time.

The results are /catastrophic/  Instead of increasing performance, there is a 
slow down of several times compared to the performance of gcc.

Now, my question is:
Where did you get this information from? Because I can’t believe that by « 
trial and error » you arrived at that weird way of doing things. There must be 
some document that pointed you to the right solution. Can you share that 
information with the public?

Thanks in advance.

Jacob


sipeed@lpi4a:~/lcc/qlibriscv$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/riscv64-linux-gnu/13/lto-wrapper
Target: riscv64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 13.2.0-4revyos1' 
--with-bugurl=file:///usr/share/doc/gcc-13/README.Bugs 
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr 
--with-gcc-major-version-only --program-suffix=-13 
--program-prefix=riscv64-linux-gnu- --enable-shared --enable-linker-build-id 
--libexecdir=/usr/libexec --without-included-gettext --enable-threads=posix 
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug 
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new 
--enable-gnu-unique-object --disable-libitm --disable-libquadmath 
--disable-libquadmath-support --enable-plugin --enable-default-pie 
--with-system-zlib --enable-libphobos-checking=release 
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch 
--disable-werror --disable-multilib --with-arch=rv64gc --with-abi=lp64d 
--enable-checking=release --build=riscv64-linux-gnu --host=riscv64-linux-gnu 
--target=riscv64-linux-gnu --with-build-config=bootstrap-lto-lean 
--enable-link-serialization=16
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 13.2.0 (Debian 13.2.0-4revyos1) 

Reply via email to