Some people over at OpenBLAS were asking me whether I knew of a
whitepaper on gcc asm.  I didn't besides the gcc manual, and wrote a
note explaining some tricks.  This patch is that note cleaned up.
Tested by an x86_64-linux build.  OK to apply?

BTW, anyone wandering over to look at OpenBLAS might notice that this
example doesn't match the file exactly.  Yes, writing this doco made
me realize I need to submit a patch there..

        * doc/extend.texi (Extended Asm): Add OpenBLAS example.

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 594b32a..05c6892 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,7 @@
+2017-03-31  Alan Modra  <amo...@gmail.com>
+
+       * doc/extend.texi (Extended Asm): Add OpenBLAS example.
+
 2017-03-31  Matthew Fortune  <matthew.fort...@imgtec.com>
 
        * config/mips/mips-msa.md (msa_vec_extract_<msafmt_f>): Update
diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index fadbc96..991a2f6 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -8516,6 +8516,84 @@ asm ("cmoveq %1, %2, %[result]"
    : "r" (test), "r" (new), "[result]" (old));
 @end example
 
+Here is a larger PowerPC example taken from OpenBLAS.  The over 150
+lines of assembly have been removed except for comments added to check
+gcc's register assignments, because the assembly itself isn't that
+important.  You do need to know that all of the function parameters
+are inputs except for the @code{y} array, which is modified by the
+function, and that early assembly sets up four pointers into the
+@code{ap} array, @code{a0=ap}, @code{a1=ap+lda}, @code{a2=ap+2*lda},
+and @code{a3=ap+3*lda}.
+
+Illustrated here is a technique you can use to have gcc allocate
+temporary registers for an asm, giving the compiler more freedom than
+the programmer allocating fixed registers via clobbers.  This is done
+by declaring a variable and making it an early-clobber asm output as
+with @code{a2} and @code{a3}, or making it an output tied to an input
+as with @code{a0} and @code{a1}.  The vsx registers used by the asm
+could have used the same technique except for gcc's limit on number of
+asm parameters.  It shouldn't be surprising that @code{a0} is tied to
+@code{ap} from the above description, and @code{lda} is only used
+early so that register is available for reuse as @code{a1}.  Tying an
+input to an output is the way to set up an initialised temporary
+register that is modified by an asm.  The example also shows an
+initialised register unchanged by the asm; @code{"b" (16)} sets up
+@code{%11} to 16.
+
+Also shown is a somewhat better method than using a @code{"memory"}
+clobber to tell gcc that an asm accesses or modifies memory .  Here we
+use @code{"+m" (*y)} in the list of outputs to tell gcc that the
+@code{y} array is both read and written by the asm.  @code{"m" (*x)}
+and @code{"m" (*ap)} in the inputs tells gcc that these arrays are
+read.  At a minimum, aliasing rules will allow gcc to know what memory
+@emph{doesn't} need to be flushed, and if the function were inlined
+then gcc may be able to do even better.  Notice that @code{x},
+@code{y}, and @code{ap} all appear twice in the asm parameters, once
+to specify memory accessed, and once to specify a base register used
+by the asm.  You won't normally be wasting a register by doing this as
+gcc can use the same register for both purposes.  However, it would be
+foolish to use both @code{%0} and @code{%2} for @code{y} in your asm
+and expect them to be the same.
+
+@example
+static void
+dgemv_kernel_4x4 (long n, const double *ap, long lda,
+                  const double *x, double *y, double alpha)
+@{
+  double *a0;
+  double *a1;
+  double *a2;
+  double *a3;
+
+  __asm__
+    (
+     ...
+     "#n=%1 ap=%8=%12 lda=%13 x=%7=%10 y=%0=%2 alpha=%9 o16=%11\n"
+     "#a0=%3 a1=%4 a2=%5 a3=%6"
+     :
+       "+m" (*y),
+       "+r" (n),       // 1
+       "+b" (y),       // 2
+       "=b" (a0),      // 3
+       "=b" (a1),      // 4
+       "=&b" (a2),     // 5
+       "=&b" (a3)      // 6
+     :
+       "m" (*x),
+       "m" (*ap),
+       "d" (alpha),    // 9
+       "r" (x),                // 10
+       "b" (16),       // 11
+       "3" (ap),       // 12
+       "4" (lda)       // 13
+     :
+       "cr0",
+       "vs32","vs33","vs34","vs35","vs36","vs37",
+       "vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47"
+     );
+@}
+@end example
+
 @anchor{Clobbers}
 @subsubsection Clobbers
 @cindex @code{asm} clobbers


-- 
Alan Modra
Australia Development Lab, IBM

Reply via email to