Trevor Saunders <tbsau...@tbsaunde.org> writes: > On Thu, Aug 06, 2015 at 08:36:36PM +0100, Richard Sandiford wrote: >> An integrated assembler or tighter asm output would be nice, but when >> I last checked LLVM was usually faster than GCC even when compiling to asm, >> even though LLVM does use indirection (in the form of virtual functions) >> for its output routines. I don't think indirect function calls themselves >> are the problem -- as long as we get the abstraction right :-) > > yeah, last time I looked (tbf a while ago) the C++ front end took up by > far the largest part of the time. So it may not be terribly important, > but it would still be nice to figure out what a good design looks like.
I tried getting final to output the code a large number of times. Obviously just sticking "for (i = 0; i < n; ++i)" around something isn't the best way of measuring performance (for all the usual reasons) but it was interesting even so. A lot of the time is taken in calls to strlen and in assemble_name itself (called by ASM_OUTPUT_LABEL). Each time we call assemble_name we do: real_name = targetm.strip_name_encoding (name); id = maybe_get_identifier (real_name); if (id) { tree id_orig = id; mark_referenced (id); ultimate_transparent_alias_target (&id); if (id != id_orig) name = IDENTIFIER_POINTER (id); gcc_assert (! TREE_CHAIN (id)); } Doing an identifier lookup every time we output a reference to a label is pretty expensive. Especially when many of the labels we're dealing with are internal ones (basic block labels, debug labels, etc.) for which the lookup is bound to fail. So if compile-time for asm output is a concern, that seems like a good place to start. We should try harder to keep track of the identifier behind a name (when there is one) and avoid this overhead for internal labels. Converting ASM_OUTPUT_LABEL to an indirect function call was in the noise even with my for-loop hack. The execution time of the hook is dominated by assemble_name itself. I hope patches like yours aren't held up simply because they have the equivalent of a virtual function. Also, although we seem to be paranoid about virtual functions and indirect calls, it's worth remembering that on most targets every call to fputs(_unlocked), fwrite(_unlocked) and strlen is a PLT call. Our current code calls fputs several times for one line of assembly, including for short strings like register names. This is doubly inefficient because: (a) we could reduce the number of PLT calls by doing the buffering ourselves and (b) the names of those registers are known at compile time (or at least at start-up time) and are short, but we call strlen() on them each time we write them out. E.g. for the attached microbenchmark I get: Time taken, normalised to VERSION==1 VERSION==1: 1.000 VERSION==2: 1.377 VERSION==3: 3.202 (1.638 with -minline-all-stringops) VERSION==4: 4.242 (2.921 with -minline-all-stringops) VERSION==5: 4.526 VERSION==6: 4.543 VERSION==7: 10.884 where the results for 5 vs. 6 are in the noise. The 5->4 gain is by doing the buffering ourselves. The 4->3 gain is for keeping track of the string length rather than recomputing it each time. This suggests that if we're serious about trying to speed up the asm output, it would be worth adding an equivalent of LLVM's StringRef that pairs a const char * string with its length. Thanks, Richard
#define _GNU_SOURCE 1 #include <stdio.h> #include <string.h> #include <iostream> struct S { S () : end (buffer) {} ~S () { fwrite_unlocked (buffer, end - buffer, 1, stdout); } #if VERSION == 3 void __attribute__((noinline)) #else void #endif write (const char *x, size_t len) { if (__builtin_expect (buffer + sizeof (buffer) - end < len, 0)) { fwrite_unlocked (buffer, end - buffer, 1, stdout); end = buffer; } memcpy (end, x, len); end += len; } #if VERSION == 1 || VERSION == 3 template <size_t N> void write (const char (&x)[N]) { write (x, N - 1); } #elif VERSION == 2 template <size_t N> void __attribute__((noinline)) write (const char (&x)[N]) { write (x, N - 1); } #else void __attribute__((noinline)) write (const char *x) { write (x, strlen (x)); } #endif char buffer[4096]; char *end; }; int main () { S s; for (int i = 0; i < 100000000; ++i) { #if VERSION <= 4 s.write ("Hello!"); #elif VERSION == 5 fputs_unlocked ("Hello!", stdout); #elif VERSION == 6 fwrite_unlocked ("Hello!", 6, 1, stdout); #elif VERSION == 7 std::cout << "Hello!"; #else #error Please define VERSION #endif } return 0; }