On August 7, 2015 3:50:33 PM GMT+02:00, Trevor Saunders <tbsau...@tbsaunde.org> wrote: >On Fri, Aug 07, 2015 at 10:45:57AM +0100, Richard Sandiford wrote: >> Trevor Saunders <tbsau...@tbsaunde.org> writes: >> > On Thu, Aug 06, 2015 at 08:36:36PM +0100, Richard Sandiford wrote: >> >> An integrated assembler or tighter asm output would be nice, but >when >> >> I last checked LLVM was usually faster than GCC even when >compiling to asm, >> >> even though LLVM does use indirection (in the form of virtual >functions) >> >> for its output routines. I don't think indirect function calls >themselves >> >> are the problem -- as long as we get the abstraction right :-) >> > >> > yeah, last time I looked (tbf a while ago) the C++ front end took >up by >> > far the largest part of the time. So it may not be terribly >important, >> > but it would still be nice to figure out what a good design looks >like. >> >> I tried getting final to output the code a large number of times. >> Obviously just sticking "for (i = 0; i < n; ++i)" around something >> isn't the best way of measuring performance (for all the usual >reasons) >> but it was interesting even so. A lot of the time is taken in calls >to >> strlen and in assemble_name itself (called by ASM_OUTPUT_LABEL). > >yeah, this data looks great. I find it interesting that you say we >spend so much time outputting labels as opposed to instructions. > >> Each time we call assemble_name we do: >> >> real_name = targetm.strip_name_encoding (name); >> >> id = maybe_get_identifier (real_name); >> if (id) >> { >> tree id_orig = id; >> >> mark_referenced (id); >> ultimate_transparent_alias_target (&id); >> if (id != id_orig) >> name = IDENTIFIER_POINTER (id); >> gcc_assert (! TREE_CHAIN (id)); >> } >> >> Doing an identifier lookup every time we output a reference to a >label >> is pretty expensive. Especially when many of the labels we're >dealing >> with are internal ones (basic block labels, debug labels, etc.) for >which >> the lookup is bound to fail. > >well, there's ASm_OUTPUT_INTERNAL_LABEL, and I think something similar >for debug labels. I guess we don't always use those where we could. >Or >maybe the problem is we have places where we need to look at data to >find out. Maybe it would make sense to have the generally used >output_label routine take a tree / rtx, and check if its a internal or >debug label and dispatch appropriately. > >> So if compile-time for asm output is a concern, that seems like a >good >> place to start. We should try harder to keep track of the identifier >> behind a name (when there is one) and avoid this overhead for >> internal labels. >> >> Converting ASM_OUTPUT_LABEL to an indirect function call was in the >> noise even with my for-loop hack. The execution time of the hook is >> dominated by assemble_name itself. I hope patches like yours aren't >> held up simply because they have the equivalent of a virtual >function. > >Well, I think it makes sense to reroll this series, but I think I'll >keep working on trying to replace these macros with something else. > >> Also, although we seem to be paranoid about virtual functions and >> indirect calls, it's worth remembering that on most targets every >> call to fputs(_unlocked), fwrite(_unlocked) and strlen is a PLT call. >> Our current code calls fputs several times for one line of assembly, >> including for short strings like register names. This is doubly >> inefficient because: >> >> (a) we could reduce the number of PLT calls by doing the buffering >> ourselves and > >yeah, I mentioned that earlier, but its great to have data showing its >a >win! I think its also probably important to enabling the other >optimizations below. > >> (b) the names of those registers are known at compile time (or at >least >> at start-up time) and are short, but we call strlen() on them >> each time we write them out. > >yeah, that seems like something that should be fixed, but I'm not sure >off hand where to look for the code doing this. > >> E.g. for the attached microbenchmark I get: >> >> Time taken, normalised to VERSION==1 >> >> VERSION==1: 1.000 >> VERSION==2: 1.377 >> VERSION==3: 3.202 (1.638 with -minline-all-stringops) >> VERSION==4: 4.242 (2.921 with -minline-all-stringops) >> VERSION==5: 4.526 >> VERSION==6: 4.543 >> VERSION==7: 10.884 >> >> where the results for 5 vs. 6 are in the noise. >> >> The 5->4 gain is by doing the buffering ourselves. The 4->3 gain is >for >> keeping track of the string length rather than recomputing it each >time. >> >> This suggests that if we're serious about trying to speed up the asm >output, >> it would be worth adding an equivalent of LLVM's StringRef that pairs >a >> const char * string with its length. > >I've thought a tiny bit about working on that, so its nice to have >data.
Tree identifiers have an embedded length. So its all about avoidibg this target hook mangling the labels. Richard. >Trev > >> >> Thanks, >> Richard >> > >> #define _GNU_SOURCE 1 >> >> #include <stdio.h> >> #include <string.h> >> #include <iostream> >> >> struct S >> { >> S () : end (buffer) {} >> >> ~S () >> { >> fwrite_unlocked (buffer, end - buffer, 1, stdout); >> } >> >> #if VERSION == 3 >> void __attribute__((noinline)) >> #else >> void >> #endif >> write (const char *x, size_t len) >> { >> if (__builtin_expect (buffer + sizeof (buffer) - end < len, 0)) >> { >> fwrite_unlocked (buffer, end - buffer, 1, stdout); >> end = buffer; >> } >> memcpy (end, x, len); >> end += len; >> } >> >> #if VERSION == 1 || VERSION == 3 >> template <size_t N> >> void >> write (const char (&x)[N]) >> { >> write (x, N - 1); >> } >> #elif VERSION == 2 >> template <size_t N> >> void __attribute__((noinline)) >> write (const char (&x)[N]) >> { >> write (x, N - 1); >> } >> #else >> void __attribute__((noinline)) >> write (const char *x) >> { >> write (x, strlen (x)); >> } >> #endif >> char buffer[4096]; >> char *end; >> }; >> >> int >> main () >> { >> S s; >> for (int i = 0; i < 100000000; ++i) >> { >> #if VERSION <= 4 >> s.write ("Hello!"); >> #elif VERSION == 5 >> fputs_unlocked ("Hello!", stdout); >> #elif VERSION == 6 >> fwrite_unlocked ("Hello!", 6, 1, stdout); >> #elif VERSION == 7 >> std::cout << "Hello!"; >> #else >> #error Please define VERSION >> #endif >> } >> return 0; >> }