On August 7, 2015 3:50:33 PM GMT+02:00, Trevor Saunders <tbsau...@tbsaunde.org> 
wrote:
>On Fri, Aug 07, 2015 at 10:45:57AM +0100, Richard Sandiford wrote:
>> Trevor Saunders <tbsau...@tbsaunde.org> writes:
>> > On Thu, Aug 06, 2015 at 08:36:36PM +0100, Richard Sandiford wrote:
>> >> An integrated assembler or tighter asm output would be nice, but
>when
>> >> I last checked LLVM was usually faster than GCC even when
>compiling to asm,
>> >> even though LLVM does use indirection (in the form of virtual
>functions)
>> >> for its output routines.  I don't think indirect function calls
>themselves
>> >> are the problem -- as long as we get the abstraction right :-)
>> >
>> > yeah, last time I looked (tbf a while ago) the C++ front end took
>up by
>> > far the largest part of the time.  So it may not be terribly
>important,
>> > but it would still be nice to figure out what a good design looks
>like.
>> 
>> I tried getting final to output the code a large number of times.
>> Obviously just sticking "for (i = 0; i < n; ++i)" around something
>> isn't the best way of measuring performance (for all the usual
>reasons)
>> but it was interesting even so.  A lot of the time is taken in calls
>to
>> strlen and in assemble_name itself (called by ASM_OUTPUT_LABEL).
>
>yeah, this data looks great.  I find it interesting that you say we
>spend so much time outputting labels as opposed to instructions.
>
>> Each time we call assemble_name we do:
>> 
>>   real_name = targetm.strip_name_encoding (name);
>> 
>>   id = maybe_get_identifier (real_name);
>>   if (id)
>>     {
>>       tree id_orig = id;
>> 
>>       mark_referenced (id);
>>       ultimate_transparent_alias_target (&id);
>>       if (id != id_orig)
>>      name = IDENTIFIER_POINTER (id);
>>       gcc_assert (! TREE_CHAIN (id));
>>     }
>> 
>> Doing an identifier lookup every time we output a reference to a
>label
>> is pretty expensive.  Especially when many of the labels we're
>dealing
>> with are internal ones (basic block labels, debug labels, etc.) for
>which
>> the lookup is bound to fail.
>
>well, there's ASm_OUTPUT_INTERNAL_LABEL, and I think something similar
>for debug labels.  I guess we don't always use those where we could. 
>Or
>maybe the problem is we have places where we need to look at data to
>find out.  Maybe it would make sense to have the generally used
>output_label routine take a tree / rtx, and check if its a internal or
>debug label and dispatch appropriately.
>
>> So if compile-time for asm output is a concern, that seems like a
>good
>> place to start.  We should try harder to keep track of the identifier
>> behind a name (when there is one) and avoid this overhead for
>> internal labels.
>> 
>> Converting ASM_OUTPUT_LABEL to an indirect function call was in the
>> noise even with my for-loop hack.  The execution time of the hook is
>> dominated by assemble_name itself.  I hope patches like yours aren't
>> held up simply because they have the equivalent of a virtual
>function.
>
>Well, I think it makes sense to reroll this series, but I think I'll
>keep working on trying to replace these macros with something else.
>
>> Also, although we seem to be paranoid about virtual functions and
>> indirect calls, it's worth remembering that on most targets every
>> call to fputs(_unlocked), fwrite(_unlocked) and strlen is a PLT call.
>> Our current code calls fputs several times for one line of assembly,
>> including for short strings like register names.  This is doubly
>> inefficient because:
>> 
>> (a) we could reduce the number of PLT calls by doing the buffering
>>     ourselves and
>
>yeah, I mentioned that earlier, but its great to have data showing its
>a
>win!  I think its also probably important to enabling the other
>optimizations below.
>
>> (b) the names of those registers are known at compile time (or at
>least
>>     at start-up time) and are short, but we call strlen() on them
>>     each time we write them out.
>
>yeah, that seems like something that should be fixed, but I'm not sure
>off hand where to look for the code doing this.
>
>> E.g. for the attached microbenchmark I get:
>> 
>>   Time taken, normalised to VERSION==1
>> 
>>   VERSION==1:  1.000
>>   VERSION==2:  1.377
>>   VERSION==3:  3.202 (1.638 with -minline-all-stringops)
>>   VERSION==4:  4.242 (2.921 with -minline-all-stringops)
>>   VERSION==5:  4.526
>>   VERSION==6:  4.543
>>   VERSION==7: 10.884
>> 
>> where the results for 5 vs. 6 are in the noise.
>> 
>> The 5->4 gain is by doing the buffering ourselves.  The 4->3 gain is
>for
>> keeping track of the string length rather than recomputing it each
>time.
>> 
>> This suggests that if we're serious about trying to speed up the asm
>output,
>> it would be worth adding an equivalent of LLVM's StringRef that pairs
>a
>> const char * string with its length.
>
>I've thought a tiny bit about working on that, so its nice to have
>data.

Tree identifiers have an embedded length.
So its all about avoidibg this target hook mangling the labels.

Richard.

>Trev
>
>> 
>> Thanks,
>> Richard
>> 
>
>> #define _GNU_SOURCE 1
>> 
>> #include <stdio.h>
>> #include <string.h>
>> #include <iostream>
>> 
>> struct S
>> {
>>   S () : end (buffer) {}
>> 
>>   ~S ()
>>   {
>>     fwrite_unlocked (buffer, end - buffer, 1, stdout);
>>   }
>> 
>> #if VERSION == 3
>>   void __attribute__((noinline))
>> #else
>>   void
>> #endif
>>   write (const char *x, size_t len)
>>   {
>>     if (__builtin_expect (buffer + sizeof (buffer) - end < len, 0))
>>       {
>>      fwrite_unlocked (buffer, end - buffer, 1, stdout);
>>      end = buffer;
>>       }
>>     memcpy (end, x, len);
>>     end += len;
>>   }
>> 
>> #if VERSION == 1 || VERSION == 3
>>   template <size_t N>
>>   void
>>   write (const char (&x)[N])
>>   {
>>     write (x, N - 1);
>>   }
>> #elif VERSION == 2
>>   template <size_t N>
>>   void __attribute__((noinline))
>>   write (const char (&x)[N])
>>   {
>>     write (x, N - 1);
>>   }
>> #else
>>   void __attribute__((noinline))
>>   write (const char *x)
>>   {
>>     write (x, strlen (x));
>>   }
>> #endif
>>   char buffer[4096];
>>   char *end;
>> };
>> 
>> int
>> main ()
>> {
>>   S s;
>>   for (int i = 0; i < 100000000; ++i)
>>     {
>> #if VERSION <= 4
>>       s.write ("Hello!");
>> #elif VERSION == 5
>>       fputs_unlocked ("Hello!", stdout);
>> #elif VERSION == 6
>>       fwrite_unlocked ("Hello!", 6, 1, stdout);
>> #elif VERSION == 7
>>       std::cout << "Hello!";
>> #else
>> #error Please define VERSION
>> #endif
>>     }
>>   return 0;
>> }


Reply via email to