This report was prompted by a mail on the lkml which was suggesting to
hand-craft memset: http://lkml.org/lkml/2007/8/17/309 . So I wondered if the
code generated for __builtin_memset was any good, and could be used instead of
hand-crafted code. I tested with (Debian) GCC 3.4.6, 4.1.3, 4.2.1, and also
with a snapshot of GCC 4.3. All the results are similar, so I will only show
them for GCC 4.2 on x86-64. Compilation was done with -O3.

First, the __builtin_memset code:

  void fill1(char *s, int a)
  {
    __builtin_memset(s, a, 15);
  }

GCC generates:

   0:   40 0f b6 c6             movzbl %sil,%eax
   4:   48 ba 01 01 01 01 01    mov    $0x101010101010101,%rdx
   b:   01 01 01 
   e:   40 0f b6 ce             movzbl %sil,%ecx
  12:   48 0f af c2             imul   %rdx,%rax
  16:   40 88 77 0e             mov    %sil,0xe(%rdi)
  1a:   48 89 07                mov    %rax,(%rdi)
  1d:   40 0f b6 c6             movzbl %sil,%eax
  21:   69 c0 01 01 01 01       imul   $0x1010101,%eax,%eax
  27:   89 47 08                mov    %eax,0x8(%rdi)
  2a:   89 c8                   mov    %ecx,%eax
  2c:   c1 e0 08                shl    $0x8,%eax
  2f:   01 c8                   add    %ecx,%eax
  31:   66 89 47 0c             mov    %ax,0xc(%rdi)
  35:   c3                      retq   

Notice that GCC first computes %sil * (01)^8 and puts it into %rax, then it
computes %sil * (01)^4 and puts it into %eax (where it already was, due to the
previous multiplication), then it computes %sil * (01)^2 and puts it into %ax
(where it already was, again).

Second, some code where multiplication results are reused:

  void fill2(char *s, int a)
  {
    unsigned long long int v = (unsigned char)a * 0x0101010101010101ull;
    *(unsigned long long int *)s = v;
    *(unsigned *)(s + 8) = v;
    *(unsigned short *)(s + 12) = v;
    *(s + 15) = v;
  }

GCC generates:

   0:   40 0f b6 f6             movzbl %sil,%esi
   4:   48 b8 01 01 01 01 01    mov    $0x101010101010101,%rax
   b:   01 01 01 
   e:   48 0f af f0             imul   %rax,%rsi
  12:   48 89 37                mov    %rsi,(%rdi)
  15:   89 77 08                mov    %esi,0x8(%rdi)
  18:   66 89 77 0c             mov    %si,0xc(%rdi)
  1c:   40 88 77 0f             mov    %sil,0xf(%rdi)
  20:   c3                      retq   

The function is 21 bytes smaller (-40%), it does not require two additional
registers (c and d), and it will not be slower.

The same issue arises on x86_32. The hand-written code (with 32bit integers
this time) is 14 bytes smaller for memset(,,15).


-- 
           Summary: Redundant multiplications for memset
           Product: gcc
           Version: 4.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: guillaume dot melquiond at ens-lyon dot fr
GCC target triplet: x86_64-linux-gnu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33103

Reply via email to