Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work

David Wohlferd Thu, 13 Nov 2014 04:03:52 -0800

Sorry for the (very) delayed response. I'm still looking for feedbackhere so I can fix the docs.

To refresh: The topic of conversation was the (extremely) wrongexplanation that has been in the docs since forever about how to usememory constraints with inline asm to avoid the performance hit of afull memory clobber. Trying to understand how this really works has ledto some surprising results.

Me:

>> While I really like the idea of using memory constraints to avoidall out>> memory clobbers, 16 bytes is a pretty small maximum memory block,and x86>> only supports a max of 8. Unless there's some way to use largersizes (say

>> SSIZE_MAX), this feature hardly seems worth documenting.

Richard:
> I wonder how you figured out that a 12 byte clobber performs a full
> memory clobber?

Here's the code (compiled with gcc version 4.9.0 x86_64-win32-seh-rev2,using -m64 -O2 -fdump-final-insns):


--------------------
#include <stdio.h>

#define MYSIZE 3

inline void
__stosb(unsigned char *Dest, unsigned char Data, size_t Count)
{
   struct _reallybigstruct { char x[MYSIZE]; }
      *p = (struct _reallybigstruct *)Dest;

   __asm__ __volatile__ ("rep stos{b|b}"
      : "+D" (Dest), "+c" (Count), "=m" (*p)
      : [Data] "a" (Data)
      //: "memory"
   );
}

int main()
{
   unsigned char buff[100];
   buff[5] = 'A';

   __stosb(buff, 'B', sizeof(buff));
   printf("%c\n", buff[5]);
}
--------------------

In summary:

   1) Create a 100 byte buffer, and set buff[5] to 'A'.

2) Call __stosb, which uses inline asm to overwrite all of buff with'B'.3) Use a memory constraint in __stosb to flush buff. The size ofthe memory constraint is controlled by a #define.

With this, I have a simple way to test various sizes of memoryconstraints to see if the buffer gets flushed. If it *is* flushing thebuffer, printing buff[5] after __stosb will print 'B'. If it is *not*flushing, it will print 'A'.


Results:

- Since buff[5] is the 6th byte in the buffer, using memoryconstraint sizes of 1, 2 & 4 (not surprisingly) all print 'A', showingthat no flush was done.- Sizes of 8 and 16 print 'B', showing that the flush was done. Thisis also the expected result, since I am now flushing enough of buff toinclude buff[5].- The surprise comes from using a size of 3 or 5. These also print'B'. WTF? If 4 doesn't flush, why does 3?

I believe the answer comes from reading the RTL. The difference betweensizes of 3 and 16 comes here:


  (set (mem/c:TI (plus:DI (reg/f:DI 7 sp)

(const_int 32 [0x20])) [ MEM[(struct _reallybigstruct *)&buff]+0S16 A128])

     (asm_operands/v:TI ("rep stos{b|b}") ("=m") 2 [

   (set (mem/c:BLK (plus:DI (reg/f:DI 7 sp)

(const_int 32 [0x20])) [ MEM[(struct _reallybigstruct*)&buff]+0 S3 A128])

         (asm_operands/v:BLK ("rep stos{b|b}") ("=m") 2 [

While I don't actually speak RTL, TI clearly refers to TIMode.Apparently when using a size that exactly matches a machine mode, asmmemory references (on i386) can flush the exact number of bytes. Butfor other sizes, gcc seems to falls back to BLK mode, which doesn't.

I don't know the exact meaning of BLK on a "set" or "asm_operands." Doesit cause a full clobber? Or just a complete clobber of buff? Attemptingto answer that question leads us to the second bit of code:


--------------------
#include <stdio.h>

#define MYSIZE 8

inline void
__stosb(unsigned char *Dest, unsigned char Data, size_t Count)
{
   struct _reallybigstruct { char x[MYSIZE]; }
      *p = (struct _reallybigstruct *)Dest;

   __asm__ __volatile__ ("rep stos{b|b}"
      : "+D" (Dest), "+c" (Count), "=m" (*p)
      : [Data] "a" (Data)
      //: "memory"
   );
}
int main()
{
   unsigned char buff[100], buff2[100];
   buff[5] = 'A';
   buff2[5] = 'M';
   asm("#" : : "r" (buff2));

   __stosb(buff, 'B', sizeof(buff));
   printf("%c %c\n", buff[5], buff2[5]);
}
--------------------

Here I've added a buff2, and I set buff2[5] to 'M' (aka ascii 77), whichI also print. I still perform the memory constraint against buff, thenI check to see if it is affecting buff2.

I start by compiling this with a size of 8 and look at the -S output.If this is NOT doing a full clobber, gcc should be able to just printbuff2[5] by moving 77 into the appropriate register before callingprintf. And indeed, that's what we see.


/APP
 # 17 "mem2.cpp" 1
        rep stosb
 # 0 "" 2
/NO_APP
        movzbl  37(%rsp), %edx
        movl    $77, %r8d
        leaq    .LC0(%rip), %rcx
        call    printf

If using a size of 3 *is* causing a full memory clobber, we would expectto see the value getting read from memory before calling printf. Andindeed, that's also what we see.


/APP
 # 17 "mem2.cpp" 1
        rep stosb
 # 0 "" 2
/NO_APP
        movzbl  37(%rsp), %edx
        leaq    .LC0(%rip), %rcx
        movzbl  149(%rsp), %r8d

I don't know the internals of gcc well enough to understand exactly whythis is happening. But from a user's point of view, it sure looks likea memory clobber.

As I said before, triggering a full memory clobber for anything over 16bytes (and most sizes under 16 bytes) makes this feature all butuseless. So if that's really what's happening, we need to decide whatto do next:


1) Can this be "fixed?"
2) Do we want to doc the current behavior?
3) Or do we just remove this section?

I think it could be a nice performance win for inline asm if it could bemade to work right, but I have no idea what might be involved in that.Failing that, I guess if it doesn't work and isn't going to work, I'drecommend removing the text for this feature.

Since all 3 suggestions require a doc change, I'll just say that I'mprepared to start work on the doc patch as soon as someone lets me knowwhat the plan is.


Richard?  Hans-Peter?  Your thoughts?

Thanks,
dw

Re: [RFD] Using the 'memory constraint' trick to avoid memory clobber doesn't work

Reply via email to