Sorry for the (very) delayed response. I'm still looking for feedback
here so I can fix the docs.
To refresh: The topic of conversation was the (extremely) wrong
explanation that has been in the docs since forever about how to use
memory constraints with inline asm to avoid the performance hit of a
full memory clobber. Trying to understand how this really works has led
to some surprising results.
Me:
>> While I really like the idea of using memory constraints to avoid
all out
>> memory clobbers, 16 bytes is a pretty small maximum memory block,
and x86
>> only supports a max of 8. Unless there's some way to use larger
sizes (say
>> SSIZE_MAX), this feature hardly seems worth documenting.
Richard:
> I wonder how you figured out that a 12 byte clobber performs a full
> memory clobber?
Here's the code (compiled with gcc version 4.9.0 x86_64-win32-seh-rev2,
using -m64 -O2 -fdump-final-insns):
--------------------
#include <stdio.h>
#define MYSIZE 3
inline void
__stosb(unsigned char *Dest, unsigned char Data, size_t Count)
{
struct _reallybigstruct { char x[MYSIZE]; }
*p = (struct _reallybigstruct *)Dest;
__asm__ __volatile__ ("rep stos{b|b}"
: "+D" (Dest), "+c" (Count), "=m" (*p)
: [Data] "a" (Data)
//: "memory"
);
}
int main()
{
unsigned char buff[100];
buff[5] = 'A';
__stosb(buff, 'B', sizeof(buff));
printf("%c\n", buff[5]);
}
--------------------
In summary:
1) Create a 100 byte buffer, and set buff[5] to 'A'.
2) Call __stosb, which uses inline asm to overwrite all of buff with
'B'.
3) Use a memory constraint in __stosb to flush buff. The size of
the memory constraint is controlled by a #define.
With this, I have a simple way to test various sizes of memory
constraints to see if the buffer gets flushed. If it *is* flushing the
buffer, printing buff[5] after __stosb will print 'B'. If it is *not*
flushing, it will print 'A'.
Results:
- Since buff[5] is the 6th byte in the buffer, using memory
constraint sizes of 1, 2 & 4 (not surprisingly) all print 'A', showing
that no flush was done.
- Sizes of 8 and 16 print 'B', showing that the flush was done. This
is also the expected result, since I am now flushing enough of buff to
include buff[5].
- The surprise comes from using a size of 3 or 5. These also print
'B'. WTF? If 4 doesn't flush, why does 3?
I believe the answer comes from reading the RTL. The difference between
sizes of 3 and 16 comes here:
(set (mem/c:TI (plus:DI (reg/f:DI 7 sp)
(const_int 32 [0x20])) [ MEM[(struct _reallybigstruct *)&buff]+0
S16 A128])
(asm_operands/v:TI ("rep stos{b|b}") ("=m") 2 [
(set (mem/c:BLK (plus:DI (reg/f:DI 7 sp)
(const_int 32 [0x20])) [ MEM[(struct _reallybigstruct
*)&buff]+0 S3 A128])
(asm_operands/v:BLK ("rep stos{b|b}") ("=m") 2 [
While I don't actually speak RTL, TI clearly refers to TIMode.
Apparently when using a size that exactly matches a machine mode, asm
memory references (on i386) can flush the exact number of bytes. But
for other sizes, gcc seems to falls back to BLK mode, which doesn't.
I don't know the exact meaning of BLK on a "set" or "asm_operands." Does
it cause a full clobber? Or just a complete clobber of buff? Attempting
to answer that question leads us to the second bit of code:
--------------------
#include <stdio.h>
#define MYSIZE 8
inline void
__stosb(unsigned char *Dest, unsigned char Data, size_t Count)
{
struct _reallybigstruct { char x[MYSIZE]; }
*p = (struct _reallybigstruct *)Dest;
__asm__ __volatile__ ("rep stos{b|b}"
: "+D" (Dest), "+c" (Count), "=m" (*p)
: [Data] "a" (Data)
//: "memory"
);
}
int main()
{
unsigned char buff[100], buff2[100];
buff[5] = 'A';
buff2[5] = 'M';
asm("#" : : "r" (buff2));
__stosb(buff, 'B', sizeof(buff));
printf("%c %c\n", buff[5], buff2[5]);
}
--------------------
Here I've added a buff2, and I set buff2[5] to 'M' (aka ascii 77), which
I also print. I still perform the memory constraint against buff, then
I check to see if it is affecting buff2.
I start by compiling this with a size of 8 and look at the -S output.
If this is NOT doing a full clobber, gcc should be able to just print
buff2[5] by moving 77 into the appropriate register before calling
printf. And indeed, that's what we see.
/APP
# 17 "mem2.cpp" 1
rep stosb
# 0 "" 2
/NO_APP
movzbl 37(%rsp), %edx
movl $77, %r8d
leaq .LC0(%rip), %rcx
call printf
If using a size of 3 *is* causing a full memory clobber, we would expect
to see the value getting read from memory before calling printf. And
indeed, that's also what we see.
/APP
# 17 "mem2.cpp" 1
rep stosb
# 0 "" 2
/NO_APP
movzbl 37(%rsp), %edx
leaq .LC0(%rip), %rcx
movzbl 149(%rsp), %r8d
I don't know the internals of gcc well enough to understand exactly why
this is happening. But from a user's point of view, it sure looks like
a memory clobber.
As I said before, triggering a full memory clobber for anything over 16
bytes (and most sizes under 16 bytes) makes this feature all but
useless. So if that's really what's happening, we need to decide what
to do next:
1) Can this be "fixed?"
2) Do we want to doc the current behavior?
3) Or do we just remove this section?
I think it could be a nice performance win for inline asm if it could be
made to work right, but I have no idea what might be involved in that.
Failing that, I guess if it doesn't work and isn't going to work, I'd
recommend removing the text for this feature.
Since all 3 suggestions require a doc change, I'll just say that I'm
prepared to start work on the doc patch as soon as someone lets me know
what the plan is.
Richard? Hans-Peter? Your thoughts?
Thanks,
dw