Andrew Haley wrote:
Till Straumann wrote:
gcc-4.3.2 seems to produce bad code when
accessing an array of small 'volatile'
objects -- it may try to access multiple
such objects in a 'parallel' fashion.
E.g., instead of reading two consecutive
'volatile short's sequentially it reads
a single 32-bit longword. This may crash
e.g., when accessing a memory-mapped device
which allows only 16-bit accesses.
If I compile this code fragment
void volarrcpy(short *d, volatile short *s, int n)
{
int i;
for (i=0; i<n; i++)
d[i] = s[i];
}
with '-O3' (the critical option seems to be '-ftree-vectorize')
then gcc-4.3.2 produces quite complicated code
but the essential section is (powerpc)
.L7:
lhz 0,0(11)
addi 11,11,2
lwzx 0,4,9
stwx 0,3,9
addi 9,9,4
bdnz .L7
or i386
.L7:
movw (%ecx), %ax
movl (%esi,%edx,4), %eax
movl %eax, (%ebx,%edx,4)
incl %edx
addl $2, %ecx
cmpl %edx, -20(%ebp)
ja .L7
Disassembled back into C-code, this reads
uint32_t *dst_l = (uint32_t*)d;
uint32_t *src_l = (uint32_t*)s;
for (i=0; i<n/2; i++) {
d[i] = s[i];
dst_l[i] = src_l[i];
}
This code seems neither optimal nor correct.
Besides reading half of the locations twice
which violates the semantics of volatile
objects accessing such objects in a 'vectorized'
way (in this case: instead of reading
two adjacent short addresses gcc emits
a single 32-bit read) seems illegal to me.
Similar behavior seems to be present in 4.3.3.
Does anybody have some insight? Should I file
a bug report?
I can't reproduce this with "GCC: (GNU) 4.3.3 20081110 (prerelease)"
.L8:
movzwl (%ecx), %eax
addl $1, %ebx
addl $2, %ecx
movw %ax, (%edx)
addl $2, %edx
cmpl %ebx, 16(%ebp)
jg .L8
I think you should upgrade.
Andrew.
OK, try this then:
void
c(char *d, volatile char *s)
{
int i;
for ( i=0; i<32; i++ )
d[i]=s[i];
}
(gcc --version: gcc (Ubuntu 4.3.3-5ubuntu4) 4.3.3)
gcc -m32 -c -S -O3
produces an unrolled sequence:
movzbl (%ecx), %eax
leal 20(%ebx), %edx
movl (%ecx), %eax
movl %eax, (%edi)
movzbl 1(%ecx), %eax
movl 4(%ecx), %eax
movl %eax, 4(%edi)
movzbl 2(%ecx), %eax
movl 4(%ebx), %eax
movl %eax, 4(%esi)
movzbl 3(%ecx), %eax
movl 8(%ebx), %eax
movl %eax, 8(%esi)
... < snip >...
The 64-bit version even uses SSE registers to
load the volatile data:
(gcc -c -S -O3)
.L7:
movzbl (%rsi), %eax
movdqu (%rsi), %xmm0
movdqa %xmm0, (%rdi)
movzbl 1(%rsi), %eax
movdqu (%rdx), %xmm0
movdqa %xmm0, 16(%rdi)
Not sure an upgrade helps ;-)
-- Till