Re: -fno-inline-functions vs glibc's initfini

2012-02-01 Thread Richard Guenther
On Wed, Feb 1, 2012 at 3:30 AM, Alexandre Oliva  wrote:
> On Jan 31, 2012, Richard Guenther  wrote:
>
>> What's probably confusing you is the "Don't pay attention to the
>> @code{inline} keyword" sentence.
>
> What really set me down the wrong patch were the comments in
> gcc/common.opt, that got me the idea it had something to do with C99
> inline.
>
> ; Nonzero means that functions declared `inline' will be treated
> ; as `static'.  Prevents generation of zillions of copies of unused
> ; static inline functions; instead, `inlines' are written out
> ; only when actually used.  Used in conjunction with -g.  Also
> ; does the right thing with #pragma interface.
> finline
> Common Report Var(flag_no_inline,0) Init(0)
> Pay attention to the \"inline\" keyword

Ick - WTF is that ... I'll fix it ;)

Richard.

>> I suppose we should clarify the documentation and I will prepare a patch.
>
> Thanks.  Would you please take care of adjusting the comments in
> common.opt accordingly?  TIA,
>
>> The implementation is exactly right
>
> Phew! :-)
>
> --
> Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
> You must be the change you wish to see in the world. -- Gandhi
> Be Free! -- http://FSFLA.org/   FSF Latin America board member
> Free Software Evangelist      Red Hat Brazil Compiler Engineer


RE: ICE building compiler

2012-02-01 Thread Henderson, Stuart
>Whilst investigating an ICE with the Blackfin compiler, I bumped in to a
>bit of code which seems questionable:
>
>in reload1.c:reload() we call select_reload_regs() and then check if
>failure was set.  However, looking at find_reload_regs() (called via
>select_reload_regs()), the only time we set failure is immediately after
>a call to spill_failure() which doesn't return.  Is spill_failure the
>correct function to be using here?  It seems like the intention of the
>code was to note the spill_failure and then try and fix it
>by jumping to the "failed:" label.

any feedback on this?
Thanks,
Stu


Memory corruption due to word sharing

2012-02-01 Thread Jan Kara
  Hello,

  we've spotted the following mismatch between what kernel folks expect
from a compiler and what GCC really does, resulting in memory corruption on
some architectures. Consider the following structure:
struct x {
long a;
unsigned int b1;
unsigned int b2:1;
};

We have two processes P1 and P2 where P1 updates field b1 and P2 updates
bitfield b2. The code GCC generates for b2 = 1 e.g. on ia64 is:
   0:   09 00 21 40 00 21   [MMI]   adds r32=8,r32
   6:   00 00 00 02 00 e0   nop.m 0x0
   c:   11 00 00 90 mov r15=1;;
  10:   0b 70 00 40 18 10   [MMI]   ld8 r14=[r32];;
  16:   00 00 00 02 00 c0   nop.m 0x0
  1c:   f1 70 c0 47 dep r14=r15,r14,32,1;;
  20:   11 00 38 40 98 11   [MIB]   st8 [r32]=r14
  26:   00 00 00 02 00 80   nop.i 0x0
  2c:   08 00 84 00 br.ret.sptk.many b0;;

Note that gcc used 64-bit read-modify-write cycle to update b2. Thus if P1
races with P2, update of b1 can get lost. BTW: I've just checked on x86_64
and there GCC uses 8-bit bitop to modify the bitfield.

We actually spotted this race in practice in btrfs on structure
fs/btrfs/ctree.h:struct btrfs_block_rsv where spinlock content got
corrupted due to update of following bitfield and there seem to be other
places in kernel where this could happen.

I've raised the issue with our GCC guys and they said to me that: "C does
not provide such guarantee, nor can you reliably lock different
structure fields with different locks if they share naturally aligned
word-size memory regions.  The C++11 memory model would guarantee this,
but that's not implemented nor do you build the kernel with a C++11
compiler."

So it seems what C/GCC promises does not quite match with what kernel
expects. I'm not really an expert in this area so I wanted to report it
here so that more knowledgeable people can decide how to solve the issue...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: Memory corruption due to word sharing

2012-02-01 Thread Markus Trippelsdorf
On 2012.02.01 at 16:19 +0100, Jan Kara wrote:
>   we've spotted the following mismatch between what kernel folks expect
> from a compiler and what GCC really does, resulting in memory corruption on
> some architectures. Consider the following structure:
> struct x {
> long a;
> unsigned int b1;
> unsigned int b2:1;
> };
> 
> We have two processes P1 and P2 where P1 updates field b1 and P2 updates
> bitfield b2. The code GCC generates for b2 = 1 e.g. on ia64 is:
>0:   09 00 21 40 00 21   [MMI]   adds r32=8,r32
>6:   00 00 00 02 00 e0   nop.m 0x0
>c:   11 00 00 90 mov r15=1;;
>   10:   0b 70 00 40 18 10   [MMI]   ld8 r14=[r32];;
>   16:   00 00 00 02 00 c0   nop.m 0x0
>   1c:   f1 70 c0 47 dep r14=r15,r14,32,1;;
>   20:   11 00 38 40 98 11   [MIB]   st8 [r32]=r14
>   26:   00 00 00 02 00 80   nop.i 0x0
>   2c:   08 00 84 00 br.ret.sptk.many b0;;
> 
> Note that gcc used 64-bit read-modify-write cycle to update b2. Thus if P1
> races with P2, update of b1 can get lost. BTW: I've just checked on x86_64
> and there GCC uses 8-bit bitop to modify the bitfield.
> 
> We actually spotted this race in practice in btrfs on structure
> fs/btrfs/ctree.h:struct btrfs_block_rsv where spinlock content got
> corrupted due to update of following bitfield and there seem to be other
> places in kernel where this could happen.
> 
> I've raised the issue with our GCC guys and they said to me that: "C does
> not provide such guarantee, nor can you reliably lock different
> structure fields with different locks if they share naturally aligned
> word-size memory regions.  The C++11 memory model would guarantee this,
> but that's not implemented nor do you build the kernel with a C++11
> compiler."
> 
> So it seems what C/GCC promises does not quite match with what kernel
> expects. I'm not really an expert in this area so I wanted to report it
> here so that more knowledgeable people can decide how to solve the issue...

FYI, the gcc bug can be found here:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080

-- 
Markus


Re: Memory corruption due to word sharing

2012-02-01 Thread Colin Walters
On Wed, 2012-02-01 at 16:19 +0100, Jan Kara wrote:

> I've raised the issue with our GCC guys and they said to me that: "C does
> not provide such guarantee, nor can you reliably lock different
> structure fields with different locks if they share naturally aligned
> word-size memory regions.  The C++11 memory model would guarantee this,
> but that's not implemented nor do you build the kernel with a C++11
> compiler."

That's interesting.  So it seems like there are two solutions:

1) Use the same lock for a given bitfield
2) Split up the bitfield into different words




Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 7:19 AM, Jan Kara  wrote:
>
>  we've spotted the following mismatch between what kernel folks expect
> from a compiler and what GCC really does, resulting in memory corruption on
> some architectures.

This is sad.

We've had something like this before due to architectural reasons
(alpha inability to do byte loads and stores leading us to not be able
to have items smaller than a byte sharing a word).

But at least then there was a *reason* for it, not "the compiler is
being difficult for no good reason".

Actually, the *sad* part is not that the compiler does something
unexpected - that's not new - but the usual "..and the gcc crowd
doesn't care, because they can point to paperwork saying it's not
defined". Even if that same paper is actually in the process of
getting updated exactly because it causes problems like ours.

That mentality is not new, of course.

> So it seems what C/GCC promises does not quite match with what kernel
> expects.

The paper C standard can *never* promise what a kernel expects. There
are tons of unspecified things that a compiler could do, including
moving memory around behind our back as long as it moves it back.
Because it's all "invisible" in the virtual C machine in the absence
of volatiles. The fact that the kernel has things like SMP coherency
requirements is simply not covered by the standard. There are tons of
other things not covered by the standard too that are just "that's
what we need".

So C/gcc has never "promised" anything in that sense, and we've always
had to make assumptions about what is reasonable code generation. Most
of the time, our assumptions are correct, simply because it would be
*stupid* for a C compiler to do anything but what we assume it does.

But sometimes compilers do stupid things. Using 8-byte accesses to a
4-byte entity is *stupid*, when it's not even faster, and when the
base type has been specified to be 4 bytes!

>I'm not really an expert in this area so I wanted to report it
> here so that more knowledgeable people can decide how to solve the issue...

If the gcc people aren't willing to agree that this is actually a flaw
in the standard (one that is being addressed, no less) and try to fix
it, we just have to extend our assumptions to something like "a
compiler would be stupid to ever access anything bigger than the
aligned register-size area". It's still just an assumption, and
compiler people could be crazy, but we could just extend the current
alpha rules to cover not just "int", but "long" too.

Sure, the compiler people could use "load/store multiple" or something
like that, but it would be obviously crazy code, so if it happens past
a register size, at least you could argue that it's a performance
issue and maybe the gcc people would care.

HOWEVER. The problem with the alpha rules (which, btw, were huge, and
led to the CPU designers literally changing the CPU instruction set
because they admitted they made a huge mistake) was never so much the
occasional memory corruption, as the fact that the places where it
could happen were basically almost impossible to find.

So we probably have tons of random places in the kernel that violate
even the alpha rules - because the corruption is so rare, and the
architecture was so rare as to making the corruption even harder to
find.

I assume this code generation idiocy only happens with bitfields? The
problem is, we really don't want to make all bitfields take up 64 bits
just because we might have a lock or something else next to them. But
we could probably re-order things (and then *occasionally* wasting
memory) if we could just find the ones that are problematic.

It's finding the problematic ones that is the issue. Which is why the
compiler should just generate code that matches what we told it to do,
not try to be "clever" in ways that doesn't even help performance! The
compiler simply doesn't know enough about the requirements. Never has,
never will, and this is not about "standards".

   Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 8:37 AM, Colin Walters  wrote:
>
> 1) Use the same lock for a given bitfield

That's not the problem. All the *bitfield* fields are all accessed
under the same word already.

> 2) Split up the bitfield into different words

Again, it's not the bitfield that is the problem.

The problem is that the compiler - when it writes to the word that
contains the bitfield - will also corrupt the word *NEXT* to the
bitfield (or before - it probably depends on alignment).

So we have two separate 32-bit words - one of which just happens to
contain a bitfield. We write to these *separate* words using different
locking rules (btw, they don't even need to be protected by a lock: we
may have other rules that protects the individual word contents.

But the compiler turns the access to the bitfield (in a 32-bit aligned
word) into a 64-bit access that accesses the word *next* to it.

That word next to it might *be* the lock, for example.

So we could literally have this kind of situation:

   struct {
  atomic_t counter;
  unsigned int val:4, other:4, data:24;
   };

and if we write code like this:

spin_lock(&somelock);
s->data++;
spin_unlock(&somelock);

and on another CPU we might do

   atomic_inc(&counter);

and the access to the bitfield will *corrupt* the atomic counter, even
though both of them are perfectly fine!

Quite frankly, if the bug is simply because gcc doesn't actually know
or care about the underlying size of the bitmask, it is possible that
we can find a case where gcc clearly is buggy even according to the
original C rules.

Honza - since you have access to the compiler in question, try
compiling this trivial test-program:


   struct example {
  volatile int a;
  int b:1;
   };

   ..
 s->b = 1;
   ..

and if that bitfield access actually does a 64-bit access that also
touches 's->a', then dammit, that's a clear violation of even the
*old* C standard, and the gcc people cannot just wave away their bugs
by saying "we've got standads, pttthththt".

And I suspect it really is a generic bug that can be shown even with
the above trivial example.

   Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Torvald Riegel
On Wed, 2012-02-01 at 16:19 +0100, Jan Kara wrote:
> I've raised the issue with our GCC guys and they said to me that: "C does
> not provide such guarantee, nor can you reliably lock different
> structure fields with different locks if they share naturally aligned
> word-size memory regions.  The C++11 memory model would guarantee this,
> but that's not implemented nor do you build the kernel with a C++11
> compiler."

Indeed, it's memory models that specify which kind of behavior is
allowed, and you need them for both the hardware and the programming
language.  C++11 and C11 have memory models, in contrast to previous
versions of these standards.
GCC 4.7 implements this memory model (C++11's and C11's models are very
similar), even though there might be some rough edges in this
implementation (bit fields, for example...).
http://gcc.gnu.org/wiki/Atomic/GCCMM

> So it seems what C/GCC promises does not quite match with what kernel
> expects. I'm not really an expert in this area so I wanted to report it
> here so that more knowledgeable people can decide how to solve the issue...

There needs to be agreement about the memory model.  The only time I
spoke to a kernel person about memory models, I got the reply that the
kernel would use its own model.

What do the kernel folks think about the C11 memory model?  If you can
spot any issues in there, the GCC community would certainly like to
know.


Torvald



Re: Memory corruption due to word sharing

2012-02-01 Thread Jiri Kosina
On Wed, 1 Feb 2012, Linus Torvalds wrote:

> But the compiler turns the access to the bitfield (in a 32-bit aligned
> word) into a 64-bit access that accesses the word *next* to it.
> 
> That word next to it might *be* the lock, for example.
> 
> So we could literally have this kind of situation:
> 
>struct {
>   atomic_t counter;
>   unsigned int val:4, other:4, data:24;
>};
> 
> and if we write code like this:
> 
> spin_lock(&somelock);
> s->data++;
> spin_unlock(&somelock);
> 
> and on another CPU we might do
> 
>atomic_inc(&counter);
> 
> and the access to the bitfield will *corrupt* the atomic counter, even
> though both of them are perfectly fine!
> 
> Quite frankly, if the bug is simply because gcc doesn't actually know
> or care about the underlying size of the bitmask, it is possible that
> we can find a case where gcc clearly is buggy even according to the
> original C rules.
> 
> Honza - since you have access to the compiler in question, try
> compiling this trivial test-program:
> 
> 
>struct example {
>   volatile int a;
>   int b:1;
>};
> 
>..
>  s->b = 1;
>..
> 
> and if that bitfield access actually does a 64-bit access that also
> touches 's->a', then dammit, that's a clear violation of even the
> *old* C standard, and the gcc people cannot just wave away their bugs
> by saying "we've got standads, pttthththt".
> 
> And I suspect it really is a generic bug that can be shown even with
> the above trivial example.

I have actually tried exactly this earlier today (because while looking at 
this, I had an idea that putting volatile in place could be a workaround, 
causing gcc to generate a saner code), but it doesn't work either:

# cat x.c
struct x {
long a;
volatile unsigned int lock;
unsigned int full:1;
};

void
wrong(struct x *ptr)
{
ptr->full = 1;
}

int main()
{
wrong(0);
}
# gcc -O2 x.c
# gdb -q ./a.out 
Reading symbols from /root/a.out...done.
(gdb) disassemble wrong
Dump of assembler code for function wrong:
   0x45c0 <+0>: [MMI]   adds r32=8,r32
   0x45c1 <+1>: nop.m 0x0
   0x45c2 <+2>: mov r15=1;;
   0x45d0 <+16>:[MMI]   ld8 r14=[r32];;
   0x45d1 <+17>:nop.m 0x0
   0x45d2 <+18>:dep r14=r15,r14,32,1;;
   0x45e0 <+32>:[MIB]   st8 [r32]=r14
   0x45e1 <+33>:nop.i 0x0
   0x45e2 <+34>:br.ret.sptk.many b0;;

In my opinion, this is a clear bug in gcc (while the original problem, 
without explitict volatile, is not a C spec violation per se, it's just 
very inconvenient :) ).

-- 
Jiri Kosina
SUSE Labs


Re: -fno-inline-functions vs glibc's initfini

2012-02-01 Thread Roland McGrath
> Want me to prepare a s/-fno-inline-functions/-fno-inline/ patch?

My reading was that actually we would want both of those switches (the
first to avoid inlining as an implicit optimization, the second to avoid
inlining of functions declared inline).  But whatever the exact detail,
yes, please send a patch that does the right thing.

> Maybe rather than using the generic version, we should have a Makefile
> rule that generates the machine-dependent .s files for developers'
> perusal in creating the machine-specific asm sources.

That certainly wouldn't hurt.  I guess we could start with that to make it
easy to produce the assembly code, and then let each arch add one piecemeal
before taking away the C option.  Once we get to every arch having
assembly, it probably makes sense to stop doing it as a single file with
magic boundary markers to be sliced, and just have each arch provide
crt[in].S files directly.


Thanks,
Roland


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 9:08 AM, Torvald Riegel  wrote:
>
> What do the kernel folks think about the C11 memory model?  If you can
> spot any issues in there, the GCC community would certainly like to
> know.

I don't think this is about memory models except very tangentially.

Gcc currently accesses fields in a way that affects *independent*
fields, without checking their memory models at all.

Even original C already has one memory model: "volatile" data is seen
outside the virtual C machine. And Jiri reports that even that
*original* memory model is being violated. We're taling about the one
from about 40 years ago.

So no amount of memory model will ever matter, if gcc can't even get
the 40-year old one right.

Also, the C11 memory model really isn't going to make any difference
to this. Again, because this happens even if we have explicit locking
for the different fields. So they aren't "atomic" or anything like
that, because we do our own (and sufficient) locking. But once the
compiler starts to randomly touch data that just happens to be
adjacent to other data, any memory model about those individual pieces
is going to be entirely immaterial anyway.

Finally: I do not believe for a second that we can actually use the
C11 memory model in the kernel and say "that's sufficient for our
needs". We will continue to have to do things that are "outside the
specs" for the very simple reason that the kernel isn't just some
random application that happens to use pthreads. We do end up doing
much more aggressive threading, with models that C11 simply doesn't
cover.

_Atomic simply isn't going to make any difference to us. Sure, we
might use it in a couple of places (where we have sometimes tried to
use volatile casts, for example), but it would be totally irrelevant
in this case exactly because *neither* field is really atomic - they
are both protected, it's just that they are *separately* protected, so
accessing one does *not* mean that you can access the other, even if
they are next to each other.

  Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 9:11 AM, Jiri Kosina  wrote:
> On Wed, 1 Feb 2012, Linus Torvalds wrote:
>>
>> And I suspect it really is a generic bug that can be shown even with
>> the above trivial example.
>
> I have actually tried exactly this earlier today (because while looking at
> this, I had an idea that putting volatile in place could be a workaround,
> causing gcc to generate a saner code), but it doesn't work either:
>
> # cat x.c
> struct x {
>    long a;
>    volatile unsigned int lock;
>    unsigned int full:1;
> };
>
> void
> wrong(struct x *ptr)
> {
>        ptr->full = 1;
> }
>
> int main()
> {
>        wrong(0);
> }
> # gcc -O2 x.c
> # gdb -q ./a.out
> Reading symbols from /root/a.out...done.
> (gdb) disassemble wrong
> Dump of assembler code for function wrong:
>   0x45c0 <+0>:     [MMI]       adds r32=8,r32
>   0x45c1 <+1>:                 nop.m 0x0
>   0x45c2 <+2>:                 mov r15=1;;
>   0x45d0 <+16>:    [MMI]       ld8 r14=[r32];;
>   0x45d1 <+17>:                nop.m 0x0
>   0x45d2 <+18>:                dep r14=r15,r14,32,1;;
>   0x45e0 <+32>:    [MIB]       st8 [r32]=r14
>   0x45e1 <+33>:                nop.i 0x0
>   0x45e2 <+34>:                br.ret.sptk.many b0;;
>
> In my opinion, this is a clear bug in gcc (while the original problem,
> without explitict volatile, is not a C spec violation per se, it's just
> very inconvenient :) ).

Yup, gcc is clearly just buggy here. I do not believe there is any
question what-so-ever about the above test-case showing a bug.

And the thing is, if they fix this bug, they'll fix our problem too,
unless they are going to write explicit code to *try* to screw us over
while fixing that 'volatile' bug.

Because the right thing to do with bitfields is really to take the
base type into account. If the bitfield was in an "int", you use an
"int" access for it, not a 64-bit access. That's the simple fix for
the volatile problem, and it just happens to fix our issue too.

Trying to explicitly *look* for volatiles, and only doing the 32-bit
access when you see them is actually extra code, and extra effort, and
doesn't even *help* anything. It's not like the 64-bit access is
somehow "better".

I can see some vindictive programmer doing that, while thinking "I'll
show these people who pointed out this bug in my code, mhwhahahahaa!
I'll fix their test-case while still leaving the real problem
unaddressed", but I don't think compiler people are quite *that* evil.
 Yes, they are evil people who are trying to trip us up, but still..

   Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Michael Matz
Hi,

On Wed, 1 Feb 2012, Jiri Kosina wrote:

> # cat x.c
> struct x {
> long a;
> volatile unsigned int lock;
> unsigned int full:1;
> };
> 
> void
> wrong(struct x *ptr)
> {
> ptr->full = 1;
> }
> 
> In my opinion, this is a clear bug

Even that depends (sadly) on who you ask.  half-volatile objects (i.e. 
struct where some members are volatile and others aren't) are terribly 
underspecified.  You can make the bitfield volatile, and for ia64 that 
would at least result of ld8.acq/st8.rel pairs.

And Linus: don't be so hastily dismissive.  People (even the compiler 
ones) do agree that using an 8 byte access for a 4 byte entity is 
problematic.  Even if allowed by the standard it's a quality of 
implementation problem.

One problem is that it's not a new problem, GCC emitted similar code since 
about forever, and still they turned up only now (well, probably because 
ia64 is dead, but sparc64 should have similar problems).  The bitfield 
handling code is _terribly_ complex and fixing it is quite involved.  So 
don't expect any quick fixes.

The other problem is specification.  While you think that the code you 
wrote is totally obvious it might not actually be so.  For instance, what 
about this struct:

{long l:32; int i1:16; short s; int i2:1; char c:7; short s2:8; short s3;}

What are the obviously correct accesses for various writes into this 
struct?

One rule may be to never write to a bitfield with accesses larger than 
their underlying declared type.  Well, but only if the bitfield fits into 
that type (int:96 is quite okay to have, and what accesses should be 
allowed there?).  That might be one obvious rule.  But it's not how 
traditional bitfield layout works.  It works based on underlying objects, 
and for that e.g. the three fields i2,c,s are all in the same one, of int 
type.

The rule above would at least work for code that most people do write, 
i.e. sequence of same-typed bitfields mixed with normal members.

And then, there's the last problem: are you sure that if GCC would use 4 
byte accesses for the case at hand, that the hardware wouldn't screw you 
again?  What about using atomic instructions for one half of the 
cache-line (the spinlock) but non-atomic instructions on the other half 
(the bitfield).  Are you sure the latter won't interfere with the former?


Ciao,
Michael.


Re: Memory corruption due to word sharing

2012-02-01 Thread Dennis Clarke

> I have actually tried exactly this earlier today (because while looking at
> this, I had an idea that putting volatile in place could be a workaround,
> causing gcc to generate a saner code), but it doesn't work either:
>
> # cat x.c
> struct x {
> long a;
> volatile unsigned int lock;
> unsigned int full:1;
> };
>
> void
> wrong(struct x *ptr)
> {
> ptr->full = 1;
> }
>
> int main()
> {
> wrong(0);
> }

> In my opinion, this is a clear bug in gcc (while the original problem,
> without explitict volatile, is not a C spec violation per se, it's just
> very inconvenient :) ).

As a data point, the exact same code on a Solaris 8 pentium III box:

$ gcc -S -o x.s x.c
$ cat x.s
.file   "x.c"
.text
.globl wrong
.type   wrong, @function
wrong:
pushl   %ebp
movl%esp, %ebp
movl8(%ebp), %eax
movzbl  8(%eax), %edx
orl $1, %edx
movb%dl, 8(%eax)
popl%ebp
ret
.size   wrong, .-wrong
.globl main
.type   main, @function
main:
pushl   %ebp
movl%esp, %ebp
subl$4, %esp
movl$0, (%esp)
callwrong
leave
ret
.size   main, .-main
.ident  "GCC: (Blastwave.org Inc. Thu Dec 16 18:05:01 GMT 2010) 4.5.2"
$ gcc -o x x.c
$ file x
x:  ELF 32-bit LSB executable 80386 Version 1, dynamically linked,
not stripped
$ ldd x
libc.so.1 => /usr/lib/libc.so.1
libdl.so.1 =>/usr/lib/libdl.so.1
$ ./x
Segmentation Fault(coredump)

$ ls -l core
-rw---   1 dclarke  csw71384 Feb  1 17:26 core

71384 bytes core is complete thus :

$  elfdump -p core | tail -6

Program Header[12]:
p_vaddr:  0xdfbf3000  p_flags:[ PF_X  PF_W  PF_R ]
p_paddr:  0   p_type: [ PT_LOAD ]
p_filesz: 0x1000  p_memsz:0x1000
p_offset: 0x106d8 p_align:0

$ /opt/studio/SOS11/SUNWspro/bin/dbx -c "print 0x1000 + 0x106d8; quit"
dbx: warning: unknown language, 'c' assumed
0x1000+0x106d8 = 71384

what caused the seg fault ?

$ /opt/studio/SOS11/SUNWspro/bin/dbx x core
Reading x
core file header read successfully
Reading ld.so.1
Reading libc.so.1
Reading libdl.so.1
program terminated by signal SEGV (no mapping at the fault address)
0x08050672: wrong+0x0006:   movzbl   0x0008(%eax),%edx
(dbx) where
=>[1] wrong(0x0, 0x8047b70, 0x805057d, 0x1, 0x8047b7c, 0x8047b84), at 0x8050672
  [2] main(0x1, 0x8047b7c, 0x8047b84), at 0x8050690

However Sun Studio 5.8 does no better :

$ /opt/studio/SOS11/SUNWspro/bin/cc -Xc -o x_Sun_Studio_5.8 x.c
$ ./x_Sun_Studio_5.8
Segmentation Fault(coredump)
$ ls -l core
-rw---   1 dclarke  csw71384 Feb  1 17:48 core

$ /opt/studio/SOS11/SUNWspro/bin/dbx x_Sun_Studio_5.8 core
dbx: warning: core object name "x_Sun_Studio_5." matches
object name "x_Sun_Studio_5.8" within the limit of 14. assuming they match
core file header read successfully
Reading ld.so.1
Reading libc.so.1
Reading libdl.so.1
program terminated by signal SEGV (no mapping at the fault address)
0x080506ae: wrong+0x000e:   movl 0x0008(%ecx),%eax
(dbx) where
=>[1] wrong(0x0), at 0x80506ae
  [2] main(0x1, 0x8047b4c, 0x8047b54), at 0x80506ca
(dbx) quit
$ /opt/studio/SOS11/SUNWspro/bin/cc -V
cc: Sun C 5.8 Patch 121016-08 2009/04/20
usage: cc [ options] files.  Use 'cc -flags' for details
$


dc


-- 
--
http://pgp.mit.edu:11371/pks/lookup?op=vindex&search=0x1D936C72FA35B44B
+-+---+
| Dennis Clarke   | Solaris and Linux and Open Source |
| dcla...@blastwave.org   | Respect for open standards.   |
+-+---+



Documentation error for cbranch4

2012-02-01 Thread Paulo J. Matos

Hi,

The docs state:
`cbranchmode4'
Conditional branch instruction combined with a compare instruction. 
Operand 0 is a comparison operator. Operand 1 and operand 2 are the first 
and second operands of the comparison, respectively. Operand 3 is a 
label_ref that refers to the label to jump to.


However from my experience and after looking to other backends, operand 3 
is actually a code label and not a label_ref.

Cheers,
-- 
PMatos



Re: Memory corruption due to word sharing

2012-02-01 Thread David Miller
From: Michael Matz 
Date: Wed, 1 Feb 2012 18:41:05 +0100 (CET)

> One problem is that it's not a new problem, GCC emitted similar code since 
> about forever, and still they turned up only now (well, probably because 
> ia64 is dead, but sparc64 should have similar problems). 

Indeed, on sparc64 it does do the silly 64-bit access too:

wrong:
ldx [%o0+8], %g2
sethi   %hi(2147483648), %g1
or  %g2, %g1, %g1
jmp %o7+8
 stx%g1, [%o0+8]

Personally I've avoided C bitfields like the plague in any code I've
written.


Re: Memory corruption due to word sharing

2012-02-01 Thread Jeff Law

On 02/01/2012 11:09 AM, David Miller wrote:

From: Michael Matz
Date: Wed, 1 Feb 2012 18:41:05 +0100 (CET)


One problem is that it's not a new problem, GCC emitted similar code since
about forever, and still they turned up only now (well, probably because
ia64 is dead, but sparc64 should have similar problems).


Indeed, on sparc64 it does do the silly 64-bit access too:

wrong:
 ldx [%o0+8], %g2
 sethi   %hi(2147483648), %g1
 or  %g2, %g1, %g1
 jmp %o7+8
  stx%g1, [%o0+8]

Personally I've avoided C bitfields like the plague in any code I've
written.
Torvald Riegel & I were told that was kernel policy when we brought up 
the upcoming bitfield semantic changes with some of the linux kernel 
folks last year.


Regardless of the kernel team's policy WRT bitfields, I believe fixing 
the semantics to avoid creation of data races in bitfields is going to 
be important.  I'm hoping Aldy can return to that task soon and he and 
Richi can come to some agreement on the implementation with gcc-4.8 as a 
target.


jeff


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 9:41 AM, Michael Matz  wrote:
>
> One problem is that it's not a new problem, GCC emitted similar code since
> about forever, and still they turned up only now (well, probably because
> ia64 is dead, but sparc64 should have similar problems).  The bitfield
> handling code is _terribly_ complex and fixing it is quite involved.  So
> don't expect any quick fixes.

I agree that bitfields are nasty, I've had to handle them myself in
sparse. And we have actually had problems with bitfields before, to
the point where we've replaced use of bitfields with masking and
setting bits by hand.

But I also think that gcc is simply *buggy*, and has made them much
nastier than they should be. What gcc *should* have done is to turn
bitfield accesses into shift-and-masking of the underlying field as
early as possible, and then do all optimizations at that level.

In fact, there is another gcc bug outstanding (48696) where I complain
about absolutely horrible code generation, and that one was actually
the exact same issue except in reverse: gcc wouldn't take the
underlying size of the bitfield into account, and use the wrong
(smaller) size for the access, causing absolutely horrendous code
generation that mixes byte and word accesses, and causes slowdowns by
orders of magnitude.

And it really is the same issue: gcc has forgotten what the base type
is, and tries to "compute" some type using the actual bits. Which is
simply *wrong*. Always has been.

It's stupid, it generates bad code (both from performance *and* from a
correctness angle), and it has actually resulted in gcc having *extra*
complexity because it keeps the bitfield data around much too late.

> The other problem is specification.  While you think that the code you
> wrote is totally obvious it might not actually be so.  For instance, what
> about this struct:
>
> {long l:32; int i1:16; short s; int i2:1; char c:7; short s2:8; short s3;}
>
> What are the obviously correct accesses for various writes into this
> struct?

I really don't think that example matters. You should instead turn the
question around, and look at the real code examples, make *those*
generate good and obviously correct code, and then *after* you've done
that, you start looking at the mixed case where people do odd things.

Quite frankly, the layout that makes *sense* for something like the
above is not to pack them. You'd just do

If somebody actually wants packing, they'd do the *sane* thing, which is

  unsigned int l:32, i1:16; s:16,i2:1, c:7, s2:8, s3:16;

and now it's obvious what the packing rules and the access rules are.

That said, I just tried, and 'sparse' actually gets your crazy example
right (where "right" is defined as what I *think* you want): optimally
packed. And it turns the accesses into the ones that are based on the
base type.  And I'm pretty sure the sparse bitfield rules are way
simpler than gcc, exactly because it tries to convert bitfields fairly
early into mask-and-set. So if I do

  struct {long l:32; int i1:16; short s; int i2:1; char c:7; short
s2:8; short s3;} dummy;

  int main(int argc, char **argv)
  {
dummy.l = 1;
dummy.i1 = 2;
dummy.s = 3;
dummy.i2 = 4;
dummy.c = 5;
dummy.s2 = 6;
dummy.s3 = 7;
  }

and then do a test-linearize (with "-m64" to show the difference
between "long" and "int") I get

  t.c:2:48: error: dubious one-bit signed bitfield
  main:
  .L0x7f928e378010:

load.64 %r1 <- 0[dummy]
and.64  %r2 <- %r1, $0x
or.64   %r3 <- %r2, $1
store.64%r3 -> 0[dummy]
load.32 %r4 <- 4[dummy]
and.32  %r5 <- %r4, $0x
or.32   %r6 <- %r5, $2
store.32%r6 -> 4[dummy]
store.16$3 -> 6[dummy]
load.32 %r7 <- 8[dummy]
and.32  %r8 <- %r7, $-2
store.32%r8 -> 8[dummy]
load.8  %r10 <- 8[dummy]
and.8   %r12 <- %r10, $-255
or.8%r13 <- %r12, $10
store.8 %r13 -> 8[dummy]
load.16 %r14 <- 8[dummy]
and.16  %r16 <- %r14, $0x00ff
or.16   %r17 <- %r16, $0x600
store.16%r17 -> 8[dummy]
store.16$7 -> 10[dummy]
ret.32  $7

which I think is what you'd expect. Now, sparse doesn't do the smart
"combine shifts-and-masks to memory", but that's an optimization phase
that should be done long after bitfields *anyway*, so that's kind of
immaterial.

And yes, look at how you can see how sparse mixes different access
sizes for different fields. But that is damn well what the programmer
asked for.

If programmers write stupid code, the compiler might as well generate odd code.

But if a programmer writes *good* code, and the compiler messes it up
because it *thinks* the code is stupid, then the compiler is just bad.

> One rule may be to never write to a bitfield with accesses larger than
> their underlying

Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 10:09 AM, David Miller  wrote:
>
> Personally I've avoided C bitfields like the plague in any code I've
> written.

I do agree with that. The kernel largely tries to avoid bitfields,
usually because we have some really strict rules about different
bitfields, but also because initialization of bitfields tends to
result in gcc generating an incredible mess of the code (while
"explicit bits" allows us to just set all the fields in one go, and
know what the result is).

So core kernel data structures tend to be things like "unsigned long",
together with our various bitop functions that have explicit atomicity
(or non-atomicity) guarantees on a bit-per-bit basis.

Sometimes bitfields are really convenient, though, and allow for much
more natural syntax.

I'm not surprised this issue came up in a filesystem, for example.

  Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Peter Bergner
On Wed, 2012-02-01 at 13:09 -0500, David Miller wrote:
> From: Michael Matz 
> Date: Wed, 1 Feb 2012 18:41:05 +0100 (CET)
> 
> > One problem is that it's not a new problem, GCC emitted similar code since 
> > about forever, and still they turned up only now (well, probably because 
> > ia64 is dead, but sparc64 should have similar problems). 
> 
> Indeed, on sparc64 it does do the silly 64-bit access too:
> 
> wrong:
> ldx [%o0+8], %g2
> sethi   %hi(2147483648), %g1
> or  %g2, %g1, %g1
> jmp %o7+8
>  stx%g1, [%o0+8]

Ditto for powerpc64-linux:

ld 9,8(3)
li 10,1
rldimi 9,10,31,32
std 9,8(3)
blr


Peter





Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 10:45 AM, Jeff Law  wrote:
>
> Torvald Riegel & I were told that was kernel policy when we brought up the
> upcoming bitfield semantic changes with some of the linux kernel folks last
> year.

Btw, one reason this is true is that the bitfield ordering/packing is
so unspecified that using bitfields sometimes is just way more pain
than you get. Some people have tried to use bitfields for various IO
data structures (think laying out bits in memory to match some
particular layout of some disk controller result structure). It's
always a total disaster, because you have another whole level of
architecture-specific bit ordering (in *addition* to all the normal
byte order issues).

That's not a compiler issue, that's just the nature of the beast.
It's just another reason why the kernel often ends up then doing bit
masking by hand.

But *all* that said, we do have a metric buttload of bitfields in the
kernel. It's not some really unusual feature. It does get used a lot,
despite all the reasons why some particular code might not use
bitfields.

We have a lot of code, there's still a lot of situations left where
bitfields are just really convenient.

Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Torvald Riegel
On Wed, 2012-02-01 at 08:41 -0800, Linus Torvalds wrote:
> If the gcc people aren't willing to agree that this is actually a flaw
> in the standard (one that is being addressed, no less)

It has been addressed in the standards.

> and try to fix
> it,

Again, this is being worked on, see http://gcc.gnu.org/wiki/Atomic/GCCMM

> we just have to extend our assumptions to something like "a
> compiler would be stupid to ever access anything bigger than the
> aligned register-size area". It's still just an assumption, and
> compiler people could be crazy, but we could just extend the current
> alpha rules to cover not just "int", but "long" too.

So, let's ignore everyone's craziness (the kernel is not the only GCC
client...) and think about how we can improve the situation.

We need a proper memory model.  No vague assumptions with lots of
hand-waving.  If you think that this is simple stuff and can
sufficiently described by "don't do anything stupid", then please have a
look at the issues that the Java memory model faced, and all the
iterations of the C++11/C11 model and discussions about it.

The only candidate that I see is the C++11/C11 model.  Any other
suggestions?

Now, if we assume this model, what does the kernel community think about
it?  It might be good to split this discussion into the following two
points, to avoid syntax flame wars:
1) The model itself (ie, the memory orders, ordering guarantees, (lack
of) progress guarantees, etc.).
2) The syntax/APIs to specify atomics.

If something else or more is needed, the compiler needs to have a formal
specification for that.  This will help users too, because it avoids all
the ambiguities.

> Sure, the compiler people could use "load/store multiple" or something
> like that, but it would be obviously crazy code, so if it happens past
> a register size, at least you could argue that it's a performance
> issue and maybe the gcc people would care.
> 
> HOWEVER. The problem with the alpha rules (which, btw, were huge, and
> led to the CPU designers literally changing the CPU instruction set
> because they admitted they made a huge mistake) was never so much the
> occasional memory corruption, as the fact that the places where it
> could happen were basically almost impossible to find.
> 
> So we probably have tons of random places in the kernel that violate
> even the alpha rules - because the corruption is so rare, and the
> architecture was so rare as to making the corruption even harder to
> find.

I assume you still would want a weak memory model, or not?  (That is,
rely on data-race-freedom, only atomics do not contribute to data races,
and you need to mark data used for synchronization (or which is just
accessed concurrently) as atomics).

If so, you need to tell the compiler which variables etc. are atomics.

> I assume this code generation idiocy only happens with bitfields? The
> problem is, we really don't want to make all bitfields take up 64 bits
> just because we might have a lock or something else next to them. But
> we could probably re-order things (and then *occasionally* wasting
> memory) if we could just find the ones that are problematic.

If the compiler is aware what is a lock (or atomics, or similar), then a
correct implementation should leave the lock alone.  But in a weak
memory model, it needs to know what is a lock.
(Same goes for volatile obviously, like in the other example posted in
the thread where the bitfields interfered with the volatile.)

> 
> It's finding the problematic ones that is the issue. Which is why the
> compiler should just generate code that matches what we told it to do,

So, would it be okay to tell the compiler which part of the state is
accessed concurrently (ie, locks, atomics, ...)?


Torvald



Re: Memory corruption due to word sharing

2012-02-01 Thread Jakub Jelinek
On Wed, Feb 01, 2012 at 06:42:54PM +0100, Torvald Riegel wrote:
> We need a proper memory model.  No vague assumptions with lots of
> hand-waving.  If you think that this is simple stuff and can
> sufficiently described by "don't do anything stupid", then please have a
> look at the issues that the Java memory model faced, and all the
> iterations of the C++11/C11 model and discussions about it.
> 
> The only candidate that I see is the C++11/C11 model.  Any other
> suggestions?

Well, the C++11/C11 model doesn't allow to use the underlying type
for accesses, consider e.g.

struct S { long s1; unsigned int s2 : 5; unsigned int s3 : 19; unsigned char 
s4; unsigned int s5; };
struct T { long s1 : 16; unsigned int s2; };

on e.g. x86_64-linux, S is 16 byte long, field s4 is packed together into
the same 32 bits as s2 and s3.  While the memory model allows s2 to be
changed when storing s3 (i.e. use a RMW cycle on it), it doesn't allow s4
to be changed, as it isn't a bitfield (you'd need s4 : 8 for that).
T is 8 bytes long, again, s2 is packed into the same 64 bits as s1,
and the memory model doesn't allow s2 to be modified.

Not sure what the kernel would expect in such a case.

Jakub


RE: Memory corruption due to word sharing

2012-02-01 Thread Boehm, Hans
I'm clearly with Torvald here.

The original set of problems here is very clearly addressed by the C+11/C11 
memory model.  It clearly implies, among other things:

- Updating a bit-field may interfere (create a data race with) other updates to 
that same contiguous sequence of bit-fields, but not with other adjacent 
fields.  Updating non-bit-fields may not interfere with accesses to any other 
field.  A zero length bit-field breaks a sequence of bit-fields.  The initial 
example here is a clear compiler bug by C11 semantics.
- Bit-fields may be updated, and indeed must be updated, in some cases, with 
multiple smaller stores.  Doing anything else would violate the preceding rule.
- Spurious writes to a variable may never be introduced.
- "Volatile" has nothing to do with thread interactions.  Atomic variables were 
introduced to allow unprotected accesses to shared objects.  (It doesn't make 
sense to use volatile for this for a number of good, mostly historically 
motivated reasons.  See 
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2016.html if you 
really care.)

(Technically, this applies to C11 threads, because that's all it can talk 
about.  But an implementation would have to be insane to restrict it to that.  
For implementations of atomic operations, I would expect most compilers to 
assume memory mappings and consistency as normally seen by user processes.)

C11 is a published standard.  Last I checked, gcc did not follow many of the 
above rules.  It looks like major changes were recently merged into the gcc 
trunk, and I haven't had a chance to test those, so it may well be fixed.  But 
more importantly, so far I haven't read anything here to dissuade me that they 
are the right target.

Hans

> -Original Message-
> From: linux-ia64-ow...@vger.kernel.org [mailto:linux-ia64-
> ow...@vger.kernel.org] On Behalf Of Torvald Riegel
> Sent: Wednesday, February 01, 2012 9:43 AM
> To: Linus Torvalds
> Cc: Jan Kara; LKML; linux-i...@vger.kernel.org; dste...@suse.cz;
> ptesa...@suse.cz; rguent...@suse.de; gcc@gcc.gnu.org
> Subject: Re: Memory corruption due to word sharing
> 
> On Wed, 2012-02-01 at 08:41 -0800, Linus Torvalds wrote:
> > If the gcc people aren't willing to agree that this is actually a
> flaw
> > in the standard (one that is being addressed, no less)
> 
> It has been addressed in the standards.
> 
> > and try to fix
> > it,
> 
> Again, this is being worked on, see
> http://gcc.gnu.org/wiki/Atomic/GCCMM
> 
> > we just have to extend our assumptions to something like "a
> > compiler would be stupid to ever access anything bigger than the
> > aligned register-size area". It's still just an assumption, and
> > compiler people could be crazy, but we could just extend the current
> > alpha rules to cover not just "int", but "long" too.
> 
> So, let's ignore everyone's craziness (the kernel is not the only GCC
> client...) and think about how we can improve the situation.
> 
> We need a proper memory model.  No vague assumptions with lots of
> hand-waving.  If you think that this is simple stuff and can
> sufficiently described by "don't do anything stupid", then please have
> a
> look at the issues that the Java memory model faced, and all the
> iterations of the C++11/C11 model and discussions about it.
> 
> The only candidate that I see is the C++11/C11 model.  Any other
> suggestions?
> 
> Now, if we assume this model, what does the kernel community think
> about
> it?  It might be good to split this discussion into the following two
> points, to avoid syntax flame wars:
> 1) The model itself (ie, the memory orders, ordering guarantees, (lack
> of) progress guarantees, etc.).
> 2) The syntax/APIs to specify atomics.
> 
> If something else or more is needed, the compiler needs to have a
> formal
> specification for that.  This will help users too, because it avoids
> all
> the ambiguities.
> 
> > Sure, the compiler people could use "load/store multiple" or
> something
> > like that, but it would be obviously crazy code, so if it happens
> past
> > a register size, at least you could argue that it's a performance
> > issue and maybe the gcc people would care.
> >
> > HOWEVER. The problem with the alpha rules (which, btw, were huge, and
> > led to the CPU designers literally changing the CPU instruction set
> > because they admitted they made a huge mistake) was never so much the
> > occasional memory corruption, as the fact that the places where it
> > could happen were basically almost impossible to find.
> >
> > So we probably have tons of random places in the kernel that violate
> > even the alpha rules - because the corruption is so rare, and the
> > architecture was so rare as to making the corruption even harder to
> > find.
> 
> I assume you still would want a weak memory model, or not?  (That is,
> rely on data-race-freedom, only atomics do not contribute to data
> races,
> and you need to mark data used for synchronization (or which is just
> accessed concurrentl

Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 9:42 AM, Torvald Riegel  wrote:
>
> We need a proper memory model.

Not really.

The fact is, the kernel will happily take the memory model of the
underlying hardware. Trying to impose some compiler description of the
memory model is actually horribly bad, because it automatically also
involves compiler synchronization points - which will try to be
portable and easy to understand, and quite frankly, that will
automatically result in what is technically known as a "shitload of
crap".

Now, a strict memory model is fine for portability, and to make it
simple for programmers. I actually largely approve of the Java memory
model approach, even if it has been horribly problematic and has some
serious problems. But it's not something that is acceptable for the
kernel - we absolutely do *not* want the compiler to introduce some
memory model, because we are very much working together with whatever
the hardware memory model is, and we do our own serialization
primitives.

> No vague assumptions with lots of hand-waving.

So here's basically what the kernel needs:

 - if we don't touch a field, the compiler doesn't touch it.

   This is the rule that gcc now violates with bitfields.

   This is a gcc bug. End of story. The "volatile" example proves it -
anybody who argues otherwise is simply wrong, and is just trying to
make excuses.

 - if we need to touch something only once, or care about something
that is touched only conditionally, and we actually need to make sure
that it is touched *only* under that condition, we already go to quite
extreme lengths to tell the compiler so.

   We usually use an access with a "volatile" cast on it or tricks
like that. Or we do the whole "magic asm that says it modified memory
to let the compiler know not to try anything clever"

 - The kernel IS NOT GOING TO MARK DATA STRUCTURES.

Marking data structures is seriously stupid and wrong. It's not the
*data* that has access rules, it's the *code* that has rules of
access.

The traditional C "volatile" is misdesigned and wrong. We don't
generally mark data volatile, we really mark *code* volatile - which
is why our "volatiles" are in the casts, not on the data structures.

Stuff that is "volatile" in one context is not volatile in another. If
you hold a RCU write lock, it may well be entirely stable, and marking
it volatile is *wrong*, and generating code as if it was volatile is
pure and utter shit.

On the other hand, if you are touching *the*very*same* field while you
are only read-locked for RCU, it may well be one of those "this has to
be read by accessing it exactly once".

And we do all this correctly in the kernel.  Modulo bugs, of course,
but the fundamental rule really is: "atomicity or volatility is about
CODE, not DATA".

And no, C11 does *not* do it correctly. The whole "_Atomic" crap is
exactly the same mistake as "volatile" was. It's wrong. Marking data
_Atomic is a sure sign that whoever designed it didn't understand
things.

> The only candidate that I see is the C++11/C11 model.  Any other
> suggestions?

The C11 model addresses the wrong thing, and addresses it badly.

You might as well ignore it as far as the kernel is concerned. I'm
sure it helps some *other* problem, but it's not the kernel problem.

The rules really are very simple, and the kernel already does
everything to make it easy for the compiler.

When we do something that the compiler cannot re-order around, we do
an asm() and mark it as modifying memory so that the compiler won't
screw things up. In addition, we will do whatever that the CPU
requires for memory ordering, and quite frankly, the compiler will
never have sufficient locking primitives to satisfy us, and there is
no real point in even trying. If you try to do locking in the
compiler, you *will* do it wrong.

If you add random flags on data structures ("_Atomic" or "volatile" or
whatever), you *will* do it wrong. It's simply a fundamentally broken
model.

So the *only* memory model we want from the compiler really is: "don't
write to fields that the source code didn't write to".

It's really is that simple.

> So, would it be okay to tell the compiler which part of the state is
> accessed concurrently (ie, locks, atomics, ...)?

Seriously, no.

See above: it's not the "state" that is accessed concurrently. It's
the code. If you ever try to mark state, you've already lost. The same
"state" can be atomic or not depending on context. It's not about the
state or the data structures, and it never will be.

Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Jeff Law

On 02/01/2012 12:44 PM, Boehm, Hans wrote:


C11 is a published standard.  Last I checked, gcc did not follow many of the 
above rules.  It looks like major changes were recently merged into the gcc 
trunk, and I haven't had a chance to test those, so it may well be fixed.  But 
more importantly, so far I haven't read anything here to dissuade me that they 
are the right target.


The bitfield changes didn't make it into gcc-4.7; we're still working 
through implementation details.   GCC's bitfield support is a horrid 
nasty mess which has made implementation of the standard problematical. 
  It's on Aldy's plate for gcc-4.8.


jeff


Re: Memory corruption due to word sharing

2012-02-01 Thread Alan Cox
> So here's basically what the kernel needs:
> 
>  - if we don't touch a field, the compiler doesn't touch it.
> 
>This is the rule that gcc now violates with bitfields.
> 
>This is a gcc bug. End of story. The "volatile" example proves it -
> anybody who argues otherwise is simply wrong, and is just trying to
> make excuses.

C historically didn't make this guarantee because a lot of processors
couldn't make it because they didn't have things like byte accessors (In
fact I suspect early ARM cannot make it for example).

Not meeting it for types where you can do is a bit rude however and
really ought to be an option (speed v sanity).

> See above: it's not the "state" that is accessed concurrently. It's
> the code. If you ever try to mark state, you've already lost. The same
> "state" can be atomic or not depending on context. It's not about the
> state or the data structures, and it never will be.

There are optimisation cases - where you can prove access properties are
safe (eg local variables some times) but they should be exactly that -
optimisations.

Alan


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 11:40 AM, Jakub Jelinek  wrote:
>
> Well, the C++11/C11 model doesn't allow to use the underlying type
> for accesses, consider e.g.
>
> struct S { long s1; unsigned int s2 : 5; unsigned int s3 : 19; unsigned char 
> s4; unsigned int s5; };
> struct T { long s1 : 16; unsigned int s2; };
>
> on e.g. x86_64-linux, S is 16 byte long, field s4 is packed together into
> the same 32 bits as s2 and s3.  While the memory model allows s2 to be
> changed when storing s3 (i.e. use a RMW cycle on it), it doesn't allow s4
> to be changed, as it isn't a bitfield (you'd need s4 : 8 for that).
> T is 8 bytes long, again, s2 is packed into the same 64 bits as s1,
> and the memory model doesn't allow s2 to be modified.
>
> Not sure what the kernel would expect in such a case.

So the kernel really doesn't care what you do to things *within* the bitfield.

Sure, we do have some expectations of packing, but basically we never
expect bitfields to be 'atomic'. You really don't need to worry about
that.

From a correctness standpoint, I really don't think the kernel cares
if gcc does bitfield reads and writes one bit at a time. Seriously.

The bug is that the bitfield write wrote *outside* the bitfield.
That's *WRONG*. It's completely wrong, even if you write back the same
value you read. It's *always* wrong. It's simply not acceptable.

If gcc does that kind of crap, then gcc needs to align bitfields
sufficiently that the accesses that are outside the bitfields never
touch any other fields.

So what I would suggest:

 - use the *smallest* aligned access you can use to access the
bitfield. This will be 'correct', but it may perform badly.

   This is apparently what x86 does. So on x86, if you have a one-bit
bitfield within a bitfield that is really 32 bits wide, it looks like
gcc will generally generate a byte load/store to set that bit. I don't
know exactly what the rules are, but this is fine from a *correctness*
point.

 - However, while using the *smallest* possible access may generate
correct code, it often generates really *crappy* code. Which is
exactly the bug that I reported in

   http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696

So quite frankly, my suggestion is to just try to use the base type.

But *whatever* you do, using a type *larger* than the whole bitfield
itself is a bug.

And dammit, the people who continue to bring up C11 as if this was a
new issue, and only needs to be worried about if you support C11 (ie
gcc-4.7 and up) are just in denial.

The 'volatile'  example makes it entirely clear that the bug has
nothing what-so-ever to do with C11.

Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Jakub Jelinek
On Wed, Feb 01, 2012 at 12:01:46PM -0800, Linus Torvalds wrote:
> On Wed, Feb 1, 2012 at 11:40 AM, Jakub Jelinek  wrote:
> > struct S { long s1; unsigned int s2 : 5; unsigned int s3 : 19; unsigned 
> > char s4; unsigned int s5; };
> > struct T { long t1 : 16; unsigned int t2; };
> >
> > on e.g. x86_64-linux, S is 16 byte long, field s4 is packed together into
> > the same 32 bits as s2 and s3.  While the memory model allows s2 to be
> > changed when storing s3 (i.e. use a RMW cycle on it), it doesn't allow s4
> > to be changed, as it isn't a bitfield (you'd need s4 : 8 for that).
> > T is 8 bytes long, again, s2 is packed into the same 64 bits as s1,
> > and the memory model doesn't allow s2 to be modified.
> >
> > Not sure what the kernel would expect in such a case.
> 
> So the kernel really doesn't care what you do to things *within* the bitfield.

But what is *within* the bitfield?  Do you consider s4 or t2 fields
(non-bitfield fields that just the ABI wants to pack together with
the bitfield) above as *within* or is already modifying of those a problem?

> The 'volatile'  example makes it entirely clear that the bug has
> nothing what-so-ever to do with C11.

The reason we mention C11/C++11 memory model is because we plan to support
that, likely for GCC 4.8, when requested by some options (either by default
when choosing those standards?, or when explicitly requested).  This will
even disable some optimizations that are fine for single-threaded apps,
but aren't fine in those memory models, aren't fine in OpenMP or for many
other threaded programs.  E.g. loop store motion if it isn't guaranteed the
variable is always stored in the loop:
int x;

void
foo (int j)
{
  int i;
  for (i = 0; i < 10; i++)
if (i > j)
  x = i;
}
can't be performed, because otherwise it introduces a tmp = x; ... x = tmp;
into code that wouldn't otherwise touch the variable, so if some other
thread modifies x, it might have unexpected value.

Jakub


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 12:01 PM, Linus Torvalds
 wrote:
>
>  - However, while using the *smallest* possible access may generate
> correct code, it often generates really *crappy* code. Which is
> exactly the bug that I reported in
>
>   http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696

Btw, as I also pointed out in that (old) bugzilla entry, gcc can do
pretty much whatever it damn well pleases with loads from a bitfield.

You can narrow them, you can make them wider, you can paint them pink.

Widening the loads can still be a performance problem, and technically
it's a really bad idea for any nearby "volatile"s too, but in practice
things will *work*.

Writes are much more critical. If you overwrite a field that is no
longer in the bitfield, you can no longer depend on "nobody cares
about bitfield accesses", because by definition you are clearly now
stepping on non-bitfield things.

You do realize that even in user space, and even before C11, people
have things like "sig_atomic_t" etc. So even if you don't like the
notion of "volatile", here's *another* example of this being  real gcc
bug:

   struct mydata {
  sig_atomic_t seen_signal;
  unsigned flags:1;
   };

and now do the same test-program, realizing that "sig_atomic_t" is
normally the exact same thing as "int".

Now, thing about what happens if you have a signal handler that comes
in and does

   mydata.seen_signal = 1;

and happens to interrupt the code that changes "mydata.flags" in just
the right spot.

That's right: the bitfield update will possibly *clear* that
"signal-safe" flag again, and gcc just created buggy asm code from
correct C code.

Guys, if you don't admit that this is a bug, I don't know what to say.

IT IS A GCC BUG.

Don't try to make it into something bigger and related to C++11/C11.
Don't try to talk about "memory models". Just admit that it is a bug
to do a 64-bit write to a 32-bit bitfield, and fix it!

Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Torvald Riegel
On Wed, 2012-02-01 at 11:47 -0800, Linus Torvalds wrote:
> On Wed, Feb 1, 2012 at 9:42 AM, Torvald Riegel  wrote:
> >
> > We need a proper memory model.
> 
> Not really.
> 
> The fact is, the kernel will happily take the memory model of the
> underlying hardware.

You do rely on the compiler to do common transformations I suppose:
hoist loads out of loops, CSE, etc.  How do you expect the compiler to
know whether they are allowed for a particular piece of code or not?

An example, C++11 pseudo code:
atomic shared_flags[123];
x = 0;
for(...)
  if (shared_flags[i].load(memory_order_acquire))
x += shared_data;

Now, can we hoist the load of shared_data to out of the loop?  The
acquire tells the compiler that this is not allowed.
Now assume x86, and it's memory model.  The acquire in there isn't
visible because you don't need a barrier for that.  So, what's the
compiler gonna do?  Never hoist any loops because an acquire might be
intended?
So, you could add a compiler barrier in there (volatile asm...).  But
those work in both directions, and the acquire only works in one
direction, allowing the first store to x to be moved later.

> Trying to impose some compiler description of the
> memory model is actually horribly bad, because it automatically also
> involves compiler synchronization points - which will try to be
> portable and easy to understand, and quite frankly, that will
> automatically result in what is technically known as a "shitload of
> crap".

What is a "compiler synchronization point"?
Have you actually read the C++11/C11 memory model?

> Now, a strict memory model is fine for portability, and to make it
> simple for programmers. I actually largely approve of the Java memory
> model approach, even if it has been horribly problematic and has some
> serious problems. But it's not something that is acceptable for the
> kernel - we absolutely do *not* want the compiler to introduce some
> memory model, because we are very much working together with whatever
> the hardware memory model is, and we do our own serialization
> primitives.

I think you underestimate the issues here.  The memory model that's the
kernel is programmed against is what tells the compiler which
transformations are allowed, and which aren't.  You seem to claim that
this is just the hardware's memory model that counts, but it's clear
that you have an implicit language memory model in mind, because
otherwise the compiler could never hoist any loads, for example (to be
safe, conservatively).

> > No vague assumptions with lots of hand-waving.
> 
> So here's basically what the kernel needs:
> 
>  - if we don't touch a field, the compiler doesn't touch it.

"we don't touch a field": what does that mean precisely?  Never touch it
in the same thread?  Same function?  Same basic block?  Between two
sequence points?  When crossing synchronizing code?  For example, in the
above code, can we touch it earlier/later?

I understand where this rule is heading (granularity races...), but
really, this isn't precise, and this all affects what optimizations are
allowed.  In this context, "show me the code" means having something
more formal, unfortunately, to prevent ambiguities.

For this reason, I suggested to have a look at the C11 model, and then
complain about limitations in this model.

> 
>This is the rule that gcc now violates with bitfields.

For the volatile case, that's a bug, yes.  For the non-volatile, I'd
need to check.

>This is a gcc bug. End of story. The "volatile" example proves it -
> anybody who argues otherwise is simply wrong, and is just trying to
> make excuses.

I'm trying to explain here why language memory models are a good thing,
and not the enemy.


>  - if we need to touch something only once, or care about something
> that is touched only conditionally, and we actually need to make sure
> that it is touched *only* under that condition, we already go to quite
> extreme lengths to tell the compiler so.
> 
>We usually use an access with a "volatile" cast on it or tricks
> like that. Or we do the whole "magic asm that says it modified memory
> to let the compiler know not to try anything clever"
> 
>  - The kernel IS NOT GOING TO MARK DATA STRUCTURES.
> 
> Marking data structures is seriously stupid and wrong. It's not the
> *data* that has access rules, it's the *code* that has rules of
> access.

And types are usually a good way to enforce code that accesses a piece
of state to behave similarly.  Which is what atomic does, and what
I meant.  From the overall C++/C perspective, it makes sense to require
types for atomics.  For the kernel, I don't care.

> Stuff that is "volatile" in one context is not volatile in another. If
> you hold a RCU write lock, it may well be entirely stable, and marking
> it volatile is *wrong*, and generating code as if it was volatile is
> pure and utter shit.

Agreed.  There is no full and portable way to do the weaker accesses
with atomicsc, but memory_order_relaxed is ve

Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 12:16 PM, Jakub Jelinek  wrote:
>>
>> So the kernel really doesn't care what you do to things *within* the 
>> bitfield.
>
> But what is *within* the bitfield?  Do you consider s4 or t2 fields
> (non-bitfield fields that just the ABI wants to pack together with
> the bitfield) above as *within* or is already modifying of those a problem?

I seriously think you should just do the obvious thing, and LOOK AT
THE BASE TYPE. That solves everything.

If the definition of the data is (assume "long" to be 64 bits)

   long mybits:32;
   int nonbitfield;

and you end up doing a 64-bit access when you read/write the "mybits"
bitfield, and it overlaps "nonbitfield", then I would also just say
"hey, whatever, the user asked for it". He really did. The
"allocation" for the bitfield comes from the base type, and so you
basically have a "long" to play with.

Let's take an even more extreme example:

   int bits1:8
   char c;
   int bits2:16;

where the "char c" is actually *embedded* in the "bitfield
allocation". But again, it's completely and totally unambiguous what
accesses you would at least start out with: sure, you *could* do a
8-bit access (and you probably should, if the ISA allows it), but
dangit, the base type is "int", so I don't think it's wrong to do a
32-bit load, mask, and a 32-bit write.

But if the user wrote

char bits1:8;
char c;
short bits2:16;

then I really think it would be a *bug* if modifying "bits1" would
possible write to 'c'. Of course, this is again a possible ISA
problem: if the architecture doesn't have byte writes, you can't do
anything about it, but that is obviously true regardless of bitfields,
so that's really a totally irrelevant argument for this case. That
non-atomicity doesn't come from bitfields, it comes from the fact that
the BASE TYPE is again non-atomic.

Notice how it always ends up being about the base type. Simple,
straightforward, and unambiguous.

And similarly, to go to the kernel example, if you have

int mybits:2;
int nonbitfield;

then I think it's an obvious *BUG* to access the nonbitfield things
when you access "mybits". It clearly is not at all "within" the
bitfield allocation, and "int" isn't non-atomic on any sane
architecture, so you don't even have the "byte accesses aren't atomic"
issue)

So I really do think that the sane approach is to look at the base
type of the bitfield, like I suggested from the very beginning.

If the base type is an "int", you do an int access. It really solves
so many problems, and it is even entirely sane semantics that you can
*explain* to people. It makes sense, it avoids the clear bugs gcc has
now, and it also solves the performance bug I reported a long time
ago.

Seriously - is there any real argument *against* just using the base
type as a hint for access size?

I realize that it may not be simple, and may not fit the current mess
of gcc bitfields, but I really do think it's the RightThing(tm) to do.
It has sensible semantics, and avoids problems.

In contrast, the *current* gcc semantics are clearly not sensible, and
have both performance and correctness issues.

 Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Torvald Riegel
On Wed, 2012-02-01 at 09:29 -0800, Linus Torvalds wrote:
> On Wed, Feb 1, 2012 at 9:08 AM, Torvald Riegel  wrote:
> >
> > What do the kernel folks think about the C11 memory model?  If you can
> > spot any issues in there, the GCC community would certainly like to
> > know.
> 
> I don't think this is about memory models except very tangentially.
> 
> Gcc currently accesses fields in a way that affects *independent*
> fields, without checking their memory models at all.
> 
> Even original C already has one memory model: "volatile" data is seen
> outside the virtual C machine. And Jiri reports that even that
> *original* memory model is being violated. We're taling about the one
> from about 40 years ago.

For volatile, I agree.

However, the original btrfs example was *without* a volatile, and that's
why I raised the memory model point.  This triggered an error in a
concurrent execution, so that's memory model land, at least in C
language standard.

The example was a granularity-of-access violation, I agree.
Nonetheless, C11 has rules for that, they have been written down, it
would good to know whether these rules are sufficient for you.

> We do end up doing
> much more aggressive threading, with models that C11 simply doesn't
> cover.

Any specific examples for that would be interesting.




Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 12:41 PM, Torvald Riegel  wrote:
>
> You do rely on the compiler to do common transformations I suppose:
> hoist loads out of loops, CSE, etc.  How do you expect the compiler to
> know whether they are allowed for a particular piece of code or not?

We have barriers.

Compiler barriers are cheap. They look like this:

include/linux/compiler-gcc.h:
   #define barrier() __asm__ __volatile__("": : :"memory")

and if we don't allow hoisting, we'll often use something like that.

It's not the only thing we do. We have cases where it's not that you
can't hoist things outside of loops, it's that you have to read things
exactly *once*, and then use that particular value (ie the compiler
can't be allowed to reload the value because it may change). So any
*particular* value we read is valid, but we can't allow the compiler
to do anything but a single access.

So we have this beauty:

include/linux/compiler.h:
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

which does that for us (quite often, it's not used as-is: a more
common case is using the "rcu_access_pointer()" inline function that
not only does that ACCESS_ONCE() but also has the capability to enable
run-time dynamic verification that you are in a proper RCU read locked
section etc)

We also have tons of (very subtle) code that actually creates locks
and memory ordering etc. They usually just use the big "barrier()"
approach to telling the compiler not to combine or move memory
accesses around it.

And I realize that compiler people tend to think that loop hoisting
etc is absolutely critical for performance, and some big hammer like
"barrier()" makes a compiler person wince. You think it results in
horrible code generation problems.

It really doesn't. Loops are fairly unusual in the kernel to begin
with, and the compiler barriers are a total non-issue. We have much
more problems with the actual CPU barriers that can be *very*
expensive on some architectures, and we work a lot at avoiding those
and avoiding cacheline ping-pong issues etc.

>> > No vague assumptions with lots of hand-waving.
>>
>> So here's basically what the kernel needs:
>>
>>  - if we don't touch a field, the compiler doesn't touch it.
>
> "we don't touch a field": what does that mean precisely?  Never touch it
> in the same thread?  Same function?  Same basic block?  Between two
> sequence points?  When crossing synchronizing code?  For example, in the
> above code, can we touch it earlier/later?

Why are you trying to make it more complex than it is?

If the source code doesn't write to it, the assembly the compiler
generates doesn't write to it.

Don't try to make it anything more complicated. This has *nothing* to
do with threads or functions or anything else.

If you do massive inlining, and you don't see any barriers or
conditionals or other reasons not to write to it, just write to it.

Don't try to appear smart and make this into something it isn't.

Look at the damn five-line example of the bug. FIX THE BUG. Don't try
to make it anything bigger than a stupid compiler bug. Don't try to
make this into a problem it simply isn't.

 Linus


RE: Memory corruption due to word sharing

2012-02-01 Thread Boehm, Hans
> From: Linus Torvalds
> >
> > We need a proper memory model.
> 
> Not really.
> 
> The fact is, the kernel will happily take the memory model of the
> underlying hardware. Trying to impose some compiler description of the
> memory model is actually horribly bad, because it automatically also
> involves compiler synchronization points - which will try to be
> portable and easy to understand, and quite frankly, that will
> automatically result in what is technically known as a "shitload of
> crap".
The C11 memory model potentially adds overhead in only two cases:

1. When current code involves touching a field that wouldn't otherwise be 
touched.  There are odd cases in which this measurably slows down code, but I 
think all agree that we need it.  In addition to bitfields, it can affect 
speculatively promoting a value to a register in a loop, which at least older 
versions of gcc also do.

2. When you use atomic operations for racing accesses.  And in that case, if 
you really want it, you get a lot of control over memory ordering.  Atomic 
operations that specify "memory_order_relaxed" should only have three kinds of 
impact:
- They ensure the access is indivisible.
- They tell the compiler that the location may be concurrently modified 
and it should refrain from optimizations that will break if it is.
- They enforce cache coherence, i.e. single-variable ordering.  This is 
implicit on most architectures, but requires ld.acq on Itanium.  IIRC, Paul 
McKenney argued convincingly that this was essential for preserving programmer 
sanity.

> 
> Now, a strict memory model is fine for portability, and to make it
> simple for programmers. I actually largely approve of the Java memory
> model approach, even if it has been horribly problematic and has some
> serious problems. But it's not something that is acceptable for the
> kernel - we absolutely do *not* want the compiler to introduce some
> memory model, because we are very much working together with whatever
> the hardware memory model is, and we do our own serialization
> primitives.
I think you are somewhat misinterpreting the C11 memory model.  Aside from the 
minor cache coherence issues on one or two architectures, you can still do 
that, if you like.  But I suspect you can also do better in places.

> 
> > No vague assumptions with lots of hand-waving.
> 
> So here's basically what the kernel needs:
> 
>  - if we don't touch a field, the compiler doesn't touch it.
> 
>This is the rule that gcc now violates with bitfields.
And in other more subtle cases.  At least until very recently.

> 
>This is a gcc bug. End of story. The "volatile" example proves it -
> anybody who argues otherwise is simply wrong, and is just trying to
> make excuses.
Agreed.

> 
>  - if we need to touch something only once, or care about something
> that is touched only conditionally, and we actually need to make sure
> that it is touched *only* under that condition, we already go to quite
> extreme lengths to tell the compiler so.
> 
>We usually use an access with a "volatile" cast on it or tricks
> like that. Or we do the whole "magic asm that says it modified memory
> to let the compiler know not to try anything clever"
> 
>  - The kernel IS NOT GOING TO MARK DATA STRUCTURES.
> 
> Marking data structures is seriously stupid and wrong. It's not the
> *data* that has access rules, it's the *code* that has rules of
> access.
> 
> The traditional C "volatile" is misdesigned and wrong. We don't
> generally mark data volatile, we really mark *code* volatile - which
> is why our "volatiles" are in the casts, not on the data structures.
> 
> Stuff that is "volatile" in one context is not volatile in another. If
> you hold a RCU write lock, it may well be entirely stable, and marking
> it volatile is *wrong*, and generating code as if it was volatile is
> pure and utter shit.
> 
> On the other hand, if you are touching *the*very*same* field while you
> are only read-locked for RCU, it may well be one of those "this has to
> be read by accessing it exactly once".
> 
> And we do all this correctly in the kernel.  Modulo bugs, of course,
> but the fundamental rule really is: "atomicity or volatility is about
> CODE, not DATA".
The preferred C11 model is clearly to mark data that potentially participates 
in unprotected concurrent accesses, and then to annotate accesses that require 
less care, e.g. because we know they can't occur concurrently with other 
accesses.  Maybe this is the wrong approach for the kernel, but to me that 
seems like a safer approach.  We did consider the alternative model, and tried 
to ensure that casts to atomic could also work on as many platforms as 
possible, though we can't guarantee that.  (There is currently no specification 
for atomic accesses that says "not concurrently accessed".  We've thought about 
it.  I would argue for adding it if there were important cases in which 
memory_order_relaxed is demonstrably not 

Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 12:53 PM, Torvald Riegel  wrote:
>
> For volatile, I agree.
>
> However, the original btrfs example was *without* a volatile, and that's
> why I raised the memory model point.  This triggered an error in a
> concurrent execution, so that's memory model land, at least in C
> language standard.

Sure. The thing is, if you fix the volatile problem, you'll almost
certainly fix our problem too.

The whole "compilers actually do reasonable things" approach really
does work in reality. It in fact works a lot better than reading some
spec and trying to figure out if something is "valid" or not, and
having fifteen different compiler writers and users disagree about
what the meaning of the word "is" is in some part of it.

I'm not kidding. With specs, there really *are* people who spend years
discussing what the meaning of the word "access" is or similar.
Combine that with a big spec that is 500+ pages in size and then try
to apply that all to a project that is 15 million lines of code and
sometimes *knowingly* has to do things that it simply knows are
outside the spec, and the discussion about these kinds of details is
just mental masturbation.


>> We do end up doing
>> much more aggressive threading, with models that C11 simply doesn't
>> cover.
>
> Any specific examples for that would be interesting.

Oh, one of my favorite (NOT!) pieces of code in the kernel is the
implementation of the

   smp_read_barrier_depends()

macro, which on every single architecture except for one (alpha) is a no-op.

We have basically 30 or so empty definitions for it, and I think we
have something like five uses of it. One of them, I think, is
performance crticial, and the reason for that macro existing.

What does it do? The semantics is that it's a read barrier between two
different reads that we want to happen in order wrt two writes on the
writing side (the writing side also has to have a "smp_wmb()" to order
those writes). But the reason it isn't a simple read barrier is that
the reads are actually causally *dependent*, ie we have code like

   first_read = read_pointer;
   smp_read_barrier_depends();
   second_read = *first_read;

and it turns out that on pretty much all architectures (except for
alpha), the *data*dependency* will already guarantee that the CPU
reads the thing in order. And because a read barrier can actually be
quite expensive, we don't want to have a read barrier for this case.

But alpha? Its memory consistency is so broken that even the data
dependency doesn't actually guarantee cache access order. It's
strange, yes. No, it's not that alpha does some magic value prediction
and can do the second read without having even done the first read
first to get the address. What's actually going on is that the cache
itself is unordered, and without the read barrier, you may get a stale
version from the cache even if the writes were forced (by the write
barrier in the writer) to happen in the right order.

You really want to try to describe issues like this in your memory
consistency model? No you don't. Nobody will ever really care, except
for crazy kernel people. And quite frankly, not even kernel people
care: we have a fairly big kernel developer community, and the people
who actually talk about memory ordering issues can be counted on one
hand. There's the "RCU guy" who writes the RCU helper functions, and
hides the proper serializing code into those helpers, so that normal
mortal kernel people don't have to care, and don't even have to *know*
how ignorant they are about the things.

And that's also why the compiler shouldn't have to care. It's a really
small esoteric detail, and it can be hidden in a header file and a set
of library routines. Teaching the compiler about crazy memory ordering
would just not be worth it. 99.99% of all programmers will never need
to understand any of it, they'll use the locking primitives and follow
the rules, and the code that makes it all work is basically invisible
to them.

Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Torvald Riegel
On Wed, 2012-02-01 at 12:59 -0800, Linus Torvalds wrote:
> On Wed, Feb 1, 2012 at 12:41 PM, Torvald Riegel  wrote:
> >
> > You do rely on the compiler to do common transformations I suppose:
> > hoist loads out of loops, CSE, etc.  How do you expect the compiler to
> > know whether they are allowed for a particular piece of code or not?
> 
> We have barriers.
> 
> Compiler barriers are cheap. They look like this:
[snip]

I know, and I've programmed my chunk of concurrent code with those and
architecture-specific code too (or libatomic-ops for a bit more
ready-to-use portability).

> It's not the only thing we do. We have cases where it's not that you
> can't hoist things outside of loops, it's that you have to read things
> exactly *once*, and then use that particular value (ie the compiler
> can't be allowed to reload the value because it may change). So any
> *particular* value we read is valid, but we can't allow the compiler
> to do anything but a single access.

That's what an access to an atomics-typed var would give you as well.

> And I realize that compiler people tend to think that loop hoisting
> etc is absolutely critical for performance, and some big hammer like
> "barrier()" makes a compiler person wince. You think it results in
> horrible code generation problems.
> 
> It really doesn't. Loops are fairly unusual in the kernel to begin
> with, and the compiler barriers are a total non-issue.

This is interesting, because GCC is currently not optimizing across
atomics but we we're considering working on this later.
Do you have real data about this, or is this based on analysis of
samples of compiled code?

> We have much
> more problems with the actual CPU barriers that can be *very*
> expensive on some architectures, and we work a lot at avoiding those
> and avoiding cacheline ping-pong issues etc.

Well, obviously...

> 
> >> > No vague assumptions with lots of hand-waving.
> >>
> >> So here's basically what the kernel needs:
> >>
> >>  - if we don't touch a field, the compiler doesn't touch it.
> >
> > "we don't touch a field": what does that mean precisely?  Never touch it
> > in the same thread?  Same function?  Same basic block?  Between two
> > sequence points?  When crossing synchronizing code?  For example, in the
> > above code, can we touch it earlier/later?
> 
> Why are you trying to make it more complex than it is?
> 
> If the source code doesn't write to it, the assembly the compiler
> generates doesn't write to it.

If it would be so easy, then answering my questions would have been easy
too, right?  Which piece of source code?  The whole kernel?

> 
> Don't try to make it anything more complicated. This has *nothing* to
> do with threads or functions or anything else.
> 
> If you do massive inlining, and you don't see any barriers or
> conditionals or other reasons not to write to it, just write to it.

And a precise definition of these reasons is what we need to agree on.

> Don't try to appear smart and make this into something it isn't.

Even if I would have done it, it wouldn't be as rude as trying to imply
that other people aren't smart.



RE: Memory corruption due to word sharing

2012-02-01 Thread Boehm, Hans
> From: Linus Torvalds
> Don't try to make it anything more complicated. This has *nothing* to
> do with threads or functions or anything else.
> 
> If you do massive inlining, and you don't see any barriers or
> conditionals or other reasons not to write to it, just write to it.
> 
> Don't try to appear smart and make this into something it isn't.
> 
> Look at the damn five-line example of the bug. FIX THE BUG. Don't try
> to make it anything bigger than a stupid compiler bug. Don't try to
> make this into a problem it simply isn't.
> 
My impression is that all of us not on the hook to fix this are in violent 
agreement on this example.

Here are some more interesting ones that illustrate the issues (all 
declarations are non-local, unless stated otherwise):

struct { char a; int b:9; int c:7; char d} x;

Is x.b = 1 allowed to overwrite x.a?  C11 says no, essentially requiring two 
byte stores.  Gcc currently does so.  I'm not sure I understand Linus' position 
here.


int count;
/* p and q are local */

for (q = p; q = q -> next; q != 0) if (q -> data > 0) ++count;

Can count be promoted to a register, and thus written even if there are no 
positive elements.  C11 says no. gcc at least used to do this.


for (q = p; q = q -> next; q != 0) { ++count; if (rare_cond) f(); }

Same question, with cond saved and restored around the call to f() (which might 
include a fence).  C11 says no.  I think Linus is also arguing for no.


for (i = 0; i < 1000; ++i) { if (i%1) a[i] = i; }

Can I vectorize the loop writing back the original even values, and thus 
writing all entries of the array.  C11 and Linus both say no.


My impression is that we are generally in agreement.

Hans


Re: Memory corruption due to word sharing

2012-02-01 Thread Torvald Riegel
On Wed, 2012-02-01 at 13:20 -0800, Linus Torvalds wrote:
> On Wed, Feb 1, 2012 at 12:53 PM, Torvald Riegel  wrote:
> >
> > For volatile, I agree.
> >
> > However, the original btrfs example was *without* a volatile, and that's
> > why I raised the memory model point.  This triggered an error in a
> > concurrent execution, so that's memory model land, at least in C
> > language standard.
> 
> Sure. The thing is, if you fix the volatile problem, you'll almost
> certainly fix our problem too.
> 
> The whole "compilers actually do reasonable things" approach really
> does work in reality. It in fact works a lot better than reading some
> spec and trying to figure out if something is "valid" or not, and
> having fifteen different compiler writers and users disagree about
> what the meaning of the word "is" is in some part of it.

That's why researchers have formalized the model in order to verify
whether it's sane.  And they found bugs / ambiguities in it:
http://www.cl.cam.ac.uk/~pes20/cpp/

The standards text might still be a typical standards text for various
reasons. But it enables having a formal version of it too, and
discussions about this formal version.  Personally, I find this more
formal model easier to work with, exactly because it is more precise
than prose.

> 
> >> We do end up doing
> >> much more aggressive threading, with models that C11 simply doesn't
> >> cover.
> >
> > Any specific examples for that would be interesting.
> 
> Oh, one of my favorite (NOT!) pieces of code in the kernel is the
> implementation of the
> 
>smp_read_barrier_depends()
> 
> macro, which on every single architecture except for one (alpha) is a no-op.
> 
> We have basically 30 or so empty definitions for it, and I think we
> have something like five uses of it. One of them, I think, is
> performance crticial, and the reason for that macro existing.
> 
> What does it do? The semantics is that it's a read barrier between two
> different reads that we want to happen in order wrt two writes on the
> writing side (the writing side also has to have a "smp_wmb()" to order
> those writes). But the reason it isn't a simple read barrier is that
> the reads are actually causally *dependent*, ie we have code like
> 
>first_read = read_pointer;
>smp_read_barrier_depends();
>second_read = *first_read;
> 
> and it turns out that on pretty much all architectures (except for
> alpha), the *data*dependency* will already guarantee that the CPU
> reads the thing in order. And because a read barrier can actually be
> quite expensive, we don't want to have a read barrier for this case.

I don't have time to look at this in detail right now, but it looks
roughly close to C++11's memory_order_consume to me, which is somehwat
like an acquire, but just for subsequent data-dependent loads.  Added
for performance reasons on some architecture AFAIR.

> You really want to try to describe issues like this in your memory
> consistency model? No you don't. Nobody will ever really care, except
> for crazy kernel people. And quite frankly, not even kernel people
> care: we have a fairly big kernel developer community, and the people
> who actually talk about memory ordering issues can be counted on one
> hand. There's the "RCU guy" who writes the RCU helper functions, and
> hides the proper serializing code into those helpers, so that normal
> mortal kernel people don't have to care, and don't even have to *know*
> how ignorant they are about the things.

I'm not a kernel person, and I do care about it.  Userspace is
synchronizing too, and not just with pthread mutexes.

> And that's also why the compiler shouldn't have to care. It's a really
> small esoteric detail, and it can be hidden in a header file and a set
> of library routines. Teaching the compiler about crazy memory ordering
> would just not be worth it. 99.99% of all programmers will never need
> to understand any of it, they'll use the locking primitives and follow
> the rules, and the code that makes it all work is basically invisible
> to them.

I disagree (though it would be nice if it were that esoteric), but
that's off-topic...



Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 1:24 PM, Torvald Riegel  wrote:
>> It's not the only thing we do. We have cases where it's not that you
>> can't hoist things outside of loops, it's that you have to read things
>> exactly *once*, and then use that particular value (ie the compiler
>> can't be allowed to reload the value because it may change). So any
>> *particular* value we read is valid, but we can't allow the compiler
>> to do anything but a single access.
>
> That's what an access to an atomics-typed var would give you as well.

Right. The concept of "atomic" access is required.

The problem I have with treating this as a C11 issue is that people
aren't using C11 compilers, and won't for a long while.

So quite frankly, it won't be reasonable for the kernel people to say
"use a C11 version" for years to come.

And the thing is, "atomic" doesn't actually seem to buy us anything in
the kernel afaik. We're perfectly happy to use "volatile". We don't
want the data structure annotated, we want the code annotated, so the
semantic differences between 'volatile' and 'atomic' seem to be pretty
irrelevant.

Sure, there is probably some subtle difference, but on the whole, it
all boils down to "we can't depend on new rules for several years, and
even when we can, it doesn't look like they'd buy us a lot, because we
have to play so many games around them *anyway*".

And don't get me wrong. I don't think that means that C11 is *bad*.
It's just that the kernel is very different from most other projects.
We have to have those crazy architecture-specific header files and
random synchronization macros etc anyway.

C11 is not - I think - even meant to be geared towards the Linux
kernel kind of crazy use. We really do some odd things, adding
compiler features for them is mostly a mistake. asm() takes care of a
lot of the oddities for us, it's not like all of them are about memory
accesses or concurrency either.

I do think that the threading issues in C11 are going to help us
kernel people, because the more people think about issues with
concurrency, they really *will* be hitting some of the issues we've
been having. So it clearly forces clarification of just what the word
"access" implies, for example, which is certainly not going to be bad
for the kernel.

>> It really doesn't. Loops are fairly unusual in the kernel to begin
>> with, and the compiler barriers are a total non-issue.
>
> This is interesting, because GCC is currently not optimizing across
> atomics but we we're considering working on this later.
> Do you have real data about this, or is this based on analysis of
> samples of compiled code?

So the kernel literally doesn't tend to *have* loops.

Sure, there are exceptions: there are the obvious loops implied in
"memcpy()" and friends, but almost all kernel activity tends to be
about individual events. A really hot loop under normal loads would be
the loop that calculates a hash value over a pathname component. Or
the loop that loops over those components.

Loop count? Think about it. Most pathnames have one or two components.
And while "8.3" is clearly too short for a filename, it's still true
that a lot of filenames (especially directories, which is what you end
up traversing many times) are in a kind of awkward 5-8 character
range.

So the kernel loads tend to have loop counts in the single digits -
*maybe* double digits. We use hash-tables a lot, so looping to find
something is usually just an entry or two, and the cost is almost
never the loop, it's the cache miss from the pointer chasing.

Most "looping" behavior in the kernel is actually user space calling
the kernel in a loop. So the kernel may see "loops" in that sense, but
it's not loopy kernel code itself.

Even really hot kernel-only code is often not about loops. You might
get a million network packets a second, and with interrupt mitigation
you would not necessarily treat them each as individual events with
its own individual interrupt (which does happen too!), but you'd still
not see it as a loop over a million packets. You'd see them come in as
a few packets at a time, and looping until you've handled them.

And the loops we have tend to not be very amenable to nice compiler
optimizations either.

The kernel almost never works with arrays, for example. The string in
a pathname component is an array, yes, but *most* kernel data
structures loop over our doubly linked lists or over a hash-chain.

And it tends to be really hard to do much loop optimization over list
traversal like that.

>> If the source code doesn't write to it, the assembly the compiler
>> generates doesn't write to it.
>
> If it would be so easy, then answering my questions would have been easy
> too, right?  Which piece of source code?  The whole kernel?

So make the rule be: "in one single function, with all you can see
there", and I'll be happy. Do as much inlining as you want (although
the kernel does have "noinline" too).

With all the normal sequence point and memory invalidation

RE: Memory corruption due to word sharing

2012-02-01 Thread Boehm, Hans
> From: Torvald Riegel
> > Oh, one of my favorite (NOT!) pieces of code in the kernel is the
> > implementation of the
> >
> >smp_read_barrier_depends()
> >
> > macro, which on every single architecture except for one (alpha) is a
> no-op.
> >
> > We have basically 30 or so empty definitions for it, and I think we
> > have something like five uses of it. One of them, I think, is
> > performance crticial, and the reason for that macro existing.
> >
> > What does it do? The semantics is that it's a read barrier between
> two
> > different reads that we want to happen in order wrt two writes on the
> > writing side (the writing side also has to have a "smp_wmb()" to
> order
> > those writes). But the reason it isn't a simple read barrier is that
> > the reads are actually causally *dependent*, ie we have code like
> >
> >first_read = read_pointer;
> >smp_read_barrier_depends();
> >second_read = *first_read;
> >
> > and it turns out that on pretty much all architectures (except for
> > alpha), the *data*dependency* will already guarantee that the CPU
> > reads the thing in order. And because a read barrier can actually be
> > quite expensive, we don't want to have a read barrier for this case.
> 
> I don't have time to look at this in detail right now, but it looks
> roughly close to C++11's memory_order_consume to me, which is somehwat
> like an acquire, but just for subsequent data-dependent loads.  Added
> for performance reasons on some architecture AFAIR.
> 
It's intended to address the same problem, though somewhat differently.  (I 
suspect there was overlap in the people involved?) One reason that C11 took a 
slightly different path is that compilers can, and sometimes do, remove 
dependencies, making smp_read_barrier_depends brittle unless it also imposes 
compiler constraints.

Hans



Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 1:25 PM, Boehm, Hans  wrote:
>
> Here are some more interesting ones that illustrate the issues (all 
> declarations are non-local, unless stated otherwise):
>
> struct { char a; int b:9; int c:7; char d} x;
>
> Is x.b = 1 allowed to overwrite x.a?  C11 says no, essentially requiring two 
> byte stores.  Gcc currently does so.  I'm not sure I understand Linus' 
> position here.

So I like the fact that the C11 stance seems very strict. But honestly
I also think it sounds like C11 is actually more strict than I would
necessarily be.

I really do think that bitfields imply "int", both historically and
technically. So I would not see the problem with treating the
bitfields as part of an 'int' and thus overwriting a (and d) when
writing to b. That's how bitfields work! They are fields of an int.

It would be good if it perhaps caused a *warning*, and gave a good way
to avoid it. For example, while I think using any other base type than
'int' is technically an extension of the C bitfield rules (but
whatever, I don't have any specs in front of me), I think a warning
together with alowing the user to rewrite it as

  struct { char a; char d; short b:9; short c:7; } x;

would make it clear that now a write to 'b' cannot validly overwrite
'a' or 'd'.

But forcing the compiler to do two (and sometimes three!) byte
accesses sounds excessive.

The reason I think the

   int flag:1;
   int othervariable;

overwriting of "othervariable" is so *obviously* a bug is exactly that
bitfields are historically about 'int', and no 'long' was there
anywhere, so using a 64-bit access is clearly not sane in any way,
shape or form.

I dunno.

 Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Paul E. McKenney
On Wed, Feb 01, 2012 at 12:59:24PM -0800, Linus Torvalds wrote:
> On Wed, Feb 1, 2012 at 12:41 PM, Torvald Riegel  wrote:
> >
> > You do rely on the compiler to do common transformations I suppose:
> > hoist loads out of loops, CSE, etc.  How do you expect the compiler to
> > know whether they are allowed for a particular piece of code or not?
> 
> We have barriers.
> 
> Compiler barriers are cheap. They look like this:
> 
> include/linux/compiler-gcc.h:
>#define barrier() __asm__ __volatile__("": : :"memory")
> 
> and if we don't allow hoisting, we'll often use something like that.
> 
> It's not the only thing we do. We have cases where it's not that you
> can't hoist things outside of loops, it's that you have to read things
> exactly *once*, and then use that particular value (ie the compiler
> can't be allowed to reload the value because it may change). So any
> *particular* value we read is valid, but we can't allow the compiler
> to do anything but a single access.
> 
> So we have this beauty:
> 
> include/linux/compiler.h:
> #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

My (perhaps forlorn and naive) hope is that C++11 memory_order_relaxed
will eventually allow ACCESS_ONCE() to be upgraded so that (for example)
access-once increments can generate a single increment-memory instruction
on x86.

> which does that for us (quite often, it's not used as-is: a more
> common case is using the "rcu_access_pointer()" inline function that
> not only does that ACCESS_ONCE() but also has the capability to enable
> run-time dynamic verification that you are in a proper RCU read locked
> section etc)

And I also hope that rcu_access_pointer(), rcu_dereference(), and
friends can eventually be written in terms of C++11 memory_order_consume
so that the some of the less-sane optimizations can actually be applied
without breaking the kernel.

Of course, this cannot happen immediately.  First, gcc must fully support
the C++11 features of interest, the inevitable bugs must be found and
fixed, and the optimizer will no doubt need to be reworked a bit before
the kernel can make use of these features.

New architectures might eventually might define things like atomic_inc()
in terms of C++11 atomics, but let's start with the straightforward stuff
as and if it makes sense.

Thanx, Paul

> We also have tons of (very subtle) code that actually creates locks
> and memory ordering etc. They usually just use the big "barrier()"
> approach to telling the compiler not to combine or move memory
> accesses around it.
> 
> And I realize that compiler people tend to think that loop hoisting
> etc is absolutely critical for performance, and some big hammer like
> "barrier()" makes a compiler person wince. You think it results in
> horrible code generation problems.
> 
> It really doesn't. Loops are fairly unusual in the kernel to begin
> with, and the compiler barriers are a total non-issue. We have much
> more problems with the actual CPU barriers that can be *very*
> expensive on some architectures, and we work a lot at avoiding those
> and avoiding cacheline ping-pong issues etc.
> 
> >> > No vague assumptions with lots of hand-waving.
> >>
> >> So here's basically what the kernel needs:
> >>
> >>  - if we don't touch a field, the compiler doesn't touch it.
> >
> > "we don't touch a field": what does that mean precisely?  Never touch it
> > in the same thread?  Same function?  Same basic block?  Between two
> > sequence points?  When crossing synchronizing code?  For example, in the
> > above code, can we touch it earlier/later?
> 
> Why are you trying to make it more complex than it is?
> 
> If the source code doesn't write to it, the assembly the compiler
> generates doesn't write to it.
> 
> Don't try to make it anything more complicated. This has *nothing* to
> do with threads or functions or anything else.
> 
> If you do massive inlining, and you don't see any barriers or
> conditionals or other reasons not to write to it, just write to it.
> 
> Don't try to appear smart and make this into something it isn't.
> 
> Look at the damn five-line example of the bug. FIX THE BUG. Don't try
> to make it anything bigger than a stupid compiler bug. Don't try to
> make this into a problem it simply isn't.
> 
>  Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 



Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 2:45 PM, Paul E. McKenney
 wrote:
>
> My (perhaps forlorn and naive) hope is that C++11 memory_order_relaxed
> will eventually allow ACCESS_ONCE() to be upgraded so that (for example)
> access-once increments can generate a single increment-memory instruction
> on x86.

I don't think that is a semantic issue.

gcc could do it *today* with volatile accesses. It doesn't, because
volatiles are scary and basically disables a lot of optimizations. Why
would memory ordering be substantially different just because it has a
different name?

> New architectures might eventually might define things like atomic_inc()
> in terms of C++11 atomics, but let's start with the straightforward stuff
> as and if it makes sense.

SMP-atomic or percpu atomic? Or both?

We need both variants in the kernel. If the compiler generates one of
them for us, that doesn't really much help.

  Linus


Re: Access to source code from an analyser

2012-02-01 Thread Diego Novillo

On 12-01-19 05:06 , Alberto Lozano Alelu wrote:


I need to have source text


For that, you will need to read the file or get the line from libcpp's 
buffers.  Accessing libcpp's buffer will not be trivial, since they are 
in libcpp's private structures.  You're better off opening the file and 
reading from that line/col.



Diego.