Re: -fno-inline-functions vs glibc's initfini
On Wed, Feb 1, 2012 at 3:30 AM, Alexandre Oliva wrote: > On Jan 31, 2012, Richard Guenther wrote: > >> What's probably confusing you is the "Don't pay attention to the >> @code{inline} keyword" sentence. > > What really set me down the wrong patch were the comments in > gcc/common.opt, that got me the idea it had something to do with C99 > inline. > > ; Nonzero means that functions declared `inline' will be treated > ; as `static'. Prevents generation of zillions of copies of unused > ; static inline functions; instead, `inlines' are written out > ; only when actually used. Used in conjunction with -g. Also > ; does the right thing with #pragma interface. > finline > Common Report Var(flag_no_inline,0) Init(0) > Pay attention to the \"inline\" keyword Ick - WTF is that ... I'll fix it ;) Richard. >> I suppose we should clarify the documentation and I will prepare a patch. > > Thanks. Would you please take care of adjusting the comments in > common.opt accordingly? TIA, > >> The implementation is exactly right > > Phew! :-) > > -- > Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ > You must be the change you wish to see in the world. -- Gandhi > Be Free! -- http://FSFLA.org/ FSF Latin America board member > Free Software Evangelist Red Hat Brazil Compiler Engineer
RE: ICE building compiler
>Whilst investigating an ICE with the Blackfin compiler, I bumped in to a >bit of code which seems questionable: > >in reload1.c:reload() we call select_reload_regs() and then check if >failure was set. However, looking at find_reload_regs() (called via >select_reload_regs()), the only time we set failure is immediately after >a call to spill_failure() which doesn't return. Is spill_failure the >correct function to be using here? It seems like the intention of the >code was to note the spill_failure and then try and fix it >by jumping to the "failed:" label. any feedback on this? Thanks, Stu
Memory corruption due to word sharing
Hello, we've spotted the following mismatch between what kernel folks expect from a compiler and what GCC really does, resulting in memory corruption on some architectures. Consider the following structure: struct x { long a; unsigned int b1; unsigned int b2:1; }; We have two processes P1 and P2 where P1 updates field b1 and P2 updates bitfield b2. The code GCC generates for b2 = 1 e.g. on ia64 is: 0: 09 00 21 40 00 21 [MMI] adds r32=8,r32 6: 00 00 00 02 00 e0 nop.m 0x0 c: 11 00 00 90 mov r15=1;; 10: 0b 70 00 40 18 10 [MMI] ld8 r14=[r32];; 16: 00 00 00 02 00 c0 nop.m 0x0 1c: f1 70 c0 47 dep r14=r15,r14,32,1;; 20: 11 00 38 40 98 11 [MIB] st8 [r32]=r14 26: 00 00 00 02 00 80 nop.i 0x0 2c: 08 00 84 00 br.ret.sptk.many b0;; Note that gcc used 64-bit read-modify-write cycle to update b2. Thus if P1 races with P2, update of b1 can get lost. BTW: I've just checked on x86_64 and there GCC uses 8-bit bitop to modify the bitfield. We actually spotted this race in practice in btrfs on structure fs/btrfs/ctree.h:struct btrfs_block_rsv where spinlock content got corrupted due to update of following bitfield and there seem to be other places in kernel where this could happen. I've raised the issue with our GCC guys and they said to me that: "C does not provide such guarantee, nor can you reliably lock different structure fields with different locks if they share naturally aligned word-size memory regions. The C++11 memory model would guarantee this, but that's not implemented nor do you build the kernel with a C++11 compiler." So it seems what C/GCC promises does not quite match with what kernel expects. I'm not really an expert in this area so I wanted to report it here so that more knowledgeable people can decide how to solve the issue... Honza -- Jan Kara SUSE Labs, CR
Re: Memory corruption due to word sharing
On 2012.02.01 at 16:19 +0100, Jan Kara wrote: > we've spotted the following mismatch between what kernel folks expect > from a compiler and what GCC really does, resulting in memory corruption on > some architectures. Consider the following structure: > struct x { > long a; > unsigned int b1; > unsigned int b2:1; > }; > > We have two processes P1 and P2 where P1 updates field b1 and P2 updates > bitfield b2. The code GCC generates for b2 = 1 e.g. on ia64 is: >0: 09 00 21 40 00 21 [MMI] adds r32=8,r32 >6: 00 00 00 02 00 e0 nop.m 0x0 >c: 11 00 00 90 mov r15=1;; > 10: 0b 70 00 40 18 10 [MMI] ld8 r14=[r32];; > 16: 00 00 00 02 00 c0 nop.m 0x0 > 1c: f1 70 c0 47 dep r14=r15,r14,32,1;; > 20: 11 00 38 40 98 11 [MIB] st8 [r32]=r14 > 26: 00 00 00 02 00 80 nop.i 0x0 > 2c: 08 00 84 00 br.ret.sptk.many b0;; > > Note that gcc used 64-bit read-modify-write cycle to update b2. Thus if P1 > races with P2, update of b1 can get lost. BTW: I've just checked on x86_64 > and there GCC uses 8-bit bitop to modify the bitfield. > > We actually spotted this race in practice in btrfs on structure > fs/btrfs/ctree.h:struct btrfs_block_rsv where spinlock content got > corrupted due to update of following bitfield and there seem to be other > places in kernel where this could happen. > > I've raised the issue with our GCC guys and they said to me that: "C does > not provide such guarantee, nor can you reliably lock different > structure fields with different locks if they share naturally aligned > word-size memory regions. The C++11 memory model would guarantee this, > but that's not implemented nor do you build the kernel with a C++11 > compiler." > > So it seems what C/GCC promises does not quite match with what kernel > expects. I'm not really an expert in this area so I wanted to report it > here so that more knowledgeable people can decide how to solve the issue... FYI, the gcc bug can be found here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52080 -- Markus
Re: Memory corruption due to word sharing
On Wed, 2012-02-01 at 16:19 +0100, Jan Kara wrote: > I've raised the issue with our GCC guys and they said to me that: "C does > not provide such guarantee, nor can you reliably lock different > structure fields with different locks if they share naturally aligned > word-size memory regions. The C++11 memory model would guarantee this, > but that's not implemented nor do you build the kernel with a C++11 > compiler." That's interesting. So it seems like there are two solutions: 1) Use the same lock for a given bitfield 2) Split up the bitfield into different words
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 7:19 AM, Jan Kara wrote: > > we've spotted the following mismatch between what kernel folks expect > from a compiler and what GCC really does, resulting in memory corruption on > some architectures. This is sad. We've had something like this before due to architectural reasons (alpha inability to do byte loads and stores leading us to not be able to have items smaller than a byte sharing a word). But at least then there was a *reason* for it, not "the compiler is being difficult for no good reason". Actually, the *sad* part is not that the compiler does something unexpected - that's not new - but the usual "..and the gcc crowd doesn't care, because they can point to paperwork saying it's not defined". Even if that same paper is actually in the process of getting updated exactly because it causes problems like ours. That mentality is not new, of course. > So it seems what C/GCC promises does not quite match with what kernel > expects. The paper C standard can *never* promise what a kernel expects. There are tons of unspecified things that a compiler could do, including moving memory around behind our back as long as it moves it back. Because it's all "invisible" in the virtual C machine in the absence of volatiles. The fact that the kernel has things like SMP coherency requirements is simply not covered by the standard. There are tons of other things not covered by the standard too that are just "that's what we need". So C/gcc has never "promised" anything in that sense, and we've always had to make assumptions about what is reasonable code generation. Most of the time, our assumptions are correct, simply because it would be *stupid* for a C compiler to do anything but what we assume it does. But sometimes compilers do stupid things. Using 8-byte accesses to a 4-byte entity is *stupid*, when it's not even faster, and when the base type has been specified to be 4 bytes! >I'm not really an expert in this area so I wanted to report it > here so that more knowledgeable people can decide how to solve the issue... If the gcc people aren't willing to agree that this is actually a flaw in the standard (one that is being addressed, no less) and try to fix it, we just have to extend our assumptions to something like "a compiler would be stupid to ever access anything bigger than the aligned register-size area". It's still just an assumption, and compiler people could be crazy, but we could just extend the current alpha rules to cover not just "int", but "long" too. Sure, the compiler people could use "load/store multiple" or something like that, but it would be obviously crazy code, so if it happens past a register size, at least you could argue that it's a performance issue and maybe the gcc people would care. HOWEVER. The problem with the alpha rules (which, btw, were huge, and led to the CPU designers literally changing the CPU instruction set because they admitted they made a huge mistake) was never so much the occasional memory corruption, as the fact that the places where it could happen were basically almost impossible to find. So we probably have tons of random places in the kernel that violate even the alpha rules - because the corruption is so rare, and the architecture was so rare as to making the corruption even harder to find. I assume this code generation idiocy only happens with bitfields? The problem is, we really don't want to make all bitfields take up 64 bits just because we might have a lock or something else next to them. But we could probably re-order things (and then *occasionally* wasting memory) if we could just find the ones that are problematic. It's finding the problematic ones that is the issue. Which is why the compiler should just generate code that matches what we told it to do, not try to be "clever" in ways that doesn't even help performance! The compiler simply doesn't know enough about the requirements. Never has, never will, and this is not about "standards". Linus
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 8:37 AM, Colin Walters wrote: > > 1) Use the same lock for a given bitfield That's not the problem. All the *bitfield* fields are all accessed under the same word already. > 2) Split up the bitfield into different words Again, it's not the bitfield that is the problem. The problem is that the compiler - when it writes to the word that contains the bitfield - will also corrupt the word *NEXT* to the bitfield (or before - it probably depends on alignment). So we have two separate 32-bit words - one of which just happens to contain a bitfield. We write to these *separate* words using different locking rules (btw, they don't even need to be protected by a lock: we may have other rules that protects the individual word contents. But the compiler turns the access to the bitfield (in a 32-bit aligned word) into a 64-bit access that accesses the word *next* to it. That word next to it might *be* the lock, for example. So we could literally have this kind of situation: struct { atomic_t counter; unsigned int val:4, other:4, data:24; }; and if we write code like this: spin_lock(&somelock); s->data++; spin_unlock(&somelock); and on another CPU we might do atomic_inc(&counter); and the access to the bitfield will *corrupt* the atomic counter, even though both of them are perfectly fine! Quite frankly, if the bug is simply because gcc doesn't actually know or care about the underlying size of the bitmask, it is possible that we can find a case where gcc clearly is buggy even according to the original C rules. Honza - since you have access to the compiler in question, try compiling this trivial test-program: struct example { volatile int a; int b:1; }; .. s->b = 1; .. and if that bitfield access actually does a 64-bit access that also touches 's->a', then dammit, that's a clear violation of even the *old* C standard, and the gcc people cannot just wave away their bugs by saying "we've got standads, pttthththt". And I suspect it really is a generic bug that can be shown even with the above trivial example. Linus
Re: Memory corruption due to word sharing
On Wed, 2012-02-01 at 16:19 +0100, Jan Kara wrote: > I've raised the issue with our GCC guys and they said to me that: "C does > not provide such guarantee, nor can you reliably lock different > structure fields with different locks if they share naturally aligned > word-size memory regions. The C++11 memory model would guarantee this, > but that's not implemented nor do you build the kernel with a C++11 > compiler." Indeed, it's memory models that specify which kind of behavior is allowed, and you need them for both the hardware and the programming language. C++11 and C11 have memory models, in contrast to previous versions of these standards. GCC 4.7 implements this memory model (C++11's and C11's models are very similar), even though there might be some rough edges in this implementation (bit fields, for example...). http://gcc.gnu.org/wiki/Atomic/GCCMM > So it seems what C/GCC promises does not quite match with what kernel > expects. I'm not really an expert in this area so I wanted to report it > here so that more knowledgeable people can decide how to solve the issue... There needs to be agreement about the memory model. The only time I spoke to a kernel person about memory models, I got the reply that the kernel would use its own model. What do the kernel folks think about the C11 memory model? If you can spot any issues in there, the GCC community would certainly like to know. Torvald
Re: Memory corruption due to word sharing
On Wed, 1 Feb 2012, Linus Torvalds wrote: > But the compiler turns the access to the bitfield (in a 32-bit aligned > word) into a 64-bit access that accesses the word *next* to it. > > That word next to it might *be* the lock, for example. > > So we could literally have this kind of situation: > >struct { > atomic_t counter; > unsigned int val:4, other:4, data:24; >}; > > and if we write code like this: > > spin_lock(&somelock); > s->data++; > spin_unlock(&somelock); > > and on another CPU we might do > >atomic_inc(&counter); > > and the access to the bitfield will *corrupt* the atomic counter, even > though both of them are perfectly fine! > > Quite frankly, if the bug is simply because gcc doesn't actually know > or care about the underlying size of the bitmask, it is possible that > we can find a case where gcc clearly is buggy even according to the > original C rules. > > Honza - since you have access to the compiler in question, try > compiling this trivial test-program: > > >struct example { > volatile int a; > int b:1; >}; > >.. > s->b = 1; >.. > > and if that bitfield access actually does a 64-bit access that also > touches 's->a', then dammit, that's a clear violation of even the > *old* C standard, and the gcc people cannot just wave away their bugs > by saying "we've got standads, pttthththt". > > And I suspect it really is a generic bug that can be shown even with > the above trivial example. I have actually tried exactly this earlier today (because while looking at this, I had an idea that putting volatile in place could be a workaround, causing gcc to generate a saner code), but it doesn't work either: # cat x.c struct x { long a; volatile unsigned int lock; unsigned int full:1; }; void wrong(struct x *ptr) { ptr->full = 1; } int main() { wrong(0); } # gcc -O2 x.c # gdb -q ./a.out Reading symbols from /root/a.out...done. (gdb) disassemble wrong Dump of assembler code for function wrong: 0x45c0 <+0>: [MMI] adds r32=8,r32 0x45c1 <+1>: nop.m 0x0 0x45c2 <+2>: mov r15=1;; 0x45d0 <+16>:[MMI] ld8 r14=[r32];; 0x45d1 <+17>:nop.m 0x0 0x45d2 <+18>:dep r14=r15,r14,32,1;; 0x45e0 <+32>:[MIB] st8 [r32]=r14 0x45e1 <+33>:nop.i 0x0 0x45e2 <+34>:br.ret.sptk.many b0;; In my opinion, this is a clear bug in gcc (while the original problem, without explitict volatile, is not a C spec violation per se, it's just very inconvenient :) ). -- Jiri Kosina SUSE Labs
Re: -fno-inline-functions vs glibc's initfini
> Want me to prepare a s/-fno-inline-functions/-fno-inline/ patch? My reading was that actually we would want both of those switches (the first to avoid inlining as an implicit optimization, the second to avoid inlining of functions declared inline). But whatever the exact detail, yes, please send a patch that does the right thing. > Maybe rather than using the generic version, we should have a Makefile > rule that generates the machine-dependent .s files for developers' > perusal in creating the machine-specific asm sources. That certainly wouldn't hurt. I guess we could start with that to make it easy to produce the assembly code, and then let each arch add one piecemeal before taking away the C option. Once we get to every arch having assembly, it probably makes sense to stop doing it as a single file with magic boundary markers to be sliced, and just have each arch provide crt[in].S files directly. Thanks, Roland
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 9:08 AM, Torvald Riegel wrote: > > What do the kernel folks think about the C11 memory model? If you can > spot any issues in there, the GCC community would certainly like to > know. I don't think this is about memory models except very tangentially. Gcc currently accesses fields in a way that affects *independent* fields, without checking their memory models at all. Even original C already has one memory model: "volatile" data is seen outside the virtual C machine. And Jiri reports that even that *original* memory model is being violated. We're taling about the one from about 40 years ago. So no amount of memory model will ever matter, if gcc can't even get the 40-year old one right. Also, the C11 memory model really isn't going to make any difference to this. Again, because this happens even if we have explicit locking for the different fields. So they aren't "atomic" or anything like that, because we do our own (and sufficient) locking. But once the compiler starts to randomly touch data that just happens to be adjacent to other data, any memory model about those individual pieces is going to be entirely immaterial anyway. Finally: I do not believe for a second that we can actually use the C11 memory model in the kernel and say "that's sufficient for our needs". We will continue to have to do things that are "outside the specs" for the very simple reason that the kernel isn't just some random application that happens to use pthreads. We do end up doing much more aggressive threading, with models that C11 simply doesn't cover. _Atomic simply isn't going to make any difference to us. Sure, we might use it in a couple of places (where we have sometimes tried to use volatile casts, for example), but it would be totally irrelevant in this case exactly because *neither* field is really atomic - they are both protected, it's just that they are *separately* protected, so accessing one does *not* mean that you can access the other, even if they are next to each other. Linus
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 9:11 AM, Jiri Kosina wrote: > On Wed, 1 Feb 2012, Linus Torvalds wrote: >> >> And I suspect it really is a generic bug that can be shown even with >> the above trivial example. > > I have actually tried exactly this earlier today (because while looking at > this, I had an idea that putting volatile in place could be a workaround, > causing gcc to generate a saner code), but it doesn't work either: > > # cat x.c > struct x { > long a; > volatile unsigned int lock; > unsigned int full:1; > }; > > void > wrong(struct x *ptr) > { > ptr->full = 1; > } > > int main() > { > wrong(0); > } > # gcc -O2 x.c > # gdb -q ./a.out > Reading symbols from /root/a.out...done. > (gdb) disassemble wrong > Dump of assembler code for function wrong: > 0x45c0 <+0>: [MMI] adds r32=8,r32 > 0x45c1 <+1>: nop.m 0x0 > 0x45c2 <+2>: mov r15=1;; > 0x45d0 <+16>: [MMI] ld8 r14=[r32];; > 0x45d1 <+17>: nop.m 0x0 > 0x45d2 <+18>: dep r14=r15,r14,32,1;; > 0x45e0 <+32>: [MIB] st8 [r32]=r14 > 0x45e1 <+33>: nop.i 0x0 > 0x45e2 <+34>: br.ret.sptk.many b0;; > > In my opinion, this is a clear bug in gcc (while the original problem, > without explitict volatile, is not a C spec violation per se, it's just > very inconvenient :) ). Yup, gcc is clearly just buggy here. I do not believe there is any question what-so-ever about the above test-case showing a bug. And the thing is, if they fix this bug, they'll fix our problem too, unless they are going to write explicit code to *try* to screw us over while fixing that 'volatile' bug. Because the right thing to do with bitfields is really to take the base type into account. If the bitfield was in an "int", you use an "int" access for it, not a 64-bit access. That's the simple fix for the volatile problem, and it just happens to fix our issue too. Trying to explicitly *look* for volatiles, and only doing the 32-bit access when you see them is actually extra code, and extra effort, and doesn't even *help* anything. It's not like the 64-bit access is somehow "better". I can see some vindictive programmer doing that, while thinking "I'll show these people who pointed out this bug in my code, mhwhahahahaa! I'll fix their test-case while still leaving the real problem unaddressed", but I don't think compiler people are quite *that* evil. Yes, they are evil people who are trying to trip us up, but still.. Linus
Re: Memory corruption due to word sharing
Hi, On Wed, 1 Feb 2012, Jiri Kosina wrote: > # cat x.c > struct x { > long a; > volatile unsigned int lock; > unsigned int full:1; > }; > > void > wrong(struct x *ptr) > { > ptr->full = 1; > } > > In my opinion, this is a clear bug Even that depends (sadly) on who you ask. half-volatile objects (i.e. struct where some members are volatile and others aren't) are terribly underspecified. You can make the bitfield volatile, and for ia64 that would at least result of ld8.acq/st8.rel pairs. And Linus: don't be so hastily dismissive. People (even the compiler ones) do agree that using an 8 byte access for a 4 byte entity is problematic. Even if allowed by the standard it's a quality of implementation problem. One problem is that it's not a new problem, GCC emitted similar code since about forever, and still they turned up only now (well, probably because ia64 is dead, but sparc64 should have similar problems). The bitfield handling code is _terribly_ complex and fixing it is quite involved. So don't expect any quick fixes. The other problem is specification. While you think that the code you wrote is totally obvious it might not actually be so. For instance, what about this struct: {long l:32; int i1:16; short s; int i2:1; char c:7; short s2:8; short s3;} What are the obviously correct accesses for various writes into this struct? One rule may be to never write to a bitfield with accesses larger than their underlying declared type. Well, but only if the bitfield fits into that type (int:96 is quite okay to have, and what accesses should be allowed there?). That might be one obvious rule. But it's not how traditional bitfield layout works. It works based on underlying objects, and for that e.g. the three fields i2,c,s are all in the same one, of int type. The rule above would at least work for code that most people do write, i.e. sequence of same-typed bitfields mixed with normal members. And then, there's the last problem: are you sure that if GCC would use 4 byte accesses for the case at hand, that the hardware wouldn't screw you again? What about using atomic instructions for one half of the cache-line (the spinlock) but non-atomic instructions on the other half (the bitfield). Are you sure the latter won't interfere with the former? Ciao, Michael.
Re: Memory corruption due to word sharing
> I have actually tried exactly this earlier today (because while looking at > this, I had an idea that putting volatile in place could be a workaround, > causing gcc to generate a saner code), but it doesn't work either: > > # cat x.c > struct x { > long a; > volatile unsigned int lock; > unsigned int full:1; > }; > > void > wrong(struct x *ptr) > { > ptr->full = 1; > } > > int main() > { > wrong(0); > } > In my opinion, this is a clear bug in gcc (while the original problem, > without explitict volatile, is not a C spec violation per se, it's just > very inconvenient :) ). As a data point, the exact same code on a Solaris 8 pentium III box: $ gcc -S -o x.s x.c $ cat x.s .file "x.c" .text .globl wrong .type wrong, @function wrong: pushl %ebp movl%esp, %ebp movl8(%ebp), %eax movzbl 8(%eax), %edx orl $1, %edx movb%dl, 8(%eax) popl%ebp ret .size wrong, .-wrong .globl main .type main, @function main: pushl %ebp movl%esp, %ebp subl$4, %esp movl$0, (%esp) callwrong leave ret .size main, .-main .ident "GCC: (Blastwave.org Inc. Thu Dec 16 18:05:01 GMT 2010) 4.5.2" $ gcc -o x x.c $ file x x: ELF 32-bit LSB executable 80386 Version 1, dynamically linked, not stripped $ ldd x libc.so.1 => /usr/lib/libc.so.1 libdl.so.1 =>/usr/lib/libdl.so.1 $ ./x Segmentation Fault(coredump) $ ls -l core -rw--- 1 dclarke csw71384 Feb 1 17:26 core 71384 bytes core is complete thus : $ elfdump -p core | tail -6 Program Header[12]: p_vaddr: 0xdfbf3000 p_flags:[ PF_X PF_W PF_R ] p_paddr: 0 p_type: [ PT_LOAD ] p_filesz: 0x1000 p_memsz:0x1000 p_offset: 0x106d8 p_align:0 $ /opt/studio/SOS11/SUNWspro/bin/dbx -c "print 0x1000 + 0x106d8; quit" dbx: warning: unknown language, 'c' assumed 0x1000+0x106d8 = 71384 what caused the seg fault ? $ /opt/studio/SOS11/SUNWspro/bin/dbx x core Reading x core file header read successfully Reading ld.so.1 Reading libc.so.1 Reading libdl.so.1 program terminated by signal SEGV (no mapping at the fault address) 0x08050672: wrong+0x0006: movzbl 0x0008(%eax),%edx (dbx) where =>[1] wrong(0x0, 0x8047b70, 0x805057d, 0x1, 0x8047b7c, 0x8047b84), at 0x8050672 [2] main(0x1, 0x8047b7c, 0x8047b84), at 0x8050690 However Sun Studio 5.8 does no better : $ /opt/studio/SOS11/SUNWspro/bin/cc -Xc -o x_Sun_Studio_5.8 x.c $ ./x_Sun_Studio_5.8 Segmentation Fault(coredump) $ ls -l core -rw--- 1 dclarke csw71384 Feb 1 17:48 core $ /opt/studio/SOS11/SUNWspro/bin/dbx x_Sun_Studio_5.8 core dbx: warning: core object name "x_Sun_Studio_5." matches object name "x_Sun_Studio_5.8" within the limit of 14. assuming they match core file header read successfully Reading ld.so.1 Reading libc.so.1 Reading libdl.so.1 program terminated by signal SEGV (no mapping at the fault address) 0x080506ae: wrong+0x000e: movl 0x0008(%ecx),%eax (dbx) where =>[1] wrong(0x0), at 0x80506ae [2] main(0x1, 0x8047b4c, 0x8047b54), at 0x80506ca (dbx) quit $ /opt/studio/SOS11/SUNWspro/bin/cc -V cc: Sun C 5.8 Patch 121016-08 2009/04/20 usage: cc [ options] files. Use 'cc -flags' for details $ dc -- -- http://pgp.mit.edu:11371/pks/lookup?op=vindex&search=0x1D936C72FA35B44B +-+---+ | Dennis Clarke | Solaris and Linux and Open Source | | dcla...@blastwave.org | Respect for open standards. | +-+---+
Documentation error for cbranch4
Hi, The docs state: `cbranchmode4' Conditional branch instruction combined with a compare instruction. Operand 0 is a comparison operator. Operand 1 and operand 2 are the first and second operands of the comparison, respectively. Operand 3 is a label_ref that refers to the label to jump to. However from my experience and after looking to other backends, operand 3 is actually a code label and not a label_ref. Cheers, -- PMatos
Re: Memory corruption due to word sharing
From: Michael Matz Date: Wed, 1 Feb 2012 18:41:05 +0100 (CET) > One problem is that it's not a new problem, GCC emitted similar code since > about forever, and still they turned up only now (well, probably because > ia64 is dead, but sparc64 should have similar problems). Indeed, on sparc64 it does do the silly 64-bit access too: wrong: ldx [%o0+8], %g2 sethi %hi(2147483648), %g1 or %g2, %g1, %g1 jmp %o7+8 stx%g1, [%o0+8] Personally I've avoided C bitfields like the plague in any code I've written.
Re: Memory corruption due to word sharing
On 02/01/2012 11:09 AM, David Miller wrote: From: Michael Matz Date: Wed, 1 Feb 2012 18:41:05 +0100 (CET) One problem is that it's not a new problem, GCC emitted similar code since about forever, and still they turned up only now (well, probably because ia64 is dead, but sparc64 should have similar problems). Indeed, on sparc64 it does do the silly 64-bit access too: wrong: ldx [%o0+8], %g2 sethi %hi(2147483648), %g1 or %g2, %g1, %g1 jmp %o7+8 stx%g1, [%o0+8] Personally I've avoided C bitfields like the plague in any code I've written. Torvald Riegel & I were told that was kernel policy when we brought up the upcoming bitfield semantic changes with some of the linux kernel folks last year. Regardless of the kernel team's policy WRT bitfields, I believe fixing the semantics to avoid creation of data races in bitfields is going to be important. I'm hoping Aldy can return to that task soon and he and Richi can come to some agreement on the implementation with gcc-4.8 as a target. jeff
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 9:41 AM, Michael Matz wrote: > > One problem is that it's not a new problem, GCC emitted similar code since > about forever, and still they turned up only now (well, probably because > ia64 is dead, but sparc64 should have similar problems). The bitfield > handling code is _terribly_ complex and fixing it is quite involved. So > don't expect any quick fixes. I agree that bitfields are nasty, I've had to handle them myself in sparse. And we have actually had problems with bitfields before, to the point where we've replaced use of bitfields with masking and setting bits by hand. But I also think that gcc is simply *buggy*, and has made them much nastier than they should be. What gcc *should* have done is to turn bitfield accesses into shift-and-masking of the underlying field as early as possible, and then do all optimizations at that level. In fact, there is another gcc bug outstanding (48696) where I complain about absolutely horrible code generation, and that one was actually the exact same issue except in reverse: gcc wouldn't take the underlying size of the bitfield into account, and use the wrong (smaller) size for the access, causing absolutely horrendous code generation that mixes byte and word accesses, and causes slowdowns by orders of magnitude. And it really is the same issue: gcc has forgotten what the base type is, and tries to "compute" some type using the actual bits. Which is simply *wrong*. Always has been. It's stupid, it generates bad code (both from performance *and* from a correctness angle), and it has actually resulted in gcc having *extra* complexity because it keeps the bitfield data around much too late. > The other problem is specification. While you think that the code you > wrote is totally obvious it might not actually be so. For instance, what > about this struct: > > {long l:32; int i1:16; short s; int i2:1; char c:7; short s2:8; short s3;} > > What are the obviously correct accesses for various writes into this > struct? I really don't think that example matters. You should instead turn the question around, and look at the real code examples, make *those* generate good and obviously correct code, and then *after* you've done that, you start looking at the mixed case where people do odd things. Quite frankly, the layout that makes *sense* for something like the above is not to pack them. You'd just do If somebody actually wants packing, they'd do the *sane* thing, which is unsigned int l:32, i1:16; s:16,i2:1, c:7, s2:8, s3:16; and now it's obvious what the packing rules and the access rules are. That said, I just tried, and 'sparse' actually gets your crazy example right (where "right" is defined as what I *think* you want): optimally packed. And it turns the accesses into the ones that are based on the base type. And I'm pretty sure the sparse bitfield rules are way simpler than gcc, exactly because it tries to convert bitfields fairly early into mask-and-set. So if I do struct {long l:32; int i1:16; short s; int i2:1; char c:7; short s2:8; short s3;} dummy; int main(int argc, char **argv) { dummy.l = 1; dummy.i1 = 2; dummy.s = 3; dummy.i2 = 4; dummy.c = 5; dummy.s2 = 6; dummy.s3 = 7; } and then do a test-linearize (with "-m64" to show the difference between "long" and "int") I get t.c:2:48: error: dubious one-bit signed bitfield main: .L0x7f928e378010: load.64 %r1 <- 0[dummy] and.64 %r2 <- %r1, $0x or.64 %r3 <- %r2, $1 store.64%r3 -> 0[dummy] load.32 %r4 <- 4[dummy] and.32 %r5 <- %r4, $0x or.32 %r6 <- %r5, $2 store.32%r6 -> 4[dummy] store.16$3 -> 6[dummy] load.32 %r7 <- 8[dummy] and.32 %r8 <- %r7, $-2 store.32%r8 -> 8[dummy] load.8 %r10 <- 8[dummy] and.8 %r12 <- %r10, $-255 or.8%r13 <- %r12, $10 store.8 %r13 -> 8[dummy] load.16 %r14 <- 8[dummy] and.16 %r16 <- %r14, $0x00ff or.16 %r17 <- %r16, $0x600 store.16%r17 -> 8[dummy] store.16$7 -> 10[dummy] ret.32 $7 which I think is what you'd expect. Now, sparse doesn't do the smart "combine shifts-and-masks to memory", but that's an optimization phase that should be done long after bitfields *anyway*, so that's kind of immaterial. And yes, look at how you can see how sparse mixes different access sizes for different fields. But that is damn well what the programmer asked for. If programmers write stupid code, the compiler might as well generate odd code. But if a programmer writes *good* code, and the compiler messes it up because it *thinks* the code is stupid, then the compiler is just bad. > One rule may be to never write to a bitfield with accesses larger than > their underlying
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 10:09 AM, David Miller wrote: > > Personally I've avoided C bitfields like the plague in any code I've > written. I do agree with that. The kernel largely tries to avoid bitfields, usually because we have some really strict rules about different bitfields, but also because initialization of bitfields tends to result in gcc generating an incredible mess of the code (while "explicit bits" allows us to just set all the fields in one go, and know what the result is). So core kernel data structures tend to be things like "unsigned long", together with our various bitop functions that have explicit atomicity (or non-atomicity) guarantees on a bit-per-bit basis. Sometimes bitfields are really convenient, though, and allow for much more natural syntax. I'm not surprised this issue came up in a filesystem, for example. Linus
Re: Memory corruption due to word sharing
On Wed, 2012-02-01 at 13:09 -0500, David Miller wrote: > From: Michael Matz > Date: Wed, 1 Feb 2012 18:41:05 +0100 (CET) > > > One problem is that it's not a new problem, GCC emitted similar code since > > about forever, and still they turned up only now (well, probably because > > ia64 is dead, but sparc64 should have similar problems). > > Indeed, on sparc64 it does do the silly 64-bit access too: > > wrong: > ldx [%o0+8], %g2 > sethi %hi(2147483648), %g1 > or %g2, %g1, %g1 > jmp %o7+8 > stx%g1, [%o0+8] Ditto for powerpc64-linux: ld 9,8(3) li 10,1 rldimi 9,10,31,32 std 9,8(3) blr Peter
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 10:45 AM, Jeff Law wrote: > > Torvald Riegel & I were told that was kernel policy when we brought up the > upcoming bitfield semantic changes with some of the linux kernel folks last > year. Btw, one reason this is true is that the bitfield ordering/packing is so unspecified that using bitfields sometimes is just way more pain than you get. Some people have tried to use bitfields for various IO data structures (think laying out bits in memory to match some particular layout of some disk controller result structure). It's always a total disaster, because you have another whole level of architecture-specific bit ordering (in *addition* to all the normal byte order issues). That's not a compiler issue, that's just the nature of the beast. It's just another reason why the kernel often ends up then doing bit masking by hand. But *all* that said, we do have a metric buttload of bitfields in the kernel. It's not some really unusual feature. It does get used a lot, despite all the reasons why some particular code might not use bitfields. We have a lot of code, there's still a lot of situations left where bitfields are just really convenient. Linus
Re: Memory corruption due to word sharing
On Wed, 2012-02-01 at 08:41 -0800, Linus Torvalds wrote: > If the gcc people aren't willing to agree that this is actually a flaw > in the standard (one that is being addressed, no less) It has been addressed in the standards. > and try to fix > it, Again, this is being worked on, see http://gcc.gnu.org/wiki/Atomic/GCCMM > we just have to extend our assumptions to something like "a > compiler would be stupid to ever access anything bigger than the > aligned register-size area". It's still just an assumption, and > compiler people could be crazy, but we could just extend the current > alpha rules to cover not just "int", but "long" too. So, let's ignore everyone's craziness (the kernel is not the only GCC client...) and think about how we can improve the situation. We need a proper memory model. No vague assumptions with lots of hand-waving. If you think that this is simple stuff and can sufficiently described by "don't do anything stupid", then please have a look at the issues that the Java memory model faced, and all the iterations of the C++11/C11 model and discussions about it. The only candidate that I see is the C++11/C11 model. Any other suggestions? Now, if we assume this model, what does the kernel community think about it? It might be good to split this discussion into the following two points, to avoid syntax flame wars: 1) The model itself (ie, the memory orders, ordering guarantees, (lack of) progress guarantees, etc.). 2) The syntax/APIs to specify atomics. If something else or more is needed, the compiler needs to have a formal specification for that. This will help users too, because it avoids all the ambiguities. > Sure, the compiler people could use "load/store multiple" or something > like that, but it would be obviously crazy code, so if it happens past > a register size, at least you could argue that it's a performance > issue and maybe the gcc people would care. > > HOWEVER. The problem with the alpha rules (which, btw, were huge, and > led to the CPU designers literally changing the CPU instruction set > because they admitted they made a huge mistake) was never so much the > occasional memory corruption, as the fact that the places where it > could happen were basically almost impossible to find. > > So we probably have tons of random places in the kernel that violate > even the alpha rules - because the corruption is so rare, and the > architecture was so rare as to making the corruption even harder to > find. I assume you still would want a weak memory model, or not? (That is, rely on data-race-freedom, only atomics do not contribute to data races, and you need to mark data used for synchronization (or which is just accessed concurrently) as atomics). If so, you need to tell the compiler which variables etc. are atomics. > I assume this code generation idiocy only happens with bitfields? The > problem is, we really don't want to make all bitfields take up 64 bits > just because we might have a lock or something else next to them. But > we could probably re-order things (and then *occasionally* wasting > memory) if we could just find the ones that are problematic. If the compiler is aware what is a lock (or atomics, or similar), then a correct implementation should leave the lock alone. But in a weak memory model, it needs to know what is a lock. (Same goes for volatile obviously, like in the other example posted in the thread where the bitfields interfered with the volatile.) > > It's finding the problematic ones that is the issue. Which is why the > compiler should just generate code that matches what we told it to do, So, would it be okay to tell the compiler which part of the state is accessed concurrently (ie, locks, atomics, ...)? Torvald
Re: Memory corruption due to word sharing
On Wed, Feb 01, 2012 at 06:42:54PM +0100, Torvald Riegel wrote: > We need a proper memory model. No vague assumptions with lots of > hand-waving. If you think that this is simple stuff and can > sufficiently described by "don't do anything stupid", then please have a > look at the issues that the Java memory model faced, and all the > iterations of the C++11/C11 model and discussions about it. > > The only candidate that I see is the C++11/C11 model. Any other > suggestions? Well, the C++11/C11 model doesn't allow to use the underlying type for accesses, consider e.g. struct S { long s1; unsigned int s2 : 5; unsigned int s3 : 19; unsigned char s4; unsigned int s5; }; struct T { long s1 : 16; unsigned int s2; }; on e.g. x86_64-linux, S is 16 byte long, field s4 is packed together into the same 32 bits as s2 and s3. While the memory model allows s2 to be changed when storing s3 (i.e. use a RMW cycle on it), it doesn't allow s4 to be changed, as it isn't a bitfield (you'd need s4 : 8 for that). T is 8 bytes long, again, s2 is packed into the same 64 bits as s1, and the memory model doesn't allow s2 to be modified. Not sure what the kernel would expect in such a case. Jakub
RE: Memory corruption due to word sharing
I'm clearly with Torvald here. The original set of problems here is very clearly addressed by the C+11/C11 memory model. It clearly implies, among other things: - Updating a bit-field may interfere (create a data race with) other updates to that same contiguous sequence of bit-fields, but not with other adjacent fields. Updating non-bit-fields may not interfere with accesses to any other field. A zero length bit-field breaks a sequence of bit-fields. The initial example here is a clear compiler bug by C11 semantics. - Bit-fields may be updated, and indeed must be updated, in some cases, with multiple smaller stores. Doing anything else would violate the preceding rule. - Spurious writes to a variable may never be introduced. - "Volatile" has nothing to do with thread interactions. Atomic variables were introduced to allow unprotected accesses to shared objects. (It doesn't make sense to use volatile for this for a number of good, mostly historically motivated reasons. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2016.html if you really care.) (Technically, this applies to C11 threads, because that's all it can talk about. But an implementation would have to be insane to restrict it to that. For implementations of atomic operations, I would expect most compilers to assume memory mappings and consistency as normally seen by user processes.) C11 is a published standard. Last I checked, gcc did not follow many of the above rules. It looks like major changes were recently merged into the gcc trunk, and I haven't had a chance to test those, so it may well be fixed. But more importantly, so far I haven't read anything here to dissuade me that they are the right target. Hans > -Original Message- > From: linux-ia64-ow...@vger.kernel.org [mailto:linux-ia64- > ow...@vger.kernel.org] On Behalf Of Torvald Riegel > Sent: Wednesday, February 01, 2012 9:43 AM > To: Linus Torvalds > Cc: Jan Kara; LKML; linux-i...@vger.kernel.org; dste...@suse.cz; > ptesa...@suse.cz; rguent...@suse.de; gcc@gcc.gnu.org > Subject: Re: Memory corruption due to word sharing > > On Wed, 2012-02-01 at 08:41 -0800, Linus Torvalds wrote: > > If the gcc people aren't willing to agree that this is actually a > flaw > > in the standard (one that is being addressed, no less) > > It has been addressed in the standards. > > > and try to fix > > it, > > Again, this is being worked on, see > http://gcc.gnu.org/wiki/Atomic/GCCMM > > > we just have to extend our assumptions to something like "a > > compiler would be stupid to ever access anything bigger than the > > aligned register-size area". It's still just an assumption, and > > compiler people could be crazy, but we could just extend the current > > alpha rules to cover not just "int", but "long" too. > > So, let's ignore everyone's craziness (the kernel is not the only GCC > client...) and think about how we can improve the situation. > > We need a proper memory model. No vague assumptions with lots of > hand-waving. If you think that this is simple stuff and can > sufficiently described by "don't do anything stupid", then please have > a > look at the issues that the Java memory model faced, and all the > iterations of the C++11/C11 model and discussions about it. > > The only candidate that I see is the C++11/C11 model. Any other > suggestions? > > Now, if we assume this model, what does the kernel community think > about > it? It might be good to split this discussion into the following two > points, to avoid syntax flame wars: > 1) The model itself (ie, the memory orders, ordering guarantees, (lack > of) progress guarantees, etc.). > 2) The syntax/APIs to specify atomics. > > If something else or more is needed, the compiler needs to have a > formal > specification for that. This will help users too, because it avoids > all > the ambiguities. > > > Sure, the compiler people could use "load/store multiple" or > something > > like that, but it would be obviously crazy code, so if it happens > past > > a register size, at least you could argue that it's a performance > > issue and maybe the gcc people would care. > > > > HOWEVER. The problem with the alpha rules (which, btw, were huge, and > > led to the CPU designers literally changing the CPU instruction set > > because they admitted they made a huge mistake) was never so much the > > occasional memory corruption, as the fact that the places where it > > could happen were basically almost impossible to find. > > > > So we probably have tons of random places in the kernel that violate > > even the alpha rules - because the corruption is so rare, and the > > architecture was so rare as to making the corruption even harder to > > find. > > I assume you still would want a weak memory model, or not? (That is, > rely on data-race-freedom, only atomics do not contribute to data > races, > and you need to mark data used for synchronization (or which is just > accessed concurrentl
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 9:42 AM, Torvald Riegel wrote: > > We need a proper memory model. Not really. The fact is, the kernel will happily take the memory model of the underlying hardware. Trying to impose some compiler description of the memory model is actually horribly bad, because it automatically also involves compiler synchronization points - which will try to be portable and easy to understand, and quite frankly, that will automatically result in what is technically known as a "shitload of crap". Now, a strict memory model is fine for portability, and to make it simple for programmers. I actually largely approve of the Java memory model approach, even if it has been horribly problematic and has some serious problems. But it's not something that is acceptable for the kernel - we absolutely do *not* want the compiler to introduce some memory model, because we are very much working together with whatever the hardware memory model is, and we do our own serialization primitives. > No vague assumptions with lots of hand-waving. So here's basically what the kernel needs: - if we don't touch a field, the compiler doesn't touch it. This is the rule that gcc now violates with bitfields. This is a gcc bug. End of story. The "volatile" example proves it - anybody who argues otherwise is simply wrong, and is just trying to make excuses. - if we need to touch something only once, or care about something that is touched only conditionally, and we actually need to make sure that it is touched *only* under that condition, we already go to quite extreme lengths to tell the compiler so. We usually use an access with a "volatile" cast on it or tricks like that. Or we do the whole "magic asm that says it modified memory to let the compiler know not to try anything clever" - The kernel IS NOT GOING TO MARK DATA STRUCTURES. Marking data structures is seriously stupid and wrong. It's not the *data* that has access rules, it's the *code* that has rules of access. The traditional C "volatile" is misdesigned and wrong. We don't generally mark data volatile, we really mark *code* volatile - which is why our "volatiles" are in the casts, not on the data structures. Stuff that is "volatile" in one context is not volatile in another. If you hold a RCU write lock, it may well be entirely stable, and marking it volatile is *wrong*, and generating code as if it was volatile is pure and utter shit. On the other hand, if you are touching *the*very*same* field while you are only read-locked for RCU, it may well be one of those "this has to be read by accessing it exactly once". And we do all this correctly in the kernel. Modulo bugs, of course, but the fundamental rule really is: "atomicity or volatility is about CODE, not DATA". And no, C11 does *not* do it correctly. The whole "_Atomic" crap is exactly the same mistake as "volatile" was. It's wrong. Marking data _Atomic is a sure sign that whoever designed it didn't understand things. > The only candidate that I see is the C++11/C11 model. Any other > suggestions? The C11 model addresses the wrong thing, and addresses it badly. You might as well ignore it as far as the kernel is concerned. I'm sure it helps some *other* problem, but it's not the kernel problem. The rules really are very simple, and the kernel already does everything to make it easy for the compiler. When we do something that the compiler cannot re-order around, we do an asm() and mark it as modifying memory so that the compiler won't screw things up. In addition, we will do whatever that the CPU requires for memory ordering, and quite frankly, the compiler will never have sufficient locking primitives to satisfy us, and there is no real point in even trying. If you try to do locking in the compiler, you *will* do it wrong. If you add random flags on data structures ("_Atomic" or "volatile" or whatever), you *will* do it wrong. It's simply a fundamentally broken model. So the *only* memory model we want from the compiler really is: "don't write to fields that the source code didn't write to". It's really is that simple. > So, would it be okay to tell the compiler which part of the state is > accessed concurrently (ie, locks, atomics, ...)? Seriously, no. See above: it's not the "state" that is accessed concurrently. It's the code. If you ever try to mark state, you've already lost. The same "state" can be atomic or not depending on context. It's not about the state or the data structures, and it never will be. Linus
Re: Memory corruption due to word sharing
On 02/01/2012 12:44 PM, Boehm, Hans wrote: C11 is a published standard. Last I checked, gcc did not follow many of the above rules. It looks like major changes were recently merged into the gcc trunk, and I haven't had a chance to test those, so it may well be fixed. But more importantly, so far I haven't read anything here to dissuade me that they are the right target. The bitfield changes didn't make it into gcc-4.7; we're still working through implementation details. GCC's bitfield support is a horrid nasty mess which has made implementation of the standard problematical. It's on Aldy's plate for gcc-4.8. jeff
Re: Memory corruption due to word sharing
> So here's basically what the kernel needs: > > - if we don't touch a field, the compiler doesn't touch it. > >This is the rule that gcc now violates with bitfields. > >This is a gcc bug. End of story. The "volatile" example proves it - > anybody who argues otherwise is simply wrong, and is just trying to > make excuses. C historically didn't make this guarantee because a lot of processors couldn't make it because they didn't have things like byte accessors (In fact I suspect early ARM cannot make it for example). Not meeting it for types where you can do is a bit rude however and really ought to be an option (speed v sanity). > See above: it's not the "state" that is accessed concurrently. It's > the code. If you ever try to mark state, you've already lost. The same > "state" can be atomic or not depending on context. It's not about the > state or the data structures, and it never will be. There are optimisation cases - where you can prove access properties are safe (eg local variables some times) but they should be exactly that - optimisations. Alan
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 11:40 AM, Jakub Jelinek wrote: > > Well, the C++11/C11 model doesn't allow to use the underlying type > for accesses, consider e.g. > > struct S { long s1; unsigned int s2 : 5; unsigned int s3 : 19; unsigned char > s4; unsigned int s5; }; > struct T { long s1 : 16; unsigned int s2; }; > > on e.g. x86_64-linux, S is 16 byte long, field s4 is packed together into > the same 32 bits as s2 and s3. While the memory model allows s2 to be > changed when storing s3 (i.e. use a RMW cycle on it), it doesn't allow s4 > to be changed, as it isn't a bitfield (you'd need s4 : 8 for that). > T is 8 bytes long, again, s2 is packed into the same 64 bits as s1, > and the memory model doesn't allow s2 to be modified. > > Not sure what the kernel would expect in such a case. So the kernel really doesn't care what you do to things *within* the bitfield. Sure, we do have some expectations of packing, but basically we never expect bitfields to be 'atomic'. You really don't need to worry about that. From a correctness standpoint, I really don't think the kernel cares if gcc does bitfield reads and writes one bit at a time. Seriously. The bug is that the bitfield write wrote *outside* the bitfield. That's *WRONG*. It's completely wrong, even if you write back the same value you read. It's *always* wrong. It's simply not acceptable. If gcc does that kind of crap, then gcc needs to align bitfields sufficiently that the accesses that are outside the bitfields never touch any other fields. So what I would suggest: - use the *smallest* aligned access you can use to access the bitfield. This will be 'correct', but it may perform badly. This is apparently what x86 does. So on x86, if you have a one-bit bitfield within a bitfield that is really 32 bits wide, it looks like gcc will generally generate a byte load/store to set that bit. I don't know exactly what the rules are, but this is fine from a *correctness* point. - However, while using the *smallest* possible access may generate correct code, it often generates really *crappy* code. Which is exactly the bug that I reported in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696 So quite frankly, my suggestion is to just try to use the base type. But *whatever* you do, using a type *larger* than the whole bitfield itself is a bug. And dammit, the people who continue to bring up C11 as if this was a new issue, and only needs to be worried about if you support C11 (ie gcc-4.7 and up) are just in denial. The 'volatile' example makes it entirely clear that the bug has nothing what-so-ever to do with C11. Linus
Re: Memory corruption due to word sharing
On Wed, Feb 01, 2012 at 12:01:46PM -0800, Linus Torvalds wrote: > On Wed, Feb 1, 2012 at 11:40 AM, Jakub Jelinek wrote: > > struct S { long s1; unsigned int s2 : 5; unsigned int s3 : 19; unsigned > > char s4; unsigned int s5; }; > > struct T { long t1 : 16; unsigned int t2; }; > > > > on e.g. x86_64-linux, S is 16 byte long, field s4 is packed together into > > the same 32 bits as s2 and s3. While the memory model allows s2 to be > > changed when storing s3 (i.e. use a RMW cycle on it), it doesn't allow s4 > > to be changed, as it isn't a bitfield (you'd need s4 : 8 for that). > > T is 8 bytes long, again, s2 is packed into the same 64 bits as s1, > > and the memory model doesn't allow s2 to be modified. > > > > Not sure what the kernel would expect in such a case. > > So the kernel really doesn't care what you do to things *within* the bitfield. But what is *within* the bitfield? Do you consider s4 or t2 fields (non-bitfield fields that just the ABI wants to pack together with the bitfield) above as *within* or is already modifying of those a problem? > The 'volatile' example makes it entirely clear that the bug has > nothing what-so-ever to do with C11. The reason we mention C11/C++11 memory model is because we plan to support that, likely for GCC 4.8, when requested by some options (either by default when choosing those standards?, or when explicitly requested). This will even disable some optimizations that are fine for single-threaded apps, but aren't fine in those memory models, aren't fine in OpenMP or for many other threaded programs. E.g. loop store motion if it isn't guaranteed the variable is always stored in the loop: int x; void foo (int j) { int i; for (i = 0; i < 10; i++) if (i > j) x = i; } can't be performed, because otherwise it introduces a tmp = x; ... x = tmp; into code that wouldn't otherwise touch the variable, so if some other thread modifies x, it might have unexpected value. Jakub
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 12:01 PM, Linus Torvalds wrote: > > - However, while using the *smallest* possible access may generate > correct code, it often generates really *crappy* code. Which is > exactly the bug that I reported in > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696 Btw, as I also pointed out in that (old) bugzilla entry, gcc can do pretty much whatever it damn well pleases with loads from a bitfield. You can narrow them, you can make them wider, you can paint them pink. Widening the loads can still be a performance problem, and technically it's a really bad idea for any nearby "volatile"s too, but in practice things will *work*. Writes are much more critical. If you overwrite a field that is no longer in the bitfield, you can no longer depend on "nobody cares about bitfield accesses", because by definition you are clearly now stepping on non-bitfield things. You do realize that even in user space, and even before C11, people have things like "sig_atomic_t" etc. So even if you don't like the notion of "volatile", here's *another* example of this being real gcc bug: struct mydata { sig_atomic_t seen_signal; unsigned flags:1; }; and now do the same test-program, realizing that "sig_atomic_t" is normally the exact same thing as "int". Now, thing about what happens if you have a signal handler that comes in and does mydata.seen_signal = 1; and happens to interrupt the code that changes "mydata.flags" in just the right spot. That's right: the bitfield update will possibly *clear* that "signal-safe" flag again, and gcc just created buggy asm code from correct C code. Guys, if you don't admit that this is a bug, I don't know what to say. IT IS A GCC BUG. Don't try to make it into something bigger and related to C++11/C11. Don't try to talk about "memory models". Just admit that it is a bug to do a 64-bit write to a 32-bit bitfield, and fix it! Linus
Re: Memory corruption due to word sharing
On Wed, 2012-02-01 at 11:47 -0800, Linus Torvalds wrote: > On Wed, Feb 1, 2012 at 9:42 AM, Torvald Riegel wrote: > > > > We need a proper memory model. > > Not really. > > The fact is, the kernel will happily take the memory model of the > underlying hardware. You do rely on the compiler to do common transformations I suppose: hoist loads out of loops, CSE, etc. How do you expect the compiler to know whether they are allowed for a particular piece of code or not? An example, C++11 pseudo code: atomic shared_flags[123]; x = 0; for(...) if (shared_flags[i].load(memory_order_acquire)) x += shared_data; Now, can we hoist the load of shared_data to out of the loop? The acquire tells the compiler that this is not allowed. Now assume x86, and it's memory model. The acquire in there isn't visible because you don't need a barrier for that. So, what's the compiler gonna do? Never hoist any loops because an acquire might be intended? So, you could add a compiler barrier in there (volatile asm...). But those work in both directions, and the acquire only works in one direction, allowing the first store to x to be moved later. > Trying to impose some compiler description of the > memory model is actually horribly bad, because it automatically also > involves compiler synchronization points - which will try to be > portable and easy to understand, and quite frankly, that will > automatically result in what is technically known as a "shitload of > crap". What is a "compiler synchronization point"? Have you actually read the C++11/C11 memory model? > Now, a strict memory model is fine for portability, and to make it > simple for programmers. I actually largely approve of the Java memory > model approach, even if it has been horribly problematic and has some > serious problems. But it's not something that is acceptable for the > kernel - we absolutely do *not* want the compiler to introduce some > memory model, because we are very much working together with whatever > the hardware memory model is, and we do our own serialization > primitives. I think you underestimate the issues here. The memory model that's the kernel is programmed against is what tells the compiler which transformations are allowed, and which aren't. You seem to claim that this is just the hardware's memory model that counts, but it's clear that you have an implicit language memory model in mind, because otherwise the compiler could never hoist any loads, for example (to be safe, conservatively). > > No vague assumptions with lots of hand-waving. > > So here's basically what the kernel needs: > > - if we don't touch a field, the compiler doesn't touch it. "we don't touch a field": what does that mean precisely? Never touch it in the same thread? Same function? Same basic block? Between two sequence points? When crossing synchronizing code? For example, in the above code, can we touch it earlier/later? I understand where this rule is heading (granularity races...), but really, this isn't precise, and this all affects what optimizations are allowed. In this context, "show me the code" means having something more formal, unfortunately, to prevent ambiguities. For this reason, I suggested to have a look at the C11 model, and then complain about limitations in this model. > >This is the rule that gcc now violates with bitfields. For the volatile case, that's a bug, yes. For the non-volatile, I'd need to check. >This is a gcc bug. End of story. The "volatile" example proves it - > anybody who argues otherwise is simply wrong, and is just trying to > make excuses. I'm trying to explain here why language memory models are a good thing, and not the enemy. > - if we need to touch something only once, or care about something > that is touched only conditionally, and we actually need to make sure > that it is touched *only* under that condition, we already go to quite > extreme lengths to tell the compiler so. > >We usually use an access with a "volatile" cast on it or tricks > like that. Or we do the whole "magic asm that says it modified memory > to let the compiler know not to try anything clever" > > - The kernel IS NOT GOING TO MARK DATA STRUCTURES. > > Marking data structures is seriously stupid and wrong. It's not the > *data* that has access rules, it's the *code* that has rules of > access. And types are usually a good way to enforce code that accesses a piece of state to behave similarly. Which is what atomic does, and what I meant. From the overall C++/C perspective, it makes sense to require types for atomics. For the kernel, I don't care. > Stuff that is "volatile" in one context is not volatile in another. If > you hold a RCU write lock, it may well be entirely stable, and marking > it volatile is *wrong*, and generating code as if it was volatile is > pure and utter shit. Agreed. There is no full and portable way to do the weaker accesses with atomicsc, but memory_order_relaxed is ve
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 12:16 PM, Jakub Jelinek wrote: >> >> So the kernel really doesn't care what you do to things *within* the >> bitfield. > > But what is *within* the bitfield? Do you consider s4 or t2 fields > (non-bitfield fields that just the ABI wants to pack together with > the bitfield) above as *within* or is already modifying of those a problem? I seriously think you should just do the obvious thing, and LOOK AT THE BASE TYPE. That solves everything. If the definition of the data is (assume "long" to be 64 bits) long mybits:32; int nonbitfield; and you end up doing a 64-bit access when you read/write the "mybits" bitfield, and it overlaps "nonbitfield", then I would also just say "hey, whatever, the user asked for it". He really did. The "allocation" for the bitfield comes from the base type, and so you basically have a "long" to play with. Let's take an even more extreme example: int bits1:8 char c; int bits2:16; where the "char c" is actually *embedded* in the "bitfield allocation". But again, it's completely and totally unambiguous what accesses you would at least start out with: sure, you *could* do a 8-bit access (and you probably should, if the ISA allows it), but dangit, the base type is "int", so I don't think it's wrong to do a 32-bit load, mask, and a 32-bit write. But if the user wrote char bits1:8; char c; short bits2:16; then I really think it would be a *bug* if modifying "bits1" would possible write to 'c'. Of course, this is again a possible ISA problem: if the architecture doesn't have byte writes, you can't do anything about it, but that is obviously true regardless of bitfields, so that's really a totally irrelevant argument for this case. That non-atomicity doesn't come from bitfields, it comes from the fact that the BASE TYPE is again non-atomic. Notice how it always ends up being about the base type. Simple, straightforward, and unambiguous. And similarly, to go to the kernel example, if you have int mybits:2; int nonbitfield; then I think it's an obvious *BUG* to access the nonbitfield things when you access "mybits". It clearly is not at all "within" the bitfield allocation, and "int" isn't non-atomic on any sane architecture, so you don't even have the "byte accesses aren't atomic" issue) So I really do think that the sane approach is to look at the base type of the bitfield, like I suggested from the very beginning. If the base type is an "int", you do an int access. It really solves so many problems, and it is even entirely sane semantics that you can *explain* to people. It makes sense, it avoids the clear bugs gcc has now, and it also solves the performance bug I reported a long time ago. Seriously - is there any real argument *against* just using the base type as a hint for access size? I realize that it may not be simple, and may not fit the current mess of gcc bitfields, but I really do think it's the RightThing(tm) to do. It has sensible semantics, and avoids problems. In contrast, the *current* gcc semantics are clearly not sensible, and have both performance and correctness issues. Linus
Re: Memory corruption due to word sharing
On Wed, 2012-02-01 at 09:29 -0800, Linus Torvalds wrote: > On Wed, Feb 1, 2012 at 9:08 AM, Torvald Riegel wrote: > > > > What do the kernel folks think about the C11 memory model? If you can > > spot any issues in there, the GCC community would certainly like to > > know. > > I don't think this is about memory models except very tangentially. > > Gcc currently accesses fields in a way that affects *independent* > fields, without checking their memory models at all. > > Even original C already has one memory model: "volatile" data is seen > outside the virtual C machine. And Jiri reports that even that > *original* memory model is being violated. We're taling about the one > from about 40 years ago. For volatile, I agree. However, the original btrfs example was *without* a volatile, and that's why I raised the memory model point. This triggered an error in a concurrent execution, so that's memory model land, at least in C language standard. The example was a granularity-of-access violation, I agree. Nonetheless, C11 has rules for that, they have been written down, it would good to know whether these rules are sufficient for you. > We do end up doing > much more aggressive threading, with models that C11 simply doesn't > cover. Any specific examples for that would be interesting.
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 12:41 PM, Torvald Riegel wrote: > > You do rely on the compiler to do common transformations I suppose: > hoist loads out of loops, CSE, etc. How do you expect the compiler to > know whether they are allowed for a particular piece of code or not? We have barriers. Compiler barriers are cheap. They look like this: include/linux/compiler-gcc.h: #define barrier() __asm__ __volatile__("": : :"memory") and if we don't allow hoisting, we'll often use something like that. It's not the only thing we do. We have cases where it's not that you can't hoist things outside of loops, it's that you have to read things exactly *once*, and then use that particular value (ie the compiler can't be allowed to reload the value because it may change). So any *particular* value we read is valid, but we can't allow the compiler to do anything but a single access. So we have this beauty: include/linux/compiler.h: #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x)) which does that for us (quite often, it's not used as-is: a more common case is using the "rcu_access_pointer()" inline function that not only does that ACCESS_ONCE() but also has the capability to enable run-time dynamic verification that you are in a proper RCU read locked section etc) We also have tons of (very subtle) code that actually creates locks and memory ordering etc. They usually just use the big "barrier()" approach to telling the compiler not to combine or move memory accesses around it. And I realize that compiler people tend to think that loop hoisting etc is absolutely critical for performance, and some big hammer like "barrier()" makes a compiler person wince. You think it results in horrible code generation problems. It really doesn't. Loops are fairly unusual in the kernel to begin with, and the compiler barriers are a total non-issue. We have much more problems with the actual CPU barriers that can be *very* expensive on some architectures, and we work a lot at avoiding those and avoiding cacheline ping-pong issues etc. >> > No vague assumptions with lots of hand-waving. >> >> So here's basically what the kernel needs: >> >> - if we don't touch a field, the compiler doesn't touch it. > > "we don't touch a field": what does that mean precisely? Never touch it > in the same thread? Same function? Same basic block? Between two > sequence points? When crossing synchronizing code? For example, in the > above code, can we touch it earlier/later? Why are you trying to make it more complex than it is? If the source code doesn't write to it, the assembly the compiler generates doesn't write to it. Don't try to make it anything more complicated. This has *nothing* to do with threads or functions or anything else. If you do massive inlining, and you don't see any barriers or conditionals or other reasons not to write to it, just write to it. Don't try to appear smart and make this into something it isn't. Look at the damn five-line example of the bug. FIX THE BUG. Don't try to make it anything bigger than a stupid compiler bug. Don't try to make this into a problem it simply isn't. Linus
RE: Memory corruption due to word sharing
> From: Linus Torvalds > > > > We need a proper memory model. > > Not really. > > The fact is, the kernel will happily take the memory model of the > underlying hardware. Trying to impose some compiler description of the > memory model is actually horribly bad, because it automatically also > involves compiler synchronization points - which will try to be > portable and easy to understand, and quite frankly, that will > automatically result in what is technically known as a "shitload of > crap". The C11 memory model potentially adds overhead in only two cases: 1. When current code involves touching a field that wouldn't otherwise be touched. There are odd cases in which this measurably slows down code, but I think all agree that we need it. In addition to bitfields, it can affect speculatively promoting a value to a register in a loop, which at least older versions of gcc also do. 2. When you use atomic operations for racing accesses. And in that case, if you really want it, you get a lot of control over memory ordering. Atomic operations that specify "memory_order_relaxed" should only have three kinds of impact: - They ensure the access is indivisible. - They tell the compiler that the location may be concurrently modified and it should refrain from optimizations that will break if it is. - They enforce cache coherence, i.e. single-variable ordering. This is implicit on most architectures, but requires ld.acq on Itanium. IIRC, Paul McKenney argued convincingly that this was essential for preserving programmer sanity. > > Now, a strict memory model is fine for portability, and to make it > simple for programmers. I actually largely approve of the Java memory > model approach, even if it has been horribly problematic and has some > serious problems. But it's not something that is acceptable for the > kernel - we absolutely do *not* want the compiler to introduce some > memory model, because we are very much working together with whatever > the hardware memory model is, and we do our own serialization > primitives. I think you are somewhat misinterpreting the C11 memory model. Aside from the minor cache coherence issues on one or two architectures, you can still do that, if you like. But I suspect you can also do better in places. > > > No vague assumptions with lots of hand-waving. > > So here's basically what the kernel needs: > > - if we don't touch a field, the compiler doesn't touch it. > >This is the rule that gcc now violates with bitfields. And in other more subtle cases. At least until very recently. > >This is a gcc bug. End of story. The "volatile" example proves it - > anybody who argues otherwise is simply wrong, and is just trying to > make excuses. Agreed. > > - if we need to touch something only once, or care about something > that is touched only conditionally, and we actually need to make sure > that it is touched *only* under that condition, we already go to quite > extreme lengths to tell the compiler so. > >We usually use an access with a "volatile" cast on it or tricks > like that. Or we do the whole "magic asm that says it modified memory > to let the compiler know not to try anything clever" > > - The kernel IS NOT GOING TO MARK DATA STRUCTURES. > > Marking data structures is seriously stupid and wrong. It's not the > *data* that has access rules, it's the *code* that has rules of > access. > > The traditional C "volatile" is misdesigned and wrong. We don't > generally mark data volatile, we really mark *code* volatile - which > is why our "volatiles" are in the casts, not on the data structures. > > Stuff that is "volatile" in one context is not volatile in another. If > you hold a RCU write lock, it may well be entirely stable, and marking > it volatile is *wrong*, and generating code as if it was volatile is > pure and utter shit. > > On the other hand, if you are touching *the*very*same* field while you > are only read-locked for RCU, it may well be one of those "this has to > be read by accessing it exactly once". > > And we do all this correctly in the kernel. Modulo bugs, of course, > but the fundamental rule really is: "atomicity or volatility is about > CODE, not DATA". The preferred C11 model is clearly to mark data that potentially participates in unprotected concurrent accesses, and then to annotate accesses that require less care, e.g. because we know they can't occur concurrently with other accesses. Maybe this is the wrong approach for the kernel, but to me that seems like a safer approach. We did consider the alternative model, and tried to ensure that casts to atomic could also work on as many platforms as possible, though we can't guarantee that. (There is currently no specification for atomic accesses that says "not concurrently accessed". We've thought about it. I would argue for adding it if there were important cases in which memory_order_relaxed is demonstrably not
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 12:53 PM, Torvald Riegel wrote: > > For volatile, I agree. > > However, the original btrfs example was *without* a volatile, and that's > why I raised the memory model point. This triggered an error in a > concurrent execution, so that's memory model land, at least in C > language standard. Sure. The thing is, if you fix the volatile problem, you'll almost certainly fix our problem too. The whole "compilers actually do reasonable things" approach really does work in reality. It in fact works a lot better than reading some spec and trying to figure out if something is "valid" or not, and having fifteen different compiler writers and users disagree about what the meaning of the word "is" is in some part of it. I'm not kidding. With specs, there really *are* people who spend years discussing what the meaning of the word "access" is or similar. Combine that with a big spec that is 500+ pages in size and then try to apply that all to a project that is 15 million lines of code and sometimes *knowingly* has to do things that it simply knows are outside the spec, and the discussion about these kinds of details is just mental masturbation. >> We do end up doing >> much more aggressive threading, with models that C11 simply doesn't >> cover. > > Any specific examples for that would be interesting. Oh, one of my favorite (NOT!) pieces of code in the kernel is the implementation of the smp_read_barrier_depends() macro, which on every single architecture except for one (alpha) is a no-op. We have basically 30 or so empty definitions for it, and I think we have something like five uses of it. One of them, I think, is performance crticial, and the reason for that macro existing. What does it do? The semantics is that it's a read barrier between two different reads that we want to happen in order wrt two writes on the writing side (the writing side also has to have a "smp_wmb()" to order those writes). But the reason it isn't a simple read barrier is that the reads are actually causally *dependent*, ie we have code like first_read = read_pointer; smp_read_barrier_depends(); second_read = *first_read; and it turns out that on pretty much all architectures (except for alpha), the *data*dependency* will already guarantee that the CPU reads the thing in order. And because a read barrier can actually be quite expensive, we don't want to have a read barrier for this case. But alpha? Its memory consistency is so broken that even the data dependency doesn't actually guarantee cache access order. It's strange, yes. No, it's not that alpha does some magic value prediction and can do the second read without having even done the first read first to get the address. What's actually going on is that the cache itself is unordered, and without the read barrier, you may get a stale version from the cache even if the writes were forced (by the write barrier in the writer) to happen in the right order. You really want to try to describe issues like this in your memory consistency model? No you don't. Nobody will ever really care, except for crazy kernel people. And quite frankly, not even kernel people care: we have a fairly big kernel developer community, and the people who actually talk about memory ordering issues can be counted on one hand. There's the "RCU guy" who writes the RCU helper functions, and hides the proper serializing code into those helpers, so that normal mortal kernel people don't have to care, and don't even have to *know* how ignorant they are about the things. And that's also why the compiler shouldn't have to care. It's a really small esoteric detail, and it can be hidden in a header file and a set of library routines. Teaching the compiler about crazy memory ordering would just not be worth it. 99.99% of all programmers will never need to understand any of it, they'll use the locking primitives and follow the rules, and the code that makes it all work is basically invisible to them. Linus
Re: Memory corruption due to word sharing
On Wed, 2012-02-01 at 12:59 -0800, Linus Torvalds wrote: > On Wed, Feb 1, 2012 at 12:41 PM, Torvald Riegel wrote: > > > > You do rely on the compiler to do common transformations I suppose: > > hoist loads out of loops, CSE, etc. How do you expect the compiler to > > know whether they are allowed for a particular piece of code or not? > > We have barriers. > > Compiler barriers are cheap. They look like this: [snip] I know, and I've programmed my chunk of concurrent code with those and architecture-specific code too (or libatomic-ops for a bit more ready-to-use portability). > It's not the only thing we do. We have cases where it's not that you > can't hoist things outside of loops, it's that you have to read things > exactly *once*, and then use that particular value (ie the compiler > can't be allowed to reload the value because it may change). So any > *particular* value we read is valid, but we can't allow the compiler > to do anything but a single access. That's what an access to an atomics-typed var would give you as well. > And I realize that compiler people tend to think that loop hoisting > etc is absolutely critical for performance, and some big hammer like > "barrier()" makes a compiler person wince. You think it results in > horrible code generation problems. > > It really doesn't. Loops are fairly unusual in the kernel to begin > with, and the compiler barriers are a total non-issue. This is interesting, because GCC is currently not optimizing across atomics but we we're considering working on this later. Do you have real data about this, or is this based on analysis of samples of compiled code? > We have much > more problems with the actual CPU barriers that can be *very* > expensive on some architectures, and we work a lot at avoiding those > and avoiding cacheline ping-pong issues etc. Well, obviously... > > >> > No vague assumptions with lots of hand-waving. > >> > >> So here's basically what the kernel needs: > >> > >> - if we don't touch a field, the compiler doesn't touch it. > > > > "we don't touch a field": what does that mean precisely? Never touch it > > in the same thread? Same function? Same basic block? Between two > > sequence points? When crossing synchronizing code? For example, in the > > above code, can we touch it earlier/later? > > Why are you trying to make it more complex than it is? > > If the source code doesn't write to it, the assembly the compiler > generates doesn't write to it. If it would be so easy, then answering my questions would have been easy too, right? Which piece of source code? The whole kernel? > > Don't try to make it anything more complicated. This has *nothing* to > do with threads or functions or anything else. > > If you do massive inlining, and you don't see any barriers or > conditionals or other reasons not to write to it, just write to it. And a precise definition of these reasons is what we need to agree on. > Don't try to appear smart and make this into something it isn't. Even if I would have done it, it wouldn't be as rude as trying to imply that other people aren't smart.
RE: Memory corruption due to word sharing
> From: Linus Torvalds > Don't try to make it anything more complicated. This has *nothing* to > do with threads or functions or anything else. > > If you do massive inlining, and you don't see any barriers or > conditionals or other reasons not to write to it, just write to it. > > Don't try to appear smart and make this into something it isn't. > > Look at the damn five-line example of the bug. FIX THE BUG. Don't try > to make it anything bigger than a stupid compiler bug. Don't try to > make this into a problem it simply isn't. > My impression is that all of us not on the hook to fix this are in violent agreement on this example. Here are some more interesting ones that illustrate the issues (all declarations are non-local, unless stated otherwise): struct { char a; int b:9; int c:7; char d} x; Is x.b = 1 allowed to overwrite x.a? C11 says no, essentially requiring two byte stores. Gcc currently does so. I'm not sure I understand Linus' position here. int count; /* p and q are local */ for (q = p; q = q -> next; q != 0) if (q -> data > 0) ++count; Can count be promoted to a register, and thus written even if there are no positive elements. C11 says no. gcc at least used to do this. for (q = p; q = q -> next; q != 0) { ++count; if (rare_cond) f(); } Same question, with cond saved and restored around the call to f() (which might include a fence). C11 says no. I think Linus is also arguing for no. for (i = 0; i < 1000; ++i) { if (i%1) a[i] = i; } Can I vectorize the loop writing back the original even values, and thus writing all entries of the array. C11 and Linus both say no. My impression is that we are generally in agreement. Hans
Re: Memory corruption due to word sharing
On Wed, 2012-02-01 at 13:20 -0800, Linus Torvalds wrote: > On Wed, Feb 1, 2012 at 12:53 PM, Torvald Riegel wrote: > > > > For volatile, I agree. > > > > However, the original btrfs example was *without* a volatile, and that's > > why I raised the memory model point. This triggered an error in a > > concurrent execution, so that's memory model land, at least in C > > language standard. > > Sure. The thing is, if you fix the volatile problem, you'll almost > certainly fix our problem too. > > The whole "compilers actually do reasonable things" approach really > does work in reality. It in fact works a lot better than reading some > spec and trying to figure out if something is "valid" or not, and > having fifteen different compiler writers and users disagree about > what the meaning of the word "is" is in some part of it. That's why researchers have formalized the model in order to verify whether it's sane. And they found bugs / ambiguities in it: http://www.cl.cam.ac.uk/~pes20/cpp/ The standards text might still be a typical standards text for various reasons. But it enables having a formal version of it too, and discussions about this formal version. Personally, I find this more formal model easier to work with, exactly because it is more precise than prose. > > >> We do end up doing > >> much more aggressive threading, with models that C11 simply doesn't > >> cover. > > > > Any specific examples for that would be interesting. > > Oh, one of my favorite (NOT!) pieces of code in the kernel is the > implementation of the > >smp_read_barrier_depends() > > macro, which on every single architecture except for one (alpha) is a no-op. > > We have basically 30 or so empty definitions for it, and I think we > have something like five uses of it. One of them, I think, is > performance crticial, and the reason for that macro existing. > > What does it do? The semantics is that it's a read barrier between two > different reads that we want to happen in order wrt two writes on the > writing side (the writing side also has to have a "smp_wmb()" to order > those writes). But the reason it isn't a simple read barrier is that > the reads are actually causally *dependent*, ie we have code like > >first_read = read_pointer; >smp_read_barrier_depends(); >second_read = *first_read; > > and it turns out that on pretty much all architectures (except for > alpha), the *data*dependency* will already guarantee that the CPU > reads the thing in order. And because a read barrier can actually be > quite expensive, we don't want to have a read barrier for this case. I don't have time to look at this in detail right now, but it looks roughly close to C++11's memory_order_consume to me, which is somehwat like an acquire, but just for subsequent data-dependent loads. Added for performance reasons on some architecture AFAIR. > You really want to try to describe issues like this in your memory > consistency model? No you don't. Nobody will ever really care, except > for crazy kernel people. And quite frankly, not even kernel people > care: we have a fairly big kernel developer community, and the people > who actually talk about memory ordering issues can be counted on one > hand. There's the "RCU guy" who writes the RCU helper functions, and > hides the proper serializing code into those helpers, so that normal > mortal kernel people don't have to care, and don't even have to *know* > how ignorant they are about the things. I'm not a kernel person, and I do care about it. Userspace is synchronizing too, and not just with pthread mutexes. > And that's also why the compiler shouldn't have to care. It's a really > small esoteric detail, and it can be hidden in a header file and a set > of library routines. Teaching the compiler about crazy memory ordering > would just not be worth it. 99.99% of all programmers will never need > to understand any of it, they'll use the locking primitives and follow > the rules, and the code that makes it all work is basically invisible > to them. I disagree (though it would be nice if it were that esoteric), but that's off-topic...
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 1:24 PM, Torvald Riegel wrote: >> It's not the only thing we do. We have cases where it's not that you >> can't hoist things outside of loops, it's that you have to read things >> exactly *once*, and then use that particular value (ie the compiler >> can't be allowed to reload the value because it may change). So any >> *particular* value we read is valid, but we can't allow the compiler >> to do anything but a single access. > > That's what an access to an atomics-typed var would give you as well. Right. The concept of "atomic" access is required. The problem I have with treating this as a C11 issue is that people aren't using C11 compilers, and won't for a long while. So quite frankly, it won't be reasonable for the kernel people to say "use a C11 version" for years to come. And the thing is, "atomic" doesn't actually seem to buy us anything in the kernel afaik. We're perfectly happy to use "volatile". We don't want the data structure annotated, we want the code annotated, so the semantic differences between 'volatile' and 'atomic' seem to be pretty irrelevant. Sure, there is probably some subtle difference, but on the whole, it all boils down to "we can't depend on new rules for several years, and even when we can, it doesn't look like they'd buy us a lot, because we have to play so many games around them *anyway*". And don't get me wrong. I don't think that means that C11 is *bad*. It's just that the kernel is very different from most other projects. We have to have those crazy architecture-specific header files and random synchronization macros etc anyway. C11 is not - I think - even meant to be geared towards the Linux kernel kind of crazy use. We really do some odd things, adding compiler features for them is mostly a mistake. asm() takes care of a lot of the oddities for us, it's not like all of them are about memory accesses or concurrency either. I do think that the threading issues in C11 are going to help us kernel people, because the more people think about issues with concurrency, they really *will* be hitting some of the issues we've been having. So it clearly forces clarification of just what the word "access" implies, for example, which is certainly not going to be bad for the kernel. >> It really doesn't. Loops are fairly unusual in the kernel to begin >> with, and the compiler barriers are a total non-issue. > > This is interesting, because GCC is currently not optimizing across > atomics but we we're considering working on this later. > Do you have real data about this, or is this based on analysis of > samples of compiled code? So the kernel literally doesn't tend to *have* loops. Sure, there are exceptions: there are the obvious loops implied in "memcpy()" and friends, but almost all kernel activity tends to be about individual events. A really hot loop under normal loads would be the loop that calculates a hash value over a pathname component. Or the loop that loops over those components. Loop count? Think about it. Most pathnames have one or two components. And while "8.3" is clearly too short for a filename, it's still true that a lot of filenames (especially directories, which is what you end up traversing many times) are in a kind of awkward 5-8 character range. So the kernel loads tend to have loop counts in the single digits - *maybe* double digits. We use hash-tables a lot, so looping to find something is usually just an entry or two, and the cost is almost never the loop, it's the cache miss from the pointer chasing. Most "looping" behavior in the kernel is actually user space calling the kernel in a loop. So the kernel may see "loops" in that sense, but it's not loopy kernel code itself. Even really hot kernel-only code is often not about loops. You might get a million network packets a second, and with interrupt mitigation you would not necessarily treat them each as individual events with its own individual interrupt (which does happen too!), but you'd still not see it as a loop over a million packets. You'd see them come in as a few packets at a time, and looping until you've handled them. And the loops we have tend to not be very amenable to nice compiler optimizations either. The kernel almost never works with arrays, for example. The string in a pathname component is an array, yes, but *most* kernel data structures loop over our doubly linked lists or over a hash-chain. And it tends to be really hard to do much loop optimization over list traversal like that. >> If the source code doesn't write to it, the assembly the compiler >> generates doesn't write to it. > > If it would be so easy, then answering my questions would have been easy > too, right? Which piece of source code? The whole kernel? So make the rule be: "in one single function, with all you can see there", and I'll be happy. Do as much inlining as you want (although the kernel does have "noinline" too). With all the normal sequence point and memory invalidation
RE: Memory corruption due to word sharing
> From: Torvald Riegel > > Oh, one of my favorite (NOT!) pieces of code in the kernel is the > > implementation of the > > > >smp_read_barrier_depends() > > > > macro, which on every single architecture except for one (alpha) is a > no-op. > > > > We have basically 30 or so empty definitions for it, and I think we > > have something like five uses of it. One of them, I think, is > > performance crticial, and the reason for that macro existing. > > > > What does it do? The semantics is that it's a read barrier between > two > > different reads that we want to happen in order wrt two writes on the > > writing side (the writing side also has to have a "smp_wmb()" to > order > > those writes). But the reason it isn't a simple read barrier is that > > the reads are actually causally *dependent*, ie we have code like > > > >first_read = read_pointer; > >smp_read_barrier_depends(); > >second_read = *first_read; > > > > and it turns out that on pretty much all architectures (except for > > alpha), the *data*dependency* will already guarantee that the CPU > > reads the thing in order. And because a read barrier can actually be > > quite expensive, we don't want to have a read barrier for this case. > > I don't have time to look at this in detail right now, but it looks > roughly close to C++11's memory_order_consume to me, which is somehwat > like an acquire, but just for subsequent data-dependent loads. Added > for performance reasons on some architecture AFAIR. > It's intended to address the same problem, though somewhat differently. (I suspect there was overlap in the people involved?) One reason that C11 took a slightly different path is that compilers can, and sometimes do, remove dependencies, making smp_read_barrier_depends brittle unless it also imposes compiler constraints. Hans
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 1:25 PM, Boehm, Hans wrote: > > Here are some more interesting ones that illustrate the issues (all > declarations are non-local, unless stated otherwise): > > struct { char a; int b:9; int c:7; char d} x; > > Is x.b = 1 allowed to overwrite x.a? C11 says no, essentially requiring two > byte stores. Gcc currently does so. I'm not sure I understand Linus' > position here. So I like the fact that the C11 stance seems very strict. But honestly I also think it sounds like C11 is actually more strict than I would necessarily be. I really do think that bitfields imply "int", both historically and technically. So I would not see the problem with treating the bitfields as part of an 'int' and thus overwriting a (and d) when writing to b. That's how bitfields work! They are fields of an int. It would be good if it perhaps caused a *warning*, and gave a good way to avoid it. For example, while I think using any other base type than 'int' is technically an extension of the C bitfield rules (but whatever, I don't have any specs in front of me), I think a warning together with alowing the user to rewrite it as struct { char a; char d; short b:9; short c:7; } x; would make it clear that now a write to 'b' cannot validly overwrite 'a' or 'd'. But forcing the compiler to do two (and sometimes three!) byte accesses sounds excessive. The reason I think the int flag:1; int othervariable; overwriting of "othervariable" is so *obviously* a bug is exactly that bitfields are historically about 'int', and no 'long' was there anywhere, so using a 64-bit access is clearly not sane in any way, shape or form. I dunno. Linus
Re: Memory corruption due to word sharing
On Wed, Feb 01, 2012 at 12:59:24PM -0800, Linus Torvalds wrote: > On Wed, Feb 1, 2012 at 12:41 PM, Torvald Riegel wrote: > > > > You do rely on the compiler to do common transformations I suppose: > > hoist loads out of loops, CSE, etc. How do you expect the compiler to > > know whether they are allowed for a particular piece of code or not? > > We have barriers. > > Compiler barriers are cheap. They look like this: > > include/linux/compiler-gcc.h: >#define barrier() __asm__ __volatile__("": : :"memory") > > and if we don't allow hoisting, we'll often use something like that. > > It's not the only thing we do. We have cases where it's not that you > can't hoist things outside of loops, it's that you have to read things > exactly *once*, and then use that particular value (ie the compiler > can't be allowed to reload the value because it may change). So any > *particular* value we read is valid, but we can't allow the compiler > to do anything but a single access. > > So we have this beauty: > > include/linux/compiler.h: > #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x)) My (perhaps forlorn and naive) hope is that C++11 memory_order_relaxed will eventually allow ACCESS_ONCE() to be upgraded so that (for example) access-once increments can generate a single increment-memory instruction on x86. > which does that for us (quite often, it's not used as-is: a more > common case is using the "rcu_access_pointer()" inline function that > not only does that ACCESS_ONCE() but also has the capability to enable > run-time dynamic verification that you are in a proper RCU read locked > section etc) And I also hope that rcu_access_pointer(), rcu_dereference(), and friends can eventually be written in terms of C++11 memory_order_consume so that the some of the less-sane optimizations can actually be applied without breaking the kernel. Of course, this cannot happen immediately. First, gcc must fully support the C++11 features of interest, the inevitable bugs must be found and fixed, and the optimizer will no doubt need to be reworked a bit before the kernel can make use of these features. New architectures might eventually might define things like atomic_inc() in terms of C++11 atomics, but let's start with the straightforward stuff as and if it makes sense. Thanx, Paul > We also have tons of (very subtle) code that actually creates locks > and memory ordering etc. They usually just use the big "barrier()" > approach to telling the compiler not to combine or move memory > accesses around it. > > And I realize that compiler people tend to think that loop hoisting > etc is absolutely critical for performance, and some big hammer like > "barrier()" makes a compiler person wince. You think it results in > horrible code generation problems. > > It really doesn't. Loops are fairly unusual in the kernel to begin > with, and the compiler barriers are a total non-issue. We have much > more problems with the actual CPU barriers that can be *very* > expensive on some architectures, and we work a lot at avoiding those > and avoiding cacheline ping-pong issues etc. > > >> > No vague assumptions with lots of hand-waving. > >> > >> So here's basically what the kernel needs: > >> > >> - if we don't touch a field, the compiler doesn't touch it. > > > > "we don't touch a field": what does that mean precisely? Never touch it > > in the same thread? Same function? Same basic block? Between two > > sequence points? When crossing synchronizing code? For example, in the > > above code, can we touch it earlier/later? > > Why are you trying to make it more complex than it is? > > If the source code doesn't write to it, the assembly the compiler > generates doesn't write to it. > > Don't try to make it anything more complicated. This has *nothing* to > do with threads or functions or anything else. > > If you do massive inlining, and you don't see any barriers or > conditionals or other reasons not to write to it, just write to it. > > Don't try to appear smart and make this into something it isn't. > > Look at the damn five-line example of the bug. FIX THE BUG. Don't try > to make it anything bigger than a stupid compiler bug. Don't try to > make this into a problem it simply isn't. > > Linus > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ >
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 2:45 PM, Paul E. McKenney wrote: > > My (perhaps forlorn and naive) hope is that C++11 memory_order_relaxed > will eventually allow ACCESS_ONCE() to be upgraded so that (for example) > access-once increments can generate a single increment-memory instruction > on x86. I don't think that is a semantic issue. gcc could do it *today* with volatile accesses. It doesn't, because volatiles are scary and basically disables a lot of optimizations. Why would memory ordering be substantially different just because it has a different name? > New architectures might eventually might define things like atomic_inc() > in terms of C++11 atomics, but let's start with the straightforward stuff > as and if it makes sense. SMP-atomic or percpu atomic? Or both? We need both variants in the kernel. If the compiler generates one of them for us, that doesn't really much help. Linus
Re: Access to source code from an analyser
On 12-01-19 05:06 , Alberto Lozano Alelu wrote: I need to have source text For that, you will need to read the file or get the line from libcpp's buffers. Accessing libcpp's buffer will not be trivial, since they are in libcpp's private structures. You're better off opening the file and reading from that line/col. Diego.