Re: [RFC][AArch64] function prologue analyzer in linux kernel
On Fri, Jan 08, 2016 at 02:36:32PM +0900, AKASHI Takahiro wrote: > On 01/07/2016 11:56 PM, Richard Earnshaw (lists) wrote: > >On 07/01/16 14:22, Will Deacon wrote: > >>On Thu, Dec 24, 2015 at 04:57:54PM +0900, AKASHI Takahiro wrote: > >>>So I'd like to introduce a function prologue analyzer to determine > >>>a size allocated by a function's prologue and deduce it from "Depth". > >>>My implementation of this analyzer has been submitted to > >>>linux-arm-kernel mailing list[1]. > >>>I borrowed some ideas from gdb's analyzer[2], especially a loop of > >>>instruction decoding as well as stop of decoding at exiting a basic block, > >>>but implemented my own simplified one because gdb version seems to do > >>>a bit more than what we expect here. > >>>Anyhow, since it is somewhat heuristic (and may not be maintainable for > >>>a long term), could you review it from a broader viewpoint of toolchain, > >>>please? > >>> > >>My main issue with this is that we cannot rely on the frame layout > >>generated by the compiler and there's little point in asking for > >>commitment here. Therefore, the heuristics will need updating as and > >>when we identify new frames that we can't handle. That's pretty fragile > >>and puts us on the back foot when faced with newer compilers. This might > >>be sustainable if we don't expect to encounter much variation, but even > >>that would require some sort of "buy-in" from the various toolchain > >>communities. > >> > >>GCC already has an option (-fstack-usage) to determine the stack usage > >>on a per-function basis and produce a report at build time. Why can't > >>we use that to provide the information we need, rather than attempt to > >>compute it at runtime based on your analyser? > >> > >>If -fstack-usage is not sufficient, understanding why might allow us to > >>propose a better option. > > > >Can you not use the dwarf frame unwind data? That's always sufficient > >to recover the CFA (canonical frame address - the value in SP when > >executing the first instruction in a function). It seems to me it's > >unlikely you're going to need something that's an exceedingly high > >performance operation. > > Thank you for your comment. > Yeah, but we need some utility routines to handle unwind data(.debug_frame). > In fact, some guy has already attempted to merge (part of) libunwind into > the kernel[1], but it was rejected by the kernel community (including Linus > if I correctly remember). It seems that they thought the code was still buggy. The ARC guys seem to have sneaked something in for their architecture: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arc/kernel/unwind.c so it might not be impossible if we don't require all the bells and whistles of libunwind. > That is one of reasons that I wanted to implement my own analyzer. I still don't understand why you can't use fstack-usage. Can you please tell me why that doesn't work? Am I missing something? Will
Re: [RFC][AArch64] function prologue analyzer in linux kernel
On Tue, Jan 12, 2016 at 03:11:29PM +0900, AKASHI Takahiro wrote: > Will, > > On 01/09/2016 12:53 AM, Will Deacon wrote: > >On Fri, Jan 08, 2016 at 02:36:32PM +0900, AKASHI Takahiro wrote: > >>On 01/07/2016 11:56 PM, Richard Earnshaw (lists) wrote: > >>>On 07/01/16 14:22, Will Deacon wrote: > >>>>On Thu, Dec 24, 2015 at 04:57:54PM +0900, AKASHI Takahiro wrote: > >>>>>So I'd like to introduce a function prologue analyzer to determine > >>>>>a size allocated by a function's prologue and deduce it from "Depth". > >>>>>My implementation of this analyzer has been submitted to > >>>>>linux-arm-kernel mailing list[1]. > >>>>>I borrowed some ideas from gdb's analyzer[2], especially a loop of > >>>>>instruction decoding as well as stop of decoding at exiting a basic > >>>>>block, > >>>>>but implemented my own simplified one because gdb version seems to do > >>>>>a bit more than what we expect here. > >>>>>Anyhow, since it is somewhat heuristic (and may not be maintainable for > >>>>>a long term), could you review it from a broader viewpoint of toolchain, > >>>>>please? > >>>>> > >>>>My main issue with this is that we cannot rely on the frame layout > >>>>generated by the compiler and there's little point in asking for > >>>>commitment here. Therefore, the heuristics will need updating as and > >>>>when we identify new frames that we can't handle. That's pretty fragile > >>>>and puts us on the back foot when faced with newer compilers. This might > >>>>be sustainable if we don't expect to encounter much variation, but even > >>>>that would require some sort of "buy-in" from the various toolchain > >>>>communities. > >>>> > >>>>GCC already has an option (-fstack-usage) to determine the stack usage > >>>>on a per-function basis and produce a report at build time. Why can't > >>>>we use that to provide the information we need, rather than attempt to > >>>>compute it at runtime based on your analyser? > >>>> > >>>>If -fstack-usage is not sufficient, understanding why might allow us to > >>>>propose a better option. > >>> > >>>Can you not use the dwarf frame unwind data? That's always sufficient > >>>to recover the CFA (canonical frame address - the value in SP when > >>>executing the first instruction in a function). It seems to me it's > >>>unlikely you're going to need something that's an exceedingly high > >>>performance operation. > >> > >>Thank you for your comment. > >>Yeah, but we need some utility routines to handle unwind data(.debug_frame). > >>In fact, some guy has already attempted to merge (part of) libunwind into > >>the kernel[1], but it was rejected by the kernel community (including Linus > >>if I correctly remember). It seems that they thought the code was still > >>buggy. > > > >The ARC guys seem to have sneaked something in for their architecture: > > > > > > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arc/kernel/unwind.c > > > >so it might not be impossible if we don't require all the bells and > >whistles of libunwind. > > Thanks. I didn't notice this code. > > >>That is one of reasons that I wanted to implement my own analyzer. > > > >I still don't understand why you can't use fstack-usage. Can you please > >tell me why that doesn't work? Am I missing something? > > I don't know how gcc calculates the usage here, but I guess it would be more > robust than my analyzer. > > The issues, that come up to my mind, are > - -fstack-usage generates a separate output file, *.su and so we have to > manage them to be incorporated in the kernel binary. That doesn't sound too bad to me. How much data are we talking about here? > This implies that (common) kernel makefiles might have to be a bit changed. > - more worse, what if kernel module case? We will have no way to let the > kernel > know the stack usage without adding an extra step at loading. We can easily add a new __init section to modules, which is a table representing the module functions and their stack sizes (like we do for other things like alternatives). We'd just then need to slurp this information at load time and throw it into an rbtree or something. Will
Re: [RFC][AArch64] function prologue analyzer in linux kernel
On Wed, Jan 13, 2016 at 05:13:29PM +0900, AKASHI Takahiro wrote: > On 01/13/2016 03:04 AM, Will Deacon wrote: > >On Tue, Jan 12, 2016 at 03:11:29PM +0900, AKASHI Takahiro wrote: > >>On 01/09/2016 12:53 AM, Will Deacon wrote: > >>>I still don't understand why you can't use fstack-usage. Can you please > >>>tell me why that doesn't work? Am I missing something? > >> > >>I don't know how gcc calculates the usage here, but I guess it would be more > >>robust than my analyzer. > >> > >>The issues, that come up to my mind, are > >>- -fstack-usage generates a separate output file, *.su and so we have to > >> manage them to be incorporated in the kernel binary. > > > >That doesn't sound too bad to me. How much data are we talking about here? > > > >> This implies that (common) kernel makefiles might have to be a bit > >> changed. > >>- more worse, what if kernel module case? We will have no way to let the > >>kernel > >> know the stack usage without adding an extra step at loading. > > > >We can easily add a new __init section to modules, which is a table > >representing the module functions and their stack sizes (like we do > >for other things like alternatives). We'd just then need to slurp this > >information at load time and throw it into an rbtree or something. > > I found another issue. > Let's think about 'dynamic storage' case like: > $ cat stack.c > extern long fooX(long a); > extern long fooY(long b[]); > > long foo1(long a) { > > if (a > 1) { > long b[a]; <== Here > > return a + fooY(b); > } else { > return a + fooX(a); > } > } > > Then, -fstack-usage returns 48 for foo1(): > $ aarch64-linux-gnu-gcc -fno-omit-frame-pointer -fstack-usage main.c stack.c \ > -pg -O2 -fasynchronous-unwind-tables > $ cat stack.su > stack.c:4:6:foo1 48 dynamic > > This indicates that foo1() may use 48 bytes or more depending on a condition. > But in my case (ftrace-based stack tracer), I always expect 32 whether we're > backtracing from fooY() or from fooX() because my stack tracer estimates: >(stack pointer) = (callee's frame pointer) + (callee's stack usage) > (in my previous e-mail, '-(minus)' was wrong.) > > where (callee's stack usage) is, as I described in my previous e-mail, a size > of > memory which is initially allocated on a stack in a function prologue, and > should not > contain a size of dynamically allocate area. According to who? What's the use in reporting only the prologue size? Will
Re: Compilers and RCU readers: Once more unto the breach!
Hi Paul, On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote: > On Tue, May 19, 2015 at 07:10:12PM -0700, Linus Torvalds wrote: > > On Tue, May 19, 2015 at 6:57 PM, Linus Torvalds > > wrote: > > So I think you're better off just saying that operations designed to > > drop significant bits break the dependency chain, and give things like > > "& 1" and "(char *)ptr-(uintptr_t)ptr" as examples of such. > > > > Making that just an extension of your existing "& 0" language would > > seem to be natural. > > Works for me! I added the following bullet to the list of things > that break dependencies: > > If a pointer is part of a dependency chain, and if the values > added to or subtracted from that pointer cancel the pointer > value so as to allow the compiler to precisely determine the > resulting value, then the resulting value will not be part of > any dependency chain. For example, if p is part of a dependency > chain, then ((char *)p-(uintptr_t)p)+65536 will not be. > > Seem reasonable? Whilst I understand what you're saying (the ARM architecture makes these sorts of distinctions when calling out dependency-based ordering), it feels like we're dangerously close to defining the difference between a true and a false dependency. If we want to do this in the context of the C language specification, you run into issues because you need to evaluate the program in order to determine data values in order to determine the nature of the dependency. You tackle this above by saying "to allow the compiler to precisely determine the resulting value", but I can't see how that can be cleanly fitted into something like the C language specification. Even if it can, then we'd need to reword the "?:" treatment that you currently have: "If a pointer is part of a dependency chain, and that pointer appears in the entry of a ?: expression selected by the condition, then the chain extends to the result." which I think requires the state of the condition to be known statically if we only want to extend the chain from the selected expression. In the general case, wouldn't a compiler have to assume that the chain is extended from both? Additionally, what about the following code? char *x = y ? z : z; Does that extend a dependency chain from z to x? If so, I can imagine a CPU breaking that in practice. > > Humans will understand, and compiler writers won't care. They will > > either depend on hardware semantics anyway (and argue that your > > language is tight enough that they don't need to do anything special) > > or they will turn the consume into an acquire (on platforms that have > > too weak hardware). > > Agreed. Plus Core Working Group will hammer out the exact wording, > should this approach meet their approval. For the avoidance of doubt, I'm completely behind any attempts to tackle this problem, but I anticipate an uphill struggle getting this text into the C standard. Is your intention to change the carries-a-dependency relation to encompass this change? Cheers, Will
Re: Compilers and RCU readers: Once more unto the breach!
On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote: > On Wed, May 20, 2015 at 12:47:45PM +0100, Will Deacon wrote: > > On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote: > > > If a pointer is part of a dependency chain, and if the values > > > added to or subtracted from that pointer cancel the pointer > > > value so as to allow the compiler to precisely determine the > > > resulting value, then the resulting value will not be part of > > > any dependency chain. For example, if p is part of a dependency > > > chain, then ((char *)p-(uintptr_t)p)+65536 will not be. > > > > > > Seem reasonable? > > > > Whilst I understand what you're saying (the ARM architecture makes these > > sorts of distinctions when calling out dependency-based ordering), it > > feels like we're dangerously close to defining the difference between a > > true and a false dependency. If we want to do this in the context of the > > C language specification, you run into issues because you need to evaluate > > the program in order to determine data values in order to determine the > > nature of the dependency. > > Indeed, something like this does -not- carry a dependency from the > memory_order_consume load to q: > > char *p, q; > > p = atomic_load_explicit(&gp, memory_order_consume); > q = gq + (intptr_t)p - (intptr_t)p; > > If this was compiled with -O0, ARM and Power might well carry a > dependency, but given any optimization, the assembly language would have > no hint of any such dependency. So I am not seeing any particular danger. The above is a welcome relaxation over C11, since ARM doesn't even give you ordering based off false data dependencies. My concern is more to do with how this can be specified precisely without prohibing honest compiler and hardware optimisations. Out of interest, how do you tackle examples (4) and (5) of (assuming the reads are promoted to consume loads)?: http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html my understanding is that you permit both outcomes (I appreciate you're not directly tackling out-of-thin-air, but treatment of dependencies is heavily related). > > You tackle this above by saying "to allow the compiler to precisely > > determine the resulting value", but I can't see how that can be cleanly > > fitted into something like the C language specification. > > I am sure that there will be significant rework from where this document > is to language appropriate from the standard. Which is why I am glad > that Jens is taking an interest in this, as he is particularly good at > producing standards language. Ok. I'm curious to see how that comes along. > > Even if it can, > > then we'd need to reword the "?:" treatment that you currently have: > > > > "If a pointer is part of a dependency chain, and that pointer appears > >in the entry of a ?: expression selected by the condition, then the > >chain extends to the result." > > > > which I think requires the state of the condition to be known statically > > if we only want to extend the chain from the selected expression. In the > > general case, wouldn't a compiler have to assume that the chain is > > extended from both? > > In practice, yes, if the compiler cannot determine which expression is > selected, it must arrange for the dependency to be carried from either, > depending on the run-time value of the condition. But you would have > to work pretty hard to create code that did not carry the dependencies > as require, not? I'm not sure... you'd require the compiler to perform static analysis of loops to determine the state of the machine when they exit (if they exit!) in order to show whether or not a dependency is carried to subsequent operations. If it can't prove otherwise, it would have to assume that a dependency *is* carried, and it's not clear to me how it would use this information to restrict any subsequent dependency removing optimisations. I guess that's one for the GCC folks. > > Additionally, what about the following code? > > > > char *x = y ? z : z; > > > > Does that extend a dependency chain from z to x? If so, I can imagine a > > CPU breaking that in practice. > > I am not seeing this. I would expect the compiler to optimize to > something like this: > > char *x = z; > > How does this avoid carrying the dependency? Or are you saying that > ARM loses the dependency via a store to memory and a later reload? > That would be a bi
Re: Compilers and RCU readers: Once more unto the breach!
On Wed, May 20, 2015 at 07:16:06PM +0100, Paul E. McKenney wrote: > On Wed, May 20, 2015 at 04:46:17PM +0100, Will Deacon wrote: > > On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote: > > > Indeed, something like this does -not- carry a dependency from the > > > memory_order_consume load to q: > > > > > > char *p, q; > > > > > > p = atomic_load_explicit(&gp, memory_order_consume); > > > q = gq + (intptr_t)p - (intptr_t)p; > > > > > > If this was compiled with -O0, ARM and Power might well carry a > > > dependency, but given any optimization, the assembly language would have > > > no hint of any such dependency. So I am not seeing any particular danger. > > > > The above is a welcome relaxation over C11, since ARM doesn't even give > > you ordering based off false data dependencies. My concern is more to do > > with how this can be specified precisely without prohibing honest compiler > > and hardware optimisations. > > That last is the challenge. I believe that I am pretty close, but I am > sure that additional adjustment will be required. Especially given that > we also need the memory model to be amenable to formal analysis. Well, there's still the whole thin-air problem which unfortunately doesn't go away with your proposal... (I was hoping that differentiating between true and false dependencies would solve that, but your set of rules isn't broad enough and I don't blame you at all for that!). > > Out of interest, how do you tackle examples (4) and (5) of (assuming the > > reads are promoted to consume loads)?: > > > > http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html > > > > my understanding is that you permit both outcomes (I appreciate you're > > not directly tackling out-of-thin-air, but treatment of dependencies > > is heavily related). Thanks for taking the time to walk these two examples through. > Let's see... #4 is as follows, given promotion to memory_order_consume > and (I am guessing) memory_order_relaxed: > > r1 = atomic_load_explicit(&x, memory_order_consume); > if (r1 == 42) > atomic_store_explicit(&y, r1, memory_order_relaxed); > -- > r2 = atomic_load_explicit(&y, memory_order_consume); > if (r2 == 42) > atomic_store_explicit(&x, 42, memory_order_relaxed); > else > atomic_store_explicit(&x, 42, memory_order_relaxed); > > The second thread does not have a proper control dependency, even with > the memory_order_consume load because both branches assign the same > value to "x". This means that the compiler is within its rights to > optimize this into the following: > > r1 = atomic_load_explicit(&x, memory_order_consume); > if (r1 == 42) > atomic_store_explicit(&y, r1, memory_order_relaxed); > -- > r2 = atomic_load_explicit(&y, memory_order_consume); > atomic_store_explicit(&x, 42, memory_order_relaxed); > > There is no dependency between the second thread's pair of statements, > so both the compiler and the CPU are within their rights to optimize > further as follows: > > r1 = atomic_load_explicit(&x, memory_order_consume); > if (r1 == 42) > atomic_store_explicit(&y, r1, memory_order_relaxed); > -- > atomic_store_explicit(&x, 42, memory_order_relaxed); > r2 = atomic_load_explicit(&y, memory_order_consume); > > If the compiler makes this final optimization, even mythical SC hardware > is within its rights to end up with (r1 == 42 && r2 == 42). Which is > fine, as far as I am concerned. Or at least something that can be > lived with. Agreed. > On to #5: > > r1 = atomic_load_explicit(&x, memory_order_consume); > if (r1 == 42) > atomic_store_explicit(&y, r1, memory_order_relaxed); > > r2 = atomic_load_explicit(&y, memory_order_consume); > if (r2 == 42) > atomic_store_explicit(&x, 42, memory_order_relaxed); > > The first thread's accesses are dependency ordered. The second thread's > ordering is in a corner case that memory-barriers.txt does not cover. > You are supposed to start control dependencies with READ_ONCE_CTRL(), not > a memory_order_consume load (AKA rcu_dereference and friends). However, > Alpha would have a full barrier as part of the memory_orde
Re: Compilers and RCU readers: Once more unto the breach!
Hi Paul, On Thu, May 21, 2015 at 09:02:12PM +0100, Paul E. McKenney wrote: > On Thu, May 21, 2015 at 08:24:22PM +0100, Will Deacon wrote: > > On Wed, May 20, 2015 at 07:16:06PM +0100, Paul E. McKenney wrote: > > > On to #5: > > > > > > r1 = atomic_load_explicit(&x, memory_order_consume); > > > if (r1 == 42) > > > atomic_store_explicit(&y, r1, memory_order_relaxed); > > > > > > r2 = atomic_load_explicit(&y, memory_order_consume); > > > if (r2 == 42) > > > atomic_store_explicit(&x, 42, memory_order_relaxed); > > > > > > The first thread's accesses are dependency ordered. The second thread's > > > ordering is in a corner case that memory-barriers.txt does not cover. > > > You are supposed to start control dependencies with READ_ONCE_CTRL(), not > > > a memory_order_consume load (AKA rcu_dereference and friends). However, > > > Alpha would have a full barrier as part of the memory_order_consume load, > > > and the rest of the processors would (one way or another) respect the > > > control dependency. And the compiler would have some fun trying to > > > break it. > > > > But this is interesting because the first thread is ordered whilst the > > second is not, so doesn't that effectively forbid the compiler from > > constant-folding values if it can't prove that there is no dependency > > chain? > > You lost me on this one. Are you suggesting that the compiler > speculate the second thread's atomic store? That would be very > bad regardless of dependency chains. > > So what constant-folding optimization are you thinking of here? > If the above example is not amenable to such an optimization, could > you please give me an example where constant folding would apply > in a way that is sensitive to dependency chains? Unless I'm missing something, I can't see what would prevent a compiler from looking at the code in thread1 and transforming it into the code in thread2 (i.e. constant folding r1 with 42 given that the taken branch must mean that r1 == 42). However, such an optimisation breaks the dependency chain, which means that a compiler needs to walk backwards to see if there is a dependency chain extending to r1. > > > So the current Linux memory model would allow (r1 == 42 && r2 == 42), > > > but I don't know of any hardware/compiler combination that would > > > allow it. And no, I am -not- going to update memory-barriers.txt for > > > this litmus test, its theoretical interest notwithstanding! ;-) Of course, I'm not asking for that at all! I'm just trying to see how your proposal holds up with the example. Will
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 06, 2014 at 06:55:01PM +, Ramana Radhakrishnan wrote: > On 02/06/14 18:25, David Howells wrote: > > > > Is it worth considering a move towards using C11 atomics and barriers and > > compiler intrinsics inside the kernel? The compiler _ought_ to be able to > > do > > these. > > > It sounds interesting to me, if we can make it work properly and > reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in. Given my (albeit limited) experience playing with the C11 spec and GCC, I really think this is a bad idea for the kernel. It seems that nobody really agrees on exactly how the C11 atomics map to real architectural instructions on anything but the trivial architectures. For example, should the following code fire the assert? extern atomic foo, bar, baz; void thread1(void) { foo.store(42, memory_order_relaxed); bar.fetch_add(1, memory_order_seq_cst); baz.store(42, memory_order_relaxed); } void thread2(void) { while (baz.load(memory_order_seq_cst) != 42) { /* do nothing */ } assert(foo.load(memory_order_seq_cst) == 42); } To answer that question, you need to go and look at the definitions of synchronises-with, happens-before, dependency_ordered_before and a whole pile of vaguely written waffle to realise that you don't know. Certainly, the code that arm64 GCC currently spits out would allow the assertion to fire on some microarchitectures. There are also so many ways to blow your head off it's untrue. For example, cmpxchg takes a separate memory model parameter for failure and success, but then there are restrictions on the sets you can use for each. It's not hard to find well-known memory-ordering experts shouting "Just use memory_model_seq_cst for everything, it's too hard otherwise". Then there's the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume atm and optimises all of the data dependencies away) as well as the definition of "data races", which seem to be used as an excuse to miscompile a program at the earliest opportunity. Trying to introduce system concepts (writes to devices, interrupts, non-coherent agents) into this mess is going to be an uphill battle IMHO. I'd just rather stick to the semantics we have and the asm volatile barriers. That's not to say I don't there's no room for improvement in what we have in the kernel. Certainly, I'd welcome allowing more relaxed operations on architectures that support them, but it needs to be something that at least the different architecture maintainers can understand how to implement efficiently behind an uncomplicated interface. I don't think that interface is C11. Just my thoughts on the matter... Will
Re: [RFC][PATCH 0/5] arch: atomic rework
Hello Torvald, It looks like Paul clarified most of the points I was trying to make (thanks Paul!), so I won't go back over them here. On Thu, Feb 06, 2014 at 09:09:25PM +, Torvald Riegel wrote: > Are you familiar with the formalization of the C11/C++11 model by Batty > et al.? > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > They also have a nice tool that can run condensed examples and show you > all allowed (and forbidden) executions (it runs in the browser, so is > slow for larger examples), including nice annotated graphs for those: > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ Thanks for the link, that's incredibly helpful. I've used ppcmem and armmem in the past, but I didn't realise they have a version for C++11 too. Actually, the armmem backend doesn't implement our atomic instructions or the acquire/release accessors, so it's not been as useful as it could be. I should probably try to learn OCaml... > IMHO, one thing worth considering is that for C/C++, the C11/C++11 is > the only memory model that has widespread support. So, even though it's > a fairly weak memory model (unless you go for the "only seq-cst" > beginners advice) and thus comes with a higher complexity, this model is > what likely most people will be familiar with over time. Deviating from > the "standard" model can have valid reasons, but it also has a cost in > that new contributors are more likely to be familiar with the "standard" > model. Indeed, I wasn't trying to write-off the C11 memory model as something we can never use in the kernel. I just don't think the current situation is anywhere close to usable for a project such as Linux. If a greater understanding of the memory model does eventually manifest amongst C/C++ developers (by which I mean, the beginners advice is really treated as such and there is a widespread intuition about ordering guarantees, as opposed to the need to use formal tools), then surely the tools and libraries will stabilise and provide uniform semantics across the 25+ architectures that Linux currently supports. If *that* happens, this discussion is certainly worth having again. Will
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 07, 2014 at 05:06:54PM +, Peter Zijlstra wrote: > On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote: > > Hi Paul, > > > > On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote: > > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > > > Hopefully some discussion of out-of-thin-air values as well. > > > > > > > > Yes, absolutely shoot store speculation in the head already. Then drive > > > > a wooden stake through its hart. > > > > > > > > C11/C++11 should not be allowed to claim itself a memory model until > > > > that > > > > is sorted. > > > > > > There actually is a proposal being put forward, but it might not make ARM > > > and Power people happy because it involves adding a compare, a branch, > > > and an ISB/isync after every relaxed load... Me, I agree with you, > > > much preferring the no-store-speculation approach. > > > > Can you elaborate a bit on this please? We don't permit speculative stores > > in the ARM architecture, so it seems counter-intuitive that GCC needs to > > emit any additional instructions to prevent that from happening. > > > > Stores can, of course, be observed out-of-order but that's a lot more > > reasonable :) > > This is more about the compiler speculating on stores; imagine: > > if (x) > y = 1; > else > y = 2; > > The compiler is allowed to change that into: > > y = 2; > if (x) > y = 1; > > Which is of course a big problem when you want to rely on the ordering. Understood, but that doesn't explain why Paul wants to add ISB/isync instructions which affect the *CPU* rather than the compiler! Will
Re: [RFC][PATCH 0/5] arch: atomic rework
Hi Paul, On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote: > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > Hopefully some discussion of out-of-thin-air values as well. > > > > Yes, absolutely shoot store speculation in the head already. Then drive > > a wooden stake through its hart. > > > > C11/C++11 should not be allowed to claim itself a memory model until that > > is sorted. > > There actually is a proposal being put forward, but it might not make ARM > and Power people happy because it involves adding a compare, a branch, > and an ISB/isync after every relaxed load... Me, I agree with you, > much preferring the no-store-speculation approach. Can you elaborate a bit on this please? We don't permit speculative stores in the ARM architecture, so it seems counter-intuitive that GCC needs to emit any additional instructions to prevent that from happening. Stores can, of course, be observed out-of-order but that's a lot more reasonable :) Will
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote: > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: > > As near as I can tell, compiler writers hate the idea of prohibiting > > speculative-store optimizations because it requires them to introduce > > both control and data dependency tracking into their compilers. Many of > > them seem to hate dependency tracking with a purple passion. At least, > > such a hatred would go a long way towards explaining the incomplete > > and high-overhead implementations of memory_order_consume, the long > > and successful use of idioms based on the memory_order_consume pattern > > notwithstanding [*]. ;-) > > Just tell them that because the hardware provides control dependencies > we actually use and rely on them. s/control/address/ ? Will
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 03:04:43PM +, Paul E. McKenney wrote: > On Mon, Feb 10, 2014 at 11:49:29AM +0000, Will Deacon wrote: > > On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote: > > > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: > > > > As near as I can tell, compiler writers hate the idea of prohibiting > > > > speculative-store optimizations because it requires them to introduce > > > > both control and data dependency tracking into their compilers. Many of > > > > them seem to hate dependency tracking with a purple passion. At least, > > > > such a hatred would go a long way towards explaining the incomplete > > > > and high-overhead implementations of memory_order_consume, the long > > > > and successful use of idioms based on the memory_order_consume pattern > > > > notwithstanding [*]. ;-) > > > > > > Just tell them that because the hardware provides control dependencies > > > we actually use and rely on them. > > > > s/control/address/ ? > > Both are important, but as Peter's reply noted, it was control > dependencies under discussion. Data dependencies (which include the > ARM/PowerPC notion of address dependencies) are called out by the standard > already, but control dependencies are not. I am not all that satisified > by current implementations of data dependencies, admittedly. Should > be an interesting discussion. ;-) Ok, but since you can't use control dependencies to order LOAD -> LOAD, it's a pretty big ask of the compiler to make use of them for things like consume, where a data dependency will suffice for any combination of accesses. Will
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 17, 2014 at 06:59:31PM +, Joseph S. Myers wrote: > On Sat, 15 Feb 2014, Torvald Riegel wrote: > > > glibc is a counterexample that comes to mind, although it's a smaller > > code base. (It's currently not using C11 atomics, but transitioning > > there makes sense, and some thing I want to get to eventually.) > > glibc is using C11 atomics (GCC builtins rather than _Atomic / > , but using __atomic_* with explicitly specified memory model > rather than the older __sync_*) on AArch64, plus in certain cases on ARM > and MIPS. Hmm, actually that results in a change in behaviour for the __sync_* primitives on AArch64. The documentation for those states that: `In most cases, these built-in functions are considered a full barrier. That is, no memory operand is moved across the operation, either forward or backward. Further, instructions are issued as necessary to prevent the processor from speculating loads across the operation and from queuing stores after the operation.' which is stronger than simply mapping them to memory_model_seq_cst, which seems to be what the AArch64 compiler is doing (so you get acquire + release instead of a full fence). Will
Re: [PATCH 5/5] gcc-plugins/stackleak: Don't instrument vgettimeofday.c in arm64 VDSO
On Thu, Jun 04, 2020 at 04:49:57PM +0300, Alexander Popov wrote: > Don't try instrumenting functions in arch/arm64/kernel/vdso/vgettimeofday.c. > Otherwise that can cause issues if the cleanup pass of stackleak gcc plugin > is disabled. > > Signed-off-by: Alexander Popov > --- > arch/arm64/kernel/vdso/Makefile | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/arch/arm64/kernel/vdso/Makefile b/arch/arm64/kernel/vdso/Makefile > index 3862cad2410c..9b84cafbd2da 100644 > --- a/arch/arm64/kernel/vdso/Makefile > +++ b/arch/arm64/kernel/vdso/Makefile > @@ -32,7 +32,8 @@ UBSAN_SANITIZE := n > OBJECT_FILES_NON_STANDARD:= y > KCOV_INSTRUMENT := n > > -CFLAGS_vgettimeofday.o = -O2 -mcmodel=tiny -fasynchronous-unwind-tables > +CFLAGS_vgettimeofday.o = -O2 -mcmodel=tiny -fasynchronous-unwind-tables \ > + $(DISABLE_STACKLEAK_PLUGIN) I can pick this one up via arm64, thanks. Are there any other plugins we should be wary of? It looks like x86 filters out $(GCC_PLUGINS_CFLAGS) when building the vDSO. Will
Re: [PATCH 5/5] gcc-plugins/stackleak: Don't instrument vgettimeofday.c in arm64 VDSO
On Tue, Jun 09, 2020 at 12:09:27PM -0700, Kees Cook wrote: > On Thu, Jun 04, 2020 at 02:58:06PM +0100, Will Deacon wrote: > > On Thu, Jun 04, 2020 at 04:49:57PM +0300, Alexander Popov wrote: > > > Don't try instrumenting functions in > > > arch/arm64/kernel/vdso/vgettimeofday.c. > > > Otherwise that can cause issues if the cleanup pass of stackleak gcc > > > plugin > > > is disabled. > > > > > > Signed-off-by: Alexander Popov > > > --- > > > arch/arm64/kernel/vdso/Makefile | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/arch/arm64/kernel/vdso/Makefile > > > b/arch/arm64/kernel/vdso/Makefile > > > index 3862cad2410c..9b84cafbd2da 100644 > > > --- a/arch/arm64/kernel/vdso/Makefile > > > +++ b/arch/arm64/kernel/vdso/Makefile > > > @@ -32,7 +32,8 @@ UBSAN_SANITIZE := n > > > OBJECT_FILES_NON_STANDARD:= y > > > KCOV_INSTRUMENT := n > > > > > > -CFLAGS_vgettimeofday.o = -O2 -mcmodel=tiny -fasynchronous-unwind-tables > > > +CFLAGS_vgettimeofday.o = -O2 -mcmodel=tiny -fasynchronous-unwind-tables \ > > > + $(DISABLE_STACKLEAK_PLUGIN) > > > > I can pick this one up via arm64, thanks. Are there any other plugins we > > should be wary of? It looks like x86 filters out $(GCC_PLUGINS_CFLAGS) > > when building the vDSO. > > I didn't realize/remember that arm64 retained the kernel build flags for > vDSO builds. (I'm used to x86 throwing all its flags away for its vDSO.) > > How does 32-bit ARM do its vDSO? > > My quick run-through on plugins: > > arm_ssp_per_task_plugin.c > 32-bit ARM only (but likely needs disabling for 32-bit ARM vDSO?) On arm64, the 32-bit toolchain is picked up via CC_COMPAT -- does that still get the plugins? > cyc_complexity_plugin.c > compile-time reporting only > > latent_entropy_plugin.c > this shouldn't get triggered for the vDSO (no __latent_entropy > nor __init attributes in vDSO), but perhaps explicitly disabling > it would be a sensible thing to do, just for robustness? > > randomize_layout_plugin.c > this shouldn't get triggered (again, lacking attributes), but > should likely be disabled too. > > sancov_plugin.c > This should be tracking the KCOV directly (see > scripts/Makefile.kcov), which is already disabled here. > > structleak_plugin.c > This should be fine in the vDSO, but there's not security > boundary here, so it wouldn't be important to KEEP it enabled. Thanks for going through these. In general though, it seems like an opt-in strategy would make more sense, as it doesn't make an awful lot of sense to me for the plugins to be used to build the vDSO. So I would prefer that this patch filters out $(GCC_PLUGINS_CFLAGS). Will
Re: [PATCH v2 3/5] arm64: vdso: Don't use gcc plugins for building vgettimeofday.c
On Wed, Jun 24, 2020 at 03:33:28PM +0300, Alexander Popov wrote: > Don't use gcc plugins for building arch/arm64/kernel/vdso/vgettimeofday.c > to avoid unneeded instrumentation. > > Signed-off-by: Alexander Popov > --- > arch/arm64/kernel/vdso/Makefile | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/arm64/kernel/vdso/Makefile b/arch/arm64/kernel/vdso/Makefile > index 556d424c6f52..0f1ad63b3326 100644 > --- a/arch/arm64/kernel/vdso/Makefile > +++ b/arch/arm64/kernel/vdso/Makefile > @@ -29,7 +29,7 @@ ldflags-y := -shared -nostdlib -soname=linux-vdso.so.1 > --hash-style=sysv \ > ccflags-y := -fno-common -fno-builtin -fno-stack-protector -ffixed-x18 > ccflags-y += -DDISABLE_BRANCH_PROFILING > > -CFLAGS_REMOVE_vgettimeofday.o = $(CC_FLAGS_FTRACE) -Os $(CC_FLAGS_SCS) > +CFLAGS_REMOVE_vgettimeofday.o = $(CC_FLAGS_FTRACE) -Os $(CC_FLAGS_SCS) > $(GCC_PLUGINS_CFLAGS) > KBUILD_CFLAGS+= $(DISABLE_LTO) > KASAN_SANITIZE := n > UBSAN_SANITIZE := n > -- > 2.25.4 I'll pick this one up as a fix for 5.8, please let me know if that's a problem. Will
Re: [PATCH v2 0/5] Improvements of the stackleak gcc plugin
On Wed, 24 Jun 2020 15:33:25 +0300, Alexander Popov wrote: > This is the v2 of the patch series with various improvements of the > stackleak gcc plugin. > > The first three patches disable unneeded gcc plugin instrumentation for > some files. > > The fourth patch is the main improvement. It eliminates an unwanted > side-effect of kernel code instrumentation performed by stackleak gcc > plugin. This patch is a deep reengineering of the idea described on > grsecurity blog: > https://grsecurity.net/resolving_an_unfortunate_stackleak_interaction > > [...] Applied to arm64 (for-next/fixes), thanks! [1/1] arm64: vdso: Don't use gcc plugins for building vgettimeofday.c https://git.kernel.org/arm64/c/e56404e8e475 Cheers, -- Will https://fixes.arm64.dev https://next.arm64.dev https://will.arm64.dev
Re: Re: typeof and operands in named address spaces
On Tue, Nov 17, 2020 at 11:31:57AM -0800, Linus Torvalds wrote: > On Tue, Nov 17, 2020 at 11:25 AM Jakub Jelinek wrote: > > > > It would need to be typeof( (typeof(type)) (type) ) to not be that > > constrained on what kind of expressions it accepts as arguments. > > Yup. > > > Anyway, it won't work with array types at least, > > int a[10]; > > typeof ((typeof (a)) (a)) b; > > is an error (in both gcc and clang), while typeof (a) b; will work > > (but not drop the qualifiers). Don't know if the kernel cares or not. > > Well, the kernel already doesn't allow that, because our existing > horror only handles simple integer scalar types. > > So that macro is a clear improvement - if it actually works (local > testing says it does, but who knows about random compiler versions > etc) I'll give it a go now, although if it works I honestly won't know whether to laugh or cry. Will
Re: Re: typeof and operands in named address spaces
On Tue, Nov 17, 2020 at 09:10:53PM +, Will Deacon wrote: > On Tue, Nov 17, 2020 at 11:31:57AM -0800, Linus Torvalds wrote: > > On Tue, Nov 17, 2020 at 11:25 AM Jakub Jelinek wrote: > > > > > > It would need to be typeof( (typeof(type)) (type) ) to not be that > > > constrained on what kind of expressions it accepts as arguments. > > > > Yup. > > > > > Anyway, it won't work with array types at least, > > > int a[10]; > > > typeof ((typeof (a)) (a)) b; > > > is an error (in both gcc and clang), while typeof (a) b; will work > > > (but not drop the qualifiers). Don't know if the kernel cares or not. > > > > Well, the kernel already doesn't allow that, because our existing > > horror only handles simple integer scalar types. > > > > So that macro is a clear improvement - if it actually works (local > > testing says it does, but who knows about random compiler versions > > etc) > > I'll give it a go now, although if it works I honestly won't know whether > to laugh or cry. GCC 9 and Clang 11 both seem to generate decent code for aarch64 defconfig with: #define __unqual_scalar_typeof(x) typeof( (typeof(x)) (x)) replacing the current monstrosity. allnoconfig and allmodconfig build fine too. However, GCC 4.9.0 goes mad and starts spilling to the stack when dealing with a pointer to volatile, as though we were just using typeof(). I tried GCC 5.4.0 and that looks ok, so I think if anybody cares about the potential performance regression with 4.9 then perhaps they should consider upgrading their toolchain. In other words, let's do it. Will
Re: [PATCH] arm64/io: Remind compiler that there is a memory side effect
On Sun, Apr 03, 2022 at 09:47:47AM +0200, Ard Biesheuvel wrote: > On Sun, 3 Apr 2022 at 09:47, Ard Biesheuvel wrote: > > On Sun, 3 Apr 2022 at 09:38, Andrew Pinski wrote: > > > It might not be the most restricted fix but it is a fix. > > > The best fix is to tell that you are writing to that location of memory. > > > volatile asm does not do what you think it does. > > > You didn't read further down about memory clobbers: > > > https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers-and-Scratch-Registers > > > Specifically this part: > > > The "memory" clobber tells the compiler that the assembly code > > > performs memory reads or writes to items other than those listed in > > > the input and output operands > > > > > > > So should we be using "m"(*addr) instead of "r"(addr) here? > > > > (along with the appropriately sized casts) > > I mean "=m" not "m" That can generate writeback addressing modes, which I think breaks MMIO virtualisation. We usually end up using "Q" instead but the codegen tends to be worse iirc. In any case, for this specific problem I think we either need a fixed compiler or some kbuild magic to avoid using it / disable the new behaviour. We rely on 'asm volatile' not being elided in other places too. Will