Re: Plans for Linux ELF "i686+" ABI ? Like SPARC V8+ ?
Andrew Haley wrote: This doesn't sound very different from the small memory model. With the small model, the program and its symbols must be linked in the lower 2 GB of the address space, but pointers are still 64 bits. This is the default model for gcc on GNU/Linux. It would be possible in theory to have a `smaller' memory model that only allowed 32-bit pointers, I suppose. Small memory model ? I don't understand the point of the "small memory model" nor can I substantiate anything you are saying from the littlepickcloud, LOL what a domain. I'm not aware of a small memory model until now, let alone that I maybe actually using it already and that its already what I'm making an inquiry about. I wonder if you are aware of the differences between SPARC V8 and V8+ ABIs, how they are both 32bit runtime memory model, using 32bit pointers, but the V8+ only works on a 64bit capable CPU and on a kernel that supports both 32 and 64bit userspace and is itself 64bit internally ? Yes maybe there is a restriction in the ELF format on the maximum executable size but linking has little to do with the issue of improving performance via better linkage ABI rules. Passing function arguments by registers [64bit] verses by stack [32bit]. The problem was not that I need to create a final executable file near the 2Gb limit but one of what benefits can be got from newer 64bit capable IA32 cpus from the extra registers available being used to run 32bit code running in a 32bit address space. One of the issues is that pointer storage is 8 bytes, so all structure sizes increased, so more memory is needed. So pointer intensive applications have poor memory footprints as 64bit application and most desktop applications do not require more than 3Gb of exec+heap+stacks+etc. I also believe code size will reduce too, due to less code needed to manage passing arguments via the stack and better code generation. This is amazing! There is no way that going from the ia32 to (presumably) the x86_64 small model should more than double memory consumption. Where has all that memory gone? I think some analysis of memory consumption is needed. I presume all the memory is eaten up dealing with 64bit issues both Mozilla (using XPCom) and Eclipse (running in a Java JVM) make extensive use of pointers. Although Mozilla is a little harder for me to measure comparatively but I've never gotten it over 1Gb Resident Set Size (with no swap in use). Darryl
Re: Plans for Linux ELF "i686+" ABI ? Like SPARC V8+ ?
Andrew Haley wrote: Darryl Miles writes: > Andrew Haley wrote: > I'm not aware of a small memory model until now, let alone that I maybe > actually using it already and that its already what I'm making an > inquiry about. Reading the gcc documentation would help you here. Section 3.17.13, Intel 386 and AMD x86-64 Options. Its just a shame that section number related to a different CPU depending upon which version of GCCs documentation you are looking at. Ah found it for those following: http://gcc.gnu.org/onlinedocs/gcc-4.2.2/gcc/i386-and-x86_002d64-Options.html#i386-and-x86_002d64-Options I think the main point I was getting at was this ABI would run within a 32bit address space at runtime (just like 32bit code does). It would be possible for it to see and use the 64bit version of kernel calls however, even if the values for stuff like mmap() always returned within the bottom 4Gb address space. The purpose of the new ABI was to make use of AMD64/EMT64 features knowing there is a runtime guarantee they are available. This would untimely end up with an entire distribution being recompiled for this ABI to see the speed benefits (if there are any to be had?). > I presume all the memory is eaten up dealing with 64bit issues both > Mozilla (using XPCom) and Eclipse (running in a Java JVM) make > extensive use of pointers. Although Mozilla is a little harder for > me to measure comparatively but I've never gotten it over 1Gb > Resident Set Size (with no swap in use). That's interesting. I certainly have seen some increase in memory consumption goinf from 32-bit to 64-bit applications on x86_64, but the fact that in your case it more than doubles is come cause for concern. Even if pointers were 50% of the allocated memory pool, which is a pretty extreme assumption, that would only increase memory use by 50%. In your case, however, memory use has increased by 150%. This needs explanation. There is a possibly of extra padding alignment issues (so there maybe benefits within these applications to reordering stuff to account for 64bit alignment rules). There is increased stack size needs and increased stack usage in both these apps which make use of many threads. I'm not sure if mmap() files that are also paged in the RSS count, as I guess in the 32bit runtime model the strategy for using mmap() to access files maybe different to 64bit. For example I had to increase the open fd limit on Eclipse to 2048 since I could easily make it break the default 1024 limit from doing a complete workspace rebuild. Most of these open files are due to mmap()s. Maybe there are other reasons I can't think of right now. Thats a job for someone, ensure enough information is emitted in the -g3 debug information going into the executable to reconstruct all structures/datatypes so that a tool could extract the amount of wasted padding and calculate the potential for savings to be made through reordering structure members and the like. Darryl
Re: Optimization of conditional access to globals: thread-unsafe?
Comments inline below v Tomash Brechko wrote: Consider this piece of code: extern int v; void f(int set_v) { if (set_v) v = 1; } f: pushl %ebp movl%esp, %ebp cmpl$0, 8(%ebp) movl$1, %eax cmove v, %eax; load (maybe) movl%eax, v; store (always) popl%ebp ret Note the last unconditional store to v. Now, if some thread would modify v between our load and store (acquiring the mutex first), then we will overwrite the new value with the old one (and would do that in a thread-unsafe manner, not acquiring the mutex). So, do the calls to f(0) require the mutex, or it's a GCC bug? The "unintended write-access" optimization is a massive headache for developers of multi-threaded code. The problem here is the mandatory write access to a memory location for which the as-written code path does not indicate a write memory access should occur. This is a tricky one, optimizations which have the effect of causing an "unintended write access to some resource" when the code path does not intend this to happen crosses a line IMHO. I think that GCC should understand where that line is and have a compile time parameter to configure if that line maybe crossed. Its a matter for debate as to what the default should be and/or -O6 should allow that line to be crossed, but having no mechanism to control it is the real bummer. Even if the interpretation offered of the C language standards specification says the line maybe be crossed, from a practical point of view this is one aspect of optimization that a developer would want to have complete control over. So much control that I would also like to see a pair of __attribute__((optimization_hint_keywords)) attached to the variable declaration to provide fine grain control. Such a solution to the problem would keep everybody happy. Here are some pieces from C99: ...SNIP... Sec 3.1 par 4: NOTE 3 Expressions that are not evaluated do not access objects. Hmm... on this point there can be a problem. There are 2 major types of access read from memory (load) and write to memory (store). It is very possible to end up performing an optimistic read; only to throw away the value contained due to a compare/jump. This is usually considered a safe optimization. But reading the statement above as-is and in the context of this problem might make some believe this "optimistic read" optimization is breaking the rules. Maybe in GCC there should be C99 adherence levels : strict mode: Where this C99 clause is adhered to, but this is much like compiling code without optimization, like when debugging. Since during debugging you always want nice clear per line / per expression separation so you can walk through execution with a debugger. may optimize read access mode: This is the normal case for optimization, where you might interleave a 'compare reg with immediate' and a 'load from memory', then perform a 'conditional branch' that ends up at code that never uses the value loaded from memory. The only rare case this is a problem is where a read from special memory, but volatile in GCC exists for that or you could move all accesses to that memory away from regular C language syntax and into a function call. may optimize read and write access mode: This is the problem case you are seeing. Same as the mode above but also permits the unintended write access, but only to write back the same value as before (based on the compiler's thread naive perception of execution at least!). So, could someone explain me why this GCC optimization is valid, and, if so, where lies the boundary below which I may safely assume GCC won't try to store to objects that aren't stored to explicitly during particular execution path? Or maybe the named bug report is valid after all? As has been pointed out by others there is no specification on what happens between threads. Your route out of this problem is to write your own implementation of: atomic_int_set(int *ptr, int value); Which always uses an atomic single instruction store. Which is thread-safe with respect to ensuring that no other concurrent read or write to that location will ever see a corrupted value. Where a corrupted value in this case would be some value other than "the previous value of 'v'" and "the value of '1'" you are setting, also that once a concurrent access with "the value of '1'" is first obversed, it will not be possible to observe the previous value on a subsequent read (the value doesn't flap about once it changes, it changes for good). if (set_v) atomic_int_set(&v, 1); By doing the above you are programatically dictating the method of thread-safety in 2 directions. One direction in terms of something that is agreeable with a compiler and something it can'
Re: Optimization of conditional access to globals: thread-unsafe?
Dave Korn wrote: On 27 October 2007 18:27, Darryl Miles wrote: The "unintended write-access" optimization is a massive headache for developers of multi-threaded code. But it's a boon for the autovectoriser, because it means it can transform code with a branch into straight-line code. Then write to the stack or a register, but not to heap when the programmer didn't explicitly permit the compiler to do so because no memory targeted lvalue expression was to be executed. This basic rule about is what a threaded programmer expects of the C language, even if there is no written law in C99. I don't want to stop people using this optimization technique there will always be a useful case for it, I just want to be able to turn that one off, but keep all other optimizations. ...SNIP... So much control that I would also like to see a pair of __attribute__((optimization_hint_keywords)) attached to the variable declaration to provide fine grain control. Such a solution to the problem would keep everybody happy. How about attaching the 'volatile' keyword to the variable? No we are _HAPPY_ to allow "may optimize read access mode" but not happy to allow "may optimize read and write access mode" (as per my previous description). Volatile can not differentiate this. Nor can volatile instruct the compiler which method to use to perform the load or store, for example a 64bit long long type on i386. Volatile has its uses but it pretty much a sledgehammer to the problem domain. Hmm... on this point there can be a problem. There are 2 major types of access read from memory (load) and write to memory (store). It is very possible to end up performing an optimistic read; only to throw away the value contained due to a compare/jump. This is usually considered a safe optimization. As embedded programmers who have to deal with registers containing auto-resetting status bits have known for many years, this is not a safe optimisation at all. We use 'volatile' to suppress it. It is safe for general programming usage which was the original case. See my comment (which you failed to cite) over the use of volatile for the situation you describe. I've already covered this case for you. NB Marking the variable 'volatile' does not mean anything useful in the situation you are in. The exact meaning of what 'volatile' is can be a problem between compilers, but in the case of GCC it can stop the re-ordering and the caching of value in register aspect of your entire problem. But it will never enforce the method used to perform the load/store, not will it (at this time) stop the unintended write-access. Huh? When I tried compiling the test case, it did exactly that. Hang on, I'll check: We differ slighting on our understanding of volatile. It does not provide exactly what a threaded programmer wants, even thought to you it addresses the problem when used with GCC in the cases you have tried. Your example you cite is coincidental, thats just how GCC generates code. Darryl
Re: Optimization of conditional access to globals: thread-unsafe?
David Miller wrote: The compiler simply cannot speculatively load or store to variables with global visibility. s/with global visibility/with visibility outside the scope of the functional unit the compiler is able to see at compile time/ Which basically means the compiler is king for doing these tricks with CPU registers, areas of the stack and inlined functional units in which it can be 100% sure about it access to this data. What are the issues with "speculative loads" ? Is there such a page as a write only page used by any system GCC targets ? For general usage the x86 concept of read-only or read-write fits well, which means that speculative load's are usually a safe optimization. But I'd be all for a way to allow/disallow each optimization independently (this give the developer more choice in the matter). With "speculative loads" enabled by default and "speculative stores" disabled by default for any multi-threaded code. As per my other posting have the ability to __attribute__((disallow_speculative_load,disallow_speculative_store)) or to __attribute__((allow_speculative_load,allow_speculative_store)) to pin the issue. With -fdisallow-speculative-load -fallow-speculative-load etc... for the defaults for the entire file being compiled. Darryl
Re: Optimization of conditional access to globals: thread-unsafe?
David Miller wrote: From: Darryl Miles <[EMAIL PROTECTED]> Date: Mon, 29 Oct 2007 04:53:49 + What are the issues with "speculative loads" ? The conditional might be protecting whether the pointer is valid and can be dereferenced at all. This then leads into the question. Is a pointer allowed to be invalid. I'm sure I have read a comment on this before, along the line of the spec says it must be valid or a certain number of other values (like zero or one past being valid). But I can not cite chapter and verse if this is true. I would agree however (before you say) that 'counter' by itself is just a variable, and it is only when execution allows it to be dereferenced that the issues about its validity come into play. This is practical common law usage. And in another module that GCC can't see when compiling foo(): I agree, any external symbols might not even be in the 'C' language, but those symbols do conform to the ABI and the unwritten rules of what the value represents is implied when you assign an equivalent type 'int' to it in a C variable declaration. Darryl
Re: Optimization of conditional access to globals: thread-unsafe?
skaller wrote: Ah .. ok I think I finally see. Thanks! The code ensures well definedness by checking the establishment of the required invariant and dynamically chosing whether or not to do the access on that basis .. and the optimisation above defeats that by lifting the access out of the conditional. In the single threaded case the lift works because it relies on sequential access, which is the only possibility for a single thread. But this is clearly not a similar case. There is a clear read-modify-write cycle taking place (-= operator), and you describe the problem in a way that a decrement with the value of zero is allowed. The problem domain that is atomic read-modify-write is not the same as a atomic assignment, which is the basis of the original issue. More over the original issue was a write access to variable where none was described in the code for that given circumstance. Along the lines of my my first post to this thread, if you want atomic read-modify-write then you are doing to have to create your atomic_int_dec(int *intptr) function, or atomic_int_sub(int *intptr, int value) which makes uses of IA32 CPU lock prefix instructions. But for many other platforms (almost all RISC) you are going to have to obtain a mutex lock then perform a 'load from memory to register', 'substract value', 'store to memory from register'. Darryl
Re: Optimization of conditional access to globals: thread-unsafe?
Michael Matz wrote: if (condition) *p = value; (i.e. without any synchronization primitive or in fact anything else after the store in the control region) and expect that the store indeed only happens in that control region. And this expectation is misguided. Had they written it like: if (condition) { *p = value; membarrier(); } it would have worked just fine. Don't you need the barrier before. This is to ensure it completed the condition test completely first before it then processed the assignment expression. if(condition) { somebarrier(); *p = value; } The issue is not that the store is done too late, but that a write-access is done too early. Darryl
Re: Optimization of conditional access to globals: thread-unsafe?
Michael Matz wrote: Don't you need the barrier before. This is to ensure it completed the condition test completely first before it then processed the assignment expression. if(condition) { somebarrier(); *p = value; } The issue is not that the store is done too late, but that a write-access is done too early. No. The initial cause for this needless thread was that a store was moved down, out of its control region. Of course it doesn't help when people keep shifting their point of focus in such discussions. Now it already moved to fear that GCC would somehow introduce new traps. Without the people discussing about that fear even bothering to check if that really happens :-( No the initial problem was that the store was done when the code execution path clearly indicates no store should be performed. The store was a re-write of the same and existing value in *p. The optimizer tried to interleave the compare/test with the load from memory. By inserting the barrier between the test and assignment that would stop that interleave from taking place, since it can't optimize across the barrier, it must perform the test and branch first, before it stores to memory. It may optionally interleave the 'load from memory into register for "value" variable'. This is would be a speculative load and this would be safe, as the value or 'value' may go unused (thrown away) if the branch is taken to skip the store to *p. Now the original case was show as a simple function with just the if(condition) { *v = 1 } I would agree with you that a barrier() afterwards would be needed if there was any statement beyond that close brace of the test within the same function. This is to ensure the store is not deferred any later, that maybe accessed via another alias to the same memory for which the compiler could not see at compile time. But there isn't, there is a function return, which does the trick nicely. A purist perspective this makes it: void foo(int value) { if(condition) { somebarrier(); *v = value; somebarrier(); } // more statements here that may access *v // if you don't have any statements here, then you can omit the 2nd somebarrier() call return; } Darryl