Re: Plans for Linux ELF "i686+" ABI ? Like SPARC V8+ ?

2007-10-15 Thread Darryl Miles

Andrew Haley wrote:

This doesn't sound very different from the small memory model.  With
the small model, the program and its symbols must be linked in the
lower 2 GB of the address space, but pointers are still 64 bits.  This
is the default model for gcc on GNU/Linux.  It would be possible in
theory to have a `smaller' memory model that only allowed 32-bit
pointers, I suppose.


Small memory model ?  I don't understand the point of the "small memory 
model" nor can I substantiate anything you are saying from the 
littlepickcloud, LOL what a domain.


I'm not aware of a small memory model until now, let alone that I maybe 
actually using it already and that its already what I'm making an 
inquiry about.


I wonder if you are aware of the differences between SPARC V8 and V8+ 
ABIs, how they are both 32bit runtime memory model, using 32bit 
pointers, but the V8+ only works on a 64bit capable CPU and on a kernel 
that supports both 32 and 64bit userspace and is itself 64bit internally ?



Yes maybe there is a restriction in the ELF format on the maximum 
executable size but linking has little to do with the issue of improving 
performance via better linkage ABI rules.  Passing function arguments by 
registers [64bit] verses by stack [32bit].


The problem was not that I need to create a final executable file near 
the 2Gb limit but one of what benefits can be got from newer 64bit 
capable IA32 cpus from the extra registers available being used to run 
32bit code running in a 32bit address space.


One of the issues is that pointer storage is 8 bytes, so all structure 
sizes increased, so more memory is needed.  So pointer intensive 
applications have poor memory footprints as 64bit application and most 
desktop applications do not require more than 3Gb of exec+heap+stacks+etc.


I also believe code size will reduce too, due to less code needed to 
manage passing arguments via the stack and better code generation.





This is amazing!  There is no way that going from the ia32 to
(presumably) the x86_64 small model should more than double memory
consumption.  Where has all that memory gone?  I think some analysis
of memory consumption is needed.


I presume all the memory is eaten up dealing with 64bit issues both 
Mozilla (using XPCom) and Eclipse (running in a Java JVM) make extensive 
use of pointers.  Although Mozilla is a little harder for me to measure 
comparatively but I've never gotten it over 1Gb Resident Set Size (with 
no swap in use).




Darryl



Re: Plans for Linux ELF "i686+" ABI ? Like SPARC V8+ ?

2007-10-15 Thread Darryl Miles

Andrew Haley wrote:

Darryl Miles writes:
 > Andrew Haley wrote:
 > I'm not aware of a small memory model until now, let alone that I maybe 
 > actually using it already and that its already what I'm making an 
 > inquiry about.


Reading the gcc documentation would help you here.  Section 3.17.13,
Intel 386 and AMD x86-64 Options.


Its just a shame that section number related to a different CPU 
depending upon which version of GCCs documentation you are looking at.


Ah found it for those following:

http://gcc.gnu.org/onlinedocs/gcc-4.2.2/gcc/i386-and-x86_002d64-Options.html#i386-and-x86_002d64-Options


I think the main point I was getting at was this ABI would run within a 
32bit address space at runtime (just like 32bit code does).  It would be 
possible for it to see and use the 64bit version of kernel calls 
however, even if the values for stuff like mmap() always returned within 
the bottom 4Gb address space.


The purpose of the new ABI was to make use of AMD64/EMT64 features 
knowing there is a runtime guarantee they are available.  This would 
untimely end up with an entire distribution being recompiled for this 
ABI to see the speed benefits (if there are any to be had?).




 > I presume all the memory is eaten up dealing with 64bit issues both
 > Mozilla (using XPCom) and Eclipse (running in a Java JVM) make
 > extensive use of pointers.  Although Mozilla is a little harder for
 > me to measure comparatively but I've never gotten it over 1Gb
 > Resident Set Size (with no swap in use).

That's interesting.  I certainly have seen some increase in memory
consumption goinf from 32-bit to 64-bit applications on x86_64, but
the fact that in your case it more than doubles is come cause for
concern.  Even if pointers were 50% of the allocated memory pool,
which is a pretty extreme assumption, that would only increase memory
use by 50%.  In your case, however, memory use has increased by 150%.
This needs explanation.


There is a possibly of extra padding alignment issues (so there maybe 
benefits within these applications to reordering stuff to account for 
64bit alignment rules).


There is increased stack size needs and increased stack usage in both 
these apps which make use of many threads.


I'm not sure if mmap() files that are also paged in the RSS count, as I 
guess in the 32bit runtime model the strategy for using mmap() to access 
files maybe different to 64bit.  For example I had to increase the open 
fd limit on Eclipse to 2048 since I could easily make it break the 
default 1024 limit from doing a complete workspace rebuild.  Most of 
these open files are due to mmap()s.


Maybe there are other reasons I can't think of right now.



Thats a job for someone, ensure enough information is emitted in the -g3 
debug information going into the executable to reconstruct all 
structures/datatypes so that a tool could extract the amount of wasted 
padding and calculate the potential for savings to be made through 
reordering structure members and the like.



Darryl



Re: Optimization of conditional access to globals: thread-unsafe?

2007-10-27 Thread Darryl Miles


Comments inline below v


Tomash Brechko wrote:

Consider this piece of code:

extern int v;
  
void

f(int set_v)
{
  if (set_v)
v = 1;
}

f:
pushl   %ebp
movl%esp, %ebp
cmpl$0, 8(%ebp)
movl$1, %eax
cmove   v, %eax; load (maybe)
movl%eax, v; store (always)
popl%ebp
ret

Note the last unconditional store to v.  Now, if some thread would
modify v between our load and store (acquiring the mutex first), then
we will overwrite the new value with the old one (and would do that in
a thread-unsafe manner, not acquiring the mutex).

So, do the calls to f(0) require the mutex, or it's a GCC bug?


The "unintended write-access" optimization is a massive headache for 
developers of multi-threaded code.



The problem here is the mandatory write access to a memory location for 
which the as-written code path does not indicate a write memory access 
should occur.


This is a tricky one, optimizations which have the effect of causing an 
"unintended write access to some resource" when the code path does not 
intend this to happen crosses a line IMHO.


I think that GCC should understand where that line is and have a compile 
time parameter to configure if that line maybe crossed.  Its a matter 
for debate as to what the default should be and/or -O6 should allow that 
line to be crossed, but having no mechanism to control it is the real 
bummer.


Even if the interpretation offered of the C language standards 
specification says the line maybe be crossed, from a practical point of 
view this is one aspect of optimization that a developer would want to 
have complete control over.


So much control that I would also like to see a pair of 
__attribute__((optimization_hint_keywords)) attached to the variable 
declaration to provide fine grain control.  Such a solution to the 
problem would keep everybody happy.




Here are some pieces from C99:


...SNIP...

Sec 3.1 par 4: NOTE 3 Expressions that are not evaluated do not access
   objects.


Hmm... on this point there can be a problem.  There are 2 major types of 
access read from memory (load) and write to memory (store).  It is very 
possible to end up performing an optimistic read; only to throw away the 
value contained due to a compare/jump.  This is usually considered a 
safe optimization.


But reading the statement above as-is and in the context of this problem 
might make some believe this  "optimistic read" optimization is breaking 
the rules.



Maybe in GCC there should be C99 adherence levels :

strict mode: Where this C99 clause is adhered to, but this is much like 
compiling code without optimization, like when debugging.  Since during 
debugging you always want nice clear per line / per expression 
separation so you can walk through execution with a debugger.


may optimize read access mode: This is the normal case for optimization, 
where you might interleave a 'compare reg with immediate' and a 'load 
from memory', then perform a 'conditional branch' that ends up at code 
that never uses the value loaded from memory.  The only rare case this 
is a problem is where a read from special memory, but volatile in GCC 
exists for that or you could move all accesses to that memory away from 
regular C language syntax and into a function call.


may optimize read and write access mode: This is the problem case you 
are seeing.  Same as the mode above but also permits the unintended 
write access, but only to write back the same value as before (based on 
the compiler's thread naive perception of execution at least!).





So, could someone explain me why this GCC optimization is valid, and,
if so, where lies the boundary below which I may safely assume GCC
won't try to store to objects that aren't stored to explicitly during
particular execution path?  Or maybe the named bug report is valid
after all?


As has been pointed out by others there is no specification on what 
happens between threads.



Your route out of this problem is to write your own implementation of:

atomic_int_set(int *ptr, int value);

Which always uses an atomic single instruction store.  Which is 
thread-safe with respect to ensuring that no other concurrent read or 
write to that location will ever see a corrupted value.  Where a 
corrupted value in this case would be some value other than "the 
previous value of 'v'" and "the value of '1'" you are setting, also that 
once a concurrent access with "the value of '1'" is first obversed, it 
will not be possible to observe the previous value on a subsequent read 
(the value doesn't flap about once it changes, it changes for good).


if (set_v)
  atomic_int_set(&v, 1);



By doing the above you are programatically dictating the method of 
thread-safety in 2 directions.


One direction in terms of something that is agreeable with a compiler 
and something it can'

Re: Optimization of conditional access to globals: thread-unsafe?

2007-10-27 Thread Darryl Miles

Dave Korn wrote:

On 27 October 2007 18:27, Darryl Miles wrote:

The "unintended write-access" optimization is a massive headache for
developers of multi-threaded code.


  But it's a boon for the autovectoriser, because it means it can transform
code with a branch into straight-line code.


Then write to the stack or a register, but not to heap when the 
programmer didn't explicitly permit the compiler to do so because no 
memory targeted lvalue expression was to be executed.


This basic rule about is what a threaded programmer expects of the C 
language, even if there is no written law in C99.



I don't want to stop people using this optimization technique there will 
always be a useful case for it, I just want to be able to turn that one 
off, but keep all other optimizations.



...SNIP...



So much control that I would also like to see a pair of
__attribute__((optimization_hint_keywords)) attached to the variable
declaration to provide fine grain control.  Such a solution to the
problem would keep everybody happy.


  How about attaching the 'volatile' keyword to the variable?


No we are _HAPPY_ to allow "may optimize read access mode" but not happy 
to allow "may optimize read and write access mode" (as per my previous 
description).  Volatile can not differentiate this.  Nor can volatile 
instruct the compiler which method to use to perform the load or store, 
for example a 64bit long long type on i386.


Volatile has its uses but it pretty much a sledgehammer to the problem 
domain.




Hmm... on this point there can be a problem.  There are 2 major types of
access read from memory (load) and write to memory (store).  It is very
possible to end up performing an optimistic read; only to throw away the
value contained due to a compare/jump.  This is usually considered a
safe optimization.


  As embedded programmers who have to deal with registers containing
auto-resetting status bits have known for many years, this is not a safe
optimisation at all.  We use 'volatile' to suppress it.


It is safe for general programming usage which was the original case. 
See my comment (which you failed to cite) over the use of volatile for 
the situation you describe.  I've already covered this case for you.




NB Marking the variable 'volatile' does not mean anything useful in the
situation you are in.  The exact meaning of what 'volatile' is can be a
problem between compilers, but in the case of GCC it can stop the
re-ordering and the caching of value in register aspect of your entire
problem.  But it will never enforce the method used to perform the
load/store, not will it (at this time) stop the unintended write-access.


  Huh?  When I tried compiling the test case, it did exactly that.  Hang on,
I'll check:


We differ slighting on our understanding of volatile.  It does not 
provide exactly what a threaded programmer wants, even thought to you it 
addresses the problem when used with GCC in the cases you have tried.


Your example you cite is coincidental, thats just how GCC generates code.


Darryl


Re: Optimization of conditional access to globals: thread-unsafe?

2007-10-28 Thread Darryl Miles

David Miller wrote:

The compiler simply cannot speculatively load or store to variables
with global visibility.


s/with global visibility/with visibility outside the scope of the 
functional unit the compiler is able to see at compile time/


Which basically means the compiler is king for doing these tricks with 
CPU registers, areas of the stack and inlined functional units in which 
it can be 100% sure about it access to this data.



What are the issues with "speculative loads" ?  Is there such a page as 
a write only page used by any system GCC targets ?  For general usage 
the x86 concept of read-only or read-write fits well, which means that 
speculative load's are usually a safe optimization.


But I'd be all for a way to allow/disallow each optimization 
independently (this give the developer more choice in the matter).  With 
"speculative loads" enabled by default and "speculative stores" disabled 
by default for any multi-threaded code.


As per my other posting have the ability to 
__attribute__((disallow_speculative_load,disallow_speculative_store)) or 
to __attribute__((allow_speculative_load,allow_speculative_store)) to 
pin the issue.  With -fdisallow-speculative-load 
-fallow-speculative-load etc... for the defaults for the entire file 
being compiled.



Darryl


Re: Optimization of conditional access to globals: thread-unsafe?

2007-10-29 Thread Darryl Miles

David Miller wrote:

From: Darryl Miles <[EMAIL PROTECTED]>
Date: Mon, 29 Oct 2007 04:53:49 +


What are the issues with "speculative loads" ?


The conditional might be protecting whether the pointer is valid and
can be dereferenced at all.


This then leads into the question.  Is a pointer allowed to be invalid.

I'm sure I have read a comment on this before, along the line of the 
spec says it must be valid or a certain number of other values (like 
zero or one past being valid).  But I can not cite chapter and verse if 
this is true.


I would agree however (before you say) that 'counter' by itself is just 
a variable, and it is only when execution allows it to be dereferenced 
that the issues about its validity come into play.  This is practical 
common law usage.




And in another module that GCC can't see when compiling foo():


I agree, any external symbols might not even be in the 'C' language, but 
those symbols do conform to the ABI and the unwritten rules of what the 
value represents is implied when you assign an equivalent type 'int' to 
it in a C variable declaration.



Darryl


Re: Optimization of conditional access to globals: thread-unsafe?

2007-10-29 Thread Darryl Miles

skaller wrote:

Ah .. ok I think I finally see. Thanks! The code ensures
well definedness by checking the establishment of the
required invariant and dynamically chosing whether or not
to do the access on that basis .. and the optimisation
above defeats that by lifting the access out of the
conditional.

In the single threaded case the lift works because it
relies on sequential access, which is the only possibility
for a single thread.



But this is clearly not a similar case.  There is a clear 
read-modify-write cycle taking place (-= operator), and you describe the 
problem in a way that a decrement with the value of zero is allowed.


The problem domain that is atomic read-modify-write is not the same as a 
atomic assignment, which is the basis of the original issue.  More over 
the original issue was a write access to variable where none was 
described in the code for that given circumstance.



Along the lines of my my first post to this thread, if you want atomic 
read-modify-write then you are doing to have to create your 
atomic_int_dec(int *intptr) function, or atomic_int_sub(int *intptr, int 
value) which makes uses of IA32 CPU lock prefix instructions.  But for 
many other platforms (almost all RISC) you are going to have to obtain a 
mutex lock then perform a 'load from memory to register', 'substract 
value', 'store to memory from register'.



Darryl


Re: Optimization of conditional access to globals: thread-unsafe?

2007-10-29 Thread Darryl Miles

Michael Matz wrote:

  if (condition)
*p = value;

(i.e. without any synchronization primitive or in fact anything else after 
the store in the control region) and expect that the store indeed only 
happens in that control region.  And this expectation is misguided.  Had 
they written it like:


  if (condition) {
*p = value;
membarrier();
  }

it would have worked just fine.



Don't you need the barrier before.  This is to ensure it completed the 
condition test completely first before it then processed the assignment 
expression.


if(condition) {
 somebarrier();
 *p = value;
}

The issue is not that the store is done too late, but that a 
write-access is done too early.



Darryl


Re: Optimization of conditional access to globals: thread-unsafe?

2007-10-29 Thread Darryl Miles

Michael Matz wrote:

Don't you need the barrier before.  This is to ensure it completed the
condition test completely first before it then processed the assignment
expression.

if(condition) {
 somebarrier();
 *p = value;
}

The issue is not that the store is done too late, but that a 
write-access is done too early.


No.  The initial cause for this needless thread was that a store was moved 
down, out of its control region.  Of course it doesn't help when people 
keep shifting their point of focus in such discussions.  Now it already 
moved to fear that GCC would somehow introduce new traps.  Without the 
people discussing about that fear even bothering to check if that really 
happens :-(


No the initial problem was that the store was done when the code 
execution path clearly indicates no store should be performed.  The 
store was a re-write of the same and existing value in *p.


The optimizer tried to interleave the compare/test with the load from 
memory.  By inserting the barrier between the test and assignment that 
would stop that interleave from taking place, since it can't optimize 
across the barrier, it must perform the test and branch first, before it 
stores to memory.


It may optionally interleave the 'load from memory into register for 
"value" variable'.  This is would be a speculative load and this would 
be safe, as the value or 'value' may go unused (thrown away) if the 
branch is taken to skip the store to *p.



Now the original case was show as a simple function with just the

if(condition) {
 *v = 1
}

I would agree with you that a barrier() afterwards would be needed if 
there was any statement beyond that close brace of the test within the 
same function.  This is to ensure the store is not deferred any later, 
that maybe accessed via another alias to the same memory for which the 
compiler could not see at compile time.


But there isn't, there is a function return, which does the trick nicely.

A purist perspective this makes it:

void foo(int value) {
 if(condition) {
  somebarrier();
  *v = value;
  somebarrier();
 }

 // more statements here that may access *v
 // if you don't have any statements here, then you can omit the 2nd 
somebarrier() call


 return;
}


Darryl