On Thu, 29 Jun 2006, Jeff Squyres (jsquyres) wrote:

> I think you may have caught us in an unintentional breakage.  If your 
> Open MPI was compiled as shared libraries and dynamic shared objects (the 
> default), this error should not have happened since we did not change 
> mpi.h.

Sure, I simply use the default.

> So there must be a second-order effect going on here (somehow the 
> size of a back-end data structure caused a problem.  Hrm.). 

> We'll look into this, because for where all of OMPI's functionality is 
> in shared libraries and components, the user's application should be 
> isolated from internal changes like this (i.e., and we can provide 
> forward compatibility).
> 
> I suspect that something deeper is going on, so let us go investigate 
> and come back with a more definitive statement.

Well, following the warnings, I check the size of the ompi_mpi_comm_null 
and ompi_mpi_comm_world symbols in both the library and the executable 
with objdump -T:

OpenMPI 1.1 library:
00000000001e8140 g    DO .bss   00000000000001c8  Base        ompi_mpi_comm_null
00000000001e83a0 g    DO .bss   00000000000001c8  Base        
ompi_mpi_comm_world


OpenMPI 1.0.2 executable:
000000000058f3d0 g    DO .bss   00000000000001c0              
ompi_mpi_comm_world
000000000058ef00 g    DO .bss   00000000000001c0              ompi_mpi_comm_null

So, the size indeed does have changed. Now, MPI_COMM_WORLD is an opaque 
pointer, so if the internal data structure changes, this should have no 
effect on the functioning of executable.

However, note that ompi_mpi_comm_* are not referenced in the 1.0.2 
executable, but declared! The most likely cause of this is that they were 
declared in the assembler file using .comm.

The dynamic linker will merge both declarations. Now, merging two symbols 
with a different size is hard, the linker will have to make a choice. 
Suppose it chooses the declaration in the executable. Then the image in 
memory will contain ompi_mpi_comm_* datastructures of $1c0 bytes, while 
the library expects them to be $1c8 bytes.

Conclusion: Opaque pointers should not be declared with .comm, they should 
just be referenced.

I didn't tell my system details yet: I'm using OpenSuSE 10 on the x86_64 
architecture. The compiler does not seem to be of any influence: the 
result is the same with Gcc, Intel C and Pathscale.

Daniël mantione

Reply via email to