On Thu, 29 Jun 2006, Jeff Squyres (jsquyres) wrote:
> I think you may have caught us in an unintentional breakage. If your > Open MPI was compiled as shared libraries and dynamic shared objects (the > default), this error should not have happened since we did not change > mpi.h. Sure, I simply use the default. > So there must be a second-order effect going on here (somehow the > size of a back-end data structure caused a problem. Hrm.). > We'll look into this, because for where all of OMPI's functionality is > in shared libraries and components, the user's application should be > isolated from internal changes like this (i.e., and we can provide > forward compatibility). > > I suspect that something deeper is going on, so let us go investigate > and come back with a more definitive statement. Well, following the warnings, I check the size of the ompi_mpi_comm_null and ompi_mpi_comm_world symbols in both the library and the executable with objdump -T: OpenMPI 1.1 library: 00000000001e8140 g DO .bss 00000000000001c8 Base ompi_mpi_comm_null 00000000001e83a0 g DO .bss 00000000000001c8 Base ompi_mpi_comm_world OpenMPI 1.0.2 executable: 000000000058f3d0 g DO .bss 00000000000001c0 ompi_mpi_comm_world 000000000058ef00 g DO .bss 00000000000001c0 ompi_mpi_comm_null So, the size indeed does have changed. Now, MPI_COMM_WORLD is an opaque pointer, so if the internal data structure changes, this should have no effect on the functioning of executable. However, note that ompi_mpi_comm_* are not referenced in the 1.0.2 executable, but declared! The most likely cause of this is that they were declared in the assembler file using .comm. The dynamic linker will merge both declarations. Now, merging two symbols with a different size is hard, the linker will have to make a choice. Suppose it chooses the declaration in the executable. Then the image in memory will contain ompi_mpi_comm_* datastructures of $1c0 bytes, while the library expects them to be $1c8 bytes. Conclusion: Opaque pointers should not be declared with .comm, they should just be referenced. I didn't tell my system details yet: I'm using OpenSuSE 10 on the x86_64 architecture. The compiler does not seem to be of any influence: the result is the same with Gcc, Intel C and Pathscale. Daniël mantione