Re: [OMPI users] unresolvable R_X86_64_64 relocation against symbol `mpi_fortran_*
Hi, Here's a reprocase, the same one as mentioned here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=608901 marcusmae@loveland:~/Programming/mpitest$ cat mpitest.f90 program main include 'mpif.h' integer ierr call mpi_init(ierr) end marcusmae@loveland:~/Programming/mpitest$ mpif90 -g mpitest.f90 /usr/bin/ld: /tmp/cc3NLduM.o(.debug_info+0x542): unresolvable R_X86_64_64 relocation against symbol `mpi_fortran_argv_null_' /usr/bin/ld: /tmp/cc3NLduM.o(.debug_info+0x55c): unresolvable R_X86_64_64 relocation against symbol `mpi_fortran_argv_null_' /usr/bin/ld: /tmp/cc3NLduM.o(.debug_info+0x5d2): unresolvable R_X86_64_64 relocation against symbol `mpi_fortran_errcodes_ignore_' /usr/bin/ld: /tmp/cc3NLduM.o(.debug_info+0x5ec): unresolvable R_X86_64_64 relocation against symbol `mpi_fortran_errcodes_ignore_' Remove "-g", and the error will be gone. marcusmae@loveland:~/Programming/mpitest$ mpif90 --showme -g mpitest.f90 gfortran -g mpitest.f90 -I/opt/openmpi_gcc-1.5.4/include -pthread -I/opt/openmpi_gcc-1.5.4/lib -L/opt/openmpi_gcc-1.5.4/lib -lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl marcusmae@loveland:~/Programming/mpitest$ mpif90 -v Using built-in specs. COLLECT_GCC=/usr/bin/gfortran COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.6.1/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.6.1-9ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++,go --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-plugin --enable-objc-gc --disable-werror --with-arch-32=i686 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3) 2011/9/28 Dmitry N. Mikushin : > Hi, > > Interestingly, the errors are gone after I removed "-g" from the app > compile options. > > I tested again on the fresh Ubuntu 11.10 install: both 1.4.3 and 1.5.4 > compile fine, but with the same error. > Also I tried hard to find any 32-bit object or library and failed. > They all are 64-bit. > > - D. > > 2011/9/24 Jeff Squyres : >> Check the output from when you ran Open MPI's configure and "make all" -- >> did it decide to build the F77 interface? >> >> Also check that gcc and gfortran output .o files of the same bitness / type. >> >> >> On Sep 24, 2011, at 8:07 AM, Dmitry N. Mikushin wrote: >> >>> Compile and link - yes, but it turns out there was some unnoticed >>> compilation error because >>> >>> ./hellompi: error while loading shared libraries: libmpi_f77.so.1: >>> cannot open shared object file: No such file or directory >>> >>> and this library does not exist. >>> >>> Hm. >>> >>> 2011/9/24 Jeff Squyres : Can you compile / link simple OMPI applications without this problem? On Sep 24, 2011, at 7:54 AM, Dmitry N. Mikushin wrote: > Hi Jeff, > > Today I've verified this application on the Feroda 15 x86_64, where > I'm usually building OpenMPI from source using the same method. > Result: no link errors there! So, the issue is likely ubuntu-specific. > > Target application is compiled linked with mpif90 pointing to > /opt/openmpi_gcc-1.5.4/bin/mpif90 I built. > > Regarding architectures, everything in target folders and OpenMPI > installation is > ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically > linked, not stripped > > - D. > > 2011/9/24 Jeff Squyres : >> How does the target application compile / link itself? >> >> Try running "file" on the Open MPI libraries and/or your target >> application .o files to see what their bitness is, etc. >> >> >> On Sep 22, 2011, at 3:15 PM, Dmitry N. Mikushin wrote: >> >>> Hi Jeff, >>> >>> You're right because I also tried 1.4.3, and it's the same issue >>> there. But what could be wrong? I'm using the simplest form - >>> ../configure --prefix=/opt/openmpi_gcc-1.4.3/ and only installed >>> compilers are system-default gcc and gfortran 4.6.1. Distro is ubuntu >>> 11.10. There is no any mpi installed from packages, and no -m32 >>> options around. What else could be the source? >>> >>> Thanks, >>> - D. >>> >>> 2011/9/22 Jeff Squyres : This usually means that you're mixing compiler/linker flags somehow (e.g., built something with 32 bit, built something else with 64 bit, try to link them together). Can you verify that everything was built with all the same 32/64? On Sep 22, 2011, at 1:21 PM, Dmi
Re: [OMPI users] [SOLVED] unresolvable R_X86_64_64 relocation against symbol `mpi_fortran_*
Ok, here's the solution: remove --as-needed option out of compiler's internal linker invocation command line. Steps to do this: 1) Dump compiler specs: $ gcc -dumpspecs > specs 2) Open specs file for edit and remove --as-needed from the line *link: %{!r:--build-id} --no-add-needed --as-needed %{!static:--eh-frame-hdr} %{!m32:-m elf_x86_64} %{m32:-m elf_i386} --hash-style=gnu %{shared:-shared} %{!shared: %{!static: %{rdynamic:-export-dynamic} %{m32:-dynamic-linker %{muclibc:/lib/ld-uClibc.so.0;:%{mbionic:/system/bin/linker;:/lib/ld-linux.so.2}}} %{!m32:-dynamic-linker %{muclibc:/lib/ld64-uClibc.so.0;:%{mbionic:/system/bin/linker64;:/lib64/ld-linux-x86-64.so.2 %{static:-static}} resulting into *link: %{!r:--build-id} --no-add-needed %{!static:--eh-frame-hdr} %{!m32:-m elf_x86_64} %{m32:-m elf_i386} --hash-style=gnu %{shared:-shared} %{!shared: %{!static: %{rdynamic:-export-dynamic} %{m32:-dynamic-linker %{muclibc:/lib/ld-uClibc.so.0;:%{mbionic:/system/bin/linker;:/lib/ld-linux.so.2}}} %{!m32:-dynamic-linker %{muclibc:/lib/ld64-uClibc.so.0;:%{mbionic:/system/bin/linker64;:/lib64/ld-linux-x86-64.so.2 %{static:-static}} 3) Save specs file into compiler's folder /usr/lib/gcc/// For example, in case of Ubuntu 10.10 with gcc 4.6.1 it's /usr/lib/gcc/x86_64-linux-gnu/4.6.1/ With this change no unresolvable relocations anymore! - D. 2011/10/3 Dmitry N. Mikushin : > Hi, > > Here's a reprocase, the same one as mentioned here: > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=608901 > > marcusmae@loveland:~/Programming/mpitest$ cat mpitest.f90 > program main > include 'mpif.h' > integer ierr > call mpi_init(ierr) > end > > marcusmae@loveland:~/Programming/mpitest$ mpif90 -g mpitest.f90 > /usr/bin/ld: /tmp/cc3NLduM.o(.debug_info+0x542): unresolvable > R_X86_64_64 relocation against symbol `mpi_fortran_argv_null_' > /usr/bin/ld: /tmp/cc3NLduM.o(.debug_info+0x55c): unresolvable > R_X86_64_64 relocation against symbol `mpi_fortran_argv_null_' > /usr/bin/ld: /tmp/cc3NLduM.o(.debug_info+0x5d2): unresolvable > R_X86_64_64 relocation against symbol `mpi_fortran_errcodes_ignore_' > /usr/bin/ld: /tmp/cc3NLduM.o(.debug_info+0x5ec): unresolvable > R_X86_64_64 relocation against symbol `mpi_fortran_errcodes_ignore_' > > Remove "-g", and the error will be gone. > > marcusmae@loveland:~/Programming/mpitest$ mpif90 --showme -g mpitest.f90 > gfortran -g mpitest.f90 -I/opt/openmpi_gcc-1.5.4/include -pthread > -I/opt/openmpi_gcc-1.5.4/lib -L/opt/openmpi_gcc-1.5.4/lib -lmpi_f90 > -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl > > marcusmae@loveland:~/Programming/mpitest$ mpif90 -v > Using built-in specs. > COLLECT_GCC=/usr/bin/gfortran > COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.6.1/lto-wrapper > Target: x86_64-linux-gnu > Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro > 4.6.1-9ubuntu3' > --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs > --enable-languages=c,c++,fortran,objc,obj-c++,go --prefix=/usr > --program-suffix=-4.6 --enable-shared --enable-linker-build-id > --with-system-zlib --libexecdir=/usr/lib --without-included-gettext > --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 > --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu > --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-plugin > --enable-objc-gc --disable-werror --with-arch-32=i686 > --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu > --host=x86_64-linux-gnu --target=x86_64-linux-gnu > Thread model: posix > gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3) > > 2011/9/28 Dmitry N. Mikushin : >> Hi, >> >> Interestingly, the errors are gone after I removed "-g" from the app >> compile options. >> >> I tested again on the fresh Ubuntu 11.10 install: both 1.4.3 and 1.5.4 >> compile fine, but with the same error. >> Also I tried hard to find any 32-bit object or library and failed. >> They all are 64-bit. >> >> - D. >> >> 2011/9/24 Jeff Squyres : >>> Check the output from when you ran Open MPI's configure and "make all" -- >>> did it decide to build the F77 interface? >>> >>> Also check that gcc and gfortran output .o files of the same bitness / type. >>> >>> >>> On Sep 24, 2011, at 8:07 AM, Dmitry N. Mikushin wrote: >>> Compile and link - yes, but it turns out there was some unnoticed compilation error because ./hellompi: error while loading shared libraries: libmpi_f77.so.1: cannot open shared object file: No such file or directory and this library does not exist. Hm. 2011/9/24 Jeff Squyres : > Can you compile / link simple OMPI applications without this problem? > > On Sep 24, 2011, at 7:54 AM, Dmitry N. Mikushin wrote: > >> Hi Jeff, >> >> Today I've verified this application on the Feroda 15 x86_64, where >> I'm usually building OpenMPI from source using the same method. >> Result: no link errors there!
Re: [OMPI users] Proper way to stop MPI process
You might want to double check this -- mpirun shouldn't be waiting on you hitting return. Check to make sure you don't just have line-buffered output in python, or somesuch. Or better yet, check from python that the PID has actually disappeared and don't rely on stdout, or something like that. On Oct 2, 2011, at 8:35 AM, Xin Tong wrote: > I am using 1.4.3. I send the sigterm from a python script. Then I wait, the > processes do not terminate until i keep pressing enter on the keyboard. > > Thanks > > > Xin > > > On Fri, Sep 30, 2011 at 10:10 PM, Ralph Castain wrote: > Sigterm should work - what version are you using? > Ralph > > Sent from my iPad > > On Sep 28, 2011, at 1:40 PM, Xin Tong wrote: > > > I am wondering what the proper way of stop a mpirun process and the child > > process it created. I tried to send SIGTERM, it does not respond to it ? > > What kind of signal should I be sending to it ? > > > > > > Thanks > > > > > > Xin > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Segfault on any MPI communication on head node
I went into the directory that I used to install 1.4.3, did the following: make clean ./configure --enable-debug make -j8 all install and it hangs at this when I try to run my code (I commented out all the host name stuff, so it's just MPI code now) [hostname:16574] [[17705,0],0] ORTE_ERROR_LOG: Buffer type (described vs non-described) mismatch - operation not allowed in file base/odls_base_default_fns.c at line 2600 I'm googling for more info but does anyone have any ideas? On 9/28/11 8:30 PM, Jeff Squyres wrote: Use --enable-debug on your configure line. This will add in some debugging code to OMPI, and it'll compile everything with -g so that you can get stack traces. Beware that the extra debugging junk makes OMPI slightly slower; don't do any benchmarking with this install, etc. On Sep 28, 2011, at 6:27 PM, Phillip Vassenkov wrote: I tried 1.4.4rc4, same problem. Where do I get a debugging version? On 9/28/11 8:32 AM, Jeff Squyres wrote: Agreed that the original program had the char*[20]/char[20] bug, but his segv is occurring before trying to use that array. So it's a bug - but he just hadn't hit it yet. :-) I'd still like to see a debugging version so that we can get a real stack trace, and/or try the latest 1.4.4 RC (posted yesterday). On Sep 27, 2011, at 3:08 PM, German Hoecht wrote: char* name[20]; yields 20 (undefined) pointers to char, guess you mean char name[20]; So Brent's suggestion should work as well(?) To be safe I would also add: gethostname(name,maxlen); name[19] = '\0'; printf("Hello, world. I am %d of %d and host %s \n", rank, ... Cheers On 09/27/2011 07:40 PM, Phillip Vassenkov wrote: Thanks, but my main concern is the segfault :P I changed and as I expected it still segfaults. On 9/27/11 9:48 AM, Henderson, Brent wrote: Here is another possibly non-helpful suggestion. :) Change: char* name[20]; int maxlen = 20; To: char name[256]; int maxlen = 256; gethostname() is supposed to properly truncate the hostname it returns if the actual name is longer than the length provided, but since you have at least one that is longer than 20 characters, I'm curious. Brent -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: Tuesday, September 27, 2011 6:29 AM To: Open MPI Users Subject: Re: [OMPI users] Segfault on any MPI communication on head node Hmm. It's not immediately clear to me what's going wrong here. I hate to ask, but could you install a debugging version of Open MPI and capture a proper stack trace of the segv? Also, could you try the 1.4.4 rc and see if that magically fixes the problem? (I'm about to post a new 1.4.4 rc later this morning, but either the current one or the one from later today would be a good datapoint) On Sep 26, 2011, at 5:09 PM, Phillip Vassenkov wrote: Yep, Fedora Core 14 and OpenMPI 1.4.3 On 9/24/11 7:02 AM, Jeff Squyres wrote: Are you running the same OS version and Open MPI version between the head node and regular nodes? On Sep 23, 2011, at 5:27 PM, Vassenkov, Phillip wrote: Hey all, I've been racking my brains over this for several days and was hoping anyone could enlighten me. I'll describe only the relevant parts of the network/computer systems. There is one head node and a multitude of regular nodes. The regular nodes are all identical to each other. If I run an mpi program from one of the regular nodes to any other regular nodes, everything works. If I include the head node in the hosts file, I get segfaults which I'll paste below along with sample code. The machines are all networked via infiniband and Ethernet. The issue only arises when mpi communication occurs. By this I mean, MPi_Init might succeed but the segfault always occurs on MPI_Barrier or MPI_send/recv. I found a work around by disabling the openib btl and enforcing that communications go over infiniband(if I don't force infiniband, it'll go over Ethernet). This command works when the head node is included in the hosts file: mpirun --hostfile hostfile --mca btl ^openib --mca btl_tcp_if_include ib0 -np 2 ./b.out Sample Code: #include "mpi.h" #include int main(int argc, char *argv[]) { int rank, nprocs; char* name[20]; int maxlen = 20; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Barrier(MPI_COMM_WORLD); gethostname(name,maxlen); printf("Hello, world. I am %d of %d and host %s \n", rank, nprocs,name); fflush(stdout); MPI_Finalize(); return 0; } Segfault: [pastec:19917] *** Process received signal *** [pastec:19917] Signal: Segmentation fault (11) [pastec:19917] Signal code: Address not mapped (1) [pastec:19917] Failing at address: 0x8 [pastec:19917] [ 0] /lib64/libpthread.so.0() [0x34a880eeb0] [pastec:19917] [ 1] /usr/lib64/libmthca-rdmav2.so(+0x36aa) [0x7eff6430b6aa] [pastec:19917] [ 2] /usr/lib64/openmpi/lib/openm
Re: [OMPI users] [SOLVED] unresolvable R_X86_64_64 relocation against symbol `mpi_fortran_*
Wow -- painful! Glad you figured it out; thanks for posting it back here to make it google-able. On Oct 3, 2011, at 9:21 AM, Dmitry N. Mikushin wrote: > Ok, here's the solution: remove --as-needed option out of compiler's > internal linker invocation command line. Steps to do this: > > 1) Dump compiler specs: $ gcc -dumpspecs > specs > 2) Open specs file for edit and remove --as-needed from the line > > *link: > %{!r:--build-id} --no-add-needed --as-needed %{!static:--eh-frame-hdr} > %{!m32:-m elf_x86_64} %{m32:-m elf_i386} --hash-style=gnu > %{shared:-shared} %{!shared: %{!static: > %{rdynamic:-export-dynamic} %{m32:-dynamic-linker > %{muclibc:/lib/ld-uClibc.so.0;:%{mbionic:/system/bin/linker;:/lib/ld-linux.so.2}}} > %{!m32:-dynamic-linker > %{muclibc:/lib/ld64-uClibc.so.0;:%{mbionic:/system/bin/linker64;:/lib64/ld-linux-x86-64.so.2 >%{static:-static}} > > resulting into > > *link: > %{!r:--build-id} --no-add-needed %{!static:--eh-frame-hdr} %{!m32:-m > elf_x86_64} %{m32:-m elf_i386} --hash-style=gnu %{shared:-shared} > %{!shared: %{!static: %{rdynamic:-export-dynamic} > %{m32:-dynamic-linker > %{muclibc:/lib/ld-uClibc.so.0;:%{mbionic:/system/bin/linker;:/lib/ld-linux.so.2}}} > %{!m32:-dynamic-linker > %{muclibc:/lib/ld64-uClibc.so.0;:%{mbionic:/system/bin/linker64;:/lib64/ld-linux-x86-64.so.2 >%{static:-static}} > > 3) Save specs file into compiler's folder > /usr/lib/gcc/// For example, in case of Ubuntu 10.10 > with gcc 4.6.1 it's /usr/lib/gcc/x86_64-linux-gnu/4.6.1/ > > With this change no unresolvable relocations anymore! > > - D. > > 2011/10/3 Dmitry N. Mikushin : >> Hi, >> >> Here's a reprocase, the same one as mentioned here: >> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=608901 >> >> marcusmae@loveland:~/Programming/mpitest$ cat mpitest.f90 >> program main >> include 'mpif.h' >> integer ierr >> call mpi_init(ierr) >> end >> >> marcusmae@loveland:~/Programming/mpitest$ mpif90 -g mpitest.f90 >> /usr/bin/ld: /tmp/cc3NLduM.o(.debug_info+0x542): unresolvable >> R_X86_64_64 relocation against symbol `mpi_fortran_argv_null_' >> /usr/bin/ld: /tmp/cc3NLduM.o(.debug_info+0x55c): unresolvable >> R_X86_64_64 relocation against symbol `mpi_fortran_argv_null_' >> /usr/bin/ld: /tmp/cc3NLduM.o(.debug_info+0x5d2): unresolvable >> R_X86_64_64 relocation against symbol `mpi_fortran_errcodes_ignore_' >> /usr/bin/ld: /tmp/cc3NLduM.o(.debug_info+0x5ec): unresolvable >> R_X86_64_64 relocation against symbol `mpi_fortran_errcodes_ignore_' >> >> Remove "-g", and the error will be gone. >> >> marcusmae@loveland:~/Programming/mpitest$ mpif90 --showme -g mpitest.f90 >> gfortran -g mpitest.f90 -I/opt/openmpi_gcc-1.5.4/include -pthread >> -I/opt/openmpi_gcc-1.5.4/lib -L/opt/openmpi_gcc-1.5.4/lib -lmpi_f90 >> -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl >> >> marcusmae@loveland:~/Programming/mpitest$ mpif90 -v >> Using built-in specs. >> COLLECT_GCC=/usr/bin/gfortran >> COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.6.1/lto-wrapper >> Target: x86_64-linux-gnu >> Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro >> 4.6.1-9ubuntu3' >> --with-bugurl=file:///usr/share/doc/gcc-4.6/README.Bugs >> --enable-languages=c,c++,fortran,objc,obj-c++,go --prefix=/usr >> --program-suffix=-4.6 --enable-shared --enable-linker-build-id >> --with-system-zlib --libexecdir=/usr/lib --without-included-gettext >> --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.6 >> --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu >> --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-plugin >> --enable-objc-gc --disable-werror --with-arch-32=i686 >> --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu >> --host=x86_64-linux-gnu --target=x86_64-linux-gnu >> Thread model: posix >> gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3) >> >> 2011/9/28 Dmitry N. Mikushin : >>> Hi, >>> >>> Interestingly, the errors are gone after I removed "-g" from the app >>> compile options. >>> >>> I tested again on the fresh Ubuntu 11.10 install: both 1.4.3 and 1.5.4 >>> compile fine, but with the same error. >>> Also I tried hard to find any 32-bit object or library and failed. >>> They all are 64-bit. >>> >>> - D. >>> >>> 2011/9/24 Jeff Squyres : Check the output from when you ran Open MPI's configure and "make all" -- did it decide to build the F77 interface? Also check that gcc and gfortran output .o files of the same bitness / type. On Sep 24, 2011, at 8:07 AM, Dmitry N. Mikushin wrote: > Compile and link - yes, but it turns out there was some unnoticed > compilation error because > > ./hellompi: error while loading shared libraries: libmpi_f77.so.1: > cannot open shared object file: No such file or directory > > and this library does not exist. > > Hm. > > 2011/9/24 Jeff Squyres : >> Can you compil
Re: [OMPI users] Segfault on any MPI communication on head node
That means you have mismatched installations around - one configured as debug, and one not. They have to match. Sent from my iPad On Oct 3, 2011, at 2:44 PM, Phillip Vassenkov wrote: > I went into the directory that I used to install 1.4.3, did the following: > make clean > ./configure --enable-debug > make -j8 all install > > and it hangs at this when I try to run my code (I commented out all the host > name stuff, so it's just MPI code now) > > [hostname:16574] [[17705,0],0] ORTE_ERROR_LOG: Buffer type (described vs > non-described) mismatch - operation not allowed in file > base/odls_base_default_fns.c at line 2600 > > I'm googling for more info but does anyone have any ideas? > > On 9/28/11 8:30 PM, Jeff Squyres wrote: >> Use --enable-debug on your configure line. This will add in some debugging >> code to OMPI, and it'll compile everything with -g so that you can get stack >> traces. >> >> Beware that the extra debugging junk makes OMPI slightly slower; don't do >> any benchmarking with this install, etc. >> >> >> On Sep 28, 2011, at 6:27 PM, Phillip Vassenkov wrote: >> >>> I tried 1.4.4rc4, same problem. Where do I get a debugging version? >>> >>> On 9/28/11 8:32 AM, Jeff Squyres wrote: Agreed that the original program had the char*[20]/char[20] bug, but his segv is occurring before trying to use that array. So it's a bug - but he just hadn't hit it yet. :-) I'd still like to see a debugging version so that we can get a real stack trace, and/or try the latest 1.4.4 RC (posted yesterday). On Sep 27, 2011, at 3:08 PM, German Hoecht wrote: > char* name[20]; yields 20 (undefined) pointers to char, guess you mean > char name[20]; > > So Brent's suggestion should work as well(?) > > To be safe I would also add: > gethostname(name,maxlen); > name[19] = '\0'; > printf("Hello, world. I am %d of %d and host %s \n", rank, ... > > Cheers > > On 09/27/2011 07:40 PM, Phillip Vassenkov wrote: >> Thanks, but my main concern is the segfault :P I changed and as I >> expected it still segfaults. >> >> On 9/27/11 9:48 AM, Henderson, Brent wrote: >>> Here is another possibly non-helpful suggestion. :) Change: >>> >>> char* name[20]; >>> int maxlen = 20; >>> >>> To: >>> >>> char name[256]; >>> int maxlen = 256; >>> >>> gethostname() is supposed to properly truncate the hostname it returns >>> if the actual name is longer than the length provided, but since you >>> have at least one that is longer than 20 characters, I'm curious. >>> >>> Brent >>> >>> >>> -Original Message- >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] >>> On Behalf Of Jeff Squyres >>> Sent: Tuesday, September 27, 2011 6:29 AM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] Segfault on any MPI communication on head node >>> >>> Hmm. It's not immediately clear to me what's going wrong here. >>> >>> I hate to ask, but could you install a debugging version of Open MPI >>> and capture a proper stack trace of the segv? >>> >>> Also, could you try the 1.4.4 rc and see if that magically fixes the >>> problem? (I'm about to post a new 1.4.4 rc later this morning, but >>> either the current one or the one from later today would be a good >>> datapoint) >>> >>> >>> On Sep 26, 2011, at 5:09 PM, Phillip Vassenkov wrote: >>> Yep, Fedora Core 14 and OpenMPI 1.4.3 On 9/24/11 7:02 AM, Jeff Squyres wrote: > Are you running the same OS version and Open MPI version between the > head node and regular nodes? > > On Sep 23, 2011, at 5:27 PM, Vassenkov, Phillip wrote: > >> Hey all, >> I've been racking my brains over this for several days and was >> hoping anyone could enlighten me. I'll describe only the relevant >> parts of the network/computer systems. There is one head node and a >> multitude of regular nodes. The regular nodes are all identical to >> each other. If I run an mpi program from one of the regular nodes >> to any other regular nodes, everything works. If I include the head >> node in the hosts file, I get segfaults which I'll paste below >> along with sample code. The machines are all networked via >> infiniband and Ethernet. The issue only arises when mpi >> communication occurs. By this I mean, MPi_Init might succeed but >> the segfault always occurs on MPI_Barrier or MPI_send/recv. I found >> a work around by disabling the openib btl and enforcing that >> communications go over infiniband(if I don't force infiniband, >> it'll go over Ethernet). This command works when the head node is >> incl