Hi, I want to test process bindings with a rankfile in openmpi-1.6.2. Both machines are dual-processor dual-core machines running Solaris 10 x86_64.
tyr fd1026 138 cat host_sunpc0_1 sunpc0 slots=4 sunpc1 slots=4 tyr fd1026 139 cat rankfile rank 0=sunpc0 slot=0:0-1,1:0-1 rank 1=sunpc1 slot=0:0-1 rank 2=sunpc1 slot=1:0 rank 3=sunpc1 slot=1:1 tyr fd1026 140 mpiexec -rf rankfile hostname -------------------------------------------------------------------------- All nodes which are allocated for this job are already filled. -------------------------------------------------------------------------- Is something wrong with my rankfile, must I add a hostfile, or is it a bug? I get the following error when I add a hostfile. tyr fd1026 141 mpiexec -hostfile host_sunpc0_1 -rf rankfile hostname [tyr.informatik.hs-fulda.de:20227] [[27927,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../openmpi-1.6.2/orte/mca/odls/base/odls_base_default_fns.c at line 927 ^Cmpiexec: abort is already in progress...hit ctrl-c again to forcibly terminate I get the following outputs when I use Linux instead of Solaris (same hardware). tyr fd1026 146 mpiexec -rf rankfile_linux hostname -------------------------------------------------------------------------- All nodes which are allocated for this job are already filled. -------------------------------------------------------------------------- tyr fd1026 147 mpiexec -hostfile host_linpc0_1 -rf rankfile_linux hostname [tyr.informatik.hs-fulda.de:20260] [[27952,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../openmpi-1.6.2/orte/mca/odls/base/odls_base_default_fns.c at line 927 [tyr:20260] *** Process received signal *** [tyr:20260] Signal: Bus Error (10) [tyr:20260] Signal code: Invalid address alignment (1) [tyr:20260] Failing at address: 7463703a2f2f3129 /export2/prog/SunOS_sparc/openmpi-1.6.2_64_cc/lib64/libopen-rte.so.4.0.0:opal_backtrace_print+0x14 /export2/prog/SunOS_sparc/openmpi-1.6.2_64_cc/lib64/libopen-rte.so.4.0.0:0x335b48 /lib/sparcv9/libc.so.1:0xd88a4 /lib/sparcv9/libc.so.1:0xcc418 /lib/sparcv9/libc.so.1:0xcc624 /lib/sparcv9/libc.so.1:0x64394 [ Signal 2131043744 (?)] /lib/sparcv9/libc.so.1:free+0x30 /export2/prog/SunOS_sparc/openmpi-1.6.2_64_cc/lib64/libopen-rte.so.4.0.0:orte_odls_base_default_construct_child _list+0x20b8 /export2/prog/SunOS_sparc/openmpi-1.6.2_64_cc/lib64/openmpi/mca_odls_default.so:0x11c80 ... "tyr" is a Sparc machine running Solaris 10. I get a similar error if I run the command on a Linux machine. tyr fd1026 148 ssh linpc4 linpc4 fd1026 100 mpiexec -rf rankfile_linux hostname -------------------------------------------------------------------------- All nodes which are allocated for this job are already filled. -------------------------------------------------------------------------- linpc4 fd1026 101 mpiexec -hostfile host_linpc0_1 -rf rankfile_linux hostname [linpc4:08079] [[49559,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../openmpi-1.6.2/orte/mca/odls/base/odls_base_default_fns.c at line 927 [linpc4:08079] *** Process received signal *** [linpc4:08079] Signal: Segmentation fault (11) [linpc4:08079] Signal code: Address not mapped (1) [linpc4:08079] Failing at address: 0x900306368 [linpc4:08079] [ 0] /lib64/libpthread.so.0(+0xfd00) [0x7fbe174bcd00] [linpc4:08079] [ 1] /lib64/libc.so.6(cfree+0x14) [0x7fbe17197d24] [linpc4:08079] [ 2] /usr/local/openmpi-1.6.2_64_cc/lib64/libopen-rte.so.4(orte_odls_base_default_construct_child_list+0x2091) [0x7fbe182e4d21] [linpc4:08079] [ 3] /usr/local/openmpi-1.6.2_64_cc/lib64/openmpi/mca_odls_default.so(+0x10dba) [0x7fbe15415dba] ... Thank you very much for any suggestion in advance. Kind regards Siegmar