Hi,

I want to test process bindings with a rankfile in openmpi-1.6.2. Both
machines are dual-processor dual-core machines running Solaris 10 x86_64.

tyr fd1026 138 cat host_sunpc0_1 
sunpc0 slots=4
sunpc1 slots=4

tyr fd1026 139 cat rankfile 
rank 0=sunpc0 slot=0:0-1,1:0-1
rank 1=sunpc1 slot=0:0-1
rank 2=sunpc1 slot=1:0
rank 3=sunpc1 slot=1:1

tyr fd1026 140 mpiexec -rf rankfile hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

Is something wrong with my rankfile, must I add a hostfile, or is it a
bug? I get the following error when I add a hostfile. 


tyr fd1026 141 mpiexec -hostfile host_sunpc0_1 -rf rankfile hostname
[tyr.informatik.hs-fulda.de:20227] [[27927,0],0] ORTE_ERROR_LOG:
  Data unpack would read past end of buffer in file
  ../../../../openmpi-1.6.2/orte/mca/odls/base/odls_base_default_fns.c
  at line 927
^Cmpiexec: abort is already in progress...hit ctrl-c again to forcibly
  terminate


I get the following outputs when I use Linux instead of Solaris
(same hardware).

tyr fd1026 146 mpiexec -rf rankfile_linux hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

tyr fd1026 147 mpiexec -hostfile host_linpc0_1 -rf rankfile_linux hostname
[tyr.informatik.hs-fulda.de:20260] [[27952,0],0] ORTE_ERROR_LOG: Data unpack 
would read past end of buffer in 
file ../../../../openmpi-1.6.2/orte/mca/odls/base/odls_base_default_fns.c at 
line 927
[tyr:20260] *** Process received signal ***
[tyr:20260] Signal: Bus Error (10)
[tyr:20260] Signal code: Invalid address alignment (1)
[tyr:20260] Failing at address: 7463703a2f2f3129
/export2/prog/SunOS_sparc/openmpi-1.6.2_64_cc/lib64/libopen-rte.so.4.0.0:opal_backtrace_print+0x14
/export2/prog/SunOS_sparc/openmpi-1.6.2_64_cc/lib64/libopen-rte.so.4.0.0:0x335b48
/lib/sparcv9/libc.so.1:0xd88a4
/lib/sparcv9/libc.so.1:0xcc418
/lib/sparcv9/libc.so.1:0xcc624
/lib/sparcv9/libc.so.1:0x64394 [ Signal 2131043744 (?)]
/lib/sparcv9/libc.so.1:free+0x30
/export2/prog/SunOS_sparc/openmpi-1.6.2_64_cc/lib64/libopen-rte.so.4.0.0:orte_odls_base_default_construct_child
_list+0x20b8
/export2/prog/SunOS_sparc/openmpi-1.6.2_64_cc/lib64/openmpi/mca_odls_default.so:0x11c80
...

"tyr" is a Sparc machine running Solaris 10. I get a similar error if
I run the command on a Linux machine.

tyr fd1026 148 ssh linpc4
linpc4 fd1026 100  mpiexec -rf rankfile_linux hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

linpc4 fd1026 101 mpiexec -hostfile host_linpc0_1 -rf rankfile_linux hostname
[linpc4:08079] [[49559,0],0] ORTE_ERROR_LOG: Data unpack would read past end of 
buffer in file 
../../../../openmpi-1.6.2/orte/mca/odls/base/odls_base_default_fns.c at line 927
[linpc4:08079] *** Process received signal ***
[linpc4:08079] Signal: Segmentation fault (11)
[linpc4:08079] Signal code: Address not mapped (1)
[linpc4:08079] Failing at address: 0x900306368
[linpc4:08079] [ 0] /lib64/libpthread.so.0(+0xfd00) [0x7fbe174bcd00]
[linpc4:08079] [ 1] /lib64/libc.so.6(cfree+0x14) [0x7fbe17197d24]
[linpc4:08079] [ 2] 
/usr/local/openmpi-1.6.2_64_cc/lib64/libopen-rte.so.4(orte_odls_base_default_construct_child_list+0x2091)
 
[0x7fbe182e4d21]
[linpc4:08079] [ 3] 
/usr/local/openmpi-1.6.2_64_cc/lib64/openmpi/mca_odls_default.so(+0x10dba) 
[0x7fbe15415dba]
...

Thank you very much for any suggestion in advance.


Kind regards

Siegmar

Reply via email to