Hi, yesterday I installed openmpi-1.7.4rc2r30323 on our machines ("Solaris 10 x86_64", "Solaris 10 Sparc", and "openSUSE Linux 12.1 x86_64" with Sun C 5.12). My rankfile "rf_linpc_sunpc_tyr" contains the following lines.
rank 0=linpc0 slot=0:0-1;1:0-1 rank 1=linpc1 slot=0:0-1 rank 2=sunpc1 slot=1:0 rank 3=tyr slot=1:0 I get no output, when I run the following command. mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname "dbx" reports the following problem. /opt/solstudio12.3/bin/sparcv9/dbx \ /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc Reading mpiexec Reading ld.so.1 ... Reading libmd.so.1 (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname (process id 22337) Reading libc_psr.so.1 ... Reading mca_dfs_test.so execution completed, exit code is 1 (dbx) check -all access checking - ON memuse checking - ON (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc_tyr hostname (process id 22344) Reading rtcapihook.so ... RTC: Running program... Read from uninitialized (rui) on thread 1: Attempting to read 1 byte at address 0xffffffff7fffbf8b which is 459 bytes above the current stack pointer Variable is 'cwd' t@1 (l@1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c" 65 if (0 != strcmp(pwd, cwd)) { (dbx) quit Rankfiles work "fine" on x86_64 architectures. Contents of my rankfile. rank 0=linpc1 slot=0:0-1;1:0-1 rank 1=sunpc1 slot=0:0-1 rank 2=sunpc1 slot=1:0 rank 3=sunpc1 slot=1:1 mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname [sunpc1:13489] MCW rank 1 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B][./.] [sunpc1:13489] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.] [sunpc1:13489] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B] sunpc1 sunpc1 sunpc1 [linpc1:29997] MCW rank 0 is not bound (or bound to all available processors) linpc1 Unfortunately "dbx" reports nevertheless a problem. /opt/solstudio12.3/bin/amd64/dbx \ /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc Reading mpiexec Reading ld.so.1 ... Reading libmd.so.1 (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname (process id 18330) Reading mca_shmem_mmap.so ... Reading mca_dfs_test.so [sunpc1:18330] MCW rank 1 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B][./.] [sunpc1:18330] MCW rank 2 bound to socket 1[core 2[hwt 0]]: [./.][B/.] [sunpc1:18330] MCW rank 3 bound to socket 1[core 3[hwt 0]]: [./.][./B] sunpc1 sunpc1 sunpc1 [linpc1:30148] MCW rank 0 is not bound (or bound to all available processors) linpc1 execution completed, exit code is 0 (dbx) check -all access checking - ON memuse checking - ON (dbx) run -report-bindings -np 4 -rf rf_linpc_sunpc hostname Running: mpiexec -report-bindings -np 4 -rf rf_linpc_sunpc hostname (process id 18340) Reading rtcapihook.so ... RTC: Running program... Reading disasm.so Read from uninitialized (rui) on thread 1: Attempting to read 1 byte at address 0x436d57 which is 15 bytes into a heap block of size 16 bytes at 0x436d48 This block was allocated from: [1] vasprintf() at 0xfffffd7fdc9b335a [2] asprintf() at 0xfffffd7fdc9b3452 [3] opal_output_init() at line 184 in "output.c" [4] do_open() at line 548 in "output.c" [5] opal_output_open() at line 219 in "output.c" [6] opal_malloc_init() at line 68 in "malloc.c" [7] opal_init_util() at line 250 in "opal_init.c" [8] orterun() at line 658 in "orterun.c" t@1 (l@1) stopped in do_open at line 638 in file "output.c" 638 info[i].ldi_prefix = strdup(lds->lds_prefix); (dbx) I can also manually bind threads on our Sun M4000 server (two quad-core Sparc VII processors with two hwthreads each). mpiexec --report-bindings -np 4 --bind-to hwthread hostname [rs0.informatik.hs-fulda.de:09531] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [../B./../..][../../../..] [rs0.informatik.hs-fulda.de:09531] MCW rank 2 bound to socket 1[core 4[hwt 0]]: [../../../..][B./../../..] [rs0.informatik.hs-fulda.de:09531] MCW rank 3 bound to socket 1[core 5[hwt 0]]: [../../../..][../B./../..] [rs0.informatik.hs-fulda.de:09531] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B./../../..][../../../..] rs0.informatik.hs-fulda.de rs0.informatik.hs-fulda.de rs0.informatik.hs-fulda.de rs0.informatik.hs-fulda.de It doesn't work with cores. I know that it wasn't possible last summer and it seems that it is still not possible now. mpiexec --report-bindings -np 4 --bind-to core hostname ----------------------------------------------------------------------- Open MPI tried to bind a new process, but something went wrong. The process was killed without launching the target application. Your job will now abort. Local host: rs0 Application name: /usr/local/bin/hostname Error message: hwloc indicates cpu binding cannot be enforced Location: ../../../../../openmpi-1.9a1r30290/orte/mca/odls/default/odls_default_module.c:500 ----------------------------------------------------------------------- 4 total processes failed to start Is it possible to specify hwthreads in a rankfile, so that I can use a rankfile for these machines? I get the expected output, if I use two M4000 servers, although the above mentioned error still exists. /opt/solstudio12.3/bin/sparcv9/dbx \ /usr/local/openmpi-1.7.4_64_cc/bin/mpiexec For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc Reading mpiexec Reading ld.so.1 ... Reading libmd.so.1 (dbx) run --report-bindings --host rs0,rs1 -np 4 \ --bind-to hwthread hostname Running: mpiexec --report-bindings --host rs0,rs1 -np 4 --bind-to hwthread hostname (process id 9599) Reading libc_psr.so.1 ... Reading mca_dfs_test.so [rs0.informatik.hs-fulda.de:09599] MCW rank 1 bound to socket 1[core 4[hwt 0]]: [../../../..][B./../../..] [rs0.informatik.hs-fulda.de:09599] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B./../../..][../../../..] rs0.informatik.hs-fulda.de rs0.informatik.hs-fulda.de rs1.informatik.hs-fulda.de [rs1.informatik.hs-fulda.de:13398] MCW rank 2 bound to socket 0[core 0[hwt 0]]: [B./../../..][../../../..] [rs1.informatik.hs-fulda.de:13398] MCW rank 3 bound to socket 1[core 4[hwt 0]]: [../../../..][B./../../..] rs1.informatik.hs-fulda.de execution completed, exit code is 0 (dbx) check -all access checking - ON memuse checking - ON (dbx) run --report-bindings --host rs0,rs1 -np 4 \ --bind-to hwthread hostname Running: mpiexec --report-bindings --host rs0,rs1 -np 4 --bind-to hwthread hostname (process id 9607) Reading rtcapihook.so ... RTC: Running program... Read from uninitialized (rui) on thread 1: Attempting to read 1 byte at address 0xffffffff7fffc80b which is 459 bytes above the current stack pointer Variable is 'cwd' dbx: warning: can't find file ".../openmpi-1.7.4rc2r30323-SunOS.sparc.64_cc/opal/util/../../../ openmpi-1.7.4rc2r30323/opal/util/opal_getcwd.c" dbx: warning: see `help finding-files' t@1 (l@1) stopped in opal_getcwd at line 65 in file "opal_getcwd.c" (dbx) Our M4000 server has no access to the source code, so that it couldn't find the file. Nevertheless it is the same error message as above. Is it possible that someone soves this problem? Thank you very much for any help in advance. Kind regards Siegmar