Hi, yesterday I installed openmpi-1.9a1r30100 on "Solaris 10 x86_64", "Solaris 10 Sparc", and "openSUSE Linux 12.1 x86_64" with Sun C 5.12. First of all the good news: "configure", "make", "make install", and "make check" completed without errors, i.e., "make check" doesn't produce a "SIGBUS Error" on "Solaris Sparc" and and doesn't block in or after "opal_path_nfs" on Linux any longer. I reported both problems before. Thank you very much to everybody who solved these problems.
Unfortunately I still get a "SIGBUS Error" on "Solaris Sparc" for "ompi_info -a". tyr openmpi-1.9 99 ompi_info | grep MPI: Open MPI: 1.9a1r30100 tyr openmpi-1.9 100 ompi_info -a |& grep Signal [tyr:09699] Signal: Bus Error (10) [tyr:09699] Signal code: Invalid address alignment (1) .../openmpi-1.9_64_cc/lib64/libopen-pal.so.0.0.0:0x1321b8 [ Signal 2099900312 (?)] Bus error tyr openmpi-1.9 101 I can compile and run a small MPI program without "SIGBUS Error". Jeff, thank you very much for solving this problem. tyr small_prog 110 mpicc init_finalize.c tyr small_prog 111 mpiexec -np 1 a.out Hello! tyr small_prog 112 "make install" didn't install the Javadoc documentation for the new Java interface. Is it necessary to install it in a separate step? tyr small_prog 118 ls -l /usr/local/openmpi-1.9_64_cc/share/ total 6 drwxr-xr-x 5 root root 512 Dec 31 12:03 man drwxr-xr-x 3 root root 3584 Dec 31 12:05 openmpi drwxr-xr-x 3 root root 512 Dec 31 12:04 vampirtrace tyr small_prog 119 In the past I could run a small program in a real heterogeneous system with little (sunpc1, linpc1) and big endian (rs0, tyr) machines. tyr small_prog 101 ompi_info | grep MPI: Open MPI: 1.6.6a1r29175 tyr small_prog 102 mpiexec -np 3 -host rs0,sunpc1,linpc1 rank_size I'm process 1 of 3 available processes running on sunpc1. MPI standard 2.1 is supported. I'm process 0 of 3 available processes running on rs0.informatik.hs-fulda.de. MPI standard 2.1 is supported. I'm process 2 of 3 available processes running on linpc1. MPI standard 2.1 is supported. tyr small_prog 103 Now I get no output at all. tyr small_prog 130 ompi_info | grep MPI: Open MPI: 1.9a1r30100 tyr small_prog 131 mpiexec -np 3 -host rs0,sunpc1,linpc1 rank_size tyr small_prog 132 mpiexec -np 3 -host rs0,sunpc1,linpc1 \ --hetero-nodes --hetero-apps rank_size tyr small_prog 133 Perhaps this behaviour is intended, because Open MPI doesn't support little and big endian machines in the same cluster or virtual computer (I know only LAM-MPI which works in such an environment). On the other side: Does it make sense to run the command without any output, if it doesn't work (even if "mpiexec" returns "1")? Nevertheless I have another problem with my small program. tyr small_prog 158 uname -p sparc tyr small_prog 159 ssh rs0 uname -p sparc tyr small_prog 160 mpiexec rank_size I'm process 0 of 1 available processes running on tyr.informatik.hs-fulda.de. MPI standard 2.2 is supported. tyr small_prog 161 ssh rs0 mpiexec rank_size I'm process 0 of 1 available processes running on rs0.informatik.hs-fulda.de. MPI standard 2.2 is supported. tyr small_prog 162 mpiexec -np 2 -host tyr,rs0 rank_size tyr small_prog 163 echo $status 1 tyr small_prog 164 The command works as expected on little endian machines. linpc1 small_prog 93 mpiexec -np 2 -host linpc1,sunpc1 rank_size I'm process 0 of 2 available processes running on linpc1. MPI standard 2.2 is supported. I'm process 1 of 2 available processes running on sunpc1. MPI standard 2.2 is supported. linpc1 small_prog 94 Next I tried process binding. rf_linpc: --------- rank 0=linpc1 slot=0:0,1;1:0,1 rf_linpc_linpc: --------------- rank 0=linpc0 slot=0:0-1;1:0-1 rank 1=linpc1 slot=0:0-1 rank 2=linpc1 slot=1:0 rank 3=linpc1 slot=1:1 rf_linpc_linpc_comma: --------------------- rank 0=linpc0 slot=0:0,1;1:0,1 rank 1=linpc1 slot=0:0,1 rank 2=linpc1 slot=1:0 rank 3=linpc1 slot=1:1 linpc1 openmpi_1.7.x_or_newer 103 mpiexec -report-bindings -np 1 \ -rf rf_linpc hostname [linpc1:08461] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 1[core 2[hwt 0]], socket 1[core 3[hwt 0]]: [B/B][B/B] linpc1 linpc1 openmpi_1.7.x_or_newer 104 That's the output which I expected, but I don't get the expected output for the following commands. linpc1 openmpi_1.7.x_or_newer 105 mpiexec -report-bindings -np 4 \ -rf rf_linpc_linpc hostname -------------------------------------------------------------------------- There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: hostname Either request fewer slots for your application, or make more slots available for use. -------------------------------------------------------------------------- linpc1 openmpi_1.7.x_or_newer 106 linpc1 openmpi_1.7.x_or_newer 110 mpiexec -report-bindings -np 4 \ -rf rf_linpc_linpc_comma hostname -------------------------------------------------------------------------- There are not enough slots available in the system to satisfy the 2 slots that were requested by the application: hostname Either request fewer slots for your application, or make more slots available for use. -------------------------------------------------------------------------- linpc1 openmpi_1.7.x_or_newer 111 It works well in Open MPI 1.6.x (similar rank file, but using "," to separate sockets due to a different syntax). linpc1 openmpi_1.6.x 109 mpiexec -report-bindings -np 4 \ -rf rf_linpc_linpc hostname [linpc1:08675] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .] (slot list 0:0-1) [linpc1:08675] MCW rank 2 bound to socket 1[core 0]: [. .][B .] (slot list 1:0) [linpc1:08675] MCW rank 3 bound to socket 1[core 1]: [. .][. B] (slot list 1:1) linpc1 linpc1 [linpc0:00677] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) linpc0 linpc1 linpc1 openmpi_1.6.x 110 Open MPI 1.6.x supports even little and big endian machines for this simple command. linpc1 openmpi_1.6.x 112 ompi_info | grep MPI: Open MPI: 1.6.6a1r29175 linpc1 openmpi_1.6.x 113 mpiexec -report-bindings -np 4 \ -rf rf_linpc_sunpc_tyr hostname [linpc1:08697] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .] (slot list 0:0-1) [linpc0:00758] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) linpc0 linpc1 tyr.informatik.hs-fulda.de [tyr.informatik.hs-fulda.de:10286] MCW rank 3 bound to socket 1[core 0]: [.][B] (slot list 1:0) [sunpc1:21136] MCW rank 2 bound to socket 1[core 0]: [. .][B .] (slot list 1:0) sunpc1 linpc1 openmpi_1.6.x 114 Option "-bycore" isn't available any longer. Is this intended? linpc1 openmpi_1.7.x_or_newer 111 mpiexec -report-bindings -np 4 \ -host linpc0,linpc1,sunpc0,sunpc1 -cpus-per-proc 4 -bycore \ -bind-to-core hostname mpiexec: Error: unknown option "-bycore" Type 'mpiexec --help' for usage. linpc1 openmpi_1.7.x_or_newer 112 linpc1 openmpi_1.7.x_or_newer 112 mpiexec -report-bindings \ -np 4 -host linpc0,linpc1,sunpc0,sunpc1 -cpus-per-proc 4 \ -bind-to-core hostname -------------------------------------------------------------------------- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node: linpc0 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -------------------------------------------------------------------------- linpc1 openmpi_1.7.x_or_newer 113 It worked with Open MPI 1.6.x. linpc1 openmpi_1.6.x 105 mpiexec -report-bindings -np 4 \ -host linpc0,linpc1,sunpc0,sunpc1 -cpus-per-proc 4 -bycore \ -bind-to-core hostname [linpc1:09465] MCW rank 1 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] linpc1 [linpc0:01036] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] linpc0 [sunpc0:03796] MCW rank 2 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] sunpc0 [sunpc1:21335] MCW rank 3 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] sunpc1 linpc1 openmpi_1.6.x 106 Have you changed the syntax once more so that I can get the expected bindings with different command line options or is it a problem in Open MPI 1.9.x? I have similar problems with Java. tyr java 197 mpiexec -np 4 java BcastIntArrayMain Process 0 running on tyr.informatik.hs-fulda.de. intValues[0]: 0 intValues[1]: 11 intValues[2]: 22 intValues[3]: 33 Process 1 running on tyr.informatik.hs-fulda.de. intValues[0]: 0 intValues[1]: 11 intValues[2]: 22 intValues[3]: 33 Process 2 running on tyr.informatik.hs-fulda.de. intValues[0]: 0 intValues[1]: 11 intValues[2]: 22 intValues[3]: 33 Process 3 running on tyr.informatik.hs-fulda.de. intValues[0]: 0 intValues[1]: 11 intValues[2]: 22 intValues[3]: 33 tyr java 198 mpiexec -np 4 -host rs0,tyr java BcastIntArrayMain tyr java 199 echo $status 1 tyr java 200 Why? Both machines are big endian machines. By the way, I have similar problems with openmpi-1.7.x. Java isn't available at the moment as I reported before. tyr small_prog 103 ompi_info | grep MPI: Open MPI: 1.7.4rc2r30094 tyr small_prog 104 ompi_info -a |& grep Signal [tyr:10441] Signal: Bus Error (10) [tyr:10441] Signal code: Invalid address alignment (1) .../openmpi-1.7.4_64_cc/lib64/libopen-pal.so.6.1.0:0x137af8 [ Signal 2099922960 (?)] Bus error tyr small_prog 105 tyr small_prog 105 mpicc init_finalize.c tyr small_prog 106 mpiexec -np 1 a.out Hello! tyr small_prog 107 tyr small_prog 107 mpiexec -np 3 -host rs0,sunpc1,linpc1 rank_size tyr small_prog 108 mpiexec -np 3 -host rs0,sunpc1,linpc1 \ ? --hetero-nodes --hetero-apps rank_size tyr small_prog 109 and so on I'm sorry that I still cause trouble, but on the other side I would be very grateful, if somebody can solve all problems. Thank you very much for any help in advance. Kind regards Siegmar