Re: [OMPI users] I have still a problem with rankfiles in openmpi-1.6.4rc3
Hi today I tried a different rankfile and got once more a problem. :-(( > > thank you very much for your patch. I have applied the patch to > > openmpi-1.6.4rc4. > > > > Open MPI: 1.6.4rc4r28022 > > : [B .][. .] (slot list 0:0) > > : [. B][. .] (slot list 0:1) > > : [B B][. .] (slot list 0:0-1) > > : [. .][B .] (slot list 1:0) > > : [. .][. B] (slot list 1:1) > > : [. .][B B] (slot list 1:0-1) > > : [B B][B B] (slot list 0:0-1,1:0-1) > > That looks great. I'll file a CMR to get this patch into 1.6. > Unless you indicate otherwise, I'll assume this issue is understood > for 1.6. Rankfile rf_6 is the same as last time. I have added one more line in rf_7 and I switched the sequence of the hosts in rf_8. Everything is still fine with rf_6. I don't get any output for rank 1 with rf_7 and I get an error for rf_8. Both machines use the same hardware. sunpc1 rankfiles 106 cat rf_6 # mpiexec -report-bindings -rf rf_6 hostname rank 0=sunpc1 slot=0:0-1,1:0-1 sunpc1 rankfiles 107 cat rf_7 # mpiexec -report-bindings -rf rf_7 hostname rank 0=sunpc1 slot=0:0-1,1:0-1 rank 1=sunpc0 slot=0:0-1 sunpc1 rankfiles 108 cat rf_8 # mpiexec -report-bindings -rf rf_8 hostname rank 0=sunpc0 slot=0:0-1,1:0-1 rank 1=sunpc1 slot=0:0-1 sunpc1 rankfiles 109 mpiexec -report-bindings -rf rf_6 hostname [sunpc1:09779] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) sunpc1 rankfiles 110 mpiexec -report-bindings -rf rf_7 hostname [sunpc1:09782] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) sunpc1 rankfiles 111 mpiexec -report-bindings -rf rf_8 hostname -- The rankfile that was used claimed that a host was either not allocated or oversubscribed its slots. Please review your rank-slot assignments and your host allocation to ensure a proper match. Also, some systems may require using full hostnames, such as "host1.example.com" (instead of just plain "host1"). Host: sunpc0 -- I get the following output, if I use sunpc0 as local host. sunpc0 rankfiles 102 mpiexec -report-bindings -rf rf_6 hostname -- All nodes which are allocated for this job are already filled. -- sunpc0 rankfiles 103 mpiexec -report-bindings -rf rf_7 hostname -- The rankfile that was used claimed that a host was either not allocated or oversubscribed its slots. Please review your rank-slot assignments and your host allocation to ensure a proper match. Also, some systems may require using full hostnames, such as "host1.example.com" (instead of just plain "host1"). Host: sunpc1 -- sunpc0 rankfiles 104 mpiexec -report-bindings -rf rf_8 hostname [sunpc0:19027] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) I get the following output, if I use tyr as local host. tyr rankfiles 218 mpiexec -report-bindings -rf rf_6 hostname -- All nodes which are allocated for this job are already filled. -- tyr rankfiles 219 mpiexec -report-bindings -rf rf_7 hostname -- All nodes which are allocated for this job are already filled. -- tyr rankfiles 220 mpiexec -report-bindings -rf rf_8 hostname -- All nodes which are allocated for this job are already filled. -- Do you have any ideas why this happens? Thank you very much for any help in advance. Kind regards Siegmar
[OMPI users] newbie: Submitting Open MPI jobs to SGE ( `qsh -pe orte 4` fails)
( cross-posted on SO: http://stackoverflow.com/questions/14775451 ) Hi, I'm very new to OpenMpi and I'm trying tosubmit OMPI to SGE: I've installed openmpi , not in /usr/... but in /commun/data/packages/openmpi/ it was compiled with --with-sge. I've added a new PE in SGE with qconf as descibed in http://docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-5677/6ml49n2c0/index.html # /commun/data/packages/openmpi/bin/ompi_info | grep gridengine MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.3) # qconf -sq all.q | grep pe_ pe_list make orte Without SGE, the program runs without any problem, using several processors. /commun/data/packages/openmpi/bin/orterun -np 20 ./a.out args Now I want to submit my program to SGE In the Open MPI FAQ, I read: # Allocate a SGE interactive job with 4 slots # from a parallel environment (PE) named 'orte' shell$ qsh -pe orte 4 but my output is: qsh -pe orte 4 Your job 84550 ("INTERACTIVE") has been submitted waiting for interactive job to be scheduled ... Could not start interactive job. I've also tried the mpirun command embedded in a script: $ cat ompi.sh #!/bin/sh /commun/data/packages/openmpi/bin/mpirun \ /path/to/a.out args but it fails $ cat ompi.sh.e84552 error: executing task of job 84552 failed: execution daemon on host "node02" didn't accept task -- A daemon (pid 18327) died unexpectedly with status 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- error: executing task of job 84552 failed: execution daemon on host "node01" didn't accept task -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. How can I fix this? Many thanks
Re: [OMPI users] newbie: Submitting Open MPI jobs to SGE ( `qsh -pe orte 4` fails)
Hi, Am 08.02.2013 um 19:36 schrieb Pierre LINDENBAUM: > ( cross-posted on SO: http://stackoverflow.com/questions/14775451 ) > I'm very new to OpenMpi and I'm trying tosubmit OMPI to SGE: > > > I've installed openmpi , not in > /usr/... > but in > /commun/data/packages/openmpi/ > > it was compiled with --with-sge. > > I've added a new PE in SGE with qconf as descibed in > http://docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-5677/6ml49n2c0/index.html > > # /commun/data/packages/openmpi/bin/ompi_info | grep gridengine > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.3) > > # qconf -sq all.q | grep pe_ > pe_list make orte > > Without SGE, the program runs without any problem, using several processors. > > /commun/data/packages/openmpi/bin/orterun -np 20 ./a.out args > > Now I want to submit my program to SGE > > In the Open MPI FAQ, I read: > > # Allocate a SGE interactive job with 4 slots > # from a parallel environment (PE) named 'orte' > shell$ qsh -pe orte 4 > > but my output is: > > qsh -pe orte 4 > Your job 84550 ("INTERACTIVE") has been submitted > waiting for interactive job to be scheduled ... > Could not start interactive job. An INTERACTIVE job is more like an immediate job, i.e. "-now y". Do you have any interactive queue configured and the cluster is empty right now? > I've also tried the mpirun command embedded in a script: > > $ cat ompi.sh > #!/bin/sh > /commun/data/packages/openmpi/bin/mpirun \ > /path/to/a.out args > > but it fails > > $ cat ompi.sh.e84552 > error: executing task of job 84552 failed: execution daemon on host > "node02" didn't accept task This is a good sign, as it tries to use `qrsh -inherit ...` already. Can you confirm the following settings: $ qconf -sp orte ... control_slaves TRUE $ qconf -sq all.q ... shell_start_mode unix_behavior -- Reuti > -- > A daemon (pid 18327) died unexpectedly with status 1 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -- > error: executing task of job 84552 failed: execution daemon on host > "node01" didn't accept task > -- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > > How can I fix this? > > Many thanks > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Hi, I am working on topic "Topology aware mapping of processes in intra node environment". I need to find the binding of each rank on the local machine. How do I do this? I am using OPENM
Hi, I am working on topic "Topology aware mapping of processes in intra node environment". I need to find the binding of each rank on the local machine. How do I do this? I am using OPENMPI version 1.4.1 Thank You -- Kranthi
Re: [OMPI users] Hi, I am working on topic "Topology aware mapping of processes in intra node environment". I need to find the binding of each rank on the local machine. How do I do this? I am using O
Ummm...you might want to look at the developer's trunk as we do topology aware mapping today. Will be released soon in the 1.7.0 release. On Feb 8, 2013, at 5:48 PM, Kranthi Kumar wrote: > Hi, > I am working on topic "Topology aware mapping of processes in intra node > environment". I need to find the binding of each rank on the local machine. > How do I do this? I am using OPENMPI version 1.4.1 > > Thank You > > -- > Kranthi > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users