Patrick, thanks for the report.
can you confirm what happened was - you defined OMPI_MCA_plm_rsh_agent=oarshmost - oarshmost was not in the $PATH - mpirun silently ignored the remote nodes if that is correct, then i think mpirun should have reported an error (oarshmost not found, or cannot remote start orted) instead of this silent behaviour Cheers, Gilles On Mon, Sep 21, 2015 at 11:43 PM, Patrick Begou <patrick.be...@legi.grenoble-inp.fr> wrote: > Hi Gilles, > > I've done a big mistake! Compiling the patched version of openMPI and > creating a new module, I've forgotten to add the path to oarshmost command > while OMPI_MCA_plm_rsh_agent=oarshmost was set.... > OpenMPI was silently ignoring oarshmost command as it was unable to find it > and so only one node was available! > > The good thing is that with your patch, oversuscribing does not occur > anymore on the nodes, it seems to solves efficiently the problem we had. > I'll keep this patched version in prod for the users as the previous one was > allowing 2 processes on some cores time to time, and haphazardly bad code > performances in thes cases. > > Yes this computer is the biggest one of CIMENT mesocenter, it is called... > froggy and all the nodes are littles frogs :-) > https://ciment.ujf-grenoble.fr/wiki-pub/index.php/Hardware:Froggy > > I was using $OAR_NODEFILE and frog.txt to check different syntax, one with a > liste of nodes (on line with a node name for each available core) and the > second with one line per node and the "slots" information for the number of > cores. EG: > > [begou@frog7 MPI_TESTS]$ cat $OAR_NODEFILE > frog7 > frog7 > frog7 > frog7 > frog8 > frog8 > frog8 > frog8 > > [begou@frog7 MPI_TESTS]$ cat frog.txt > frog7 slots=4 > frog8 slots=4 > > Thanks again for the patch and your help. > > Patrick > > > Gilles Gouaillardet wrote: > > Thanks Patrick, > > could you please try again with the --hetero-nodes mpirun option ? > (I am afk, and not 100% sure about the syntax) > > could you also submit a job with 2 nodes and 4 cores on each node, that does > cat /proc/self/status > oarshmost <remote host> cat /proc/self/status > > btw, is there any reason why do you use a machine file (frog.txt) instead of > using $OAR_NODEFILE directly ? > /* not to mention I am surprised a French supercomputer is called "frog" ;-) > */ > > Cheers, > > Gilles > > On Friday, September 18, 2015, Patrick Begou > <patrick.be...@legi.grenoble-inp.fr> wrote: >> >> Gilles Gouaillardet wrote: >> >> Patrick, >> >> by the way, this will work when running on a single node. >> >> i do not know what will happen when you run on multiple node ... >> since there is no OAR integration in openmpi, i guess you are using ssh to >> start orted on the remote nodes >> (unless you instructed ompi to use an OARified version of ssh) >> >> Yes, OMPI_MCA_plm_rsh_agent=oarshmost >> This exports also needed environment instead of multpiple -x options. To >> be as similar as possible to the environments on french national >> supercomputers. >> >> my concern is the remote orted might not run within the cpuset that was >> created by OAR for this job, >> so you might end up using all the cores on the remote nodes. >> >> The oar environment does this. With older OpenMPI version all is working >> fine. >> >> please let us know how that works for you >> >> Cheers, >> >> Gilles >> >> >> On 9/18/2015 5:02 PM, Gilles Gouaillardet wrote: >> >> Patrick, >> >> i just filled PR 586 https://github.com/open-mpi/ompi-release/pull/586 for >> the v1.10 series >> >> this is only a three line patch. >> could you please give it a try ? >> >> >> This patch solve the problem when OpenMPI uses one node but now I'm unable >> to use more than one node. >> On one node, with 4 cores in the cpuset: >> >> mpirun --bind-to core --hostfile $OAR_NODEFILE ./location.exe |grep >> 'thread is now running on PU' |sort >> (process 0) thread is now running on PU logical index 0 (OS/physical index >> 12) on system frog26 >> (process 1) thread is now running on PU logical index 1 (OS/physical index >> 13) on system frog26 >> (process 2) thread is now running on PU logical index 2 (OS/physical index >> 14) on system frog26 >> (process 3) thread is now running on PU logical index 3 (OS/physical index >> 15) on system frog26 >> >> [begou@frog26 MPI_TESTS]$ mpirun -np 5 --bind-to core --hostfile >> $OAR_NODEFILE ./location.exe >> -------------------------------------------------------------------------- >> A request was made to bind to that would result in binding more >> processes than cpus on a resource: >> >> Bind to: CORE >> Node: frog26 >> #processes: 2 >> #cpus: 1 >> >> You can override this protection by adding the "overload-allowed" >> option to your binding directive. >> >> >> >> But if I request two nodes (4 cores one each) only 4 processes can start >> on the local cores, none on the second host: >> [begou@frog5 MPI_TESTS]$ cat $OAR_NODEFILE >> frog5 >> frog5 >> frog5 >> frog5 >> frog6 >> frog6 >> frog6 >> frog6 >> >> [begou@frog5 MPI_TESTS]$ cat ./frog.txt >> frog5 slots=4 >> frog6 slots=4 >> >> But only 4 processes are launched: >> [begou@frog5 MPI_TESTS]$ mpirun --hostfile frog.txt --bind-to core >> ./location.exe |grep 'thread is now running on PU' >> (process 0) thread is now running on PU logical index 0 (OS/physical index >> 12) on system frog5 >> (process 1) thread is now running on PU logical index 1 (OS/physical index >> 13) on system frog5 >> (process 2) thread is now running on PU logical index 2 (OS/physical index >> 14) on system frog5 >> (process 3) thread is now running on PU logical index 3 (OS/physical index >> 15) on system frog5 >> >> If I ask explicitly 8 processes (one for each 4 cores of the 2 nodes) >> [begou@frog5 MPI_TESTS]$ mpirun --hostfile frog.txt -np 8 --bind-to core >> ./location.exe >> -------------------------------------------------------------------------- >> A request was made to bind to that would result in binding more >> processes than cpus on a resource: >> >> Bind to: CORE >> Node: frog5 >> #processes: 2 >> #cpus: 1 >> >> You can override this protection by adding the "overload-allowed" >> option to your binding directive. >> >> >> >> Cheers, >> >> Gilles >> >> On 9/18/2015 4:54 PM, Patrick Begou wrote: >> >> Ralph Castain wrote: >> >> As I said, if you don’t provide an explicit slot count in your hostfile, >> we default to allowing oversubscription. We don’t have OAR integration in >> OMPI, and so mpirun isn’t recognizing that you are running under a resource >> manager - it thinks this is just being controlled by a hostfile. >> >> >> That's look strange for me is that in this case (default) oversubscription >> is allowed for the number of core of one cpu (8), not the number of cores >> available in the node (16) or unlimited... >> >> If you want us to error out on oversubscription, you can either add the >> flag you identified, or simply change your hostfile to: >> >> frog53 slots=4 >> >> Either will work. >> >> This syntax in the host file doesn't change anything to the oversuscribing >> problem. It is still allowed with the same maximum amount of processes for >> this test case: >> >> [begou@frog7 MPI_TESTS]$ mpirun -np 8 --hostfile frog7.txt --bind-to core >> ./location.exe|grep 'thread is now running on PU' |sort >> (process 0) thread is now running on PU logical index 0 (OS/physical index >> 0) on system frog7 >> (process 1) thread is now running on PU logical index 2 (OS/physical index >> 6) on system frog7 >> (process 2) thread is now running on PU logical index 0 (OS/physical index >> 0) on system frog7 >> (process 3) thread is now running on PU logical index 2 (OS/physical index >> 6) on system frog7 >> (process 4) thread is now running on PU logical index 3 (OS/physical index >> 7) on system frog7 >> (process 5) thread is now running on PU logical index 1 (OS/physical index >> 5) on system frog7 >> (process 6) thread is now running on PU logical index 2 (OS/physical index >> 6) on system frog7 >> (process 7) thread is now running on PU logical index 3 (OS/physical index >> 7) on system frog7 >> >> [begou@frog7 MPI_TESTS]$ cat frog7.txt >> frog7 slots=4 >> >> Patrick >> >> >> On Sep 16, 2015, at 1:00 AM, Patrick Begou >> <patrick.be...@legi.grenoble-inp.fr> wrote: >> >> Thanks all for your answers, I've added some details about the tests I >> have run. See below. >> >> >> Ralph Castain wrote: >> >> Not precisely correct. It depends on the environment. >> >> If there is a resource manager allocating nodes, or you provide a hostfile >> that specifies the number of slots on the nodes, or you use -host, then we >> default to no-oversubscribe. >> >> I'm using a batch scheduler (OAR). >> # cat /dev/cpuset/oar/begou_7955553/cpuset.cpus >> 4-7 >> >> So 4 cores allowed. Nodes have two height cores cpus. >> >> Node file contains: >> # cat $OAR_NODEFILE >> frog53 >> frog53 >> frog53 >> frog53 >> >> # mpirun --hostfile $OAR_NODEFILE -bind-to core location.exe >> is okay (my test code show one process on each core) >> (process 3) thread is now running on PU logical index 1 (OS/physical index >> 5) on system frog53 >> (process 0) thread is now running on PU logical index 3 (OS/physical index >> 7) on system frog53 >> (process 1) thread is now running on PU logical index 0 (OS/physical index >> 4) on system frog53 >> (process 2) thread is now running on PU logical index 2 (OS/physical index >> 6) on system frog53 >> >> # mpirun -np 5 --hostfile $OAR_NODEFILE -bind-to core location.exe >> oversuscribe with: >> (process 0) thread is now running on PU logical index 3 (OS/physical index >> 7) on system frog53 >> (process 1) thread is now running on PU logical index 1 (OS/physical index >> 5) on system frog53 >> (process 3) thread is now running on PU logical index 2 (OS/physical index >> 6) on system frog53 >> (process 4) thread is now running on PU logical index 0 (OS/physical index >> 4) on system frog53 >> (process 2) thread is now running on PU logical index 2 (OS/physical index >> 6) on system frog53 >> This is not allowed with OpenMPI 1.7.3 >> >> I can increase until the maximul core number of this first pocessor (8 >> cores) >> # mpirun -np 8 --hostfile $OAR_NODEFILE -bind-to core location.exe |grep >> 'thread is now running on PU' >> (process 5) thread is now running on PU logical index 1 (OS/physical index >> 5) on system frog53 >> (process 7) thread is now running on PU logical index 3 (OS/physical index >> 7) on system frog53 >> (process 4) thread is now running on PU logical index 0 (OS/physical index >> 4) on system frog53 >> (process 6) thread is now running on PU logical index 2 (OS/physical index >> 6) on system frog53 >> (process 2) thread is now running on PU logical index 1 (OS/physical index >> 5) on system frog53 >> (process 0) thread is now running on PU logical index 2 (OS/physical index >> 6) on system frog53 >> (process 1) thread is now running on PU logical index 0 (OS/physical index >> 4) on system frog53 >> (process 3) thread is now running on PU logical index 0 (OS/physical index >> 4) on system frog53 >> >> But I cannot overload more than the 8 cores (max core number of one cpu). >> # mpirun -np 9 --hostfile $OAR_NODEFILE -bind-to core location.exe >> A request was made to bind to that would result in binding more >> processes than cpus on a resource: >> >> Bind to: CORE >> Node: frog53 >> #processes: 2 >> #cpus: 1 >> >> You can override this protection by adding the "overload-allowed" >> option to your binding directive. >> >> Now if I add --nooversubscribe the problem doesn't exist anymore (no more >> than 4 processes, one on each core). So looks like if default behavior would >> be a nooversuscribe on cores number of the socket ??? >> >> Again, with 1.7.3 this problem doesn't occur at all. >> >> Patrick >> >> >> >> If you provide a hostfile that doesn’t specify slots, then we use the >> number of cores we find on each node, and we allow oversubscription. >> >> What is being described sounds like more of a bug than an intended >> feature. I’d need to know more about it, though, to be sure. Can you tell me >> how you are specifying this cpuset? >> >> >> On Sep 15, 2015, at 4:44 PM, Matt Thompson <fort...@gmail.com> wrote: >> >> Looking at the Open MPI 1.10.0 man page: >> >> https://www.open-mpi.org/doc/v1.10/man1/mpirun.1.php >> >> it looks like perhaps -oversubscribe (which was an option) is now the >> default behavior. Instead we have: >> >> -nooversubscribe, --nooversubscribe Do not oversubscribe any nodes; error >> (without starting any processes) if the requested number of processes would >> cause oversubscription. This option implicitly sets "max_slots" equal to the >> "slots" value for each node. >> >> It also looks like -map-by has a way to implement it as well (see man >> page). >> >> Thanks for letting me/us know about this. On a system of mine I sort of >> depend on the -nooversubscribe behavior! >> >> Matt >> >> >> >> On Tue, Sep 15, 2015 at 11:17 AM, Patrick Begou >> <patrick.be...@legi.grenoble-inp.fr> wrote: >>> >>> Hi, >>> >>> I'm runing OpenMPI 1.10.0 built with Intel 2015 compilers on a Bullx >>> System. >>> I've some troubles with the bind-to core option when using cpuset. >>> If the cpuset is less than all the cores of a cpu (ex: 4 cores allowed on >>> a 8 cores cpus) OpenMPI 1.10.0 allows to overload these cores until the >>> maximum number of cores of the cpu. >>> With this config and because the cpuset only allows 4 cores, I can reach >>> 2 processes/core if I use: >>> >>> mpirun -np 8 --bind-to core my_application >>> >>> OpenMPI 1.7.3 doesn't show the problem with the same situation: >>> mpirun -np 8 --bind-to-core my_application >>> returns: >>> A request was made to bind to that would result in binding more >>> processes than cpus on a resource >>> and that's okay of course. >>> >>> >>> Is there a way to avoid this oveloading with OpenMPI 1.10.0 ? >>> >>> Thanks >>> >>> Patrick >>> >>> -- >>> =================================================================== >>> | Equipe M.O.S.T. | | >>> | Patrick BEGOU | mailto:patrick.be...@grenoble-inp.fr | >>> | LEGI | | >>> | BP 53 X | Tel 04 76 82 51 35 | >>> | 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 | >>> =================================================================== >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/09/27575.php >> >> >> >> >> -- >> Matt Thompson >> >> Man Among Men >> Fulcrum of History >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/09/27579.php >> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/09/27580.php >> >> >> >> -- >> =================================================================== >> | Equipe M.O.S.T. | | >> | Patrick BEGOU | mailto:patrick.be...@grenoble-inp.fr | >> | LEGI | | >> | BP 53 X | Tel 04 76 82 51 35 | >> | 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 | >> =================================================================== >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/09/27583.php >> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/09/27590.php >> >> >> >> -- >> =================================================================== >> | Equipe M.O.S.T. | | >> | Patrick BEGOU | mailto:patrick.be...@grenoble-inp.fr | >> | LEGI | | >> | BP 53 X | Tel 04 76 82 51 35 | >> | 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 | >> =================================================================== >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/09/27619.php >> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/09/27620.php >> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/09/27621.php >> >> >> >> -- >> =================================================================== >> | Equipe M.O.S.T. | | >> | Patrick BEGOU | mailto:patrick.be...@grenoble-inp.fr | >> | LEGI | | >> | BP 53 X | Tel 04 76 82 51 35 | >> | 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 | >> =================================================================== > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/09/27624.php > > > > -- > =================================================================== > | Equipe M.O.S.T. | | > | Patrick BEGOU | mailto:patrick.be...@grenoble-inp.fr | > | LEGI | | > | BP 53 X | Tel 04 76 82 51 35 | > | 38041 GRENOBLE CEDEX | Fax 04 76 82 52 71 | > =================================================================== > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/09/27644.php