So very hetero, I did some testing and I couldn't make it happen below 32 cores. Not sure if this the real issue or if it requires a specific layout:
[brockp@nyx5512 ~]$ cat $PBS_NODEFILE | sort | uniq -c 1 nyx5512 1 nyx5515 1 nyx5518 1 nyx5523 1 nyx5527 2 nyx5537 1 nyx5542 1 nyx5560 2 nyx5561 2 nyx5562 3 nyx5589 1 nyx5591 1 nyx5593 1 nyx5617 2 nyx5620 1 nyx5622 5 nyx5629 1 nyx5630 1 nyx5770 1 nyx5771 2 nyx5772 1 nyx5780 3 nyx5784 2 nyx5820 10 nyx5844 2 nyx5847 1 nyx5849 1 nyx5852 2 nyx5856 1 nyx5870 8 nyx5872 1 nyx5894 This sort of layout gives me that warning, if I leave -np 64 A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node: nyx5589 #processes: 2 #cpus: 1 If I omit the -np ## it works and nyx5589 does get 3 processes started. If I look at the binding of the three ranks on nyx5589 that it complains about they appear correct: [root@nyx5589 ~]# hwloc-bind --get --pid 24826 0x00000080 -> 7 [root@nyx5589 ~]# hwloc-bind --get --pid 24827 0x00000400 -> 10 [root@nyx5589 ~]# hwloc-bind --get --pid 24828 0x00001000 -> 12 I think I found the problem though, and its on torque side, while the CPU set sets up 7, 10, and 12 PBS server thinks it gave out 6,7, and 10. Thus where the only 2 processes come from. I checked some of the other jobs and the cpusets and the pbs server cpu list are the same. More investigation required. Still strange why would it give that message at all? Why would OpenMPI care, and why only when -np ## is given. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 On Sep 23, 2014, at 3:27 PM, Maxime Boissonneault <maxime.boissonnea...@calculquebec.ca> wrote: > Do you know the topology of the cores allocated by Torque (i.e. were they all > on the same nodes, or 8 per node, or a heterogenous distribution for example > ?) > > > Le 2014-09-23 15:05, Brock Palen a écrit : >> Yes the request to torque was procs=64, >> >> We are using cpusets. >> >> the mpirun without -np 64 creates 64 spawned hostnames. >> >> Brock Palen >> www.umich.edu/~brockp >> CAEN Advanced Computing >> XSEDE Campus Champion >> bro...@umich.edu >> (734)936-1985 >> >> >> >> On Sep 23, 2014, at 3:02 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> FWIW: that warning has been removed from the upcoming 1.8.3 release >>> >>> >>> On Sep 23, 2014, at 11:45 AM, Reuti <re...@staff.uni-marburg.de> wrote: >>> >>>> -----BEGIN PGP SIGNED MESSAGE----- >>>> Hash: SHA1 >>>> >>>> Am 23.09.2014 um 19:53 schrieb Brock Palen: >>>> >>>>> I found a fun head scratcher, with openmpi 1.8.2 with torque 5 built >>>>> with TM support, on hereto core layouts I get the fun thing: >>>>> mpirun -report-bindings hostname <-------- Works >>>> And you get 64 lines of output? >>>> >>>> >>>>> mpirun -report-bindings -np 64 hostname <--------- Wat? >>>>> -------------------------------------------------------------------------- >>>>> A request was made to bind to that would result in binding more >>>>> processes than cpus on a resource: >>>>> >>>>> Bind to: CORE >>>>> Node: nyx5518 >>>>> #processes: 2 >>>>> #cpus: 1 >>>>> >>>>> You can override this protection by adding the "overload-allowed" >>>>> option to your binding directive. >>>>> -------------------------------------------------------------------------- >>>> How many cores are physically installed on this machine - two as mentioned >>>> above? >>>> >>>> - -- Reuti >>>> >>>> >>>>> I ran with --oversubscribed and got the expected host list, which matched >>>>> $PBS_NODEFILE and was 64 entires long: >>>>> >>>>> mpirun -overload-allowed -report-bindings -np 64 --oversubscribe hostname >>>>> >>>>> What did I do wrong? I'm stumped why one works one doesn't but the one >>>>> that doesn't if your force it appears correct. >>>>> >>>>> >>>>> Brock Palen >>>>> www.umich.edu/~brockp >>>>> CAEN Advanced Computing >>>>> XSEDE Campus Champion >>>>> bro...@umich.edu >>>>> (734)936-1985 >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2014/09/25375.php >>>> -----BEGIN PGP SIGNATURE----- >>>> Version: GnuPG/MacGPG2 v2.0.20 (Darwin) >>>> Comment: GPGTools - http://gpgtools.org >>>> >>>> iEYEARECAAYFAlQhv7IACgkQo/GbGkBRnRr3HgCgjZoD9l9a+WThl5CDaGF1jawx >>>> PWIAmwWnZwQdytNgAJgbir6V7yCyBt5D >>>> =NG0H >>>> -----END PGP SIGNATURE----- >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/09/25376.php >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/09/25378.php >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25379.php > > > -- > --------------------------------- > Maxime Boissonneault > Analyste de calcul - Calcul Québec, Université Laval > Ph. D. en physique > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25380.php
signature.asc
Description: Message signed with OpenPGP using GPGMail