So very hetero, I did some testing and I couldn't make it happen below 32 cores. Not sure if this the real issue or if it requires a specific layout:
[brockp@nyx5512 ~]$ cat $PBS_NODEFILE | sort | uniq -c
1 nyx5512
1 nyx5515
1 nyx5518
1 nyx5523
1 nyx5527
2 nyx5537
1 nyx5542
1 nyx5560
2 nyx5561
2 nyx5562
3 nyx5589
1 nyx5591
1 nyx5593
1 nyx5617
2 nyx5620
1 nyx5622
5 nyx5629
1 nyx5630
1 nyx5770
1 nyx5771
2 nyx5772
1 nyx5780
3 nyx5784
2 nyx5820
10 nyx5844
2 nyx5847
1 nyx5849
1 nyx5852
2 nyx5856
1 nyx5870
8 nyx5872
1 nyx5894
This sort of layout gives me that warning, if I leave -np 64
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: CORE
Node: nyx5589
#processes: 2
#cpus: 1
If I omit the -np ## it works and nyx5589 does get 3 processes started.
If I look at the binding of the three ranks on nyx5589 that it complains about
they appear correct:
[root@nyx5589 ~]# hwloc-bind --get --pid 24826
0x00000080 -> 7
[root@nyx5589 ~]# hwloc-bind --get --pid 24827
0x00000400 -> 10
[root@nyx5589 ~]# hwloc-bind --get --pid 24828
0x00001000 -> 12
I think I found the problem though, and its on torque side, while the CPU set
sets up 7, 10, and 12
PBS server thinks it gave out 6,7, and 10. Thus where the only 2 processes
come from.
I checked some of the other jobs and the cpusets and the pbs server cpu list
are the same.
More investigation required. Still strange why would it give that message at
all? Why would OpenMPI care, and why only when -np ## is given.
Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
[email protected]
(734)936-1985
On Sep 23, 2014, at 3:27 PM, Maxime Boissonneault
<[email protected]> wrote:
> Do you know the topology of the cores allocated by Torque (i.e. were they all
> on the same nodes, or 8 per node, or a heterogenous distribution for example
> ?)
>
>
> Le 2014-09-23 15:05, Brock Palen a écrit :
>> Yes the request to torque was procs=64,
>>
>> We are using cpusets.
>>
>> the mpirun without -np 64 creates 64 spawned hostnames.
>>
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> [email protected]
>> (734)936-1985
>>
>>
>>
>> On Sep 23, 2014, at 3:02 PM, Ralph Castain <[email protected]> wrote:
>>
>>> FWIW: that warning has been removed from the upcoming 1.8.3 release
>>>
>>>
>>> On Sep 23, 2014, at 11:45 AM, Reuti <[email protected]> wrote:
>>>
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>>
>>>> Am 23.09.2014 um 19:53 schrieb Brock Palen:
>>>>
>>>>> I found a fun head scratcher, with openmpi 1.8.2 with torque 5 built
>>>>> with TM support, on hereto core layouts I get the fun thing:
>>>>> mpirun -report-bindings hostname <-------- Works
>>>> And you get 64 lines of output?
>>>>
>>>>
>>>>> mpirun -report-bindings -np 64 hostname <--------- Wat?
>>>>> --------------------------------------------------------------------------
>>>>> A request was made to bind to that would result in binding more
>>>>> processes than cpus on a resource:
>>>>>
>>>>> Bind to: CORE
>>>>> Node: nyx5518
>>>>> #processes: 2
>>>>> #cpus: 1
>>>>>
>>>>> You can override this protection by adding the "overload-allowed"
>>>>> option to your binding directive.
>>>>> --------------------------------------------------------------------------
>>>> How many cores are physically installed on this machine - two as mentioned
>>>> above?
>>>>
>>>> - -- Reuti
>>>>
>>>>
>>>>> I ran with --oversubscribed and got the expected host list, which matched
>>>>> $PBS_NODEFILE and was 64 entires long:
>>>>>
>>>>> mpirun -overload-allowed -report-bindings -np 64 --oversubscribe hostname
>>>>>
>>>>> What did I do wrong? I'm stumped why one works one doesn't but the one
>>>>> that doesn't if your force it appears correct.
>>>>>
>>>>>
>>>>> Brock Palen
>>>>> www.umich.edu/~brockp
>>>>> CAEN Advanced Computing
>>>>> XSEDE Campus Champion
>>>>> [email protected]
>>>>> (734)936-1985
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/users/2014/09/25375.php
>>>> -----BEGIN PGP SIGNATURE-----
>>>> Version: GnuPG/MacGPG2 v2.0.20 (Darwin)
>>>> Comment: GPGTools - http://gpgtools.org
>>>>
>>>> iEYEARECAAYFAlQhv7IACgkQo/GbGkBRnRr3HgCgjZoD9l9a+WThl5CDaGF1jawx
>>>> PWIAmwWnZwQdytNgAJgbir6V7yCyBt5D
>>>> =NG0H
>>>> -----END PGP SIGNATURE-----
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2014/09/25376.php
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/09/25378.php
>>
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/09/25379.php
>
>
> --
> ---------------------------------
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
>
> _______________________________________________
> users mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/09/25380.php
signature.asc
Description: Message signed with OpenPGP using GPGMail
