Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread r...@open-mpi.org
Someone has done some work there since I last did, but I can see the issue. 
Torque indeed always provides an ordered file - the only way you can get an 
unordered one is for someone to edit it, and that is forbidden - i.e., you get 
what you deserve because you are messing around with a system-defined file :-)

The problem is that Torque internally assigns a “launch ID” which is just the 
integer position of the nodename in the PBS_NODEFILE. So if you modify that 
position, then we get the wrong index - and everything goes down the drain from 
there. In your example, n1.cluster changed index from 3 to 2 because of your 
edit. Torque thinks that index 2 is just another reference to n0.cluster, and 
so we merrily launch a daemon onto the wrong node.

They have a good reason for doing things this way. It allows you to launch a 
process against each launch ID, and the pattern will reflect the original qsub 
request in what we would call a map-by slot round-robin mode. This maximizes 
the use of shared memory, and is expected to provide good performance for a 
range of apps.

Lesson to be learned: never, ever muddle around with a system-generated file. 
If you want to modify where things go, then use one or more of the mpirun 
options to do so. We give you lots and lots of knobs for just that reason.



> On Sep 7, 2016, at 10:53 PM, Gilles Gouaillardet  wrote:
> 
> Ralph,
> 
> 
> there might be an issue within Open MPI.
> 
> 
> on the cluster i used, hostname returns the FQDN, and $PBS_NODEFILE uses the 
> FQDN too.
> 
> my $PBS_NODEFILE has one line per task, and it is ordered
> 
> e.g.
> 
> n0.cluster
> 
> n0.cluster
> 
> n1.cluster
> 
> n1.cluster
> 
> 
> in my torque script, i rewrote the machinefile like this
> 
> n0.cluster
> 
> n1.cluster
> 
> n0.cluster
> 
> n1.cluster
> 
> and updated the PBS environment variable to point to my new file.
> 
> 
> then i invoked
> 
> mpirun hostname
> 
> 
> 
> in the first case, 2 tasks run on n0 and 2 tasks run on n1
> in the second case, 4 tasks run on n0, and none on n1.
> 
> so i am thinking we might not support unordered $PBS_NODEFILE.
> 
> as a reminder, the submit command was
> qsub -l nodes=3:ppn=1
> but for some reasons i ignore, only two nodes were allocated (two slots on 
> the first one, one on the second one)
> and if i understand correctly, $PBS_NODEFILE was not ordered.
> (e.g. n0 n1 n0 and *not * n0 n0 n1)
> 
> i tried to reproduce this without hacking $PBS_NODEFILE, but my jobs hang in 
> the queue if only two nodes with 16 slots each are available and i request
> -l nodes=3:ppn=1
> i guess this is a different scheduler configuration, and i cannot change that.
> 
> Could you please have a look at this ?
> 
> Cheers,
> 
> Gilles
> 
> On 9/7/2016 11:15 PM, r...@open-mpi.org wrote:
>> The usual cause of this problem is that the nodename in the machinefile is 
>> given as a00551, while Torque is assigning the node name as 
>> a00551.science.domain. Thus, mpirun thinks those are two separate nodes and 
>> winds up spawning an orted on its own node.
>> 
>> You might try ensuring that your machinefile is using the exact same name as 
>> provided in your allocation
>> 
>> 
>>> On Sep 7, 2016, at 7:06 AM, Gilles Gouaillardet 
>>>  wrote:
>>> 
>>> Thanjs for the ligs
>>> 
>>> From what i see now, it looks like a00551 is running both mpirun and orted, 
>>> though it should only run mpirun, and orted should run only on a00553
>>> 
>>> I will check the code and see what could be happening here
>>> 
>>> Btw, what is the output of
>>> hostname
>>> hostname -f
>>> On a00551 ?
>>> 
>>> Out of curiosity, is a previous version of Open MPI (e.g. v1.10.4) 
>>> installled and running correctly on your cluster ?
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> Oswin Krause  wrote:
 Hi Gilles,
 
 Thanks for the hint with the machinefile. I know it is not equivalent
 and i do not intend to use that approach. I just wanted to know whether
 I could start the program successfully at all.
 
 Outside torque(4.2), rsh seems to be used which works fine, querying a
 password if no kerberos ticket is there
 
 Here is the output:
 [zbh251@a00551 ~]$ mpirun -V
 mpirun (Open MPI) 2.0.1
 [zbh251@a00551 ~]$ ompi_info | grep ras
 MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Component
 v2.0.1)
 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
 v2.0.1)
 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component
 v2.0.1)
 MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.1)
 [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output
 -display-map hostname
 [a00551.science.domain:04104] mca: base: components_register:
 registering framework plm components
 [a00551.science.domain:04104] mca: base: components_register: found
 loaded component isolated
 [a00551.science.domain:04104] mca: base: components_register: component
 is

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread r...@open-mpi.org
If you are correctly analyzing things, then there would be an issue in the 
code. When we get an allocation from a resource manager, we set a flag 
indicating that it is “gospel” - i.e., that we do not directly sense the number 
of cores on a node and set the #slots equal to that value. Instead, we take the 
RM-provided allocation as ultimate truth.

This should be true even if you add a machinefile, as the machinefile is only 
used to “filter” the nodelist provided by the RM. It shouldn’t cause the #slots 
to be modified.

Taking a quick glance at the v2.x code, it looks to me like all is being done 
correctly. Again, output from a debug build would resolve that question


> On Sep 7, 2016, at 10:56 PM, Gilles Gouaillardet  wrote:
> 
> Oswin,
> 
> 
> unfortunatly some important info is missing.
> 
> i guess the root cause is Open MPI was not configure'd with --enable-debug
> 
> 
> could you please update your torque script and simply add the following 
> snippet before invoking mpirun
> 
> 
> echo PBS_NODEFILE
> 
> cat $PBS_NODEFILE
> 
> echo ---
> 
> 
> as i wrote in an other email, i suspect hosts are not ordered (and i'd like 
> to confirm that) and Open MPI does not handle that correctly
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> On 9/7/2016 10:25 PM, Oswin Krause wrote:
>> Hi Gilles,
>> 
>> Thanks for the hint with the machinefile. I know it is not equivalent and i 
>> do not intend to use that approach. I just wanted to know whether I could 
>> start the program successfully at all.
>> 
>> Outside torque(4.2), rsh seems to be used which works fine, querying a 
>> password if no kerberos ticket is there
>> 
>> Here is the output:
>> [zbh251@a00551 ~]$ mpirun -V
>> mpirun (Open MPI) 2.0.1
>> [zbh251@a00551 ~]$ ompi_info | grep ras
>> MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Component 
>> v2.0.1)
>> MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v2.0.1)
>> MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v2.0.1)
>> MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.1)
>> [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output 
>> -display-map hostname
>> [a00551.science.domain:04104] mca: base: components_register: registering 
>> framework plm components
>> [a00551.science.domain:04104] mca: base: components_register: found loaded 
>> component isolated
>> [a00551.science.domain:04104] mca: base: components_register: component 
>> isolated has no register or open function
>> [a00551.science.domain:04104] mca: base: components_register: found loaded 
>> component rsh
>> [a00551.science.domain:04104] mca: base: components_register: component rsh 
>> register function successful
>> [a00551.science.domain:04104] mca: base: components_register: found loaded 
>> component slurm
>> [a00551.science.domain:04104] mca: base: components_register: component 
>> slurm register function successful
>> [a00551.science.domain:04104] mca: base: components_register: found loaded 
>> component tm
>> [a00551.science.domain:04104] mca: base: components_register: component tm 
>> register function successful
>> [a00551.science.domain:04104] mca: base: components_open: opening plm 
>> components
>> [a00551.science.domain:04104] mca: base: components_open: found loaded 
>> component isolated
>> [a00551.science.domain:04104] mca: base: components_open: component isolated 
>> open function successful
>> [a00551.science.domain:04104] mca: base: components_open: found loaded 
>> component rsh
>> [a00551.science.domain:04104] mca: base: components_open: component rsh open 
>> function successful
>> [a00551.science.domain:04104] mca: base: components_open: found loaded 
>> component slurm
>> [a00551.science.domain:04104] mca: base: components_open: component slurm 
>> open function successful
>> [a00551.science.domain:04104] mca: base: components_open: found loaded 
>> component tm
>> [a00551.science.domain:04104] mca: base: components_open: component tm open 
>> function successful
>> [a00551.science.domain:04104] mca:base:select: Auto-selecting plm components
>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component 
>> [isolated]
>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of component 
>> [isolated] set priority to 0
>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component 
>> [rsh]
>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of component 
>> [rsh] set priority to 10
>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component 
>> [slurm]
>> [a00551.science.domain:04104] mca:base:select:(  plm) Querying component [tm]
>> [a00551.science.domain:04104] mca:base:select:(  plm) Query of component 
>> [tm] set priority to 75
>> [a00551.science.domain:04104] mca:base:select:(  plm) Selected component [tm]
>> [a00551.science.domain:04104] mca: base: close: component isolated closed
>> [a00551.science.domain:04104] mca: base: close: unloading component isolated
>> [a00

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Oswin Krause

Hi,

Thanks for all the hints. Only issue is: this is the file generated by 
torque. Torque - or at least the torque 4.2 provided by my redhat 
version - gives me an unordered file.

Should I rebuild torque?

Best,
Oswin

I am currently rebuilding the package with --enable-debug.

On 2016-09-08 09:57, r...@open-mpi.org wrote:

Someone has done some work there since I last did, but I can see the
issue. Torque indeed always provides an ordered file - the only way
you can get an unordered one is for someone to edit it, and that is
forbidden - i.e., you get what you deserve because you are messing
around with a system-defined file :-)

The problem is that Torque internally assigns a “launch ID” which is
just the integer position of the nodename in the PBS_NODEFILE. So if
you modify that position, then we get the wrong index - and everything
goes down the drain from there. In your example, n1.cluster changed
index from 3 to 2 because of your edit. Torque thinks that index 2 is
just another reference to n0.cluster, and so we merrily launch a
daemon onto the wrong node.

They have a good reason for doing things this way. It allows you to
launch a process against each launch ID, and the pattern will reflect
the original qsub request in what we would call a map-by slot
round-robin mode. This maximizes the use of shared memory, and is
expected to provide good performance for a range of apps.

Lesson to be learned: never, ever muddle around with a
system-generated file. If you want to modify where things go, then use
one or more of the mpirun options to do so. We give you lots and lots
of knobs for just that reason.



On Sep 7, 2016, at 10:53 PM, Gilles Gouaillardet  
wrote:


Ralph,


there might be an issue within Open MPI.


on the cluster i used, hostname returns the FQDN, and $PBS_NODEFILE 
uses the FQDN too.


my $PBS_NODEFILE has one line per task, and it is ordered

e.g.

n0.cluster

n0.cluster

n1.cluster

n1.cluster


in my torque script, i rewrote the machinefile like this

n0.cluster

n1.cluster

n0.cluster

n1.cluster

and updated the PBS environment variable to point to my new file.


then i invoked

mpirun hostname



in the first case, 2 tasks run on n0 and 2 tasks run on n1
in the second case, 4 tasks run on n0, and none on n1.

so i am thinking we might not support unordered $PBS_NODEFILE.

as a reminder, the submit command was
qsub -l nodes=3:ppn=1
but for some reasons i ignore, only two nodes were allocated (two 
slots on the first one, one on the second one)

and if i understand correctly, $PBS_NODEFILE was not ordered.
(e.g. n0 n1 n0 and *not * n0 n0 n1)

i tried to reproduce this without hacking $PBS_NODEFILE, but my jobs 
hang in the queue if only two nodes with 16 slots each are available 
and i request

-l nodes=3:ppn=1
i guess this is a different scheduler configuration, and i cannot 
change that.


Could you please have a look at this ?

Cheers,

Gilles

On 9/7/2016 11:15 PM, r...@open-mpi.org wrote:
The usual cause of this problem is that the nodename in the 
machinefile is given as a00551, while Torque is assigning the node 
name as a00551.science.domain. Thus, mpirun thinks those are two 
separate nodes and winds up spawning an orted on its own node.


You might try ensuring that your machinefile is using the exact same 
name as provided in your allocation



On Sep 7, 2016, at 7:06 AM, Gilles Gouaillardet 
 wrote:


Thanjs for the ligs

From what i see now, it looks like a00551 is running both mpirun and 
orted, though it should only run mpirun, and orted should run only 
on a00553


I will check the code and see what could be happening here

Btw, what is the output of
hostname
hostname -f
On a00551 ?

Out of curiosity, is a previous version of Open MPI (e.g. v1.10.4) 
installled and running correctly on your cluster ?


Cheers,

Gilles

Oswin Krause  wrote:

Hi Gilles,

Thanks for the hint with the machinefile. I know it is not 
equivalent
and i do not intend to use that approach. I just wanted to know 
whether

I could start the program successfully at all.

Outside torque(4.2), rsh seems to be used which works fine, 
querying a

password if no kerberos ticket is there

Here is the output:
[zbh251@a00551 ~]$ mpirun -V
mpirun (Open MPI) 2.0.1
[zbh251@a00551 ~]$ ompi_info | grep ras
MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, 
Component

v2.0.1)
MCA ras: simulator (MCA v2.1.0, API v2.0.0, 
Component

v2.0.1)
MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component
v2.0.1)
MCA ras: tm (MCA v2.1.0, API v2.0.0, Component 
v2.0.1)

[zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output
-display-map hostname
[a00551.science.domain:04104] mca: base: components_register:
registering framework plm components
[a00551.science.domain:04104] mca: base: components_register: found
loaded component isolated
[a00551.science.domain:04104] mca: base: components_register: 
component

isolated has no register or open function
[a00551.s

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Gilles Gouaillardet

Ralph,


i am not sure i am reading you correctly, so let me clarify.


i did not hack $PBS_NODEFILE for fun nor profit, i was simply trying to 
reproduce an issue i could not reproduce otherwise.


/* my job submitted with -l nodes=3:ppn=1 do not start if there are only 
two nodes available, whereas the same user job


starts on two nodes */

thanks for the explanation of the torque internals, my hack was 
incomplete and not a valid one, i do acknowledge it.



i re-read the email that started this thread and i found the information 
i was looking for




echo $PBS_NODEFILE
/var/lib/torque/aux//278.a00552.science.domain
cat $PBS_NODEFILE
a00551.science.domain
a00553.science.domain
a00551.science.domain 



so, assuming the enduser did not edit his $PBS_NODEFILE, and torque is 
correctly configured and not busted, then



Torque indeed always provides an ordered file - the only way you can get an 
unordered one is for someone to edit it

might be updated to

"Torque used to always provide an ordered file, but recent versions 
might not do that."



makes sense ?


Cheers,

Gilles


On 9/8/2016 4:57 PM, r...@open-mpi.org wrote:

Someone has done some work there since I last did, but I can see the issue. 
Torque indeed always provides an ordered file - the only way you can get an 
unordered one is for someone to edit it, and that is forbidden - i.e., you get 
what you deserve because you are messing around with a system-defined file :-)

The problem is that Torque internally assigns a “launch ID” which is just the 
integer position of the nodename in the PBS_NODEFILE. So if you modify that 
position, then we get the wrong index - and everything goes down the drain from 
there. In your example, n1.cluster changed index from 3 to 2 because of your 
edit. Torque thinks that index 2 is just another reference to n0.cluster, and 
so we merrily launch a daemon onto the wrong node.

They have a good reason for doing things this way. It allows you to launch a 
process against each launch ID, and the pattern will reflect the original qsub 
request in what we would call a map-by slot round-robin mode. This maximizes 
the use of shared memory, and is expected to provide good performance for a 
range of apps.

Lesson to be learned: never, ever muddle around with a system-generated file. 
If you want to modify where things go, then use one or more of the mpirun 
options to do so. We give you lots and lots of knobs for just that reason.




On Sep 7, 2016, at 10:53 PM, Gilles Gouaillardet  wrote:

Ralph,


there might be an issue within Open MPI.


on the cluster i used, hostname returns the FQDN, and $PBS_NODEFILE uses the 
FQDN too.

my $PBS_NODEFILE has one line per task, and it is ordered

e.g.

n0.cluster

n0.cluster

n1.cluster

n1.cluster


in my torque script, i rewrote the machinefile like this

n0.cluster

n1.cluster

n0.cluster

n1.cluster

and updated the PBS environment variable to point to my new file.


then i invoked

mpirun hostname



in the first case, 2 tasks run on n0 and 2 tasks run on n1
in the second case, 4 tasks run on n0, and none on n1.

so i am thinking we might not support unordered $PBS_NODEFILE.

as a reminder, the submit command was
qsub -l nodes=3:ppn=1
but for some reasons i ignore, only two nodes were allocated (two slots on the 
first one, one on the second one)
and if i understand correctly, $PBS_NODEFILE was not ordered.
(e.g. n0 n1 n0 and *not * n0 n0 n1)

i tried to reproduce this without hacking $PBS_NODEFILE, but my jobs hang in 
the queue if only two nodes with 16 slots each are available and i request
-l nodes=3:ppn=1
i guess this is a different scheduler configuration, and i cannot change that.

Could you please have a look at this ?

Cheers,

Gilles

On 9/7/2016 11:15 PM, r...@open-mpi.org wrote:

The usual cause of this problem is that the nodename in the machinefile is 
given as a00551, while Torque is assigning the node name as 
a00551.science.domain. Thus, mpirun thinks those are two separate nodes and 
winds up spawning an orted on its own node.

You might try ensuring that your machinefile is using the exact same name as 
provided in your allocation



On Sep 7, 2016, at 7:06 AM, Gilles Gouaillardet  
wrote:

Thanjs for the ligs

 From what i see now, it looks like a00551 is running both mpirun and orted, 
though it should only run mpirun, and orted should run only on a00553

I will check the code and see what could be happening here

Btw, what is the output of
hostname
hostname -f
On a00551 ?

Out of curiosity, is a previous version of Open MPI (e.g. v1.10.4) installled 
and running correctly on your cluster ?

Cheers,

Gilles

Oswin Krause  wrote:

Hi Gilles,

Thanks for the hint with the machinefile. I know it is not equivalent
and i do not intend to use that approach. I just wanted to know whether
I could start the program successfully at all.

Outside torque(4.2), rsh seems to be used which works fine, querying a
password if no kerberos ticket is there

Here is the

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Gilles Gouaillardet

Oswin,


that might be off topic and or/premature ...

PBS Pro has been made free (and opensource too) and is available at 
http://www.pbspro.org/


this is something you might be interested in (unless you are using 
torque because of the MOAB scheduler),


and it might be more friendly (e.g. always generate ordered 
$PBS_NODEFILE) to Open MPI



Cheers,


Gilles


On 9/8/2016 5:09 PM, Oswin Krause wrote:

Hi,

Thanks for all the hints. Only issue is: this is the file generated by 
torque. Torque - or at least the torque 4.2 provided by my redhat 
version - gives me an unordered file.

Should I rebuild torque?

Best,
Oswin

I am currently rebuilding the package with --enable-debug.

On 2016-09-08 09:57, r...@open-mpi.org wrote:

Someone has done some work there since I last did, but I can see the
issue. Torque indeed always provides an ordered file - the only way
you can get an unordered one is for someone to edit it, and that is
forbidden - i.e., you get what you deserve because you are messing
around with a system-defined file :-)

The problem is that Torque internally assigns a “launch ID” which is
just the integer position of the nodename in the PBS_NODEFILE. So if
you modify that position, then we get the wrong index - and everything
goes down the drain from there. In your example, n1.cluster changed
index from 3 to 2 because of your edit. Torque thinks that index 2 is
just another reference to n0.cluster, and so we merrily launch a
daemon onto the wrong node.

They have a good reason for doing things this way. It allows you to
launch a process against each launch ID, and the pattern will reflect
the original qsub request in what we would call a map-by slot
round-robin mode. This maximizes the use of shared memory, and is
expected to provide good performance for a range of apps.

Lesson to be learned: never, ever muddle around with a
system-generated file. If you want to modify where things go, then use
one or more of the mpirun options to do so. We give you lots and lots
of knobs for just that reason.



On Sep 7, 2016, at 10:53 PM, Gilles Gouaillardet  
wrote:


Ralph,


there might be an issue within Open MPI.


on the cluster i used, hostname returns the FQDN, and $PBS_NODEFILE 
uses the FQDN too.


my $PBS_NODEFILE has one line per task, and it is ordered

e.g.

n0.cluster

n0.cluster

n1.cluster

n1.cluster


in my torque script, i rewrote the machinefile like this

n0.cluster

n1.cluster

n0.cluster

n1.cluster

and updated the PBS environment variable to point to my new file.


then i invoked

mpirun hostname



in the first case, 2 tasks run on n0 and 2 tasks run on n1
in the second case, 4 tasks run on n0, and none on n1.

so i am thinking we might not support unordered $PBS_NODEFILE.

as a reminder, the submit command was
qsub -l nodes=3:ppn=1
but for some reasons i ignore, only two nodes were allocated (two 
slots on the first one, one on the second one)

and if i understand correctly, $PBS_NODEFILE was not ordered.
(e.g. n0 n1 n0 and *not * n0 n0 n1)

i tried to reproduce this without hacking $PBS_NODEFILE, but my jobs 
hang in the queue if only two nodes with 16 slots each are available 
and i request

-l nodes=3:ppn=1
i guess this is a different scheduler configuration, and i cannot 
change that.


Could you please have a look at this ?

Cheers,

Gilles

On 9/7/2016 11:15 PM, r...@open-mpi.org wrote:
The usual cause of this problem is that the nodename in the 
machinefile is given as a00551, while Torque is assigning the node 
name as a00551.science.domain. Thus, mpirun thinks those are two 
separate nodes and winds up spawning an orted on its own node.


You might try ensuring that your machinefile is using the exact 
same name as provided in your allocation



On Sep 7, 2016, at 7:06 AM, Gilles Gouaillardet 
 wrote:


Thanjs for the ligs

From what i see now, it looks like a00551 is running both mpirun 
and orted, though it should only run mpirun, and orted should run 
only on a00553


I will check the code and see what could be happening here

Btw, what is the output of
hostname
hostname -f
On a00551 ?

Out of curiosity, is a previous version of Open MPI (e.g. v1.10.4) 
installled and running correctly on your cluster ?


Cheers,

Gilles

Oswin Krause  wrote:

Hi Gilles,

Thanks for the hint with the machinefile. I know it is not 
equivalent
and i do not intend to use that approach. I just wanted to know 
whether

I could start the program successfully at all.

Outside torque(4.2), rsh seems to be used which works fine, 
querying a

password if no kerberos ticket is there

Here is the output:
[zbh251@a00551 ~]$ mpirun -V
mpirun (Open MPI) 2.0.1
[zbh251@a00551 ~]$ ompi_info | grep ras
MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, 
Component

v2.0.1)
MCA ras: simulator (MCA v2.1.0, API v2.0.0, 
Component

v2.0.1)
MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component
v2.0.1)
MCA ras: tm (MCA v2.1.0, API v2.0.0, Component 
v2.0

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Oswin Krause

Hi Gilles, Hi Ralph,

I have just rebuild openmpi. quite a lot more of information. As I said, 
i did not tinker with the PBS_NODEFILE. I think the issue might be NUMA 
here. I can try to go through the process and reconfigure to non-numa 
and see whether this works. The issue might be that the node allocation 
looks like this:


a00551.science.domain-0
a00552.science.domain-0
a00551.science.domain-1

and the last part then gets shortened which leads to the issue. Not sure 
whether this makes sense but this is my explanation.


Here the output:
$PBS_NODEFILE
/var/lib/torque/aux//285.a00552.science.domain
PBS_NODEFILE
a00551.science.domain
a00553.science.domain
a00551.science.domain
-
[a00551.science.domain:16986] mca: base: components_register: 
registering framework plm components
[a00551.science.domain:16986] mca: base: components_register: found 
loaded component isolated
[a00551.science.domain:16986] mca: base: components_register: component 
isolated has no register or open function
[a00551.science.domain:16986] mca: base: components_register: found 
loaded component rsh
[a00551.science.domain:16986] mca: base: components_register: component 
rsh register function successful
[a00551.science.domain:16986] mca: base: components_register: found 
loaded component slurm
[a00551.science.domain:16986] mca: base: components_register: component 
slurm register function successful
[a00551.science.domain:16986] mca: base: components_register: found 
loaded component tm
[a00551.science.domain:16986] mca: base: components_register: component 
tm register function successful
[a00551.science.domain:16986] mca: base: components_open: opening plm 
components
[a00551.science.domain:16986] mca: base: components_open: found loaded 
component isolated
[a00551.science.domain:16986] mca: base: components_open: component 
isolated open function successful
[a00551.science.domain:16986] mca: base: components_open: found loaded 
component rsh
[a00551.science.domain:16986] mca: base: components_open: component rsh 
open function successful
[a00551.science.domain:16986] mca: base: components_open: found loaded 
component slurm
[a00551.science.domain:16986] mca: base: components_open: component 
slurm open function successful
[a00551.science.domain:16986] mca: base: components_open: found loaded 
component tm
[a00551.science.domain:16986] mca: base: components_open: component tm 
open function successful
[a00551.science.domain:16986] mca:base:select: Auto-selecting plm 
components
[a00551.science.domain:16986] mca:base:select:(  plm) Querying component 
[isolated]
[a00551.science.domain:16986] mca:base:select:(  plm) Query of component 
[isolated] set priority to 0
[a00551.science.domain:16986] mca:base:select:(  plm) Querying component 
[rsh]
[a00551.science.domain:16986] [[INVALID],INVALID] plm:rsh_lookup on 
agent ssh : rsh path NULL
[a00551.science.domain:16986] mca:base:select:(  plm) Query of component 
[rsh] set priority to 10
[a00551.science.domain:16986] mca:base:select:(  plm) Querying component 
[slurm]
[a00551.science.domain:16986] mca:base:select:(  plm) Querying component 
[tm]
[a00551.science.domain:16986] mca:base:select:(  plm) Query of component 
[tm] set priority to 75
[a00551.science.domain:16986] mca:base:select:(  plm) Selected component 
[tm]
[a00551.science.domain:16986] mca: base: close: component isolated 
closed
[a00551.science.domain:16986] mca: base: close: unloading component 
isolated

[a00551.science.domain:16986] mca: base: close: component rsh closed
[a00551.science.domain:16986] mca: base: close: unloading component rsh
[a00551.science.domain:16986] mca: base: close: component slurm closed
[a00551.science.domain:16986] mca: base: close: unloading component 
slurm
[a00551.science.domain:16986] plm:base:set_hnp_name: initial bias 16986 
nodename hash 2226275586

[a00551.science.domain:16986] plm:base:set_hnp_name: final jobfam 33770
[a00551.science.domain:16986] [[33770,0],0] plm:base:receive start comm
[a00551.science.domain:16986] [[33770,0],0] plm:base:setup_job
[a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm
[a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm creating 
map
[a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm add new 
daemon [[33770,0],1]
[a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm assigning 
new daemon [[33770,0],1] to node a00553.science.domain

[a00551.science.domain:16986] [[33770,0],0] plm:tm: launching vm
[a00551.science.domain:16986] [[33770,0],0] plm:tm: final top-level 
argv:
	orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm 
-mca ess_base_jobid 2213150720 -mca ess_base_vpid  -mca 
ess_base_num_procs 2 -mca orte_hnp_uri 
2213150720.0;usock;tcp://130.226.12.194:53397;tcp6://[fe80::225:90ff:feeb:f6d5]:42821 
--mca plm_base_verbose 10
[a00551.science.domain:16986] [[33770,0],0] plm:tm: launching on node 
a00553.science.domain

[a00551.science.domain:16986] [[33770,0],0] plm:tm: executing:
	orted

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Oswin Krause

Hi,

i reconfigured to only have one physical node. Still no success, but the 
nodefile now looks better. I still get the errors:


[a00551.science.domain:18021] [[34768,0],1] bind() failed on error 
Address already in use (98)
[a00551.science.domain:18021] [[34768,0],1] ORTE_ERROR_LOG: Error in 
file oob_usock_component.c at line 228
[a00551.science.domain:18022] [[34768,0],2] bind() failed on error 
Address already in use (98)
[a00551.science.domain:18022] [[34768,0],2] ORTE_ERROR_LOG: Error in 
file oob_usock_component.c at line 228


(btw: for some reason the bind errors where missing. sorry!)

PBS_NODEFILE
a00551.science.domain
a00554.science.domain
a00553.science.domain
---
mpirun --mca plm_base_verbose 10 --tag-output -display-map hostname
[a00551.science.domain:18097] mca: base: components_register: 
registering framework plm components
[a00551.science.domain:18097] mca: base: components_register: found 
loaded component isolated
[a00551.science.domain:18097] mca: base: components_register: component 
isolated has no register or open function
[a00551.science.domain:18097] mca: base: components_register: found 
loaded component rsh
[a00551.science.domain:18097] mca: base: components_register: component 
rsh register function successful
[a00551.science.domain:18097] mca: base: components_register: found 
loaded component slurm
[a00551.science.domain:18097] mca: base: components_register: component 
slurm register function successful
[a00551.science.domain:18097] mca: base: components_register: found 
loaded component tm
[a00551.science.domain:18097] mca: base: components_register: component 
tm register function successful
[a00551.science.domain:18097] mca: base: components_open: opening plm 
components
[a00551.science.domain:18097] mca: base: components_open: found loaded 
component isolated
[a00551.science.domain:18097] mca: base: components_open: component 
isolated open function successful
[a00551.science.domain:18097] mca: base: components_open: found loaded 
component rsh
[a00551.science.domain:18097] mca: base: components_open: component rsh 
open function successful
[a00551.science.domain:18097] mca: base: components_open: found loaded 
component slurm
[a00551.science.domain:18097] mca: base: components_open: component 
slurm open function successful
[a00551.science.domain:18097] mca: base: components_open: found loaded 
component tm
[a00551.science.domain:18097] mca: base: components_open: component tm 
open function successful
[a00551.science.domain:18097] mca:base:select: Auto-selecting plm 
components
[a00551.science.domain:18097] mca:base:select:(  plm) Querying component 
[isolated]
[a00551.science.domain:18097] mca:base:select:(  plm) Query of component 
[isolated] set priority to 0
[a00551.science.domain:18097] mca:base:select:(  plm) Querying component 
[rsh]
[a00551.science.domain:18097] [[INVALID],INVALID] plm:rsh_lookup on 
agent ssh : rsh path NULL
[a00551.science.domain:18097] mca:base:select:(  plm) Query of component 
[rsh] set priority to 10
[a00551.science.domain:18097] mca:base:select:(  plm) Querying component 
[slurm]
[a00551.science.domain:18097] mca:base:select:(  plm) Querying component 
[tm]
[a00551.science.domain:18097] mca:base:select:(  plm) Query of component 
[tm] set priority to 75
[a00551.science.domain:18097] mca:base:select:(  plm) Selected component 
[tm]
[a00551.science.domain:18097] mca: base: close: component isolated 
closed
[a00551.science.domain:18097] mca: base: close: unloading component 
isolated

[a00551.science.domain:18097] mca: base: close: component rsh closed
[a00551.science.domain:18097] mca: base: close: unloading component rsh
[a00551.science.domain:18097] mca: base: close: component slurm closed
[a00551.science.domain:18097] mca: base: close: unloading component 
slurm
[a00551.science.domain:18097] plm:base:set_hnp_name: initial bias 18097 
nodename hash 2226275586

[a00551.science.domain:18097] plm:base:set_hnp_name: final jobfam 34561
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive start comm
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_job
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm creating 
map
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm add new 
daemon [[34561,0],1]
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm assigning 
new daemon [[34561,0],1] to node a00554.science.domain
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm add new 
daemon [[34561,0],2]
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm assigning 
new daemon [[34561,0],2] to node a00553.science.domain

[a00551.science.domain:18097] [[34561,0],0] plm:tm: launching vm
[a00551.science.domain:18097] [[34561,0],0] plm:tm: final top-level 
argv:
	orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm 
-mca ess_base_jobid 2264989696 -mca ess_base_vpid  -mca 
ess_base_num_procs 3

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Gilles Gouaillardet

Oswin,


can you please run again (one task per physical node) with

mpirun --mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca 
ras_base_verbose 10 hostname



Cheers,


Gilles


On 9/8/2016 6:42 PM, Oswin Krause wrote:

Hi,

i reconfigured to only have one physical node. Still no success, but 
the nodefile now looks better. I still get the errors:


[a00551.science.domain:18021] [[34768,0],1] bind() failed on error 
Address already in use (98)
[a00551.science.domain:18021] [[34768,0],1] ORTE_ERROR_LOG: Error in 
file oob_usock_component.c at line 228
[a00551.science.domain:18022] [[34768,0],2] bind() failed on error 
Address already in use (98)
[a00551.science.domain:18022] [[34768,0],2] ORTE_ERROR_LOG: Error in 
file oob_usock_component.c at line 228


(btw: for some reason the bind errors where missing. sorry!)

PBS_NODEFILE
a00551.science.domain
a00554.science.domain
a00553.science.domain
---
mpirun --mca plm_base_verbose 10 --tag-output -display-map hostname
[a00551.science.domain:18097] mca: base: components_register: 
registering framework plm components
[a00551.science.domain:18097] mca: base: components_register: found 
loaded component isolated
[a00551.science.domain:18097] mca: base: components_register: 
component isolated has no register or open function
[a00551.science.domain:18097] mca: base: components_register: found 
loaded component rsh
[a00551.science.domain:18097] mca: base: components_register: 
component rsh register function successful
[a00551.science.domain:18097] mca: base: components_register: found 
loaded component slurm
[a00551.science.domain:18097] mca: base: components_register: 
component slurm register function successful
[a00551.science.domain:18097] mca: base: components_register: found 
loaded component tm
[a00551.science.domain:18097] mca: base: components_register: 
component tm register function successful
[a00551.science.domain:18097] mca: base: components_open: opening plm 
components
[a00551.science.domain:18097] mca: base: components_open: found loaded 
component isolated
[a00551.science.domain:18097] mca: base: components_open: component 
isolated open function successful
[a00551.science.domain:18097] mca: base: components_open: found loaded 
component rsh
[a00551.science.domain:18097] mca: base: components_open: component 
rsh open function successful
[a00551.science.domain:18097] mca: base: components_open: found loaded 
component slurm
[a00551.science.domain:18097] mca: base: components_open: component 
slurm open function successful
[a00551.science.domain:18097] mca: base: components_open: found loaded 
component tm
[a00551.science.domain:18097] mca: base: components_open: component tm 
open function successful
[a00551.science.domain:18097] mca:base:select: Auto-selecting plm 
components
[a00551.science.domain:18097] mca:base:select:(  plm) Querying 
component [isolated]
[a00551.science.domain:18097] mca:base:select:(  plm) Query of 
component [isolated] set priority to 0
[a00551.science.domain:18097] mca:base:select:(  plm) Querying 
component [rsh]
[a00551.science.domain:18097] [[INVALID],INVALID] plm:rsh_lookup on 
agent ssh : rsh path NULL
[a00551.science.domain:18097] mca:base:select:(  plm) Query of 
component [rsh] set priority to 10
[a00551.science.domain:18097] mca:base:select:(  plm) Querying 
component [slurm]
[a00551.science.domain:18097] mca:base:select:(  plm) Querying 
component [tm]
[a00551.science.domain:18097] mca:base:select:(  plm) Query of 
component [tm] set priority to 75
[a00551.science.domain:18097] mca:base:select:(  plm) Selected 
component [tm]

[a00551.science.domain:18097] mca: base: close: component isolated closed
[a00551.science.domain:18097] mca: base: close: unloading component 
isolated

[a00551.science.domain:18097] mca: base: close: component rsh closed
[a00551.science.domain:18097] mca: base: close: unloading component rsh
[a00551.science.domain:18097] mca: base: close: component slurm closed
[a00551.science.domain:18097] mca: base: close: unloading component slurm
[a00551.science.domain:18097] plm:base:set_hnp_name: initial bias 
18097 nodename hash 2226275586

[a00551.science.domain:18097] plm:base:set_hnp_name: final jobfam 34561
[a00551.science.domain:18097] [[34561,0],0] plm:base:receive start comm
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_job
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm creating 
map
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm add new 
daemon [[34561,0],1]
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm 
assigning new daemon [[34561,0],1] to node a00554.science.domain
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm add new 
daemon [[34561,0],2]
[a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm 
assigning new daemon [[34561,0],2] to node a00553.science.domain

[a00551.science.domain:18097] [[34561,0],0] plm:tm: launchin

Re: [OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Oswin Krause

Hi Gilles,

There you go:

[zbh251@a00551 ~]$ cat $PBS_NODEFILE
a00551.science.domain
a00554.science.domain
a00553.science.domain
[zbh251@a00551 ~]$ mpirun --mca ess_base_verbose 10 --mca 
plm_base_verbose 10 --mca ras_base_verbose 10 hostname
[a00551.science.domain:18889] mca: base: components_register: 
registering framework ess components
[a00551.science.domain:18889] mca: base: components_register: found 
loaded component pmi
[a00551.science.domain:18889] mca: base: components_register: component 
pmi has no register or open function
[a00551.science.domain:18889] mca: base: components_register: found 
loaded component tool
[a00551.science.domain:18889] mca: base: components_register: component 
tool has no register or open function
[a00551.science.domain:18889] mca: base: components_register: found 
loaded component env
[a00551.science.domain:18889] mca: base: components_register: component 
env has no register or open function
[a00551.science.domain:18889] mca: base: components_register: found 
loaded component hnp
[a00551.science.domain:18889] mca: base: components_register: component 
hnp has no register or open function
[a00551.science.domain:18889] mca: base: components_register: found 
loaded component singleton
[a00551.science.domain:18889] mca: base: components_register: component 
singleton register function successful
[a00551.science.domain:18889] mca: base: components_register: found 
loaded component slurm
[a00551.science.domain:18889] mca: base: components_register: component 
slurm has no register or open function
[a00551.science.domain:18889] mca: base: components_register: found 
loaded component tm
[a00551.science.domain:18889] mca: base: components_register: component 
tm has no register or open function
[a00551.science.domain:18889] mca: base: components_open: opening ess 
components
[a00551.science.domain:18889] mca: base: components_open: found loaded 
component pmi
[a00551.science.domain:18889] mca: base: components_open: component pmi 
open function successful
[a00551.science.domain:18889] mca: base: components_open: found loaded 
component tool
[a00551.science.domain:18889] mca: base: components_open: component tool 
open function successful
[a00551.science.domain:18889] mca: base: components_open: found loaded 
component env
[a00551.science.domain:18889] mca: base: components_open: component env 
open function successful
[a00551.science.domain:18889] mca: base: components_open: found loaded 
component hnp
[a00551.science.domain:18889] mca: base: components_open: component hnp 
open function successful
[a00551.science.domain:18889] mca: base: components_open: found loaded 
component singleton
[a00551.science.domain:18889] mca: base: components_open: component 
singleton open function successful
[a00551.science.domain:18889] mca: base: components_open: found loaded 
component slurm
[a00551.science.domain:18889] mca: base: components_open: component 
slurm open function successful
[a00551.science.domain:18889] mca: base: components_open: found loaded 
component tm
[a00551.science.domain:18889] mca: base: components_open: component tm 
open function successful
[a00551.science.domain:18889] mca:base:select: Auto-selecting ess 
components
[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
[pmi]
[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
[tool]
[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
[env]
[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
[hnp]
[a00551.science.domain:18889] mca:base:select:(  ess) Query of component 
[hnp] set priority to 100
[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
[singleton]
[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
[slurm]
[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
[tm]
[a00551.science.domain:18889] mca:base:select:(  ess) Selected component 
[hnp]

[a00551.science.domain:18889] mca: base: close: component pmi closed
[a00551.science.domain:18889] mca: base: close: unloading component pmi
[a00551.science.domain:18889] mca: base: close: component tool closed
[a00551.science.domain:18889] mca: base: close: unloading component tool
[a00551.science.domain:18889] mca: base: close: component env closed
[a00551.science.domain:18889] mca: base: close: unloading component env
[a00551.science.domain:18889] mca: base: close: component singleton 
closed
[a00551.science.domain:18889] mca: base: close: unloading component 
singleton

[a00551.science.domain:18889] mca: base: close: component slurm closed
[a00551.science.domain:18889] mca: base: close: unloading component 
slurm

[a00551.science.domain:18889] mca: base: close: component tm closed
[a00551.science.domain:18889] mca: base: close: unloading component tm
[a00551.science.domain:18889] mca: base: components_register: 
registering framework plm components
[a00551.science.domain:18889] mca: base: components_register: 

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Gilles Gouaillardet
Oswin,

So it seems that Open MPI think it tm_spawn orted on the remote nodes, but 
orted ends up running on the same node than mpirun.

On your compute nodes, can you
ldd /.../lib/openmpi/mca_plm_tm.so

And confirm it is linked with the same libtorque.so that was built/provided 
with torque ?
Check path and md5sum on both compute nodes and the node on which you built 
torque (ideally from both build and install dir)

Cheers,

Gilles

Oswin Krause  wrote:
>Hi Gilles,
>
>There you go:
>
>[zbh251@a00551 ~]$ cat $PBS_NODEFILE
>a00551.science.domain
>a00554.science.domain
>a00553.science.domain
>[zbh251@a00551 ~]$ mpirun --mca ess_base_verbose 10 --mca 
>plm_base_verbose 10 --mca ras_base_verbose 10 hostname
>[a00551.science.domain:18889] mca: base: components_register: 
>registering framework ess components
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component pmi
>[a00551.science.domain:18889] mca: base: components_register: component 
>pmi has no register or open function
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component tool
>[a00551.science.domain:18889] mca: base: components_register: component 
>tool has no register or open function
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component env
>[a00551.science.domain:18889] mca: base: components_register: component 
>env has no register or open function
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component hnp
>[a00551.science.domain:18889] mca: base: components_register: component 
>hnp has no register or open function
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component singleton
>[a00551.science.domain:18889] mca: base: components_register: component 
>singleton register function successful
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component slurm
>[a00551.science.domain:18889] mca: base: components_register: component 
>slurm has no register or open function
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component tm
>[a00551.science.domain:18889] mca: base: components_register: component 
>tm has no register or open function
>[a00551.science.domain:18889] mca: base: components_open: opening ess 
>components
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component pmi
>[a00551.science.domain:18889] mca: base: components_open: component pmi 
>open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component tool
>[a00551.science.domain:18889] mca: base: components_open: component tool 
>open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component env
>[a00551.science.domain:18889] mca: base: components_open: component env 
>open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component hnp
>[a00551.science.domain:18889] mca: base: components_open: component hnp 
>open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component singleton
>[a00551.science.domain:18889] mca: base: components_open: component 
>singleton open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component slurm
>[a00551.science.domain:18889] mca: base: components_open: component 
>slurm open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component tm
>[a00551.science.domain:18889] mca: base: components_open: component tm 
>open function successful
>[a00551.science.domain:18889] mca:base:select: Auto-selecting ess 
>components
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[pmi]
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[tool]
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[env]
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[hnp]
>[a00551.science.domain:18889] mca:base:select:(  ess) Query of component 
>[hnp] set priority to 100
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[singleton]
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[slurm]
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[tm]
>[a00551.science.domain:18889] mca:base:select:(  ess) Selected component 
>[hnp]
>[a00551.science.domain:18889] mca: base: close: component pmi closed
>[a00551.science.domain:18889] mca: base: close: unloading component pmi
>[a00551.science.domain:18889] mca: base: close: component tool closed
>[a00551.science.domain:18889] mca: base: close: unloading component tool
>[a00551.science.domain:18889] mca: base: close: component env closed
>[a00551.science.domain:18889] mca: base: close: unloading component env
>[a00551.science.domain:18889] mca: base: close: c

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Gilles Gouaillardet
Oswin,

One more thing, can you

pbsdsh -v hostname

before invoking mpirun ?
Hopefully this should print the three hostnames

Then you can
ldd `which pbsdsh`
And see which libtorque.so is linked with it

Cheers,

Gilles

Oswin Krause  wrote:
>Hi Gilles,
>
>There you go:
>
>[zbh251@a00551 ~]$ cat $PBS_NODEFILE
>a00551.science.domain
>a00554.science.domain
>a00553.science.domain
>[zbh251@a00551 ~]$ mpirun --mca ess_base_verbose 10 --mca 
>plm_base_verbose 10 --mca ras_base_verbose 10 hostname
>[a00551.science.domain:18889] mca: base: components_register: 
>registering framework ess components
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component pmi
>[a00551.science.domain:18889] mca: base: components_register: component 
>pmi has no register or open function
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component tool
>[a00551.science.domain:18889] mca: base: components_register: component 
>tool has no register or open function
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component env
>[a00551.science.domain:18889] mca: base: components_register: component 
>env has no register or open function
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component hnp
>[a00551.science.domain:18889] mca: base: components_register: component 
>hnp has no register or open function
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component singleton
>[a00551.science.domain:18889] mca: base: components_register: component 
>singleton register function successful
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component slurm
>[a00551.science.domain:18889] mca: base: components_register: component 
>slurm has no register or open function
>[a00551.science.domain:18889] mca: base: components_register: found 
>loaded component tm
>[a00551.science.domain:18889] mca: base: components_register: component 
>tm has no register or open function
>[a00551.science.domain:18889] mca: base: components_open: opening ess 
>components
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component pmi
>[a00551.science.domain:18889] mca: base: components_open: component pmi 
>open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component tool
>[a00551.science.domain:18889] mca: base: components_open: component tool 
>open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component env
>[a00551.science.domain:18889] mca: base: components_open: component env 
>open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component hnp
>[a00551.science.domain:18889] mca: base: components_open: component hnp 
>open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component singleton
>[a00551.science.domain:18889] mca: base: components_open: component 
>singleton open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component slurm
>[a00551.science.domain:18889] mca: base: components_open: component 
>slurm open function successful
>[a00551.science.domain:18889] mca: base: components_open: found loaded 
>component tm
>[a00551.science.domain:18889] mca: base: components_open: component tm 
>open function successful
>[a00551.science.domain:18889] mca:base:select: Auto-selecting ess 
>components
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[pmi]
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[tool]
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[env]
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[hnp]
>[a00551.science.domain:18889] mca:base:select:(  ess) Query of component 
>[hnp] set priority to 100
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[singleton]
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[slurm]
>[a00551.science.domain:18889] mca:base:select:(  ess) Querying component 
>[tm]
>[a00551.science.domain:18889] mca:base:select:(  ess) Selected component 
>[hnp]
>[a00551.science.domain:18889] mca: base: close: component pmi closed
>[a00551.science.domain:18889] mca: base: close: unloading component pmi
>[a00551.science.domain:18889] mca: base: close: component tool closed
>[a00551.science.domain:18889] mca: base: close: unloading component tool
>[a00551.science.domain:18889] mca: base: close: component env closed
>[a00551.science.domain:18889] mca: base: close: unloading component env
>[a00551.science.domain:18889] mca: base: close: component singleton 
>closed
>[a00551.science.domain:18889] mca: base: close: unloading component 
>singleton
>[a00551.science.domain:18889] mca: base: close: component slurm closed
>[a00551.science.domain:18889] mca: 

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread r...@open-mpi.org
I’m pruning this email thread so I can actually read the blasted thing :-)

Guys: you are off in the wilderness chasing ghosts! Please stop.

When I say that Torque uses an “ordered” file, I am _not_ saying that all the 
host entries of the same name have to be listed consecutively. I am saying that 
the _position_ of each entry has meaning, and you cannot just change it.

I have honestly totally lost the root of this discussion in all the white noise 
about the PBS_NODEFILE. Can we reboot?
Ralph


> On Sep 8, 2016, at 5:26 AM, Gilles Gouaillardet 
>  wrote:
> 
> Oswin,
> 
> One more thing, can you
> 
> pbsdsh -v hostname
> 
> before invoking mpirun ?
> Hopefully this should print the three hostnames
> 
> Then you can
> ldd `which pbsdsh`
> And see which libtorque.so is linked with it
> 
> Cheers,
> 
> Gilles
> 
> 

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OMPI users] Unable to mpirun from within torque

2016-09-08 Thread Oswin Krause

Hi,

okay lets reboot, even though Gilles last mail was onto something.

The problem is that i failed starting programs with mpirun when more 
than one node was involved. I mentioned that it is likely some 
configuration problem with my server, especially authentification(we 
have some kerberos nightmare going on here)


We then tried to find out where mpirun got in the wrong corner and we 
got sidetracked by the nodefile and i reconfigured everything so that we 
can at least factor out any NUMA stuff.


Now i see that i am wrong on this mailing list and it is likely a 
problem with pbs/torque as


pbsdsh -v hostname

pbsdsh(): spawned task 0
pbsdsh(): spawned task 1
pbsdsh(): spawned task 2
pbsdsh(): spawn event returned: 0 (3 spawns and 0 obits outstanding)
pbsdsh(): sending obit for task 0
a00551.science.domain
pbsdsh(): spawn event returned: 1 (2 spawns and 1 obits outstanding)
pbsdsh(): sending obit for task 1
a00551.science.domain
pbsdsh(): spawn event returned: 2 (1 spawns and 2 obits outstanding)
pbsdsh(): sending obit for task 2
a00551.science.domain
pbsdsh(): obit event returned: 0 (0 spawns and 3 obits outstanding)
pbsdsh(): task 0 exit status 0
pbsdsh(): obit event returned: 1 (0 spawns and 2 obits outstanding)
pbsdsh(): task 1 exit status 0
pbsdsh(): obit event returned: 2 (0 spawns and 1 obits outstanding)
pbsdsh(): task 2 exit status 0

Best,
Oswin


On 2016-09-08 15:42, r...@open-mpi.org wrote:
I’m pruning this email thread so I can actually read the blasted thing 
:-)


Guys: you are off in the wilderness chasing ghosts! Please stop.

When I say that Torque uses an “ordered” file, I am _not_ saying that
all the host entries of the same name have to be listed consecutively.
I am saying that the _position_ of each entry has meaning, and you
cannot just change it.

I have honestly totally lost the root of this discussion in all the
white noise about the PBS_NODEFILE. Can we reboot?
Ralph


On Sep 8, 2016, at 5:26 AM, Gilles Gouaillardet 
 wrote:


Oswin,

One more thing, can you

pbsdsh -v hostname

before invoking mpirun ?
Hopefully this should print the three hostnames

Then you can
ldd `which pbsdsh`
And see which libtorque.so is linked with it

Cheers,

Gilles




___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users