Re: [OMPI users] OMPI users] Unable to mpirun from within torque
Someone has done some work there since I last did, but I can see the issue. Torque indeed always provides an ordered file - the only way you can get an unordered one is for someone to edit it, and that is forbidden - i.e., you get what you deserve because you are messing around with a system-defined file :-) The problem is that Torque internally assigns a “launch ID” which is just the integer position of the nodename in the PBS_NODEFILE. So if you modify that position, then we get the wrong index - and everything goes down the drain from there. In your example, n1.cluster changed index from 3 to 2 because of your edit. Torque thinks that index 2 is just another reference to n0.cluster, and so we merrily launch a daemon onto the wrong node. They have a good reason for doing things this way. It allows you to launch a process against each launch ID, and the pattern will reflect the original qsub request in what we would call a map-by slot round-robin mode. This maximizes the use of shared memory, and is expected to provide good performance for a range of apps. Lesson to be learned: never, ever muddle around with a system-generated file. If you want to modify where things go, then use one or more of the mpirun options to do so. We give you lots and lots of knobs for just that reason. > On Sep 7, 2016, at 10:53 PM, Gilles Gouaillardet wrote: > > Ralph, > > > there might be an issue within Open MPI. > > > on the cluster i used, hostname returns the FQDN, and $PBS_NODEFILE uses the > FQDN too. > > my $PBS_NODEFILE has one line per task, and it is ordered > > e.g. > > n0.cluster > > n0.cluster > > n1.cluster > > n1.cluster > > > in my torque script, i rewrote the machinefile like this > > n0.cluster > > n1.cluster > > n0.cluster > > n1.cluster > > and updated the PBS environment variable to point to my new file. > > > then i invoked > > mpirun hostname > > > > in the first case, 2 tasks run on n0 and 2 tasks run on n1 > in the second case, 4 tasks run on n0, and none on n1. > > so i am thinking we might not support unordered $PBS_NODEFILE. > > as a reminder, the submit command was > qsub -l nodes=3:ppn=1 > but for some reasons i ignore, only two nodes were allocated (two slots on > the first one, one on the second one) > and if i understand correctly, $PBS_NODEFILE was not ordered. > (e.g. n0 n1 n0 and *not * n0 n0 n1) > > i tried to reproduce this without hacking $PBS_NODEFILE, but my jobs hang in > the queue if only two nodes with 16 slots each are available and i request > -l nodes=3:ppn=1 > i guess this is a different scheduler configuration, and i cannot change that. > > Could you please have a look at this ? > > Cheers, > > Gilles > > On 9/7/2016 11:15 PM, r...@open-mpi.org wrote: >> The usual cause of this problem is that the nodename in the machinefile is >> given as a00551, while Torque is assigning the node name as >> a00551.science.domain. Thus, mpirun thinks those are two separate nodes and >> winds up spawning an orted on its own node. >> >> You might try ensuring that your machinefile is using the exact same name as >> provided in your allocation >> >> >>> On Sep 7, 2016, at 7:06 AM, Gilles Gouaillardet >>> wrote: >>> >>> Thanjs for the ligs >>> >>> From what i see now, it looks like a00551 is running both mpirun and orted, >>> though it should only run mpirun, and orted should run only on a00553 >>> >>> I will check the code and see what could be happening here >>> >>> Btw, what is the output of >>> hostname >>> hostname -f >>> On a00551 ? >>> >>> Out of curiosity, is a previous version of Open MPI (e.g. v1.10.4) >>> installled and running correctly on your cluster ? >>> >>> Cheers, >>> >>> Gilles >>> >>> Oswin Krause wrote: Hi Gilles, Thanks for the hint with the machinefile. I know it is not equivalent and i do not intend to use that approach. I just wanted to know whether I could start the program successfully at all. Outside torque(4.2), rsh seems to be used which works fine, querying a password if no kerberos ticket is there Here is the output: [zbh251@a00551 ~]$ mpirun -V mpirun (Open MPI) 2.0.1 [zbh251@a00551 ~]$ ompi_info | grep ras MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Component v2.0.1) MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v2.0.1) MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v2.0.1) MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.1) [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output -display-map hostname [a00551.science.domain:04104] mca: base: components_register: registering framework plm components [a00551.science.domain:04104] mca: base: components_register: found loaded component isolated [a00551.science.domain:04104] mca: base: components_register: component is
Re: [OMPI users] Unable to mpirun from within torque
If you are correctly analyzing things, then there would be an issue in the code. When we get an allocation from a resource manager, we set a flag indicating that it is “gospel” - i.e., that we do not directly sense the number of cores on a node and set the #slots equal to that value. Instead, we take the RM-provided allocation as ultimate truth. This should be true even if you add a machinefile, as the machinefile is only used to “filter” the nodelist provided by the RM. It shouldn’t cause the #slots to be modified. Taking a quick glance at the v2.x code, it looks to me like all is being done correctly. Again, output from a debug build would resolve that question > On Sep 7, 2016, at 10:56 PM, Gilles Gouaillardet wrote: > > Oswin, > > > unfortunatly some important info is missing. > > i guess the root cause is Open MPI was not configure'd with --enable-debug > > > could you please update your torque script and simply add the following > snippet before invoking mpirun > > > echo PBS_NODEFILE > > cat $PBS_NODEFILE > > echo --- > > > as i wrote in an other email, i suspect hosts are not ordered (and i'd like > to confirm that) and Open MPI does not handle that correctly > > > Cheers, > > > Gilles > > On 9/7/2016 10:25 PM, Oswin Krause wrote: >> Hi Gilles, >> >> Thanks for the hint with the machinefile. I know it is not equivalent and i >> do not intend to use that approach. I just wanted to know whether I could >> start the program successfully at all. >> >> Outside torque(4.2), rsh seems to be used which works fine, querying a >> password if no kerberos ticket is there >> >> Here is the output: >> [zbh251@a00551 ~]$ mpirun -V >> mpirun (Open MPI) 2.0.1 >> [zbh251@a00551 ~]$ ompi_info | grep ras >> MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Component >> v2.0.1) >> MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v2.0.1) >> MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v2.0.1) >> MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.1) >> [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output >> -display-map hostname >> [a00551.science.domain:04104] mca: base: components_register: registering >> framework plm components >> [a00551.science.domain:04104] mca: base: components_register: found loaded >> component isolated >> [a00551.science.domain:04104] mca: base: components_register: component >> isolated has no register or open function >> [a00551.science.domain:04104] mca: base: components_register: found loaded >> component rsh >> [a00551.science.domain:04104] mca: base: components_register: component rsh >> register function successful >> [a00551.science.domain:04104] mca: base: components_register: found loaded >> component slurm >> [a00551.science.domain:04104] mca: base: components_register: component >> slurm register function successful >> [a00551.science.domain:04104] mca: base: components_register: found loaded >> component tm >> [a00551.science.domain:04104] mca: base: components_register: component tm >> register function successful >> [a00551.science.domain:04104] mca: base: components_open: opening plm >> components >> [a00551.science.domain:04104] mca: base: components_open: found loaded >> component isolated >> [a00551.science.domain:04104] mca: base: components_open: component isolated >> open function successful >> [a00551.science.domain:04104] mca: base: components_open: found loaded >> component rsh >> [a00551.science.domain:04104] mca: base: components_open: component rsh open >> function successful >> [a00551.science.domain:04104] mca: base: components_open: found loaded >> component slurm >> [a00551.science.domain:04104] mca: base: components_open: component slurm >> open function successful >> [a00551.science.domain:04104] mca: base: components_open: found loaded >> component tm >> [a00551.science.domain:04104] mca: base: components_open: component tm open >> function successful >> [a00551.science.domain:04104] mca:base:select: Auto-selecting plm components >> [a00551.science.domain:04104] mca:base:select:( plm) Querying component >> [isolated] >> [a00551.science.domain:04104] mca:base:select:( plm) Query of component >> [isolated] set priority to 0 >> [a00551.science.domain:04104] mca:base:select:( plm) Querying component >> [rsh] >> [a00551.science.domain:04104] mca:base:select:( plm) Query of component >> [rsh] set priority to 10 >> [a00551.science.domain:04104] mca:base:select:( plm) Querying component >> [slurm] >> [a00551.science.domain:04104] mca:base:select:( plm) Querying component [tm] >> [a00551.science.domain:04104] mca:base:select:( plm) Query of component >> [tm] set priority to 75 >> [a00551.science.domain:04104] mca:base:select:( plm) Selected component [tm] >> [a00551.science.domain:04104] mca: base: close: component isolated closed >> [a00551.science.domain:04104] mca: base: close: unloading component isolated >> [a00
Re: [OMPI users] OMPI users] Unable to mpirun from within torque
Hi, Thanks for all the hints. Only issue is: this is the file generated by torque. Torque - or at least the torque 4.2 provided by my redhat version - gives me an unordered file. Should I rebuild torque? Best, Oswin I am currently rebuilding the package with --enable-debug. On 2016-09-08 09:57, r...@open-mpi.org wrote: Someone has done some work there since I last did, but I can see the issue. Torque indeed always provides an ordered file - the only way you can get an unordered one is for someone to edit it, and that is forbidden - i.e., you get what you deserve because you are messing around with a system-defined file :-) The problem is that Torque internally assigns a “launch ID” which is just the integer position of the nodename in the PBS_NODEFILE. So if you modify that position, then we get the wrong index - and everything goes down the drain from there. In your example, n1.cluster changed index from 3 to 2 because of your edit. Torque thinks that index 2 is just another reference to n0.cluster, and so we merrily launch a daemon onto the wrong node. They have a good reason for doing things this way. It allows you to launch a process against each launch ID, and the pattern will reflect the original qsub request in what we would call a map-by slot round-robin mode. This maximizes the use of shared memory, and is expected to provide good performance for a range of apps. Lesson to be learned: never, ever muddle around with a system-generated file. If you want to modify where things go, then use one or more of the mpirun options to do so. We give you lots and lots of knobs for just that reason. On Sep 7, 2016, at 10:53 PM, Gilles Gouaillardet wrote: Ralph, there might be an issue within Open MPI. on the cluster i used, hostname returns the FQDN, and $PBS_NODEFILE uses the FQDN too. my $PBS_NODEFILE has one line per task, and it is ordered e.g. n0.cluster n0.cluster n1.cluster n1.cluster in my torque script, i rewrote the machinefile like this n0.cluster n1.cluster n0.cluster n1.cluster and updated the PBS environment variable to point to my new file. then i invoked mpirun hostname in the first case, 2 tasks run on n0 and 2 tasks run on n1 in the second case, 4 tasks run on n0, and none on n1. so i am thinking we might not support unordered $PBS_NODEFILE. as a reminder, the submit command was qsub -l nodes=3:ppn=1 but for some reasons i ignore, only two nodes were allocated (two slots on the first one, one on the second one) and if i understand correctly, $PBS_NODEFILE was not ordered. (e.g. n0 n1 n0 and *not * n0 n0 n1) i tried to reproduce this without hacking $PBS_NODEFILE, but my jobs hang in the queue if only two nodes with 16 slots each are available and i request -l nodes=3:ppn=1 i guess this is a different scheduler configuration, and i cannot change that. Could you please have a look at this ? Cheers, Gilles On 9/7/2016 11:15 PM, r...@open-mpi.org wrote: The usual cause of this problem is that the nodename in the machinefile is given as a00551, while Torque is assigning the node name as a00551.science.domain. Thus, mpirun thinks those are two separate nodes and winds up spawning an orted on its own node. You might try ensuring that your machinefile is using the exact same name as provided in your allocation On Sep 7, 2016, at 7:06 AM, Gilles Gouaillardet wrote: Thanjs for the ligs From what i see now, it looks like a00551 is running both mpirun and orted, though it should only run mpirun, and orted should run only on a00553 I will check the code and see what could be happening here Btw, what is the output of hostname hostname -f On a00551 ? Out of curiosity, is a previous version of Open MPI (e.g. v1.10.4) installled and running correctly on your cluster ? Cheers, Gilles Oswin Krause wrote: Hi Gilles, Thanks for the hint with the machinefile. I know it is not equivalent and i do not intend to use that approach. I just wanted to know whether I could start the program successfully at all. Outside torque(4.2), rsh seems to be used which works fine, querying a password if no kerberos ticket is there Here is the output: [zbh251@a00551 ~]$ mpirun -V mpirun (Open MPI) 2.0.1 [zbh251@a00551 ~]$ ompi_info | grep ras MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Component v2.0.1) MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v2.0.1) MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v2.0.1) MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.1) [zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output -display-map hostname [a00551.science.domain:04104] mca: base: components_register: registering framework plm components [a00551.science.domain:04104] mca: base: components_register: found loaded component isolated [a00551.science.domain:04104] mca: base: components_register: component isolated has no register or open function [a00551.s
Re: [OMPI users] Unable to mpirun from within torque
Ralph, i am not sure i am reading you correctly, so let me clarify. i did not hack $PBS_NODEFILE for fun nor profit, i was simply trying to reproduce an issue i could not reproduce otherwise. /* my job submitted with -l nodes=3:ppn=1 do not start if there are only two nodes available, whereas the same user job starts on two nodes */ thanks for the explanation of the torque internals, my hack was incomplete and not a valid one, i do acknowledge it. i re-read the email that started this thread and i found the information i was looking for echo $PBS_NODEFILE /var/lib/torque/aux//278.a00552.science.domain cat $PBS_NODEFILE a00551.science.domain a00553.science.domain a00551.science.domain so, assuming the enduser did not edit his $PBS_NODEFILE, and torque is correctly configured and not busted, then Torque indeed always provides an ordered file - the only way you can get an unordered one is for someone to edit it might be updated to "Torque used to always provide an ordered file, but recent versions might not do that." makes sense ? Cheers, Gilles On 9/8/2016 4:57 PM, r...@open-mpi.org wrote: Someone has done some work there since I last did, but I can see the issue. Torque indeed always provides an ordered file - the only way you can get an unordered one is for someone to edit it, and that is forbidden - i.e., you get what you deserve because you are messing around with a system-defined file :-) The problem is that Torque internally assigns a “launch ID” which is just the integer position of the nodename in the PBS_NODEFILE. So if you modify that position, then we get the wrong index - and everything goes down the drain from there. In your example, n1.cluster changed index from 3 to 2 because of your edit. Torque thinks that index 2 is just another reference to n0.cluster, and so we merrily launch a daemon onto the wrong node. They have a good reason for doing things this way. It allows you to launch a process against each launch ID, and the pattern will reflect the original qsub request in what we would call a map-by slot round-robin mode. This maximizes the use of shared memory, and is expected to provide good performance for a range of apps. Lesson to be learned: never, ever muddle around with a system-generated file. If you want to modify where things go, then use one or more of the mpirun options to do so. We give you lots and lots of knobs for just that reason. On Sep 7, 2016, at 10:53 PM, Gilles Gouaillardet wrote: Ralph, there might be an issue within Open MPI. on the cluster i used, hostname returns the FQDN, and $PBS_NODEFILE uses the FQDN too. my $PBS_NODEFILE has one line per task, and it is ordered e.g. n0.cluster n0.cluster n1.cluster n1.cluster in my torque script, i rewrote the machinefile like this n0.cluster n1.cluster n0.cluster n1.cluster and updated the PBS environment variable to point to my new file. then i invoked mpirun hostname in the first case, 2 tasks run on n0 and 2 tasks run on n1 in the second case, 4 tasks run on n0, and none on n1. so i am thinking we might not support unordered $PBS_NODEFILE. as a reminder, the submit command was qsub -l nodes=3:ppn=1 but for some reasons i ignore, only two nodes were allocated (two slots on the first one, one on the second one) and if i understand correctly, $PBS_NODEFILE was not ordered. (e.g. n0 n1 n0 and *not * n0 n0 n1) i tried to reproduce this without hacking $PBS_NODEFILE, but my jobs hang in the queue if only two nodes with 16 slots each are available and i request -l nodes=3:ppn=1 i guess this is a different scheduler configuration, and i cannot change that. Could you please have a look at this ? Cheers, Gilles On 9/7/2016 11:15 PM, r...@open-mpi.org wrote: The usual cause of this problem is that the nodename in the machinefile is given as a00551, while Torque is assigning the node name as a00551.science.domain. Thus, mpirun thinks those are two separate nodes and winds up spawning an orted on its own node. You might try ensuring that your machinefile is using the exact same name as provided in your allocation On Sep 7, 2016, at 7:06 AM, Gilles Gouaillardet wrote: Thanjs for the ligs From what i see now, it looks like a00551 is running both mpirun and orted, though it should only run mpirun, and orted should run only on a00553 I will check the code and see what could be happening here Btw, what is the output of hostname hostname -f On a00551 ? Out of curiosity, is a previous version of Open MPI (e.g. v1.10.4) installled and running correctly on your cluster ? Cheers, Gilles Oswin Krause wrote: Hi Gilles, Thanks for the hint with the machinefile. I know it is not equivalent and i do not intend to use that approach. I just wanted to know whether I could start the program successfully at all. Outside torque(4.2), rsh seems to be used which works fine, querying a password if no kerberos ticket is there Here is the
Re: [OMPI users] Unable to mpirun from within torque
Oswin, that might be off topic and or/premature ... PBS Pro has been made free (and opensource too) and is available at http://www.pbspro.org/ this is something you might be interested in (unless you are using torque because of the MOAB scheduler), and it might be more friendly (e.g. always generate ordered $PBS_NODEFILE) to Open MPI Cheers, Gilles On 9/8/2016 5:09 PM, Oswin Krause wrote: Hi, Thanks for all the hints. Only issue is: this is the file generated by torque. Torque - or at least the torque 4.2 provided by my redhat version - gives me an unordered file. Should I rebuild torque? Best, Oswin I am currently rebuilding the package with --enable-debug. On 2016-09-08 09:57, r...@open-mpi.org wrote: Someone has done some work there since I last did, but I can see the issue. Torque indeed always provides an ordered file - the only way you can get an unordered one is for someone to edit it, and that is forbidden - i.e., you get what you deserve because you are messing around with a system-defined file :-) The problem is that Torque internally assigns a “launch ID” which is just the integer position of the nodename in the PBS_NODEFILE. So if you modify that position, then we get the wrong index - and everything goes down the drain from there. In your example, n1.cluster changed index from 3 to 2 because of your edit. Torque thinks that index 2 is just another reference to n0.cluster, and so we merrily launch a daemon onto the wrong node. They have a good reason for doing things this way. It allows you to launch a process against each launch ID, and the pattern will reflect the original qsub request in what we would call a map-by slot round-robin mode. This maximizes the use of shared memory, and is expected to provide good performance for a range of apps. Lesson to be learned: never, ever muddle around with a system-generated file. If you want to modify where things go, then use one or more of the mpirun options to do so. We give you lots and lots of knobs for just that reason. On Sep 7, 2016, at 10:53 PM, Gilles Gouaillardet wrote: Ralph, there might be an issue within Open MPI. on the cluster i used, hostname returns the FQDN, and $PBS_NODEFILE uses the FQDN too. my $PBS_NODEFILE has one line per task, and it is ordered e.g. n0.cluster n0.cluster n1.cluster n1.cluster in my torque script, i rewrote the machinefile like this n0.cluster n1.cluster n0.cluster n1.cluster and updated the PBS environment variable to point to my new file. then i invoked mpirun hostname in the first case, 2 tasks run on n0 and 2 tasks run on n1 in the second case, 4 tasks run on n0, and none on n1. so i am thinking we might not support unordered $PBS_NODEFILE. as a reminder, the submit command was qsub -l nodes=3:ppn=1 but for some reasons i ignore, only two nodes were allocated (two slots on the first one, one on the second one) and if i understand correctly, $PBS_NODEFILE was not ordered. (e.g. n0 n1 n0 and *not * n0 n0 n1) i tried to reproduce this without hacking $PBS_NODEFILE, but my jobs hang in the queue if only two nodes with 16 slots each are available and i request -l nodes=3:ppn=1 i guess this is a different scheduler configuration, and i cannot change that. Could you please have a look at this ? Cheers, Gilles On 9/7/2016 11:15 PM, r...@open-mpi.org wrote: The usual cause of this problem is that the nodename in the machinefile is given as a00551, while Torque is assigning the node name as a00551.science.domain. Thus, mpirun thinks those are two separate nodes and winds up spawning an orted on its own node. You might try ensuring that your machinefile is using the exact same name as provided in your allocation On Sep 7, 2016, at 7:06 AM, Gilles Gouaillardet wrote: Thanjs for the ligs From what i see now, it looks like a00551 is running both mpirun and orted, though it should only run mpirun, and orted should run only on a00553 I will check the code and see what could be happening here Btw, what is the output of hostname hostname -f On a00551 ? Out of curiosity, is a previous version of Open MPI (e.g. v1.10.4) installled and running correctly on your cluster ? Cheers, Gilles Oswin Krause wrote: Hi Gilles, Thanks for the hint with the machinefile. I know it is not equivalent and i do not intend to use that approach. I just wanted to know whether I could start the program successfully at all. Outside torque(4.2), rsh seems to be used which works fine, querying a password if no kerberos ticket is there Here is the output: [zbh251@a00551 ~]$ mpirun -V mpirun (Open MPI) 2.0.1 [zbh251@a00551 ~]$ ompi_info | grep ras MCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Component v2.0.1) MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v2.0.1) MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v2.0.1) MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0
Re: [OMPI users] Unable to mpirun from within torque
Hi Gilles, Hi Ralph, I have just rebuild openmpi. quite a lot more of information. As I said, i did not tinker with the PBS_NODEFILE. I think the issue might be NUMA here. I can try to go through the process and reconfigure to non-numa and see whether this works. The issue might be that the node allocation looks like this: a00551.science.domain-0 a00552.science.domain-0 a00551.science.domain-1 and the last part then gets shortened which leads to the issue. Not sure whether this makes sense but this is my explanation. Here the output: $PBS_NODEFILE /var/lib/torque/aux//285.a00552.science.domain PBS_NODEFILE a00551.science.domain a00553.science.domain a00551.science.domain - [a00551.science.domain:16986] mca: base: components_register: registering framework plm components [a00551.science.domain:16986] mca: base: components_register: found loaded component isolated [a00551.science.domain:16986] mca: base: components_register: component isolated has no register or open function [a00551.science.domain:16986] mca: base: components_register: found loaded component rsh [a00551.science.domain:16986] mca: base: components_register: component rsh register function successful [a00551.science.domain:16986] mca: base: components_register: found loaded component slurm [a00551.science.domain:16986] mca: base: components_register: component slurm register function successful [a00551.science.domain:16986] mca: base: components_register: found loaded component tm [a00551.science.domain:16986] mca: base: components_register: component tm register function successful [a00551.science.domain:16986] mca: base: components_open: opening plm components [a00551.science.domain:16986] mca: base: components_open: found loaded component isolated [a00551.science.domain:16986] mca: base: components_open: component isolated open function successful [a00551.science.domain:16986] mca: base: components_open: found loaded component rsh [a00551.science.domain:16986] mca: base: components_open: component rsh open function successful [a00551.science.domain:16986] mca: base: components_open: found loaded component slurm [a00551.science.domain:16986] mca: base: components_open: component slurm open function successful [a00551.science.domain:16986] mca: base: components_open: found loaded component tm [a00551.science.domain:16986] mca: base: components_open: component tm open function successful [a00551.science.domain:16986] mca:base:select: Auto-selecting plm components [a00551.science.domain:16986] mca:base:select:( plm) Querying component [isolated] [a00551.science.domain:16986] mca:base:select:( plm) Query of component [isolated] set priority to 0 [a00551.science.domain:16986] mca:base:select:( plm) Querying component [rsh] [a00551.science.domain:16986] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [a00551.science.domain:16986] mca:base:select:( plm) Query of component [rsh] set priority to 10 [a00551.science.domain:16986] mca:base:select:( plm) Querying component [slurm] [a00551.science.domain:16986] mca:base:select:( plm) Querying component [tm] [a00551.science.domain:16986] mca:base:select:( plm) Query of component [tm] set priority to 75 [a00551.science.domain:16986] mca:base:select:( plm) Selected component [tm] [a00551.science.domain:16986] mca: base: close: component isolated closed [a00551.science.domain:16986] mca: base: close: unloading component isolated [a00551.science.domain:16986] mca: base: close: component rsh closed [a00551.science.domain:16986] mca: base: close: unloading component rsh [a00551.science.domain:16986] mca: base: close: component slurm closed [a00551.science.domain:16986] mca: base: close: unloading component slurm [a00551.science.domain:16986] plm:base:set_hnp_name: initial bias 16986 nodename hash 2226275586 [a00551.science.domain:16986] plm:base:set_hnp_name: final jobfam 33770 [a00551.science.domain:16986] [[33770,0],0] plm:base:receive start comm [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_job [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm creating map [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm add new daemon [[33770,0],1] [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm assigning new daemon [[33770,0],1] to node a00553.science.domain [a00551.science.domain:16986] [[33770,0],0] plm:tm: launching vm [a00551.science.domain:16986] [[33770,0],0] plm:tm: final top-level argv: orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm -mca ess_base_jobid 2213150720 -mca ess_base_vpid -mca ess_base_num_procs 2 -mca orte_hnp_uri 2213150720.0;usock;tcp://130.226.12.194:53397;tcp6://[fe80::225:90ff:feeb:f6d5]:42821 --mca plm_base_verbose 10 [a00551.science.domain:16986] [[33770,0],0] plm:tm: launching on node a00553.science.domain [a00551.science.domain:16986] [[33770,0],0] plm:tm: executing: orted
Re: [OMPI users] Unable to mpirun from within torque
Hi, i reconfigured to only have one physical node. Still no success, but the nodefile now looks better. I still get the errors: [a00551.science.domain:18021] [[34768,0],1] bind() failed on error Address already in use (98) [a00551.science.domain:18021] [[34768,0],1] ORTE_ERROR_LOG: Error in file oob_usock_component.c at line 228 [a00551.science.domain:18022] [[34768,0],2] bind() failed on error Address already in use (98) [a00551.science.domain:18022] [[34768,0],2] ORTE_ERROR_LOG: Error in file oob_usock_component.c at line 228 (btw: for some reason the bind errors where missing. sorry!) PBS_NODEFILE a00551.science.domain a00554.science.domain a00553.science.domain --- mpirun --mca plm_base_verbose 10 --tag-output -display-map hostname [a00551.science.domain:18097] mca: base: components_register: registering framework plm components [a00551.science.domain:18097] mca: base: components_register: found loaded component isolated [a00551.science.domain:18097] mca: base: components_register: component isolated has no register or open function [a00551.science.domain:18097] mca: base: components_register: found loaded component rsh [a00551.science.domain:18097] mca: base: components_register: component rsh register function successful [a00551.science.domain:18097] mca: base: components_register: found loaded component slurm [a00551.science.domain:18097] mca: base: components_register: component slurm register function successful [a00551.science.domain:18097] mca: base: components_register: found loaded component tm [a00551.science.domain:18097] mca: base: components_register: component tm register function successful [a00551.science.domain:18097] mca: base: components_open: opening plm components [a00551.science.domain:18097] mca: base: components_open: found loaded component isolated [a00551.science.domain:18097] mca: base: components_open: component isolated open function successful [a00551.science.domain:18097] mca: base: components_open: found loaded component rsh [a00551.science.domain:18097] mca: base: components_open: component rsh open function successful [a00551.science.domain:18097] mca: base: components_open: found loaded component slurm [a00551.science.domain:18097] mca: base: components_open: component slurm open function successful [a00551.science.domain:18097] mca: base: components_open: found loaded component tm [a00551.science.domain:18097] mca: base: components_open: component tm open function successful [a00551.science.domain:18097] mca:base:select: Auto-selecting plm components [a00551.science.domain:18097] mca:base:select:( plm) Querying component [isolated] [a00551.science.domain:18097] mca:base:select:( plm) Query of component [isolated] set priority to 0 [a00551.science.domain:18097] mca:base:select:( plm) Querying component [rsh] [a00551.science.domain:18097] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [a00551.science.domain:18097] mca:base:select:( plm) Query of component [rsh] set priority to 10 [a00551.science.domain:18097] mca:base:select:( plm) Querying component [slurm] [a00551.science.domain:18097] mca:base:select:( plm) Querying component [tm] [a00551.science.domain:18097] mca:base:select:( plm) Query of component [tm] set priority to 75 [a00551.science.domain:18097] mca:base:select:( plm) Selected component [tm] [a00551.science.domain:18097] mca: base: close: component isolated closed [a00551.science.domain:18097] mca: base: close: unloading component isolated [a00551.science.domain:18097] mca: base: close: component rsh closed [a00551.science.domain:18097] mca: base: close: unloading component rsh [a00551.science.domain:18097] mca: base: close: component slurm closed [a00551.science.domain:18097] mca: base: close: unloading component slurm [a00551.science.domain:18097] plm:base:set_hnp_name: initial bias 18097 nodename hash 2226275586 [a00551.science.domain:18097] plm:base:set_hnp_name: final jobfam 34561 [a00551.science.domain:18097] [[34561,0],0] plm:base:receive start comm [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_job [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm creating map [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm add new daemon [[34561,0],1] [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm assigning new daemon [[34561,0],1] to node a00554.science.domain [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm add new daemon [[34561,0],2] [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm assigning new daemon [[34561,0],2] to node a00553.science.domain [a00551.science.domain:18097] [[34561,0],0] plm:tm: launching vm [a00551.science.domain:18097] [[34561,0],0] plm:tm: final top-level argv: orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm -mca ess_base_jobid 2264989696 -mca ess_base_vpid -mca ess_base_num_procs 3
Re: [OMPI users] Unable to mpirun from within torque
Oswin, can you please run again (one task per physical node) with mpirun --mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca ras_base_verbose 10 hostname Cheers, Gilles On 9/8/2016 6:42 PM, Oswin Krause wrote: Hi, i reconfigured to only have one physical node. Still no success, but the nodefile now looks better. I still get the errors: [a00551.science.domain:18021] [[34768,0],1] bind() failed on error Address already in use (98) [a00551.science.domain:18021] [[34768,0],1] ORTE_ERROR_LOG: Error in file oob_usock_component.c at line 228 [a00551.science.domain:18022] [[34768,0],2] bind() failed on error Address already in use (98) [a00551.science.domain:18022] [[34768,0],2] ORTE_ERROR_LOG: Error in file oob_usock_component.c at line 228 (btw: for some reason the bind errors where missing. sorry!) PBS_NODEFILE a00551.science.domain a00554.science.domain a00553.science.domain --- mpirun --mca plm_base_verbose 10 --tag-output -display-map hostname [a00551.science.domain:18097] mca: base: components_register: registering framework plm components [a00551.science.domain:18097] mca: base: components_register: found loaded component isolated [a00551.science.domain:18097] mca: base: components_register: component isolated has no register or open function [a00551.science.domain:18097] mca: base: components_register: found loaded component rsh [a00551.science.domain:18097] mca: base: components_register: component rsh register function successful [a00551.science.domain:18097] mca: base: components_register: found loaded component slurm [a00551.science.domain:18097] mca: base: components_register: component slurm register function successful [a00551.science.domain:18097] mca: base: components_register: found loaded component tm [a00551.science.domain:18097] mca: base: components_register: component tm register function successful [a00551.science.domain:18097] mca: base: components_open: opening plm components [a00551.science.domain:18097] mca: base: components_open: found loaded component isolated [a00551.science.domain:18097] mca: base: components_open: component isolated open function successful [a00551.science.domain:18097] mca: base: components_open: found loaded component rsh [a00551.science.domain:18097] mca: base: components_open: component rsh open function successful [a00551.science.domain:18097] mca: base: components_open: found loaded component slurm [a00551.science.domain:18097] mca: base: components_open: component slurm open function successful [a00551.science.domain:18097] mca: base: components_open: found loaded component tm [a00551.science.domain:18097] mca: base: components_open: component tm open function successful [a00551.science.domain:18097] mca:base:select: Auto-selecting plm components [a00551.science.domain:18097] mca:base:select:( plm) Querying component [isolated] [a00551.science.domain:18097] mca:base:select:( plm) Query of component [isolated] set priority to 0 [a00551.science.domain:18097] mca:base:select:( plm) Querying component [rsh] [a00551.science.domain:18097] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [a00551.science.domain:18097] mca:base:select:( plm) Query of component [rsh] set priority to 10 [a00551.science.domain:18097] mca:base:select:( plm) Querying component [slurm] [a00551.science.domain:18097] mca:base:select:( plm) Querying component [tm] [a00551.science.domain:18097] mca:base:select:( plm) Query of component [tm] set priority to 75 [a00551.science.domain:18097] mca:base:select:( plm) Selected component [tm] [a00551.science.domain:18097] mca: base: close: component isolated closed [a00551.science.domain:18097] mca: base: close: unloading component isolated [a00551.science.domain:18097] mca: base: close: component rsh closed [a00551.science.domain:18097] mca: base: close: unloading component rsh [a00551.science.domain:18097] mca: base: close: component slurm closed [a00551.science.domain:18097] mca: base: close: unloading component slurm [a00551.science.domain:18097] plm:base:set_hnp_name: initial bias 18097 nodename hash 2226275586 [a00551.science.domain:18097] plm:base:set_hnp_name: final jobfam 34561 [a00551.science.domain:18097] [[34561,0],0] plm:base:receive start comm [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_job [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm creating map [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm add new daemon [[34561,0],1] [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm assigning new daemon [[34561,0],1] to node a00554.science.domain [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm add new daemon [[34561,0],2] [a00551.science.domain:18097] [[34561,0],0] plm:base:setup_vm assigning new daemon [[34561,0],2] to node a00553.science.domain [a00551.science.domain:18097] [[34561,0],0] plm:tm: launchin
Re: [OMPI users] Unable to mpirun from within torque
Hi Gilles, There you go: [zbh251@a00551 ~]$ cat $PBS_NODEFILE a00551.science.domain a00554.science.domain a00553.science.domain [zbh251@a00551 ~]$ mpirun --mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca ras_base_verbose 10 hostname [a00551.science.domain:18889] mca: base: components_register: registering framework ess components [a00551.science.domain:18889] mca: base: components_register: found loaded component pmi [a00551.science.domain:18889] mca: base: components_register: component pmi has no register or open function [a00551.science.domain:18889] mca: base: components_register: found loaded component tool [a00551.science.domain:18889] mca: base: components_register: component tool has no register or open function [a00551.science.domain:18889] mca: base: components_register: found loaded component env [a00551.science.domain:18889] mca: base: components_register: component env has no register or open function [a00551.science.domain:18889] mca: base: components_register: found loaded component hnp [a00551.science.domain:18889] mca: base: components_register: component hnp has no register or open function [a00551.science.domain:18889] mca: base: components_register: found loaded component singleton [a00551.science.domain:18889] mca: base: components_register: component singleton register function successful [a00551.science.domain:18889] mca: base: components_register: found loaded component slurm [a00551.science.domain:18889] mca: base: components_register: component slurm has no register or open function [a00551.science.domain:18889] mca: base: components_register: found loaded component tm [a00551.science.domain:18889] mca: base: components_register: component tm has no register or open function [a00551.science.domain:18889] mca: base: components_open: opening ess components [a00551.science.domain:18889] mca: base: components_open: found loaded component pmi [a00551.science.domain:18889] mca: base: components_open: component pmi open function successful [a00551.science.domain:18889] mca: base: components_open: found loaded component tool [a00551.science.domain:18889] mca: base: components_open: component tool open function successful [a00551.science.domain:18889] mca: base: components_open: found loaded component env [a00551.science.domain:18889] mca: base: components_open: component env open function successful [a00551.science.domain:18889] mca: base: components_open: found loaded component hnp [a00551.science.domain:18889] mca: base: components_open: component hnp open function successful [a00551.science.domain:18889] mca: base: components_open: found loaded component singleton [a00551.science.domain:18889] mca: base: components_open: component singleton open function successful [a00551.science.domain:18889] mca: base: components_open: found loaded component slurm [a00551.science.domain:18889] mca: base: components_open: component slurm open function successful [a00551.science.domain:18889] mca: base: components_open: found loaded component tm [a00551.science.domain:18889] mca: base: components_open: component tm open function successful [a00551.science.domain:18889] mca:base:select: Auto-selecting ess components [a00551.science.domain:18889] mca:base:select:( ess) Querying component [pmi] [a00551.science.domain:18889] mca:base:select:( ess) Querying component [tool] [a00551.science.domain:18889] mca:base:select:( ess) Querying component [env] [a00551.science.domain:18889] mca:base:select:( ess) Querying component [hnp] [a00551.science.domain:18889] mca:base:select:( ess) Query of component [hnp] set priority to 100 [a00551.science.domain:18889] mca:base:select:( ess) Querying component [singleton] [a00551.science.domain:18889] mca:base:select:( ess) Querying component [slurm] [a00551.science.domain:18889] mca:base:select:( ess) Querying component [tm] [a00551.science.domain:18889] mca:base:select:( ess) Selected component [hnp] [a00551.science.domain:18889] mca: base: close: component pmi closed [a00551.science.domain:18889] mca: base: close: unloading component pmi [a00551.science.domain:18889] mca: base: close: component tool closed [a00551.science.domain:18889] mca: base: close: unloading component tool [a00551.science.domain:18889] mca: base: close: component env closed [a00551.science.domain:18889] mca: base: close: unloading component env [a00551.science.domain:18889] mca: base: close: component singleton closed [a00551.science.domain:18889] mca: base: close: unloading component singleton [a00551.science.domain:18889] mca: base: close: component slurm closed [a00551.science.domain:18889] mca: base: close: unloading component slurm [a00551.science.domain:18889] mca: base: close: component tm closed [a00551.science.domain:18889] mca: base: close: unloading component tm [a00551.science.domain:18889] mca: base: components_register: registering framework plm components [a00551.science.domain:18889] mca: base: components_register:
Re: [OMPI users] OMPI users] Unable to mpirun from within torque
Oswin, So it seems that Open MPI think it tm_spawn orted on the remote nodes, but orted ends up running on the same node than mpirun. On your compute nodes, can you ldd /.../lib/openmpi/mca_plm_tm.so And confirm it is linked with the same libtorque.so that was built/provided with torque ? Check path and md5sum on both compute nodes and the node on which you built torque (ideally from both build and install dir) Cheers, Gilles Oswin Krause wrote: >Hi Gilles, > >There you go: > >[zbh251@a00551 ~]$ cat $PBS_NODEFILE >a00551.science.domain >a00554.science.domain >a00553.science.domain >[zbh251@a00551 ~]$ mpirun --mca ess_base_verbose 10 --mca >plm_base_verbose 10 --mca ras_base_verbose 10 hostname >[a00551.science.domain:18889] mca: base: components_register: >registering framework ess components >[a00551.science.domain:18889] mca: base: components_register: found >loaded component pmi >[a00551.science.domain:18889] mca: base: components_register: component >pmi has no register or open function >[a00551.science.domain:18889] mca: base: components_register: found >loaded component tool >[a00551.science.domain:18889] mca: base: components_register: component >tool has no register or open function >[a00551.science.domain:18889] mca: base: components_register: found >loaded component env >[a00551.science.domain:18889] mca: base: components_register: component >env has no register or open function >[a00551.science.domain:18889] mca: base: components_register: found >loaded component hnp >[a00551.science.domain:18889] mca: base: components_register: component >hnp has no register or open function >[a00551.science.domain:18889] mca: base: components_register: found >loaded component singleton >[a00551.science.domain:18889] mca: base: components_register: component >singleton register function successful >[a00551.science.domain:18889] mca: base: components_register: found >loaded component slurm >[a00551.science.domain:18889] mca: base: components_register: component >slurm has no register or open function >[a00551.science.domain:18889] mca: base: components_register: found >loaded component tm >[a00551.science.domain:18889] mca: base: components_register: component >tm has no register or open function >[a00551.science.domain:18889] mca: base: components_open: opening ess >components >[a00551.science.domain:18889] mca: base: components_open: found loaded >component pmi >[a00551.science.domain:18889] mca: base: components_open: component pmi >open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component tool >[a00551.science.domain:18889] mca: base: components_open: component tool >open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component env >[a00551.science.domain:18889] mca: base: components_open: component env >open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component hnp >[a00551.science.domain:18889] mca: base: components_open: component hnp >open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component singleton >[a00551.science.domain:18889] mca: base: components_open: component >singleton open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component slurm >[a00551.science.domain:18889] mca: base: components_open: component >slurm open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component tm >[a00551.science.domain:18889] mca: base: components_open: component tm >open function successful >[a00551.science.domain:18889] mca:base:select: Auto-selecting ess >components >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[pmi] >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[tool] >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[env] >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[hnp] >[a00551.science.domain:18889] mca:base:select:( ess) Query of component >[hnp] set priority to 100 >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[singleton] >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[slurm] >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[tm] >[a00551.science.domain:18889] mca:base:select:( ess) Selected component >[hnp] >[a00551.science.domain:18889] mca: base: close: component pmi closed >[a00551.science.domain:18889] mca: base: close: unloading component pmi >[a00551.science.domain:18889] mca: base: close: component tool closed >[a00551.science.domain:18889] mca: base: close: unloading component tool >[a00551.science.domain:18889] mca: base: close: component env closed >[a00551.science.domain:18889] mca: base: close: unloading component env >[a00551.science.domain:18889] mca: base: close: c
Re: [OMPI users] OMPI users] Unable to mpirun from within torque
Oswin, One more thing, can you pbsdsh -v hostname before invoking mpirun ? Hopefully this should print the three hostnames Then you can ldd `which pbsdsh` And see which libtorque.so is linked with it Cheers, Gilles Oswin Krause wrote: >Hi Gilles, > >There you go: > >[zbh251@a00551 ~]$ cat $PBS_NODEFILE >a00551.science.domain >a00554.science.domain >a00553.science.domain >[zbh251@a00551 ~]$ mpirun --mca ess_base_verbose 10 --mca >plm_base_verbose 10 --mca ras_base_verbose 10 hostname >[a00551.science.domain:18889] mca: base: components_register: >registering framework ess components >[a00551.science.domain:18889] mca: base: components_register: found >loaded component pmi >[a00551.science.domain:18889] mca: base: components_register: component >pmi has no register or open function >[a00551.science.domain:18889] mca: base: components_register: found >loaded component tool >[a00551.science.domain:18889] mca: base: components_register: component >tool has no register or open function >[a00551.science.domain:18889] mca: base: components_register: found >loaded component env >[a00551.science.domain:18889] mca: base: components_register: component >env has no register or open function >[a00551.science.domain:18889] mca: base: components_register: found >loaded component hnp >[a00551.science.domain:18889] mca: base: components_register: component >hnp has no register or open function >[a00551.science.domain:18889] mca: base: components_register: found >loaded component singleton >[a00551.science.domain:18889] mca: base: components_register: component >singleton register function successful >[a00551.science.domain:18889] mca: base: components_register: found >loaded component slurm >[a00551.science.domain:18889] mca: base: components_register: component >slurm has no register or open function >[a00551.science.domain:18889] mca: base: components_register: found >loaded component tm >[a00551.science.domain:18889] mca: base: components_register: component >tm has no register or open function >[a00551.science.domain:18889] mca: base: components_open: opening ess >components >[a00551.science.domain:18889] mca: base: components_open: found loaded >component pmi >[a00551.science.domain:18889] mca: base: components_open: component pmi >open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component tool >[a00551.science.domain:18889] mca: base: components_open: component tool >open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component env >[a00551.science.domain:18889] mca: base: components_open: component env >open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component hnp >[a00551.science.domain:18889] mca: base: components_open: component hnp >open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component singleton >[a00551.science.domain:18889] mca: base: components_open: component >singleton open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component slurm >[a00551.science.domain:18889] mca: base: components_open: component >slurm open function successful >[a00551.science.domain:18889] mca: base: components_open: found loaded >component tm >[a00551.science.domain:18889] mca: base: components_open: component tm >open function successful >[a00551.science.domain:18889] mca:base:select: Auto-selecting ess >components >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[pmi] >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[tool] >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[env] >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[hnp] >[a00551.science.domain:18889] mca:base:select:( ess) Query of component >[hnp] set priority to 100 >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[singleton] >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[slurm] >[a00551.science.domain:18889] mca:base:select:( ess) Querying component >[tm] >[a00551.science.domain:18889] mca:base:select:( ess) Selected component >[hnp] >[a00551.science.domain:18889] mca: base: close: component pmi closed >[a00551.science.domain:18889] mca: base: close: unloading component pmi >[a00551.science.domain:18889] mca: base: close: component tool closed >[a00551.science.domain:18889] mca: base: close: unloading component tool >[a00551.science.domain:18889] mca: base: close: component env closed >[a00551.science.domain:18889] mca: base: close: unloading component env >[a00551.science.domain:18889] mca: base: close: component singleton >closed >[a00551.science.domain:18889] mca: base: close: unloading component >singleton >[a00551.science.domain:18889] mca: base: close: component slurm closed >[a00551.science.domain:18889] mca:
Re: [OMPI users] OMPI users] Unable to mpirun from within torque
I’m pruning this email thread so I can actually read the blasted thing :-) Guys: you are off in the wilderness chasing ghosts! Please stop. When I say that Torque uses an “ordered” file, I am _not_ saying that all the host entries of the same name have to be listed consecutively. I am saying that the _position_ of each entry has meaning, and you cannot just change it. I have honestly totally lost the root of this discussion in all the white noise about the PBS_NODEFILE. Can we reboot? Ralph > On Sep 8, 2016, at 5:26 AM, Gilles Gouaillardet > wrote: > > Oswin, > > One more thing, can you > > pbsdsh -v hostname > > before invoking mpirun ? > Hopefully this should print the three hostnames > > Then you can > ldd `which pbsdsh` > And see which libtorque.so is linked with it > > Cheers, > > Gilles > > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] OMPI users] Unable to mpirun from within torque
Hi, okay lets reboot, even though Gilles last mail was onto something. The problem is that i failed starting programs with mpirun when more than one node was involved. I mentioned that it is likely some configuration problem with my server, especially authentification(we have some kerberos nightmare going on here) We then tried to find out where mpirun got in the wrong corner and we got sidetracked by the nodefile and i reconfigured everything so that we can at least factor out any NUMA stuff. Now i see that i am wrong on this mailing list and it is likely a problem with pbs/torque as pbsdsh -v hostname pbsdsh(): spawned task 0 pbsdsh(): spawned task 1 pbsdsh(): spawned task 2 pbsdsh(): spawn event returned: 0 (3 spawns and 0 obits outstanding) pbsdsh(): sending obit for task 0 a00551.science.domain pbsdsh(): spawn event returned: 1 (2 spawns and 1 obits outstanding) pbsdsh(): sending obit for task 1 a00551.science.domain pbsdsh(): spawn event returned: 2 (1 spawns and 2 obits outstanding) pbsdsh(): sending obit for task 2 a00551.science.domain pbsdsh(): obit event returned: 0 (0 spawns and 3 obits outstanding) pbsdsh(): task 0 exit status 0 pbsdsh(): obit event returned: 1 (0 spawns and 2 obits outstanding) pbsdsh(): task 1 exit status 0 pbsdsh(): obit event returned: 2 (0 spawns and 1 obits outstanding) pbsdsh(): task 2 exit status 0 Best, Oswin On 2016-09-08 15:42, r...@open-mpi.org wrote: I’m pruning this email thread so I can actually read the blasted thing :-) Guys: you are off in the wilderness chasing ghosts! Please stop. When I say that Torque uses an “ordered” file, I am _not_ saying that all the host entries of the same name have to be listed consecutively. I am saying that the _position_ of each entry has meaning, and you cannot just change it. I have honestly totally lost the root of this discussion in all the white noise about the PBS_NODEFILE. Can we reboot? Ralph On Sep 8, 2016, at 5:26 AM, Gilles Gouaillardet wrote: Oswin, One more thing, can you pbsdsh -v hostname before invoking mpirun ? Hopefully this should print the three hostnames Then you can ldd `which pbsdsh` And see which libtorque.so is linked with it Cheers, Gilles ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users