Hi Gilles, Hi Ralph,I have just rebuild openmpi. quite a lot more of information. As I said, i did not tinker with the PBS_NODEFILE. I think the issue might be NUMA here. I can try to go through the process and reconfigure to non-numa and see whether this works. The issue might be that the node allocation looks like this:
a00551.science.domain-0 a00552.science.domain-0 a00551.science.domain-1and the last part then gets shortened which leads to the issue. Not sure whether this makes sense but this is my explanation.
Here the output: $PBS_NODEFILE /var/lib/torque/aux//285.a00552.science.domain PBS_NODEFILE a00551.science.domain a00553.science.domain a00551.science.domain ---------[a00551.science.domain:16986] mca: base: components_register: registering framework plm components [a00551.science.domain:16986] mca: base: components_register: found loaded component isolated [a00551.science.domain:16986] mca: base: components_register: component isolated has no register or open function [a00551.science.domain:16986] mca: base: components_register: found loaded component rsh [a00551.science.domain:16986] mca: base: components_register: component rsh register function successful [a00551.science.domain:16986] mca: base: components_register: found loaded component slurm [a00551.science.domain:16986] mca: base: components_register: component slurm register function successful [a00551.science.domain:16986] mca: base: components_register: found loaded component tm [a00551.science.domain:16986] mca: base: components_register: component tm register function successful [a00551.science.domain:16986] mca: base: components_open: opening plm components [a00551.science.domain:16986] mca: base: components_open: found loaded component isolated [a00551.science.domain:16986] mca: base: components_open: component isolated open function successful [a00551.science.domain:16986] mca: base: components_open: found loaded component rsh [a00551.science.domain:16986] mca: base: components_open: component rsh open function successful [a00551.science.domain:16986] mca: base: components_open: found loaded component slurm [a00551.science.domain:16986] mca: base: components_open: component slurm open function successful [a00551.science.domain:16986] mca: base: components_open: found loaded component tm [a00551.science.domain:16986] mca: base: components_open: component tm open function successful [a00551.science.domain:16986] mca:base:select: Auto-selecting plm components [a00551.science.domain:16986] mca:base:select:( plm) Querying component [isolated] [a00551.science.domain:16986] mca:base:select:( plm) Query of component [isolated] set priority to 0 [a00551.science.domain:16986] mca:base:select:( plm) Querying component [rsh] [a00551.science.domain:16986] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [a00551.science.domain:16986] mca:base:select:( plm) Query of component [rsh] set priority to 10 [a00551.science.domain:16986] mca:base:select:( plm) Querying component [slurm] [a00551.science.domain:16986] mca:base:select:( plm) Querying component [tm] [a00551.science.domain:16986] mca:base:select:( plm) Query of component [tm] set priority to 75 [a00551.science.domain:16986] mca:base:select:( plm) Selected component [tm] [a00551.science.domain:16986] mca: base: close: component isolated closed [a00551.science.domain:16986] mca: base: close: unloading component isolated
[a00551.science.domain:16986] mca: base: close: component rsh closed [a00551.science.domain:16986] mca: base: close: unloading component rsh [a00551.science.domain:16986] mca: base: close: component slurm closed[a00551.science.domain:16986] mca: base: close: unloading component slurm [a00551.science.domain:16986] plm:base:set_hnp_name: initial bias 16986 nodename hash 2226275586
[a00551.science.domain:16986] plm:base:set_hnp_name: final jobfam 33770 [a00551.science.domain:16986] [[33770,0],0] plm:base:receive start comm [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_job [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm[a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm creating map [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm add new daemon [[33770,0],1] [a00551.science.domain:16986] [[33770,0],0] plm:base:setup_vm assigning new daemon [[33770,0],1] to node a00553.science.domain
[a00551.science.domain:16986] [[33770,0],0] plm:tm: launching vm[a00551.science.domain:16986] [[33770,0],0] plm:tm: final top-level argv: orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm -mca ess_base_jobid 2213150720 -mca ess_base_vpid <template> -mca ess_base_num_procs 2 -mca orte_hnp_uri 2213150720.0;usock;tcp://130.226.12.194:53397;tcp6://[fe80::225:90ff:feeb:f6d5]:42821 --mca plm_base_verbose 10 [a00551.science.domain:16986] [[33770,0],0] plm:tm: launching on node a00553.science.domain
[a00551.science.domain:16986] [[33770,0],0] plm:tm: executing:orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess tm -mca ess_base_jobid 2213150720 -mca ess_base_vpid 1 -mca ess_base_num_procs 2 -mca orte_hnp_uri 2213150720.0;usock;tcp://130.226.12.194:53397;tcp6://[fe80::225:90ff:feeb:f6d5]:42821 --mca plm_base_verbose 10 [a00551.science.domain:16986] [[33770,0],0] plm:tm:launch: finished spawning orteds [a00551.science.domain:16986] [[33770,0],0] plm:base:orted_report_launch from daemon [[33770,0],1] [a00551.science.domain:16986] [[33770,0],0] plm:base:orted_report_launch from daemon [[33770,0],1] on node a00551 [a00551.science.domain:16986] [[33770,0],0] RECEIVED TOPOLOGY FROM NODE a00551 [a00551.science.domain:16986] [[33770,0],0] ADDING TOPOLOGY PER USER REQUEST TO NODE a00553.science.domain [a00551.science.domain:16986] [[33770,0],0] plm:base:orted_report_launch completed for daemon [[33770,0],1] at contact 2213150720.1;tcp://130.226.12.194:38025;tcp6://[fe80::225:90ff:feeb:f6d5]:39080 [a00551.science.domain:16986] [[33770,0],0] plm:base:orted_report_launch recvd 2 of 2 reported daemons [a00551.science.domain:16986] [[33770,0],0] plm:base:setting topo to that from node a00553.science.domain
Data for JOB [33770,1] offset 0 ======================== JOB MAP ======================== Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2Process OMPI jobid: [33770,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] Process OMPI jobid: [33770,1] App: 0 Process rank: 1 Bound: socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt 0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
Data for node: a00553.science.domain Num slots: 1 Max slots: 0 Num procs: 1 Process OMPI jobid: [33770,1] App: 0 Process rank: 2 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]
=============================================================[a00551.science.domain:16986] [[33770,0],0] complete_setup on job [33770,1] [a00551.science.domain:16986] [[33770,0],0] plm:base:launch_apps for job [33770,1] [a00551.science.domain:16986] [[33770,0],0] plm:base:receive processing msg [a00551.science.domain:16986] [[33770,0],0] plm:base:receive update proc state command from [[33770,0],1] [a00551.science.domain:16986] [[33770,0],0] plm:base:receive got update_proc_state for job [33770,1] [a00551.science.domain:16986] [[33770,0],0] plm:base:receive got update_proc_state for vpid 2 state RUNNING exit_code 0 [a00551.science.domain:16986] [[33770,0],0] plm:base:receive done processing commands
[1,0]<stdout>:a00551.science.domain [1,2]<stdout>:a00551.science.domain[a00551.science.domain:16986] [[33770,0],0] plm:base:receive processing msg [a00551.science.domain:16986] [[33770,0],0] plm:base:receive update proc state command from [[33770,0],1] [a00551.science.domain:16986] [[33770,0],0] plm:base:receive got update_proc_state for job [33770,1] [a00551.science.domain:16986] [[33770,0],0] plm:base:receive got update_proc_state for vpid 2 state NORMALLY TERMINATED exit_code 0 [a00551.science.domain:16986] [[33770,0],0] plm:base:receive done processing commands
[1,1]<stdout>:a00551.science.domain[a00551.science.domain:16986] [[33770,0],0] plm:base:launch wiring up iof for job [33770,1] [a00551.science.domain:16986] [[33770,0],0] plm:base:launch job [33770,1] is not a dynamic spawn [a00551.science.domain:16986] [[33770,0],0] plm:base:orted_cmd sending orted_exit commands
[a00551.science.domain:16986] [[33770,0],0] plm:base:receive stop comm [a00551.science.domain:16986] mca: base: close: component tm closed [a00551.science.domain:16986] mca: base: close: unloading component tm On 2016-09-08 10:18, Gilles Gouaillardet wrote:
Ralph, i am not sure i am reading you correctly, so let me clarify. i did not hack $PBS_NODEFILE for fun nor profit, i was simply trying to reproduce an issue i could not reproduce otherwise. /* my job submitted with -l nodes=3:ppn=1 do not start if there are only two nodes available, whereas the same user job starts on two nodes */ thanks for the explanation of the torque internals, my hack was incomplete and not a valid one, i do acknowledge it. i re-read the email that started this thread and i found the information i was looking forecho $PBS_NODEFILE /var/lib/torque/aux//278.a00552.science.domain cat $PBS_NODEFILE a00551.science.domain a00553.science.domain a00551.science.domainso, assuming the enduser did not edit his $PBS_NODEFILE, and torque is correctly configured and not busted, thenTorque indeed always provides an ordered file - the only way you can get an unordered one is for someone to edit itmight be updated to "Torque used to always provide an ordered file, but recent versions might not do that." makes sense ? Cheers, Gilles On 9/8/2016 4:57 PM, r...@open-mpi.org wrote:Someone has done some work there since I last did, but I can see the issue. Torque indeed always provides an ordered file - the only way you can get an unordered one is for someone to edit it, and that is forbidden - i.e., you get what you deserve because you are messing around with a system-defined file :-)The problem is that Torque internally assigns a “launch ID” which is just the integer position of the nodename in the PBS_NODEFILE. So if you modify that position, then we get the wrong index - and everything goes down the drain from there. In your example, n1.cluster changed index from 3 to 2 because of your edit. Torque thinks that index 2 is just another reference to n0.cluster, and so we merrily launch a daemon onto the wrong node.They have a good reason for doing things this way. It allows you to launch a process against each launch ID, and the pattern will reflect the original qsub request in what we would call a map-by slot round-robin mode. This maximizes the use of shared memory, and is expected to provide good performance for a range of apps.Lesson to be learned: never, ever muddle around with a system-generated file. If you want to modify where things go, then use one or more of the mpirun options to do so. We give you lots and lots of knobs for just that reason.On Sep 7, 2016, at 10:53 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:Ralph, there might be an issue within Open MPI.on the cluster i used, hostname returns the FQDN, and $PBS_NODEFILE uses the FQDN too.my $PBS_NODEFILE has one line per task, and it is ordered e.g. n0.cluster n0.cluster n1.cluster n1.cluster in my torque script, i rewrote the machinefile like this n0.cluster n1.cluster n0.cluster n1.cluster and updated the PBS environment variable to point to my new file. then i invoked mpirun hostname in the first case, 2 tasks run on n0 and 2 tasks run on n1 in the second case, 4 tasks run on n0, and none on n1. so i am thinking we might not support unordered $PBS_NODEFILE. as a reminder, the submit command was qsub -l nodes=3:ppn=1but for some reasons i ignore, only two nodes were allocated (two slots on the first one, one on the second one)and if i understand correctly, $PBS_NODEFILE was not ordered. (e.g. n0 n1 n0 and *not * n0 n0 n1)i tried to reproduce this without hacking $PBS_NODEFILE, but my jobs hang in the queue if only two nodes with 16 slots each are available and i request-l nodes=3:ppn=1i guess this is a different scheduler configuration, and i cannot change that.Could you please have a look at this ? Cheers, Gilles On 9/7/2016 11:15 PM, r...@open-mpi.org wrote:The usual cause of this problem is that the nodename in the machinefile is given as a00551, while Torque is assigning the node name as a00551.science.domain. Thus, mpirun thinks those are two separate nodes and winds up spawning an orted on its own node.You might try ensuring that your machinefile is using the exact same name as provided in your allocationOn Sep 7, 2016, at 7:06 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com> wrote:Thanjs for the ligsFrom what i see now, it looks like a00551 is running both mpirun and orted, though it should only run mpirun, and orted should run only on a00553I will check the code and see what could be happening here Btw, what is the output of hostname hostname -f On a00551 ?Out of curiosity, is a previous version of Open MPI (e.g. v1.10.4) installled and running correctly on your cluster ?Cheers, Gilles Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:Hi Gilles,Thanks for the hint with the machinefile. I know it is not equivalent and i do not intend to use that approach. I just wanted to know whetherI could start the program successfully at all.Outside torque(4.2), rsh seems to be used which works fine, querying apassword if no kerberos ticket is there Here is the output: [zbh251@a00551 ~]$ mpirun -V mpirun (Open MPI) 2.0.1 [zbh251@a00551 ~]$ ompi_info | grep rasMCA ras: loadleveler (MCA v2.1.0, API v2.0.0, Componentv2.0.1)MCA ras: simulator (MCA v2.1.0, API v2.0.0, Componentv2.0.1) MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v2.0.1)MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v2.0.1)[zbh251@a00551 ~]$ mpirun --mca plm_base_verbose 10 --tag-output -display-map hostname [a00551.science.domain:04104] mca: base: components_register: registering framework plm components[a00551.science.domain:04104] mca: base: components_register: foundloaded component isolated[a00551.science.domain:04104] mca: base: components_register: componentisolated has no register or open function[a00551.science.domain:04104] mca: base: components_register: foundloaded component rsh[a00551.science.domain:04104] mca: base: components_register: componentrsh register function successful[a00551.science.domain:04104] mca: base: components_register: foundloaded component slurm[a00551.science.domain:04104] mca: base: components_register: componentslurm register function successful[a00551.science.domain:04104] mca: base: components_register: foundloaded component tm[a00551.science.domain:04104] mca: base: components_register: componenttm register function successful[a00551.science.domain:04104] mca: base: components_open: opening plmcomponents[a00551.science.domain:04104] mca: base: components_open: found loadedcomponent isolated[a00551.science.domain:04104] mca: base: components_open: componentisolated open function successful[a00551.science.domain:04104] mca: base: components_open: found loadedcomponent rsh[a00551.science.domain:04104] mca: base: components_open: component rshopen function successful[a00551.science.domain:04104] mca: base: components_open: found loadedcomponent slurm[a00551.science.domain:04104] mca: base: components_open: componentslurm open function successful[a00551.science.domain:04104] mca: base: components_open: found loadedcomponent tm[a00551.science.domain:04104] mca: base: components_open: component tmopen function successful [a00551.science.domain:04104] mca:base:select: Auto-selecting plm components[a00551.science.domain:04104] mca:base:select:( plm) Querying component[isolated][a00551.science.domain:04104] mca:base:select:( plm) Query of component[isolated] set priority to 0[a00551.science.domain:04104] mca:base:select:( plm) Querying component[rsh][a00551.science.domain:04104] mca:base:select:( plm) Query of component[rsh] set priority to 10[a00551.science.domain:04104] mca:base:select:( plm) Querying component[slurm][a00551.science.domain:04104] mca:base:select:( plm) Querying component[tm][a00551.science.domain:04104] mca:base:select:( plm) Query of component[tm] set priority to 75[a00551.science.domain:04104] mca:base:select:( plm) Selected component[tm] [a00551.science.domain:04104] mca: base: close: component isolated closed[a00551.science.domain:04104] mca: base: close: unloading componentisolated[a00551.science.domain:04104] mca: base: close: component rsh closed [a00551.science.domain:04104] mca: base: close: unloading component rsh [a00551.science.domain:04104] mca: base: close: component slurm closed [a00551.science.domain:04104] mca: base: close: unloading componentslurm [a00551.science.domain:04109] mca: base: components_register: registering framework plm components[a00551.science.domain:04109] mca: base: components_register: foundloaded component rsh[a00551.science.domain:04109] mca: base: components_register: componentrsh register function successful[a00551.science.domain:04109] mca: base: components_open: opening plmcomponents[a00551.science.domain:04109] mca: base: components_open: found loadedcomponent rsh[a00551.science.domain:04109] mca: base: components_open: component rshopen function successful [a00551.science.domain:04109] mca:base:select: Auto-selecting plm components[a00551.science.domain:04109] mca:base:select:( plm) Querying component[rsh][a00551.science.domain:04109] mca:base:select:( plm) Query of component[rsh] set priority to 10[a00551.science.domain:04109] mca:base:select:( plm) Selected component[rsh] [a00551.science.domain:04109] [[53688,0],1] bind() failed on error Address already in use (98)[a00551.science.domain:04109] [[53688,0],1] ORTE_ERROR_LOG: Error infile oob_usock_component.c at line 228 Data for JOB [53688,1] offset 0 ======================== JOB MAP ======================== Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2Process OMPI jobid: [53688,1] App: 0 Process rank: 0 Bound: socket0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]Process OMPI jobid: [53688,1] App: 0 Process rank: 1 Bound: socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB] Data for node: a00553.science.domain Num slots: 1 Max slots: 0 Num procs: 1Process OMPI jobid: [53688,1] App: 0 Process rank: 2 Bound: socket0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] ============================================================= [a00551.science.domain:04104] [[53688,0],0] complete_setup on job [53688,1][a00551.science.domain:04104] [[53688,0],0] plm:base:receive update procstate command from [[53688,0],1] [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got update_proc_state for job [53688,1] [1,0]<stdout>:a00551.science.domain [1,2]<stdout>:a00551.science.domain[a00551.science.domain:04104] [[53688,0],0] plm:base:receive update procstate command from [[53688,0],1] [a00551.science.domain:04104] [[53688,0],0] plm:base:receive got update_proc_state for job [53688,1] [1,1]<stdout>:a00551.science.domain[a00551.science.domain:04109] mca: base: close: component rsh closed [a00551.science.domain:04109] mca: base: close: unloading component rsh [a00551.science.domain:04104] mca: base: close: component tm closed [a00551.science.domain:04104] mca: base: close: unloading component tmOn 2016-09-07 14:41, Gilles Gouaillardet wrote:Hi, Which version of Open MPI are you running ?I noted that though you are asking three nodes and one task per node,you have been allocated 2 nodes only. I do not know if this is related to this issue. Note if you use the machinefile, a00551 has two slots (since itappears twice in the machinefile) but a00553 has 20 slots (since it appears once in the machinefile, the number of slots is automaticallydetected) Can you run mpirun --mca plm_base_verbose 10 ... So we can confirm tm is used.Before invoking mpirun, you might want to cleanup the ompi directory in/tmp Cheers, Gilles Oswin Krause <oswin.kra...@ruhr-uni-bochum.de> wrote:Hi,I am currently trying to set up OpenMPI in torque. OpenMPI is buildwith tm support. Torque is correctly assigning nodes and I can run mpi-programs on single nodes just fine. the problem starts when processes are split between nodes.For example, I create an interactive session with torque and start aprogram by qsub -I -n -l nodes=3:ppn=1 mpirun --tag-output -display-map hostname which leads to[a00551.science.domain:15932] [[65415,0],1] bind() failed on errorAddress already in use (98)[a00551.science.domain:15932] [[65415,0],1] ORTE_ERROR_LOG: Error infile oob_usock_component.c at line 228 Data for JOB [65415,1] offset 0 ======================== JOB MAP ======================== Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2Process OMPI jobid: [65415,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]Process OMPI jobid: [65415,1] App: 0 Process rank: 1 Bound: socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]Data for node: a00553.science.domain Num slots: 1 Max slots: 0 Numprocs: 1Process OMPI jobid: [65415,1] App: 0 Process rank: 2 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] ============================================================= [1,0]<stdout>:a00551.science.domain [1,2]<stdout>:a00551.science.domain [1,1]<stdout>:a00551.science.domainif I login on a00551 and start using the hostfile generated by thePBS_NODEFILE, everything works: (from within the interactive session) echo $PBS_NODEFILE /var/lib/torque/aux//278.a00552.science.domain cat $PBS_NODEFILE a00551.science.domain a00553.science.domain a00551.science.domain (from within the separate login)mpirun --hostfile /var/lib/torque/aux//278.a00552.science.domain -np 3--tag-output -display-map hostname Data for JOB [65445,1] offset 0 ======================== JOB MAP ======================== Data for node: a00551 Num slots: 2 Max slots: 0 Num procs: 2Process OMPI jobid: [65445,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..]Process OMPI jobid: [65445,1] App: 0 Process rank: 1 Bound: socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]], socket 1[core 16[hwt 0-1]], socket 1[core 17[hwt0-1]], socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]]:[../../../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]Data for node: a00553.science.domain Num slots: 20 Max slots: 0 Numprocs: 1Process OMPI jobid: [65445,1] App: 0 Process rank: 2 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]]:[BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../../../..] ============================================================= [1,0]<stdout>:a00551.science.domain [1,2]<stdout>:a00553.science.domain [1,1]<stdout>:a00551.science.domainI am kind of lost whats going on here. Anyone having an idea? I amseriously considering this to be the problem of kerberosauthentification that we have to work with, but I fail to see how thisshould affect the sockets. Best, Oswin _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users