Hi Gilles,
There you go:
[zbh251@a00551 ~]$ cat $PBS_NODEFILE
a00551.science.domain
a00554.science.domain
a00553.science.domain
[zbh251@a00551 ~]$ mpirun --mca ess_base_verbose 10 --mca
plm_base_verbose 10 --mca ras_base_verbose 10 hostname
[a00551.science.domain:18889] mca: base: components_re
Oswin,
can you please run again (one task per physical node) with
mpirun --mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca
ras_base_verbose 10 hostname
Cheers,
Gilles
On 9/8/2016 6:42 PM, Oswin Krause wrote:
Hi,
i reconfigured to only have one physical node. Still no success,
Hi,
i reconfigured to only have one physical node. Still no success, but the
nodefile now looks better. I still get the errors:
[a00551.science.domain:18021] [[34768,0],1] bind() failed on error
Address already in use (98)
[a00551.science.domain:18021] [[34768,0],1] ORTE_ERROR_LOG: Error in
Hi Gilles, Hi Ralph,
I have just rebuild openmpi. quite a lot more of information. As I said,
i did not tinker with the PBS_NODEFILE. I think the issue might be NUMA
here. I can try to go through the process and reconfigure to non-numa
and see whether this works. The issue might be that the no
Oswin,
that might be off topic and or/premature ...
PBS Pro has been made free (and opensource too) and is available at
http://www.pbspro.org/
this is something you might be interested in (unless you are using
torque because of the MOAB scheduler),
and it might be more friendly (e.g. alwa
Ralph,
i am not sure i am reading you correctly, so let me clarify.
i did not hack $PBS_NODEFILE for fun nor profit, i was simply trying to
reproduce an issue i could not reproduce otherwise.
/* my job submitted with -l nodes=3:ppn=1 do not start if there are only
two nodes available, wher
If you are correctly analyzing things, then there would be an issue in the
code. When we get an allocation from a resource manager, we set a flag
indicating that it is “gospel” - i.e., that we do not directly sense the number
of cores on a node and set the #slots equal to that value. Instead, we
Oswin,
unfortunatly some important info is missing.
i guess the root cause is Open MPI was not configure'd with --enable-debug
could you please update your torque script and simply add the following
snippet before invoking mpirun
echo PBS_NODEFILE
cat $PBS_NODEFILE
echo ---
as i wrote
Oswin,
Does the torque library show up if you run
$ ldd mpirun
That would indicate that Torque support is compiled in.
Also, what happens if you use the same hostfile, or some hostfile as
an explicit argument when you run mpirun from within the torque job?
-- bennet
On Wed, Sep 7, 2016 at
Hi,
Sorry, I forgot:
The node allocation seems to be correct as the nodes are NUMA. The node
allocation in torque is
a00551.science.domain-0
a00551.science.domain-1
a00553.science.domain-0
On 2016-09-07 14:41, Gilles Gouaillardet wrote:
Hi,
Which version of Open MPI are you running ?
I not
Hi Gilles,
Thanks for the hint with the machinefile. I know it is not equivalent
and i do not intend to use that approach. I just wanted to know whether
I could start the program successfully at all.
Outside torque(4.2), rsh seems to be used which works fine, querying a
password if no kerber
Hi,
Which version of Open MPI are you running ?
I noted that though you are asking three nodes and one task per node, you have
been allocated 2 nodes only.
I do not know if this is related to this issue.
Note if you use the machinefile, a00551 has two slots (since it appears twice
in the machi
12 matches
Mail list logo