Someone has done some work there since I last did, but I can see the issue.
Torque indeed always provides an ordered file - the only way you can get an
unordered one is for someone to edit it, and that is forbidden - i.e., you get
what you deserve because you are messing around with a system-def
If you are correctly analyzing things, then there would be an issue in the
code. When we get an allocation from a resource manager, we set a flag
indicating that it is “gospel” - i.e., that we do not directly sense the number
of cores on a node and set the #slots equal to that value. Instead, we
Hi,
Thanks for all the hints. Only issue is: this is the file generated by
torque. Torque - or at least the torque 4.2 provided by my redhat
version - gives me an unordered file.
Should I rebuild torque?
Best,
Oswin
I am currently rebuilding the package with --enable-debug.
On 2016-09-08 09
Ralph,
i am not sure i am reading you correctly, so let me clarify.
i did not hack $PBS_NODEFILE for fun nor profit, i was simply trying to
reproduce an issue i could not reproduce otherwise.
/* my job submitted with -l nodes=3:ppn=1 do not start if there are only
two nodes available, wher
Oswin,
that might be off topic and or/premature ...
PBS Pro has been made free (and opensource too) and is available at
http://www.pbspro.org/
this is something you might be interested in (unless you are using
torque because of the MOAB scheduler),
and it might be more friendly (e.g. alwa
Hi Gilles, Hi Ralph,
I have just rebuild openmpi. quite a lot more of information. As I said,
i did not tinker with the PBS_NODEFILE. I think the issue might be NUMA
here. I can try to go through the process and reconfigure to non-numa
and see whether this works. The issue might be that the no
Hi,
i reconfigured to only have one physical node. Still no success, but the
nodefile now looks better. I still get the errors:
[a00551.science.domain:18021] [[34768,0],1] bind() failed on error
Address already in use (98)
[a00551.science.domain:18021] [[34768,0],1] ORTE_ERROR_LOG: Error in
Oswin,
can you please run again (one task per physical node) with
mpirun --mca ess_base_verbose 10 --mca plm_base_verbose 10 --mca
ras_base_verbose 10 hostname
Cheers,
Gilles
On 9/8/2016 6:42 PM, Oswin Krause wrote:
Hi,
i reconfigured to only have one physical node. Still no success,
Hi Gilles,
There you go:
[zbh251@a00551 ~]$ cat $PBS_NODEFILE
a00551.science.domain
a00554.science.domain
a00553.science.domain
[zbh251@a00551 ~]$ mpirun --mca ess_base_verbose 10 --mca
plm_base_verbose 10 --mca ras_base_verbose 10 hostname
[a00551.science.domain:18889] mca: base: components_re
Oswin,
So it seems that Open MPI think it tm_spawn orted on the remote nodes, but
orted ends up running on the same node than mpirun.
On your compute nodes, can you
ldd /.../lib/openmpi/mca_plm_tm.so
And confirm it is linked with the same libtorque.so that was built/provided
with torque ?
Chec
Oswin,
One more thing, can you
pbsdsh -v hostname
before invoking mpirun ?
Hopefully this should print the three hostnames
Then you can
ldd `which pbsdsh`
And see which libtorque.so is linked with it
Cheers,
Gilles
Oswin Krause wrote:
>Hi Gilles,
>
>There you go:
>
>[zbh251@a00551 ~]$ cat $
I’m pruning this email thread so I can actually read the blasted thing :-)
Guys: you are off in the wilderness chasing ghosts! Please stop.
When I say that Torque uses an “ordered” file, I am _not_ saying that all the
host entries of the same name have to be listed consecutively. I am saying tha
Hi,
okay lets reboot, even though Gilles last mail was onto something.
The problem is that i failed starting programs with mpirun when more
than one node was involved. I mentioned that it is likely some
configuration problem with my server, especially authentification(we
have some kerberos ni
13 matches
Mail list logo