Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain

On Wed, 14 Mar 2012 at 5:50pm, Ralph Castain wrote


On Mar 14, 2012, at 5:44 PM, Reuti wrote:


(I was just typing when Ralph's message came in: I can confirm this. To 
avoid it, it would mean for Open MPI to collect all lines from the 
hostfile which are on the same machine. SGE creates entries for each 
queue/host pair in the machine file).


Hmmm…I can take a look at the allocator module and see why we aren't 
doing it. Would the host names be the same for the two queues?


I can't speak authoritatively like Reuti can, but here's what a hostfile
looks like on my cluster (note that all our name resolution is done via 
/etc/hosts -- there's no DNS involved):


iq103 8 lab.q@iq103 
iq103 1 test.q@iq103 
iq104 8 lab.q@iq104 
iq104 1 test.q@iq104 
opt221 2 lab.q@opt221 
opt221 1 test.q@opt221 

@Ralph: it could work if SGE would have a facility to request the 
desired queue in `qrsh -inherit ...`, because then the $TMPDIR would be 
unique for each orted again (assuming its using different ports for 
each).


Gotcha! I suspect getting the allocator to handle this cleanly is the 
better solution, though.


If I can help (testing patches, e.g.), let me know.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain

On Thu, 15 Mar 2012 at 12:44am, Reuti wrote

Which version of SGE are you using? The traditional rsh startup was 
replaced by the builtin startup some time ago (although it should still 
work).


We're currently running the rather ancient 6.1u4 (due to the "If it ain't 
broke..." philosophy).  The hardware for our new queue master recently 
arrived and I'll soon be upgrading to the most recent Open Grid Scheduler 
release.  Are you saying that the upgrade with the new builtin startup 
method should avoid this problem?


Maybe this shows already the problem: there are two `qrsh -inherit`, as 
Open MPI thinks these are different machines (I ran only with one slot 
on each host hence didn't get it first but can reproduce it now). But 
for SGE both may end up in the same queue overriding the openmpi-session 
in $TMPDIR.


Although it's running: you get all output? If I request 4 slots and get 
one from each queue on both machines the mpihello outputs only 3 lines: 
the "Hello World from Node 3" is always missing.


I do seem to get all the output -- there are indeed 64 Hello World lines.

Thanks again for all the help on this.  This is one of the most productive 
exchanges I've had on a mailing list in far too long.


--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF


Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Rayson Ho
Hi Joshua,

I don't think the new built-in rsh in later versions of Grid Engine is
going to make any difference - the orted is the real starter of the
MPI tasks and should have a greater influence on the task environment.

However, it would help if you can record the nice values and resource
limits of each of the MPI task - you can easily do so by using a shell
wrapper like this one:


#!/bin/sh

# resource limit
ulimit -a > /tmp/mpijob.$$

# nice value
ps -eo pid,user,nice,command | grep $$

# run real executable


exit $?


Use mpirun to submit it as if it is the real MPI application - then
you can see if there are limits introduced by Grid Engine that are
causing issues...

Rayson

=
Open Grid Scheduler / Grid Engine
http://gridscheduler.sourceforge.net/

Scalable Grid Engine Support Program
http://www.scalablelogic.com/



On Thu, Mar 15, 2012 at 12:28 AM, Joshua Baker-LePain  wrote:
> On Thu, 15 Mar 2012 at 12:44am, Reuti wrote
>
>
>> Which version of SGE are you using? The traditional rsh startup was
>> replaced by the builtin startup some time ago (although it should still
>> work).
>
>
> We're currently running the rather ancient 6.1u4 (due to the "If it ain't
> broke..." philosophy).  The hardware for our new queue master recently
> arrived and I'll soon be upgrading to the most recent Open Grid Scheduler
> release.  Are you saying that the upgrade with the new builtin startup
> method should avoid this problem?
>
>
>> Maybe this shows already the problem: there are two `qrsh -inherit`, as
>> Open MPI thinks these are different machines (I ran only with one slot on
>> each host hence didn't get it first but can reproduce it now). But for SGE
>> both may end up in the same queue overriding the openmpi-session in $TMPDIR.
>>
>> Although it's running: you get all output? If I request 4 slots and get
>> one from each queue on both machines the mpihello outputs only 3 lines: the
>> "Hello World from Node 3" is always missing.
>
>
> I do seem to get all the output -- there are indeed 64 Hello World lines.
>
> Thanks again for all the help on this.  This is one of the most productive
> exchanges I've had on a mailing list in far too long.
>
>
> --
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
==
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/



Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Reuti
Am 15.03.2012 um 05:22 schrieb Joshua Baker-LePain:

> On Wed, 14 Mar 2012 at 5:50pm, Ralph Castain wrote
> 
>> On Mar 14, 2012, at 5:44 PM, Reuti wrote:
> 
>>> (I was just typing when Ralph's message came in: I can confirm this. To 
>>> avoid it, it would mean for Open MPI to collect all lines from the hostfile 
>>> which are on the same machine. SGE creates entries for each queue/host pair 
>>> in the machine file).
>> 
>> Hmmm…I can take a look at the allocator module and see why we aren't doing 
>> it. Would the host names be the same for the two queues?
> 
> I can't speak authoritatively like Reuti can, but here's what a hostfile
> looks like on my cluster (note that all our name resolution is done via 
> /etc/hosts -- there's no DNS involved):
> 
> iq103 8 lab.q@iq103 
> iq103 1 test.q@iq103 
> iq104 8 lab.q@iq104 
> iq104 1 test.q@iq104 
> opt221 2 lab.q@opt221 
> opt221 1 test.q@opt221 

Yes, exactly this needs to be parsed and adding up all entries therein for one 
and the same machine.

If you need it instantly, it could be put in a wrapper for start_proc_args of 
the PE (and Open MPI compiled without SGE support), so that a custom build 
machinefile can be used. In this case the rsh resp. ssh call also needs to be 
caught.

Often the opposite is desired in an SGE setup: tune it so that all slots are 
coming from one queue only.

But I still wonder whether it is possible to tune your setup in a similar way: 
allow one slot more in the high priority queue (long,.q) in case it's a 
parallel job, with an RQS (assuming 8 cores with one core oversubscription):

limit queues long.q pes * to slots=9
limit queues long.q to slots=8

while you have an additonal short.q (the low priority queue) there with one 
slot. The overall limit is still set on an exechost level to 9. The PE is then 
only attached to long.q.

-- Reuti

PS: In your example you also had the case 2 slots in the low priority queue, 
what is the actual setup in your cluster?


>>> @Ralph: it could work if SGE would have a facility to request the desired 
>>> queue in `qrsh -inherit ...`, because then the $TMPDIR would be unique for 
>>> each orted again (assuming its using different ports for each).
>> 
>> Gotcha! I suspect getting the allocator to handle this cleanly is the better 
>> solution, though.
> 
> If I can help (testing patches, e.g.), let me know.
> 
> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Ralph Castain
Just to be clear: I take it that the first entry is the host name, and the 
second is the number of slots allocated on that host?

FWIW: I see the problem. Our parser was apparently written assuming every line 
was a unique host, so it doesn't even check to see if there is duplication. 
Easy fix - can shoot it to you today.

On Mar 15, 2012, at 6:53 AM, Reuti wrote:

> Am 15.03.2012 um 05:22 schrieb Joshua Baker-LePain:
> 
>> On Wed, 14 Mar 2012 at 5:50pm, Ralph Castain wrote
>> 
>>> On Mar 14, 2012, at 5:44 PM, Reuti wrote:
>> 
 (I was just typing when Ralph's message came in: I can confirm this. To 
 avoid it, it would mean for Open MPI to collect all lines from the 
 hostfile which are on the same machine. SGE creates entries for each 
 queue/host pair in the machine file).
>>> 
>>> Hmmm…I can take a look at the allocator module and see why we aren't doing 
>>> it. Would the host names be the same for the two queues?
>> 
>> I can't speak authoritatively like Reuti can, but here's what a hostfile
>> looks like on my cluster (note that all our name resolution is done via 
>> /etc/hosts -- there's no DNS involved):
>> 
>> iq103 8 lab.q@iq103 
>> iq103 1 test.q@iq103 
>> iq104 8 lab.q@iq104 
>> iq104 1 test.q@iq104 
>> opt221 2 lab.q@opt221 
>> opt221 1 test.q@opt221 
> 
> Yes, exactly this needs to be parsed and adding up all entries therein for 
> one and the same machine.
> 
> If you need it instantly, it could be put in a wrapper for start_proc_args of 
> the PE (and Open MPI compiled without SGE support), so that a custom build 
> machinefile can be used. In this case the rsh resp. ssh call also needs to be 
> caught.
> 
> Often the opposite is desired in an SGE setup: tune it so that all slots are 
> coming from one queue only.
> 
> But I still wonder whether it is possible to tune your setup in a similar 
> way: allow one slot more in the high priority queue (long,.q) in case it's a 
> parallel job, with an RQS (assuming 8 cores with one core oversubscription):
> 
> limit queues long.q pes * to slots=9
> limit queues long.q to slots=8
> 
> while you have an additonal short.q (the low priority queue) there with one 
> slot. The overall limit is still set on an exechost level to 9. The PE is 
> then only attached to long.q.
> 
> -- Reuti
> 
> PS: In your example you also had the case 2 slots in the low priority queue, 
> what is the actual setup in your cluster?
> 
> 
 @Ralph: it could work if SGE would have a facility to request the desired 
 queue in `qrsh -inherit ...`, because then the $TMPDIR would be unique for 
 each orted again (assuming its using different ports for each).
>>> 
>>> Gotcha! I suspect getting the allocator to handle this cleanly is the 
>>> better solution, though.
>> 
>> If I can help (testing patches, e.g.), let me know.
>> 
>> -- 
>> Joshua Baker-LePain
>> QB3 Shared Cluster Sysadmin
>> UCSF___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Reuti
Am 15.03.2012 um 15:37 schrieb Ralph Castain:

> Just to be clear: I take it that the first entry is the host name, and the 
> second is the number of slots allocated on that host?

This is correct.


> FWIW: I see the problem. Our parser was apparently written assuming every 
> line was a unique host, so it doesn't even check to see if there is 
> duplication. Easy fix - can shoot it to you today.

But even with the fix the nice value will be the same for all processes forked 
there. Either all have the nice value of his low priority queue or the high 
priority queue.

-- Reuti


> On Mar 15, 2012, at 6:53 AM, Reuti wrote:
> 
>> Am 15.03.2012 um 05:22 schrieb Joshua Baker-LePain:
>> 
>>> On Wed, 14 Mar 2012 at 5:50pm, Ralph Castain wrote
>>> 
 On Mar 14, 2012, at 5:44 PM, Reuti wrote:
>>> 
> (I was just typing when Ralph's message came in: I can confirm this. To 
> avoid it, it would mean for Open MPI to collect all lines from the 
> hostfile which are on the same machine. SGE creates entries for each 
> queue/host pair in the machine file).
 
 Hmmm…I can take a look at the allocator module and see why we aren't doing 
 it. Would the host names be the same for the two queues?
>>> 
>>> I can't speak authoritatively like Reuti can, but here's what a hostfile
>>> looks like on my cluster (note that all our name resolution is done via 
>>> /etc/hosts -- there's no DNS involved):
>>> 
>>> iq103 8 lab.q@iq103 
>>> iq103 1 test.q@iq103 
>>> iq104 8 lab.q@iq104 
>>> iq104 1 test.q@iq104 
>>> opt221 2 lab.q@opt221 
>>> opt221 1 test.q@opt221 
>> 
>> Yes, exactly this needs to be parsed and adding up all entries therein for 
>> one and the same machine.
>> 
>> If you need it instantly, it could be put in a wrapper for start_proc_args 
>> of the PE (and Open MPI compiled without SGE support), so that a custom 
>> build machinefile can be used. In this case the rsh resp. ssh call also 
>> needs to be caught.
>> 
>> Often the opposite is desired in an SGE setup: tune it so that all slots are 
>> coming from one queue only.
>> 
>> But I still wonder whether it is possible to tune your setup in a similar 
>> way: allow one slot more in the high priority queue (long,.q) in case it's a 
>> parallel job, with an RQS (assuming 8 cores with one core oversubscription):
>> 
>> limit queues long.q pes * to slots=9
>> limit queues long.q to slots=8
>> 
>> while you have an additonal short.q (the low priority queue) there with one 
>> slot. The overall limit is still set on an exechost level to 9. The PE is 
>> then only attached to long.q.
>> 
>> -- Reuti
>> 
>> PS: In your example you also had the case 2 slots in the low priority queue, 
>> what is the actual setup in your cluster?
>> 
>> 
> @Ralph: it could work if SGE would have a facility to request the desired 
> queue in `qrsh -inherit ...`, because then the $TMPDIR would be unique 
> for each orted again (assuming its using different ports for each).
 
 Gotcha! I suspect getting the allocator to handle this cleanly is the 
 better solution, though.
>>> 
>>> If I can help (testing patches, e.g.), let me know.
>>> 
>>> -- 
>>> Joshua Baker-LePain
>>> QB3 Shared Cluster Sysadmin
>>> UCSF___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 




Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Ralph Castain

On Mar 15, 2012, at 8:46 AM, Reuti wrote:

> Am 15.03.2012 um 15:37 schrieb Ralph Castain:
> 
>> Just to be clear: I take it that the first entry is the host name, and the 
>> second is the number of slots allocated on that host?
> 
> This is correct.
> 
> 
>> FWIW: I see the problem. Our parser was apparently written assuming every 
>> line was a unique host, so it doesn't even check to see if there is 
>> duplication. Easy fix - can shoot it to you today.
> 
> But even with the fix the nice value will be the same for all processes 
> forked there. Either all have the nice value of his low priority queue or the 
> high priority queue.

Agreed - nothing I can do about that, though. We only do the one qrsh call, so 
the daemons are going to fall into a single queue, and so will all their 
children. In this scenario, it isn't clear to me (from this discussion) that I 
can control which queue gets used - can I? Should I?


> 
> -- Reuti
> 
> 
>> On Mar 15, 2012, at 6:53 AM, Reuti wrote:
>> 
>>> Am 15.03.2012 um 05:22 schrieb Joshua Baker-LePain:
>>> 
 On Wed, 14 Mar 2012 at 5:50pm, Ralph Castain wrote
 
> On Mar 14, 2012, at 5:44 PM, Reuti wrote:
 
>> (I was just typing when Ralph's message came in: I can confirm this. To 
>> avoid it, it would mean for Open MPI to collect all lines from the 
>> hostfile which are on the same machine. SGE creates entries for each 
>> queue/host pair in the machine file).
> 
> Hmmm…I can take a look at the allocator module and see why we aren't 
> doing it. Would the host names be the same for the two queues?
 
 I can't speak authoritatively like Reuti can, but here's what a hostfile
 looks like on my cluster (note that all our name resolution is done via 
 /etc/hosts -- there's no DNS involved):
 
 iq103 8 lab.q@iq103 
 iq103 1 test.q@iq103 
 iq104 8 lab.q@iq104 
 iq104 1 test.q@iq104 
 opt221 2 lab.q@opt221 
 opt221 1 test.q@opt221 
>>> 
>>> Yes, exactly this needs to be parsed and adding up all entries therein for 
>>> one and the same machine.
>>> 
>>> If you need it instantly, it could be put in a wrapper for start_proc_args 
>>> of the PE (and Open MPI compiled without SGE support), so that a custom 
>>> build machinefile can be used. In this case the rsh resp. ssh call also 
>>> needs to be caught.
>>> 
>>> Often the opposite is desired in an SGE setup: tune it so that all slots 
>>> are coming from one queue only.
>>> 
>>> But I still wonder whether it is possible to tune your setup in a similar 
>>> way: allow one slot more in the high priority queue (long,.q) in case it's 
>>> a parallel job, with an RQS (assuming 8 cores with one core 
>>> oversubscription):
>>> 
>>> limit queues long.q pes * to slots=9
>>> limit queues long.q to slots=8
>>> 
>>> while you have an additonal short.q (the low priority queue) there with one 
>>> slot. The overall limit is still set on an exechost level to 9. The PE is 
>>> then only attached to long.q.
>>> 
>>> -- Reuti
>>> 
>>> PS: In your example you also had the case 2 slots in the low priority 
>>> queue, what is the actual setup in your cluster?
>>> 
>>> 
>> @Ralph: it could work if SGE would have a facility to request the 
>> desired queue in `qrsh -inherit ...`, because then the $TMPDIR would be 
>> unique for each orted again (assuming its using different ports for 
>> each).
> 
> Gotcha! I suspect getting the allocator to handle this cleanly is the 
> better solution, though.
 
 If I can help (testing patches, e.g.), let me know.
 
 -- 
 Joshua Baker-LePain
 QB3 Shared Cluster Sysadmin
 UCSF___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Reuti
Am 15.03.2012 um 15:50 schrieb Ralph Castain:

> 
> On Mar 15, 2012, at 8:46 AM, Reuti wrote:
> 
>> Am 15.03.2012 um 15:37 schrieb Ralph Castain:
>> 
>>> Just to be clear: I take it that the first entry is the host name, and the 
>>> second is the number of slots allocated on that host?
>> 
>> This is correct.
>> 
>> 
>>> FWIW: I see the problem. Our parser was apparently written assuming every 
>>> line was a unique host, so it doesn't even check to see if there is 
>>> duplication. Easy fix - can shoot it to you today.
>> 
>> But even with the fix the nice value will be the same for all processes 
>> forked there. Either all have the nice value of his low priority queue or 
>> the high priority queue.
> 
> Agreed - nothing I can do about that, though. We only do the one qrsh call, 
> so the daemons are going to fall into a single queue, and so will all their 
> children. In this scenario, it isn't clear to me (from this discussion) that 
> I can control which queue gets used

Correct.


> - can I?

No. As posted I created an issue for it. But if it would work, then you would 
get already different $TMPDIRs for each queue.


> Should I?

I can't speak for the community. Personally I would say: don't distribute 
parallel jobs among different queues at all, as some applications will use some 
internal communication about the environment variables of the master process to 
distribute them to the slaves (even if SGE's `qrsh -inherit ...` is called 
without -V, and even if Open MPI is not told to forward and specific 
environment variable). If you have a custom application it can work of course, 
but with closed source ones you can only test and get the experience whether 
it's working or not.

Not to mention the timing issue of differently niced processes. Adjusting the 
SGE setup of the OP would be the smarter way IMO.

If it's fixed in Open MPI to add up all the granted slots on one machine, some 
users may think it's an Open MPI error to attach all to one queue only, as they 
expect different queues to be used. So this "workaround" should be noted 
somewhere: >>As it's not possible the reach a specific queue on a slave machine 
by SGE's tight integration commands (`qrsh -inherit ...`), as a workaround the 
number of slots across different queues are added up inside the $PE_HOSTFILE of 
SGE and started in the queue SGE choses for the first issued `qrsh -inherit 
...`. Which one is taken can't be predicted though.<<

-- Reuti


>>> On Mar 15, 2012, at 6:53 AM, Reuti wrote:
>>> 
 Am 15.03.2012 um 05:22 schrieb Joshua Baker-LePain:
 
> On Wed, 14 Mar 2012 at 5:50pm, Ralph Castain wrote
> 
>> On Mar 14, 2012, at 5:44 PM, Reuti wrote:
> 
>>> (I was just typing when Ralph's message came in: I can confirm this. To 
>>> avoid it, it would mean for Open MPI to collect all lines from the 
>>> hostfile which are on the same machine. SGE creates entries for each 
>>> queue/host pair in the machine file).
>> 
>> Hmmm…I can take a look at the allocator module and see why we aren't 
>> doing it. Would the host names be the same for the two queues?
> 
> I can't speak authoritatively like Reuti can, but here's what a hostfile
> looks like on my cluster (note that all our name resolution is done via 
> /etc/hosts -- there's no DNS involved):
> 
> iq103 8 lab.q@iq103 
> iq103 1 test.q@iq103 
> iq104 8 lab.q@iq104 
> iq104 1 test.q@iq104 
> opt221 2 lab.q@opt221 
> opt221 1 test.q@opt221 
 
 Yes, exactly this needs to be parsed and adding up all entries therein for 
 one and the same machine.
 
 If you need it instantly, it could be put in a wrapper for start_proc_args 
 of the PE (and Open MPI compiled without SGE support), so that a custom 
 build machinefile can be used. In this case the rsh resp. ssh call also 
 needs to be caught.
 
 Often the opposite is desired in an SGE setup: tune it so that all slots 
 are coming from one queue only.
 
 But I still wonder whether it is possible to tune your setup in a similar 
 way: allow one slot more in the high priority queue (long,.q) in case it's 
 a parallel job, with an RQS (assuming 8 cores with one core 
 oversubscription):
 
 limit queues long.q pes * to slots=9
 limit queues long.q to slots=8
 
 while you have an additonal short.q (the low priority queue) there with 
 one slot. The overall limit is still set on an exechost level to 9. The PE 
 is then only attached to long.q.
 
 -- Reuti
 
 PS: In your example you also had the case 2 slots in the low priority 
 queue, what is the actual setup in your cluster?
 
 
>>> @Ralph: it could work if SGE would have a facility to request the 
>>> desired queue in `qrsh -inherit ...`, because then the $TMPDIR would be 
>>> unique for each orted again (assuming its using different ports for 
>>> each).

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain

On Thu, 15 Mar 2012 at 1:53pm, Reuti wrote

PS: In your example you also had the case 2 slots in the low priority 
queue, what is the actual setup in your cluster?


Our actual setup is:

 o lab.q, slots=numprocs, load_thresholds=np_load_avg=1.5, labs (=SGE
   projects) limited by RQS to a number of slots equal to their "share" of
   the cluster, seq_no=0, priority=0.

 o long.q, slots=numprocs, load_thresholds=np_load_avg=0.9, seq_no=1,
   priority=19

 o short.q, slots=numprocs, load_thresholds=np_load_avg=1.25, users
   limited by RQS to 200 slots, runtime limited to 30 minutes, seq_no=2,
   priority=10

Users are instructed to not select a queue when submitting jobs.  The 
theory is that even if non-contributing users have filled the cluster with 
long.q jobs, contributing users will still have instant access to "their" 
lab.q slots, overloading nodes with jobs running at a higher priority than 
the long.q jobs.  long.q jobs won't start on nodes full of lab.q jobs. 
And short.q is for quick, high priority jobs regardless of cluster status 
(the main use case being processing MRI data into images while a patient 
is physically in the scanner).


The truth is our cluster is primarily used for, and thus SGE is tuned for, 
large numbers of serial jobs.  We do have *some* folks running parallel 
code, and it *is* starting to get to the point where I need to reconfigure 
things to make that part work better.


--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF


Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain

On Thu, 15 Mar 2012 at 4:41pm, Reuti wrote


Am 15.03.2012 um 15:50 schrieb Ralph Castain:


On Mar 15, 2012, at 8:46 AM, Reuti wrote:


Am 15.03.2012 um 15:37 schrieb Ralph Castain:

FWIW: I see the problem. Our parser was apparently written assuming 
every line was a unique host, so it doesn't even check to see if 
there is duplication. Easy fix - can shoot it to you today.


But even with the fix the nice value will be the same for all 
processes forked there. Either all have the nice value of his low 
priority queue or the high priority queue.


Agreed - nothing I can do about that, though. We only do the one qrsh 
call, so the daemons are going to fall into a single queue, and so will 
all their children. In this scenario, it isn't clear to me (from this 
discussion) that I can control which queue gets used


Correct.


Which I understand.  Our queue setup is admittedly a bit wonky (which is
probably why I'm the first one to have this issue).  I'm much more 
concerned with things not crashing than with them absolutely having the 
"right" nice levels.  :)



Should I?


I can't speak for the community. Personally I would say: don't 
distribute parallel jobs among different queues at all, as some 
applications will use some internal communication about the environment 
variables of the master process to distribute them to the slaves (even 
if SGE's `qrsh -inherit ...` is called without -V, and even if Open MPI 
is not told to forward and specific environment variable). If you have a 
custom application it can work of course, but with closed source ones 
you can only test and get the experience whether it's working or not.


Not to mention the timing issue of differently niced processes. 
Adjusting the SGE setup of the OP would be the smarter way IMO.


And I agree with that as well.  I understand if the decision is made to 
leave the parser the way it is, given that my setup is outside the norm.


--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF


Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Ralph Castain
No, I'll fix the parser as we should be able to run anyway. Just can't 
guarantee which queue the job will end up in, but at least it -will- run.

On Mar 15, 2012, at 11:34 AM, Joshua Baker-LePain wrote:

> On Thu, 15 Mar 2012 at 4:41pm, Reuti wrote
> 
>> Am 15.03.2012 um 15:50 schrieb Ralph Castain:
>>> 
>>> On Mar 15, 2012, at 8:46 AM, Reuti wrote:
>>> 
 Am 15.03.2012 um 15:37 schrieb Ralph Castain:
 
> FWIW: I see the problem. Our parser was apparently written assuming every 
> line was a unique host, so it doesn't even check to see if there is 
> duplication. Easy fix - can shoot it to you today.
 
 But even with the fix the nice value will be the same for all processes 
 forked there. Either all have the nice value of his low priority queue or 
 the high priority queue.
>>> 
>>> Agreed - nothing I can do about that, though. We only do the one qrsh call, 
>>> so the daemons are going to fall into a single queue, and so will all their 
>>> children. In this scenario, it isn't clear to me (from this discussion) 
>>> that I can control which queue gets used
>> 
>> Correct.
> 
> Which I understand.  Our queue setup is admittedly a bit wonky (which is
> probably why I'm the first one to have this issue).  I'm much more concerned 
> with things not crashing than with them absolutely having the "right" nice 
> levels.  :)
> 
>>> Should I?
>> 
>> I can't speak for the community. Personally I would say: don't distribute 
>> parallel jobs among different queues at all, as some applications will use 
>> some internal communication about the environment variables of the master 
>> process to distribute them to the slaves (even if SGE's `qrsh -inherit ...` 
>> is called without -V, and even if Open MPI is not told to forward and 
>> specific environment variable). If you have a custom application it can work 
>> of course, but with closed source ones you can only test and get the 
>> experience whether it's working or not.
>> 
>> Not to mention the timing issue of differently niced processes. Adjusting 
>> the SGE setup of the OP would be the smarter way IMO.
> 
> And I agree with that as well.  I understand if the decision is made to leave 
> the parser the way it is, given that my setup is outside the norm.
> 
> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain

On Thu, 15 Mar 2012 at 11:38am, Ralph Castain wrote

No, I'll fix the parser as we should be able to run anyway. Just can't 
guarantee which queue the job will end up in, but at least it -will- 
run.


Makes sense to me.  Thanks!

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF


Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Ralph Castain
Here's the patch: I've set it up to go into 1.5, but not 1.4 as that series is 
being closed out. Please let me know if this solves the problem for you.


Modified: orte/mca/ras/gridengine/ras_gridengine_module.c
==
--- orte/mca/ras/gridengine/ras_gridengine_module.c (original)
+++ orte/mca/ras/gridengine/ras_gridengine_module.c 2012-03-15 13:45:50 EDT 
(Thu, 15 Mar 2012)
@@ -64,6 +64,8 @@
int rc;
FILE *fp;
orte_node_t *node;
+opal_list_item_t *item;
+bool found;

/* show the Grid Engine's JOB_ID */
if (mca_ras_gridengine_component.show_jobid ||
@@ -92,22 +94,36 @@
queue = strtok_r(NULL, " \n", &tok);
arch = strtok_r(NULL, " \n", &tok);

-/* create a new node entry */
-node = OBJ_NEW(orte_node_t);
-if (NULL == node) {
-fclose(fp);
-return ORTE_ERR_OUT_OF_RESOURCE;
+/* see if we already have this node */
+found = false;
+for (item = opal_list_get_first(nodelist);
+ item != opal_list_get_end(nodelist);
+ item = opal_list_get_next(item)) {
+node = (orte_node_t*)item;
+if (0 == strcmp(ptr, node->name)) {
+/* just add the slots */
+node->slots += (int)strtol(num, (char **)NULL, 10);
+found = true;
+break;
+}
+}
+if (!found) {
+/* create a new node entry */
+node = OBJ_NEW(orte_node_t);
+if (NULL == node) {
+fclose(fp);
+return ORTE_ERR_OUT_OF_RESOURCE;
+}
+node->name = strdup(ptr);
+node->state = ORTE_NODE_STATE_UP;
+node->slots_inuse = 0;
+node->slots_max = 0;
+node->slots = (int)strtol(num, (char **)NULL, 10);
+opal_output(mca_ras_gridengine_component.verbose,
+"ras:gridengine: %s: PE_HOSTFILE shows slots=%d",
+node->name, node->slots);
+opal_list_append(nodelist, &node->super);
}
-node->name = strdup(ptr);
-node->state = ORTE_NODE_STATE_UP;
-node->slots_inuse = 0;
-node->slots_max = 0;
-node->slots = (int)strtol(num, (char **)NULL, 10);
-opal_output(mca_ras_gridengine_component.verbose,
-"ras:gridengine: %s: PE_HOSTFILE shows slots=%d",
-node->name, node->slots);
-opal_list_append(nodelist, &node->super);
-
} /* finished reading the $PE_HOSTFILE */

cleanup:

On Mar 15, 2012, at 11:41 AM, Joshua Baker-LePain wrote:

> On Thu, 15 Mar 2012 at 11:38am, Ralph Castain wrote
> 
>> No, I'll fix the parser as we should be able to run anyway. Just can't 
>> guarantee which queue the job will end up in, but at least it -will- run.
> 
> Makes sense to me.  Thanks!
> 
> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Reuti
Am 15.03.2012 um 18:14 schrieb Joshua Baker-LePain:

> On Thu, 15 Mar 2012 at 1:53pm, Reuti wrote
> 
>> PS: In your example you also had the case 2 slots in the low priority queue, 
>> what is the actual setup in your cluster?
> 
> Our actual setup is:
> 
> o lab.q, slots=numprocs, load_thresholds=np_load_avg=1.5, labs (=SGE
>   projects) limited by RQS to a number of slots equal to their "share" of
>   the cluster, seq_no=0, priority=0.
> 
> o long.q, slots=numprocs, load_thresholds=np_load_avg=0.9, seq_no=1,
>   priority=19
> 
> o short.q, slots=numprocs, load_thresholds=np_load_avg=1.25, users
>   limited by RQS to 200 slots, runtime limited to 30 minutes, seq_no=2,
>   priority=10
> 
> Users are instructed to not select a queue when submitting jobs.  The theory 
> is that even if non-contributing users have filled the cluster with long.q 
> jobs, contributing users will still have instant access to "their" lab.q 
> slots, overloading nodes with jobs running at a higher priority than the 
> long.q jobs.  long.q jobs won't start on nodes full of lab.q jobs. And 
> short.q is for quick, high priority jobs regardless of cluster status (the 
> main use case being processing MRI data into images while a patient is 
> physically in the scanner).

Thx for posting the information. Avoiding to get slots from different queues 
isn't complex:

1. Define each PE three times, like "orte_lab", "orte_long" and "orte_short". 
Attach the corresponding one to each queue and only this one, i.e. "long.q" 
gets "orte_long" etc.

2. The `qsub` command needs to include a wildcard like "-pe orte* 64" instead 
of the plain "orte" which is used right now I guess.

Once SGE selected a PE for the job, it will stay in this PE, and as the PE is 
attached to only one queue no foreign slots will be assigned any longer. Jobs 
may have to wait a little bit longer, as for now the slots are collected from 
all queues.

NB: Do you use "-R y" and a set h_rt  to avoid starvation of parallel jobs 
already?

-- Reuti


> The truth is our cluster is primarily used for, and thus SGE is tuned for, 
> large numbers of serial jobs.  We do have *some* folks running parallel code, 
> and it *is* starting to get to the point where I need to reconfigure things 
> to make that part work better.
> 
> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI_Testsome with incount=0, NULL array_of_indices and array_of_statuses causes MPI_ERR_ARG

2012-03-15 Thread Eugene Loh

On 03/13/12 13:25, Jeffrey Squyres wrote:

On Mar 9, 2012, at 5:17 PM, Jeremiah Willcock wrote:

On Open MPI 1.5.1, when I call MPI_Testsome with incount=0 and the two output 
arrays NULL, I get an argument error (MPI_ERR_ARG).  Is this the intended 
behavior?  If incount=0, no requests can complete, so the output arrays can 
never be written to.  I do not see anything in the MPI 2.2 standard that says 
either way whether this is allowed.

I have no strong opinions here, so I coded up a patch to just return 
MPI_SUCCESS in this scenario (attached).

If no one objects, we can probably get this in 1.6.


It isn't enough just to return MPI_SUCCESS when the count is zero.  The 
man pages indicate what behavior is expected when count==0 and the MTT 
tests (ibm/pt2pt/[test|wait][any|some|all].c) check for this behavior.  
Put another way, a bunch of MTT tests started failing since r26138 due 
to quick return on count=0.


Again, the trunk since r26138 sets no output values when count=0.  In 
contrast, the ibm/pt2pt/*.c tests correctly check for the count=0 
behavior that we document in our man pages.  Here are excerpts from our 
man pages:


  Testall

Returns flag = true if all communications associated
with active handles in the array have completed (this
includes the case where no handle in the list is active).

  Testany

MPI_Testany tests for completion of either one or none
of the operations associated with active handles.  In
the latter case (no operation completed), it returns
flag = false, returns a value of MPI_UNDEFINED in index,
and status is undefined.

The array may contain null or inactive handles. If the
array contains no active handles then the call returns
immediately with flag = true, index = MPI_UNDEFINED,
and an empty status.

  Testsome

If there is no active handle in the list, it returns
outcount = MPI_UNDEFINED.

  Waitall

[...no issues...]

  Waitany

The array_of_requests list may contain null or inactive
handles.  If the list contains no active handles (list
has length zero or all entries are null or inactive),
then the call returns immediately with index = MPI_UNDEFINED,
and an empty status.

  Waitsome

If the list contains no active handles, then the call
returns immediately with outcount = MPI_UNDEFINED.

I'll test and put back the attached patch.
Index: trunk/ompi/mpi/c/testall.c
===
--- trunk/ompi/mpi/c/testall.c  (revision 26147)
+++ trunk/ompi/mpi/c/testall.c  (working copy)
@@ -67,6 +67,7 @@
 }

 if (OPAL_UNLIKELY(0 == count)) {
+*flag = true;
 return MPI_SUCCESS;
 }

Index: trunk/ompi/mpi/c/waitany.c
===
--- trunk/ompi/mpi/c/waitany.c  (revision 26147)
+++ trunk/ompi/mpi/c/waitany.c  (working copy)
@@ -67,6 +67,7 @@
 }

 if (OPAL_UNLIKELY(0 == count)) {
+*indx = MPI_UNDEFINED;
 return MPI_SUCCESS;
 }

Index: trunk/ompi/mpi/c/testany.c
===
--- trunk/ompi/mpi/c/testany.c  (revision 26147)
+++ trunk/ompi/mpi/c/testany.c  (working copy)
@@ -67,6 +67,8 @@
 }

 if (OPAL_UNLIKELY(0 == count)) {
+*completed = true;
+*indx = MPI_UNDEFINED;
 return MPI_SUCCESS;
 }

Index: trunk/ompi/mpi/c/waitsome.c
===
--- trunk/ompi/mpi/c/waitsome.c (revision 26147)
+++ trunk/ompi/mpi/c/waitsome.c (working copy)
@@ -69,6 +69,7 @@
 }

 if (OPAL_UNLIKELY(0 == incount)) {
+*outcount = MPI_UNDEFINED;
 return MPI_SUCCESS;
 }

Index: trunk/ompi/mpi/c/testsome.c
===
--- trunk/ompi/mpi/c/testsome.c (revision 26147)
+++ trunk/ompi/mpi/c/testsome.c (working copy)
@@ -69,6 +69,7 @@
 }

 if (OPAL_UNLIKELY(0 == incount)) {
+*outcount = MPI_UNDEFINED;
 return OMPI_SUCCESS;
 }



Re: [OMPI users] MPI_Testsome with incount=0, NULL array_of_indices and array_of_statuses causes MPI_ERR_ARG

2012-03-15 Thread Jeffrey Squyres
Many thanks for doing this Eugene.

On Mar 15, 2012, at 11:58 AM, Eugene Loh wrote:

> On 03/13/12 13:25, Jeffrey Squyres wrote:
>> On Mar 9, 2012, at 5:17 PM, Jeremiah Willcock wrote:
>>> On Open MPI 1.5.1, when I call MPI_Testsome with incount=0 and the two 
>>> output arrays NULL, I get an argument error (MPI_ERR_ARG).  Is this the 
>>> intended behavior?  If incount=0, no requests can complete, so the output 
>>> arrays can never be written to.  I do not see anything in the MPI 2.2 
>>> standard that says either way whether this is allowed.
>> I have no strong opinions here, so I coded up a patch to just return 
>> MPI_SUCCESS in this scenario (attached).
>> 
>> If no one objects, we can probably get this in 1.6.
> 
> It isn't enough just to return MPI_SUCCESS when the count is zero.  The man 
> pages indicate what behavior is expected when count==0 and the MTT tests 
> (ibm/pt2pt/[test|wait][any|some|all].c) check for this behavior.  Put another 
> way, a bunch of MTT tests started failing since r26138 due to quick return on 
> count=0.
> 
> Again, the trunk since r26138 sets no output values when count=0.  In 
> contrast, the ibm/pt2pt/*.c tests correctly check for the count=0 behavior 
> that we document in our man pages.  Here are excerpts from our man pages:
> 
>  Testall
> 
>Returns flag = true if all communications associated
>with active handles in the array have completed (this
>includes the case where no handle in the list is active).
> 
>  Testany
> 
>MPI_Testany tests for completion of either one or none
>of the operations associated with active handles.  In
>the latter case (no operation completed), it returns
>flag = false, returns a value of MPI_UNDEFINED in index,
>and status is undefined.
> 
>The array may contain null or inactive handles. If the
>array contains no active handles then the call returns
>immediately with flag = true, index = MPI_UNDEFINED,
>and an empty status.
> 
>  Testsome
> 
>If there is no active handle in the list, it returns
>outcount = MPI_UNDEFINED.
> 
>  Waitall
> 
>[...no issues...]
> 
>  Waitany
> 
>The array_of_requests list may contain null or inactive
>handles.  If the list contains no active handles (list
>has length zero or all entries are null or inactive),
>then the call returns immediately with index = MPI_UNDEFINED,
>and an empty status.
> 
>  Waitsome
> 
>If the list contains no active handles, then the call
>returns immediately with outcount = MPI_UNDEFINED.
> 
> I'll test and put back the attached patch.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Joshua Baker-LePain

On Thu, 15 Mar 2012 at 11:49am, Ralph Castain wrote

Here's the patch: I've set it up to go into 1.5, but not 1.4 as that 
series is being closed out. Please let me know if this solves the 
problem for you.


I couldn't get the included inline patch to apply to 1.5.4 (probably my 
issue), but I downloaded it from 
 and applied that.  My 
test job ran just fine, and looking at the nodes verified a single orted 
process per node despite SGE assigning slots in multiple queues.


In short, WORKSFORME.

Thanks!

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF


Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

2012-03-15 Thread Ralph Castain
Great - thanks!

On Mar 15, 2012, at 2:55 PM, Joshua Baker-LePain wrote:

> On Thu, 15 Mar 2012 at 11:49am, Ralph Castain wrote
> 
>> Here's the patch: I've set it up to go into 1.5, but not 1.4 as that series 
>> is being closed out. Please let me know if this solves the problem for you.
> 
> I couldn't get the included inline patch to apply to 1.5.4 (probably my 
> issue), but I downloaded it from 
>  and applied that.  My 
> test job ran just fine, and looking at the nodes verified a single orted 
> process per node despite SGE assigning slots in multiple queues.
> 
> In short, WORKSFORME.
> 
> Thanks!
> 
> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users