Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

Ralph Castain Wed, 14 Mar 2012 19:50:08 -0400

On Mar 14, 2012, at 5:44 PM, Reuti wrote:

> Am 14.03.2012 um 23:48 schrieb Joshua Baker-LePain:
> 
>> On Wed, 14 Mar 2012 at 6:31pm, Reuti wrote
>> 
>>> I just tested with two different queues on two machines and a small 
>>> mpihello and it is working as expected.
>> 
>> At this point the narrative is getting very confused, even for me.  So I 
>> tried to find a clear cut case where I can change one thing to flip between 
>> "it works" and "it doesn't":
>> 
>> Case "it works":
>> o Setup 2 queues -- lab.q and test.q.  Both run at priority 0.  lab.q has
>>  slots=cores on each host, test.q has 1 slot per host.
>> 
>> o Submit job via:
>>  qsub -q "lab.q|test.q" -l mem_free=150M -pe ompi 64 jobscript.sh
>> 
>> o Job runs just fine.  Running 'ps aufx' on one of the nodes shows 2 orted
>>  processes, one with 4 children (the proceses running in the lab.q
>>  slots) and one with 1 child (the process running in the test.q slot),
>>  all happily running (caution: very long lines ahead):
>> 
>> sge       9673  0.0  0.0  14224  1204 ?        S    14:31   0:00  \_ 
>> sge_shepherd-6997934 -bg
>> root      9674  0.0  0.0  11272   892 ?        Ss   14:31   0:00  |   \_ 
>> /ccpr1/sge6/utilbin/lx24-amd64/rshd -l
> 
> Which version of SGE are you using? The traditional rsh startup was replaced 
> by the builtin startup some time ago (although it should still work).
> 
> 
>> jlb       9677  0.0  0.0   8988   700 ?        S    14:31   0:00  |       \_ 
>> /ccpr1/sge6/utilbin/lx24-amd64/qrsh_starter 
>> /var/spool/sge/opt95/active_jobs/6997934.1/1.opt95
>> jlb       9679  0.1  0.0  47932  2008 ?        S    14:31   0:00  |          
>>  \_ orted -mca ess env -mca orte_ess_jobid 1517355008 -mca orte_ess_vpid 5 
>> -mca orte_ess_num_procs 24 --hnp-uri 1517355008.0;tcp://172.19.12.104:47527
>> jlb       9690 53.6  0.0 157376  3832 ?        R    14:31   0:02  |          
>>      \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
>> jlb       9691 50.8  0.0 157376  3832 ?        R    14:31   0:02  |          
>>      \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
>> jlb       9692 37.0  0.0 157376  3828 ?        R    14:31   0:01  |          
>>      \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
>> jlb       9693 49.2  0.0 157376  3824 ?        R    14:31   0:02  |          
>>      \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
>> sge       9675  0.0  0.0  14228  1208 ?        S    14:31   0:00  \_ 
>> sge_shepherd-6997934 -bg
>> root      9676  0.0  0.0  11268   888 ?        Ss   14:31   0:00      \_ 
>> /ccpr1/sge6/utilbin/lx24-amd64/rshd -l
>> jlb       9678  0.0  0.0   8992   708 ?        S    14:31   0:00          \_ 
>> /ccpr1/sge6/utilbin/lx24-amd64/qrsh_starter 
>> /var/spool/sge/opt95/active_jobs/6997934.1/2.opt95
>> jlb       9680  0.0  0.0  47932  2000 ?        S    14:31   0:00             
>>  \_ orted -mca ess env -mca orte_ess_jobid 1517355008 -mca orte_ess_vpid 6 
>> -mca orte_ess_num_procs 24 --hnp-uri 1517355008.0;tcp://172.19.12.104:47527
>> jlb       9689 36.8  0.0  89776  3672 ?        R    14:31   0:01             
>>      \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
> 
> Maybe this shows already the problem: there are two `qrsh -inherit`, as Open 
> MPI thinks these are different machines (I ran only with one slot on each 
> host hence didn't get it first but can reproduce it now). But for SGE both 
> may end up in the same queue overriding the openmpi-session in $TMPDIR.
> 
> Although it's running: you get all output? If I request 4 slots and get one 
> from each queue on both machines the mpihello outputs only 3 lines: the 
> "Hello World from Node 3" is always missing.
> 
> (I was just typing when Ralph's message came in: I can confirm this. To avoid 
> it, it would mean for Open MPI to collect all lines from the hostfile which 
> are on the same machine. SGE creates entries for each queue/host pair in the 
> machine file).


Hmmm…I can take a look at the allocator module and see why we aren't doing it. 
Would the host names be the same for the two queues?

> 
> ==
> 
> @Ralph: it could work if SGE would have a facility to request the desired 
> queue in `qrsh -inherit ...`, because then the $TMPDIR would be unique for 
> each orted again (assuming its using different ports for each).

Gotcha! I suspect getting the allocator to handle this cleanly is the better 
solution, though.


> 
> -- Reuti
> 
> 
>> Case "it doesn't":
>> o Take the above queue setup, and simply change test.q to have 2 slots
>>  per host.
>> 
>> o Submit job with the same qsub line.
>> 
>> o Job crashes.  I had 'ps aufx' running in a continuous loop on one of the
>>  nodes.  This was the last output which showed the job processes.  Note
>>  that the actually mpihello processes never got into the "R" state:
>> 
>> sge      12423  0.0  0.0  14224  1196 ?        S    14:41   0:00  \_ 
>> sge_shepherd-6997938 -bg
>> root     12425  0.0  0.0  11272   896 ?        Ss   14:41   0:00  |   \_ 
>> /ccpr1/sge6/utilbin/lx24-amd64/rshd -l
>> jlb      12428  0.0  0.0   8988   700 ?        S    14:41   0:00  |       \_ 
>> /ccpr1/sge6/utilbin/lx24-amd64/qrsh_starter 
>> /var/spool/sge/opt65/active_jobs/6997938.1/1.opt65
>> jlb      12430  0.0  0.0  47932  2016 ?        S    14:41   0:00  |          
>>  \_ orted -mca ess env -mca orte_ess_jobid 1468006400 -mca orte_ess_vpid 7 
>> -mca orte_ess_num_procs 20 --hnp-uri 1468006400.0;tcp://172.19.12.104:39940
>> jlb      12798  1.0  0.0 153244  3752 ?        S    14:41   0:00  |          
>>      \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
>> jlb      12799  2.0  0.0 153244  3752 ?        S    14:41   0:00  |          
>>      \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
>> jlb      12800  1.0  0.0 153244  3752 ?        S    14:41   0:00  |          
>>      \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
>> sge      12436  0.0  0.0  14228  1208 ?        S    14:41   0:00  \_ 
>> sge_shepherd-6997938 -bg
>> root     12437  0.0  0.0  11268   884 ?        Ss   14:41   0:00      \_ 
>> /ccpr1/sge6/utilbin/lx24-amd64/rshd -l
>> jlb      12439  0.0  0.0   8992   712 ?        S    14:41   0:00          \_ 
>> /ccpr1/sge6/utilbin/lx24-amd64/qrsh_starter 
>> /var/spool/sge/opt65/active_jobs/6997938.1/2.opt65
>> jlb      12441  0.1  0.0  47932  2012 ?        S    14:41   0:00             
>>  \_ orted -mca ess env -mca orte_ess_jobid 1468006400 -mca orte_ess_vpid 8 
>> -mca orte_ess_num_procs 20 --hnp-uri 1468006400.0;tcp://172.19.12.104:39940
>> jlb      12795  1.0  0.0 153100  3128 ?        S    14:41   0:00             
>>      \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
>> jlb      12796  2.0  0.0 153232  3752 ?        S    14:41   0:00             
>>      \_ /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4.3-debug
> 
> 
> 
>>> Joshua: the Centos6 is the same on all nodes and the you recompiled the 
>>> application with the actual version of the library? By "threads" you refer 
>>> to "processes"?
>> 
>> All the nodes are installed from the same kickstart file and kept fully
>> up to date.  And, yes, the application is compiled against the exact
>> library I'm running it with.
>> 
>> Thanks again to all for looking at this.
>> 
>> -- 
>> Joshua Baker-LePain
>> QB3 Shared Cluster Sysadmin
>> UCSF
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Segfaults w/ both 1.4 and 1.5 on CentOS 6.2/SGE

Reply via email to