Out of curiosity: could you send along the mpirun cmd line you are using to 
launch these jobs? I'm wondering if the SGE integration itself is the problem, 
and it only shows up in the sm code.


On Mar 13, 2012, at 4:57 PM, Gutierrez, Samuel K wrote:

> 
> On Mar 13, 2012, at 4:07 PM, Joshua Baker-LePain wrote:
> 
>> On Tue, 13 Mar 2012 at 9:15pm, Gutierrez, Samuel K wrote
>> 
>>>>> Any more information surrounding your failures in 1.5.4 are greatly 
>>>>> appreciated.
>>>> 
>>>> I'm happy to provide, but what exactly are you looking for?  The test code 
>>>> I'm running is *very* simple:
>>> 
>>> If you experience this type of failure with 1.4.5, can you send another 
>>> backtrace?  We'll go from there.
>> 
> 
> Fooey.  What compiler are you using to build Open MPI and how are you 
> configuring your build?  Can you also run with a debug build of Open MPI so 
> we can see the line numbers?
> 
>> In an odd way I'm relieved to say that 1.4.5 failed in the same way.  From 
>> the SGE log of the run, here's the error message from one of the threads 
>> that segfaulted:
>> [iq104:05697] *** Process received signal ***
>> [iq104:05697] Signal: Segmentation fault (11)
>> [iq104:05697] Signal code: Address not mapped (1)
>> [iq104:05697] Failing at address: 0x2ad032188e8c
>> [iq104:05697] [ 0] /lib64/libpthread.so.0() [0x3e5420f4a0]
>> [iq104:05697] [ 1] 
>> /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so(+0x3c4c) 
>> [0x2b0099ec4c4c]
>> [iq104:05697] [ 2] 
>> /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0(opal_progress+0x6a) 
>> [0x2b00967737ca]
>> [iq104:05697] [ 3] 
>> /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so(+0x18d5) 
>> [0x2b00975ef8d5]
>> [iq104:05697] [ 4] /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0(+0x38a24) 
>> [0x2b009628da24]
>> [iq104:05697] [ 5] 
>> /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0(MPI_Init+0x1b0) [0x2b00962b24f0]
>> [iq104:05697] [ 6] 
>> /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4-debug(main+0x22) [0x400826]
>> [iq104:05697] [ 7] /lib64/libc.so.6(__libc_start_main+0xfd) [0x3e53e1ecdd]
>> [iq104:05697] [ 8] /netapp/sali/jlb/mybin/mpihello-long.ompi-1.4-debug() 
>> [0x400749]
>> [iq104:05697] *** End of error message ***
>> 
>> And the backtrace of the resulting core file:
>> #0  0x00002b0099ec4c4c in mca_btl_sm_component_progress ()
>>  from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_btl_sm.so
>> #1  0x00002b00967737ca in opal_progress ()
>>  from /netapp/sali/jlb/ompi-1.4.5/lib/libopen-pal.so.0
>> #2  0x00002b00975ef8d5 in barrier ()
>>  from /netapp/sali/jlb/ompi-1.4.5/lib/openmpi/mca_grpcomm_bad.so
>> #3  0x00002b009628da24 in ompi_mpi_init ()
>>  from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0
>> #4  0x00002b00962b24f0 in PMPI_Init ()
>>  from /netapp/sali/jlb/ompi-1.4.5/lib/libmpi.so.0
>> #5  0x0000000000400826 in main (argc=1, argv=0x7fff9fe113f8)
>>   at mpihello-long.c:11
>> 
>>> Another question.  How reproducible is this on your system?
>> 
>> In my testing today, it's been 100% reproducible.
> 
> That's surprising.
> 
> Thanks,
> 
> Sam
> 
>> 
>> -- 
>> Joshua Baker-LePain
>> QB3 Shared Cluster Sysadmin
>> UCSF
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to