Use --enable-debug on your configure line.  This will add in some debugging 
code to OMPI, and it'll compile everything with -g so that you can get stack 
traces.

Beware that the extra debugging junk makes OMPI slightly slower; don't do any 
benchmarking with this install, etc.


On Sep 28, 2011, at 6:27 PM, Phillip Vassenkov wrote:

> I tried 1.4.4rc4, same problem. Where do I get a debugging version?
> 
> On 9/28/11 8:32 AM, Jeff Squyres wrote:
>> Agreed that the original program had the char*[20]/char[20] bug, but his 
>> segv is occurring before trying to use that array.  So it's a bug - but he 
>> just hadn't hit it yet.  :-)
>> 
>> I'd still like to see a debugging version so that we can get a real stack 
>> trace, and/or try the latest 1.4.4 RC (posted yesterday).
>> 
>> 
>> On Sep 27, 2011, at 3:08 PM, German Hoecht wrote:
>> 
>>> char* name[20]; yields 20 (undefined) pointers to char, guess you mean
>>> char name[20];
>>> 
>>> So Brent's suggestion should work as well(?)
>>> 
>>> To be safe I would also add:
>>> gethostname(name,maxlen);
>>> name[19] = '\0';
>>> printf("Hello, world.  I am %d of %d and host %s \n", rank, ...
>>> 
>>> Cheers
>>> 
>>> On 09/27/2011 07:40 PM, Phillip Vassenkov wrote:
>>>> Thanks, but my main concern is the segfault :P I changed and as I
>>>> expected it still segfaults.
>>>> 
>>>> On 9/27/11 9:48 AM, Henderson, Brent wrote:
>>>>> Here is another possibly non-helpful suggestion.  :)  Change:
>>>>> 
>>>>>      char* name[20];
>>>>>      int maxlen = 20;
>>>>> 
>>>>> To:
>>>>> 
>>>>>      char name[256];
>>>>>      int maxlen = 256;
>>>>> 
>>>>> gethostname() is supposed to properly truncate the hostname it returns
>>>>> if the actual name is longer than the length provided, but since you
>>>>> have at least one that is longer than 20 characters, I'm curious.
>>>>> 
>>>>> Brent
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>>>>> On Behalf Of Jeff Squyres
>>>>> Sent: Tuesday, September 27, 2011 6:29 AM
>>>>> To: Open MPI Users
>>>>> Subject: Re: [OMPI users] Segfault on any MPI communication on head node
>>>>> 
>>>>> Hmm.  It's not immediately clear to me what's going wrong here.
>>>>> 
>>>>> I hate to ask, but could you install a debugging version of Open MPI
>>>>> and capture a proper stack trace of the segv?
>>>>> 
>>>>> Also, could you try the 1.4.4 rc and see if that magically fixes the
>>>>> problem? (I'm about to post a new 1.4.4 rc later this morning, but
>>>>> either the current one or the one from later today would be a good
>>>>> datapoint)
>>>>> 
>>>>> 
>>>>> On Sep 26, 2011, at 5:09 PM, Phillip Vassenkov wrote:
>>>>> 
>>>>>> Yep, Fedora Core 14 and OpenMPI 1.4.3
>>>>>> 
>>>>>> On 9/24/11 7:02 AM, Jeff Squyres wrote:
>>>>>>> Are you running the same OS version and Open MPI version between the
>>>>>>> head node and regular nodes?
>>>>>>> 
>>>>>>> On Sep 23, 2011, at 5:27 PM, Vassenkov, Phillip wrote:
>>>>>>> 
>>>>>>>> Hey all,
>>>>>>>> I've been racking my brains over this for several days and was
>>>>>>>> hoping anyone could enlighten me. I'll describe only the relevant
>>>>>>>> parts of the network/computer systems. There is one head node and a
>>>>>>>> multitude of regular nodes. The regular nodes are all identical to
>>>>>>>> each other. If I run an mpi program from one of the regular nodes
>>>>>>>> to any other regular nodes, everything works. If I include the head
>>>>>>>> node in the hosts file, I get segfaults which I'll paste below
>>>>>>>> along with sample code. The machines are all networked via
>>>>>>>> infiniband and Ethernet. The issue only arises when mpi
>>>>>>>> communication occurs. By this I mean, MPi_Init might succeed but
>>>>>>>> the segfault always occurs on MPI_Barrier or MPI_send/recv. I found
>>>>>>>> a work around by disabling the openib btl and enforcing that
>>>>>>>> communications go over infiniband(if I don't force infiniband,
>>>>>>>> it'll go over Ethernet). This command works when the head node is
>>>>>>>> included in the hosts file:
>>>>>>>> mpirun --hostfile hostfile --mca btl ^openib --mca
>>>>>>>> btl_tcp_if_include ib0  -np 2 ./b.out
>>>>>>>> 
>>>>>>>> Sample Code:
>>>>>>>> #include "mpi.h"
>>>>>>>> #include<stdio.h>
>>>>>>>> int main(int argc, char *argv[])
>>>>>>>> {
>>>>>>>>    int rank, nprocs;
>>>>>>>>     char* name[20];
>>>>>>>>     int maxlen = 20;
>>>>>>>>     MPI_Init(&argc,&argv);
>>>>>>>>     MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
>>>>>>>>     MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>>>>>>>     MPI_Barrier(MPI_COMM_WORLD);
>>>>>>>>     gethostname(name,maxlen);
>>>>>>>>     printf("Hello, world.  I am %d of %d and host %s \n", rank,
>>>>>>>> nprocs,name);
>>>>>>>>     fflush(stdout);
>>>>>>>>     MPI_Finalize();
>>>>>>>>     return 0;
>>>>>>>> 
>>>>>>>> }
>>>>>>>> 
>>>>>>>> Segfault:
>>>>>>>> [pastec:19917] *** Process received signal ***
>>>>>>>> [pastec:19917] Signal: Segmentation fault (11)
>>>>>>>> [pastec:19917] Signal code: Address not mapped (1)
>>>>>>>> [pastec:19917] Failing at address: 0x8
>>>>>>>> [pastec:19917] [ 0] /lib64/libpthread.so.0() [0x34a880eeb0]
>>>>>>>> [pastec:19917] [ 1] /usr/lib64/libmthca-rdmav2.so(+0x36aa)
>>>>>>>> [0x7eff6430b6aa]
>>>>>>>> [pastec:19917] [ 2]
>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x133c9)
>>>>>>>> [0x7eff66a163c9]
>>>>>>>> [pastec:19917] [ 3]
>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1eb70)
>>>>>>>> [0x7eff66a21b70]
>>>>>>>> [pastec:19917] [ 4]
>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ec89)
>>>>>>>> [0x7eff66a21c89]
>>>>>>>> [pastec:19917] [ 5]
>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1403d)
>>>>>>>> [0x7eff66a1703d]
>>>>>>>> [pastec:19917] [ 6]
>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x120e6)
>>>>>>>> [0x7eff676670e6]
>>>>>>>> [pastec:19917] [ 7]
>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x6273)
>>>>>>>> [0x7eff6765b273]
>>>>>>>> [pastec:19917] [ 8]
>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0x1b2f)
>>>>>>>> [0x7eff65539b2f]
>>>>>>>> [pastec:19917] [ 9]
>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0xa5cf)
>>>>>>>> [0x7eff655425cf]
>>>>>>>> [pastec:19917] [10]
>>>>>>>> /usr/lib64/openmpi/lib/libmpi.so.0(MPI_Barrier+0x9e) [0x3a54c4c94e]
>>>>>>>> [pastec:19917] [11] ./b.out(main+0x6e) [0x400a42]
>>>>>>>> [pastec:19917] [12] /lib64/libc.so.6(__libc_start_main+0xfd)
>>>>>>>> [0x34a841ee5d]
>>>>>>>> [pastec:19917] [13] ./b.out() [0x400919]
>>>>>>>> [pastec:19917] *** End of error message ***
>>>>>>>> [pastec.gtri.gatech.edu:19913] [[18526,0],0]-[[18526,1],1]
>>>>>>>> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> 
>>>>>>>> mpirun noticed that process rank 1 with PID 19917 on node
>>>>>>>> pastec.gtri.gatech.edu exited on signal 11 (Segmentation fault).
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to