char* name[20]; yields 20 (undefined) pointers to char, guess you mean char name[20];
So Brent's suggestion should work as well(?) To be safe I would also add: gethostname(name,maxlen); name[19] = '\0'; printf("Hello, world. I am %d of %d and host %s \n", rank, ... Cheers On 09/27/2011 07:40 PM, Phillip Vassenkov wrote: > Thanks, but my main concern is the segfault :P I changed and as I > expected it still segfaults. > > On 9/27/11 9:48 AM, Henderson, Brent wrote: >> Here is another possibly non-helpful suggestion. :) Change: >> >> char* name[20]; >> int maxlen = 20; >> >> To: >> >> char name[256]; >> int maxlen = 256; >> >> gethostname() is supposed to properly truncate the hostname it returns >> if the actual name is longer than the length provided, but since you >> have at least one that is longer than 20 characters, I'm curious. >> >> Brent >> >> >> -----Original Message----- >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] >> On Behalf Of Jeff Squyres >> Sent: Tuesday, September 27, 2011 6:29 AM >> To: Open MPI Users >> Subject: Re: [OMPI users] Segfault on any MPI communication on head node >> >> Hmm. It's not immediately clear to me what's going wrong here. >> >> I hate to ask, but could you install a debugging version of Open MPI >> and capture a proper stack trace of the segv? >> >> Also, could you try the 1.4.4 rc and see if that magically fixes the >> problem? (I'm about to post a new 1.4.4 rc later this morning, but >> either the current one or the one from later today would be a good >> datapoint) >> >> >> On Sep 26, 2011, at 5:09 PM, Phillip Vassenkov wrote: >> >>> Yep, Fedora Core 14 and OpenMPI 1.4.3 >>> >>> On 9/24/11 7:02 AM, Jeff Squyres wrote: >>>> Are you running the same OS version and Open MPI version between the >>>> head node and regular nodes? >>>> >>>> On Sep 23, 2011, at 5:27 PM, Vassenkov, Phillip wrote: >>>> >>>>> Hey all, >>>>> I've been racking my brains over this for several days and was >>>>> hoping anyone could enlighten me. I'll describe only the relevant >>>>> parts of the network/computer systems. There is one head node and a >>>>> multitude of regular nodes. The regular nodes are all identical to >>>>> each other. If I run an mpi program from one of the regular nodes >>>>> to any other regular nodes, everything works. If I include the head >>>>> node in the hosts file, I get segfaults which I'll paste below >>>>> along with sample code. The machines are all networked via >>>>> infiniband and Ethernet. The issue only arises when mpi >>>>> communication occurs. By this I mean, MPi_Init might succeed but >>>>> the segfault always occurs on MPI_Barrier or MPI_send/recv. I found >>>>> a work around by disabling the openib btl and enforcing that >>>>> communications go over infiniband(if I don't force infiniband, >>>>> it'll go over Ethernet). This command works when the head node is >>>>> included in the hosts file: >>>>> mpirun --hostfile hostfile --mca btl ^openib --mca >>>>> btl_tcp_if_include ib0 -np 2 ./b.out >>>>> >>>>> Sample Code: >>>>> #include "mpi.h" >>>>> #include<stdio.h> >>>>> int main(int argc, char *argv[]) >>>>> { >>>>> int rank, nprocs; >>>>> char* name[20]; >>>>> int maxlen = 20; >>>>> MPI_Init(&argc,&argv); >>>>> MPI_Comm_size(MPI_COMM_WORLD,&nprocs); >>>>> MPI_Comm_rank(MPI_COMM_WORLD,&rank); >>>>> MPI_Barrier(MPI_COMM_WORLD); >>>>> gethostname(name,maxlen); >>>>> printf("Hello, world. I am %d of %d and host %s \n", rank, >>>>> nprocs,name); >>>>> fflush(stdout); >>>>> MPI_Finalize(); >>>>> return 0; >>>>> >>>>> } >>>>> >>>>> Segfault: >>>>> [pastec:19917] *** Process received signal *** >>>>> [pastec:19917] Signal: Segmentation fault (11) >>>>> [pastec:19917] Signal code: Address not mapped (1) >>>>> [pastec:19917] Failing at address: 0x8 >>>>> [pastec:19917] [ 0] /lib64/libpthread.so.0() [0x34a880eeb0] >>>>> [pastec:19917] [ 1] /usr/lib64/libmthca-rdmav2.so(+0x36aa) >>>>> [0x7eff6430b6aa] >>>>> [pastec:19917] [ 2] >>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x133c9) >>>>> [0x7eff66a163c9] >>>>> [pastec:19917] [ 3] >>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1eb70) >>>>> [0x7eff66a21b70] >>>>> [pastec:19917] [ 4] >>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ec89) >>>>> [0x7eff66a21c89] >>>>> [pastec:19917] [ 5] >>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1403d) >>>>> [0x7eff66a1703d] >>>>> [pastec:19917] [ 6] >>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x120e6) >>>>> [0x7eff676670e6] >>>>> [pastec:19917] [ 7] >>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x6273) >>>>> [0x7eff6765b273] >>>>> [pastec:19917] [ 8] >>>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0x1b2f) >>>>> [0x7eff65539b2f] >>>>> [pastec:19917] [ 9] >>>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0xa5cf) >>>>> [0x7eff655425cf] >>>>> [pastec:19917] [10] >>>>> /usr/lib64/openmpi/lib/libmpi.so.0(MPI_Barrier+0x9e) [0x3a54c4c94e] >>>>> [pastec:19917] [11] ./b.out(main+0x6e) [0x400a42] >>>>> [pastec:19917] [12] /lib64/libc.so.6(__libc_start_main+0xfd) >>>>> [0x34a841ee5d] >>>>> [pastec:19917] [13] ./b.out() [0x400919] >>>>> [pastec:19917] *** End of error message *** >>>>> [pastec.gtri.gatech.edu:19913] [[18526,0],0]-[[18526,1],1] >>>>> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) >>>>> -------------------------------------------------------------------------- >>>>> >>>>> mpirun noticed that process rank 1 with PID 19917 on node >>>>> pastec.gtri.gatech.edu exited on signal 11 (Segmentation fault). >>>>> -------------------------------------------------------------------------- >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users