That means you have mismatched installations around - one configured as debug, and one not. They have to match.
Sent from my iPad On Oct 3, 2011, at 2:44 PM, Phillip Vassenkov <phillip.vassen...@gtri.gatech.edu> wrote: > I went into the directory that I used to install 1.4.3, did the following: > make clean > ./configure --enable-debug > make -j8 all install > > and it hangs at this when I try to run my code (I commented out all the host > name stuff, so it's just MPI code now) > > [hostname:16574] [[17705,0],0] ORTE_ERROR_LOG: Buffer type (described vs > non-described) mismatch - operation not allowed in file > base/odls_base_default_fns.c at line 2600 > > I'm googling for more info but does anyone have any ideas? > > On 9/28/11 8:30 PM, Jeff Squyres wrote: >> Use --enable-debug on your configure line. This will add in some debugging >> code to OMPI, and it'll compile everything with -g so that you can get stack >> traces. >> >> Beware that the extra debugging junk makes OMPI slightly slower; don't do >> any benchmarking with this install, etc. >> >> >> On Sep 28, 2011, at 6:27 PM, Phillip Vassenkov wrote: >> >>> I tried 1.4.4rc4, same problem. Where do I get a debugging version? >>> >>> On 9/28/11 8:32 AM, Jeff Squyres wrote: >>>> Agreed that the original program had the char*[20]/char[20] bug, but his >>>> segv is occurring before trying to use that array. So it's a bug - but he >>>> just hadn't hit it yet. :-) >>>> >>>> I'd still like to see a debugging version so that we can get a real stack >>>> trace, and/or try the latest 1.4.4 RC (posted yesterday). >>>> >>>> >>>> On Sep 27, 2011, at 3:08 PM, German Hoecht wrote: >>>> >>>>> char* name[20]; yields 20 (undefined) pointers to char, guess you mean >>>>> char name[20]; >>>>> >>>>> So Brent's suggestion should work as well(?) >>>>> >>>>> To be safe I would also add: >>>>> gethostname(name,maxlen); >>>>> name[19] = '\0'; >>>>> printf("Hello, world. I am %d of %d and host %s \n", rank, ... >>>>> >>>>> Cheers >>>>> >>>>> On 09/27/2011 07:40 PM, Phillip Vassenkov wrote: >>>>>> Thanks, but my main concern is the segfault :P I changed and as I >>>>>> expected it still segfaults. >>>>>> >>>>>> On 9/27/11 9:48 AM, Henderson, Brent wrote: >>>>>>> Here is another possibly non-helpful suggestion. :) Change: >>>>>>> >>>>>>> char* name[20]; >>>>>>> int maxlen = 20; >>>>>>> >>>>>>> To: >>>>>>> >>>>>>> char name[256]; >>>>>>> int maxlen = 256; >>>>>>> >>>>>>> gethostname() is supposed to properly truncate the hostname it returns >>>>>>> if the actual name is longer than the length provided, but since you >>>>>>> have at least one that is longer than 20 characters, I'm curious. >>>>>>> >>>>>>> Brent >>>>>>> >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] >>>>>>> On Behalf Of Jeff Squyres >>>>>>> Sent: Tuesday, September 27, 2011 6:29 AM >>>>>>> To: Open MPI Users >>>>>>> Subject: Re: [OMPI users] Segfault on any MPI communication on head node >>>>>>> >>>>>>> Hmm. It's not immediately clear to me what's going wrong here. >>>>>>> >>>>>>> I hate to ask, but could you install a debugging version of Open MPI >>>>>>> and capture a proper stack trace of the segv? >>>>>>> >>>>>>> Also, could you try the 1.4.4 rc and see if that magically fixes the >>>>>>> problem? (I'm about to post a new 1.4.4 rc later this morning, but >>>>>>> either the current one or the one from later today would be a good >>>>>>> datapoint) >>>>>>> >>>>>>> >>>>>>> On Sep 26, 2011, at 5:09 PM, Phillip Vassenkov wrote: >>>>>>> >>>>>>>> Yep, Fedora Core 14 and OpenMPI 1.4.3 >>>>>>>> >>>>>>>> On 9/24/11 7:02 AM, Jeff Squyres wrote: >>>>>>>>> Are you running the same OS version and Open MPI version between the >>>>>>>>> head node and regular nodes? >>>>>>>>> >>>>>>>>> On Sep 23, 2011, at 5:27 PM, Vassenkov, Phillip wrote: >>>>>>>>> >>>>>>>>>> Hey all, >>>>>>>>>> I've been racking my brains over this for several days and was >>>>>>>>>> hoping anyone could enlighten me. I'll describe only the relevant >>>>>>>>>> parts of the network/computer systems. There is one head node and a >>>>>>>>>> multitude of regular nodes. The regular nodes are all identical to >>>>>>>>>> each other. If I run an mpi program from one of the regular nodes >>>>>>>>>> to any other regular nodes, everything works. If I include the head >>>>>>>>>> node in the hosts file, I get segfaults which I'll paste below >>>>>>>>>> along with sample code. The machines are all networked via >>>>>>>>>> infiniband and Ethernet. The issue only arises when mpi >>>>>>>>>> communication occurs. By this I mean, MPi_Init might succeed but >>>>>>>>>> the segfault always occurs on MPI_Barrier or MPI_send/recv. I found >>>>>>>>>> a work around by disabling the openib btl and enforcing that >>>>>>>>>> communications go over infiniband(if I don't force infiniband, >>>>>>>>>> it'll go over Ethernet). This command works when the head node is >>>>>>>>>> included in the hosts file: >>>>>>>>>> mpirun --hostfile hostfile --mca btl ^openib --mca >>>>>>>>>> btl_tcp_if_include ib0 -np 2 ./b.out >>>>>>>>>> >>>>>>>>>> Sample Code: >>>>>>>>>> #include "mpi.h" >>>>>>>>>> #include<stdio.h> >>>>>>>>>> int main(int argc, char *argv[]) >>>>>>>>>> { >>>>>>>>>> int rank, nprocs; >>>>>>>>>> char* name[20]; >>>>>>>>>> int maxlen = 20; >>>>>>>>>> MPI_Init(&argc,&argv); >>>>>>>>>> MPI_Comm_size(MPI_COMM_WORLD,&nprocs); >>>>>>>>>> MPI_Comm_rank(MPI_COMM_WORLD,&rank); >>>>>>>>>> MPI_Barrier(MPI_COMM_WORLD); >>>>>>>>>> gethostname(name,maxlen); >>>>>>>>>> printf("Hello, world. I am %d of %d and host %s \n", rank, >>>>>>>>>> nprocs,name); >>>>>>>>>> fflush(stdout); >>>>>>>>>> MPI_Finalize(); >>>>>>>>>> return 0; >>>>>>>>>> >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> Segfault: >>>>>>>>>> [pastec:19917] *** Process received signal *** >>>>>>>>>> [pastec:19917] Signal: Segmentation fault (11) >>>>>>>>>> [pastec:19917] Signal code: Address not mapped (1) >>>>>>>>>> [pastec:19917] Failing at address: 0x8 >>>>>>>>>> [pastec:19917] [ 0] /lib64/libpthread.so.0() [0x34a880eeb0] >>>>>>>>>> [pastec:19917] [ 1] /usr/lib64/libmthca-rdmav2.so(+0x36aa) >>>>>>>>>> [0x7eff6430b6aa] >>>>>>>>>> [pastec:19917] [ 2] >>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x133c9) >>>>>>>>>> [0x7eff66a163c9] >>>>>>>>>> [pastec:19917] [ 3] >>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1eb70) >>>>>>>>>> [0x7eff66a21b70] >>>>>>>>>> [pastec:19917] [ 4] >>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ec89) >>>>>>>>>> [0x7eff66a21c89] >>>>>>>>>> [pastec:19917] [ 5] >>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1403d) >>>>>>>>>> [0x7eff66a1703d] >>>>>>>>>> [pastec:19917] [ 6] >>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x120e6) >>>>>>>>>> [0x7eff676670e6] >>>>>>>>>> [pastec:19917] [ 7] >>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x6273) >>>>>>>>>> [0x7eff6765b273] >>>>>>>>>> [pastec:19917] [ 8] >>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0x1b2f) >>>>>>>>>> [0x7eff65539b2f] >>>>>>>>>> [pastec:19917] [ 9] >>>>>>>>>> /usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0xa5cf) >>>>>>>>>> [0x7eff655425cf] >>>>>>>>>> [pastec:19917] [10] >>>>>>>>>> /usr/lib64/openmpi/lib/libmpi.so.0(MPI_Barrier+0x9e) [0x3a54c4c94e] >>>>>>>>>> [pastec:19917] [11] ./b.out(main+0x6e) [0x400a42] >>>>>>>>>> [pastec:19917] [12] /lib64/libc.so.6(__libc_start_main+0xfd) >>>>>>>>>> [0x34a841ee5d] >>>>>>>>>> [pastec:19917] [13] ./b.out() [0x400919] >>>>>>>>>> [pastec:19917] *** End of error message *** >>>>>>>>>> [pastec.gtri.gatech.edu:19913] [[18526,0],0]-[[18526,1],1] >>>>>>>>>> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) >>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>> >>>>>>>>>> mpirun noticed that process rank 1 with PID 19917 on node >>>>>>>>>> pastec.gtri.gatech.edu exited on signal 11 (Segmentation fault). >>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users