I went into the directory that I used to install 1.4.3, did the following:
make clean
./configure --enable-debug
make -j8 all install
and it hangs at this when I try to run my code (I commented out all the
host name stuff, so it's just MPI code now)
[hostname:16574] [[17705,0],0] ORTE_ERROR_LOG: Buffer type (described vs
non-described) mismatch - operation not allowed in file
base/odls_base_default_fns.c at line 2600
I'm googling for more info but does anyone have any ideas?
On 9/28/11 8:30 PM, Jeff Squyres wrote:
Use --enable-debug on your configure line. This will add in some debugging
code to OMPI, and it'll compile everything with -g so that you can get stack
traces.
Beware that the extra debugging junk makes OMPI slightly slower; don't do any
benchmarking with this install, etc.
On Sep 28, 2011, at 6:27 PM, Phillip Vassenkov wrote:
I tried 1.4.4rc4, same problem. Where do I get a debugging version?
On 9/28/11 8:32 AM, Jeff Squyres wrote:
Agreed that the original program had the char*[20]/char[20] bug, but his segv
is occurring before trying to use that array. So it's a bug - but he just
hadn't hit it yet. :-)
I'd still like to see a debugging version so that we can get a real stack
trace, and/or try the latest 1.4.4 RC (posted yesterday).
On Sep 27, 2011, at 3:08 PM, German Hoecht wrote:
char* name[20]; yields 20 (undefined) pointers to char, guess you mean
char name[20];
So Brent's suggestion should work as well(?)
To be safe I would also add:
gethostname(name,maxlen);
name[19] = '\0';
printf("Hello, world. I am %d of %d and host %s \n", rank, ...
Cheers
On 09/27/2011 07:40 PM, Phillip Vassenkov wrote:
Thanks, but my main concern is the segfault :P I changed and as I
expected it still segfaults.
On 9/27/11 9:48 AM, Henderson, Brent wrote:
Here is another possibly non-helpful suggestion. :) Change:
char* name[20];
int maxlen = 20;
To:
char name[256];
int maxlen = 256;
gethostname() is supposed to properly truncate the hostname it returns
if the actual name is longer than the length provided, but since you
have at least one that is longer than 20 characters, I'm curious.
Brent
-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On Behalf Of Jeff Squyres
Sent: Tuesday, September 27, 2011 6:29 AM
To: Open MPI Users
Subject: Re: [OMPI users] Segfault on any MPI communication on head node
Hmm. It's not immediately clear to me what's going wrong here.
I hate to ask, but could you install a debugging version of Open MPI
and capture a proper stack trace of the segv?
Also, could you try the 1.4.4 rc and see if that magically fixes the
problem? (I'm about to post a new 1.4.4 rc later this morning, but
either the current one or the one from later today would be a good
datapoint)
On Sep 26, 2011, at 5:09 PM, Phillip Vassenkov wrote:
Yep, Fedora Core 14 and OpenMPI 1.4.3
On 9/24/11 7:02 AM, Jeff Squyres wrote:
Are you running the same OS version and Open MPI version between the
head node and regular nodes?
On Sep 23, 2011, at 5:27 PM, Vassenkov, Phillip wrote:
Hey all,
I've been racking my brains over this for several days and was
hoping anyone could enlighten me. I'll describe only the relevant
parts of the network/computer systems. There is one head node and a
multitude of regular nodes. The regular nodes are all identical to
each other. If I run an mpi program from one of the regular nodes
to any other regular nodes, everything works. If I include the head
node in the hosts file, I get segfaults which I'll paste below
along with sample code. The machines are all networked via
infiniband and Ethernet. The issue only arises when mpi
communication occurs. By this I mean, MPi_Init might succeed but
the segfault always occurs on MPI_Barrier or MPI_send/recv. I found
a work around by disabling the openib btl and enforcing that
communications go over infiniband(if I don't force infiniband,
it'll go over Ethernet). This command works when the head node is
included in the hosts file:
mpirun --hostfile hostfile --mca btl ^openib --mca
btl_tcp_if_include ib0 -np 2 ./b.out
Sample Code:
#include "mpi.h"
#include<stdio.h>
int main(int argc, char *argv[])
{
int rank, nprocs;
char* name[20];
int maxlen = 20;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Barrier(MPI_COMM_WORLD);
gethostname(name,maxlen);
printf("Hello, world. I am %d of %d and host %s \n", rank,
nprocs,name);
fflush(stdout);
MPI_Finalize();
return 0;
}
Segfault:
[pastec:19917] *** Process received signal ***
[pastec:19917] Signal: Segmentation fault (11)
[pastec:19917] Signal code: Address not mapped (1)
[pastec:19917] Failing at address: 0x8
[pastec:19917] [ 0] /lib64/libpthread.so.0() [0x34a880eeb0]
[pastec:19917] [ 1] /usr/lib64/libmthca-rdmav2.so(+0x36aa)
[0x7eff6430b6aa]
[pastec:19917] [ 2]
/usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x133c9)
[0x7eff66a163c9]
[pastec:19917] [ 3]
/usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1eb70)
[0x7eff66a21b70]
[pastec:19917] [ 4]
/usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ec89)
[0x7eff66a21c89]
[pastec:19917] [ 5]
/usr/lib64/openmpi/lib/openmpi/mca_btl_openib.so(+0x1403d)
[0x7eff66a1703d]
[pastec:19917] [ 6]
/usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x120e6)
[0x7eff676670e6]
[pastec:19917] [ 7]
/usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so(+0x6273)
[0x7eff6765b273]
[pastec:19917] [ 8]
/usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0x1b2f)
[0x7eff65539b2f]
[pastec:19917] [ 9]
/usr/lib64/openmpi/lib/openmpi/mca_coll_tuned.so(+0xa5cf)
[0x7eff655425cf]
[pastec:19917] [10]
/usr/lib64/openmpi/lib/libmpi.so.0(MPI_Barrier+0x9e) [0x3a54c4c94e]
[pastec:19917] [11] ./b.out(main+0x6e) [0x400a42]
[pastec:19917] [12] /lib64/libc.so.6(__libc_start_main+0xfd)
[0x34a841ee5d]
[pastec:19917] [13] ./b.out() [0x400919]
[pastec:19917] *** End of error message ***
[pastec.gtri.gatech.edu:19913] [[18526,0],0]-[[18526,1],1]
mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 19917 on node
pastec.gtri.gatech.edu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users