Sorry I completely forgot to mention:
OpenMPI 1.2.2, on intels x86

segmentation fault is still there, but I'm waiting for my admin to fix the routing to check if that was the problem, however on my stand-alone linux machine with correct routing that does not appear - but that's not a prove.

segmentation fault in MPI_Barrier is only when I want to run my program on the head node, eg.
mpirun -np 4 ./myproggy

if I use also worker nodes:
mpirun -hostfile ./../hosts -np 50 ./myproggy

I got an error in the MPI_Init:
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort (...)
 PML add procs failed
 --> Returned "Unreachable" (-12) instead of "Success" (0)

so with hostfile and worker nodes it is correct - it gives an error on the beginning... but, I wouldn't be so entusiastic as I have lot of strange & complicated code there before MPI_Barrier, which I updated recently - could be my mistake that I messed some memory used by openmpi (valgrind does not complain). who knows... I'll wait for my admin to fix the routing and I'll post more information as soon as possible. Possibly as recently I upgraded openmpi from 1.x.x version, maybe there are some libraries messed up, or other things are wrong.

         greets, Marcin




Jelena Pjesivac-Grbovic wrote:
Hi Marcin,

what version of Open MPI did you use?
Is it still occurring?
It is also possible that the connection went down during the execution... although, a segfault really should not occur.

Thanks,
Jelena

On Tue, 29 May 2007, Marcin Skoczylas wrote:

hello,

recently my administrator made some changes on our cluster and now I
have a crash during MPI_Barrier:

[our-host:12566] *** Process received signal ***
[our-host:12566] Signal: Segmentation fault (11)
[our-host:12566] Signal code: Address not mapped (1)
[our-host:12566] Failing at address: 0x4
[our-host:12566] [ 0] /lib/tls/libpthread.so.0 [0xa22f80]
[our-host:12566] [ 1]
/usr/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x68f)
[0xcd86d7]
[our-host:12566] [ 2]
/usr/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x32) [0xcb7e3a]
[our-host:12566] [ 3] /usr/lib/libopen-pal.so.0(opal_progress+0xed)
[0xc2b221]
[our-host:12566] [ 4] /usr/lib/libmpi.so.0 [0x3aecc5]
[our-host:12566] [ 5] /usr/lib/libmpi.so.0(ompi_request_wait_all+0xec)
[0x3ae784]
[our-host:12566] [ 6]
/usr/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_sendrecv_actual+0x77)
[0xd025bb]
[our-host:12566] [ 7]
/usr/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_recursivedoubling+0xde)
[0xd05e3a]
[our-host:12566] [ 8]
/usr/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_dec_fixed+0x44)
[0xd027d8]
[our-host:12566] [ 9] /usr/lib/libmpi.so.0(PMPI_Barrier+0x176) [0x3c0cea]

Actually, I made small investigation and I realised that:

[user@our-host]$ ssh our-host
ssh(12704) ssh: connect to host our-host port 22: No route to host

that could be the thing, I'm going to talk with my admin soon about this
routing change, however if it is really this problem, shouldn't it be
recognised during startup, f.e. in MPI_Init? Actually, I'm not sure...
your comments?

            greetings, Marcin


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jelena Pjesivac-Grbovic, Pjesa
Graduate Research Assistant
Innovative Computing Laboratory
Computer Science Department, UTK
Claxton Complex 350
(865) 974 - 6722 (865) 974 - 6321
jpjes...@utk.edu

Murphy's Law of Research:
         Enough research will tend to support your theory.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to