Hi Jeff,
Thank you for your reply. I just changed to another cluster which does not
have infiniband. I ran the HPL by:
mpirun *--mca btl tcp,self* -np 144 --hostfile /root/research/hostfile
./xhpl

It ran successfully, but if I delete "--mca btl tcp,self", it cannot run
again. So I doubt whether openmpi 1.2 cannot identify the proper network
interface and set correct parameters for them.
Then, I return back to the previous cluster with infiniband and type the
same command above. It gets stuck forever.

I change the command to:
mpirun *--mca btl_tcp_if_include ib0* --hostfile /root/research/hostfile-ib
-np 48 ./xhpl

It can successfully launch but gives me errors as follows when HPL tries to
split the communication:

[node1.novalocal:09562] *** An error occurred in MPI_Comm_split
[node1.novalocal:09562] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
[node1.novalocal:09562] *** MPI_ERR_IN_STATUS: error code in status
[node1.novalocal:09562] *** MPI_ERRORS_ARE_FATAL (goodbye)
[node1.novalocal:09583] *** An error occurred in MPI_Comm_split
[node1.novalocal:09583] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
[node1.novalocal:09583] *** MPI_ERR_IN_STATUS: error code in status
[node1.novalocal:09583] *** MPI_ERRORS_ARE_FATAL (goodbye)
[node1.novalocal:09637] *** An error occurred in MPI_Comm_split
[node1.novalocal:09637] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
[node1.novalocal:09637] *** MPI_ERR_IN_STATUS: error code in status
[node1.novalocal:09637] *** MPI_ERRORS_ARE_FATAL (goodbye)
[node1.novalocal:09994] *** An error occurred in MPI_Comm_split
[node1.novalocal:09994] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
[node1.novalocal:09994] *** MPI_ERR_IN_STATUS: error code in status
[node1.novalocal:09994] *** MPI_ERRORS_ARE_FATAL (goodbye)
mpirun noticed that job rank 0 with PID 46005 on node test-ib exited on
signal 15 (Terminated).

Hope you can give me some suggestions. Thank you.

Kaiming Ouyang, Research Assistant.
Department of Computer Science and Engineering
University of California, Riverside
900 University Avenue, Riverside, CA 92521


On Mon, Mar 19, 2018 at 7:35 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> That's actually failing in a shared memory section of the code.
>
> But to answer your question, yes, Open MPI 1.2 did have IB support.
>
> That being said, I have no idea what would cause this shared memory segv
> -- it's quite possible that it's simple bit rot (i.e., v1.2.9 was released
> 9 years ago -- see https://www.open-mpi.org/software/ompi/versions/
> timeline.php.  Perhaps it does not function correctly on modern
> glibc/Linux kernel-based platforms).
>
> Can you upgrade to a [much] newer Open MPI?
>
>
>
> > On Mar 19, 2018, at 8:29 PM, Kaiming Ouyang <kouya...@ucr.edu> wrote:
> >
> > Hi everyone,
> > Recently I need to compile High-Performance Linpack code with openmpi
> 1.2 version (a little bit old). When I finish compilation, and try to run,
> I get the following errors:
> >
> > [test:32058] *** Process received signal ***
> > [test:32058] Signal: Segmentation fault (11)
> > [test:32058] Signal code: Address not mapped (1)
> > [test:32058] Failing at address: 0x14a2b84b6304
> > [test:32058] [ 0] /lib64/libpthread.so.0(+0xf5e0) [0x14eb116295e0]
> > [test:32058] [ 1] /root/research/lib/openmpi-1.
> 2.9/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x28a)
> [0x14eaa81258aa]
> > [test:32058] [ 2] /root/research/lib/openmpi-1.
> 2.9/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x2b) [0x14eaa853219b]
> > [test:32058] [ 3] /root/research/lib/openmpi-1.
> 2.9/lib/libopen-pal.so.0(opal_progress+0x4a) [0x14eb128dbaaa]
> > [test:32058] [ 4] /root/research/lib/openmpi-1.
> 2.9/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x1d) [0x14eaf41e6b4d]
> > [test:32058] [ 5] /root/research/lib/openmpi-1.
> 2.9/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_recv+0x3a5) [0x14eaf41eac45]
> > [test:32058] [ 6] /root/research/lib/openmpi-1.
> 2.9/lib/libopen-rte.so.0(mca_oob_recv_packed+0x33) [0x14eb12b62223]
> > [test:32058] [ 7] /root/research/lib/openmpi-1.
> 2.9/lib/openmpi/mca_gpr_proxy.so(orte_gpr_proxy_put+0x1f9)
> [0x14eaf3dd7db9]
> > [test:32058] [ 8] /root/research/lib/openmpi-1.
> 2.9/lib/libopen-rte.so.0(orte_smr_base_set_proc_state+0x31d)
> [0x14eb12b7893d]
> > [test:32058] [ 9] /root/research/lib/openmpi-1.
> 2.9/lib/libmpi.so.0(ompi_mpi_init+0x8d6) [0x14eb13202136]
> > [test:32058] [10] /root/research/lib/openmpi-1.
> 2.9/lib/libmpi.so.0(MPI_Init+0x6a) [0x14eb1322461a]
> > [test:32058] [11] ./xhpl(main+0x5d) [0x404e7d]
> > [test:32058] [12] /lib64/libc.so.6(__libc_start_main+0xf5)
> [0x14eb11278c05]
> > [test:32058] [13] ./xhpl() [0x4056cb]
> > [test:32058] *** End of error message ***
> > mpirun noticed that job rank 0 with PID 31481 on node test.novalocal
> exited on signal 15 (Terminated).
> > 23 additional processes aborted (not shown)
> >
> > The machine has infiniband, so I doubt whether openmpi 1.2 does not
> support infiniband by default. I also try to run it not through infiniband,
> but the program can only deal with small size input. When I increase the
> input size and grid size, it just gets stuck. The program I run is a
> benchmark, so I don't think there would be a problem in the code. Any idea?
> Thanks.
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to