Thank you.
I am using newest version HPL.
I forgot to say I can run HPL with openmpi-3.0 under infiniband. The reason
I want to use old version is I need to compile a library that only supports
old version openmpi, so I am trying to do this tricky job. Anyways, thank
you for your reply Jeff, have a good day.

Kaiming Ouyang, Research Assistant.
Department of Computer Science and Engineering
University of California, Riverside
900 University Avenue, Riverside, CA 92521


On Mon, Mar 19, 2018 at 8:39 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> I'm sorry; I can't help debug a version from 9 years ago.  The best
> suggestion I have is to use a modern version of Open MPI.
>
> Note, however, your use of "--mca btl ..." is going to have the same
> meaning for all versions of Open MPI.  The problem you showed in the first
> mail was with the shared memory transport.  Using "--mca btl tcp,self"
> means you're not using the shared memory transport.  If you don't specify
> "--mca btl tcp,self", Open MPI will automatically use the shared memory
> transport.  Hence, you could be running into the same (or similar/related)
> problem that you mentioned in the first mail -- i.e., something is going
> wrong with how the v1.2.9 shared memory transport is interacting with your
> system.
>
> Likewise, "--mca btl_tcp_if_include ib0" tells the TCP BTL plugin to use
> the "ib0" network.  But if you have the openib BTL available (i.e., the
> IB-native plug), that will be used instead of the TCP BTL because native
> verbs over IB performs much better than TCP over IB.  Meaning: if you
> specify btl_Tcp_if_include without specifying "--mca btl tcp,self", then
> (assuming openib is available) the TCP BTL likely isn't used and the
> btl_tcp_if_include value is therefore ignored.
>
> Also, what version of Linpack are you using?  The error you show is
> usually indicative of an MPI application bug (the MPI_COMM_SPLIT error).
> If you're running an old version of xhpl, you should upgrade to the latest.
>
>
>
>
> > On Mar 19, 2018, at 9:59 PM, Kaiming Ouyang <kouya...@ucr.edu> wrote:
> >
> > Hi Jeff,
> > Thank you for your reply. I just changed to another cluster which does
> not have infiniband. I ran the HPL by:
> > mpirun --mca btl tcp,self -np 144 --hostfile /root/research/hostfile
> ./xhpl
> >
> > It ran successfully, but if I delete "--mca btl tcp,self", it cannot run
> again. So I doubt whether openmpi 1.2 cannot identify the proper network
> interface and set correct parameters for them.
> > Then, I return back to the previous cluster with infiniband and type the
> same command above. It gets stuck forever.
> >
> > I change the command to:
> > mpirun --mca btl_tcp_if_include ib0 --hostfile
> /root/research/hostfile-ib -np 48 ./xhpl
> >
> > It can successfully launch but gives me errors as follows when HPL tries
> to split the communication:
> >
> > [node1.novalocal:09562] *** An error occurred in MPI_Comm_split
> > [node1.novalocal:09562] *** on communicator MPI COMMUNICATOR 3 SPLIT
> FROM 0
> > [node1.novalocal:09562] *** MPI_ERR_IN_STATUS: error code in status
> > [node1.novalocal:09562] *** MPI_ERRORS_ARE_FATAL (goodbye)
> > [node1.novalocal:09583] *** An error occurred in MPI_Comm_split
> > [node1.novalocal:09583] *** on communicator MPI COMMUNICATOR 3 SPLIT
> FROM 0
> > [node1.novalocal:09583] *** MPI_ERR_IN_STATUS: error code in status
> > [node1.novalocal:09583] *** MPI_ERRORS_ARE_FATAL (goodbye)
> > [node1.novalocal:09637] *** An error occurred in MPI_Comm_split
> > [node1.novalocal:09637] *** on communicator MPI COMMUNICATOR 3 SPLIT
> FROM 0
> > [node1.novalocal:09637] *** MPI_ERR_IN_STATUS: error code in status
> > [node1.novalocal:09637] *** MPI_ERRORS_ARE_FATAL (goodbye)
> > [node1.novalocal:09994] *** An error occurred in MPI_Comm_split
> > [node1.novalocal:09994] *** on communicator MPI COMMUNICATOR 3 SPLIT
> FROM 0
> > [node1.novalocal:09994] *** MPI_ERR_IN_STATUS: error code in status
> > [node1.novalocal:09994] *** MPI_ERRORS_ARE_FATAL (goodbye)
> > mpirun noticed that job rank 0 with PID 46005 on node test-ib exited on
> signal 15 (Terminated).
> >
> > Hope you can give me some suggestions. Thank you.
> >
> > Kaiming Ouyang, Research Assistant.
> > Department of Computer Science and Engineering
> > University of California, Riverside
> > 900 University Avenue, Riverside, CA 92521
> >
> >
> > On Mon, Mar 19, 2018 at 7:35 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > That's actually failing in a shared memory section of the code.
> >
> > But to answer your question, yes, Open MPI 1.2 did have IB support.
> >
> > That being said, I have no idea what would cause this shared memory segv
> -- it's quite possible that it's simple bit rot (i.e., v1.2.9 was released
> 9 years ago -- see https://www.open-mpi.org/software/ompi/versions/
> timeline.php.  Perhaps it does not function correctly on modern
> glibc/Linux kernel-based platforms).
> >
> > Can you upgrade to a [much] newer Open MPI?
> >
> >
> >
> > > On Mar 19, 2018, at 8:29 PM, Kaiming Ouyang <kouya...@ucr.edu> wrote:
> > >
> > > Hi everyone,
> > > Recently I need to compile High-Performance Linpack code with openmpi
> 1.2 version (a little bit old). When I finish compilation, and try to run,
> I get the following errors:
> > >
> > > [test:32058] *** Process received signal ***
> > > [test:32058] Signal: Segmentation fault (11)
> > > [test:32058] Signal code: Address not mapped (1)
> > > [test:32058] Failing at address: 0x14a2b84b6304
> > > [test:32058] [ 0] /lib64/libpthread.so.0(+0xf5e0) [0x14eb116295e0]
> > > [test:32058] [ 1] /root/research/lib/openmpi-1.
> 2.9/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0x28a)
> [0x14eaa81258aa]
> > > [test:32058] [ 2] /root/research/lib/openmpi-1.
> 2.9/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x2b) [0x14eaa853219b]
> > > [test:32058] [ 3] /root/research/lib/openmpi-1.
> 2.9/lib/libopen-pal.so.0(opal_progress+0x4a) [0x14eb128dbaaa]
> > > [test:32058] [ 4] /root/research/lib/openmpi-1.
> 2.9/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_wait+0x1d) [0x14eaf41e6b4d]
> > > [test:32058] [ 5] /root/research/lib/openmpi-1.
> 2.9/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_recv+0x3a5) [0x14eaf41eac45]
> > > [test:32058] [ 6] /root/research/lib/openmpi-1.
> 2.9/lib/libopen-rte.so.0(mca_oob_recv_packed+0x33) [0x14eb12b62223]
> > > [test:32058] [ 7] /root/research/lib/openmpi-1.
> 2.9/lib/openmpi/mca_gpr_proxy.so(orte_gpr_proxy_put+0x1f9)
> [0x14eaf3dd7db9]
> > > [test:32058] [ 8] /root/research/lib/openmpi-1.
> 2.9/lib/libopen-rte.so.0(orte_smr_base_set_proc_state+0x31d)
> [0x14eb12b7893d]
> > > [test:32058] [ 9] /root/research/lib/openmpi-1.
> 2.9/lib/libmpi.so.0(ompi_mpi_init+0x8d6) [0x14eb13202136]
> > > [test:32058] [10] /root/research/lib/openmpi-1.
> 2.9/lib/libmpi.so.0(MPI_Init+0x6a) [0x14eb1322461a]
> > > [test:32058] [11] ./xhpl(main+0x5d) [0x404e7d]
> > > [test:32058] [12] /lib64/libc.so.6(__libc_start_main+0xf5)
> [0x14eb11278c05]
> > > [test:32058] [13] ./xhpl() [0x4056cb]
> > > [test:32058] *** End of error message ***
> > > mpirun noticed that job rank 0 with PID 31481 on node test.novalocal
> exited on signal 15 (Terminated).
> > > 23 additional processes aborted (not shown)
> > >
> > > The machine has infiniband, so I doubt whether openmpi 1.2 does not
> support infiniband by default. I also try to run it not through infiniband,
> but the program can only deal with small size input. When I increase the
> input size and grid size, it just gets stuck. The program I run is a
> benchmark, so I don't think there would be a problem in the code. Any idea?
> Thanks.
> > >
> > > _______________________________________________
> > > users mailing list
> > > users@lists.open-mpi.org
> > > https://lists.open-mpi.org/mailman/listinfo/users
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to