Re: [OMPI users] How to build OMPI with Checkpoint/restart.
On Sep 16, 2009, at 8:30 AM, Marcin Stolarek wrote: Hi, It seems I solved my problem. Root of the error was, that I haven't loaded blcr module. So I couldn't checkpoint even one therad application. I am glad to hear that you have things working now. However I stil can't find MCA:blcr in ompi_all -info, It's working. This may have been a red-herring, sorry. I think ompi_info will only show the 'none' component due to the way it searches for components in the system. This is a bug how in the CRS selection logic plays with ompi_info. I will take a note/file a bug to look into fixing it. Unfortunately I do not have a work around other than looking in the install directory for the mca_crs_blcr.so file. -- Josh marcin 2009/9/15 Marcin Stolarek Hi, I've done everythink from the beginig.: rm -r $ompi_install make clean make make install In $ompi_install, I've got files you mentioned: mstol@halo2:/home/guests/mstol/openmpi/lib/openmp# ls mca_crs_bl* mca_crs_blcr.la mca_crs_blcr.so but, when I try: # ompi_info -all | grep "crs:" mstol@halo2:/home/guests/mstol/openmpi/openmpi-1.3.3# ompi_info -- all | grep "crs:" MCA crs: none (MCA v2.0, API v2.0, Component v1.3.3) MCA crs: parameter "crs_base_verbose" (current value: "0", data source: default value) MCA crs: parameter "crs" (current value: "none", data source: default value) MCA crs: parameter "crs_none_select_warning" (current value: "0", data source: default value) MCA crs: parameter "crs_none_priority" (current value: "0", data source: default value) I don't have crs: blcr component. marcin 2009/9/14 Josh Hursey The config.log looked fine, so I think you have fixed the configure problem that you previously posted about. Though the config.log indicates that the BLCR component is scheduled for compile, ompi_info does not indicate that it is available. I suspect that the error below is because the CRS could not find any CRS components to select (though there should have been an error displayed indicating as such). I would check your Open MPI installation to make sure that it is the one that you configured with. Specifically I would check to make sure that in the installation location there are the following files: $install_dir/lib/openmpi/mca_crs_blcr.so $install_dir/lib/openmpi/mca_crs_blcr.la If that checks out, then I would remove the old installation directory and try reinstalling fresh. Let me know how it goes. -- Josh On Sep 13, 2009, at 5:49 AM, Marcin Stolarek wrote: I've tryed another time. Here is what I get when trying to run using-1.4a1r21964 : (terminus:~) mstol% mpirun --am ft-enable-cr ./a.out -- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_cr_init() failed failed --> Returned value -1 instead of OPAL_SUCCESS -- [terminus:06120] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file runtime/orte_ init.c at line 79 -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: orte_init failed --> Returned "Error" (-1) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [terminus:6120] Abort before MPI_INIT completed successfully; not able to guaran tee that all other processes were killed! -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- I've included config.log and ompi_info --all output in attacment LD_LIBRARY_PATH is set correctly. Any idea? marcin 2009/9/12 Marcin Stolarek Hi, I'm trying to compile OpenMPI with checkpoint restart via BLCR. I'm not sure which path shoul I set as a value of --with-blcr option. I'm using 1.3.3 release, which version of BLCR should I use? I've compiled the newest version of BLCR with --prefix=$BLCR, and I've putten
[OMPI users] infiniband question
Hi, I'm new to infiniband. I installed the rdma_cm, rdma_ucm and ib_uverbs kernel modules. When i'm running a ring test openmpi code, i've got : [Lidia][0,1,1][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes) [Lidia][0,1,1][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes) [Lilou][0,1,0][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes) [Lilou][0,1,0][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes) And then, the program hangs. I thought i only need rdma communications, and don't need the DALP lib (with the iboip module). I am wrong ? Thanks, Yann -- ___ Yann JOBIC HPC engineer Polytech Marseille DME IUSTI-CNRS UMR 6595 Technopôle de Château Gombert 5 rue Enrico Fermi 13453 Marseille cedex 13 Tel : (33) 4 91 10 69 39 ou (33) 4 91 10 69 43 Fax : (33) 4 91 10 69 69
Re: [OMPI users] infiniband question
Correct, you don't need DAPL. Can you send all the information listed here: http://www.open-mpi.org/community/help/ On Sep 17, 2009, at 9:17 AM, Yann JOBIC wrote: Hi, I'm new to infiniband. I installed the rdma_cm, rdma_ucm and ib_uverbs kernel modules. When i'm running a ring test openmpi code, i've got : [Lidia][0,1,1][btl_openib_endpoint.c: 992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes) [Lidia][0,1,1][btl_openib_endpoint.c: 992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes) [Lilou][0,1,0][btl_openib_endpoint.c: 992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes) [Lilou][0,1,0][btl_openib_endpoint.c: 992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes) And then, the program hangs. I thought i only need rdma communications, and don't need the DALP lib (with the iboip module). I am wrong ? Thanks, Yann -- ___ Yann JOBIC HPC engineer Polytech Marseille DME IUSTI-CNRS UMR 6595 Technopôle de Château Gombert 5 rue Enrico Fermi 13453 Marseille cedex 13 Tel : (33) 4 91 10 69 39 ou (33) 4 91 10 69 43 Fax : (33) 4 91 10 69 69 ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] Multi-threading with OpenMPI ?
On Sep 16, 2009, at 9:53 PM, Ralph Castain wrote: Only the obvious, and not very helpful one: comm_spawn isn't thread safe at this time. You'll need to serialize your requests to that function. This is likely the cause of your issues if you are calling MPI_COMM_SPAWN in multiple threads simultaneously. Can you verify? If not, we'll need to dig a little deeper to figure out what's going on. But Ralph is right -- read up on the THREAD_MULTIPLE constraints (check the OMPI README file) to see if that's what's biting you. -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] How does OpenMPI decided to use which algorithm inMPI_Bcast????????????????
Search through the mailing list archives -- this question has been discussed a few times. On Sep 3, 2009, at 2:03 AM, shan axida wrote: Hi, I had a glance at OpenMPI source codes and there are several algorithms for MPI_Bcast function. My question is how is the algorithm decided to use in a given MPI_Bcast call? message size? Anyone give me little detailed information for this question? Thanks a lot. Axida ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] OpenMPI much slower than Mpich2
Sorry for the delay in replying; my INBOX has become a disaster recently. More below. On Sep 14, 2009, at 5:08 AM, Sam Verboven wrote: Dear All, I'm having the following problem. If I execute the exact same application using both openmpi and mpich2, the former takes more than 2 times as long. When we compared the ganglia output we could see that openmpi generates more than 60 percent System CPU whereas mpich2 only has about 5, the remaining cycles all going to User CPU. This about explains the slowdown: when using openmpi we lose more than half the cycles to operating system overhead. We would very much like to know why our openmpi implementation incurs such a dramatic overhead. The only reason I could think of myself is the fact that we use bridged network interfaces on the cluster. Openmpi would not run properly until we specified the mca command to use the br0 interface instead of the physical eth0. Mpich2 does not require any extra parameters. What did Open MPI did when you did not specify the use br0? I assume that br0 is a combination of some other devices, like eth0 and eth1? If so, what happens if you "btl_tcp_if_include eth0,eth1" instead of br0? The calculations themselves are done using fortran. The operating system is ubuntu 9.04, we have 14 dual quad core nodes and both openmpi and mpich2 are compiled from source without any configure options. Full command OpenMPI: mpirun.openmpi --mca btl_tcp_if_include br0 --prefix /usr/shares/mpi/openmpi -hostfile hostfile -np 224 /home/arickx/bin/Linux/F_me_Kl1l2_3cl_mpi_2 Full command Mpich2: mpiexec.mpich2 -machinefile machinefile -np 113 /home/arickx/bin/Linux/F_me_Kl1l2_3cl_mpi_2 I notice that you're running almost 2x the number of processes for Open MPI as MPICH2 -- does increasing the number of processes increase the problem size, or have some other effect on overall run-time? -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] Application hangs when checkpointing application (update)
Interesting. I'll try to take a look and see if I can reproduce today. -- Josh On Sep 14, 2009, at 4:54 PM, Jean Potsam wrote: Hi Josh, Thanks for the response. I am actually testing it on a single node (though in the near future i will run it on a set of nodes). Therefore, my application is running on the same machine as mpirun. When I run the application and triggers the checkpointing mechanism from a seperate terminal, it checkpoints fine. However, when I try to checkpoint it from within the main program as show below, it hangs. kind regards, Jean --- On Mon, 14/9/09, Josh Hursey wrote: From: Josh Hursey Subject: Re: [OMPI users] Application hangs when checkpointing application (update) To: "Open MPI Users" Date: Monday, 14 September, 2009, 1:27 PM Is your application running on the same machine as mpirun? How did you configure Open MPI? Note that is program will not work without the FT thread enabled, which would be one reason why it would seem to hang (since it is waiting for the application to enter the MPI library): --enable-ft-thread --enable-mpi-threads I do not think the message that you saw is related. Often orte_checkpoint cannot figure out the jobid on first contact with the HNP/mpirun process, so this is displayed as an INVALID handle. -- Josh On Sep 11, 2009, at 9:50 AM, Jean Potsam wrote: > > Hi Everyone, > I noticed that it hangs just before displaying the following while trying to checkpoint the application. > > > [sun06:15252] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] > ### > > Can it be related to the above? > > Thanks > > > -- > Hi Everyone, > I wrote a small program with a function to trigger the checkpointing mechanism as follows: > > > > #include > #include > #include > #include > #include > void trigger_checkpoint(); > int main(int argc, char **argv) > { > int rank,size; > MPI_Init(&argc, &argv); > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > MPI_Comm_size(MPI_COMM_WORLD, &size); > printf("I am processor no %d of a total of %d procs \n", rank, size); > system("sleep 10"); > trigger_checkpoint(); > printf("I am processor no %d of a total of %d procs \n", rank, size); > system("sleep 10"); > printf("I am processor no %d of a total of %d procs \n", rank, size); > system("sleep 10"); > printf("bye \n"); > MPI_Finalize(); > return 0; > } > > void trigger_checkpoint() > { > printf("hi\n"); > system("ompi-checkpoint -v `pidof mpirun` "); > } > # > > > The application works fine on my laptop with ubuntu as the OS. However, when I tried running it on one of the machines at my uni, with suse linux installed, the application hangs as soon as the ompi- checkpoint is triggered. This is what I get: > > > > ## > I am processor no 0 of a total of 1 procs > hi > I am processor no 0 of a total of 1 procs > [sun06:15426] orte_checkpoint: Checkpointing... > [sun06:15426]PID 15411 > [sun06:15426]Connected to Mpirun [[12727,0],0] > [sun06:15426] orte_checkpoint: notify_hnp: Contact Head Node Process PID 15411 > ### > > does anyone has some ideas about this? > > Thanks a lot > > Jean. > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Multi-threading with OpenMPI ?
HI Jeff, Ralph, Yes, I call MPI_COMM_SPAWN in multiple threads simultaneously. Because I need to expose my parallel algorithm as a web service, I need multiple clients connect and execute my logic as same time(ie mutiple threads). For each client , a new thread is created (by Web service framework) and inside the thread,MPI_Init_Thread() is called if the MPI hasnt been initialized. The the thread calls MPI_COMM__SPAWN and create new processes. So ,if this is the case isn't there any workarounds ? Thanks in advance, umanga Jeff Squyres wrote: On Sep 16, 2009, at 9:53 PM, Ralph Castain wrote: Only the obvious, and not very helpful one: comm_spawn isn't thread safe at this time. You'll need to serialize your requests to that function. This is likely the cause of your issues if you are calling MPI_COMM_SPAWN in multiple threads simultaneously. Can you verify? If not, we'll need to dig a little deeper to figure out what's going on. But Ralph is right -- read up on the THREAD_MULTIPLE constraints (check the OMPI README file) to see if that's what's biting you.
[OMPI users] running open mpi on ubuntu 9.04
Dear Open MPI people: I'm trying to run a simple "hello world" program on Ubuntu 9.04 It's on a dual core laptop; no other machines. Here is the output: erin@erin-laptop:~$ mpirun -np 2 a.out ssh: connect to host erin-laptop port 22: Connection refused -- A daemon (pid 11854) died unexpectedly with status 255 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- mpirun: clean termination accomplished erin@erin-laptop:~$ Any help would be much appreciated. Sincerely, Erin Erin M. Hodgess, PhD Associate Professor Department of Computer and Mathematical Sciences University of Houston - Downtown mailto: hodge...@uhd.edu
Re: [OMPI users] Multi-threading with OpenMPI ?
Only thing I can suggest is to place a thread lock around the call to comm_spawn so that only one thread at a time can execute that function. The call to mpi_init_thread is fine - you just need to explicitly protect the call to comm_spawn. On Sep 17, 2009, at 7:44 PM, Ashika Umanga Umagiliya wrote: HI Jeff, Ralph, Yes, I call MPI_COMM_SPAWN in multiple threads simultaneously. Because I need to expose my parallel algorithm as a web service, I need multiple clients connect and execute my logic as same time(ie mutiple threads). For each client , a new thread is created (by Web service framework) and inside the thread,MPI_Init_Thread() is called if the MPI hasnt been initialized. The the thread calls MPI_COMM__SPAWN and create new processes. So ,if this is the case isn't there any workarounds ? Thanks in advance, umanga Jeff Squyres wrote: On Sep 16, 2009, at 9:53 PM, Ralph Castain wrote: Only the obvious, and not very helpful one: comm_spawn isn't thread safe at this time. You'll need to serialize your requests to that function. This is likely the cause of your issues if you are calling MPI_COMM_SPAWN in multiple threads simultaneously. Can you verify? If not, we'll need to dig a little deeper to figure out what's going on. But Ralph is right -- read up on the THREAD_MULTIPLE constraints (check the OMPI README file) to see if that's what's biting you. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] running open mpi on ubuntu 9.04
I gather you must be running a version of the old 1.2 series? Or are you running 1.3? It does make a difference as to the nature of the problem, and the recommended solution. Thanks Ralph On Sep 17, 2009, at 8:51 PM, Hodgess, Erin wrote: Dear Open MPI people: I'm trying to run a simple "hello world" program on Ubuntu 9.04 It's on a dual core laptop; no other machines. Here is the output: erin@erin-laptop:~$ mpirun -np 2 a.out ssh: connect to host erin-laptop port 22: Connection refused -- A daemon (pid 11854) died unexpectedly with status 255 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- mpirun: clean termination accomplished erin@erin-laptop:~$ Any help would be much appreciated. Sincerely, Erin Erin M. Hodgess, PhD Associate Professor Department of Computer and Mathematical Sciences University of Houston - Downtown mailto: hodge...@uhd.edu ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] running open mpi on ubuntu 9.04
It's 1.3, please. Thanks, Erin Erin M. Hodgess, PhD Associate Professor Department of Computer and Mathematical Sciences University of Houston - Downtown mailto: hodge...@uhd.edu -Original Message- From: users-boun...@open-mpi.org on behalf of Ralph Castain Sent: Thu 9/17/2009 10:39 PM To: Open MPI Users Subject: Re: [OMPI users] running open mpi on ubuntu 9.04 I gather you must be running a version of the old 1.2 series? Or are you running 1.3? It does make a difference as to the nature of the problem, and the recommended solution. Thanks Ralph On Sep 17, 2009, at 8:51 PM, Hodgess, Erin wrote: > Dear Open MPI people: > > I'm trying to run a simple "hello world" program on Ubuntu 9.04 > > It's on a dual core laptop; no other machines. > > Here is the output: > erin@erin-laptop:~$ mpirun -np 2 a.out > ssh: connect to host erin-laptop port 22: Connection refused > -- > A daemon (pid 11854) died unexpectedly with status 255 while > attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed > shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to > have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -- > -- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -- > mpirun: clean termination accomplished > > erin@erin-laptop:~$ > > Any help would be much appreciated. > > Sincerely, > Erin > > > Erin M. Hodgess, PhD > Associate Professor > Department of Computer and Mathematical Sciences > University of Houston - Downtown > mailto: hodge...@uhd.edu > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Multi-threading with OpenMPI ?
Thanks Ralph, I have not much experience in this area.shall i use pthread_mutex_lock(*),*pthread_mutex_unlock() ..etc or following which i saw in OpenMPI source : static opal_mutex_t ompi_lock; OPAL_THREAD_LOCK(&ompi_lock); // OPAL_THREAD_UNLOCK(&ompi_lock); Thanks in advance, umanga Ralph Castain wrote: Only thing I can suggest is to place a thread lock around the call to comm_spawn so that only one thread at a time can execute that function. The call to mpi_init_thread is fine - you just need to explicitly protect the call to comm_spawn. On Sep 17, 2009, at 7:44 PM, Ashika Umanga Umagiliya wrote: HI Jeff, Ralph, Yes, I call MPI_COMM_SPAWN in multiple threads simultaneously. Because I need to expose my parallel algorithm as a web service, I need multiple clients connect and execute my logic as same time(ie mutiple threads). For each client , a new thread is created (by Web service framework) and inside the thread,MPI_Init_Thread() is called if the MPI hasnt been initialized. The the thread calls MPI_COMM__SPAWN and create new processes. So ,if this is the case isn't there any workarounds ? Thanks in advance, umanga Jeff Squyres wrote: On Sep 16, 2009, at 9:53 PM, Ralph Castain wrote: Only the obvious, and not very helpful one: comm_spawn isn't thread safe at this time. You'll need to serialize your requests to that function. This is likely the cause of your issues if you are calling MPI_COMM_SPAWN in multiple threads simultaneously. Can you verify? If not, we'll need to dig a little deeper to figure out what's going on. But Ralph is right -- read up on the THREAD_MULTIPLE constraints (check the OMPI README file) to see if that's what's biting you. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users