Re: [OMPI users] openmpi+torque: How run job in a subset of the allocation?
Thanks to Ralph, Gus and Georg for your input! I was diverted to other things for a week, but now back on track... You deserve to have the question marks straightened out first: The two application are (1) a commercial solver and (2) an in-house code handling some special physics. The two codes will run sequentially, taking turns one time step at a time. The commercial solver runs one time step, outputs some data to file, then waits for the in-house code to do some work, then goes to next time step. The in-house code is actually restarted each time it has work to do. (This is what I mean with "loosely coupled".) As the two codes never work at the same time, I would like them to use the same hardware. The commercial solver may be reluctant to release its cores to other processes ("agressive"), but I hope it will at the end of time steps... The commercial code will just be started as usual, using the full allocation of the MOAB job. The in-house code is the one using openmpi, and I want it to use all of the cores in the first node of the allocation, and only those. As Ralph suggests, it seems very convenient to use the -host option with relative node syntax. I also found some other references on how mpirun handles host info from the resources managers. Starting the two codes as background jobs, like Georg does, sounds good. I will simply give it a spin and see how it works... Thanks again for your time, Ola
[OMPI users] ?????? ?????? can you help me please ?thanks
I have a server with 12 cores.when I run mpi program with 10 processors.only three processors works.Here are a picture about the problem why?Is the problem with process schedule?? -- -- ??: "Bruno Coutinho";; : 2013??12??6??(??) 11:14 ??: "Open MPI Users"; : Re: [OMPI users]?? can you help me please ?thanks Probably it was the changing from eager to rendezvous protocols as Jeff said. If you don't know what are these, read this: https://computing.llnl.gov/tutorials/mpi_performance/#Protocols http://blogs.cisco.com/performance/what-is-an-mpi-eager-limit/ http://blogs.cisco.com/performance/eager-limits-part-2/ You can tune eager limit chaning mca parameters btl_tcp_eager_limit (for tcp), btl_self_eager_limit (comunication fron one process to itself), btl_sm_eager_limit (shared memory) and btl_udapl_eager_limit or btl_openib_eager_limit (if you use infiniband). 2013/12/6 Jeff Squyres (jsquyres) I sent you some further questions yesterday: http://www.open-mpi.org/community/lists/users/2013/12/23158.php On Dec 6, 2013, at 1:35 AM, <781578...@qq.com> wrote: > Here is my code: > int*a=(int*)malloc(sizeof(int)*number); > MPI_Send(a,number, MPI_INT, 1, 1,MPI_COMM_WORLD); > > int*b=(int*)malloc(sizeof(int)*number); > MPI_Recv(b, number, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status); > > number here is the size of my array(eg,a or b). > I have try it on my local compute and my rocks cluster.On rocks cluster, one > processor on one frontend node use "MPI_Send" send a message ,other > processors on compute nodes use "MPI_Recv" receive message . > when number is least than 1,other processors can receive message fast; > but when number is more than 15000,other processors can receive message > slowly > why?? becesue openmpi API ?? or other problems? > > it spends me a few days , I want your help,thanks for all readers. good luck > for you > > > > > -- -- > ??: "Ralph Castain";; > : 2013??12??5??(??) 6:52 > ??: "Open MPI Users"; > : Re: [OMPI users] can you help me please ?thanks > > You are running 15000 ranks on two nodes?? My best guess is that you are > swapping like crazy as your memory footprint problem exceeds available > physical memory. > > > > On Thu, Dec 5, 2013 at 1:04 AM, <781578...@qq.com> wrote: > My ROCKS cluster includes one frontend and two compute nodes.In my program,I > have use the openmpi API such as MPI_Send and MPI_Recv . but when I run > the progam with 3 processors . one processor send a message ,other receive > message .here are some code. > int*a=(int*)malloc(sizeof(int)*number); > MPI_Send(a,number, MPI_INT, 1, 1,MPI_COMM_WORLD); > > int*b=(int*)malloc(sizeof(int)*number); > MPI_Recv(b, number, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status); > > when number is least than 1,it runs fast. > but number is more than 15000,it runs slowly > > why?? becesue openmpi API ?? or other problems? > -- -- > ??: "Ralph Castain";; > : 2013??12??3??(??) 1:39 > ??: "Open MPI Users"; > : Re: [OMPI users] can you help me please ?thanks > > > > > > On Mon, Dec 2, 2013 at 9:23 PM, <781578...@qq.com> wrote: > A simple program at my 4-node ROCKS cluster runs fine with command: > /opt/openmpi/bin/mpirun -np 4 -machinefile machines ./sort_mpi6 > > > Another bigger programs runs fine on the head node only with command: > > cd ./sphere; /opt/openmpi/bin/mpirun -np 4 ../bin/sort_mpi6 > > But with the command: > > cd /sphere; /opt/openmpi/bin/mpirun -np 4 -machinefile ../machines > ../bin/sort_mpi6 > > It gives output that: > > ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: cannot > open > shared object file: No such file or directory > ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: cannot > open > shared object file: No such file or directory > ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: cannot > open > shared object file: No such file or directory > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] 回复: 回复: can you help me please ?thanks
Forgive me, but I have no idea what that output means. Why do you think only 3 processors are being used? On Dec 9, 2013, at 5:05 AM, <781578...@qq.com> wrote: > I have a server with 12 cores.when I run mpi program with 10 > processors.only three processors works.Here are a picture about the problem > > <40f6d...@e690af16.27c0a552.jpg> > > why?Is the problem with process schedule?? > -- -- > ??: "Bruno Coutinho";; > : 2013??12??6??(??) 11:14 > ??: "Open MPI Users"; > : Re: [OMPI users]?? can you help me please ?thanks > > Probably it was the changing from eager to rendezvous protocols as Jeff said. > > If you don't know what are these, read this: > https://computing.llnl.gov/tutorials/mpi_performance/#Protocols > http://blogs.cisco.com/performance/what-is-an-mpi-eager-limit/ > http://blogs.cisco.com/performance/eager-limits-part-2/ > > You can tune eager limit chaning mca parameters btl_tcp_eager_limit (for > tcp), btl_self_eager_limit (comunication fron one process to itself), > btl_sm_eager_limit (shared memory) and btl_udapl_eager_limit or > btl_openib_eager_limit (if you use infiniband). > > > 2013/12/6 Jeff Squyres (jsquyres) > I sent you some further questions yesterday: > > http://www.open-mpi.org/community/lists/users/2013/12/23158.php > > > On Dec 6, 2013, at 1:35 AM, <781578...@qq.com> wrote: > > > Here is my code: > > int*a=(int*)malloc(sizeof(int)*number); > > MPI_Send(a,number, MPI_INT, 1, 1,MPI_COMM_WORLD); > > > > int*b=(int*)malloc(sizeof(int)*number); > > MPI_Recv(b, number, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status); > > > > number here is the size of my array(eg,a or b). > > I have try it on my local compute and my rocks cluster.On rocks cluster, > > one processor on one frontend node use "MPI_Send" send a message ,other > > processors on compute nodes use "MPI_Recv" receive message . > > when number is least than 1,other processors can receive message fast; > > but when number is more than 15000,other processors can receive message > > slowly > > why?? becesue openmpi API ?? or other problems? > > > > it spends me a few days , I want your help,thanks for all readers. good > > luck for you > > > > > > > > > > -- -- > > ??: "Ralph Castain";; > > : 2013??12??5??(??) 6:52 > > ??: "Open MPI Users"; > > : Re: [OMPI users] can you help me please ?thanks > > > > You are running 15000 ranks on two nodes?? My best guess is that you are > > swapping like crazy as your memory footprint problem exceeds available > > physical memory. > > > > > > > > On Thu, Dec 5, 2013 at 1:04 AM, <781578...@qq.com> wrote: > > My ROCKS cluster includes one frontend and two compute nodes.In my > > program,I have use the openmpi API such as MPI_Send and MPI_Recv . but > > when I run the progam with 3 processors . one processor send a message > > ,other receive message .here are some code. > > int*a=(int*)malloc(sizeof(int)*number); > > MPI_Send(a,number, MPI_INT, 1, 1,MPI_COMM_WORLD); > > > > int*b=(int*)malloc(sizeof(int)*number); > > MPI_Recv(b, number, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status); > > > > when number is least than 1,it runs fast. > > but number is more than 15000,it runs slowly > > > > why?? becesue openmpi API ?? or other problems? > > -- -- > > ??: "Ralph Castain";; > > : 2013??12??3??(??) 1:39 > > ??: "Open MPI Users"; > > : Re: [OMPI users] can you help me please ?thanks > > > > > > > > > > > > On Mon, Dec 2, 2013 at 9:23 PM, <781578...@qq.com> wrote: > > A simple program at my 4-node ROCKS cluster runs fine with command: > > /opt/openmpi/bin/mpirun -np 4 -machinefile machines ./sort_mpi6 > > > > > > Another bigger programs runs fine on the head node only with command: > > > > cd ./sphere; /opt/openmpi/bin/mpirun -np 4 ../bin/sort_mpi6 > > > > But with the command: > > > > cd /sphere; /opt/openmpi/bin/mpirun -np 4 -machinefile ../machines > > ../bin/sort_mpi6 > > > > It gives output that: > > > > ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: > > cannot open > > shared object file: No such file or directory > > ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: > > cannot open > > shared object file: No such file or directory > > ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: > > cannot open > > shared object file: No such file or directory > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > > use
[OMPI users] ?????? ?????? ?????? can you help me please ?thanks
it means that only three 3 processors have worked,and other processors have done nothing.why? -- -- ??: "Ralph Castain";; : 2013??12??9??(??) 11:18 ??: "Open MPI Users"; : Re: [OMPI users]?? ?? can you help me please ?thanks Forgive me, but I have no idea what that output means. Why do you think only 3 processors are being used? On Dec 9, 2013, at 5:05 AM, <781578...@qq.com> wrote: I have a server with 12 cores.when I run mpi program with 10 processors.only three processors works.Here are a picture about the problem <40f6d...@e690af16.27c0a552.jpg> why?Is the problem with process schedule?? -- -- ??: "Bruno Coutinho";; : 2013??12??6??(??) 11:14 ??: "Open MPI Users"; : Re: [OMPI users]?? can you help me please ?thanks Probably it was the changing from eager to rendezvous protocols as Jeff said. If you don't know what are these, read this: https://computing.llnl.gov/tutorials/mpi_performance/#Protocols http://blogs.cisco.com/performance/what-is-an-mpi-eager-limit/ http://blogs.cisco.com/performance/eager-limits-part-2/ You can tune eager limit chaning mca parameters btl_tcp_eager_limit (for tcp), btl_self_eager_limit (comunication fron one process to itself), btl_sm_eager_limit (shared memory) and btl_udapl_eager_limit or btl_openib_eager_limit (if you use infiniband). 2013/12/6 Jeff Squyres (jsquyres) I sent you some further questions yesterday: http://www.open-mpi.org/community/lists/users/2013/12/23158.php On Dec 6, 2013, at 1:35 AM, <781578...@qq.com> wrote: > Here is my code: > int*a=(int*)malloc(sizeof(int)*number); > MPI_Send(a,number, MPI_INT, 1, 1,MPI_COMM_WORLD); > > int*b=(int*)malloc(sizeof(int)*number); > MPI_Recv(b, number, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status); > > number here is the size of my array(eg,a or b). > I have try it on my local compute and my rocks cluster.On rocks cluster, one > processor on one frontend node use "MPI_Send" send a message ,other > processors on compute nodes use "MPI_Recv" receive message . > when number is least than 1,other processors can receive message fast; > but when number is more than 15000,other processors can receive message > slowly > why?? becesue openmpi API ?? or other problems? > > it spends me a few days , I want your help,thanks for all readers. good luck > for you > > > > > -- -- > ??: "Ralph Castain";; > : 2013??12??5??(??) 6:52 > ??: "Open MPI Users"; > : Re: [OMPI users] can you help me please ?thanks > > You are running 15000 ranks on two nodes?? My best guess is that you are > swapping like crazy as your memory footprint problem exceeds available > physical memory. > > > > On Thu, Dec 5, 2013 at 1:04 AM, <781578...@qq.com> wrote: > My ROCKS cluster includes one frontend and two compute nodes.In my program,I > have use the openmpi API such as MPI_Send and MPI_Recv . but when I run > the progam with 3 processors . one processor send a message ,other receive > message .here are some code. > int*a=(int*)malloc(sizeof(int)*number); > MPI_Send(a,number, MPI_INT, 1, 1,MPI_COMM_WORLD); > > int*b=(int*)malloc(sizeof(int)*number); > MPI_Recv(b, number, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status); > > when number is least than 1,it runs fast. > but number is more than 15000,it runs slowly > > why?? becesue openmpi API ?? or other problems? > -- -- > ??: "Ralph Castain";; > : 2013??12??3??(??) 1:39 > ??: "Open MPI Users"; > : Re: [OMPI users] can you help me please ?thanks > > > > > > On Mon, Dec 2, 2013 at 9:23 PM, <781578...@qq.com> wrote: > A simple program at my 4-node ROCKS cluster runs fine with command: > /opt/openmpi/bin/mpirun -np 4 -machinefile machines ./sort_mpi6 > > > Another bigger programs runs fine on the head node only with command: > > cd ./sphere; /opt/openmpi/bin/mpirun -np 4 ../bin/sort_mpi6 > > But with the command: > > cd /sphere; /opt/openmpi/bin/mpirun -np 4 -machinefile ../machines > ../bin/sort_mpi6 > > It gives output that: > > ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: cannot > open > shared object file: No such file or directory > ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: cannot > open > shared object file: No such file or directory > ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: cannot > open > shared object file: No such file or directory > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http:/