Re: [OMPI users] openmpi+torque: How run job in a subset of the allocation?

2013-12-09 Thread Ola . Widlund
Thanks to Ralph, Gus and Georg for your input!

I was diverted to other things for a week, but now back on track...

You deserve to have the question marks straightened out first: The two 
application are (1) a commercial solver and (2) an in-house code handling 
some special physics. The two codes will run sequentially, taking turns 
one time step at a time. The commercial solver runs one time step, outputs 
some data to file, then waits for the in-house code to do some work, then 
goes to next time step. The in-house code is actually restarted each time 
it has work to do. (This is what I mean with "loosely coupled".)

As the two codes never work at the same time, I would like them to use the 
same hardware. The commercial solver may be reluctant to release its cores 
to other processes ("agressive"), but I hope it will at the end of time 
steps... The commercial code will just be started as usual, using the full 
allocation of the MOAB job. The in-house code is the one using openmpi, 
and I want it to use all of the cores in the first node of the allocation, 
and only those.

As Ralph suggests, it seems very convenient to use the -host option with 
relative node syntax. I also found some other references on how mpirun 
handles host info from the resources managers. Starting the two codes as 
background jobs, like Georg does, sounds good. I will simply give it a 
spin and see how it works...

Thanks again for your time,

Ola



[OMPI users] ?????? ?????? can you help me please ?thanks

2013-12-09 Thread ????
I have a server  with  12 cores.when I run mpi program with 10 processors.only  
three processors works.Here are a picture about the problem
   

  

  
 why?Is the problem with process schedule??   

 --  --
  ??: "Bruno Coutinho";;
 : 2013??12??6??(??) 11:14
 ??: "Open MPI Users"; 
 
 : Re: [OMPI users]?? can you help me please ?thanks

 

 Probably it was the changing from eager to rendezvous protocols as Jeff said.  

 If you don't know what are these, read this:
 https://computing.llnl.gov/tutorials/mpi_performance/#Protocols

 http://blogs.cisco.com/performance/what-is-an-mpi-eager-limit/

 http://blogs.cisco.com/performance/eager-limits-part-2/

 

 You can tune eager limit chaning mca parameters btl_tcp_eager_limit (for tcp), 
btl_self_eager_limit (comunication fron one process to itself), 
btl_sm_eager_limit (shared memory) and btl_udapl_eager_limit or 
btl_openib_eager_limit (if you use infiniband).
 

 2013/12/6 Jeff Squyres (jsquyres) 
 I sent you some further questions yesterday:

http://www.open-mpi.org/community/lists/users/2013/12/23158.php
  

On Dec 6, 2013, at 1:35 AM,  <781578...@qq.com> wrote:

> Here is  my code:
> int*a=(int*)malloc(sizeof(int)*number);
> MPI_Send(a,number, MPI_INT, 1, 1,MPI_COMM_WORLD);
>
> int*b=(int*)malloc(sizeof(int)*number);
> MPI_Recv(b, number, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
>
> number  here is the size of my array(eg,a or b).
> I  have try it on my local compute and my rocks cluster.On rocks cluster, one 
> processor  on  one frontend node  use "MPI_Send" send a message ,other 
> processors on compute nodes use "MPI_Recv" receive message .
> when number is least than 1,other processors can receive message fast;
> but when  number is more than 15000,other processors can receive message 
> slowly
> why??  becesue openmpi API ?? or other  problems?
>
> it spends me a few days , I want your help,thanks for all readers. good luck 
> for you
>
>
>
>
> --  --
> ??: "Ralph Castain";;
> : 2013??12??5??(??) 6:52
> ??: "Open MPI Users";
> : Re: [OMPI users] can you help me please ?thanks
>
> You are running 15000 ranks on two nodes?? My best guess is that you are 
> swapping like crazy as your memory footprint problem exceeds available 
> physical memory.
>
>
>
> On Thu, Dec 5, 2013 at 1:04 AM,  <781578...@qq.com> wrote:
> My ROCKS cluster includes one frontend and two  compute nodes.In my program,I 
> have use the openmpi API  such as  MPI_Send and  MPI_Recv .  but  when I  run 
>  the progam with 3 processors . one processor  send a message ,other receive 
> message .here are some code.
> int*a=(int*)malloc(sizeof(int)*number);
> MPI_Send(a,number, MPI_INT, 1, 1,MPI_COMM_WORLD);
>
> int*b=(int*)malloc(sizeof(int)*number);
> MPI_Recv(b, number, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
>
> when number is least than 1,it runs fast.
> but number is more than 15000,it runs slowly
>
> why??  becesue openmpi API ?? or other  problems?
> --  --
> ??: "Ralph Castain";;
> : 2013??12??3??(??) 1:39
> ??: "Open MPI Users";
> : Re: [OMPI users] can you help me please ?thanks
>
>
>
>
>
> On Mon, Dec 2, 2013 at 9:23 PM,  <781578...@qq.com> wrote:
> A simple program at my 4-node ROCKS cluster runs fine with command:
> /opt/openmpi/bin/mpirun -np 4 -machinefile machines ./sort_mpi6
>
>
> Another bigger programs runs fine on the head node only with command:
>
> cd ./sphere; /opt/openmpi/bin/mpirun -np 4 ../bin/sort_mpi6
>
> But with the command:
>
> cd /sphere; /opt/openmpi/bin/mpirun -np 4 -machinefile ../machines
> ../bin/sort_mpi6
>
> It gives output that:
>
> ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: cannot 
> open
> shared object file: No such file or directory
> ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: cannot 
> open
> shared object file: No such file or directory
> ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: cannot 
> open
> shared object file: No such file or directory
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




 --
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


  ___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] 回复: 回复: can you help me please ?thanks

2013-12-09 Thread Ralph Castain
Forgive me, but I have no idea what that output means. Why do you think only 3 
processors are being used?

On Dec 9, 2013, at 5:05 AM,  <781578...@qq.com> wrote:

>  I have a server  with  12 cores.when I run mpi program with 10 
> processors.only  three processors works.Here are a picture about the problem
>  
> <40f6d...@e690af16.27c0a552.jpg>
>  
> why?Is the problem with process schedule??   
> --  --
> ??: "Bruno Coutinho";;
> : 2013??12??6??(??) 11:14
> ??: "Open MPI Users";
> : Re: [OMPI users]?? can you help me please ?thanks
> 
> Probably it was the changing from eager to rendezvous protocols as Jeff said.
> 
> If you don't know what are these, read this:
> https://computing.llnl.gov/tutorials/mpi_performance/#Protocols
> http://blogs.cisco.com/performance/what-is-an-mpi-eager-limit/
> http://blogs.cisco.com/performance/eager-limits-part-2/
> 
> You can tune eager limit chaning mca parameters btl_tcp_eager_limit (for 
> tcp), btl_self_eager_limit (comunication fron one process to itself), 
> btl_sm_eager_limit (shared memory) and btl_udapl_eager_limit or 
> btl_openib_eager_limit (if you use infiniband).
> 
> 
> 2013/12/6 Jeff Squyres (jsquyres) 
> I sent you some further questions yesterday:
> 
> http://www.open-mpi.org/community/lists/users/2013/12/23158.php
> 
> 
> On Dec 6, 2013, at 1:35 AM,  <781578...@qq.com> wrote:
> 
> > Here is  my code:
> > int*a=(int*)malloc(sizeof(int)*number);
> > MPI_Send(a,number, MPI_INT, 1, 1,MPI_COMM_WORLD);
> >
> > int*b=(int*)malloc(sizeof(int)*number);
> > MPI_Recv(b, number, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
> >
> > number  here is the size of my array(eg,a or b).
> > I  have try it on my local compute and my rocks cluster.On rocks cluster, 
> > one processor  on  one frontend node  use "MPI_Send" send a message ,other 
> > processors on compute nodes use "MPI_Recv" receive message .
> > when number is least than 1,other processors can receive message fast;
> > but when  number is more than 15000,other processors can receive message 
> > slowly
> > why??  becesue openmpi API ?? or other  problems?
> >
> > it spends me a few days , I want your help,thanks for all readers. good 
> > luck for you
> >
> >
> >
> >
> > --  --
> > ??: "Ralph Castain";;
> > : 2013??12??5??(??) 6:52
> > ??: "Open MPI Users";
> > : Re: [OMPI users] can you help me please ?thanks
> >
> > You are running 15000 ranks on two nodes?? My best guess is that you are 
> > swapping like crazy as your memory footprint problem exceeds available 
> > physical memory.
> >
> >
> >
> > On Thu, Dec 5, 2013 at 1:04 AM,  <781578...@qq.com> wrote:
> > My ROCKS cluster includes one frontend and two  compute nodes.In my 
> > program,I have use the openmpi API  such as  MPI_Send and  MPI_Recv .  but  
> > when I  run  the progam with 3 processors . one processor  send a message 
> > ,other receive message .here are some code.
> > int*a=(int*)malloc(sizeof(int)*number);
> > MPI_Send(a,number, MPI_INT, 1, 1,MPI_COMM_WORLD);
> >
> > int*b=(int*)malloc(sizeof(int)*number);
> > MPI_Recv(b, number, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
> >
> > when number is least than 1,it runs fast.
> > but number is more than 15000,it runs slowly
> >
> > why??  becesue openmpi API ?? or other  problems?
> > --  --
> > ??: "Ralph Castain";;
> > : 2013??12??3??(??) 1:39
> > ??: "Open MPI Users";
> > : Re: [OMPI users] can you help me please ?thanks
> >
> >
> >
> >
> >
> > On Mon, Dec 2, 2013 at 9:23 PM,  <781578...@qq.com> wrote:
> > A simple program at my 4-node ROCKS cluster runs fine with command:
> > /opt/openmpi/bin/mpirun -np 4 -machinefile machines ./sort_mpi6
> >
> >
> > Another bigger programs runs fine on the head node only with command:
> >
> > cd ./sphere; /opt/openmpi/bin/mpirun -np 4 ../bin/sort_mpi6
> >
> > But with the command:
> >
> > cd /sphere; /opt/openmpi/bin/mpirun -np 4 -machinefile ../machines
> > ../bin/sort_mpi6
> >
> > It gives output that:
> >
> > ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: 
> > cannot open
> > shared object file: No such file or directory
> > ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: 
> > cannot open
> > shared object file: No such file or directory
> > ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: 
> > cannot open
> > shared object file: No such file or directory
> >
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > use

[OMPI users] ?????? ?????? ?????? can you help me please ?thanks

2013-12-09 Thread ????
it means that only three 3 processors have worked,and other  processors  have 
done nothing.why?
  

 

 --  --
  ??: "Ralph Castain";;
 : 2013??12??9??(??) 11:18
 ??: "Open MPI Users"; 
 
 : Re: [OMPI users]?? ?? can you help me please ?thanks

 

Forgive me, but I have no idea what that output means. Why do you think only 3 
processors are being used? 
  On Dec 9, 2013, at 5:05 AM,  <781578...@qq.com> wrote:

   I have a server  with  12 cores.when I run mpi program with 10 
processors.only  three processors works.Here are a picture about the problem
   

  <40f6d...@e690af16.27c0a552.jpg>

  
 why?Is the problem with process schedule??   

 --  --
  ??: "Bruno Coutinho";;
 : 2013??12??6??(??) 11:14
 ??: "Open MPI Users"; 
 
 : Re: [OMPI users]?? can you help me please ?thanks

 

 Probably it was the changing from eager to rendezvous protocols as Jeff said.  

 If you don't know what are these, read this:
 https://computing.llnl.gov/tutorials/mpi_performance/#Protocols

 http://blogs.cisco.com/performance/what-is-an-mpi-eager-limit/

 http://blogs.cisco.com/performance/eager-limits-part-2/

 

 You can tune eager limit chaning mca parameters btl_tcp_eager_limit (for tcp), 
btl_self_eager_limit (comunication fron one process to itself), 
btl_sm_eager_limit (shared memory) and btl_udapl_eager_limit or 
btl_openib_eager_limit (if you use infiniband).
 

 2013/12/6 Jeff Squyres (jsquyres) 
 I sent you some further questions yesterday:

http://www.open-mpi.org/community/lists/users/2013/12/23158.php
  

On Dec 6, 2013, at 1:35 AM,  <781578...@qq.com> wrote:

> Here is  my code:
> int*a=(int*)malloc(sizeof(int)*number);
> MPI_Send(a,number, MPI_INT, 1, 1,MPI_COMM_WORLD);
>
> int*b=(int*)malloc(sizeof(int)*number);
> MPI_Recv(b, number, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
>
> number  here is the size of my array(eg,a or b).
> I  have try it on my local compute and my rocks cluster.On rocks cluster, one 
> processor  on  one frontend node  use "MPI_Send" send a message ,other 
> processors on compute nodes use "MPI_Recv" receive message .
> when number is least than 1,other processors can receive message fast;
> but when  number is more than 15000,other processors can receive message 
> slowly
> why??  becesue openmpi API ?? or other  problems?
>
> it spends me a few days , I want your help,thanks for all readers. good luck 
> for you
>
>
>
>
> --  --
> ??: "Ralph Castain";;
> : 2013??12??5??(??) 6:52
> ??: "Open MPI Users";
> : Re: [OMPI users] can you help me please ?thanks
>
> You are running 15000 ranks on two nodes?? My best guess is that you are 
> swapping like crazy as your memory footprint problem exceeds available 
> physical memory.
>
>
>
> On Thu, Dec 5, 2013 at 1:04 AM,  <781578...@qq.com> wrote:
> My ROCKS cluster includes one frontend and two  compute nodes.In my program,I 
> have use the openmpi API  such as  MPI_Send and  MPI_Recv .  but  when I  run 
>  the progam with 3 processors . one processor  send a message ,other receive 
> message .here are some code.
> int*a=(int*)malloc(sizeof(int)*number);
> MPI_Send(a,number, MPI_INT, 1, 1,MPI_COMM_WORLD);
>
> int*b=(int*)malloc(sizeof(int)*number);
> MPI_Recv(b, number, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
>
> when number is least than 1,it runs fast.
> but number is more than 15000,it runs slowly
>
> why??  becesue openmpi API ?? or other  problems?
> --  --
> ??: "Ralph Castain";;
> : 2013??12??3??(??) 1:39
> ??: "Open MPI Users";
> : Re: [OMPI users] can you help me please ?thanks
>
>
>
>
>
> On Mon, Dec 2, 2013 at 9:23 PM,  <781578...@qq.com> wrote:
> A simple program at my 4-node ROCKS cluster runs fine with command:
> /opt/openmpi/bin/mpirun -np 4 -machinefile machines ./sort_mpi6
>
>
> Another bigger programs runs fine on the head node only with command:
>
> cd ./sphere; /opt/openmpi/bin/mpirun -np 4 ../bin/sort_mpi6
>
> But with the command:
>
> cd /sphere; /opt/openmpi/bin/mpirun -np 4 -machinefile ../machines
> ../bin/sort_mpi6
>
> It gives output that:
>
> ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: cannot 
> open
> shared object file: No such file or directory
> ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: cannot 
> open
> shared object file: No such file or directory
> ../bin/sort_mpi6: error while loading shared libraries: libgdal.so.1: cannot 
> open
> shared object file: No such file or directory
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http:/