[OMPI users] qsub - mpirun problem

2008-09-28 Thread Zhiliang Hu
I have asked this question on TorqueUsers list.  Responses from that list 
suggests that the question be asked on this list:

The situation is:

I can submit my jobs as in:
> qsub -l nodes=6:ppn=2 /path/to/mpi_program

where "mpi_program" is:
/path/to/mpirun -np 12 /path/to/my_program

-- however everything went to run on the head node (one time on the first 
compute node).  Jobs can be done anyway.

While the mpirun can run on its own by specifying a "-machinefile", it is 
pointed out by Glen among others, and also on this web site 
http://wiki.hpc.ufl.edu/index.php/Common_Problems (I got the same error as the 
last example on that web page) that it's not a good idea to provide machinefile 
since it's "already handled by OpenMPI and Torque".

My question is, why the OpenMPI and Torque is not handling the jobs to all 
nodes?

ps 1:
The OpenMPI is configured and installed with the "--with-tm" option, and the 
"ompi_info" does show lines:

 MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.7)
 MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.7)

ps 2:
"/path/to/mpirun -np 12 -machinefile /path/to/machinefile /path/to/my_program"
works normal (send jobs to all nodes).

Thanks,

Zhiliang 



Re: [OMPI users] qsub - mpirun problem

2008-09-28 Thread Zhiliang Hu
Ralph,

Thank you for your quick response.

Indeed as you expected, "printenv | grep PBS" produced nothing.

BTW, I have:

> qmgr -c 'p s'

# Create queues and set their attributes.
#
#
# Create and define queue default
#
create queue default
set queue default queue_type = Execution
set queue default resources_default.nodes = 7
set queue default enabled = True
set queue default started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = nagrp2
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_available.nodect = 6
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 793

- I am not sure what/how is missing from my configurations (do you mean the 
installation "configure" step with optional directives) or else?

Thank you,

Zhiliang

At 07:16 PM 9/28/2008 -0600, you wrote:
>Hi Zhiliang
>
>First thing to check is that your Torque system is defining and  
>setting the environmental variables we are expecting in a Torque  
>system. It is quite possible that your Torque system isn't configured  
>as we expect.
>
>Can you run a job and send us the output from "printenv | grep PBS"?  
>We should see a PBS jobid, the name of the file containing the names  
>of the allocated nodes, etc.
>
>Since you are able to run with -machinefile, my guess is that your  
>system isn't setting those environmental variables as we expect. In  
>that case, you will have to keep specifying the machinefile by hand.
>
>Thanks
>Ralph
>
>On Sep 28, 2008, at 7:02 PM, Zhiliang Hu wrote:
>
>>I have asked this question on TorqueUsers list.  Responses from that  
>>list suggests that the question be asked on this list:
>>
>>The situation is:
>>
>>I can submit my jobs as in:
>>>qsub -l nodes=6:ppn=2 /path/to/mpi_program
>>
>>where "mpi_program" is:
>>/path/to/mpirun -np 12 /path/to/my_program
>>
>>-- however everything went to run on the head node (one time on the  
>>first compute node).  Jobs can be done anyway.
>>
>>While the mpirun can run on its own by specifying a "-machinefile",  
>>it is pointed out by Glen among others, and also on this web site 
>>http://wiki.hpc.ufl.edu/index.php/Common_Problems  (I got the same error as 
>>the last example on that web page) that  
>>it's not a good idea to provide machinefile since it's "already  
>>handled by OpenMPI and Torque".
>>
>>My question is, why the OpenMPI and Torque is not handling the jobs  
>>to all nodes?
>>
>>ps 1:
>>The OpenMPI is configured and installed with the "--with-tm" option,  
>>and the "ompi_info" does show lines:
>>
>>MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.7)
>>MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.7)
>>
>>ps 2:
>>"/path/to/mpirun -np 12 -machinefile /path/to/machinefile /path/to/ 
>>my_program"
>>works normal (send jobs to all nodes).
>>
>>Thanks,
>>
>>Zhiliang
>>
>>___
>>users mailing list
>>us...@open-mpi.org
>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] qsub - mpirun problem

2008-09-29 Thread Zhiliang Hu
At 11:29 AM 9/29/2008 -0400, Jeff Squyres wrote:
>On Sep 28, 2008, at 10:07 PM, Zhiliang Hu wrote:
>
>>Indeed as you expected, "printenv | grep PBS" produced nothing.
>
>Are you *sure*?  I find it very hard to believe that if you run that  
>command ***in a Torque job*** that you will get no output.  Torque  
>would have to be *seriously* misbehaving for that to occur.
>
>-- 
>Jeff Squyres
>Cisco Systems

That's a command line without a torque job.

Zhiliang



Re: [OMPI users] qsub - mpirun problem

2008-09-29 Thread Zhiliang Hu
How you run that command line from *inside a Torque* job?

-- I am only a poor biologist, reading through the manuals/tutorials but still 
don't have good clues... (thanks in advance ;-)

Zhiliang


At 11:48 AM 9/29/2008 -0400, you wrote:
>We need to see that command line from *inside a Torque* job.  That's  
>the only place where those PBS_* environment variables will exists --  
>OMPI's mpirun should be seeing these environment variables (when  
>inside a Torque job) and then reacting to them by using the Torque  
>native launcher, etc.
>
>Just to be sure: you are launching OMPI's "mpirun" inside your Torque  
>job, correct?

As shown in my original post, I tried to (1) send a mpirun job without torque 
that it works; (2) submit it with 'qsub' but end up with things on one node.


>On Sep 29, 2008, at 11:41 AM, Zhiliang Hu wrote:
>
>>At 11:29 AM 9/29/2008 -0400, Jeff Squyres wrote:
>>>On Sep 28, 2008, at 10:07 PM, Zhiliang Hu wrote:
>>>
>>>>Indeed as you expected, "printenv | grep PBS" produced nothing.
>>>
>>>Are you *sure*?  I find it very hard to believe that if you run that
>>>command ***in a Torque job*** that you will get no output.  Torque
>>>would have to be *seriously* misbehaving for that to occur.
>>>
>>>-- 
>>>Jeff Squyres
>>>Cisco Systems
>>
>>That's a command line without a torque job.
>>
>>Zhiliang



Re: [OMPI users] qsub - mpirun problem

2008-09-29 Thread Zhiliang Hu
I am the "system admin" here (so far so good on several servers over several 
years but this PBS thing appear to be daunting ;-)

I suppose **run ... from *inside a Torque*** is to run things with a PBS 
script.  I thought "qsub -l nodes=6:ppn=2 mpirun ..." already bring things into 
a PBS environment context.

Hope I don't have to take another school to get this to work ;-)

Zhiliang


At 12:38 PM 9/29/2008 -0400, you wrote:
>On Sep 29, 2008, at 12:27 PM, Zhiliang Hu wrote:
>
>>How you run that command line from *inside a Torque* job?
>>
>>-- I am only a poor biologist, reading through the manuals/tutorials  
>>but still don't have good clues... (thanks in advance ;-)
>
>Ah, gotcha.
>
>I'm guessing that you're running OMPI outside of a Torque job, and  
>that's why it's running entirely on a single machine (or on the  
>machines where you listed in a hostfile).
>
>You need to run your Open MPI job inside the Torque job that you  
>submit; OMPI should then detect that it is inside a Torque job and  
>automatically use the hosts that have been allocated by Torque.
>
>You probably want to consult your local sysadmin / cluster admin to  
>help you get Torque setup in your account, show you how to submit job  
>scripts, etc. (the specific instructions for how to use Torque can  
>vary from site to site and are a bit outside the scope of this list).
>
>Good luck!
>
>-- 
>Jeff Squyres
>Cisco Systems
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] qsub - mpirun problem

2008-09-29 Thread Zhiliang Hu
At 06:55 PM 9/29/2008 +0200, Reuti wrote:
>Am 29.09.2008 um 18:27 schrieb Zhiliang Hu:
>
>>How you run that command line from *inside a Torque* job?
>>
>>-- I am only a poor biologist, reading through the manuals/ tutorials but 
>>still don't have good clues... (thanks in advance ;-)
>
>What is the content of your jobscript? Did you request more than one  
>node for your job?
>
>-- Reuti

"-l nodes=6:ppn=2" is all I have to specify the node requests:

UNIX_PROMPT> qsub -l nodes=6:ppn=2 /path/to/mpi_program
where "mpi_program" is a file with one line:
  /path/to/mpirun -np 12 /path/to/my_program

Zhiliang

ps: "my_program" is a parallel program. 



Re: [OMPI users] qsub - mpirun problem

2008-09-29 Thread Zhiliang Hu
At 07:37 PM 9/29/2008 +0200, Reuti wrote:

>>"-l nodes=6:ppn=2" is all I have to specify the node requests:
>
>this might help: http://www.open-mpi.org/faq/?category=tm

Essentially the examples given on this web is no difference from what I did.
Only thing new is, I suppose "qsub -I " is for interactive mode.  When I did 
this:

  qsub -I -l nodes=7 mpiblastn.sh 

It hangs on "qsub: waiting for job 798.nagrp2.ansci.iastate.edu to start".


>>UNIX_PROMPT> qsub -l nodes=6:ppn=2 /path/to/mpi_program
>>where "mpi_program" is a file with one line:
>>  /path/to/mpirun -np 12 /path/to/my_program
>
>Can you please try this jobscript instead:
>
>#!/bin/sh
>set | grep PBS
>/path/to/mpirun /path/to/my_program
>
>All should be handled by Open MPI automatically. With the "set" bash  
>command you will get a list with all defined variables for further  
>analysis; and where you can check for the variables set by Torque.
>
>-- Reuti

"set | grep PBS" part had nothing in output.

Zhiliang






Re: [OMPI users] qsub - mpirun problem

2008-09-29 Thread Zhiliang Hu
At 10:45 PM 9/29/2008 +0200, you wrote:
>Am 29.09.2008 um 22:33 schrieb Zhiliang Hu:
>
>>At 07:37 PM 9/29/2008 +0200, Reuti wrote:
>>
>>>>"-l nodes=6:ppn=2" is all I have to specify the node requests:
>>>
>>>this might help: http://www.open-mpi.org/faq/?category=tm
>>
>>Essentially the examples given on this web is no difference from  
>>what I did.
>>Only thing new is, I suppose "qsub -I " is for interactive mode.   
>>When I did this:
>>
>>  qsub -I -l nodes=7 mpiblastn.sh
>>
>>It hangs on "qsub: waiting for job 798.nagrp2.ansci.iastate.edu to  
>>start".
>>
>>
>>>>UNIX_PROMPT> qsub -l nodes=6:ppn=2 /path/to/mpi_program
>>>>where "mpi_program" is a file with one line:
>>>> /path/to/mpirun -np 12 /path/to/my_program
>>>
>>>Can you please try this jobscript instead:
>>>
>>>#!/bin/sh
>>>set | grep PBS
>>>/path/to/mpirun /path/to/my_program
>>>
>>>All should be handled by Open MPI automatically. With the "set" bash
>>>command you will get a list with all defined variables for further
>>>analysis; and where you can check for the variables set by Torque.
>>>
>>>-- Reuti
>>
>>"set | grep PBS" part had nothing in output.
>
>Strange - you checked the .o end .e files of the job? - Reuti

There is nothing in -o nor -e output.  I had to kill the job.
I checked torque log, it shows (/var/spool/torque/server_logs):

09/29/2008 15:52:16;0100;PBS_Server;Job;799.xxx.xxx.xxx;enqueuing into default, 
state 1 hop 1
09/29/2008 15:52:16;0008;PBS_Server;Job;799.xxx.xxx.xxx;Job Queued at request 
of z...@xxx.xxx.xxx, owner = z...@xxx.xxx.xxx, job name = mpiblastn.sh, queue = 
default
09/29/2008 15:52:16;0040;PBS_Server;Svr;xxx.xxx.xxx;Scheduler sent command new
09/29/2008 15:52:16;0008;PBS_Server;Job;799.xxx.xxx.xxx;Job Modified at request 
of schedu...@xxx.xxx.xxx
09/29/2008 15:52:27;0008;PBS_Server;Job;799.xxx.xxx.xxx;Job deleted at request 
of z...@xxx.xxx.xxx
09/29/2008 15:52:27;0100;PBS_Server;Job;799.xxx.xxx.xxx;dequeuing from default, 
state EXITING
09/29/2008 15:52:27;0040;PBS_Server;Svr;xxx.xxx.xxx;Scheduler sent command term
09/29/2008 15:52:47;0001;PBS_Server;Svr;PBS_Server;is_request, bad attempt to 
connect from 172.16.100.1:1021 (address not trusted - check entry in 
server_priv/nodes)

where the server_priv/nodes has:
node001 np=4
node002 np=4
node003 np=4
node004 np=4
node005 np=4
node006 np=4
node007 np=4

which was set up by the vender.

What is "address not trusted"?

Zhiliang






Re: [OMPI users] qsub - mpirun problem

2008-09-29 Thread Zhiliang Hu
At 02:15 PM 9/29/2008 -0700, you wrote:
>It sounds like you may not have setup paswordless ssh between all  
>your nodes.
>
>Doug Reeder

That's not the case.  paswordless ssh is set up and it works fine.
-- that's how I can do "mpirun -np 6 -machinefiles .." fine.

Zhiliang


>On Sep 29, 2008, at 2:12 PM, Zhiliang Hu wrote:
>
>>At 10:45 PM 9/29/2008 +0200, you wrote:
>>>Am 29.09.2008 um 22:33 schrieb Zhiliang Hu:
>>>
>>>>At 07:37 PM 9/29/2008 +0200, Reuti wrote:
>>>>
>>>>>>"-l nodes=6:ppn=2" is all I have to specify the node requests:
>>>>>
>>>>>this might help: http://www.open-mpi.org/faq/?category=tm
>>>>
>>>>Essentially the examples given on this web is no difference from
>>>>what I did.
>>>>Only thing new is, I suppose "qsub -I " is for interactive mode.
>>>>When I did this:
>>>>
>>>> qsub -I -l nodes=7 mpiblastn.sh
>>>>
>>>>It hangs on "qsub: waiting for job 798.nagrp2.ansci.iastate.edu to
>>>>start".
>>>>
>>>>
>>>>>>UNIX_PROMPT> qsub -l nodes=6:ppn=2 /path/to/mpi_program
>>>>>>where "mpi_program" is a file with one line:
>>>>>>/path/to/mpirun -np 12 /path/to/my_program
>>>>>
>>>>>Can you please try this jobscript instead:
>>>>>
>>>>>#!/bin/sh
>>>>>set | grep PBS
>>>>>/path/to/mpirun /path/to/my_program
>>>>>
>>>>>All should be handled by Open MPI automatically. With the "set"  
>>>>>bash
>>>>>command you will get a list with all defined variables for further
>>>>>analysis; and where you can check for the variables set by Torque.
>>>>>
>>>>>-- Reuti
>>>>
>>>>"set | grep PBS" part had nothing in output.
>>>
>>>Strange - you checked the .o end .e files of the job? - Reuti
>>
>>There is nothing in -o nor -e output.  I had to kill the job.
>>I checked torque log, it shows (/var/spool/torque/server_logs):
>>
>>09/29/2008 15:52:16;0100;PBS_Server;Job;799.xxx.xxx.xxx;enqueuing  
>>into default, state 1 hop 1
>>09/29/2008 15:52:16;0008;PBS_Server;Job;799.xxx.xxx.xxx;Job Queued  
>>at request of z...@xxx.xxx.xxx, owner = z...@xxx.xxx.xxx, job name =  
>>mpiblastn.sh, queue = default
>>09/29/2008 15:52:16;0040;PBS_Server;Svr;xxx.xxx.xxx;Scheduler sent  
>>command new
>>09/29/2008 15:52:16;0008;PBS_Server;Job;799.xxx.xxx.xxx;Job  
>>Modified at request of schedu...@xxx.xxx.xxx
>>09/29/2008 15:52:27;0008;PBS_Server;Job;799.xxx.xxx.xxx;Job deleted  
>>at request of z...@xxx.xxx.xxx
>>09/29/2008 15:52:27;0100;PBS_Server;Job;799.xxx.xxx.xxx;dequeuing  
>>from default, state EXITING
>>09/29/2008 15:52:27;0040;PBS_Server;Svr;xxx.xxx.xxx;Scheduler sent  
>>command term
>>09/29/2008 15:52:47;0001;PBS_Server;Svr;PBS_Server;is_request, bad  
>>attempt to connect from 172.16.100.1:1021 (address not trusted -  
>>check entry in server_priv/nodes)
>>
>>where the server_priv/nodes has:
>>node001 np=4
>>node002 np=4
>>node003 np=4
>>node004 np=4
>>node005 np=4
>>node006 np=4
>>node007 np=4
>>
>>which was set up by the vender.
>>
>>What is "address not trusted"?
>>
>>Zhiliang
>>
>>
>>
>>
>>___
>>users mailing list
>>us...@open-mpi.org
>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] qsub - mpirun problem

2008-09-29 Thread Zhiliang Hu
At 12:10 AM 9/30/2008 +0200, you wrote:

>>Can you please try this jobscript instead:
>>
>>#!/bin/sh
>>set | grep PBS
>>/path/to/mpirun /path/to/my_program
>>
>>All should be handled by Open MPI automatically. With the "set"  
>>bash
>>command you will get a list with all defined variables for further
>>analysis; and where you can check for the variables set by Torque.
>>
>>-- Reuti
>
>"set | grep PBS" part had nothing in output.

Strange - you checked the .o end .e files of the job? - Reuti
>>>
>>>There is nothing in -o nor -e output.  I had to kill the job.
>>>I checked torque log, it shows (/var/spool/torque/server_logs):
>>>
>>>09/29/2008 15:52:16;0100;PBS_Server;Job;799.xxx.xxx.xxx;enqueuing  
>>>into default, state 1 hop 1
>>>09/29/2008 15:52:16;0008;PBS_Server;Job;799.xxx.xxx.xxx;Job Queued  
>>>at request of z...@xxx.xxx.xxx, owner = z...@xxx.xxx.xxx, job name =  
>>>mpiblastn.sh, queue = default
>>>09/29/2008 15:52:16;0040;PBS_Server;Svr;xxx.xxx.xxx;Scheduler sent  
>>>command new
>>>09/29/2008 15:52:16;0008;PBS_Server;Job;799.xxx.xxx.xxx;Job  
>>>Modified at request of schedu...@xxx.xxx.xxx
>>>09/29/2008 15:52:27;0008;PBS_Server;Job;799.xxx.xxx.xxx;Job  
>>>deleted at request of z...@xxx.xxx.xxx
>>>09/29/2008 15:52:27;0100;PBS_Server;Job;799.xxx.xxx.xxx;dequeuing  
>>>from default, state EXITING
>>>09/29/2008 15:52:27;0040;PBS_Server;Svr;xxx.xxx.xxx;Scheduler sent  
>>>command term
>>>09/29/2008 15:52:47;0001;PBS_Server;Svr;PBS_Server;is_request, bad  
>>>attempt to connect from 172.16.100.1:1021 (address not trusted -  
>>>check entry in server_priv/nodes)
>
>As you blank out some addresses: have the nodes and the headnode one  
>or two network cards installed? All the names like node001 et al. are  
>known on neach node by the correct address? I.e. 172.16.100.1 = node001?
>
>-- Reuti

There should be no problem in this regard -- the set up is by a 
commercial company. I can ssh from any node to any node (passwdless).

Zhiliang



Re: [OMPI users] qsub - mpirun problem

2008-09-29 Thread Zhiliang Hu
Thanks to several people who tried to help to diagnose, and shared your 
thoughts, on this subject thread.  That gave me more clues and courage to talk 
back to our vender.

My question on the Torque list is still pending for replies...

Best regards to you all,

Zhiliang


At 11:22 AM 9/30/2008 +1000, you wrote:

>On Mon, 2008-09-29 at 17:30 -0500, Zhiliang Hu wrote:
>> >As you blank out some addresses: have the nodes and the headnode one  
>> >or two network cards installed? All the names like node001 et al. are  
>> >known on neach node by the correct address? I.e. 172.16.100.1 = node001?
>> >
>> >-- Reuti
>> 
>> There should be no problem in this regard -- the set up is by a 
>> commercial company. I can ssh from any node to any node (passwdless).
>> 
>> Zhiliang
>
>Your faith in commercial enterprises is touching.  Unfortunately, it's
>at odds with my experience, on two continents.
>
>Like Reuti said, if you paid someone to set up a cluster to run parallel
>jobs and it won't run parallel jobs, then yell at them loud and long.
>
>I'll also reiterate that this sounds like a PBS problem rather than
>(yet) an OpenMPI problem.  It seems you left the PBS discussion
>prematurely.
>
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users