Am 15.04.2011 um 23:02 schrieb Derrick LIN: > - what is your SGE configuration `qconf -sconf`? > > <snip> > rlogin_daemon /usr/sbin/sshd -i > rlogin_command /usr/bin/ssh > qlogin_daemon /usr/sbin/sshd -i > qlogin_command /usr/share/gridengine/qlogin-wrapper > rsh_daemon /usr/sbin/sshd -i > rsh_command /usr/bin/ssh
So you route the SGE startup mechanism to use `ssh`, nevertherless is should work of course. Small difference to a conventional `ssh` is, that SGE will start a private daemon for each job on the nodes listening on a random port. When you use only one host, then forks will be created but no `ssh` call. Your test uses more than one node? You copied you SGE aware version to all nodes at the same location? Are you getting the correct `mpiexec` and shared libraries in your jobscript? Shows the output of: #!/bin/sh which mpiexec echo $LD_LIBRARY_PATH ldd ompi_job the expected ones (ompi_job is the binary and ompi_job.sh the script) when submitted with a PE request? -- Reuti > jsv_url none > jsv_allowed_mod ac,h,i,e,o,j,M,N,p,w > > # my queue setting is: > > qname dev.q > hostlist sgeqexec01.domain.com.au > seq_no 0 > load_thresholds np_load_avg=1.75 > suspend_thresholds NONE > nsuspend 1 > suspend_interval 00:05:00 > priority 0 > min_cpu_interval 00:05:00 > processors UNDEFINED > qtype BATCH INTERACTIVE > ckpt_list NONE > pe_list make orte > rerun FALSE > slots 8 > tmpdir /tmp > shell /bin/bash > prolog NONE > epilog NONE > shell_start_mode posix_compliant > starter_method NONE > suspend_method NONE > resume_method NONE > terminate_method NONE > notify 00:00:60 > owner_list NONE > user_lists NONE > xuser_lists NONE > subordinate_list NONE > complex_values NONE > projects NONE > xprojects NONE > calendar NONE > initial_state default > s_rt INFINITY > h_rt INFINITY > s_cpu INFINITY > h_cpu INFINITY > s_fsize INFINITY > h_fsize INFINITY > s_data INFINITY > h_data INFINITY > s_stack INFINITY > h_stack INFINITY > s_core INFINITY > h_core INFINITY > s_rss INFINITY > h_rss INFINITY > s_vmem INFINITY > h_vmem INFINITY > > # my PE setting is: > > pe_name orte > slots 4 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule $round_robin > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary FALSE > > a) you are testing from master to a node, but jobs are running between nodes. > > b) unless you need X11 forwarding, using SGE’s -builtin- communication works > fine, this way you can have a cluster without `rsh` or `ssh` (or limited to > admin staff) and can still run parallel jobs. > > Sorry for the misleading snip. All the hosts (both master and execution host) > in the cluster can powerwordless each other without an issue. As my 2) > states, I could run a generic openmpi job without the SGE successfully. So I > do not think is the communication issue? > > Then you are bypassing SGE’s slot allocation and will have wrong accounting > and no job control of the slave tasks. > > I know it's not a proper submission as a PE job. I simply ran out of idea > what to do next. Even it's not a proper way, but that openmpi error didn't > happen and the job completed. I am wondering why. > > > The correct version of my OpenMPI is 1.4.1, not 1.3 in my first post. > > I have installed OpenMPI on the submission host and the master later, but it > didn't help. So I guess OpenMPI is needed in execution hosts only. > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users