I am having two problem with the integration of OpenMPI 1.3 and SGE 6.2u1, which we are new with both. The troubles are getting jobs to suspend/resume and collect cpu time correctly.
For suspend/resume I have added the following to my mpirun command: --mca orte_forward_job_control 1 --mca plm_rsh_daemonize_qrsh 1 and adjusted the suspend_method for the queue that it's running in. I have not gotten it to place any process into the T state. Although this is not a huge problem, I hope to have this working in the future. My main problem is getting the cpu time correct. On a multiple cpu job only the master nodes shows the cpu time correct for that process, the others are very short and not sure what they are measuring. (I believe startup time). Here's and example: cpu 0.360 cpu 0.480 cpu 0.470 cpu 0.490 cpu 0.530 cpu 0.470 cpu 0.680 cpu 464.305 And from watching the runs that time is close to the wall clock time and match what I see for that single process. Now I have gotten it to give what I believe are correct values, but I have to include --debug-daemons option to our mpirun command. With that I get the following: cpu 73.146 cpu 72.982 cpu 73.381 cpu 73.142 cpu 73.029 cpu 73.183 cpu 73.117 cpu 73.265 cpu 73.236 I have noticed that when I get the cpu time correctly I get qrsh process that startup (my understanding is that this is what starts the processes on the remote machines) and they stay running until the jobs is finished. When I don't get the correct cpu time, I see the qrsh processes start on the master node, but die off once they start the process on the remote nodes. The PE environment looks like the following: pe_name orte slots 560 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $round_robin control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE Please let me know if I can provide any more information to help figure this out. Thanks, Scott Malone Manager, High Performance Computing Facility Information Sciences - Research Informatics St. Jude Children's Research Hospital 332 North Lauderdale Memphis, TN 38105 901.495.4947 scott.mal...@stjude.org ________________________________ Email Disclaimer: www.stjude.org/emaildisclaimer