Hi All, I am having an OpenMPI issue that seems to be relted to job scheduling- on TACC, one of the TeraGrid resources.
The program I am trying to run, ABySS, seems to run fine without scheduling- i.e. when I run it on the login nodes without scheduling through qsub... but, using that same commande, but scheduling it through qsub, the job fails.. Here is the qsub script, fyi: !/bin/bash #$ -N homo47 #$ -j y #$ -o homo47 #$ -pe 16way 128 #$ -q normal #$ -l h_rt=00:30:00 #$ -M macma...@gmail.com #$ -m be cd /work/01301/mmacmane/abyss-1.1.2/bin #$ -cwd #$ -V ibrun ./abyss-pe k=19 in='/work/01301/mmacmane/homo/*.fastq' name='homo_47' n=5 s=200 c=13 I get an error message: TACC: Done. TACC: Starting up job 1299149 TACC: Setting up parallel environment for OpenMPI mpirun. TACC: Setup complete. Running job script. TACC: starting parallel tasks... /opt/apps/pgi7_2/openmpi/1.3/bin/mpirun -np 64 ABYSS-P -k19 -c13 --coverage-hist=coverage.hist -s bubbles.fa -o homo_61-1.fa /work/01301/mmacmane/homo/SRR001665_1.fastq /work/01301/mmacmane/homo/SRR001665_2.fastq /work/01301/mmacmane/homo/SRR002271_1.fastq /work/01301/mmacmane/homo/SRR002271_2.fastq /work/01301/mmacmane/homo/SRR002273_1.fastq /work/01301/mmacmane/homo/SRR002273_2.fastq /work/01301/mmacmane/homo/SRR002274_1.fastq /work/01301/mmacmane/homo/SRR002274_2.fastq /work/01301/mmacmane/homo/SRR002275_1.fastq /work/01301/mmacmane/homo/SRR002275_2.fastq /work/01301/mmacmane/homo/SRR002276_1.fastq /work/01301/mmacmane/homo/SRR002276_2.fastq /work/01301/mmacmane/homo/SRR002291_1.fastq /work/01301/mmacmane/homo/SRR002291_2.fastq /work/01301/mmacmane/homo/SRR002295_1.fastq /work/01301/mmacmane/homo/SRR002295_2.fastq /work/01301/mmacmane/homo/SRR002297_1.fastq /work/01301/mmacmane/homo/SRR002297_2.fastq /work/01301/mmacmane/homo/SRR029337_1.fastq /work/01301/mmacmane/homo/SRR029337_2.fastq ...many many lines of this... [i178-302.ranger.tacc.utexas.edu:28340] [[5795,1],19] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 105 [i178-302.ranger.tacc.utexas.edu:28340] [[5795,1],19] could not get route to [[INVALID],INVALID] [i178-302.ranger.tacc.utexas.edu:28340] [[5795,1],19] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file base/plm_base_proxy.c at line 85 [i176-303.ranger.tacc.utexas.edu:05045] [[5795,1],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 105 [i176-303.ranger.tacc.utexas.edu:05045] [[5795,1],1] could not get route to [[INVALID],INVALID] [i176-303.ranger.tacc.utexas.edu:05045] [[5795,1],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file base/plm_base_proxy.c at line 85 [i178-302.ranger.tacc.utexas.edu:28325] [[5795,1],18] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 105 [i178-302.ranger.tacc.utexas.edu:28325] [[5795,1],18] could not get route to [[INVALID],INVALID] ...many many lines of this... TACC: Cleaning up after job: 1299149 TACC: Done. The TACC systems administrators don't seem to have a great solution, and the authors of the program say its MPI-related... _________________________________ Matthew MacManes PhD Candidate University of California- Berkeley Museum of Vertebrate Zoology Phone: 510-495-5833 Lab Website: http://ib.berkeley.edu/labs/lacey Personal Website: http://macmanes.com/