Hi All,

I am having an OpenMPI issue that seems to be relted to job scheduling- on
TACC, one of the TeraGrid resources.

The program I am trying to run, ABySS, seems to run fine without scheduling-
i.e. when I run it on the login nodes without scheduling through qsub...
but, using that same commande, but scheduling it through qsub, the job
fails..

Here is the qsub script, fyi:

!/bin/bash      
#$ -N homo47
#$ -j y
#$ -o homo47
#$ -pe 16way 128
#$ -q normal    

#$ -l h_rt=00:30:00     
#$ -M   macma...@gmail.com
#$ -m be
cd /work/01301/mmacmane/abyss-1.1.2/bin
#$ -cwd
#$ -V
ibrun ./abyss-pe k=19 in='/work/01301/mmacmane/homo/*.fastq'
name='homo_47' n=5 s=200 c=13


I get an error message:

TACC: Done.
TACC: Starting up job 1299149
TACC: Setting up parallel environment for OpenMPI mpirun.
TACC: Setup complete. Running job script.
TACC: starting parallel tasks...
/opt/apps/pgi7_2/openmpi/1.3/bin/mpirun -np 64 ABYSS-P -k19 -c13
--coverage-hist=coverage.hist -s bubbles.fa  -o homo_61-1.fa
/work/01301/mmacmane/homo/SRR001665_1.fastq
/work/01301/mmacmane/homo/SRR001665_2.fastq
/work/01301/mmacmane/homo/SRR002271_1.fastq
/work/01301/mmacmane/homo/SRR002271_2.fastq
/work/01301/mmacmane/homo/SRR002273_1.fastq
/work/01301/mmacmane/homo/SRR002273_2.fastq
/work/01301/mmacmane/homo/SRR002274_1.fastq
/work/01301/mmacmane/homo/SRR002274_2.fastq
/work/01301/mmacmane/homo/SRR002275_1.fastq
/work/01301/mmacmane/homo/SRR002275_2.fastq
/work/01301/mmacmane/homo/SRR002276_1.fastq
/work/01301/mmacmane/homo/SRR002276_2.fastq
/work/01301/mmacmane/homo/SRR002291_1.fastq
/work/01301/mmacmane/homo/SRR002291_2.fastq
/work/01301/mmacmane/homo/SRR002295_1.fastq
/work/01301/mmacmane/homo/SRR002295_2.fastq
/work/01301/mmacmane/homo/SRR002297_1.fastq
/work/01301/mmacmane/homo/SRR002297_2.fastq
/work/01301/mmacmane/homo/SRR029337_1.fastq
/work/01301/mmacmane/homo/SRR029337_2.fastq

...many many lines of this...

[i178-302.ranger.tacc.utexas.edu:28340] [[5795,1],19] ORTE_ERROR_LOG:
A message is attempting to be sent to a process whose contact
information is unknown in file rml_oob_send.c at line 105
[i178-302.ranger.tacc.utexas.edu:28340] [[5795,1],19] could not get
route to [[INVALID],INVALID]
[i178-302.ranger.tacc.utexas.edu:28340] [[5795,1],19] ORTE_ERROR_LOG:
A message is attempting to be sent to a process whose contact
information is unknown in file base/plm_base_proxy.c at line 85
[i176-303.ranger.tacc.utexas.edu:05045] [[5795,1],1] ORTE_ERROR_LOG: A
message is attempting to be sent to a process whose contact
information is unknown in file rml_oob_send.c at line 105
[i176-303.ranger.tacc.utexas.edu:05045] [[5795,1],1] could not get
route to [[INVALID],INVALID]
[i176-303.ranger.tacc.utexas.edu:05045] [[5795,1],1] ORTE_ERROR_LOG: A
message is attempting to be sent to a process whose contact
information is unknown in file base/plm_base_proxy.c at line 85
[i178-302.ranger.tacc.utexas.edu:28325] [[5795,1],18] ORTE_ERROR_LOG:
A message is attempting to be sent to a process whose contact
information is unknown in file rml_oob_send.c at line 105
[i178-302.ranger.tacc.utexas.edu:28325] [[5795,1],18] could not get
route to [[INVALID],INVALID]

...many many lines of this...

TACC: Cleaning up after job: 1299149
TACC: Done.

The TACC systems administrators don't seem to have a great solution, and the
authors of the program say its MPI-related...

_________________________________
Matthew MacManes
PhD Candidate
University of California- Berkeley
Museum of Vertebrate Zoology
Phone: 510-495-5833
Lab Website: http://ib.berkeley.edu/labs/lacey
Personal Website: http://macmanes.com/

Reply via email to