I have updated to OpenMPI 1.2.6 and had the user rerun his jobs. He's getting similar output:
[root@aeolus logs]# more 2047.aeolus.OU Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. data directory is /mnt/pvfs2/patton/data/chem/aa1 exec directory is /mnt/pvfs2/patton/exec/chem/aa1 arch directory is /mnt/pvfs2/patton/data/chem/aa1 mpirun: killing job... Terminated -------------------------------------------------------------------------- WARNING: mpirun is in the process of killing a job, but has detected an interruption (probably control-C). It is dangerous to interrupt mpirun while it is killing a job (proper termination may not be guaranteed). Hit control-C again within 1 second if you really want to kill mpirun immediately. -------------------------------------------------------------------------- [compute-0-0.local:03444] OOB: Connection to HNP lost -------------------------- Any suggestions? At the moment, we both seem to be stuck in a finger-pointing cycle, as neither can find anything wrong with their part nor can we come up with any further debugging info. The job dies at random intervals (in timesteps and wall clock time), and the user assures me its always dying well before the allowed walltime. Thanks! --Jim On Tue, May 20, 2008 at 1:23 PM, Jim Kusznir <jkusz...@gmail.com> wrote: > Hello all: > > I've got a user on our ROCKS 4.3 cluster that's having some strange > errors. I have other users using the cluster without any such errors > reported, but this user also runs this code on other clusters without > any problems, so I'm not really sure where the problem lies. They are > getting logs with the following: > > -------- > Warning: no access to tty (Bad file descriptor). > Thus no job control in this shell. > data directory is /mnt/pvfs2/patton/data/chem/aa1 > exec directory is /mnt/pvfs2/patton/exec/chem/aa1 > arch directory is /mnt/pvfs2/patton/data/chem/aa1 > mpirun: killing job... > > Terminated > -------------------------------------------------------------------------- > WARNING: mpirun is in the process of killing a job, but has detected an > interruption (probably control-C). > > It is dangerous to interrupt mpirun while it is killing a job (proper > termination may not be guaranteed). Hit control-C again within 1 > second if you really want to kill mpirun immediately. > -------------------------------------------------------------------------- > mpirun noticed that job rank 0 with PID 14126 on node > compute-0-23.local exited on signal 15 (Terminated). > [compute-0-23.local:14124] [0,0,0]-[0,0,1] mca_oob_tcp_msg_recv: readv > failed: Connection reset by peer (104) > --------- > > The job was submitted with: > --------- > #!/bin/csh > ##PBS -N for.chem.aa1 > #PBS -l nodes=2 > #PBS -l walltime=0:30:00 > #PBS -m n > #PBS -j oe > #PBS -o /home/patton/logs > #PBS -e /home/patton/logs > #PBS -V > # > # ------ set case specific parameters > # and setup directory structure > # > set time=000001_000100 > # > set case=aa1 > set type=chem > # > # ---- set up directories > # > set SCRATCH=/mnt/pvfs2/patton > mkdir -p $SCRATCH > > set datadir=$SCRATCH/data/$type/$case > set execdir=$SCRATCH/exec/$type/$case > set archdir=$SCRATCH/data/$type/$case > set les_output=les.$type.$case.out.$time > > set compdir=$HOME/compile/$type/$case > #set compdir=$HOME/compile/free/aa1 > > echo 'data directory is ' $datadir > echo 'exec directory is ' $execdir > echo 'arch directory is ' $archdir > > mkdir -p $datadir > mkdir -p $execdir > # > cd $execdir > rm -fr * > cp $compdir/* . > # > # ------- build machine file for code to read setup > # > # ------------ set imachine=0 for NCAR IBM SP : bluevista > # imachine=1 for NCAR IBM SP : bluesky > # imachine=2 for ASC SGI Altix : eagle > # imachine=3 for ERDC Cray XT3 : sapphire > # imachine=4 for ASC HP XC : falcon > # imachine=5 for NERSC Cray XT4 : franklin > # imachine=6 for WSU Cluster : aeolus > # > set imachine=6 > set store_files=1 > set OMP_NUM_THREADS=1 > # > echo $imachine > mach.file > echo $store_files >> mach.file > echo $datadir >> mach.file > echo $archdir >> mach.file > # > # ---- submit the run > # > mpirun -n 2 ./lesmpi.a > $les_output > # > # ------ clean up > # > mv $execdir/u.* $datadir > mv $execdir/p.* $datadir > mv $execdir/his.* $datadir > cp $execdir/$les_output $datadir > # > echo 'job ended ' > exit > # > ------------- > (its possible this particular script doesn't match this particular > error...The user ran the job, and this is what I assembled from > conversations with him. In any case, its representative to the jobs > he's running, and they're all returning similar errors.) > > The error occurs at varying time steps in the runs, and if run without > MPI, it runs fine to completion. > > Here's the version info: > > [kusznir@aeolus ~]$ rpm -qa |grep pgi > pgilinux86-64-707-1 > openmpi-pgi-docs-1.2.4-1 > openmpi-pgi-devel-1.2.4-1 > roll-pgi-usersguide-4.3-0 > openmpi-pgi-runtime-1.2.4-1 > mpich-ethernet-pgi-1.2.7p1-1 > pgi-rocks-4.3-0 > > The OpenMPI rpms were built from the supplied spec (or nearly so, > anyway) with the following command line: > CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90 rpmbuild -bb --define 'install_in_opt 1' > --d > efine 'install_modulefile 1' --define 'modules_rpm_name environment-modules' > --d > efine 'build_all_in_one_rpm 0' --define 'configure_options > --with-tm=/opt/torqu > e' --define '_name openmpi-pgi' --define 'use_default_rpm_opt_flags 0' > openmpi.s > pec > > Any suggestions? > > Thanks! > > --Jim >