I've asked for verification, but I recall the original verbal complaint claiming the wall time was random and sometimes as short as 2 minutes into a job.
They have said they've run more tests with more instrumentation on their code, and it always fails in a random place....Same job, different results. --Jim On Fri, May 23, 2008 at 11:54 AM, Jeff Squyres <jsquy...@cisco.com> wrote: > This may be a dumb question, but is there a chance that his job is > running beyond 30 minutes, and PBS/Torque/whatever is killing it? > > On May 20, 2008, at 4:23 PM, Jim Kusznir wrote: > >> Hello all: >> >> I've got a user on our ROCKS 4.3 cluster that's having some strange >> errors. I have other users using the cluster without any such errors >> reported, but this user also runs this code on other clusters without >> any problems, so I'm not really sure where the problem lies. They are >> getting logs with the following: >> >> -------- >> Warning: no access to tty (Bad file descriptor). >> Thus no job control in this shell. >> data directory is /mnt/pvfs2/patton/data/chem/aa1 >> exec directory is /mnt/pvfs2/patton/exec/chem/aa1 >> arch directory is /mnt/pvfs2/patton/data/chem/aa1 >> mpirun: killing job... >> >> Terminated >> -------------------------------------------------------------------------- >> WARNING: mpirun is in the process of killing a job, but has detected >> an >> interruption (probably control-C). >> >> It is dangerous to interrupt mpirun while it is killing a job (proper >> termination may not be guaranteed). Hit control-C again within 1 >> second if you really want to kill mpirun immediately. >> -------------------------------------------------------------------------- >> mpirun noticed that job rank 0 with PID 14126 on node >> compute-0-23.local exited on signal 15 (Terminated). >> [compute-0-23.local:14124] [0,0,0]-[0,0,1] mca_oob_tcp_msg_recv: readv >> failed: Connection reset by peer (104) >> --------- >> >> The job was submitted with: >> --------- >> #!/bin/csh >> ##PBS -N for.chem.aa1 >> #PBS -l nodes=2 >> #PBS -l walltime=0:30:00 >> #PBS -m n >> #PBS -j oe >> #PBS -o /home/patton/logs >> #PBS -e /home/patton/logs >> #PBS -V >> # >> # ------ set case specific parameters >> # and setup directory structure >> # >> set time=000001_000100 >> # >> set case=aa1 >> set type=chem >> # >> # ---- set up directories >> # >> set SCRATCH=/mnt/pvfs2/patton >> mkdir -p $SCRATCH >> >> set datadir=$SCRATCH/data/$type/$case >> set execdir=$SCRATCH/exec/$type/$case >> set archdir=$SCRATCH/data/$type/$case >> set les_output=les.$type.$case.out.$time >> >> set compdir=$HOME/compile/$type/$case >> #set compdir=$HOME/compile/free/aa1 >> >> echo 'data directory is ' $datadir >> echo 'exec directory is ' $execdir >> echo 'arch directory is ' $archdir >> >> mkdir -p $datadir >> mkdir -p $execdir >> # >> cd $execdir >> rm -fr * >> cp $compdir/* . >> # >> # ------- build machine file for code to read setup >> # >> # ------------ set imachine=0 for NCAR IBM SP : bluevista >> # imachine=1 for NCAR IBM SP : bluesky >> # imachine=2 for ASC SGI Altix : eagle >> # imachine=3 for ERDC Cray XT3 : sapphire >> # imachine=4 for ASC HP XC : falcon >> # imachine=5 for NERSC Cray XT4 : franklin >> # imachine=6 for WSU Cluster : aeolus >> # >> set imachine=6 >> set store_files=1 >> set OMP_NUM_THREADS=1 >> # >> echo $imachine > mach.file >> echo $store_files >> mach.file >> echo $datadir >> mach.file >> echo $archdir >> mach.file >> # >> # ---- submit the run >> # >> mpirun -n 2 ./lesmpi.a > $les_output >> # >> # ------ clean up >> # >> mv $execdir/u.* $datadir >> mv $execdir/p.* $datadir >> mv $execdir/his.* $datadir >> cp $execdir/$les_output $datadir >> # >> echo 'job ended ' >> exit >> # >> ------------- >> (its possible this particular script doesn't match this particular >> error...The user ran the job, and this is what I assembled from >> conversations with him. In any case, its representative to the jobs >> he's running, and they're all returning similar errors.) >> >> The error occurs at varying time steps in the runs, and if run without >> MPI, it runs fine to completion. >> >> Here's the version info: >> >> [kusznir@aeolus ~]$ rpm -qa |grep pgi >> pgilinux86-64-707-1 >> openmpi-pgi-docs-1.2.4-1 >> openmpi-pgi-devel-1.2.4-1 >> roll-pgi-usersguide-4.3-0 >> openmpi-pgi-runtime-1.2.4-1 >> mpich-ethernet-pgi-1.2.7p1-1 >> pgi-rocks-4.3-0 >> >> The OpenMPI rpms were built from the supplied spec (or nearly so, >> anyway) with the following command line: >> CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90 rpmbuild -bb --define >> 'install_in_opt 1' --d >> efine 'install_modulefile 1' --define 'modules_rpm_name environment- >> modules' --d >> efine 'build_all_in_one_rpm 0' --define 'configure_options --with- >> tm=/opt/torqu >> e' --define '_name openmpi-pgi' --define 'use_default_rpm_opt_flags >> 0' openmpi.s >> pec >> >> Any suggestions? >> >> Thanks! >> >> --Jim >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >