I've asked for verification, but I recall the original verbal
complaint claiming the wall time was random and sometimes as short as
2 minutes into a job.

They have said they've run more tests with more instrumentation on
their code, and it always fails in a random place....Same job,
different results.

--Jim

On Fri, May 23, 2008 at 11:54 AM, Jeff Squyres <jsquy...@cisco.com> wrote:
> This may be a dumb question, but is there a chance that his job is
> running beyond 30 minutes, and PBS/Torque/whatever is killing it?
>
> On May 20, 2008, at 4:23 PM, Jim Kusznir wrote:
>
>> Hello all:
>>
>> I've got a user on our ROCKS 4.3 cluster that's having some strange
>> errors.  I have other users using the cluster without any such errors
>> reported, but this user also runs this code on other clusters without
>> any problems, so I'm not really sure where the problem lies.  They are
>> getting logs with the following:
>>
>> --------
>> Warning: no access to tty (Bad file descriptor).
>> Thus no job control in this shell.
>> data directory is  /mnt/pvfs2/patton/data/chem/aa1
>> exec directory is  /mnt/pvfs2/patton/exec/chem/aa1
>> arch directory is  /mnt/pvfs2/patton/data/chem/aa1
>> mpirun: killing job...
>>
>> Terminated
>> --------------------------------------------------------------------------
>> WARNING: mpirun is in the process of killing a job, but has detected
>> an
>> interruption (probably control-C).
>>
>> It is dangerous to interrupt mpirun while it is killing a job (proper
>> termination may not be guaranteed).  Hit control-C again within 1
>> second if you really want to kill mpirun immediately.
>> --------------------------------------------------------------------------
>> mpirun noticed that job rank 0 with PID 14126 on node
>> compute-0-23.local exited on signal 15 (Terminated).
>> [compute-0-23.local:14124] [0,0,0]-[0,0,1] mca_oob_tcp_msg_recv: readv
>> failed: Connection reset by peer (104)
>> ---------
>>
>> The job was submitted with:
>> ---------
>> #!/bin/csh
>> ##PBS -N for.chem.aa1
>> #PBS -l nodes=2
>> #PBS -l walltime=0:30:00
>> #PBS -m n
>> #PBS -j oe
>> #PBS -o /home/patton/logs
>> #PBS -e /home/patton/logs
>> #PBS -V
>> #
>> # ------ set case specific parameters
>> #        and setup directory structure
>> #
>> set time=000001_000100
>> #
>> set case=aa1
>> set type=chem
>> #
>> # ---- set up directories
>> #
>> set SCRATCH=/mnt/pvfs2/patton
>> mkdir -p $SCRATCH
>>
>> set datadir=$SCRATCH/data/$type/$case
>> set execdir=$SCRATCH/exec/$type/$case
>> set archdir=$SCRATCH/data/$type/$case
>> set les_output=les.$type.$case.out.$time
>>
>> set compdir=$HOME/compile/$type/$case
>> #set compdir=$HOME/compile/free/aa1
>>
>> echo 'data directory is ' $datadir
>> echo 'exec directory is ' $execdir
>> echo 'arch directory is ' $archdir
>>
>> mkdir -p $datadir
>> mkdir -p $execdir
>> #
>> cd $execdir
>> rm -fr *
>> cp $compdir/* .
>> #
>> # ------- build machine file for code to read setup
>> #
>> # ------------ set imachine=0 for NCAR IBM SP    : bluevista
>> #                  imachine=1 for NCAR IBM SP    : bluesky
>> #                  imachine=2 for ASC SGI Altix  : eagle
>> #                  imachine=3 for ERDC Cray XT3  : sapphire
>> #                  imachine=4 for ASC HP XC      : falcon
>> #                  imachine=5 for NERSC Cray XT4 : franklin
>> #                  imachine=6 for WSU Cluster    : aeolus
>> #
>> set imachine=6
>> set store_files=1
>> set OMP_NUM_THREADS=1
>> #
>> echo $imachine > mach.file
>> echo $store_files >> mach.file
>> echo $datadir >> mach.file
>> echo $archdir >> mach.file
>> #
>> # ---- submit the run
>> #
>> mpirun -n 2 ./lesmpi.a > $les_output
>> #
>> # ------ clean up
>> #
>> mv $execdir/u.* $datadir
>> mv $execdir/p.* $datadir
>> mv $execdir/his.* $datadir
>> cp $execdir/$les_output $datadir
>> #
>> echo 'job ended '
>> exit
>> #
>> -------------
>> (its possible this particular script doesn't match this particular
>> error...The user ran the job, and this is what I assembled from
>> conversations with him.  In any case, its representative to the jobs
>> he's running, and they're all returning similar errors.)
>>
>> The error occurs at varying time steps in the runs, and if run without
>> MPI, it runs fine to completion.
>>
>> Here's the version info:
>>
>> [kusznir@aeolus ~]$ rpm -qa |grep pgi
>> pgilinux86-64-707-1
>> openmpi-pgi-docs-1.2.4-1
>> openmpi-pgi-devel-1.2.4-1
>> roll-pgi-usersguide-4.3-0
>> openmpi-pgi-runtime-1.2.4-1
>> mpich-ethernet-pgi-1.2.7p1-1
>> pgi-rocks-4.3-0
>>
>> The OpenMPI rpms were built from the supplied spec (or nearly so,
>> anyway) with the following command line:
>> CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90 rpmbuild -bb --define
>> 'install_in_opt 1' --d
>> efine 'install_modulefile 1' --define 'modules_rpm_name environment-
>> modules' --d
>> efine 'build_all_in_one_rpm 0'  --define 'configure_options --with-
>> tm=/opt/torqu
>> e' --define '_name openmpi-pgi' --define 'use_default_rpm_opt_flags
>> 0' openmpi.s
>> pec
>>
>> Any suggestions?
>>
>> Thanks!
>>
>> --Jim
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to