"Lane, William" <william.l...@cshs.org> writes: > Ralph, > > For the following openMPI job submission: > > qsub -q short.q -V -pe make 84 -b y mpirun -np 84 --prefix > /hpc/apps/mpi/openmpi/1.10.1/ --hetero-nodes --mca btl ^sm --mca > plm_base_verbose 5 /hpc/home/lanew/mpi/openmpi/a_1_10_1.out > > I have some more information on this issue. All the server daemons are > started without error and before > I ever see the > > [csclprd3-5-2:10512] [[42154,0],0] plm:base:receive got > update_proc_state for job [42154,1] > [csclprd3-6-12:30667] *** Process received signal *** > [csclprd3-6-12:30667] Signal: Bus error (7) > > qrsh throws the following error for various nodes taking part in the > openmpi compute ring: > > unable to write to file /tmp/285507.1.short.q/qrsh_error: No space > left on device[csclprd3-4-3:08052] [[24964,0],17] plm:rsh: using > "/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose" for > launching > > Does each and every node taking part in the openMPI compute ring need > to write to a temporary directory?
If don't believe the maintainer, please check the code or execution to verify that it's pointless continuing with no space for SGE's tmpdir (which means you basically have a broken system). Really. I'd expect shmem to stop with an error message if it couldn't write the memory mapped file, so maybe there's an OMPI bug there, but it won't happen on a healthy node with enough space. It's also possible SGE should see the job failure as a system problem and disabled the queue, if it didn't, but don't know how the job failure will have appeared.