= 4096 still required?

Dave Love Mon, 21 Mar 2016 11:11:35 -0400 (EDT)

"Lane, William" <william.l...@cshs.org> writes:

> Ralph,
>
> For the following openMPI job submission:
>
> qsub -q short.q -V -pe make 84 -b y mpirun -np 84 --prefix
> /hpc/apps/mpi/openmpi/1.10.1/ --hetero-nodes --mca btl ^sm --mca
> plm_base_verbose 5 /hpc/home/lanew/mpi/openmpi/a_1_10_1.out
>
> I have some more information on this issue. All the server daemons are
> started without error and before
> I ever see the
>
> [csclprd3-5-2:10512] [[42154,0],0] plm:base:receive got
> update_proc_state for job [42154,1]
> [csclprd3-6-12:30667] *** Process received signal ***
> [csclprd3-6-12:30667] Signal: Bus error (7)
>
> qrsh throws the following error for various nodes taking part in the
> openmpi compute ring:
>
> unable to write to file /tmp/285507.1.short.q/qrsh_error: No space
> left on device[csclprd3-4-3:08052] [[24964,0],17] plm:rsh: using
> "/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose" for
> launching
>
> Does each and every node taking part in the openMPI compute ring need
> to write to a temporary directory?


If don't believe the maintainer, please check the code or execution to
verify that it's pointless continuing with no space for SGE's tmpdir
(which means you basically have a broken system).  Really.

I'd expect shmem to stop with an error message if it couldn't write the
memory mapped file, so maybe there's an OMPI bug there, but it won't
happen on a healthy node with enough space.  It's also possible SGE
should see the job failure as a system problem and disabled the queue,
if it didn't, but don't know how the job failure will have appeared.

Re: [OMPI users] OpenMPI 1.10.1 *ix hard/soft open files limits >= 4096 still required?

Reply via email to