I forget - have you tried this launch with the "-mca plm_rsh_no_tree_spawn 1”? It might let you progress on the launch, but I suspect you will then hit the shared memory error again.
The problem is that your tmp file system is flooded and so we are hitting either qrsh being unable to launch on a backend node, or shared memory failing because it cannot create the backing file. The real question is: why is your tmp directory full? Perhaps you can point things at a different tmpdir where there is more room? > On Mar 18, 2016, at 4:05 PM, Lane, William <william.l...@cshs.org> wrote: > > Ralph, > > For the following openMPI job submission: > > qsub -q short.q -V -pe make 84 -b y mpirun -np 84 --prefix > /hpc/apps/mpi/openmpi/1.10.1/ --hetero-nodes --mca btl ^sm --mca > plm_base_verbose 5 /hpc/home/lanew/mpi/openmpi/a_1_10_1.out > > I have some more information on this issue. All the server daemons are > started without error and before > I ever see the > > [csclprd3-5-2:10512] [[42154,0],0] plm:base:receive got update_proc_state for > job [42154,1] > [csclprd3-6-12:30667] *** Process received signal *** > [csclprd3-6-12:30667] Signal: Bus error (7) > > qrsh throws the following error for various nodes taking part in the openmpi > compute ring: > > unable to write to file /tmp/285507.1.short.q/qrsh_error: No space left on > device[csclprd3-4-3:08052] [[24964,0],17] plm:rsh: using > "/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose" for launching > > Does each and every node taking part in the openMPI compute ring need to > write to a temporary directory? > > Also, the submit node for this cluster has a soft open files limit of 1024 > and a hard open files limit of 4096. All > compute nodes have a hard/soft open files limit of 4096. > > -Bill L. > > From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain > [r...@open-mpi.org] > Sent: Thursday, March 17, 2016 6:02 PM > To: Open MPI Users > Subject: Re: [OMPI users] OpenMPI 1.10.1 *ix hard/soft open files limits >= > 4096 still required? > > Yeah, it looks like something is wrong with the mmap backend for some reason. > It gets used by both vader and sm, so no help there. > > I’m afraid I’ll have to defer to Nathan from here as he is more familiar with > it than I. > > >> On Mar 17, 2016, at 4:55 PM, Lane, William <william.l...@cshs.org >> <mailto:william.l...@cshs.org>> wrote: >> >> I ran OpenMPI using the "-mca btl ^vader" switch Ralph recommended and I'm >> still getting the same errors >> >> qsub -q short.q -V -pe make 206 -b y mpirun -np 206 --prefix >> /hpc/apps/mpi/openmpi/1.10.1/ --hetero-nodes --mca btl ^vader --mca >> plm_base_verbose 5 /hpc/home/lanew/mpi/openmpi/a_1_10_1.out >> >> [csclprd3-5-2:10512] [[42154,0],0] plm:base:receive got update_proc_state >> for job [42154,1] >> [csclprd3-6-12:30667] *** Process received signal *** >> [csclprd3-6-12:30667] Signal: Bus error (7) >> [csclprd3-6-12:30667] Signal code: Non-existant physical address (2) >> [csclprd3-6-12:30667] Failing at address: 0x2b1b18a2d000 >> [csclprd3-6-12:30667] [ 0] /lib64/libpthread.so.0(+0xf500)[0x2b1b0e06c500] >> [csclprd3-6-12:30667] [ 1] >> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_shmem_mmap.so(+0x1524)[0x2b1b0f5fd524] >> [csclprd3-6-12:30667] [ 2] >> /hpc/apps/mpi/openmpi/1.10.1/lib/libmca_common_sm.so.4(mca_common_sm_module_create_and_attach+0x56)[0x2b1b124daab6] >> [csclprd3-6-12:30667] [ 3] >> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_btl_sm.so(+0x39cb)[0x2b1b12d749cb] >> [csclprd3-6-12:30667] [ 4] >> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_btl_sm.so(+0x3f2a)[0x2b1b12d74f2a] >> [csclprd3-6-12:30667] [ 5] >> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_btl_base_select+0x117)[0x2b1b0ddfdb07] >> [csclprd3-6-12:30667] [ 6] >> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x2b1b126de7b2] >> [csclprd3-6-12:30667] [ 7] >> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_bml_base_init+0x99)[0x2b1b0ddfd309] >> [csclprd3-6-12:30667] [ 8] >> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_pml_ob1.so(+0x538c)[0x2b1b133a138c] >> [csclprd3-6-12:30667] [ 9] >> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_pml_base_select+0x1e0)[0x2b1b0de0e780] >> [csclprd3-6-12:30667] [10] >> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(ompi_mpi_init+0x51d)[0x2b1b0ddc017d] >> [csclprd3-6-12:30667] [11] >> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(MPI_Init+0x170)[0x2b1b0dddf820] >> [csclprd3-6-12:30667] [12] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400ad0] >> [csclprd3-6-12:30667] [13] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b1b0e298cdd] >> [csclprd3-6-12:30667] [14] /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400999] >> [csclprd3-6-12:30667] *** End of error message *** >> >> -Bill L. >> >> From: users [users-boun...@open-mpi.org <mailto:users-boun...@open-mpi.org>] >> on behalf of Lane, William [william.l...@cshs.org >> <mailto:william.l...@cshs.org>] >> Sent: Thursday, March 17, 2016 4:49 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] OpenMPI 1.10.1 *ix hard/soft open files limits >= >> 4096 still required? >> >> I apologize Ralph, I forgot to include my command line for invoking OpenMPI >> on SoGE: >> >> qsub -q short.q -V -pe make 87 -b y mpirun -np 87 --prefix >> /hpc/apps/mpi/openmpi/1.10.1/ --hetero-nodes --mca btl ^sm --mca >> plm_base_verbose 5 /hpc/home/lanew/mpi/openmpi/a_1_10_1.out >> >> a_1_10_1.out is my OpenMPI test code binary compiled under OpenMPI 1.10.1. >> >> Thanks for the quick response! >> >> -Bill L. >> >> From: users [users-boun...@open-mpi.org <mailto:users-boun...@open-mpi.org>] >> on behalf of Ralph Castain [r...@open-mpi.org <mailto:r...@open-mpi.org>] >> Sent: Thursday, March 17, 2016 4:44 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] OpenMPI 1.10.1 *ix hard/soft open files limits >= >> 4096 still required? >> >> No, that shouldn’t be the issue any more - and that isn’t what the backtrace >> indicates. It looks instead like there was a problem with the shared memory >> backing file on a remote node, and that caused the vader shared memory BTL >> to segfault. >> >> Try turning vader off and see if that helps - I’m not sure what you are >> using, but maybe “-mca btl ^vader” will suffice >> >> Nathan - any other suggestions? >> >> >>> On Mar 17, 2016, at 4:40 PM, Lane, William <william.l...@cshs.org >>> <mailto:william.l...@cshs.org>> wrote: >>> >>> I remember years ago, OpenMPI (version 1.3.3) required the hard/soft open >>> files limits be >= 4096 in order to function when large numbers of slots >>> were requested (with 1.3.3 this was at roughly 85 slots). Is this >>> requirement >>> still present for OpenMPI versions 1.10.1 and greater? >>> >>> I'm having some issues now with OpenMPI version 1.10.1 that remind me >>> of the issues I had w/1.3.3 where OpenMPI worked fine as long as I don't >>> request too many slots. >>> >>> When I look at the ulimits -a (soft limit) I see: >>> open files (-n) 1024 >>> >>> Ulimits -Ha (hard limit) gives: >>> open files (-n) 4096 >>> >>> I'm getting errors of the form: >>> [csclprd3-5-5:15248] [[40732,0],0] plm:base:receive got update_proc_state >>> for job [40732,1] >>> [csclprd3-6-12:30567] *** Process received signal *** >>> [csclprd3-6-12:30567] Signal: Bus error (7) >>> [csclprd3-6-12:30567] Signal code: Non-existant physical address (2) >>> [csclprd3-6-12:30567] Failing at address: 0x2b3d19f72000 >>> [csclprd3-6-12:30568] *** Process received signal *** >>> [csclprd3-6-12:30567] [ 0] /lib64/libpthread.so.0(+0xf500)[0x2b3d0f71f500] >>> [csclprd3-6-12:30567] [ 1] >>> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_shmem_mmap.so(+0x1524)[0x2b3d10cb0524] >>> [csclprd3-6-12:30567] [ 2] >>> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_btl_vader.so(+0x3674)[0x2b3d18494674] >>> [csclprd3-6-12:30567] [ 3] >>> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_btl_base_select+0x117)[0x2b3d0f4b0b07] >>> [csclprd3-6-12:30567] [ 4] >>> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x2b3d13d917b2] >>> [csclprd3-6-12:30567] [ 5] >>> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_bml_base_init+0x99)[0x2b3d0f4b0309] >>> [csclprd3-6-12:30567] [ 6] >>> /hpc/apps/mpi/openmpi/1.10.1/lib/openmpi/mca_pml_ob1.so(+0x538c)[0x2b3d18ac238c] >>> [csclprd3-6-12:30567] [ 7] >>> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(mca_pml_base_select+0x1e0)[0x2b3d0f4c1780] >>> [csclprd3-6-12:30567] [ 8] >>> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(ompi_mpi_init+0x51d)[0x2b3d0f47317d] >>> [csclprd3-6-12:30567] [ 9] >>> /hpc/apps/mpi/openmpi/1.10.1/lib/libmpi.so.12(MPI_Init+0x170)[0x2b3d0f492820] >>> [csclprd3-6-12:30567] [10] >>> /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400ad0] >>> [csclprd3-6-12:30567] [11] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b3d0f94bcdd] >>> [csclprd3-6-12:30567] [12] >>> /hpc/home/lanew/mpi/openmpi/a_1_10_1.out[0x400999] >>> [csclprd3-6-12:30567] *** End of error message *** >>> >>> Ugh. >>> >>> Bill L. >>> IMPORTANT WARNING: This message is intended for the use of the person or >>> entity to which it is addressed and may contain information that is >>> privileged and confidential, the disclosure of which is governed by >>> applicable law. If the reader of this message is not the intended >>> recipient, or the employee or agent responsible for delivering it to the >>> intended recipient, you are hereby notified that any dissemination, >>> distribution or copying of this information is strictly prohibited. Thank >>> you for your cooperation. _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2016/03/28746.php >>> <http://www.open-mpi.org/community/lists/users/2016/03/28746.php> >> IMPORTANT WARNING: This message is intended for the use of the person or >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended recipient, >> or the employee or agent responsible for delivering it to the intended >> recipient, you are hereby notified that any dissemination, distribution or >> copying of this information is strictly prohibited. Thank you for your >> cooperation. >> IMPORTANT WARNING: This message is intended for the use of the person or >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended recipient, >> or the employee or agent responsible for delivering it to the intended >> recipient, you are hereby notified that any dissemination, distribution or >> copying of this information is strictly prohibited. Thank you for your >> cooperation._______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/03/28749.php >> <http://www.open-mpi.org/community/lists/users/2016/03/28749.php> > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended recipient, > or the employee or agent responsible for delivering it to the intended > recipient, you are hereby notified that any dissemination, distribution or > copying of this information is strictly prohibited. Thank you for your > cooperation. _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/03/28756.php > <http://www.open-mpi.org/community/lists/users/2016/03/28756.php>