It isn't really Torque that is imposing those constraints: - the torque_mom initscript inherits from the OS whatever ulimits are in effect at that time; - each job inherits the ulimits from the pbs_mom.
Thus, you need to change the ulimits from whatever is set at startup time, e.g., in /etc/sysconfig/torque_mom: ulimit -d unlimited ulimit -s unlimited ulimit -n 32768 ulimit -l 2097152 or whatever you consider to be reasonable. Cheers, Martin -- Martin Siegert WestGrid/ComputeCanada Simon Fraser University Burnaby, British Columbia On Wed, Jun 11, 2014 at 10:20:08PM +0000, Jeff Squyres (jsquyres) wrote: > +1 > > On Jun 11, 2014, at 6:01 PM, Ralph Castain <r...@open-mpi.org> > wrote: > > > Yeah, I think we've seen that somewhere before too... > > > > > > On Jun 11, 2014, at 2:59 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: > > > >> Agreed. The problem is not with UDCM. I don't think something is wrong > >> with the system. I think his Torque is imposing major constraints on the > >> maximum size that can be locked into memory. > >> > >> Josh > >> > >> > >> On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm <hje...@lanl.gov> wrote: > >> Probably won't help to use RDMACM though as you will just see the > >> resource failure somewhere else. UDCM is not the problem. Something is > >> wrong with the system. Allocating a 512 entry CQ should not fail. > >> > >> -Nathan > >> > >> On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote: > >> > I'm guessing it's a resource limitation issue coming from Torque. > >> > > >> > Hmmmm...I found something interesting on the interwebs that looks > >> > awfully > >> > similar: > >> > > >> > http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html > >> > > >> > Greg, if the suggestion from the Torque users doesn't resolve your > >> > issue ( > >> > "...adding the following line 'ulimit -l unlimited' to pbs_mom and > >> > restarting pbs_mom." ) doesn't work, try using the RDMACM CPC > >> > (instead of > >> > UDCM, which is a pretty recent addition to the openIB BTL.) by > >> > setting: > >> > > >> > -mca btl_openib_cpc_include rdmacm > >> > > >> > Josh > >> > > >> > On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres) > >> > <jsquy...@cisco.com> wrote: > >> > > >> > Mellanox -- > >> > > >> > What would cause a CQ to fail to be created? > >> > > >> > On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." > >> > <fisch...@westinghouse.com> wrote: > >> > > >> > > Is there any other work around that I might try? Something that > >> > avoids UDCM? > >> > > > >> > > -----Original Message----- > >> > > From: Fischer, Greg A. > >> > > Sent: Tuesday, June 10, 2014 2:59 PM > >> > > To: Nathan Hjelm > >> > > Cc: Open MPI Users; Fischer, Greg A. > >> > > Subject: RE: [OMPI users] openib segfaults with Torque > >> > > > >> > > [binf316:fischega] $ ulimit -m > >> > > unlimited > >> > > > >> > > Greg > >> > > > >> > > -----Original Message----- > >> > > From: Nathan Hjelm [mailto:hje...@lanl.gov] > >> > > Sent: Tuesday, June 10, 2014 2:58 PM > >> > > To: Fischer, Greg A. > >> > > Cc: Open MPI Users > >> > > Subject: Re: [OMPI users] openib segfaults with Torque > >> > > > >> > > Out of curiosity what is the mlock limit on your system? If it is > >> > too > >> > low that can cause ibv_create_cq to fail. To check run ulimit -m. > >> > > > >> > > -Nathan Hjelm > >> > > Application Readiness, HPC-5, LANL > >> > > > >> > > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote: > >> > >> Yes, this fails on all nodes on the system, except for the head > >> > node. > >> > >> > >> > >> The uptime of the system isn't significant. Maybe 1 week, and > >> > it's > >> > received basically no use. > >> > >> > >> > >> -----Original Message----- > >> > >> From: Nathan Hjelm [mailto:hje...@lanl.gov] > >> > >> Sent: Tuesday, June 10, 2014 2:49 PM > >> > >> To: Fischer, Greg A. > >> > >> Cc: Open MPI Users > >> > >> Subject: Re: [OMPI users] openib segfaults with Torque > >> > >> > >> > >> > >> > >> Well, thats interesting. The output shows that ibv_create_cq is > >> > failing. Strange since an identical call had just succeeded (udcm > >> > creates two completion queues). Some questions that might indicate > >> > where > >> > the failure might be: > >> > >> > >> > >> Does this fail on any other node in your system? > >> > >> > >> > >> How long has the node been up? > >> > >> > >> > >> -Nathan Hjelm > >> > >> Application Readiness, HPC-5, LANL > >> > >> > >> > >> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote: > >> > >>> Jeff/Nathan, > >> > >>> > >> > >>> I ran the following with my debug build of OpenMPI 1.8.1 - after > >> > opening a terminal on a compute node with "qsub -l nodes 2 -I": > >> > >>> > >> > >>> mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 > >> > >>> ring_c &> output.txt > >> > >>> > >> > >>> Output and backtrace are attached. Let me know if I can provide > >> > anything else. > >> > >>> > >> > >>> Thanks for looking into this, > >> > >>> Greg > >> > >>> > >> > >>> -----Original Message----- > >> > >>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of > >> > Jeff > >> > >>> Squyres (jsquyres) > >> > >>> Sent: Tuesday, June 10, 2014 10:31 AM > >> > >>> To: Nathan Hjelm > >> > >>> Cc: Open MPI Users > >> > >>> Subject: Re: [OMPI users] openib segfaults with Torque > >> > >>> > >> > >>> Greg: > >> > >>> > >> > >>> Can you run with "--mca btl_base_verbose 100" on your debug > >> > build so > >> > that we can get some additional output to see why UDCM is failing to > >> > setup properly? > >> > >>> > >> > >>> > >> > >>> > >> > >>> On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> > >> > wrote: > >> > >>> > >> > >>>> On Tue, Jun 10, 2014 at 12:10:28AM +0000, Jeff Squyres > >> > (jsquyres) > >> > wrote: > >> > >>>>> I seem to recall that you have an IB-based cluster, right? > >> > >>>>> > >> > >>>>> From a *very quick* glance at the code, it looks like this > >> > might > >> > be a simple incorrect-finalization issue. That is: > >> > >>>>> > >> > >>>>> - you run the job on a single server > >> > >>>>> - openib disqualifies itself because you're running on a > >> > single > >> > >>>>> server > >> > >>>>> - openib then goes to finalize/close itself > >> > >>>>> - but openib didn't fully initialize itself (because it > >> > >>>>> disqualified itself early in the initialization process), and > >> > >>>>> something in the finalization process didn't take that into > >> > >>>>> account > >> > >>>>> > >> > >>>>> Nathan -- is that anywhere close to correct? > >> > >>>> > >> > >>>> Nope. udcm_module_finalize is being called because there was an > >> > >>>> error setting up the udcm state. See > >> > btl_openib_connect_udcm.c:476. > >> > >>>> The opal_list_t destructor is getting an assert failure. > >> > Probably > >> > >>>> because the constructor wasn't called. I can rearrange the > >> > >>>> constructors to be called first but there appears to be a > >> > deeper > >> > >>>> issue with the user's > >> > >>>> system: udcm_module_init should not be failing! It creates a > >> > >>>> couple of CQs, allocates a small number of registered bufferes > >> > and > >> > >>>> starts monitoring the fd for the completion channel. All these > >> > >>>> things are also done in the setup of the openib btl itself. > >> > Keep > >> > >>>> in mind that the openib btl will not disqualify itself when > >> > running > >> > single server. > >> > >>>> Openib may be used to communicate on node and is needed for the > >> > dynamics case. > >> > >>>> > >> > >>>> The user might try adding -mca btl_base_verbose 100 to shed > >> > some > >> > >>>> light on what the real issue is. > >> > >>>> > >> > >>>> BTW, I no longer monitor the user mailing list. If something > >> > needs > >> > >>>> my attention forward it to me directly. > >> > >>>> > >> > >>>> -Nathan > >> > >>> > >> > >>> > >> > >>> -- > >> > >>> Jeff Squyres > >> > >>> jsquy...@cisco.com > >> > >>> For corporate legal information go to: > >> > >>> http://www.cisco.com/web/about/doing_business/legal/cri/ > >> > >>> > >> > >>> _______________________________________________ > >> > >>> users mailing list > >> > >>> us...@open-mpi.org > >> > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >>> > >> > >>> > >> > >> > >> > >>> Core was generated by `ring_c'. > >> > >>> Program terminated with signal 6, Aborted. > >> > >>> #0 0x00007f8b6ae1cb55 in raise () from /lib64/libc.so.6 > >> > >>> #0 0x00007f8b6ae1cb55 in raise () from /lib64/libc.so.6 > >> > >>> #1 0x00007f8b6ae1e0c5 in abort () from /lib64/libc.so.6 > >> > >>> #2 0x00007f8b6ae15a10 in __assert_fail () from /lib64/libc.so.6 > >> > >>> #3 0x00007f8b664b684b in udcm_module_finalize (btl=0x717060, > >> > >>> cpc=0x7190c0) at > >> > >>> > >> > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_ > >> > >>> co > >> > >>> nnect_udcm.c:734 > >> > >>> #4 0x00007f8b664b5474 in udcm_component_query (btl=0x717060, > >> > >>> cpc=0x718a48) at > >> > >>> > >> > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_ > >> > >>> co > >> > >>> nnect_udcm.c:476 > >> > >>> #5 0x00007f8b664ae316 in > >> > >>> ompi_btl_openib_connect_base_select_for_local_port > >> > (btl=0x717060) at > >> > >>> > >> > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_ > >> > >>> co > >> > >>> nnect_base.c:273 > >> > >>> #6 0x00007f8b66497817 in btl_openib_component_init > >> > (num_btl_modules=0x7fffe34cebe0, enable_progress_threads=false, > >> > enable_mpi_threads=false) > >> > >>> at > >> > >>> > >> > > >> > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component. > >> > >>> c:2703 > >> > >>> #7 0x00007f8b6b43fa5e in mca_btl_base_select > >> > >>> (enable_progress_threads=false, enable_mpi_threads=false) at > >> > >>> > >> > ../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108 > >> > >>> #8 0x00007f8b666d9d42 in mca_bml_r2_component_init > >> > (priority=0x7fffe34cecb4, enable_progress_threads=false, > >> > enable_mpi_threads=false) > >> > >>> at > >> > >>> > >> > ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88 > >> > >>> #9 0x00007f8b6b43ed1b in mca_bml_base_init > >> > >>> (enable_progress_threads=false, enable_mpi_threads=false) at > >> > >>> ../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69 > >> > >>> #10 0x00007f8b655ff739 in mca_pml_ob1_component_init > >> > (priority=0x7fffe34cedf0, enable_progress_threads=false, > >> > enable_mpi_threads=false) > >> > >>> at > >> > >>> > >> > ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:27 > >> > >>> 1 > >> > >>> #11 0x00007f8b6b4659b2 in mca_pml_base_select > >> > >>> (enable_progress_threads=false, enable_mpi_threads=false) at > >> > >>> > >> > ../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128 > >> > >>> #12 0x00007f8b6b3d233c in ompi_mpi_init (argc=1, > >> > >>> argv=0x7fffe34cf0e8, requested=0, provided=0x7fffe34cef98) at > >> > >>> ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604 > >> > >>> #13 0x00007f8b6b407386 in PMPI_Init (argc=0x7fffe34cefec, > >> > >>> argv=0x7fffe34cefe0) at pinit.c:84 > >> > >>> #14 0x000000000040096f in main (argc=1, argv=0x7fffe34cf0e8) at > >> > >>> ring_c.c:19 > >> > >>> > >> > >> > >> > >>> [binf316:24591] mca: base: components_register: registering btl > >> > >>> components [binf316:24591] mca: base: components_register: found > >> > >>> loaded component openib [binf316:24592] mca: base: > >> > >>> components_register: registering btl components [binf316:24592] > >> > mca: > >> > >>> base: components_register: found loaded component openib > >> > >>> [binf316:24591] mca: base: components_register: component openib > >> > >>> register function successful [binf316:24591] mca: base: > >> > >>> components_register: found loaded component self [binf316:24591] > >> > mca: > >> > >>> base: components_register: component self register function > >> > >>> successful [binf316:24591] mca: base: components_open: opening > >> > btl > >> > >>> components [binf316:24591] mca: base: components_open: found > >> > loaded > >> > >>> component openib [binf316:24591] mca: base: components_open: > >> > >>> component openib open function successful [binf316:24591] mca: > >> > base: > >> > components_open: > >> > >>> found loaded component self [binf316:24591] mca: base: > >> > >>> components_open: component self open function successful > >> > >>> [binf316:24592] mca: base: components_register: component openib > >> > >>> register function successful [binf316:24592] mca: base: > >> > >>> components_register: found loaded component self [binf316:24592] > >> > mca: > >> > >>> base: components_register: component self register function > >> > >>> successful [binf316:24592] mca: base: components_open: opening > >> > btl > >> > >>> components [binf316:24592] mca: base: components_open: found > >> > loaded > >> > >>> component openib [binf316:24592] mca: base: components_open: > >> > >>> component openib open function successful [binf316:24592] mca: > >> > base: > >> > components_open: > >> > >>> found loaded component self [binf316:24592] mca: base: > >> > >>> components_open: component self open function successful > >> > >>> [binf316:24591] select: initializing btl component openib > >> > >>> [binf316:24592] select: initializing btl component openib > >> > >>> > >> > [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_ip.c:364:add_rdma_addr] Adding addr 9.9.10.75 > >> > >>> (0x4b0a0909) subnet 0x9090000 as mlx4_0:1 > >> > >>> > >> > [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_ip.c:364:add_rdma_addr] Adding addr 9.9.10.75 > >> > >>> (0x4b0a0909) subnet 0x9090000 as mlx4_0:1 > >> > >>> > >> > [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_component.c:686:init_one_port] looking for > >> > mlx4_0:1 > >> > >>> GID index 0 > >> > >>> > >> > [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_component.c:717:init_one_port] my IB subnet_id > >> > for > >> > >>> HCA > >> > >>> mlx4_0 port 1 is fe80000000000000 > >> > >>> > >> > [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_component.c:1294:setup_qps] pp: rd_num is 256 > >> > rd_low > >> > >>> is > >> > >>> 192 rd_win 128 rd_rsv 4 > >> > >>> > >> > [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 > >> > >>> rd_low is > >> > >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48 > >> > >>> > >> > [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 > >> > >>> rd_low is > >> > >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48 > >> > >>> > >> > [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 > >> > >>> rd_low is > >> > >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48 > >> > >>> > >> > [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni > >> > >>> > >> > b/connect/btl_openib_connect_rdmacm.c:1840:rdmacm_component_query] > >> > >>> rdmacm_component_query > >> > >>> > >> > [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_ip.c:132:mca_btl_openib_rdma_get_ipv4addr] > >> > Looking > >> > >>> for > >> > >>> mlx4_0:1 in IP address list > >> > >>> > >> > [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_ip.c:141:mca_btl_openib_rdma_get_ipv4addr] > >> > FOUND: > >> > >>> mlx4_0:1 is 9.9.10.75 (0x4b0a0909) > >> > >>> > >> > [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/connect/btl_openib_connect_rdmacm.c:1750:ipaddrcheck] Found > >> > >>> device > >> > >>> mlx4_0:1 = IP address 9.9.10.75 (0x4b0a0909):51845 > >> > >>> > >> > [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/connect/btl_openib_connect_rdmacm.c:1776:ipaddrcheck] > >> > creating > >> > >>> new server to listen on 9.9.10.75 (0x4b0a0909):51845 > >> > [binf316:24591] > >> > >>> openib BTL: rdmacm CPC available for use on mlx4_0:1 > >> > >>> > >> > [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/connect/btl_openib_connect_udcm.c:542:udcm_module_init] > >> > created > >> > >>> cpc module 0x719220 for btl 0x716ee0 > >> > >>> > >> > [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_component.c:686:init_one_port] looking for > >> > mlx4_0:1 > >> > >>> GID index 0 > >> > >>> > >> > [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_component.c:717:init_one_port] my IB subnet_id > >> > for > >> > >>> HCA > >> > >>> mlx4_0 port 1 is fe80000000000000 > >> > >>> > >> > [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/connect/btl_openib_connect_udcm.c:565:udcm_module_init] > >> > error > >> > >>> creating ud send completion queue > >> > >>> ring_c: > >> > > >> > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: > >> > udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + > >> > 0xdeafbeedULL) > >> > == ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' > >> > failed. > >> > >>> [binf316:24591] *** Process received signal *** [binf316:24591] > >> > >>> Signal: Aborted (6) [binf316:24591] Signal code: (-6) > >> > >>> > >> > [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_component.c:1294:setup_qps] pp: rd_num is 256 > >> > rd_low > >> > >>> is > >> > >>> 192 rd_win 128 rd_rsv 4 > >> > >>> > >> > [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 > >> > >>> rd_low is > >> > >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48 > >> > >>> > >> > [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 > >> > >>> rd_low is > >> > >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48 > >> > >>> > >> > [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 > >> > >>> rd_low is > >> > >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48 > >> > >>> > >> > [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni > >> > >>> > >> > b/connect/btl_openib_connect_rdmacm.c:1840:rdmacm_component_query] > >> > >>> rdmacm_component_query > >> > >>> > >> > [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_ip.c:132:mca_btl_openib_rdma_get_ipv4addr] > >> > Looking > >> > >>> for > >> > >>> mlx4_0:1 in IP address list > >> > >>> > >> > [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/btl_openib_ip.c:141:mca_btl_openib_rdma_get_ipv4addr] > >> > FOUND: > >> > >>> mlx4_0:1 is 9.9.10.75 (0x4b0a0909) > >> > >>> > >> > [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/connect/btl_openib_connect_rdmacm.c:1750:ipaddrcheck] Found > >> > >>> device > >> > >>> mlx4_0:1 = IP address 9.9.10.75 (0x4b0a0909):57734 > >> > >>> > >> > [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/connect/btl_openib_connect_rdmacm.c:1776:ipaddrcheck] > >> > creating > >> > >>> new server to listen on 9.9.10.75 (0x4b0a0909):57734 > >> > [binf316:24592] > >> > >>> openib BTL: rdmacm CPC available for use on mlx4_0:1 > >> > >>> > >> > [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/connect/btl_openib_connect_udcm.c:542:udcm_module_init] > >> > created > >> > >>> cpc module 0x7190c0 for btl 0x717060 > >> > >>> > >> > [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope > >> > >>> ni b/connect/btl_openib_connect_udcm.c:565:udcm_module_init] > >> > error > >> > >>> creating ud send completion queue > >> > >>> ring_c: > >> > > >> > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: > >> > udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + > >> > 0xdeafbeedULL) > >> > == ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' > >> > failed. > >> > >>> [binf316:24592] *** Process received signal *** [binf316:24592] > >> > >>> Signal: Aborted (6) [binf316:24592] Signal code: (-6) > >> > >>> [binf316:24591] [ 0] > >> > /lib64/libpthread.so.0(+0xf7c0)[0x7fb35959c7c0] > >> > >>> [binf316:24591] [ 1] > >> > /lib64/libc.so.6(gsignal+0x35)[0x7fb359248b55] > >> > >>> [binf316:24591] [ 2] > >> > /lib64/libc.so.6(abort+0x181)[0x7fb35924a131] > >> > >>> [binf316:24591] [ 3] > >> > >>> /lib64/libc.so.6(__assert_fail+0xf0)[0x7fb359241a10] > >> > >>> [binf316:24591] [ 4] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ > >> > >>> bt l_openib.so(+0x3784b)[0x7fb3548e284b] > >> > >>> [binf316:24591] [ 5] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ > >> > >>> bt l_openib.so(+0x36474)[0x7fb3548e1474] > >> > >>> [binf316:24591] [ 6] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ > >> > >>> bt > >> > >>> > >> > l_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b > >> > >>> )[ 0x7fb3548da316] [binf316:24591] [ 7] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ > >> > >>> bt l_openib.so(+0x18817)[0x7fb3548c3817] > >> > >>> [binf316:24591] [ 8] [binf316:24592] [ 0] > >> > >>> /lib64/libpthread.so.0(+0xf7c0)[0x7f8b6b1707c0] > >> > >>> [binf316:24592] [ 1] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( > >> > >>> mc a_btl_base_select+0x1b2)[0x7fb35986ba5e] > >> > >>> [binf316:24591] [ 9] > >> > /lib64/libc.so.6(gsignal+0x35)[0x7f8b6ae1cb55] > >> > >>> [binf316:24592] [ 2] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ > >> > >>> bm l_r2.so(mca_bml_r2_component_init+0x20)[0x7fb354b05d42] > >> > >>> [binf316:24591] [10] > >> > /lib64/libc.so.6(abort+0x181)[0x7f8b6ae1e131] > >> > >>> [binf316:24592] [ 3] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( > >> > >>> mc a_bml_base_init+0xd6)[0x7fb35986ad1b] > >> > >>> [binf316:24591] [11] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ > >> > >>> pm l_ob1.so(+0x7739)[0x7fb353a2b739] [binf316:24591] [12] > >> > >>> /lib64/libc.so.6(__assert_fail+0xf0)[0x7f8b6ae15a10] > >> > >>> [binf316:24592] [ 4] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ > >> > >>> bt l_openib.so(+0x3784b)[0x7f8b664b684b] > >> > >>> [binf316:24592] [ 5] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ > >> > >>> bt l_openib.so(+0x36474)[0x7f8b664b5474] > >> > >>> [binf316:24592] [ 6] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( > >> > >>> mc a_pml_base_select+0x26e)[0x7fb3598919b2] > >> > >>> [binf316:24591] [13] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ > >> > >>> bt > >> > >>> > >> > l_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b > >> > >>> )[ 0x7f8b664ae316] [binf316:24592] [ 7] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ > >> > >>> bt l_openib.so(+0x18817)[0x7f8b66497817] > >> > >>> [binf316:24592] [ 8] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( > >> > >>> om > >> > >>> pi_mpi_init+0x5f6)[0x7fb3597fe33c] > >> > >>> [binf316:24591] [14] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( > >> > >>> mc a_btl_base_select+0x1b2)[0x7f8b6b43fa5e] > >> > >>> [binf316:24592] [ 9] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ > >> > >>> bm l_r2.so(mca_bml_r2_component_init+0x20)[0x7f8b666d9d42] > >> > >>> [binf316:24592] [10] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( > >> > >>> MP > >> > >>> I_Init+0x17e)[0x7fb359833386] > >> > >>> [binf316:24591] [15] ring_c[0x40096f] [binf316:24591] [16] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( > >> > >>> mc a_bml_base_init+0xd6)[0x7f8b6b43ed1b] > >> > >>> [binf316:24592] [11] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ > >> > >>> pm l_ob1.so(+0x7739)[0x7f8b655ff739] [binf316:24592] [12] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( > >> > >>> mc a_pml_base_select+0x26e)[0x7f8b6b4659b2] > >> > >>> [binf316:24592] [13] > >> > >>> /lib64/libc.so.6(__libc_start_main+0xe6)[0x7fb359234c36] > >> > >>> [binf316:24591] [17] ring_c[0x400889] [binf316:24591] *** End of > >> > >>> error message *** > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( > >> > >>> om > >> > >>> pi_mpi_init+0x5f6)[0x7f8b6b3d233c] > >> > >>> [binf316:24592] [14] > >> > >>> > >> > /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( > >> > >>> MP > >> > >>> I_Init+0x17e)[0x7f8b6b407386] > >> > >>> [binf316:24592] [15] ring_c[0x40096f] [binf316:24592] [16] > >> > >>> /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f8b6ae08c36] > >> > >>> [binf316:24592] [17] ring_c[0x400889] [binf316:24592] *** End of > >> > >>> error message *** > >> > >>> > >> > -------------------------------------------------------------------- > >> > >>> -- > >> > >>> ---- mpirun noticed that process rank 0 with PID 24591 on node > >> > >>> xxxx316 exited on signal 6 (Aborted). > >> > >>> > >> > -------------------------------------------------------------------- > >> > >>> -- > >> > >>> ---- > >> > >> > >> > >> > >> > > > >> > > _______________________________________________ > >> > > users mailing list > >> > > us...@open-mpi.org > >> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > Link to this post: > >> > http://www.open-mpi.org/community/lists/users/2014/06/24632.php > >> > > >> > -- > >> > Jeff Squyres > >> > jsquy...@cisco.com > >> > For corporate legal information go to: > >> > http://www.cisco.com/web/about/doing_business/legal/cri/ > >> > > >> > _______________________________________________ > >> > users mailing list > >> > us...@open-mpi.org > >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > Link to this post: > >> > http://www.open-mpi.org/community/lists/users/2014/06/24633.php > >> > >> > _______________________________________________ > >> > users mailing list > >> > us...@open-mpi.org > >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > Link to this post: > >> > http://www.open-mpi.org/community/lists/users/2014/06/24634.php > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2014/06/24636.php > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > >> http://www.open-mpi.org/community/lists/users/2014/06/24637.php > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2014/06/24638.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/06/24639.php