Running out of file descriptors sounds likely here - if you have 20 procs/node, and fully connect, each node will see 20*220 connections (you don't use tcp between procs on the same node), with each connection requiring a file descriptor.
On Apr 4, 2014, at 11:26 AM, Vince Grimes <tom.gri...@ttu.edu> wrote: > Dear all: > > The subject heading is a little misleading because this is in response > to part of that original contact. I tried the first two suggestions below > (disabling eager DMA and using tcp btl), but to no avail. In all cases I am > running over 20 12-core nodes through SGE. In the first case, I get the > errors: > > *** > [[30430,1],234][btl_openib_component.c:3492:handle_wc] from > compute-1-18.local to: compute-6-10 error polling HP CQ with status WORK > REQUEST FLUSHED ERROR status number 5 for wr_id 2c41e80 opcode 128 vendor > error 244 qp_idx 0 > -------------------------------------------------------------------------- > WARNING: A process refused to die! > > Host: compute-4-13.local > PID: 22356 > > This process may still be running and/or consuming resources. > > -------------------------------------------------------------------------- > [compute-6-1.local:22658] 2 more processes have sent help message > help-odls-default.txt / odls-default:could-not-kill > [compute-6-1.local:22658] Set MCA parameter "orte_base_help_aggregate" to 0 > to see all help / error messages > -------------------------------------------------------------------------- > *** > > The first error is at the same place as before > ([btl_openib_component.c:3492:handle_wc]) and the message is only slightly > different (LP -> HP). > > For the second suggestion, using tcp btl, I got a whole load of these: > > *** > [compute-3-1.local][[20917,1],74][btl_tcp_endpoint.c:653:mca_btl_tcp_endpoint_complete_connect] > connect() to 10.7.36.244 failed: Connection timed out (110) > *** > > there are 1826 "Connection timed out" errors at an earlier spot in the code > than in the case above. I checked iptables and there is no reason the > connection would have been refused. Is it possible I'm out of file > descriptors (because sockets count as files)? `ulimit -n` yields 1024. > > T. Vince Grimes, Ph.D. > CCC System Administrator > > Texas Tech University > Dept. of Chemistry and Biochemistry (10A) > Box 41061 > Lubbock, TX 79409-1061 > > (806) 834-0813 (voice); (806) 742-1289 (fax) > > On 03/22/2014 11:00 AM, users-requ...@open-mpi.org wrote: >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Fri, 21 Mar 2014 20:16:31 +0000 >> From: Joshua Ladd <josh...@mellanox.com> >> To: Open MPI Users <us...@open-mpi.org> >> Subject: Re: [OMPI users] Call stack upon MPI routine error >> Message-ID: >> <8edebdde2c39d447a738659597bbb63a3ed12...@mtidag01.mtl.com> >> Content-Type: text/plain; charset="us-ascii" >> >> Hi, Vince >> >> Couple of ideas off the top of my head: >> >> 1. Try disabling eager RDMA. Eager RDMA can consume significant resources: >> "-mca btl_openib_use_eager_rdma 0" >> >> 2. Try using the TCP BTL - is the error still present? >> >> 3. Try the poor man's debugger - print the pid and hostname of the process >> when and then put a while(1) at btl_openib_component.c:3492 so that the >> process will hang when it hits this error. Hop over to the node and attach >> to the hung process. You can move up the call stack from here. >> >> Best, >> >> Josh >> >> -----Original Message----- >> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Vince Grimes >> Sent: Friday, March 21, 2014 3:52 PM >> To: us...@open-mpi.org >> Subject: [OMPI users] Call stack upon MPI routine error >> >> OpenMPI folks: >> >> I have mentioned before a problem with an in-house code (ScalIT) that >> generates the error message >> >> [[31552,1],84][btl_openib_component.c:3492:handle_wc] from compute-4-5.local >> to: compute-4-13 error polling LP CQ with status LOCAL QP OPERATION ERROR >> status number 2 for wr_id 246f300 opcode 128 vendor error 107 qp_idx 0 >> >> at a specific, reproducible point. It was suggested that the error could be >> due to memory problems, such as the amount of registered memory. I have >> already corrected the amount of registered memory per the URLs that were >> given to me. My question today is two-fold: >> >> First, is it possible that ScalIT uses so much memory that there is no >> memory to register for IB communications? ScalIT is very memory-intensive >> and has to run distributed just to get a large matrix in memory (split >> between nodes). >> >> Second, is there a way to trap that error so I can see the call stack, >> showing the MPI function called and exactly where in the code the error was >> generated? >> >> -- >> T. Vince Grimes, Ph.D. >> CCC System Administrator >> >> Texas Tech University >> Dept. of Chemistry and Biochemistry (10A) Box 41061 Lubbock, TX 79409-1061 >> >> (806) 834-0813 (voice); (806) 742-1289 (fax) >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users