Running out of file descriptors sounds likely here - if you have 20 procs/node, 
and fully connect, each node will see 20*220 connections (you don't use tcp 
between procs on the same node), with each connection requiring a file 
descriptor.


On Apr 4, 2014, at 11:26 AM, Vince Grimes <tom.gri...@ttu.edu> wrote:

> Dear all:
> 
>       The subject heading is a little misleading because this is in response 
> to part of that original contact. I tried the first two suggestions below 
> (disabling eager DMA and using tcp btl), but to no avail. In all cases I am 
> running over 20 12-core nodes through SGE. In the first case, I get the 
> errors:
> 
> ***
> [[30430,1],234][btl_openib_component.c:3492:handle_wc] from 
> compute-1-18.local to: compute-6-10 error polling HP CQ with status WORK 
> REQUEST FLUSHED ERROR status number 5 for wr_id 2c41e80 opcode 128 vendor 
> error 244 qp_idx 0
> --------------------------------------------------------------------------
> WARNING: A process refused to die!
> 
> Host: compute-4-13.local
> PID:  22356
> 
> This process may still be running and/or consuming resources.
> 
> --------------------------------------------------------------------------
> [compute-6-1.local:22658] 2 more processes have sent help message 
> help-odls-default.txt / odls-default:could-not-kill
> [compute-6-1.local:22658] Set MCA parameter "orte_base_help_aggregate" to 0 
> to see all help / error messages
> --------------------------------------------------------------------------
> ***
> 
> The first error is at the same place as before 
> ([btl_openib_component.c:3492:handle_wc]) and the message is only slightly 
> different (LP -> HP).
> 
>       For the second suggestion, using tcp btl, I got a whole load of these:
> 
> ***
> [compute-3-1.local][[20917,1],74][btl_tcp_endpoint.c:653:mca_btl_tcp_endpoint_complete_connect]
>  connect() to 10.7.36.244 failed: Connection timed out (110)
> ***
> 
> there are 1826 "Connection timed out" errors at an earlier spot in the code 
> than in the case above. I checked iptables and there is no reason the 
> connection would have been refused. Is it possible I'm out of file 
> descriptors (because sockets count as files)? `ulimit -n` yields 1024.
> 
> T. Vince Grimes, Ph.D.
> CCC System Administrator
> 
> Texas Tech University
> Dept. of Chemistry and Biochemistry (10A)
> Box 41061
> Lubbock, TX 79409-1061
> 
> (806) 834-0813 (voice);     (806) 742-1289 (fax)
> 
> On 03/22/2014 11:00 AM, users-requ...@open-mpi.org wrote:
>> ----------------------------------------------------------------------
>> 
>> Message: 1
>> Date: Fri, 21 Mar 2014 20:16:31 +0000
>> From: Joshua Ladd <josh...@mellanox.com>
>> To: Open MPI Users <us...@open-mpi.org>
>> Subject: Re: [OMPI users] Call stack upon MPI routine error
>> Message-ID:
>>      <8edebdde2c39d447a738659597bbb63a3ed12...@mtidag01.mtl.com>
>> Content-Type: text/plain; charset="us-ascii"
>> 
>> Hi, Vince
>> 
>> Couple of ideas off the top of my head:
>> 
>> 1. Try disabling eager RDMA. Eager RDMA can consume significant resources: 
>> "-mca btl_openib_use_eager_rdma 0"
>> 
>> 2. Try using the TCP BTL - is the error still present?
>> 
>> 3. Try the poor man's debugger -  print the pid and hostname of the process 
>> when and then put a while(1) at btl_openib_component.c:3492 so that the 
>> process will hang when it hits this error. Hop over to the node and attach 
>> to the hung process. You can move up the call stack from here.
>> 
>> Best,
>> 
>> Josh
>> 
>> -----Original Message-----
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Vince Grimes
>> Sent: Friday, March 21, 2014 3:52 PM
>> To: us...@open-mpi.org
>> Subject: [OMPI users] Call stack upon MPI routine error
>> 
>> OpenMPI folks:
>> 
>>      I have mentioned before a problem with an in-house code (ScalIT) that 
>> generates the error message
>> 
>> [[31552,1],84][btl_openib_component.c:3492:handle_wc] from compute-4-5.local 
>> to: compute-4-13 error polling LP CQ with status LOCAL QP OPERATION ERROR 
>> status number 2 for wr_id 246f300 opcode 128  vendor error 107 qp_idx 0
>> 
>> at a specific, reproducible point. It was suggested that the error could be 
>> due to memory problems, such as the amount of registered memory. I have 
>> already corrected the amount of registered memory per the URLs that were 
>> given to me. My question today is two-fold:
>> 
>> First, is it possible that ScalIT uses so much memory that there is no 
>> memory to register for IB communications? ScalIT is very memory-intensive 
>> and has to run distributed just to get a large matrix in memory (split 
>> between nodes).
>> 
>> Second, is there a way to trap that error so I can see the call stack, 
>> showing the MPI function called and exactly where in the code the error was 
>> generated?
>> 
>> --
>> T. Vince Grimes, Ph.D.
>> CCC System Administrator
>> 
>> Texas Tech University
>> Dept. of Chemistry and Biochemistry (10A) Box 41061 Lubbock, TX 79409-1061
>> 
>> (806) 834-0813 (voice);     (806) 742-1289 (fax)
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to