> So the error output is not showing what you two think should be
> the default value, 20, but then nor is it showing what I think I
> have set it to globally, again, 20.
>
> But anyroad, what I wanted from this is confirmation that the output
> is telling me the value that the job was running with, 10.
>
> Now to find out why it appears as 10, because,
>
> a) that's not seemingly the default
> b) it's not being set to 10 globally by me as the admin
> c) it wasn't being set to anything by the user's submission script
>
> I'll have a dig around and get back to the thread,

So, getting back,

there have been two runs of jobs that invoked the mpirun using these
OpenMPI parameter setting flags (basically, these mimic what I have
in the global config file)

 -mca btl_openib_ib_timeout 20 -mca btl_openib_ib_min_rnr_timer 25

when both of the job failed, the error output was

-----8<----------8<----------8<----------8<----------8<-----

[[31705,1],77][btl_openib_component.c:2951:handle_wc] from
scifachpc-c06n01 to: scifachpc-c06n03 error polling LP CQ with status
RETRY EXCEEDED ERROR status number 12 for wr_id 294230912 opcode 1  vendor
error 129 qp_idx 1
--------------------------------------------------------------------------
The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

    The total number of times that the sender wishes the receiver to
    retry timeout, packet sequence, etc. errors before posting a
    completion error.

This error typically means that there is something awry within the
InfiniBand fabric itself.  You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.

Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:

* btl_openib_ib_retry_count - The number of times the sender will
  attempt to retry (defaulted to 7, the maximum value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
  to 10).  The actual timeout value used is calculated as:

     4.096 microseconds * (2^btl_openib_ib_timeout)

  See the InfiniBand spec 1.2 (section 12.7.34) for more details.

Below is some information about the host that raised the error and the
peer to which it was connected:

  Local host:   somename
  Local device: mlx4_0
  Peer host:    someothername

You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 77 with PID 14705 on
node somename exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------


-----8<----------8<----------8<----------8<----------8<-----


Note that the error output it still showing that mysterious "10"
in there for btl_openib_ib_timeout value.

When I run ompi_info from a login shell on the node I see

-----8<----------8<----------8<----------8<----------8<-----

ompi_info --param btl openib | grep ib_timeout
                 MCA btl: parameter "btl_openib_ib_timeout" (current
value: "20", data source: file
[/usr/lib64/openmpi/1.4-gcc/etc/openmpi-mca-params.conf])
                          InfiniBand transmit timeout, plugged into
formula: 4.096 microseconds *
(2^btl_openib_ib_timeout)(must be >= 0 and <=
31)

-----8<----------8<----------8<----------8<----------8<-----

For info,

the underlying IB kit is Mellanox Connect-X HCA running on a stock
RHEL5.6 OS with the following OpenMPI

openmpi-1.4-4.el5

indeed, everything is pretty much out of the box here.

I have noticed that the same node is apearing in the error output
each time, so I'll try taking that one out of the test PE that the
jobs are executing in and seeing if I can tie it to hardware.


-- 
Kevin M. Buckley                                  Room:  CO327
School of Engineering and                         Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand

Reply via email to