Allen Barnett wrote:
Thanks for the pointer!
Do you know if these sizes are dependent on the hardware?
They can be, the following file sets up the defaults for some known cards:
ompi/mca/btl/openib/mca-btl-openib-device-params.ini
--td
Thanks,
Allen
On Tue, 2010-08-03 at 10:29 -0400, Terry Dontje wrote:
Sorry, I didn't see your prior question glad you found the
btl_openib_receive_queues parameter. There is not a faq entry for
this but I found the following in the openib btl help file that spells
out the parameters when using Per-peer receive queue (ie receive queue
setting with "P" as the first argument).
Per-peer receive queues require between 2 and 5 parameters:
1. Buffer size in bytes (mandatory)
2. Number of buffers (mandatory)
3. Low buffer count watermark (optional; defaults to (num_buffers /
2))
4. Credit window size (optional; defaults to (low_watermark / 2))
5. Number of buffers reserved for credit messages (optional;
defaults to (num_buffers*2-1)/credit_window)
Example: P,128,256,128,16
- 128 byte buffers
- 256 buffers to receive incoming MPI messages
- When the number of available buffers reaches 128, re-post 128 more
buffers to reach a total of 256
- If the number of available credits reaches 16, send an explicit
credit message to the sender
- Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are
reserved for explicit credit messages
--td
Allen Barnett wrote:
Hi: In response to my own question, by studying the file
mca-btl-openib-device-params.ini, I discovered that this option in
OMPI-1.4.2:
-mca btl_openib_receive_queues P,65536,256,192,128
was sufficient to prevent OMPI from trying to create shared receive
queues and allowed my application to run to completion using the IB
hardware.
I guess my question now is: What do these numbers mean? Presumably the
size (or counts?) of buffers to allocate? Are there limits or a way to
tune these values?
Thanks,
Allen
On Mon, 2010-08-02 at 12:49 -0400, Allen Barnett wrote:
Hi Terry:
It is indeed the case that the openib BTL has not been initialized. I
ran with your tcp-disabled MCA option and it aborted in MPI_Init.
The OFED stack is what's included in RHEL4. It appears to be made up of
the RPMs:
openib-1.4-1.el4
opensm-3.2.5-1.el4
libibverbs-1.1.2-1.el4
How can I determine if srq is supported? Is there an MCA option to
defeat it? (Our in-house cluster has more recent Mellanox IB hardware
and is running this same IB stack and ompi 1.4.2 works OK, so I suspect
srq is supported by the OpenFabrics stack. Perhaps.)
Thanks,
Allen
On Mon, 2010-08-02 at 06:47 -0400, Terry Dontje wrote:
My guess is from the message below saying "(openib) BTL failed to
initialize" that the code is probably running over tcp. To
absolutely prove this you can specify to only use the openib, sm and
self btls to eliminate the tcp btl. To do that you add the following
to the mpirun line "-mca btl openib,sm,self". I believe with that
specification the code will abort and not run to completion.
What version of the OFED stack are you using? I wonder if srq is
supported on your system or not?
--td
Allen Barnett wrote:
Hi: A customer is attempting to run our OpenMPI 1.4.2-based application
on a cluster of machines running RHEL4 with the standard OFED stack. The
HCAs are identified as:
03:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1)
04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)
ibv_devinfo says that one port on the HCAs is active but the other is
down:
hca_id: mthca0
fw_ver: 3.0.2
node_guid: 0006:6a00:9800:4c78
sys_image_guid: 0006:6a00:9800:4c78
vendor_id: 0x066a
vendor_part_id: 23108
hw_ver: 0xA1
phys_port_cnt: 2
port: 1
state: active (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 26
port_lmc: 0x00
port: 2
state: down (1)
max_mtu: 2048 (4)
active_mtu: 512 (2)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
When the OMPI application is run, it prints the error message:
--------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
create an internal queue. This typically indicates a failed
OpenFabrics installation, faulty hardware, or that Open MPI is
attempting to use a feature that is not supported on your hardware
(i.e., is a shared receive queue specified in the
btl_openib_receive_queues MCA parameter with a device that does not
support it?). The failure occured here:
Local host: machine001.lan
OMPI
source: /software/openmpi-1.4.2/ompi/mca/btl/openib/btl_openib.c:250
Function: ibv_create_srq()
Error: Invalid argument (errno=22)
Device: mthca0
You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------
The full log of a run with "btl_openib_verbose 1" is attached. My
application appears to run to completion, but I can't tell if it's just
running on TCP and not using the IB hardware.
I would appreciate any suggestions on how to proceed to fix this error.
Thanks,
Allen
--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.650.633.7054
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>