I'm trying out openmpi for the first time on
a cluster of dual AMD Opterons with  Myrinet
interconnect using GM.   There are two outstanding
but possibly connected problems, (a) how to interact
correctly with the LSF job manager and (2) how to
use the gm interconnect.

The compile of openmpi 1.1 was without problems and
appears to have correctly built the GM btl.
$ ompi_info -a | egrep "\bgm\b|_gm_"
               MCA mpool: gm (MCA v1.0, API v1.0, Component v1.1)
                 MCA btl: gm (MCA v1.0, API v1.0, Component v1.1)
               MCA mpool: parameter "mpool_gm_priority" (current value: "0")
                 MCA btl: parameter "btl_gm_free_list_num" (current value: "8")
                 MCA btl: parameter "btl_gm_free_list_max" (current value: "-1")
                 MCA btl: parameter "btl_gm_free_list_inc" (current value: "8")
                 MCA btl: parameter "btl_gm_debug" (current value: "0")
                 MCA btl: parameter "btl_gm_mpool" (current value: "gm")
                 MCA btl: parameter "btl_gm_max_ports" (current value: "16")
                 MCA btl: parameter "btl_gm_max_boards" (current value: "4")
                 MCA btl: parameter "btl_gm_max_modules" (current value: "4")
                 MCA btl: parameter "btl_gm_num_high_priority" (current value: 
"8")
                 MCA btl: parameter "btl_gm_num_repost" (current value: "4")
                 MCA btl: parameter "btl_gm_num_mru" (current value: "64")
                 MCA btl: parameter "btl_gm_port_name" (current value: "OMPI")
                 MCA btl: parameter "btl_gm_exclusivity" (current value: "1024")
                 MCA btl: parameter "btl_gm_eager_limit" (current value: 
"32768")
                 MCA btl: parameter "btl_gm_min_send_size" (current value: 
"32768")
                 MCA btl: parameter "btl_gm_max_send_size" (current value: 
"65536")
                 MCA btl: parameter "btl_gm_min_rdma_size" (current value: 
"524288")
                 MCA btl: parameter "btl_gm_max_rdma_size" (current value: 
"131072")
                 MCA btl: parameter "btl_gm_flags" (current value: "2")
                 MCA btl: parameter "btl_gm_bandwidth" (current value: "250")
                 MCA btl: parameter "btl_gm_priority" (current value: "0")

However I have been unable to sey up a parallel run which uses gm.
If I start a run using the openmpi mpirun command, the program executes
correctly in parallel. However the timings appear to suggest that it is
using tcp, and the command executed on the node looks  like:

  orted --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename
scarf-cn001.rl.ac.uk --universe
cse0...@scarf-cn001.rl.ac.uk:default-universe-28588 --nsreplica
"0.0.0;tcp://192.168.1.1:52491;tcp://130.246.142.1:52491" --gprreplica
"0.0.0;tcp://192.168.1.1:52491;t

Furthermore if attempt to start with the mpirun arguments "--mca btl
gm,self,^tcp" the run aborts at the MPI_INIT call.

Q1:  Is there anything else I have to do to get openmpi to use gm?
Q2:  Is there any way of diagnosing which btl is actually being used
     and why?  None "-v" option to mpirun, "-mca btl  btl_base_verbose"
     or "-mca btl  btl_gm_debug=1" make any difference or produce any
     more output
Q3:  Is there a way to make openmpi work with the LSF commands?  So far
     I have constructed a hostfile from the LSF environment variable
     LSB_HOSTS and used the openmpi mpirun command to start the
     parallel executable.

Sincerely

Keith Refson

Reply via email to