I'm trying out openmpi for the first time on a cluster of dual AMD Opterons with Myrinet interconnect using GM. There are two outstanding but possibly connected problems, (a) how to interact correctly with the LSF job manager and (2) how to use the gm interconnect.
The compile of openmpi 1.1 was without problems and appears to have correctly built the GM btl. $ ompi_info -a | egrep "\bgm\b|_gm_" MCA mpool: gm (MCA v1.0, API v1.0, Component v1.1) MCA btl: gm (MCA v1.0, API v1.0, Component v1.1) MCA mpool: parameter "mpool_gm_priority" (current value: "0") MCA btl: parameter "btl_gm_free_list_num" (current value: "8") MCA btl: parameter "btl_gm_free_list_max" (current value: "-1") MCA btl: parameter "btl_gm_free_list_inc" (current value: "8") MCA btl: parameter "btl_gm_debug" (current value: "0") MCA btl: parameter "btl_gm_mpool" (current value: "gm") MCA btl: parameter "btl_gm_max_ports" (current value: "16") MCA btl: parameter "btl_gm_max_boards" (current value: "4") MCA btl: parameter "btl_gm_max_modules" (current value: "4") MCA btl: parameter "btl_gm_num_high_priority" (current value: "8") MCA btl: parameter "btl_gm_num_repost" (current value: "4") MCA btl: parameter "btl_gm_num_mru" (current value: "64") MCA btl: parameter "btl_gm_port_name" (current value: "OMPI") MCA btl: parameter "btl_gm_exclusivity" (current value: "1024") MCA btl: parameter "btl_gm_eager_limit" (current value: "32768") MCA btl: parameter "btl_gm_min_send_size" (current value: "32768") MCA btl: parameter "btl_gm_max_send_size" (current value: "65536") MCA btl: parameter "btl_gm_min_rdma_size" (current value: "524288") MCA btl: parameter "btl_gm_max_rdma_size" (current value: "131072") MCA btl: parameter "btl_gm_flags" (current value: "2") MCA btl: parameter "btl_gm_bandwidth" (current value: "250") MCA btl: parameter "btl_gm_priority" (current value: "0") However I have been unable to sey up a parallel run which uses gm. If I start a run using the openmpi mpirun command, the program executes correctly in parallel. However the timings appear to suggest that it is using tcp, and the command executed on the node looks like: orted --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename scarf-cn001.rl.ac.uk --universe cse0...@scarf-cn001.rl.ac.uk:default-universe-28588 --nsreplica "0.0.0;tcp://192.168.1.1:52491;tcp://130.246.142.1:52491" --gprreplica "0.0.0;tcp://192.168.1.1:52491;t Furthermore if attempt to start with the mpirun arguments "--mca btl gm,self,^tcp" the run aborts at the MPI_INIT call. Q1: Is there anything else I have to do to get openmpi to use gm? Q2: Is there any way of diagnosing which btl is actually being used and why? None "-v" option to mpirun, "-mca btl btl_base_verbose" or "-mca btl btl_gm_debug=1" make any difference or produce any more output Q3: Is there a way to make openmpi work with the LSF commands? So far I have constructed a hostfile from the LSF environment variable LSB_HOSTS and used the openmpi mpirun command to start the parallel executable. Sincerely Keith Refson