On Wed, 6 Feb 2008, Christian Bell wrote:
> Hi Daniel --
>
> PSM should determine your node setup and enable shared contexts
> accordingly, but it looks like something isn't working right. You
> can apply the patch I've attached to this e-mail and things should
> work again.
Alas, it doesn't compile (patch was applied to OpenMPI 1.2.5):
mtl_psm.c(109): error: struct "orte_proc_info_t" has no field "num_local_procs"
if (orte_process_info.num_local_procs > 0) {
^
mtl_psm.c(111): error: struct "orte_proc_info_t" has no field "num_local_procs"
snprintf(buf, sizeof buf - 1, "%d", orte_process_info.num_local_procs);
^
mtl_psm.c(113): error: struct "orte_proc_info_t" has no field "local_rank"
snprintf(buf, sizeof buf - 1, "%d", orte_process_info.local_rank);
^
compilation aborted for mtl_psm.c (code 2)
> However, it would be useful to identify what's going wrong. Can
> you compile a hello world program and run it with the machinefile
> you're trying to use. Send me the output from:
>
> mpirun -machinefile .... env PSM_TRACEMASK=0x101 ./hello_world
>
> I understand your failure mode only if somehow the 8-core node is
> detected to be a 4-core node. The output should tell us this.
Attached. It seems it does try to enable context sharing but for some
reason /dev/ipath still returns a busy code.
Daniël
node017.23692env IPATH_DISABLE_MMAP_MALLOC Disable mmap for malloc()
=> NO
node017.23692env IPATH_NO_CPUAFFINITY Prevent PSM from setting affinity
=> NO
node017.23692env IPATH_UNIT Device Unit number (-1 autodetects)
=> -1
node017.23692env PSM_DEVICES Ordered list of PSM-level devices
=> shm,ipath (default was self,shm,ipath)
node017.23692psmi_parse_devices: PSM Device allocation order: amsh,ips
node017.23692env PSM_MEMORY Memory usage mode (normal or large)
=> normal
node017.23692env PSM_SHAREDCONTEXTS Enable shared contexts
=> YES (default was YES)
node017.23692ipath_setaffinity: PORT_INFO returned
unit_id=0/1,port=1/4,hwports=4,subport=0/0,nproc=8
node017.23692ipath_setaffinity: Set CPU affinity to 0, port 0:1:0 (1 active
chips)
node017.23692ipath_userinit: Driver is not QLogic-built
node017.23692ipath_userinit: Runtime flags are 0x46, explicit mallopt mmap
disable in malloc is off
node017.23692psmi_port_open: Opened port 1.0 on device /dev/ipath
(LID=14,epid=e0001,flags=46)
node017.23692env PSM_RCVTHREAD Recv thread flags (0 disables
thread) => 0x1
node017:1.0.env PSM_MQ_SENDREQS_MAX Max num of isend requests in flight
=> 1048576
node017:1.0.env PSM_MQ_RECVREQS_MAX Max num of irecv requests in flight
=> 1048576
node017:1.0.env PSM_MQ_RNDV_IPATH_THRESH ipath eager-to-rendezvous switchover
=> 64000
node017:1.0.env PSM_MQ_RNDV_SHM_THRESH shm eager-to-rendezvous switchover
=> 16000
node017:1.0.ips_spio_init: PIO copy uses forced ordering
node017:1.0.env PSM_TID Tid proto flags (0 disables
protocol) => 0x1
node017:1.0.ips_protoexp_init: Tid control message settings: timeout
min=200us/max=1000us, interrupt when trying attempt #2
node017:1.0.ips_proto_init: Tid error control: warning every 30 secs, fatal
error after 250 tid errors
node017:1.0.ips_proto_init: Ethernet Host IP=10.141.0.17 and PID=23692
node017:1.0.psmi_shm_attach: Registered as master to key
/psm_shm.d999e196-868e-c6e6-0d4a-bc2c78de85f1
node017:1.0.psmi_shm_attach: Mapped shm control object at 0x2aaaab25a000
node017:1.0.psmi_shm_attach: Mapped and initalized shm object control page at
0x2aaaab25a000,size=4096
node017:1.0.psmi_shm_attach: Grabbed shmidx 0
node017:1.0.amsh_init_segment: Grew shared segment for 1 procs, size=5.93 MB
node017:1.0.am_remap_segment: shm segment remap from 0x2aaaab25a000..4096 to
0x2aaab26b3000..6217728 (relocated=YES)
node017:1.0.ips_ptl_pollintr: Enabled communication thread on URG packets
node017.23691env IPATH_DISABLE_MMAP_MALLOC Disable mmap for malloc()
=> NO
node017.23691env IPATH_NO_CPUAFFINITY Prevent PSM from setting affinity
=> NO
node017.23691env IPATH_UNIT Device Unit number (-1 autodetects)
=> -1
node017.23691env PSM_DEVICES Ordered list of PSM-level devices
=> shm,ipath (default was self,shm,ipath)
node017.23691psmi_parse_devices: PSM Device allocation order: amsh,ips
node017.23691env PSM_MEMORY Memory usage mode (normal or large)
=> normal
node017.23691env PSM_SHAREDCONTEXTS Enable shared contexts
=> YES (default was YES)
node017.23691ipath_setaffinity: PORT_INFO returned
unit_id=0/1,port=2/4,hwports=4,subport=0/0,nproc=8
node017.23691ipath_setaffinity: Set CPU affinity to 1, port 0:2:0 (1 active
chips)
node017.23691ipath_userinit: Driver is not QLogic-built
node017.23691ipath_userinit: Runtime flags are 0x46, explicit mallopt mmap
disable in malloc is off
node017.23691psmi_port_open: Opened port 2.0 on device /dev/ipath
(LID=14,epid=e0002,flags=46)
node017.23691env PSM_RCVTHREAD Recv thread flags (0 disables
thread) => 0x1
node017:2.0.env PSM_MQ_SENDREQS_MAX Max num of isend requests in flight
=> 1048576
node017:2.0.env PSM_MQ_RECVREQS_MAX Max num of irecv requests in flight
=> 1048576
node017:2.0.env PSM_MQ_RNDV_IPATH_THRESH ipath eager-to-rendezvous switchover
=> 64000
node017:2.0.env PSM_MQ_RNDV_SHM_THRESH shm eager-to-rendezvous switchover
=> 16000
node017:2.0.ips_spio_init: PIO copy uses forced ordering
node017:2.0.env PSM_TID Tid proto flags (0 disables
protocol) => 0x1
node017:2.0.ips_protoexp_init: Tid control message settings: timeout
min=200us/max=1000us, interrupt when trying attempt #2
node017:2.0.ips_proto_init: Tid error control: warning every 30 secs, fatal
error after 250 tid errors
node017:2.0.ips_proto_init: Ethernet Host IP=10.141.0.17 and PID=23691
node017:2.0.psmi_shm_attach: Registered as slave to key
/psm_shm.d999e196-868e-c6e6-0d4a-bc2c78de85f1
node017:2.0.psmi_shm_attach: Mapped shm control object at 0x2aaaab25a000
node017:2.0.psmi_shm_attach: Slave synchronized object control page at
0x2aaaab25a000, size=4096
node017:2.0.psmi_shm_attach: Grabbed shmidx 1
node017:2.0.amsh_init_segment: Grew shared segment for 2 procs, size=11.86 MB
node017:2.0.am_remap_segment: shm segment remap from 0x2aaaab25a000..4096 to
0x2aaab26b3000..12431360 (relocated=YES)
node017.23694env IPATH_DISABLE_MMAP_MALLOC Disable mmap for malloc()
=> NO
node017.23694env IPATH_NO_CPUAFFINITY Prevent PSM from setting affinity
=> NO
node017:2.0.ips_ptl_pollintr: Enabled communication thread on URG packets
node017.23694env IPATH_UNIT Device Unit number (-1 autodetects)
=> -1
node017.23694env PSM_DEVICES Ordered list of PSM-level devices
=> shm,ipath (default was self,shm,ipath)
node017.23694psmi_parse_devices: PSM Device allocation order: amsh,ips
node017.23694env PSM_MEMORY Memory usage mode (normal or large)
=> normal
node017.23694env PSM_SHAREDCONTEXTS Enable shared contexts
=> YES (default was YES)
node017.23694ipath_setaffinity: PORT_INFO returned
unit_id=0/1,port=3/4,hwports=4,subport=0/0,nproc=8
node017.23694ipath_setaffinity: Set CPU affinity to 2, port 0:3:0 (1 active
chips)
node017.23698env IPATH_DISABLE_MMAP_MALLOC Disable mmap for malloc()
=> NO
node017.23698env IPATH_NO_CPUAFFINITY Prevent PSM from setting affinity
=> NO
node017.23698env IPATH_UNIT Device Unit number (-1 autodetects)
=> -1
node017.23698env PSM_DEVICES Ordered list of PSM-level devices
=> shm,ipath (default was self,shm,ipath)
node017.23698psmi_parse_devices: PSM Device allocation order: amsh,ips
node017.23698env PSM_MEMORY Memory usage mode (normal or large)
=> normal
node017.23698env PSM_SHAREDCONTEXTS Enable shared contexts
=> YES (default was YES)
node017.23698ipath_setaffinity: PORT_INFO returned
unit_id=0/1,port=4/4,hwports=4,subport=0/0,nproc=8
node017.23694ipath_userinit: Driver is not QLogic-built
node017.23694ipath_userinit: Runtime flags are 0x46, explicit mallopt mmap
disable in malloc is off
node017.23698ipath_setaffinity: Set CPU affinity to 3, port 0:4:0 (1 active
chips)
node017.23694psmi_port_open: Opened port 3.0 on device /dev/ipath
(LID=14,epid=e0003,flags=46)
node017.23694env PSM_RCVTHREAD Recv thread flags (0 disables
thread) => 0x1
node017:3.0.env PSM_MQ_SENDREQS_MAX Max num of isend requests in flight
=> 1048576
node017:3.0.env PSM_MQ_RECVREQS_MAX Max num of irecv requests in flight
=> 1048576
node017:3.0.env PSM_MQ_RNDV_IPATH_THRESH ipath eager-to-rendezvous switchover
=> 64000
node017:3.0.env PSM_MQ_RNDV_SHM_THRESH shm eager-to-rendezvous switchover
=> 16000
node017:3.0.ips_spio_init: PIO copy uses forced ordering
node017.23698ipath_userinit: Driver is not QLogic-built
node017.23698ipath_userinit: Runtime flags are 0x46, explicit mallopt mmap
disable in malloc is off
node017.23698psmi_port_open: Opened port 4.0 on device /dev/ipath
(LID=14,epid=e0004,flags=46)
node017.23698env PSM_RCVTHREAD Recv thread flags (0 disables
thread) => 0x1
node017.23696env IPATH_DISABLE_MMAP_MALLOC Disable mmap for malloc()
=> NO
node017:4.0.env PSM_MQ_SENDREQS_MAX Max num of isend requests in flight
=> 1048576
node017.23696env IPATH_NO_CPUAFFINITY Prevent PSM from setting affinity
=> NO
node017.23696env IPATH_UNIT Device Unit number (-1 autodetects)
=> -1
node017.23696env PSM_DEVICES Ordered list of PSM-level devices
=> shm,ipath (default was self,shm,ipath)
node017.23696psmi_parse_devices: PSM Device allocation order: amsh,ips
node017.23696env PSM_MEMORY Memory usage mode (normal or large)
=> normal
node017:3.0.env PSM_TID Tid proto flags (0 disables
protocol) => 0x1
node017.23696env PSM_SHAREDCONTEXTS Enable shared contexts
=> YES (default was YES)
node017.23696ipath_userinit: assign_port command failed: Device or resource busy
node017:4.0.env PSM_MQ_RECVREQS_MAX Max num of irecv requests in flight
=> 1048576
node017.23696psmi_port_open: /dev/ipath open failed: 25 (Device or resource
busy)
[node017:23696] Open MPI failed to open a PSM endpoint: No free InfiniPath
contexts available on /dev/ipath
[node017:23696] Error in psm_ep_open (error No free ports could be obtained)
node017:4.0.env PSM_MQ_RNDV_IPATH_THRESH ipath eager-to-rendezvous switchover
=> 64000
node017:4.0.env PSM_MQ_RNDV_SHM_THRESH shm eager-to-rendezvous switchover
=> 16000
node017:4.0.ips_spio_init: PIO copy uses forced ordering
node017:3.0.ips_protoexp_init: Tid control message settings: timeout
min=200us/max=1000us, interrupt when trying attempt #2
node017:3.0.ips_proto_init: Tid error control: warning every 30 secs, fatal
error after 250 tid errors
node017:4.0.env PSM_TID Tid proto flags (0 disables
protocol) => 0x1
node017:3.0.ips_proto_init: Ethernet Host IP=10.141.0.17 and PID=23694
node017:4.0.ips_protoexp_init: Tid control message settings: timeout
min=200us/max=1000us, interrupt when trying attempt #2
node017:4.0.ips_proto_init: Tid error control: warning every 30 secs, fatal
error after 250 tid errors
node017:3.0.psmi_shm_attach: Registered as slave to key
/psm_shm.d999e196-868e-c6e6-0d4a-bc2c78de85f1
node017:3.0.psmi_shm_attach: Mapped shm control object at 0x2aaaab25a000
node017:3.0.psmi_shm_attach: Slave synchronized object control page at
0x2aaaab25a000, size=4096
node017:3.0.psmi_shm_attach: Grabbed shmidx 2
node017:3.0.amsh_init_segment: Grew shared segment for 3 procs, size=17.78 MB
node017:3.0.am_remap_segment: shm segment remap from 0x2aaaab25a000..4096 to
0x2aaab26b3000..18644992 (relocated=YES)
node017:4.0.ips_proto_init: Ethernet Host IP=10.141.0.17 and PID=23698
node017:4.0.psmi_shm_attach: Registered as slave to key
/psm_shm.d999e196-868e-c6e6-0d4a-bc2c78de85f1
node017:4.0.psmi_shm_attach: Mapped shm control object at 0x2aaaab25a000
node017:4.0.psmi_shm_attach: Slave synchronized object control page at
0x2aaaab25a000, size=4096
node017:4.0.ips_ptl_pollintr: Enabled communication thread on URG packets
node017.23695env IPATH_DISABLE_MMAP_MALLOC Disable mmap for malloc()
=> NO
node017.23695env IPATH_NO_CPUAFFINITY Prevent PSM from setting affinity
=> NO
node017.23695env IPATH_UNIT Device Unit number (-1 autodetects)
=> -1
node017.23695env PSM_DEVICES Ordered list of PSM-level devices
=> shm,ipath (default was self,shm,ipath)
node017.23695psmi_parse_devices: PSM Device allocation order: amsh,ips
node017.23695env PSM_MEMORY Memory usage mode (normal or large)
=> normal
node017.23695env PSM_SHAREDCONTEXTS Enable shared contexts
=> YES (default was YES)
node017.23695ipath_userinit: assign_port command failed: Device or resource busy
node017.23695psmi_port_open: /dev/ipath open failed: 25 (Device or resource
busy)
[node017:23695] Open MPI failed to open a PSM endpoint: No free InfiniPath
contexts available on /dev/ipath
[node017:23695] Error in psm_ep_open (error No free ports could be obtained)
node017.23697env IPATH_DISABLE_MMAP_MALLOC Disable mmap for malloc()
=> NO
node017.23697env IPATH_NO_CPUAFFINITY Prevent PSM from setting affinity
=> NO
node017.23697env IPATH_UNIT Device Unit number (-1 autodetects)
=> -1
node017.23697env PSM_DEVICES Ordered list of PSM-level devices
=> shm,ipath (default was self,shm,ipath)
node017.23697psmi_parse_devices: PSM Device allocation order: amsh,ips
node017.23697env PSM_MEMORY Memory usage mode (normal or large)
=> normal
node017.23693env IPATH_DISABLE_MMAP_MALLOC Disable mmap for malloc()
=> NO
node017.23697env PSM_SHAREDCONTEXTS Enable shared contexts
=> YES (default was YES)
node017.23693env IPATH_NO_CPUAFFINITY Prevent PSM from setting affinity
=> NO
node017.23693env IPATH_UNIT Device Unit number (-1 autodetects)
=> -1
node017.23697ipath_userinit: assign_port command failed: Device or resource busy
node017.23693env PSM_DEVICES Ordered list of PSM-level devices
=> shm,ipath (default was self,shm,ipath)
node017.23693psmi_parse_devices: PSM Device allocation order: amsh,ips
node017.23693env PSM_MEMORY Memory usage mode (normal or large)
=> normal
node017.23697psmi_port_open: /dev/ipath open failed: 25 (Device or resource
busy)
[node017:23697] Open MPI failed to open a PSM endpoint: No free InfiniPath
contexts available on /dev/ipath
[node017:23697] Error in psm_ep_open (error No free ports could be obtained)
node017.23693env PSM_SHAREDCONTEXTS Enable shared contexts
=> YES (default was YES)
node017.23693ipath_userinit: assign_port command failed: Device or resource busy
node017.23693psmi_port_open: /dev/ipath open failed: 25 (Device or resource
busy)
[node017:23693] Open MPI failed to open a PSM endpoint: No free InfiniPath
contexts available on /dev/ipath
[node017:23693] Error in psm_ep_open (error No free ports could be obtained)
node017:4.0.psmi_shm_attach: Grabbed shmidx 3
node017:4.0.amsh_init_segment: Grew shared segment for 4 procs, size=23.71 MB
node017:4.0.am_remap_segment: shm segment remap from 0x2aaaab25a000..4096 to
0x2aaab26b3000..24858624 (relocated=YES)
node017:3.0.ips_ptl_pollintr: Enabled communication thread on URG packets