On Wed, 6 Feb 2008, Christian Bell wrote:

> Hi Daniel --
> 
>   PSM should determine your node setup and enable shared contexts
>   accordingly, but it looks like something isn't working right.  You
>   can apply the patch I've attached to this e-mail and things should
>   work again.

Alas, it doesn't compile (patch was applied to OpenMPI 1.2.5):

mtl_psm.c(109): error: struct "orte_proc_info_t" has no field "num_local_procs"
      if (orte_process_info.num_local_procs > 0) {
                            ^

mtl_psm.c(111): error: struct "orte_proc_info_t" has no field "num_local_procs"
         snprintf(buf, sizeof buf - 1, "%d", orte_process_info.num_local_procs);
                                                               ^

mtl_psm.c(113): error: struct "orte_proc_info_t" has no field "local_rank"
         snprintf(buf, sizeof buf - 1, "%d", orte_process_info.local_rank);
                                                               ^

compilation aborted for mtl_psm.c (code 2)

  
>   However, it would be useful to identify what's going wrong.  Can
>   you compile a hello world program and run it with the machinefile
>   you're trying to use.  Send me the output from:
> 
>   mpirun -machinefile .... env PSM_TRACEMASK=0x101 ./hello_world
> 
>   I understand your failure mode only if somehow the 8-core node is
>   detected to be a 4-core node.  The output should tell us this.

Attached. It seems it does try to enable context sharing but for some 
reason /dev/ipath still returns a busy code.

Daniël
node017.23692env  IPATH_DISABLE_MMAP_MALLOC Disable mmap for malloc()           
     => NO
node017.23692env  IPATH_NO_CPUAFFINITY      Prevent PSM from setting affinity   
     => NO
node017.23692env  IPATH_UNIT                Device Unit number (-1 autodetects) 
     => -1
node017.23692env  PSM_DEVICES               Ordered list of PSM-level devices   
     => shm,ipath (default was self,shm,ipath)
node017.23692psmi_parse_devices: PSM Device allocation order: amsh,ips
node017.23692env  PSM_MEMORY                Memory usage mode (normal or large) 
     => normal
node017.23692env  PSM_SHAREDCONTEXTS        Enable shared contexts              
     => YES (default was YES)
node017.23692ipath_setaffinity: PORT_INFO returned 
unit_id=0/1,port=1/4,hwports=4,subport=0/0,nproc=8
node017.23692ipath_setaffinity: Set CPU affinity to 0, port 0:1:0 (1 active 
chips)
node017.23692ipath_userinit: Driver is not QLogic-built
node017.23692ipath_userinit: Runtime flags are 0x46, explicit mallopt mmap 
disable in malloc is off
node017.23692psmi_port_open: Opened port 1.0 on device /dev/ipath 
(LID=14,epid=e0001,flags=46)
node017.23692env  PSM_RCVTHREAD             Recv thread flags (0 disables 
thread)    =>                 0x1
node017:1.0.env  PSM_MQ_SENDREQS_MAX       Max num of isend requests in flight  
    => 1048576
node017:1.0.env  PSM_MQ_RECVREQS_MAX       Max num of irecv requests in flight  
    => 1048576
node017:1.0.env  PSM_MQ_RNDV_IPATH_THRESH  ipath eager-to-rendezvous switchover 
    => 64000
node017:1.0.env  PSM_MQ_RNDV_SHM_THRESH    shm eager-to-rendezvous switchover   
    => 16000
node017:1.0.ips_spio_init: PIO copy uses forced ordering
node017:1.0.env  PSM_TID                   Tid proto flags (0 disables 
protocol)    =>                  0x1
node017:1.0.ips_protoexp_init: Tid control message settings: timeout 
min=200us/max=1000us, interrupt when trying attempt #2
node017:1.0.ips_proto_init: Tid error control: warning every 30 secs, fatal 
error after 250 tid errors
node017:1.0.ips_proto_init: Ethernet Host IP=10.141.0.17 and PID=23692
node017:1.0.psmi_shm_attach: Registered as master to key 
/psm_shm.d999e196-868e-c6e6-0d4a-bc2c78de85f1
node017:1.0.psmi_shm_attach: Mapped shm control object at 0x2aaaab25a000
node017:1.0.psmi_shm_attach: Mapped and initalized shm object control page at 
0x2aaaab25a000,size=4096
node017:1.0.psmi_shm_attach: Grabbed shmidx 0
node017:1.0.amsh_init_segment: Grew shared segment for 1 procs, size=5.93 MB
node017:1.0.am_remap_segment: shm segment remap from 0x2aaaab25a000..4096 to 
0x2aaab26b3000..6217728 (relocated=YES)
node017:1.0.ips_ptl_pollintr: Enabled communication thread on URG packets
node017.23691env  IPATH_DISABLE_MMAP_MALLOC Disable mmap for malloc()           
     => NO
node017.23691env  IPATH_NO_CPUAFFINITY      Prevent PSM from setting affinity   
     => NO
node017.23691env  IPATH_UNIT                Device Unit number (-1 autodetects) 
     => -1
node017.23691env  PSM_DEVICES               Ordered list of PSM-level devices   
     => shm,ipath (default was self,shm,ipath)
node017.23691psmi_parse_devices: PSM Device allocation order: amsh,ips
node017.23691env  PSM_MEMORY                Memory usage mode (normal or large) 
     => normal
node017.23691env  PSM_SHAREDCONTEXTS        Enable shared contexts              
     => YES (default was YES)
node017.23691ipath_setaffinity: PORT_INFO returned 
unit_id=0/1,port=2/4,hwports=4,subport=0/0,nproc=8
node017.23691ipath_setaffinity: Set CPU affinity to 1, port 0:2:0 (1 active 
chips)
node017.23691ipath_userinit: Driver is not QLogic-built
node017.23691ipath_userinit: Runtime flags are 0x46, explicit mallopt mmap 
disable in malloc is off
node017.23691psmi_port_open: Opened port 2.0 on device /dev/ipath 
(LID=14,epid=e0002,flags=46)
node017.23691env  PSM_RCVTHREAD             Recv thread flags (0 disables 
thread)    =>                 0x1
node017:2.0.env  PSM_MQ_SENDREQS_MAX       Max num of isend requests in flight  
    => 1048576
node017:2.0.env  PSM_MQ_RECVREQS_MAX       Max num of irecv requests in flight  
    => 1048576
node017:2.0.env  PSM_MQ_RNDV_IPATH_THRESH  ipath eager-to-rendezvous switchover 
    => 64000
node017:2.0.env  PSM_MQ_RNDV_SHM_THRESH    shm eager-to-rendezvous switchover   
    => 16000
node017:2.0.ips_spio_init: PIO copy uses forced ordering
node017:2.0.env  PSM_TID                   Tid proto flags (0 disables 
protocol)    =>                  0x1
node017:2.0.ips_protoexp_init: Tid control message settings: timeout 
min=200us/max=1000us, interrupt when trying attempt #2
node017:2.0.ips_proto_init: Tid error control: warning every 30 secs, fatal 
error after 250 tid errors
node017:2.0.ips_proto_init: Ethernet Host IP=10.141.0.17 and PID=23691
node017:2.0.psmi_shm_attach: Registered as slave to key 
/psm_shm.d999e196-868e-c6e6-0d4a-bc2c78de85f1
node017:2.0.psmi_shm_attach: Mapped shm control object at 0x2aaaab25a000
node017:2.0.psmi_shm_attach: Slave synchronized object control page at 
0x2aaaab25a000, size=4096
node017:2.0.psmi_shm_attach: Grabbed shmidx 1
node017:2.0.amsh_init_segment: Grew shared segment for 2 procs, size=11.86 MB
node017:2.0.am_remap_segment: shm segment remap from 0x2aaaab25a000..4096 to 
0x2aaab26b3000..12431360 (relocated=YES)
node017.23694env  IPATH_DISABLE_MMAP_MALLOC Disable mmap for malloc()           
     => NO
node017.23694env  IPATH_NO_CPUAFFINITY      Prevent PSM from setting affinity   
     => NO
node017:2.0.ips_ptl_pollintr: Enabled communication thread on URG packets
node017.23694env  IPATH_UNIT                Device Unit number (-1 autodetects) 
     => -1
node017.23694env  PSM_DEVICES               Ordered list of PSM-level devices   
     => shm,ipath (default was self,shm,ipath)
node017.23694psmi_parse_devices: PSM Device allocation order: amsh,ips
node017.23694env  PSM_MEMORY                Memory usage mode (normal or large) 
     => normal
node017.23694env  PSM_SHAREDCONTEXTS        Enable shared contexts              
     => YES (default was YES)
node017.23694ipath_setaffinity: PORT_INFO returned 
unit_id=0/1,port=3/4,hwports=4,subport=0/0,nproc=8
node017.23694ipath_setaffinity: Set CPU affinity to 2, port 0:3:0 (1 active 
chips)
node017.23698env  IPATH_DISABLE_MMAP_MALLOC Disable mmap for malloc()           
     => NO
node017.23698env  IPATH_NO_CPUAFFINITY      Prevent PSM from setting affinity   
     => NO
node017.23698env  IPATH_UNIT                Device Unit number (-1 autodetects) 
     => -1
node017.23698env  PSM_DEVICES               Ordered list of PSM-level devices   
     => shm,ipath (default was self,shm,ipath)
node017.23698psmi_parse_devices: PSM Device allocation order: amsh,ips
node017.23698env  PSM_MEMORY                Memory usage mode (normal or large) 
     => normal
node017.23698env  PSM_SHAREDCONTEXTS        Enable shared contexts              
     => YES (default was YES)
node017.23698ipath_setaffinity: PORT_INFO returned 
unit_id=0/1,port=4/4,hwports=4,subport=0/0,nproc=8
node017.23694ipath_userinit: Driver is not QLogic-built
node017.23694ipath_userinit: Runtime flags are 0x46, explicit mallopt mmap 
disable in malloc is off
node017.23698ipath_setaffinity: Set CPU affinity to 3, port 0:4:0 (1 active 
chips)
node017.23694psmi_port_open: Opened port 3.0 on device /dev/ipath 
(LID=14,epid=e0003,flags=46)
node017.23694env  PSM_RCVTHREAD             Recv thread flags (0 disables 
thread)    =>                 0x1
node017:3.0.env  PSM_MQ_SENDREQS_MAX       Max num of isend requests in flight  
    => 1048576
node017:3.0.env  PSM_MQ_RECVREQS_MAX       Max num of irecv requests in flight  
    => 1048576
node017:3.0.env  PSM_MQ_RNDV_IPATH_THRESH  ipath eager-to-rendezvous switchover 
    => 64000
node017:3.0.env  PSM_MQ_RNDV_SHM_THRESH    shm eager-to-rendezvous switchover   
    => 16000
node017:3.0.ips_spio_init: PIO copy uses forced ordering
node017.23698ipath_userinit: Driver is not QLogic-built
node017.23698ipath_userinit: Runtime flags are 0x46, explicit mallopt mmap 
disable in malloc is off
node017.23698psmi_port_open: Opened port 4.0 on device /dev/ipath 
(LID=14,epid=e0004,flags=46)
node017.23698env  PSM_RCVTHREAD             Recv thread flags (0 disables 
thread)    =>                 0x1
node017.23696env  IPATH_DISABLE_MMAP_MALLOC Disable mmap for malloc()           
     => NO
node017:4.0.env  PSM_MQ_SENDREQS_MAX       Max num of isend requests in flight  
    => 1048576
node017.23696env  IPATH_NO_CPUAFFINITY      Prevent PSM from setting affinity   
     => NO
node017.23696env  IPATH_UNIT                Device Unit number (-1 autodetects) 
     => -1
node017.23696env  PSM_DEVICES               Ordered list of PSM-level devices   
     => shm,ipath (default was self,shm,ipath)
node017.23696psmi_parse_devices: PSM Device allocation order: amsh,ips
node017.23696env  PSM_MEMORY                Memory usage mode (normal or large) 
     => normal
node017:3.0.env  PSM_TID                   Tid proto flags (0 disables 
protocol)    =>                  0x1
node017.23696env  PSM_SHAREDCONTEXTS        Enable shared contexts              
     => YES (default was YES)
node017.23696ipath_userinit: assign_port command failed: Device or resource busy
node017:4.0.env  PSM_MQ_RECVREQS_MAX       Max num of irecv requests in flight  
    => 1048576
node017.23696psmi_port_open: /dev/ipath open failed: 25 (Device or resource 
busy)
[node017:23696] Open MPI failed to open a PSM endpoint: No free InfiniPath 
contexts available on /dev/ipath
[node017:23696] Error in psm_ep_open (error No free ports could be obtained)
node017:4.0.env  PSM_MQ_RNDV_IPATH_THRESH  ipath eager-to-rendezvous switchover 
    => 64000
node017:4.0.env  PSM_MQ_RNDV_SHM_THRESH    shm eager-to-rendezvous switchover   
    => 16000
node017:4.0.ips_spio_init: PIO copy uses forced ordering
node017:3.0.ips_protoexp_init: Tid control message settings: timeout 
min=200us/max=1000us, interrupt when trying attempt #2
node017:3.0.ips_proto_init: Tid error control: warning every 30 secs, fatal 
error after 250 tid errors
node017:4.0.env  PSM_TID                   Tid proto flags (0 disables 
protocol)    =>                  0x1
node017:3.0.ips_proto_init: Ethernet Host IP=10.141.0.17 and PID=23694
node017:4.0.ips_protoexp_init: Tid control message settings: timeout 
min=200us/max=1000us, interrupt when trying attempt #2
node017:4.0.ips_proto_init: Tid error control: warning every 30 secs, fatal 
error after 250 tid errors
node017:3.0.psmi_shm_attach: Registered as slave to key 
/psm_shm.d999e196-868e-c6e6-0d4a-bc2c78de85f1
node017:3.0.psmi_shm_attach: Mapped shm control object at 0x2aaaab25a000
node017:3.0.psmi_shm_attach: Slave synchronized object control page at 
0x2aaaab25a000, size=4096
node017:3.0.psmi_shm_attach: Grabbed shmidx 2
node017:3.0.amsh_init_segment: Grew shared segment for 3 procs, size=17.78 MB
node017:3.0.am_remap_segment: shm segment remap from 0x2aaaab25a000..4096 to 
0x2aaab26b3000..18644992 (relocated=YES)
node017:4.0.ips_proto_init: Ethernet Host IP=10.141.0.17 and PID=23698
node017:4.0.psmi_shm_attach: Registered as slave to key 
/psm_shm.d999e196-868e-c6e6-0d4a-bc2c78de85f1
node017:4.0.psmi_shm_attach: Mapped shm control object at 0x2aaaab25a000
node017:4.0.psmi_shm_attach: Slave synchronized object control page at 
0x2aaaab25a000, size=4096
node017:4.0.ips_ptl_pollintr: Enabled communication thread on URG packets
node017.23695env  IPATH_DISABLE_MMAP_MALLOC Disable mmap for malloc()           
     => NO
node017.23695env  IPATH_NO_CPUAFFINITY      Prevent PSM from setting affinity   
     => NO
node017.23695env  IPATH_UNIT                Device Unit number (-1 autodetects) 
     => -1
node017.23695env  PSM_DEVICES               Ordered list of PSM-level devices   
     => shm,ipath (default was self,shm,ipath)
node017.23695psmi_parse_devices: PSM Device allocation order: amsh,ips
node017.23695env  PSM_MEMORY                Memory usage mode (normal or large) 
     => normal
node017.23695env  PSM_SHAREDCONTEXTS        Enable shared contexts              
     => YES (default was YES)
node017.23695ipath_userinit: assign_port command failed: Device or resource busy
node017.23695psmi_port_open: /dev/ipath open failed: 25 (Device or resource 
busy)
[node017:23695] Open MPI failed to open a PSM endpoint: No free InfiniPath 
contexts available on /dev/ipath
[node017:23695] Error in psm_ep_open (error No free ports could be obtained)
node017.23697env  IPATH_DISABLE_MMAP_MALLOC Disable mmap for malloc()           
     => NO
node017.23697env  IPATH_NO_CPUAFFINITY      Prevent PSM from setting affinity   
     => NO
node017.23697env  IPATH_UNIT                Device Unit number (-1 autodetects) 
     => -1
node017.23697env  PSM_DEVICES               Ordered list of PSM-level devices   
     => shm,ipath (default was self,shm,ipath)
node017.23697psmi_parse_devices: PSM Device allocation order: amsh,ips
node017.23697env  PSM_MEMORY                Memory usage mode (normal or large) 
     => normal
node017.23693env  IPATH_DISABLE_MMAP_MALLOC Disable mmap for malloc()           
     => NO
node017.23697env  PSM_SHAREDCONTEXTS        Enable shared contexts              
     => YES (default was YES)
node017.23693env  IPATH_NO_CPUAFFINITY      Prevent PSM from setting affinity   
     => NO
node017.23693env  IPATH_UNIT                Device Unit number (-1 autodetects) 
     => -1
node017.23697ipath_userinit: assign_port command failed: Device or resource busy
node017.23693env  PSM_DEVICES               Ordered list of PSM-level devices   
     => shm,ipath (default was self,shm,ipath)
node017.23693psmi_parse_devices: PSM Device allocation order: amsh,ips
node017.23693env  PSM_MEMORY                Memory usage mode (normal or large) 
     => normal
node017.23697psmi_port_open: /dev/ipath open failed: 25 (Device or resource 
busy)
[node017:23697] Open MPI failed to open a PSM endpoint: No free InfiniPath 
contexts available on /dev/ipath
[node017:23697] Error in psm_ep_open (error No free ports could be obtained)
node017.23693env  PSM_SHAREDCONTEXTS        Enable shared contexts              
     => YES (default was YES)
node017.23693ipath_userinit: assign_port command failed: Device or resource busy
node017.23693psmi_port_open: /dev/ipath open failed: 25 (Device or resource 
busy)
[node017:23693] Open MPI failed to open a PSM endpoint: No free InfiniPath 
contexts available on /dev/ipath
[node017:23693] Error in psm_ep_open (error No free ports could be obtained)
node017:4.0.psmi_shm_attach: Grabbed shmidx 3
node017:4.0.amsh_init_segment: Grew shared segment for 4 procs, size=23.71 MB
node017:4.0.am_remap_segment: shm segment remap from 0x2aaaab25a000..4096 to 
0x2aaab26b3000..24858624 (relocated=YES)
node017:3.0.ips_ptl_pollintr: Enabled communication thread on URG packets

Reply via email to