lstopo will tell you -- if there is more than one "PU" (hwloc terminology for "processing unit") per core, then hyper threading is enabled. If there's only one PU per core, then hyper threading is disabled.
> On Jun 29, 2015, at 4:42 PM, Lane, William <william.l...@cshs.org> wrote: > > Would the output of dmidecode -t processor and/or lstopo tell me conclusively > if hyperthreading is enabled or not? Hyperthreading is supposed to be enabled > for all the IBM x3550 M3 and M4 nodes, but I'm not sure if it actually is and > I > don't have access to the BIOS settings. > > -Bill L. > > From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain > [r...@open-mpi.org] > Sent: Saturday, June 27, 2015 7:21 PM > To: Open MPI Users > Subject: Re: [OMPI users] OpenMPI 1.8.6, CentOS 6.3, too many slots = crash > > Bill - this is such a jumbled collection of machines that I’m having trouble > figuring out what I should replicate. I can create something artificial here > so I can try to debug this, but I need to know exactly what I’m up against - > can you tell me: > > * the architecture of each type - how many sockets, how many cores/socket, HT > on or off. If two nodes have the same physical setup but one has HT on and > the other off, then please consider those two different types > > * how many nodes of each type > > Looking at your map output, it looks like the map is being done correctly, > but somehow the binding locale isn’t getting set in some cases. You latest > error output would seem out-of-step with your prior reports, so something > else may be going on there. As I said earlier, this is the most hetero > environment we’ve seen, and so there may be some code paths your hitting that > haven’t been well exercised before. > > > > >> On Jun 26, 2015, at 5:22 PM, Lane, William <william.l...@cshs.org> wrote: >> >> Well, I managed to get a successful mpirun @ a slot count of 132 using --mca >> btl ^sm, >> however when I increased the slot count to 160, mpirun crashed without any >> error >> output: >> >> mpirun -np 160 -display-devel-map --prefix /hpc/apps/mpi/openmpi/1.8.6/ >> --hostfile hostfile-noslots --mca btl ^sm --hetero-nodes --bind-to core >> /hpc/home/lanew/mpi/openmpi/ProcessColors3 >> out.txt 2>&1 >> >> -------------------------------------------------------------------------- >> WARNING: a request was made to bind a process. While the system >> supports binding the process itself, at least one node does NOT >> support binding memory to the process location. >> >> Node: csclprd3-6-1 >> >> This usually is due to not having the required NUMA support installed >> on the node. In some Linux distributions, the required support is >> contained in the libnumactl and libnumactl-devel packages. >> This is a warning only; your job will continue, though performance may be >> degraded. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> A request was made to bind to that would result in binding more >> processes than cpus on a resource: >> >> Bind to: CORE >> Node: csclprd3-6-1 >> #processes: 2 >> #cpus: 1 >> >> You can override this protection by adding the "overload-allowed" >> option to your binding directive. >> -------------------------------------------------------------------------- >> >> But csclprd3-6-1 (a blade) does have 2 CPU's on 2 separate sockets w/2 cores >> apiece as shown in my dmidecode output: >> >> csclprd3-6-1 ~]# dmidecode -t processor >> # dmidecode 2.11 >> SMBIOS 2.4 present. >> >> Handle 0x0008, DMI type 4, 32 bytes >> Processor Information >> Socket Designation: Socket 1 CPU 1 >> Type: Central Processor >> Family: Xeon >> Manufacturer: GenuineIntel >> ID: F6 06 00 00 01 03 00 00 >> Signature: Type 0, Family 6, Model 15, Stepping 6 >> Flags: >> FPU (Floating-point unit on-chip) >> CX8 (CMPXCHG8 instruction supported) >> APIC (On-chip APIC hardware supported) >> Version: Intel Xeon >> Voltage: 2.9 V >> External Clock: 333 MHz >> Max Speed: 4000 MHz >> Current Speed: 3000 MHz >> Status: Populated, Enabled >> Upgrade: ZIF Socket >> L1 Cache Handle: 0x0004 >> L2 Cache Handle: 0x0005 >> L3 Cache Handle: Not Provided >> >> Handle 0x0009, DMI type 4, 32 bytes >> Processor Information >> Socket Designation: Socket 2 CPU 2 >> Type: Central Processor >> Family: Xeon >> Manufacturer: GenuineIntel >> ID: F6 06 00 00 01 03 00 00 >> Signature: Type 0, Family 6, Model 15, Stepping 6 >> Flags: >> FPU (Floating-point unit on-chip) >> CX8 (CMPXCHG8 instruction supported) >> APIC (On-chip APIC hardware supported) >> Version: Intel Xeon >> Voltage: 2.9 V >> External Clock: 333 MHz >> Max Speed: 4000 MHz >> Current Speed: 3000 MHz >> Status: Populated, Enabled >> Upgrade: ZIF Socket >> L1 Cache Handle: 0x0006 >> L2 Cache Handle: 0x0007 >> L3 Cache Handle: Not Provided >> >> csclprd3-6-1 ~]# lstopo >> Machine (16GB) >> Socket L#0 + L2 L#0 (4096KB) >> L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) >> L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#2) >> Socket L#1 + L2 L#1 (4096KB) >> L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#1) >> L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3) >> >> csclprd3-0-1 information (which looks correct as this particular x3550 should >> have one socket populated (of two) with a 6 core Xeon (or 12 cores >> w/hyperthreading >> turned on): >> >> csclprd3-0-1 ~]# lstopo >> Machine (71GB) >> Socket L#0 + L3 L#0 (12MB) >> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 >> (P#0) >> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 >> (P#1) >> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 >> (P#2) >> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 >> (P#3) >> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 >> (P#4) >> L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 >> (P#5) >> >> csclprd3-0-1 ~]# dmidecode -t processor >> # dmidecode 2.11 >> # SMBIOS entry point at 0x7f6be000 >> SMBIOS 2.5 present. >> >> Handle 0x0001, DMI type 4, 40 bytes >> Processor Information >> Socket Designation: Node 1 Socket 1 >> Type: Central Processor >> Family: Xeon MP >> Manufacturer: Intel(R) Corporation >> ID: C2 06 02 00 FF FB EB BF >> Signature: Type 0, Family 6, Model 44, Stepping 2 >> Flags: >> FPU (Floating-point unit on-chip) >> VME (Virtual mode extension) >> DE (Debugging extension) >> PSE (Page size extension) >> TSC (Time stamp counter) >> MSR (Model specific registers) >> PAE (Physical address extension) >> MCE (Machine check exception) >> CX8 (CMPXCHG8 instruction supported) >> APIC (On-chip APIC hardware supported) >> SEP (Fast system call) >> MTRR (Memory type range registers) >> PGE (Page global enable) >> MCA (Machine check architecture) >> CMOV (Conditional move instruction supported) >> PAT (Page attribute table) >> PSE-36 (36-bit page size extension) >> CLFSH (CLFLUSH instruction supported) >> DS (Debug store) >> ACPI (ACPI supported) >> MMX (MMX technology supported) >> FXSR (FXSAVE and FXSTOR instructions supported) >> SSE (Streaming SIMD extensions) >> SSE2 (Streaming SIMD extensions 2) >> SS (Self-snoop) >> HTT (Multi-threading) >> TM (Thermal monitor supported) >> PBE (Pending break enabled) >> Version: Intel(R) Xeon(R) CPU E5645 @ 2.40GHz >> Voltage: 1.2 V >> External Clock: 5866 MHz >> Max Speed: 4400 MHz >> Current Speed: 2400 MHz >> Status: Populated, Enabled >> Upgrade: ZIF Socket >> L1 Cache Handle: 0x0002 >> L2 Cache Handle: 0x0003 >> L3 Cache Handle: 0x0004 >> Serial Number: Not Specified >> Asset Tag: Not Specified >> Part Number: Not Specified >> Core Count: 6 >> Core Enabled: 6 >> Thread Count: 6 >> Characteristics: >> 64-bit capable >> >> Handle 0x005A, DMI type 4, 40 bytes >> Processor Information >> Socket Designation: Node 1 Socket 2 >> Type: Central Processor >> Family: Xeon MP >> Manufacturer: Not Specified >> ID: 00 00 00 00 00 00 00 00 >> Signature: Type 0, Family 0, Model 0, Stepping 0 >> Flags: None >> Version: Not Specified >> Voltage: 1.2 V >> External Clock: 5866 MHz >> Max Speed: 4400 MHz >> Current Speed: Unknown >> Status: Unpopulated >> Upgrade: ZIF Socket >> L1 Cache Handle: Not Provided >> L2 Cache Handle: Not Provided >> L3 Cache Handle: Not Provided >> Serial Number: Not Specified >> Asset Tag: Not Specified >> Part Number: Not Specified >> Characteristics: None >> >> >> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain >> [r...@open-mpi.org] >> Sent: Wednesday, June 24, 2015 6:06 AM >> To: Open MPI Users >> Subject: Re: [OMPI users] OpenMPI 1.8.6, CentOS 6.3, too many slots = crash >> >> I think trying with --mca btl ^sm makes a lot of sense and may solve the >> problem. I also noted that we are having trouble with the topology of >> several of the nodes - seeing only one socket, non-HT where you say we >> should see two sockets and HT-enabled. In those cases, the locality is >> "unknown" - given that those procs are on remote nodes from the one being >> impacted, I don't think it should cause a problem. However, it isn't >> correct, and that raises flags. >> >> My best guess of the root cause of that error is either we are getting bad >> topology info on those nodes, or we have a bug that is mishandling this >> scenario. It would probably be good to get this error fixed to ensure it >> isn't the source of the eventual crash, even though I'm not sure they are >> related. >> >> Bill: Can we examine one of the problem nodes? Let's pick csclprd3-0-1 (or >> take another one from your list - just look for one where "locality" is >> reported as "unknown" for the procs in the output map). Can you run lstopo >> on that node and send us the output? In the above map, it is reporting a >> single socket with 6 cores, non-HT. Is that what lstopo shows when run on >> the node? Is it what you expected? >> >> >> On Wed, Jun 24, 2015 at 4:07 AM, Gilles Gouaillardet >> <gilles.gouaillar...@gmail.com> wrote: >> Bill, >> >> were you able to get a core file and analyze the stack with gdb ? >> >> I suspect the error occurs in mca_btl_sm_add_procs but this is just my best >> guess. >> if this is correct, can you check the value of >> mca_btl_sm_component.num_smp_procs ? >> >> as a workaround, can you try >> mpirun --mca btl ^sm ... >> >> I do not see how I can tackle the root cause without being able to reproduce >> the issue :-( >> >> can you try to reproduce the issue with the smallest hostfile, and then run >> lstopo on all the nodes ? >> btw, you are not mixing 32 bits and 64 bits OS, are you ? >> >> Cheers, >> >> Gilles >> >> >> >> >> mca_btl_sm_add_procs( >> >> >> >> int >> >> mca_btl_sm_add_procs >> ( >> On Wednesday, June 24, 2015, Lane, William <william.l...@cshs.org> wrote: >> Gilles, >> >> All the blades only have two core Xeons (without hyperthreading) populating >> both their sockets. All >> the x3550 nodes have hyperthreading capable Xeons and Sandybridge server >> CPU's. It's possible >> hyperthreading has been disabled on some of these nodes though. The 3-0-n >> nodes are all IBM x3550 >> nodes while the 3-6-n nodes are all blade nodes. >> >> I have run this exact same test code successfully in the past on another >> cluster (~200 nodes of Sunfire X2100 >> 2x dual-core Opterons) w/no issues on upwards of 390 slots. I even tested it >> recently on OpenMPI 1.8.5 >> on another smaller R&D cluster consisting of 10 Sunfire X2100 nodes (w/2 >> dual core Opterons apiece). >> On this particular cluster I've had success running this code on < 132 slots. >> >> Anyway, here's the results of the following mpirun: >> >> mpirun -np 132 -display-devel-map --prefix /hpc/apps/mpi/openmpi/1.8.6/ >> --hostfile hostfile-noslots --mca btl_tcp_if_include eth0 --hetero-nodes >> --bind-to core /hpc/home/lanew/mpi/openmpi/ProcessColors3 >> out.txt 2>&1 >> >> -------------------------------------------------------------------------- >> WARNING: a request was made to bind a process. While the system >> supports binding the process itself, at least one node does NOT >> support binding memory to the process location. >> >> Node: csclprd3-6-1 >> >> This usually is due to not having the required NUMA support installed >> on the node. In some Linux distributions, the required support is >> contained in the libnumactl and libnumactl-devel packages. >> This is a warning only; your job will continue, though performance may be >> degraded. >> -------------------------------------------------------------------------- >> Data for JOB [51718,1] offset 0 >> >> Mapper requested: NULL Last mapper: round_robin Mapping policy: BYSOCKET >> Ranking policy: SLOT >> Binding policy: CORE Cpu set: NULL PPR: NULL Cpus-per-rank: 1 >> Num new daemons: 0 New daemon starting vpid INVALID >> Num nodes: 15 >> >> Data for node: csclprd3-6-1 Launch id: -1 State: 0 >> Daemon: [[51718,0],1] Daemon launched: True >> Num slots: 4 Slots in use: 4 Oversubscribed: FALSE >> Num slots allocated: 4 Max slots: 0 >> Username on node: NULL >> Num procs: 4 Next node_rank: 4 >> Data for proc: [[51718,1],0] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 0 >> State: INITIALIZED App_context: 0 >> Locale: [B/B][./.] >> Binding: [B/.][./.] >> Data for proc: [[51718,1],1] >> Pid: 0 Local rank: 1 Node rank: 1 App rank: 1 >> State: INITIALIZED App_context: 0 >> Locale: [./.][B/B] >> Binding: [./.][B/.] >> Data for proc: [[51718,1],2] >> Pid: 0 Local rank: 2 Node rank: 2 App rank: 2 >> State: INITIALIZED App_context: 0 >> Locale: [B/B][./.] >> Binding: [./B][./.] >> Data for proc: [[51718,1],3] >> Pid: 0 Local rank: 3 Node rank: 3 App rank: 3 >> State: INITIALIZED App_context: 0 >> Locale: [./.][B/B] >> Binding: [./.][./B] >> >> Data for node: csclprd3-6-5 Launch id: -1 State: 0 >> Daemon: [[51718,0],2] Daemon launched: True >> Num slots: 4 Slots in use: 4 Oversubscribed: FALSE >> Num slots allocated: 4 Max slots: 0 >> Username on node: NULL >> Num procs: 4 Next node_rank: 4 >> Data for proc: [[51718,1],4] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 4 >> State: INITIALIZED App_context: 0 >> Locale: [B/B][./.] >> Binding: [B/.][./.] >> Data for proc: [[51718,1],5] >> Pid: 0 Local rank: 1 Node rank: 1 App rank: 5 >> State: INITIALIZED App_context: 0 >> Locale: [./.][B/B] >> Binding: [./.][B/.] >> Data for proc: [[51718,1],6] >> Pid: 0 Local rank: 2 Node rank: 2 App rank: 6 >> State: INITIALIZED App_context: 0 >> Locale: [B/B][./.] >> Binding: [./B][./.] >> Data for proc: [[51718,1],7] >> Pid: 0 Local rank: 3 Node rank: 3 App rank: 7 >> State: INITIALIZED App_context: 0 >> Locale: [./.][B/B] >> Binding: [./.][./B] >> >> Data for node: csclprd3-0-0 Launch id: -1 State: 0 >> Daemon: [[51718,0],3] Daemon launched: True >> Num slots: 12 Slots in use: 12 Oversubscribed: FALSE >> Num slots allocated: 12 Max slots: 0 >> Username on node: NULL >> Num procs: 12 Next node_rank: 12 >> Data for proc: [[51718,1],8] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 8 >> State: INITIALIZED App_context: 0 >> Locale: [B/B/B/B/B/B][./././././.] >> Binding: [B/././././.][./././././.] >> Data for proc: [[51718,1],9] >> Pid: 0 Local rank: 1 Node rank: 1 App rank: 9 >> State: INITIALIZED App_context: 0 >> Locale: [./././././.][B/B/B/B/B/B] >> Binding: [./././././.][B/././././.] >> Data for proc: [[51718,1],10] >> Pid: 0 Local rank: 2 Node rank: 2 App rank: 10 >> State: INITIALIZED App_context: 0 >> Locale: [B/B/B/B/B/B][./././././.] >> Binding: [./B/./././.][./././././.] >> Data for proc: [[51718,1],11] >> Pid: 0 Local rank: 3 Node rank: 3 App rank: 11 >> State: INITIALIZED App_context: 0 >> Locale: [./././././.][B/B/B/B/B/B] >> Binding: [./././././.][./B/./././.] >> Data for proc: [[51718,1],12] >> Pid: 0 Local rank: 4 Node rank: 4 App rank: 12 >> State: INITIALIZED App_context: 0 >> Locale: [B/B/B/B/B/B][./././././.] >> Binding: [././B/././.][./././././.] >> Data for proc: [[51718,1],13] >> Pid: 0 Local rank: 5 Node rank: 5 App rank: 13 >> State: INITIALIZED App_context: 0 >> Locale: [./././././.][B/B/B/B/B/B] >> Binding: [./././././.][././B/././.] >> Data for proc: [[51718,1],14] >> Pid: 0 Local rank: 6 Node rank: 6 App rank: 14 >> State: INITIALIZED App_context: 0 >> Locale: [B/B/B/B/B/B][./././././.] >> Binding: [./././B/./.][./././././.] >> Data for proc: [[51718,1],15] >> Pid: 0 Local rank: 7 Node rank: 7 App rank: 15 >> State: INITIALIZED App_context: 0 >> Locale: [./././././.][B/B/B/B/B/B] >> Binding: [./././././.][./././B/./.] >> Data for proc: [[51718,1],16] >> Pid: 0 Local rank: 8 Node rank: 8 App rank: 16 >> State: INITIALIZED App_context: 0 >> Locale: [B/B/B/B/B/B][./././././.] >> Binding: [././././B/.][./././././.] >> Data for proc: [[51718,1],17] >> Pid: 0 Local rank: 9 Node rank: 9 App rank: 17 >> State: INITIALIZED App_context: 0 >> Locale: [./././././.][B/B/B/B/B/B] >> Binding: [./././././.][././././B/.] >> Data for proc: [[51718,1],18] >> Pid: 0 Local rank: 10 Node rank: 10 App rank: 18 >> State: INITIALIZED App_context: 0 >> Locale: [B/B/B/B/B/B][./././././.] >> Binding: [./././././B][./././././.] >> Data for proc: [[51718,1],19] >> Pid: 0 Local rank: 11 Node rank: 11 App rank: 19 >> State: INITIALIZED App_context: 0 >> Locale: [./././././.][B/B/B/B/B/B] >> Binding: [./././././.][./././././B] >> >> Data for node: csclprd3-0-1 Launch id: -1 State: 0 >> Daemon: [[51718,0],4] Daemon launched: True >> Num slots: 6 Slots in use: 6 Oversubscribed: FALSE >> Num slots allocated: 6 Max slots: 0 >> Username on node: NULL >> Num procs: 6 Next node_rank: 6 >> Data for proc: [[51718,1],20] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 20 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [B/././././.] >> Data for proc: [[51718,1],21] >> Pid: 0 Local rank: 1 Node rank: 1 App rank: 21 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./B/./././.] >> Data for proc: [[51718,1],22] >> Pid: 0 Local rank: 2 Node rank: 2 App rank: 22 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [././B/././.] >> Data for proc: [[51718,1],23] >> Pid: 0 Local rank: 3 Node rank: 3 App rank: 23 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./././B/./.] >> Data for proc: [[51718,1],24] >> Pid: 0 Local rank: 4 Node rank: 4 App rank: 24 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [././././B/.] >> Data for proc: [[51718,1],25] >> Pid: 0 Local rank: 5 Node rank: 5 App rank: 25 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./././././B] >> >> Data for node: csclprd3-0-2 Launch id: -1 State: 0 >> Daemon: [[51718,0],5] Daemon launched: True >> Num slots: 6 Slots in use: 6 Oversubscribed: FALSE >> Num slots allocated: 6 Max slots: 0 >> Username on node: NULL >> Num procs: 6 Next node_rank: 6 >> Data for proc: [[51718,1],26] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 26 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [B/././././.] >> Data for proc: [[51718,1],27] >> Pid: 0 Local rank: 1 Node rank: 1 App rank: 27 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./B/./././.] >> Data for proc: [[51718,1],28] >> Pid: 0 Local rank: 2 Node rank: 2 App rank: 28 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [././B/././.] >> Data for proc: [[51718,1],29] >> Pid: 0 Local rank: 3 Node rank: 3 App rank: 29 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./././B/./.] >> Data for proc: [[51718,1],30] >> Pid: 0 Local rank: 4 Node rank: 4 App rank: 30 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [././././B/.] >> Data for proc: [[51718,1],31] >> Pid: 0 Local rank: 5 Node rank: 5 App rank: 31 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./././././B] >> >> Data for node: csclprd3-0-3 Launch id: -1 State: 0 >> Daemon: [[51718,0],6] Daemon launched: True >> Num slots: 6 Slots in use: 6 Oversubscribed: FALSE >> Num slots allocated: 6 Max slots: 0 >> Username on node: NULL >> Num procs: 6 Next node_rank: 6 >> Data for proc: [[51718,1],32] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 32 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [B/././././.] >> Data for proc: [[51718,1],33] >> Pid: 0 Local rank: 1 Node rank: 1 App rank: 33 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./B/./././.] >> Data for proc: [[51718,1],34] >> Pid: 0 Local rank: 2 Node rank: 2 App rank: 34 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [././B/././.] >> Data for proc: [[51718,1],35] >> Pid: 0 Local rank: 3 Node rank: 3 App rank: 35 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./././B/./.] >> Data for proc: [[51718,1],36] >> Pid: 0 Local rank: 4 Node rank: 4 App rank: 36 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [././././B/.] >> Data for proc: [[51718,1],37] >> Pid: 0 Local rank: 5 Node rank: 5 App rank: 37 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./././././B] >> >> Data for node: csclprd3-0-4 Launch id: -1 State: 0 >> Daemon: [[51718,0],7] Daemon launched: True >> Num slots: 6 Slots in use: 6 Oversubscribed: FALSE >> Num slots allocated: 6 Max slots: 0 >> Username on node: NULL >> Num procs: 6 Next node_rank: 6 >> Data for proc: [[51718,1],38] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 38 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [B/././././.] >> Data for proc: [[51718,1],39] >> Pid: 0 Local rank: 1 Node rank: 1 App rank: 39 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./B/./././.] >> Data for proc: [[51718,1],40] >> Pid: 0 Local rank: 2 Node rank: 2 App rank: 40 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [././B/././.] >> Data for proc: [[51718,1],41] >> Pid: 0 Local rank: 3 Node rank: 3 App rank: 41 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./././B/./.] >> Data for proc: [[51718,1],42] >> Pid: 0 Local rank: 4 Node rank: 4 App rank: 42 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [././././B/.] >> Data for proc: [[51718,1],43] >> Pid: 0 Local rank: 5 Node rank: 5 App rank: 43 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./././././B] >> >> Data for node: csclprd3-0-5 Launch id: -1 State: 0 >> Daemon: [[51718,0],8] Daemon launched: True >> Num slots: 6 Slots in use: 6 Oversubscribed: FALSE >> Num slots allocated: 6 Max slots: 0 >> Username on node: NULL >> Num procs: 6 Next node_rank: 6 >> Data for proc: [[51718,1],44] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 44 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [B/././././.] >> Data for proc: [[51718,1],45] >> Pid: 0 Local rank: 1 Node rank: 1 App rank: 45 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./B/./././.] >> Data for proc: [[51718,1],46] >> Pid: 0 Local rank: 2 Node rank: 2 App rank: 46 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [././B/././.] >> Data for proc: [[51718,1],47] >> Pid: 0 Local rank: 3 Node rank: 3 App rank: 47 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./././B/./.] >> Data for proc: [[51718,1],48] >> Pid: 0 Local rank: 4 Node rank: 4 App rank: 48 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [././././B/.] >> Data for proc: [[51718,1],49] >> Pid: 0 Local rank: 5 Node rank: 5 App rank: 49 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./././././B] >> >> Data for node: csclprd3-0-6 Launch id: -1 State: 0 >> Daemon: [[51718,0],9] Daemon launched: True >> Num slots: 6 Slots in use: 6 Oversubscribed: FALSE >> Num slots allocated: 6 Max slots: 0 >> Username on node: NULL >> Num procs: 6 Next node_rank: 6 >> Data for proc: [[51718,1],50] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 50 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [B/././././.] >> Data for proc: [[51718,1],51] >> Pid: 0 Local rank: 1 Node rank: 1 App rank: 51 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./B/./././.] >> Data for proc: [[51718,1],52] >> Pid: 0 Local rank: 2 Node rank: 2 App rank: 52 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [././B/././.] >> Data for proc: [[51718,1],53] >> Pid: 0 Local rank: 3 Node rank: 3 App rank: 53 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./././B/./.] >> Data for proc: [[51718,1],54] >> Pid: 0 Local rank: 4 Node rank: 4 App rank: 54 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [././././B/.] >> Data for proc: [[51718,1],55] >> Pid: 0 Local rank: 5 Node rank: 5 App rank: 55 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [./././././B] >> >> Data for node: csclprd3-0-7 Launch id: -1 State: 0 >> Daemon: [[51718,0],10] Daemon launched: True >> Num slots: 16 Slots in use: 16 Oversubscribed: FALSE >> Num slots allocated: 16 Max slots: 0 >> Username on node: NULL >> Num procs: 16 Next node_rank: 16 >> Data for proc: [[51718,1],56] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 56 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [BB/../../../../../../..][../../../../../../../..] >> Data for proc: [[51718,1],57] >> Pid: 0 Local rank: 1 Node rank: 1 App rank: 57 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][BB/../../../../../../..] >> Data for proc: [[51718,1],58] >> Pid: 0 Local rank: 2 Node rank: 2 App rank: 58 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../BB/../../../../../..][../../../../../../../..] >> Data for proc: [[51718,1],59] >> Pid: 0 Local rank: 3 Node rank: 3 App rank: 59 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../BB/../../../../../..] >> Data for proc: [[51718,1],60] >> Pid: 0 Local rank: 4 Node rank: 4 App rank: 60 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../BB/../../../../..][../../../../../../../..] >> Data for proc: [[51718,1],61] >> Pid: 0 Local rank: 5 Node rank: 5 App rank: 61 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../BB/../../../../..] >> Data for proc: [[51718,1],62] >> Pid: 0 Local rank: 6 Node rank: 6 App rank: 62 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../BB/../../../..][../../../../../../../..] >> Data for proc: [[51718,1],63] >> Pid: 0 Local rank: 7 Node rank: 7 App rank: 63 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../BB/../../../..] >> Data for proc: [[51718,1],64] >> Pid: 0 Local rank: 8 Node rank: 8 App rank: 64 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../../BB/../../..][../../../../../../../..] >> Data for proc: [[51718,1],65] >> Pid: 0 Local rank: 9 Node rank: 9 App rank: 65 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../../BB/../../..] >> Data for proc: [[51718,1],66] >> Pid: 0 Local rank: 10 Node rank: 10 App rank: 66 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../../../BB/../..][../../../../../../../..] >> Data for proc: [[51718,1],67] >> Pid: 0 Local rank: 11 Node rank: 11 App rank: 67 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../../../BB/../..] >> Data for proc: [[51718,1],68] >> Pid: 0 Local rank: 12 Node rank: 12 App rank: 68 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../../../../BB/..][../../../../../../../..] >> Data for proc: [[51718,1],69] >> Pid: 0 Local rank: 13 Node rank: 13 App rank: 69 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../../../../BB/..] >> Data for proc: [[51718,1],70] >> Pid: 0 Local rank: 14 Node rank: 14 App rank: 70 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../../../../../BB][../../../../../../../..] >> Data for proc: [[51718,1],71] >> Pid: 0 Local rank: 15 Node rank: 15 App rank: 71 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../../../../../BB] >> >> Data for node: csclprd3-0-8 Launch id: -1 State: 0 >> Daemon: [[51718,0],11] Daemon launched: True >> Num slots: 16 Slots in use: 16 Oversubscribed: FALSE >> Num slots allocated: 16 Max slots: 0 >> Username on node: NULL >> Num procs: 16 Next node_rank: 16 >> Data for proc: [[51718,1],72] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 72 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [BB/../../../../../../..][../../../../../../../..] >> Data for proc: [[51718,1],73] >> Pid: 0 Local rank: 1 Node rank: 1 App rank: 73 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][BB/../../../../../../..] >> Data for proc: [[51718,1],74] >> Pid: 0 Local rank: 2 Node rank: 2 App rank: 74 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../BB/../../../../../..][../../../../../../../..] >> Data for proc: [[51718,1],75] >> Pid: 0 Local rank: 3 Node rank: 3 App rank: 75 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../BB/../../../../../..] >> Data for proc: [[51718,1],76] >> Pid: 0 Local rank: 4 Node rank: 4 App rank: 76 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../BB/../../../../..][../../../../../../../..] >> Data for proc: [[51718,1],77] >> Pid: 0 Local rank: 5 Node rank: 5 App rank: 77 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../BB/../../../../..] >> Data for proc: [[51718,1],78] >> Pid: 0 Local rank: 6 Node rank: 6 App rank: 78 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../BB/../../../..][../../../../../../../..] >> Data for proc: [[51718,1],79] >> Pid: 0 Local rank: 7 Node rank: 7 App rank: 79 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../BB/../../../..] >> Data for proc: [[51718,1],80] >> Pid: 0 Local rank: 8 Node rank: 8 App rank: 80 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../../BB/../../..][../../../../../../../..] >> Data for proc: [[51718,1],81] >> Pid: 0 Local rank: 9 Node rank: 9 App rank: 81 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../../BB/../../..] >> Data for proc: [[51718,1],82] >> Pid: 0 Local rank: 10 Node rank: 10 App rank: 82 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../../../BB/../..][../../../../../../../..] >> Data for proc: [[51718,1],83] >> Pid: 0 Local rank: 11 Node rank: 11 App rank: 83 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../../../BB/../..] >> Data for proc: [[51718,1],84] >> Pid: 0 Local rank: 12 Node rank: 12 App rank: 84 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../../../../BB/..][../../../../../../../..] >> Data for proc: [[51718,1],85] >> Pid: 0 Local rank: 13 Node rank: 13 App rank: 85 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../../../../BB/..] >> Data for proc: [[51718,1],86] >> Pid: 0 Local rank: 14 Node rank: 14 App rank: 86 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../../../../../BB][../../../../../../../..] >> Data for proc: [[51718,1],87] >> Pid: 0 Local rank: 15 Node rank: 15 App rank: 87 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../../../../../BB] >> >> Data for node: csclprd3-0-10 Launch id: -1 State: 0 >> Daemon: [[51718,0],12] Daemon launched: True >> Num slots: 16 Slots in use: 16 Oversubscribed: FALSE >> Num slots allocated: 16 Max slots: 0 >> Username on node: NULL >> Num procs: 16 Next node_rank: 16 >> Data for proc: [[51718,1],88] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 88 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [BB/../../../../../../..][../../../../../../../..] >> Data for proc: [[51718,1],89] >> Pid: 0 Local rank: 1 Node rank: 1 App rank: 89 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][BB/../../../../../../..] >> Data for proc: [[51718,1],90] >> Pid: 0 Local rank: 2 Node rank: 2 App rank: 90 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../BB/../../../../../..][../../../../../../../..] >> Data for proc: [[51718,1],91] >> Pid: 0 Local rank: 3 Node rank: 3 App rank: 91 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../BB/../../../../../..] >> Data for proc: [[51718,1],92] >> Pid: 0 Local rank: 4 Node rank: 4 App rank: 92 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../BB/../../../../..][../../../../../../../..] >> Data for proc: [[51718,1],93] >> Pid: 0 Local rank: 5 Node rank: 5 App rank: 93 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../BB/../../../../..] >> Data for proc: [[51718,1],94] >> Pid: 0 Local rank: 6 Node rank: 6 App rank: 94 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../BB/../../../..][../../../../../../../..] >> Data for proc: [[51718,1],95] >> Pid: 0 Local rank: 7 Node rank: 7 App rank: 95 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../BB/../../../..] >> Data for proc: [[51718,1],96] >> Pid: 0 Local rank: 8 Node rank: 8 App rank: 96 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../../BB/../../..][../../../../../../../..] >> Data for proc: [[51718,1],97] >> Pid: 0 Local rank: 9 Node rank: 9 App rank: 97 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../../BB/../../..] >> Data for proc: [[51718,1],98] >> Pid: 0 Local rank: 10 Node rank: 10 App rank: 98 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../../../BB/../..][../../../../../../../..] >> Data for proc: [[51718,1],99] >> Pid: 0 Local rank: 11 Node rank: 11 App rank: 99 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../../../BB/../..] >> Data for proc: [[51718,1],100] >> Pid: 0 Local rank: 12 Node rank: 12 App rank: 100 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../../../../BB/..][../../../../../../../..] >> Data for proc: [[51718,1],101] >> Pid: 0 Local rank: 13 Node rank: 13 App rank: 101 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../../../../BB/..] >> Data for proc: [[51718,1],102] >> Pid: 0 Local rank: 14 Node rank: 14 App rank: 102 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../../../../../BB][../../../../../../../..] >> Data for proc: [[51718,1],103] >> Pid: 0 Local rank: 15 Node rank: 15 App rank: 103 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../../../../../BB] >> >> Data for node: csclprd3-0-11 Launch id: -1 State: 0 >> Daemon: [[51718,0],13] Daemon launched: True >> Num slots: 16 Slots in use: 16 Oversubscribed: FALSE >> Num slots allocated: 16 Max slots: 0 >> Username on node: NULL >> Num procs: 16 Next node_rank: 16 >> Data for proc: [[51718,1],104] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 104 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [BB/../../../../../../..][../../../../../../../..] >> Data for proc: [[51718,1],105] >> Pid: 0 Local rank: 1 Node rank: 1 App rank: 105 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][BB/../../../../../../..] >> Data for proc: [[51718,1],106] >> Pid: 0 Local rank: 2 Node rank: 2 App rank: 106 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../BB/../../../../../..][../../../../../../../..] >> Data for proc: [[51718,1],107] >> Pid: 0 Local rank: 3 Node rank: 3 App rank: 107 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../BB/../../../../../..] >> Data for proc: [[51718,1],108] >> Pid: 0 Local rank: 4 Node rank: 4 App rank: 108 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../BB/../../../../..][../../../../../../../..] >> Data for proc: [[51718,1],109] >> Pid: 0 Local rank: 5 Node rank: 5 App rank: 109 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../BB/../../../../..] >> Data for proc: [[51718,1],110] >> Pid: 0 Local rank: 6 Node rank: 6 App rank: 110 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../BB/../../../..][../../../../../../../..] >> Data for proc: [[51718,1],111] >> Pid: 0 Local rank: 7 Node rank: 7 App rank: 111 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../BB/../../../..] >> Data for proc: [[51718,1],112] >> Pid: 0 Local rank: 8 Node rank: 8 App rank: 112 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../../BB/../../..][../../../../../../../..] >> Data for proc: [[51718,1],113] >> Pid: 0 Local rank: 9 Node rank: 9 App rank: 113 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../../BB/../../..] >> Data for proc: [[51718,1],114] >> Pid: 0 Local rank: 10 Node rank: 10 App rank: 114 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../../../BB/../..][../../../../../../../..] >> Data for proc: [[51718,1],115] >> Pid: 0 Local rank: 11 Node rank: 11 App rank: 115 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../../../BB/../..] >> Data for proc: [[51718,1],116] >> Pid: 0 Local rank: 12 Node rank: 12 App rank: 116 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../../../../BB/..][../../../../../../../..] >> Data for proc: [[51718,1],117] >> Pid: 0 Local rank: 13 Node rank: 13 App rank: 117 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../../../../BB/..] >> Data for proc: [[51718,1],118] >> Pid: 0 Local rank: 14 Node rank: 14 App rank: 118 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> Binding: [../../../../../../../BB][../../../../../../../..] >> Data for proc: [[51718,1],119] >> Pid: 0 Local rank: 15 Node rank: 15 App rank: 119 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../../../..][../../../../../../../BB] >> >> Data for node: csclprd3-0-12 Launch id: -1 State: 0 >> Daemon: [[51718,0],14] Daemon launched: True >> Num slots: 6 Slots in use: 6 Oversubscribed: FALSE >> Num slots allocated: 6 Max slots: 0 >> Username on node: NULL >> Num procs: 6 Next node_rank: 6 >> Data for proc: [[51718,1],120] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 120 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [BB/../../../../..] >> Data for proc: [[51718,1],121] >> Pid: 0 Local rank: 1 Node rank: 1 App rank: 121 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [../BB/../../../..] >> Data for proc: [[51718,1],122] >> Pid: 0 Local rank: 2 Node rank: 2 App rank: 122 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [../../BB/../../..] >> Data for proc: [[51718,1],123] >> Pid: 0 Local rank: 3 Node rank: 3 App rank: 123 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [../../../BB/../..] >> Data for proc: [[51718,1],124] >> Pid: 0 Local rank: 4 Node rank: 4 App rank: 124 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [../../../../BB/..] >> Data for proc: [[51718,1],125] >> Pid: 0 Local rank: 5 Node rank: 5 App rank: 125 >> State: INITIALIZED App_context: 0 >> Locale: UNKNOWN >> Binding: [../../../../../BB] >> >> Data for node: csclprd3-0-13 Launch id: -1 State: 0 >> Daemon: [[51718,0],15] Daemon launched: True >> Num slots: 12 Slots in use: 6 Oversubscribed: FALSE >> Num slots allocated: 12 Max slots: 0 >> Username on node: NULL >> Num procs: 6 Next node_rank: 6 >> Data for proc: [[51718,1],126] >> Pid: 0 Local rank: 0 Node rank: 0 App rank: 126 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB][../../../../../..] >> Binding: [BB/../../../../..][../../../../../..] >> Data for proc: [[51718,1],127] >> Pid: 0 Local rank: 1 Node rank: 1 App rank: 127 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../..][BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../..][BB/../../../../..] >> Data for proc: [[51718,1],128] >> Pid: 0 Local rank: 2 Node rank: 2 App rank: 128 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB][../../../../../..] >> Binding: [../BB/../../../..][../../../../../..] >> Data for proc: [[51718,1],129] >> Pid: 0 Local rank: 3 Node rank: 3 App rank: 129 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../..][BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../..][../BB/../../../..] >> Data for proc: [[51718,1],130] >> Pid: 0 Local rank: 4 Node rank: 4 App rank: 130 >> State: INITIALIZED App_context: 0 >> Locale: [BB/BB/BB/BB/BB/BB][../../../../../..] >> Binding: [../../BB/../../..][../../../../../..] >> Data for proc: [[51718,1],131] >> Pid: 0 Local rank: 5 Node rank: 5 App rank: 131 >> State: INITIALIZED App_context: 0 >> Locale: [../../../../../..][BB/BB/BB/BB/BB/BB] >> Binding: [../../../../../..][../../BB/../../..] >> [csclprd3-0-13:31619] *** Process received signal *** >> [csclprd3-0-13:31619] Signal: Bus error (7) >> [csclprd3-0-13:31619] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:31619] Failing at address: 0x7f1374267a00 >> [csclprd3-0-13:31620] *** Process received signal *** >> [csclprd3-0-13:31620] Signal: Bus error (7) >> [csclprd3-0-13:31620] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:31620] Failing at address: 0x7fcc702a7980 >> [csclprd3-0-13:31615] *** Process received signal *** >> [csclprd3-0-13:31615] Signal: Bus error (7) >> [csclprd3-0-13:31615] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:31615] Failing at address: 0x7f8128367880 >> [csclprd3-0-13:31616] *** Process received signal *** >> [csclprd3-0-13:31616] Signal: Bus error (7) >> [csclprd3-0-13:31616] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:31616] Failing at address: 0x7fe674227a00 >> [csclprd3-0-13:31617] *** Process received signal *** >> [csclprd3-0-13:31617] Signal: Bus error (7) >> [csclprd3-0-13:31617] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:31617] Failing at address: 0x7f061c32db80 >> [csclprd3-0-13:31618] *** Process received signal *** >> [csclprd3-0-13:31618] Signal: Bus error (7) >> [csclprd3-0-13:31618] Signal code: Non-existant physical address (2) >> [csclprd3-0-13:31618] Failing at address: 0x7fb8402eaa80 >> [csclprd3-0-13:31618] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7fb851851500] >> [csclprd3-0-13:31618] [ 1] [csclprd3-0-13:31616] [ 0] >> /lib64/libpthread.so.0(+0xf500)[0x7fe6843a4500] >> [csclprd3-0-13:31616] [ 1] [csclprd3-0-13:31620] [ 0] >> /lib64/libpthread.so.0(+0xf500)[0x7fcc80c54500] >> [csclprd3-0-13:31620] [ 1] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x167f61)[0x7fcc80fc9f61] >> [csclprd3-0-13:31620] [ 2] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x168047)[0x7fcc80fca047] >> [csclprd3-0-13:31620] [ 3] [csclprd3-0-13:31615] [ 0] >> /lib64/libpthread.so.0(+0xf500)[0x7f81385ca500] >> [csclprd3-0-13:31615] [ 1] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x167f61)[0x7f813893ff61] >> [csclprd3-0-13:31615] [ 2] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x168047)[0x7f8138940047] >> [csclprd3-0-13:31615] [ 3] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x167f61)[0x7fb851bc6f61] >> [csclprd3-0-13:31618] [ 2] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x168047)[0x7fb851bc7047] >> [csclprd3-0-13:31618] [ 3] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x55670)[0x7fb851ab4670] >> [csclprd3-0-13:31618] [ 4] [csclprd3-0-13:31617] [ 0] >> /lib64/libpthread.so.0(+0xf500)[0x7f062cfe5500] >> [csclprd3-0-13:31617] [ 1] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x167f61)[0x7f062d35af61] >> [csclprd3-0-13:31617] [ 2] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x168047)[0x7f062d35b047] >> [csclprd3-0-13:31617] [ 3] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x55670)[0x7f062d248670] >> [csclprd3-0-13:31617] [ 4] [csclprd3-0-13:31619] [ 0] >> /lib64/libpthread.so.0(+0xf500)[0x7f1384fde500] >> [csclprd3-0-13:31619] [ 1] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x167f61)[0x7f1385353f61] >> [csclprd3-0-13:31619] [ 2] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x167f61)[0x7fe684719f61] >> [csclprd3-0-13:31616] [ 2] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x168047)[0x7fe68471a047] >> [csclprd3-0-13:31616] [ 3] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x55670)[0x7fe684607670] >> [csclprd3-0-13:31616] [ 4] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x168047)[0x7f1385354047] >> [csclprd3-0-13:31619] [ 3] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x55670)[0x7f1385241670] >> [csclprd3-0-13:31619] [ 4] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x3b9)[0x7f13852425ab] >> [csclprd3-0-13:31619] [ 5] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0xfb)[0x7f1385242751] >> [csclprd3-0-13:31619] [ 6] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_btl_sm_add_procs+0x671)[0x7f13853501c9] >> [csclprd3-0-13:31619] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x14a628)[0x7f1385336628] >> [csclprd3-0-13:31619] [ 8] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x55670)[0x7fcc80eb7670] >> [csclprd3-0-13:31620] [ 4] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x3b9)[0x7fcc80eb85ab] >> [csclprd3-0-13:31620] [ 5] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0xfb)[0x7fcc80eb8751] >> [csclprd3-0-13:31620] [ 6] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_btl_sm_add_procs+0x671)[0x7fcc80fc61c9] >> [csclprd3-0-13:31620] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x14a628)[0x7fcc80fac628] >> [csclprd3-0-13:31620] [ 8] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xff)[0x7fcc8111fd61] >> [csclprd3-0-13:31620] [ 9] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x55670)[0x7f813882d670] >> [csclprd3-0-13:31615] [ 4] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x3b9)[0x7f813882e5ab] >> [csclprd3-0-13:31615] [ 5] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0xfb)[0x7f813882e751] >> [csclprd3-0-13:31615] [ 6] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_btl_sm_add_procs+0x671)[0x7f813893c1c9] >> [csclprd3-0-13:31615] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x14a628)[0x7f8138922628] >> [csclprd3-0-13:31615] [ 8] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xff)[0x7f8138a95d61] >> [csclprd3-0-13:31615] [ 9] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0xbda)[0x7f813885d747] >> [csclprd3-0-13:31615] [10] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x3b9)[0x7fb851ab55ab] >> [csclprd3-0-13:31618] [ 5] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0xfb)[0x7fb851ab5751] >> [csclprd3-0-13:31618] [ 6] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_btl_sm_add_procs+0x671)[0x7fb851bc31c9] >> [csclprd3-0-13:31618] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x14a628)[0x7fb851ba9628] >> [csclprd3-0-13:31618] [ 8] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xff)[0x7fb851d1cd61] >> [csclprd3-0-13:31618] [ 9] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0xbda)[0x7fb851ae4747] >> [csclprd3-0-13:31618] [10] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x3b9)[0x7f062d2495ab] >> [csclprd3-0-13:31617] [ 5] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0xfb)[0x7f062d249751] >> [csclprd3-0-13:31617] [ 6] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_btl_sm_add_procs+0x671)[0x7f062d3571c9] >> [csclprd3-0-13:31617] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x14a628)[0x7f062d33d628] >> [csclprd3-0-13:31617] [ 8] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xff)[0x7f062d4b0d61] >> [csclprd3-0-13:31617] [ 9] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0xbda)[0x7f062d278747] >> [csclprd3-0-13:31617] [10] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_grow+0x3b9)[0x7fe6846085ab] >> [csclprd3-0-13:31616] [ 5] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_free_list_resize_mt+0xfb)[0x7fe684608751] >> [csclprd3-0-13:31616] [ 6] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_btl_sm_add_procs+0x671)[0x7fe6847161c9] >> [csclprd3-0-13:31616] [ 7] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(+0x14a628)[0x7fe6846fc628] >> [csclprd3-0-13:31616] [ 8] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xff)[0x7fe68486fd61] >> [csclprd3-0-13:31616] [ 9] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0xbda)[0x7fe684637747] >> [csclprd3-0-13:31616] [10] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x185)[0x7fe68467750b] >> [csclprd3-0-13:31616] [11] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:31616] [12] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fe684021cdd] >> [csclprd3-0-13:31616] [13] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:31616] *** End of error message *** >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x185)[0x7f062d2b850b] >> [csclprd3-0-13:31617] [11] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:31617] [12] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f062cc62cdd] >> [csclprd3-0-13:31617] [13] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:31617] *** End of error message *** >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(mca_pml_ob1_add_procs+0xff)[0x7f13854a9d61] >> [csclprd3-0-13:31619] [ 9] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0xbda)[0x7f1385271747] >> [csclprd3-0-13:31619] [10] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x185)[0x7f13852b150b] >> [csclprd3-0-13:31619] [11] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:31619] [12] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f1384c5bcdd] >> [csclprd3-0-13:31619] [13] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:31619] *** End of error message *** >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(ompi_mpi_init+0xbda)[0x7fcc80ee7747] >> [csclprd3-0-13:31620] [10] >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x185)[0x7fcc80f2750b] >> [csclprd3-0-13:31620] [11] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:31620] [12] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fcc808d1cdd] >> [csclprd3-0-13:31620] [13] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:31620] *** End of error message *** >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x185)[0x7f813889d50b] >> [csclprd3-0-13:31615] [11] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:31615] [12] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f8138247cdd] >> [csclprd3-0-13:31615] [13] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:31615] *** End of error message *** >> /hpc/apps/mpi/openmpi/1.8.6/lib/libmpi.so.1(MPI_Init+0x185)[0x7fb851b2450b] >> [csclprd3-0-13:31618] [11] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400ad0] >> [csclprd3-0-13:31618] [12] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fb8514cecdd] >> [csclprd3-0-13:31618] [13] >> /hpc/home/lanew/mpi/openmpi/ProcessColors3[0x400999] >> [csclprd3-0-13:31618] *** End of error message *** >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 126 with PID 0 on node csclprd3-0-13 exited >> on signal 7 (Bus error). >> -------------------------------------------------------------------------- >> >> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain >> [r...@open-mpi.org] >> Sent: Tuesday, June 23, 2015 6:20 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] OpenMPI 1.8.6, CentOS 6.3, too many slots = crash >> >> Wow - that is one sick puppy! I see that some nodes are reporting not-bound >> for their procs, and the rest are binding to socket (as they should). Some >> of your nodes clearly do not have hyper threads enabled (or only have >> single-thread cores on them), and have 2 cores/socket. Other nodes have 8 >> cores/socket with hyper threads enabled, while still others have 6 >> cores/socket and HT enabled. >> >> I don't see anyone binding to a single HT if multiple HTs/core are >> available. I think you are being fooled by those nodes that either don't >> have HT enabled, or have only 1 HT/core. >> >> In both cases, it is node 13 that is the node that fails. I also note that >> you said everything works okay with < 132 ranks, and node 13 hosts ranks >> 127-131. So node 13 would host ranks even if you reduced the number in the >> job to 131. This would imply that it probably isn't something wrong with the >> node itself. >> >> Is there any way you could run a job of this size on a homogeneous cluster? >> The procs all show bindings that look right, but I'm wondering if the >> heterogeneity is the source of the trouble here. We may be communicating the >> binding pattern incorrectly and giving bad info to the backend daemon. >> >> Also, rather than --report-bindings, use "--display-devel-map" on the >> command line and let's see what the mapper thinks it did. If there is a >> problem with placement, that is where it would exist. >> >> >> On Tue, Jun 23, 2015 at 5:12 PM, Lane, William <william.l...@cshs.org> wrote: >> Ralph, >> >> There is something funny going on, the trace from the >> runs w/the debug build aren't showing any differences from >> what I got earlier. However, I did do a run w/the --bind-to core >> switch and was surprised to see that hyperthreading cores were >> sometimes being used. >> >> Here's the traces that I have: >> >> mpirun -np 132 -report-bindings --prefix /hpc/apps/mpi/openmpi/1.8.6/ >> --hostfile hostfile-noslots --mca btl_tcp_if_include eth0 --hetero-nodes >> /hpc/home/lanew/mpi/openmpi/ProcessColors3 >> [csclprd3-0-5:16802] MCW rank 44 is not bound (or bound to all available >> processors) >> [csclprd3-0-5:16802] MCW rank 45 is not bound (or bound to all available >> processors) >> [csclprd3-0-5:16802] MCW rank 46 is not bound (or bound to all available >> processors) >> [csclprd3-6-5:12480] MCW rank 4 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]]: [B/B][./.] >> [csclprd3-6-5:12480] MCW rank 5 bound to socket 1[core 2[hwt 0]], socket >> 1[core 3[hwt 0]]: [./.][B/B] >> [csclprd3-6-5:12480] MCW rank 6 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]]: [B/B][./.] >> [csclprd3-6-5:12480] MCW rank 7 bound to socket 1[core 2[hwt 0]], socket >> 1[core 3[hwt 0]]: [./.][B/B] >> [csclprd3-0-5:16802] MCW rank 47 is not bound (or bound to all available >> processors) >> [csclprd3-0-5:16802] MCW rank 48 is not bound (or bound to all available >> processors) >> [csclprd3-0-5:16802] MCW rank 49 is not bound (or bound to all available >> processors) >> [csclprd3-0-1:14318] MCW rank 22 is not bound (or bound to all available >> processors) >> [csclprd3-0-1:14318] MCW rank 23 is not bound (or bound to all available >> processors) >> [csclprd3-0-1:14318] MCW rank 24 is not bound (or bound to all available >> processors) >> [csclprd3-6-1:24682] MCW rank 3 bound to socket 1[core 2[hwt 0]], socket >> 1[core 3[hwt 0]]: [./.][B/B] >> [csclprd3-6-1:24682] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]]: [B/B][./.] >> [csclprd3-0-1:14318] MCW rank 25 is not bound (or bound to all available >> processors) >> [csclprd3-0-1:14318] MCW rank 20 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:13827] MCW rank 34 is not bound (or bound to all available >> processors) >> [csclprd3-0-1:14318] MCW rank 21 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:13827] MCW rank 35 is not bound (or bound to all available >> processors) >> [csclprd3-6-1:24682] MCW rank 1 bound to socket 1[core 2[hwt 0]], socket >> 1[core 3[hwt 0]]: [./.][B/B] >> [csclprd3-0-3:13827] MCW rank 36 is not bound (or bound to all available >> processors) >> [csclprd3-6-1:24682] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]]: [B/B][./.] >> [csclprd3-0-6:30371] MCW rank 51 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:30371] MCW rank 52 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:30371] MCW rank 53 is not bound (or bound to all available >> processors) >> [csclprd3-0-2:05825] MCW rank 30 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:30371] MCW rank 54 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:13827] MCW rank 37 is not bound (or bound to all available >> processors) >> [csclprd3-0-2:05825] MCW rank 31 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:13827] MCW rank 32 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:30371] MCW rank 55 is not bound (or bound to all available >> processors) >> [csclprd3-0-3:13827] MCW rank 33 is not bound (or bound to all available >> processors) >> [csclprd3-0-6:30371] MCW rank 50 is not bound (or bound to all available >> processors) >> [csclprd3-0-2:05825] MCW rank 26 is not bound (or bound to all available >> processors) >> [csclprd3-0-2:05825] MCW rank 27 is not bound (or bound to all available >> processors) >> [csclprd3-0-2:05825] MCW rank 28 is not bound (or bound to all available >> processors) >> [csclprd3-0-2:05825] MCW rank 29 is not bound (or bound to all available >> processors) >> [csclprd3-0-12:12383] MCW rank 121 is not bound (or bound to all available >> processors) >> [csclprd3-0-12:12383] MCW rank 122 is not bound (or bound to all available >> processors) >> [csclprd3-0-12:12383] MCW rank 123 is not bound (or bound to all available >> processors) >> [csclprd3-0-12:12383] MCW rank 124 is not bound (or bound to all available >> processors) >> [csclprd3-0-12:12383] MCW rank 125 is not bound (or bound to all available >> processors) >> [csclprd3-0-12:12383] MCW rank 120 is not bound (or bound to all available >> processors) >> [csclprd3-0-0:31079] MCW rank 13 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-0:31079] MCW rank 14 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-0:31079] MCW rank 15 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-0:31079] MCW rank 16 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-7:20515] MCW rank 68 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-10:19096] MCW rank 100 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-7:20515] MCW rank 69 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-10:19096] MCW rank 101 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core >> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-0:31079] MCW rank 17 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-7:20515] MCW rank 70 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-10:19096] MCW rank 102 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-11:31636] MCW rank 116 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-11:31636] MCW rank 117 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core >> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-0:31079] MCW rank 18 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-11:31636] MCW rank 118 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-0:31079] MCW rank 19 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-7:20515] MCW rank 71 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-10:19096] MCW rank 103 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core >> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-0:31079] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-0:31079] MCW rank 9 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-10:19096] MCW rank 88 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-11:31636] MCW rank 119 bound to socket 1[core 8[hwt 0-1]], >> socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt >> 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core >> 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-7:20515] MCW rank 56 bound to socket 0[core 0[hwt 0-1]], socket >> 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], >> socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt >> 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-0:31079] MCW rank 10 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-7:20515] MCW rank 57 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-10:19096] MCW rank 89 bound to socket 1[core 8[hwt 0-1]], socket >> 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], >> socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt >> 0-1]], socket 1[core 15[hwt 0-1]]: >> [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB] >> [csclprd3-0-11:31636] MCW rank 104 bound to socket 0[core 0[hwt 0-1]], >> socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt >> 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core >> 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: >> [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..] >> [csclprd3-0-0:31079] MCW rank 11 bound to socket 1[core 6[hwt 0]], socket >> 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket >> 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] >> [csclprd3-0-0:31079] MCW rank 12 bound to socket 0[core 0[hwt 0]], socket >> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket >> 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] >> [csclprd3-0-4:30348] MCW rank 42 is not bound (or bound to all >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27185.php >> >> IMPORTANT WARNING: This message is intended for the use of the person or >> entity to which it is addressed and may contain information that is >> privileged and confidential, the disclosure of which is governed by >> applicable law. If the reader of this message is not the intended recipient, >> or the employee or agent responsible for delivering it to the intended >> recipient, you are hereby notified that any dissemination, distribution or >> copying of this information is strictly prohibited. Thank you for your >> cooperation. >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27204.php > > IMPORTANT WARNING: This message is intended for the use of the person or > entity to which it is addressed and may contain information that is > privileged and confidential, the disclosure of which is governed by > applicable law. If the reader of this message is not the intended recipient, > or the employee or agent responsible for delivering it to the intended > recipient, you are hereby notified that any dissemination, distribution or > copying of this information is strictly prohibited. Thank you for your > cooperation. > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27220.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/