Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

Jeff Squyres Thu, 6 May 2010 17:34:32 -0400

On May 6, 2010, at 2:01 PM, Gus Correa wrote:

> 1) Now I can see and use the btl_sm_num_fifos component:
> 
> I had committed already "btl = ^sm" to the openmpi-mca-params.conf
> file.  This apparently hides the btl_sm_num_fifos from ompi_info.
> 
> After I switched to no options in openmpi-mca-params.conf,
> then ompi_info showed the btl_sm_num_fifos component.
> 
> ompi_info --all | grep btl_sm_num_fifos
>                 MCA btl: parameter "btl_sm_num_fifos" (current value: "1", 
> data source: default value)
> 
> A side comment:
> This means that the system administrator can
> hide some Open MPI options from the users, depending on what
> he puts in the openmpi-mca-params.conf file, right?


Correct.

BUT: a user can always override the "btl" MCA param and see them again.  For 
example, you could also have done this:

   echo "btl =" > ~/.openmpi/mca-params.conf
   ompi_info --all | grep btl_sm_num_fifos
   # ...will show the sm params...

> 2) However, running with "sm" still breaks, unfortunately:
> 
> Boomer!

Doh!

> I get the same errors that I reported in my very
> first email, if I increase the number of processes to 16,
> to explore the hyperthreading range.
> 
> This is using "sm" (i.e. not excluded in the mca config file),
> and btl_sm_num_fifos (mpiexec command line)
> 
> The machine hangs, requires a hard reboot, etc, etc,
> as reported earlier.  See the below, please.

I saw that only some probably-unrelated dmesg messages were emitted.  Was there 
anything else revealing on the console and/or /var/log/* files?  Hard reboots 
absolutely should not be caused by Open MPI.

> So, I guess the conclusion is that I can use sm,
> but I have to remain within the range of physical cores (8),
> not oversubscribe, not try to explore the HT range.
> Should I expect it to work also for np>number of physical cores?

Your prior explanations of when HT is useful seemed pretty reasonable to me.  
Meaning: Nehalem HT will help only in some kinds of codes.  Dense computation 
codes with few conditional branches may not benefit much from HT.

But OMPI applications should always run *correctly*, regardless of HT or not-HT 
-- even if you're oversubscribing.  The performance may suffer (sometimes 
dramatically) if you oversubscribe physical cores with dense computational 
code, but it should always run *correctly*.

> I wonder if this would still work with np<=8, but with heavier code.
> (I only used hello_c.c so far.)

If hello_c is crashing your computer - even if you're running np>8 or np>16 -- 
something is wrong outside of Open MPI.  I routinely run np=100 hello_c on 
machines.

> $ mpiexec -mca btl_sm_num_fifos 16 -np 16 a.out
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 8 with PID 3659 on node 
> spinoza.ldeo.columbia.edu exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> $
> 
> Message from syslogd@spinoza at May  6 13:38:13 ...
> kernel:------------[ cut here ]------------
> 
> Message from syslogd@spinoza at May  6 13:38:13 ...
> kernel:invalid opcode: 0000 [#1] SMP
> 
> Message from syslogd@spinoza at May  6 13:38:13 ...
> kernel:last sysfs file: 
> /sys/devices/system/cpu/cpu15/topology/physical_package_id
> 
> Message from syslogd@spinoza at May  6 13:38:13 ...
> kernel:Stack:
> 
> Message from syslogd@spinoza at May  6 13:38:13 ...
> kernel:Call Trace:
> 
> Message from syslogd@spinoza at May  6 13:38:13 ...
> kernel:Code: 48 89 45 a0 4c 89 ff e8 e0 dd 2b 00 41 8b b6 58 03 00 00 4c 89 
> e7 ff c6 e8 b5 bc ff ff 41 8b 96 5c 03 00 00 48 98 48 39 d0 73 04 <0f> 0b eb 
> fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94 24 00 01

I unfortunately don't know what these messages mean...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] How do I run OpenMPI safely on a Nehalem standalone machine?

Reply via email to