On May 6, 2010, at 2:01 PM, Gus Correa wrote: > 1) Now I can see and use the btl_sm_num_fifos component: > > I had committed already "btl = ^sm" to the openmpi-mca-params.conf > file. This apparently hides the btl_sm_num_fifos from ompi_info. > > After I switched to no options in openmpi-mca-params.conf, > then ompi_info showed the btl_sm_num_fifos component. > > ompi_info --all | grep btl_sm_num_fifos > MCA btl: parameter "btl_sm_num_fifos" (current value: "1", > data source: default value) > > A side comment: > This means that the system administrator can > hide some Open MPI options from the users, depending on what > he puts in the openmpi-mca-params.conf file, right?
Correct. BUT: a user can always override the "btl" MCA param and see them again. For example, you could also have done this: echo "btl =" > ~/.openmpi/mca-params.conf ompi_info --all | grep btl_sm_num_fifos # ...will show the sm params... > 2) However, running with "sm" still breaks, unfortunately: > > Boomer! Doh! > I get the same errors that I reported in my very > first email, if I increase the number of processes to 16, > to explore the hyperthreading range. > > This is using "sm" (i.e. not excluded in the mca config file), > and btl_sm_num_fifos (mpiexec command line) > > The machine hangs, requires a hard reboot, etc, etc, > as reported earlier. See the below, please. I saw that only some probably-unrelated dmesg messages were emitted. Was there anything else revealing on the console and/or /var/log/* files? Hard reboots absolutely should not be caused by Open MPI. > So, I guess the conclusion is that I can use sm, > but I have to remain within the range of physical cores (8), > not oversubscribe, not try to explore the HT range. > Should I expect it to work also for np>number of physical cores? Your prior explanations of when HT is useful seemed pretty reasonable to me. Meaning: Nehalem HT will help only in some kinds of codes. Dense computation codes with few conditional branches may not benefit much from HT. But OMPI applications should always run *correctly*, regardless of HT or not-HT -- even if you're oversubscribing. The performance may suffer (sometimes dramatically) if you oversubscribe physical cores with dense computational code, but it should always run *correctly*. > I wonder if this would still work with np<=8, but with heavier code. > (I only used hello_c.c so far.) If hello_c is crashing your computer - even if you're running np>8 or np>16 -- something is wrong outside of Open MPI. I routinely run np=100 hello_c on machines. > $ mpiexec -mca btl_sm_num_fifos 16 -np 16 a.out > -------------------------------------------------------------------------- > mpiexec noticed that process rank 8 with PID 3659 on node > spinoza.ldeo.columbia.edu exited on signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > $ > > Message from syslogd@spinoza at May 6 13:38:13 ... > kernel:------------[ cut here ]------------ > > Message from syslogd@spinoza at May 6 13:38:13 ... > kernel:invalid opcode: 0000 [#1] SMP > > Message from syslogd@spinoza at May 6 13:38:13 ... > kernel:last sysfs file: > /sys/devices/system/cpu/cpu15/topology/physical_package_id > > Message from syslogd@spinoza at May 6 13:38:13 ... > kernel:Stack: > > Message from syslogd@spinoza at May 6 13:38:13 ... > kernel:Call Trace: > > Message from syslogd@spinoza at May 6 13:38:13 ... > kernel:Code: 48 89 45 a0 4c 89 ff e8 e0 dd 2b 00 41 8b b6 58 03 00 00 4c 89 > e7 ff c6 e8 b5 bc ff ff 41 8b 96 5c 03 00 00 48 98 48 39 d0 73 04 <0f> 0b eb > fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94 24 00 01 I unfortunately don't know what these messages mean... -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/