Hi Eugene

Thanks for the detailed answer.

*************

1) Now I can see and use the btl_sm_num_fifos component:

I had committed already "btl = ^sm" to the openmpi-mca-params.conf
file.  This apparently hides the btl_sm_num_fifos from ompi_info.

After I switched to no options in openmpi-mca-params.conf,
then ompi_info showed the btl_sm_num_fifos component.

ompi_info --all | grep btl_sm_num_fifos
MCA btl: parameter "btl_sm_num_fifos" (current value: "1", data source: default value)

A side comment:
This means that the system administrator can
hide some Open MPI options from the users, depending on what
he puts in the openmpi-mca-params.conf file, right?

*************

2) However, running with "sm" still breaks, unfortunately:

Boomer!
I get the same errors that I reported in my very
first email, if I increase the number of processes to 16,
to explore the hyperthreading range.

This is using "sm" (i.e. not excluded in the mca config file),
and btl_sm_num_fifos (mpiexec command line)

The machine hangs, requires a hard reboot, etc, etc,
as reported earlier.  See the below, please.

So, I guess the conclusion is that I can use sm,
but I have to remain within the range of physical cores (8),
not oversubscribe, not try to explore the HT range.
Should I expect it to work also for np>number of physical cores?

I wonder if this would still work with np<=8, but with heavier code.
(I only used hello_c.c so far.)
Not sure I'll be able to test this, the user wants to use the machine.


$mpiexec -mca btl_sm_num_fifos 4 -np 4 a.out
Hello, world, I am 0 of 4
Hello, world, I am 1 of 4
Hello, world, I am 2 of 4
Hello, world, I am 3 of 4

$ mpiexec -mca btl_sm_num_fifos 8 -np 8 a.out
Hello, world, I am 0 of 8
Hello, world, I am 1 of 8
Hello, world, I am 2 of 8
Hello, world, I am 3 of 8
Hello, world, I am 4 of 8
Hello, world, I am 5 of 8
Hello, world, I am 6 of 8
Hello, world, I am 7 of 8

$ mpiexec -mca btl_sm_num_fifos 16 -np 16 a.out
--------------------------------------------------------------------------
mpiexec noticed that process rank 8 with PID 3659 on node spinoza.ldeo.columbia.edu exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
$

Message from syslogd@spinoza at May  6 13:38:13 ...
 kernel:------------[ cut here ]------------

Message from syslogd@spinoza at May  6 13:38:13 ...
 kernel:invalid opcode: 0000 [#1] SMP

Message from syslogd@spinoza at May  6 13:38:13 ...
kernel:last sysfs file: /sys/devices/system/cpu/cpu15/topology/physical_package_id

Message from syslogd@spinoza at May  6 13:38:13 ...
 kernel:Stack:

Message from syslogd@spinoza at May  6 13:38:13 ...
 kernel:Call Trace:

Message from syslogd@spinoza at May  6 13:38:13 ...
kernel:Code: 48 89 45 a0 4c 89 ff e8 e0 dd 2b 00 41 8b b6 58 03 00 00 4c 89 e7 ff c6 e8 b5 bc ff ff 41 8b 96 5c 03 00 00 48 98 48 39 d0 73 04 <0f> 0b eb fe 48 29 d0 48 89 45 a8 66 41 ff 07 49 8b 94 24 00 01

*****************

Many thanks,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------


Eugene Loh wrote:
Gus Correa wrote:

Hi Eugene

Thank you for answering one of my original questions.

However, there seems to be a problem with the syntax.
Is it really "-mca btl btl_sm_num_fifos=some_number"?

No.  Try "--mca btl_sm_num_fifos 4".  Or,

% setenv OMPI_MCA_btl_sm_num_fifos 4
% ompi_info -a | grep btl_sm_num_fifos # check that things were set correctly
% mpirun -n 4 a.out

When I grep any component starting with btl_sm I get nothing:

ompi_info --all | grep btl_sm
(No output)

I'm no guru, but I think the reason has something to do with dynamically loaded somethings. E.g.,

% /home/eugene/ompi/bin/ompi_info --all | grep btl_sm_num_fifos
(no output)
% setenv OPAL_PREFIX /home/eugene/ompi
% set path = ( $OPAL_PREFIX/bin $path )
% ompi_info --all | grep btl_sm_num_fifos
MCA btl: parameter "btl_sm_num_fifos" (current value: "1", data source: default value)
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to