[OMPI users] Change behavior of --output-filename

2019-11-12 Thread Max Sagebaum via users
Hello @ all,

Short question: How to select what is the behavior of --output-filename?

Long question:
I am used to that the option --output-filename file.out will generate the files 
file.out.0 file.out.1 etc.
The default logic switched to file.out/1/rank.*/stdout

man mpirun gives me:
-output-filename, --output-filename 
  Redirect the stdout, stderr, and stddiag of all processes to a 
process-unique version of the specified filename. ...

mpirun --help output gives me:
-output-filename|--output-filename 
 Redirect output from application processes into
 filename/job/rank/std[out,err,diag]. A relative
 path value will be converted to an absolute path

So man is reporting a wrong logic, help provides the correct answer for the 
logic. A lock in the code yields that the output from man comes from
./orte/orted/orted_main.c:
{ "orte_output_filename", '\0', "output-filename", "output-filename", 1,
  NULL, OPAL_CMD_LINE_TYPE_STRING,
  "Redirect output from application processes into filename.rank" },

The mpirun output comes from
./orte/mca/schizo/ompi/schizo_ompi.c
{ "orte_output_filename", '\0', "output-filename", "output-filename", 1,
  &orte_cmd_options.output_filename, OPAL_CMD_LINE_TYPE_STRING,
  "Redirect output from application processes into 
filename/job/rank/std[out,err,diag]. A relative path value will be converted to 
an absolute path",
  OPAL_CMD_LINE_OTYPE_OUTPUT },

These two behaviors can probably be selected with some mca parameters, but I 
could not figure out which one.

So the final question: How can I configure which default backend is used for 
the command line parameter --output-filename?

Best regards

Max Sagebaum

uname -a: 5.3.8-200.fc30.x86_6
ompi_info:
Package: Open MPI 
mockbu...@buildvm-08.phx2.fedoraproject.org
 Distribution
Open MPI: 3.1.4
  Open MPI repo revision: v3.1.4
   Open MPI release date: Apr 15, 2019
Open RTE: 3.1.4
  Open RTE repo revision: v3.1.4
   Open RTE release date: Apr 15, 2019
OPAL: 3.1.4
  OPAL repo revision: v3.1.4
   OPAL release date: Apr 15, 2019
 MPI API: 3.1.0
Ident string: 3.1.4
  Prefix: /usr/lib64/openmpi
 Configured architecture: x86_64-unknown-linux-gnu
  Configure host: buildvm-08.phx2.fedoraproject.org
   Configured by: mockbuild
   Configured on: Thu Oct 17 03:20:12 UTC 2019
  Configure host: buildvm-08.phx2.fedoraproject.org
  Configure command line: '--prefix=/usr/lib64/openmpi' 
'--mandir=/usr/share/man/openmpi-x86_64' 
'--includedir=/usr/include/openmpi-x86_64' '--sysconfdir=/etc/openmpi-x86_64' 
'--disable-silent-rules' '--enable-builtin-atomics' 
'--enable-mpi-thread-multiple' '--enable-mpi-cxx' '--enable-mpi-java' 
'--with-sge' '--with-
valgrind' '--enable-memchecker' '--with-hwloc=/usr' '--with-libevent=external' 
'--with-pmix=external' 'CC=gcc' 'CXX=g++' 'LDFLAGS=-Wl,-z,relro -Wl,--as-needed 
 -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld' 'CFLAGS= -O2 -g 
-pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 
-Wp,-D_GLIBCXX_ASSERTION
S -fexceptions -fstack-protector-strong -grecord-gcc-switches 
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic 
-fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection' 
'CXXFLAGS= -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOU
RCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong 
-grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic 
-fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection' 
'FC=gfortran' 'FCFLAGS= -O2 -g -pipe -Wall
 -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS 
-fexceptions -fstack-protector-strong -grecord-gcc-switches 
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic 
-fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection'
Built by: mockbuild
...
If required I can post the full output.




Re: [OMPI users] Change behavior of --output-filename

2019-11-12 Thread Ralph Castain via users
The man page is simply out of date - see 
https://github.com/open-mpi/ompi/issues/7095 for further thinking


On Nov 12, 2019, at 1:26 AM, Max Sagebaum via users mailto:users@lists.open-mpi.org> > wrote:

Hello @ all,

Short question: How to select what is the behavior of --output-filename?

Long question:
I am used to that the option --output-filename file.out will generate the files 
file.out.0 file.out.1 etc.
The default logic switched to file.out/1/rank.*/stdout

man mpirun gives me:
-output-filename, --output-filename 
  Redirect the stdout, stderr, and stddiag of all processes to a 
process-unique version of the specified filename. ...

mpirun --help output gives me:
-output-filename|--output-filename   
 Redirect output from application processes into
 filename/job/rank/std[out,err,diag]. A relative
 path value will be converted to an absolute path

So man is reporting a wrong logic, help provides the correct answer for the 
logic. A lock in the code yields that the output from man comes from
./orte/orted/orted_main.c:
{ "orte_output_filename", '\0', "output-filename", "output-filename", 1,
  NULL, OPAL_CMD_LINE_TYPE_STRING,
  "Redirect output from application processes into filename.rank" },

The mpirun output comes from
./orte/mca/schizo/ompi/schizo_ompi.c
{ "orte_output_filename", '\0', "output-filename", "output-filename", 1,

  &orte_cmd_options.output_filename, OPAL_CMD_LINE_TYPE_STRING,
  "Redirect output from application processes into 
filename/job/rank/std[out,err,diag]. A relative path value will be converted to 
an absolute path",
  OPAL_CMD_LINE_OTYPE_OUTPUT },

These two behaviors can probably be selected with some mca parameters, but I 
could not figure out which one.

So the final question: How can I configure which default backend is used for 
the command line parameter --output-filename?

Best regards

Max Sagebaum

uname -a: 5.3.8-200.fc30.x86_6
ompi_info:
Package: Open MPI mockbu...@buildvm-08.phx2.fedoraproject.org 
 Distribution
Open MPI: 3.1.4
  Open MPI repo revision: v3.1.4
   Open MPI release date: Apr 15, 2019
Open RTE: 3.1.4
  Open RTE repo revision: v3.1.4
   Open RTE release date: Apr 15, 2019
OPAL: 3.1.4
  OPAL repo revision: v3.1.4
   OPAL release date: Apr 15, 2019
 MPI API: 3.1.0
Ident string: 3.1.4
  Prefix: /usr/lib64/openmpi
 Configured architecture: x86_64-unknown-linux-gnu
  Configure host: buildvm-08.phx2.fedoraproject.org 
 
   Configured by: mockbuild
   Configured on: Thu Oct 17 03:20:12 UTC 2019
  Configure host: buildvm-08.phx2.fedoraproject.org 
 
  Configure command line: '--prefix=/usr/lib64/openmpi' 
'--mandir=/usr/share/man/openmpi-x86_64' 
'--includedir=/usr/include/openmpi-x86_64' '--sysconfdir=/etc/openmpi-x86_64' 
'--disable-silent-rules' '--enable-builtin-atomics' 
'--enable-mpi-thread-multiple' '--enable-mpi-cxx' '--enable-mpi-java' 
'--with-sge' '--with-
valgrind' '--enable-memchecker' '--with-hwloc=/usr' '--with-libevent=external' 
'--with-pmix=external' 'CC=gcc' 'CXX=g++' 'LDFLAGS=-Wl,-z,relro -Wl,--as-needed 
 -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld' 'CFLAGS= -O2 -g 
-pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 
-Wp,-D_GLIBCXX_ASSERTION
S -fexceptions -fstack-protector-strong -grecord-gcc-switches 
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic 
-fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection' 
'CXXFLAGS= -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOU
RCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong 
-grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic 
-fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection' 
'FC=gfortran' 'FCFLAGS= -O2 -g -pipe -Wall
 -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS 
-fexceptions -fstack-protector-strong -grecord-gcc-switches 
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic 
-fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection'
Built by: mockbuild
...
If required I can post the full output.





[OMPI users] optimized cuda rdma mca-params.conf?

2019-11-12 Thread Douglas Duckworth via users
Good Morning

We are running OpenMPI 4.0.2 on several CentOS 7.8 nodes with V-100s, driver 
version 418.87.01, persistent mode enabled, and nvidia peer memory module 
present.

We are seeing low penalty for running jobs across multiple GPU nodes.  I am 
relatively new to tuning OpenMPI so I wanted to ask the community if the chosen 
configuration could be optimized further.

We have FDR interconnect, plans for upgrading, with ConnectX-3 cards running 
latest OFED and firmware.

openmpi-mca-params.conf

btl=openib,self
btl_openib_if_include=mlx3_0,mlx4_0,mlx5_0
btl_openib_warn_nonexistent_if=0
btl_tcp_if_include=ib0
btl_openib_allow_ib=true
orte_base_help_aggregate=0
btl_openib_want_cuda_gdr=true
oob_tcp_if_include=ib0
#mpi_common_cuda_verbose=100
#opal_cuda_verbose=10
#btl_openib_verbose=true
orte_keep_fqdn_hostname=true
oob_tcp_if_include=ib0


Thank you!
Doug


--
Thanks,

Douglas Duckworth, MSc, LFCS
HPC System Administrator
Scientific Computing Unit
Weill Cornell Medicine
E: d...@med.cornell.edu
O: 212-746-6305
F: 212-746-8690


Re: [OMPI users] Change behavior of --output-filename

2019-11-12 Thread Jeff Squyres (jsquyres) via users
On Nov 12, 2019, at 9:17 AM, Ralph Castain via users 
mailto:users@lists.open-mpi.org>> wrote:

The man page is simply out of date - see 
https://github.com/open-mpi/ompi/issues/7095 for further thinking

And https://github.com/open-mpi/ompi/issues/7133 for what might happen going 
forward.

--
Jeff Squyres
jsquy...@cisco.com



[OMPI users] qelr_alloc_context: Failed to allocate context for device.

2019-11-12 Thread Matteo Guglielmi via users
I'm trying to get openmpi over RoCE working with this setup:




card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov


OS: CentOS 7.7


modinfo qede

filename:   
/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/qede/qede.ko.xz
version:8.37.0.20
license:GPL
description:QLogic FastLinQ 4 Ethernet Driver
retpoline:  Y
rhelversion:7.7
srcversion: A6AFD0788918644F2EFFF31
alias:  pci:v1077d8090sv*sd*bc*sc*i*
alias:  pci:v1077d8070sv*sd*bc*sc*i*
alias:  pci:v1077d1664sv*sd*bc*sc*i*
alias:  pci:v1077d1656sv*sd*bc*sc*i*
alias:  pci:v1077d1654sv*sd*bc*sc*i*
alias:  pci:v1077d1644sv*sd*bc*sc*i*
alias:  pci:v1077d1636sv*sd*bc*sc*i*
alias:  pci:v1077d1666sv*sd*bc*sc*i*
alias:  pci:v1077d1634sv*sd*bc*sc*i*
depends:ptp,qed
intree: Y
vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
sig_hashalgo:   sha256
parm:   debug: Default debug msglevel (uint)

modinfo qedr

filename:   
/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qedr.ko.xz
license:Dual BSD/GPL
author: QLogic Corporation
description:QLogic 40G/100G ROCE Driver
retpoline:  Y
rhelversion:7.7
srcversion: B5B65473217AA2B1F2F619B
depends:qede,qed,ib_core
intree: Y
vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
sig_hashalgo:   sha256

ibv_devinfo

hca_id: qedr0
transport: InfiniBand (0)
fw_ver: 8.37.7.0
node_guid: b62e:99ff:fea7:8439
sys_image_guid: b62e:99ff:fea7:8439
vendor_id: 0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

hca_id: qedr1
transport: InfiniBand (0)
fw_ver: 8.37.7.0
node_guid: b62e:99ff:fea7:843a
sys_image_guid: b62e:99ff:fea7:843a
vendor_id: 0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet




RDMA actually works at system level which means that I cand do

rdma ping-pong tests etc.



But when I try to run openmpi with these options:



mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ...





I get the following error messages:




--
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them).  This is most certainly not what you wanted.  Check your
cables, subnet manager configuration, etc.  The openib BTL will be
ignored for this job.

  Local host: node001
--
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
--
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:   node002
  Local device: qedr0
  Local port:   1
  CPCs attempted:   rdmacm
--
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.

...


I've tried several things such as:


1) upgrade the 3.10 kernel's qed* drivers to the latest stable version 8.42.9

2) upgrade the CentOS kernel from 3.10 to 5.3 via elrepo

3) install the latest OFED-4.17-1.tgz stack


but the error messages never go away ad do remain always the same.



Any advice is highly appreciated.