Hi Joseph, Thanks for reporting!
Regarding your second point about the missing output files there seems to be a problem with the current working directory detection on the remote nodes: while on the first node - on which mpirun is executed - the output folder is created in the current working directory, the processes on the other nodes seem to write the files into $HOME/output.log/ As a workaround you can use an absolute directory path: --output-filename $PWD/output.log Best Christoph On Friday, 9 February 2018 15:52:31 CET Joseph Schuchart wrote: > All, > > I am trying to debug my MPI application using good ol' printf and I am > running into an issue with Open MPI's output redirection (using > --output-filename). > > The system I'm running on is an IB cluster with the home directory > mounted through NFS. > > 1) Sometimes I get the following error message and the application hangs: > > ``` > $ mpirun -n 2 -N 1 --output-filename output.log ls > [n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file > /path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_ > base_setup.c at line 314 > [n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file > /path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/orted/iof > _orted.c at line 184 > [n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file > /path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_ > base_setup.c at line 237 > [n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file > /path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/odls/base/odl > s_base_default_fns.c at line 1147 > ``` > > So far I have only seen this error when running straight out of my home > directory, not when running from a subdirectory. > > In case this error does not appear all log files are written correctly. > > 2) If I call mpirun from within a subdirectory I am only seeing output > files from processes running on the same node as rank 0. I have not seen > above error messages in this case. > > Example: > > ``` > # two procs, one per node > ~/test $ mpirun -n 2 -N 1 --output-filename output.log ls > output.log > output.log > ~/test $ ls output.log/* > rank.0 > # two procs, single node > ~/test $ mpirun -n 2 -N 2 --output-filename output.log ls > output.log > output.log > ~/test $ ls output.log/* > rank.0 rank.1 > ``` > > Using Open MPI 2.1.1, I can observe a similar effect: > ``` > # two procs, one per node > ~/test $ mpirun --output-filename output.log -n 2 -N 1 ls > ~/test $ ls > output.log.1.0 > # two procs, single node > ~/test $ mpirun --output-filename output.log -n 2 -N 2 ls > ~/test $ ls > output.log.1.0 output.log.1.1 > ``` > > Any idea why this happens and/or how to debug this? > > In case this helps, the NFS mount flags are: > (rw,nosuid,nodev,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,pro > to=tcp,timeo=600,retrans=2,sec=sys,mountaddr=<addr>,mountvers=3,mountport=<p > ort>,mountproto=udp,local_lock=none,addr=<addr>) > > I also tested above commands with MPICH, which gives me the expected > output for all processes on all nodes. > > Any help would be much appreciated! > > Cheers, > Joseph -- Christoph Niethammer High Performance Computing Center Stuttgart (HLRS) Nobelstrasse 19 70569 Stuttgart Tel: ++49(0)711-685-87203 email: nietham...@hlrs.de http://www.hlrs.de/people/niethammer _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users