Hi Joseph,

Thanks for reporting!

Regarding your second point about the missing output files there seems to be a 
problem with the current working directory detection on the remote nodes: 
while on the first node - on which mpirun is executed - the output folder is 
created in the current working directory, the processes on the other nodes 
seem to write the files into $HOME/output.log/

As a workaround you can use an absolute directory path:
--output-filename $PWD/output.log

Best
Christoph



On Friday, 9 February 2018 15:52:31 CET Joseph Schuchart wrote:
> All,
> 
> I am trying to debug my MPI application using good ol' printf and I am
> running into an issue with Open MPI's output redirection (using
> --output-filename).
> 
> The system I'm running on is an IB cluster with the home directory
> mounted through NFS.
> 
> 1) Sometimes I get the following error message and the application hangs:
> 
> ```
> $ mpirun -n 2 -N 1 --output-filename output.log ls
> [n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
> /path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_
> base_setup.c at line 314
> [n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
> /path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/orted/iof
> _orted.c at line 184
> [n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
> /path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_
> base_setup.c at line 237
> [n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
> /path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/odls/base/odl
> s_base_default_fns.c at line 1147
> ```
> 
> So far I have only seen this error when running straight out of my home
> directory, not when running from a subdirectory.
> 
> In case this error does not appear all log files are written correctly.
> 
> 2) If I call mpirun from within a subdirectory I am only seeing output
> files from processes running on the same node as rank 0. I have not seen
> above error messages in this case.
> 
> Example:
> 
> ```
> # two procs, one per node
> ~/test $ mpirun -n 2 -N 1 --output-filename output.log ls
> output.log
> output.log
> ~/test $ ls output.log/*
> rank.0
> # two procs, single node
> ~/test $ mpirun -n 2 -N 2 --output-filename output.log ls
> output.log
> output.log
> ~/test $ ls output.log/*
> rank.0  rank.1
> ```
> 
> Using Open MPI 2.1.1, I can observe a similar effect:
> ```
> # two procs, one per node
> ~/test $ mpirun --output-filename output.log -n 2 -N 1 ls
> ~/test $ ls
> output.log.1.0
> # two procs, single node
> ~/test $ mpirun --output-filename output.log -n 2 -N 2 ls
> ~/test $ ls
> output.log.1.0  output.log.1.1
> ```
> 
> Any idea why this happens and/or how to debug this?
> 
> In case this helps, the NFS mount flags are:
> (rw,nosuid,nodev,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,pro
> to=tcp,timeo=600,retrans=2,sec=sys,mountaddr=<addr>,mountvers=3,mountport=<p
> ort>,mountproto=udp,local_lock=none,addr=<addr>)
> 
> I also tested above commands with MPICH, which gives me the expected
> output for all processes on all nodes.
> 
> Any help would be much appreciated!
> 
> Cheers,
> Joseph
-- 
Christoph Niethammer
High Performance Computing Center Stuttgart (HLRS)
Nobelstrasse 19
70569 Stuttgart

Tel: ++49(0)711-685-87203
email: nietham...@hlrs.de
http://www.hlrs.de/people/niethammer


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to