Dear all,

Looks like I have a weird issue never encountered before. While trying to
run simplest "Hello world" program, I get:

$ cat hello.c
#include <mpi.h>

int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);

MPI_Finalize();

return 0;
}
$ mpicc hello.c -o hello
$ mpirun -np 1 ./hello
--------------------------------------------------------------------------
WARNING: The accept(3) system call failed on a TCP socket.  While this
should generally never happen on a well-configured HPC system, the
most common causes when it does occur are:

  * The process ran out of file descriptors
  * The operating system ran out of file descriptors
  * The operating system ran out of memory

Your Open MPI job will likely hang until the failure resason is fixed
(e.g., more file descriptors and/or memory becomes available), and may
eventually timeout / abort.

  Local host:     M17xR4
  Errno:          9 (Bad file descriptor)
  Probable cause: Unknown cause; job will try to continue
--------------------------------------------------------------------------

Further tracing shows the following:

[pid 13498] accept(0, 0x7f2ec8000960, 0x7f2ee6740e7c) = -1 EBADF (Bad file
descriptor)
[pid 13498] shutdown(0, SHUT_RDWR)      = -1 EBADF (Bad file descriptor)
[pid 13498] close(0)                    = -1 EBADF (Bad file descriptor)
[pid 13498] open("/usr/share/openmpi/help-oob-tcp.txt", O_RDONLY) = 0
[pid 13498] ioctl(0, TCGETS, 0x7f2ee6740be0) = -1 ENOTTY (Inappropriate
ioctl for device)
[pid 13499] <... nanosleep resumed> NULL) = 0
[pid 13498] fstat(0,  <unfinished ...>
[pid 13499] nanosleep({0, 100000},  <unfinished ...>
[pid 13498] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=3025, ...})
= 0
[pid 13498] read(0, "# -*- text -*-\n#\n# Copyright (c)"..., 8192) = 3025
[pid 13498] read(0, "", 4096)           = 0
[pid 13498] read(0, "", 8192)           = 0
[pid 13498] ioctl(0, TCGETS, 0x7f2ee6740b40) = -1 ENOTTY (Inappropriate
ioctl for device)
[pid 13498] close(0)                    = 0
[pid 13499] <... nanosleep resumed> NULL) = 0
[pid 13499] nanosleep({0, 100000},  <unfinished ...>
[pid 13498] write(1, "--------------------------------"...,
768--------------------------------------------------------------------------
WARNING: The accept(3) system call failed on a TCP socket.  While this
should generally never happen on a well-configured HPC system, the
most common causes when it does occur are:

  * The process ran out of file descriptors
  * The operating system ran out of file descriptors
  * The operating system ran out of memory

Your Open MPI job will likely hang until the failure resason is fixed
(e.g., more file descriptors and/or memory becomes available), and may
eventually timeout / abort.

  Local host:     M17xR4
  Errno:          9 (Bad file descriptor)
  Probable cause: Unknown cause; job will try to continue
--------------------------------------------------------------------------
) = 768

In fact, "Bad file descriptor" first occurs a bit earlier, here:

[pid 13499] open("/proc/self/fd",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 20
[pid 13499] fstat(20, {st_mode=S_IFDIR|0500, st_size=0, ...}) = 0
[pid 13499] getdents(20, /* 25 entries */, 32768) = 600
[pid 13499] close(3)                    = 0
[pid 13499] close(4)                    = 0
[pid 13499] close(5)                    = 0
[pid 13499] close(6)                    = 0
[pid 13499] close(7)                    = 0
[pid 13499] close(8)                    = 0
[pid 13499] close(9)                    = 0
[pid 13499] close(10)                   = 0
[pid 13499] close(11)                   = 0
[pid 13499] close(12)                   = 0
[pid 13499] close(13)                   = 0
[pid 13499] close(14)                   = 0
[pid 13499] close(15)                   = 0
[pid 13499] close(16)                   = 0
[pid 13499] close(17)                   = 0
[pid 13499] close(18)                   = 0
[pid 13499] close(19)                   = 0
[pid 13499] close(20)                   = 0
[pid 13499] getdents(20, 0x1cc04a0, 32768) = -1 EBADF (Bad file descriptor)
[pid 13499] close(20)                   = -1 EBADF (Bad file descriptor)

Any idea how to fix this? System is Ubuntu 16.04:

Linux M17xR4 4.10.0-42-generic #46~16.04.1-Ubuntu SMP Mon Dec 4 15:57:59
UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Kind regards,
- Dmitry.
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to