ping

2018-06-01 22:29 GMT+03:00 Dmitry N. Mikushin <maemar...@gmail.com>:

> Dear all,
>
> Looks like I have a weird issue never encountered before. While trying to
> run simplest "Hello world" program, I get:
>
> $ cat hello.c
> #include <mpi.h>
>
> int main(int argc, char* argv[])
> {
> MPI_Init(&argc, &argv);
>
> MPI_Finalize();
>
> return 0;
> }
> $ mpicc hello.c -o hello
> $ mpirun -np 1 ./hello
> --------------------------------------------------------------------------
> WARNING: The accept(3) system call failed on a TCP socket.  While this
> should generally never happen on a well-configured HPC system, the
> most common causes when it does occur are:
>
>   * The process ran out of file descriptors
>   * The operating system ran out of file descriptors
>   * The operating system ran out of memory
>
> Your Open MPI job will likely hang until the failure resason is fixed
> (e.g., more file descriptors and/or memory becomes available), and may
> eventually timeout / abort.
>
>   Local host:     M17xR4
>   Errno:          9 (Bad file descriptor)
>   Probable cause: Unknown cause; job will try to continue
> --------------------------------------------------------------------------
>
> Further tracing shows the following:
>
> [pid 13498] accept(0, 0x7f2ec8000960, 0x7f2ee6740e7c) = -1 EBADF (Bad file
> descriptor)
> [pid 13498] shutdown(0, SHUT_RDWR)      = -1 EBADF (Bad file descriptor)
> [pid 13498] close(0)                    = -1 EBADF (Bad file descriptor)
> [pid 13498] open("/usr/share/openmpi/help-oob-tcp.txt", O_RDONLY) = 0
> [pid 13498] ioctl(0, TCGETS, 0x7f2ee6740be0) = -1 ENOTTY (Inappropriate
> ioctl for device)
> [pid 13499] <... nanosleep resumed> NULL) = 0
> [pid 13498] fstat(0,  <unfinished ...>
> [pid 13499] nanosleep({0, 100000},  <unfinished ...>
> [pid 13498] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=3025, ...})
> = 0
> [pid 13498] read(0, "# -*- text -*-\n#\n# Copyright (c)"..., 8192) = 3025
> [pid 13498] read(0, "", 4096)           = 0
> [pid 13498] read(0, "", 8192)           = 0
> [pid 13498] ioctl(0, TCGETS, 0x7f2ee6740b40) = -1 ENOTTY (Inappropriate
> ioctl for device)
> [pid 13498] close(0)                    = 0
> [pid 13499] <... nanosleep resumed> NULL) = 0
> [pid 13499] nanosleep({0, 100000},  <unfinished ...>
> [pid 13498] write(1, "--------------------------------"...,
> 768---------------------------------------------------------
> -----------------
> WARNING: The accept(3) system call failed on a TCP socket.  While this
> should generally never happen on a well-configured HPC system, the
> most common causes when it does occur are:
>
>   * The process ran out of file descriptors
>   * The operating system ran out of file descriptors
>   * The operating system ran out of memory
>
> Your Open MPI job will likely hang until the failure resason is fixed
> (e.g., more file descriptors and/or memory becomes available), and may
> eventually timeout / abort.
>
>   Local host:     M17xR4
>   Errno:          9 (Bad file descriptor)
>   Probable cause: Unknown cause; job will try to continue
> --------------------------------------------------------------------------
> ) = 768
>
> In fact, "Bad file descriptor" first occurs a bit earlier, here:
>
> [pid 13499] open("/proc/self/fd", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC)
> = 20
> [pid 13499] fstat(20, {st_mode=S_IFDIR|0500, st_size=0, ...}) = 0
> [pid 13499] getdents(20, /* 25 entries */, 32768) = 600
> [pid 13499] close(3)                    = 0
> [pid 13499] close(4)                    = 0
> [pid 13499] close(5)                    = 0
> [pid 13499] close(6)                    = 0
> [pid 13499] close(7)                    = 0
> [pid 13499] close(8)                    = 0
> [pid 13499] close(9)                    = 0
> [pid 13499] close(10)                   = 0
> [pid 13499] close(11)                   = 0
> [pid 13499] close(12)                   = 0
> [pid 13499] close(13)                   = 0
> [pid 13499] close(14)                   = 0
> [pid 13499] close(15)                   = 0
> [pid 13499] close(16)                   = 0
> [pid 13499] close(17)                   = 0
> [pid 13499] close(18)                   = 0
> [pid 13499] close(19)                   = 0
> [pid 13499] close(20)                   = 0
> [pid 13499] getdents(20, 0x1cc04a0, 32768) = -1 EBADF (Bad file descriptor)
> [pid 13499] close(20)                   = -1 EBADF (Bad file descriptor)
>
> Any idea how to fix this? System is Ubuntu 16.04:
>
> Linux M17xR4 4.10.0-42-generic #46~16.04.1-Ubuntu SMP Mon Dec 4 15:57:59
> UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>
> Kind regards,
> - Dmitry.
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to