New submission from Nir Soffer <nir...@gmail.com>: Using io.FileIO can hang all threads when accessing an inaccessible NFS server.
To reproduce this, you need to open the file like this: fd = os.open(filename, ...) fio = io.FileIO(fd, "r+", closefd=True) Inside fileio_init, there is a checkfd call, calling fstat without releasing the GIL. This will hang all threads. The expected behavior is blocking only the thread blocked on the system call, so a system stay responsive and can serve other tasks. Here is the log showing this issue, created with the attached reproducer script (fileio_nfs_test.py). # python fileio_nfs_test.py mnt/fileio.out dumbo.tlv.redhat.com 2017-11-30 18:41:49,159 - (MainThread) - pid=3436 2017-11-30 18:41:49,159 - (MainThread) - Opening mnt/fileio.out 2017-11-30 18:41:49,160 - (MainThread) - OK fd=3 2017-11-30 18:41:49,161 - (MainThread) - Starting canary thread 2017-11-30 18:41:49,161 - (Canary) - Blocking access to storage 2017-11-30 18:41:49,169 - (Canary) - If this test is hang, please run: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP 2017-11-30 18:41:49,169 - (Canary) - check 0 2017-11-30 18:41:49,169 - (MainThread) - Waiting until storage is blocked... 2017-11-30 18:41:50,170 - (Canary) - check 1 2017-11-30 18:41:51,170 - (Canary) - check 2 2017-11-30 18:41:52,171 - (Canary) - check 3 2017-11-30 18:41:53,171 - (Canary) - check 4 2017-11-30 18:41:54,172 - (Canary) - check 5 2017-11-30 18:41:55,172 - (Canary) - check 6 2017-11-30 18:41:56,172 - (Canary) - check 7 2017-11-30 18:41:57,173 - (Canary) - check 8 2017-11-30 18:41:58,173 - (Canary) - check 9 2017-11-30 18:41:59,174 - (MainThread) - Opening io.FileIO Everything is hang now! After some time I run this from another shell: iptables -D OUTPUT -p tcp -d dumbo.tlv.redhat.com --dport 2049 -j DROP And now the script is unblocked and finish. 2017-11-30 18:45:29,683 - (MainThread) - OK 2017-11-30 18:45:29,684 - (MainThread) - Creating mmap 2017-11-30 18:45:29,684 - (Canary) - check 10 2017-11-30 18:45:29,684 - (MainThread) - OK 2017-11-30 18:45:29,685 - (MainThread) - Filling mmap 2017-11-30 18:45:29,685 - (MainThread) - OK 2017-11-30 18:45:29,685 - (MainThread) - Writing mmap to storage 2017-11-30 18:45:29,719 - (MainThread) - OK 2017-11-30 18:45:29,719 - (MainThread) - Syncing 2017-11-30 18:45:29,719 - (MainThread) - OK 2017-11-30 18:45:29,720 - (MainThread) - Done We have a canary thread logging every second. Once we tried to open the FileIO, the canary thread stopped - this is possible only if the io extension module was holding the GIL during a blocking call. And here is the backtrace of the hang process in the kernel: # cat /proc/3436/stack [<ffffffffc03228e4>] rpc_wait_bit_killable+0x24/0xb0 [sunrpc] [<ffffffffc03240d4>] __rpc_execute+0x154/0x410 [sunrpc] [<ffffffffc03257b8>] rpc_execute+0x68/0xb0 [sunrpc] [<ffffffffc0315776>] rpc_run_task+0xf6/0x150 [sunrpc] [<ffffffffc07eb8c3>] nfs4_call_sync_sequence+0x63/0xa0 [nfsv4] [<ffffffffc07ec94c>] _nfs4_proc_getattr+0xcc/0xf0 [nfsv4] [<ffffffffc07f88f2>] nfs4_proc_getattr+0x72/0xf0 [nfsv4] [<ffffffffc07b543f>] __nfs_revalidate_inode+0xbf/0x310 [nfs] [<ffffffffc07b5a55>] nfs_getattr+0x95/0x250 [nfs] [<ffffffff81206146>] vfs_getattr+0x46/0x80 [<ffffffff812061c5>] vfs_fstat+0x45/0x80 [<ffffffff812068e4>] SYSC_newfstat+0x24/0x60 [<ffffffff81206ade>] SyS_newfstat+0xe/0x10 [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff You cannot attach to the process with gdb, since it is in D state, but once the process is unblocked, gbd takes control, and we see: Thread 2 (Thread 0x7f97a2ea5700 (LWP 4799)): #0 0x00007f97ab925a0b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0 #1 0x00007f97ab925a9f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0 #2 0x00007f97ab925b3b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 #3 0x00007f97abc455f5 in PyThread_acquire_lock () from /lib64/libpython2.7.so.1.0 #4 0x00007f97abc11156 in PyEval_RestoreThread () from /lib64/libpython2.7.so.1.0 #5 0x00007f97a44f9086 in time_sleep () from /usr/lib64/python2.7/lib-dynload/timemodule.so #6 0x00007f97abc18bb0 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #7 0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #8 0x00007f97abc183fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #9 0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #10 0x00007f97abc183fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #11 0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #12 0x00007f97abba494d in function_call () from /lib64/libpython2.7.so.1.0 #13 0x00007f97abb7f9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #14 0x00007f97abc135bd in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #15 0x00007f97abc1857d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #16 0x00007f97abc1857d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #17 0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #18 0x00007f97abba4858 in function_call () from /lib64/libpython2.7.so.1.0 #19 0x00007f97abb7f9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #20 0x00007f97abb8e995 in instancemethod_call () from /lib64/libpython2.7.so.1.0 #21 0x00007f97abb7f9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #22 0x00007f97abc117b7 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0 #23 0x00007f97abc496e2 in t_bootstrap () from /lib64/libpython2.7.so.1.0 #24 0x00007f97ab91fe25 in start_thread () from /lib64/libpthread.so.0 #25 0x00007f97aaf4434d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f97ac10d740 (LWP 4798)): #0 0x00007f97aaf35154 in _fxstat () from /lib64/libc.so.6 #1 0x00007f97a470ad32 in fileio_init () from /usr/lib64/python2.7/lib-dynload/_io.so #2 0x00007f97abbd565f in type_call () from /lib64/libpython2.7.so.1.0 #3 0x00007f97abb7f9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0 #4 0x00007f97abc140f6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #5 0x00007f97abc1aefd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 Missing separate debuginfos, use: debuginfo-install python-2.7.5-58.el7.x86_64 ---Type <return> to continue, or q <return> to quit--- #6 0x00007f97abc1b002 in PyEval_EvalCode () from /lib64/libpython2.7.so.1.0 #7 0x00007f97abc3443f in run_mod () from /lib64/libpython2.7.so.1.0 #8 0x00007f97abc355fe in PyRun_FileExFlags () from /lib64/libpython2.7.so.1.0 #9 0x00007f97abc36889 in PyRun_SimpleFileExFlags () from /lib64/libpython2.7.so.1.0 #10 0x00007f97abc47a3f in Py_Main () from /lib64/libpython2.7.so.1.0 #11 0x00007f97aae6dc05 in __libc_start_main () from /lib64/libc.so.6 #12 0x000000000040071e in _start () Looking at python code - there are two helpers in fileio.c that call fstat without releasing the GIL during the call: - https://github.com/python/cpython/blob/2.7/Modules/_io/fileio.c#L145 - https://github.com/python/cpython/blob/2.7/Modules/_io/fileio.c#L161 And both helpers are called from fileio_init (the implementation of io.FileIO()) - https://github.com/python/cpython/blob/2.7/Modules/_io/fileio.c#L332 - https://github.com/python/cpython/blob/2.7/Modules/_io/fileio.c#L366 Reported by RHV user, see https://bugzilla.redhat.com/1518676 (the bug may be private, sorry if you cannot access it). ---------- components: Extension Modules files: fileio_nfs_test.py messages: 307335 nosy: nirs priority: normal severity: normal status: open title: io.FileIO hang all threads if fstat blocks on inaccessible NFS server versions: Python 2.7, Python 3.5, Python 3.6, Python 3.7, Python 3.8 Added file: https://bugs.python.org/file47309/fileio_nfs_test.py _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue32186> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com