Jack O'Connor <oconnor...@gmail.com> added the comment:

I'm late to the party, but I want to explain what's going on here in case it's 
helpful to folks. The issue you're seeing here has to do with whether a child 
processs has been "reaped". (Windows is different from Unix here, because the 
parent keeps an open handle to the child, so this is mostly a Unix thing.) In 
short, when a child exits, it leaves a "zombie" process whose only job is to 
hold some metadata and keep the child's PID reserved.  When the parent calls 
wait/waitpid/waitid or similar, that zombie process is cleaned up. That means 
that waiting has important correctness properties apart from just blocking the 
parent -- signaling after wait returns is unsafe, and forgetting to wait also 
leaks kernel resources.

Here's a short example demonstrating this:

```
  import signal                                                                 
                                                                                
                               
  import subprocess                                                             
                                                                                
                               
  import time                                                                   
                                                                                
                               
                                                                                
                                                                                
                               
  # Start a child process and sleep a little bit so that we know it's exited.   
                                                                                
                                           
  child = subprocess.Popen(["true"])                                            
                                                                                
                               
  time.sleep(1)                                                                 
                                                                                
                               
                                                                                
                                                                                
                               
  # Signal it. Even though it's definitely exited, this is not an error.        
                                                                                
                                          
  os.kill(child.pid, signal.SIGKILL)                                            
                                                                                
                               
  print("signaling before waiting works fine")                                  
                                                                                
                               
                                                                                
                                                                                
                               
  # Now wait on it. We could also use os.waitpid or os.waitid here. This reaps  
                                                                                
                               
  # the zombie child.                                                           
                                                                                
                               
  child.wait()                                                                  
                                                                                
                               
                                                                                
                                                                                
                               
  # Try to signal it again. This raises ProcessLookupError, because the child's 
                                                                                
                               
  # PID has been freed. But note that Popen.kill() would be a no-op here,
  # because it knows the child has already been waited on.                      
                                                                                
                                              
  os.kill(child.pid, signal.SIGKILL)                                            
                                                                                
                               
```

With that in mind, the original behavior with communicate() that started this 
bug is expected. The docs say that communicate() "waits for process to 
terminate and sets the returncode attribute." That means internally it calls 
waitpid, so your terminate() thread is racing against process exit. Catching 
the exception thrown by terminate() will hide the problem, but the underlying 
race condition means your program might end up killing an unrelated process 
that just happens to reuse the same PID at the wrong time. Doing this properly 
requires using waitid(WNOWAIT), which is...tricky.

----------
nosy: +oconnor663

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue40550>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to