Hello, I have been investigating the EINTR bug for a while now, and I'm finally confident about what the problem is. As a reminder, the bug causes sudo to sometimes fail with EPERM, usually when it tries to open /etc/sudoers because a reauthentication has failed silently.
While we originally thought the problem was related to dead-name notifications regarding the rendezvous port, it turns out that it is actually regular interruptions. Both auth_user_authenticate() and auth_server_authenticate() occationally gets such interruptions while running sudo in a loop. In fact, it seems that there is some kind of cascade effect that leads to a flood of interruptions. However, since interruptions on a port affect all current RPCs to that port and in turn gets propagated to any port that is in use by those RPCs, I cannot be certain what causes these interruptions. However, it doesn't really matter where the interruptions come from since the authentication protocol should be able to handle any interruptions, regardless. What happens is that once the auth server is handling both auth_user_authenticate() and auth_server_authenticate(), and has matched up both calls using the rendezvous port, auth_server_authenticate() gets interrupted, which causes the server to retry the call. However, auth_user_authenticate() still returns successfully and deallocates the rendezvous port, which leaves the server thinking that the client has abandoned the authentication. After this, the server will not permit any operations on the port handle that was returned to the client (or perhaps it is treated as an unknown user, not sure). So what happens if auth_user_authenticate() is interrupted and auth_server_authenticate() returns success? Well, then the client will hang waiting for the server to call auth_server_authenticate() again with the same rendezvous, which will never happen. If the client is doing this from a setauth(), it will blocking any signals from being delivered to the client process so it can only be killed with SIGKILL. Actually, I'm not sure this can happen in setauth() because of the signal block, but I have observed this behavior while testing the bug. The code doesn't expect an interruption since it doesn't even retry on EINTR, so the restart must be done on some lower level. This might be the symptom of some other bug lurking here somewhere, but at least it is a separate bug ;-). The code still shouldn't ignore EINTR though, since it could be caused by an interrupt in a different process that is using the same auth port. The best way to fix this is to make the client restart the entire authentication process on EINTR and only reply to the client with success after auth has successfully replied to the server. However, for this to work we must make sure that auth_user_authenticate() isn't restarted automatically, and that the server really has gotten the reply. If not, we'll need to change the authentication protocol so that client and server syncs up after authentication, which won't be easy. Regards, Fredrik