Hello again,

I can now positively confirm the race condition in cygserver w.r.t. the named
pipe used to serialize SYSV requests through the server.  The race is due to
that transport_layer_pipes::accept() (bool *const recoverable) (file: 
transport_pipes.cc) does actually _create_ the pipe when pipe_instance == 0
(ironically, transport_layer_pipes::listen() does not create any OS primitives
at all!).

This means that under heavy load, cygserver threads may all end up processing
their requests and closing all instances of the pipe (bringing pipe_instance == 
0)
yet not being able to get to the point of accepting new request (that is, to
re-create the pipe).  For the client (user process), this looks like the pipe
does not exist (during that very tiny period of time), and the following message
gets printed:

Iteration 3016
      1 [main] a 4872 transport_layer_pipes::connect: lost connection to 
cygserver, error = 2

Note "error 2" for ERROR_FILE_NOT_FOUND.

The error is so subtle it is very difficult to reproduce with the default 
number of
threads in cygserver, 10 (although, not impossible, esp. on a 
moderately-to-heavily
loaded system).

The error is not due to deadlocking (reported earlier today) -- the server 
remains
afloat coming to CreateNamedPipe() just a jiffy too late, it's the client that 
suffers
from the pipe disappearance.  Deadlocking of the logging cygserver seems to 
remain a
yet another issue.

Anton Lavrentiev
Contractor NIH/NLM/NCBI

P.S. The code to reveal the race is the very same test program posted today
(stripped of the BUGx conditionals):

#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sys/ipc.h>
#include <sys/sem.h>


#define SEMKEY 1


union semun {
    int val;                    /* value for SETVAL */
    struct semid_ds *buf;       /* buffer for IPC_STAT, IPC_SET */
    unsigned short  *array;     /* array for GETALL, SETALL */
};


static void doCriticalWork(void)
{
    return;
}


int main(void)
{
    struct sembuf lock[2];
    int n, semid;

    if ((semid = semget(SEMKEY, 2, IPC_CREAT | 0666)) < 0) {
        perror("semget(IPC_CREATE)");
        return 1;
    }

    lock[0].sem_num = 0;
    lock[0].sem_op  = 0;
    lock[0].sem_flg = IPC_NOWAIT;
    lock[1].sem_num = 0;
    lock[1].sem_op  = 1;

    lock[1].sem_flg = SEM_UNDO;

    if (semop(semid, lock, 2) != 0) {
        perror("semop(LOCK[0])");
        return 1;
    }

    for (n = 0; ; ++n) {
        static const union semun arg = { 0 };
        int error;

        printf("Iteration %d\n", n);

        lock[0].sem_num = 1;
        lock[0].sem_op  = 0; /* precondition:  [1] == 0 */
        lock[0].sem_flg = 0;
        lock[1].sem_num = 1;
        lock[1].sem_op  = 1; /* postcondition: [1] == 1 */
        lock[1].sem_flg = SEM_UNDO;

        if (semop(semid, lock, 2) < 0) {
            error = errno;
            fprintf(stderr, "semop(LOCK[1]): %d, %s\n",
                    error, strerror(error));
            break;
        }

        doCriticalWork();

        lock[0].sem_num =  1;
        lock[0].sem_op  = -1;
        lock[0].sem_flg =  SEM_UNDO | IPC_NOWAIT;

        if (semop(semid, lock, 1) < 0) {
            error = errno;
            fprintf(stderr, "semop(UNLOCK[1]): %d, %s\n",
                    error, strerror(error));
            break;
        }
    }

    return 1;
}

Reply via email to