lunn left a comment (kamailio/kamailio#4185)

I've been helping Mathias with this problem. I've not understood why it is 
deadlocking, but i have found something along the way.

Some background. https://docs.openssl.org/1.1.1/man7/RAND_DRBG/ documents the 
"deterministic random bit generator". This is however from version 1.1.1 of 
openssl, not version 3. Version 3 does not include this documentation any more. 
However, the basics still seem valid.

drbg makes use of stacked random number generators. The parent generator is 
connected to the entropy source. Thus it is seeded from entropy. The child 
generators pull seeds from the parent generator. Seeding happens once when the 
generator is created, and is then repeated after a time limit, or when 
sufficient bytes have been taken out of the generator.

The documentation indicates the child generators are expected to be per thread, 
and so can be accessed without locking. The parent generator is however 
accessed by multiple children, so does perform locking, and it is explicitly 
documented as being thread-safe.

Kamailio however does not use a thread model, but a process model with shared 
memory. As a result, there is a lot of fun and games to make openssl work 
correctly in a model it is not intended for. 

Openssl is setup in the first process. This causes the parent generator and one 
child generator to be created. Since the openssl memory allocation functions 
have been replaced with kamailio versions, these generators end up in the 
shared memory. The address of the child generator is stored into a thread local 
key by openssl.

The worker processes are then forked off. They then go and overwrite the thread 
local key of the child generator, setting it back to 0. As soon as there is 
need for the child generator, openssl will create a new one for the worker 
process. Since the collection of thread local keys are per process, each worker 
process gets its own child generator.

The child generators however share the parent generator, which is in the shared 
memory which all processes have access to. The locking used on the parent looks 
at first glance to work happily for both threads and processes using shared 
memory. The pthread library uses atomic operations to try to do as much as it 
can in userspace.  I've not seen anything which indicates user space atomic 
operations are not valid on shared memory. When the locks need to block, they 
call into the kernel on a futex. The futex man page also indicates this is 
valid, so long as you are not using a FUTEX_PRIVATE_FLAG operation.

So the basic scheme looks O.K.

What i did notice however is that our deadlock happens when the parent 
generator is reseeding. And all child generators are also reseeding. And all 
child generators processes are trying to reseed the parent generator. Why are 
they trying to reseed the parent?

```
  fork_id = openssl_get_fork_id();

    if (drbg->fork_id != fork_id) {
        drbg->fork_id = fork_id;
        reseed_required = 1;
    }
```
There is additional documentation for drgb->fork_id:
```
 /*
     * Stores the return value of openssl_get_fork_id() as of when we last
     * reseeded.  The DRBG reseeds automatically whenever drbg->fork_id !=
     * openssl_get_fork_id().  Used to provide fork-safety and reseed this
     * DRBG in the child process.
     */
    int fork_id;
```

and `openssl_get_fork_id()` is:

```
int openssl_get_fork_id(void)
{
    return getpid();
}
```

Since kamailio is using a process model, not a thread model, each process has 
its own pid. So with 8 processes running in parallel, and the system is loaded, 
it is very likely that the pid is different every time there is a request for 
the shared primary to generate random data, and so reseeding is happening 
pretty much every time, rather than infrequently.

As a quick test, i hacked out this fork_id check, so that the primary did not 
reseed so often. Our test which deadlocks within a handful of seconds ran for a 
handful of hours without deadlocking. The deadlock is probably still there, but 
we less frequently get into a situation where the deadlock could happen.

I've not traced where the primary is getting its entropy from. If it is system 
entropy, that is probably not good for the system as a whole. Other random 
number generators on the machine might be producing less random numbers? This 
would be my primary concern with the way openssl is being used. 

-- 
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/4185#issuecomment-2794383648
You are receiving this because you are subscribed to this thread.

Message ID: <kamailio/kamailio/issues/4185/2794383...@github.com>
_______________________________________________
Kamailio - Development Mailing List -- sr-dev@lists.kamailio.org
To unsubscribe send an email to sr-dev-le...@lists.kamailio.org
Important: keep the mailing list in the recipients, do not reply only to the 
sender!

Reply via email to