lunn left a comment (kamailio/kamailio#4185)
I've been helping Mathias with this problem. I've not understood why it is
deadlocking, but i have found something along the way.
Some background. https://docs.openssl.org/1.1.1/man7/RAND_DRBG/ documents the
"deterministic random bit generator". This is however from version 1.1.1 of
openssl, not version 3. Version 3 does not include this documentation any more.
However, the basics still seem valid.
drbg makes use of stacked random number generators. The parent generator is
connected to the entropy source. Thus it is seeded from entropy. The child
generators pull seeds from the parent generator. Seeding happens once when the
generator is created, and is then repeated after a time limit, or when
sufficient bytes have been taken out of the generator.
The documentation indicates the child generators are expected to be per thread,
and so can be accessed without locking. The parent generator is however
accessed by multiple children, so does perform locking, and it is explicitly
documented as being thread-safe.
Kamailio however does not use a thread model, but a process model with shared
memory. As a result, there is a lot of fun and games to make openssl work
correctly in a model it is not intended for.
Openssl is setup in the first process. This causes the parent generator and one
child generator to be created. Since the openssl memory allocation functions
have been replaced with kamailio versions, these generators end up in the
shared memory. The address of the child generator is stored into a thread local
key by openssl.
The worker processes are then forked off. They then go and overwrite the thread
local key of the child generator, setting it back to 0. As soon as there is
need for the child generator, openssl will create a new one for the worker
process. Since the collection of thread local keys are per process, each worker
process gets its own child generator.
The child generators however share the parent generator, which is in the shared
memory which all processes have access to. The locking used on the parent looks
at first glance to work happily for both threads and processes using shared
memory. The pthread library uses atomic operations to try to do as much as it
can in userspace. I've not seen anything which indicates user space atomic
operations are not valid on shared memory. When the locks need to block, they
call into the kernel on a futex. The futex man page also indicates this is
valid, so long as you are not using a FUTEX_PRIVATE_FLAG operation.
So the basic scheme looks O.K.
What i did notice however is that our deadlock happens when the parent
generator is reseeding. And all child generators are also reseeding. And all
child generators processes are trying to reseed the parent generator. Why are
they trying to reseed the parent?
```
fork_id = openssl_get_fork_id();
if (drbg->fork_id != fork_id) {
drbg->fork_id = fork_id;
reseed_required = 1;
}
```
There is additional documentation for drgb->fork_id:
```
/*
* Stores the return value of openssl_get_fork_id() as of when we last
* reseeded. The DRBG reseeds automatically whenever drbg->fork_id !=
* openssl_get_fork_id(). Used to provide fork-safety and reseed this
* DRBG in the child process.
*/
int fork_id;
```
and `openssl_get_fork_id()` is:
```
int openssl_get_fork_id(void)
{
return getpid();
}
```
Since kamailio is using a process model, not a thread model, each process has
its own pid. So with 8 processes running in parallel, and the system is loaded,
it is very likely that the pid is different every time there is a request for
the shared primary to generate random data, and so reseeding is happening
pretty much every time, rather than infrequently.
As a quick test, i hacked out this fork_id check, so that the primary did not
reseed so often. Our test which deadlocks within a handful of seconds ran for a
handful of hours without deadlocking. The deadlock is probably still there, but
we less frequently get into a situation where the deadlock could happen.
I've not traced where the primary is getting its entropy from. If it is system
entropy, that is probably not good for the system as a whole. Other random
number generators on the machine might be producing less random numbers? This
would be my primary concern with the way openssl is being used.
--
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/4185#issuecomment-2794383648
You are receiving this because you are subscribed to this thread.
Message ID: <kamailio/kamailio/issues/4185/2794383...@github.com>
_______________________________________________
Kamailio - Development Mailing List -- sr-dev@lists.kamailio.org
To unsubscribe send an email to sr-dev-le...@lists.kamailio.org
Important: keep the mailing list in the recipients, do not reply only to the
sender!