henningw created an issue (kamailio/kamailio#4209)
Frequent hangs in Kamailio probably related due to lock contention in
xhttp_prom module.
### Environment:
The systems are using 32 Kamailio worker processes for the relevant network
interface, Its also using Prometheus counter increment operations more than 30
times in the cfg during INVITE processing. The Kamailio uses otherwise no
database or other IO related services. Kamailio version 5.8.3, but no relevant
changes in the xhttp_prom module could be found.
### Quick summary of the findings:
Multiple systems showed frequent hangs in their Kamailio servers on a customer
setup. It happens usually after a few hours that all Kamailio processes gets
blocked, and no more traffic can be processed on the respective system.
I have analysed three stack traces of the Kamailio on one of the system that
showed the behaviour. Two without problems and one that was created from during
a period where the server had problems.
### Details:
Here some details of the stack traces from a problematic case.
The relevant processes are from PID 494551 to 494582.
The majority of all of these processes are blocked in paths related to the
Prometheus module (PID 494551 to 494576):
#### PID 494551:
```
#1 0x00007fc69ea5f053 in futex_get (lock=0x7fc4a147acd0) at
../../core/mem/../futexlock.h:108
v = 2
i = 1024
#2 0x00007fc69ea70f99 in prom_counter_inc (s_name=0x7fffe339ffa0, number=1,
l1=0x7fffe339ff90, l2=0x0, l3=0x0) at prom_metric.c:1154
p = 0x6b
__func__ = "prom_counter_inc"
[…]
#14 0x00000000005ab179 in receive_msg (buf=0x9f47e0 <buf> "INVITE
[sip:+1yyyyyyy...@10.xxx.XXX107](sip:+1yyyyyyyyyy...@10.xxx.xx.107)
SIP/2.0\r\nRecord-Route:
[sip:10.1XXX.XXX.104;lr=on;ftag=HK507HSy55p9F;dlgcor=62b91.985c3](sip:10.XXX.XXX.104;lr=on;ftag=HK507HSy55p9F;dlgcor=62b91.985c3)\r\nRecord-Route:
[sip:10.XXX.XXX.117;r2=on;lr;ftag=HK507HSy55p9F](sip:10.XXX.XXX.117;r2=on;lr;ftag=HK507HSy55p9F)\r\nRecord-R"...,
len=2819, rcv_info=0x7fffe33a28d0) at core/receive.c:518
```
Most of the worker processes are in the same state as shown above.
Some of the processes are also working in other Prometheus related operations:
#### PID 494554:
```
#0 prom_metric_timeout_delete (p_m=0x7fc49ea102f0) at prom_metric.c:646
current = 0x7fc4a2810fd0
ts = 1744143945433
__func__ = "prom_metric_timeout_delete"
l = 0x7fc4abffc808
#1 0x00007fc69ea676ce in prom_metric_list_timeout_delete () at prom_metric.c:668
p = 0x7fc49ea102f0
```
### Problem hypothesis:
My hypothesis is that the hang is caused from lock contention around the
Prometheus module. The relevant code uses only one lock, and this together with
the extensive usage of the increment counters probably causes this issues under
high load.
The majority of the worker processed are occupied in the Prometheus path and
are not working on SIP packets. This will cause of course an increase of the
UDP queue and the described problems.
In order to test this hypothesis we removed temporarily the Prometheus logic in
the kamailio cfg and see if the issue still persists. The issue did not showed
up again after two days of testing, when before it was observed after a few
hours.
### Possible solutions:
The Prometheus module probably needs some improvements to support better
high-load and concurrency situations. On common approach is to split the locks,
e.g. using a per process lock array and then combining the individual values in
a second pass when read from outside.
Alternatively the xhttp_prom module should be used only carefully in situations
with a high concurrency setup.
I have the full backtrace available, if helpful just let me know.
--
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/issues/4209
You are receiving this because you are subscribed to this thread.
Message ID: <kamailio/kamailio/issues/4...@github.com>
_______________________________________________
Kamailio - Development Mailing List -- sr-dev@lists.kamailio.org
To unsubscribe send an email to sr-dev-le...@lists.kamailio.org
Important: keep the mailing list in the recipients, do not reply only to the
sender!