[sr-dev] Re: [kamailio/kamailio] cdp: restructure locking order to prevent rare deadlock (PR #4191)

Dragos Vingarzan via sr-dev Sun, 30 Mar 2025 10:46:47 -0700

vingarzan left a comment (kamailio/kamailio#4191)

Hey @mtryfoss!


I think I wrote some big parts of this code, but a very long time ago and it 
might have a lot of changes by now that I could not be fully aware of. I also 
further developed and heavily modified version of this code is in production in 
the closed source project from where it originally came, so hopefully it should 
be fixable. But there were also many changes here.

In hindsight, having Diameter session logic in the stack was a mistake, since 
IRL it's better to let the "application" handle it cleanly outside, yet that 
might be too late here. But, it's also not a huge amount of logic that it does, 
so let me try and help hopefully.

Seems like there is a dead-lock between the lock on a peer and the one on 
sessions. There is maybe also a 3rd lock involved, but I don't see that in your 
changes. And it might be just a side-effect, which would clear if the root 
cause is fixed. But the idea is the same: **we should try to eliminate any 
overlap between critical zones, in order to have some good guarantees about 
avoiding a dead/live-lock situation.**

So IMHO a proper fix/change would be to not have `peer *p` as a parameter to 
`Snd_Message()`. That would mean that you wouldn't hold `p->lock`. One way 
could be to use the FQDN instead, the other would be to copy (before unlocking) 
what the deeper functions need and avoid re-getting/re-locking in 
`peer_send_msg()`. In short: I'm advocating to actually have the same treatment 
to `Snd_Message()` as `Rcv_Process()` and call it also without a lock after the 
peer state machine big switch in `sm_process()`.

To explain why `Rcv_Process()` on purpose does not hold a lock while we call 
the handler in the application layer: imagine that you would want to send back 
an answer to a request that you have received, to the same peer. The handler 
will dead-lock while sending, trying to get a lock on a peer which the 
handler's caller already has a lock on. For this reason I'd also eliminate the 
peer pointer from the `Rcv_Process()` call. Yet... that's exported, so seems 
like a much deeper change needed to fix it thoroughly.

Back to your dead-lock though: I think that the hack/fix-attempt in 
`get_first_connected_route()` which you are eliminating is/was a bad idea in 
the first place, so I encourage your removal. But I see now that the session 
pointer is all over the place in calls to route messages. Seems like someone 
tried to add a mechanism of session-stickiness-to-peers, but there was no 
consideration for avoiding overlapping locks... I'm actually surprised that it 
doesn't crash&burn much more often, because again: if you release a lock and 
you'd be acquiring it again a few lines below, you are just asking for trouble, 
since someone might've already free that session or peer pointer and you would 
be reading/writing from bad addresses. And of course, it's a bad design, which 
breaks code modularity between Diameter peers and sessions.

A better design for the unlock is to have a function with a double pointer 
peer/session parameter, which resets the pointer in the process, forcing a 
seg-fault if someone tries such unlock/do-something/relock-without-get patterns.

Anyway, long-story short, this seems like a deeper change might be required. If 
you wish, I'd offer a call (hopefully time-boxed) to help you navigate `cdp` or 
maybe to discuss a deeper change for peers and sessions.

Cheers,
-Dragos

-- 
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/pull/4191#issuecomment-2764667575
You are receiving this because you are subscribed to this thread.

Message ID: <kamailio/kamailio/pull/4191/c2764667...@github.com>

_______________________________________________
Kamailio - Development Mailing List -- sr-dev@lists.kamailio.org
To unsubscribe send an email to sr-dev-le...@lists.kamailio.org
Important: keep the mailing list in the recipients, do not reply only to the 
sender!

[sr-dev] Re: [kamailio/kamailio] cdp: restructure locking order to prevent rare deadlock (PR #4191)

Reply via email to