vingarzan left a comment (kamailio/kamailio#4191)
Hey @mtryfoss!
I think I wrote some big parts of this code, but a very long time ago and it
might have a lot of changes by now that I could not be fully aware of. I also
further developed and heavily modified version of this code is in production in
the closed source project from where it originally came, so hopefully it should
be fixable. But there were also many changes here.
In hindsight, having Diameter session logic in the stack was a mistake, since
IRL it's better to let the "application" handle it cleanly outside, yet that
might be too late here. But, it's also not a huge amount of logic that it does,
so let me try and help hopefully.
Seems like there is a dead-lock between the lock on a peer and the one on
sessions. There is maybe also a 3rd lock involved, but I don't see that in your
changes. And it might be just a side-effect, which would clear if the root
cause is fixed. But the idea is the same: **we should try to eliminate any
overlap between critical zones, in order to have some good guarantees about
avoiding a dead/live-lock situation.**
So IMHO a proper fix/change would be to not have `peer *p` as a parameter to
`Snd_Message()`. That would mean that you wouldn't hold `p->lock`. One way
could be to use the FQDN instead, the other would be to copy (before unlocking)
what the deeper functions need and avoid re-getting/re-locking in
`peer_send_msg()`. In short: I'm advocating to actually have the same treatment
to `Snd_Message()` as `Rcv_Process()` and call it also without a lock after the
peer state machine big switch in `sm_process()`.
To explain why `Rcv_Process()` on purpose does not hold a lock while we call
the handler in the application layer: imagine that you would want to send back
an answer to a request that you have received, to the same peer. The handler
will dead-lock while sending, trying to get a lock on a peer which the
handler's caller already has a lock on. For this reason I'd also eliminate the
peer pointer from the `Rcv_Process()` call. Yet... that's exported, so seems
like a much deeper change needed to fix it thoroughly.
Back to your dead-lock though: I think that the hack/fix-attempt in
`get_first_connected_route()` which you are eliminating is/was a bad idea in
the first place, so I encourage your removal. But I see now that the session
pointer is all over the place in calls to route messages. Seems like someone
tried to add a mechanism of session-stickiness-to-peers, but there was no
consideration for avoiding overlapping locks... I'm actually surprised that it
doesn't crash&burn much more often, because again: if you release a lock and
you'd be acquiring it again a few lines below, you are just asking for trouble,
since someone might've already free that session or peer pointer and you would
be reading/writing from bad addresses. And of course, it's a bad design, which
breaks code modularity between Diameter peers and sessions.
A better design for the unlock is to have a function with a double pointer
peer/session parameter, which resets the pointer in the process, forcing a
seg-fault if someone tries such unlock/do-something/relock-without-get patterns.
Anyway, long-story short, this seems like a deeper change might be required. If
you wish, I'd offer a call (hopefully time-boxed) to help you navigate `cdp` or
maybe to discuss a deeper change for peers and sessions.
Cheers,
-Dragos
--
Reply to this email directly or view it on GitHub:
https://github.com/kamailio/kamailio/pull/4191#issuecomment-2764667575
You are receiving this because you are subscribed to this thread.
Message ID: <kamailio/kamailio/pull/4191/c2764667...@github.com>
_______________________________________________
Kamailio - Development Mailing List -- sr-dev@lists.kamailio.org
To unsubscribe send an email to sr-dev-le...@lists.kamailio.org
Important: keep the mailing list in the recipients, do not reply only to the
sender!