So I guess now you reminded me of this thread ... > The Timetravel socket can be read from the SIGIO handler, which may > catch a RUN message that was expected to be received in the wait loop. > > It can happen that the device simulation only "created" one interrupt, > but an uncertain (i.e., hard to predict by the calendar/device > simulation) number of SIGIOs are emitted, with each SIGIO interrupt > producing a GET message. The calendar might try to send a RUN message > after the first GET message, which is then processed in the SIGIO > handler, waiting for the response to a subsequent GET message. However, > time is advanced by the received RUN message. Typically, this doesn't > pose problems because the interrupt implies that the next RUN message > should be at the current time (as sent by the calendar with the GET > message). But there are corner cases, such as simultaneous other > interrupts, which may desynchronize the current time in the UML instance > from the expected time from the calendar. Since there is no real use > case for advancing time in the SIGIO handler with RUN messages (in > contrast to UPDATE messages), we logically only expect time to advance > in the idle loop in TT-ext mode. Therefore, with this patch, we restrict > time changes to the idle loop.
OK, that mostly makes sense, so far. Although IIUC now you lost the time update, so you have to recover from that again somehow when you actually do the RUN (which we don't do from the SIGIO handler). > Additionally, since both the idle loop and the signal/interrupt handlers > do blocking reads from the TT socket, a deadlock can occur if a RUN > message intended for the idle loop is received in the SIGIO handler. In > this situation, the calendar expects the UML instance to run, but it > actually waits for another message, either in the SIGIO handler (e.g., > a second interrupt) or in a poll in the idle loop, as the previous > message was handled by the signal handler, which returned execution to > the main loop and ultimately entered the idle loop. I'm not sure I understand this part. How can you get a RUN message when someone else is still running and sending you interrupts? Or are you sending interrupts (with time travel handlers because of patch 3?) from _outside_ of the simulation?! I guess that'd be a case of "don't do that then"? :) Or ... I'm still wrapping my head around the system you built ... are you saying that interrupts are actually delayed and we have something like this: Participant A --- message ----------------> UML (socket) Participant A --- WAIT --> calendar SIGIO ----> UML calendar <-- GET --------------------- UML (time_travel_handler) calendar --- RUN --------------------> UML (reads "RUN" instead of GET response) But then that's again just because "you're doing it wrong", Participant A should be waiting for an ACK message? At least that was the intent. You're not using it that way with the line interrupts. But honestly, I think that's just another reason not to do things the way you're doing in patches 2 and 3. > Therefore, this patch also allows checking whether the current RUN > message was handled by the idle loop or by the signal/interrupt handlers > in the ACK of the RUN. > > With the information in the ACK of the RUN message, the calendar knows > whether the RUN was answered in a signal handler and can act > accordingly. TBH, I don't like this at all. If I'm not completely off with my analysis above (or it's a similar situation), this just puts workaround requirements on the calendar because you don't synchronise communication in _other_ parts of the system properly. In the simulation that we had intended to build, the exact timing of the SIGIO wouldn't matter because UML would _always_ be in WAIT state when getting SIGIO, and would always have to fully process the SIGIO (and send an ACK on the respective socket) before the other side continues and we can get to any other part of the system. That's a pretty fundamental rule of the system because it allows us to reason how information flows, which part can be doing what at which time, and (mostly) ignore the inherently asynchronous nature of today's computing in the system, since things are always locally serialised. You broke that rule, and the system is coming apart at the seams. You already found - as I pointed out in the other email - that you now have to some assumptions on the sender side of the message about how the interrupt processing will be done on the receiver (although that assumption is perhaps not more than "may be immediate so I need to return to the calendar"). You also found that when you broke that rule, you got other asynchronous things going on, and now need to handle that suddenly you can get a RUN message inside a SIGIO handler, because that's async and nothing waits for it to have finished. I honestly don't believe that removing this rule will do the system good in the long term. You'll almost certainly find other corner cases - we still find corner cases in the system _with_ the rule, and that's a _much_ simpler system. I also tend to think that the more complex system (without that rule) will be much harder to optimise. For example, the shared memory table of run requests (yes yes, I'll send it in a bit), you don't even have a GET message in time_travel_add_irq_event() any more. In fact you have no messages at all there! That maybe makes this part easier? But it also means now you have to reason about why that's actually still OK to not get the RUN message where you previously expected it, etc. Or if we consider further optimisations we've been thinking about (but not implemented yet) - one of them is to use signals for wait/run so we can hand off running between processes without going through the calendar (with the calendar still maintaining the new distributed protocol on behalf of clients that don't participate in a distributed protocol yet.) Not having the ACK would actually _defeat_ this optimisation, because you wouldn't actually know who to hand it off to, only the calendar would be able to do the handoff, since it's the only one communicating with everyone. With the ACK you know that as soon as you have the ACK, the shared table must be updated, and you can make the hand-off decision locally instead of going to the calendar. So ... I think all of this is an even longer-winded way than the last discussion over on patch 3 of saying I don't like your simulation model that doesn't require an ACK, and I don't think we should support it. FWIW, it really shouldn't be that hard to replace the 'lines' with virtio serial consoles or so. We've actually done that, we have a device-side implementation of this: #define VIRTIO_ID_RPROC_SERIAL 11 /* virtio remoteproc serial link */ (which is really just a virtio console, shows up as /dev/vport*p0 inside the virtual system, rather than /dev/ssl* or whatever you have there now. With the usfstl vhost device abstraction [1] it's less than 150 LOC, and much of it is boilerplate/struct filling... [1] https://github.com/linux-test-project/usfstl/blob/main/include/usfstl/vhost.h Although our implementation only supports a single connection and is built into whatever generates the messages, if you want multiple UML instances to communicate you'd have to have an implementation that has a socket for each and a config file which is connected to where. johannes _______________________________________________ linux-um mailing list linux-um@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-um