On 02/04/2011 05:03 AM, Jes Sorensen wrote:
On 02/03/11 18:41, Michael Roth wrote:
On 02/02/2011 02:48 AM, Jes Sorensen wrote:
Yes very much so[1] - one reason why it would be nice to have virtagent
use threads to execute the actual commands. We should probably add a
flag to agent commands indicating whether they issue disk I/O or not, so
we can block attempts to execute commands that do so, while the guest is
frozen.
**Warning, epic response**
For things like logging and i/o on a frozen system...I agree we'd need
some flag for these kinds of situations. Maybe a disable_logging()
flag....i really don't like this though... I'd imagine even syslogd()
could block virtagent in this type of situation, so that would need to
be disabled as well.
One way to resolve this would be to have the logging handled in it's own
thread, which uses non blocking calls to do the actual logging.
Obviously we'd have to use a non file system based communication method
between the main thread and the logging thread :)
I suspect syslogd() already buffers to some extent. But the problem
there, as well the problem with having our own buffered logging
implementation, is that we can't rely on being able to log an arbitrary
number of messages. As some point that interface would need to either
block, or stop dropping log messages.
If it blocks, we're deadlocked again. If it drops messages, it's trivial
for someone to flood it with messages after the fsfreeze() to get back
into a state where they can execute RPC without any accounting.
So a seperate logging thread doesn't buy us much, and what we come up
with is rely gonna come down to syslogd:
1) if syslogd blocks, we must disable logging after fsfreeze().
Buffering, on syslogd's side or our own, only buys us time. If we
disable logging, we must only allow absolutely required/benign RPCs.
fsstatus/fsthaw are required, things like va.ping are benign, and
potentially useful, in this particular situation.
copyfile/shutdown/exec/etc should be considered "high-risk"...we want to
make sure these are logged.
2) if syslogd doesnt block, it either drops messages at some point with
no indication to us, or it drops them and provides some indication. If
there's no indication, we must treat this the same way we treat 1),
since we must assume that logging is effectively disabled. So only
required or benign RPCs.
if we get some indication when a call to syslog() fails to log/buffer,
we can allow all RPCs, but treat failures to log as a cue to immediately
cleanup and exit the RPC. fsfreeze() under this circumstance will need
to make sure it only logs after it does the unfreeze, else we may never
be able to unfreeze at that point forward.
So the solution is dependent on syslogd's behavior. Ill have to do some
testing to confirm...but I think the above covers the possibilities.
But doing so completely subverts our attempts and providing proper
accounting of what the agent is doing to the user. A user can freeze the
filesystem, knowing that logging would be disabled, then prod at
whatever he wants. So the handling should be something specific to
fsfreeze, with stricter requirements:
If a user calls fsfreeze(), we disable logging, but also disable the
ability to do anything other than fsthaw() or fsstatus(). This actually
solves the potential deadlocking problem for other RPCs as well...since
they cant be executed in the first place.
I disagree that we should block all calls, except for fsfreeze/fsstatus/
fsthaw in this case. There are other calls that could be valid in this
situations, so I think it needs to be evaluated on a case by case basis.
So a solution for these situations is still needed, and I'm starting to
agree that threads are needed, but I don't think we should do RPCs
concurrently (not sure if that's what is being suggested or not). At
least, there's no pressing reason for it as things currently stand
(there aren't currently any RPCs where fast response times are all that
important, so it's okay to serialize them behind previous RPCs, and
HMP/QMP are command at a time), and it's something that Im fairly
confident can be added if the need arises in the future.
Eventually I think we will need to be able to support concurrent RPC
calls. There can be situations where an operation takes a long time
while it is valid to be able to ping the guest agent to verify that it
is still alive etc.
Currently we're limited to 1 RPC at a time by the monitor/hmp/qmp
(except for some niche cases where virtagent has multiple requests in
flight). Ideally this will remain the interface we expose to users, but
there could be situations where this isn't the case...for instance,
other bits of qemu making direct calls into virtagent. I think we can
add support for concurrent RPCs incrementally though...I've thought over
the implementation details a bit and it seems to be fairly
straightforward. I think we need to get the basic use cases down first
though.
Cheers,
Jes