Daniel P. Berrangé <berra...@redhat.com> writes:

> On Fri, Oct 14, 2022 at 11:31:13AM +0200, Markus Armbruster wrote:
>> Daniel P. Berrangé <berra...@redhat.com> writes:
>> 
>> > On Thu, Oct 13, 2022 at 05:00:26PM +0200, Markus Armbruster wrote:
>> >> Denis Plotnikov <den-plotni...@yandex-team.ru> writes:
>> >> 
>> >> > Add "start" & "end" time values to qmp command responses.
>> >> 
>> >> Please spell it QMP.  More of the same below.
>> >> 
>> >> >
>> >> > These time values are added to let the qemu management layer get the 
>> >> > exact
>> >> > command execution time without any other time variance which might be 
>> >> > brought by
>> >> > other parts of management layer or qemu internals. This is particulary 
>> >> > useful
>> >> > for the management layer logging for later problems resolving.
>> >> 
>> >> I'm still having difficulties seeing the value add over existing
>> >> tracepoints and logging.
>> >> 
>> >> Can you tell me about a problem you cracked (or could have cracked) with
>> >> the help of this?
>> >
>> > Consider your QMP client is logging all commands and replies in its
>> > own logfile (libvirt can do this). Having this start/end timestamps
>> > included means the QMP client log is self contained.
>> 
>> A QMP client can include client-side timestamps in its log.  What value
>> is being added by server-side timestamps?  According to the commit
>> message, it's for getting "the exact command execution time without any
>> other time variance which might be brought by other parts of management
>> layer or qemu internals."  Why is that useful?  In particular, why is
>> excluding network and QEMU queueing delays (inbound and outbound)
>> useful?
>
> Lets, say some commands normally runs in ~100ms, but occasionally
> runs in 2secs, and you want to understand why.
>
> A first step is understanding whether a given command itself is
> slow at executing, or whether its execution has merely been
> delayed because some other aspect of QEMU has delayed its execution.
> If the server timestamps show it was very fast, then that indicates
> delayed processing. Thus instead of debugging the slow command, I
> can think about what scenarios would be responsible for the delay.
> Perhaps a previous QMP command was very slow, or maybe there is
> simply a large volume of QMP commands backlogged, or some part of
> QEMU got blocked.
>
> Another case would be a command that is normally fast, and sometimes
> is slower, but still relatively fast. The network and queueing side
> might be a significant enough proportion of the total time to obscure
> the slowdown. If you can eliminate the non-execution time, you can
> see the performance trends over time to spot the subtle slowdowns
> and detect abnormal behaviour before it becomes too terrible.

This is troubleshooting.  Asking for better troubleshooting tools is
fair.

However, the proposed timestamps provide much more limited insight than
existing tracepoints.  For instance, enabling

    monitor_qmp_*
    handle_qmp_command

results in something like

    2656634@1665750359.529763:handle_qmp_command mon 0x55f93a4375d0 req: 
{"execute": "qmp_capabilities", "arguments": {"enable": ["oob"]}}
    2656634@1665750359.529789:monitor_qmp_in_band_enqueue 0x7fbaa8003870 mon 
0x55f93a4375d0 len 0
    2656629@1665750359.529997:monitor_qmp_in_band_dequeue 0x7fbaa8003870 len 0
    2656629@1665750359.530081:monitor_qmp_cmd_in_band 
    2656629@1665750359.530166:monitor_qmp_respond mon 0x55f93a4375d0 resp: 
{"return": {}}

This gives me timestamps for

    complete JSON value read from socket and parsed
    value enqueued in-band (with mutex held)
    value dequeued in-band (with mutex held)
    in-band command dispatched
    reply enqueued

The last two are pretty close to the timestamps added by the patch.

The patch's timestamps together with client-side timestamps can tell me
whether the command is delayed somewhere between the client's timestamp
(presumably close to socket send) and QEMU's dispatch.  It could be the
network, QEMU's networking code, QEMU's JSON parser, QEMU's queuing, or
QEMU's coroutine scheduling.

To narrow down the delay to something useful, I still need to resort to
tracepoints.

The value added by having timestamps in QMP is "always on".  But is that
really worthwhile?  What would you *do* with timestamps fished out of
logs?

My gut feeling is that client-side timestamps suffice to signal "there
is a problem", but the additional server-side QMP timestamps are
unlikely to be enough to figure out the problem, so you enable
tracepoints and wait for the problem to return.

That's why I'm asking for concrete experience supporting "yes, it
is worthwhile".


Reply via email to