Re: IRQ latency measurements in hypervisor

Volodymyr Babchuk Wed, 20 Jan 2021 16:49:59 -0800

Stefano Stabellini writes:

> On Fri, 15 Jan 2021, Volodymyr Babchuk wrote:
[...]
>> >> [   83.873162] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 
>> >> (240 ns)
>> >> [   94.136632] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 
>> >> (240 ns)
>> >> [  104.400098] rt_eval_tmu e6fc0000.tmu: Mean: 50 (1500 ns) stddev: 129 
>> >> (3870 ns)
>> >> [  114.663828] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 
>> >> (240 ns)
>> >> [  124.927296] rt_eval_tmu e6fc0000.tmu: Mean: 56 (1680 ns) stddev: 183 
>> >> (5490 ns)
>> >> 
>> >> This is the baremetal Linux. And there is Dom0:
>> >> 
>> >> [  237.431003] rt_eval_tmu e6fc0000.tmu: Mean: 306 (9180 ns) stddev: 25 
>> >> (750 ns)
>> >> [  247.694506] rt_eval_tmu e6fc0000.tmu: Mean: 302 (9060 ns) stddev: 17 
>> >> (510 ns)
>> >> 
>> >> Driver outputs both the raw timer value (eg. 4) and the same value
>> >> scaled to nanoseconds (eg. 1320 ns). As you can see baremetal setup is
>> >> much faster. But experiments showed that Linux does not provide
>> >> consistent values, even when running in baremetal mode. You can see
>> >> sporadic spikes in "std dev" values.
>> >
>> > So baremetal IRQ latency is 1320-1680ns and Linux IRQ latency is
>> > 9060-9180ns. I am not surprised that Linux results are inconsistent but
>> > I have a couple of observations:
>> >
>> > - 9us is high for Linux
>> > If the system is idle, the latency should be lower, around 2-3us. I
>> > imagine you are actually running some sort of interference from dom0? Or
>> > using RTDS and descheduling vCPUs?
>> 
>> 9us was in idle state. Interestingly enough, I got latency if 3us while
>> Dom0 was doing some CPU-intensive tasks. So, under load latency is lower
>> than in idle state. I didn't investigated this, so I can't tell you what
>> causes this behavior.
>
> Did you use vwfi=native? You should definitely be able to see ~3us
> without interference and with 1vCPU <-> 1pCPU

With vwfi=native things in Dom0 got better, but it is still not 3us:

[   41.563904] rt_eval_tmu e6fc0000.tmu: Mean: 173 (5190 ns) stddev: 15 (450 
ns) min: 125 (3750 ns) max: 276102 (8283060 ns)
[   51.820403] rt_eval_tmu e6fc0000.tmu: Mean: 172 (5160 ns) stddev: 8 (240 ns) 
min: 125 (3750 ns) max: 276102 (8283060 ns)
[   62.076898] rt_eval_tmu e6fc0000.tmu: Mean: 172 (5160 ns) stddev: 9 (270 ns) 
min: 125 (3750 ns) max: 276102 (8283060 ns)
[   72.333377] rt_eval_tmu e6fc0000.tmu: Mean: 171 (5130 ns) stddev: 9 (270 ns) 
min: 125 (3750 ns) max: 276102 (8283060 ns)
[   82.589852] rt_eval_tmu e6fc0000.tmu: Mean: 172 (5160 ns) stddev: 10 (300 
ns) min: 125 (3750 ns) max: 276102 (8283060 ns)

And DomU (as Julien suggested) is extremely unhappy:

Mean: 282 (8460 ns) stddev: 412 (12360 ns) above thr: 4% [109 (3270 ns) - 2245 
(67350 ns)] global [108 (3240 ns) 74261 (2227830 ns)]
Mean: 958 (28740 ns) stddev: 860 (25800 ns) above thr: 50% [120 (3600 ns) - 
2253 (67590 ns)] global [108 (3240 ns) 74261 (2227830 ns)]
Mean: 713 (21390 ns) stddev: 533 (15990 ns) above thr: 37% [114 (3420 ns) - 
2186 (65580 ns)] global [108 (3240 ns) 74261 (2227830 ns)]

>
>> > - the stddev of 3870ns is high for baremetal
>> > In the baremetal case the stddev should be minimal if the system is
>> > idle.
>> 
>> This is what I expected too. But nevertheless there was spikes. I didn't
>> investigated this as well, so I can only speculate there. My rootfs is
>> on NFS, so maybe network driver caused this spikes.
>
> Yeah, maybe it would be best to reduce the sources of possible spikes
> and get rid of NFS.

I agree there, but I am working remotely, so NFS is easier. I'll try
with eMMC later.

>
>> >
>> >
>> >> So my next step was to use proper RT OS to do the measurements. I
>> >> chose Zephyr. My PR that adds Xen support to Zephyr can be found at
>> >> [1]. Support for RCAR Gen3 is not upstreamed, but is present on my
>> >> GitHub([2]). At [3] you can find the source code for application that
>> >> does the latency measurements. It behaves exactly as my linux driver,
>> >> but provides a bit more information:
>> >> 
>> >>  *** Booting Zephyr OS build zephyr-v2.4.0-2750-g0f2c858a39fc  ***
>> >> RT Eval app
>> >> 
>> >> Counter freq is 33280000 Hz. Period is 30 ns
>> >> Set alarm in 0 sec (332800 ticks)
>> >> Mean: 600 (18000 ns) stddev: 3737 (112110 ns) above thr: 0% [265 (7950 
>> >> ns) - 66955 (2008650 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
>> >> Mean: 388 (11640 ns) stddev: 2059 (61770 ns) above thr: 0% [266 (7980 ns) 
>> >> - 58830 (1764900 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
>> >> Mean: 358 (10740 ns) stddev: 1796 (53880 ns) above thr: 0% [265 (7950 ns) 
>> >> - 57780 (1733400 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
>> 
>> This is Zephyr running as DomU.
>> 
>> >> ...
>> >> 
>> >> So there you can see: mean time, standard deviation, % of interrupts
>> >> that was processed above 30us threshold, minimum and maximum latency
>> >> values for the current 10s run, global minimum and maximum.
>> >> 
>> >> Zephyr running as baremetal showed very stable results (this is an
>> >> older build, so no extended statistics there):
>> >> 
>> >> ## Starting application at 0x480803C0 ...
>> >>  *** Booting Zephyr OS build zephyr-v2.4.0-1137-g5803ee1e8183  ***
>> >> RT Eval app
>> >> 
>> >> Counter freq is 33280000 Hz. Period is 30 ns
>> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> 
>> And this is Zephyr is running as baremetal.
>> 
>> >> ...
>> >> 
>> >> As Zephyr provided stable readouts with no jitter, I used it to do all
>> >> subsequent measurements.
>> >
>> > I am a bit confused here. Looking at the numbers above the stddev is
>> > 112110 ns in the first instance. That is pretty high. Am I looking at
>> > the wrong numbers?
>> 
>> I added some clarification above. As for 112110ns in the very first instance
>> - I always ignored the first instance, assuming that things need to
>> settle after domain being created.
>> 
>> But your comment is actually correct: what exacelt should "settle"?
>> Domain is already created. All should run smoothly. So, this is worth
>> investigating.
>
> It is fair to ignore the first 2 measurements.
>
> However, the numbers above have high stddev even at the 3rd measurement:
> 53us is high and above the target 30us. I take you didn't apply yet the 
> alloc_heap_pages and the serial workarounds?

Yes, this is the very first try. Without workarounds and proper
scheduler configuration.

With all fixes it is much better:

Mean: 307 (9210 ns) stddev: 4 (120 ns) above thr: 0% [265 (7950 ns) - 376 
(11280 ns)] global [265 (7950 ns) 376 (11280 ns)]
Mean: 306 (9180 ns) stddev: 3 (90 ns) above thr: 0% [263 (7890 ns) - 360 (10800 
ns)] global [263 (7890 ns) 376 (11280 ns)]
Mean: 306 (9180 ns) stddev: 3 (90 ns) above thr: 0% [298 (8940 ns) - 373 (11190 
ns)] global [263 (7890 ns) 376 (11280 ns)]
Mean: 307 (9210 ns) stddev: 17 (510 ns) above thr: 0% [265 (7950 ns) - 856 
(25680 ns)] global [263 (7890 ns) 1251 (37530 ns)]
Mean: 304 (9120 ns) stddev: 7 (210 ns) above thr: 0% [286 (8580 ns) - 486 
(14580 ns)] global [263 (7890 ns) 1251 (37530 ns)]
Mean: 307 (9210 ns) stddev: 47 (1410 ns) above thr: 0% [260 (7800 ns) - 1299 
(38970 ns)] global [260 (7800 ns) 1299 (38970 ns)]
Mean: 291 (8730 ns) stddev: 41 (1230 ns) above thr: 0% [250 (7500 ns) - 1324 
(39720 ns)] global [250 (7500 ns) 1324 (39720 ns)]

>
>> >> IMPORTANT! All subsequent tests was conducted with only 1 CPU core
>> >> enabled. My goal was to ensure that system can timely react to an
>> >> external interrupt even under load.
>> >
>> > All right. FYI I have no frame of reference for 2 vCPUs on 1 pCPUs, all
>> > my tests were done with 1vCPU <-> 1pCPU and the null scheduler.
>> 
>> As I said, I had no issues with 1vCPU <-> 1pCPU setup, so I quickly
>> moved to cases which had issues.
>> 
>> >> Test results and latency sources
>> >> 
>> >> As you can see, baremetal OS provides latency of ~0.9us without any
>> >> deviations. The same code running as DomU on idle system shows mean
>> >> latency of 12us and deviation of about 10us. Range of latencies in a
>> >> 10s batch can vary from 8us to 25us. This fits into required 30us
>> >> threshold, so no big issue there.
>> >> 
>> >> But this worsens under certain conditions.
>> >> 
>> >> 1. Serial console. RCAR serial driver (scif) works in synchronous
>> >>    mode, so any heavy serial console output leads to higher
>> >>    latency. Tests shows mean latency of 1000us and deviation of 1332
>> >>    us. 54% of interrupts are handled outside of 30us period. Real
>> >>    values may be even higher, because in some cases timer would do
>> >>    full cycle and instead of say 11ms ISR would read 11 % 10 = 1ms
>> >>    latency. I tried to enable asynchronous mode for the serial
>> >>    console. This made things better, but it lead to gaps in output, so
>> >>    I just turned the serial console off completely.
>> >
>> > That's very interesting. I wonder if other serial drivers would cause
>> > similar issues, e.g. PL011.
>> 
>> They should. This behavior is programmed in serial.c. Driver can enable
>> async mode calling serial_async_transmit(). As I can see, only ns16550
>> driver does this.
>> 
>> Maybe you didn't saw problems there because you had more pCPU enabled
>> and Xen used some other pCPU to do UART operations.
>
> Good to know. A workaround would be to let Dom0/DomU have direct access
> to the UART, not giving any UART to Xen at all.

Yes. Or use SSH. I chose the second way.

[...]

-- 
Volodymyr Babchuk at EPAM
Re: IRQ latency measurements in hypervisor

Reply via email to