Hi Ryan, On 01/26/2018 12:20 AM, Ryan Thoryk wrote: > Package: linux-image-4.9.0-5-amd64 > Version: 4.9.65-3+deb9u2 > Severity: normal > > I'm having an issue with CPU usage reporting, tested on kernels 4.9.0-3 > and 4.9.0-5. The machines are running on Amazon EC2, which could be > related. With the "sar" utility, after some time, the system's "steal" > value periodically is 100%,
This means that your vcpus want to execute work but are not being scheduled on a physical cpu core. Either the physical machine gets too much work from all the virtual machines that are requesting cpu time, or other things are going on, like your virtual machine getting paused (e.g. when doing live migration there's a handover moment when it's shortly paused and then resumed, this is also visible as a short 100% steal spike). > and the normal CPU user/system values, > including idle, are always 0. When running a cpu-intensive app and > using the "top" utility, the user and system values are always 0, the > "idle" field stays at 100%, and only the "wait" field increases. Sounds a lot like this one: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=871608 A patch to fix that cpu accounting breakage (picked from linux 4.15) was included in 4.9.65-3. So only for the 4.9.0-3 (which actual version?) you could be seeing that one happening. > The attached file shows the "sar" output around the time the issue > started. This has happened on 2 separate machines (started at different > times on each), and a reboot appears to (temporarily) fix the issue. > I'm wondering if anyone else has this issue, and if it could be > something to do with the hypervisor. Because of the mentioned steal time fix that was included in a version in between the 2 versions you mention, my first suggestion would be to see if the symptoms on the old and new kernel are exactly the same, or if they are only similar but different. Hans