RE: Bug/limitation in 'time' (kernel setings?)...

Bruce Dawson Tue, 19 Mar 2013 21:22:24 -0700

The reply below is way too long, but oh well.

The summary is:
I am 100% certain that my ExprCount() loop is CPU bound. It is unfortunate
that time gives inaccurate information about this. It would be nice if the
documentation acknowledged that in order to save future generations from
confusion. I think my test shows this quite effectively and I have verified
the CPU bound nature of the test in many ways. While your results are
different, they also show 'time' failing to recognize that the loop is CPU
bound. I believe that your system is showing consistent behavior with mine
-- the scheduler is working correctly, but 'time' is being misleading.

You seem convinced that the task is not CPU bound on my machine, but I'm
convinced that it is.

The main reason that I believe this is that if it is not CPU bound -- if
there are significant delays between when bash and expr are running -- then
that is a pri-0 bug in the Linux kernel. This would have massive
implications for the efficiency of web servers and other systems that depend
on Linux. Therefore a high burden of proof is needed to justify that claim.

The only evidence I saw that my system was not CPU bound when running my
test script was 'time'. That's it.

On the other side, as evidence that it is CPU bound (i.e.; one CPU core's
worth of CPU time being consumed per second) I have the following evidence:
- It should be, and if it isn't it is a serious kernel bug.
- "mpstat -P -ALL 1" shows 99.7% idle when the script is not running, and
this drops to ~86.8% idle when it is running. That's a difference of 12.9%
which represents ~one core on my eight core home system.
- "iostat -c 1" shows identical results
- 'top' shows idle time of ~99.7% and ~86.8% in the summary section at the
top, depending on whether the test is running. However in the list of
processes it only shows counttest.sh, using 15% of a core. That means that
top is internally inconsistent -- it fails to show where the other 85% of a
core's time is being spent. This is (IIRC) a known weakness with top that
short-lived processes don't show up in the process list.
- The scheduler graph that is shown on my blog post clearly shows what is
happening. You can see the full version of it here:
http://www.cygnus-software.com/images/ZoomScreenshot_croppedbig.png
It shows bash (in red) running for fractions of a millisecond, and each expr
instance running about 0.5 ms. The OS is able to switch between processes
faster than every 24 ms because when expr ends the operating system *knows*
that bash was waiting for it so it immediately wakes up bash. The full
scheduler doesn't have to run -- it just wakes up whoever was waiting. That
is a crucial feature of the Windows and Linux kernels.
- xosview shows the load at 12-13% when running ExprCount(), and 0.1% when
not.
- qps only showed four of my eight CPUs and I couldn't see a summary of
total CPU usage. I'm not sure what scheduler information it was supposed to
show.
- "perf stat counttest.sh ExprTest" shows 0.984 CPUs utilized

So, *lots* of evidence that the bash process is CPU bound on my system. And,
this is correct and proper. The scheduler in my kernel is working correctly.
Here's what kernel I'm running. I have a 64-bit 3.5 kernel at work -- same
behavior.

Note that 'top' is potentially confusing because the units of the global CPU
percentages and the process CPU percentages are different. The global CPU
percentage is usage/numcores whereas the process percentage is usage
of-one-core. It takes eight processes at 100% to get the global percentage
on an eight-core system to 100%.

> At full load, my system shows 1200% cpu a measured by the standard
definition of CPU seconds / real time.

On my version of 'top' (3.2.8) it does not use that definition
*in*its*summary*area. It uses CPU seconds / (real time * num-cores). So if I
have all eight cores fully loaded *then* it shows 100% CPU utilization. In
the individual process area it uses your definition, but at the top it shows
the percentage of total-available-CPU-power.

> But 'top' never shows over 20% usage

Where? In the total of user and system at the top, or in the per-process
section? Also, you said 'time' was showing your test using 71.48% CPU and
then you said that 'top' was showing 20% -- that's a heck of an anomaly.

> If your kernel is spending 80% of it's time in the scheduler when there
are no tasks being scheduled, I'd say you have a problem.

It's not. It's spending 80% of its time in process creation. This is shown
clearly in the screenshot, and in the sampling profiler data that I
examined.

> You can control that with cpuaffinity.

Yes, I know. I mention that technique in my blog post. I also mention that
it is generally a bad idea, unless your machine is devoted to a single task.

sched_latency_ns on my system is 24000000, just as on yours.

$uname -a
Linux brucedglados2 3.2.0-38-generic-pae #60-Ubuntu SMP Wed Feb 13 13:47:26
UTC 2013 i686 i686 i386 GNU/Linux

I tried grabbing /proc/config.gz but that file doesn't exist.

Blog post is here:
http://randomascii.wordpress.com/2013/03/18/counting-to-ten-on-linux/

> If that latency of 24 is how long before the scheduler wakes up and
switches tasks

No. Definitely not. The 24 ms scheduler latency is for rescheduling of long
running tasks. It does not apply to the case where a process is waiting on
some event, such as a child process dying. I am 100% positive about this, on
Linux and on Windows. Both operating systems can, if conditions demand it,
schedule threads/processes hundreds of times in 24 ms. I look at process
scheduling data on Windows all the time and if Linux was that much less
efficient then Steve Ballmer would be laughing maniacally and Linux wouldn't
be used on servers.

> Note..  you said your loop took 2 seconds on your 3.2GHz sys.

Yes, but that was then my system was running at 1.1 GHz. When my system
ramps up to its full speed then the 1,000 iterations are done in less than a
second. Our systems are quite consistent there.

-----Original Message-----
From: Linda Walsh [mailto:b...@tlinx.org] 
Sent: Tuesday, March 19, 2013 3:41 PM
To: Bruce Dawson
Cc: 'bug-bash'
Subject: Re: Bug/limitation in 'time' (kernel setings?)...

Linda Walsh wrote:
> 
> Bruce Dawson wrote:
>> How many CPUs do you have? 20% usage doesn't mean anything without 
>> knowing that. On a five-core system 20% usage would imply one full core.
---
Another tool to check out if you have it is 'qps'... it can show the time
slice/schedule interval ... I can see it scheduling 24ms/ intervals for bash
when it runs a separate proc/loop.  It sounds a bit sketchy to me why it
would take 76ms for it to be scheduled for another run, but ... that's about
what it measures out it in real time.

The fact that your kernel doesn't show you that ... I don't know what to
say, other than it's how your kernel is configured.

Now looking in my /proc/sys/kernel

I see sched_latency_ns = 24000000 (24 ms).

I also see the scheduler doesn't wakup all the time:
but about every 4 ms.  (I inserted the commas):
proc/sys/kernel# for i in sched*;do
> echo $i: $(<$i)
> done
sched_autogroup_enabled: 1
sched_child_runs_first: 0
sched_domain:
sched_latency_ns: 24,000,000
sched_migration_cost_ns: 500,000
sched_min_granularity_ns: 3,000,000
sched_nr_migrate: 32
sched_rt_period_us: 1,000,000
sched_rt_runtime_us: 950,000
sched_shares_window_ns: 10,000,000
sched_time_avg_ms: 1000
sched_tunable_scaling: 1
sched_wakeup_granularity_ns: 4,000,000

So if the child runs first = 0, the parent keeps running at first, If that
latency of 24 is how long before the scheduler wakes up and switches tasks,
that 24 ms wasted, then same for child.  That's about a 33% duty cycle there
-- which can likely be tweaked by tweaking the above numbers...

RE: Bug/limitation in 'time' (kernel setings?)...

Reply via email to