Re: [Intel-gfx] [PATCH v3] drm/i915: Use exponential backoff for wait_for()

John Harrison Mon, 04 Dec 2017 13:52:20 -0800

On 11/29/2017 11:55 PM, Sagar Arun Kamble wrote:

On 11/30/2017 12:45 PM, John Harrison wrote:
On 11/29/2017 10:19 PM, Sagar Arun Kamble wrote:
On 11/30/2017 8:34 AM, John Harrison wrote:
On 11/24/2017 6:12 AM, Chris Wilson wrote:
Quoting Michał Winiarski (2017-11-24 12:37:56)
Since we see the effects for GuC preeption, let's gather some evidence.
(SKL)
intel_guc_send_mmio latency: 100 rounds of gem_exec_latency --r '*-preemption'

drm-tip:
      usecs               : count     distribution
          0 -> 1          : 0        |                                        |
          2 -> 3          : 0        |                                        |
          4 -> 7          : 0        |                                        |
          8 -> 15         : 44       |                                        |
         16 -> 31         : 1088     |                                        |
         32 -> 63         : 832      |                                        |
         64 -> 127        : 0        |                                        |
        128 -> 255        : 0        |                                        |
        256 -> 511        : 12       |                                        |
        512 -> 1023       : 0        |                                        |
       1024 -> 2047       : 29899    |*********                               |
       2048 -> 4095       : 131033   |****************************************|
Such pretty graphs. Reminds me of the bpf hist output, I wonder if we
could create a tracepoint/kprobe that would output a histogram for each
waiter (filterable ofc). Benefit? Just thinking of tuning the
spin/sleep, in which case overall metrics are best
(intel_eait_for_register needs to be optimised for the typical case). I
am wondering if we could tune the spin period down to 5us, 2us? And then
have the 10us sleep.

We would also need a typical workload to run, it's profile-guided
optimisation after all. Hmm.
-Chris
It took me a while to get back to this but I've now had chance torun with this exponential backoff scheme on the original systemthat showed the problem. It was a slightly messy back port due tothe customer tree being much older than current nightly. I'm prettysure I got it correct though. However, I'm not sure what therecommendation is for the two timeout values. Using the default of'10, 10' in the patch, I still get lots of very long delays.
Recommended setting currently is Wmin=10, Wmax=10 for wait_for_usand Wmin=10, Wmax=1000 for wait_for.
Exponential backoff is more helpful inside wait_for if wait_for_usprior to wait_for is smaller.Setting Wmax less than Wmin is effectively changing the backoffstrategy to just linear waits of Wmin.
I have to up the Wmin value to at least 140 to get a stall freeresult. Which is plausible given that the big spike in the resultsof any fast version is at 110-150us. Also of note is that a Wminbetween 10 and 110 actually makes things worse. Changing Wmax hasno effect.
In the following table, 'original' is the original driver beforeany changes and 'retry loop' is the version using the firstworkaround of just running the busy poll wait in a 10x loop. Theother columns are using the backoff patch with the given Wmin/Wmaxvalues. Note that the times are bucketed to 10us up to 500us andthen in 500us lumps thereafter. The value listed is the lowerlimit, i.e. there were no times of <10us measured. Each case wasrun for 1000 samples.
Below setting like in current nightly will suit this workload and asyou have found this will also likely complete most waits in <150us.If many samples had been beyond 160us and less than 300us we mighthave been needed to change Wmin to may be 15 or 20 to ensure the
exponential rise caps around 300us.

wait_for_us(10, 10)
wait_for()

#define wait_for _wait_for(10, 1000)
But as shown in the table, a setting of 10/10 does not work well forthis workload. The best results possible are a large spike of waitsin the 120-130us bucket with a small tail out to 150us. Whereas, the10/10 setting produces a spike from 150-170us with the tail extendingto 240us and an appreciable number of samples stretching all the wayout to the 1-10ms range. A regular delay of multiple milliseconds isnot acceptable when this path is supposed to be a low latencypre-emption to switch to some super high priority time critical task.And as noted, I did try a bunch of different settings for Wmax butnothing seemed to make much of a difference. E.g. 10/10 vs 10/1000produced pretty much identical results. Hence it didn't seem worthincluding those in the table.
Wmin = 10us leads us to total delay of 150us in 3 loops (this might betight to catch most conditions)
Wmin = 25us can lead us to total delay of 175us in 3 loops
Since most conditions are likely to complete around 140us-160us, Lookslike Wmin of 25 to 30 (25,1000 or 30, 1000) will suit this workload butsince this profile driver optimization I am wondering about theoptimal Wmin point.
This wait need is very time critical. Exponential rise might not begood strategy during higher wait times.
usleep_range might also be adding extra latency.
May be we should do this exponential backoff for waits having US >=1000 and do periodic backoff for US<1000 with period of 50us?

The results I am seeing do not correspond. First of, it seems I getdifferent results depending upon the context. That is in the context ofthe pre-emption GuC send action command I get the results previouslyposted. If I just run usleep_range(x, y) in loop 1000 times from thecontext of dumping a debugfs file, I get something very different.Basically, the minimum sleep time is 110-120us irrespective of thevalues of X and Y. Pushing X and Y beyond 120 seems to make it completein Y+10-20us. E.g. u_r(100,200) completes in 210-230us for 80% ofsamples. On the other hand, I don't get anywhere near so many samples inthe millisecond range as when called in the send action code path.

However, it sounds like the underlying issue might actually be aback-port merge problem. The customer tree in question is actually acombination of a 4.1 base kernel with a 4.11 DRM dropped on top. Asnoted in a separate thread, this tree also has a problem with themutex_lock() call stalling even when the mutex is very definitely notacquired (using mutex_trylock() eliminates the stall completely).Apparently the back port process did hit a bunch of conflicts in thebase kernel's scheduling code. So there is a strong possibility that allthe issues we are seeing in that tree are an artifact of a merge issue.

So I think it is probably safe to ignore the results I am seeing interms of what the best upstream solution should be.

Time Original 10/10 50/10 100/10 110/10 130/10 140/10 RetryLoop 10us: 2 2 2 2 2 2 2 2 30us: 1 1 1 1 1
    50us:                              1
70us: 14 63 56 64 63 61 80us: 8 41 52 44 46 41 90us: 6 24 10 28 12 17 100us: 2 4 20 16 17 17 22
   110us: 13        21        14        13        11
120us: 6 366 633 636 660 650 130us: 2 2 46 125 95 86 95 140us: 3 2 16 18 32 46 48 150us: 210 3 12 13 37 32 31 160us: 322 1 18 10 14 12 17 170us: 157 4 5 5 3 5 2 180us: 62 11 3 1 2 1 1 190us: 32 212 1 1 2
   200us:                   27       266 1                   1
210us: 16181 1 220us: 1651 1
   230us:                   10        43         4
   240us:                   12        22 62         1
   250us:                    4        12 112         3
   260us:                    3        13 73         8
   270us:                    5        12 12         8         2
   280us:                    4         7 12         5         1
   290us:                              9         4
   300us:                    1         3 9         1         1
   310us:                    2         3 5         1         1
   320us:                    1         4 2         3
   330us:                    1         5         1
   340us:                    1 2                   1
   350us:                              2         1
   360us:                              2         1
   370us:                    2                   2
   380us:                                        1
   390us:                    2         1 2         1
   410us:                    1
   420us:                    3
   430us:                    2         2         1
   440us:                    2         1
   450us:                              4
   460us:                    3         1
   470us:                              3         1
   480us:                    2                   2
   490us:                                        1
   500us:                   19        13        17
  1000us:        249        22        30        11
  1500us:        393         4         4 2         1
2000us: 132 7 8 8 2 1 1 2500us: 63 4 4 6 1 1 1
  3000us:         59         9         7 6         1
  3500us:         34         2 1                             1
  4000us:         17         9         4         1
  4500us:          8         2         1         1
  5000us:          7         1         2
  5500us:          7         2                   1
  6000us:          4         2         1         1
  6500us:          3                             1
  7000us:          6         2                   1
  7500us:          4 1                             1
  8000us:          5                             1
  8500us:                    1         1
  9000us:          2
  9500us:          2         1
>10000us:          3                             1


John.



_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH v3] drm/i915: Use exponential backoff for wait_for()

Reply via email to