* Dominik Brodowski <li...@dominikbrodowski.net> wrote:

> On Wed, Feb 07, 2018 at 12:44:41PM -0800, Linus Torvalds wrote:
> > On Wed, Feb 7, 2018 at 12:15 PM, Dominik Brodowski
> > <li...@dominikbrodowski.net> wrote:
> > >
> > > Note: The testb $3, CS(%rsp) instruction in idtentry() does not need
> > > modification. Previously %rsp was manually decreased by 15*8; with
> > > this patch, %rsp is decreased by 15 pushq instructions. Moreover,
> > > error_entry did and does the exact same test (with offset=8) after
> > > the registers have been moved/pushed and cleared.
> > 
> > So this has the problem that now those save/clear instructions will
> > all be done in that idtentry macro.
> > 
> > So now that code will be duplicated for all the users of that macro.
> > 
> > The old code did the saving in the common error_entry and
> > paranoid_entry routines, in order to be able to share all the code,
> > and making the duplicated stub functions generated by the idtentry
> > macro smaller.
> > 
> > Now, admittedly the new push sequence is much smaller than the old
> > movq sequence, so the duplication doesn't hurt as much, but it's still
> > likely quite noticeable.
> > 
> > So this removes lines of asm code, but it adds a lot of instructions
> > to the end result thanks to the macro, I think.
> 
> Indeed, that is the case (see below). However, if we want to switch to
> PUSH instructions and do this in a routine which is call'ed and which
> ret'urns, %rsp needs to be moved around even more often than the old
> ALLOC_PT_GPREGS_ON_STACK macro did (which you wanted to get rid of,
> IIUYC). Or do I miss something?
> 
>    text          data     bss     dec     hex filename
>   19500             0       0   19500    4c2c arch/x86/entry/entry_64.o-orig
>   19510             0       0   19510    4c36 arch/x86/entry/entry_64.o-3_of_7
>   21105             0       0   21105    5271 arch/x86/entry/entry_64.o-5_of_7
>   24307             0       0   24307    5ef3 arch/x86/entry/entry_64.o-7_of_7

So while this shows a +~5K increase in text size, these numbers also _very_ 
significantly over-represent the extra footprint. The assumptions that resulted 
in 
us compressing the IRQ entry code have changed very significantly with the new 
x86 
IRQ allocation code we introduced in the last year:

- IRQ vectors are usually populated in tightly clustered groups.

  With our new vector allocator code the typical per CPU allocation percentage 
on
  x86 systems is ~3 device vectors and ~10 fixed vectors out of ~220 vectors -
  i.e. a very low ~6% utilization (!). This means that the _real_ text footprint
  increase is probably closer to 0.1K of text...

  This is a very typical picture on an average desktop system:

  /sys/kernel/debug/irq/domains/VECTORS:

   Online bitmaps:       40
   Global available:   8007
   Global reserved:      24
   Total allocated:      73
   System: 42: 0-19,32,50,128,237-255

 | CPU | avl | man | act | vectors
     0   199     0     3  33-34,48
     1   199     0     3  33-35
     2   199     0     3  33-35
     3   199     0     3  33-35
     4   199     0     3  33-35
     5   199     0     3  33-35
     6   199     0     3  33-35
     7   199     0     3  33-35
     8   199     0     3  33-35
     9   199     0     3  33-35
    10   201     0     1  33
    11   201     0     1  33
    12   201     0     1  33
    13   201     0     1  33
    14   201     0     1  33
    15   201     0     1  33
    16   201     0     1  33
    17   201     0     1  33
    18   201     0     1  33
    19   201     0     1  33
    20   199     0     3  33-35
    21   199     0     3  33-35
    22   199     0     3  33-35
    23   199     0     3  33-35
    24   199     0     3  33-35
    25   199     0     3  33-35
    26   199     0     3  33-35
    27   199     0     3  33-35
    28   199     0     3  33-35
    29   199     0     3  33-35
    30   201     0     1  33
    31   201     0     1  33
    32   201     0     1  33
    33   202     0     0  
    34   202     0     0  
    35   202     0     0  
    36   202     0     0  
    37   202     0     0  
    38   202     0     0  
    39   202     0     0  

  I.e. the average number of (device) IRQ vectors per CPU is around 2, with the 
  max being 3.

  The days where we allocated a lot of vectors on every CPU and the compression 
  of the IRQ entry code text mattered are over.

- Another issue is that only a small minority of vectors is frequent enough to
  actually matter to cache utilization in practice: 3-4 key IPIs and 1-2 device 
  IRQs at most - and those vectors tend to be tightly clustered as well into 
about 
  two groups, and are probably already on 2-3 cache lines in practice.

  For the common case of 'cache cold' IRQs it's the depth of the call chain and 
  the fragmentation of the resulting I$ that should be the main performance 
limit
  - not the overall size of it.

- The CPU side cost of IRQ delivery is still very expensive even in the best, 
most 
  cached case, as in 'over a thousand cycles'. So much stuff is done that maybe 
  contemporary x86 IRQ entry microcode already prefetches the IDT entry and its 
  expected call target address.

So for those reasons I'm really tempted by the all around simplification 
offered 
by this series:

   2 files changed, 62 insertions(+), 160 deletions(-)

Basically in this specific case I'd like to turn the argument around,
use this simpler design and put the burden of proof on the patches that
want to _complicate it_ beyond this simple, straightforward model: show us
the numbers that it truly makes a difference: it's not that hard to 
spray a CPU with IRQs by using IPIs, and the cache miss rate and the
overall perf counter stats should tell a clear story about whether it
makes sense to cache-compress (and complicate) the IRQ entry sequences.

Thanks,

        Ingo

Reply via email to