Re: [PATCH] rw_semaphores, optimisations

D . W . Howells Sun, 22 Apr 2001 15:36:14 -0700

Hello Andrea,

Interesting benchmarks... did you compile the test programs with "make 
SCHED=yes" by any chance? Also what other software are you running?

The reason I ask is that running a full blown KDE setup running in the 
background, I get the following numbers on the rwsem-ro test (XADD optimised 
kernel):

    SCHED: 4615646, 4530769, 4534453 and 4628365
    no SCHED: 6311620, 6312776, 6327772 and 6325508

Also quite stable as you can see.

> (ah and btw the machine is a 2-way PII 450mhz). 

Your numbers were "4274607" and "4280280" for this kernel and test This I 
find a little suprising. I'd expect them to be about 10% higher than I get on 
my machine given your faster CPUs.

What compiler are you using? I'm using the following:

   Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/2.96/specs
   gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-80)

Something else that I noticed: Playing a music CD appears to improve the 
benchmarks all round:-) Must be some interrupt effect of some sort, or maybe 
they just like the music...

> rwsem-2.4.4-pre6 + my new generic rwsem (fast path in C inlined)

Linus wants out of line generic code only, I believe. Hence why I made my 
generic code out of line.

I have noticed one glaring potential slowdown in my generic code's down 
functions. I've got the following in _both_ fastpaths!:

    struct task_struct *tsk = current;

It shouldn't hurt _too_ much (its only reg->reg anyway), but it will have an 
effect. I'll have to move it and post another patch tomorrow.

I've also been comparing the assembly from the two generic spinlock 
implementations (having out-of-lined yours in what I think is the you'd have 
done it). I've noticed a number of things:

  (1) My fastpaths have slightly fewer instructions in them

  (2) gcc-2.96-20000731 produces what looks like much less efficient code
      than gcc-snapshot-20010409 (to be expected, I suppose).

  (3) Both compilers do insane things to registers (like in one instruction
      moving %eax to %edx and then moving it back again in the next).

  (4) If _any_ inline assembly is used, the compiler grabs extra chunks of
      stack which it does not then use. It will then pop these into registers
      under some circumstances. It'll also save random registers it doesn't
      clobber under others.

(Basically, I had a lot of frustrating fun playing around with the spinlock 
asm constraints trying to avoid the compiler clobbering registers 
unnecessarily because of them).

I've attached the source file I've been playing with and an example 
disassembly dump for your amusement. I used the snapshot gcc to do this (it 
emits conditional chunks of code out of line more intelligently than the 
older one.

It's also interesting that your generic out-of-line semaphores are faster 
given the fact that you muck around with EFLAGS and CLI/STI, and I don't. 
Maybe I'm getting hit by an interrupt. I'll have to play around with it and 
benchmark it again.

David

/* slowpath.c: description
 *
 * Copyright (c) 2001   David Howells ([EMAIL PROTECTED]).
 */
#define __KERNEL__
#include <asm/types.h>
#include <asm/current.h>
#include <asm/system.h>
#include <linux/kernel.h>
#include <linux/list.h>
#include <linux/spinlock.h>

struct rw_semaphore_dwh {
	__u32			active;
	__u32			waiting;
	spinlock_t		wait_lock;
};

extern void FASTCALL(dwh_down_read(struct rw_semaphore_dwh *));
extern void FASTCALL(dwh_up_read(struct rw_semaphore_dwh *));
extern void FASTCALL(dwh_down_read_failed(struct rw_semaphore_dwh *,
					  struct task_struct *));
extern void FASTCALL(dwh__rwsem_do_wake(struct rw_semaphore_dwh *));

struct rw_semaphore_aa {
	spinlock_t lock;
	int count;
	struct list_head wait;
};

#define RWSEM_WAITQUEUE_READ 0

extern void FASTCALL(aa_down_read(struct rw_semaphore_aa *));
extern void FASTCALL(aa_up_read(struct rw_semaphore_aa *));
extern void FASTCALL(aa_down_failed(struct rw_semaphore_aa *, int));
extern void FASTCALL(aa_rwsem_wake(struct rw_semaphore_aa *));

void dwh_down_read(struct rw_semaphore_dwh *sem)
{
	struct task_struct *tsk = current;
	spin_lock(&sem->wait_lock);

	if (sem->waiting) {
		sem->active++;
		spin_unlock(&sem->wait_lock);
		goto out;
	}
	sem->waiting++;
	dwh_down_read_failed(sem,tsk);
	spin_unlock(&sem->wait_lock);
 out:
}

void aa_down_read(struct rw_semaphore_aa *sem)
{
	spin_lock_irq(&sem->lock);
	if (sem->count < 0 || !list_empty(&sem->wait))
		goto slow_path;
	sem->count++;
 out:
	spin_unlock_irq(&sem->lock);
	return;

 slow_path:
	aa_down_failed(sem, RWSEM_WAITQUEUE_READ);
	goto out;
}

void dwh_up_read(struct rw_semaphore_dwh *sem)
{
	spin_lock(&sem->wait_lock);

	if (--sem->active==0 && sem->waiting)
		dwh__rwsem_do_wake(sem);

	spin_unlock(&sem->wait_lock);
}

void aa_up_read(struct rw_semaphore_aa *sem)
{
	unsigned long flags;

	spin_lock_irqsave(&sem->lock, flags);
	if (!--sem->count && !list_empty(&sem->wait))
		aa_rwsem_wake(sem);
	spin_unlock_irqrestore(&sem->lock, flags);
}


slowpath.o:     file format elf32-i386

Disassembly of section .text:

00000000 <dwh_down_read>:
   0:   53                      push   %ebx
   1:   83 ec 08                sub    $0x8,%esp
   4:   89 c3                   mov    %eax,%ebx
   6:   ba 00 e0 ff ff          mov    $0xffffe000,%edx
   b:   21 e2                   and    %esp,%edx
   d:   f0 fe 4b 08             lock decb 0x8(%ebx)
  11:   0f 88 fc ff ff ff       js     13 <dwh_down_read+0x13>
  17:   8b 4b 04                mov    0x4(%ebx),%ecx
  1a:   85 c9                   test   %ecx,%ecx
  1c:   74 0a                   je     28 <dwh_down_read+0x28>
  1e:   ff 03                   incl   (%ebx)
  20:   c6 43 08 01             movb   $0x1,0x8(%ebx)
  24:   58                      pop    %eax
  25:   5a                      pop    %edx
  26:   5b                      pop    %ebx
  27:   c3                      ret    
  28:   c7 43 04 01 00 00 00    movl   $0x1,0x4(%ebx)
  2f:   89 d8                   mov    %ebx,%eax
  31:   e8 fc ff ff ff          call   32 <dwh_down_read+0x32>
  36:   eb e8                   jmp    20 <dwh_down_read+0x20>

00000038 <aa_down_read>:
  38:   53                      push   %ebx
  39:   83 ec 08                sub    $0x8,%esp
  3c:   89 c3                   mov    %eax,%ebx
  3e:   fa                      cli    
  3f:   f0 fe 0b                lock decb (%ebx)
  42:   0f 88 09 00 00 00       js     51 <aa_down_read+0x19>
  48:   8b 53 04                mov    0x4(%ebx),%edx
  4b:   85 d2                   test   %edx,%edx
  4d:   78 19                   js     68 <aa_down_read+0x30>
  4f:   8d 43 08                lea    0x8(%ebx),%eax
  52:   39 43 08                cmp    %eax,0x8(%ebx)
  55:   75 11                   jne    68 <aa_down_read+0x30>
  57:   8d 42 01                lea    0x1(%edx),%eax
  5a:   89 43 04                mov    %eax,0x4(%ebx)
  5d:   c6 03 01                movb   $0x1,(%ebx)
  60:   fb                      sti    
  61:   5b                      pop    %ebx
  62:   58                      pop    %eax
  63:   5b                      pop    %ebx
  64:   c3                      ret    
  65:   8d 76 00                lea    0x0(%esi),%esi
  68:   31 d2                   xor    %edx,%edx
  6a:   89 d8                   mov    %ebx,%eax
  6c:   e8 fc ff ff ff          call   6d <aa_down_read+0x35>
  71:   eb ea                   jmp    5d <aa_down_read+0x25>
  73:   90                      nop    

00000074 <dwh_up_read>:
  74:   53                      push   %ebx
  75:   83 ec 08                sub    $0x8,%esp
  78:   89 c3                   mov    %eax,%ebx
  7a:   f0 fe 4b 08             lock decb 0x8(%ebx)
  7e:   0f 88 15 00 00 00       js     99 <dwh_up_read+0x25>
  84:   8b 03                   mov    (%ebx),%eax
  86:   48                      dec    %eax
  87:   89 03                   mov    %eax,(%ebx)
  89:   85 c0                   test   %eax,%eax
  8b:   74 0b                   je     98 <dwh_up_read+0x24>
  8d:   c6 43 08 01             movb   $0x1,0x8(%ebx)
  91:   5a                      pop    %edx
  92:   59                      pop    %ecx
  93:   5b                      pop    %ebx
  94:   c3                      ret    
  95:   8d 76 00                lea    0x0(%esi),%esi
  98:   8b 43 04                mov    0x4(%ebx),%eax
  9b:   85 c0                   test   %eax,%eax
  9d:   74 ee                   je     8d <dwh_up_read+0x19>
  9f:   89 d8                   mov    %ebx,%eax
  a1:   e8 fc ff ff ff          call   a2 <dwh_up_read+0x2e>
  a6:   eb e5                   jmp    8d <dwh_up_read+0x19>

000000a8 <aa_up_read>:
  a8:   56                      push   %esi
  a9:   53                      push   %ebx
  aa:   51                      push   %ecx
  ab:   89 c3                   mov    %eax,%ebx
  ad:   9c                      pushf  
  ae:   5e                      pop    %esi
  af:   fa                      cli    
  b0:   f0 fe 0b                lock decb (%ebx)
  b3:   0f 88 22 00 00 00       js     db <aa_up_read+0x33>
  b9:   8b 43 04                mov    0x4(%ebx),%eax
  bc:   48                      dec    %eax
  bd:   89 43 04                mov    %eax,0x4(%ebx)
  c0:   85 c0                   test   %eax,%eax
  c2:   74 0c                   je     d0 <aa_up_read+0x28>
  c4:   c6 03 01                movb   $0x1,(%ebx)
  c7:   56                      push   %esi
  c8:   9d                      popf   
  c9:   5a                      pop    %edx
  ca:   5b                      pop    %ebx
  cb:   5e                      pop    %esi
  cc:   c3                      ret    
  cd:   8d 76 00                lea    0x0(%esi),%esi
  d0:   8d 43 08                lea    0x8(%ebx),%eax
  d3:   39 43 08                cmp    %eax,0x8(%ebx)
  d6:   74 ec                   je     c4 <aa_up_read+0x1c>
  d8:   89 d8                   mov    %ebx,%eax
  da:   e8 fc ff ff ff          call   db <aa_up_read+0x33>
  df:   eb e3                   jmp    c4 <aa_up_read+0x1c>
Disassembly of section .text.lock:

00000000 <.text.lock>:
   0:   80 7b 08 00             cmpb   $0x0,0x8(%ebx)
   4:   f3 90                   repz nop 
   6:   7e f8                   jle    0 <.text.lock>
   8:   e9 09 00 00 00          jmp    16 <.text.lock+0x16>
   d:   80 3b 00                cmpb   $0x0,(%ebx)
  10:   f3 90                   repz nop 
  12:   7e f9                   jle    d <.text.lock+0xd>
  14:   e9 3b 00 00 00          jmp    54 <aa_down_read+0x1c>
  19:   80 7b 08 00             cmpb   $0x0,0x8(%ebx)
  1d:   f3 90                   repz nop 
  1f:   7e f8                   jle    19 <.text.lock+0x19>
  21:   e9 76 00 00 00          jmp    9c <dwh_up_read+0x28>
  26:   80 3b 00                cmpb   $0x0,(%ebx)
  29:   f3 90                   repz nop 
  2b:   7e f9                   jle    26 <.text.lock+0x26>
  2d:   e9 ac 00 00 00          jmp    de <aa_up_read+0x36>

Re: [PATCH] rw_semaphores, optimisations

Reply via email to