Added to TODO list. ---------------------------------------------------------------------------
Simon Riggs wrote: > On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote: > > I wrote: > > > Another thought came to mind: maybe the current data layout for LWLocks > > > is bad. Right now, the spinlock that protects each LWLock data struct > > > is itself part of the struct, and since the structs aren't large (circa > > > 20 bytes), the whole thing is usually all in the same cache line. ... > > > Maybe it'd be better to allocate the spinlocks off by themselves. > > > > Well, this is odd. I made up a patch to do this (attached) and found > > that it pretty much sucks. Still the 4-way Opteron, previous best > > (slock-no-cmpb and spin-delay-2): > > 1 31s 2 42s 4 51s 8 100s > > with lwlock-separate added: > > 1 31s 2 52s 4 106s 8 213s > > > > What I had expected to see was a penalty in the single-thread case (due > > to instructions added to the LWLock functions), but there isn't any. > > I did not expect to see a factor-of-2 penalty for four threads. > > > > I guess what this means is that there's no real problem with losing the > > cache line while manipulating the LWLock, which is what the patch was > > intended to prevent. Instead, we're paying for swapping two cache lines > > (the spinlock and the LWLock) across processors instead of just one line. > > But that should at worst be a 2x inflation of the time previously spent > > in LWLockAcquire/Release, which is surely not yet all of the application > > ;-). Why the heck is this so bad? > > (Just going back through the whole thread for completeness.) > > > Should we expect that apparently > > minor changes in shared data structures might be costing equivalently > > huge penalties in SMP performance elsewhere? > > Yes. That's the advice from Intel and AMD; but we should add that there > is potential for improving performance also. > > The possible problem we were trying to avoid here was false sharing of > the cache line by lock requestors of locks whose spin locks were > adjacent in memory. > > Splitting the data structure was just one of the possible ways of > avoiding that. The above shows that the possible solution described > above didn't work, but there could be others. > > One thing we tried in February was padding out the statically defined > locks with dummy lock definitions in the enum. This has the effect of > ensuring that the most contested locks are very definitely in their own > cache line and not shared with others. > That showed a noticeable improvement in performance, probably because it > costs very little to implement, even if the code would require some > explanatory documentation. > > The lwlock structure in include/storage/lwlock.h looks like > > typedef enum LWLockId > { > BufMappingLock, > BufFreelistLock, > LockMgrLock, > OidGenLock, > XidGenLock, > ProcArrayLock, > SInvalLock, > FreeSpaceLock, > WALInsertLock, > WALWriteLock, > ... > > Notice that the heavily contested locks (i.e. first 3 and the WAL locks) > are adjacent to at least one other heavily contested lock. So they are > certain to be in the same cache line and therefore to cause false > sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's > optimization handbooks). > > This could be replaced by... > > typedef enum LWLockId > { > BufMappingLock, > BufMappingLock_Padding1, > BufMappingLock_Padding2, > BufMappingLock_Padding3, > BufMappingLock_Padding4, > BufMappingLock_Padding5, > BufMappingLock_Padding6, > BufMappingLock_Padding7, > BufFreelistLock, > BufFreelistLock_Padding1, > BufFreelistLock_Padding2, > BufFreelistLock_Padding3, > BufFreelistLock_Padding4, > BufFreelistLock_Padding5, > BufFreelistLock_Padding6, > BufFreelistLock_Padding7, > LockMgrLock, > LockMgrLock_Padding1, > LockMgrLock_Padding2, > LockMgrLock_Padding3, > LockMgrLock_Padding4, > LockMgrLock_Padding5, > LockMgrLock_Padding6, > LockMgrLock_Padding7, > OidGenLock, > XidGenLock, > ProcArrayLock, > SInvalLock, > FreeSpaceLock, > WALInsertLock, > WALInsertLock_Padding1, > WALInsertLock_Padding2, > WALInsertLock_Padding3, > WALInsertLock_Padding4, > WALInsertLock_Padding5, > WALInsertLock_Padding6, > WALInsertLock_Padding7, > WALWriteLock, > WALWriteLock_Padding1, > WALWriteLock_Padding2, > WALWriteLock_Padding3, > WALWriteLock_Padding4, > WALWriteLock_Padding5, > WALWriteLock_Padding6, > WALWriteLock_Padding7, > ... > > where the number of padding locks is determined by how many lock > structures fit within a 128 byte cache line. > > This isn't exactly elegant coding, but it provides a useful improvement > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks > pretty darn stupid. But it does follow the CPU optimization handbook > advice and I did see a noticeable improvement in performance and a > reduction in context switching. > > I'm not in a position to try this again now on 8.1beta, but I'd welcome > a performance test result from anybody that is. I'll supply a patch > against 8.1beta for anyone wanting to test this. > > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > -- Bruce Momjian http://candle.pha.pa.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly