On 2014-11-17 21:03:07 +0100, Tomas Vondra wrote: > On 17.11.2014 19:46, Andres Freund wrote: > > On 2014-11-17 19:42:25 +0100, Tomas Vondra wrote: > >> On 17.11.2014 18:04, Andres Freund wrote: > >>> Hi, > >>> > >>> On 2014-11-16 23:31:51 -0800, Jeff Davis wrote: > >>>> *** a/src/include/nodes/memnodes.h > >>>> --- b/src/include/nodes/memnodes.h > >>>> *************** > >>>> *** 60,65 **** typedef struct MemoryContextData > >>>> --- 60,66 ---- > >>>> MemoryContext nextchild; /* next child of same parent */ > >>>> char *name; /* context name (just > >>>> for debugging) */ > >>>> bool isReset; /* T = no space alloced > >>>> since last reset */ > >>>> + uint64 mem_allocated; /* track memory allocated for > >>>> this context */ > >>>> #ifdef USE_ASSERT_CHECKING > >>>> bool allowInCritSection; /* allow palloc in > >>>> critical section */ > >>>> #endif > >>> > >>> That's quite possibly one culprit for the slowdown. Right now one > >>> AllocSetContext struct fits precisely into three cachelines. After > >>> your change it doesn't anymore. > >> > >> I'm no PPC64 expert, but I thought the cache lines are 128 B on that > >> platform, since at least Power6? > > > > Hm, might be true. > > > >> Also, if I'm counting right, the MemoryContextData structure is 56B > >> without the 'mem_allocated' field (and without the allowInCritSection), > >> and 64B with it (at that particular place). sizeof() seems to confirm > >> that. (But I'm on x86, so maybe the alignment on PPC64 is different?). > > > > The MemoryContextData struct is embedded into AllocSetContext. > > Oh, right. That makes is slightly more complicated, though, because > AllocSetContext adds 6 x 8B fields plus an in-line array of freelist > pointers. Which is 11x8 bytes. So in total 56+56+88=200B, without the > additional field. There might be some difference because of alignment, > but I still don't see how that one additional field might impact cachelines?
It's actually 196 bytes: struct AllocSetContext { MemoryContextData header; /* 0 56 */ AllocBlock blocks; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ AllocChunk freelist[11]; /* 64 88 */ /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */ Size initBlockSize; /* 152 8 */ Size maxBlockSize; /* 160 8 */ Size nextBlockSize; /* 168 8 */ Size allocChunkLimit; /* 176 8 */ AllocBlock keeper; /* 184 8 */ /* --- cacheline 3 boundary (192 bytes) --- */ /* size: 192, cachelines: 3, members: 8 */ }; And thus one additional field tipps it over the edge. "pahole" is a very useful utility. > But if we separated the freelist, that might actually make it faster, at > least for calls that don't touch the freelist at all, no? Because most > of the palloc() calls will be handled from the current block. I seriously doubt it. The additional indirection + additional branches are likely to make it worse. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers