В Пн, 14/03/2022 в 17:12 +0900, Kyotaro Horiguchi пишет: > At Mon, 14 Mar 2022 09:15:11 +0300, Yura Sokolov <y.soko...@postgrespro.ru> > wrote in > > В Пн, 14/03/2022 в 14:31 +0900, Kyotaro Horiguchi пишет: > > > I'd like to ask you to remove nalloced from partitions then add a > > > global atomic for the same use? > > > > I really believe it should be global. I made it per-partition to > > not overcomplicate first versions. Glad you tell it. > > > > I thought to protect it with freeList[0].mutex, but probably atomic > > is better idea here. But which atomic to chose: uint64 or uint32? > > Based on sizeof(long)? > > Ok, I'll do in next version. > > Current nentries is a long (= int64 on CentOS). And uint32 can support > roughly 2^32 * 8192 = 32TB shared buffers, which doesn't seem safe > enough. So it would be uint64. > > > Whole get_hash_entry look strange. > > Doesn't it better to cycle through partitions and only then go to > > get_hash_entry? > > May be there should be bitmap for non-empty free lists? 32bit for > > 32 partitions. But wouldn't bitmap became contention point itself? > > The code puts significance on avoiding contention caused by visiting > freelists of other partitions. And perhaps thinks that freelist > shortage rarely happen. > > I tried pgbench runs with scale 100 (with 10 threads, 10 clients) on > 128kB shared buffers and I saw that get_hash_entry never takes the > !element_alloc() path and always allocate a fresh entry, then > saturates at 30 new elements allocated at the medium of a 100 seconds > run. > > Then, I tried the same with the patch, and I am surprized to see that > the rise of the number of newly allocated elements didn't stop and > went up to 511 elements after the 100 seconds run. So I found that my > concern was valid. The change in dynahash actually > continuously/repeatedly causes lack of free list entries. I'm not > sure how much the impact given on performance if we change > get_hash_entry to prefer other freelists, though.
Well, it is quite strange SharedBufHash is not allocated as HASH_FIXED_SIZE. Could you check what happens with this flag set? I'll try as well. Other way to reduce observed case is to remember freelist_idx for reused entry. I didn't believe it matters much since entries migrated netherless, but probably due to some hot buffers there are tention to crowd particular freelist. > By the way, there's the following comment in StrategyInitalize. > > > * Initialize the shared buffer lookup hashtable. > > * > > * Since we can't tolerate running out of lookup table entries, we > > must be > > * sure to specify an adequate table size here. The maximum > > steady-state > > * usage is of course NBuffers entries, but BufferAlloc() tries to > > insert > > * a new entry before deleting the old. In principle this could be > > * happening in each partition concurrently, so we could need as many > > as > > * NBuffers + NUM_BUFFER_PARTITIONS entries. > > */ > > InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS); > > "but BufferAlloc() tries to insert a new entry before deleting the > old." gets false by this patch but still need that additional room for > stashed entries. It seems like needing a fix. > > > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center