At Thu, 07 Apr 2022 14:14:59 +0300, Yura Sokolov <y.soko...@postgrespro.ru> wrote in > В Чт, 07/04/2022 в 16:55 +0900, Kyotaro Horiguchi пишет: > > Hi, Yura. > > > > At Wed, 06 Apr 2022 16:17:28 +0300, Yura Sokolov <y.soko...@postgrespro.ru> > > wrot > > e in > > > Ok, I got access to stronger server, did the benchmark, found weird > > > things, and so here is new version :-) > > > > Thanks for the new version and benchmarking. > > > > > First I found if table size is strictly limited to NBuffers and FIXED, > > > then under high concurrency get_hash_entry may not find free entry > > > despite it must be there. It seems while process scans free lists, other > > > concurrent processes "moves etry around", ie one concurrent process > > > fetched it from one free list, other process put new entry in other > > > freelist, and unfortunate process missed it since it tests freelists > > > only once. > > > > StrategyGetBuffer believes that entries don't move across freelists > > and it was true before this patch. > > StrategyGetBuffer knows nothing about dynahash's freelist. > It knows about buffer manager's freelist, which is not partitioned.
Yeah, right. I meant get_hash_entry. > > > To fix this issues I made following: > > > > > > # Concurrency > > > > > > First, I limit concurrency by introducing other lwlocks tranche - > > > BufferEvict. It is 8 times larger than BufferMapping tranche (1024 vs > > > 128). > > > If backend doesn't find buffer in buffer table and wants to introduce > > > it, it first calls > > > LWLockAcquireOrWait(newEvictPartitionLock, LW_EXCLUSIVE) > > > If lock were acquired, then it goes to eviction and replace process. > > > Otherwise, it waits lock to be released and repeats search. > > > > > > This greately improve performance for > 400 clients in pgbench. > > > > So the performance difference between the existing code and v11 is the > > latter has a collision cross section eight times smaller than the > > former? > > No. Acquiring EvictPartitionLock > 1. doesn't block readers, since readers doesn't acquire EvictPartitionLock > 2. doesn't form "tree of lock dependency" since EvictPartitionLock is > independent from PartitionLock. > > Problem with existing code: > 1. Process A locks P1 and P2 > 2. Process B (p3-old, p1-new) locks P3 and wants to lock P1 > 3. Process C (p4-new, p1-old) locks P4 and wants to lock P1 > 4. Process D (p5-new, p4-old) locks P5 and wants to lock P4 > At this moment locks P1, P2, P3, P4 and P5 are all locked and waiting > for Process A. > And readers can't read from same five partitions. > > With new code: > 1. Process A locks E1 (evict partition) and locks P2, > then releases P2 and locks P1. > 2. Process B tries to locks E1, waits and retries search. > 3. Process C locks E4, locks P1, then releases P1 and locks P4 > 4. Process D locks E5, locks P4, then releases P4 and locks P5 > So, there is no network of locks. > Process A doesn't block Process D in any moment: > - either A blocks C, but C doesn't block D at this moment > - or A doesn't block C. > And readers doesn't see simultaneously locked five locks which > depends on single Process A. Thansk for the detailed explanation. I see that. > > + * Prevent "thundering herd" problem and limit concurrency. > > > > this is something like pressing accelerator and break pedals at the > > same time. If it improves performance, just increasing the number of > > buffer partition seems to work? > > To be honestly: of cause simple increase of NUM_BUFFER_PARTITIONS > does improve average case. > But it is better to cure problem than anesthetize. > Increase of > NUM_BUFFER_PARTITIONS reduces probability and relative > weight of lock network, but doesn't eliminate. Agreed. > > It's also not great that follower backends runs a busy loop on the > > lock until the top-runner backend inserts the new buffer to the > > buftable then releases the newParititionLock. > > > > > I tried other variant as well: > > > - first insert entry with dummy buffer index into buffer table. > > > - if such entry were already here, then wait it to be filled. > > > - otherwise find victim buffer and replace dummy index with new one. > > > Wait were done with shared lock on EvictPartitionLock as well. > > > This variant performed quite same. > > > > This one looks better to me. Since a partition can be shared by two or > > more new-buffers, condition variable seems to work better here... > > > > > Logically I like that variant more, but there is one gotcha: > > > FlushBuffer could fail with elog(ERROR). Therefore then there is > > > a need to reliable remove entry with dummy index. > > > > Perhaps UnlockBuffers can do that. > > Thanks for suggestion. I'll try to investigate and retry this way > of patch. > > > > And after all, I still need to hold EvictPartitionLock to notice > > > waiters. > > > I've tried to use ConditionalVariable, but its performance were much > > > worse. > > > > How many CVs did you use? > > I've tried both NUM_PARTITION_LOCKS and NUM_PARTITION_LOCKS*8. > It doesn't matter. > Looks like use of WaitLatch (which uses epoll) and/or tripple > SpinLockAcquire per good case (with two list traversing) is much worse > than PgSemaphorLock (which uses futex) and single wait list action. Sure. I unintentionally neglected the overhead of our CV implementation. It cannot be used in such a hot path. > Other probability is while ConditionVariable eliminates thundering > nerd effect, it doesn't limit concurrency enough... but that's just > theory. > > In reality, I'd like to try to make BufferLookupEnt->id to be atomic > and add LwLock to BufferLookupEnt. I'll test it, but doubt it could > be merged, since there is no way to initialize dynahash's entries > reliably. Yeah, that's what came to my mind first (but with not a LWLock but a CV) but gave up for the reason of additional size. The size of BufferLookupEnt is 24 and sizeof(ConditionVariable) is 12. By the way sizeof(LWLock) is 16.. So I think we don't take the per-bufentry approach here for the reason of additional memory usage. > > I don't think that causes significant performance hit, but I don't > > understand how it improves freelist hit ratio other than by accident. > > Could you have some reasoning for it? > > Since free_reused_entry returns entry into random free_list, this > probability is quite high. In tests, I see stabilisa Maybe. Doesn't it improve the efficiency if we prioritize emptied freelist on returning an element? I tried it with an atomic_u32 to remember empty freelist. On the uin32, each bit represents a freelist index. I saw it eliminated calls to element_alloc. I tried to remember a single freelist index in an atomic but there was a case where two freelists are emptied at once and that lead to element_alloc call. > > By the way the change of get_hash_entry looks something wrong. > > > > If I understand it correctly, it visits num_freelists/4 freelists at > > once, then tries element_alloc. If element_alloc() fails (that must > > happen), it only tries freeList[freelist_idx] and gives up, even > > though there must be an element in other 3/4 freelists. > > No. If element_alloc fails, it tries all NUM_FREELISTS again. > - condition: `ntries || !allocFailed`. `!allocFailed` become true, > so `ntries` remains. > - `ntries = num_freelists;` regardless of `allocFailed`. > Therefore, all `NUM_FREELISTS` are retried for partitioned table. Ah, okay. ntries is set to num_freelists after calling element_alloc. I think we (I?) need more comments. By the way, why it is num_freelists / 4 + 1? > > > `free_reused_entry` now returns entry to random position. It flattens > > > free entry's spread. Although it is not enough without other changes > > > (thundering herd mitigation and probing more lists in get_hash_entry). > > > > If "thudering herd" means "many backends rush trying to read-in the > > same page at once", isn't it avoided by the change in BufferAlloc? > > "thundering herd" reduces speed of entries migration a lot. But > `simple_select` benchmark is too biased: looks like btree root is > evicted from time to time. So entries are slowly migrated to of from > freelist of its partition. > Without "thundering herd" fix this migration is very fast. Ah, that observation agree with the seemingly unidirectional migration of free entries. I remember that it is raised in this list several times to prioritize index pages in shared buffers.. > > Up to about 15%(?) of gain is great. > > Up to 35% in "2 socket 3 key 1GB" case. > Up to 44% in "2 socket 1 key 128MB" case. Oh, more great! > > I'm not sure it is okay that it seems to slow by about 1%.. > > Well, in fact some degradation is not reproducible. > Surprisingly, results change a bit from time to time. Yeah. > I just didn't rerun whole `master` branch bench again > after v11 bench, since each whole test run costs me 1.5 hour. Thans for the labor. > But I confirm regression on "1 socket 1 key 1GB" test case > between 83 and 211 connections. It were reproducible on > more powerful Xeon 8354H, although it were less visible. > > Other fluctuations close to 1% are not reliable. I'm glad to hear that. It is not surprising that some fluctuation happens. > For example, sometimes I see degradation or improvement with > 2GB shared buffers (and even more than 1%). But 2GB is enough > for whole test dataset (scale 100 pgbench is 1.5GB on disk). > Therefore modified code is not involved in benchmarking at all. > How it could be explained? > That is why I don't post 2GB benchmark results. (yeah, I'm > cheating a bit). If buffer replacement doesn't happen, theoretically this patch cannot be involved in the fluctuation. I think we can consider it an error. It might come from placement of other variables. I have somethimes got annoyed by such small but steady change of performance that persists until I recompiled the whole tree. But, sorry, I don't have a clear idea of how such performance shift happens.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center