Hello, I ran into someone with a system where big queries scanning 8GB+ of all-in-cache data took consistently ~2.5x longer on a primary server than on a replica. Both servers had concurrent activity on them but plenty of spare capacity and similar specs. After some investigation it turned out that on the primary there were (1) some select() syscalls waiting for 1ms, which might indicate contended SpinLockAcquire() back-offs, and (2) a huge amount of time spent in:
+ 93,31% 0,00% postgres postgres [.] index_getnext + 93,30% 0,00% postgres postgres [.] index_fetch_heap + 81,66% 0,01% postgres postgres [.] heap_page_prune_opt + 75,85% 0,00% postgres postgres [.] TransactionIdLimitedForOldSnapshots + 75,83% 0,01% postgres postgres [.] RelationHasUnloggedIndex + 75,79% 0,00% postgres postgres [.] RelationGetIndexList + 75,79% 75,78% postgres postgres [.] list_copy The large tables in question have around 30 indexes. I see that heap_page_prune_opt()'s call to TransactionIdLimitedForOldSnapshots() acquires a couple of system-wide spinlocks, and also tests RelationAllowsEarlyPruning() which calls RelationHasUnloggedIndex() which says: * Tells whether any index for the relation is unlogged. * * Note: There doesn't seem to be any way to have an unlogged index attached * to a permanent table, but it seems best to keep this general so that it * returns sensible results even when they seem obvious (like for an unlogged * table) and to handle possible future unlogged indexes on permanent tables. It calls RelationGetIndexList() which conses up a new copy of the list every time, so that we can spin through it looking for unlogged indexes (and in this user's case there are none). I didn't try to poke at this in lab conditions, but from a glance a the code, I guess heap_page_prune_opt() is running for every index tuple except those that reference the same heap page as the one before, so I guess it happens a lot if the heap is not physically correlated with the index keys. Ouch. -- Thomas Munro https://enterprisedb.com