Re: Slow standby snapshot

Andrey Borodin Sun, 07 Nov 2021 03:38:08 -0800

Sorry for so late reply. I've been thinking about possible approaches.
KnownAssignedXids over hashtable in fact was implemented long before and 
rejected [0].


> 3 авг. 2021 г., в 22:35, Andres Freund <and...@anarazel.de> написал(а):
> 
> On 2021-08-03 10:33:50 +0500, Andrey Borodin wrote:
>>> 3 авг. 2021 г., в 03:01, Andres Freund <and...@anarazel.de> написал(а):
>>> On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:
>>>> The main idea is simple optimistic optimization - store offset to next
>>>> valid entry. So, in most cases, we could just skip all the gaps.
>>>> Of course, it adds some additional impact for workloads without long
>>>> (few seconds) transactions but it is almost not detectable (because of
>>>> CPU caches).
>>> 
>>> I'm doubtful that that's really the right direction. For workloads that
>>> are replay heavy we already often can see the cost of maintaining the
>>> known xids datastructures show up significantly - not surprising, given
>>> the datastructure. And for standby workloads with active primaries the
>>> cost of searching through the array in all backends is noticeable as
>>> well.  I think this needs a bigger data structure redesign.
>> 
>> KnownAssignedXids implements simple membership test idea. What kind of
>> redesign would you suggest? Proposed optimisation makes it close to optimal,
>> but needs eventual compression.
> 
> Binary searches are very ineffecient on modern CPUs (unpredictable memory
> accesses, unpredictable branches). We constantly need to do binary searches
> during replay to remove xids from the array. I don't see how you can address
> that with just the current datastructure.
Current patch addresses another problem. In presence of old enough transaction 
enumeration of KnownAssignedXids with shared lock prevents adding new 
transactions with exclusive lock. And recovery effectively pauses.

Binary searches can consume 10-15 cache misses, which is unreasonable amount of 
memory waits. But that's somewhat different problem.
Also binsearch is not that expensive when we compress KnownAssignedXids often.

>> Maybe use a hashtable of running transactions? It will be slightly faster
>> when adding\removing single transactions. But much worse when doing
>> KnownAssignedXidsRemove().
> 
> Why would it be worse for KnownAssignedXidsRemove()? Were you intending to
> write KnownAssignedXidsGet[AndSetXmin]()?
I was thinking about inefficient KnownAssignedXidsRemovePreceding() in 
hashtable. But, probably, this is not so frequent operation.

>> Maybe use a tree? (AVL\RB or something like that) It will be slightly 
>> better, because it does not need eventual compression like exiting array.
> 
> I'm not entirely sure what datastructure would work best. I can see something
> like a radix tree work well, or a skiplist style approach. Or a hashtable:
> 
> I'm not sure that we need to care as much about the cost of
> KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes that
> less frequent. But more importantly, it'd not be hard to maintain an
> occasionally (or opportunistically) maintained denser version for
> GetSnapshotData() - there's only a single writer, making the concurrency
> issues a lot simpler.

I've been prototyping Radix tree for a while.
Here every 4 xids are summarized my minimum Xid and number of underlying Xids. 
Of cause 4 is arbitrary number, summarization area must be of cacheline size.
┌───────┐                                                       
│ 1 / 9 │                                                       
├───────┴────┐                                                  
│            └────┐                                             
│                 └────┐                                        
│                      └────┐                                   
▼                           └───▶                               
┌───────────────────────────────┐                               
│ 1 / 3 | 5 / 0 | 9 / 3 | D / 3 │                               
├───────┬───────┬────────┬──────┴────┐                          
│       └─┐     └───┐    └────┐      └─────┐                    
│         └─┐       └──┐      └────┐       └─────┐              
│           └─┐        └──┐        └────┐        └────┐         
▼             └─┐         └──┐          └───┐         └────┐    
┌───────────────▼────────────┴─▶────────────┴──▶───────────┴───▶
│ 1 | 2 |   | 4 |   |   |   |   | 9 |   | B | C | D | E | F |  │
└──────────────────────────────────────────────────────────────┘
Bottom layer is current array (TransactionId *KnownAssignedXids).
When we remove Xid we need theoretical minimum of cachelines touched. I'd say 
5-7 instead of 10-15 of binsearch (in case of millions of entries in 
KnownAssignedXids).
Enumerating running Xids is not that difficult too: we will need to scan O(xip) 
memory, not whole KnownAssignedXids array.

But the overall complexity of this approach does not seem good to me.

All in all, I think using proposed "KnownAssignedXidsNext" patch solves real 
problem and the problem of binary searches should be addressed by compressing 
KnownAssignedXids more often.

Currently we do not compress array
        if (nelements < 4 * PROCARRAY_MAXPROCS ||  // It's not that big yet OR
            nelements < 2 * pArray->numKnownAssignedXids) // It's contains less 
than a half of a bloat
            return;
From my POV arbitrary number 4 is just too high.

Summary: I think ("KnownAssignedXidsNext" patch + compressing KnownAssignedXids 
more often) is better than major KnownAssignedXids redesign.


Best regards, Andrey Borodin.

[0] 
https://github.com/postgres/postgres/commit/2871b4618af1acc85665eec0912c48f8341504c4

Re: Slow standby snapshot

Reply via email to