From: Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> Sent: Wednesday, May 26, 2021 7:22 PM > Thanks for trying that out. > > Please see the code around the use_fsm flag in RelationGetBufferForTuple for > more understanding of the points below. > > What happens if FSM is skipped i.e. myState->ti_options = > TABLE_INSERT_SKIP_FSM;? > 1) The flag use_fsm will be false in heap_insert->RelationGetBufferForTuple. > 2) Each worker initially gets a block and keeps inserting into it until it is > full. > When the block is full, the worker doesn't look in FSM GetPageWithFreeSpace > as use_fsm is false. It directly goes for relation extension and tries to > acquire > relation extension lock with LockRelationForExtension. Note that the bulk > extension of blocks with RelationAddExtraBlocks is not reached as use_fsm is > false. > 3) After acquiring the relation extension lock, it adds an extra new block > with > ReadBufferBI(relation, P_NEW, ...), see the comment "In addition to whatever > extension we performed above, we always add at least one block to satisfy our > own request." The tuple is inserted into this new block. > > Basically, the workers can't look for the empty pages from the pages added by > other workers, they keep doing the above steps in silos. > > What happens if FSM is not skipped i.e. myState->ti_options = 0;? > 1) The flag use_fsm will be true in heap_insert->RelationGetBufferForTuple. > 2) Each worker initially gets a block and keeps inserting into it until it is > full. > When the block is full, the worker looks for the page with free space in FSM > GetPageWithFreeSpace as use_fsm is true. > If it can't find any page with the required amount of free space, it goes for > bulk > relation extension(RelationAddExtraBlocks) after acquiring relation extension > lock with ConditionalLockRelationForExtension. Then the worker adds > extraBlocks = Min(512, lockWaiters * 20); new blocks in > RelationAddExtraBlocks and immediately updates the bottom level of FSM for > each block (see the comment around RecordPageWithFreeSpace for why only > the bottom level, not the entire FSM tree). After all the blocks are added, > then > it updates the entire FSM tree FreeSpaceMapVacuumRange. > 4) After the bulk extension, then the worker adds another block see the > comment "In addition to whatever extension we performed above, we always > add at least one block to satisfy our own request." and inserts tuple into > this > new block. > > Basically, the workers can benefit from the bulk extension of the relation and > they always can look for the empty pages from the pages added by other > workers. There are high chances that the blocks will be available after bulk > extension. Having said that, if the added extra blocks are consumed by the > workers so fast i.e. if the tuple sizes are big i.e very less tuples per > page, then, > the bulk extension too can't help much and there will be more contention on > the relation extension lock. Well, one might think to add more blocks at a > time, > say Min(1024, lockWaiters * 128/256/512) than currently extraBlocks = Min(512, > lockWaiters * 20);. This will work (i.e. we don't see any regression with > parallel > inserts in CTAS patches), but it can't be a practical solution. Because the > total > pages for the relation will be more with many pages having more free space. > Furthermore, the future sequential scans on that relation might take a lot of > time. > > If myState->ti_options = TABLE_INSERT_SKIP_FSM; in only the place(within if > (myState->is_parallel)), then it will be effective for leader i.e. leader > will not > look for FSM, but all the workers will, because within if > (myState->is_parallel_worker) in intorel_startup, > myState->ti_options = 0; for workers. > > I ran tests with configuration shown at [1] for the case 4 (2 bigint(of 8 > bytes > each) columns, 16 name(of 64 bytes each) columns, tuple size 1064 bytes, 10mn > tuples) with leader participation where I'm seeing regression: > > 1) when myState->ti_options = TABLE_INSERT_SKIP_FSM; for both leader and > workers, then my results are as follows: > 0 workers - 116934.137, 2 workers - 209802.060, 4 workers - 248580.275 > 2) when myState->ti_options = 0; for both leader and workers, then my results > are as follows: > 0 workers - 1116184.718, 2 workers - 139798.055, 4 workers - 143022.409 > I hope the above explanation and the test results should clarify the fact that > skipping FSM doesn't solve the problem. Let me know if anything is not clear > or > I'm missing something.
Thanks for the explanation. I followed your above test steps and the below configuration, but my test results are a little different from yours. I am not sure the exact reason, maybe because of the hardware.. Test INSERT 10000000 rows((2 bigint(of 8 bytes) 16 name(of 64 bytes each) columns): SERIAL: 22023.631 ms PARALLEL 2 WORKER [NOT SKIP FSM]: 21824.934 ms [SKIP FSM]: 19381.474 ms PARALLEL 4 WORKER [NOT SKIP FSM]: 20481.117 ms [SKIP FSM]: 18381.305 ms I am afraid that the using the FSM seems not get a stable performance gain(at least on my machine), I will take a deep look into this to figure out the difference. A naive idea it that the benefit that bulk extension bring is not much greater than the cost in FSM. Do you have some ideas on it ? My test machine: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz Stepping: 7 CPU MHz: 2901.005 CPU max MHz: 3200.0000 CPU min MHz: 1000.0000 BogoMIPS: 4400.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 14080K Best regards, houzj > [1] postgresql.conf parameters used: > shared_buffers = 40GB > max_worker_processes = 32 > max_parallel_maintenance_workers = 24 > max_parallel_workers = 32 > synchronous_commit = off > checkpoint_timeout = 1d > max_wal_size = 24GB > min_wal_size = 15GB > autovacuum = off > port = 5440 > > System Configuration: > RAM: 528GB > Disk Type: SSD > Disk Size: 1.5TB > lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 128 > On-line CPU(s) list: 0-127 > Thread(s) per core: 2 > Core(s) per socket: 8 > Socket(s): 8 > NUMA node(s): 8 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 47 > Model name: Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz > Stepping: 2 > CPU MHz: 1064.000 > CPU max MHz: 2129.0000 > CPU min MHz: 1064.0000 > BogoMIPS: 4266.62 > Virtualization: VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 256K > L3 cache: 24576K