Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Sat, Feb 27, 2016 at 5:14 AM, Andres Freund wrote: > > > > >> ./pgbench -j$ -c$ -T300 -M prepared -S postgres > > > > >> > > > > >> ClientBasePatch > > > > >> 11716916454 > > > > >> 8108547105559 > > > > >> 32241619262818 > > > > >> 64206868233606 > > > > >> 128137084217013 > > > > > > So, there's a small regression on low client counts. That's worth > > > addressing. > > > > > > > Interesting. I'll try to reproduce it. > > Any progress here? In Multi socket machine with 8 sockets and 64 cores, I have seen more regression compared to my previous run in power8 with 2 socket, currently I tested Read only workload for 5 mins Run, When I get time, I will run for longer time and confirm again. Shared Buffer= 8GB Scale Factor=300 ./pgbench -j$ -c$ -T300 -M prepared -S postgres client basepatch 1 7057 5230 2 10043 9573 4 2014018188 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Mon, Feb 29, 2016 at 8:26 AM, Amit Kapila wrote: > Have you tried by reverting the commits 6150a1b0 and ac1d794, which I > think effects read-only performance and sometimes create variation in TPS > across different runs, here second might have less impact, but first one > could impact performance? Is it possible for you to get perf data with and > without patch and share with others? > I only reverted ac1d794 commit in my test, In my next run I will revert 6150a1b0 also and test. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Wed, Feb 10, 2016 at 7:06 PM, Dilip Kumar wrote: > I have tested Relation extension patch from various aspects and performance results and other statistical data are explained in the mail. Test 1: Identify the Heavy Weight lock is the Problem or the Actual Context Switch 1. I converted the RelationExtensionLock to simple LWLock and tested with single Relation. Results are as below This is the simple script of copy 1 record in one transaction of size 4 Bytes clientbaselwlockmulti_extend by 50 block 1155156160 2282276284 4248319428 8161267675 16 143241889 LWLock performance is better than base, obvious reason may be because we have saved some instructions by converting to LWLock but it don't scales any better compared to base code. Test2: Identify that improvement in case of multiextend is becuase of avoiding context switch or some other factor, like reusing blocks b/w backend by putting in FSM.. 1. Test by just extending multiple blocks and reuse in it's own backend (Don't put in FSM) Insert 1K record data don't fits in shared buffer 512MB Shared Buffer ClientBaseExtend 800 block self useExtend 1000 Block 1 117 131 118 2 111 203 140 351 242 178 451 231 190 552 259 224 651 263 243 743 253 254 843 240 254 16 40 190 243 We can see the same improvement in case of self using the blocks also, It shows that Sharing the blocks between the backend was not the WIN but avoiding context switch was the measure win. 2. Tested the Number of ProcSleep during the Run. This is the simple script of copy 1 record in one transaction of size 4 Bytes * BASE CODE* *PATCH MULTI EXTEND* ClientBase_TPSProcSleep CountExtend By 10 BlockProc Sleep Count 2280 457,506 31162,641 32351,098,701 358 141,624 42161,155,735 368 188,173 What we can see in above test that, in Base code performance is degrading after 2 threads, while Proc Sleep count in increasing with huge amount. Compared to that in Patch, with extending 10 blocks at a time Proc Sleep reduce to ~1/8 and we can see it is constantly scaling. Proc Sleep test for Insert test when data don't fits in shared buffer and inserting big record of 1024 bytes, is currently running once I get the data will post the same. Posting the re-based version and moving to next CF. Open points: 1. After getting the Lock recheck the FSM if some other back end has already added extra blocks and reuse them. 2. Is it good idea to have user level parameter for extend_by_block or we can try some approach to internally identify how many blocks are needed and as per the need only add the blocks, this will make it more flexible. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index 86b9ae1..78e81dd 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -268,6 +268,16 @@ static relopt_int intRelOpts[] = #endif }, + { + { + "extend_by_blocks", + "Number of blocks to be added to relation in every extend call", + RELOPT_KIND_HEAP, + AccessExclusiveLock + }, + 1, 1, 1 + }, + /* list terminator */ {{NULL}} }; @@ -1291,7 +1301,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind) {"autovacuum_analyze_scale_factor", RELOPT_TYPE_REAL, offsetof(StdRdOptions, autovacuum) +offsetof(AutoVacOpts, analyze_scale_factor)}, {"user_catalog_table", RELOPT_TYPE_BOOL, - offsetof(StdRdOptions, user_catalog_table)} + offsetof(StdRdOptions, user_catalog_table)}, + {"extend_by_blocks", RELOPT_TYPE_INT, + offsetof(StdRdOptions, extend_by_blocks)} }; options = parseRelOptions(reloptions, validate, kind, &numoptions); diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index 8140418..eb3ce17 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -238,6 +238,7 @@ RelationGetBufferForTuple(Relation relation, Size len, BlockNumber targetBlock, otherBlock; bool needLock; + int extraBlocks; len = MAXALIGN(len); /* be conservativ
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Mon, Feb 29, 2016 at 5:18 PM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > I didn't reproduce the regression. I had access to multicore machine but > didn't see either regression on low clients or improvements on high clients. > In the attached path spinlock delay was exposed in s_lock.h and used > in LockBufHdr(). > Dilip, could you try this version of patch? Could you also run perf or > other profiler in the case of regression. It would be nice to compare > profiles with and without patch. We probably could find the cause of > regression. > OK, I will test it, sometime in this week. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Tue, Mar 1, 2016 at 10:19 AM, Dilip Kumar wrote: > > OK, I will test it, sometime in this week. > I have tested this patch in my laptop, and there i did not see any regression at 1 client Shared buffer 10GB, 5 mins run with pgbench, read-only test basepatch run122187 24334 run226288 27440 run326306 27411 May be in a day or 2 I will test it in the same machine where I reported the regression, and if regression is there I will check the instruction using Call grind. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Tue, Mar 1, 2016 at 4:36 PM, Amit Kapila wrote: > One thing that is slightly unclear is that whether there is any overhead > due to buffer eviction especially when the buffer to be evicted is already > dirty and needs XLogFlush(). One reason why it might not hurt is that by > the time we tries to evict the buffer, corresponding WAL is already flushed > by WALWriter or other possibility could be that even if it is getting done > during buffer eviction, the impact for same is much lesser. Can we try to > measure the number of flush calls which happen during buffer eviction? > > Good Idea, I will do this test and post the results.. > > >> Proc Sleep test for Insert test when data don't fits in shared buffer and >> inserting big record of 1024 bytes, is currently running >> once I get the data will post the same. >> >> > Okay. However, I wonder if the performance data for the case when data > doesn't fit into shared buffer also shows similar trend, then it might be > worth to try by doing extend w.r.t load in system. I mean to say we can > batch the extension requests (as we have done in ProcArrayGroupClearXid) > and extend accordingly, if that works out then the benefit could be that we > don't need any configurable knob for the same. > 1. One option can be as you suggested like ProcArrayGroupClearXid, With some modification, because when we wait for the request and extend w.r.t that, may be again we face the Context Switch problem, So may be we can extend in some multiple of the Request. (But we need to take care whether to give block directly to requester or add it into FSM or do both i.e. give at-least one block to requester and put some multiple in FSM) 2. Other option can be that we analyse the current Load on the extend and then speed up or slow down the extending speed. Simple algorithm can look like this If (GotBlockFromFSM()) Success++ // We got the block from FSM Else Failure++// Did not get Block from FSM and need to extend by my self Now while extending - Speed up - If (failure - success > Threshold ) // Threshold can be one number assume 10. ExtendByBlock += failure- success; --> If many failure means load is high then Increase the ExtendByBlock Failure = success= 0 --> reset after this so that we can measure the latest trend. Slow down.. -- //Now its possible that demand of block is reduced but ExtendByBlock is still high.. So analyse statistic and slow down the extending pace.. If (success - failure > Threshold) { // Can not reduce it by big number because, may be more request are satisfying because this is correct amount, so gradually decrease the pace and re-analyse the statistics next time. ExtendByBlock --; Failure = success= 0 } Any Suggestions are Welcome... -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Wed, Mar 2, 2016 at 10:39 AM, Andres Freund wrote: > Sounds like a ppc vs. x86 issue. The regression was on the former, right? Well, Regression what I reported last two time, out of that one was on X86 and other was on PPC. Copied from older Threads -- On PPC >> > > > >> ./pgbench -j$ -c$ -T300 -M prepared -S postgres >> > > > >> >> > > > >> ClientBasePatch >> > > > >> 11716916454 >> > > > >> 8108547105559 >> > > > >> 32241619262818 >> > > > >> 64206868233606 >> > > > >> 128137084217013 On X86 >> > > > >>Shared Buffer= 8GB >> > > > >>Scale Factor=300 >> > > > >>./pgbench -j$ -c$ -T300 -M prepared -S postgres >> > > > >>client basepatch >> > > > >>1 7057 5230 >> > > > >>2 10043 9573 >> > > > >>4 2014018188 And this latest result (no regression) is on X86 but on my local machine. I did not exactly saw what this new version of patch is doing different, so I will test this version in other machines also and see the results. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Wed, Mar 2, 2016 at 10:31 AM, Dilip Kumar wrote: > 1. One option can be as you suggested like ProcArrayGroupClearXid, With > some modification, because when we wait for the request and extend w.r.t > that, may be again we face the Context Switch problem, So may be we can > extend in some multiple of the Request. > (But we need to take care whether to give block directly to requester or > add it into FSM or do both i.e. give at-least one block to requester and > put some multiple in FSM) > > > 2. Other option can be that we analyse the current Load on the extend and > then speed up or slow down the extending speed. > Simple algorithm can look like this > I have tried the approach of group extend, 1. We convert the extension lock to TryLock and if we get the lock then extend by one block.2. 2. If don't get the Lock then use the Group leader concep where only one process will extend for all, Slight change from ProcArrayGroupClear is that here other than satisfying the requested backend we Add some extra blocks in FSM, say GroupSize*10. 3. So Actually we can not get exact load but still we have some factor like group size tell us exactly the contention size and we extend in multiple of that. Performance Analysis: - Performance is scaling with this approach, its slightly less compared to previous patch where we directly give extend_by_block parameter and extend in multiple blocks, and I think that's obvious because in group extend case only after contention happen on lock we add extra blocks, but in former case it was extending extra blocks optimistically. Test1(COPY) - ./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 1) g(i)) TO '/tmp/copybinary' WITH BINARY"; echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script ./pgbench -f copy_script -T 300 -c$ -j$ postgres Shared Buffer:8GBmax_wal_size:10GB Storage:Magnetic DiskWAL:SSD --- ClientBase multi_extend by 20 page group_extend_patch(groupsize*10) 1 105157 149 2 217255 282 4 210 494 452 8 166702 629 16145 563 561 32124 480 566 Test2(INSERT) ./psql -d postgres -c "create table test_data(a int, b text)" ./psql -d postgres -c "insert into test_data values(generate_series(1,1000),repeat('x', 1024))" ./psql -d postgres -c "create table data (a int, b text) echo "insert into data select * from test_data;" >> insert_script shared_buffers=512GB max_wal_size=20GB checkpoint_timeout=10min ./pgbench -c $ -j $ -f insert_script -T 300 postgres ClientBaseMulti-extend by 1000*group extend (group*10) *group extend (group*100) 1117118 125 122 2111140 161 159 4 51 190 141 187 843 254 148 172 16 40 243 150 173 * (group*10)-> means inside the code, Group leader will see how many members are in the group who want blocks, so we will satisfy request of all member + will put extra blocks in FSM extra block to extend are = (group*10) --> 10 is just some constant. Summary: 1. Here with group extend patch, there is no configuration to tell that how many block to extend, so that should be decided by current load in the system (contention on the extension lock). 2. With small multiplier i.e. 10 we can get fairly good improvement compare to base code, but when load is high like record size is 1K, improving the multiplier size giving better results. * Note: Currently this is POC patch, It has only one group Extend List, so currently can handle only one relation Group extend. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index 8140418..adb82ba 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -23,6 +23,7 @@ #include "storage/freespace.h" #include "storage/lmgr.h" #include "storage/smgr.h" +#include "storage/proc.h" /* @@ -168,6 +169,160 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2, } } + +static Bloc
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Wed, Mar 2, 2016 at 11:05 AM, Dilip Kumar wrote: > And this latest result (no regression) is on X86 but on my local machine. > > I did not exactly saw what this new version of patch is doing different, > so I will test this version in other machines also and see the results. > I tested this on PPC again, This time in various order (sometime patch first and then base first). I tested with latest patch *pinunpin-cas-2.patch* on Power8. Shared Buffer = 8GB ./pgbench -j$ -c$ -T300 -M prepared -S postgres BASE - Clientsrun1run2run3 1 212001875420537 2 403313952038746 Patch - Clientsrun1run2run3 1 202251980619778 2 398304189836620 I think, here we can not see any regression, (If I take median then it may looks low with patch so posting all 3 reading). Note: reverted only ac1d794 commit in my test. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Tue, Mar 8, 2016 at 8:34 PM, Amit Kapila wrote: > > >> Hmm. Can we drive this off of the heavyweight lock manager's idea of > > >> how big the relation extension lock wait queue is, instead of adding > > >> more stuff to PGPROC? > > > > > > One idea to make it work without adding additional stuff in PGPROC is > that > > > after acquiring relation extension lock, check if there is any > available > > > block in fsm, if it founds any block, then release the lock and > proceed, > > > else extend the relation by one block and then check lock's wait queue > size > > > or number of lock requests (nRequested) and extend the relation > further in > > > proportion to wait queue size and then release the lock and proceed. > Here, > > > I think we can check for wait queue size even before extending the > relation > > > by one block. > I have come up with this patch.. If this approach looks fine then I will prepare final patch (more comments, indentation, and improve some code) and do some long run testing (current results are 5 mins run). Idea is same what Robert and Amit suggested up thread. /* First we try the lock and if get just extend one block, this will give two benefit , 1. Single thread performance will not impact by checking lock waiters and all 2. If we check the waiter in else part it will give time for more waiter to get collected and will get better estimation of contention*/ TryRelExtLock () { extend one block } else { RelextLock() if (recheck the FSM if somebody have added blocks for me) -- don't extend any block just reuse else --we have to extend blocks -- get the waiter = lock->nRequested --add extra block to FSM extraBlock = waiter*20; } Result looks like this --- Test1(COPY) -./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 1) g(i)) TO '/tmp/copybinary' WITH BINARY";echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script ./pgbench -f copy_script -T 300 -c$ -j$ postgres Shared Buffer:8GBmax_wal_size:10GB Storage:Magnetic DiskWAL:SSD --- ClientBase multi_extend by 20 page lock_waiter patch(waiter*20) 1 105157 148 2 217255 252 4 210 494442 8 166702 645 16145 563773 32124 480957 Note: @1 thread there should not be any improvement, so many be run to run variance. Test2(INSERT) ./psql -d postgres -c "create table test_data(a int, b text)"./psql -d postgres -c "insert into test_data values(generate_series(1,1000),repeat('x', 1024))"./psql -d postgres -c "create table data (a int, b text) echo "insert into data select * from test_data;" >> insert_script shared_buffers=512GB max_wal_size=20GB checkpoint_timeout=10min ./pgbench -c $ -j $ -f insert_script -T 300 postgres ClientBaseMulti-extend by 1000lock_waiter patch(waiter*20) 1117118 117 2111140 132 4 51 190 134 843 254 148 16 40 243 206 32 - - 264 * (waiter*20)-> First process got the lock will find the lock waiters and add waiter*20 extra blocks. In next run I will run beyond 32 also, as we can see even at 32 client its increasing.. so its clear when it see more contentions, adapting to contention and adding more blocks.. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com multi_extend_v5.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation extension scalability
On Wed, Mar 9, 2016 at 1:39 AM, Robert Haas wrote: > LockWaiterCount() bravely accesses a shared memory data structure that > is mutable with no locking at all. That might actually be safe for > our purposes, but I think it would be better to take the partition > lock in shared mode if it doesn't cost too much performance. If > that's too expensive, then it should at least have a comment > explaining (1) that it is doing this without the lock and (2) why > that's safe (sketch: the LOCK can't go away because we hold it, and > nRequested could change but we don't mind a stale value, and a 32-bit > read won't be torn). > With LWLock also performance are equally good so added the lock. > > A few of the other functions in here also lack comments, and perhaps > should have them. > > The changes to RelationGetBufferForTuple need a visit from the > refactoring police. Look at the way you are calling > RelationAddOneBlock. The logic looks about like this: > > if (needLock) > { > if (trylock relation for extension) > RelationAddOneBlock(); > else > { > lock relation for extension; > if (use fsm) > { > complicated; > } > else > RelationAddOneBlock(); > } > else > RelationAddOneBlock(); > > So basically you want to do the RelationAddOneBlock() thing if > !needLock || !use_fsm || can't trylock. See if you can rearrange the > code so that there's only one fallthrough call to > RelationAddOneBlock() instead of three separate ones. > Actually in every case we need one blocks, So I have re factored it and RelationAddOneBlock is now out of any condition. > > Also, consider moving the code that adds multiple blocks at a time to > its own function instead of including it all in line. > Done Attaching a latest patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index 8140418..eb4ee0c 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -169,6 +169,86 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2, } /* + * RelationAddOneBlock + * + * Extend relation by one block and lock the buffer + */ +static Buffer +RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate) +{ + Buffer buffer; + /* + * XXX This does an lseek - rather expensive - but at the moment it is the + * only way to accurately determine how many blocks are in a relation. Is + * it worth keeping an accurate file length in shared memory someplace, + * rather than relying on the kernel to do it for us? + */ + buffer = ReadBufferBI(relation, P_NEW, bistate); + + /* + * We can be certain that locking the otherBuffer first is OK, since + * it must have a lower page number. + */ + if (otherBuffer != InvalidBuffer) + LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * Now acquire lock on the new page. + */ + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + return buffer; +} +/* + * RelationAddExtraBlocks + * + * Extend extra blocks for the relations to avoid the future contention + * on the relation extension lock. + */ + +static void +RelationAddExtraBlocks(Relation relation, BulkInsertState bistate) +{ + Page page; + Size freespace; + BlockNumber blockNum; + int extraBlocks = 0; + int lockWaiters = 0; + Buffer buffer; + + /* + * For calculating number of extra blocks to extend, find the level + * of contention on this lock, by getting the requester of this lock + * and add extra blocks in multiple of waiters. + */ + lockWaiters = RelationExtensionLockWaiter(relation); + + extraBlocks = lockWaiters * 20; + + while (extraBlocks--) + { + /* + * Here we are adding extra blocks to the relation after + * adding each block update the information in FSM so that + * other backend running parallel can find the block. + */ + buffer = ReadBufferBI(relation, P_NEW, bistate); + + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + page = BufferGetPage(buffer); + PageInit(page, BufferGetPageSize(buffer), 0); + freespace = PageGetHeapFreeSpace(page); + MarkBufferDirty(buffer); + blockNum = BufferGetBlockNumber(buffer); + + UnlockReleaseBuffer(buffer); + + RecordPageWithFreeSpace(relation, blockNum, freespace); + } +} + +/* * RelationGetBufferForTuple * * Returns pinned and exclusive-locked buffer of a page in given relation @@ -233,10 +313,11 @@ RelationGetBufferForTuple(Relation relation, Size len, bool use_fsm = !(options & HEAP_INSERT_SKIP_FSM); Buffer buffer = InvalidBuffer; Page page; - Size pageFreeSpace, -saveFreeSpace; + Size pageFreeSpace = 0, +saveFreeSpace = 0; BlockNumber targetBlock, -otherBlock; +otherBlock, +lastValidBlock = InvalidBlockNumber; bool needLock; len
Re: [HACKERS] Relation extension scalability
On Thu, Mar 10, 2016 at 7:55 AM, Petr Jelinek wrote: Thanks for the comments.. > Hmm, why did you remove the comment above the call to > UnlockRelationForExtension? While re factoring I lose this comment.. Fixed it > It still seems relevant, maybe with some minor modification? > > Also there is a bit of whitespace mess inside the conditional lock block. Fixed I got the result of 10 mins run so posting it.. Note: Base code results are copied from up thread... Results For 10 Mins run of COPY 1 records of size 4 bytes script and configuration are same as used in up thread ClientBasePatch 1105111 2217246 4210428 8166653 16 145808 32 124988 64---974 Results For 10 Mins run of INSERT 1000 records of size 1024 bytes(data don't fits in shared buffer) -- ClientBasePatch 1117120 2111126 4 51 130 8 43 147 16 40 209 32 --- 254 64 --- 205 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index 8140418..b73535c 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -169,6 +169,87 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2, } /* + * RelationAddOneBlock + * + * Extend relation by one block and lock the buffer + */ +static Buffer +RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate) +{ + Buffer buffer; + /* + * XXX This does an lseek - rather expensive - but at the moment it is the + * only way to accurately determine how many blocks are in a relation. Is + * it worth keeping an accurate file length in shared memory someplace, + * rather than relying on the kernel to do it for us? + */ + buffer = ReadBufferBI(relation, P_NEW, bistate); + + /* + * We can be certain that locking the otherBuffer first is OK, since + * it must have a lower page number. + */ + if (otherBuffer != InvalidBuffer) + LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * Now acquire lock on the new page. + */ + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + return buffer; +} + +/* + * RelationAddExtraBlocks + * + * Extend extra blocks for the relations to avoid the future contention + * on the relation extension lock. + */ + +static void +RelationAddExtraBlocks(Relation relation, BulkInsertState bistate) +{ + Page page; + Size freespace; + BlockNumber blockNum; + int extraBlocks = 0; + int lockWaiters = 0; + Buffer buffer; + + /* + * For calculating number of extra blocks to extend, find the level + * of contention on this lock, by getting the requester of this lock + * and add extra blocks in multiple of waiters. + */ + lockWaiters = RelationExtensionLockWaiter(relation); + + extraBlocks = lockWaiters * 20; + + while (extraBlocks--) + { + /* + * Here we are adding extra blocks to the relation after + * adding each block update the information in FSM so that + * other backend running parallel can find the block. + */ + buffer = ReadBufferBI(relation, P_NEW, bistate); + + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + page = BufferGetPage(buffer); + PageInit(page, BufferGetPageSize(buffer), 0); + freespace = PageGetHeapFreeSpace(page); + MarkBufferDirty(buffer); + blockNum = BufferGetBlockNumber(buffer); + + UnlockReleaseBuffer(buffer); + + RecordPageWithFreeSpace(relation, blockNum, freespace); + } +} + +/* * RelationGetBufferForTuple * * Returns pinned and exclusive-locked buffer of a page in given relation @@ -233,10 +314,11 @@ RelationGetBufferForTuple(Relation relation, Size len, bool use_fsm = !(options & HEAP_INSERT_SKIP_FSM); Buffer buffer = InvalidBuffer; Page page; - Size pageFreeSpace, -saveFreeSpace; + Size pageFreeSpace = 0, +saveFreeSpace = 0; BlockNumber targetBlock, -otherBlock; +otherBlock, +lastValidBlock = InvalidBlockNumber; bool needLock; len = MAXALIGN(len); /* be conservative */ @@ -308,6 +390,7 @@ RelationGetBufferForTuple(Relation relation, Size len, } } +loop: while (targetBlock != InvalidBlockNumber) { /* @@ -388,6 +471,8 @@ RelationGetBufferForTuple(Relation relation, Size len, otherBlock, targetBlock, vmbuffer_other, vmbuffer); + lastValidBlock = targetBlock; + /* * Now we can check to see if there's enough free space here. If so, * we're done. @@ -441,27 +526,47 @@ RelationGetBufferForTuple(Relation relation, Size len, needLock = !RELATION_IS_LOCAL(relation); if (needLock) - LockRelationForExtension(relation, ExclusiveLock);
Re: [HACKERS] Relation extension scalability
On Fri, Mar 11, 2016 at 12:04 AM, Petr Jelinek wrote: Thanks for looking.. Those look good. The patch looks good in general now. I am bit scared by > the lockWaiters * 20 as it can result in relatively big changes in rare > corner cases when for example a lot of backends were waiting for lock on > relation and suddenly all try to extend it. I wonder if we should clamp it > to something sane (although what's sane today might be small in couple of > years). But in such condition when all are waiting on lock, then at a time only one waiter will get the lock and that will easily count the lock waiter and extend in multiple of that. And we also have the check that if any waiter get the lock it will first check in FSM that anybody else have added one block or not.. And other waiter will not get lock unless first waiter extend all blocks and release the locks. One possible case is as soon as we extend the blocks new requester directly find in FSM and don't come for lock, and old waiter after getting lock don't find in FSM, But IMHO in such cases, also its good that other waiter also extend more blocks (because this can happen when request flow is very high). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Thu, Mar 10, 2016 at 8:26 PM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > I don't think we can rely on median that much if we have only 3 runs. > For 3 runs we can only apply Kornfeld method which claims that confidence > interval should be between lower and upper values. > Since confidence intervals for master and patched versions are overlapping > we can't conclude that expected TPS numbers are different. > Dilip, could you do more runs? 10, for example. Using such statistics we > would be able to conclude something. > Here is the reading for 10 runs Median Result --- Client Base Patch --- 1 1987319739 2 3865838276 4 6881262075 Full Results of 10 runs... Base - Runs 1 Client2 Client 4 Client - 1194423486649023 2196053513951695 3197263710453899 4198353843955708 5198663863867473 6198803867970152 7199733872070920 8200483873771342 9200573881571403 10 203444142377953 - Patch --- Runs 1 Client 2 Client 4 Client -- 1188813016154928 2194153203159741 3195643502261060 4196273671261839 5196703765962011 6198083889462139 7198573908162983 8199103992375358 9201693993777481 10 201814000378462 ------ -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Sat, Mar 12, 2016 at 5:31 AM, Jim Nasby wrote: > FWIW, this is definitely a real possibility in any shop that has very high > downtime costs and high transaction rates. > > I also think some kind of clamp is a good idea. It's not that uncommon to > run max_connections significantly higher than 100, so the extension could > be way larger than 16MB. In those cases this patch could actually make > things far worse as everyone backs up waiting on the OS to extend many MB > when all you actually needed were a couple dozen more pages. > I agree, We can have some max limit on number of extra pages, What other thinks ? > BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent > demand for extension that means you're running lots of small DML > operations, not really big ones. I'd think that would make *1 more > appropriate. > *1 will not solve this problem, Here the main problem was many people are sleep/wakeup on the extension lock and that was causing the bottleneck. So if we do *1 this will satisfy only current requesters which has already waited on the lock. But our goal is to avoid backends from requesting this lock. Idea of Finding the requester to get the statistics on this locks (load on the lock) and extend in multiple of load so that in future this situation will be avoided for long time and again when happen next time extend in multiple of load. How 20 comes ? I tested with Multiple clients loads 1..64, with multiple load size 4 byte records to 1KB Records, COPY/ INSERT and found 20 works best. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Sat, Mar 12, 2016 at 8:37 AM, Amit Kapila wrote: > Can you post the numbers for 1, 5, 10, 15, 25 or whatever other multiplier > you have tried, so that it is clear that 20 is best? I had Tried with 1, 10, 20 and 50. 1. With base code it was almost the same as base code. 2. With 10 thread data it matching with my previous group extend patch data posted upthread http://www.postgresql.org/message-id/CAFiTN-tyEu+Wf0-jBc3TGfCoHdEAjNTx=wvuxpoa1vddyst...@mail.gmail.com 3. Beyond 20 with 50 I did not see any extra benefit in performance number (compared to 20 when tested 50 with 4 byte COPY I did not tested other data size with 50.). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Mon, Mar 14, 2016 at 5:02 AM, Jim Nasby wrote: > Well, 16MB is 2K pages, which is what you'd get if 100 connections were > all blocked and we're doing 20 pages per waiter. That seems like a really > extreme scenario, so maybe 4MB is a good compromise. That's unlikely to be > hit in most cases, unlikely to put a ton of stress on IO, even with > magnetic media (assuming the whole 4MB is queued to write in one shot...). > 4MB would still reduce the number of locks by 500x. In my performance results given up thread, we are getting max performance at 32 clients, means at a time we are extending 32*20 ~= max (600) pages at a time. So now with 4MB limit (max 512 pages) Results will looks similar. So we need to take a decision whether 4MB is good limit, should I change it ? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Thu, Mar 17, 2016 at 1:31 PM, Petr Jelinek wrote: > Great. > > Just small notational thing, maybe this would be simpler?: > extraBlocks = Min(512, lockWaiters * 20); > Done, new patch attached. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index 8140418..a9e18dc 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -169,6 +169,91 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2, } /* + * RelationAddOneBlock + * + * Extend relation by one block and lock the buffer + */ +static Buffer +RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate) +{ + Buffer buffer; + /* + * XXX This does an lseek - rather expensive - but at the moment it is the + * only way to accurately determine how many blocks are in a relation. Is + * it worth keeping an accurate file length in shared memory someplace, + * rather than relying on the kernel to do it for us? + */ + buffer = ReadBufferBI(relation, P_NEW, bistate); + + /* + * We can be certain that locking the otherBuffer first is OK, since + * it must have a lower page number. + */ + if (otherBuffer != InvalidBuffer) + LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * Now acquire lock on the new page. + */ + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + return buffer; +} + +/* + * RelationAddExtraBlocks + * + * Extend extra blocks for the relations to avoid the future contention + * on the relation extension lock. + */ + +static void +RelationAddExtraBlocks(Relation relation, BulkInsertState bistate) +{ + Page page; + Size freespace; + BlockNumber blockNum; + int extraBlocks = 0; + int lockWaiters = 0; + Buffer buffer; + + /* + * For calculating number of extra blocks to extend, find the level + * of contention on this lock, by getting the requester of this lock + * and add extra blocks in multiple of waiters. + */ + lockWaiters = RelationExtensionLockWaiter(relation); + + /* To avoid the cases when there are huge number of lock waiters, and + * extend file size by big amount at a time, put some limit on the + * max number of pages to be extended at a time. + */ + extraBlocks = Min(512, lockWaiters * 20); + + while (extraBlocks--) + { + /* + * Here we are adding extra blocks to the relation after + * adding each block update the information in FSM so that + * other backend running parallel can find the block. + */ + buffer = ReadBufferBI(relation, P_NEW, bistate); + + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + page = BufferGetPage(buffer); + PageInit(page, BufferGetPageSize(buffer), 0); + freespace = PageGetHeapFreeSpace(page); + MarkBufferDirty(buffer); + blockNum = BufferGetBlockNumber(buffer); + + UnlockReleaseBuffer(buffer); + + RecordPageWithFreeSpace(relation, blockNum, freespace); + } +} + +/* * RelationGetBufferForTuple * * Returns pinned and exclusive-locked buffer of a page in given relation @@ -233,10 +318,11 @@ RelationGetBufferForTuple(Relation relation, Size len, bool use_fsm = !(options & HEAP_INSERT_SKIP_FSM); Buffer buffer = InvalidBuffer; Page page; - Size pageFreeSpace, -saveFreeSpace; + Size pageFreeSpace = 0, +saveFreeSpace = 0; BlockNumber targetBlock, -otherBlock; +otherBlock, +lastValidBlock = InvalidBlockNumber; bool needLock; len = MAXALIGN(len); /* be conservative */ @@ -308,6 +394,7 @@ RelationGetBufferForTuple(Relation relation, Size len, } } +loop: while (targetBlock != InvalidBlockNumber) { /* @@ -388,6 +475,8 @@ RelationGetBufferForTuple(Relation relation, Size len, otherBlock, targetBlock, vmbuffer_other, vmbuffer); + lastValidBlock = targetBlock; + /* * Now we can check to see if there's enough free space here. If so, * we're done. @@ -441,27 +530,47 @@ RelationGetBufferForTuple(Relation relation, Size len, needLock = !RELATION_IS_LOCAL(relation); if (needLock) - LockRelationForExtension(relation, ExclusiveLock); - - /* - * XXX This does an lseek - rather expensive - but at the moment it is the - * only way to accurately determine how many blocks are in a relation. Is - * it worth keeping an accurate file length in shared memory someplace, - * rather than relying on the kernel to do it for us? - */ - buffer = ReadBufferBI(relation, P_NEW, bistate); - - /* - * We can be certain that locking the otherBuffer first is OK, since it - * must have a lower page number. - */ - if (otherBuffer != InvalidBuffer) - LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE); + { + /* + * First try to get the lock in no-wait mode, if succeed extend one + * block, else get the lock in normal mode and after we get the lock + * extend some extra blocks, extra blocks will be added to satisfy + * request of other waiters and avoid future contention. Here instead + * of directly ta
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Mon, Mar 14, 2016 at 3:09 AM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > I've drawn graphs for these measurements. The variation doesn't look > random here. TPS is going higher from measurement to measurement. I bet > you did measurements sequentially. > I think we should do more measurements until TPS stop growing and beyond > to see accumulate average statistics. Could you please do the same tests > but 50 runs. > I have taken reading at different client count (1,2 and 4) 1. Avg: Is taken average after discarding lowest 5 and highest 5 readings. 2. With 4 client I have only 30 reading. Summary: I think if we check avg or median atleast at 1 or 2 client head always looks winner, but same is not the case with 4 clients. And with 4 clients I can see much more fluctuation in the reading. Head(1 Client) Patch (1 Client) Head (2 Client) Patch (2 Client) Head (4 Client) Patch (4 Client) Avg: 19628 19578 37180 36536 70044 70731 Median: 19663 19581 37967 37484 73003 75376 Below is all the reading. (Arranged in sorted order) *Runs* *Head (1 Client)* *Patch (1 Client)* *Head (2 Client)* *Patch (2 Client)* *Head (4 Client)* *Patch (4 Client)* 1 18191 18153 29454 26128 49931 47210 2 18365 18768 31218 26956 53358 47287 3 19053 18769 31808 29286 53575 55458 4 19128 18915 31860 30558 54282 55794 5 19160 18948 32411 30945 56629 56253 6 19177 19055 32453 31316 57595 58147 7 19351 19232 32571 31703 59490 58543 8 19353 19239 33294 32029 63887 58990 9 19361 19255 33936 32217 64718 60233 10 19390 19297 34361 32767 65737 68210 11 19452 19368 34563 34487 65881 71409 12 19478 19387 36110 34907 67151 72108 13 19488 19388 36221 34936 70974 73977 14 19490 19395 36461 35068 72212 74891 15 19512 19406 36712 35298 73003 75376 16 19538 19450 37104 35492 74842 75468 17 19547 19487 37246 36648 75400 75515 18 19592 19521 37567 37263 75573 75626 19 19627 19536 37707 37430 75798 75745 20 19661 19556 37958 37461 75834 75868 21 19666 19607 37976 37507 76240 76161 22 19701 19624 38060 37557 76426 76162 23 19708 19643 38244 37717 76770 76333 24 19748 19684 38272 38285 77011 77009 25 19751 19694 38467 38294 77114 77168 26 19776 19695 38524 38469 77630 77318 27 19781 19709 38625 38642 77865 77550 28 19786 19765 38756 38643 77912 77904 29 19796 19823 38939 38649 78242 78736 30 19826 19847 39139 38659 31 19837 19899 39208 38713 32 19849 19909 39211 38837 33 19854 19932 39230 38876 34 19867 19949 39249 39088 35 19891 19990 39259 39148 36 20038 20085 39286 39453 37 20083 20128 39435 39563 38 20143 20166 39448 39959 39 20191 20198 39475 40495 40 20437 20455 40375 40664 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Mon, Mar 14, 2016 at 8:26 AM, Petr Jelinek wrote: > Well any value we choose will be very arbitrary. If we look at it from the > point of maximum absolute disk space we allocate for relation at once, > the 4MB limit would represent 2.5 orders of magnitude change. That sounds > like enough for one release cycle, I think we can further tune it if the > need arises in next one. (with my love for round numbers I would have > suggested 8MB as that's 3 orders of magnitude, but I am fine with 4MB as > well) > I have modified the patch, this contains the max limit on extra pages, 512(4MB) pages is the max limit. I have measured the performance also and that looks equally good. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index 8140418..fcd42b9 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -169,6 +169,94 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2, } /* + * RelationAddOneBlock + * + * Extend relation by one block and lock the buffer + */ +static Buffer +RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate) +{ + Buffer buffer; + /* + * XXX This does an lseek - rather expensive - but at the moment it is the + * only way to accurately determine how many blocks are in a relation. Is + * it worth keeping an accurate file length in shared memory someplace, + * rather than relying on the kernel to do it for us? + */ + buffer = ReadBufferBI(relation, P_NEW, bistate); + + /* + * We can be certain that locking the otherBuffer first is OK, since + * it must have a lower page number. + */ + if (otherBuffer != InvalidBuffer) + LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * Now acquire lock on the new page. + */ + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + return buffer; +} + +/* + * RelationAddExtraBlocks + * + * Extend extra blocks for the relations to avoid the future contention + * on the relation extension lock. + */ + +static void +RelationAddExtraBlocks(Relation relation, BulkInsertState bistate) +{ + Page page; + Size freespace; + BlockNumber blockNum; + int extraBlocks = 0; + int lockWaiters = 0; + Buffer buffer; + + /* + * For calculating number of extra blocks to extend, find the level + * of contention on this lock, by getting the requester of this lock + * and add extra blocks in multiple of waiters. + */ + lockWaiters = RelationExtensionLockWaiter(relation); + + extraBlocks = lockWaiters * 20; + + /* To avoid the cases when there are huge number of lock waiters, and + * extend file size by big amount at a time, put some limit on the + * max number of pages to be extended at a time. + */ + if (extraBlocks > 512) + extraBlocks = 512; + + while (extraBlocks--) + { + /* + * Here we are adding extra blocks to the relation after + * adding each block update the information in FSM so that + * other backend running parallel can find the block. + */ + buffer = ReadBufferBI(relation, P_NEW, bistate); + + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + + page = BufferGetPage(buffer); + PageInit(page, BufferGetPageSize(buffer), 0); + freespace = PageGetHeapFreeSpace(page); + MarkBufferDirty(buffer); + blockNum = BufferGetBlockNumber(buffer); + + UnlockReleaseBuffer(buffer); + + RecordPageWithFreeSpace(relation, blockNum, freespace); + } +} + +/* * RelationGetBufferForTuple * * Returns pinned and exclusive-locked buffer of a page in given relation @@ -233,10 +321,11 @@ RelationGetBufferForTuple(Relation relation, Size len, bool use_fsm = !(options & HEAP_INSERT_SKIP_FSM); Buffer buffer = InvalidBuffer; Page page; - Size pageFreeSpace, -saveFreeSpace; + Size pageFreeSpace = 0, +saveFreeSpace = 0; BlockNumber targetBlock, -otherBlock; +otherBlock, +lastValidBlock = InvalidBlockNumber; bool needLock; len = MAXALIGN(len); /* be conservative */ @@ -308,6 +397,7 @@ RelationGetBufferForTuple(Relation relation, Size len, } } +loop: while (targetBlock != InvalidBlockNumber) { /* @@ -388,6 +478,8 @@ RelationGetBufferForTuple(Relation relation, Size len, otherBlock, targetBlock, vmbuffer_other, vmbuffer); + lastValidBlock = targetBlock; + /* * Now we can check to see if there's enough free space here. If so, * we're done. @@ -441,27 +533,47 @@ RelationGetBufferForTuple(Relation relation, Size len, needLock = !RELATION_IS_LOCAL(relation); if (needLock) - LockRelationForExtension(relation, ExclusiveLock); - - /* - * XXX This does an lseek - rather expensive - but at the moment it is the - * only way to accurately determine how many blocks are in a relation. Is - * it worth keeping an accurate file length in shared memory someplace, - * rather than relying on the kernel to do it for us? - */ - buffer = ReadBufferBI(relation, P_NEW
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Sun, Mar 20, 2016 at 4:10 AM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > Actually, we behave like old code and do such modifications without > increasing number of atomic operations. We can just calculate new value of > state (including unset of BM_LOCKED flag) and write it to the buffer. In > this case we don't require more atomic operations. However, it becomes not > safe to use atomic decrement in UnpinBuffer(). We can use there loop of > CAS which wait buffer header to be unlocked like PinBuffer() does. I'm not > sure if this affects multicore scalability, but I think it worth trying. > > So, this idea is implemented in attached patch. Please, try it for both > regression on lower number of clients and scalability on large number of > clients. > > Good, seems like we have reduced some instructions, I will test it soon. > Other changes in this patch: > 1) PinBuffer() and UnpinBuffer() use exponential backoff in spindealy > like LockBufHdr() does. > 2) Previous patch contained revert > of ac1d7945f866b1928c2554c0f80fd52d7f92. This was not intentional. > No, it doesn't contains this revert. > Some other comments.. *** ReadBuffer_common(SMgrRelation smgr, cha *** 828,837 */ do { ! LockBufHdr(bufHdr); ! Assert(bufHdr->flags & BM_VALID); ! bufHdr->flags &= ~BM_VALID; ! UnlockBufHdr(bufHdr); } while (!StartBufferIO(bufHdr, true)); } } --- 826,834 */ do { ! uint32 state = LockBufHdr(bufHdr); ! state &= ~(BM_VALID | BM_LOCKED); ! pg_atomic_write_u32(&bufHdr->state, state); } while (!StartBufferIO(bufHdr, true)); Better Write some comment, about we clearing the BM_LOCKED from stage directly and need not to call UnlockBufHdr explicitly. otherwise its confusing. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Tue, Mar 22, 2016 at 12:31 PM, Dilip Kumar wrote: > ! pg_atomic_write_u32(&bufHdr->state, state); > } while (!StartBufferIO(bufHdr, true)); > > Better Write some comment, about we clearing the BM_LOCKED from stage > directly and need not to call UnlockBufHdr explicitly. > otherwise its confusing. > Few more comments.. *** 828,837 */ do { ! LockBufHdr(bufHdr); *! Assert(bufHdr->flags & BM_VALID);* ! bufHdr->flags &= ~BM_VALID; ! UnlockBufHdr(bufHdr); } while (!StartBufferIO(bufHdr, true)); } } --- 826,834 */ do { ! uint32 state = LockBufHdr(bufHdr); ! state &= ~(BM_VALID | BM_LOCKED); ! pg_atomic_write_u32(&bufHdr->state, state); } while (!StartBufferIO(bufHdr, true)); 1. Previously there was a Assert, any reason why we removed it ? Assert(bufHdr->flags & BM_VALID); 2. And also if we don't need assert then can't we directly clear BM_VALID flag from state variable (if not locked) like pin/unpin buffer does, without taking buffer header lock? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Thu, Mar 17, 2016 at 11:39 AM, Amit Kapila wrote: I have reviewed the patch.. here are some review comments, I will continue to review.. 1. + + /* + * Add the proc to list, if the clog page where we need to update the + */ + if (nextidx != INVALID_PGPROCNO && + ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage) + return false; Should we clear all these structure variable what we set above in case we are not adding our self to group, I can see it will not have any problem even if we don't clear them, I think if we don't want to clear we can add some comment mentioning the same. + proc->clogGroupMember = true; + proc->clogGroupMemberXid = xid; + proc->clogGroupMemberXidStatus = status; + proc->clogGroupMemberPage = pageno; + proc->clogGroupMemberLsn = lsn; 2. Here we are updating in our own proc, I think we don't need atomic operation here, we are not yet added to the list. + if (nextidx != INVALID_PGPROCNO && + ProcGlobal->allProcs[nextidx].clogGroupMemberPage != proc->clogGroupMemberPage) + return false; + + pg_atomic_write_u32(&proc->clogGroupNext, nextidx); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Mon, Mar 21, 2016 at 8:10 PM, Amit Kapila wrote: > Review comments: > > Thanks for the review, Please find my response inline.. > 1. > /* > + * RelationAddOneBlock > + * > + * Extend relation by one block and lock the buffer > + */ > +static Buffer > +RelationAddOneBlock(Relation relation, Buffer otherBuffer, > BulkInsertState bistate) > > Shall we mention in comments that this API returns locked buffer and it's > the responsibility of caller to unlock it. > Fixed > 2. > + /* To avoid the cases when there are huge number of lock waiters, and > + * extend file size by big amount at a > time, put some limit on the > > first line in multi-line comments should not contain anything. > Fixed > > 3. > + extraBlocks = Min(512, lockWaiters * 20); > > I think we should explain in comments about the reason of choosing 20 in > above calculation. > Fixed, Just check explanation is enough or we need to add something more ? > > 4. Sometime back [1], you agreed on doing some analysis for the overhead > that XLogFlush can cause during buffer eviction, but I don't see the > results of same, are you planing to complete the same? > Ok, I will test this.. > > 5. > + if (LOCKACQUIRE_OK > + != RelationExtensionLockConditional(relation, > ExclusiveLock)) > > I think the coding style is to keep constant on right side of condition, > did you see any other place in code which uses the check in a similar way? > Fixed, Not sure about any other place. (This is what I used to follow to keep constant on left side to avoid the cases, where instead == by mistake if we have given =, then it will do assignment instead throwing error). But In PG style constant should be on right, so I will take care. > > 6. > - /* > - * Now acquire lock on the new page. > + /* For all case we need to add at least one block to satisfy > our > + * own request. > */ > > Same problem as in point 2. > Fixed. > > 7. > @@ -472,6 +581,8 @@ RelationGetBufferForTuple(Relation relation, Size len, > if (needLock) > > UnlockRelationForExtension(relation, ExclusiveLock); > > + > + > > Spurious newline addition. > Fixed > > 8. > +int > +RelationExtensionLockWaiter(Relation relation) > > How about naming this function as RelationExtensionLockWaiterCount? > > Done > 9. > + /* Other waiter has extended the block for us*/ > > Provide an extra space before ending the comment. > Fixed > > 10. > + if (use_fsm) > + { > + if (lastValidBlock != > InvalidBlockNumber) > + { > + targetBlock = > RecordAndGetPageWithFreeSpace(relation, > + > lastValidBlock, > + > pageFreeSpace, > + > len + saveFreeSpace); > + } > > Are you using RecordAndGetPageWithFreeSpace() instead of > GetPageWithFreeSpace() to get the page close to the previous target page? > If yes, then do you see enough benefit of the same that it can compensate > the additional write operation which Record* API might cause due to > additional dirtying of buffer? > Here we are calling RecordAndGetPageWithFreeSpace instead of GetPageWithFreeSpace, because other backend who have got the lock might have added extra block in the FSM and its possible that FSM tree might not have been updated so far and we will not get the page by searching from top using GetPageWithFreeSpace, so we need to search the leaf page directly using our last valid target block. Explained same in the comments... > 11. > + { > + /* > + * First try to get the lock in no-wait mode, if succeed extend one > + > * block, else get the lock in normal mode and after we get the lock > - LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); > + buffer = > RelationAddOneBlock(relation, otherBuffer, bistate); > > Won't this cause one extra block addition after backend extends the > relation for multiple blocks, what is the need of same? > This is the block for this backend, Extra extend for future request and already added to FSM. I could have added this count along with extra block in RelationAddExtraBlocks, But In that case I need to put some extra If for saving one buffer for this bakend and then returning that the specific buffer to caller, and In caller also need to distinguish between who wants to add one block or who have got one block added in along with extra block. I think this way code is simple.. That everybody comes down will add one block for self use. and all other functionality and logic is above, i.e. wether to take lock or not, whether to add extra blocks or not.. > > 12. I think it is good to once test pgbench read-write tests to ensure > that this doesn't introduce any new regression. > I will test this and post the results.. Latest patch attached.. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com multi_extend_v10.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation extension scalability
On Thu, Mar 24, 2016 at 10:44 AM, Robert Haas wrote: > On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila > wrote: > > RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed > to > > it, rather than from top, so even if RecordPageWithFreeSpace() doesn't > > update till root, it will be able to search the newly added page. I > agree > > with whatever you have said in another mail that we should introduce a > new > > API to do a more targeted search for such cases. > > OK, let's do that, then. Ok, I have added new API which just find the free block from and start search from last given page. 1. I have named the new function as GetPageWithFreeSpaceUsingOldPage, I don't like this name, but I could not come up with any better, Please suggest one. And also body of GetPageWithFreeSpaceUsingOldPage looks almost similar to RecordAndGetPageWithFreeSpace, I tried to merge these two but for that we need to pass extra parameter to the function. 2. I also had to write one more function *fsm_search_from_addr *instead of using* fsm_set_and_search. *So that we can find block without updating the other slot. I have done performance test just to ensure the result. And performance is same as old. with both COPY and INSERT. 3. I have also run pgbench read-write what amit suggested upthread.. No regression or improvement with pgbench workload. Client basePatch 1 899 914 8 5397 5413 32 18170 18052 64 29850 29941 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com multi_extend_v12.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation extension scalability
On Thu, Mar 24, 2016 at 6:13 PM, Amit Kapila wrote: > > 1. > +GetPageWithFreeSpaceUsingOldPage(Relation rel, BlockNumber oldPage, > + Size spaceNeeded) > { > .. > + /* > + * If fsm_set_and_search found a suitable new block, return that. > + * Otherwise, search as usual. > + */ > .. > } > > In the above comment, you are referring wrong function. > Fixed > > 2. > +static int > +fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue) > +{ > + Buffer buf; > + int newslot = -1; > + > + buf = fsm_readbuf(rel, addr, true); > + LockBuffer(buf, BUFFER_LOCK_SHARE); > + > + if (minValue != 0) > + { > + /* Search while we still hold the lock */ > + newslot = fsm_search_avail(buf, minValue, > + addr.level == FSM_BOTTOM_LEVEL, > + false); > > In this new API, I don't understand why we need minValue != 0 check, > basically if user of API doesn't want to search for space > 0, then what is > the need of calling this API? I think this API should use Assert for > minValue!=0 unless you see reason for not doing so. > > Agree, it should be assert. > > > > GetNearestPageWithFreeSpace? (although not sure that's accurate > description, maybe Nearby would be better) > > > > Better than what is used in patch. > > Yet another possibility could be to call it as > GetPageWithFreeSpaceExtended and call it from GetPageWithFreeSpace with > value of oldPage as InvalidBlockNumber. > Yes I like this.. Changed the same. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com multi_extend_v13.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation extension scalability
On Fri, Mar 25, 2016 at 3:04 AM, Robert Haas wrote: > 1. Callers who use GetPageWithFreeSpace() rather than > GetPageFreeSpaceExtended() will fail to find the new pages if the > upper map levels haven't been updated by VACUUM. > > 2. Even callers who use GetPageFreeSpaceExtended() may fail to find > the new pages. This can happen in two separate ways, namely (a) the > Yeah, that's the issue, if extended pages spills to next FSM page, then other waiters will not find those page, and one by one all waiters will end up adding extra pages. for example, if there are ~30 waiters then total blocks extended = (25(25+1)/2) *20 =~ 6500 pages. This is not the case every time but whenever heap block go to new FSM page this will happen. - FSM page case hold 4096 heap blk info, so after every 8th extend (assume 512 block will extend in one time), it will extend ~6500 pages - Any new request to RelationGetBufferForTuple will be able to find those page, because by that time the backend which is extending the page would have set new block using RelationSetTargetBlock. (there are still chances that some blocks can be completely left unused, until vacuum comes). I have changed the patch as per the suggestion (just POC because performance number are not that great) Below is the performance number comparison of base, previous patch(v13) and latest patch (v14). performance of patch v14 is significantly low compared to v13, mainly I guess below reasons 1. As per above calculation v13 extend ~6500 block (after every 8th extend), and that's why it's performing well. 2. In v13 as soon as we extend the block we add to FSM so immediately available for new requester, (In this patch also I tried to add one by one to FSM and updated fsm tree till root after all pages added to FSM, but no significant improvement). 3. fsm_update_recursive doesn't seems like problem to me. does it ? Copy 1 tuples, of 4 bytes each.. - Client base patch v13 patch v14 1 118 147126 2 217 276269 4 210 421347 8 166 630375 16 145 813415 32 124 985451 64 974455 Insert 1000 tuples of 1K size each. Clientbase patch v13 patch v14 1 117 124119 2 111 126119 4 51 128124 8 43 149131 16 40 217120 32 263115 64 248109 Note: I think one thread number can be just run to run variance.. Does anyone see problem in updating the FSM tree, I have debugged and saw that we are able to get the pages properly from tree and same is visible in performance number of v14 compared to base. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index 8140418..9ac1eb7 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -169,6 +169,62 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2, } /* + * Extend a relation by multiple blocks to avoid future contention on the + * relation extension lock. Our goal is to pre-extend the relation by an + * amount which ramps up as the degree of contention ramps up, but limiting + * the result to some sane overall value. + */ +static void +RelationAddExtraBlocks(Relation relation, BulkInsertState bistate) +{ + Page page; + BlockNumber blockNum, +firstBlock, +lastBlock; + int extraBlocks = 0; + int lockWaiters = 0; + Buffer buffer; + + /* + * We use the length of the lock wait queue to judge how much to extend. + * It might seem like multiplying the number of lock waiters by as much + * as 20 is too aggressive, but benchmarking revealed that smaller numbers + * were insufficient. 512 is just an arbitrary cap to prevent pathological + * results (and excessive wasted disk space). + */ + lockWaiters = RelationExtensionLockWaiterCount(relation); + extraBlocks = Min(512, lockWaiters * 20); + + firstBlock = lastBlock = InvalidBlockNumber; + + while (extraBlocks-- >= 0) + { + /* Ouch - an unnecessary lseek() each time through the loop! */ + buffer = ReadBufferBI(relation, P_NEW, bistate); + + /* Extend by one page. */ + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + page = BufferGetPage(buffer); + PageInit(page, BufferGetPageSize(buffer), 0); + MarkBufferDirty(buffer); + blockNum = BufferGetBlockNumber(buffer); + UnlockReleaseBuffer(buffer); + + if (firstBlock == InvalidBlockNumber) + firstBlock = blockNum; + + lastBlock = blockNum; + } + + /* + * Put the page in the freespace map so other backends can find it. + * This
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Fri, Mar 25, 2016 at 8:09 PM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > Could anybody run benchmarks? Feature freeze is soon, but it would be > *very nice* to fit it into 9.6 release cycle, because it greatly improves > scalability on large machines. Without this patch PostgreSQL 9.6 will be > significantly behind competitors like MySQL 5.7. I have run the performance and here are the results.. With latest patch I did not see any regression at lower client count (median of 3 reading). scale factor 1000 shared buffer 8GB readonly *Client Base patch* 1 12957 13068 2 24931 25816 4 46311 48767 32 300921 310062 64 387623 493843 128 249635 583513 scale factor 300 shared buffer 8GB readonly *Client Base patch* 1 14537 14586--> one thread number looks little less, generally I get ~18000 (will recheck). 2 34703 33929--> may be run to run variance (once I get time, will recheck) 4 67744 69069 32 312575 336012 64 213312 539056 128 190139 380122 *Summary:* Actually with 64 client we have seen ~470,000 TPS with head also, by revering commit 6150a1b0. refer this thread: ( http://www.postgresql.org/message-id/caa4ek1+zeb8pmwwktf+3brs0pt4ux6rs6aom0uip8c6shjw...@mail.gmail.com ) I haven't tested this patch by reverting commit 6150a1b0, so not sure can this patch give even better performance ? It also points to the case, what Andres has mentioned in this thread. http://www.postgresql.org/message-id/20160226191158.3vidtk3ktcmhi...@alap3.anarazel.de Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Sat, Mar 26, 2016 at 8:07 AM, Robert Haas wrote: > I think we need to start testing these patches not only in terms of > how *fast* they are but how *large* the relation ends up being when > we're done. A patch that inserts the rows slower but the final > relation is smaller may be better overall. Can you retest v13, v14, > and master, and post not only the timings but the relation size > afterwards? And maybe post the exact script you are using? > I have tested the size and performance, scripts are attached in the mail. COPY 1-10 bytes tuple from 32 Clients Base V13 V14 - - TPS 123 874 446 No. Of Tuples 14827 104998 53637 Relpages 656089 4652593 2485482 INSERT 1028 bytes Tuples From 16 Clients Base V13 V14 - TPS 42 211 120 No. Of Tuples 5149000 25343000 14524000 Rel Pages 735577 3620765 2140612 As per above results If we calculate the tuple number of tuples and respective relpages, then neither in v13 nor v14 there are extra unused pages. As per my calculation for INSERT (1028 byte tuple) each page contain 7 tuples so number of pages required Base: 5149000/7 = 735571 (from relpages we can see 6 pages are extra) V13 : 25343000/7= 3620428 (from relpages we can see ~300 pages are extra). V14 : 14524000/7= 2074857 (from relpages we can see ~7 pages are extra). With V14 we have found max pages number of extra pages, I expected V13 to have max unused pages, but it's v14 and I tested it in multiple runs and v13 is always the winner. I tested with multiple client count also like 8, 32 and v13 always have only ~60-300 extra pages out of total ~2-4 Million Pages. Attached files: --- test_size_ins.sh --> automated script to run insert test and calculate tuple and relpages. test_size_copy --> automated script to run copy test and calculate tuple and relpages. copy_script -> copy pg_bench script used by test_size_copy.sh insert_script --> insert pg_bench script used by test_size_ins.sh > Maybe something like this would help: > > if (needLock) > { > if (!use_fsm) > LockRelationForExtension(relation, ExclusiveLock); > else if (!ConditionLockRelationForExtension(relation, > ExclusiveLock)) > { > BlockNumber last_blkno = > RelationGetNumberOfBlocks(relation); > > targetBlock = GetPageWithFreeSpaceExtended(relation, > last_blkno, len + saveFreeSpace); > if (targetBlock != InvalidBlockNumber) > goto loop; > > LockRelationForExtension(relation, ExclusiveLock); > targetBlock = GetPageWithFreeSpace(relation, len + > saveFreeSpace); > if (targetBlock != InvalidBlockNumber) > { > UnlockRelationForExtension(relation, ExclusiveLock); > goto loop; > } > RelationAddExtraBlocks(relation, bistate); > } > } > > I think this is better than what you had before with lastValidBlock, > because we're actually interested in searching the free space map at > the *very end* of the relation, not wherever the last target block > happens to have been. > > We could go further still and have GetPageWithFreeSpace() always > search the last, say, two pages of the FSM in all cases. But that > might be expensive. The extra call to RelationGetNumberOfBlocks seems > cheap enough here because the alternative is to wait for a contended > heavyweight lock. > I will try the test with this also and post the results. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com copy_script Description: Binary data insert_script Description: Binary data test_size_copy.sh Description: Bourne shell script test_size_ins.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation extension scalability
On Sat, Mar 26, 2016 at 3:18 PM, Dilip Kumar wrote: > search the last, say, two pages of the FSM in all cases. But that >> might be expensive. The extra call to RelationGetNumberOfBlocks seems >> cheap enough here because the alternative is to wait for a contended >> heavyweight lock. >> > > I will try the test with this also and post the results. > I have changed v14 as per this suggestion and results are same as v14. I have again measured the relation size, this time directly using size function so results are better understandable. Relation Size - INSERT : 16000 transaction from 32 Client Base v13 v14_1 - - TPS 37 255 112 Rel Size 17GB 17GB 18GB COPY: 32000 transaction from 32 client Base v13v14_1 --- TPS 121 823 427 Rel Size 11GB11GB 11GB Script are attached in the mail = test_size_ins.sh --> run insert test and calculate relation size. test_size_copy --> run copy test and relation size copy_script -> copy pg_bench script used by test_size_copy.sh insert_script --> insert pg_bench script used by test_size_ins.sh multi_extend_v14_poc_v1.patch --> modified patch of v14. I also tried modifying v14 from different different angle. One is like below--> - In AddExtraBlock { I add page to FSM one by one like v13 does. then update the full FSM tree up till root } Results: -- 1. With this performance is little less than v14 but the problem of extra relation size is solved. 2. With this we can conclude that extra size of relation in v14 is because some while extending the pages, its not immediately available and at end some of the pages are left unused. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com test_size_copy.sh Description: Bourne shell script test_size_ins.sh Description: Bourne shell script copy_script Description: Binary data insert_script Description: Binary data multi_extend_v14_poc_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation extension scalability
On Sat, Mar 26, 2016 at 3:18 PM, Dilip Kumar wrote: > We could go further still and have GetPageWithFreeSpace() always >> search the last, say, two pages of the FSM in all cases. But that >> might be expensive. The extra call to RelationGetNumberOfBlocks seems >> cheap enough here because the alternative is to wait for a contended >> heavyweight lock. >> > > I will try the test with this also and post the results. > **Something went wrong in last mail, seems like become separate thread, so resending the same mail ** I have changed v14 as per this suggestion and results are same as v14. I have again measured the relation size, this time directly using size function so results are better understandable. Relation Size - INSERT : 16000 transaction from 32 Client Base v13 v14_1 - - TPS 37 255 112 Rel Size 17GB 17GB 18GB COPY: 32000 transaction from 32 client Base v13v14_1 --- TPS 121 823427 Rel Size 11GB11GB 11GB Script are attached in the mail = test_size_ins.sh --> run insert test and calculate relation size. test_size_copy --> run copy test and relation size copy_script -> copy pg_bench script used by test_size_copy.sh insert_script --> insert pg_bench script used by test_size_ins.sh multi_extend_v14_poc_v1.patch --> modified patch of v14. I also tried modifying v14 from different different angle. One is like below--> - In AddExtraBlock { I add page to FSM one by one like v13 does. then update the full FSM tree up till root } Results: -- 1. With this performance is little less than v14 but the problem of extra relation size is solved. 2. With this we can conclude that extra size of relation in v14 is because some while extending the pages, its not immediately available and at end some of the pages are left unused. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com copy_script Description: Binary data insert_script Description: Binary data multi_extend_v14_poc_v1.patch Description: Binary data test_size_copy.sh Description: Bourne shell script test_size_ins.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Sun, Mar 27, 2016 at 5:37 PM, Andres Freund wrote: > On what hardware did you run these tests? IBM POWER 8 MACHINE. Architecture: ppc64le Byte Order:Little Endian CPU(s):192 Thread(s) per core:8 Core(s) per socket:1 Socket(s):24 NUMA node(s): 4 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Mon, Mar 28, 2016 at 1:55 AM, Robert Haas wrote: > > One is like below--> > > - > > In AddExtraBlock > > { > >I add page to FSM one by one like v13 does. > >then update the full FSM tree up till root > > } > > Not following this. Did you attach this version? > No I did not attached this.. During rough experiment, tried this, did not produced any patch, I will send this. > > Results: > > -- > > 1. With this performance is little less than v14 but the problem of extra > > relation size is solved. > > 2. With this we can conclude that extra size of relation in v14 is > because > > some while extending the pages, its not immediately available and at end > > some of the pages are left unused. > > I agree with that conclusion. I'm not quite sure where that leaves > us, though. We can go back to v13, but why isn't that producing extra > pages? It seems like it should: whenever a bulk extend rolls over to > a new FSM page, concurrent backends will search either the old or the > new one but not both. > > Maybe we could do this - not sure if it's what you were suggesting above: > > 1. Add the pages one at a time, and do RecordPageWithFreeSpace after each > one. > 2. After inserting them all, go back and update the upper levels of > the FSM tree up the root. > Yes same, I wanted to explained the same above. > Another idea is: > > If ConditionalLockRelationForExtension fails to get the lock > immediately, search the last *two* pages of the FSM for a free page. > > Just brainstorming here. I think this is better option, Since we will search last two pages of FSM tree, then no need to update the upper level of the FSM tree. Right ? I will test and post the result with this option. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Mon, Mar 28, 2016 at 11:00 AM, Amit Kapila wrote: > I have not debugged the flow, but by looking at v13 code, it looks like it > will search both old and new. In > function > GetPageWithFreeSpaceExtended()->fsm_search_from_addr()->fsm_search_avail(), > the basic idea of search is: Start the search from the target slot. At > every step, move one > node to the right, then climb up to the parent. Stop when we reach a node > with enough free space (as we must, since the root has enough space). > So shouldn't it be able to find the new FSM page where the bulk extend > rolls over? > This is actually multi level tree, So each FSM page contain one slot tree. So fsm_search_avail() is searching only the slot tree, inside one FSM page. But we want to go to next FSM page. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Sun, Mar 27, 2016 at 5:48 PM, Andres Freund wrote: > > What's sizeof(BufferDesc) after applying these patches? It should better > be <= 64... > It is 72. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Mon, Mar 28, 2016 at 7:21 AM, Dilip Kumar wrote: > I agree with that conclusion. I'm not quite sure where that leaves >> us, though. We can go back to v13, but why isn't that producing extra >> pages? It seems like it should: whenever a bulk extend rolls over to >> a new FSM page, concurrent backends will search either the old or the >> new one but not both. >> > Our open question was why V13 is not producing extra pages, I tried to print some logs and debug. It seems to me that, when blocks are spilling to next FSM pages, that time all backends who are waiting on lock will not get the block because searching in old FSM page. But the backend which is extending the pages will set RelationSetTargetBlock to latest blocks, and that will make new FSM page available for search by new requesters. 1. So this is why v13 (in normal cases*1) not producing unused pages. 2. But it will produce extra pages (which will be consumed by new requesters), because all waiter will come one by one and extend 512 pages. *1 : Above I have mentioned normal case, I mean there is some case exist where V13 can leave unused page, Like one by one each waiter Get the lock and extend the page, but no one go down till RelationSetTargetBlock so till now new pages are not available by new requester, and time will come when blocks will spill to third FSM page, now one by one all backends go down and set RelationSetTargetBlock, and suppose last one set it to the block which is in 3rd FSM page, in this case, pages in second FSM pages are unused. Maybe we could do this - not sure if it's what you were suggesting above: >> >> 1. Add the pages one at a time, and do RecordPageWithFreeSpace after each >> one. >> 2. After inserting them all, go back and update the upper levels of >> the FSM tree up the root. >> > I think this is better option, Since we will search last two pages of FSM > tree, then no need to update the upper level of the FSM tree. Right ? > > I will test and post the result with this option. > I have created this patch and results are as below. * All test scripts are same attached upthread 1. Relation Size : No change in size, its same as base and v13 2. INSERT 1028 Byte 1000 tuple performance --- Client base v13 v15 1 117 124 122 2 111 126 123 4 51 128 125 8 43 149 135 16 40 217 147 32 35 263 141 3. COPY 1 Tuple performance. -- Client base v13 v15 1 118 147 155 2 217 276 273 4 210 421 457 8 166 630 643 16 145 813 595 32 124 985 598 Conclusion: --- 1. I think v15 is solving the problem exist with v13 and performance is significantly high compared to base, and relation size is also stable, So IMHO V15 is winner over other solution, what other thinks ? 2. And no point in thinking that V13 is better than V15 because, V13 has bug of sometime extending more than expected pages and that is uncontrolled and same can be the reason also of v13 performing better. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com multi_extend_v15.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation extension scalability
On Mon, Mar 28, 2016 at 3:02 PM, Dilip Kumar wrote: > 1. Relation Size : No change in size, its same as base and v13 > > 2. INSERT 1028 Byte 1000 tuple performance > --- > Client base v13 v15 > 1 117 124 122 > 2 111 126 123 > 4 51 128 125 > 8 43 149 135 > 16 40 217 147 > 32 35 263 141 > > 3. COPY 1 Tuple performance. > -- > Client base v13 v15 > 1 118 147 155 > 2 217 276 273 > 4 210 421 457 > 8 166 630 643 > 16 145 813 595 > 32 124 985 598 > > Conclusion: > --- > 1. I think v15 is solving the problem exist with v13 and performance is > significantly high compared to base, and relation size is also stable, So > IMHO V15 is winner over other solution, what other thinks ? > > 2. And no point in thinking that V13 is better than V15 because, V13 has > bug of sometime extending more than expected pages and that is uncontrolled > and same can be the reason also of v13 performing better. > Found one problem with V15, so sending the new version. In V15 I am taking prev_blkno as targetBlock instead it should be the last block of the relation at that time. Attaching new patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com multi_extend_v16.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation extension scalability
On Tue, Mar 29, 2016 at 7:26 AM, Robert Haas wrote: > Well, it's less important in that case, but I think it's still worth > doing. Some people are going to do just plain GetPageWithFreeSpace(). > I am attaching new version v17. Its like this... In *RelationAddExtraBlocks* { -- Add Block one by one to FSM. -- Update FSM tree all the way to root } In *RelationGetBufferForTuple* --- Same as v14, search the FSM tree from root. GetPageWithFreeSpace *Summary:* *--* 1. By Adding block to FSM tree one by one it solves the problem of unused blocks in V14. 2. It Update the FSM tree all they up to root, so anybody search from root can get the block. 3. It also search the block from root, so it don't have problem like v15 has(Exactly identifying which two FSM page to search). 4. This solves the performance problem of V14 by some optimizations in logic of updating FSM tree till root. *Performance Data*: -- Client base v17 1 117 120 2 111 123 4 51 124 8 43 135 16 40 145 32 35 144 64 -- 140 Client base v17 --- --- -- 1 118 117 2 217 220 4 210 379 8 166 574 16 145 594 32 124 599 64 609 Notes: - If I do some change in this patch in strategy of searching the block, performance remains almost the same. 1. Like search in two block like v15 or v17 does. 2. Search first using target block and if don't get then search from top. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com multi_extend_v17.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation extension scalability
On Tue, Mar 29, 2016 at 2:09 PM, Dilip Kumar wrote: > > Attaching new version v18 - Some cleanup work on v17. - Improved *UpdateFreeSpaceMap *function. - Performance and space utilization are same as V17 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com multi_extend_v18.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Tue, Mar 29, 2016 at 10:43 PM, Andres Freund wrote: > My gut feeling is that we should do both 1) and 2). > > Dilip, could you test performance of reducing ppc's spinlock to 1 byte? > Cross-compiling suggest that doing so "just works". I.e. replace the > #if defined(__ppc__) typedef from an int to a char. > I set that, but after that it hangs, even Initdb hangs.. int │164 s_lock(volatile slock_t *lock, const char *file, int line) │165 { │166 SpinDelayStatus delayStatus; │167 │168 init_spin_delay(&delayStatus, (Pointer)lock, file, line); │169 * │170 while (TAS_SPIN(lock)) * * │171 { * * >│172 make_spin_delay(&delayStatus); * * │173 }* │174 I did not try to find the reason, but just built in debug mode and found it never come out of this loop. I clean build multiple times but problem persist, Does it have dependency of some other variable of defined under PPC in some other place ? I don't know.. /* PowerPC */ *#if* defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) || defined(__powerpc64__) *#define* HAS_TEST_AND_SET *typedef* *unsigned* *int* slock_t; --> changed like this *#define* TAS(lock) tas(lock) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Wed, Mar 30, 2016 at 7:19 AM, Robert Haas wrote: Thanks for review and better comments.. > hio.c: In function ‘RelationGetBufferForTuple’: > hio.c:231:20: error: ‘freespace’ may be used uninitialized in this > function [-Werror=uninitialized] > hio.c:185:7: note: ‘freespace’ was declared here > hio.c:231:20: error: ‘blockNum’ may be used uninitialized in this > function [-Werror=uninitialized] > hio.c:181:14: note: ‘blockNum’ was declared here > I have fixed those in v20 > > There's nothing whatsoever to prevent RelationExtensionLockWaiterCount > from returning 0. > > It's also rather ugly that the call to UpdateFreeSpaceMap() assumes > that the last value returned by PageGetHeapFreeSpace() is as good as > any other, but maybe we can just install a comment explaining that > point; there's not an obviously better approach that I can see. > Added comments.. + if (lockWaiters) + /* + * Here we are using same freespace for all the Blocks, but that + * is Ok, because all are newly added blocks and have same freespace + * And even some block which we just added to FreespaceMap above, is + * used by some backend and now freespace is not same, will not harm + * anything, because actual freespace will be calculated by user + * after getting the page. + */ + UpdateFreeSpaceMap(relation, firstBlock, blockNum, freespace); Does this look good ? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com multi_extend_v20.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation extension scalability
On Wed, Mar 30, 2016 at 7:51 AM, Dilip Kumar wrote: > + if (lockWaiters) > + /* > + * Here we are using same freespace for all the Blocks, but that > + * is Ok, because all are newly added blocks and have same freespace > + * And even some block which we just added to FreespaceMap above, is > + * used by some backend and now freespace is not same, will not harm > + * anything, because actual freespace will be calculated by user > + * after getting the page. > + */ > + UpdateFreeSpaceMap(relation, firstBlock, blockNum, freespace); > > > Does this look good ? Done in better way.. + lockWaiters = RelationExtensionLockWaiterCount(relation); + + if (lockWaiters == 0) + return; + -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com multi_extend_v21.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation extension scalability
On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila wrote: > Yes, that makes sense. One more point is that if the reason for v13 > giving better performance is extra blocks (which we believe in certain > cases can leak till the time Vacuum updates the FSM tree), do you think it > makes sense to once test by increasing lockWaiters * 20 limit to may > be lockWaiters * 25 or lockWaiters * 30. I tested COPY 1 record, by increasing the number of blocks just to find out why we are not as good as V13 with extraBlocks = Min( lockWaiters * 40, 2048) and got below results.. COPY 1 Client Patch(extraBlocks = Min( lockWaiters * 40, 2048)) - 16 752 32 708 This proves that main reason of v13 being better is its adding extra blocks without control. though v13 is better than these results, I think we can get that also by changing multiplier and max limit . But I think we are ok with the max size as 4MB (512 blocks) right?. Does this test make sense ? Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Thu, Dec 24, 2015 at 8:26 AM, Michael Paquier wrote: > On Sun, Dec 13, 2015 at 11:05 PM, Amit Kapila > wrote: > > On Fri, Dec 11, 2015 at 6:34 PM, Andres Freund > wrote: > I was looking into this patch, overall patch looks good to me, I want to know what is the current state of this patch, is there some pending task in this patch ? Patch was not applying on head so i have re based it and re based version is attached in the mail. I have done some performance testing also.. Summary: --- 1. In my test for readonly workload i have observed some improvement for scale factor 1000. 2. I have also observed some regression with scale factor 300 (I can't say it's actual regression or just run to run variance), I thought that may be problem with lower scale factor so tested with scale factor 100 but with s.f. 100 looks fine. Machine Detail: cpu : POWER8 cores: 24 (192 with HT) Non Default Parameter: Shared Buffer= 30GB max_wal_size= 10GB max_connections=500 Test1: pgbench -i -s 1000 postgres pgbench -c$ -j$ -Mprepared -S postgres Client Base Pached 1 19753 19493 32 344059 336773 64 495708 540425 128 564358 685212 256 466562 639059 Test2: pgbench -i -s 300 postgres pgbench -c$ -j$ -Mprepared -S postgres Client Base Pached 1 2055519404 32 375919 332670 64 509067 440680 128431346 415121 256380926 379176 Test3: pgbench -i -s 100 postgres pgbench -c$ -j$ -Mprepared -S postgres Client Base Pached 1 2055519404 32 375919 332670 64 509067 440680 128431346 415121 256380926 379176 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c index b423aa7..04862d7 100644 --- a/src/backend/storage/buffer/buf_init.c +++ b/src/backend/storage/buffer/buf_init.c @@ -121,12 +121,9 @@ InitBufferPool(void) BufferDesc *buf = GetBufferDescriptor(i); CLEAR_BUFFERTAG(buf->tag); - buf->flags = 0; - buf->usage_count = 0; - buf->refcount = 0; - buf->wait_backend_pid = 0; - SpinLockInit(&buf->buf_hdr_lock); + pg_atomic_init_u32(&buf->state, 0); + buf->wait_backend_pid = 0; buf->buf_id = i; diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 7141eb8..f69faeb 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -51,7 +51,6 @@ #include "utils/resowner_private.h" #include "utils/timestamp.h" - /* Note: these two macros only work on shared buffers, not local ones! */ #define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ)) #define BufferGetLSN(bufHdr) (PageGetLSN(BufHdrGetBlock(bufHdr))) @@ -126,7 +125,7 @@ static BufferDesc *PinCountWaitBuf = NULL; * entry using ReservePrivateRefCountEntry() and then later, if necessary, * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing * memory allocations in NewPrivateRefCountEntry() which can be important - * because in some scenarios it's called with a spinlock held... + * because in some scenarios it's called with a header lock held... */ static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES]; static HTAB *PrivateRefCountHash = NULL; @@ -775,9 +774,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, */ if (isLocalBuf) { - /* Only need to adjust flags */ - Assert(bufHdr->flags & BM_VALID); - bufHdr->flags &= ~BM_VALID; + Assert(pg_atomic_read_u32(&bufHdr->state) & BM_VALID); + pg_atomic_fetch_and_u32(&bufHdr->state, ~BM_VALID); } else { @@ -789,8 +787,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, do { LockBufHdr(bufHdr); -Assert(bufHdr->flags & BM_VALID); -bufHdr->flags &= ~BM_VALID; +Assert(pg_atomic_read_u32(&bufHdr->state) & BM_VALID); +pg_atomic_fetch_and_u32(&bufHdr->state, ~BM_VALID); UnlockBufHdr(bufHdr); } while (!StartBufferIO(bufHdr, true)); } @@ -808,7 +806,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, * it's not been recycled) but come right back here to try smgrextend *
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Tue, Jan 19, 2016 at 5:44 PM, Michael Paquier wrote: > > Test3: > > pgbench -i -s 100 postgres > > pgbench -c$ -j$ -Mprepared -S postgres > > > > Client Base Pached > > > > 1 2055519404 > > 32 375919 332670 > > 64 509067 440680 > > 128431346 415121 > > 256380926 379176 > > It seems like you did a copy-paste of the results with s=100 and > s=300. Both are showing the exact same numbers. Oops, my mistake, re-pasting the correct results for s=100 pgbench -i -s 100 postgres pgbench -c$ -j$ -Mprepared -S postgres Client Base Pached 12054820791 32 372633 355356 64 532052 552148 128412755 478826 256 346701 372057 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex
On Tue, Jan 12, 2016 at 1:50 PM, Aleksander Alekseev < a.aleks...@postgrespro.ru> wrote: > > increasing number of lock partitions (see columns "no locks", "lwlock" > and "spinlock array"). Previously it couldn't help us (see "master" > column) because of a bottleneck. > > If you have any other questions regarding this patch please don't > hesitate to ask. > I have done some performance bench marking for this patch.(dynahash-optimization-v10-step1.patch) Machine Detail: cpu : POWER8 cores: 24 (192 with HT) Non Default Parameter: Shared Buffer= 30GB max_wal_size= 10GB max_connections=500 Test1: *schema.sql* and *pgbench.sql* are same files which Aleksander has attached in first mail of the thread. psql -d postgres -f schema.sql pgbench -c$ -j$ -f pgbench.sql postgres clientbasepatch 1145145 2262258 4453472 8749732 1611141128 3217271789 6427292793 12835343520 With this test i did not see any improvement, though i think it should improve the performance, do you have any suggestion to see the results same as yours ? I have also dump stacks using some script and I have observed many stacks dumps as you mentioned in initial thread. And after patch, I found that number of lock waiting stacks are reduced. Stack Dump: --- #0 0x7f25842899a7 in semop () from /lib64/libc.so.6 #1 0x007116d0 in PGSemaphoreLock (sema=0x7f257cb170d8) at pg_sema.c:387 #2 0x0078955f in LWLockAcquire (lock=0x7f247698a980, mode=LW_EXCLUSIVE) at lwlock.c:1028 #3 0x007804c4 in LockAcquireExtended #4 0x0077fe91 in LockAcquire #5 0x0077e862 in LockRelationOid #6 0x0053bc7b in find_inheritance_children #7 0x0053bd4f in find_all_inheritors #8 0x006fc0a2 in expand_inherited_rtentry #9 0x006fbf91 in expand_inherited_tables I have tried to analyze using perf also, I can see that amount of time taken in hash_search_with_hash_value is reduced from 3.86%(master) to 1.72%(patch). I have plan to do further investigation, in different scenarios of dynahash. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex
On Fri, Jan 22, 2016 at 3:44 PM, Aleksander Alekseev < a.aleks...@postgrespro.ru> wrote: > This patch affects header files. By any chance didn't you forget to run > `make clean` after applying it? As we discussed above, when you > change .h files autotools doesn't rebuild dependent .c files: > Yes, actually i always compile using "make clean;make -j20; make install" If you want i will run it again may be today or tomorrow and post the result. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Sat, Jan 23, 2016 at 12:19 PM, Amit Kapila wrote > > > Few comments about patch: > Thanks for reviewing.. > 1. > Patch is not getting compiled. > > 1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared > identifier > 1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared > identifier > 1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared > identifier > Oh, My mistake, my preprocessor is ignoring this error and relacing it with BLKSIZE I will fix in next version of patch. > 2. > ! page = BufferGetPage(buffer); > ! PageInit(page, BufferGetPageSize > (buf), 0); > ! > ! freespace = PageGetHeapFreeSpace(page); > ! > MarkBufferDirty(buffer); > ! UnlockReleaseBuffer(buffer); > ! > RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), freespace); > > What is the need to mark page dirty here, won't it automatically > be markerd dirty once the page is used? I think it is required > if you wish to WAL-log this action. > These pages are not going to be used immediately and we have done PageInit so i think it should be marked dirty before adding to FSM, so that if buffer get replaced out then it flushes the init data. > 3. I think you don't need to multi-extend a relation if > HEAP_INSERT_SKIP_FSM is used, as for that case it anyways try to > get a new page by extending a relation. > Yes, if HEAP_INSERT_SKIP_FSM is used and we use multi-extend atleast in current transaction it will not take pages from FSM and everytime it will do multi-extend, however pages will be used if there are parallel backend, but still not a good idea to extend every time in multiple chunk in current backend. So i will change this.. 4. Again why do you need this multi-extend optimization for local > relations (those only accessible to current backend)? > I think we can change this while adding the table level "extend_by_blocks" for local table we will not allow this property, so no need to change at this place. What do you think ? 5. Do we need this for nbtree as well? One way to check that > is by Copying large data in table having index. > > Ok, i will try this test and update. > Note: Test with both data and WAL on Magnetic Disk : No significant >> improvement visible >> -- I think wall write is becoming bottleneck in this case. >> >> > In that case, can you try the same test with un-logged tables? > OK > > Also, it is good to check the performance of patch with read-write work > load to ensure that extending relation in multiple-chunks should not > regress such cases. > Ok > > Currently i have kept extend_num_page as session level parameter but i >> think later we can make this as table property. >> Any suggestion on this ? >> >> > I think we should have a new storage_parameter at table level > extend_by_blocks or something like that instead of GUC. The > default value of this parameter should be 1 which means retain > current behaviour by default. > +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Sat, Jan 23, 2016 at 4:28 PM, Amit Kapila wrote: > I found one more problem with patch. > > ! UnlockReleaseBuffer(buffer); > ! RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), > freespace); > > You can't call BufferGetBlockNumber(buffer) after releasing > the pin on buffer which will be released by > UnlockReleaseBuffer(). Get the block number before unlocking > the buffer. > Good catch, will fix this also in next version. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex
On Wed, Jan 27, 2016 at 5:15 PM, Aleksander Alekseev < a.aleks...@postgrespro.ru> wrote: > Most likely HASHHDR.mutex is not only bottleneck in your case so my > patch doesn't help much. Unfortunately I don't have access to any > POWER8 server so I can't investigate this issue. I suggest to use a > gettimeofday trick I described in a first message of this thread. Its > time consuming but it gives a clear understanding which code is keeping > a lock. > I have also tested the pgbench Readonly test when data don't fit into shared buffer, Because in this case HASHHDR.mutex access will be quite frequent. And in this case i do see very good improvement in POWER8 server. Test Result: Scale Factor:300 Shared Buffer:512MB pgbench -c$ -j$ -S -M Prepared postgres Clientbasepatch 64 222173 318318 128195805262290 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Mon, Jan 25, 2016 at 11:59 AM, Dilip Kumar wrote: 1. >> Patch is not getting compiled. >> >> 1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared >> identifier >> > Oh, My mistake, my preprocessor is ignoring this error and relacing it > with BLKSIZE > FIXED > 3. I think you don't need to multi-extend a relation if >> HEAP_INSERT_SKIP_FSM is used, as for that case it anyways try to >> get a new page by extending a relation. >> > > So i will change this.. > FIXED > > 4. Again why do you need this multi-extend optimization for local >> relations (those only accessible to current backend)? >> > > I think we can change this while adding the table level > "extend_by_blocks" for local table we will not allow this property, so no > need to change at this place. > What do you think ? > Now I have added table level parameter for specifying the number of blocks, So do you think that we still need to block it, as user can control it, Moreover i think if user want to use for local table then no harm in it at least by extending in one shot he avoid multiple call of Extension lock, though there will be no contention. What is your opinion ? 5. Do we need this for nbtree as well? One way to check that >> is by Copying large data in table having index. >> >> Ok, i will try this test and update. > I tried to load data to table with Index and tried to analyze bottleneck using perf, and found btcompare was taking maximum time, still i don't deny that it can not get benefited by multi extend. So i tried quick POC for this, but i realize that even though we extend multiple page for index and add to FSM, it will be updated only in current page, Information about new free page will be propagated to root page only during vacuum, And unlike heap Btree always search FSM from root and it will not find the extra added pages. So i think we can analyze this topic separately for index multi extend and find is there are cases where index multi extend can give benefit. Note: Test with both data and WAL on Magnetic Disk : No significant >>> improvement visible >>> -- I think wall write is becoming bottleneck in this case. >>> >>> >> In that case, can you try the same test with un-logged tables? >> > Date with un-logged table Test Init: ./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1, 1) g(i)) TO '/tmp/copybinary' WITH BINARY"; echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script ./psql -d postgres -c "create unlogged table data (data text)" --> base ./psql -d postgres -c "create unlogged table data (data text) with(extend_by_blocks=50)" --patch test_script: ./psql -d postgres -c "truncate table data" ./psql -d postgres -c "checkpoint" ./pgbench -f copy_script -T 120 -c$ -j$ postgres Shared Buffer48GB Table:Unlogged Table ench -c$ -j$ -f -M Prepared postgres ClientsBasepatch 1178 180 2337 338 4265 601 8167 805 Also, it is good to check the performance of patch with read-write work >> load to ensure that extending relation in multiple-chunks should not >> regress such cases. >> > > Ok > I did not find in regression in normal case. Note: I tested it with previous patch extend_num_pages=10 (guc parameter) so that we can see any impact on overall system. > Currently i have kept extend_num_page as session level parameter but i >>> think later we can make this as table property. >>> Any suggestion on this ? >>> >>> >> I think we should have a new storage_parameter at table level >> extend_by_blocks or something like that instead of GUC. The >> default value of this parameter should be 1 which means retain >> current behaviour by default. >> > +1 > Changed it to table level storage parameter. I kept max_limit to 100 any suggestion on this ? should we increase it ? latest patch is attached.. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index 86b9ae1..76f9a21 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -268,6 +268,16 @@ static relopt_int intRelOpts[] = #endif }, + { + { + "extend_by_blocks", + "Number of blocks to be added to relation in every extend call", + RELOPT_KIND_HEAP, + AccessExclusiveLock + }, + 1, 1, 100 + }, + /* list terminator */ {{NULL}} }; @@ -1291,7 +1301,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind) {"autovacuum
Re: [HACKERS] Relation extension scalability
On Thu, Jan 28, 2016 at 4:53 PM, Dilip Kumar wrote: > I did not find in regression in normal case. > Note: I tested it with previous patch extend_num_pages=10 (guc parameter) > so that we can see any impact on overall system. > Just forgot to mentioned That i have run pgbench read-write case. S.F: 300 ./pgbench -j $ -c $ -T 1800 -M Prepared postgres Tested with 1,8,16,32,64 Threads and did not see any regression with patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Sat, Jan 30, 2016 at 3:05 AM, Merlin Moncure wrote: > Probably want to run for at least 5 minutes via -T 300 Last time i run for 5 minutes and taken median of three runs, just missed mentioning "-T 300" in the mail.. By looking at the results with scale factor 1000 and 100 i don't see any reason why it will regress with scale factor 300. So I will run the test again with scale factor 300 and this time i am planning to run 2 cases. 1. when data fits in shared buffer 2. when data doesn't fit in shared buffer. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics
On Sun, Jan 31, 2016 at 11:44 AM, Dilip Kumar wrote: > By looking at the results with scale factor 1000 and 100 i don't see any > reason why it will regress with scale factor 300. > > So I will run the test again with scale factor 300 and this time i am > planning to run 2 cases. > 1. when data fits in shared buffer > 2. when data doesn't fit in shared buffer. > I have run the test again with 300 S.F and found no regression, in fact there is improvement with the patch like we saw with 1000 scale factor. Shared Buffer= 8GB max_connections=150 Scale Factor=300 ./pgbench -j$ -c$ -T300 -M prepared -S postgres ClientBasePatch 11974419382 8125923126395 3231393151 64387339496830 128306412350610 Shared Buffer= 512MB max_connections=150 Scale Factor=300 ./pgbench -j$ -c$ -T300 -M prepared -S postgres ClientBasePatch 11716916454 8108547105559 32241619262818 64206868233606 128 137084 217013 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Relation extension scalability
On Fri, Feb 5, 2016 at 4:50 PM, Amit Kapila wrote: > > Could you also measure how this behaves for an INSERT instead of a COPY > > workload? > > I think such a test will be useful. > I have measured the performance with insert to see the behavior when it don't use strategy. I have tested multiple option, small tuple, big tuple, data fits in shared buffer and doesn't fit in shared buffer. Observation: -- Apart from this test I have also used some tool which can take a many stack traces with some delay. 1. I have observed with base code (when data don't fits in shared buffer) almost all the stack traces are waiting on "LockRelationForExtension" and many on "FlushBuffer" also (Flushing the dirty buffer). Total Stack Captured: 204, FlushBuffer: 13, LockRelationForExtension: 187 (This test run with 8 thread (shared buf 512MB) and after every 5 second stack is captured.) 2. If I change shared buf 48GB then Obviously FlushBuffer disappeared but still LockRelationForExtension remains in very high number. 3.Performance of base code in both the cases when Data fits in shared buffers or doesn't fits in shared buffer remain very low and non-scaling(we can see that in below results). Test--1 (big record insert and Data fits in shared Buffer) setup ./psql -d postgres -c "create table test_data(a int, b text)" ./psql -d postgres -c "insert into test_data values(generate_series(1,1000),repeat('x', 1024))" ./psql -d postgres -c "create table data (a int) with(extend_by_blocks=$variable)" {create table data (a int) for base code} echo "insert into data select * from test_data;" >> copy_script test: - shared_buffers=48GB max_wal_size=20GB checkpoint_timeout=10min ./pgbench -c $ -j $ -f copy_script -T 120 postgres clientbaseextend_by_block=50extend_by_block=1000 1113115118 450 220216 843 202302 Test--2 (big record insert and Data doesn't fits in shared Buffer) -- setup: --- ./psql -d postgres -c "create table test_data(a int, b text)" ./psql -d postgres -c "insert into test_data values(generate_series(1,1000),repeat('x', 1024))" ./psql -d postgres -c "create table data (a int) with(extend_by_blocks=1000)" echo "insert into data select * from test_data;" >> copy_script test: -- shared_buffers=512MB max_wal_size=20GB checkpoint_timeout=10min ./pgbench -c $ -j $ -f copy_script -T 120 postgres clientbaseextend_by_block=1000 1 125125 4 49 236 8 41 294 1639 279 Test--3 (small record insert and Data fits in shared Buffer) -- setup: ./psql -d postgres -c "create table test_data(a int)" ./psql -d postgres -c "insert into test_data values(generate_series(1,1))" ./psql -d postgres -c "create table data (a int) with(extend_by_blocks=20)" echo "insert into data select * from test_data;" >> copy_script test: - shared_buffers=48GB -c max_wal_size=20GB -c checkpoint_timeout=10min ./pgbench -c $ -j $ -f copy_script -T 120 postgres clientbasePatch-extend_by_block=20 1 137 143 2 269 250 4 377 443 8 170 690 16145 745 *All test done with Data on MD and Wal on SSD Note: Last patch have max limit of extend_by_block=100 so for taking performance with extend_by_block=1000 i localy changed it. I will send the modified patch once we finalize on which approach to proceed with. > > I'm doubtful that anything that does the victim buffer search while > > holding the extension lock will actually scale in a wide range of > > scenarios. > > > > I think the problem for victim buffer could be visible if the blocks > are dirty and it needs to write the dirty buffer and especially as > the patch is written where after acquiring the extension lock, it again > tries to extend the relation without checking if it can get a page with > space from FSM. It seems to me that we should re-check the > availability of page because while one backend is waiting on extension > lock, other backend might have added pages. To re-check the > availability we might want to use something similar to > LWLockAcquireOrWait() semantics as used during WAL writing. > I will work on this in next version... -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
On Wed, Aug 17, 2016 at 6:10 PM, Robert Haas wrote: >> If you are making changes in plpgsql_validator(), then shouldn't we >> make changes in plperl_validator() or plpython_validator()? I see >> that you have made changes to function CheckFunctionValidatorAccess() >> which seems to be getting called from *_validator() (* indicates >> plpgsql/plperl/plpython) functions. Is there a reason for changing >> the *_validator() function? Yes you are right, making changes in CheckFunctionValidatorAccess is sufficient, no need to make changes in *_validator (* indicates plpgsql/plperl/plpython) functions. I have changed this in attached patch.. > > Yeah, when I glanced briefly at this patch, I found myself wondering > whether all of these call sites were actually reachable (and why the > patch contained no test cases to prove it). I have also added test cases to cover all my changes. Basically this patch changes error at three places. 1. getBaseTypeAndTypmod: This is being called from domain_in exposed function (domain_in-> domain_state_setup-> getBaseTypeAndTypmod). Though this function is being called from many other places which are not exposed function, but I don't this will have any imapct, will this ? 2. lookup_type_cache: This is being called from record_in (record_in->lookup_rowtype_tupdesc-> lookup_rowtype_tupdesc_internal->lookup_type_cache). 3. CheckFunctionValidatorAccess: This is being called from all language validator functions. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com cache_lookup_failure_v4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
On Wed, Sep 7, 2016 at 8:52 AM, Haribabu Kommi wrote: > I reviewed and tested the patch. The changes are fine. > This patch provides better error message compared to earlier. > > Marked the patch as "Ready for committer" in commit-fest. Thanks for the review ! -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
On Thu, Sep 8, 2016 at 2:23 AM, Tom Lane wrote: > I really don't like this solution. Changing this one function, out of the > dozens just like it in lsyscache.c, seems utterly random and arbitrary. > > I'm actually not especially convinced that passing domain_in a random > OID needs to not produce an XX000 error. That isn't a user-facing > function and people shouldn't be calling it by hand at all. The same > goes for record_in. But if we do want those cases to avoid this, > I think it's appropriate to fix it in those functions, not decree > that essentially-random other parts of the system should bear the > responsibility. (Thought experiment: if we changed the order of > operations in domain_in so that getBaseTypeAndTypmod were no longer > the first place to fail, would we undo this change and then change > wherever the failure had moved to? That's pretty messy.) > > In the case of record_in, it seems like it'd be easy enough to use > lookup_rowtype_tupdesc_noerror() instead of lookup_rowtype_tupdesc() > and then throw a user-facing error right there. Actually when we are passing invalid type to "record_in", error is thrown in "lookup_type_cache" function, and this function is not taking "no_error" input (it's only limited to lookup_rowtype_tupdesc_internal). One option is to pass "no_error" parameter to "lookup_type_cache" function also, and throw error only in record_in function. But problem is, "lookup_type_cache" has lot of references. And also will it be good idea to throw one generic error instead of multiple specific errors. ? Another option is to do same for "record_in" also what you have suggested for domain_in (an extra syscache lookup). ? >Not sure about a > nice solution for domain_in. We might have to resort to an extra > syscache lookup there, but even if we did, it should happen only > once per query so that's not very awful. > >> 3. CheckFunctionValidatorAccess: This is being called from all >> language validator functions. > > This part seems reasonable, since the validator functions are documented > as something users might call, and CheckFunctionValidatorAccess seems > like an apropos place to handle it. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
On Thu, Sep 8, 2016 at 7:21 PM, Tom Lane wrote: > We could make it work like that without breaking the ABI if we were > to add a NOERROR bit to the allowed "flags". However, after looking > around a bit I'm no longer convinced what I said above is a good idea. > In particular, if we put the responsibility of reporting the error on > callers then we'll lose the ability to distinguish "no such pg_type OID" > from "type exists but it's only a shell". So I'm now thinking it's okay > to promote lookup_type_cache's elog to a full ereport, especially since > it's already using ereport for some of the related error cases such as > the shell-type failure. Okay.. > > That still leaves us with what to do for domain_in. A really simple > fix would be to move the InitDomainConstraintRef call before the > getBaseTypeAndTypmod call, because the former fetches the typcache > entry and would then benefit from lookup_type_cache's ereport. > But that seems a bit magic. > > I'm tempted to propose that domain_state_setup start with > > typentry = lookup_type_cache(domainType, TYPECACHE_DOMAIN_INFO); > if (typentry->typtype != TYPTYPE_DOMAIN) > ereport(ERROR, > (errcode(ERRCODE_DATATYPE_MISMATCH), > errmsg("type %s is not a domain", > format_type_be(domainType; > > removing the need to verify that after getBaseTypeAndTypmod. Seems like a better option, done it this way.. attaching the modified patch.. > > The cache loading is basically free because InitDomainConstraintRef > would do it anyway; so the extra cost here is only one dynahash search. > You could imagine buying back those cycles by teaching the typcache > to be able to cache the result of getBaseTypeAndTypmod, but I'm doubtful > that we really care. This whole setup sequence only happens once per > query anyway. Agreed. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com cache_lookup_failure_v5.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
On Fri, Sep 9, 2016 at 6:51 PM, Tom Lane wrote: > Pushed with cosmetic adjustments --- mostly, I thought we needed some > comments about the topic. Okay, Thanks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Mon, Sep 5, 2016 at 9:33 AM, Amit Kapila wrote: > USE_CONTENT_LOCK on my windows box, you can try by commenting that as > well, if it works for you). So, in short we have to compare three > approaches here. > > 1) Group mode to reduce CLOGControlLock contention > 2) Use granular locking model > 3) Use atomic operations I have tested performance with approach 1 and approach 2. 1. Transaction (script.sql): I have used below transaction to run my bench mark, We can argue that this may not be an ideal workload, but I tested this to put more load on ClogControlLock during commit transaction. --- \set aid random (1,3000) \set tid random (1,3000) BEGIN; SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE; SAVEPOINT s1; SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE; SAVEPOINT s2; SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE; END; --- 2. Results ./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql scale factor: 300 Clients head(tps)grouplock(tps) granular(tps) --- - -- --- 12829367 3932637421 18029777 3781036469 25628523 3741835882 grouplock --> 1) Group mode to reduce CLOGControlLock contention granular --> 2) Use granular locking model I will test with 3rd approach also, whenever I get time. 3. Summary: 1. I can see on head we are gaining almost ~30 % performance at higher client count (128 and beyond). 2. group lock is ~5% better compared to granular lock. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Wed, Sep 14, 2016 at 10:25 AM, Dilip Kumar wrote: > I have tested performance with approach 1 and approach 2. > > 1. Transaction (script.sql): I have used below transaction to run my > bench mark, We can argue that this may not be an ideal workload, but I > tested this to put more load on ClogControlLock during commit > transaction. > > --- > \set aid random (1,3000) > \set tid random (1,3000) > > BEGIN; > SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE; > SAVEPOINT s1; > SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE; > SAVEPOINT s2; > SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE; > END; > --- > > 2. Results > ./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql > scale factor: 300 > Clients head(tps)grouplock(tps) granular(tps) > --- - -- --- > 12829367 3932637421 > 18029777 3781036469 > 25628523 3741835882 > > > grouplock --> 1) Group mode to reduce CLOGControlLock contention > granular --> 2) Use granular locking model > > I will test with 3rd approach also, whenever I get time. > > 3. Summary: > 1. I can see on head we are gaining almost ~30 % performance at higher > client count (128 and beyond). > 2. group lock is ~5% better compared to granular lock. Forgot to mention that, this test is on unlogged tables. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Wed, Sep 14, 2016 at 8:59 PM, Robert Haas wrote: > Sure, but you're testing at *really* high client counts here. Almost > nobody is going to benefit from a 5% improvement at 256 clients. I agree with your point, but here we need to consider one more thing, that on head we are gaining ~30% with both the approaches. So for comparing these two patches we can consider.. A. Other workloads (one can be as below) -> Load on CLogControlLock at commit (exclusive mode) + Load on CLogControlLock at Transaction status (shared mode). I think we can mix (savepoint + updates) B. Simplicity of the patch (if both are performing almost equal in all practical scenarios). C. Bases on algorithm whichever seems winner. I will try to test these patches with other workloads... > You > need to test 64 clients and 32 clients and 16 clients and 8 clients > and see what happens there. Those cases are a lot more likely than > these stratospheric client counts. I tested with 64 clients as well.. 1. On head we are gaining ~15% with both the patches. 2. But group lock vs granular lock is almost same. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Sat, Sep 17, 2016 at 6:54 AM, Tomas Vondra wrote: > Although most of the changes probably does not matter much for unlogged > tables (I planned to see how this affects regular tables, but as I see no > difference for unlogged ones, I haven't done that yet). > > So the question is why Dilip sees +30% improvement, while my results are > almost exactly the same. Looking at Dilip's benchmark, I see he only ran the > test for 10 seconds, and I'm not sure how many runs he did, warmup etc. > Dilip, can you provide additional info? Actually I ran test for 10 minutes. Sorry for the confusion (I copy paste my script and manually replaced the variable and made mistake ) My script is like this scale_factor=300 shared_bufs=8GB time_for_reading=600 ./postgres -c shared_buffers=8GB -c checkpoint_timeout=40min -c max_wal_size=20GB -c max_connections=300 -c maintenance_work_mem=1GB& ./pgbench -i -s $scale_factor --unlogged-tables postgres ./pgbench -c $threads -j $threads -T $time_for_reading -M prepared postgres -f ../../script.sql >> test_results.txt I am taking median of three readings.. with below script, I can repeat my results every time (64 client 15% gain on head and 128+ client 30% gain on head). I will repeat my test with 8,16 and 32 client and post the results soon. > \set aid random (1,3000) > \set tid random (1,3000) > > BEGIN; > SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE; > SAVEPOINT s1; > SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE; > SAVEPOINT s2; > SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE; > END; > --- -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Mon, Sep 19, 2016 at 2:41 AM, Tomas Vondra wrote: > But now that I look at the first post, I see it apparently used a plain > tpc-b pgbench (with synchronous_commit=on) to show the benefits, which is > the workload I'm running right now (results sometime tomorrow). Good option, We can test plain TPC-B also.. I have some more results.. I have got the result for "Update with no savepoint" below is my script... \set aid random (1,3000) \set tid random (1,3000) \set delta random(-5000, 5000) BEGIN; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; Results: (median of three, 10 minutes run). Clients Head GroupLock 16 2145221589 32 4242242688 64 4246052590 ~ 23% 1282268356825 ~150% 256 18748 54867 With this workload I observed that gain is bigger than my previous workload (select for update with 2 SP).. Just to confirm that the gain what we are seeing is because of Clog Lock contention removal or it's something else, I ran 128 client with perf for 5 minutes and below is my result. I can see that after applying group lock patch, LWLockAcquire become 28% to just 4%, and all because of Clog Lock. On Head: - 28.45% 0.24% postgres postgres [.] LWLockAcquire - LWLockAcquire + 53.49% TransactionIdSetPageStatus + 40.83% SimpleLruReadPage_ReadOnly + 1.16% BufferAlloc + 0.92% GetSnapshotData + 0.89% GetNewTransactionId + 0.72% LockBuffer + 0.70% ProcArrayGroupClearXid After Group Lock Patch: --- -4.47% 0.26% postgres postgres [.] LWLockAcquire - LWLockAcquire + 27.11% GetSnapshotData + 21.57% GetNewTransactionId + 11.44% SimpleLruReadPage_ReadOnly + 10.13% BufferAlloc + 7.24% ProcArrayGroupClearXid + 4.74% LockBuffer + 4.08% LockAcquireExtended + 2.91% TransactionGroupUpdateXidStatus + 2.71% LockReleaseAll + 1.90% WALInsertLockAcquire + 0.94% LockRelease + 0.91% VirtualXactLockTableInsert + 0.90% VirtualXactLockTableCleanup + 0.72% MultiXactIdSetOldestMember + 0.66% LockRefindAndRelease Next I will test, "update with 2 savepoints", "select for update with no savepoints" I will also test the granular lock and atomic lock patch in next run.. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Tue, Sep 20, 2016 at 8:37 AM, Amit Kapila wrote: > I think it is former (8 socket machine). I confirm this is 8 sockets machine(cthulhu) > > > You point related to high-client count is valid and I am sure it won't > give noticeable benefit at lower client-count as the the > CLOGControlLock contention starts impacting only at high-client count. > I am not sure if it is good idea to reject a patch which helps in > stabilising the performance (helps in falling off the cliff) when the > processes increases the number of cores (or hardware threads) > >> If you have to work that hard >> to find a big win, and tests under more reasonable conditions show no >> benefit, it's not clear to me that it's really worth the time we're >> all spending benchmarking and reviewing this, or the risk of bugs, or >> the damage to the SLRU abstraction layer. > > I agree with you unless it shows benefit on somewhat more usual > scenario's, we should not accept it. So shouldn't we wait for results > of other workloads like simple-update or tpc-b on bigger machines > before reaching to conclusion? +1 My test are under run, I will post it soon.. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Tue, Sep 20, 2016 at 9:15 AM, Dilip Kumar wrote: > +1 > > My test are under run, I will post it soon.. I have some more results now 8 socket machine 10 min run(median of 3 run) synchronous_commit=off scal factor = 300 share buffer= 8GB test1: Simple update(pgbench) Clients Head GroupLock 32 45702 45402 64 46974 51627 128 35056 55362 test2: TPCB (pgbench) Clients Head GroupLock 32 27969 27765 64 33140 34786 128 21555 38848 Summary: -- At 32 clients no gain, I think at this workload Clog Lock is not a problem. At 64 Clients we can see ~10% gain with simple update and ~5% with TPCB. At 128 Clients we can see > 50% gain. Currently I have tested with synchronous commit=off, later I can try with on. I can also test at 80 client, I think we will see some significant gain at this client count also, but as of now I haven't yet tested. With above results, what we think ? should we continue our testing ? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Wed, Sep 21, 2016 at 8:47 AM, Dilip Kumar wrote: > Summary: > -- > At 32 clients no gain, I think at this workload Clog Lock is not a problem. > At 64 Clients we can see ~10% gain with simple update and ~5% with TPCB. > At 128 Clients we can see > 50% gain. > > Currently I have tested with synchronous commit=off, later I can try > with on. I can also test at 80 client, I think we will see some > significant gain at this client count also, but as of now I haven't > yet tested. > > With above results, what we think ? should we continue our testing ? I have done further testing with on TPCB workload to see the impact on performance gain by increasing scale factor. Again at 32 client there is no gain, but at 64 client gain is 12% and at 128 client it's 75%, it shows that improvement with group lock is better at higher scale factor (at 300 scale factor gain was 5% at 64 client and 50% at 128 clients). 8 socket machine (kernel 3.10) 10 min run(median of 3 run) synchronous_commit=off scal factor = 1000 share buffer= 40GB Test results: client head group lock 32 27496 27178 64 31275 35205 1282065634490 LWLOCK_STATS approx. block count on ClogControl Lock ("lwlock main 11") client head group lock 32 8 6 6415 10 128 14 7 Note: These are approx. block count, I have detailed result of LWLOCK_STAT, incase someone wants to look into. LWLOCK_STATS shows that ClogControl lock block count reduced by 25% at 32 client, 33% at 64 client and 50% at 128 client. Conclusion: 1. I think both LWLOCK_STATS and performance data shows that we get significant contention reduction on ClogControlLock with the patch. 2. It also shows that though we are not seeing any performance gain at 32 clients, but there is contention reduction with patch. I am planning to do some more test with higher scale factor (3000 or more). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages
On Thu, Oct 5, 2017 at 8:15 PM, Robert Haas wrote: > On Sun, Sep 17, 2017 at 7:04 AM, Dilip Kumar wrote: >> I used lossy_pages = max(0, total_pages - maxentries / 2). as >> suggesed by Alexander. > > Does that formula accurately estimate the number of lossy pages? I have printed the total_pages, exact_pages and lossy_pages during planning time, and for testing purpose, I tweak the code a bit so that it doesn't consider lossy_pages in cost calculation (same as base code). I have tested TPCH scale factor 20. at different work_mem(4MB, 20MB, 64MB) and noted down the estimated pages vs actual pages. Analysis: The estimated value of the lossy_pages is way higher than its actual value and reason is that the total_pages calculated by the "Mackert and Lohman formula" is not correct. work_mem=4 MB query:4 estimated: total_pages=552472.00 exact_pages=32768.00 lossy_pages=519704.00 actual:exact=18548 lossy=146141 query:6 estimated: total_pages=1541449.00 exact_pages=32768.00 lossy_pages=1508681.00 actual:exact=13417 lossy=430385 query:8 estimated: total_pages=552472.00 exact_pages=32768.00 lossy_pages=519704.00 actual: exact=56869 lossy=495603 query:14 estimated: total_pages=1149603.00 exact_pages=32768.00 lossy_pages=1116835.00 actual: exact=17115 lossy=280949 work_mem: 20 MB query:4 estimated: total_pages=552472.00 exact_pages=163840.00 lossy_pages=388632.00 actual: exact=109856 lossy=57761 query:6 estimated: total_pages=1541449.00 exact_pages=163840.00 lossy_pages=1377609.00 actual: exact=59771 lossy=397956 query:8 estimated: total_pages=552472.00 exact_pages=163840.00 lossy_pages=388632.00 actual: Heap Blocks: exact=221777 lossy=330695 query:14 estimated: total_pages=1149603.00 exact_pages=163840.00 lossy_pages=985763.00 actual: exact=63381 lossy=235513 work_mem:64 MB query:4 estimated: total_pages=552472.00 exact_pages=552472.00 lossy_pages=0.00 actual: exact=166005 lossy=0 query:6 estimated: total_pages=1541449.00 exact_pages=524288.00 lossy_pages=1017161.00 actual: exact=277717 lossy=185919 query:8 estimated: total_pages=552472.00 exact_pages=552472.00 lossy_pages=0.00 actual:exact=552472 lossy=0 query:14 estimated: total_pages=1149603.00 exact_pages=524288.00 lossy_pages=625315.00 actual: exact=309091 lossy=0 > > The performance results look good, but that's a slightly different > thing from whether the estimate is accurate. > > +nbuckets = tbm_calculate_entires(maxbytes); > > entires? changed to + tbm->maxentries = (int) tbm_calculate_entires(maxbytes); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com improve_bitmap_cost_v4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages
On Fri, Oct 6, 2017 at 6:36 PM, Robert Haas wrote: > On Fri, Oct 6, 2017 at 2:12 AM, Dilip Kumar wrote: >>> The performance results look good, but that's a slightly different >>> thing from whether the estimate is accurate. >>> >>> +nbuckets = tbm_calculate_entires(maxbytes); >>> >>> entires? >> >> changed to >> + tbm->maxentries = (int) tbm_calculate_entires(maxbytes); > > My point was not that you should add a cast, but that you wrote > "entires" rather than "entries". My bad, I thought you were suggesting the variable name as "entries" instead of "nbuckets" so I removed the variable "nbuckets". I will fix the typo in the function name and post the patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages
On Fri, Oct 6, 2017 at 6:08 PM, Alexander Kuzmenkov wrote: > >> Analysis: The estimated value of the lossy_pages is way higher than >> its actual value and reason is that the total_pages calculated by the >> "Mackert and Lohman formula" is not correct. > > > I think the problem might be that the total_pages includes cache effects and > rescans. For bitmap entries we should use something like relation pages * > selectivity. I have noticed that for the TPCH case if I use "pages * selectivity" it give me better results, but IMHO directly multiplying the pages with selectivity may not be the correct way to calculate the number of heap pages it can only give the correct result when all the TID being fetched are clustered. But on the other hand "Mackert and Lohman formula" formulae consider that all the TID's are evenly distributed across the heap pages which can also give the wrong estimation like we are seeing in our TPCH case. > > Meanwhile, I ran TPC-H briefly with the v3 patch: scale factor 25, 2 > workers, SSD storage. > It shows significant improvement on 4MB work_mem and no change on 2GB. > > Here are the results (execution time in seconds, average of 5 runs): > work_mem4MB2GB > Query masterpatchmasterpatch > 4179174168167 > 5190168155156 > 628087227229 > 8197114179172 > 10269227190192 > 14 110108106105 > Thanks for the testing number looks good to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages
On Fri, Oct 6, 2017 at 7:04 PM, Dilip Kumar wrote: > On Fri, Oct 6, 2017 at 6:36 PM, Robert Haas wrote: >> On Fri, Oct 6, 2017 at 2:12 AM, Dilip Kumar wrote: >>>> The performance results look good, but that's a slightly different >>>> thing from whether the estimate is accurate. >>>> >>>> +nbuckets = tbm_calculate_entires(maxbytes); >>>> >>>> entires? >>> >>> changed to >>> + tbm->maxentries = (int) tbm_calculate_entires(maxbytes); >> >> My point was not that you should add a cast, but that you wrote >> "entires" rather than "entries". > > My bad, I thought you were suggesting the variable name as "entries" > instead of "nbuckets" so I removed the variable "nbuckets". I will > fix the typo in the function name and post the patch. Fixed in the attached version. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com improve_bitmap_cost_v5.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages
On Fri, Oct 6, 2017 at 7:24 PM, Dilip Kumar wrote: > On Fri, Oct 6, 2017 at 6:08 PM, Alexander Kuzmenkov > wrote: >> >>> Analysis: The estimated value of the lossy_pages is way higher than >>> its actual value and reason is that the total_pages calculated by the >>> "Mackert and Lohman formula" is not correct. >> >> >> I think the problem might be that the total_pages includes cache effects and >> rescans. For bitmap entries we should use something like relation pages * >> selectivity. > > I have noticed that for the TPCH case if I use "pages * selectivity" > it give me better results, but IMHO directly multiplying the pages > with selectivity may not be the correct way to calculate the number of > heap pages it can only give the correct result when all the TID being > fetched are clustered. But on the other hand "Mackert and Lohman > formula" formulae consider that all the TID's are evenly distributed > across the heap pages which can also give the wrong estimation like we > are seeing in our TPCH case. I agree with the point that the total_pages included the cache effects and rescan when loop_count > 1, that can be avoided if we always calculate heap_pages as it is calculated in the else part (loop_count=0). Fortunately, in all the TPCH query plan what I posted up thread bitmap scan was never at the inner side of the NLJ so loop_count was always 0. I will fix this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages
On Fri, Oct 6, 2017 at 9:21 PM, Dilip Kumar wrote: > On Fri, Oct 6, 2017 at 7:24 PM, Dilip Kumar wrote: >> On Fri, Oct 6, 2017 at 6:08 PM, Alexander Kuzmenkov >> wrote: >>> >>>> Analysis: The estimated value of the lossy_pages is way higher than >>>> its actual value and reason is that the total_pages calculated by the >>>> "Mackert and Lohman formula" is not correct. >>> >>> >>> I think the problem might be that the total_pages includes cache effects and >>> rescans. For bitmap entries we should use something like relation pages * >>> selectivity. >> >> I have noticed that for the TPCH case if I use "pages * selectivity" >> it give me better results, but IMHO directly multiplying the pages >> with selectivity may not be the correct way to calculate the number of >> heap pages it can only give the correct result when all the TID being >> fetched are clustered. But on the other hand "Mackert and Lohman >> formula" formulae consider that all the TID's are evenly distributed >> across the heap pages which can also give the wrong estimation like we >> are seeing in our TPCH case. > > I agree with the point that the total_pages included the cache effects > and rescan when loop_count > 1, that can be avoided if we always > calculate heap_pages as it is calculated in the else part > (loop_count=0). Fortunately, in all the TPCH query plan what I posted > up thread bitmap scan was never at the inner side of the NLJ so > loop_count was always 0. I will fix this. I have fixed the issue. Now, for calculating the lossy pages it will not consider the rescan. As mentioned above it will not affect the TPCH plan so haven't measured the performance again. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com improve_bitmap_cost_v6.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Bitmap Heap Scans segfaults due to (tbm->dsa==NULL) on PostgreSQL 10
On Thu, Oct 12, 2017 at 4:31 PM, Tomas Vondra wrote: > Hi, > > It seems that Q19 from TPC-H is consistently failing with segfaults due > to calling tbm_prepare_shared_iterate() with (tbm->dsa==NULL). > > I'm not very familiar with how the dsa is initialized and passed around, > but I only see the failures when the bitmap is constructed by a mix of > BitmapAnd and BitmapOr operations. > I think I have got the issue, bitmap_subplan_mark_shared is not properly pushing the isshared flag to lower level bitmap index node, and because of that tbm_create is passing NULL dsa while creating the tidbitmap. So this problem will come in very specific combination of BitmapOr and BitmapAnd when BitmapAnd is the first subplan for the BitmapOr. If BitmapIndex is the first subplan under BitmapOr then there is no problem because BitmapOr node will create the tbm by itself and isshared is set for BitmapOr. Attached patch fixing the issue for me. I will thoroughly test this patch with other scenario as well. Thanks for reporting. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index 5c934f2..cc7590b 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -4922,7 +4922,11 @@ bitmap_subplan_mark_shared(Plan *plan) bitmap_subplan_mark_shared( linitial(((BitmapAnd *) plan)->bitmapplans)); else if (IsA(plan, BitmapOr)) + { ((BitmapOr *) plan)->isshared = true; + bitmap_subplan_mark_shared( + linitial(((BitmapOr *) plan)->bitmapplans)); + } else if (IsA(plan, BitmapIndexScan)) ((BitmapIndexScan *) plan)->isshared = true; else -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Bitmap Heap Scans segfaults due to (tbm->dsa==NULL) on PostgreSQL 10
On Thu, Oct 12, 2017 at 6:37 PM, Tomas Vondra wrote: > > > On 10/12/2017 02:40 PM, Dilip Kumar wrote: >> On Thu, Oct 12, 2017 at 4:31 PM, Tomas Vondra >> wrote: >>> Hi, >>> >>> It seems that Q19 from TPC-H is consistently failing with segfaults due >>> to calling tbm_prepare_shared_iterate() with (tbm->dsa==NULL). >>> >>> I'm not very familiar with how the dsa is initialized and passed around, >>> but I only see the failures when the bitmap is constructed by a mix of >>> BitmapAnd and BitmapOr operations. >>> >> I think I have got the issue, bitmap_subplan_mark_shared is not >> properly pushing the isshared flag to lower level bitmap index node, >> and because of that tbm_create is passing NULL dsa while creating the >> tidbitmap. So this problem will come in very specific combination of >> BitmapOr and BitmapAnd when BitmapAnd is the first subplan for the >> BitmapOr. If BitmapIndex is the first subplan under BitmapOr then >> there is no problem because BitmapOr node will create the tbm by >> itself and isshared is set for BitmapOr. >> >> Attached patch fixing the issue for me. I will thoroughly test this >> patch with other scenario as well. Thanks for reporting. >> > > Yep, this fixes the failures for me. > Thanks for confirming. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On Fri, Oct 13, 2017 at 12:06 PM, Jeevan Chalke wrote: > While playing around with the patch I have noticed one regression with the partial partition-wise aggregate. I am consistently able to reproduce this on my local machine. Scenario: Group by on non-key column and only one tuple per group. Complete Test: create table t(a int,b int) partition by range(a); create table t1 partition of t for values from (1) to (10); create table t2 partition of t for values from (10) to (20); insert into t values (generate_series(1,19),generate_series(1, 19)); postgres=# explain analyze select sum(a) from t group by b; QUERY PLAN -- Finalize GroupAggregate (cost=20379.55..28379.51 rows=19 width=12) (actual time=102.311..322.969 rows=19 loops=1) Group Key: t1.b -> Merge Append (cost=20379.55..25379.53 rows=19 width=12) (actual time=102.303..232.310 rows=19 loops=1) Sort Key: t1.b -> Partial GroupAggregate (cost=10189.72..11939.70 rows=9 width=12) (actual time=52.164..108.967 rows=9 loops=1) Group Key: t1.b -> Sort (cost=10189.72..10439.72 rows=9 width=8) (actual time=52.158..66.236 rows=9 loops=1) Sort Key: t1.b Sort Method: external merge Disk: 1768kB -> Seq Scan on t1 (cost=0.00..1884.99 rows=9 width=8) (actual time=0.860..20.388 rows=9 loops=1) -> Partial GroupAggregate (cost=10189.82..11939.82 rows=10 width=12) (actual time=50.134..102.976 rows=10 loops=1) Group Key: t2.b -> Sort (cost=10189.82..10439.82 rows=10 width=8) (actual time=50.128..63.362 rows=10 loops=1) Sort Key: t2.b Sort Method: external merge Disk: 1768kB -> Seq Scan on t2 (cost=0.00..1885.00 rows=10 width=8) (actual time=0.498..20.977 rows=10 loops=1) Planning time: 0.190 ms Execution time: 339.929 ms (18 rows) postgres=# set enable_partition_wise_agg=off; SET postgres=# explain analyze select sum(a) from t group by b; QUERY PLAN -- GroupAggregate (cost=26116.53..29616.51 rows=19 width=12) (actual time=139.413..250.751 rows=19 loops=1) Group Key: t1.b -> Sort (cost=26116.53..26616.52 rows=19 width=8) (actual time=139.406..168.775 rows=19 loops=1) Sort Key: t1.b Sort Method: external merge Disk: 3544kB -> Result (cost=0.00..5769.98 rows=19 width=8) (actual time=0.674..76.392 rows=19 loops=1) -> Append (cost=0.00..3769.99 rows=19 width=8) (actual time=0.672..40.291 rows=19 loops=1) -> Seq Scan on t1 (cost=0.00..1884.99 rows=9 width=8) (actual time=0.672..12.408 rows=9 loops=1) -> Seq Scan on t2 (cost=0.00..1885.00 rows=10 width=8) (actual time=1.407..11.689 rows=10 loops=1) Planning time: 0.146 ms Execution time: 263.678 ms (11 rows) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On Tue, Oct 17, 2017 at 10:44 PM, Jeevan Chalke wrote: > > I didn't get what you mean by regression here. Can you please explain? > > I see that PWA plan is selected over regular plan when enabled on the basis > of costing. > Regular planning need a Result node due to which costing increases where as > PWA don't need that and thus wins. Sorry for not clearly explaining, I meant that with normal plan execution time is 263.678 ms whereas with PWA its 339.929 ms. I only set enable_partition_wise_agg=on and it switched to PWA and execution time increased by 30%. I understand that the this is the worst case for PWA where FinalizeAggregate is getting all the tuple. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] path toward faster partition pruning
On Mon, Oct 30, 2017 at 12:20 PM, Amit Langote wrote: > On 2017/10/30 14:55, Amit Langote wrote: >> Fixed in the attached updated patch, along with a new test in 0001 to >> cover this case. Also, made a few tweaks to 0003 and 0005 (moved some >> code from the former to the latter) around the handling of >> ScalarArrayOpExprs. > > Sorry, I'd forgotten to include some changes. > > In the previous versions, RT index of the table needed to be passed to > partition.c, which I realized is no longer needed, so I removed that > requirement from the interface. As a result, patches 0002 and 0003 have > changed in this version. Some Minor comments: + * get_rel_partitions + * Return the list of partitions of rel that pass the clauses mentioned + * rel->baserestrictinfo + * + * Returned list contains the AppendRelInfos of chosen partitions. + */ +static List * +get_append_rel_partitions(PlannerInfo *root, Function name in function header is not correct. + !DatumGetBool(((Const *) clause)->constvalue)) + { + *constfalse = true; + continue; DatumGetBool will return true if the first byte of constvalue is nonzero otherwise false. IIUC, this is not the intention here. Or I am missing something? + * clauses in this function ourselves, for example, having both a > 1 and + * a = 0 the list a = 0 the list -> a = 0 in the list +static bool +partkey_datum_from_expr(const Expr *expr, Datum *value) +{ + /* + * Add more expression types here as needed to support higher-level + * code. + */ + switch (nodeTag(expr)) + { + case T_Const: + *value = ((Const *) expr)->constvalue; + return true; I think for evaluating other expressions (e.g. T_Param) we will have to pass ExprContext to this function. Or we can do something cleaner because if we want to access the ExprContext inside partkey_datum_from_expr then we may need to pass it to "get_partitions_from_clauses" which is a common function for optimizer and executor. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages
On Sat, Nov 11, 2017 at 3:25 AM, Robert Haas wrote: > On Thu, Nov 9, 2017 at 3:55 AM, amul sul wrote: >> It took me a little while to understand this calculation. You have moved >> this >> code from tbm_create(), but I think you should move the following >> comment as well: > > I made an adjustment that I hope will address your concern here, made > a few other adjustments, and committed this. > Thanks, Robert. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Thu, Sep 29, 2016 at 6:40 AM, Tomas Vondra wrote: > Yes, definitely - we're missing something important, I think. One difference > is that Dilip is using longer runs, but I don't think that's a problem (as I > demonstrated how stable the results are). > > I wonder what CPU model is Dilip using - I know it's x86, but not which > generation it is. I'm using E5-4620 v1 Xeon, perhaps Dilip is using a newer > model and it makes a difference (although that seems unlikely). I am using "Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz " -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Thu, Sep 29, 2016 at 8:05 PM, Robert Haas wrote: > OK, another theory: Dilip is, I believe, reinitializing for each run, > and you are not. Yes, I am reinitializing for each run. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Dynamic shared memory areas
On Wed, Oct 5, 2016 at 3:00 AM, Thomas Munro wrote: > Here's a new version that does that. While testing this patch I found some issue, + total_size = DSA_INITIAL_SEGMENT_SIZE; + total_pages = total_size / FPM_PAGE_SIZE; + metadata_bytes = + MAXALIGN(sizeof(dsa_area_control)) + + MAXALIGN(sizeof(FreePageManager)) + + total_pages * sizeof(dsa_pointer); + /* Add padding up to next page boundary. */ + if (metadata_bytes % FPM_PAGE_SIZE != 0) + metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE); + usable_pages = + (total_size - metadata_bytes) / FPM_PAGE_SIZE; + segment = dsm_create(total_size, 0); + dsm_pin_segment(segment); Actually problem is that size of dsa_area_control is bigger than DSA_INITIAL_SEGMENT_SIZE. but we are allocating segment of DSA_INITIAL_SEGMENT_SIZE size. (gdb) p sizeof(dsa_area_control) $8 = 67111000 (gdb) p DSA_INITIAL_SEGMENT_SIZE $9 = 1048576 In dsa-v1 problem was not exist because DSA_MAX_SEGMENTS was 1024, but in dsa-v2 I think it's calculated wrongly. (gdb) p DSA_MAX_SEGMENTS $10 = 16777216 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash tables in dynamic shared memory
On Wed, Oct 5, 2016 at 3:10 AM, Thomas Munro wrote: > Here is an experimental hash table implementation that uses DSA memory > so that hash tables can be shared by multiple backends and yet grow > dynamically. Development name: "DHT". +dht_iterate_begin(dht_hash_table *hash_table, + dht_iterator *iterator, + bool exclusive) +{ + Assert(hash_table->control->magic == DHT_MAGIC); + + iterator->hash_table = hash_table; + iterator->exclusive = exclusive; + iterator->partition = 0; + iterator->bucket = 0; + iterator->item = NULL; + iterator->locked = false; While reviewing , I found that in dht_iterate_begin function, we are not initializing iterator->last_item_pointer to InvalidDsaPointer; and during scan this variable will be used in advance_iterator function. (I think this will create problem, even loss of some tuple ?) +advance_iterator(dht_iterator *iterator, dsa_pointer bucket_head, + dsa_pointer *next) +{ + dht_hash_table_item *item; + + while (DsaPointerIsValid(bucket_head)) + { + item = dsa_get_address(iterator->hash_table->area, + bucket_head); + if ((!DsaPointerIsValid(iterator->last_item_pointer) || + bucket_head < iterator->last_item_pointer) && -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Proposal: scan key push down to heap [WIP]
Hi Hackers, I would like to propose a patch for pushing down the scan key to heap. Currently only in case of system table scan keys are pushed down. I have implemented the POC patch to do the same for normal table scan. This patch will extract the expression from qual and prepare the scan keys. Currently in POC version I have only supported "var OP const" type of qual, because these type of quals can be pushed down using existing framework. Purpose of this work is to first implement the basic functionality and analyze the results. If results are good then we can extend it for other type of expressions. However in future when we try to expand the support for complex expressions, then we need to be very careful while selecting pushable expression. It should not happen that we push something very complex, which may cause contention with other write operation (as HeapKeyTest is done under page lock). Performance Test: (test done in local machine, with all default setting). Setup: -- create table test(a int, b varchar, c varchar, d int); insert into test values(generate_series(1,1000), repeat('x', 30), repeat('y', 30), generate_series(1,1000)); analyze test; Test query: -- select count(*) from test where a < $1; Results: (execution time in ms) Selectivity Head(ms) Patch(ms)gain 0.01612 307 49% 0.1 623 353 43% 0.2 645 398 38% 0.5 780 535 31% 0.8 852 590 30% 1 913 730 20% Instructions: (Cpu instructions measured with callgrind tool): Quary : select count(*) from test where a < 10; Head: 10,815,730,925 Patch: 4,780,047,331 Summary: -- 1. ~50% reduction in both instructions as well as execution time. 2. Here we can see ~ 20% execution time reduction even at selectivity 1 (when all tuples are selected). And, reasoning for the same can be that HeapKeyTest is much simplified compared to ExecQual. It's possible that in future when we try to support more variety of keys, gain at high selectivity may come down. WIP patch attached.. Thoughts ? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com heap_scankey_pushdown_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
connections establishing) tps = 52320.442694 (excluding connections establishing) 28564 Client | ClientRead 3663 1320 LWLockNamed | ProcArrayLock 742 Lock| transactionid 534 LWLockNamed | CLogControlLock 255 Lock| extend 108 LWLockNamed | XidGenLock 81 LWLockTranche | buffer_content 44 LWLockTranche | lock_manager 29 Lock| tuple 6 LWLockTranche | wal_insert 1 Tuples only is on. 1 LWLockTranche | buffer_mapping With Patch: tps = 47505.582315 (including connections establishing) tps = 47505.773351 (excluding connections establishing) 28186 Client | ClientRead 3655 1172 LWLockNamed | ProcArrayLock 619 Lock| transactionid 289 LWLockNamed | CLogControlLock 237 Lock| extend 81 LWLockTranche | buffer_content 48 LWLockNamed | XidGenLock 28 LWLockTranche | lock_manager 23 Lock| tuple 6 LWLockTranche | wal_insert I think at higher client count from client count 96 onwards contention on CLogControlLock is clearly visible and which is completely solved with group lock patch. And at lower client count 32,64 contention on CLogControlLock is not significant hence we can not see any gain with group lock patch. (though we can see some contention on CLogControlLock is reduced at 64 clients.) Note: Here I have taken only one set of reading, and at 32 client my reading shows some regression with group lock patch, which may be run to run variance (because earlier I never saw this regression, I can confirm again with multiple runs). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com [@power2 ~]$ uname -mrs Linux 3.10.0-229.14.1.ael7b.ppc64le ppc64le [@power2 ~]$ lscpu Architecture: ppc64le Byte Order:Little Endian CPU(s):192 On-line CPU(s) list: 0-191 Thread(s) per core:8 Core(s) per socket:1 Socket(s): 24 NUMA node(s): 4 Model: IBM,8286-42A L1d cache: 64K L1i cache: 32K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-47 NUMA node1 CPU(s): 48-95 NUMA node2 CPU(s): 96-143 NUMA node3 CPU(s): 144-191 [@power2 ~]$ nproc --all 192 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: scan key push down to heap [WIP]
On Thu, Oct 13, 2016 at 6:05 AM, Robert Haas wrote: > I seriously doubt that this should EVER be supported for anything > other than "var op const", and even then only for very simple > operators. Yes, with existing key push down infrastructure only "var op const", but If we extend that I think we can cover many other simple expressions, i.e Unary Operator : ISNULL, NOTNULL Other Simple expression : "Var op Const", "Var op Var", "Var op SimpleExpr(Var, Const)" .. We can not push down function expression because we can not guarantee that they are safe, but can we pushdown inbuilt functions ? (I think we can identify their safety based on their data type, but I am not too sure about this point). For example, texteq() is probably too complicated, because > it might de-TOAST, and that might involve doing a LOT of work while > holding the buffer content lock. But int4eq() would be OK. > > Part of the trick if we want to make this work is going to be figuring > out how we'll identify which operators are safe. Yes, I agree that's the difficult part. Can't we process full qual list and decide decide the operator are safe or not based on their datatype ? What I mean to say is instead of checking safety of each operator like texteq(), text_le()... we can directly discard any operator involving such kind of data types. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
There is major chance in tidbitmap.c file after efficient hash table commit [1] and my patch need to be rebased. Only parallel-bitmap-heap-scan need to be rebased, all other patch can be applied on head as is. Rebased version (v2) of parallel-bitmap-heap-scan is attached. [1] commit-http://git.postgresql.org/pg/commitdiff/75ae538bc3168bf44475240d4e0487ee2f3bb376 On Fri, Oct 7, 2016 at 11:46 AM, Dilip Kumar wrote: > Hi Hackers, > > I would like to propose parallel bitmap heap scan feature. After > running TPCH benchmark, It was observed that many of TPCH queries are > using bitmap scan (@TPCH_plan.tar.gz attached below). Keeping this > point in mind we thought that many query will get benefited with > parallel bitmap scan. > > Robert has also pointed out the same thing in his blog related to parallel > query > http://rhaas.blogspot.in/2016/04/postgresql-96-with-parallel-query-vs.html > > Currently Bitmap heap plan look like this : > -- > Bitmap Heap Scan > -> Bitmap Index Scan > > After this patch : > - > Parallel Bitmap Heap Scan > -> Bitmap Index Scan > > As part of this work I have implemented parallel processing in > BitmapHeapScan node. BitmapIndexScan is still non parallel. > > Brief design idea: > --- > #1. Shared TIDBitmap creation and initialization > First worker to see the state as parallel bitmap info as PBM_INITIAL > become leader and set the state to PBM_INPROGRESS All other workers > see the state as PBM_INPROGRESS will wait for leader to complete the > TIDBitmap. > > #2 At this level TIDBitmap is ready and all workers are awake. > > #3. Bitmap processing (Iterate and process the pages). > In this phase each worker will iterate over page and chunk array and > select heap pages one by one. If prefetch is enable then there will be > two iterator. Since multiple worker are iterating over same page and > chunk array we need to have a shared iterator, so we grab a spin lock > and iterate within a lock, so that each worker get and different page > to process. > > Note: For more detail on design, please refer comment of > BitmapHeapNext API in "parallel-bitmap-heap-scan-v1.patch" file. > > Attached patch details: > -- > 1. parallel-bitmap-heap-scan-v1.patch: This is the main patch to make > bitmap heap scan node parallel aware. > > 2. dht-return-dsa-v1.patch: This patch will provide new API, where we > can scan full DHT[1], and get the dsa_pointers (a relative pointer). > The dsa_pointer values can be shared with other processes. We need > this because, after TIDBitmap is created, only one worker will process > whole TIDBitmap and convert it to a page and chunk array. So we need > to store the generic pointer, so that later on each worker can convert > those to their local pointer before start processing. > > My patch depends on following patches. > -- > 1. conditional_variable > https://www.postgresql.org/message-id/CAEepm%3D0zshYwB6wDeJCkrRJeoBM%3DjPYBe%2B-k_VtKRU_8zMLEfA%40mail.gmail.com > > 2. dsa_area > https://www.postgresql.org/message-id/CAEepm%3D024p-MeAsDmG%3DR3%2BtR4EGhuGJs_%2BrjFKF0eRoSTmMJnA%40mail.gmail.com > > 3. Creating a DSA area to provide work space for parallel execution > https://www.postgresql.org/message-id/CAEepm%3D0HmRefi1%2BxDJ99Gj5APHr8Qr05KZtAxrMj8b%2Bay3o6sA%40mail.gmail.com > > 4. Hash table in dynamic shared memory (DHT) [1] > https://www.postgresql.org/message-id/CAEepm%3D0VrMt3s_REDhQv6z1pHL7FETOD7Rt9V2MQ3r-2ss2ccA%40mail.gmail.com > > Order in which patches should be applied: > > 1. conditional_variable > 2. dsa_area > 3. Creating a DSA area to provide work space for parallel execution > 4. Hash table in dynamic shared memory. > 5. dht-return-dsa-v1.patch > 6. parallel-bitmap-heap-scan-v1.patch > > Performance Results: > - > Summary : > 1. After this patch, I observed currently 4 queries are getting > significant improvement (Q4, Q6, Q14, Q15). >- Q4, is converting from parallel seqscan to parallel bitmap heap scan. >- Other queries are converted from a regular bitmap heap scan to a > parallel bitmap heap scan. > 2. Benefit is more visible at lower workers (upto 4), after that some > of the queries are selecting ParallelSeqScan over ParallelBitmapScan. > And, I think this is expected, because so far we have only made > BitmapHeap node as parallel whereas ParallelSeqScan is completely > parallel so at higher worker count ParallelSeqScan is better choice. > 3. Detailed result is attached
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Oct 18, 2016 at 1:42 AM, Robert Haas wrote: > But what's the impact on performance? Presumably parallel bitmap heap > scan was already slower than the non-parallel version, and that commit > presumably widens the gap. Seems like something to worry about... I have checked the performance in my local machine and there is no impact on the gap. If you see the explain analyze output of all queries which got benefited which parallel bitmap map heap scan, BitmapIndex node is taking very less time compare to BitmapHeap. Actual execution time on head (before efficient hash table patch) BitmapHeapNode BitmapIndexNode Q6 38997 6951 Q14 14516569 Q15 28530 1442 Out of 4 queries, Q4 is converted from parallel seqscan to parallel bitmap scan so no impact. Q14, Q15 time spent in BitmapIndex node is < 5% of time spent in BitmapHeap Node. Q6 it's 20% but I did not see much impact on this in my local machine. However I will take the complete performance reading and post the data on my actual performance machine. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Tue, Oct 18, 2016 at 1:45 AM, Andres Freund wrote: > I don't quite understand why the bitmap has to be parallel at all. As > far as I understand your approach as described here, the only thing that > needs to be shared are the iteration arrays. Since they never need to > be resized and such, it seems to make a lot more sense to just add an > API to share those, instead of the whole underlying hash. You are right that we only share iteration arrays. But only point is that each entry of iteration array is just a pointer to hash entry. So either we need to build hash in shared memory (my current approach) or we need to copy each hash element at shared location (I think this is going to be expensive). Let me know if I am missing something.. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel bitmap heap scan
On Wed, Oct 19, 2016 at 12:39 PM, Andres Freund wrote: > Try measuring with something more heavy on bitmap scan time > itself. E.g. > SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= > '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date); > or similar. The tpch queries don't actually spend that much time in the > bitmapscan itself - the parallization of the rest of the query is what > matters... Yeah, I agree. I have tested with this query, with exact filter condition it was taking parallel sequence scan, so I have modified the filter a bit and tested. Tested with all default configuration in my local machine. I think I will generate more such test cases and do detail testing in my performance machine. Explain Analyze results: - On Head: postgres=# explain analyze SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-03-31'::date); QUERY PLAN --- Aggregate (cost=848805.90..848805.91 rows=1 width=32) (actual time=12440.165..12440.166 rows=1 loops=1) -> Bitmap Heap Scan on lineitem (cost=143372.40..834833.25 rows=5589057 width=8) (actual time=1106.217..11183.722 rows=5678841 loops=1) Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-03-31'::date)) Rows Removed by Index Recheck: 20678739 Heap Blocks: exact=51196 lossy=528664 -> Bitmap Index Scan on idx_lineitem_shipdate (cost=0.00..141975.13 rows=5589057 width=0) (actual time=1093.376..1093.376 rows=5678841 loops=1) Index Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-03-31'::date)) Planning time: 0.185 ms Execution time: 12440.819 ms (9 rows) After Patch: --- postgres=# explain analyze SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-03-31'::date); QUERY PLAN -- - Finalize Aggregate (cost=792751.16..792751.17 rows=1 width=32) (actual time=6660.157..6660.157 rows=1 loops=1) -> Gather (cost=792750.94..792751.15 rows=2 width=32) (actual time=6659.378..6660.117 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=791750.94..791750.95 rows=1 width=32) (actual time=6655.941..6655.941 rows=1 loops=3) -> Parallel Bitmap Heap Scan on lineitem (cost=143372.40..785929.00 rows=2328774 width=8) (actual time=1980.797..6204.232 rows=1892947 loops= 3) Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-03-31'::date)) Rows Removed by Index Recheck: 6930269 Heap Blocks: exact=17090 lossy=176443 -> Bitmap Index Scan on idx_lineitem_shipdate (cost=0.00..141975.13 rows=5589057 width=0) (actual time=1933.454..1933.454 rows=5678841 loops=1) Index Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-03-31'::date)) Planning time: 0.207 ms Execution time: 6669.195 ms (13 rows) Summary: -> With patch overall execution is 2 time faster compared to head. -> Bitmap creation with patch is bit slower compared to head and thats because of DHT vs efficient hash table. I found one defect in v2 patch, that I induced during last rebasing. That is fixed in v3. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com parallel-bitmap-heap-scan-v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
kpointerCommLock Test4: number of clients: 32 Head: tps = 27587.12 (including connections establishing) tps = 27588.119611 (excluding connections establishing) [dilip.kumar@cthulhu bin]$ cat 1000_32_ul.txt 117762 | 4031 614 LWLockNamed | ProcArrayLock 379 LWLockNamed | CLogControlLock 344 Lock| transactionid 183 Lock| extend 102 LWLockTranche | buffer_mapping 71 LWLockTranche | buffer_content 39 LWLockNamed | XidGenLock 25 LWLockTranche | lock_manager 3 LWLockTranche | wal_insert 3 LWLockTranche | clog 2 LWLockNamed | CheckpointerCommLock 2 Lock| tuple Patch: tps = 28291.428848 (including connections establishing) tps = 28291.586435 (excluding connections establishing) [dilip.kumar@cthulhu bin]$ cat 1000_32_ul.txt 116596 | 4041 757 LWLockNamed | ProcArrayLock 407 LWLockNamed | CLogControlLock 358 Lock| transactionid 183 Lock| extend 142 LWLockTranche | buffer_mapping 77 LWLockTranche | buffer_content 68 LWLockNamed | XidGenLock 35 LWLockTranche | lock_manager 15 LWLockTranche | wal_insert 7 LWLockTranche | clog 7 Lock| tuple 4 LWLockNamed | CheckpointerCommLock 1 Tuples only is on. Summary: - At 96 and more clients count we can see ClogControlLock at the top. - With patch contention on ClogControlLock is reduced significantly. I think these behaviours are same as we saw on power. With 300 scale factor: - Contention on ClogControlLock is significant only at 192 client (still transaction id lock is on top), Which is completely removed with group lock patch. For 300 scale factor, I am posting data only at 192 client count (If anyone interested in other data I can post). Head: scaling factor: 300 query mode: prepared number of clients: 192 number of threads: 192 duration: 1800 s number of transactions actually processed: 65930726 latency average: 5.242 ms tps = 36621.827041 (including connections establishing) tps = 36622.064081 (excluding connections establishing) [dilip.kumar@cthulhu bin]$ cat 300_192_ul.txt 437848 | 118966 Lock| transactionid 88869 LWLockNamed | CLogControlLock 18558 Lock| tuple 6183 LWLockTranche | buffer_content 5664 LWLockTranche | lock_manager 3995 LWLockNamed | ProcArrayLock 3646 1748 Lock| extend 1635 LWLockNamed | XidGenLock 401 LWLockTranche | wal_insert 33 BufferPin | BufferPin 5 LWLockTranche | proc 3 LWLockTranche | buffer_mapping Patch: scaling factor: 300 query mode: prepared number of clients: 192 number of threads: 192 duration: 1800 s number of transactions actually processed: 82616270 latency average: 4.183 ms tps = 45894.737813 (including connections establishing) tps = 45894.995634 (excluding connections establishing) 120372 Lock| transactionid 16346 Lock| tuple 7489 LWLockTranche | lock_manager 4514 LWLockNamed | ProcArrayLock 3632 3310 LWLockNamed | CLogControlLock 2287 LWLockNamed | XidGenLock 2271 Lock| extend 709 LWLockTranche | buffer_content 490 LWLockTranche | wal_insert 30 BufferPin | BufferPin 10 LWLockTranche | proc 6 LWLockTranche | buffer_mapping -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
ablishing) 302317 | 18836 Lock| transactionid 12912 LWLockNamed | CLogControlLock 4120 LWLockNamed | ProcArrayLock 3662 1700 Lock| tuple 1305 Lock| extend 1030 LWLockTranche | buffer_content 828 LWLockTranche | lock_manager 730 LWLockNamed | XidGenLock 107 LWLockTranche | wal_insert 4 LWLockTranche | buffer_mapping 1 Tuples only is on. 1 LWLockTranche | proc 1 BufferPin | BufferPin Group Lock Patch: scaling factor: 300 query mode: prepared number of clients: 96 number of threads: 96 duration: 1800 s number of transactions actually processed: 61608756 latency average: 2.805 ms tps = 44385.885080 (including connections establishing) tps = 44386.297364 (excluding connections establishing) [dilip.kumar@cthulhu bin]$ cat 300_96_ul.txt 237842 | 14379 Lock| transactionid 3335 LWLockNamed | ProcArrayLock 2850 1374 LWLockNamed | CLogControlLock 1200 Lock| tuple 992 Lock| extend 717 LWLockNamed | XidGenLock 625 LWLockTranche | lock_manager 259 LWLockTranche | buffer_content 105 LWLockTranche | wal_insert 4 LWLockTranche | buffer_mapping 2 LWLockTranche | proc Head: scaling factor: 300 query mode: prepared number of clients: 192 number of threads: 192 duration: 1800 s number of transactions actually processed: 65930726 latency average: 5.242 ms tps = 36621.827041 (including connections establishing) tps = 36622.064081 (excluding connections establishing) [dilip.kumar@cthulhu bin]$ cat 300_192_ul.txt 437848 | 118966 Lock| transactionid 88869 LWLockNamed | CLogControlLock 18558 Lock| tuple 6183 LWLockTranche | buffer_content 5664 LWLockTranche | lock_manager 3995 LWLockNamed | ProcArrayLock 3646 1748 Lock| extend 1635 LWLockNamed | XidGenLock 401 LWLockTranche | wal_insert 33 BufferPin | BufferPin 5 LWLockTranche | proc 3 LWLockTranche | buffer_mapping GroupLock Patch: scaling factor: 300 query mode: prepared number of clients: 192 number of threads: 192 duration: 1800 s number of transactions actually processed: 82616270 latency average: 4.183 ms tps = 45894.737813 (including connections establishing) tps = 45894.995634 (excluding connections establishing) 120372 Lock| transactionid 16346 Lock| tuple 7489 LWLockTranche | lock_manager 4514 LWLockNamed | ProcArrayLock 3632 3310 LWLockNamed | CLogControlLock 2287 LWLockNamed | XidGenLock 2271 Lock| extend 709 LWLockTranche | buffer_content 490 LWLockTranche | wal_insert 30 BufferPin | BufferPin 10 LWLockTranche | proc 6 LWLockTranche | buffer_mapping Summary: On (X86 8 Socket machine, 300 S.F), I did not observe significant wait on ClogControlLock upto 96 clients. However at 192 we can see significant wait on ClogControlLock, but still not as bad as we see on POWER. > > I've taken some time to build a simple web-based reports from the results > collected so far (also included in the git repository), and pushed them > here: > > http://tvondra.bitbucket.org > > For each of the completed runs, there's a report comparing tps for different > client counts with master and the three patches (average tps, median and > stddev), and it's possible to download a more thorough text report with wait > event stats, comparison of individual runs etc. I saw your report, I think presenting it this way can give very clear idea. > > If you want to cooperate on this, I'm available - i.e. I can help you get > the tooling running, customize it etc. That will be really helpful, then next time I can also present my reports in same format. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Thu, Oct 20, 2016 at 9:15 PM, Robert Haas wrote: > So here's my theory. The whole reason why Tomas is having difficulty > seeing any big effect from these patches is because he's testing on > x86. When Dilip tests on x86, he doesn't see a big effect either, > regardless of workload. But when Dilip tests on POWER, which I think > is where he's mostly been testing, he sees a huge effect, because for > some reason POWER has major problems with this lock that don't exist > on x86. Right, because on POWER we can see big contention on ClogControlLock with 300 scale factor, even at 96 client count, but on X86 with 300 scan factor there is almost no contention on ClogControlLock. However at 1000 scale factor we can see significant contention on ClogControlLock on X86 machine. I want to test on POWER with 1000 scale factor to see whether contention on ClogControlLock become much worse ? I will run this test and post the results. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Fri, Oct 21, 2016 at 7:57 AM, Dilip Kumar wrote: > On Thu, Oct 20, 2016 at 9:03 PM, Tomas Vondra > wrote: > >> In the results you've posted on 10/12, you've mentioned a regression with 32 >> clients, where you got 52k tps on master but only 48k tps with the patch (so >> ~10% difference). I have no idea what scale was used for those tests, > > That test was with scale factor 300 on POWER 4 socket machine. I think > I need to repeat this test with multiple reading to confirm it was > regression or run to run variation. I will do that soon and post the > results. As promised, I have rerun my test (3 times), and I did not see any regression. Median of 3 run on both head and with group lock patch are same. However I am posting results of all three runs. I think in my earlier reading, we saw TPS ~48K with the patch, but I think over multiple run we get this reading with both head as well as with patch. Head: run1: transaction type: scaling factor: 300 query mode: prepared number of clients: 32 number of threads: 32 duration: 1800 s number of transactions actually processed: 87784836 latency average = 0.656 ms tps = 48769.327513 (including connections establishing) tps = 48769.543276 (excluding connections establishing) run2: transaction type: scaling factor: 300 query mode: prepared number of clients: 32 number of threads: 32 duration: 1800 s number of transactions actually processed: 91240374 latency average = 0.631 ms tps = 50689.069717 (including connections establishing) tps = 50689.263505 (excluding connections establishing) run3: transaction type: scaling factor: 300 query mode: prepared number of clients: 32 number of threads: 32 duration: 1800 s number of transactions actually processed: 90966003 latency average = 0.633 ms tps = 50536.639303 (including connections establishing) tps = 50536.836924 (excluding connections establishing) With group lock patch: -- run1: transaction type: scaling factor: 300 query mode: prepared number of clients: 32 number of threads: 32 duration: 1800 s number of transactions actually processed: 87316264 latency average = 0.660 ms tps = 48509.008040 (including connections establishing) tps = 48509.194978 (excluding connections establishing) run2: transaction type: scaling factor: 300 query mode: prepared number of clients: 32 number of threads: 32 duration: 1800 s number of transactions actually processed: 91950412 latency average = 0.626 ms tps = 51083.507790 (including connections establishing) tps = 51083.704489 (excluding connections establishing) run3: transaction type: scaling factor: 300 query mode: prepared number of clients: 32 number of threads: 32 duration: 1800 s number of transactions actually processed: 90378462 latency average = 0.637 ms tps = 50210.225983 (including connections establishing) tps = 50210.405401 (excluding connections establishing) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: scan key push down to heap [WIP]
On Fri, Oct 14, 2016 at 11:54 AM, Andres Freund wrote: > I don't think it's a good idea to do this under the content lock in any > case. But luckily I don't think we have to do so at all. > > Due to pagemode - which is used by pretty much everything iterating over > heaps, and definitely seqscans - the key test already happens without > the content lock held, in heapgettup_pagemode(): Yes, you are right. Now after this clarification, it's clear that even we push down the key we are not evaluating it under buffer content lock. As of now, I have done my further analysis by keeping in mind only pushing 'var op const'. Below are my observations. #1. If we process the qual in executor, all temp memory allocation are under "per_tuple_context" which get reset after each tuple process. But, on other hand if we do that in heap, then that will be under "per_query_context". This restrict us to pushdown any qual which need to do memory allocations like "toastable" field. Is there any other way to handle this ? #2. Currently quals are ordered based on cost (refer order_qual_clauses), But once we pushdown some of the quals, then those quals will always be executed first. Can this create problem ? consider below example.. create or replace function f1(anyelement) returns bool as $$ begin raise notice '%', $1; return true; end; $$ LANGUAGE plpgsql cost 0.01; select * from t3 where a>1 and f1(b); In this case "f1(b)" will always be executed as first qual (cost is set very low by user) hence it will raise notice for all the tuple. But if we pushdown "a>1" qual to heap then only qualified tuple will be passed to "f1(b)". Is it behaviour change ? I know that, it can also impact the performance, because when user has given some function with very low cost then that should be executed first as it may discard most of the tuple. One solution to this can be.. 1. Only pushdown if either all quals are of same cost. 2. Pushdown all quals until we find first non pushable qual (this way we can maintain the same qual execution order what is there in existing system). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: scan key push down to heap [WIP]
On Tue, Oct 25, 2016 at 10:35 PM, Alvaro Herrera wrote: > We don't promise order of execution (which is why we can afford to sort > on cost), but I think it makes sense to keep a rough ordering based on > cost, and not let this push-down affect those ordering decisions too > much. Ok, Thanks for clarification. > > I think it's fine to push-down quals that are within the same order of > magnitude of the cost of a pushable condition, while keeping any other > much-costlier qual (which could otherwise be pushed down) together with > non-pushable conditions; this would sort-of guarantee within-order-of- > magnitude order of execution of quals. > > Hopefully an example clarifies what I mean. Let's suppose you have > three quals, where qual2 is non-pushable but 1 and 3 are. cost(1)=10, > cost(2)=11, cost(3)=12. Currently, they are executed in that order. > > If you were to compare costs in the straightforward way, you would push > only 1 (because 3 is costlier than 2 which is not push-down-able). With > fuzzy comparisons, you'd push both 1 and 3, because cost of 3 is close > enough to that of qual 2. > > But if cost(3)=100 then only push qual 1, and let qual 3 be evaluated > together with 2 outside the scan node. After putting more thought on this, IMHO it need not to be so complicated. Currently we are talking about pushing only "var op const", and cost of all such functions are very low and fixed "1". Do we really need to take care of any user defined function which is declared with very low cost ? Because while building index conditions also we don't take care of such things. Index conditions will always we evaluated first then only filter will be applied. Am I missing something ? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Thu, Oct 27, 2016 at 5:14 PM, Amit Kapila wrote: >>> Thanks Tomas and Dilip for doing detailed performance tests for this >>> patch. I would like to summarise the performance testing results. >>> >>> 1. With update intensive workload, we are seeing gains from 23%~192% >>> at client count >=64 with group_update patch [1]. this is with unlogged table >>> 2. With tpc-b pgbench workload (at 1000 scale factor), we are seeing >>> gains from 12% to ~70% at client count >=64 [2]. Tests are done on >>> 8-socket intel m/c. this is with synchronous_commit=off >>> 3. With pgbench workload (both simple-update and tpc-b at 300 scale >>> factor), we are seeing gain 10% to > 50% at client count >=64 [3]. >>> Tests are done on 8-socket intel m/c. this is with synchronous_commit=off >>> 4. To see why the patch only helps at higher client count, we have >>> done wait event testing for various workloads [4], [5] and the results >>> indicate that at lower clients, the waits are mostly due to >>> transactionid or clientread. At client-counts where contention due to >>> CLOGControlLock is significant, this patch helps a lot to reduce that >>> contention. These tests are done on on 8-socket intel m/c and >>> 4-socket power m/c these both are with synchronous_commit=off + unlogged table >>> 5. With pgbench workload (unlogged tables), we are seeing gains from >>> 15% to > 300% at client count >=72 [6]. >>> >> >> It's not entirely clear which of the above tests were done on unlogged >> tables, and I don't see that in the referenced e-mails. That would be an >> interesting thing to mention in the summary, I think. >> > > One thing is clear that all results are on either > synchronous_commit=off or on unlogged tables. I think Dilip can > answer better which of those are on unlogged and which on > synchronous_commit=off. I have mentioned this above under each of your test point.. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: scan key push down to heap [WIP]
On Wed, Oct 26, 2016 at 12:01 PM, Andres Freund wrote: > The gains are quite noticeable in some cases. So if we can make it work > without noticeable downsides... > > What I'm worried about though is that this, afaics, will quite > noticeably *increase* total cost in cases with a noticeable number of > columns and a not that selective qual. The reason for that being that > HeapKeyTest() uses heap_getattr(), whereas upper layers use > slot_getattr(). The latter "caches" repeated deforms, the former > doesn't... That'll lead to deforming being essentially done twice, and > it's quite often already a major cost of query processing. What about putting slot reference inside HeapScanDesc ?. I know it will make ,heap layer use executor structure but just a thought. I have quickly hacked this way where we use slot reference in HeapScanDesc and directly use slot_getattr inside HeapKeyTest (only if we have valid slot otherwise use _heap_getattr) and measure the worst case performance (what you have mentioned above.) My Test: (21 column table with varchar in beginning + qual is on last few column + varying selectivity ) postgres=# \d test Table "public.test" Column | Type| Modifiers +---+--- f1 | integer | f2 | character varying | f3 | integer | f4 | integer | f5 | integer | f6 | integer | f7 | integer | f8 | integer | f9 | integer | f10| integer | f11| integer | f12| integer | f13| integer | f14| integer | f15| integer | f16| integer | f17| integer | f18| integer | f19| integer | f20| integer | f21| integer | tuple count : 1000 (10 Million) explain analyze select * from test where f21< $1 and f20 < $1 and f19 < $1 and f15 < $1 and f10 < $1; ($1 vary from 1Million to 1Million). Target code base: --- 1. Head 2. Heap_scankey_pushdown_v1 3. My hack for keeping slot reference in HeapScanDesc (v1+use_slot_in_HeapKeyTest) Result: Selectivity Head scan_key_pushdown_v1 v1+use_slot_in_HeapKeyTest 0.1 3880 2980 2747 0.2 4041 3187 2914 0.5 5051 4921 3626 0.8 5378 7296 3879 1.0 6161 8525 4575 Performance graph is attached in the mail.. Observation: 1. Heap_scankey_pushdown_v1, start degrading after very high selectivity (this behaviour is only visible if table have 20 or more columns, I tested with 10 columns but with that I did not see any regression in v1). 2. (v1+use_slot_in_HeapKeyTest) is always winner, even at very high selectivity. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [BUGS] BUG #14350: VIEW with INSTEAD OF INSERT TRIGGER and COPY. Missing feature or working as designed.
On Tue, Nov 1, 2016 at 12:40 PM, Haribabu Kommi wrote: > COPY command is treated as an UTILITY command, During the query > processing, the rules are not applied on the COPY command, and in the > execution of COPY command, it just inserts the data into the heap and > indexes with direct calls. > > I feel supporting the COPY command for the case-2 is little bit complex > and may not be that much worth. I agree it will be complex.. > > Attached is the updated patch with doc changes. Now since we are adding support for INSTEAD OF TRIGGER in COPY FROM command, It will be good that we provide a HINT to user if INSTEAD of trigger does not exist, like we do in case of insert ? INSERT case: postgres=# insert into ttt_v values(7); 2016-11-01 14:31:39.845 IST [72343] ERROR: cannot insert into view "ttt_v" 2016-11-01 14:31:39.845 IST [72343] DETAIL: Views that do not select from a single table or view are not automatically updatable. 2016-11-01 14:31:39.845 IST [72343] HINT: To enable inserting into the view, provide an INSTEAD OF INSERT trigger or an unconditional ON INSERT DO INSTEAD rule. COPY case: postgres=# COPY ttt_v FROM stdin; 2016-11-01 14:31:52.235 IST [72343] ERROR: cannot copy to view "ttt_v" 2016-11-01 14:31:52.235 IST [72343] STATEMENT: COPY ttt_v FROM stdin; -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers