from:"Dilip Kumar"

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-02-28 Thread Dilip Kumar

On Sat, Feb 27, 2016 at 5:14 AM, Andres Freund  wrote:

> > > > >> ./pgbench  -j$ -c$ -T300 -M prepared -S postgres
> > > > >>
> > > > >> ClientBasePatch
> > > > >> 11716916454
> > > > >> 8108547105559
> > > > >> 32241619262818
> > > > >> 64206868233606
> > > > >> 128137084217013
> > >
> > > So, there's a small regression on low client counts. That's worth
> > > addressing.
> > >
> >
> > Interesting. I'll try to reproduce it.
>
> Any progress here?

In Multi socket machine with 8 sockets and 64 cores, I have seen more
regression compared to my previous run in power8 with 2 socket, currently I
tested Read only workload for 5 mins Run, When I get time, I will run for
longer time and confirm again.

Shared Buffer= 8GB
Scale Factor=300

./pgbench  -j$ -c$ -T300 -M prepared -S postgres
client basepatch
1   7057 5230
2 10043 9573
4 2014018188

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-02-28 Thread Dilip Kumar

On Mon, Feb 29, 2016 at 8:26 AM, Amit Kapila 
wrote:

> Have you tried by reverting the commits 6150a1b0 and ac1d794, which I
> think effects read-only performance and sometimes create variation in TPS
> across different runs, here second might have less impact, but first one
> could impact performance?  Is it possible for you to get perf data with and
> without patch and share with others?
>

I only reverted ac1d794 commit in my test, In my next run I will revert
6150a1b0 also and test.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-02-29 Thread Dilip Kumar

On Wed, Feb 10, 2016 at 7:06 PM, Dilip Kumar  wrote:

>
I have tested Relation extension patch from various aspects and performance
results and other statistical data are explained in the mail.

Test 1: Identify the Heavy Weight lock is the Problem or the Actual Context
Switch
1. I converted the RelationExtensionLock to simple LWLock and tested with
single Relation. Results are as below

This is the simple script of copy 1 record in one transaction of size 4
Bytes
clientbaselwlockmulti_extend by 50 block
1155156160
2282276284
4248319428
8161267675
16   143241889

LWLock performance is better than base, obvious reason may be because we
have saved some instructions by converting to LWLock but it don't scales
any better compared to base code.


Test2: Identify that improvement in case of multiextend is becuase of
avoiding context switch or some other factor, like reusing blocks b/w
backend by putting in FSM..

1. Test by just extending multiple blocks and reuse in it's own backend
(Don't put in FSM)
Insert 1K record data don't fits in shared buffer 512MB Shared Buffer


ClientBaseExtend 800 block self useExtend 1000 Block
1  117  131 118
2  111  203 140
351  242 178
451  231 190
552  259 224
651  263 243
743  253  254
843  240  254
16  40  190  243

We can see the same improvement in case of self using the blocks also, It
shows that Sharing the blocks between the backend was not the WIN but
avoiding context switch was the measure win.

2. Tested the Number of ProcSleep during the Run.
This is the simple script of copy 1 record in one transaction of size 4
Bytes

* BASE CODE*
*PATCH MULTI EXTEND*
ClientBase_TPSProcSleep CountExtend By 10 BlockProc
Sleep Count
2280   457,506
31162,641
32351,098,701
358   141,624
42161,155,735
368   188,173

What we can see in above test that, in Base code performance is degrading
after 2 threads, while Proc Sleep count in increasing with huge amount.

Compared to that in Patch, with extending 10 blocks at a time Proc Sleep
reduce to ~1/8 and we can see it is constantly scaling.

Proc Sleep test for Insert test when data don't fits in shared buffer and
inserting big record of 1024 bytes, is currently running
once I get the data will post the same.

Posting the re-based version and moving to next CF.

Open points:
1. After getting the Lock recheck the FSM if some other back end has
already added extra blocks and reuse them.
2. Is it good idea to have user level parameter for extend_by_block or we
can try some approach to internally identify how many blocks are needed and
as per the need only add the blocks, this will make it more flexible.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 86b9ae1..78e81dd 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -268,6 +268,16 @@ static relopt_int intRelOpts[] =
 #endif
 	},
 
+	{
+		{
+			"extend_by_blocks",
+			"Number of blocks to be added to relation in every extend call",
+			RELOPT_KIND_HEAP,
+			AccessExclusiveLock
+		},
+		1, 1, 1
+	},
+
 	/* list terminator */
 	{{NULL}}
 };
@@ -1291,7 +1301,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"autovacuum_analyze_scale_factor", RELOPT_TYPE_REAL,
 		offsetof(StdRdOptions, autovacuum) +offsetof(AutoVacOpts, analyze_scale_factor)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		offsetof(StdRdOptions, user_catalog_table)}
+		offsetof(StdRdOptions, user_catalog_table)},
+		{"extend_by_blocks", RELOPT_TYPE_INT,
+		offsetof(StdRdOptions, extend_by_blocks)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..eb3ce17 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -238,6 +238,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	BlockNumber targetBlock,
 otherBlock;
 	bool		needLock;
+	int			extraBlocks;
 
 	len = MAXALIGN(len);		/* be conservativ

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-02-29 Thread Dilip Kumar

On Mon, Feb 29, 2016 at 5:18 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> I didn't reproduce the regression. I had access to multicore machine but
> didn't see either regression on low clients or improvements on high clients.
> In the attached path spinlock delay was exposed in s_lock.h and used
> in LockBufHdr().
> Dilip, could you try this version of patch? Could you also run perf or
> other profiler in the case of regression. It would be nice to compare
> profiles with and without patch. We probably could find the cause of
> regression.
>

OK, I will test it, sometime in this week.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-03-01 Thread Dilip Kumar

On Tue, Mar 1, 2016 at 10:19 AM, Dilip Kumar  wrote:

>
> OK, I will test it, sometime in this week.
>

I have tested this patch in my laptop, and there i did not see any
regression at 1 client

Shared buffer 10GB, 5 mins run with pgbench, read-only test

basepatch
run122187   24334
run226288   27440
run326306   27411

May be in a day or 2 I will test it in the same machine where I reported
the regression, and if regression is there I will check the instruction
using Call grind.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-01 Thread Dilip Kumar

On Tue, Mar 1, 2016 at 4:36 PM, Amit Kapila  wrote:

> One thing that is slightly unclear is that whether there is any overhead
> due to buffer eviction especially when the buffer to be evicted is already
> dirty and needs XLogFlush().  One reason why it might not hurt is that by
> the time we tries to evict the buffer, corresponding WAL is already flushed
> by WALWriter or other possibility could be that even if it is getting done
> during buffer eviction, the impact for same is much lesser.  Can we try to
> measure the number of flush calls which happen during buffer eviction?
>
>
Good Idea, I will do this test and post the results..

>
>
>> Proc Sleep test for Insert test when data don't fits in shared buffer and
>> inserting big record of 1024 bytes, is currently running
>> once I get the data will post the same.
>>
>>
> Okay.  However, I wonder if the performance data for the case when data
> doesn't fit into shared buffer also shows similar trend, then it might be
> worth to try by doing extend w.r.t load in system.  I mean to say we can
> batch the extension requests (as we have done in ProcArrayGroupClearXid)
> and extend accordingly, if that works out then the benefit could be that we
> don't need any configurable knob for the same.
>

1. One option can be as you suggested like ProcArrayGroupClearXid, With
some modification, because when we wait for the request and extend w.r.t
that, may be again we face the Context Switch problem, So may be we can
extend in some multiple of the Request.
(But we need to take care whether to give block directly to requester or
add it into FSM or do both i.e. give at-least one block to requester and
put some multiple in FSM)

2. Other option can be that we analyse the current Load on the extend and
then speed up or slow down the extending speed.
Simple algorithm can look like this

If (GotBlockFromFSM())
  Success++  // We got the block from FSM
Else
  Failure++// Did not get Block from FSM and need to
extend by my self

Now while extending
-
Speed up
-
If (failure - success > Threshold )  // Threshold can be one number assume
10.
ExtendByBlock += failure- success;   --> If many failure means load
is high then Increase the ExtendByBlock
Failure = success= 0 --> reset after this
so that we can measure the latest trend.

Slow down..
--
//Now its possible that demand of block is reduced but ExtendByBlock is
still high.. So analyse statistic and slow down the extending pace..

If (success - failure >  Threshold)
{
// Can not reduce it by big number because, may be more request are
satisfying because this is correct amount, so gradually decrease the pace
and re-analyse the statistics next time.
   ExtendByBlock --;
   Failure = success= 0
}

Any Suggestions are Welcome...

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-03-01 Thread Dilip Kumar

On Wed, Mar 2, 2016 at 10:39 AM, Andres Freund  wrote:

> Sounds like a ppc vs. x86 issue. The regression was on the former, right?

Well, Regression what I reported last two time, out of that one was on X86
and other was on PPC.

Copied from older Threads
--
On PPC
>> > > > >> ./pgbench  -j$ -c$ -T300 -M prepared -S postgres
>> > > > >>
>> > > > >> ClientBasePatch
>> > > > >> 11716916454
>> > > > >> 8108547105559
>> > > > >> 32241619262818
>> > > > >> 64206868233606
>> > > > >> 128137084217013

On X86
>> > > > >>Shared Buffer= 8GB
>> > > > >>Scale Factor=300

>> > > > >>./pgbench  -j$ -c$ -T300 -M prepared -S postgres
>> > > > >>client basepatch
>> > > > >>1   7057 5230
>> > > > >>2 10043     9573
>> > > > >>4 2014018188

And this latest result (no regression) is on X86 but on my local machine.

I did not exactly saw what this new version of patch is doing different, so
I will test this version in other machines also and see the results.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-03 Thread Dilip Kumar

On Wed, Mar 2, 2016 at 10:31 AM, Dilip Kumar  wrote:

> 1. One option can be as you suggested like ProcArrayGroupClearXid, With
> some modification, because when we wait for the request and extend w.r.t
> that, may be again we face the Context Switch problem, So may be we can
> extend in some multiple of the Request.
> (But we need to take care whether to give block directly to requester or
> add it into FSM or do both i.e. give at-least one block to requester and
> put some multiple in FSM)
>
>
> 2. Other option can be that we analyse the current Load on the extend and
> then speed up or slow down the extending speed.
> Simple algorithm can look like this
>

I have tried the approach of group extend,

1. We convert the extension lock to TryLock and if we get the lock then
extend by one block.2.
2. If don't get the Lock then use the Group leader concep where only one
process will extend for all, Slight change from ProcArrayGroupClear is that
here other than satisfying the requested backend we Add some extra blocks
in FSM, say GroupSize*10.
3. So Actually we can not get exact load but still we have some factor like
group size tell us exactly the contention size and we extend in multiple of
that.

Performance Analysis:
-
Performance is scaling with this approach, its slightly less compared to
previous patch where we directly give extend_by_block parameter and extend
in multiple blocks, and I think that's obvious because in group extend case
only after contention happen on lock we add extra blocks, but in former
case it was extending extra blocks optimistically.

Test1(COPY)
-
./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
1) g(i)) TO '/tmp/copybinary' WITH BINARY";
echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script

./pgbench -f copy_script -T 300 -c$ -j$ postgres

Shared Buffer:8GBmax_wal_size:10GB  Storage:Magnetic DiskWAL:SSD
---
ClientBase  multi_extend by 20 page group_extend_patch(groupsize*10)
1  105157   149
2  217255   282
4  210 494  452
8  166702   629
16145 563  561
32124 480  566

Test2(INSERT)

./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data
values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int, b text)
echo "insert into data select * from test_data;" >> insert_script

 shared_buffers=512GB  max_wal_size=20GB  checkpoint_timeout=10min
./pgbench -c $ -j $ -f insert_script -T 300 postgres

ClientBaseMulti-extend by 1000*group extend (group*10)
*group extend (group*100)
1117118
125 122
2111140
   161 159
4 51 190
141 187
843  254
148 172
16   40 243
150 173

* (group*10)-> means inside the code, Group leader will see how many
members are in the group who want blocks, so we will satisfy request of all
member + will put extra blocks in FSM extra block to extend are =
(group*10) --> 10 is just some constant.

Summary:

1. Here with group extend patch, there is no configuration to tell that how
many block to extend, so that should be decided by current load in the
system (contention on the extension lock).
2. With small multiplier i.e. 10 we can get fairly good improvement compare
to base code, but when load is high like record size is 1K, improving the
multiplier size giving better results.

* Note: Currently this is POC patch, It has only one group Extend List, so
currently can handle only one relation Group extend.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..adb82ba 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -23,6 +23,7 @@
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
+#include "storage/proc.h"

 /*
@@ -168,6 +169,160 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 	}
 }

+
+static Bloc

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-03-05 Thread Dilip Kumar

On Wed, Mar 2, 2016 at 11:05 AM, Dilip Kumar  wrote:

> And this latest result (no regression) is on X86 but on my local machine.
>
> I did not exactly saw what this new version of patch is doing different,
> so I will test this version in other machines also and see the results.
>

I tested this on PPC again, This time in various order (sometime patch
first and then base first).
 I tested with latest patch *pinunpin-cas-2.patch* on Power8.

Shared Buffer = 8GB
./pgbench  -j$ -c$ -T300 -M prepared -S postgres

BASE
-
Clientsrun1run2run3
1   212001875420537
2   403313952038746

Patch
-
Clientsrun1run2run3
1   202251980619778
2   398304189836620

I think, here we can not see any regression, (If I take median then it may
looks low with patch so posting all 3 reading).

Note: reverted only ac1d794 commit in my test.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-08 Thread Dilip Kumar

On Tue, Mar 8, 2016 at 8:34 PM, Amit Kapila  wrote:

> > >> Hmm.  Can we drive this off of the heavyweight lock manager's idea of
> > >> how big the relation extension lock wait queue is, instead of adding
> > >> more stuff to PGPROC?
> > >
> > > One idea to make it work without adding additional stuff in PGPROC is
> that
> > > after acquiring relation extension lock, check if there is any
> available
> > > block in fsm, if it founds any block, then release the lock and
> proceed,
> > > else extend the relation by one block and then check lock's wait queue
> size
> > > or number of lock requests (nRequested) and extend the relation
> further in
> > > proportion to wait queue size and then release the lock and proceed.
> Here,
> > > I think we can check for wait queue size even before extending the
> relation
> > > by one block.
>


I have come up with this patch..

If this approach looks fine then I will prepare final patch (more comments,
indentation, and improve some code) and do some long run testing (current
results are 5 mins run).

Idea is same what Robert and Amit suggested up thread.

/* First we try the lock and if get just extend one block, this will give
two benefit ,
1. Single thread performance will not impact by checking lock waiters and
all
2. If we check the waiter in else part it will give time for more waiter to
get collected and will get better estimation of contention*/

TryRelExtLock ()
{
extend one block
}
else
{
   RelextLock()
   if (recheck the FSM if somebody have added blocks for me)
  -- don't extend any block just reuse
   else
  --we have to extend blocks
  -- get the waiter = lock->nRequested
  --add extra block to FSM extraBlock = waiter*20;
}

Result looks like this
---

Test1(COPY)
-./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
1) g(i)) TO '/tmp/copybinary' WITH BINARY";echo COPY data from
'/tmp/copybinary' WITH BINARY; > copy_script
./pgbench -f copy_script -T 300 -c$ -j$ postgres

Shared Buffer:8GBmax_wal_size:10GB  Storage:Magnetic DiskWAL:SSD
---
ClientBase  multi_extend by 20 page lock_waiter patch(waiter*20)
1  105157 148
2  217255 252
4  210 494442
8  166702 645
16145 563773
32124 480957

Note: @1 thread there should not be any improvement, so many be run to run
variance.

Test2(INSERT)
./psql -d postgres -c "create table test_data(a int, b text)"./psql
-d postgres -c "insert into test_data
values(generate_series(1,1000),repeat('x', 1024))"./psql -d postgres -c
"create table data (a int, b text)
echo "insert into data select * from test_data;" >> insert_script

 shared_buffers=512GB  max_wal_size=20GB  checkpoint_timeout=10min
./pgbench -c $ -j $ -f insert_script -T 300 postgres

ClientBaseMulti-extend by 1000lock_waiter patch(waiter*20)
1117118  117

2111140  132

4 51 190  134

843  254  148

16   40 243  206
32   - -  264


* (waiter*20)-> First process got the lock will find the lock waiters and
add waiter*20 extra blocks.

In next run I will run beyond 32 also, as we can see even at 32 client its
increasing.. so its clear when it see more contentions, adapting to
contention and adding more blocks..


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


multi_extend_v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation extension scalability

2016-03-09 Thread Dilip Kumar

On Wed, Mar 9, 2016 at 1:39 AM, Robert Haas  wrote:

> LockWaiterCount() bravely accesses a shared memory data structure that
> is mutable with no locking at all.  That might actually be safe for
> our purposes, but I think it would be better to take the partition
> lock in shared mode if it doesn't cost too much performance.  If
> that's too expensive, then it should at least have a comment
> explaining (1) that it is doing this without the lock and (2) why
> that's safe (sketch: the LOCK can't go away because we hold it, and
> nRequested could change but we don't mind a stale value, and a 32-bit
> read won't be torn).
>

With LWLock also performance are equally good so added the lock.


>
> A few of the other functions in here also lack comments, and perhaps
> should have them.
>
> The changes to RelationGetBufferForTuple need a visit from the
> refactoring police.  Look at the way you are calling
> RelationAddOneBlock.  The logic looks about like this:
>
> if (needLock)
> {
>   if (trylock relation for extension)
> RelationAddOneBlock();
>   else
>   {
> lock relation for extension;
> if (use fsm)
> {
>   complicated;
> }
> else
>   RelationAddOneBlock();
> }
> else
>   RelationAddOneBlock();
>
> So basically you want to do the RelationAddOneBlock() thing if
> !needLock || !use_fsm || can't trylock.  See if you can rearrange the
> code so that there's only one fallthrough call to
> RelationAddOneBlock() instead of three separate ones.
>

Actually in every case we need one blocks, So I have re factored it and
RelationAddOneBlock is now out of any condition.


>
> Also, consider moving the code that adds multiple blocks at a time to
> its own function instead of including it all in line.
>

Done

Attaching a latest patch.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..eb4ee0c 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,86 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * RelationAddOneBlock
+ *
+ * Extend relation by one block and lock the buffer
+ */
+static Buffer
+RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate)
+{
+	Buffer buffer;
+	/*
+	 * XXX This does an lseek - rather expensive - but at the moment it is the
+	 * only way to accurately determine how many blocks are in a relation.  Is
+	 * it worth keeping an accurate file length in shared memory someplace,
+	 * rather than relying on the kernel to do it for us?
+	 */
+	buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+	/*
+	 * We can be certain that locking the otherBuffer first is OK, since
+	 * it must have a lower page number.
+	 */
+	if (otherBuffer != InvalidBuffer)
+		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Now acquire lock on the new page.
+	 */
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	return buffer;
+}
+/*
+ * RelationAddExtraBlocks
+ *
+ * Extend extra blocks for the relations to avoid the future contention
+ * on the relation extension lock.
+ */
+
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	Size		freespace;
+	BlockNumber blockNum;
+	int			extraBlocks = 0;
+	int 		lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * For calculating number of extra blocks to extend, find the level
+	 * of contention on this lock, by getting the requester of this lock
+	 * and add extra blocks in multiple of waiters.
+	 */
+	lockWaiters = RelationExtensionLockWaiter(relation);
+
+	extraBlocks = lockWaiters * 20;
+
+	while (extraBlocks--)
+	{
+		/*
+		 * Here we are adding extra blocks to the relation after
+		 * adding each block update the information in FSM so that
+		 * other backend running parallel can find the block.
+		 */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		freespace = PageGetHeapFreeSpace(page);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+
+		UnlockReleaseBuffer(buffer);
+
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,10 +313,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-saveFreeSpace;
+	Size		pageFreeSpace = 0,
+saveFreeSpace = 0;
 	BlockNumber targetBlock,
-otherBlock;
+otherBlock,
+lastValidBlock = InvalidBlockNumber;
 	bool		needLock;
 
 	len

Re: [HACKERS] Relation extension scalability

2016-03-10 Thread Dilip Kumar

On Thu, Mar 10, 2016 at 7:55 AM, Petr Jelinek  wrote:

Thanks for the comments..


> Hmm, why did you remove the comment above the call to
> UnlockRelationForExtension?

While re factoring I lose this comment.. Fixed it


> It still seems relevant, maybe with some minor modification?
>
> Also there is a bit of whitespace mess inside the conditional lock block.


Fixed

I got the result of 10 mins run so posting it..
Note: Base code results are copied from up thread...

Results For 10 Mins run of COPY 1 records of size 4 bytes script and
configuration are same as used in up thread


ClientBasePatch
1105111
2217246
4210428
8166653
16  145808
32  124988
64---974


Results For 10 Mins run of INSERT 1000 records of size 1024 bytes(data
don't fits in shared buffer)
--

ClientBasePatch
1117120
2111126
4 51 130
8 43 147
16   40 209
32   ---  254
64   ---  205

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..b73535c 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,87 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * RelationAddOneBlock
+ *
+ * Extend relation by one block and lock the buffer
+ */
+static Buffer
+RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate)
+{
+	Buffer buffer;
+	/*
+	 * XXX This does an lseek - rather expensive - but at the moment it is the
+	 * only way to accurately determine how many blocks are in a relation.  Is
+	 * it worth keeping an accurate file length in shared memory someplace,
+	 * rather than relying on the kernel to do it for us?
+	 */
+	buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+	/*
+	 * We can be certain that locking the otherBuffer first is OK, since
+	 * it must have a lower page number.
+	 */
+	if (otherBuffer != InvalidBuffer)
+		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Now acquire lock on the new page.
+	 */
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	return buffer;
+}
+
+/*
+ * RelationAddExtraBlocks
+ *
+ * Extend extra blocks for the relations to avoid the future contention
+ * on the relation extension lock.
+ */
+
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	Size		freespace;
+	BlockNumber blockNum;
+	int			extraBlocks = 0;
+	int 		lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * For calculating number of extra blocks to extend, find the level
+	 * of contention on this lock, by getting the requester of this lock
+	 * and add extra blocks in multiple of waiters.
+	 */
+	lockWaiters = RelationExtensionLockWaiter(relation);
+
+	extraBlocks = lockWaiters * 20;
+
+	while (extraBlocks--)
+	{
+		/*
+		 * Here we are adding extra blocks to the relation after
+		 * adding each block update the information in FSM so that
+		 * other backend running parallel can find the block.
+		 */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		freespace = PageGetHeapFreeSpace(page);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+
+		UnlockReleaseBuffer(buffer);
+
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,10 +314,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-saveFreeSpace;
+	Size		pageFreeSpace = 0,
+saveFreeSpace = 0;
 	BlockNumber targetBlock,
-otherBlock;
+otherBlock,
+lastValidBlock = InvalidBlockNumber;
 	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
@@ -308,6 +390,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -388,6 +471,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
  otherBlock, targetBlock, vmbuffer_other,
  vmbuffer);
 
+		lastValidBlock = targetBlock;
+
 		/*
 		 * Now we can check to see if there's enough free space here. If so,
 		 * we're done.
@@ -441,27 +526,47 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	needLock = !RELATION_IS_LOCAL(relation);
 
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);

Re: [HACKERS] Relation extension scalability

2016-03-10 Thread Dilip Kumar

On Fri, Mar 11, 2016 at 12:04 AM, Petr Jelinek  wrote:

Thanks for looking..

Those look good. The patch looks good in general now. I am bit scared by
> the lockWaiters * 20 as it can result in relatively big changes in rare
> corner cases when for example a lot of backends were waiting for lock on
> relation and suddenly all try to extend it. I wonder if we should clamp it
> to something sane (although what's sane today might be small in couple of
> years).

But in such condition when all are waiting on lock, then at a time only one
waiter will get the lock and that will easily count the lock waiter and
extend in multiple of that. And we also have the check that if any waiter
get the lock it will first check in FSM that anybody else have added one
block or not..

And other waiter will not get lock unless first waiter extend all blocks
and release the locks.

One possible case is as soon as we extend the blocks new requester directly
find in FSM and don't come for lock, and old waiter after getting lock
don't find in FSM, But IMHO in such cases, also its good that other waiter
also extend more blocks (because this can happen when request flow is very
high).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-03-10 Thread Dilip Kumar

On Thu, Mar 10, 2016 at 8:26 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> I don't think we can rely on median that much if we have only 3 runs.
> For 3 runs we can only apply Kornfeld method which claims that confidence
> interval should be between lower and upper values.
> Since confidence intervals for master and patched versions are overlapping
> we can't conclude that expected TPS numbers are different.
> Dilip, could you do more runs? 10, for example. Using such statistics we
> would be able to conclude something.
>

Here is the reading for 10 runs


Median Result
---

Client   Base  Patch
---
1  1987319739
2  3865838276
4  6881262075

Full Results of 10 runs...

 Base
-
 Runs  1 Client2 Client  4 Client
-
1194423486649023
2196053513951695
3197263710453899
4198353843955708
5198663863867473
6198803867970152
7199733872070920
8200483873771342
9200573881571403
10  203444142377953
-


Patch
---
Runs  1 Client 2 Client  4 Client
--
1188813016154928
2194153203159741
3195643502261060
4196273671261839
5196703765962011
6198083889462139
7198573908162983
8199103992375358
9201693993777481
10  201814000378462
------


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-11 Thread Dilip Kumar

On Sat, Mar 12, 2016 at 5:31 AM, Jim Nasby  wrote:

> FWIW, this is definitely a real possibility in any shop that has very high
> downtime costs and high transaction rates.
>
> I also think some kind of clamp is a good idea. It's not that uncommon to
> run max_connections significantly higher than 100, so the extension could
> be way larger than 16MB. In those cases this patch could actually make
> things far worse as everyone backs up waiting on the OS to extend many MB
> when all you actually needed were a couple dozen more pages.
>

I agree, We can have some max limit on number of extra pages, What other
thinks ?

> BTW, how was *20 arrived at? ISTM that if you have a lot of concurrent
> demand for extension that means you're running lots of small DML
> operations, not really big ones. I'd think that would make *1 more
> appropriate.
>

*1 will not solve this problem, Here the main problem was many people are
sleep/wakeup on the extension lock and that was causing the bottleneck. So
if we do *1 this will satisfy only current requesters which has already
waited on the lock. But our goal is to avoid backends from requesting this
lock.

Idea of Finding the requester to get the statistics on this locks (load on
the lock) and extend in multiple of load so that in future this situation
will be avoided for long time and again when happen next time extend in
multiple of load.

How 20 comes ?
  I tested with Multiple clients loads 1..64,  with multiple load size 4
byte records to 1KB Records,  COPY/ INSERT and found 20 works best.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-11 Thread Dilip Kumar

On Sat, Mar 12, 2016 at 8:37 AM, Amit Kapila 
wrote:

> Can you post the numbers for 1, 5, 10, 15, 25 or whatever other multiplier
> you have tried, so that it is clear that 20 is best?

I had Tried with 1, 10, 20 and 50.

1. With base code it was almost the same as base code.

2. With 10 thread data it matching with my previous group extend patch data
posted upthread
http://www.postgresql.org/message-id/CAFiTN-tyEu+Wf0-jBc3TGfCoHdEAjNTx=wvuxpoa1vddyst...@mail.gmail.com

3. Beyond 20 with 50 I did not see any extra benefit in performance number
(compared to 20 when tested 50 with 4 byte COPY I did not tested other data
size with 50.).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-13 Thread Dilip Kumar

On Mon, Mar 14, 2016 at 5:02 AM, Jim Nasby  wrote:

> Well, 16MB is 2K pages, which is what you'd get if 100 connections were
> all blocked and we're doing 20 pages per waiter. That seems like a really
> extreme scenario, so maybe 4MB is a good compromise. That's unlikely to be
> hit in most cases, unlikely to put a ton of stress on IO, even with
> magnetic media (assuming the whole 4MB is queued to write in one shot...).
> 4MB would still reduce the number of locks by 500x.

In my performance results given up thread, we are getting max performance
at 32 clients, means at a time we are extending 32*20 ~= max (600) pages at
a time. So now with 4MB limit (max 512 pages) Results will looks similar.
So we need to take a decision whether 4MB is good limit, should I change it
?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-19 Thread Dilip Kumar

On Thu, Mar 17, 2016 at 1:31 PM, Petr Jelinek  wrote:

> Great.
>
> Just small notational thing, maybe this would be simpler?:
> extraBlocks = Min(512, lockWaiters * 20);
>

Done, new patch attached.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..a9e18dc 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,91 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * RelationAddOneBlock
+ *
+ * Extend relation by one block and lock the buffer
+ */
+static Buffer
+RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate)
+{
+	Buffer buffer;
+	/*
+	 * XXX This does an lseek - rather expensive - but at the moment it is the
+	 * only way to accurately determine how many blocks are in a relation.  Is
+	 * it worth keeping an accurate file length in shared memory someplace,
+	 * rather than relying on the kernel to do it for us?
+	 */
+	buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+	/*
+	 * We can be certain that locking the otherBuffer first is OK, since
+	 * it must have a lower page number.
+	 */
+	if (otherBuffer != InvalidBuffer)
+		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Now acquire lock on the new page.
+	 */
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	return buffer;
+}
+
+/*
+ * RelationAddExtraBlocks
+ *
+ * Extend extra blocks for the relations to avoid the future contention
+ * on the relation extension lock.
+ */
+
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	Size		freespace;
+	BlockNumber blockNum;
+	int			extraBlocks = 0;
+	int 		lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * For calculating number of extra blocks to extend, find the level
+	 * of contention on this lock, by getting the requester of this lock
+	 * and add extra blocks in multiple of waiters.
+	 */
+	lockWaiters = RelationExtensionLockWaiter(relation);
+
+	/* To avoid the cases when there are huge number of lock waiters, and
+	 * extend file size by big amount at a time, put some limit on the
+	 * max number of pages to be extended at a time.
+	 */
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	while (extraBlocks--)
+	{
+		/*
+		 * Here we are adding extra blocks to the relation after
+		 * adding each block update the information in FSM so that
+		 * other backend running parallel can find the block.
+		 */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		freespace = PageGetHeapFreeSpace(page);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+
+		UnlockReleaseBuffer(buffer);
+
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,10 +318,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-saveFreeSpace;
+	Size		pageFreeSpace = 0,
+saveFreeSpace = 0;
 	BlockNumber targetBlock,
-otherBlock;
+otherBlock,
+lastValidBlock = InvalidBlockNumber;
 	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
@@ -308,6 +394,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -388,6 +475,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
  otherBlock, targetBlock, vmbuffer_other,
  vmbuffer);
 
+		lastValidBlock = targetBlock;
+
 		/*
 		 * Now we can check to see if there's enough free space here. If so,
 		 * we're done.
@@ -441,27 +530,47 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	needLock = !RELATION_IS_LOCAL(relation);
 
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
-
-	/*
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
-
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+	{
+		/*
+		 * First try to get the lock in no-wait mode, if succeed extend one
+		 * block, else get the lock in normal mode and after we get the lock
+		 * extend some extra blocks, extra blocks will be added to satisfy
+		 * request of other waiters and avoid future contention. Here instead
+		 * of directly ta

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-03-19 Thread Dilip Kumar

On Mon, Mar 14, 2016 at 3:09 AM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> I've drawn graphs for these measurements. The variation doesn't look
> random here.  TPS is going higher from measurement to measurement.  I bet
> you did measurements sequentially.
> I think we should do more measurements until TPS stop growing and beyond
> to see accumulate average statistics.  Could you please do the same tests
> but 50 runs.
>

I have taken reading at different client count (1,2 and 4)

1. Avg: Is taken average after discarding lowest 5 and highest 5 readings.
2. With 4 client I have only 30 reading.

Summary:

I think if we check avg or median atleast at 1 or 2 client head always
looks winner, but same is not the case with 4 clients. And with 4 clients I
can see much more fluctuation in the reading.


Head(1 Client) Patch (1 Client)
Head (2 Client) Patch (2 Client)
Head (4 Client) Patch (4 Client)
Avg: 19628 19578
37180 36536
70044 70731
Median: 19663 19581
37967 37484
73003 75376
Below is all the reading. (Arranged in sorted order)

*Runs* *Head (1 Client)* *Patch (1 Client)*
*Head (2 Client)* *Patch (2 Client)*
*Head (4 Client)* *Patch (4 Client)*
1 18191 18153
29454 26128
49931 47210
2 18365 18768
31218 26956
53358 47287
3 19053 18769
31808 29286
53575 55458
4 19128 18915
31860 30558
54282 55794
5 19160 18948
32411 30945
56629 56253
6 19177 19055
32453 31316
57595 58147
7 19351 19232
32571 31703
59490 58543
8 19353 19239
33294 32029
63887 58990
9 19361 19255
33936 32217
64718 60233
10 19390 19297
34361 32767
65737 68210
11 19452 19368
34563 34487
65881 71409
12 19478 19387
36110 34907
67151 72108
13 19488 19388
36221 34936
70974 73977
14 19490 19395
36461 35068
72212 74891
15 19512 19406
36712 35298
73003 75376
16 19538 19450
37104 35492
74842 75468
17 19547 19487
37246 36648
75400 75515
18 19592 19521
37567 37263
75573 75626
19 19627 19536
37707 37430
75798 75745
20 19661 19556
37958 37461
75834 75868
21 19666 19607
37976 37507
76240 76161
22 19701 19624
38060 37557
76426 76162
23 19708 19643
38244 37717
76770 76333
24 19748 19684
38272 38285
77011 77009
25 19751 19694
38467 38294
77114 77168
26 19776 19695
38524 38469
77630 77318
27 19781 19709
38625 38642
77865 77550
28 19786 19765
38756 38643
77912 77904
29 19796 19823
38939 38649
78242 78736
30 19826 19847
39139 38659


31 19837 19899
39208 38713


32 19849 19909
39211 38837


33 19854 19932
39230 38876


34 19867 19949
39249 39088


35 19891 19990
39259 39148


36 20038 20085
39286 39453


37 20083 20128
39435 39563


38 20143 20166
39448 39959


39 20191 20198
39475 40495


40 20437 20455
40375 40664





-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-19 Thread Dilip Kumar

On Mon, Mar 14, 2016 at 8:26 AM, Petr Jelinek  wrote:

> Well any value we choose will be very arbitrary. If we look at it from the
> point of maximum absolute disk space we allocate for relation at once,
> the 4MB limit would represent 2.5 orders of magnitude change. That sounds
> like enough for one release cycle, I think we can further tune it if the
> need arises in next one. (with my love for round numbers I would have
> suggested 8MB as that's 3 orders of magnitude, but I am fine with 4MB as
> well)
>

I have modified the patch, this contains the max limit on extra pages,
512(4MB) pages is the max limit.

I have measured the performance also and that looks equally good.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..fcd42b9 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,94 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * RelationAddOneBlock
+ *
+ * Extend relation by one block and lock the buffer
+ */
+static Buffer
+RelationAddOneBlock(Relation relation, Buffer otherBuffer, BulkInsertState bistate)
+{
+	Buffer buffer;
+	/*
+	 * XXX This does an lseek - rather expensive - but at the moment it is the
+	 * only way to accurately determine how many blocks are in a relation.  Is
+	 * it worth keeping an accurate file length in shared memory someplace,
+	 * rather than relying on the kernel to do it for us?
+	 */
+	buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+	/*
+	 * We can be certain that locking the otherBuffer first is OK, since
+	 * it must have a lower page number.
+	 */
+	if (otherBuffer != InvalidBuffer)
+		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	/*
+	 * Now acquire lock on the new page.
+	 */
+	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+	return buffer;
+}
+
+/*
+ * RelationAddExtraBlocks
+ *
+ * Extend extra blocks for the relations to avoid the future contention
+ * on the relation extension lock.
+ */
+
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	Size		freespace;
+	BlockNumber blockNum;
+	int			extraBlocks = 0;
+	int 		lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * For calculating number of extra blocks to extend, find the level
+	 * of contention on this lock, by getting the requester of this lock
+	 * and add extra blocks in multiple of waiters.
+	 */
+	lockWaiters = RelationExtensionLockWaiter(relation);
+
+	extraBlocks = lockWaiters * 20;
+
+	/* To avoid the cases when there are huge number of lock waiters, and
+	 * extend file size by big amount at a time, put some limit on the
+	 * max number of pages to be extended at a time.
+	 */
+	if (extraBlocks > 512)
+		extraBlocks = 512;
+
+	while (extraBlocks--)
+	{
+		/*
+		 * Here we are adding extra blocks to the relation after
+		 * adding each block update the information in FSM so that
+		 * other backend running parallel can find the block.
+		 */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		freespace = PageGetHeapFreeSpace(page);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+
+		UnlockReleaseBuffer(buffer);
+
+		RecordPageWithFreeSpace(relation, blockNum, freespace);
+	}
+}
+
+/*
  * RelationGetBufferForTuple
  *
  *	Returns pinned and exclusive-locked buffer of a page in given relation
@@ -233,10 +321,11 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
 	Page		page;
-	Size		pageFreeSpace,
-saveFreeSpace;
+	Size		pageFreeSpace = 0,
+saveFreeSpace = 0;
 	BlockNumber targetBlock,
-otherBlock;
+otherBlock,
+lastValidBlock = InvalidBlockNumber;
 	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
@@ -308,6 +397,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 		}
 	}
 
+loop:
 	while (targetBlock != InvalidBlockNumber)
 	{
 		/*
@@ -388,6 +478,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
  otherBlock, targetBlock, vmbuffer_other,
  vmbuffer);
 
+		lastValidBlock = targetBlock;
+
 		/*
 		 * Now we can check to see if there's enough free space here. If so,
 		 * we're done.
@@ -441,27 +533,47 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	needLock = !RELATION_IS_LOCAL(relation);
 
 	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
-
-	/*
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-03-21 Thread Dilip Kumar

On Sun, Mar 20, 2016 at 4:10 AM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> Actually, we behave like old code and do such modifications without
> increasing number of atomic operations.  We can just calculate new value of
> state (including unset of BM_LOCKED flag) and write it to the buffer.  In
> this case we don't require more atomic operations.  However, it becomes not
> safe to use atomic decrement in UnpinBuffer().  We can use there loop of
> CAS which wait buffer header to be unlocked like PinBuffer() does.  I'm not
> sure if this affects multicore scalability, but I think it worth trying.
>
> So, this idea is implemented in attached patch.  Please, try it for both
> regression on lower number of clients and scalability on large number of
> clients.
>
>
Good, seems like we have reduced some instructions, I will test it soon.


> Other changes in this patch:
> 1) PinBuffer() and UnpinBuffer() use exponential backoff in spindealy
> like LockBufHdr() does.
> 2) Previous patch contained revert
> of ac1d7945f866b1928c2554c0f80fd52d7f92.  This was not intentional.
> No, it doesn't contains this revert.
>


Some other comments..

*** ReadBuffer_common(SMgrRelation smgr, cha
*** 828,837 
  */
  do
  {
! LockBufHdr(bufHdr);
! Assert(bufHdr->flags & BM_VALID);
! bufHdr->flags &= ~BM_VALID;
! UnlockBufHdr(bufHdr);
  } while (!StartBufferIO(bufHdr, true));
  }
  }
--- 826,834 
  */
  do
  {
! uint32 state = LockBufHdr(bufHdr);
! state &= ~(BM_VALID | BM_LOCKED);
! pg_atomic_write_u32(&bufHdr->state, state);
  } while (!StartBufferIO(bufHdr, true));

Better Write some comment, about we clearing the BM_LOCKED from stage
directly and need not to call UnlockBufHdr explicitly.
otherwise its confusing.




-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-03-21 Thread Dilip Kumar

On Tue, Mar 22, 2016 at 12:31 PM, Dilip Kumar  wrote:

> ! pg_atomic_write_u32(&bufHdr->state, state);
>   } while (!StartBufferIO(bufHdr, true));
>
> Better Write some comment, about we clearing the BM_LOCKED from stage
> directly and need not to call UnlockBufHdr explicitly.
> otherwise its confusing.
>

Few more comments..

*** 828,837 
*/
   do
   {
!  LockBufHdr(bufHdr);
*!  Assert(bufHdr->flags & BM_VALID);*
!  bufHdr->flags &= ~BM_VALID;
!  UnlockBufHdr(bufHdr);
   } while (!StartBufferIO(bufHdr, true));
   }
   }
--- 826,834 
*/
   do
   {
!  uint32 state = LockBufHdr(bufHdr);
!  state &= ~(BM_VALID | BM_LOCKED);
!  pg_atomic_write_u32(&bufHdr->state, state);
   } while (!StartBufferIO(bufHdr, true));

1. Previously there was a Assert, any reason why we removed it ?
 Assert(bufHdr->flags & BM_VALID);

2. And also if we don't need assert then can't we directly clear BM_VALID
flag from state variable (if not locked) like pin/unpin buffer does,
without taking buffer header lock?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-03-22 Thread Dilip Kumar

On Thu, Mar 17, 2016 at 11:39 AM, Amit Kapila 
wrote:

 I have reviewed the patch.. here are some review comments, I will continue
to review..

1.

+
+ /*
+  * Add the proc to list, if the clog page where we need to update the

+  */
+ if (nextidx != INVALID_PGPROCNO &&
+ ProcGlobal->allProcs[nextidx].clogGroupMemberPage !=
proc->clogGroupMemberPage)
+ return false;

Should we clear all these structure variable what we set above in case we
are not adding our self to group, I can see it will not have any problem
even if we don't clear them,
I think if we don't want to clear we can add some comment mentioning the
same.

+ proc->clogGroupMember = true;
+ proc->clogGroupMemberXid = xid;
+ proc->clogGroupMemberXidStatus = status;
+ proc->clogGroupMemberPage = pageno;
+ proc->clogGroupMemberLsn = lsn;


2.

Here we are updating in our own proc, I think we don't need atomic
operation here, we are not yet added to the list.

+ if (nextidx != INVALID_PGPROCNO &&
+ ProcGlobal->allProcs[nextidx].clogGroupMemberPage !=
proc->clogGroupMemberPage)
+ return false;
+
+ pg_atomic_write_u32(&proc->clogGroupNext, nextidx);


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-22 Thread Dilip Kumar

On Mon, Mar 21, 2016 at 8:10 PM, Amit Kapila 
wrote:

> Review comments:
>
>
Thanks for the review, Please find my response inline..

> 1.
>  /*
> + * RelationAddOneBlock
> + *
> + * Extend relation by one block and lock the buffer
> + */
> +static Buffer
> +RelationAddOneBlock(Relation relation, Buffer otherBuffer,
> BulkInsertState bistate)
>
> Shall we mention in comments that this API returns locked buffer and it's
> the responsibility of caller to unlock it.
>

Fixed

> 2.
> + /* To avoid the cases when there are huge number of lock waiters, and
> + * extend file size by big amount at a
> time, put some limit on the
>
> first line in multi-line comments should not contain anything.
>

Fixed

>
> 3.
> + extraBlocks = Min(512, lockWaiters * 20);
>
> I think we should explain in comments about the reason of choosing 20 in
> above calculation.
>

Fixed, Just check explanation is enough or we need to add something more ?

>
> 4. Sometime back [1], you agreed on doing some analysis for the overhead
> that XLogFlush can cause during buffer eviction,  but I don't see the
> results of same, are you planing to complete the same?
>

Ok, I will test this..

>
> 5.
> + if (LOCKACQUIRE_OK
> + != RelationExtensionLockConditional(relation,
> ExclusiveLock))
>
> I think the coding style is to keep constant on right side of condition,
> did you see any other place in code which uses the check in a similar way?
>

Fixed,

Not sure about any other place. (This is what I used to follow to keep
constant on left side to avoid the cases, where instead == by mistake if we
have given =, then it will do assignment instead throwing error).

But In PG style constant should be on right, so I will take care.

>
> 6.
> - /*
> - * Now acquire lock on the new page.
> + /* For all case we need to add at least one block to satisfy
> our
> + * own request.
>   */
>
> Same problem as in point 2.
>

Fixed.

>
> 7.
> @@ -472,6 +581,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
>   if (needLock)
>
> UnlockRelationForExtension(relation, ExclusiveLock);
>
> +
> +
>
> Spurious newline addition.
>

Fixed

>
> 8.
> +int
> +RelationExtensionLockWaiter(Relation relation)
>
> How about naming this function as RelationExtensionLockWaiterCount?
>
>
Done

> 9.
> + /* Other waiter has extended the block for us*/
>
> Provide an extra space before ending the comment.
>

Fixed

>
> 10.
> + if (use_fsm)
> + {
> + if (lastValidBlock !=
> InvalidBlockNumber)
> + {
> + targetBlock =
> RecordAndGetPageWithFreeSpace(relation,
> +
> lastValidBlock,
> +
> pageFreeSpace,
> +
> len + saveFreeSpace);
> + }
>
> Are you using RecordAndGetPageWithFreeSpace() instead of
> GetPageWithFreeSpace() to get the page close to the previous target page?
> If yes, then do you see enough benefit of the same that it can compensate
> the additional write operation which Record* API might cause due to
> additional dirtying of buffer?
>

Here we are calling RecordAndGetPageWithFreeSpace instead of
GetPageWithFreeSpace, because other backend who have got the lock might
have added extra block  in the FSM and its possible that FSM tree might not
have been updated so far and we will not get the page by searching from top
using GetPageWithFreeSpace, so we need to search the leaf page directly
using our last valid target block.

Explained same in the comments...

> 11.
> + {
> + /*
> + * First try to get the lock in no-wait mode, if succeed extend one
> +
>  * block, else get the lock in normal mode and after we get the lock
> - LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
> + buffer =
> RelationAddOneBlock(relation, otherBuffer, bistate);
>
> Won't this cause one extra block addition after backend extends the
> relation for multiple blocks, what is the need of same?
>

This is the block for this backend, Extra extend for future request and
already added to FSM. I could have added this count along with extra block
in RelationAddExtraBlocks, But In that case I need to put some extra If for
saving one buffer for this bakend and then returning that the specific
buffer to caller, and In caller also need to distinguish between who wants
to add one block or who have got one block added in along with extra block.

I think this way code is simple.. That everybody comes down will add one
block for self use. and all other functionality and logic is above, i.e.
wether to take lock or not, whether to add extra blocks or not..

>
> 12. I think it is good to once test pgbench read-write tests to ensure
> that this doesn't introduce any new regression.
>

I will test this and post the results..

Latest patch attached..

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

multi_extend_v10.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation extension scalability

2016-03-23 Thread Dilip Kumar

On Thu, Mar 24, 2016 at 10:44 AM, Robert Haas  wrote:

> On Wed, Mar 23, 2016 at 9:43 PM, Amit Kapila 
> wrote:
> > RecordAndGetPageWithFreeSpace() tries to search from the oldPage passed
> to
> > it, rather than from top, so even if  RecordPageWithFreeSpace() doesn't
> > update till root, it will be able to search the newly added page.  I
> agree
> > with whatever you have said in another mail that we should introduce a
> new
> > API to do a more targeted search for such cases.
>
> OK, let's do that, then.

Ok, I have added new API which just find the free block from and start
search from last given page.

1. I have named the new function as GetPageWithFreeSpaceUsingOldPage, I
don't like this name, but I could not come up with any better, Please
suggest one.

And also body of GetPageWithFreeSpaceUsingOldPage looks almost similar to
RecordAndGetPageWithFreeSpace, I tried to merge these two but for that we
need to pass extra parameter to the function.

2. I also had to write one more function *fsm_search_from_addr *instead of
using* fsm_set_and_search. *So that we can find block without updating the
other slot.

I have done performance test just to ensure the result. And performance is
same as old. with both COPY and INSERT.

3. I have also run pgbench read-write what amit suggested upthread.. No
regression or improvement with pgbench workload.

Client   basePatch
1 899 914
8 5397 5413
32 18170 18052
64 29850 29941

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

multi_extend_v12.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation extension scalability

2016-03-24 Thread Dilip Kumar

On Thu, Mar 24, 2016 at 6:13 PM, Amit Kapila 
wrote:

>
> 1.
> +GetPageWithFreeSpaceUsingOldPage(Relation rel, BlockNumber oldPage,
> + Size spaceNeeded)
> {
> ..
> + /*
> + * If fsm_set_and_search found a suitable new block, return that.
> + * Otherwise, search as usual.
> + */
> ..
> }
>
> In the above comment, you are referring wrong function.
>

Fixed

>
> 2.
> +static int
> +fsm_search_from_addr(Relation rel, FSMAddress addr, uint8 minValue)
> +{
> + Buffer buf;
> + int newslot = -1;
> +
> + buf = fsm_readbuf(rel, addr, true);
> + LockBuffer(buf, BUFFER_LOCK_SHARE);
> +
> + if (minValue != 0)
> + {
> + /* Search while we still hold the lock */
> + newslot = fsm_search_avail(buf, minValue,
> +   addr.level == FSM_BOTTOM_LEVEL,
> +   false);
>
> In this new API, I don't understand why we need minValue != 0 check,
> basically if user of API doesn't want to search for space > 0, then what is
> the need of calling this API?  I think this API should use Assert for
> minValue!=0 unless you see reason for not doing so.
>
>
Agree, it should be assert.


> >
> > GetNearestPageWithFreeSpace? (although not sure that's accurate
> description, maybe Nearby would be better)
> >
>
> Better than what is used in patch.
>
> Yet another possibility could be to call it as
> GetPageWithFreeSpaceExtended and call it from GetPageWithFreeSpace with
> value of oldPage as InvalidBlockNumber.
>

Yes I like this.. Changed the same.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


multi_extend_v13.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation extension scalability

2016-03-25 Thread Dilip Kumar

On Fri, Mar 25, 2016 at 3:04 AM, Robert Haas  wrote:

> 1. Callers who use GetPageWithFreeSpace() rather than
> GetPageFreeSpaceExtended() will fail to find the new pages if the
> upper map levels haven't been updated by VACUUM.
>
> 2. Even callers who use GetPageFreeSpaceExtended() may fail to find
> the new pages.  This can happen in two separate ways, namely (a) the
>

Yeah, that's the issue, if extended pages spills to next FSM page, then
other waiters will not find those page, and one by one all waiters will end
up adding extra pages.
for example,  if there are ~30 waiters then
total blocks extended = (25(25+1)/2) *20 =~ 6500 pages.

This is not the case every time but whenever heap block go to new FSM page
this will happen.

- FSM page case hold 4096 heap blk info, so after every 8th extend (assume
512 block will extend in one time), it will extend ~6500 pages

- Any new request to RelationGetBufferForTuple will be able to find those
page, because by that time the backend which is extending the page would
have set new block using RelationSetTargetBlock.
(there are still chances that some blocks can be completely left unused,
until vacuum comes).

I have changed the patch as per the suggestion (just POC because
performance number are not that great)

Below is the performance number comparison of base, previous patch(v13) and
latest patch (v14).

performance of patch v14 is significantly low compared to v13, mainly I
guess below reasons
1. As per above calculation v13 extend ~6500 block (after every 8th
extend), and that's why it's performing well.

2. In v13 as soon as we extend the block we add to FSM so immediately
available for new requester,  (In this patch also I tried to add one by one
to FSM and updated fsm tree till root after all pages added to FSM, but no
significant improvement).

3.  fsm_update_recursive doesn't seems like problem to me. does it ?


Copy 1 tuples, of 4 bytes each..
-
Client base   patch v13   patch  v14
1   118  147126
2   217  276269
4   210  421347
8   166  630375
16 145  813415
32 124  985451
64 974455

Insert 1000 tuples of 1K size each.

Clientbase  patch v13 patch v14
1  117 124119
2  111 126119
4   51  128124
8   43  149131
16 40  217120
32   263115
64   248109

Note: I think one thread number can be just run to run variance..

Does anyone see problem in updating the FSM tree, I have debugged and saw
that we are able to get the pages properly from tree and same is visible in
performance number of v14 compared to base.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8140418..9ac1eb7 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -169,6 +169,62 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
+ * Extend a relation by multiple blocks to avoid future contention on the
+ * relation extension lock.  Our goal is to pre-extend the relation by an
+ * amount which ramps up as the degree of contention ramps up, but limiting
+ * the result to some sane overall value.
+ */
+static void
+RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+{
+	Page		page;
+	BlockNumber blockNum,
+firstBlock,
+lastBlock;
+	int			extraBlocks = 0;
+	int			lockWaiters = 0;
+	Buffer		buffer;
+
+	/*
+	 * We use the length of the lock wait queue to judge how much to extend.
+	 * It might seem like multiplying the number of lock waiters by as much
+	 * as 20 is too aggressive, but benchmarking revealed that smaller numbers
+	 * were insufficient.  512 is just an arbitrary cap to prevent pathological
+	 * results (and excessive wasted disk space).
+	 */
+	lockWaiters = RelationExtensionLockWaiterCount(relation);
+	extraBlocks = Min(512, lockWaiters * 20);
+
+	firstBlock = lastBlock = InvalidBlockNumber;
+
+	while (extraBlocks-- >= 0)
+	{
+		/* Ouch - an unnecessary lseek() each time through the loop! */
+		buffer = ReadBufferBI(relation, P_NEW, bistate);
+
+		/* Extend by one page. */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		PageInit(page, BufferGetPageSize(buffer), 0);
+		MarkBufferDirty(buffer);
+		blockNum = BufferGetBlockNumber(buffer);
+		UnlockReleaseBuffer(buffer);
+
+		if (firstBlock == InvalidBlockNumber)
+			firstBlock = blockNum;
+
+		lastBlock = blockNum;
+	}
+
+	/*
+	 * Put the page in the freespace map so other backends can find it.
+	 * This

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-03-25 Thread Dilip Kumar

On Fri, Mar 25, 2016 at 8:09 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> Could anybody run benchmarks?  Feature freeze is soon, but it would be
> *very nice* to fit it into 9.6 release cycle, because it greatly improves
> scalability on large machines.  Without this patch PostgreSQL 9.6 will be
> significantly behind competitors like MySQL 5.7.

I have run the performance and here are the results.. With latest patch I
did not see any regression at lower client count (median of 3 reading).

scale factor 1000 shared buffer 8GB readonly
*Client Base patch*
1 12957 13068
2 24931 25816
4 46311 48767
32 300921 310062
64 387623 493843
128 249635 583513
scale factor 300 shared buffer 8GB readonly
*Client Base patch*
1 14537 14586--> one thread number looks little less, generally I get
~18000 (will recheck).
2 34703 33929--> may be run to run variance (once I get time, will
recheck)
4 67744 69069
32 312575 336012
64 213312 539056
128 190139 380122

*Summary:*

Actually with 64 client we have seen ~470,000 TPS with head also, by
revering commit 6150a1b0.
refer this thread: (
http://www.postgresql.org/message-id/caa4ek1+zeb8pmwwktf+3brs0pt4ux6rs6aom0uip8c6shjw...@mail.gmail.com
)

I haven't tested this patch by reverting commit 6150a1b0, so not sure can
this patch give even better performance ?

It also points to the case, what Andres has mentioned in this thread.

http://www.postgresql.org/message-id/20160226191158.3vidtk3ktcmhi...@alap3.anarazel.de

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-26 Thread Dilip Kumar

On Sat, Mar 26, 2016 at 8:07 AM, Robert Haas  wrote:

> I think we need to start testing these patches not only in terms of
> how *fast* they are but how *large* the relation ends up being when
> we're done.  A patch that inserts the rows slower but the final
> relation is smaller may be better overall.  Can you retest v13, v14,
> and master, and post not only the timings but the relation size
> afterwards?  And maybe post the exact script you are using?
>


I have tested the size and performance, scripts are attached in the mail.

COPY 1-10 bytes tuple from 32 Clients
Base V13 V14
      -
 -
TPS 123 874 446
No. Of Tuples 14827  104998 53637
Relpages 656089 4652593 2485482
INSERT 1028 bytes Tuples From 16 Clients
Base V13 V14
     
 -
TPS 42 211 120
No. Of Tuples 5149000 25343000 14524000
Rel Pages 735577 3620765 2140612

As per above results If we calculate the tuple number of tuples and
respective relpages, then neither in v13 nor v14 there are extra unused
pages.

As per my calculation for INSERT (1028 byte tuple) each page contain 7
tuples so
number of pages required
Base: 5149000/7 = 735571  (from relpages we can see 6 pages are extra)
V13 :  25343000/7= 3620428 (from relpages we can see ~300 pages are extra).
V14 :  14524000/7= 2074857 (from relpages we can see ~7 pages are
extra).

With V14 we have found max pages number of extra pages, I expected V13 to
have max unused pages, but it's v14 and I tested it in multiple runs and
v13 is always the winner. I tested with multiple client count also like 8,
32 and v13 always have only ~60-300 extra pages out of total ~2-4 Million
Pages.


Attached files:
---
test_size_ins.sh  --> automated script to run insert test and calculate
tuple and relpages.
test_size_copy   --> automated script to run copy test and calculate tuple
and relpages.
copy_script  -> copy pg_bench script used by test_size_copy.sh
insert_script --> insert pg_bench script used by test_size_ins.sh



> Maybe something like this would help:
>
> if (needLock)
> {
> if (!use_fsm)
> LockRelationForExtension(relation, ExclusiveLock);
> else if (!ConditionLockRelationForExtension(relation,
> ExclusiveLock))
> {
> BlockNumber last_blkno =
> RelationGetNumberOfBlocks(relation);
>
> targetBlock = GetPageWithFreeSpaceExtended(relation,
> last_blkno, len + saveFreeSpace);
> if (targetBlock != InvalidBlockNumber)
> goto loop;
>
> LockRelationForExtension(relation, ExclusiveLock);
> targetBlock = GetPageWithFreeSpace(relation, len +
> saveFreeSpace);
> if (targetBlock != InvalidBlockNumber)
> {
> UnlockRelationForExtension(relation, ExclusiveLock);
> goto loop;
> }
> RelationAddExtraBlocks(relation, bistate);
> }
> }
>
> I think this is better than what you had before with lastValidBlock,
> because we're actually interested in searching the free space map at
> the *very end* of the relation, not wherever the last target block
> happens to have been.
>
> We could go further still and have GetPageWithFreeSpace() always
> search the last, say, two pages of the FSM in all cases.  But that
> might be expensive. The extra call to RelationGetNumberOfBlocks seems
> cheap enough here because the alternative is to wait for a contended
> heavyweight lock.
>

I will try the test with this also and post the results.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


copy_script
Description: Binary data


insert_script
Description: Binary data


test_size_copy.sh
Description: Bourne shell script


test_size_ins.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation extension scalability

2016-03-27 Thread Dilip Kumar

On Sat, Mar 26, 2016 at 3:18 PM, Dilip Kumar  wrote:

> search the last, say, two pages of the FSM in all cases.  But that
>> might be expensive. The extra call to RelationGetNumberOfBlocks seems
>> cheap enough here because the alternative is to wait for a contended
>> heavyweight lock.
>>
>
> I will try the test with this also and post the results.
>

I have changed v14 as per this suggestion and results are same as v14.

I have again measured the relation size, this time directly using size
function so results are better understandable.

Relation Size
-
INSERT : 16000 transaction from 32 Client

 Base  v13 v14_1
-  -  
TPS  37  255 112
Rel Size   17GB 17GB 18GB

COPY: 32000 transaction from 32 client
 Base  v13v14_1
   ---
TPS  121   823 427
Rel Size   11GB11GB   11GB

Script are attached in the mail
=
test_size_ins.sh  --> run insert test and calculate relation size.
test_size_copy   --> run copy test and relation size
copy_script  -> copy pg_bench script used by test_size_copy.sh
insert_script --> insert pg_bench script used by test_size_ins.sh
multi_extend_v14_poc_v1.patch --> modified patch of v14.

I also tried modifying v14 from different different angle.

One is like below-->
-
In AddExtraBlock
{
   I add page to FSM one by one like v13 does.
   then update the full FSM tree up till root
}

Results:
--
1. With this performance is little less than v14 but the problem of extra
relation size is solved.
2. With this we can conclude that extra size of relation in v14 is because
some while extending the pages, its not immediately available and at end
some of the pages are left unused.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

test_size_copy.sh
Description: Bourne shell script

test_size_ins.sh
Description: Bourne shell script

copy_script
Description: Binary data

insert_script
Description: Binary data

multi_extend_v14_poc_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation extension scalability

2016-03-27 Thread Dilip Kumar

On Sat, Mar 26, 2016 at 3:18 PM, Dilip Kumar  wrote:

> We could go further still and have GetPageWithFreeSpace() always
>> search the last, say, two pages of the FSM in all cases.  But that
>> might be expensive. The extra call to RelationGetNumberOfBlocks seems
>> cheap enough here because the alternative is to wait for a contended
>> heavyweight lock.
>>
>
> I will try the test with this also and post the results.
>

**Something went wrong in last mail, seems like become separate thread,  so
resending the same mail **

I have changed v14 as per this suggestion and results are same as v14.

I have again measured the relation size, this time directly using size
function so results are better understandable.

Relation Size
-
INSERT : 16000 transaction from 32 Client

 Base  v13 v14_1
-  -  
TPS 37  255 112
Rel Size   17GB 17GB 18GB

COPY: 32000 transaction from 32 client
 Base  v13v14_1
   ---
TPS 121   823427
Rel Size   11GB11GB  11GB

Script are attached in the mail
=
test_size_ins.sh  --> run insert test and calculate relation size.
test_size_copy   --> run copy test and relation size
copy_script  -> copy pg_bench script used by test_size_copy.sh
insert_script --> insert pg_bench script used by test_size_ins.sh
multi_extend_v14_poc_v1.patch --> modified patch of v14.

I also tried modifying v14 from different different angle.

One is like below-->
-
In AddExtraBlock
{
   I add page to FSM one by one like v13 does.
   then update the full FSM tree up till root
}

Results:
--
1. With this performance is little less than v14 but the problem of extra
relation size is solved.
2. With this we can conclude that extra size of relation in v14 is because
some while extending the pages, its not immediately available and at end
some of the pages are left unused.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

copy_script
Description: Binary data

insert_script
Description: Binary data

multi_extend_v14_poc_v1.patch
Description: Binary data

test_size_copy.sh
Description: Bourne shell script

test_size_ins.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-03-27 Thread Dilip Kumar

On Sun, Mar 27, 2016 at 5:37 PM, Andres Freund  wrote:

> On what hardware did you run these tests?


IBM POWER 8 MACHINE.

Architecture:  ppc64le
Byte Order:Little Endian
CPU(s):192
Thread(s) per core:8
Core(s) per socket:1
Socket(s):24
NUMA node(s): 4

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-27 Thread Dilip Kumar

On Mon, Mar 28, 2016 at 1:55 AM, Robert Haas  wrote:

> > One is like below-->
> > -
> > In AddExtraBlock
> > {
> >I add page to FSM one by one like v13 does.
> >then update the full FSM tree up till root
> > }
>
> Not following this.  Did you attach this version?
>

No I did not attached this.. During rough experiment, tried this, did not
produced any patch, I will send this.


> > Results:
> > --
> > 1. With this performance is little less than v14 but the problem of extra
> > relation size is solved.
> > 2. With this we can conclude that extra size of relation in v14 is
> because
> > some while extending the pages, its not immediately available and at end
> > some of the pages are left unused.
>
> I agree with that conclusion.  I'm not quite sure where that leaves
> us, though.  We can go back to v13, but why isn't that producing extra
> pages?  It seems like it should: whenever a bulk extend rolls over to
> a new FSM page, concurrent backends will search either the old or the
> new one but not both.
>
> Maybe we could do this - not sure if it's what you were suggesting above:
>
> 1. Add the pages one at a time, and do RecordPageWithFreeSpace after each
> one.
> 2. After inserting them all, go back and update the upper levels of
> the FSM tree up the root.
>

Yes same, I wanted to explained the same above.


> Another idea is:
>
> If ConditionalLockRelationForExtension fails to get the lock
> immediately, search the last *two* pages of the FSM for a free page.
>
> Just brainstorming here.


I think this is better option, Since we will search last two pages of FSM
tree, then no need to update the upper level of the FSM tree. Right ?

I will test and post the result with this option.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-27 Thread Dilip Kumar

On Mon, Mar 28, 2016 at 11:00 AM, Amit Kapila 
wrote:

> I have not debugged the flow, but by looking at v13 code, it looks like it
> will search both old and new.   In
> function 
> GetPageWithFreeSpaceExtended()->fsm_search_from_addr()->fsm_search_avail(),
> the basic idea of search is: Start the search from the target slot.  At
> every step, move one
> node to the right, then climb up to the parent.  Stop when we reach a node
> with enough free space (as we must, since the root has enough space).
> So shouldn't it be able to find the new FSM page where the bulk extend
> rolls over?
>

This is actually multi level tree, So each FSM page contain one slot tree.

So fsm_search_avail() is searching only the slot tree, inside one FSM page.
But we want to go to next FSM page.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-03-27 Thread Dilip Kumar

On Sun, Mar 27, 2016 at 5:48 PM, Andres Freund  wrote:

>
> What's sizeof(BufferDesc) after applying these patches? It should better
> be <= 64...
>

It is 72.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-28 Thread Dilip Kumar

On Mon, Mar 28, 2016 at 7:21 AM, Dilip Kumar  wrote:

> I agree with that conclusion.  I'm not quite sure where that leaves
>> us, though.  We can go back to v13, but why isn't that producing extra
>> pages?  It seems like it should: whenever a bulk extend rolls over to
>> a new FSM page, concurrent backends will search either the old or the
>> new one but not both.
>>
>
Our open question was why V13 is not producing extra pages, I tried to
print some logs and debug. It seems to me that,
when blocks are spilling to next FSM pages, that time all backends who are
waiting on lock will not get the block because searching in old FSM page.
But the backend which is extending the pages will set
 RelationSetTargetBlock to latest blocks, and that will make new FSM page
available for search by new requesters.

1. So this is why v13  (in normal cases*1) not producing unused pages.
2. But it will produce extra pages (which will be consumed by new
requesters), because all waiter will come one by one and extend 512 pages.

*1 : Above I have mentioned normal case, I mean there is some case exist
where V13 can leave unused page, Like one by one each waiter Get the lock
and extend the page, but no one go down till RelationSetTargetBlock so till
now new pages are not available by new requester, and time will come when
blocks will spill to third FSM page, now one by one all backends go down
and set RelationSetTargetBlock, and suppose last one set it to the block
which is in 3rd FSM page, in this case, pages in second FSM pages are
unused.

Maybe we could do this - not sure if it's what you were suggesting above:
>>
>> 1. Add the pages one at a time, and do RecordPageWithFreeSpace after each
>> one.
>> 2. After inserting them all, go back and update the upper levels of
>> the FSM tree up the root.
>>
> I think this is better option, Since we will search last two pages of FSM
> tree, then no need to update the upper level of the FSM tree. Right ?
>
> I will test and post the result with this option.
>

I have created this patch and results are as below.

* All test scripts are same attached upthread

1. Relation Size : No change in size, its same as base and v13

2. INSERT 1028 Byte 1000 tuple performance
---
Client base v13 v15
1 117 124 122
2 111 126 123
4 51 128 125
8 43 149 135
16 40 217 147
32 35 263 141

3. COPY 1 Tuple performance.
--
Client base v13 v15
1 118 147 155
2 217 276 273
4 210 421 457
8 166 630 643
16 145 813 595
32 124 985 598

Conclusion:
---
1. I think v15 is solving the problem exist with v13 and performance is
significantly high compared to base, and relation size is also stable, So
IMHO V15 is winner over other solution, what other thinks ?

2. And no point in thinking that V13 is better than V15 because, V13 has
bug of sometime extending more than expected pages and that is uncontrolled
and same can be the reason also of v13 performing better.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

multi_extend_v15.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation extension scalability

2016-03-28 Thread Dilip Kumar

On Mon, Mar 28, 2016 at 3:02 PM, Dilip Kumar  wrote:

> 1. Relation Size : No change in size, its same as base and v13
>
> 2. INSERT 1028 Byte 1000 tuple performance
> ---
> Client base v13 v15
> 1 117 124 122
> 2 111 126 123
> 4 51 128 125
> 8 43 149 135
> 16 40 217 147
> 32 35 263 141
>
> 3. COPY 1 Tuple performance.
> --
> Client base v13 v15
> 1 118 147 155
> 2 217 276 273
> 4 210 421 457
> 8 166 630 643
> 16 145 813 595
> 32 124 985 598
>
> Conclusion:
> ---
> 1. I think v15 is solving the problem exist with v13 and performance is
> significantly high compared to base, and relation size is also stable, So
> IMHO V15 is winner over other solution, what other thinks ?
>
> 2. And no point in thinking that V13 is better than V15 because, V13 has
> bug of sometime extending more than expected pages and that is uncontrolled
> and same can be the reason also of v13 performing better.
>

Found one problem with V15, so sending the new version.
In V15 I am taking prev_blkno as targetBlock instead it should be the last
block of the relation at that time. Attaching new patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


multi_extend_v16.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation extension scalability

2016-03-29 Thread Dilip Kumar

On Tue, Mar 29, 2016 at 7:26 AM, Robert Haas  wrote:

> Well, it's less important in that case, but I think it's still worth
> doing.  Some people are going to do just plain GetPageWithFreeSpace().
>

I am attaching new version v17.

Its like this...

In *RelationAddExtraBlocks*
{
-- Add Block one by one to FSM.

-- Update FSM tree all the way to root
}

In *RelationGetBufferForTuple*
 ---  Same as v14, search the FSM tree from root.  GetPageWithFreeSpace

*Summary:*
*--*
1. By Adding block to FSM tree one by one it solves the problem of unused
blocks in V14.
2. It Update the FSM tree all they up to root, so anybody search from root
can get the block.
3. It also search the block from root, so it don't have problem like v15
has(Exactly identifying which two FSM page to search).
4. This solves the performance problem of V14 by some optimizations in
logic of updating FSM tree till root.

*Performance Data*:
--
Client  base  v17

1 117 120
2 111 123
4 51 124
8 43 135
16 40 145
32 35 144
64 -- 140
Client  base v17
---   ---  --
1 118 117
2 217 220
4 210 379
8 166 574
16 145 594
32 124 599
64  609

Notes:
-
If I do some change in this patch in strategy of searching the block,
performance remains almost the same.
 1. Like search in two block like v15 or v17 does.
 2. Search first using target block and if don't get then search from top.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

multi_extend_v17.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation extension scalability

2016-03-29 Thread Dilip Kumar

On Tue, Mar 29, 2016 at 2:09 PM, Dilip Kumar  wrote:

>
> Attaching new version v18

- Some cleanup work on v17.
- Improved  *UpdateFreeSpaceMap *function.
- Performance and space utilization are same as V17


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


multi_extend_v18.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-03-29 Thread Dilip Kumar

On Tue, Mar 29, 2016 at 10:43 PM, Andres Freund  wrote:

> My gut feeling is that we should do both 1) and 2).
>
> Dilip, could you test performance of reducing ppc's spinlock to 1 byte?
> Cross-compiling suggest that doing so "just works".  I.e. replace the
> #if defined(__ppc__) typedef from an int to a char.
>

I set that, but after that it hangs, even Initdb hangs..

 int

   │164 s_lock(volatile slock_t *lock, const char *file, int line)


   │165 {


   │166 SpinDelayStatus delayStatus;


   │167


   │168 init_spin_delay(&delayStatus, (Pointer)lock, file,
line);

   │169


*   │170 while (TAS_SPIN(lock))

 *
*   │171 {

   *
*  >│172 make_spin_delay(&delayStatus);

  *
*   │173 }*


   │174

I did not try to find the reason, but just built in debug mode and found it
never come out of this loop.

I clean build multiple times but problem persist,

Does it have dependency of some other variable of defined under PPC in some
other place ? I don't know..

/* PowerPC */

*#if* defined(__ppc__) || defined(__powerpc__) || defined(__ppc64__) ||
defined(__powerpc64__)

*#define* HAS_TEST_AND_SET

*typedef* *unsigned* *int* slock_t;  --> changed like this

*#define* TAS(lock) tas(lock)

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-03-29 Thread Dilip Kumar

On Wed, Mar 30, 2016 at 7:19 AM, Robert Haas  wrote:

Thanks for review and better comments..


> hio.c: In function ‘RelationGetBufferForTuple’:
> hio.c:231:20: error: ‘freespace’ may be used uninitialized in this
> function [-Werror=uninitialized]
> hio.c:185:7: note: ‘freespace’ was declared here
> hio.c:231:20: error: ‘blockNum’ may be used uninitialized in this
> function [-Werror=uninitialized]
> hio.c:181:14: note: ‘blockNum’ was declared here
>

I have fixed those in v20



>
> There's nothing whatsoever to prevent RelationExtensionLockWaiterCount
> from returning 0.
>
> It's also rather ugly that the call to UpdateFreeSpaceMap() assumes
> that the last value returned by PageGetHeapFreeSpace() is as good as
> any other, but maybe we can just install a comment explaining that
> point; there's not an obviously better approach that I can see.
>

Added comments..

+ if (lockWaiters)
+ /*
+ * Here we are using same freespace for all the Blocks, but that
+ * is Ok, because all are newly added blocks and have same freespace
+ * And even some block which we just added to FreespaceMap above, is
+ * used by some backend and now freespace is not same, will not harm
+ * anything, because actual freespace will be calculated by user
+ * after getting the page.
+ */
+ UpdateFreeSpaceMap(relation, firstBlock, blockNum, freespace);


Does this look good ?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


multi_extend_v20.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation extension scalability

2016-03-29 Thread Dilip Kumar

On Wed, Mar 30, 2016 at 7:51 AM, Dilip Kumar  wrote:

> + if (lockWaiters)
> + /*
> + * Here we are using same freespace for all the Blocks, but that
> + * is Ok, because all are newly added blocks and have same freespace
> + * And even some block which we just added to FreespaceMap above, is
> + * used by some backend and now freespace is not same, will not harm
> + * anything, because actual freespace will be calculated by user
> + * after getting the page.
> + */
> + UpdateFreeSpaceMap(relation, firstBlock, blockNum, freespace);
>
>
> Does this look good ?


Done in better way..

+ lockWaiters = RelationExtensionLockWaiterCount(relation);
+
+ if (lockWaiters == 0)
+ return;
+


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


multi_extend_v21.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Relation extension scalability

2016-03-30 Thread Dilip Kumar

On Tue, Mar 29, 2016 at 10:08 AM, Amit Kapila 
wrote:

> Yes, that makes sense.  One more point is that if the reason for v13
> giving better performance is extra blocks (which we believe in certain
> cases can leak till the time Vacuum updates the FSM tree), do you think it
> makes sense to once test by increasing lockWaiters * 20 limit to may
> be lockWaiters * 25 or lockWaiters * 30.

I tested COPY 1 record, by increasing the number of blocks just to find
out why we are not as good as V13
 with extraBlocks = Min( lockWaiters * 40, 2048) and got below results..

COPY 1

Client  Patch(extraBlocks = Min( lockWaiters * 40, 2048))
   -
16 752
32 708

This proves that main reason of v13 being better is its adding extra blocks
without control.
though v13 is better than these results, I think we can get that also by
changing multiplier and max limit .

But I think we are ok with the max size as 4MB (512 blocks) right?.

Does this test make sense ?

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-01-19 Thread Dilip Kumar

On Thu, Dec 24, 2015 at 8:26 AM, Michael Paquier 
wrote:

> On Sun, Dec 13, 2015 at 11:05 PM, Amit Kapila 
> wrote:
> > On Fri, Dec 11, 2015 at 6:34 PM, Andres Freund 
> wrote:
>

I was looking into this patch, overall patch looks good to me, I want to
know what is the current state of this patch, is there some pending task in
this patch ?

Patch was not applying on head so i have re based it and re based version is
attached in the mail.

I have done some performance testing also..

Summary:
---
1. In my test for readonly workload i have observed some improvement for
scale factor 1000.
2. I have also observed some regression with scale factor 300 (I can't say
it's actual regression  or just run to run variance), I thought that may be
problem with lower scale factor so tested with scale factor 100 but with
s.f. 100 looks fine.

Machine Detail:
cpu   : POWER8
cores: 24 (192 with HT)

Non Default Parameter:

Shared Buffer= 30GB
max_wal_size= 10GB
max_connections=500

Test1:
pgbench -i -s 1000 postgres
pgbench -c$ -j$ -Mprepared -S postgres

Client Base  Pached

1   19753   19493
32   344059 336773
64   495708 540425
128 564358 685212
256 466562 639059

Test2:
pgbench -i -s 300 postgres
pgbench -c$ -j$ -Mprepared -S postgres

Client Base  Pached

1  2055519404
32  375919  332670
64  509067  440680
128431346  415121
256380926  379176

Test3:
pgbench -i -s 100 postgres
pgbench -c$ -j$ -Mprepared -S postgres

Client Base  Pached

1  2055519404
32  375919  332670
64  509067  440680
128431346  415121
256380926      379176

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index b423aa7..04862d7 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -121,12 +121,9 @@ InitBufferPool(void)
 			BufferDesc *buf = GetBufferDescriptor(i);
 
 			CLEAR_BUFFERTAG(buf->tag);
-			buf->flags = 0;
-			buf->usage_count = 0;
-			buf->refcount = 0;
-			buf->wait_backend_pid = 0;
 
-			SpinLockInit(&buf->buf_hdr_lock);
+			pg_atomic_init_u32(&buf->state, 0);
+			buf->wait_backend_pid = 0;
 
 			buf->buf_id = i;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7141eb8..f69faeb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -51,7 +51,6 @@
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
 
-
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
 #define BufferGetLSN(bufHdr)	(PageGetLSN(BufHdrGetBlock(bufHdr)))
@@ -126,7 +125,7 @@ static BufferDesc *PinCountWaitBuf = NULL;
  * entry using ReservePrivateRefCountEntry() and then later, if necessary,
  * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing
  * memory allocations in NewPrivateRefCountEntry() which can be important
- * because in some scenarios it's called with a spinlock held...
+ * because in some scenarios it's called with a header lock held...
  */
 static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
 static HTAB *PrivateRefCountHash = NULL;
@@ -775,9 +774,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 */
 		if (isLocalBuf)
 		{
-			/* Only need to adjust flags */
-			Assert(bufHdr->flags & BM_VALID);
-			bufHdr->flags &= ~BM_VALID;
+			Assert(pg_atomic_read_u32(&bufHdr->state) & BM_VALID);
+			pg_atomic_fetch_and_u32(&bufHdr->state, ~BM_VALID);
 		}
 		else
 		{
@@ -789,8 +787,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			do
 			{
 LockBufHdr(bufHdr);
-Assert(bufHdr->flags & BM_VALID);
-bufHdr->flags &= ~BM_VALID;
+Assert(pg_atomic_read_u32(&bufHdr->state) & BM_VALID);
+pg_atomic_fetch_and_u32(&bufHdr->state, ~BM_VALID);
 UnlockBufHdr(bufHdr);
 			} while (!StartBufferIO(bufHdr, true));
 		}
@@ -808,7 +806,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 * it's not been recycled) but come right back here to try smgrextend
 	 *

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-01-19 Thread Dilip Kumar

On Tue, Jan 19, 2016 at 5:44 PM, Michael Paquier 
wrote:

> > Test3:
> > pgbench -i -s 100 postgres
> > pgbench -c$ -j$ -Mprepared -S postgres
> >
> > Client Base  Pached
> >
> > 1  2055519404
> > 32  375919  332670
> > 64  509067  440680
> > 128431346  415121
> > 256380926  379176
>
> It seems like you did a copy-paste of the results with s=100 and
> s=300. Both are showing the exact same numbers.


Oops, my mistake, re-pasting the correct results for s=100

pgbench -i -s 100 postgres
pgbench -c$ -j$ -Mprepared -S postgres

Client Base  Pached

12054820791
32  372633  355356
64  532052  552148
128412755      478826
256 346701 372057


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-21 Thread Dilip Kumar

On Tue, Jan 12, 2016 at 1:50 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:

>
> increasing number of lock partitions (see columns "no locks", "lwlock"
> and "spinlock array"). Previously it couldn't help us (see "master"
> column) because of a bottleneck.
>
> If you have any other questions regarding this patch please don't
> hesitate to ask.
>

I have done some performance bench marking for this
patch.(dynahash-optimization-v10-step1.patch)

Machine Detail:
cpu   : POWER8
cores: 24 (192 with HT)

Non Default Parameter:

Shared Buffer= 30GB
max_wal_size= 10GB
max_connections=500

Test1:
*schema.sql* and *pgbench.sql* are same files which Aleksander has attached
in first mail of the thread.

psql -d postgres -f schema.sql
pgbench -c$ -j$ -f pgbench.sql  postgres

clientbasepatch
1145145
2262258
4453472
8749732
1611141128
3217271789
6427292793
12835343520

With this test i did not see any improvement, though i think it should
improve the performance, do you have any suggestion to see the results same
as yours ?

I have also dump stacks using some script and I have observed many stacks
dumps as you mentioned in initial thread. And after patch, I found that
number of lock waiting stacks are reduced.

Stack Dump:
---
#0  0x7f25842899a7 in semop () from /lib64/libc.so.6
#1  0x007116d0 in PGSemaphoreLock (sema=0x7f257cb170d8) at
pg_sema.c:387
#2  0x0078955f in LWLockAcquire (lock=0x7f247698a980,
mode=LW_EXCLUSIVE) at lwlock.c:1028
#3  0x007804c4 in LockAcquireExtended
#4  0x0077fe91 in LockAcquire
#5  0x0077e862 in LockRelationOid
#6  0x0053bc7b in find_inheritance_children
#7  0x0053bd4f in find_all_inheritors
#8  0x006fc0a2 in expand_inherited_rtentry
#9  0x006fbf91 in expand_inherited_tables

I have tried to analyze using perf also, I can see that amount of time
taken in hash_search_with_hash_value is reduced from 3.86%(master) to
1.72%(patch).

I have plan to do further investigation, in different scenarios of dynahash.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-24 Thread Dilip Kumar

On Fri, Jan 22, 2016 at 3:44 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:

> This patch affects header files. By any chance didn't you forget to run
> `make clean` after applying it? As we discussed above, when you
> change .h files autotools doesn't rebuild dependent .c files:
>

Yes, actually i always compile using "make clean;make -j20; make install"
If you want i will run it again may be today or tomorrow and post the
result.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-01-24 Thread Dilip Kumar

On Sat, Jan 23, 2016 at 12:19 PM, Amit Kapila 
wrote
>
>
> Few comments about patch:
>
Thanks for reviewing..


> 1.
> Patch is not getting compiled.
>
> 1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
> identifier
> 1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
> identifier
> 1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
> identifier
>
Oh, My mistake, my preprocessor is ignoring this error and relacing it with
BLKSIZE

I will fix in next version of patch.

> 2.
> ! page = BufferGetPage(buffer);
> ! PageInit(page, BufferGetPageSize
> (buf), 0);
> !
> ! freespace = PageGetHeapFreeSpace(page);
> !
> MarkBufferDirty(buffer);
> ! UnlockReleaseBuffer(buffer);
> !
> RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer), freespace);
>
> What is the need to mark page dirty here, won't it automatically
> be markerd dirty once the page is used?  I think it is required
> if you wish to WAL-log this action.
>

These pages  are not going to be used immediately and we have done PageInit
so i think it should be marked dirty before adding to FSM, so that if
buffer get replaced out then it flushes the init data.


> 3. I think you don't need to multi-extend a relation if
> HEAP_INSERT_SKIP_FSM is used, as for that case it anyways try to
> get a new page by extending a relation.
>

Yes, if HEAP_INSERT_SKIP_FSM is used and we use multi-extend atleast in
current transaction it will not take pages from FSM and everytime it will
do multi-extend, however pages will be used if there are parallel backend,
but still not a good idea to extend every time in multiple chunk in current
backend.

So i will change this..

4. Again why do you need this multi-extend optimization for local
> relations (those only accessible to current backend)?
>

I think we can change this while adding the  table level "extend_by_blocks"
for local table we will not allow this property, so no need to change at
this place.

What do you think ?

5. Do we need this for nbtree as well?  One way to check that
> is by Copying large data in table having index.
>
> Ok, i will try this test and update.



> Note: Test with both data and WAL on Magnetic Disk : No significant
>> improvement visible
>> -- I think wall write is becoming bottleneck in this case.
>>
>>
> In that case, can you try the same test with un-logged tables?
>

OK

>
> Also, it is good to check the performance of patch with read-write work
> load to ensure that extending relation in multiple-chunks should not
> regress such cases.
>

Ok


>
> Currently i have kept extend_num_page as session level parameter but i
>> think later we can make this as table property.
>> Any suggestion on this ?
>>
>>
> I think we should have a new storage_parameter at table level
> extend_by_blocks or something like that instead of GUC. The
> default value of this parameter should be 1 which means retain
> current behaviour by default.
>

+1


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-01-24 Thread Dilip Kumar

On Sat, Jan 23, 2016 at 4:28 PM, Amit Kapila 
wrote:

> I found one more problem with patch.
>
> ! UnlockReleaseBuffer(buffer);
> ! RecordPageWithFreeSpace(relation, BufferGetBlockNumber(buffer),
> freespace);
>
> You can't call BufferGetBlockNumber(buffer) after releasing
> the pin on buffer which will be released by
> UnlockReleaseBuffer().  Get the block number before unlocking
> the buffer.
>

Good catch, will fix this also in next version.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-01-27 Thread Dilip Kumar

On Wed, Jan 27, 2016 at 5:15 PM, Aleksander Alekseev <
a.aleks...@postgrespro.ru> wrote:

> Most likely HASHHDR.mutex is not only bottleneck in your case so my
> patch doesn't help much. Unfortunately I don't have access to any
> POWER8 server so I can't investigate this issue. I suggest to use a
> gettimeofday trick I described in a first message of this thread. Its
> time consuming but it gives a clear understanding which code is keeping
> a lock.
>

I have also tested the pgbench Readonly test when data don't fit into
shared buffer,
Because in this case HASHHDR.mutex access will be quite frequent.
And in this case i do see very good improvement in POWER8 server.

Test Result:

Scale Factor:300
Shared Buffer:512MB
pgbench -c$ -j$ -S -M Prepared postgres

Clientbasepatch
64   222173    318318
128195805262290

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-01-28 Thread Dilip Kumar

On Mon, Jan 25, 2016 at 11:59 AM, Dilip Kumar  wrote:

1.
>> Patch is not getting compiled.
>>
>> 1>src/backend/access/heap/hio.c(480): error C2065: 'buf' : undeclared
>> identifier
>>
> Oh, My mistake, my preprocessor is ignoring this error and relacing it
> with BLKSIZE
>

FIXED

> 3. I think you don't need to multi-extend a relation if
>> HEAP_INSERT_SKIP_FSM is used, as for that case it anyways try to
>> get a new page by extending a relation.
>>
>
> So i will change this..
>

FIXED

>
> 4. Again why do you need this multi-extend optimization for local
>> relations (those only accessible to current backend)?
>>
>
> I think we can change this while adding the  table level
> "extend_by_blocks" for local table we will not allow this property, so no
> need to change at this place.
> What do you think ?
>

Now I have added table level parameter for specifying the number of blocks,
So do you think that we still need to block it, as user can control it,
Moreover i think if user want to use for local table then no harm in it at
least by extending in one shot he avoid multiple call of Extension lock,
though there will be no contention.

What is your opinion ?

5. Do we need this for nbtree as well?  One way to check that
>> is by Copying large data in table having index.
>>
>> Ok, i will try this test and update.
>

I tried to load data to table with Index and tried to analyze bottleneck
using perf, and found btcompare was taking maximum time, still i don't deny
that it can not get benefited by multi extend.

So i tried quick POC for this, but i realize that even though we extend
multiple page for index and add to FSM, it will be updated only in current
page, Information about new free page will be propagated to root page only
during vacuum, And unlike heap Btree always search FSM from root and it
will not find the extra added pages.

So i think we can analyze this topic separately for index multi extend and
find is there are cases where index multi extend can give benefit.

Note: Test with both data and WAL on Magnetic Disk : No significant
>>> improvement visible
>>> -- I think wall write is becoming bottleneck in this case.
>>>
>>>
>> In that case, can you try the same test with un-logged tables?
>>
> Date with un-logged table

Test Init:

./psql -d postgres -c "COPY (select g.i::text FROM generate_series(1,
1) g(i)) TO '/tmp/copybinary' WITH BINARY";
echo COPY data from '/tmp/copybinary' WITH BINARY; > copy_script
./psql -d postgres -c "create unlogged table data (data text)" --> base
./psql -d postgres -c "create unlogged table data (data text)
with(extend_by_blocks=50)" --patch

test_script:

./psql -d postgres -c "truncate table data"
./psql -d postgres -c "checkpoint"
./pgbench -f copy_script -T 120 -c$ -j$ postgres

Shared Buffer48GB
Table:Unlogged Table
ench -c$ -j$ -f -M Prepared postgres

ClientsBasepatch
1178   180
2337   338
4265   601
8167   805

Also, it is good to check the performance of patch with read-write work
>> load to ensure that extending relation in multiple-chunks should not
>> regress such cases.
>>
>
> Ok
>

I did not find in regression in normal case.
Note: I tested it with previous patch extend_num_pages=10 (guc parameter)
so that we can see any impact on overall system.

> Currently i have kept extend_num_page as session level parameter but i
>>> think later we can make this as table property.
>>> Any suggestion on this ?
>>>
>>>
>> I think we should have a new storage_parameter at table level
>> extend_by_blocks or something like that instead of GUC. The
>> default value of this parameter should be 1 which means retain
>> current behaviour by default.
>>
> +1
>

Changed it to table level storage parameter. I kept max_limit to 100 any
suggestion on this ? should we increase it ?

latest patch is attached..

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 86b9ae1..76f9a21 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -268,6 +268,16 @@ static relopt_int intRelOpts[] =
 #endif
 	},

+	{
+		{
+			"extend_by_blocks",
+			"Number of blocks to be added to relation in every extend call",
+			RELOPT_KIND_HEAP,
+			AccessExclusiveLock
+		},
+		1, 1, 100
+	},
+
 	/* list terminator */
 	{{NULL}}
 };
@@ -1291,7 +1301,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"autovacuum

Re: [HACKERS] Relation extension scalability

2016-01-28 Thread Dilip Kumar

On Thu, Jan 28, 2016 at 4:53 PM, Dilip Kumar  wrote:

> I did not find in regression in normal case.
> Note: I tested it with previous patch extend_num_pages=10 (guc parameter)
> so that we can see any impact on overall system.
>

Just forgot to mentioned That i have run pgbench read-write case.

S.F: 300

./pgbench -j $ -c $ -T 1800 -M Prepared postgres

Tested with 1,8,16,32,64 Threads and did not see any regression with patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-01-30 Thread Dilip Kumar

On Sat, Jan 30, 2016 at 3:05 AM, Merlin Moncure  wrote:

> Probably want to run for at least 5 minutes via -T 300


Last time i run for 5 minutes and taken median of three runs, just missed
mentioning "-T 300" in the mail..

By looking at the results with scale factor 1000 and 100 i don't see any
reason why it will regress with scale factor 300.

So I will run the test again with scale factor 300 and this time i am
planning to run 2 cases.
1. when data fits in shared buffer
2. when data doesn't fit in shared buffer.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-01-31 Thread Dilip Kumar

On Sun, Jan 31, 2016 at 11:44 AM, Dilip Kumar  wrote:

> By looking at the results with scale factor 1000 and 100 i don't see any
> reason why it will regress with scale factor 300.
>
> So I will run the test again with scale factor 300 and this time i am
> planning to run 2 cases.
> 1. when data fits in shared buffer
> 2. when data doesn't fit in shared buffer.
>

I have run the test again with 300 S.F and found no regression, in fact
there is improvement with the patch like we saw with 1000 scale factor.

Shared Buffer= 8GB
max_connections=150
Scale Factor=300

./pgbench  -j$ -c$ -T300 -M prepared -S postgres

ClientBasePatch
11974419382
8125923126395
3231393151
64387339496830
128306412350610

Shared Buffer= 512MB
max_connections=150
Scale Factor=300

./pgbench  -j$ -c$ -T300 -M prepared -S postgres

ClientBasePatch
11716916454
8108547105559
32241619262818
64206868233606
128    137084    217013


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Relation extension scalability

2016-02-09 Thread Dilip Kumar

On Fri, Feb 5, 2016 at 4:50 PM, Amit Kapila  wrote:

> > Could you also measure how this behaves for an INSERT instead of a COPY
> > workload?
>
>
I think such a test will be useful.
>

I have measured the performance with insert to see the behavior when it
don't use strategy. I have tested multiple option, small tuple, big tuple,
data fits in shared buffer and doesn't fit in shared buffer.

Observation:
--
 Apart from this test I have also used some tool which can take a many
stack traces with some delay.
1. I have observed with base code (when data don't fits in shared buffer)
almost all the stack traces are waiting on "LockRelationForExtension" and
many on "FlushBuffer" also (Flushing the dirty buffer).

Total Stack Captured: 204, FlushBuffer: 13, LockRelationForExtension: 187

 (This test run with 8 thread (shared buf 512MB) and after every 5 second
stack is captured.)

2. If I change shared buf 48GB then Obviously FlushBuffer disappeared but
still LockRelationForExtension remains in very high number.

3.Performance of base code in both the cases when Data fits in shared
buffers or doesn't fits in shared buffer remain very low and non-scaling(we
can see that in below results).

Test--1 (big record insert and Data fits in shared Buffer)

setup

./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data
values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int)
with(extend_by_blocks=$variable)" {create table data (a int) for base code}
echo "insert into data select * from test_data;" >> copy_script

test:
-
 shared_buffers=48GB  max_wal_size=20GB  checkpoint_timeout=10min
./pgbench -c $ -j $ -f copy_script -T 120 postgres



clientbaseextend_by_block=50extend_by_block=1000
1113115118
450  220216
843  202302


Test--2 (big record insert and Data doesn't fits in shared Buffer)
--
setup:
---
./psql -d postgres -c "create table test_data(a int, b text)"
./psql -d postgres -c "insert into test_data
values(generate_series(1,1000),repeat('x', 1024))"
./psql -d postgres -c "create table data (a int)
with(extend_by_blocks=1000)"
echo "insert into data select * from test_data;" >> copy_script

test:
--
 shared_buffers=512MB  max_wal_size=20GB  checkpoint_timeout=10min
./pgbench -c $ -j $ -f copy_script -T 120 postgres


clientbaseextend_by_block=1000
1 125125
4 49  236
8 41  294
1639 279


Test--3 (small record insert and Data fits in shared Buffer)
--
setup:

./psql -d postgres -c "create table test_data(a int)"
./psql -d postgres -c "insert into test_data
values(generate_series(1,1))"
./psql -d postgres -c "create table data (a int) with(extend_by_blocks=20)"
 echo "insert into data select * from test_data;" >> copy_script

test:
-
 shared_buffers=48GB -c max_wal_size=20GB -c checkpoint_timeout=10min
./pgbench -c $ -j $ -f copy_script -T 120 postgres


clientbasePatch-extend_by_block=20
1  137  143
2  269  250
4  377  443
8 170   690
16145  745

*All test done with Data on MD and Wal on SSD

Note: Last patch have max limit of extend_by_block=100 so for taking
performance with  extend_by_block=1000 i localy changed it.
I will send the modified patch once we finalize on which approach to
proceed with.


> > I'm doubtful that anything that does the victim buffer search while
> > holding the extension lock will actually scale in a wide range of
> > scenarios.
> >
>
> I think the problem for victim buffer could be visible if the blocks
> are dirty and it needs to write the dirty buffer and especially as
> the patch is written where after acquiring the extension lock, it again
> tries to extend the relation without checking if it can get a page with
> space from FSM.  It seems to me that we should re-check the
> availability of page because while one backend is waiting on extension
> lock, other backend might have added pages.  To re-check the
> availability we might want to use something similar to
> LWLockAcquireOrWait() semantics as used during WAL writing.
>

I will work on this in next version...

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-08-26 Thread Dilip Kumar

On Wed, Aug 17, 2016 at 6:10 PM, Robert Haas  wrote:
>> If you are making changes in plpgsql_validator(), then shouldn't we
>> make changes in plperl_validator() or plpython_validator()?  I see
>> that you have made changes to function CheckFunctionValidatorAccess()
>> which seems to be getting called from *_validator() (* indicates
>> plpgsql/plperl/plpython) functions.  Is there a reason for changing
>> the *_validator() function?

Yes you are right, making changes in CheckFunctionValidatorAccess is
sufficient, no need to make changes in *_validator (* indicates
plpgsql/plperl/plpython) functions.

I have changed this in attached patch..

>
> Yeah, when I glanced briefly at this patch, I found myself wondering
> whether all of these call sites were actually reachable (and why the
> patch contained no test cases to prove it).

I have also added test cases to cover all my changes.

Basically this patch changes error at three places.

1. getBaseTypeAndTypmod: This is being called from domain_in exposed
function (domain_in->
domain_state_setup-> getBaseTypeAndTypmod). Though this function is
being called from many other places which are not exposed function,
but I don't this will have any imapct, will this ?

2. lookup_type_cache: This is being called from record_in
(record_in->lookup_rowtype_tupdesc->
lookup_rowtype_tupdesc_internal->lookup_type_cache).

3. CheckFunctionValidatorAccess: This is being called from all
language validator functions.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

cache_lookup_failure_v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-09-06 Thread Dilip Kumar

On Wed, Sep 7, 2016 at 8:52 AM, Haribabu Kommi  wrote:
> I reviewed and tested the patch. The changes are fine.
> This patch provides better error message compared to earlier.
>
> Marked the patch as "Ready for committer" in commit-fest.


Thanks for the review !

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-09-08 Thread Dilip Kumar

On Thu, Sep 8, 2016 at 2:23 AM, Tom Lane  wrote:
> I really don't like this solution.  Changing this one function, out of the
> dozens just like it in lsyscache.c, seems utterly random and arbitrary.
>
> I'm actually not especially convinced that passing domain_in a random
> OID needs to not produce an XX000 error.  That isn't a user-facing
> function and people shouldn't be calling it by hand at all.  The same
> goes for record_in.  But if we do want those cases to avoid this,
> I think it's appropriate to fix it in those functions, not decree
> that essentially-random other parts of the system should bear the
> responsibility.  (Thought experiment: if we changed the order of
> operations in domain_in so that getBaseTypeAndTypmod were no longer
> the first place to fail, would we undo this change and then change
> wherever the failure had moved to?  That's pretty messy.)
>
> In the case of record_in, it seems like it'd be easy enough to use
> lookup_rowtype_tupdesc_noerror() instead of lookup_rowtype_tupdesc()
> and then throw a user-facing error right there.

Actually when we are passing invalid type to "record_in", error is
thrown in "lookup_type_cache" function, and this function is not
taking "no_error" input (it's only limited to
lookup_rowtype_tupdesc_internal).

One option is to pass "no_error" parameter to "lookup_type_cache"
function also, and throw error only in record_in function.
But problem is, "lookup_type_cache" has lot of references. And also
will it be good idea to throw one generic error instead of multiple
specific errors. ?

Another option is to do same for "record_in" also what you have
suggested for domain_in (an extra syscache lookup). ?

>Not sure about a
> nice solution for domain_in.  We might have to resort to an extra
> syscache lookup there, but even if we did, it should happen only
> once per query so that's not very awful.
>
>> 3. CheckFunctionValidatorAccess: This is being called from all
>> language validator functions.
>
> This part seems reasonable, since the validator functions are documented
> as something users might call, and CheckFunctionValidatorAccess seems
> like an apropos place to handle it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-09-08 Thread Dilip Kumar

On Thu, Sep 8, 2016 at 7:21 PM, Tom Lane  wrote:
> We could make it work like that without breaking the ABI if we were
> to add a NOERROR bit to the allowed "flags".  However, after looking
> around a bit I'm no longer convinced what I said above is a good idea.
> In particular, if we put the responsibility of reporting the error on
> callers then we'll lose the ability to distinguish "no such pg_type OID"
> from "type exists but it's only a shell".  So I'm now thinking it's okay
> to promote lookup_type_cache's elog to a full ereport, especially since
> it's already using ereport for some of the related error cases such as
> the shell-type failure.

Okay..
>
> That still leaves us with what to do for domain_in.  A really simple
> fix would be to move the InitDomainConstraintRef call before the
> getBaseTypeAndTypmod call, because the former fetches the typcache
> entry and would then benefit from lookup_type_cache's ereport.
> But that seems a bit magic.
>
> I'm tempted to propose that domain_state_setup start with
>
> typentry = lookup_type_cache(domainType, TYPECACHE_DOMAIN_INFO);
> if (typentry->typtype != TYPTYPE_DOMAIN)
> ereport(ERROR,
> (errcode(ERRCODE_DATATYPE_MISMATCH),
>  errmsg("type %s is not a domain",
> format_type_be(domainType;
>
> removing the need to verify that after getBaseTypeAndTypmod.

Seems like a better option, done it this way..
attaching the modified patch..
>
> The cache loading is basically free because InitDomainConstraintRef
> would do it anyway; so the extra cost here is only one dynahash search.
> You could imagine buying back those cycles by teaching the typcache
> to be able to cache the result of getBaseTypeAndTypmod, but I'm doubtful
> that we really care.  This whole setup sequence only happens once per
> query anyway.

Agreed.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


cache_lookup_failure_v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2016-09-09 Thread Dilip Kumar

On Fri, Sep 9, 2016 at 6:51 PM, Tom Lane  wrote:
> Pushed with cosmetic adjustments --- mostly, I thought we needed some
> comments about the topic.

Okay, Thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-13 Thread Dilip Kumar

On Mon, Sep 5, 2016 at 9:33 AM, Amit Kapila  wrote:
> USE_CONTENT_LOCK on my windows box, you can try by commenting that as
> well, if it works for you).  So, in short we have to compare three
> approaches here.
>
> 1) Group mode to reduce CLOGControlLock contention
> 2) Use granular locking model
> 3) Use atomic operations

I have tested performance with approach 1 and approach 2.

1. Transaction (script.sql): I have used below transaction to run my
bench mark, We can argue that this may not be an ideal workload, but I
tested this to put more load on ClogControlLock during commit
transaction.

---
\set aid random (1,3000)
\set tid random (1,3000)

BEGIN;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s1;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
SAVEPOINT s2;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
END;
---

2. Results
./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql
scale factor: 300
Clients   head(tps)grouplock(tps)  granular(tps)
---  -   --   ---
12829367 3932637421
18029777 3781036469
25628523 3741835882


grouplock --> 1) Group mode to reduce CLOGControlLock contention
granular  --> 2) Use granular locking model

I will test with 3rd approach also, whenever I get time.

3. Summary:
1. I can see on head we are gaining almost ~30 % performance at higher
client count (128 and beyond).
2. group lock is ~5% better compared to granular lock.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-14 Thread Dilip Kumar

On Wed, Sep 14, 2016 at 10:25 AM, Dilip Kumar  wrote:
> I have tested performance with approach 1 and approach 2.
>
> 1. Transaction (script.sql): I have used below transaction to run my
> bench mark, We can argue that this may not be an ideal workload, but I
> tested this to put more load on ClogControlLock during commit
> transaction.
>
> ---
> \set aid random (1,3000)
> \set tid random (1,3000)
>
> BEGIN;
> SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
> SAVEPOINT s1;
> SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
> SAVEPOINT s2;
> SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
> END;
> ---
>
> 2. Results
> ./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql
> scale factor: 300
> Clients   head(tps)grouplock(tps)  granular(tps)
> ---  -   --   ---
> 12829367 3932637421
> 18029777 3781036469
> 25628523 3741835882
>
>
> grouplock --> 1) Group mode to reduce CLOGControlLock contention
> granular  --> 2) Use granular locking model
>
> I will test with 3rd approach also, whenever I get time.
>
> 3. Summary:
> 1. I can see on head we are gaining almost ~30 % performance at higher
> client count (128 and beyond).
> 2. group lock is ~5% better compared to granular lock.

Forgot to mention that, this test is on unlogged tables.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-14 Thread Dilip Kumar

On Wed, Sep 14, 2016 at 8:59 PM, Robert Haas  wrote:
> Sure, but you're testing at *really* high client counts here.  Almost
> nobody is going to benefit from a 5% improvement at 256 clients.

I agree with your point, but here we need to consider one more thing,
that on head we are gaining ~30% with both the approaches.

So for comparing these two patches we can consider..

A.  Other workloads (one can be as below)
   -> Load on CLogControlLock at commit (exclusive mode) + Load on
CLogControlLock at Transaction status (shared mode).
   I think we can mix (savepoint + updates)

B. Simplicity of the patch (if both are performing almost equal in all
practical scenarios).

C. Bases on algorithm whichever seems winner.

I will try to test these patches with other workloads...

>  You
> need to test 64 clients and 32 clients and 16 clients and 8 clients
> and see what happens there.  Those cases are a lot more likely than
> these stratospheric client counts.

I tested with 64 clients as well..
1. On head we are gaining ~15% with both the patches.
2. But group lock vs granular lock is almost same.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-16 Thread Dilip Kumar

On Sat, Sep 17, 2016 at 6:54 AM, Tomas Vondra
 wrote:
> Although most of the changes probably does not matter much for unlogged
> tables (I planned to see how this affects regular tables, but as I see no
> difference for unlogged ones, I haven't done that yet).
>
> So the question is why Dilip sees +30% improvement, while my results are
> almost exactly the same. Looking at Dilip's benchmark, I see he only ran the
> test for 10 seconds, and I'm not sure how many runs he did, warmup etc.
> Dilip, can you provide additional info?

Actually I ran test for 10 minutes.

Sorry for the confusion (I copy paste my script and manually replaced
the variable and made mistake )

My script is like this

scale_factor=300
shared_bufs=8GB
time_for_reading=600

./postgres -c shared_buffers=8GB -c checkpoint_timeout=40min -c
max_wal_size=20GB -c max_connections=300 -c maintenance_work_mem=1GB&
./pgbench -i -s $scale_factor --unlogged-tables postgres
./pgbench -c $threads -j $threads -T $time_for_reading -M prepared
postgres -f ../../script.sql >> test_results.txt

I am taking median of three readings..

with below script, I can repeat my results every time (64 client 15%
gain on head and 128+ client 30% gain on head).

I will repeat my test with 8,16 and 32 client and post the results soon.

> \set aid random (1,3000)
> \set tid random (1,3000)
>
> BEGIN;
> SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
> SAVEPOINT s1;
> SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
> SAVEPOINT s2;
> SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
> END;
> ---


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-18 Thread Dilip Kumar

On Mon, Sep 19, 2016 at 2:41 AM, Tomas Vondra
 wrote:
> But now that I look at the first post, I see it apparently used a plain
> tpc-b pgbench (with synchronous_commit=on) to show the benefits, which is
> the workload I'm running right now (results sometime tomorrow).

Good option, We can test plain TPC-B also..

I have some more results.. I have got the result for "Update with no
savepoint"

below is my script...

\set aid random (1,3000)
\set tid random (1,3000)
\set delta random(-5000, 5000)
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;


Results: (median of three, 10 minutes run).

Clients   Head GroupLock
16  2145221589
32  4242242688
64  4246052590   ~ 23%
1282268356825   ~150%
256   18748 54867

With this workload I observed that gain is bigger than my previous
workload (select for update with 2 SP)..

Just to confirm that the gain what we are seeing is because of Clog
Lock contention removal or it's
something else, I ran 128 client with perf for 5 minutes and below is my result.

I can see that after applying group lock patch, LWLockAcquire become
28% to just 4%, and all because
of Clog Lock.

On Head:

-   28.45% 0.24%  postgres  postgres   [.] LWLockAcquire
   - LWLockAcquire
  + 53.49% TransactionIdSetPageStatus
  + 40.83% SimpleLruReadPage_ReadOnly
  + 1.16% BufferAlloc
  + 0.92% GetSnapshotData
  + 0.89% GetNewTransactionId
  + 0.72% LockBuffer
  + 0.70% ProcArrayGroupClearXid


After Group Lock Patch:
---
-4.47% 0.26%  postgres  postgres   [.] LWLockAcquire
   - LWLockAcquire
  + 27.11% GetSnapshotData
  + 21.57% GetNewTransactionId
  + 11.44% SimpleLruReadPage_ReadOnly
  + 10.13% BufferAlloc
  + 7.24% ProcArrayGroupClearXid
  + 4.74% LockBuffer
  + 4.08% LockAcquireExtended
  + 2.91% TransactionGroupUpdateXidStatus
  + 2.71% LockReleaseAll
  + 1.90% WALInsertLockAcquire
  + 0.94% LockRelease
  + 0.91% VirtualXactLockTableInsert
  + 0.90% VirtualXactLockTableCleanup
  + 0.72% MultiXactIdSetOldestMember
  + 0.66% LockRefindAndRelease

Next I will test, "update with 2 savepoints", "select for update with
no savepoints"
I will also test the granular lock and atomic lock patch in next run..

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-19 Thread Dilip Kumar

On Tue, Sep 20, 2016 at 8:37 AM, Amit Kapila  wrote:
> I think it is former (8 socket machine).

I confirm this is 8 sockets machine(cthulhu)
>

>
> You point related to high-client count is valid and I am sure it won't
> give noticeable benefit at lower client-count as the the
> CLOGControlLock contention starts impacting only at high-client count.
> I am not sure if it is good idea to reject a patch which helps in
> stabilising the performance (helps in falling off the cliff) when the
> processes increases the number of cores (or hardware threads)
>
>>  If you have to work that hard
>> to find a big win, and tests under more reasonable conditions show no
>> benefit, it's not clear to me that it's really worth the time we're
>> all spending benchmarking and reviewing this, or the risk of bugs, or
>> the damage to the SLRU abstraction layer.
>
> I agree with you unless it shows benefit on somewhat more usual
> scenario's, we should not accept it.  So shouldn't we wait for results
> of other workloads like simple-update or tpc-b on bigger machines
> before reaching to conclusion?

+1

My test are under run, I will post it soon..


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-20 Thread Dilip Kumar

On Tue, Sep 20, 2016 at 9:15 AM, Dilip Kumar  wrote:
> +1
>
> My test are under run, I will post it soon..

I have some more results now

8 socket machine
10 min run(median of 3 run)
synchronous_commit=off
scal factor = 300
share buffer= 8GB

test1: Simple update(pgbench)

Clients Head  GroupLock
 32   45702   45402
 64   46974   51627
128  35056   55362


test2: TPCB (pgbench)

Clients Head   GroupLock
 32   27969   27765
 64   33140   34786
128  21555   38848

Summary:
--
At 32 clients no gain, I think at this workload Clog Lock is not a problem.
At 64 Clients we can see ~10% gain with simple update and ~5% with TPCB.
At 128 Clients we can see > 50% gain.

Currently I have tested with synchronous commit=off, later I can try
with on. I can also test at 80 client, I think we will see some
significant gain at this client count also, but as of now I haven't
yet tested.

With above results, what we think ? should we continue our testing ?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-27 Thread Dilip Kumar

On Wed, Sep 21, 2016 at 8:47 AM, Dilip Kumar  wrote:
> Summary:
> --
> At 32 clients no gain, I think at this workload Clog Lock is not a problem.
> At 64 Clients we can see ~10% gain with simple update and ~5% with TPCB.
> At 128 Clients we can see > 50% gain.
>
> Currently I have tested with synchronous commit=off, later I can try
> with on. I can also test at 80 client, I think we will see some
> significant gain at this client count also, but as of now I haven't
> yet tested.
>
> With above results, what we think ? should we continue our testing ?

I have done further testing with on TPCB workload to see the impact on
performance gain by increasing scale factor.

Again at 32 client there is no gain, but at 64 client gain is 12% and
at 128 client it's 75%, it shows that improvement with group lock is
better at higher scale factor (at 300 scale factor gain was 5% at 64
client and 50% at 128 clients).

8 socket machine (kernel 3.10)
10 min run(median of 3 run)
synchronous_commit=off
scal factor = 1000
share buffer= 40GB

Test results:

client  head group lock
32  27496   27178
64  31275   35205
1282065634490

LWLOCK_STATS approx. block count on ClogControl Lock ("lwlock main 11")

client  head  group lock
32  8  6
6415 10
128  14   7

Note: These are approx. block count, I have detailed result of
LWLOCK_STAT, incase someone wants to look into.

LWLOCK_STATS shows that ClogControl lock block count reduced by 25% at
32 client, 33% at 64 client and 50% at 128 client.

Conclusion:
1. I think both  LWLOCK_STATS and performance data shows that we get
significant contention reduction on ClogControlLock with the patch.
2. It also shows that though we are not seeing any performance gain at
32 clients, but there is contention reduction with patch.

I am planning to do some more test with higher scale factor (3000 or more).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-10-05 Thread Dilip Kumar

On Thu, Oct 5, 2017 at 8:15 PM, Robert Haas  wrote:
> On Sun, Sep 17, 2017 at 7:04 AM, Dilip Kumar  wrote:
>> I used  lossy_pages = max(0, total_pages - maxentries / 2). as
>> suggesed by Alexander.
>
> Does that formula accurately estimate the number of lossy pages?

I have printed the total_pages, exact_pages and lossy_pages during
planning time, and for testing purpose, I tweak the code a bit so that
it doesn't consider lossy_pages in cost calculation (same as base
code).

I have tested TPCH scale factor 20. at different work_mem(4MB, 20MB,
64MB) and noted down the estimated pages vs actual pages.

Analysis: The estimated value of the lossy_pages is way higher than
its actual value and reason is that the total_pages calculated by the
"Mackert and Lohman formula" is not correct.

work_mem=4 MB

query:4
estimated: total_pages=552472.00 exact_pages=32768.00
lossy_pages=519704.00
actual:exact=18548 lossy=146141

query:6
estimated: total_pages=1541449.00 exact_pages=32768.00
lossy_pages=1508681.00
actual:exact=13417 lossy=430385

query:8
estimated:  total_pages=552472.00 exact_pages=32768.00
lossy_pages=519704.00
actual: exact=56869 lossy=495603

query:14
estimated:  total_pages=1149603.00 exact_pages=32768.00
lossy_pages=1116835.00
actual: exact=17115 lossy=280949

work_mem: 20 MB
query:4
estimated:  total_pages=552472.00 exact_pages=163840.00
lossy_pages=388632.00
actual: exact=109856 lossy=57761

query:6
estimated:   total_pages=1541449.00 exact_pages=163840.00
lossy_pages=1377609.00
actual:  exact=59771 lossy=397956

query:8
estimated:  total_pages=552472.00 exact_pages=163840.00
lossy_pages=388632.00
actual: Heap Blocks: exact=221777 lossy=330695

query:14
estimated:  total_pages=1149603.00 exact_pages=163840.00
lossy_pages=985763.00
actual: exact=63381 lossy=235513

work_mem:64 MB
query:4
estimated:  total_pages=552472.00 exact_pages=552472.00
lossy_pages=0.00
actual: exact=166005 lossy=0

query:6
estimated:  total_pages=1541449.00 exact_pages=524288.00
lossy_pages=1017161.00
actual: exact=277717 lossy=185919

query:8
estimated: total_pages=552472.00 exact_pages=552472.00
lossy_pages=0.00
actual:exact=552472 lossy=0

query:14
estimated:  total_pages=1149603.00 exact_pages=524288.00
lossy_pages=625315.00
actual: exact=309091 lossy=0


>
> The performance results look good, but that's a slightly different
> thing from whether the estimate is accurate.
>
> +nbuckets = tbm_calculate_entires(maxbytes);
>
> entires?

changed to
+ tbm->maxentries = (int) tbm_calculate_entires(maxbytes);


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


improve_bitmap_cost_v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-10-06 Thread Dilip Kumar

On Fri, Oct 6, 2017 at 6:36 PM, Robert Haas  wrote:
> On Fri, Oct 6, 2017 at 2:12 AM, Dilip Kumar  wrote:
>>> The performance results look good, but that's a slightly different
>>> thing from whether the estimate is accurate.
>>>
>>> +nbuckets = tbm_calculate_entires(maxbytes);
>>>
>>> entires?
>>
>> changed to
>> + tbm->maxentries = (int) tbm_calculate_entires(maxbytes);
>
> My point was not that you should add a cast, but that you wrote
> "entires" rather than "entries".

My bad, I thought you were suggesting the variable name as "entries"
instead of "nbuckets" so I removed the variable "nbuckets".  I will
fix the typo in the function name and post the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-10-06 Thread Dilip Kumar

On Fri, Oct 6, 2017 at 6:08 PM, Alexander Kuzmenkov
 wrote:
>
>> Analysis: The estimated value of the lossy_pages is way higher than
>> its actual value and reason is that the total_pages calculated by the
>> "Mackert and Lohman formula" is not correct.
>
>
> I think the problem might be that the total_pages includes cache effects and
> rescans. For bitmap entries we should use something like relation pages *
> selectivity.

I have noticed that for the TPCH case if I use "pages * selectivity"
it give me better results, but IMHO directly multiplying the pages
with selectivity may not be the correct way to calculate the number of
heap pages it can only give the correct result when all the TID being
fetched are clustered.  But on the other hand "Mackert and Lohman
formula" formulae consider that all the TID's are evenly distributed
across the heap pages which can also give the wrong estimation like we
are seeing in our TPCH case.

>
> Meanwhile, I ran TPC-H briefly with the v3 patch: scale factor 25, 2
> workers, SSD storage.
> It shows significant improvement on 4MB work_mem and no change on 2GB.
>
> Here are the results (execution time in seconds, average of 5 runs):
> work_mem4MB2GB
> Query masterpatchmasterpatch
> 4179174168167
> 5190168155156
> 628087227229
> 8197114179172
> 10269227190192
> 14    110108106105
>

Thanks for the testing number looks good to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-10-06 Thread Dilip Kumar

On Fri, Oct 6, 2017 at 7:04 PM, Dilip Kumar  wrote:
> On Fri, Oct 6, 2017 at 6:36 PM, Robert Haas  wrote:
>> On Fri, Oct 6, 2017 at 2:12 AM, Dilip Kumar  wrote:
>>>> The performance results look good, but that's a slightly different
>>>> thing from whether the estimate is accurate.
>>>>
>>>> +nbuckets = tbm_calculate_entires(maxbytes);
>>>>
>>>> entires?
>>>
>>> changed to
>>> + tbm->maxentries = (int) tbm_calculate_entires(maxbytes);
>>
>> My point was not that you should add a cast, but that you wrote
>> "entires" rather than "entries".
>
> My bad, I thought you were suggesting the variable name as "entries"
> instead of "nbuckets" so I removed the variable "nbuckets".  I will
> fix the typo in the function name and post the patch.

Fixed in the attached version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


improve_bitmap_cost_v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-10-06 Thread Dilip Kumar

On Fri, Oct 6, 2017 at 7:24 PM, Dilip Kumar  wrote:
> On Fri, Oct 6, 2017 at 6:08 PM, Alexander Kuzmenkov
>  wrote:
>>
>>> Analysis: The estimated value of the lossy_pages is way higher than
>>> its actual value and reason is that the total_pages calculated by the
>>> "Mackert and Lohman formula" is not correct.
>>
>>
>> I think the problem might be that the total_pages includes cache effects and
>> rescans. For bitmap entries we should use something like relation pages *
>> selectivity.
>
> I have noticed that for the TPCH case if I use "pages * selectivity"
> it give me better results, but IMHO directly multiplying the pages
> with selectivity may not be the correct way to calculate the number of
> heap pages it can only give the correct result when all the TID being
> fetched are clustered.  But on the other hand "Mackert and Lohman
> formula" formulae consider that all the TID's are evenly distributed
> across the heap pages which can also give the wrong estimation like we
> are seeing in our TPCH case.

I agree with the point that the total_pages included the cache effects
and rescan when loop_count > 1, that can be avoided if we always
calculate heap_pages as it is calculated in the else part
(loop_count=0).  Fortunately, in all the TPCH query plan what I posted
up thread bitmap scan was never at the inner side of the NLJ so
loop_count was always 0.  I will fix this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-10-11 Thread Dilip Kumar

On Fri, Oct 6, 2017 at 9:21 PM, Dilip Kumar  wrote:
> On Fri, Oct 6, 2017 at 7:24 PM, Dilip Kumar  wrote:
>> On Fri, Oct 6, 2017 at 6:08 PM, Alexander Kuzmenkov
>>  wrote:
>>>
>>>> Analysis: The estimated value of the lossy_pages is way higher than
>>>> its actual value and reason is that the total_pages calculated by the
>>>> "Mackert and Lohman formula" is not correct.
>>>
>>>
>>> I think the problem might be that the total_pages includes cache effects and
>>> rescans. For bitmap entries we should use something like relation pages *
>>> selectivity.
>>
>> I have noticed that for the TPCH case if I use "pages * selectivity"
>> it give me better results, but IMHO directly multiplying the pages
>> with selectivity may not be the correct way to calculate the number of
>> heap pages it can only give the correct result when all the TID being
>> fetched are clustered.  But on the other hand "Mackert and Lohman
>> formula" formulae consider that all the TID's are evenly distributed
>> across the heap pages which can also give the wrong estimation like we
>> are seeing in our TPCH case.
>
> I agree with the point that the total_pages included the cache effects
> and rescan when loop_count > 1, that can be avoided if we always
> calculate heap_pages as it is calculated in the else part
> (loop_count=0).  Fortunately, in all the TPCH query plan what I posted
> up thread bitmap scan was never at the inner side of the NLJ so
> loop_count was always 0.  I will fix this.

I have fixed the issue. Now, for calculating the lossy pages it will
not consider the rescan.  As mentioned above it will not affect the
TPCH plan so haven't measured the performance again.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


improve_bitmap_cost_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Bitmap Heap Scans segfaults due to (tbm->dsa==NULL) on PostgreSQL 10

2017-10-12 Thread Dilip Kumar

On Thu, Oct 12, 2017 at 4:31 PM, Tomas Vondra
 wrote:
> Hi,
>
> It seems that Q19 from TPC-H is consistently failing with segfaults due
> to calling tbm_prepare_shared_iterate() with (tbm->dsa==NULL).
>
> I'm not very familiar with how the dsa is initialized and passed around,
> but I only see the failures when the bitmap is constructed by a mix of
> BitmapAnd and BitmapOr operations.
>
I think I have got the issue, bitmap_subplan_mark_shared is not
properly pushing the isshared flag to lower level bitmap index node,
and because of that tbm_create is passing NULL dsa while creating the
tidbitmap.  So this problem will come in very specific combination of
BitmapOr and BitmapAnd when BitmapAnd is the first subplan for the
BitmapOr.  If BitmapIndex is the first subplan under BitmapOr then
there is no problem because BitmapOr node will create the tbm by
itself and isshared is set for BitmapOr.

Attached patch fixing the issue for me.  I will thoroughly test this
patch with other scenario as well.  Thanks for reporting.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 5c934f2..cc7590b 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4922,7 +4922,11 @@ bitmap_subplan_mark_shared(Plan *plan)
 		bitmap_subplan_mark_shared(
    linitial(((BitmapAnd *) plan)->bitmapplans));
 	else if (IsA(plan, BitmapOr))
+	{
 		((BitmapOr *) plan)->isshared = true;
+		bitmap_subplan_mark_shared(
+  linitial(((BitmapOr *) plan)->bitmapplans));
+	}
 	else if (IsA(plan, BitmapIndexScan))
 		((BitmapIndexScan *) plan)->isshared = true;
 	else

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Bitmap Heap Scans segfaults due to (tbm->dsa==NULL) on PostgreSQL 10

2017-10-12 Thread Dilip Kumar

On Thu, Oct 12, 2017 at 6:37 PM, Tomas Vondra
 wrote:
>
>
> On 10/12/2017 02:40 PM, Dilip Kumar wrote:
>> On Thu, Oct 12, 2017 at 4:31 PM, Tomas Vondra
>>  wrote:
>>> Hi,
>>>
>>> It seems that Q19 from TPC-H is consistently failing with segfaults due
>>> to calling tbm_prepare_shared_iterate() with (tbm->dsa==NULL).
>>>
>>> I'm not very familiar with how the dsa is initialized and passed around,
>>> but I only see the failures when the bitmap is constructed by a mix of
>>> BitmapAnd and BitmapOr operations.
>>>
>> I think I have got the issue, bitmap_subplan_mark_shared is not
>> properly pushing the isshared flag to lower level bitmap index node,
>> and because of that tbm_create is passing NULL dsa while creating the
>> tidbitmap.  So this problem will come in very specific combination of
>> BitmapOr and BitmapAnd when BitmapAnd is the first subplan for the
>> BitmapOr.  If BitmapIndex is the first subplan under BitmapOr then
>> there is no problem because BitmapOr node will create the tbm by
>> itself and isshared is set for BitmapOr.
>>
>> Attached patch fixing the issue for me.  I will thoroughly test this
>> patch with other scenario as well.  Thanks for reporting.
>>
>
> Yep, this fixes the failures for me.
>
Thanks for confirming.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

2017-10-17 Thread Dilip Kumar

On Fri, Oct 13, 2017 at 12:06 PM, Jeevan Chalke
 wrote:
>
While playing around with the patch I have noticed one regression with
the partial partition-wise aggregate.

I am consistently able to reproduce this on my local machine.

Scenario: Group by on non-key column and only one tuple per group.

Complete Test:

create table t(a int,b int) partition by range(a);
create table t1 partition of t for values from (1) to (10);
create table t2 partition of t for values from (10) to (20);

insert into t values (generate_series(1,19),generate_series(1, 19));
postgres=# explain analyze select sum(a) from t group by b;
  QUERY
PLAN
--
 Finalize GroupAggregate  (cost=20379.55..28379.51 rows=19
width=12) (actual time=102.311..322.969 rows=19 loops=1)
   Group Key: t1.b
   ->  Merge Append  (cost=20379.55..25379.53 rows=19 width=12)
(actual time=102.303..232.310 rows=19 loops=1)
 Sort Key: t1.b
 ->  Partial GroupAggregate  (cost=10189.72..11939.70
rows=9 width=12) (actual time=52.164..108.967 rows=9 loops=1)
   Group Key: t1.b
   ->  Sort  (cost=10189.72..10439.72 rows=9 width=8)
(actual time=52.158..66.236 rows=9 loops=1)
 Sort Key: t1.b
 Sort Method: external merge  Disk: 1768kB
 ->  Seq Scan on t1  (cost=0.00..1884.99
rows=9 width=8) (actual time=0.860..20.388 rows=9 loops=1)
 ->  Partial GroupAggregate  (cost=10189.82..11939.82
rows=10 width=12) (actual time=50.134..102.976 rows=10
loops=1)
   Group Key: t2.b
   ->  Sort  (cost=10189.82..10439.82 rows=10 width=8)
(actual time=50.128..63.362 rows=10 loops=1)
 Sort Key: t2.b
 Sort Method: external merge  Disk: 1768kB
 ->  Seq Scan on t2  (cost=0.00..1885.00
rows=10 width=8) (actual time=0.498..20.977 rows=10 loops=1)
 Planning time: 0.190 ms
 Execution time: 339.929 ms
(18 rows)

postgres=# set enable_partition_wise_agg=off;
SET
postgres=# explain analyze select sum(a) from t group by b;
QUERY PLAN
--
 GroupAggregate  (cost=26116.53..29616.51 rows=19 width=12)
(actual time=139.413..250.751 rows=19 loops=1)
   Group Key: t1.b
   ->  Sort  (cost=26116.53..26616.52 rows=19 width=8) (actual
time=139.406..168.775 rows=19 loops=1)
 Sort Key: t1.b
 Sort Method: external merge  Disk: 3544kB
 ->  Result  (cost=0.00..5769.98 rows=19 width=8) (actual
time=0.674..76.392 rows=19 loops=1)
   ->  Append  (cost=0.00..3769.99 rows=19 width=8)
(actual time=0.672..40.291 rows=19 loops=1)
 ->  Seq Scan on t1  (cost=0.00..1884.99
rows=9 width=8) (actual time=0.672..12.408 rows=9 loops=1)
 ->  Seq Scan on t2  (cost=0.00..1885.00
rows=10 width=8) (actual time=1.407..11.689 rows=10 loops=1)
 Planning time: 0.146 ms
 Execution time: 263.678 ms
(11 rows)


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

2017-10-17 Thread Dilip Kumar

On Tue, Oct 17, 2017 at 10:44 PM, Jeevan Chalke
 wrote:
>

> I didn't get what you mean by regression here. Can you please explain?
>
> I see that PWA plan is selected over regular plan when enabled on the basis
> of costing.
> Regular planning need a Result node due to which costing increases where as
> PWA don't need that and thus wins.

Sorry for not clearly explaining,  I meant that with normal plan
execution time is 263.678 ms whereas with PWA its 339.929 ms.

I only set enable_partition_wise_agg=on and it switched to PWA and
execution time increased by 30%.
I understand that the this is the worst case for PWA where
FinalizeAggregate is getting all the tuple.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] path toward faster partition pruning

2017-10-30 Thread Dilip Kumar

On Mon, Oct 30, 2017 at 12:20 PM, Amit Langote
 wrote:
> On 2017/10/30 14:55, Amit Langote wrote:
>> Fixed in the attached updated patch, along with a new test in 0001 to
>> cover this case.  Also, made a few tweaks to 0003 and 0005 (moved some
>> code from the former to the latter) around the handling of 
>> ScalarArrayOpExprs.
>
> Sorry, I'd forgotten to include some changes.
>
> In the previous versions, RT index of the table needed to be passed to
> partition.c, which I realized is no longer needed, so I removed that
> requirement from the interface.  As a result, patches 0002 and 0003 have
> changed in this version.

Some Minor comments:

+ * get_rel_partitions
+ * Return the list of partitions of rel that pass the clauses mentioned
+ * rel->baserestrictinfo
+ *
+ * Returned list contains the AppendRelInfos of chosen partitions.
+ */
+static List *
+get_append_rel_partitions(PlannerInfo *root,

Function name in function header is not correct.

+ !DatumGetBool(((Const *) clause)->constvalue))
+ {
+ *constfalse = true;
+ continue;

DatumGetBool will return true if the first byte of constvalue is
nonzero otherwise
false.  IIUC, this is not the intention here. Or I am missing something?

+ * clauses in this function ourselves, for example, having both a > 1 and
+ * a = 0 the list

a = 0 the list -> a = 0 in the list

+static bool
+partkey_datum_from_expr(const Expr *expr, Datum *value)
+{
+ /*
+ * Add more expression types here as needed to support higher-level
+ * code.
+ */
+ switch (nodeTag(expr))
+ {
+ case T_Const:
+ *value = ((Const *) expr)->constvalue;
+ return true;

I think for evaluating other expressions (e.g. T_Param) we will have
to pass ExprContext to this function. Or we can do something cleaner
because if we want to access the ExprContext inside
partkey_datum_from_expr then we may need to pass it to
"get_partitions_from_clauses" which is a common function for optimizer
and executor.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-11-12 Thread Dilip Kumar

On Sat, Nov 11, 2017 at 3:25 AM, Robert Haas  wrote:
> On Thu, Nov 9, 2017 at 3:55 AM, amul sul  wrote:
>> It took me a little while to understand this calculation.  You have moved 
>> this
>> code from tbm_create(), but I think you should move the following
>> comment as well:
>
> I made an adjustment that I hope will address your concern here, made
> a few other adjustments, and committed this.
>
Thanks, Robert.
-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-29 Thread Dilip Kumar

On Thu, Sep 29, 2016 at 6:40 AM, Tomas Vondra
 wrote:
> Yes, definitely - we're missing something important, I think. One difference
> is that Dilip is using longer runs, but I don't think that's a problem (as I
> demonstrated how stable the results are).
>
> I wonder what CPU model is Dilip using - I know it's x86, but not which
> generation it is. I'm using E5-4620 v1 Xeon, perhaps Dilip is using a newer
> model and it makes a difference (although that seems unlikely).

I am using "Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz "


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-09-29 Thread Dilip Kumar

On Thu, Sep 29, 2016 at 8:05 PM, Robert Haas  wrote:
> OK, another theory: Dilip is, I believe, reinitializing for each run,
> and you are not.

Yes, I am reinitializing for each run.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Dynamic shared memory areas

2016-10-05 Thread Dilip Kumar

On Wed, Oct 5, 2016 at 3:00 AM, Thomas Munro
 wrote:
> Here's a new version that does that.

While testing this patch I found some issue,

+ total_size = DSA_INITIAL_SEGMENT_SIZE;
+ total_pages = total_size / FPM_PAGE_SIZE;
+ metadata_bytes =
+ MAXALIGN(sizeof(dsa_area_control)) +
+ MAXALIGN(sizeof(FreePageManager)) +
+ total_pages * sizeof(dsa_pointer);
+ /* Add padding up to next page boundary. */
+ if (metadata_bytes % FPM_PAGE_SIZE != 0)
+ metadata_bytes += FPM_PAGE_SIZE - (metadata_bytes % FPM_PAGE_SIZE);
+ usable_pages =
+ (total_size - metadata_bytes) / FPM_PAGE_SIZE;

+ segment = dsm_create(total_size, 0);
+ dsm_pin_segment(segment);

Actually problem is that size of dsa_area_control is bigger than
DSA_INITIAL_SEGMENT_SIZE.
but we are allocating segment of DSA_INITIAL_SEGMENT_SIZE size.

(gdb) p sizeof(dsa_area_control)
$8 = 67111000
(gdb) p DSA_INITIAL_SEGMENT_SIZE
$9 = 1048576

In dsa-v1 problem was not exist because  DSA_MAX_SEGMENTS was 1024,
but in dsa-v2 I think it's calculated wrongly.

(gdb) p DSA_MAX_SEGMENTS
$10 = 16777216

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash tables in dynamic shared memory

2016-10-05 Thread Dilip Kumar

On Wed, Oct 5, 2016 at 3:10 AM, Thomas Munro
 wrote:
> Here is an experimental hash table implementation that uses DSA memory
> so that hash tables can be shared by multiple backends and yet grow
> dynamically.  Development name:  "DHT".

+dht_iterate_begin(dht_hash_table *hash_table,
+  dht_iterator *iterator,
+  bool exclusive)
+{
+ Assert(hash_table->control->magic == DHT_MAGIC);
+
+ iterator->hash_table = hash_table;
+ iterator->exclusive = exclusive;
+ iterator->partition = 0;
+ iterator->bucket = 0;
+ iterator->item = NULL;
+ iterator->locked = false;

While reviewing , I found that in dht_iterate_begin function, we are
not initializing
iterator->last_item_pointer to InvalidDsaPointer;
and during scan this variable will be used in advance_iterator
function. (I think this will create problem, even loss of some tuple
?)

+advance_iterator(dht_iterator *iterator, dsa_pointer bucket_head,
+ dsa_pointer *next)
+{
+ dht_hash_table_item *item;
+
+ while (DsaPointerIsValid(bucket_head))
+ {
+ item = dsa_get_address(iterator->hash_table->area,
+   bucket_head);
+ if ((!DsaPointerIsValid(iterator->last_item_pointer) ||
+ bucket_head < iterator->last_item_pointer) &&

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Proposal: scan key push down to heap [WIP]

2016-10-11 Thread Dilip Kumar

Hi Hackers,

I would like to propose a patch for pushing down the scan key to heap.

Currently only in case of system table scan keys are pushed down. I
have implemented the POC patch to do the same for normal table scan.

This patch will extract the expression from qual and prepare the scan
keys. Currently in POC version I have only supported  "var OP const"
type of qual, because these type of quals can be pushed down using
existing framework.

Purpose of this work is to first implement the basic functionality and
analyze the results. If results are good then we can extend it for
other type of expressions.

However in future when we try to expand the support for complex
expressions, then we need to be very careful while selecting
pushable expression. It should not happen that we push something very
complex, which may cause contention with other write operation (as
HeapKeyTest is done under page lock).

Performance Test: (test done in local machine, with all default setting).

Setup:
--

create table test(a int, b varchar, c varchar, d int);
insert into test values(generate_series(1,1000), repeat('x', 30),
repeat('y', 30), generate_series(1,1000));
analyze test;

Test query:
--
select count(*) from test where a < $1;

Results: (execution time in ms)

Selectivity   Head(ms)   Patch(ms)gain
0.01612 307  49%
0.1  623 353  43%
0.2  645 398  38%
0.5  780 535  31%
0.8  852 590  30%
1 913 730  20%

Instructions: (Cpu instructions measured with callgrind tool):

Quary :  select count(*) from test where a < 10;

Head: 10,815,730,925
Patch: 4,780,047,331

Summary:
--
 1.  ~50% reduction in both instructions as well as execution time.
 2. Here we can see ~ 20% execution time reduction even at selectivity
1 (when all tuples are selected). And, reasoning for the same can be
that HeapKeyTest is much simplified compared to ExecQual. It's
possible that in future when we try to support more variety of keys,
gain at high selectivity may come down.

WIP patch attached..

Thoughts ?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


heap_scankey_pushdown_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-12 Thread Dilip Kumar

connections establishing)
tps = 52320.442694 (excluding connections establishing)

  28564  Client  | ClientRead
   3663
   1320  LWLockNamed | ProcArrayLock
742  Lock| transactionid
534  LWLockNamed | CLogControlLock
255  Lock| extend
108  LWLockNamed | XidGenLock
 81  LWLockTranche   | buffer_content
 44  LWLockTranche   | lock_manager
 29  Lock| tuple
  6  LWLockTranche   | wal_insert
  1 Tuples only is on.
  1  LWLockTranche   | buffer_mapping

With Patch:
tps = 47505.582315 (including connections establishing)
tps = 47505.773351 (excluding connections establishing)

  28186  Client  | ClientRead
   3655
   1172  LWLockNamed | ProcArrayLock
619  Lock| transactionid
289  LWLockNamed | CLogControlLock
237  Lock| extend
 81  LWLockTranche   | buffer_content
 48  LWLockNamed | XidGenLock
 28  LWLockTranche   | lock_manager
 23  Lock| tuple
  6  LWLockTranche   | wal_insert

I think at higher client count from client count 96 onwards contention
on CLogControlLock is clearly visible and which is completely solved
with group lock patch.

And at lower client count 32,64  contention on CLogControlLock is not
significant hence we can not see any gain with group lock patch.
(though we can see some contention on CLogControlLock is reduced at 64
clients.)

Note: Here I have taken only one set of reading, and at 32 client my
reading shows some regression with group lock patch, which may be run
to run variance (because earlier I never saw this regression, I can
confirm again with multiple runs).


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
[@power2 ~]$ uname -mrs
Linux 3.10.0-229.14.1.ael7b.ppc64le ppc64le
[@power2 ~]$ lscpu
Architecture:  ppc64le
Byte Order:Little Endian
CPU(s):192
On-line CPU(s) list:   0-191
Thread(s) per core:8
Core(s) per socket:1
Socket(s): 24
NUMA node(s):  4
Model: IBM,8286-42A
L1d cache: 64K
L1i cache: 32K
L2 cache:  512K
L3 cache:  8192K
NUMA node0 CPU(s): 0-47
NUMA node1 CPU(s): 48-95
NUMA node2 CPU(s): 96-143
NUMA node3 CPU(s): 144-191
[@power2 ~]$ nproc --all
192


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: scan key push down to heap [WIP]

2016-10-12 Thread Dilip Kumar

On Thu, Oct 13, 2016 at 6:05 AM, Robert Haas  wrote:
> I seriously doubt that this should EVER be supported for anything
> other than "var op const", and even then only for very simple
> operators.
Yes, with existing key push down infrastructure only "var op const",
but If we extend that I think we can cover many other simple
expressions, i.e

Unary Operator : ISNULL, NOTNULL
Other Simple expression : "Var op Const", "Var op Var", "Var op
SimpleExpr(Var, Const)" ..

We can not push down function expression because we can not guarantee
that they are safe, but can we pushdown inbuilt functions ? (I think
we can identify their safety based on their data type, but I am not
too sure about this point).

  For example, texteq() is probably too complicated, because
> it might de-TOAST, and that might involve doing a LOT of work while
> holding the buffer content lock.  But int4eq() would be OK.
>
> Part of the trick if we want to make this work is going to be figuring
> out how we'll identify which operators are safe.
Yes, I agree that's the difficult part. Can't we process full qual
list and decide decide the operator are safe or not based on their
datatype ?

What I mean to say is instead of checking safety of each operator like
texteq(), text_le()...
we can directly discard any operator involving such kind of data types.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel bitmap heap scan

2016-10-16 Thread Dilip Kumar

There is major chance in tidbitmap.c file after efficient hash table
commit [1] and my patch need to be rebased.

Only parallel-bitmap-heap-scan need to be rebased, all other patch can
be applied on head as is.
Rebased version (v2) of parallel-bitmap-heap-scan is attached.

[1] 
commit-http://git.postgresql.org/pg/commitdiff/75ae538bc3168bf44475240d4e0487ee2f3bb376


On Fri, Oct 7, 2016 at 11:46 AM, Dilip Kumar  wrote:
> Hi Hackers,
>
> I would like to propose parallel bitmap heap scan feature. After
> running TPCH benchmark, It was observed that many of TPCH queries are
> using bitmap scan (@TPCH_plan.tar.gz attached below). Keeping this
> point in mind we thought that many query will get benefited with
> parallel bitmap scan.
>
> Robert has also pointed out the same thing in his blog related to parallel 
> query
> http://rhaas.blogspot.in/2016/04/postgresql-96-with-parallel-query-vs.html
>
> Currently Bitmap heap plan look like this :
> --
> Bitmap Heap Scan
> -> Bitmap Index Scan
>
> After this patch :
> -
> Parallel Bitmap Heap Scan
> -> Bitmap Index Scan
>
> As part of this work I have implemented parallel processing in
> BitmapHeapScan node. BitmapIndexScan is still non parallel.
>
> Brief design idea:
> ---
> #1. Shared TIDBitmap creation and initialization
>   First worker to see the state as parallel bitmap info as PBM_INITIAL
> become leader and set the state to PBM_INPROGRESS All other   workers
> see the state as PBM_INPROGRESS will wait for leader to complete the
> TIDBitmap.
>
> #2 At this level TIDBitmap is ready and all workers are awake.
>
> #3. Bitmap processing (Iterate and process the pages).
> In this phase each worker will iterate over page and chunk array and
> select heap pages one by one. If prefetch is enable then there will be
> two iterator. Since multiple worker are iterating over same page and
> chunk array we need to have a shared iterator, so we grab a spin lock
> and iterate within a lock, so that each worker get and different page
> to process.
>
> Note: For more detail on design, please refer comment of
> BitmapHeapNext API in "parallel-bitmap-heap-scan-v1.patch" file.
>
> Attached patch details:
> --
> 1. parallel-bitmap-heap-scan-v1.patch: This is the main patch to make
> bitmap heap scan node parallel aware.
>
> 2. dht-return-dsa-v1.patch:  This patch will provide new API, where we
> can scan full DHT[1], and get the dsa_pointers (a relative pointer).
> The dsa_pointer values can be shared with other processes. We need
> this because, after TIDBitmap is created, only one worker will process
> whole TIDBitmap and convert it to a page and chunk array. So we need
> to store the generic pointer, so that later on each worker can convert
> those to their local pointer before start processing.
>
> My patch depends on following patches.
> --
> 1. conditional_variable
> https://www.postgresql.org/message-id/CAEepm%3D0zshYwB6wDeJCkrRJeoBM%3DjPYBe%2B-k_VtKRU_8zMLEfA%40mail.gmail.com
>
> 2. dsa_area
> https://www.postgresql.org/message-id/CAEepm%3D024p-MeAsDmG%3DR3%2BtR4EGhuGJs_%2BrjFKF0eRoSTmMJnA%40mail.gmail.com
>
> 3. Creating a DSA area to provide work space for parallel execution
> https://www.postgresql.org/message-id/CAEepm%3D0HmRefi1%2BxDJ99Gj5APHr8Qr05KZtAxrMj8b%2Bay3o6sA%40mail.gmail.com
>
> 4. Hash table in dynamic shared memory (DHT) [1]
> https://www.postgresql.org/message-id/CAEepm%3D0VrMt3s_REDhQv6z1pHL7FETOD7Rt9V2MQ3r-2ss2ccA%40mail.gmail.com
>
> Order in which patches should be applied:
> 
> 1. conditional_variable
> 2. dsa_area
> 3. Creating a DSA area to provide work space for parallel execution
> 4. Hash table in dynamic shared memory.
> 5. dht-return-dsa-v1.patch
> 6. parallel-bitmap-heap-scan-v1.patch
>
> Performance Results:
> -
> Summary :
> 1. After this patch, I observed currently 4 queries are getting
> significant improvement (Q4, Q6, Q14, Q15).
>- Q4, is converting from parallel seqscan to parallel bitmap heap scan.
>- Other queries are converted from a regular bitmap heap scan to a
> parallel bitmap heap scan.
> 2. Benefit is more visible at lower workers (upto 4), after that some
> of the queries are selecting ParallelSeqScan over ParallelBitmapScan.
> And, I think this is expected, because so far we have only made
> BitmapHeap node as parallel whereas ParallelSeqScan is completely
> parallel so at higher worker count ParallelSeqScan is better choice.
> 3. Detailed result is attached

Re: [HACKERS] Parallel bitmap heap scan

2016-10-18 Thread Dilip Kumar

On Tue, Oct 18, 2016 at 1:42 AM, Robert Haas  wrote:
> But what's the impact on performance?  Presumably parallel bitmap heap
> scan was already slower than the non-parallel version, and that commit
> presumably widens the gap.  Seems like something to worry about...

I have checked the performance in my local machine and there is no
impact on the gap.
If you see the explain analyze output of all queries which got
benefited which parallel bitmap map heap scan, BitmapIndex node is
taking very less time compare to BitmapHeap.

Actual execution time on head (before efficient hash table patch)
 BitmapHeapNode  BitmapIndexNode
Q6   38997  6951
Q14 14516569
Q15 28530  1442

Out of 4 queries, Q4 is converted from parallel seqscan to parallel
bitmap scan so no impact.
Q14, Q15 time spent in BitmapIndex node is < 5% of time spent in
BitmapHeap Node. Q6 it's 20% but I did not see much impact on this in
my local machine. However I will take the complete performance reading
and post the data on my actual performance machine.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel bitmap heap scan

2016-10-18 Thread Dilip Kumar

On Tue, Oct 18, 2016 at 1:45 AM, Andres Freund  wrote:
> I don't quite understand why the bitmap has to be parallel at all. As
> far as I understand your approach as described here, the only thing that
> needs to be shared are the iteration arrays.  Since they never need to
> be resized and such, it seems to make a lot more sense to just add an
> API to share those, instead of the whole underlying hash.

You are right that we only share iteration arrays. But only point is
that each entry of iteration array is just a pointer to hash entry.
So either we need to build hash in shared memory (my current approach)
or we need to copy each hash element at shared location (I think this
is going to be expensive).

Let me know if I am missing something..

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel bitmap heap scan

2016-10-19 Thread Dilip Kumar

On Wed, Oct 19, 2016 at 12:39 PM, Andres Freund  wrote:
> Try measuring with something more heavy on bitmap scan time
> itself. E.g.
> SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= 
> '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date);
> or similar.  The tpch queries don't actually spend that much time in the
> bitmapscan itself - the parallization of the rest of the query is what
> matters...

Yeah, I agree.

I have tested with this query, with exact filter condition it was
taking parallel sequence scan, so I have modified the filter a bit and
tested.

Tested with all default configuration in my local machine. I think I
will generate more such test cases and do detail testing in my
performance machine.


Explain Analyze results:
-
On Head:

postgres=# explain analyze  SELECT SUM(l_extendedprice) FROM lineitem
WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
'1996-03-31'::date);

 QUERY PLAN
---
 Aggregate  (cost=848805.90..848805.91 rows=1 width=32) (actual
time=12440.165..12440.166 rows=1 loops=1)
   ->  Bitmap Heap Scan on lineitem  (cost=143372.40..834833.25
rows=5589057 width=8) (actual time=1106.217..11183.722 rows=5678841
loops=1)
 Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND
(l_shipdate <= '1996-03-31'::date))
 Rows Removed by Index Recheck: 20678739
 Heap Blocks: exact=51196 lossy=528664
 ->  Bitmap Index Scan on idx_lineitem_shipdate
(cost=0.00..141975.13 rows=5589057 width=0) (actual
time=1093.376..1093.376 rows=5678841 loops=1)
   Index Cond: ((l_shipdate >= '1995-01-01'::date) AND
(l_shipdate <= '1996-03-31'::date))
 Planning time: 0.185 ms
 Execution time: 12440.819 ms
(9 rows)

After Patch:
---
postgres=# explain analyze  SELECT SUM(l_extendedprice) FROM lineitem
WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
'1996-03-31'::date);

   QUERY PLAN

--
-
 Finalize Aggregate  (cost=792751.16..792751.17 rows=1 width=32)
(actual time=6660.157..6660.157 rows=1 loops=1)
   ->  Gather  (cost=792750.94..792751.15 rows=2 width=32) (actual
time=6659.378..6660.117 rows=3 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 ->  Partial Aggregate  (cost=791750.94..791750.95 rows=1
width=32) (actual time=6655.941..6655.941 rows=1 loops=3)
   ->  Parallel Bitmap Heap Scan on lineitem
(cost=143372.40..785929.00 rows=2328774 width=8) (actual
time=1980.797..6204.232 rows=1892947 loops=
3)
 Recheck Cond: ((l_shipdate >= '1995-01-01'::date)
AND (l_shipdate <= '1996-03-31'::date))
 Rows Removed by Index Recheck: 6930269
 Heap Blocks: exact=17090 lossy=176443
 ->  Bitmap Index Scan on idx_lineitem_shipdate
(cost=0.00..141975.13 rows=5589057 width=0) (actual
time=1933.454..1933.454 rows=5678841
loops=1)
   Index Cond: ((l_shipdate >=
'1995-01-01'::date) AND (l_shipdate <= '1996-03-31'::date))
 Planning time: 0.207 ms
 Execution time: 6669.195 ms
(13 rows)


Summary:
-> With patch overall execution is 2 time faster compared to head.
-> Bitmap creation with patch is bit slower compared to head and thats
because of DHT vs efficient hash table.

I found one defect in v2 patch, that I induced during last rebasing.
That is fixed in v3.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


parallel-bitmap-heap-scan-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-20 Thread Dilip Kumar

kpointerCommLock

Test4: number of clients: 32

Head:
tps = 27587.12 (including connections establishing)
tps = 27588.119611 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_32_ul.txt
 117762  |
   4031
614  LWLockNamed | ProcArrayLock
379  LWLockNamed | CLogControlLock
344  Lock| transactionid
183  Lock| extend
102  LWLockTranche   | buffer_mapping
 71  LWLockTranche   | buffer_content
 39  LWLockNamed | XidGenLock
 25  LWLockTranche   | lock_manager
  3  LWLockTranche   | wal_insert
  3  LWLockTranche   | clog
  2  LWLockNamed | CheckpointerCommLock
  2  Lock| tuple

Patch:
tps = 28291.428848 (including connections establishing)
tps = 28291.586435 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 1000_32_ul.txt
 116596  |
   4041
757  LWLockNamed | ProcArrayLock
407  LWLockNamed | CLogControlLock
358  Lock| transactionid
183  Lock| extend
142  LWLockTranche   | buffer_mapping
 77  LWLockTranche   | buffer_content
 68  LWLockNamed | XidGenLock
 35  LWLockTranche   | lock_manager
 15  LWLockTranche   | wal_insert
  7  LWLockTranche   | clog
  7  Lock| tuple
  4  LWLockNamed | CheckpointerCommLock
  1 Tuples only is on.

Summary:
- At 96 and more clients count we can see ClogControlLock at the top.
- With patch contention on ClogControlLock is reduced significantly.
I think these behaviours are same as we saw on power.

With 300 scale factor:
- Contention on ClogControlLock is significant only at 192 client
(still transaction id lock is on top), Which is completely removed
with group lock patch.

For 300 scale factor, I am posting data only at 192 client count (If
anyone interested in other data I can post).

Head:
scaling factor: 300
query mode: prepared
number of clients: 192
number of threads: 192
duration: 1800 s
number of transactions actually processed: 65930726
latency average: 5.242 ms
tps = 36621.827041 (including connections establishing)
tps = 36622.064081 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 300_192_ul.txt
 437848  |
 118966  Lock| transactionid
  88869  LWLockNamed | CLogControlLock
  18558  Lock| tuple
   6183  LWLockTranche   | buffer_content
   5664  LWLockTranche   | lock_manager
   3995  LWLockNamed | ProcArrayLock
   3646
   1748  Lock| extend
   1635  LWLockNamed | XidGenLock
401  LWLockTranche   | wal_insert
 33  BufferPin   | BufferPin
  5  LWLockTranche   | proc
  3  LWLockTranche   | buffer_mapping

Patch:
scaling factor: 300
query mode: prepared
number of clients: 192
number of threads: 192
duration: 1800 s
number of transactions actually processed: 82616270
latency average: 4.183 ms
tps = 45894.737813 (including connections establishing)
tps = 45894.995634 (excluding connections establishing)
120372  Lock| transactionid
  16346  Lock| tuple
   7489  LWLockTranche   | lock_manager
   4514  LWLockNamed | ProcArrayLock
   3632
   3310  LWLockNamed | CLogControlLock
   2287  LWLockNamed | XidGenLock
   2271  Lock| extend
709  LWLockTranche   | buffer_content
490  LWLockTranche   | wal_insert
 30  BufferPin   | BufferPin
 10  LWLockTranche   | proc
  6  LWLockTranche   | buffer_mapping

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-20 Thread Dilip Kumar

ablishing)

 302317  |
  18836  Lock| transactionid
  12912  LWLockNamed | CLogControlLock
   4120  LWLockNamed | ProcArrayLock
   3662
   1700  Lock| tuple
   1305  Lock| extend
   1030  LWLockTranche   | buffer_content
828  LWLockTranche   | lock_manager
730  LWLockNamed | XidGenLock
107  LWLockTranche   | wal_insert
  4  LWLockTranche   | buffer_mapping
  1 Tuples only is on.
  1  LWLockTranche   | proc
  1  BufferPin   | BufferPin

Group Lock Patch:
scaling factor: 300
query mode: prepared
number of clients: 96
number of threads: 96
duration: 1800 s
number of transactions actually processed: 61608756
latency average: 2.805 ms
tps = 44385.885080 (including connections establishing)
tps = 44386.297364 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 300_96_ul.txt
 237842  |
  14379  Lock| transactionid
   3335  LWLockNamed | ProcArrayLock
   2850
   1374  LWLockNamed | CLogControlLock
   1200  Lock| tuple
992  Lock| extend
717  LWLockNamed | XidGenLock
625  LWLockTranche   | lock_manager
259  LWLockTranche   | buffer_content
105  LWLockTranche   | wal_insert
  4  LWLockTranche   | buffer_mapping
  2  LWLockTranche   | proc

Head:
scaling factor: 300
query mode: prepared
number of clients: 192
number of threads: 192
duration: 1800 s
number of transactions actually processed: 65930726
latency average: 5.242 ms
tps = 36621.827041 (including connections establishing)
tps = 36622.064081 (excluding connections establishing)
[dilip.kumar@cthulhu bin]$ cat 300_192_ul.txt
 437848  |
 118966  Lock| transactionid
  88869  LWLockNamed | CLogControlLock
  18558  Lock| tuple
   6183  LWLockTranche   | buffer_content
   5664  LWLockTranche   | lock_manager
   3995  LWLockNamed | ProcArrayLock
   3646
   1748  Lock| extend
   1635  LWLockNamed | XidGenLock
401  LWLockTranche   | wal_insert
 33  BufferPin   | BufferPin
  5  LWLockTranche   | proc
  3  LWLockTranche   | buffer_mapping

GroupLock Patch:
scaling factor: 300
query mode: prepared
number of clients: 192
number of threads: 192
duration: 1800 s
number of transactions actually processed: 82616270
latency average: 4.183 ms
tps = 45894.737813 (including connections establishing)
tps = 45894.995634 (excluding connections establishing)
120372  Lock| transactionid
  16346  Lock| tuple
   7489  LWLockTranche   | lock_manager
   4514  LWLockNamed | ProcArrayLock
   3632
   3310  LWLockNamed | CLogControlLock
   2287  LWLockNamed | XidGenLock
   2271  Lock| extend
709  LWLockTranche   | buffer_content
490  LWLockTranche   | wal_insert
 30  BufferPin   | BufferPin
 10  LWLockTranche   | proc
  6  LWLockTranche   | buffer_mapping

Summary: On (X86 8 Socket machine, 300 S.F), I did not observe
significant wait on ClogControlLock upto 96 clients. However at 192 we
can see significant wait on ClogControlLock, but still not as bad as
we see on POWER.


>
> I've taken some time to build a simple web-based reports from the results
> collected so far (also included in the git repository), and pushed them
> here:
>
> http://tvondra.bitbucket.org
>
> For each of the completed runs, there's a report comparing tps for different
> client counts with master and the three patches (average tps, median and
> stddev), and it's possible to download a more thorough text report with wait
> event stats, comparison of individual runs etc.

I saw your report, I think presenting it this way can give very clear idea.
>
> If you want to cooperate on this, I'm available - i.e. I can help you get
> the tooling running, customize it etc.

That will be really helpful, then next time I can also present my
reports in same format.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-20 Thread Dilip Kumar

On Thu, Oct 20, 2016 at 9:15 PM, Robert Haas  wrote:
> So here's my theory.  The whole reason why Tomas is having difficulty
> seeing any big effect from these patches is because he's testing on
> x86.  When Dilip tests on x86, he doesn't see a big effect either,
> regardless of workload.  But when Dilip tests on POWER, which I think
> is where he's mostly been testing, he sees a huge effect, because for
> some reason POWER has major problems with this lock that don't exist
> on x86.

Right, because on POWER we can see big contention on ClogControlLock
with 300 scale factor, even at 96 client count, but on X86 with 300
scan factor there is almost no contention on ClogControlLock.

However at 1000 scale factor we can see significant contention on
ClogControlLock on X86 machine.

I want to test on POWER with 1000 scale factor to see whether
contention on ClogControlLock become much worse ?

I will run this test and post the results.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-24 Thread Dilip Kumar

On Fri, Oct 21, 2016 at 7:57 AM, Dilip Kumar  wrote:
> On Thu, Oct 20, 2016 at 9:03 PM, Tomas Vondra
>  wrote:
>
>> In the results you've posted on 10/12, you've mentioned a regression with 32
>> clients, where you got 52k tps on master but only 48k tps with the patch (so
>> ~10% difference). I have no idea what scale was used for those tests,
>
> That test was with scale factor 300 on POWER 4 socket machine. I think
> I need to repeat this test with multiple reading to confirm it was
> regression or run to run variation. I will do that soon and post the
> results.

As promised, I have rerun my test (3 times), and I did not see any regression.
Median of 3 run on both head and with group lock patch are same.
However I am posting results of all three runs.

I think in my earlier reading, we saw TPS ~48K with the patch, but I
think over multiple run we get this reading with both head as well as
with patch.

Head:

run1:

transaction type: 
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 87784836
latency average = 0.656 ms
tps = 48769.327513 (including connections establishing)
tps = 48769.543276 (excluding connections establishing)

run2:
transaction type: 
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 91240374
latency average = 0.631 ms
tps = 50689.069717 (including connections establishing)
tps = 50689.263505 (excluding connections establishing)

run3:
transaction type: 
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 90966003
latency average = 0.633 ms
tps = 50536.639303 (including connections establishing)
tps = 50536.836924 (excluding connections establishing)

With group lock patch:
--
run1:
transaction type: 
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 87316264
latency average = 0.660 ms
tps = 48509.008040 (including connections establishing)
tps = 48509.194978 (excluding connections establishing)

run2:
transaction type: 
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 91950412
latency average = 0.626 ms
tps = 51083.507790 (including connections establishing)
tps = 51083.704489 (excluding connections establishing)

run3:
transaction type: 
scaling factor: 300
query mode: prepared
number of clients: 32
number of threads: 32
duration: 1800 s
number of transactions actually processed: 90378462
latency average = 0.637 ms
tps = 50210.225983 (including connections establishing)
tps = 50210.405401 (excluding connections establishing)

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: scan key push down to heap [WIP]

2016-10-25 Thread Dilip Kumar

On Fri, Oct 14, 2016 at 11:54 AM, Andres Freund  wrote:
> I don't think it's a good idea to do this under the content lock in any
> case. But luckily I don't think we have to do so at all.
>
> Due to pagemode - which is used by pretty much everything iterating over
> heaps, and definitely seqscans - the key test already happens without
> the content lock held, in heapgettup_pagemode():

Yes, you are right. Now after this clarification,  it's clear that
even we push down the key we are not evaluating it under buffer
content lock.

As of now, I have done my further analysis by keeping in mind only
pushing 'var op const'. Below are my observations.

#1. If we process the qual in executor, all temp memory allocation are
under "per_tuple_context" which get reset after each tuple process.
But, on other hand if we do that in heap, then that will be under
"per_query_context". This restrict us to pushdown any qual which need
to do memory allocations like "toastable" field.

Is there any other way to handle this ?

#2. Currently quals are ordered based on cost (refer
order_qual_clauses), But once we pushdown some of the quals, then
those quals will always be executed first. Can this create problem ?

consider below example..

create or replace function f1(anyelement) returns bool as
$$
begin
raise notice '%', $1;
return true;
end;
$$ LANGUAGE plpgsql cost 0.01;

select * from t3 where a>1 and f1(b);

In this case "f1(b)" will always be executed as first qual (cost is
set very low by user) hence it will raise notice for all the tuple.
But if we pushdown "a>1" qual to heap then only qualified tuple will
be passed to "f1(b)".

Is it behaviour change ?

I know that, it can also impact the performance, because when user has
given some function with very low cost then that should be executed
first as it may discard most of the tuple.

One solution to this can be..
1. Only pushdown if either all quals are of same cost.
2. Pushdown all quals until we find first non pushable qual (this way
we can maintain the same qual execution order what is there in
existing system).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: scan key push down to heap [WIP]

2016-10-25 Thread Dilip Kumar

On Tue, Oct 25, 2016 at 10:35 PM, Alvaro Herrera
 wrote:
> We don't promise order of execution (which is why we can afford to sort
> on cost), but I think it makes sense to keep a rough ordering based on
> cost, and not let this push-down affect those ordering decisions too
> much.

Ok, Thanks for clarification.

>
> I think it's fine to push-down quals that are within the same order of
> magnitude of the cost of a pushable condition, while keeping any other
> much-costlier qual (which could otherwise be pushed down) together with
> non-pushable conditions; this would sort-of guarantee within-order-of-
> magnitude order of execution of quals.
>
> Hopefully an example clarifies what I mean.  Let's suppose you have
> three quals, where qual2 is non-pushable but 1 and 3 are.  cost(1)=10,
> cost(2)=11, cost(3)=12.  Currently, they are executed in that order.
>
> If you were to compare costs in the straightforward way, you would push
> only 1 (because 3 is costlier than 2 which is not push-down-able).  With
> fuzzy comparisons, you'd push both 1 and 3, because cost of 3 is close
> enough to that of qual 2.
>
> But if cost(3)=100 then only push qual 1, and let qual 3 be evaluated
> together with 2 outside the scan node.

After putting more thought on this, IMHO it need not to be so
complicated. Currently we are talking about pushing only "var op
const", and cost of all such functions are very low and fixed "1".

Do we really need to take care of any user defined function which is
declared with very low cost ?
Because while building index conditions also we don't take care of
such things. Index conditions will always we evaluated first then only
filter will be applied.

Am I missing something ?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-10-27 Thread Dilip Kumar

On Thu, Oct 27, 2016 at 5:14 PM, Amit Kapila  wrote:
>>> Thanks Tomas and Dilip for doing detailed performance tests for this
>>> patch.  I would like to summarise the performance testing results.
>>>
>>> 1. With update intensive workload, we are seeing gains from 23%~192%
>>> at client count >=64 with group_update patch [1].

this is with unlogged table

>>> 2. With tpc-b pgbench workload (at 1000 scale factor), we are seeing
>>> gains from 12% to ~70% at client count >=64 [2].  Tests are done on
>>> 8-socket intel   m/c.

this is with synchronous_commit=off

>>> 3. With pgbench workload (both simple-update and tpc-b at 300 scale
>>> factor), we are seeing gain 10% to > 50% at client count >=64 [3].
>>> Tests are done on 8-socket intel m/c.

this is with synchronous_commit=off

>>> 4. To see why the patch only helps at higher client count, we have
>>> done wait event testing for various workloads [4], [5] and the results
>>> indicate that at lower clients, the waits are mostly due to
>>> transactionid or clientread.  At client-counts where contention due to
>>> CLOGControlLock is significant, this patch helps a lot to reduce that
>>> contention.  These tests are done on on 8-socket intel m/c and
>>> 4-socket power m/c

these both are with synchronous_commit=off + unlogged table

>>> 5. With pgbench workload (unlogged tables), we are seeing gains from
>>> 15% to > 300% at client count >=72 [6].
>>>
>>
>> It's not entirely clear which of the above tests were done on unlogged
>> tables, and I don't see that in the referenced e-mails. That would be an
>> interesting thing to mention in the summary, I think.
>>
>
> One thing is clear that all results are on either
> synchronous_commit=off or on unlogged tables.  I think Dilip can
> answer better which of those are on unlogged and which on
> synchronous_commit=off.

I have mentioned this above under each of your test point..

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: scan key push down to heap [WIP]

2016-10-28 Thread Dilip Kumar

On Wed, Oct 26, 2016 at 12:01 PM, Andres Freund  wrote:
> The gains are quite noticeable in some cases. So if we can make it work
> without noticeable downsides...
>
> What I'm worried about though is that this, afaics, will quite
> noticeably *increase* total cost in cases with a noticeable number of
> columns and a not that selective qual. The reason for that being that
> HeapKeyTest() uses heap_getattr(), whereas upper layers use
> slot_getattr(). The latter "caches" repeated deforms, the former
> doesn't... That'll lead to deforming being essentially done twice, and
> it's quite often already a major cost of query processing.

What about putting slot reference inside HeapScanDesc ?. I know it
will make ,heap layer use executor structure but just a thought.

I have quickly hacked this way where we use slot reference in
HeapScanDesc and directly use
 slot_getattr inside HeapKeyTest (only if we have valid slot otherwise
use _heap_getattr) and measure the worst case performance (what you
have mentioned above.)

My Test: (21 column table with varchar in beginning + qual is on last
few column + varying selectivity )

postgres=# \d test
  Table "public.test"
 Column |   Type| Modifiers
+---+---
 f1 | integer   |
 f2 | character varying |
 f3 | integer   |
 f4 | integer   |
 f5 | integer   |
 f6 | integer   |
 f7 | integer   |
 f8 | integer   |
 f9 | integer   |
 f10| integer   |
 f11| integer   |
 f12| integer   |
 f13| integer   |
 f14| integer   |
 f15| integer   |
 f16| integer   |
 f17| integer   |
 f18| integer   |
 f19| integer   |
 f20| integer   |
 f21| integer   |

tuple count : 1000 (10 Million)
explain analyze select * from test where f21< $1 and f20 < $1 and f19
< $1 and f15 < $1 and f10 < $1; ($1 vary from 1Million to 1Million).

Target code base:
---
1. Head
2. Heap_scankey_pushdown_v1
3. My hack for keeping slot reference in HeapScanDesc
(v1+use_slot_in_HeapKeyTest)

Result:
Selectivity Head   scan_key_pushdown_v1 v1+use_slot_in_HeapKeyTest
0.1 3880  2980 2747
0.2 4041  3187 2914
0.5 5051  4921 3626
0.8 5378  7296 3879
1.0 6161  8525 4575

Performance graph is attached in the mail..

Observation:

1. Heap_scankey_pushdown_v1, start degrading after very high
selectivity (this behaviour is only visible if table have 20 or more
columns, I tested with 10 columns but with that I did not see any
regression in v1).

2. (v1+use_slot_in_HeapKeyTest) is always winner, even at very high selectivity.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [BUGS] BUG #14350: VIEW with INSTEAD OF INSERT TRIGGER and COPY. Missing feature or working as designed.

2016-11-01 Thread Dilip Kumar

On Tue, Nov 1, 2016 at 12:40 PM, Haribabu Kommi
 wrote:
> COPY command is treated as an UTILITY command, During the query
> processing, the rules are not applied on the COPY command, and in the
> execution of COPY command, it just inserts the data into the heap and
> indexes with direct calls.
>
> I feel supporting the COPY command for the case-2 is little bit complex
> and may not be that much worth.

I agree it will be complex..
>
> Attached is the updated patch with doc changes.

Now since we are adding support for INSTEAD OF TRIGGER in COPY FROM
command, It will be good that we provide a HINT to user if INSTEAD of
trigger does not exist, like we do in case of insert ?

INSERT case:
postgres=# insert into ttt_v values(7);
2016-11-01 14:31:39.845 IST [72343] ERROR:  cannot insert into view "ttt_v"
2016-11-01 14:31:39.845 IST [72343] DETAIL:  Views that do not select
from a single table or view are not automatically updatable.
2016-11-01 14:31:39.845 IST [72343] HINT:  To enable inserting into
the view, provide an INSTEAD OF INSERT trigger or an unconditional ON
INSERT DO INSTEAD rule.

COPY case:
postgres=# COPY ttt_v FROM stdin;
2016-11-01 14:31:52.235 IST [72343] ERROR:  cannot copy to view "ttt_v"
2016-11-01 14:31:52.235 IST [72343] STATEMENT:  COPY ttt_v FROM stdin;

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 5 >

1 - 100 of 402 matches

Mail list logo