Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

Tomas Vondra Sun, 11 Nov 2012 11:37:35 -0800

Hi,

I've prepared a slightly updated patch, based on the previous review.
See it attached.


On 18.10.2012 04:28, 花田 茂 wrote:> Hi Tomas,
>
> On 2012/10/17, at 20:45, Tomas Vondra <t...@fuzzy.cz> wrote:
>>
>> Dne 17.10.2012 12:34, Shigeru HANADA napsal:
>>> Performance test
>>> ================
>>> I tested 1000 tables case (each is copy of pgbench_branches with 
>>> 100000 rows) on 1GB shared_buffers server. Please note that I 
>>> tested on MacBook air, i.e. storage is not HDD but SSD. Here is 
>>> the test procedure:
>>>
>>> 1) loop 1000 times
>>> 1-1) create copy table of pgbench_accounts as accounts$i
>>> 1-2) load 100000 rows
>>> 1-3) add primary key
>>> 1-4) select all rows to cache pages in shared buffer
>>> 2) BEGIN
>>> 3) loop 1000 times
>>> 3-1) DROP TABLE accounts$i
>>> 4) COMMIT
>>
>> I don't think the 'load rows' and 'select all rows' is really 
>> necessary. And AFAIK sequential scans use small circular buffer
>> not to pollute sharedbuffers, so I'd guess the pages are not cached
>> in shared buffers anyway. Have you verified that, e.g. by
>> pg_buffercache?
>
> Oops, you're right. I omitted 1-3 and 1-4 in actual measurement, but
> IMO loading data is necessary to fill the shared buffer up, because
> # of buffers which are deleted during COMMIT is major factor of this
> patch. And, yes, I verified that all shared buffers are used　after
> all loading have been finished.

I don't think it's an important factor, as the original code does this:

  for (i = 0; i < NBuffers; i++)
  {
    volatile BufferDesc *bufHdr = &BufferDescriptors[i];
    ...
  }

in the DropRelFileNodeAllBuffers. That loops through all shared buffers
no matter what, so IMHO the performance in this case depends on the
total size of the shared buffers. But maybe I'm missing something important.

>>>> Our system creates a lot of "working tables" (even 100.000)
>>>> and we need to perform garbage collection (dropping obsolete 
>>>> tables) regularly. Thisoften took ~ 1 hour, because we're using
>>>> big AWS instances with lots of RAM (which tends to be slower
>>>> than RAM on bare hw). After applying this patch and dropping
>>>> tables in groups of 100, the gc runs in less than 4 minutes
>>>> (i.e. a 15x speed-up).
>>>
>>> Hm, my environment seems very different from yours. Could you
>>> show the setting of shared_buffers in your environment? I'd like
>>> to make my test environment as similar as possible to yours.
>>
>> We're using m2.4xlarge instances (70G of RAM) with 10GB shared
>> buffers.
>
> Thank you, it's more huge than I expected. I'm not sure whether my
> boss allows me to use such rich environment... :(

I've done a simple benchmark on my laptop with 2GB shared buffers, it's
attached in the drop-test.py (it's a bit messy, but it works).

It does four things:

1) creates N tables
2) drops them one by one
3) creates N tables
4) drops them in batches (batch = one transaction)

To run the script, simply specify number of tables you want to work with
(say 10000), size of the batch (e.g. 100) and connection string (e.g.
'host=localhost dbname=mydb').

With those parameters, I got these numbers on the laptop:

  creating 10000 tables
    all tables created in 3.298694 seconds
  dropping 10000 tables one by one ...
    all tables dropped in 32.692478 seconds
  creating 10000 tables
    all tables created in 3.458178 seconds
  dropping 10000 tables in batches by 100...
    all tables dropped in 3.28268 seconds

So it's 33 seconds vs. 3.3 seconds, i.e. 10x speedup. On AWS we usually
get larger differences, as we use larger shared buffers and the memory
is significantly slower there.

Regarding the other comments:

> * As Robert commented, this patch adds DropRelFileNodeAllBuffersList 
> by copying code from DropRelFileNodeAllBuffers. Please refactor it
> to avoid code duplication.

Yes, I've merged the two functions, i.e. I've removed the original
DropRelFileNodeAllBuffers and used the name for the new one (taking
array instead of single relnode). Then I've modified the existing calls
to to use

    DropRelFileNodeAllBuffers(&relnode, 1)

instead of the original one

    DropRelFileNodeAllBuffers(relnode)

Maybe this should be done for smgrdounlink/smgrdounlinkall too.

> * In smgrDoPendingDeletes, you free srels explicitly. Can't we leave
>  them to memory context stuff? Even it is required, at least pfree 
> must be called in the case nrels == 0 too.

Hmmm, right. Not sure which choice is better, so for now I've moved the
pfree out of the 'if' block, so it's executed for 'nrels==0' too.

> * In smgrDoPendingDeletes, the buffer srels is expanded in every 
> iteration. This seems a bit inefficient. How about doubling the 
> capacity when used it up? This requires additional variable, but 
> reduces repalloc call considerably.

Done, although I haven't seen any significant speed improvement.

> * Just an idea, but if we can remove entries for local relations from
> rnodes array before buffer loop in DropRelFileNodeAllBuffersList, 
> following bsearch might be more efficient, though dropping many 
> temporary tables might be rare.

My reasoning, exactly. But maybe it should be done to keep the code
clean, i.e. not letting temp tables to code paths where they are not
expected.

Tomas

diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 993bc49..fcd9387 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -335,6 +335,11 @@ smgrDoPendingDeletes(bool isCommit)
        PendingRelDelete *pending;
        PendingRelDelete *prev;
        PendingRelDelete *next;
+    
+       SMgrRelation   *srels = palloc(sizeof(SMgrRelation));
+       int                     nrels = 0,
+                               i = 0,
+                               maxrels = 1;
 
        prev = NULL;
        for (pending = pendingDeletes; pending != NULL; pending = next)
@@ -358,14 +363,31 @@ smgrDoPendingDeletes(bool isCommit)
                                SMgrRelation srel;
 
                                srel = smgropen(pending->relnode, 
pending->backend);
-                               smgrdounlink(srel, false);
-                               smgrclose(srel);
+                               
+                               /* extend the array if needed (double the size) 
*/
+                               if (maxrels <= nrels) {
+                                       maxrels *= 2;
+                                       srels = repalloc(srels, 
sizeof(SMgrRelation) * maxrels);
+                               }
+                               
+                               srels[nrels++] = srel;
                        }
                        /* must explicitly free the list entry */
                        pfree(pending);
                        /* prev does not change */
                }
        }
+
+       if (nrels > 0)
+       {
+               smgrdounlinkall(srels, nrels, false);
+               
+               for (i = 0; i < nrels; i++)
+                       smgrclose(srels[i]);
+       }
+       
+       pfree(srels);
+
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c 
b/src/backend/storage/buffer/bufmgr.c
index bdcbe47..aa934a8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -108,6 +108,7 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
 static void AtProcExit_Buffers(int code, Datum arg);
 
+static int rnode_comparator(const void * p1, const void * p2);
 
 /*
  * PrefetchBuffer -- initiate asynchronous read of a block of a relation
@@ -2094,31 +2095,37 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, 
ForkNumber forkNum,
  * --------------------------------------------------------------------
  */
 void
-DropRelFileNodeAllBuffers(RelFileNodeBackend rnode)
+DropRelFileNodeAllBuffers(RelFileNodeBackend * rnodes, int nnodes)
 {
-       int                     i;
+       int         i;
+
+       /* sort the list of rnodes */
+       pg_qsort(rnodes, nnodes, sizeof(RelFileNodeBackend), rnode_comparator);
 
        /* If it's a local relation, it's localbuf.c's problem. */
-       if (RelFileNodeBackendIsTemp(rnode))
+       for (i = 0; i < nnodes; i++)
        {
-               if (rnode.backend == MyBackendId)
-                       DropRelFileNodeAllLocalBuffers(rnode.node);
-               return;
+               if (RelFileNodeBackendIsTemp(rnodes[i]))
+               {
+                       if (rnodes[i].backend == MyBackendId)
+                               DropRelFileNodeAllLocalBuffers(rnodes[i].node);
+               }
        }
 
        for (i = 0; i < NBuffers; i++)
        {
                volatile BufferDesc *bufHdr = &BufferDescriptors[i];
+               RelFileNodeBackend *rnode = bsearch((const void *) 
&(bufHdr->tag.rnode),
+                                                                               
        rnodes,
+                                                                               
        nnodes, sizeof(RelFileNodeBackend),
+                                                                               
        rnode_comparator);
 
-               /*
-                * As in DropRelFileNodeBuffers, an unlocked precheck should be 
safe
-                * and saves some cycles.
-                */
-               if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+               /* buffer does not belong to any of the relations */
+               if (rnode == NULL)
                        continue;
 
                LockBufHdr(bufHdr);
-               if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+               if (RelFileNodeEquals(bufHdr->tag.rnode, rnode->node))
                        InvalidateBuffer(bufHdr);       /* releases spinlock */
                else
                        UnlockBufHdr(bufHdr);
@@ -2953,3 +2960,31 @@ local_buffer_write_error_callback(void *arg)
                pfree(path);
        }
 }
+
+/*
+ * Used to sort relfilenode array (ordered by [relnode, dbnode, spcnode]), so
+ * that it's suitable for bsearch.
+ */
+static int
+rnode_comparator(const void * p1, const void * p2)
+{
+       RelFileNodeBackend n1 = * (RelFileNodeBackend *) p1;
+       RelFileNodeBackend n2 = * (RelFileNodeBackend *) p2;
+
+       if (n1.node.relNode < n2.node.relNode)
+               return -1;
+       else if (n1.node.relNode > n2.node.relNode)
+               return 1;
+
+       if (n1.node.dbNode < n2.node.dbNode)
+               return -1;
+       else if (n1.node.dbNode > n2.node.dbNode)
+               return 1;
+
+       if (n1.node.spcNode < n2.node.spcNode)
+               return -1;
+       else if (n1.node.spcNode > n2.node.spcNode)
+               return 1;
+       else
+               return 0;
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 5dff8b3..40a9d0b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -390,7 +390,7 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
         * Get rid of any remaining buffers for the relation.  bufmgr will just
         * drop them without bothering to write the contents.
         */
-       DropRelFileNodeAllBuffers(rnode);
+       DropRelFileNodeAllBuffers(&rnode, 1);
 
        /*
         * It'd be nice to tell the stats collector to forget it immediately, 
too.
@@ -419,6 +419,64 @@ smgrdounlink(SMgrRelation reln, bool isRedo)
        (*(smgrsw[which].smgr_unlink)) (rnode, InvalidForkNumber, isRedo);
 }
 
+void smgrdounlinkall(SMgrRelation * rels, int nrels, bool isRedo)
+{
+       int i = 0;
+       RelFileNodeBackend rnodes[nrels];
+       ForkNumber  forknum;
+
+       for (i = 0; i < nrels; i++)
+       {
+               RelFileNodeBackend rnode = rels[i]->smgr_rnode;
+               int         which = rels[i]->smgr_which;
+
+               rnodes[i] = rnode;
+
+               /* Close the forks at smgr level */
+               for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+                       (*(smgrsw[which].smgr_close)) (rels[i], forknum);
+       }
+
+       /*
+        * Get rid of any remaining buffers for the relation.  bufmgr will just
+        * drop them without bothering to write the contents.
+        */
+       DropRelFileNodeAllBuffers(rnodes, nrels);
+
+       /*
+        * It'd be nice to tell the stats collector to forget it immediately, 
too.
+        * But we can't because we don't know the OID (and in cases involving
+        * relfilenode swaps, it's not always clear which table OID to forget,
+        * anyway).
+        */
+
+       /*
+        * Send a shared-inval message to force other backends to close any
+        * dangling smgr references they may have for this rel.  We should do 
this
+        * before starting the actual unlinking, in case we fail partway through
+        * that step.  Note that the sinval message will eventually come back to
+        * this backend, too, and thereby provide a backstop that we closed our
+        * own smgr rel.
+        */
+       for (i = 0; i < nrels; i++) {
+               CacheInvalidateSmgr(rnodes[i]);
+       }
+
+       /*
+        * Delete the physical file(s).
+        *
+        * Note: smgr_unlink must treat deletion failure as a WARNING, not an
+        * ERROR, because we've already decided to commit or abort the current
+        * xact.
+        */
+
+       for (i = 0; i < nrels; i++) {
+               int         which = rels[i]->smgr_which;
+               for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+                       (*(smgrsw[which].smgr_unlink)) (rnodes[i], forknum, 
isRedo);
+       }
+}
+
 /*
  *     smgrdounlinkfork() -- Immediately unlink one fork of a relation.
  *
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 51eb77b..1deee42 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -188,7 +188,7 @@ extern void FlushRelationBuffers(Relation rel);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
                                           ForkNumber forkNum, BlockNumber 
firstDelBlock);
-extern void DropRelFileNodeAllBuffers(RelFileNodeBackend rnode);
+extern void DropRelFileNodeAllBuffers(RelFileNodeBackend * rnodes, int nnodes);
 extern void DropDatabaseBuffers(Oid dbid);
 
 #define RelationGetNumberOfBlocks(reln) \
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 9eccf76..272476b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,6 +85,7 @@ extern void smgrcloseall(void);
 extern void smgrclosenode(RelFileNodeBackend rnode);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdounlink(SMgrRelation reln, bool isRedo);
+extern void smgrdounlinkall(SMgrRelation * rels, int nrels, bool isRedo);
 extern void smgrdounlinkfork(SMgrRelation reln, ForkNumber forknum, bool 
isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
                   BlockNumber blocknum, char *buffer, bool skipFsync);

#!/bin/env python

import datetime
import psycopg2
import sys

if __name__ == '__main__':
	
	if len(sys.argv) < 4:
		print "ERORR: not enough parameters"
		print "HINT:  drop-test.py num-of-tables drop-num 'connection string'"
		sys.exit(1)
	
	ntables = int(sys.argv[1])
	nlimit  = int(sys.argv[2])
	connstr = str(sys.argv[3])
	debug   = False
	
	conn = psycopg2.connect(connstr)
	cur  = conn.cursor()
	
	print 'creating %s tables' % (ntables,)
	start = datetime.datetime.now()
	for i in range(ntables):
		cur.execute('CREATE TABLE tab_%s (id INT)' % (i,))
		if (i % 1000 == 0) and debug:
			print '  tables created: %s' % (i,)
	conn.commit()
	end = datetime.datetime.now()
	print '  all tables created in %s seconds' % ((end-start).total_seconds(),)

	# set to autocommit mode
	conn.autocommit = True
	start = datetime.datetime.now()
	print 'dropping %s tables one by one ...' % (ntables,)
	for i in range(ntables):
		cur.execute('DROP TABLE tab_%s' % (i,))
		if (i % 1000 == 0) and debug:
			print '  tables dropped: %s' % (i,)

	end = datetime.datetime.now()
	print '  all tables dropped in %s seconds' % ((end-start).total_seconds(),)

	# cancel the autocommit mode
	conn.autocommit = False
	
	# recreate tables
	print 'creating %s tables' % (ntables,)
	start = datetime.datetime.now()
	for i in range(ntables):
		cur.execute('CREATE TABLE tab_%s (id INT)' % (i,))
		if (i % 1000 == 0) and debug:
			print '  tables created: %s' % (i,)

	conn.commit()
	end = datetime.datetime.now()
	print '  all tables created in %s seconds' % ((end-start).total_seconds(),)
	
	# drop the tables in batches
	print 'dropping %s tables in batches by %s...' % (ntables,nlimit)
	start = datetime.datetime.now()
	for i in range(ntables):
		cur.execute('DROP TABLE tab_%s' % (i,))
		if i % nlimit == 0:
			conn.commit()
			if debug:
				print '  tables dropped: %s' % (i,)
	conn.commit()
	end = datetime.datetime.now()
	print '  all tables dropped in %s seconds' % ((end-start).total_seconds(),)
	
	conn.close()

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

Reply via email to