date:20160804

Re: [HACKERS] [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

2016-08-04 Thread Michael Paquier

On Thu, Aug 4, 2016 at 2:19 AM, Andreas Seltenreich  wrote:
> testing with sqlsmith shows that the following assertion doesn't hold:
>
> FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", 
> Line: 10200)
>
> The triggering statements always contain a call to pg_start_backup with
> the third argument 'true', i.e. it's trying to start an exlusive backup.
>
> I didn't manage to put together a stand-alone testcase yet.

While I have not been able to trigger this assertion directly, I have
bumped into the fact that pg_stop_backup can reset unconditionally
XLogCtl->Insert.exclusiveBackup *before* pg_start_backup finishes or
even creates the backup_label file if it is set. So the in-memory
state of the backup is like there is no backups running at all
(including exclusive and non-exclusive), but there could be a
backup_label file present. In this state, it is not possible to
trigger pg_start_backup or pg_stop_backup again except if the
backup_label file is manually removed.

In do_pg_stop_backup, both steps would be better reversed, like in the
patch attached. So what we should actually do in pg_stop_backup is
first look at if the backup_label file exists, and then we reset the
in-memory flag as the last thing that do_pg_start_backup does is
writing the backup_label file. This does not close completely the
window though. After the backup_label file is created, it could still
be possible to trigger  the assertion if there is an error on the
tablespace map file.

This window exists as well on back-branches btw, this is not new to
9.6. Magnus, what do you think?
-- 
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f13f9c1..9380225 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10274,40 +10274,6 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 			  errmsg("WAL level not sufficient for making an online backup"),
  errhint("wal_level must be set to \"replica\" or \"logical\" at server start.")));
 
-	/*
-	 * OK to update backup counters and forcePageWrites
-	 */
-	WALInsertLockAcquireExclusive();
-	if (exclusive)
-	{
-		if (!XLogCtl->Insert.exclusiveBackup)
-		{
-			WALInsertLockRelease();
-			ereport(ERROR,
-	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-	 errmsg("exclusive backup not in progress")));
-		}
-		XLogCtl->Insert.exclusiveBackup = false;
-	}
-	else
-	{
-		/*
-		 * The user-visible pg_start/stop_backup() functions that operate on
-		 * exclusive backups can be called at any time, but for non-exclusive
-		 * backups, it is expected that each do_pg_start_backup() call is
-		 * matched by exactly one do_pg_stop_backup() call.
-		 */
-		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
-		XLogCtl->Insert.nonExclusiveBackups--;
-	}
-
-	if (!XLogCtl->Insert.exclusiveBackup &&
-		XLogCtl->Insert.nonExclusiveBackups == 0)
-	{
-		XLogCtl->Insert.forcePageWrites = false;
-	}
-	WALInsertLockRelease();
-
 	if (exclusive)
 	{
 		/*
@@ -10362,6 +10328,40 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 	}
 
 	/*
+	 * OK to update backup counters and forcePageWrites
+	 */
+	WALInsertLockAcquireExclusive();
+	if (exclusive)
+	{
+		if (!XLogCtl->Insert.exclusiveBackup)
+		{
+			WALInsertLockRelease();
+			ereport(ERROR,
+	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+	 errmsg("exclusive backup not in progress")));
+		}
+		XLogCtl->Insert.exclusiveBackup = false;
+	}
+	else
+	{
+		/*
+		 * The user-visible pg_start/stop_backup() functions that operate on
+		 * exclusive backups can be called at any time, but for non-exclusive
+		 * backups, it is expected that each do_pg_start_backup() call is
+		 * matched by exactly one do_pg_stop_backup() call.
+		 */
+		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
+		XLogCtl->Insert.nonExclusiveBackups--;
+	}
+
+	if (!XLogCtl->Insert.exclusiveBackup &&
+		XLogCtl->Insert.nonExclusiveBackups == 0)
+	{
+		XLogCtl->Insert.forcePageWrites = false;
+	}
+	WALInsertLockRelease();
+
+	/*
 	 * Read and parse the START WAL LOCATION line (this code is pretty crude,
 	 * but we are not expecting any variability in the file format).
 	 */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-08-04 Thread Andres Freund

Hi,

On 2016-08-02 10:55:18 -0400, Noah Misch wrote:
> [Action required within 72 hours.  This is a generic notification.]
> 
> The above-described topic is currently a PostgreSQL 9.6 open item.  Andres,
> since you committed the patch believed to have created it, you own this open
> item.

Well kinda (it was a partial fix for something not originally by me),
but I'll deal with. Reading now, will commit tomorrow.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] FailedAssertion("!(XLogCtl->Insert.exclusiveBackup)", File: "xlog.c", Line: 10200)

2016-08-04 Thread Michael Paquier

On Thu, Aug 4, 2016 at 4:03 PM, Michael Paquier
 wrote:
> This window exists as well on back-branches btw, this is not new to
> 9.6. Magnus, what do you think?

Actually, a completely water-proof solution is to use an in-memory
flag proper to exclusive backups during the time pg_start_backup() is
called, at the cost of taking WALInsertLockAcquireExclusive once again
at the end of do_pg_start_backup(), but it seems to me that it is an
acceptable cost because pg_start_backup is not a hot code path. Using
a separate flag has the advantage to provide to the user a better
error message when pg_stop_backup is used while pg_start_backup is
running as well.

Andreas, with the patch attached is the assertion still triggered?

Thoughts from others are welcome as well.
-- 
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f13f9c1..c4bc332 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -505,10 +505,13 @@ typedef struct XLogCtlInsert
 	 * of streaming base backups currently in progress. forcePageWrites is set
 	 * to true when either of these is non-zero. lastBackupStart is the latest
 	 * checkpoint redo location used as a starting point for an online backup.
+	 * exclusiveBackupStarted is true while pg_start_backup() is being called
+	 * during an exclusive backup.
 	 */
 	bool		exclusiveBackup;
 	int			nonExclusiveBackups;
 	XLogRecPtr	lastBackupStart;
+	bool		exclusiveBackupStarted;
 
 	/*
 	 * WAL insertion locks.
@@ -9833,7 +9836,8 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	WALInsertLockAcquireExclusive();
 	if (exclusive)
 	{
-		if (XLogCtl->Insert.exclusiveBackup)
+		if (XLogCtl->Insert.exclusiveBackup ||
+			XLogCtl->Insert.exclusiveBackupStarted)
 		{
 			WALInsertLockRelease();
 			ereport(ERROR,
@@ -9842,6 +9846,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	 errhint("Run pg_stop_backup() and try again.")));
 		}
 		XLogCtl->Insert.exclusiveBackup = true;
+		XLogCtl->Insert.exclusiveBackupStarted = true;
 	}
 	else
 		XLogCtl->Insert.nonExclusiveBackups++;
@@ -10178,6 +10183,16 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
 
 	/*
+	 * Mark that start phase has correctly finished for an exclusive backup.
+	 */
+	if (exclusive)
+	{
+		WALInsertLockAcquireExclusive();
+		XLogCtl->Insert.exclusiveBackupStarted = false;
+		WALInsertLockRelease();
+	}
+
+	/*
 	 * We're done.  As a convenience, return the starting WAL location.
 	 */
 	if (starttli_p)
@@ -10195,8 +10210,10 @@ pg_start_backup_callback(int code, Datum arg)
 	WALInsertLockAcquireExclusive();
 	if (exclusive)
 	{
-		Assert(XLogCtl->Insert.exclusiveBackup);
+		Assert(XLogCtl->Insert.exclusiveBackup &&
+			   XLogCtl->Insert.exclusiveBackupStarted);
 		XLogCtl->Insert.exclusiveBackup = false;
+		XLogCtl->Insert.exclusiveBackupStarted = false;
 	}
 	else
 	{
@@ -10287,6 +10304,13 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 	 errmsg("exclusive backup not in progress")));
 		}
+		if (XLogCtl->Insert.exclusiveBackupStarted)
+		{
+			WALInsertLockRelease();
+			ereport(ERROR,
+	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+	 errmsg("exclusive backup currently starting")));
+		}
 		XLogCtl->Insert.exclusiveBackup = false;
 	}
 	else

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [RFC] Change the default of update_process_title to off

2016-08-04 Thread Tsunakawa, Takayuki

Hello,

I'd like to propose changing the default value of update_process_title to off, 
at least on Windows.  I'll submit a patch if we see no big problem.


PROBLEM


Our customer is trying to certify PostgreSQL with their packaged software 
product.  Currently, the product supports a famous DBMS (let me call it DBMS-X 
hereafter).  They evaluated the performance of PostgreSQL and DBMS-X.

The performance of PostgreSQL was very bad on Windows.  The result was as 
follows (the unit is some request throughput).  These were measured on the same 
machine.

* DBMS-X on Windows: 750
* PostgreSQL on Windows: 250
* PostgreSQL on Linux: 870

The performance on Windows was considered unacceptable.  Using pgbench, we 
could see similar result -- the performance on Linux is about three times 
higher than on Windows.


CAUSE


The CreateEvent() and CloseHandle() Win32 API calls from postgres.exe was 
consuming much CPU time.  While stressing the system by running the select-only 
mode of pgbench, Windows performance monitor showed 50% User Time, 40% 
Privileged Time, and 10% Idle Time.  Windows Performance Toolkit, which 
corresponds to perf on Linux, revealed that half of the privileged time was 
used by CreateEvent() and CloseHandle() called from set_ps_display().  Those 
calls are performed when update_process_title is on.

With update_process_title off, the performance became much closer to Linux as 
follows.  The scaling factoris 300.  The pgbench client was run on a different 
Windows machine with 12 CPU cores.  The effect was minimal on Linux.

C:\> pgbench -h  -T 30 -c #clients -j 12 -S benchdb

[Windows]
#clients  onoff
12 29793  38169
24 31587  87237
48 32588  83335
96 34261  67668


[Linux]
#clients  onoff
12 52823  52976
24 90712  91955
48 108653  108762
96 107167  107140


PROPOSAL AND CONSIDERATIONS


I think we should change the default of update_process_title to off on Windows 
because:

1. The performance gain is huge.
2. It's almost useless because we can only see the postgres command line with 
Process Explorer, which the user must download from Microsoft and install.
3. I don't see the benefit of update_process_title=on at the expense of 
performance.
4. The default setting of PostgreSQL parameters should be friendly.  I'm afraid 
many users cannot track the cause of poor performance to update_process_title.  
I heard that MySQL's popularity was partly because it ran smoothly on Windows 
in the early days.  PostgreSQL should be, too.

The question is, do we want to change the default to off on other OSes?  Is the 
command line really useful?  If useful, does it need to be on by default?

Regards
Takayuki Tsunakawa




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Craig Ringer

On 4 August 2016 at 12:45, Tom Lane  wrote:

> Craig Ringer  writes:
> > On 4 August 2016 at 02:15, Tom Lane  wrote:
> >> So it seems like fixing libpq's parsing of server_version_num is
> >> something we definitely want to fix ASAP in all back branches.
>
> > Well, this seems like a good time to make server_version_num GUC_REPORT
> as
> > well...
>
> To what end?  Existing versions of libpq wouldn't know about it, and new
> versions of libpq couldn't rely on it to get reported by older servers,
> so it'd still be the path of least resistance to examine server_version.
>

Because it's really silly that we don't, and since we're making a change
that will affect clients anyway (the argument against doing it before),
lets do it.

Otherwise why bother ever adding anything, since it'll take time for
clients to use it?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] [RFC] Change the default of update_process_title to off

2016-08-04 Thread Michael Paquier

On Thu, Aug 4, 2016 at 4:41 PM, Tsunakawa, Takayuki
 wrote:
> 1. The performance gain is huge.
> 2. It's almost useless because we can only see the postgres command line with 
> Process Explorer, which the user must download from Microsoft and install.
> 3. I don't see the benefit of update_process_title=on at the expense of 
> performance.
> 4. The default setting of PostgreSQL parameters should be friendly.  I'm 
> afraid many users cannot track the cause of poor performance to 
> update_process_title.  I heard that MySQL's popularity was partly because it 
> ran smoothly on Windows in the early days.  PostgreSQL should be, too.
>
> The question is, do we want to change the default to off on other OSes?

I don't think so.

> Is the command line really useful?
> If useful, does it need to be on by default?

I'd vote for keeping it on by default, because this information with
ps is really useful for any kind of deployments, testing, etc.

Here is a different proposal: documenting instead that disabling that
parameter on Windows can improve performance, at the cost of losing
information verbosity for processes.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-08-04 Thread Etsuro Fujita


On 2016/08/02 22:02, Kouhei Kaigai wrote:

I wrote:

I removed the Relations line.  Here is an updated version of the patch.

* As I said upthread, I left the upper-relation handling for another
patch.  Currently, the patch prints "Foreign Scan" with no relations in
that case.

* I kept custom joins as-is.  We would need discussions about how to
choose relations we print in EXPLAIN, so I'd also like to leave that for
yet another patch.



Please don't rely on fs_relids bitmap to pick up relations to be printed.
It just hold a set of underlying relations, but it does not mean all of
these relations are actually scanned inside of the ForeignScan.


Firstly, I'd like to discuss about what the relations printed after  
"Foreign join" would mean.  I think they would mean the relations  
involved in that foreign join (in other words, the ones that participate  
in that foreign join) rather than the relations to be scanned by a  
Foreign Scan representing that foreign join.  Wouldn't that make sense?


In that sense I think it's reasonable to print all relations specified  
in fs_relids after "Foreign Join".  (And in that sense I was thinking we  
could handle the custom join case the same way as the foreign join case.)



You didn't answer the following scenario I pointed out in the upthread.

| Please assume an enhancement of postgres_fdw that reads a small local table 
(tbl_1)
| and parse them as VALUES() clause within a remote query to execute remote join
| with foreign tables (ftbl_2, ftbl_3).
| This ForeignScan node has three underlying relations; tbl_1, ftbl_2 and 
ftbl_3.
| Likely, tbl_1 will be scanned by SeqScan, not ForeignScan itself.
| In this case, which relations shall be printed around ForeignScan?
| Is it possible to choose proper relation names without hint by the extension?
|

To care about these FDW usage, you should add an alternative bitmap rather
than fs_relids/custom_relids. My suggestion is, having explain_relids for
the purpose.


We currently don't allow such a join to be performed as a foreign join,  
because we allow a join to be performed so if the relations of the join  
are all foreign tables belonging to the same remote server, as you know.


So, as I said upthread, I think it would make sense to introduce such a  
bitmap when we extend the existing foreign-join pushdown infrastructure  
so that we can do such a thing as proposed by you.  I guess that that  
would need to extend the concept of foreign joins, though.



Also, the logic to print "Foreign (Scan|Insert|Update|Delete)" is different
from what I suggested. I'm suggesting to allow extension giving a label
to fill up "Foreign %s" format.

Please explain why your choice is better than my proposition.


No, I haven't done anything about that yet.  I kept the behavior as-is.


At least, my proposition is available to apply on both of foreign-scan and
custom-scan, and no need to future maintenance if and when FDW gets support
remote Aggregation, Sort or others.


I'd like to discuss this issue separately, maybe in a new thread.

Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Change the default of update_process_title to off

2016-08-04 Thread Andres Freund

On 2016-08-04 16:48:11 +0900, Michael Paquier wrote:
> Here is a different proposal: documenting instead that disabling that
> parameter on Windows can improve performance, at the cost of losing
> information verbosity for processes.

The benefit on windows seems pretty marginal, given the way it has to be
viewed. People installing processexplorer et. al. to view a handle that
have to know about, will be able to change the config.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Leaking memory in text_overlay function

2016-08-04 Thread Anastasia Lubennikova

I found this thread on the open CF. As I see, the discussion is ended 
with decision to reject patch.

I've just changed the status to "Rejected".

02.07.2016 18:11, Dirk Rudolph:
Well that's good to know. It was just curious that, in my case, I only 
saw this in this particular function.

Anyway. I will consider to handle the memory the same way, if this is 
the recommendation.

Many thanks.

/Closed

On Sat, Jul 2, 2016 at 4:45 PM, Tom Lane > wrote:

Dirk Rudolph mailto:dirk.rudo...@netcentric.biz>> writes:
> while implementing my own C function, I mentioned that some
memory is not freed by the text_overlay function in varlena.c

By and large, that's intentional.  SQL-called functions normally run
in short-lived memory contexts, so that any memory they don't
bother to
pfree will be cleaned up automatically at the end of the tuple cycle.
If we did not follow that approach, there would need to be many
thousands
more explicit pfree calls than there are today, and the system
would be
net slower overall because multiple retail pfree's are slower than a
MemoryContextReset.

There are places where it's important to avoid leaking memory,
certainly.
But I don't think this is one of them, unless you can demonstrate a
scenario with query-lifespan or worse leakage.

> Particularly I mean the both substrings allocated by
text_substring (according to the documentation of text_substring
"The result is always a freshly palloc'd datum.") and one of the
concatenation results. I’m aware of the MemoryContext being
deleted in my case but IMHO builtin functions should be memory safe.

That is not the project policy.

regards, tom lane

--

Dirk Rudolph | Senior Software Engineer

Netcentric AG

M: +41 79 642 37 11
D: +49 174 966 84 34

dirk.rudo...@netcentric.biz  | 
www.netcentric.biz 

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Why we lost Uber as a user

2016-08-04 Thread Torsten Zuehlsdorff




On 03.08.2016 21:05, Robert Haas wrote:

On Wed, Aug 3, 2016 at 2:23 PM, Tom Lane  wrote:

Robert Haas  writes:

I don't think they are saying that logical replication is more
reliable than physical replication, nor do I believe that to be true.
I think they are saying that if logical corruption happens, you can
fix it by typing SQL statements to UPDATE, INSERT, or DELETE the
affected rows, whereas if physical corruption happens, there's no
equally clear path to recovery.


Well, that's not an entirely unreasonable point, but I dispute the
implication that it makes recovery from corruption an easy thing to do.
How are you going to know what SQL statements to issue?  If the master
database is changing 24x7, how are you going to keep up with that?


I think in many cases people fix their data using business logic.  For
example, suppose your database goes down and you have to run
pg_resetxlog to get it back up.  You dump-and-restore, as one does,
and find that you can't rebuild one of your unique indexes because
there are now two records with that same PK.  Well, what you do is you
look at them and judge which one has the correct data, often the one
that looks more complete or the one with the newer timestamp.  Or,
maybe you need to merge them somehow.  In my experience helping users
through problems of this type, once you explain the problem to the
user and tell them they have to square it on their end, the support
call ends.  The user may not always be entirely thrilled about having
to, say, validate a problematic record against external sources of
truth, but they usually know how to do it.  Database bugs aren't the
only way that databases become inaccurate.  If the database that they
use to keep track of land ownership in the jurisdiction where I live
says that two different people own the same piece of property,
somewhere there is a paper deed in a filing cabinet.  Fishing that out
to understand what happened may not be fun, but a DBA can explain that
problem to other people in the organization and those people can get
it fixed.  It's a problem, but it's fixable.

On the other hand, if a heap tuple contains invalid infomask bits that
cause an error every time you read the page (this actually happened to
an EnterpriseDB customer!), the DBA can't tell other people how to fix
it and can't fix it personally either.  Instead, the DBA calls me.


After reading this statement the ZFS filesystem pops into my mind. It 
has protection build in against various problems (data degradation, 
current spikes, phantom writes, etc).


For me this raises two questions:

1) would the usage of ZFS prevent such errors?

My feeling would say yes, but i have no idea about how a invalid 
infomask bit could occur.


2) would it be possible to add such prevention to PostgreSQL

I know this could add a massive overhead, but it its optional this could 
be a fine thing?


Greetings,
Torsten


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-08-04 Thread Kouhei Kaigai

> -Original Message-
> From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp]
> Sent: Thursday, August 04, 2016 4:42 PM
> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown 
> plans
> 
> On 2016/08/02 22:02, Kouhei Kaigai wrote:
> 
> I wrote:
> >> I removed the Relations line.  Here is an updated version of the patch.
> >>
> >> * As I said upthread, I left the upper-relation handling for another
> >> patch.  Currently, the patch prints "Foreign Scan" with no relations in
> >> that case.
> >>
> >> * I kept custom joins as-is.  We would need discussions about how to
> >> choose relations we print in EXPLAIN, so I'd also like to leave that for
> >> yet another patch.
> 
> > Please don't rely on fs_relids bitmap to pick up relations to be printed.
> > It just hold a set of underlying relations, but it does not mean all of
> > these relations are actually scanned inside of the ForeignScan.
> 
> Firstly, I'd like to discuss about what the relations printed after
> "Foreign join" would mean.  I think they would mean the relations
> involved in that foreign join (in other words, the ones that participate
> in that foreign join) rather than the relations to be scanned by a
> Foreign Scan representing that foreign join.  Wouldn't that make sense?
> 
> In that sense I think it's reasonable to print all relations specified
> in fs_relids after "Foreign Join".  (And in that sense I was thinking we
> could handle the custom join case the same way as the foreign join case.)
> 
> > You didn't answer the following scenario I pointed out in the upthread.
> >
> > | Please assume an enhancement of postgres_fdw that reads a small local 
> > table (tbl_1)
> > | and parse them as VALUES() clause within a remote query to execute remote 
> > join
> > | with foreign tables (ftbl_2, ftbl_3).
> > | This ForeignScan node has three underlying relations; tbl_1, ftbl_2 and 
> > ftbl_3.
> > | Likely, tbl_1 will be scanned by SeqScan, not ForeignScan itself.
> > | In this case, which relations shall be printed around ForeignScan?
> > | Is it possible to choose proper relation names without hint by the 
> > extension?
> > |
> >
> > To care about these FDW usage, you should add an alternative bitmap rather
> > than fs_relids/custom_relids. My suggestion is, having explain_relids for
> > the purpose.
> 
> We currently don't allow such a join to be performed as a foreign join,
> because we allow a join to be performed so if the relations of the join
> are all foreign tables belonging to the same remote server, as you know.
> 
> So, as I said upthread, I think it would make sense to introduce such a
> bitmap when we extend the existing foreign-join pushdown infrastructure
> so that we can do such a thing as proposed by you.  I guess that that
> would need to extend the concept of foreign joins, though.
>
OK, right now, FDW does not support to take arbitrary child nodes, unlike
CustomScan. However, it implies your proposed infrastructure also has to
be revised once FDW gets enhanced in the near future.

> > Also, the logic to print "Foreign (Scan|Insert|Update|Delete)" is different
> > from what I suggested. I'm suggesting to allow extension giving a label
> > to fill up "Foreign %s" format.
> >
> > Please explain why your choice is better than my proposition.
> 
> No, I haven't done anything about that yet.  I kept the behavior as-is.
> 
> > At least, my proposition is available to apply on both of foreign-scan and
> > custom-scan, and no need to future maintenance if and when FDW gets support
> > remote Aggregation, Sort or others.
> 
> I'd like to discuss this issue separately, maybe in a new thread.
>
Why do you try to re-invent a similar infrastructure twice and separately?

What I proposed perfectly covers what you want to do, and has more benefits.
- A common manner for both of ForeignScan and CustomScan
- Flexibility to control "Foreign XXX" label and relation names to be printed.

Even if it is sufficient for the current usage of FDW, I've been saying your
proposition is not sufficient for CustomScan nowadays, and ForeignScan in the
near future.
It is not an answer to ignore the CustomScan side, because we have to enhanced
the infrastructure of CustomScan side to follow up FDW sooner or later.
However, we will have to apply a different manner on CustomScan side, or 
overwrite
what you proposed on FDW side, at that time.
It is not a desirable future.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity in EXPLAIN for foreign/custom join pushdown plans

2016-08-04 Thread Etsuro Fujita


On 2016/08/04 18:03, Kouhei Kaigai wrote:

Kaigai-san wrote:

Also, the logic to print "Foreign (Scan|Insert|Update|Delete)" is different
from what I suggested. I'm suggesting to allow extension giving a label
to fill up "Foreign %s" format.

Please explain why your choice is better than my proposition.


I wrote:

No, I haven't done anything about that yet.  I kept the behavior as-is.



At least, my proposition is available to apply on both of foreign-scan and
custom-scan, and no need to future maintenance if and when FDW gets support
remote Aggregation, Sort or others.



I'd like to discuss this issue separately, maybe in a new thread.



Why do you try to re-invent a similar infrastructure twice and separately?


As I said above, I haven't changed the behavior of EXPLAIN for *upper  
relation processing* such as aggregation or sorting in a ForeignScan or  
CustomScan node.



What I proposed perfectly covers what you want to do, and has more benefits.
- A common manner for both of ForeignScan and CustomScan
- Flexibility to control "Foreign XXX" label and relation names to be printed.


That may be so or not, but more importantly, this is more like a user  
interface problem, so each person would have different opinions about that.



Even if it is sufficient for the current usage of FDW, I've been saying your
proposition is not sufficient for CustomScan nowadays, and ForeignScan in the
near future.


Again I haven't done anything about the EXPLAIN for upper relation  
processing in both ForeignScan and CustomScan cases.  I kept the  
behavior as-is, but I don't think the behavior as-is is OK, either.



It is not an answer to ignore the CustomScan side, because we have to enhanced
the infrastructure of CustomScan side to follow up FDW sooner or later.
However, we will have to apply a different manner on CustomScan side, or 
overwrite
what you proposed on FDW side, at that time.
It is not a desirable future.


I agree on that point that it's better to handle both ForeignScan and  
CustomScan cases the same way.


Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Pavan Deolasee

Write Amplification Reduction Method (WARM)


A few years back, we developed HOT to address the problem associated with
MVCC and frequent updates and it has served us very well. But in the light
of Uber's recent technical blog highlighting some of the problems that are
still remaining, especially for workloads where HOT does not help to the
full extent, Simon, myself and others at 2ndQuadrant discussed several
ideas and what I present below is an outcome of that. This is not to take
away credit from anybody else. Others may have thought about similar ideas,
but I haven’t seen any concrete proposal so far.

At present, WARM looks to be a better idea than LITE, though we have that
as a backstop also (says Simon).

Background
==

As a background, HOT primarily helps in two ways.

1. It avoids additional index entries when a tuple is updated in a way such
that no index columns are changed. Before we implemented HOT, every update
would generate a new index entry in every index on the table. This not only
resulted in bigger and bloater indexes, but also made vacuuming difficult
because heap tuples can't be removed without first removing the
corresponding index entries.  HOT made it possible to avoid additional
index entries whenever possible and ensured most dead heap tuples can be
removed at a page level.

2. HOT also helped other cases such as DELETE and non-HOT updates by
reducing such heap tuples to a DEAD line pointer stub, until a regular
vacuum happens. But the space consumed by the heap tuple is freed and
returned to the FSM for future utilization.

Given that HOT was going to change the very core of MVCC and how tuples are
stored and accessed, we made a decision to keep things simple and two
important conditions were spelled out for doing HOT update.

1. There must be enough space in the same page to store the new version of
the tuple.
2. None of the index columns must be updated.

While it was generally easy to satisfy the first condition by either
leaving some free space in heap pages or system would anyways come to a
stable state after initial few updates. The second condition is somewhat
harder to control. Those who are aware would carefully design their schema
and choose indexes so that HOT conditions are met. But that may not be
possible for every workload, as seen from the Uber example. But many others
may have faced similar situations.

This proposal is to relax the second condition, so it’s not quite HOT, just
WARM.

Basic Idea


The basic idea is to allow updates that change an index column, without
requiring every index on the table to cause a new index entry. We still
require the first HOT condition i.e. the heap page must have sufficient
free space to store the new version of the tuple.

Indexes whose values do not change do not require new index pointers. Only
the index whose key is being changed will need a new index entry. The new
index entry will be set to the CTID of the root line pointer.

This method succeeds in reducing the write amplification, but causes other
issues which also need to be solved. WARM breaks the invariant that all
tuples in a HOT chain have the same index values and so an IndexScan would
need to re-check the index scan conditions against the visible tuple
returned from heap_hot_search(). We must still check visibility, so we only
need to re-check scan conditions on that one tuple version.

We don’t want to force a recheck for every index fetch because that will
slow everything down. So we would like a simple and efficient way of
knowing about the existence of a WARM tuple in a chain and do a recheck in
only those cases, ideally O(1). Having a HOT chain contain a WARM tuple is
discussed below as being a “WARM chain”, implying it needs re-check.

We could solve this by either 1) marking both old and new index tuples as
"recheck-required" or 2) marking the HOT chain on the heap as
"recheck-required" such that any tuple returned from the chain requires a
recheck.

The advantage of doing 1) is that the recheck is required only for the
compromised index.
Note that in presence of multiple index keys pointing to the same root line
pointer, the chain needs re-check for both the old as well as the new index
key. The old index key must not fetch the new tuple(s) and the new index
key must not fetch the old tuple(s). So irrespective of whether an old or a
new tuple is being returned, the caller must know about the WARM tuples in
the chains. So marking the index would require us to mark both old and new
index tuples as requiring re-check. That requires an additional index scan
to locate the old row and then an additional write to force it to re-check,
which is algorithmically O(NlogN) on table size.

Marking the heap, (2) is O(1) per updated index, so seems better. But since
we don't know with respect to which index or indexes the chain is broken,
all index scans must do a recheck for a tuple returned from a WARM chain.

How d

[HACKERS] Unused argument in PinBuffer function

2016-08-04 Thread Ildar Musin


Hi all,

I was looking through the buffer manager's code and have noticed that 
function PinBuffer has an unused 'strategy' argument. It's seems that 
after refactoring made by Alexander Korotkov and Andres Freund 
(48354581a49c30f5757c203415aa8412d85b0f70) it became useless. The 
attached patch removes it. Probably someone more advanced could edit the 
function description to reflect changes?


Regards,

--
Ildar Musin
i.mu...@postgrespro.ru

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 76ade37..1eaa1cb 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -432,7 +432,7 @@ static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
   ForkNumber forkNum, BlockNumber blockNum,
   ReadBufferMode mode, BufferAccessStrategy strategy,
   bool *hit);
-static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
+static bool PinBuffer(BufferDesc *buf);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf, bool fixOwner);
 static void BufferSync(int flags);
@@ -1020,7 +1020,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 */
 		buf = GetBufferDescriptor(buf_id);
 
-		valid = PinBuffer(buf, strategy);
+		valid = PinBuffer(buf);
 
 		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
@@ -1232,7 +1232,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			buf = GetBufferDescriptor(buf_id);
 
-			valid = PinBuffer(buf, strategy);
+			valid = PinBuffer(buf);
 
 			/* Can release the mapping lock as soon as we've pinned it */
 			LWLockRelease(newPartitionLock);
@@ -1563,7 +1563,7 @@ ReleaseAndReadBuffer(Buffer buffer,
  * some callers to avoid an extra spinlock cycle.
  */
 static bool
-PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
+PinBuffer(BufferDesc *buf)
 {
 	Buffer		b = BufferDescriptorGetBuffer(buf);
 	bool		result;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Bug in WaitForBackgroundWorkerShutdown() [REL9_5_STABLE]

2016-08-04 Thread Dmitry Ivanov

Recently I've encountered a strange crash while calling elog(ERROR, "...") 
after the WaitForBackgroundWorkerShutdown() function. It turns out that 
_returns_ inside the PG_TRY()..PG_CATCH() block are *highly* undesirable, 
since they leave PG_exception_stack pointing to a local struct in a dead 
frame, which is an obvious UB. I've attached a patch which fixes this behavior 
in the aforementioned function, but there might be more errors like that 
elsewhere.

-- 
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Companydiff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 76619e9..5ecb55a 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -1025,13 +1025,14 @@ WaitForBackgroundWorkerShutdown(BackgroundWorkerHandle *handle)
 
 			status = GetBackgroundWorkerPid(handle, &pid);
 			if (status == BGWH_STOPPED)
-return status;
+break;
 
 			rc = WaitLatch(&MyProc->procLatch,
 		   WL_LATCH_SET | WL_POSTMASTER_DEATH, 0);
 
 			if (rc & WL_POSTMASTER_DEATH)
-return BGWH_POSTMASTER_DIED;
+status = BGWH_POSTMASTER_DIED;
+break;
 
 			ResetLatch(&MyProc->procLatch);
 		}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in WaitForBackgroundWorkerShutdown() [REL9_5_STABLE]

2016-08-04 Thread Dmitry Ivanov

> Recently I've encountered a strange crash while calling elog(ERROR, "...")
> after the WaitForBackgroundWorkerShutdown() function. It turns out that
> _returns_ inside the PG_TRY()..PG_CATCH() block are *highly* undesirable,
> since they leave PG_exception_stack pointing to a local struct in a dead
> frame, which is an obvious UB. I've attached a patch which fixes this
> behavior in the aforementioned function, but there might be more errors
> like that elsewhere.

Forgot some curly braces, my bad. v2 attached.

-- 
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Companydiff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 76619e9..fd55262 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -1025,13 +1025,16 @@ WaitForBackgroundWorkerShutdown(BackgroundWorkerHandle *handle)
 
 			status = GetBackgroundWorkerPid(handle, &pid);
 			if (status == BGWH_STOPPED)
-return status;
+break;
 
 			rc = WaitLatch(&MyProc->procLatch,
 		   WL_LATCH_SET | WL_POSTMASTER_DEATH, 0);
 
 			if (rc & WL_POSTMASTER_DEATH)
-return BGWH_POSTMASTER_DIED;
+			{
+status = BGWH_POSTMASTER_DIED;
+break;
+			}
 
 			ResetLatch(&MyProc->procLatch);
 		}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in WaitForBackgroundWorkerShutdown() [REL9_5_STABLE]

2016-08-04 Thread Yury Zhuravlev


Dmitry Ivanov wrote:

Recently I've encountered a strange crash while calling elog(ERROR, "...")
after the WaitForBackgroundWorkerShutdown() function. It turns out that
_returns_ inside the PG_TRY()..PG_CATCH() block are *highly* undesirable,
since they leave PG_exception_stack pointing to a local struct in a dead
frame, which is an obvious UB. I've attached a patch which fixes this
behavior in the aforementioned function, but there might be more errors
like that elsewhere.


Forgot some curly braces, my bad. v2 attached.



Good catch. But in 9.6 it has been fixed by 
db0f6cad4884bd4c835156d3a720d9a79dbd63a9 commit. 



--
Yury Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-04 Thread Aleksander Alekseev

Thanks everyone for your remarks and comments!

Here is an improved version of a patch.

Main changes:
* Patch passes `make installcheck`
* Code is fully commented, also no more TODO's

I wish I sent this version of a patch last time. Now I realize it was
really hard to read and understand. Hope I managed to correct this
flaw. If you believe that some parts of the code are still poorly
commented or could be improved in any other way please let me know.

And as usual, any other comments, remarks or questions are highly
appreciated!

-- 
Best regards,
Aleksander Alekseev
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 8f5332a..ac272e1 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -1693,7 +1693,7 @@
   
   
p = permanent table, u = unlogged table,
-   t = temporary table
+   t = temporary table, f = fast temporary table
   
  
 
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index 1fa6de0..56de4dc 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/access/common
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = heaptuple.o indextuple.o printtup.o reloptions.o scankey.o \
+OBJS = fasttab.o heaptuple.o indextuple.o printtup.o reloptions.o scankey.o \
 	tupconvert.o tupdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/common/fasttab.c b/src/backend/access/common/fasttab.c
new file mode 100644
index 000..610790d
--- /dev/null
+++ b/src/backend/access/common/fasttab.c
@@ -0,0 +1,1884 @@
+/*-
+ *
+ * fasttab.c
+ *	  virtual catalog and fast temporary tables
+ *
+ * This file contents imlementation of special type of temporary tables ---
+ * fast temporary tables (FTT). From user perspective they work exactly as
+ * regular temporary tables. However there are no records about FTTs in
+ * pg_catalog. These records are stored in backend's memory instead and mixed
+ * with regular records during scans of catalog tables. We refer to
+ * corresponding tuples of catalog tables as "in-memory" or "virtual" tuples
+ * and to all these tuples together --- as "in-memory" or "virtual" catalog.
+ *
+ * Note that since temporary tables are visiable only in one session there is
+ * no need to use shared memory or locks for FTTs. Transactions support is
+ * very simple too. There is no need to track xmin/xmax, etc.
+ *
+ * FTTs are designed to to solve pg_catalog bloating problem. The are
+ * applications that create and delete a lot of temporary tables. It causes
+ * bloating of pg_catalog and running auto vacuum on it. It's quite an
+ * expensive operation that affects entire database performance.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/fasttab.c
+ *
+ *-
+ */
+
+#include "c.h"
+#include "postgres.h"
+#include "pgstat.h"
+#include "miscadmin.h"
+#include "access/amapi.h"
+#include "access/fasttab.h"
+#include "access/relscan.h"
+#include "access/valid.h"
+#include "access/sysattr.h"
+#include "access/htup_details.h"
+#include "catalog/pg_class.h"
+#include "catalog/pg_type.h"
+#include "catalog/pg_depend.h"
+#include "catalog/pg_inherits.h"
+#include "catalog/pg_statistic.h"
+#include "storage/bufmgr.h"
+#include "utils/rel.h"
+#include "utils/inval.h"
+#include "utils/memutils.h"
+
+/*
+		  TYPEDEFS, MACRO DECLARATIONS AND CONST STATIC VARIABLES
+ */
+
+/* #define FASTTAB_DEBUG 1 */
+
+#ifdef FASTTAB_DEBUG
+static int32 fasttab_scan_tuples_counter = -1;
+#endif
+
+/* List of in-memory tuples. */
+typedef struct
+{
+	dlist_node	node;
+	HeapTuple	tup;
+}	DListHeapTupleData;
+
+typedef DListHeapTupleData *DListHeapTuple;
+
+/* Like strcmp but for integer types --- int, uint32, Oid, etc. */
+#define FasttabCompareInts(x, y) ( (x) == (y) ? 0 : ( (x) > (y) ? 1 : -1 ))
+
+/* Forward declaration is required for relation_is_inmem_tuple_function */
+struct FasttabSnapshotData;
+typedef struct FasttabSnapshotData *FasttabSnapshot;
+
+/* Predicate that determines whether given tuple should be stored in-memory */
+typedef bool (*relation_is_inmem_tuple_function)
+			(Relation relation, HeapTuple tup, FasttabSnapshot fasttab_snapshot,
+		 int tableIdx);
+
+/* Capacity of FasttabRelationMethods->attrNumbers, see below */
+#define FasttabRelationMaxOidAttributes 2
+
+/* FasttabRelationMethodsTable entry */
+typedef const struct
+{
+	/* relation oid */
+	Oid			relationId;
+	/* predicate that determines whether tuple s

Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-04 Thread Geoff Winkless

On 30 July 2016 at 13:42, Tomas Vondra  wrote:
> I'd argue that if you mess with catalogs directly, you're on your own.

Interesting. What would you suggest people use instead?

Geoff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] max_parallel_degree > 0 for 9.6 beta

2016-08-04 Thread Robert Haas

On Thu, Aug 4, 2016 at 12:56 AM, Tom Lane  wrote:
> Noah Misch  writes:
>> On Wed, Apr 20, 2016 at 02:28:15PM -0400, Tom Lane wrote:
>>> +1, but let's put an entry on the 9.6 open-items page to remind us to
>>> make that decision at the right time.
>
>> It's that time.  Do we restore the max_parallel_workers_per_gather=0 default,
>> or is enabling this by default the right thing after all?
>
> At this point I'd have to vote against enabling by default in 9.6.  The
> fact that in the past week we've found bugs as bad as e1a93dd6a does not
> give me a warm fuzzy feeling about the parallel-query code being ready
> for prime time.
>
> Of course the question is how do we ever get to that point if we chicken
> out with enabling it by default now.  Maybe we could keep it turned on
> in HEAD.

What do other people think about this topic?

Personally, I've found the process of fixing the parallel query bugs
rather exhausting, and if we leave it turned on by default in 9.6,
we'll probably get a lot more of those bug reports a lot sooner than
we will otherwise.  But I'd kind of rather get it out of the way than
put it off: those bugs need to be fixed at some point, and we won't
find them if nobody runs the code.  However, we have a long track
record of being cautious about things like this, so I would be OK with
the idea of disabling this in the 9.6 branch and leaving it turned on
in master, with the hope of shipping it enabled-by-default in v10.  I
think it would not be good to leave disabled it by default in master
-- the code won't get much testing then, we won't find the bugs, and
we'll build more stuff on top of what we've already got without
finding the cracks in the foundation.  I think users will be more
inclined to forgive us for parallel query bugs in 2016 than in 2026;
better to debug it before the honeymoon wears off.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-08-04 Thread Amit Kapila

On Wed, Aug 3, 2016 at 5:09 PM, Pavel Stehule  wrote:
>
>
> 2016-08-03 12:16 GMT+02:00 Rahila Syed :
>>
>> Should changing the value from OFF to ON automatically either commit or
>> rollback transaction in progress?
>>
>>
>> FWIW, running  set autocommit through ecpg commits the ongoing transaction
>> when autocommit is set to ON from OFF. Should such behaviour be implemented
>> for \set AUTOCOMMIT ON as well?
>
>
> I dislike automatic commit or rollback here.
>

What problem you see with it, if we do so and may be mention the same
in docs as well.  Anyway, I think we should make the behaviour of both
ecpg and psql same.

> What about raising warning if
> some transaction is open?
>

Not sure what benefit we will get by raising warning.  I think it is
better to choose one behaviour (automatic commit or leave the
transaction open as is currently being done in psql) and make it
consistent across all clients.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] handling unconvertible error messages

2016-08-04 Thread Tom Lane

Victor Wagner  writes:
> If error occurs during processing of StartMessage protocol message,
> i.e. client request connection to unexisting database,
> ErrorResponse would contain message in the server default locale,
> despite of client encoding being specified in the StartMessage.

Yeah.  I'm inclined to think that we should reset the message locale
to C as soon as we've forked away from the postmaster, and leave it
that way until we've absorbed settings from the startup packet.
Sending messages of this sort in English isn't great, but it's better
than sending completely-unreadable ones.  Or is that just my
English-centricity showing?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Tom Lane

Craig Ringer  writes:
> On 4 August 2016 at 12:45, Tom Lane  wrote:
>> Craig Ringer  writes:
>>> Well, this seems like a good time to make server_version_num GUC_REPORT
>>> as well...

>> To what end?  Existing versions of libpq wouldn't know about it, and new
>> versions of libpq couldn't rely on it to get reported by older servers,
>> so it'd still be the path of least resistance to examine server_version.

> Because it's really silly that we don't,

Sorry, but I don't buy that.  I think sending both server_version and
server_version_num would be silly, and we're certainly not going to stop
sending server_version.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Change the default of update_process_title to off

2016-08-04 Thread Tom Lane

Andres Freund  writes:
> On 2016-08-04 16:48:11 +0900, Michael Paquier wrote:
>> Here is a different proposal: documenting instead that disabling that
>> parameter on Windows can improve performance, at the cost of losing
>> information verbosity for processes.

> The benefit on windows seems pretty marginal, given the way it has to be
> viewed. People installing processexplorer et. al. to view a handle that
> have to know about, will be able to change the config.

Yeah, I think I agree.  It would be bad to disable it by default on
Unix, because ps(1) is a very standard tool there, but the same argument
doesn't hold for Windows.

Another route to a solution would be to find a cheaper way to update
the process title on Windows ... has anyone looked for alternatives?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Vladimir Sitnikov

>
> Sorry, but I don't buy that.  I think sending both server_version and
> server_version_num would be silly, and we're certainly not going to stop
> sending server_version.
>

What is wrong with sending machine-readable value?

Vladimir

[HACKERS] "Some tests to cover hash_index"

2016-08-04 Thread Mithun Cy

I am attaching the patch to improve some coverage of hash index code [1].
I have added some basic tests, which mainly covers overflow pages. It took
2 sec extra time in my machine in parallel schedule.




Hit Total Coverage
old tests Line Coverage 780 1478 52.7

Function Coverage 63 85 74.1
improvement after tests Line Coverage 1181 1478 79.9 %

Function Coverage 78 85 91.8 %


[1]Concurrent Hash Index. 

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


commit-hash_coverage_test
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Tom Lane

Vladimir Sitnikov  writes:
>> Sorry, but I don't buy that.  I think sending both server_version and
>> server_version_num would be silly, and we're certainly not going to stop
>> sending server_version.

> What is wrong with sending machine-readable value?

[ shrug... ]  What do you claim is not machine-readable about
server_version?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] max_parallel_degree > 0 for 9.6 beta

2016-08-04 Thread David G. Johnston

On Thu, Aug 4, 2016 at 9:25 AM, Robert Haas  wrote:

> On Thu, Aug 4, 2016 at 12:56 AM, Tom Lane  wrote:
> > Noah Misch  writes:
> >> On Wed, Apr 20, 2016 at 02:28:15PM -0400, Tom Lane wrote:
> >>> +1, but let's put an entry on the 9.6 open-items page to remind us to
> >>> make that decision at the right time.
> >
> >> It's that time.  Do we restore the max_parallel_workers_per_gather=0
> default,
> >> or is enabling this by default the right thing after all?
> >
> > At this point I'd have to vote against enabling by default in 9.6.  The
> > fact that in the past week we've found bugs as bad as e1a93dd6a does not
> > give me a warm fuzzy feeling about the parallel-query code being ready
> > for prime time.
> >
> > Of course the question is how do we ever get to that point if we chicken
> > out with enabling it by default now.  Maybe we could keep it turned on
> > in HEAD.
>
> However, we have a long track
> record of being cautious about things like this, so I would be OK with
> the idea of disabling this in the 9.6 branch and leaving it turned on
> in master, with the hope of shipping it enabled-by-default in v10.


My initial reaction was +1 but now I'm leaning toward enabled by default.

Those who would upgrade to 9.6 within a year of its release are most
likely, process and personality wise, to be those for whom the risks and
rewards of new features is part of their everyday routine.

If indeed we release 10.0 with it enabled by default we should be confident
in its stability at that point in the 9.6.x branch.

David J.

Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON

2016-08-04 Thread Pavel Stehule

2016-08-04 15:37 GMT+02:00 Amit Kapila :

> On Wed, Aug 3, 2016 at 5:09 PM, Pavel Stehule 
> wrote:
> >
> >
> > 2016-08-03 12:16 GMT+02:00 Rahila Syed :
> >>
> >> Should changing the value from OFF to ON automatically either commit or
> >> rollback transaction in progress?
> >>
> >>
> >> FWIW, running  set autocommit through ecpg commits the ongoing
> transaction
> >> when autocommit is set to ON from OFF. Should such behaviour be
> implemented
> >> for \set AUTOCOMMIT ON as well?
> >
> >
> > I dislike automatic commit or rollback here.
> >
>
> What problem you see with it, if we do so and may be mention the same
> in docs as well.  Anyway, I think we should make the behaviour of both
> ecpg and psql same.
>

Implicit COMMIT can be dangerous - ROLLBACK can be unfriendly surprising.


> > What about raising warning if
> > some transaction is open?
> >
>
> Not sure what benefit we will get by raising warning.  I think it is
> better to choose one behaviour (automatic commit or leave the
> transaction open as is currently being done in psql) and make it
> consistent across all clients.
>

I am not sure about value of ecpg for this case. It is used by 0.0001%
users. Probably nobody in Czech Republic knows this client.

Warnings enforce the user do some decision - I don't think so we can do
this decision well.

Regards

Pavel



>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>

Re: [HACKERS] Hash Indexes

2016-08-04 Thread Mithun Cy

I did some basic testing of same. In that I found one issue with cursor.

+BEGIN;

+SET enable_seqscan = OFF;

+SET enable_bitmapscan = OFF;

+CREATE FUNCTION declares_cursor(int)

+ RETURNS void

+ AS 'DECLARE c CURSOR FOR SELECT * from con_hash_index_table WHERE keycol
= $1;'

+LANGUAGE SQL;

+

+SELECT declares_cursor(1);

+MOVE FORWARD ALL FROM c;

+MOVE BACKWARD 1 FROM c;

+ CLOSE c;

+ WARNING: buffer refcount leak: [5835] (rel=base/16384/30537,
blockNum=327, flags=0x9380, refcount=1 1)

ROLLBACK;

closing the cursor comes with the warning which say we forgot to unpin the
buffer.

I have also added tests [1] for coverage improvements.

[1] Some tests to cover hash_index.



On Thu, Jul 14, 2016 at 4:33 PM, Amit Kapila 
wrote:

> On Wed, Jun 22, 2016 at 8:48 PM, Robert Haas 
> wrote:
> > On Wed, Jun 22, 2016 at 5:14 AM, Amit Kapila 
> wrote:
> >> We can do it in the way as you are suggesting, but there is another
> thing
> >> which we need to consider here.  As of now, the patch tries to finish
> the
> >> split if it finds split-in-progress flag in either old or new bucket.
> We
> >> need to lock both old and new buckets to finish the split, so it is
> quite
> >> possible that two different backends try to lock them in opposite order
> >> leading to a deadlock.  I think the correct way to handle is to always
> try
> >> to lock the old bucket first and then new bucket.  To achieve that, if
> the
> >> insertion on new bucket finds that split-in-progress flag is set on a
> >> bucket, it needs to release the lock and then acquire the lock first on
> old
> >> bucket, ensure pincount is 1 and then lock new bucket again and ensure
> that
> >> pincount is 1. I have already maintained the order of locks in scan (old
> >> bucket first and then new bucket; refer changes in _hash_first()).
> >> Alternatively, we can try to  finish the splits only when someone tries
> to
> >> insert in old bucket.
> >
> > Yes, I think locking buckets in increasing order is a good solution.
> > I also think it's fine to only try to finish the split when the insert
> > targets the old bucket.  Finishing the split enables us to remove
> > tuples from the old bucket, which lets us reuse space instead of
> > accelerating more.  So there is at least some potential benefit to the
> > backend inserting into the old bucket.  On the other hand, a process
> > inserting into the new bucket derives no direct benefit from finishing
> > the split.
> >
>
> Okay, following this suggestion, I have updated the patch so that only
> insertion into old-bucket can try to finish the splits.  Apart from
> that, I have fixed the issue reported by Mithun upthread.  I have
> updated the README to explain the locking used in patch.   Also, I
> have changed the locking around vacuum, so that it can work with
> concurrent scans when ever possible.  In previous patch version,
> vacuum used to take cleanup lock on a bucket to remove the dead
> tuples, moved-due-to-split tuples and squeeze operation, also it holds
> the lock on bucket till end of cleanup.  Now, also it takes cleanup
> lock on a bucket to out-wait scans, but it releases the lock as it
> proceeds to clean the overflow pages.  The idea is first we need to
> lock the next bucket page and  then release the lock on current bucket
> page.  This ensures that any concurrent scan started after we start
> cleaning the bucket will always be behind the cleanup.  Allowing scans
> to cross vacuum will allow it to remove tuples required for sanctity
> of scan.  Also for squeeze-phase we are just checking if the pincount
> of buffer is one (we already have Exclusive lock on buffer of bucket
> by that time), then only proceed, else will try to squeeze next time
> the cleanup is required for that bucket.
>
> Thoughts/Suggestions?
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Vladimir Sitnikov

Tom Lane :

> [ shrug... ]  What do you claim is not machine-readable about
> server_version?
>

0) server_version needs a dance to parse.
For instance, recent "Stamp version 10devel" patch did touch
"server_version" parsing in fe-exec.c:
https://github.com/vlsi/postgres/pull/2/files#diff-2df5cad06efe4485ad362b0eb765cec0L986
Of course it might happen there was just a bug in fe-exec.c, however that
is a smell. Lots of clients might need to update server_version parsing
logic for no reason except "support 10devel kind of versions".
There are cases when the dance is locale-specific:
https://github.com/pgjdbc/pgjdbc/pull/190

1) By saying "just parse server_version" you are basically forcing every
postgresql client (except libpq) to implement, test, and support its own
parse logic. Do you care of having robust clients that will not break in
parts when tested against 10.whatever?

1) Several examples were provided by Craig:
https://www.postgresql.org/message-id/CAMsr%2BYFt1NcjseExt_Ov%2Bfrk0Jzb0-DKqYKt8ALzVEXHBM0jKg%40mail.gmail.com
Craig> This means that at least when connecting to newer servers clients no
longer have to do any stupid dances around parsing "9.5beta1",
"9.4.0mycustompatchedPg"

2) Official documentation suggests "see also server_version_num for a
machine-readable version."
https://www.postgresql.org/docs/9.5/static/functions-info.html

Even though "one can simply try to parse server_version value", I claim
that server_version_num is much more machine-readable than server_version
one.

Vladimir

Re: [HACKERS] handling unconvertible error messages

2016-08-04 Thread Victor Wagner

On Thu, 04 Aug 2016 09:42:10 -0400
Tom Lane  wrote:

> Victor Wagner  writes:
> > If error occurs during processing of StartMessage protocol message,
> > i.e. client request connection to unexisting database,
> > ErrorResponse would contain message in the server default locale,
> > despite of client encoding being specified in the StartMessage.  
> 
> Yeah.  I'm inclined to think that we should reset the message locale
> to C as soon as we've forked away from the postmaster, and leave it
> that way until we've absorbed settings from the startup packet.
> Sending messages of this sort in English isn't great, but it's better
> than sending completely-unreadable ones.  Or is that just my
> English-centricity showing?

From my russian point of view, english messages are definitely better
than transliteration of Russian  with latin letters (although it is
not completely unreadable), not to mention wrong encoding or lots of
question marks.

Really, if this response is sent after backend has been forked, problem
probably can be easily fixed better way - StartupMessage contain
information about desired client encoding, so this information just
should be processed earlier than any other information from this
message, which can cause errors (such as database name).

If this errors are sent from postmaster itself, things are worse,
because I don't think that locale subsystem is desined to be
reintitalized lots of times in the same process. 
But postmaster itself can use non-localized messaging. Its messages in
the logs are typically analyzed by more or less qualified DBA and
system admistrators, not by end user.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] regression test for extended query protocol

2016-08-04 Thread Alvaro Herrera

Michael Paquier wrote:
> On Thu, Aug 4, 2016 at 12:14 AM, Dave Cramer  wrote:
> > We currently run tests every time a PR is created on github, but I don't
> > think there are any animals running the JDBC test suite
> >
> > We can add tests, what exactly do we want to test. Then setting up an animal
> > to run the tests would be fairly straight forward.
> 
> It may be interesting to implement that as a module in the buildfarm
> client I think that any animal could include at will. Just a thought.

Implementing things as buildfarm modules gets boring, though -- consider
for instance that src/test/recovery is only being run by hamster, and
only because you patched the buildfarm client; only Andrew can influence
buildfarm animals into running that test by integrating the module in
the client script, and even that is limited by having to request animal
caretakers to upgrade that script.

If somebody had some spare time to devote to this, I would suggest to
implement something in core that can be used to specify a list of
commands to run, and a list of files-to-be-saved-in-bf-log emitted by
each command.  We could have one such base file in the core repo that
specifies some tests to run (such as "make -C src/test/recovery check"),
and an additional file can be given by buildfarm animals to run custom
tests, without having to write BF modules for each thing.  With that,
pgsql committers could simply add a new test to be run by all buildfarm
animals by adding it to the list in core.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] improved DefElem list processing

2016-08-04 Thread Peter Eisentraut

Here are two WIP patches to improve the DefElem list processing that is
used by many utility commands.

One factors out the duplicate checks, which are currently taking up a
lot of space with duplicate code.  I haven't applied this everywhere
yet, but the patch shows how much boring code can be saved.

The other adds a location field to the DefElem node.  This allows
showing an error pointer if there is a problem in one of the options,
which can be useful for complex commands such as COPY or CREATE USER.

At the moment, these patches are independent and don't work together,
but the second one would become much easier if the first one is accepted.

If these ideas are acceptable, I'll produce a complete patch series.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 89af91f91944ba70c538d249f042eebf14fdd2cd Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 4 Aug 2016 11:29:27 -0400
Subject: [PATCH] Add location field to DefElem

---
 contrib/file_fdw/file_fdw.c |  5 ++-
 src/backend/commands/copy.c | 93 +
 src/backend/commands/sequence.c | 36 +---
 src/backend/commands/user.c | 41 +++---
 src/backend/nodes/copyfuncs.c   |  1 +
 src/backend/nodes/equalfuncs.c  |  2 +
 src/backend/nodes/makefuncs.c   |  1 +
 src/backend/nodes/outfuncs.c|  1 +
 src/backend/nodes/readfuncs.c   |  1 +
 src/backend/parser/gram.y   | 44 +++
 src/backend/tcop/utility.c  | 36 +---
 src/include/commands/copy.h |  7 ++--
 src/include/commands/sequence.h |  5 ++-
 src/include/commands/user.h |  3 +-
 src/include/nodes/parsenodes.h  |  1 +
 15 files changed, 190 insertions(+), 87 deletions(-)

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index c049131..85d9913 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -293,7 +293,7 @@ file_fdw_validator(PG_FUNCTION_ARGS)
 	/*
 	 * Now apply the core COPY code's validation logic for more checks.
 	 */
-	ProcessCopyOptions(NULL, true, other_options);
+	ProcessCopyOptions(NULL, NULL, true, other_options);
 
 	/*
 	 * Filename option is required for file_fdw foreign tables.
@@ -632,7 +632,8 @@ fileBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Create CopyState from FDW options.  We always acquire all columns, so
 	 * as to match the expected ScanTupleSlot signature.
 	 */
-	cstate = BeginCopyFrom(node->ss.ss_currentRelation,
+	cstate = BeginCopyFrom(NULL,
+		   node->ss.ss_currentRelation,
 		   filename,
 		   false,
 		   NIL,
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f45b330..1f85cda 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -279,12 +279,12 @@ static const char BinarySignature[11] = "PGCOPY\n\377\r\n\0";
 
 
 /* non-export function prototypes */
-static CopyState BeginCopy(bool is_from, Relation rel, Node *raw_query,
-		  const char *queryString, const Oid queryRelId, List *attnamelist,
+static CopyState BeginCopy(ParseState *pstate, bool is_from, Relation rel, Node *raw_query,
+		   const Oid queryRelId, List *attnamelist,
 		  List *options);
 static void EndCopy(CopyState cstate);
 static void ClosePipeToProgram(CopyState cstate);
-static CopyState BeginCopyTo(Relation rel, Node *query, const char *queryString,
+static CopyState BeginCopyTo(ParseState *pstate, Relation rel, Node *query,
 			const Oid queryRelId, const char *filename, bool is_program,
 			List *attnamelist, List *options);
 static void EndCopyTo(CopyState cstate);
@@ -787,7 +787,7 @@ CopyLoadRawBuf(CopyState cstate)
  * the table or the specifically requested columns.
  */
 Oid
-DoCopy(const CopyStmt *stmt, const char *queryString, uint64 *processed)
+DoCopy(ParseState *pstate, const CopyStmt *stmt, uint64 *processed)
 {
 	CopyState	cstate;
 	bool		is_from = stmt->is_from;
@@ -936,7 +936,7 @@ DoCopy(const CopyStmt *stmt, const char *queryString, uint64 *processed)
 			PreventCommandIfReadOnly("COPY FROM");
 		PreventCommandIfParallelMode("COPY FROM");
 
-		cstate = BeginCopyFrom(rel, stmt->filename, stmt->is_program,
+		cstate = BeginCopyFrom(pstate, rel, stmt->filename, stmt->is_program,
 			   stmt->attlist, stmt->options);
 		cstate->range_table = range_table;
 		*processed = CopyFrom(cstate);	/* copy from file to database */
@@ -944,7 +944,7 @@ DoCopy(const CopyStmt *stmt, const char *queryString, uint64 *processed)
 	}
 	else
 	{
-		cstate = BeginCopyTo(rel, query, queryString, relid,
+		cstate = BeginCopyTo(pstate, rel, query, relid,
 			 stmt->filename, stmt->is_program,
 			 stmt->attlist, stmt->options);
 		*processed = DoCopyTo(cstate);	/* copy from database to file */
@@ -980,7 +980,8 @@ DoCopy(const CopyStmt *stmt, const char *queryString, uint64 *processed)
  * self-consistency of the options list.
  */
 void
-ProcessCopyOptions(CopyS

Re: [HACKERS] Design for In-Core Logical Replication

2016-08-04 Thread Simon Riggs

On 29 July 2016 at 16:53, Robert Haas  wrote:

> I think that to really understand exactly what you and Petr have in
> mind, we'd need a description of where publication and subscription
> data is stored within the server, and exactly what gets stored.
> Perhaps that will come in a later email.  I'm not bashing the design,
> exactly, I just can't quite see how all of the pieces fit together
> yet.

Sure no problem.

It's clear there are about 12 ways of putting this together and each
way has various terms needed.

We need input from many people to get this right, since this is much
more about UI than system architecture.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Andres Freund

Hi,

On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
> Indexes whose values do not change do not require new index pointers. Only
> the index whose key is being changed will need a new index entry. The new
> index entry will be set to the CTID of the root line pointer.

That seems to require tracing all hot-chains in a page, to actually
figure out what the root line pointer of a warm-updated HOT tuple is,
provided it's HOT_UPDATED itself.  Or have you found a smart way to
figure that out?

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we lost Uber as a user

2016-08-04 Thread Alfred Perlstein




On 8/4/16 2:00 AM, Torsten Zuehlsdorff wrote:



On 03.08.2016 21:05, Robert Haas wrote:

On Wed, Aug 3, 2016 at 2:23 PM, Tom Lane  wrote:

Robert Haas  writes:

I don't think they are saying that logical replication is more
reliable than physical replication, nor do I believe that to be true.
I think they are saying that if logical corruption happens, you can
fix it by typing SQL statements to UPDATE, INSERT, or DELETE the
affected rows, whereas if physical corruption happens, there's no
equally clear path to recovery.


Well, that's not an entirely unreasonable point, but I dispute the
implication that it makes recovery from corruption an easy thing to do.
How are you going to know what SQL statements to issue?  If the master
database is changing 24x7, how are you going to keep up with that?


I think in many cases people fix their data using business logic.  For
example, suppose your database goes down and you have to run
pg_resetxlog to get it back up.  You dump-and-restore, as one does,
and find that you can't rebuild one of your unique indexes because
there are now two records with that same PK.  Well, what you do is you
look at them and judge which one has the correct data, often the one
that looks more complete or the one with the newer timestamp. Or,
maybe you need to merge them somehow.  In my experience helping users
through problems of this type, once you explain the problem to the
user and tell them they have to square it on their end, the support
call ends.  The user may not always be entirely thrilled about having
to, say, validate a problematic record against external sources of
truth, but they usually know how to do it.  Database bugs aren't the
only way that databases become inaccurate.  If the database that they
use to keep track of land ownership in the jurisdiction where I live
says that two different people own the same piece of property,
somewhere there is a paper deed in a filing cabinet.  Fishing that out
to understand what happened may not be fun, but a DBA can explain that
problem to other people in the organization and those people can get
it fixed.  It's a problem, but it's fixable.

On the other hand, if a heap tuple contains invalid infomask bits that
cause an error every time you read the page (this actually happened to
an EnterpriseDB customer!), the DBA can't tell other people how to fix
it and can't fix it personally either.  Instead, the DBA calls me.


After reading this statement the ZFS filesystem pops into my mind. It 
has protection build in against various problems (data degradation, 
current spikes, phantom writes, etc).


For me this raises two questions:

1) would the usage of ZFS prevent such errors?

My feeling would say yes, but i have no idea about how a invalid 
infomask bit could occur.


2) would it be possible to add such prevention to PostgreSQL

I know this could add a massive overhead, but it its optional this 
could be a fine thing?
Postgresql is very "zfs-like" in its internals.  The problem was a bug 
in postgresql that caused it to just write data to the wrong place.


Some vendors use ZFS under databases to provide very cool services such 
as backup snapshots, test snapshots and other such uses.  I think Joyent 
is one such vendor but I'm not 100% sure.


-Alfred


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 09:31:18AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
> > Indexes whose values do not change do not require new index pointers. Only
> > the index whose key is being changed will need a new index entry. The new
> > index entry will be set to the CTID of the root line pointer.
> 
> That seems to require tracing all hot-chains in a page, to actually
> figure out what the root line pointer of a warm-updated HOT tuple is,
> provided it's HOT_UPDATED itself.  Or have you found a smart way to
> figure that out?

The index points to the head of the HOT chain, and you just walk the
chain on the page --- on need to look at other chains on the page.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Lossy Index Tuple Enhancement (LITE)

2016-08-04 Thread Simon Riggs

On 3 August 2016 at 20:37, Claudio Freire  wrote:
> On Wed, Aug 3, 2016 at 4:20 AM, Simon Riggs  wrote:
>> == IndexScan ==
>>
>> Note that the executor code for IndexScan appears identical between
>> the two optimizations. The difference between duplicate and range LITE
>> tuples is needed only at INSERT time (or UPDATE indexed column to a
>> new value).
>>
>> When we do an IndexScan if we see a LITE tuple we do a scan of the
>> linepointer ranges covered by this index tuple (which might eventually
>> go to a full block scan). For example with bit1 set we would scan 16
>> linepointers (on an 8192 byte block) and with 2 bits set we would scan
>> 32 linepointers etc.., not necessarily consecutive ranges. So the
>> IndexScan can return potentially many heap rows per index entry, which
>> need to be re-checked and may also need to be sorted prior to
>> returning them. If no rows are returned we can kill the index pointer,
>> just as we do now for btrees, so they will be removed eventually from
>> the index without the need for VACUUM. An BitmapIndexScan that sees a
>> lossy pointer can also use the lossy TID concept, so we can have
>> partially lossy bitmaps.
>
> Wouldn't this risk scanning rows more than once unless it's part of a
> bitmap scan?

It's always the job of the index insert logic to ensure a valid index exists.

I'm not sure what you see that changes that here. Say more?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Andres Freund

On 2016-08-04 12:55:25 -0400, Bruce Momjian wrote:
> On Thu, Aug  4, 2016 at 09:31:18AM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
> > > Indexes whose values do not change do not require new index pointers. Only
> > > the index whose key is being changed will need a new index entry. The new
> > > index entry will be set to the CTID of the root line pointer.
> > 
> > That seems to require tracing all hot-chains in a page, to actually
> > figure out what the root line pointer of a warm-updated HOT tuple is,
> > provided it's HOT_UPDATED itself.  Or have you found a smart way to
> > figure that out?
> 
> The index points to the head of the HOT chain, and you just walk the
> chain on the page --- on need to look at other chains on the page.

When doing an update, you'll need to re-find the root tuple, to insert
the root ctid for the new index tuples.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Lossy Index Tuple Enhancement (LITE)

2016-08-04 Thread Simon Riggs

On 4 August 2016 at 02:13, Bruce Momjian  wrote:

> How do plan to clear the bitmask so it, over time, doesn't end up being
> all-set?

I don't have a plan, though thoughts welcome.

Similar situation that our current indexes don't recover from bloat, a
situation made worse by non-HOT updates.

So, we have a choice of which disadvantage to accept.

> Also, why not use this bitmap for all indexes, not just update chains?

I don't understand where you get this update chains thing from.

The bitmap can apply to multiple tuples on one page, which is described.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 04:29:09PM +0530, Pavan Deolasee wrote:
> Write Amplification Reduction Method (WARM)
> 
> 
> A few years back, we developed HOT to address the problem associated with MVCC
> and frequent updates and it has served us very well. But in the light of 
> Uber's
> recent technical blog highlighting some of the problems that are still
> remaining, especially for workloads where HOT does not help to the full 
> extent,
> Simon, myself and others at 2ndQuadrant discussed several ideas and what I
> present below is an outcome of that. This is not to take away credit from
> anybody else. Others may have thought about similar ideas, but I haven’t seen
> any concrete proposal so far.

HOT was a huge win for Postgres and I am glad you are reviewing
improvements.

> This method succeeds in reducing the write amplification, but causes other
> issues which also need to be solved. WARM breaks the invariant that all tuples
> in a HOT chain have the same index values and so an IndexScan would need to
> re-check the index scan conditions against the visible tuple returned from
> heap_hot_search(). We must still check visibility, so we only need to re-check
> scan conditions on that one tuple version.
>   
> We don’t want to force a recheck for every index fetch because that will slow
> everything down. So we would like a simple and efficient way of knowing about
> the existence of a WARM tuple in a chain and do a recheck in only those cases,
> ideally O(1). Having a HOT chain contain a WARM tuple is discussed below as
> being a “WARM chain”, implying it needs re-check.

In summary, we are already doing visibility checks on the HOT chain, so
a recheck if the heap tuple matches the index value is only done at most
on the one visible tuple in the chain, not ever tuple in the chain.

> 2. Mark the root line pointer (or the root tuple) with a special
> HEAP_RECHECK_REQUIRED flag to tell us about the presence of a WARM tuple in 
> the
> chain. Since all indexes point to the root line pointer, it should be enough 
> to
> just mark the root line pointer (or tuple) with this flag. 

Yes, I think #2 is the easiest.  Also, if we modify the index page, we
would have to WAL the change and possibly WAL log the full page write
of the index page.  :-(

> Approach 2 seems more reasonable and simple. 
> 
> There are only 2 bits for lp_flags and all combinations are already used. But
> for LP_REDIRECT root line pointer, we could use the lp_len field to store this
> special flag, which is not used for LP_REDIRECT line pointers. So we are able
> to mark the root line pointer.

Uh, as I understand it, we only use LP_REDIRECT when we have _removed_
the tuple that the ctid was pointing to, but it seems you would need to
set HEAP_RECHECK_REQUIRED earlier than that.

Also, what is currently in the lp_len field for LP_REDIRECT?  Zeros, or
random data?  I am asking for pg_upgrade purposes.

> One idea is to somehow do this as part of the vacuum where we collect root 
> line
> pointers of  WARM chains during the first phase of heap scan, check the 
> indexes
> for all such tuples (may be combined with index vacuum) and then clear the 
> heap
> flags during the second pass, unless new tuples are added to the WARM chain. 
> We
> can detect that by checking that all tuples in the WARM chain still have XID
> less than the OldestXmin that VACUUM is using.

Yes, it seems natural to clear the ctid HEAP_RECHECK_REQUIRED flag where
you are adjusting the HOT chain anyway.

> It’s important to ensure that the flag is set when it is absolutely necessary,
> while having false positives is not a problem. We might do a little wasteful
> work if the flag is incorrectly set. Since flag will be set only during
> heap_update() and the operation is already WAL logged, this can be piggybacked
> with the heap_update WAL record. Similarly, when a root tuple is pruned to a
> redirect line pointer, the operation is already WAL logged and we can 
> piggyback
> setting of line pointer flag with that WAL record.
> 
> Flag clearing need not be WAL logged, unless we can piggyback that to some
> existing WAL logging.

Agreed, good point.

Very nice!

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Simon Riggs

On 4 August 2016 at 17:31, Andres Freund  wrote:
> Hi,
>
> On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
>> Indexes whose values do not change do not require new index pointers. Only
>> the index whose key is being changed will need a new index entry. The new
>> index entry will be set to the CTID of the root line pointer.
>
> That seems to require tracing all hot-chains in a page, to actually
> figure out what the root line pointer of a warm-updated HOT tuple is,
> provided it's HOT_UPDATED itself.  Or have you found a smart way to
> figure that out?

Hmm, sorry, I did think of that point and I thought I had added it to the doc.

So, yes, I agree - re-locating the root is the biggest downside to
this idea. Perhaps Pavan has other thoughts?

I think its doable, but it will be fiddly.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 09:58:29AM -0700, Andres Freund wrote:
> On 2016-08-04 12:55:25 -0400, Bruce Momjian wrote:
> > On Thu, Aug  4, 2016 at 09:31:18AM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
> > > > Indexes whose values do not change do not require new index pointers. 
> > > > Only
> > > > the index whose key is being changed will need a new index entry. The 
> > > > new
> > > > index entry will be set to the CTID of the root line pointer.
> > > 
> > > That seems to require tracing all hot-chains in a page, to actually
> > > figure out what the root line pointer of a warm-updated HOT tuple is,
> > > provided it's HOT_UPDATED itself.  Or have you found a smart way to
> > > figure that out?
> > 
> > The index points to the head of the HOT chain, and you just walk the
> > chain on the page --- on need to look at other chains on the page.
> 
> When doing an update, you'll need to re-find the root tuple, to insert
> the root ctid for the new index tuples.

Ah, I had not thought of that --- good point.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Claudio Freire

On Thu, Aug 4, 2016 at 7:59 AM, Pavan Deolasee  wrote:
> We don’t want to force a recheck for every index fetch because that will
> slow everything down. So we would like a simple and efficient way of knowing
> about the existence of a WARM tuple in a chain and do a recheck in only
> those cases, ideally O(1). Having a HOT chain contain a WARM tuple is
> discussed below as being a “WARM chain”, implying it needs re-check.
>
> We could solve this by either 1) marking both old and new index tuples as
> "recheck-required" or 2) marking the HOT chain on the heap as
> "recheck-required" such that any tuple returned from the chain requires a
> recheck.
>
> The advantage of doing 1) is that the recheck is required only for the
> compromised index.
> Note that in presence of multiple index keys pointing to the same root line
> pointer, the chain needs re-check for both the old as well as the new index
> key. The old index key must not fetch the new tuple(s) and the new index key
> must not fetch the old tuple(s). So irrespective of whether an old or a new
> tuple is being returned, the caller must know about the WARM tuples in the
> chains. So marking the index would require us to mark both old and new index
> tuples as requiring re-check. That requires an additional index scan to
> locate the old row and then an additional write to force it to re-check,
> which is algorithmically O(NlogN) on table size.
>
> Marking the heap, (2) is O(1) per updated index, so seems better. But since
> we don't know with respect to which index or indexes the chain is broken,
> all index scans must do a recheck for a tuple returned from a WARM chain.
>
> How do we mark a chain as WARM? How do we clear the flag? There are a few
> possible approaches:
>
> 1. Mark the first WARM tuple which causes HOT chain to become WARM with a
> special HEAP_RECHECK_REQUIRED flag (this flag must then be carried over to
> any further additions to the chain. Then whenever a tuple if returned from a
> chain, we must look forward into the chain for WARM tuple and let the caller
> know about potential breakage. This seems simple but would require us to
> follow HOT chain every time to check if it’s HOT or WARM.
>
> 2. Mark the root line pointer (or the root tuple) with a special
> HEAP_RECHECK_REQUIRED flag to tell us about the presence of a WARM tuple in
> the chain. Since all indexes point to the root line pointer, it should be
> enough to just mark the root line pointer (or tuple) with this flag.
>
> 3. We could also adopt a hybrid approach and mark the new index tuple and
> new heap tuple with HEAP_RECHECK_REQUIRED flag and presence of either of the
> flag in either tuple forces recheck.

I've actually been thinking of another possibility in the context of
LITE, and what I was pondering looks eerily similar to WARM, so let me
summarize (I was going to write a lengthier POC but I thought I better
comment now since you're introducing the same concept):

Marks are troublesome becuase, as you mention, you would need one
"WARM" bit per index to be efficient. We also don't want a recheck on
every index access.

HOT is very close to WARM already, even the code. The only problem
with HOT is that it stops scanning the update chain when a tuple is
not HOT-updated. If heap_hot_search_buffer knew to keep on following
the update chain when there is a WARM update on the index in question,
this could work.

But it won't work if it's implemented naively, because if there was
another item pointer in another page of the index pointing to a heap
tuple that belongs to an already-visited chain, which is possible in
the below case, then the scan would produce duplicate rows:

N: Row versions:
->: update
* : causes a new index tuple on index I

1 -> 2 -> 3 -> 4 -> 5
* * *
a ba  <- key

If the scan qual is "key in ('a','b')", then the index scan will visit
both versions of the row, 1, 3 and 4. Suppose only 5 is visible, and
suppose the whole chain lies in the same page.

There is one update chain, 1..5, but three WARM chains: 1..2, 3, 4..5

When visiting item pointers, 1 will be found, the chain will be
followed and eventually version 5 will pass the visibility check and
the tuple will pass the scan quals, it will even match the key in the
index item, so no key check will fix this. 5 will be yielded when
visiting version 1.

Next, version 3 will be visited. There's no easy way to know that we
already visited 3, so we'll follow the chain all the way to 5, and,
again, it will be yielded. Here, the key won't match, so maybe this
case could be filtered out with a key check.

Finally, version 4 will be visited, the chain will be followed to 5,
and as in the first case, it will be yielded.

So the scan yielded the same heap tuple thrice, which isn't an ideal situation.

It would be totally acceptable for a bitmap index scan, so the
expensive logic I will be outlining below may be skipped when building
a bitmap (lossy or not).

To fix this, what I w

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Simon Riggs

On 4 August 2016 at 18:05, Bruce Momjian  wrote:

>> Approach 2 seems more reasonable and simple.
>>
>> There are only 2 bits for lp_flags and all combinations are already used. But
>> for LP_REDIRECT root line pointer, we could use the lp_len field to store 
>> this
>> special flag, which is not used for LP_REDIRECT line pointers. So we are able
>> to mark the root line pointer.
>
> Uh, as I understand it, we only use LP_REDIRECT when we have _removed_
> the tuple that the ctid was pointing to, but it seems you would need to
> set HEAP_RECHECK_REQUIRED earlier than that.

Hmm. Mostly there will be one, so this is just for the first update
after any VACUUM.

Adding a new linepointer just to hold this seems kludgy and could mean
we run out of linepointers.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 09:58:29AM -0700, Andres Freund wrote:
> On 2016-08-04 12:55:25 -0400, Bruce Momjian wrote:
> > On Thu, Aug  4, 2016 at 09:31:18AM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
> > > > Indexes whose values do not change do not require new index pointers. 
> > > > Only
> > > > the index whose key is being changed will need a new index entry. The 
> > > > new
> > > > index entry will be set to the CTID of the root line pointer.
> > > 
> > > That seems to require tracing all hot-chains in a page, to actually
> > > figure out what the root line pointer of a warm-updated HOT tuple is,
> > > provided it's HOT_UPDATED itself.  Or have you found a smart way to
> > > figure that out?
> > 
> > The index points to the head of the HOT chain, and you just walk the
> > chain on the page --- on need to look at other chains on the page.
> 
> When doing an update, you'll need to re-find the root tuple, to insert
> the root ctid for the new index tuples.

Also, I was thinking we could just point into the middle of the HOT
chain to limit the number of rows we have to check, but the problem
there is that HOT pruning could then not remove that middle ctid, which
is bad.

I wonder if walking all HOT chains on a single page to find the root is
cheap enough.  The ctid tells you if it is part of a HOT chain, right?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Andres Freund

On 2016-08-04 18:11:17 +0100, Simon Riggs wrote:
> On 4 August 2016 at 17:31, Andres Freund  wrote:
> > Hi,
> >
> > On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
> >> Indexes whose values do not change do not require new index pointers. Only
> >> the index whose key is being changed will need a new index entry. The new
> >> index entry will be set to the CTID of the root line pointer.
> >
> > That seems to require tracing all hot-chains in a page, to actually
> > figure out what the root line pointer of a warm-updated HOT tuple is,
> > provided it's HOT_UPDATED itself.  Or have you found a smart way to
> > figure that out?
> 
> Hmm, sorry, I did think of that point and I thought I had added it to the doc.
> 
> So, yes, I agree - re-locating the root is the biggest downside to
> this idea. Perhaps Pavan has other thoughts?
> 
> I think its doable, but it will be fiddly.

I'm less afraid of the fiddlyness of finding the root tuple, than the
cost. It's not cheap to walk through, potentially, all hot chains to
find the root ctid.

Have you considered what it'd take to allow index pointers to allow to
point to "intermediate" root tuples? I.e. have some indexes point into
the middle of an update-chain, whereas others still point to the
beginning? There's certainly some complications involved with that, but
it'd also have the advantage in reducing the amount of rechecking that
needs to be done.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Simon Riggs

On 4 August 2016 at 18:17, Andres Freund  wrote:
> On 2016-08-04 18:11:17 +0100, Simon Riggs wrote:
>> On 4 August 2016 at 17:31, Andres Freund  wrote:
>> > Hi,
>> >
>> > On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
>> >> Indexes whose values do not change do not require new index pointers. Only
>> >> the index whose key is being changed will need a new index entry. The new
>> >> index entry will be set to the CTID of the root line pointer.
>> >
>> > That seems to require tracing all hot-chains in a page, to actually
>> > figure out what the root line pointer of a warm-updated HOT tuple is,
>> > provided it's HOT_UPDATED itself.  Or have you found a smart way to
>> > figure that out?
>>
>> Hmm, sorry, I did think of that point and I thought I had added it to the 
>> doc.
>>
>> So, yes, I agree - re-locating the root is the biggest downside to
>> this idea. Perhaps Pavan has other thoughts?
>>
>> I think its doable, but it will be fiddly.
>
> I'm less afraid of the fiddlyness of finding the root tuple, than the
> cost. It's not cheap to walk through, potentially, all hot chains to
> find the root ctid.
>
> Have you considered what it'd take to allow index pointers to allow to
> point to "intermediate" root tuples? I.e. have some indexes point into
> the middle of an update-chain, whereas others still point to the
> beginning? There's certainly some complications involved with that, but
> it'd also have the advantage in reducing the amount of rechecking that
> needs to be done.

"Intermediate root tuples" was my first attempt at this and it didn't
work. I'll dig up the details, some problem in VACUUM, IIRC.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 06:16:02PM +0100, Simon Riggs wrote:
> On 4 August 2016 at 18:05, Bruce Momjian  wrote:
> 
> >> Approach 2 seems more reasonable and simple.
> >>
> >> There are only 2 bits for lp_flags and all combinations are already used. 
> >> But
> >> for LP_REDIRECT root line pointer, we could use the lp_len field to store 
> >> this
> >> special flag, which is not used for LP_REDIRECT line pointers. So we are 
> >> able
> >> to mark the root line pointer.
> >
> > Uh, as I understand it, we only use LP_REDIRECT when we have _removed_
> > the tuple that the ctid was pointing to, but it seems you would need to
> > set HEAP_RECHECK_REQUIRED earlier than that.
> 
> Hmm. Mostly there will be one, so this is just for the first update
> after any VACUUM.
> 
> Adding a new linepointer just to hold this seems kludgy and could mean
> we run out of linepointers.

Ah, so in cases where there isn't an existing LP_REDIRECT for the chain,
you create one and use the lp_len to identify it as a WARM chain?  Hmm.

You can't update the indexes pointing to the existing ctid, so what you
would really have to do is to write over the existing ctid with
LP_REDIRECT plus WARM marker, and move the old ctid to a new ctid slot?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Andres Freund

On 2016-08-04 18:18:47 +0100, Simon Riggs wrote:
> On 4 August 2016 at 18:17, Andres Freund  wrote:
> > On 2016-08-04 18:11:17 +0100, Simon Riggs wrote:
> > I'm less afraid of the fiddlyness of finding the root tuple, than the
> > cost. It's not cheap to walk through, potentially, all hot chains to
> > find the root ctid.
> >
> > Have you considered what it'd take to allow index pointers to allow to
> > point to "intermediate" root tuples? I.e. have some indexes point into
> > the middle of an update-chain, whereas others still point to the
> > beginning? There's certainly some complications involved with that, but
> > it'd also have the advantage in reducing the amount of rechecking that
> > needs to be done.
> 
> "Intermediate root tuples" was my first attempt at this and it didn't
> work. I'll dig up the details, some problem in VACUUM, IIRC.

I think it's doable, but the question is how to do cleanup. It's kind of
easy to end up with a page full of line pointers which are all reachable
from indexes, but otherwise useless.  But I'm not sure that's actually
that inevitable:

a) Given that we fully scan indexes anyway, we could apply redirections
   to the indexes themselves, during vacuum. Then redirections could be
   removed afterwards.
b) Start fixing up indexes, by refinding entries in the index from the
   heap tuple. That requires expressions to actually be immutable, but
   personally I think it's stupid to cater for expressions that are
   not. Then we can update indexes to point to the "final tuple" after
   pruning.  It'd also pave the way to avoid the full index scan at the
   end of vacuum (in some cases).

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Lossy Index Tuple Enhancement (LITE)

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 06:03:34PM +0100, Simon Riggs wrote:
> On 4 August 2016 at 02:13, Bruce Momjian  wrote:
> 
> > How do plan to clear the bitmask so it, over time, doesn't end up being
> > all-set?
> 
> I don't have a plan, though thoughts welcome.
> 
> Similar situation that our current indexes don't recover from bloat, a
> situation made worse by non-HOT updates.
> 
> So, we have a choice of which disadvantage to accept.
> 
> 
> > Also, why not use this bitmap for all indexes, not just update chains?
> 
> I don't understand where you get this update chains thing from.
> 
> The bitmap can apply to multiple tuples on one page, which is described.

I am asking if we could use this idea for all rows on a page, not just
HOT updated, e.g. we have five rows that have col1=4 in an index --- why
not just use on index pointer for all of them, with a bitmap to point to
the possible matches?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Lossy Index Tuple Enhancement (LITE)

2016-08-04 Thread Claudio Freire

On Thu, Aug 4, 2016 at 1:58 PM, Simon Riggs  wrote:
> On 3 August 2016 at 20:37, Claudio Freire  wrote:
>> On Wed, Aug 3, 2016 at 4:20 AM, Simon Riggs  wrote:
>>> == IndexScan ==
>>>
>>> Note that the executor code for IndexScan appears identical between
>>> the two optimizations. The difference between duplicate and range LITE
>>> tuples is needed only at INSERT time (or UPDATE indexed column to a
>>> new value).
>>>
>>> When we do an IndexScan if we see a LITE tuple we do a scan of the
>>> linepointer ranges covered by this index tuple (which might eventually
>>> go to a full block scan). For example with bit1 set we would scan 16
>>> linepointers (on an 8192 byte block) and with 2 bits set we would scan
>>> 32 linepointers etc.., not necessarily consecutive ranges. So the
>>> IndexScan can return potentially many heap rows per index entry, which
>>> need to be re-checked and may also need to be sorted prior to
>>> returning them. If no rows are returned we can kill the index pointer,
>>> just as we do now for btrees, so they will be removed eventually from
>>> the index without the need for VACUUM. An BitmapIndexScan that sees a
>>> lossy pointer can also use the lossy TID concept, so we can have
>>> partially lossy bitmaps.
>>
>> Wouldn't this risk scanning rows more than once unless it's part of a
>> bitmap scan?
>
> It's always the job of the index insert logic to ensure a valid index exists.
>
> I'm not sure what you see that changes that here. Say more?

I just explained it further in the WARM thread, which is actually the
idea I was arriving to.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 06:18:47PM +0100, Simon Riggs wrote:
> > Have you considered what it'd take to allow index pointers to allow to
> > point to "intermediate" root tuples? I.e. have some indexes point into
> > the middle of an update-chain, whereas others still point to the
> > beginning? There's certainly some complications involved with that, but
> > it'd also have the advantage in reducing the amount of rechecking that
> > needs to be done.
> 
> "Intermediate root tuples" was my first attempt at this and it didn't
> work. I'll dig up the details, some problem in VACUUM, IIRC.

I think the problem is that HOT chain pruning becomes almost impossible
except via VACUUM.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Pavan Deolasee

On Thu, Aug 4, 2016 at 10:41 PM, Simon Riggs  wrote:

> On 4 August 2016 at 17:31, Andres Freund  wrote:
> > Hi,
> >
> > On 2016-08-04 16:29:09 +0530, Pavan Deolasee wrote:
> >> Indexes whose values do not change do not require new index pointers.
> Only
> >> the index whose key is being changed will need a new index entry. The
> new
> >> index entry will be set to the CTID of the root line pointer.
> >
> > That seems to require tracing all hot-chains in a page, to actually
> > figure out what the root line pointer of a warm-updated HOT tuple is,
> > provided it's HOT_UPDATED itself.  Or have you found a smart way to
> > figure that out?
>
> Hmm, sorry, I did think of that point and I thought I had added it to the
> doc.
>
> So, yes, I agree - re-locating the root is the biggest downside to
> this idea. Perhaps Pavan has other thoughts?
>
>
I clearly missed this problem, so what I say now is not fully thought
through. I see couple of options:

1. Most often the tuple that's being updated will be fetched by recent
index scan. So we can potentially cache last (few) mappings of CTID to
their root line pointers. May be we can store that in RelationData on a per
relation basis.
2. I also think that most often the tuple that will be updated will have
its t_ctid pointing to itself. This sounds more invasive, but can we
somehow utilise the t_ctid field to store the OffsetNumber of the root line
pointer? The root line pointer only exists when the tuple under
consideration has HEAP_ONLY_TUPLE flag set and we know for such tuples,
BlockNumber in t_ctid must be same as current block (unless HEAP_UPDATED is
also set and the updating transaction eventually aborted).

We may have to fall back to a full page scan to find the root line pointer,
but we can limit that for corner cases.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Pavan Deolasee

On Thu, Aug 4, 2016 at 10:49 PM, Bruce Momjian  wrote:

> On Thu, Aug  4, 2016 at 06:16:02PM +0100, Simon Riggs wrote:
> > On 4 August 2016 at 18:05, Bruce Momjian  wrote:
> >
> > >> Approach 2 seems more reasonable and simple.
> > >>
> > >> There are only 2 bits for lp_flags and all combinations are already
> used. But
> > >> for LP_REDIRECT root line pointer, we could use the lp_len field to
> store this
> > >> special flag, which is not used for LP_REDIRECT line pointers. So we
> are able
> > >> to mark the root line pointer.
> > >
> > > Uh, as I understand it, we only use LP_REDIRECT when we have _removed_
> > > the tuple that the ctid was pointing to, but it seems you would need to
> > > set HEAP_RECHECK_REQUIRED earlier than that.
> >
> > Hmm. Mostly there will be one, so this is just for the first update
> > after any VACUUM.
> >
> > Adding a new linepointer just to hold this seems kludgy and could mean
> > we run out of linepointers.
>
> Ah, so in cases where there isn't an existing LP_REDIRECT for the chain,
> you create one and use the lp_len to identify it as a WARM chain?  Hmm.
>
>
If the root tuple still exists, we store the WARM flag (or
HEAP_RECHECK_REQUIRED as used in the original post) in the tuple header
itself. When the root tuple becomes dead and HOT prune decides to replace
it with a LP_REDIRECT line pointer, the information is moved to lp_len
(which is currently set to 0 for LP_REDIRECT items). Does that answer your
question?


> You can't update the indexes pointing to the existing ctid, so what you
> would really have to do is to write over the existing ctid with
> LP_REDIRECT plus WARM marker, and move the old ctid to a new ctid slot?
>
>
Not really. I hope the above answers this, but please let me know if you
mean something else.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Claudio Freire

On Thu, Aug 4, 2016 at 2:14 PM, Claudio Freire  wrote:
> To fix this, what I was pondering, and I believe it would be viable,
> is to teach heap_hot_search_buffer about the WARM invariant. It will
> make correctness heavily depend on the invariant being true, which
> means either flagging item pointers as WARM chains, or requiring a
> reindex during upgrade to make sure all indexes verify the invariant -
> as old indices won't.
>
> When an item pointer is WARM (either implicit or explicit),
> heap_hot_search_buffer will only stop following the update chain if it
> reaches not the end of a HOT chain, but rather a key change. For this,
> it will need to recheck the key of all but the first TID in the chain,
> so singleton chains (ie: not-updated or frozen tuples) won't incurr
> this extra cost. When walking the chain, it will need to form an index
> tuple for both the old and new versions, and compare them, and follow
> the chain *only* if the keys are equal.
>
> As an optimization, if the tuple is marked HOT-updated, the key check
> may be skipped.
>
> Vacuum will need to be taught not to break the invariant, but I
> believe this is viable and efficient.


And I didn't mention the invariant.

The invariant would be that all WARM chains:

 * contain entire HOT chains (ie: no item pointers pointing to the
middle of a HOT chain)
 * start at the item pointer, and end when the t_ctid of the heap
tuple either points to another page, or a tuple with a different index
key
 * there is exactly one WARM chain containing any heap tuple,
including the singleton chain represented by a tuple whose WARM chain
contains only itself

This is maintained by:

 * No item pointer will be created for row versions whose immediately
previous version is already on the index with the same key

Cleanup can only proceed by applying cleanup on HOT chains, by moving
the head of a WARM chain forward, or by removing it entirely. It may
be that VACUUM already works like that, but I'm unsure.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] psql: Missing option to print current buffer to current output channel (i.e., \qprint)

2016-08-04 Thread David G. Johnston

"\echo" has "\qecho" - I'm basically looking for a "\qprint"

"\write" doesn't apply

The following doesn't work either...

#bash (stock Ubuntu 14.04)
psql --set=query_in_var='SELECT version();' > /dev/null <<'SQL'
\o /tmp/psql-test.txt
\set ECHO queries
--still ends up on stdout, hence the redirect to /dev/null to avoid it
showing on screen along with the cat output below
:query_in_var
\o
SQL
cat /tmp/psql-test.txt

Can anyone offer a reason not to add "\qprint"?

David J.

[HACKERS] Pgbench performance tuning?

2016-08-04 Thread Greg Stark

I'm trying to run pgbench on a moderately beefy machine (4-core 3.4GHz
with 32G of ram and md mirrored spinning rust drives) at scale 300
with 32 clients with duration of 15min. I'm getting TPS numbers
between 60-150 which seems surprisingly low to me and also to several
people on IRC.

Now pg_test_fsync does seem to indicate that's not an unreasonable
commit rate if there was very little commit grouping going on:

Compare file sync methods using one 8kB write:
open_datasync   100.781 ops/sec9922 usecs/op
fdatasync71.088 ops/sec   14067 usecs/op

Compare file sync methods using two 8kB writes:
open_datasync50.286 ops/sec   19886 usecs/op
fdatasync80.349 ops/sec   12446 usecs/op

And iostat does seem to indicate the drives are ~ 80% utilized with
high write await times So maybe this is just what the system is
capable of with synchronous_commit?

Is anyone really familiar with pg_bench on similar hardware? Are these
numbers reasonable? Any suggestion for how to run it to get the most
realistic measure of Postgres on this machine?

starting vacuum...end.
transaction type: 
scaling factor: 300
query mode: simple
number of clients: 32
number of threads: 4
duration: 900 s
number of transactions actually processed: 109424
latency average: 263.196 ms
tps = 121.464536 (including connections establishing)
tps = 121.464824 (excluding connections establishing)
script statistics:
 - statement latencies in milliseconds:
 0.002  \set aid random(1, 10 * :scale)
 0.001  \set bid random(1, 1 * :scale)
 0.001  \set tid random(1, 10 * :scale)
 0.001  \set delta random(-5000, 5000)
 0.229  BEGIN;
12.589  UPDATE pgbench_accounts SET abalance = abalance +
:delta WHERE aid = :aid;
 0.280  SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
 1.442  UPDATE pgbench_tellers SET tbalance = tbalance +
:delta WHERE tid = :tid;
12.435  UPDATE pgbench_branches SET bbalance = bbalance +
:delta WHERE bid = :bid;
 0.222  INSERT INTO pgbench_history (tid, bid, aid, delta,
mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
   237.623  END;

And the iostat for the period the pg_bench is running:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   1,310,000,43   20,480,00   77,78

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb   0,00 2,47   14,31  181,13   175,96  2897,77
31,4580,85  413,63   67,55  440,97   4,16  81,35
sda   0,01 2,47   23,77  181,13   292,04  2897,77
31,1466,72  325,59   51,08  361,61   3,92  80,40
md0   0,00 0,000,050,00 0,20 0,00
8,00 0,000,000,000,00   0,00   0,00
md1   0,00 0,000,000,00 0,00 0,00
0,00 0,000,000,000,00   0,00   0,00
md2   0,00 0,00   38,04  182,79   467,79  2895,73
30,46 0,000,000,000,00   0,00   0,00
md3   0,00 0,000,000,01 0,00 0,04
8,00 0,000,000,000,00   0,00   0,00


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 11:04:49PM +0530, Pavan Deolasee wrote:
> If the root tuple still exists, we store the WARM flag (or
> HEAP_RECHECK_REQUIRED as used in the original post) in the tuple header 
> itself.
> When the root tuple becomes dead and HOT prune decides to replace it with a
> LP_REDIRECT line pointer, the information is moved to lp_len (which is
> currently set to 0 for LP_REDIRECT items). Does that answer your question?

Ah, brilliant!

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pgbench performance tuning?

2016-08-04 Thread Andres Freund

On 2016-08-04 18:48:31 +0100, Greg Stark wrote:
> I'm trying to run pgbench on a moderately beefy machine (4-core 3.4GHz
> with 32G of ram and md mirrored spinning rust drives) at scale 300
> with 32 clients with duration of 15min. I'm getting TPS numbers
> between 60-150 which seems surprisingly low to me and also to several
> people on IRC.

What's the config? Version? What activity does pidstat -d -l indicate?
How much WAL was generated?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 01:05:53PM -0400, Bruce Momjian wrote:
> > Approach 2 seems more reasonable and simple. 
> > 
> > There are only 2 bits for lp_flags and all combinations are already used. 
> > But
> > for LP_REDIRECT root line pointer, we could use the lp_len field to store 
> > this
> > special flag, which is not used for LP_REDIRECT line pointers. So we are 
> > able
> > to mark the root line pointer.
> 
> Uh, as I understand it, we only use LP_REDIRECT when we have _removed_
> the tuple that the ctid was pointing to, but it seems you would need to
> set HEAP_RECHECK_REQUIRED earlier than that.
> 
> Also, what is currently in the lp_len field for LP_REDIRECT?  Zeros, or
> random data?  I am asking for pg_upgrade purposes.

It seems LP_REDIRECT's lp_len is always zero:  :-)

/*
 * ItemIdSetRedirect
 *  Set the item identifier to be REDIRECT, with the specified link.
 *  Beware of multiple evaluations of itemId!
 */
#define ItemIdSetRedirect(itemId, link) \
( \
(itemId)->lp_flags = LP_REDIRECT, \
(itemId)->lp_off = (link), \
(itemId)->lp_len = 0 \
  ^^
)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Change the default of update_process_title to off

2016-08-04 Thread Robert Haas

On Thu, Aug 4, 2016 at 3:52 AM, Andres Freund  wrote:
> On 2016-08-04 16:48:11 +0900, Michael Paquier wrote:
>> Here is a different proposal: documenting instead that disabling that
>> parameter on Windows can improve performance, at the cost of losing
>> information verbosity for processes.
>
> The benefit on windows seems pretty marginal, given the way it has to be
> viewed. People installing processexplorer et. al. to view a handle that
> have to know about, will be able to change the config.

I agree.  I think it would be fine to disable this on Windows, but I
wouldn't want to do the same thing on other platforms.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 02:35:32PM -0300, Claudio Freire wrote:
> > Vacuum will need to be taught not to break the invariant, but I
> > believe this is viable and efficient.
> 
> 
> And I didn't mention the invariant.
> 
> The invariant would be that all WARM chains:
> 
>  * contain entire HOT chains (ie: no item pointers pointing to the
> middle of a HOT chain)

You mean no _index_ item pointers, right?

>  * start at the item pointer, and end when the t_ctid of the heap
> tuple either points to another page, or a tuple with a different index
> key

Uh, there is only one visible tuple in a HOT chain, so I don't see the
different index key as being a significant check.  That is, I think we
should check the visibility first, as we do now, and then, for the
only-visible tuple, check the key.  I don't see a value in (expensive)
checking the key of an invisible tuple in hopes of stoping the HOT chain
traversal.

Also, what happens when a tuple chain goes off-page, then returns to the
original page?  That is a new HOT chain given our current code, but
would the new tuple addition realize it needs to create a new index
tuple?  The new tuple code needs to check the index's root HOT tid for a
non-match.

> This is maintained by:
> 
>  * No item pointer will be created for row versions whose immediately
> previous version is already on the index with the same key

You really only have the ctid HOT head stored in the index, not the
immediate ctid.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] improved DefElem list processing

2016-08-04 Thread Tom Lane

Peter Eisentraut  writes:
> Here are two WIP patches to improve the DefElem list processing that is
> used by many utility commands.

> One factors out the duplicate checks, which are currently taking up a
> lot of space with duplicate code.  I haven't applied this everywhere
> yet, but the patch shows how much boring code can be saved.

+1 on the general idea, but I wonder if it's a problem that you replaced
an O(N) check with an O(N^2) check.  I don't think there are any commands
that would be likely to have so many options that this would become a
serious issue, but ...

> The other adds a location field to the DefElem node.

+1 for sure, lots of places where that would be a good thing
(the duplicate check itself, for starters).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Double invocation of InitPostmasterChild in bgworker with -DEXEC_BACKEND

2016-08-04 Thread Robert Haas

On Tue, Aug 2, 2016 at 6:40 PM, Tom Lane  wrote:
> Thomas Munro  writes:
>> I discovered that if you build with -DEXEC_BACKEND on a Unix system
>> and then try to start a background worker, it dies in
>> InitializeLatchSupport:
>
>> TRAP: FailedAssertion("!(selfpipe_readfd == -1)", File: "latch.c", Line: 161)
>
>> That's because InitPostmasterChild is called twice.  I can
>> successfully use regular parallel query workers and bgworkers created
>> by extensions if I apply the attached patch.
>
> Confirmed, fix pushed.  I wonder if we should have a buildfarm member
> running the Unix EXEC_BACKEND code ...

That would get my vote.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Lossy Index Tuple Enhancement (LITE)

2016-08-04 Thread Simon Riggs

On 4 August 2016 at 18:27, Bruce Momjian  wrote:

>> > Also, why not use this bitmap for all indexes, not just update chains?
>>
>> I don't understand where you get this update chains thing from.
>>
>> The bitmap can apply to multiple tuples on one page, which is described.
>
> I am asking if we could use this idea for all rows on a page, not just
> HOT updated, e.g. we have five rows that have col1=4 in an index --- why
> not just use on index pointer for all of them, with a bitmap to point to
> the possible matches?

Yes. I described that...

"The same situation can occur if we have many INSERTs with same values
on the same block. This could lead to a reduction in size of the btree
for indexes with many duplicate values."

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pgbench performance tuning?

2016-08-04 Thread Andres Freund

On 2016-08-04 19:15:43 +0100, Greg Stark wrote:
> On Thu, Aug 4, 2016 at 6:53 PM, Andres Freund  wrote:
> > What's the config? Version? What activity does pidstat -d -l indicate?
> > How much WAL was generated?
> 
> I know the specifics matter but I was also trying to avoid dumping too
> much into the email.
> 
> The shared buffers is set to 16384 (128MB). Otherwise it's a default
> config (For 9.4 and before I set checkpoint_segments to 32 as well).

Well, with 128MB I don't find that a very surprising result. You're
going to push data out to disk constantly. Given the averaged random
access pattern of pgbench that's not really something that interesting.


> Never seen pidstat before but pidstat -d looks like it prints very
> similar output to the iostat output I was gathering already. There's
> nothing else running on the machine.

The question is which backends are doing the IO.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] improved DefElem list processing

2016-08-04 Thread Tom Lane

I wrote:
> Peter Eisentraut  writes:
>> The other adds a location field to the DefElem node.

> +1 for sure, lots of places where that would be a good thing
> (the duplicate check itself, for starters).

Forgot to mention: seems like you should have added a location
argument to makeDefElem.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Claudio Freire

On Thu, Aug 4, 2016 at 3:07 PM, Bruce Momjian  wrote:
> On Thu, Aug  4, 2016 at 02:35:32PM -0300, Claudio Freire wrote:
>> > Vacuum will need to be taught not to break the invariant, but I
>> > believe this is viable and efficient.
>>
>>
>> And I didn't mention the invariant.
>>
>> The invariant would be that all WARM chains:
>>
>>  * contain entire HOT chains (ie: no item pointers pointing to the
>> middle of a HOT chain)
>
> You mean no _index_ item pointers, right?

Yes

>>  * start at the item pointer, and end when the t_ctid of the heap
>> tuple either points to another page, or a tuple with a different index
>> key
>
> Uh, there is only one visible tuple in a HOT chain, so I don't see the
> different index key as being a significant check.

For a HOT chain, sure. That's why I mentioned it as an optimization.

But t_ctid, I checked, unless I read the code wrong, is set for
non-HOT updates too.

So following t_ctid regardless of the HOT-update flags should yield
update chains that can walk several buffers, and that's in fact the
mechanism I'm proposing, only just stop following the chain when it
jumps buffers.

>  That is, I think we
> should check the visibility first, as we do now, and then, for the
> only-visible tuple, check the key.

No, because, as I explained in the example, doing that will result in
duplicates being returned.

The whole update chain can span multiple WARM chains, and each WARM
chain can span multiple HOT chains.

Consider, assuming there's an index on id and another on val:

INSERT INTO TABLE t (id,val,dat) VALUES (32,'a','blabla');
UPDATE t SET dat='blabla2' where id = 32;
UPDATE t SET dat='blabla3', id=33 where id = 32;
UPDATE t SET val='b' where id = 33;
UPDATE t SET dat='blabla4' where id = 33;
UPDATE t SET val='a' where id = 33;

This produces a case similar to the one I described. Assuming all
updates happen on the same page, the first will be HOT, so no key
check is needed. The second cannot be HOT, because it updates the id.
But it can be WARM, if it's done on the same page, so it can avoid
updating the index on val. The third will need a new index item
pointer, because it has a new key. The t_ctid chain will walk all the
versions of the row, but the one with 'blabla2' will not have the
HOT-updated bit set. So regular HOT chain walking will stop there, I
propose to not stop there, so the index on val can follow the chain
into blabla3 as if it were a HOT update. Correctness is achieved only
if there is no index item pointer on that index pointing to that index
version, and since there is no efficient way of checking, the
invariants I propose aim to guarantee it so it can be assumed. Since
WARM updates don't create new index item pointers, what needs to be
checked is whether updating from version 2 to version 3 would be a
WARM update on this index or not, and that equals to checking the
index keys for equality.

>  I don't see a value in (expensive)
> checking the key of an invisible tuple in hopes of stoping the HOT chain
> traversal.

Not HOT chain, the point is to guarantee correctness by properly
identifying the end of the WARM (not HOT) chain, which is left
implicit.

> Also, what happens when a tuple chain goes off-page, then returns to the
> original page?

The WARM chain ends on the page hop, and when it returns it's a new
WARM chain. So traversal would not follow t_ctid pointers that hop
pages, which makes it efficient, and also guarantees correctness
(since if the is a page hop, we know the index will contain an item
pointer to the version in that other page, so we'll visit it when we
get there).

> That is a new HOT chain given our current code, but
> would the new tuple addition realize it needs to create a new index
> tuple?  The new tuple code needs to check the index's root HOT tid for a
> non-match.

I'm not proposing to change how HOT works, but adding another very
similar mechanism on top of it called WARM.

So HOT will keep working like that, HOT pruning will happen as usual,
but we'll have the concept (immaterialized in the physical
representation of the data, just implicit) of WARM chains. WARM chains
would span one or several HOT chains, so they're "bigger".

>> This is maintained by:
>>
>>  * No item pointer will be created for row versions whose immediately
>> previous version is already on the index with the same key
>
> You really only have the ctid HOT head stored in the index, not the
> immediate ctid.

I know. I just mentioned it just in case there was a transient time
during cleanup when it isn't true, but thinking it a little bit more,
if it weren't, HOT would also cause duplicates during index scans, so
it's probably not the case (or protected with a lock).

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] handling unconvertible error messages

2016-08-04 Thread Tom Lane

Victor Wagner  writes:
> On Thu, 04 Aug 2016 09:42:10 -0400
> Tom Lane  wrote:
>> Yeah.  I'm inclined to think that we should reset the message locale
>> to C as soon as we've forked away from the postmaster, and leave it
>> that way until we've absorbed settings from the startup packet.

> Really, if this response is sent after backend has been forked, problem
> probably can be easily fixed better way - StartupMessage contain
> information about desired client encoding, so this information just
> should be processed earlier than any other information from this
> message, which can cause errors (such as database name).

I think that's wishful thinking.  There will *always* be errors that
come out before we can examine the contents of the startup message.
Moreover, until we've done authentication, we should be very wary of
applying client-specified settings at all: they might be malicious.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Claudio Freire

On Thu, Aug 4, 2016 at 3:23 PM, Claudio Freire  wrote:
>> That is a new HOT chain given our current code, but
>> would the new tuple addition realize it needs to create a new index
>> tuple?  The new tuple code needs to check the index's root HOT tid for a
>> non-match.
>
> I'm not proposing to change how HOT works, but adding another very
> similar mechanism on top of it called WARM.
>
> So HOT will keep working like that, HOT pruning will happen as usual,
> but we'll have the concept (immaterialized in the physical
> representation of the data, just implicit) of WARM chains. WARM chains
> would span one or several HOT chains, so they're "bigger".


Answering your question, there's no need to find the root page on index updates.

When creating the new tuple in nodeModifyTable.c, the code currently
skips updating indexes if it's using HOT.

We would add a check for WARM too. It will use WARM for index X iff both:

 * ItemPointerGetBlockNumber(oldtuplectid) ==
ItemPointerGetBlockNumber(newtuplectid)
 * satisfies HOT for only this index X


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pgbench performance tuning?

2016-08-04 Thread Greg Stark

On Thu, Aug 4, 2016 at 6:53 PM, Andres Freund  wrote:
> What's the config? Version? What activity does pidstat -d -l indicate?
> How much WAL was generated?

I know the specifics matter but I was also trying to avoid dumping too
much into the email.

The shared buffers is set to 16384 (128MB). Otherwise it's a default
config (For 9.4 and before I set checkpoint_segments to 32 as well).

Never seen pidstat before but pidstat -d looks like it prints very
similar output to the iostat output I was gathering already. There's
nothing else running on the machine.

I would have to rerun the benchmarks to measure the WAL output. Since
there's no replication and the database shut down cleanly it looks
like it recycled all the log files proactively and the last checkpoint
is in the earliest numbered xlog file.

I'll add checking the xlog position before and after pg_bench to the
script. I also wanted to use run the database under a binary calling
getrusage to report RUSAGE_CHILDREN for the database. It looks like
there's no stock program to do this but it shouldn't be hard to hack
one up.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Peter Eisentraut

On 8/3/16 1:20 PM, Tom Lane wrote:
> I was just wondering whether it would be worth starting to use "git tag -a".

One should always use annotated tags for public releases.  That way, the
tag is its own Git object that cannot be later moved around or easily
faked (modulo GPG signatures -- a whole different discussion).

Unannotated tags are more intended for private labels.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Robert Haas

On Thu, Aug 4, 2016 at 9:57 AM, Tom Lane  wrote:
> Vladimir Sitnikov  writes:
>>> Sorry, but I don't buy that.  I think sending both server_version and
>>> server_version_num would be silly, and we're certainly not going to stop
>>> sending server_version.
>
>> What is wrong with sending machine-readable value?
>
> [ shrug... ]  What do you claim is not machine-readable about
> server_version?

Surely you can't have missed the connection between the issue at hand
and what Craig is talking about.  If libpq were using the
machine-readable version rather than PARSING A STRING, switching to a
two-part numbering scheme wouldn't force a compatibility break.  Every
driver that his independently implemented the PostgreSQL wire protocol
is going to have to be updated for this, if they're doing something
similar to libpq, and you're still asking why sending
server_version_num is potentially beneficial?

I think it's entirely reasonable to ask whether it's worth burdening
connection startup with a few extra bytes of essentially duplicative
information is a good idea on performance grounds, and I don't know
the answer to that question.  But pretending like it wouldn't help
anything when it would fix *the exact problem we are currently talking
about* is just sticking your head in the sand.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Keeping CURRENT_DATE and similar constructs in original format

2016-08-04 Thread Peter Eisentraut

On 5/12/16 6:14 PM, Tom Lane wrote:
> So what I've wanted to do for some time is invent a new expression node
> type that represents any one of these functions and can be reverse-listed
> in the same format that the input had.  The attached proposed patch does
> that.

I was experimenting with this as well when I found your patch, and I
think this is the right solution.  Your patch works fine for me.

Because of the refactoring in 2f153ddfdd318b211590dd5585f65f357d85c2de,
you will need to update your patch a bit.

> (I'm not particularly in love with the node type name
> ValueFunction; anybody got a better idea?)

I think this is fine.  The only other idea I have would be
SQLValueFunction, to emphasize that this is about SQL-mandated special
cases.

> By the by, a scan through gram.y reveals other stuff we aren't trying
> to reverse-list in original form:
> 
>   a_expr AT TIME ZONE a_expr
>   LIKE, ILIKE, SIMILAR TO
>   OVERLAPS
>   BETWEEN
>   COLLATION FOR '(' a_expr ')'
>   EXTRACT '(' extract_list ')'
>   OVERLAY '(' overlay_list ')'
>   POSITION '(' position_list ')'
>   SUBSTRING '(' substr_list ')'
>   TREAT '(' a_expr AS Typename ')'
>   TRIM '(' BOTH trim_list ')'
>   TRIM '(' LEADING trim_list ')'
>   TRIM '(' TRAILING trim_list ')'
>   TRIM '(' trim_list ')'
> 
> Each of these gets converted to some PG-specific function or operator
> name, and then will get reverse-listed using that name and ordinary
> function or operator syntax, rather than using the SQL-approved special
> syntax.

I think those could be addressed by having ruleutils.c *always* convert
matching function calls back to the special syntax.  Alternatively, tag
the function call node in the grammar with "this is special syntax" and
then look for that in ruleutils.c.  This is sort of what I was playing
with, except that the several levels of casting for the datetime
functions make that a mess.  If it's only one function call, it should
be easier.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in WaitForBackgroundWorkerShutdown() [REL9_5_STABLE]

2016-08-04 Thread Tom Lane

Yury Zhuravlev  writes:
> Dmitry Ivanov wrote:
>>> Recently I've encountered a strange crash while calling elog(ERROR, "...")
>>> after the WaitForBackgroundWorkerShutdown() function. It turns out that
>>> _returns_ inside the PG_TRY()..PG_CATCH() block are *highly* undesirable,
>>> since they leave PG_exception_stack pointing to a local struct in a dead
>>> frame, which is an obvious UB. I've attached a patch which fixes this
>>> behavior in the aforementioned function, but there might be more errors
>>> like that elsewhere.

> Good catch. But in 9.6 it has been fixed by 
> db0f6cad4884bd4c835156d3a720d9a79dbd63a9 commit. 

Only accidentally, I'd say.  I chose to push this into HEAD as well,
so that WaitForBackgroundWorkerShutdown would look more like
WaitForBackgroundWorkerStartup, and would not have a risk of future
breakage if someone adds code after the for-loop.

I was pretty worried about the "more errors like that" point, because
we recently fixed another similar mistake in plpython, cf 1d2fe56e4.
However, a search using the attached quick-hack script to look for
"return" etc between PG_TRY and PG_CATCH did not find any other
instances.

It'd be nice to have some automatic way of catching future mistakes
like this, but I'm not sure about a reliable way to do that.
One idea is to add an Assert to PG_CATCH:

 #define PG_TRY()  \
 do { \
 sigjmp_buf *save_exception_stack = PG_exception_stack; \
 ErrorContextCallback *save_context_stack = error_context_stack; \
 sigjmp_buf local_sigjmp_buf; \
 if (sigsetjmp(local_sigjmp_buf, 0) == 0) \
 { \
 PG_exception_stack = &local_sigjmp_buf

 #define PG_CATCH()\
+Assert(PG_exception_stack == &local_sigjmp_buf); \
 } \
 else \
 { \
 PG_exception_stack = save_exception_stack; \
 error_context_stack = save_context_stack

but there are a couple of disadvantages to this.  One is that you don't
find out about a problem unless the bad code path is actually exercised.
I'm afraid that the most tempting places to do something like this would
be low-probability error cases, and thus that the Assert would do little
to catch them.  (Case in point: the postmaster-death path fixed in this
patch would surely never have been caught by testing.)  The other and
really worse problem is that when the Assert does fire, it's nowhere near
the scene of the crime, so while it may tell you there's a problem it
does not give much help in locating the bug.

Anyone have a better idea?

regards, tom lane

#!/bin/sh

# search DIR for PG_TRY blocks containing WORD
# return, break, continue, goto are good things to look for

dir="$1"
word="$2"

for f in `find "$dir" -type f -name '*.c'`
do
  sed -n -e '/PG_TRY()/,/PG_CATCH()/ { /'"$word"'/ =; /'"$word"'/ p }' $f | \
  sed -e '/^[0-9]\+$/ { s|^[0-9]\+$|'"$f"': &: |; N }'
done

exit 0

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Tom Lane

Robert Haas  writes:
> On Thu, Aug 4, 2016 at 9:57 AM, Tom Lane  wrote:
>> [ shrug... ]  What do you claim is not machine-readable about
>> server_version?

> Surely you can't have missed the connection between the issue at hand
> and what Craig is talking about.  If libpq were using the
> machine-readable version rather than PARSING A STRING, switching to a
> two-part numbering scheme wouldn't force a compatibility break.

Well, yeah, this specific case would not have broken, because we
intentionally chose to maintain backwards compatibility of the
PG_VERSION_NUM data format.  But I think there's nothing much except
wishful thinking backing up the idea that starting to send
server_version_num now will prevent a bug in future.  If we ever do
change our version numbering scheme again, it would quite possibly
be in a way that breaks PG_VERSION_NUM as well.  A fairly obvious
potential reason to need to change something there is overflow of
the two-digit fields: clients relying on PG_VERSION_NUM would be
*more* at risk, not less so, than clients looking at a string.

In short: if we'd done it that way all along, it would've been nice.
But encouraging people to change horses now isn't really going to
make things better.  What is far more likely to happen, if we were
to do that, is that people will make clients that gratuitously fail
on old server versions because they didn't bother to support or test
for the case that the server doesn't send server_version_num.

You don't really want to encourage multiple ways of identifying which
software version you're dealing with.  That way madness lies.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: [BUGS] BUG #14247: COMMENT is restored on wrong database

2016-08-04 Thread Robert Haas

On Tue, Aug 2, 2016 at 5:42 PM, David G. Johnston
 wrote:
> Moving to -hackers since this is getting into details
>
> Bug Report # 14247
>
> On Tue, Aug 2, 2016 at 4:58 PM, Tom Lane  wrote:
>>
>> "David G. Johnston"  writes:
>> > Do you have an opinion on this following?
>>
>> I think the real problem in this area is that the division of labor
>> we have between pg_dump and pg_dumpall isn't very good for real-world
>> use cases.
>
>
> I won't disagree here.
>
>>
>>   I'm not at all sure what a better answer would look like.
>> But I'd rather see us take two steps back and consider the issue
>> holistically instead of trying to band-aid individual misbehaviors.
>>
>> The fact that pg_dump is emitting COMMENT ON DATABASE at all is
>> fundamentally wrong given the existing division-of-labor decisions,
>> namely that pg_dump is responsible for objects within a database
>> not for database-level properties.

I think a while back somebody had the idea of making COMMENT ON
CURRENT_DATABASE or COMMENT ON CURRENT DATABASE work, which seems like
an elegant solution to me.  Of course, I just work here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Tom Lane

Peter Eisentraut  writes:
> On 8/3/16 1:20 PM, Tom Lane wrote:
>> I was just wondering whether it would be worth starting to use "git tag -a".

> One should always use annotated tags for public releases.  That way, the
> tag is its own Git object that cannot be later moved around or easily
> faked (modulo GPG signatures -- a whole different discussion).

AFAIK, you can't really move around a plain tag either.  Certainly that's
why we go out of our way not to tag releases until we're certain we don't
need a re-wrap.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Patch] RBTree iteration interface improvement

2016-08-04 Thread Robert Haas

On Thu, Jul 28, 2016 at 1:57 PM, Tom Lane  wrote:
> Aleksander Alekseev  writes:
>>> Can you explain use case where you need it?
>
>> ... Or maybe you have different objects, e.g. IndexScanDesc's, that should
>> iterate over some tree's independently somewhere in indexam.c
>> procedures. Exact order may depend on user's query so you don't even
>> control it.
>
> It seems clear to me that the existing arrangement is hazardous for any
> RBTree that hasn't got exactly one consumer.  I think Aleksander's plan to
> decouple the iteration state is probably a good one (NB: I've not read the
> patch, so this is not an endorsement of details).  I'd go so far as to say
> that we should remove the old API as being dangerous, rather than preserve
> it on backwards-compatibility grounds.  We make bigger changes than that
> in internal APIs all the time.
>
> Having said that, though: if the iteration state is not part of the
> object, it's not very clear whether we can behave sanely if someone
> changes the tree while an iteration is open.  It will need careful
> thought as to what sort of guarantees we can make about that.  If it's
> too weak, then a separated-state version would have enough hazards of
> its own that it's not necessarily any safer.

+1 to all of that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: [BUGS] BUG #14247: COMMENT is restored on wrong database

2016-08-04 Thread Tom Lane

Robert Haas  writes:
> On Tue, Aug 2, 2016 at 5:42 PM, David G. Johnston
>  wrote:
> The fact that pg_dump is emitting COMMENT ON DATABASE at all is
> fundamentally wrong given the existing division-of-labor decisions,
> namely that pg_dump is responsible for objects within a database
> not for database-level properties.

> I think a while back somebody had the idea of making COMMENT ON
> CURRENT_DATABASE or COMMENT ON CURRENT DATABASE work, which seems like
> an elegant solution to me.  Of course, I just work here.

I'm fairly annoyed at David for having selectively quoted from private
email in a public forum, but that was one of the points I touched on
in material that he cut.  The point I tried to make to him is that
possibly COMMENT ON CURRENT DATABASE is a portion of a holistic solution,
but it's only a portion.  We need to rethink exactly what pg_dump is
supposed to do with database-level properties.  And if it does need
COMMENT ON CURRENT DATABASE, it likely also needs GRANT ... ON CURRENT
DATABASE, ALTER CURRENT DATABASE OWNER TO, ALTER CURRENT DATABASE SET,
and maybe some other things.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: [BUGS] BUG #14247: COMMENT is restored on wrong database

2016-08-04 Thread David G. Johnston

On Thu, Aug 4, 2016 at 4:50 PM, Tom Lane  wrote:

> Robert Haas  writes:
> > On Tue, Aug 2, 2016 at 5:42 PM, David G. Johnston
> >  wrote:
> > The fact that pg_dump is emitting COMMENT ON DATABASE at all is
> > fundamentally wrong given the existing division-of-labor decisions,
> > namely that pg_dump is responsible for objects within a database
> > not for database-level properties.
>
> > I think a while back somebody had the idea of making COMMENT ON
> > CURRENT_DATABASE or COMMENT ON CURRENT DATABASE work, which seems like
> > an elegant solution to me.  Of course, I just work here.
>
> I'm fairly annoyed at David for having selectively quoted from private
> email in a public forum, but that was one of the points I touched on
> in material that he cut.

> The point I tried to make to him is that
> possibly COMMENT ON CURRENT DATABASE is a portion of a holistic solution,
> but it's only a portion.

I should have asked first and I'll take the heat for choosing to re-post
publicly but I kept this aspect of your recommendation precisely because
that was indeed your position.

TL>> It's entirely possible that some feature like COMMENT ON CURRENT
DATABASE
TL>> would be a necessary component of a holistic solution, [ various
other ON CURRENT commands elidded ]

I'm all for an elegant solution here though at some point having a working
solution now beats waiting for someone to willingly dive more deeply into
pg_dump.  I too seem to recall previous proposals for COMMON ON CURRENT
DATABASE yet here we are...

David J.

Re: [HACKERS] Pgbench performance tuning?

2016-08-04 Thread Jeff Janes

On Thu, Aug 4, 2016 at 10:48 AM, Greg Stark  wrote:
> I'm trying to run pgbench on a moderately beefy machine (4-core 3.4GHz
> with 32G of ram and md mirrored spinning rust drives) at scale 300
> with 32 clients with duration of 15min. I'm getting TPS numbers
> between 60-150 which seems surprisingly low to me and also to several
> people on IRC.

I am assuming "md mirrored" means you have two drives, so there is no
striping.  What is their RPM?

Your IO system is inadequate and I'm only slightly surprised at the low TPS.

> Now pg_test_fsync does seem to indicate that's not an unreasonable
> commit rate if there was very little commit grouping going on:
>
> Compare file sync methods using one 8kB write:
> open_datasync   100.781 ops/sec9922 usecs/op
> fdatasync71.088 ops/sec   14067 usecs/op
>
> Compare file sync methods using two 8kB writes:
> open_datasync50.286 ops/sec   19886 usecs/op
> fdatasync80.349 ops/sec   12446 usecs/op
>
> And iostat does seem to indicate the drives are ~ 80% utilized with
> high write await times So maybe this is just what the system is
> capable of with synchronous_commit?

As an experiment, turn off synchrounous_commit and see what happens.

Mostly likely you are getting adequate commit grouping behavior, but
that doesn't apply to the table data.  Each transaction dirties some
random page in the 3.9GB pgbench_accounts table, and there is no
effective grouping of those writes. That data isn't written
synchronously, but in the steady state it doesn't really matter
because at some point the write rate has to equilibrate with the
dirtying rate.  If you can't make the disks faster then the TPS has to
drop to meet them. The most likely mechanism for this to happen is
that the disks are so harried trying to keep up with the dirty data
eviction, that they can't service the sync calls from the commits in a
timely matter.  But if you took the sync calls out, the bottleneck
would likely just move somewhere else with only modest overall
improvement.

The way to tune it would be to make shared_buffers large enough that
all of pgbench_accounts fits in it, and increase checkpoint_segments
and checkpoint_timeout as much as you can afford, and increase
checkpoint_completion_target.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Oddity with NOT IN

2016-08-04 Thread Jim Nasby


I've got a customer that discovered something odd...

SELECT f1 FROM v1 WHERE f2 not in (SELECT bad FROM v2 WHERE f3 = 1);

does not error, even though bad doesn't exist, but

SELECT bad FROM v2 WHERE f3 = 1;
gives

ERROR:  column "bad" does not exist

Is that expected?

This is on 9.4.8, and both v1 and v2 are views. The only "odd" thing 
that I see is that v1 is a UNION ALL and v2 is a UNION. I don't think 
there's any tables in common between the two views.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Oddity with NOT IN

2016-08-04 Thread Marko Tiikkaja


On 2016-08-04 11:23 PM, Jim Nasby wrote:

I've got a customer that discovered something odd...

SELECT f1 FROM v1 WHERE f2 not in (SELECT bad FROM v2 WHERE f3 = 1);

does not error, even though bad doesn't exist, but


I'm guessing there's a v1.bad?

This is a common mistake, and also why I recommend always table 
qualifying column references when there's more than one table in scope.



.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 03:23:10PM -0300, Claudio Freire wrote:
> INSERT INTO TABLE t (id,val,dat) VALUES (32,'a','blabla');
> UPDATE t SET dat='blabla2' where id = 32;
> UPDATE t SET dat='blabla3', id=33 where id = 32;
> UPDATE t SET val='b' where id = 33;
> UPDATE t SET dat='blabla4' where id = 33;
> UPDATE t SET val='a' where id = 33;
> 
> This produces a case similar to the one I described. Assuming all
> updates happen on the same page, the first will be HOT, so no key
> check is needed. The second cannot be HOT, because it updates the id.
> But it can be WARM, if it's done on the same page, so it can avoid
> updating the index on val. The third will need a new index item
> pointer, because it has a new key. The t_ctid chain will walk all the
> versions of the row, but the one with 'blabla2' will not have the
> HOT-updated bit set. So regular HOT chain walking will stop there, I
> propose to not stop there, so the index on val can follow the chain
> into blabla3 as if it were a HOT update. Correctness is achieved only
> if there is no index item pointer on that index pointing to that index
> version, and since there is no efficient way of checking, the
> invariants I propose aim to guarantee it so it can be assumed. Since
> WARM updates don't create new index item pointers, what needs to be
> checked is whether updating from version 2 to version 3 would be a
> WARM update on this index or not, and that equals to checking the
> index keys for equality.

OK, it took me a while to understand this, so let me see if I can
restate it.  You have an update chain in a single page which is made up
of one ore more HOT chains, some of which might be WARM.  A flag
indicates a WARM chain (see below).  A WARM chain cannot go outside of a
HOT chain.

The HOT chains have identical keys for all indexes, and the WARM chains
have identical keys for some indexes.  If an update chain is all on the
same page, it can have multiple HOT chains.  I think this is possible if
an update changes _all_ indexed columns in the middle of an update
chain, turning off HOT and WARM.

I think we agreed that WARM index tuples will point to the head of the
HOT update chain on the same page.  We will store the
HEAP_RECHECK_REQUIRED flag on the update chain head tuple, or on the
LP_REDIRECT ctid if we removed the tuple during HOT cleanup.

So, I think the case you are discussing is walking WARM beyond the end
of the HOT chain so you don't have to create a second index entry for
the second HOT chain in the same update chain for matching values.  

For example, you have a HOT chain, and you set the indexed col1=2, then
later, in the same HOT chain, change it to col1=4, then later, in the
same HOT chain, back to col1=2.  I assume the last update would see
there was already an index entry pointing to the head of the same HOT
chain, and would skip adding to the index.

However, you might be saying that the last update of col1=2 makes a
non-HOT entry in the update chain, and you want to find that without
adding an index entry.

That seems overly complex and not worth it.

Another approach would be to keep updates on the same page in a single
WARM chain, even if all indexes have changed columns.  We will already
have code to walk the HOT chain looking for matches, and at most one
tuple in the update chain is visible.  The downside of this is
unnecessary checking if the tuple matches the index.

> >  I don't see a value in (expensive)
> > checking the key of an invisible tuple in hopes of stoping the HOT chain
> > traversal.
> 
> Not HOT chain, the point is to guarantee correctness by properly
> identifying the end of the WARM (not HOT) chain, which is left
> implicit.

Why should WARM chains span beyond HOT chains?

> > Also, what happens when a tuple chain goes off-page, then returns to the
> > original page?
> 
> The WARM chain ends on the page hop, and when it returns it's a new
> WARM chain. So traversal would not follow t_ctid pointers that hop
> pages, which makes it efficient, and also guarantees correctness
> (since if the is a page hop, we know the index will contain an item
> pointer to the version in that other page, so we'll visit it when we
> get there).
> 
> > That is a new HOT chain given our current code, but
> > would the new tuple addition realize it needs to create a new index
> > tuple?  The new tuple code needs to check the index's root HOT tid for a
> > non-match.
> 
> I'm not proposing to change how HOT works, but adding another very
> similar mechanism on top of it called WARM.
> 
> So HOT will keep working like that, HOT pruning will happen as usual,
> but we'll have the concept (immaterialized in the physical
> representation of the data, just implicit) of WARM chains. WARM chains
> would span one or several HOT chains, so they're "bigger".

Yes, I think I got it, but what is the benefit?  Fewer index entries?

As I see it, the only way with WARM you could have an update chain on
the same page that has multiple HOT chains is

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Andrew Dunstan




On 08/01/2016 04:25 PM, Tom Lane wrote:

Andrew Dunstan  writes:

Somewhat related is how we name the git branches. It would help me from
a buildfarm POV if we kept lexically them sortable, which could be done
at least for the next 90 major releases :-) by adding an underscore
after the REL piece, thus: REL_10_STABLE. I realise that's a way off,
but it's worth bringing up while we're discussing the topic.

Hmm, sounds a bit C-locale-centric, but I have no objection to inserting
an underscore there if it seems helpful.



Good. FTR I tested this with every locale on my system and the results 
were identical .





What I thought would be worth discussing is whether to continue using the
"_STABLE" suffix.  It seems rather like a noise word for our purposes.
OTOH, dropping it might be a headache for scripts that deal with branch
names --- any thoughts?





Just from the buildfarm POV it poses no difficulties.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Bogus ANALYZE results for an otherwise-unique column with many nulls

2016-08-04 Thread Tom Lane

I looked into the problem described at
https://www.postgresql.org/message-id/flat/VisenaEmail.26.df42f82acae38a58.156463942b8%40tc7-visena
and I believe I've reproduced it: the requirement is that the inner join
column for the antijoin must contain a lot of NULL values, and what isn't
NULL must be unique or nearly so.  If ANALYZE doesn't come across any
duplicated values, it will set the column's stadistinct value to "-1",
which causes the planner to believe that each row of the inner table
produces a unique value, resulting in a bogus answer from
get_variable_numdistinct() and thence a bogus join size estimate.

Here's an example in the regression database, making use of the existing
unique column tenk1.unique1:

regression=# create table manynulls as select case when random() < 0.1 then 
unique1 else null end as unique1 from tenk1;
SELECT 1
regression=# analyze manynulls; 
ANALYZE
regression=# explain analyze select * from tenk1 t where not exists(select 1 
from manynulls m where m.unique1 = t.unique1);
QUERY PLAN  
   
---
 Hash Anti Join  (cost=261.00..756.50 rows=1 width=244) (actual 
time=4.632..14.729 rows=8973 loops=1)
   Hash Cond: (t.unique1 = m.unique1)
   ->  Seq Scan on tenk1 t  (cost=0.00..458.00 rows=1 width=244) (actual 
time=0.015..2.683 rows=1 loops=1)
   ->  Hash  (cost=136.00..136.00 rows=1 width=4) (actual time=4.553..4.553 
rows=1027 loops=1)
 Buckets: 16384  Batches: 1  Memory Usage: 165kB
 ->  Seq Scan on manynulls m  (cost=0.00..136.00 rows=1 width=4) 
(actual time=0.019..2.668 rows=1 loops=1)
 Planning time: 0.808 ms
 Execution time: 15.670 ms
(8 rows)

So the antijoin size estimate is way off, but it's hardly the planner's
fault because the stats are insane:

regression=# select attname,null_frac,n_distinct from pg_stats where tablename 
= 'manynulls';
 attname | null_frac | n_distinct 
-+---+
 unique1 |0.8973 | -1
(1 row)

With the patch attached below, ANALYZE produces

regression=# analyze manynulls;
ANALYZE
regression=# select attname,null_frac,n_distinct from pg_stats where tablename 
= 'manynulls';
 attname | null_frac | n_distinct 
-+---+
 unique1 |0.8973 |-0.1027
(1 row)

and now the join size estimate is dead on:

regression=# explain analyze select * from tenk1 t where not exists(select 1 
from manynulls m where m.unique1 = t.unique1);
QUERY PLAN  
   
---
 Hash Anti Join  (cost=261.00..847.69 rows=8973 width=244) (actual 
time=4.501..13.888 rows=8973 loops=1)
   Hash Cond: (t.unique1 = m.unique1)
   ->  Seq Scan on tenk1 t  (cost=0.00..458.00 rows=1 width=244) (actual 
time=0.031..4.959 rows=1 loops=1)
   ->  Hash  (cost=136.00..136.00 rows=1 width=4) (actual time=4.404..4.404 
rows=1027 loops=1)
 Buckets: 16384  Batches: 1  Memory Usage: 165kB
 ->  Seq Scan on manynulls m  (cost=0.00..136.00 rows=1 width=4) 
(actual time=0.034..2.576 rows=1 loops=1)
 Planning time: 1.388 ms
 Execution time: 14.542 ms
(8 rows)

What I did in the patch is to scale the formerly fixed "-1.0" stadistinct
estimate to discount the fraction of nulls we found.  An alternative
possibility might be to decree that a fractional stadistinct considers
only non-nulls, but then all the other paths through ANALYZE would be
wrong.  The spec for it in pg_statistic.h doesn't suggest any such
interpretation, either.

Looking around, there are a couple of places outside commands/analyze.c
that are making the same mistake, so this patch isn't complete, but it
illustrates what needs to be done.

This is a bit reminiscent of the nulls-accounting problem we fixed in
commit be4b4dc75, though that tended to result in underestimates not
overestimates of the number of distinct values.  We didn't back-patch
that fix, so probably we shouldn't back-patch this either.  On the other
hand, it is far more open-and-shut that this is wrong.  Thoughts?

regards, tom lane

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 5fcedd7..9ac7122 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
*** compute_distinct_stats(VacAttrStatsP sta
*** 2049,2056 
  
  		if (nmultiple == 0)
  		{
! 			/* If we found no repeated values, assume it's a unique column */
! 			stats->stadistinct = -1.0;
  		}
  		else if (track_cnt < track_max && toowide_cnt == 0 &&
   nmultiple == track_cnt)
--- 2049,2059 
  
  		if (nmultiple == 0)
  		{
!

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 03:33:16PM -0300, Claudio Freire wrote:
> On Thu, Aug 4, 2016 at 3:23 PM, Claudio Freire  wrote:
> >> That is a new HOT chain given our current code, but
> >> would the new tuple addition realize it needs to create a new index
> >> tuple?  The new tuple code needs to check the index's root HOT tid for a
> >> non-match.
> >
> > I'm not proposing to change how HOT works, but adding another very
> > similar mechanism on top of it called WARM.
> >
> > So HOT will keep working like that, HOT pruning will happen as usual,
> > but we'll have the concept (immaterialized in the physical
> > representation of the data, just implicit) of WARM chains. WARM chains
> > would span one or several HOT chains, so they're "bigger".
> 
> 
> Answering your question, there's no need to find the root page on index 
> updates.
> 
> When creating the new tuple in nodeModifyTable.c, the code currently
> skips updating indexes if it's using HOT.
> 
> We would add a check for WARM too. It will use WARM for index X iff both:
> 
>  * ItemPointerGetBlockNumber(oldtuplectid) ==
> ItemPointerGetBlockNumber(newtuplectid)
>  * satisfies HOT for only this index X

OK, I see why you like walking the entire update chain instead of just
walking the HOT chain --- you think it avoids us having to find the HOT
chain root.  However, how do you check the index to find out of the
update chain root is already in the index?  I don't think you can just
look at the current tuple to know that since a previous tuple in the
update chain might have put it there even if the current old tuple
wouldn't have.

My point is there can be multiple update chains on the same page for
different rows with identical indexed values, so I don't see how
checking just for the page number helps here.  Are we going to check the
entire page?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bogus ANALYZE results for an otherwise-unique column with many nulls

2016-08-04 Thread Tom Lane

I wrote:
> Looking around, there are a couple of places outside commands/analyze.c
> that are making the same mistake, so this patch isn't complete, but it
> illustrates what needs to be done.

Here's a more complete patch.

regards, tom lane

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 5fcedd7..9ac7122 100644
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
*** compute_distinct_stats(VacAttrStatsP sta
*** 2049,2056 
  
  		if (nmultiple == 0)
  		{
! 			/* If we found no repeated values, assume it's a unique column */
! 			stats->stadistinct = -1.0;
  		}
  		else if (track_cnt < track_max && toowide_cnt == 0 &&
   nmultiple == track_cnt)
--- 2049,2059 
  
  		if (nmultiple == 0)
  		{
! 			/*
! 			 * If we found no repeated non-null values, assume it's a unique
! 			 * column; but be sure to discount for any nulls we found.
! 			 */
! 			stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  		}
  		else if (track_cnt < track_max && toowide_cnt == 0 &&
   nmultiple == track_cnt)
*** compute_scalar_stats(VacAttrStatsP stats
*** 2426,2433 
  
  		if (nmultiple == 0)
  		{
! 			/* If we found no repeated values, assume it's a unique column */
! 			stats->stadistinct = -1.0;
  		}
  		else if (toowide_cnt == 0 && nmultiple == ndistinct)
  		{
--- 2429,2439 
  
  		if (nmultiple == 0)
  		{
! 			/*
! 			 * If we found no repeated non-null values, assume it's a unique
! 			 * column; but be sure to discount for any nulls we found.
! 			 */
! 			stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  		}
  		else if (toowide_cnt == 0 && nmultiple == ndistinct)
  		{
*** compute_scalar_stats(VacAttrStatsP stats
*** 2753,2759 
  		else
  			stats->stawidth = stats->attrtype->typlen;
  		/* Assume all too-wide values are distinct, so it's a unique column */
! 		stats->stadistinct = -1.0;
  	}
  	else if (null_cnt > 0)
  	{
--- 2759,2765 
  		else
  			stats->stawidth = stats->attrtype->typlen;
  		/* Assume all too-wide values are distinct, so it's a unique column */
! 		stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  	}
  	else if (null_cnt > 0)
  	{
diff --git a/src/backend/tsearch/ts_typanalyze.c b/src/backend/tsearch/ts_typanalyze.c
index 0f851ea..817453c 100644
*** a/src/backend/tsearch/ts_typanalyze.c
--- b/src/backend/tsearch/ts_typanalyze.c
*** compute_tsvector_stats(VacAttrStats *sta
*** 295,301 
  		stats->stawidth = total_width / (double) nonnull_cnt;
  
  		/* Assume it's a unique column (see notes above) */
! 		stats->stadistinct = -1.0;
  
  		/*
  		 * Construct an array of the interesting hashtable items, that is,
--- 295,301 
  		stats->stawidth = total_width / (double) nonnull_cnt;
  
  		/* Assume it's a unique column (see notes above) */
! 		stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  
  		/*
  		 * Construct an array of the interesting hashtable items, that is,
diff --git a/src/backend/utils/adt/rangetypes_typanalyze.c b/src/backend/utils/adt/rangetypes_typanalyze.c
index fcb71d3..56504fc 100644
*** a/src/backend/utils/adt/rangetypes_typanalyze.c
--- b/src/backend/utils/adt/rangetypes_typanalyze.c
*** compute_range_stats(VacAttrStats *stats,
*** 203,209 
  		/* Do the simple null-frac and width stats */
  		stats->stanullfrac = (double) null_cnt / (double) samplerows;
  		stats->stawidth = total_width / (double) non_null_cnt;
! 		stats->stadistinct = -1.0;
  
  		/* Must copy the target values into anl_context */
  		old_cxt = MemoryContextSwitchTo(stats->anl_context);
--- 203,211 
  		/* Do the simple null-frac and width stats */
  		stats->stanullfrac = (double) null_cnt / (double) samplerows;
  		stats->stawidth = total_width / (double) non_null_cnt;
! 
! 		/* Estimate that non-null values are unique */
! 		stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
  
  		/* Must copy the target values into anl_context */
  		old_cxt = MemoryContextSwitchTo(stats->anl_context);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index cc2a9a1..56943f2 100644
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
*** double
*** 4738,4743 
--- 4738,4744 
  get_variable_numdistinct(VariableStatData *vardata, bool *isdefault)
  {
  	double		stadistinct;
+ 	double		stanullfrac = 0.0;
  	double		ntuples;
  
  	*isdefault = false;
*** get_variable_numdistinct(VariableStatDat
*** 4745,4751 
  	/*
  	 * Determine the stadistinct value to use.  There are cases where we can
  	 * get an estimate even without a pg_statistic entry, or can get a better
! 	 * value than is in pg_statistic.
  	 */
  	if (HeapTupleIsValid(vardata->statsTuple))
  	{
--- 4746,4753 
  	/*
  	 * Determine the stadistinct value to use.  There are cases where we can
  	 * get an estimate even without a pg_statistic entry,

Re: [HACKERS] Possible duplicate release of buffer lock.

2016-08-04 Thread Peter Geoghegan

On Wed, Aug 3, 2016 at 1:31 AM, Kyotaro HORIGUCHI
 wrote:
> The first line is emitted for simultaneous deletion of a index
> page, which is impossible by design in a consistent state so the
> complained situation should be the result of an index corruption
> before upgading, specifically, inconsistent sibling pointers
> around a deleted page.

> My point here is that if concurrent deletion can't be perfomed by
> the current implement, this while loop could be removed and
> immediately error out or log a message,

Perhaps I have misunderstood, but I must ask: Are you aware that 9.4
has commit efada2b8, which fixes a bug with page deletion, that never
made it into earlier branches?

It is worth noting that anything that can make VACUUM in particular
raise an error is considered problematic. For example, we do not allow
VACUUM to finish incomplete B-Tree page splits, even though we could,
because there might be no disk space for even one extra page at the
worst possible time (we VACUUM in part to free disk space, of course).
We really want to let VACUUM finish, if at all possible, even if it
sees something that indicates data corruption.

I remember that during the review of the similar B-Tree page split
patch (not efada2b8, but rather commit 40dae7ec, which was also in
9.4), I found bugs due to functions not being very clear about whether
a buffer lock and/or pin were held upon returning from a function in
all cases, including very rare cases or "can't happen" cases. So, this
seems kind of familiar.

Code comments in the relevant function say this:

 * Returns 'false' if the page could not be unlinked (shouldn't happen).
 * If the (new) right sibling of the page is empty, *rightsib_empty is set
 * to true.
 */
static bool
_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
{

What we see here is that this "shouldn't happen" condition does, in
fact, happen very rarely, including in cases where the target is not a
leaf page.

I think there might well be a real bug here, because we are not
careful enough in ensuring that the _bt_unlink_halfdead_page()
"contract" holds when something that "can't happen" does, in fact,
happen:

/*
 * Release the target, if it was not the leaf block.  The leaf is always
 * kept locked.
 */
if (target != leafblkno)
_bt_relbuf(rel, buf);

return true;
}

(Assuming you can call this code at the end of
_bt_unlink_halfdead_page() its "contract".)

It's not good at all if the code is stuck in a way that prevents
VACUUM from finishing, as I said. That seems to be the case here,
judging from Horiguchi-san's log output, which shows an ERROR from
"lock main 13879 is not held". The problem is not so much that a
"can't happen" thing happens (data corruption will sometimes make "the
impossible" possible, after all -- we know this, and write code
defensively because of this). The bug is that there is any ERROR for
VACUUM that isn't absolutely unavoidable (the damage has been done
anyway -- the index is already corrupt).

I'll need to think about this some more, when I have more time.
Perhaps tomorrow.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 10:58:33PM +0530, Pavan Deolasee wrote:
> 2. I also think that most often the tuple that will be updated will have its
> t_ctid pointing to itself. This sounds more invasive, but can we somehow

Agreed!  I think it has to.  Great idea.

> utilise the t_ctid field to store the OffsetNumber of the root line pointer?
> The root line pointer only exists when the tuple under consideration has
> HEAP_ONLY_TUPLE flag set and we know for such tuples, BlockNumber in t_ctid
> must be same as current block (unless HEAP_UPDATED is also set and the 
> updating
> transaction eventually aborted).

Oh, aborts, so the tuple we are looking at is the head of the chain and
points to a tuple that was aborted.  So we would still need a page-scan
to handle cases where the t_ctid was over-written by an aborted tuple,
but as you said it would be rare.

So, if the tuple has HEAP_ONLY_TUPLE set, we know the block number
ip_blkid equals our block number, so any ip_blkid value that doesn't
equal our block number is a special flag that indicates that ip_posid
points to the head of our HOT chain.  I guess we could just set it to
(our block number + 1), with rollover possible and desired.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fwd: [BUGS] BUG #14247: COMMENT is restored on wrong database

2016-08-04 Thread Craig Ringer

On 5 August 2016 at 05:03, David G. Johnston 
wrote:

> 
> I'm all for an elegant solution here though at some point having a working
> solution now beats waiting for someone to willingly dive more deeply into
> pg_dump.  I too seem to recall previous proposals for COMMON ON CURRENT
> DATABASE yet here we are...
>
>
SECURITY LABEL ... ON CURRENT DATABASEis needed too, and has caused
real-world pain.

I tend to agree that adding and using ... ON CURRENT DATABASE is worthwhile
now. It's guaranteed to be at least as correct as what pg_dump emits now.

We do have a horrible mess with how pg_dump handles database level
properties, but I'd rather not try to deal with that at the same time, get
into a long discussion and land up fixing nothing.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] New version numbering practices

2016-08-04 Thread Craig Ringer

On 5 August 2016 at 04:31, Tom Lane  wrote:

>  In short: if we'd done it that way all along, it would've been nice.

But encouraging people to change horses now isn't really going to
> make things better.  What is far more likely to happen, if we were
> to do that, is that people will make clients that gratuitously fail
> on old server versions because they didn't bother to support or test
> for the case that the server doesn't send server_version_num.
>

Yup, if someone's writing a completely new client or bypassing their
database driver's version detection.

But the same argument would've applied when the server_version_num GUC was
added in the first place. The world didn't end then, and it won't now.

Yes, it'll add a small cost to the startup packet, but that's it. It
should've been done when server_version_num was added in the first place
and it should be done now. Small variation in one-off packet size is pretty
unimportant anyway TBH, since it's latency and roundtrip costs and that
hurt us.

You don't really want to encourage multiple ways of identifying which
> software version you're dealing with.  That way madness lies.
>
>
Yet we have a server_version_num GUC. Because it's useful and people
benefited from it.

It's a pain in the butt having 9.4beta, etc and having to deal with those.

I like to think we think in the long term here. The same arguments you make
against adding server_version_num as GUC_REPORT apply to pretty much every
improvement that exposes any kind of UI, including new SQL. Users won't be
able to use it for ages, it'll break apps that connect to older servers if
they use the new feature, etc etc.

Lets just add it. It should've been there in the first place, it was an
oversight not to add it when initially adding server_version_num, and it
should be fixed.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Claudio Freire

On Thu, Aug 4, 2016 at 7:21 PM, Bruce Momjian  wrote:
> On Thu, Aug  4, 2016 at 03:23:10PM -0300, Claudio Freire wrote:
>> INSERT INTO TABLE t (id,val,dat) VALUES (32,'a','blabla');
>> UPDATE t SET dat='blabla2' where id = 32;
>> UPDATE t SET dat='blabla3', id=33 where id = 32;
>> UPDATE t SET val='b' where id = 33;
>> UPDATE t SET dat='blabla4' where id = 33;
>> UPDATE t SET val='a' where id = 33;
>>
>> This produces a case similar to the one I described. Assuming all
>> updates happen on the same page, the first will be HOT, so no key
>> check is needed. The second cannot be HOT, because it updates the id.
>> But it can be WARM, if it's done on the same page, so it can avoid
>> updating the index on val. The third will need a new index item
>> pointer, because it has a new key. The t_ctid chain will walk all the
>> versions of the row, but the one with 'blabla2' will not have the
>> HOT-updated bit set. So regular HOT chain walking will stop there, I
>> propose to not stop there, so the index on val can follow the chain
>> into blabla3 as if it were a HOT update. Correctness is achieved only
>> if there is no index item pointer on that index pointing to that index
>> version, and since there is no efficient way of checking, the
>> invariants I propose aim to guarantee it so it can be assumed. Since
>> WARM updates don't create new index item pointers, what needs to be
>> checked is whether updating from version 2 to version 3 would be a
>> WARM update on this index or not, and that equals to checking the
>> index keys for equality.
>
> OK, it took me a while to understand this, so let me see if I can
> restate it.  You have an update chain in a single page which is made up
> of one ore more HOT chains, some of which might be WARM.  A flag
> indicates a WARM chain (see below).  A WARM chain cannot go outside of a
> HOT chain.

It's the other way around. A WARM chain has to span whole HOT chains.
The WARM chain always begins in the head of a HOT chain (or non-HOT
tuple of course), and cannot end before the HOT chain ends.

That is, all within a single page:

  [  v1 .. v2 .. v3  .. v4 .. v5 .. v6  v7 ]
  [ hot chain 1   ][ hot chain 2 ] [NH]
  [ WARM CHAIN  ] [NW]

(NH = non-HOT)
(NW = non-WARM)

The flag could be useful for the upgrade path, but if you ignore
having to live with old-format indexes, it's not necessary.

A WARM chain starts at any index item pointer. It ends when there's a
key change or a page change (ie: the WARM chain cannot span multiple
pages). That cannot happen within a HOT chain, so that point will also
be the end of a HOT chain.

You can think of a WARM chain as a chain of HOT chains.

> The HOT chains have identical keys for all indexes, and the WARM chains
> have identical keys for some indexes.

Lets highlight that a WARM chain concept only exists in the context of
a specific index. WARM chains (the way I describe them) apply to only
one index at a time.

So, different indexes will have different chaining of HOT chains into
WARM chains. While index A may chain HOT chain 1 and HOT chain 2
together, index B may not. In fact, the way WARM works garantees that
at least one index won't chain them.

So WARM chains have identical keys for a specific index. Different
indexes have different WARM chains.

>  If an update chain is all on the
> same page, it can have multiple HOT chains.  I think this is possible if
> an update changes _all_ indexed columns in the middle of an update
> chain, turning off HOT and WARM.

Yes

> I think we agreed that WARM index tuples will point to the head of the
> HOT update chain on the same page.

The head of the WARM chain, which may not be the HOT update chain.

In the above example, the WARM index tuple would point to the head of
the hot chain 1, not hot chain 2.

> We will store the
> HEAP_RECHECK_REQUIRED flag on the update chain head tuple, or on the
> LP_REDIRECT ctid if we removed the tuple during HOT cleanup.

I wouldn't put it on the update chain head, I'd put it on the
just-inserted heap tuple, and I'd call it HEAP_WARM_TUPLE. I wouldn't
use a corresponding HEAP_WARM_UPDATED, there's no need to update the
old tuple's flags.

The purpose of the flag is to be able to discern new-style index
pointers from old-style ones, since they require different ways of
following the update chain.

In the example above, old indexes will have item pointers to the head
of hot chain 1 and 2. A scan that finds the item pointer for hot chain
1 has to scan the HOT chain only, and not step into hot chain 2,
because there's an item pointer for hot chain 2 already. New indexes
need to follow through the end of HOT chain 1 into HOT chain 2,
because there won't be an item pointer pointing to the head of HOT
chain 2. The flag being set in the tuple would signal that this is a
new-style update, so it needs to follow the WARM chain, whereas if the
flag is 0, it needs not. It allows for a painless upgrade, but that's
it.

The flag alone doesn't remove the need

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

>On Thu, Aug  4, 2016 at 06:21:48PM -0400, Bruce Momjian wrote:
> Another approach would be to keep updates on the same page in a single
> WARM chain, even if all indexes have changed columns.  We will already
> have code to walk the HOT chain looking for matches, and at most one
> tuple in the update chain is visible.  The downside of this is
> unnecessary checking if the tuple matches the index.

Forget this idea I had.  If all the indexed keys change, we don't save
anything in fewer index tuples, and we extend the length of the HOT
chain because we always have to point to the root of the HOT chain in
the index.

The only advantage would be allowing reuse of the old index tuple if the
indexed value changes back to and old indexed value in the same chain,
which seems rare.

Let me say I am excited at the progress that has been made on this item
in the past 36 hours and I think we are triangulating on a solution.

I know we are restricted by the existing page format and pg_upgrade in
adding this features, but frankly that limitation seems minor --- even
if designing a new storage system, figuring out how to do WARM chains
efficiently would be hard.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Claudio Freire

On Thu, Aug 4, 2016 at 7:59 AM, Pavan Deolasee  wrote:
> So marking the index would require us to mark both old and new index tuples
> as requiring re-check. That requires an additional index scan to locate the
> old row and then an additional write to force it to re-check, which is
> algorithmically O(NlogN) on table size.


So, basically, I'm saying this isn't really O(NlogN), it's O(N),
manifested in low-cardinality indexes.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] max_parallel_degree > 0 for 9.6 beta

2016-08-04 Thread Michael Paquier

On Thu, Aug 4, 2016 at 10:52 PM, David G. Johnston
 wrote:
> My initial reaction was +1 but now I'm leaning toward enabled by default.
>
> Those who would upgrade to 9.6 within a year of its release are most likely,
> process and personality wise, to be those for whom the risks and rewards of
> new features is part of their everyday routine.
>
> If indeed we release 10.0 with it enabled by default we should be confident
> in its stability at that point in the 9.6.x branch.

Being cautious pays more in the long term, so seeing the number of
bugs that showed up I'd rather vote for having it disabled by default
in 9.6 stable, and enabled on master to aim at enabling it in 10.0.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Heap WARM Tuples - Design Draft

2016-08-04 Thread Bruce Momjian

On Thu, Aug  4, 2016 at 09:53:31PM -0300, Claudio Freire wrote:
> It's the other way around. A WARM chain has to span whole HOT chains.
> The WARM chain always begins in the head of a HOT chain (or non-HOT
> tuple of course), and cannot end before the HOT chain ends.
> 
> That is, all within a single page:
> 
>   [  v1 .. v2 .. v3  .. v4 .. v5 .. v6  v7 ]
>   [ hot chain 1   ][ hot chain 2 ] [NH]
>   [ WARM CHAIN  ] [NW]
> 
> (NH = non-HOT)
> (NW = non-WARM)
> 
> The flag could be useful for the upgrade path, but if you ignore
> having to live with old-format indexes, it's not necessary.
> 
> A WARM chain starts at any index item pointer. It ends when there's a
> key change or a page change (ie: the WARM chain cannot span multiple
> pages). That cannot happen within a HOT chain, so that point will also
> be the end of a HOT chain.
> 
> You can think of a WARM chain as a chain of HOT chains.

OK, that's a lot of text, and I am confused.  Please tell me the
advantage of having an index point to a string of HOT chains, rather
than a single one?  Is the advantage that the index points into the
middle of the HOT chain rather than at the beginning?  I can see some
value to that, but having more ctid HOT headers that can't be removed
except by VACUUM seems bad, plus the need for index rechecks as you
cross to the next HOT chain seems bad.

The value of WARM is to avoid index bloat --- right now we traverse the
HOT chain on a single page just fine with no complaints about speed so I
don't see a value to optimizing that traversal, and I think the key
rechecks and ctid bloat will make it all much slower.

It also seems much more complicated.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 113 matches

Mail list logo