Re: proposal \gcsv

2020-04-01 Thread Vik Fearing
On 4/1/20 1:53 AM, Tom Lane wrote:
> Consider some syntax along the lines of
> 
> \gpset (pset-option-name [pset-option-value]) ... filename
> 
> or if you don't like parentheses, choose some other punctuation to wrap
> the \pset options in.  I initially thought of square brackets, but I'm
> afraid they might be just too darn confusing to document --- how could
> you make them distinct from metasyntax square brackets, especially in
> plain-ASCII docs?  Also it'd have to be punctuation that's unlikely to
> start a file name --- but parens are already reserved in most shells.


If parens are going to be required, why don't we just add them to \g?

TABLE blah \g (format csv) filename
-- 
Vik Fearing




Re: Tab completion for \gx

2020-04-01 Thread Vik Fearing
On 4/1/20 5:01 AM, Bruce Momjian wrote:
> 
> Patch applied though PG 10, thanks.

Thanks!
-- 
Vik Fearing




Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-04-01 Thread Dilip Kumar
On Wed, Apr 1, 2020 at 8:51 AM Dilip Kumar  wrote:
>
> On Wed, Apr 1, 2020 at 8:26 AM Masahiko Sawada
>  wrote:
> >
> > On Wed, 1 Apr 2020 at 11:46, Amit Kapila  wrote:
> > >
> > > On Tue, Mar 31, 2020 at 7:32 PM Dilip Kumar  wrote:
> > > >
> > > > While testing I have found one issue.  Basically, during a parallel
> > > > vacuum, it was showing more number of
> > > > shared_blk_hits+shared_blks_read.  After, some investigation, I found
> > > > that during the cleanup phase nworkers are -1, and because of this we
> > > > didn't try to launch worker but "lps->pcxt->nworkers_launched" had the
> > > > old launched worker count and shared memory also had old buffer read
> > > > data which was never updated as we did not try to launch the worker.
> > > >
> > > > diff --git a/src/backend/access/heap/vacuumlazy.c
> > > > b/src/backend/access/heap/vacuumlazy.c
> > > > index b97b678..5dfaf4d 100644
> > > > --- a/src/backend/access/heap/vacuumlazy.c
> > > > +++ b/src/backend/access/heap/vacuumlazy.c
> > > > @@ -2150,7 +2150,8 @@ lazy_parallel_vacuum_indexes(Relation *Irel,
> > > > IndexBulkDeleteResult **stats,
> > > >  * Next, accumulate buffer usage.  (This must wait for the 
> > > > workers to
> > > >  * finish, or we might get incomplete data.)
> > > >  */
> > > > -   for (i = 0; i < lps->pcxt->nworkers_launched; i++)
> > > > +   nworkers = Min(nworkers, lps->pcxt->nworkers_launched);
> > > > +   for (i = 0; i < nworkers; i++)
> > > > InstrAccumParallelQuery(&lps->buffer_usage[i]);
> > > >
> > > > It worked after the above fix.
> > > >
> > >
> > > Good catch.  I think we should not even call
> > > WaitForParallelWorkersToFinish for such a case.  So, I guess the fix
> > > could be,
> > >
> > > if (workers > 0)
> > > {
> > > WaitForParallelWorkersToFinish();
> > > for (i = 0; i < lps->pcxt->nworkers_launched; i++)
> > >  InstrAccumParallelQuery(&lps->buffer_usage[i]);
> > > }
> > >
> >
> > Agreed. I've attached the updated patch.
> >
> > Thank you for testing, Dilip!
>
> Thanks!  One hunk is failing on the latest head.  And, I have rebased
> the patch for my testing so posting the same.  I have done some more
> testing to test multi-pass vacuum.
>
> postgres[114321]=# show maintenance_work_mem ;
>  maintenance_work_mem
> --
>  1MB
> (1 row)
>
> --Test case
> select pg_stat_statements_reset();
> drop table test;
> CREATE TABLE test (a int, b int);
> CREATE INDEX idx1 on test(a);
> CREATE INDEX idx2 on test(b);
> INSERT INTO test SELECT i, i FROM GENERATE_SERIES(1,200) as i;
> DELETE FROM test where a%2=0;
> VACUUM (PARALLEL n) test;
> select query, total_time, shared_blks_hit, shared_blks_read,
> shared_blks_hit + shared_blks_read as total_read_blks,
> shared_blks_dirtied, shared_blks_written from pg_stat_statements where
> query like 'VACUUM%';
>
>   query   | total_time  | shared_blks_hit |
> shared_blks_read | total_read_blks | shared_blks_dirtied |
> shared_blks_written
> --+-+-+--+-+-+-
>  VACUUM (PARALLEL 0) test | 5964.282408 |   92447 |
> 6 |   92453 |   19789 |   0
>
>
>   query   | total_time | shared_blks_hit |
> shared_blks_read | total_read_blks | shared_blks_dirtied |
> shared_blks_written
> --++-+--+-+-+-
>  VACUUM (PARALLEL 1) test | 3957.765881003 |   92447 |
>6 |   92453 |   19789 |
>   0
> (1 row)
>
> So I am getting correct results with the multi-pass vacuum.

I have done some testing for the parallel "create index".

postgres[99536]=# show maintenance_work_mem ;
 maintenance_work_mem
--
 1MB
(1 row)

CREATE TABLE test (a int, b int);
INSERT INTO test SELECT i, i FROM GENERATE_SERIES(1,200) as i;
CREATE INDEX idx1 on test(a);
select query, total_time, shared_blks_hit, shared_blks_read,
shared_blks_hit + shared_blks_read as total_read_blks,
shared_blks_dirtied, shared_blks_written from pg_stat_statements where
query like 'CREATE INDEX%';


SET max_parallel_maintenance_workers TO 0;
query | total_time | shared_blks_hit |
shared_blks_read | total_read_blks | shared_blks_dirtied |
shared_blks_written
--++-+--+-+-+-
 CREATE INDEX idx1 on test(a) | 1947.495997999 |8947 |
  11 |8958 |   5 |
  0

SET max_parallel_maintenance_workers TO 2;

query | total_time | shared_blks_hit |
shared_blks_read | total_read_blks | shared_blks_dirtied |
shared_blks_writte

Re: recovery_target_action=pause with confusing hint

2020-04-01 Thread Fujii Masao




On 2020/04/01 11:42, movead...@highgo.ca wrote:



When I test the patch, I find an issue: I start a stream with 
'promote_trigger_file'

> GUC valid, and exec pg_wal_replay_pause() during recovery and as below it

 >> shows success to pause at the first time. I think it use a initialize
 >> 'SharedPromoteIsTriggered' value first time I exec the 
pg_wal_replay_pause().

hm. Are you sure this is related to this patch? Could you explain the exact 
timing? I mean log_statement = all and relevant logs.
Most likely this is expected change by 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=496ee647ecd2917369ffcf1eaa0b2cdca07c8730
My proposal does not change the behavior after this commit, only changing the 
lines in the logs.

I test it again with (92d31085e926253aa650b9d1e1f2f09934d0ddfc), and the
issue appeared again. Here is my test method which quite simple:
1. Setup a base backup by pg_basebackup.
2. Insert lots of data in master for the purpose I have enough time to exec
    pg_wal_replay_pause() when startup the replication.
3. Configure the 'promote_trigger_file' GUC and create the trigger file.
4. Start the backup(standby), connect it immediately, and exec 
pg_wal_replay_pause()
Then it appears, and a test log attached.

I means when I exec the pg_wal_replay_pause() first time, nobody has check the 
trigger state
by CheckForStandbyTrigger(), it use a Initialized 'SharedPromoteIsTriggered' 
value.
And patch attached can solve the issue.


Thanks for the explanation!

But, sorry,,, I failed to understand the issue that you reported, yet...
You mean that the first call of pg_wal_replay_pause() in the step #2
should check whether the trigger file exists or not? If so, could you
tell me why we should do that?

BTW, right now only the startup process is allowed to call
CheckForStandbyTrigger(). So the backend process calling
pg_wal_replay_pause() and PromoteIsTriggered() is not allowed to call
CheckForStandbyTrigger(). The current logic is that the startup process
is responsible for checking the trigger file and set the flag in the shmem
if promotion is triggered. Then other processes like backend know
whether promotion is ongoing or not from the shmem. So basically
the backend doesn't need to check the trigger file itself.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




Re: Add A Glossary

2020-04-01 Thread Jürgen Purtz



On 31.03.20 19:58, Justin Pryzby wrote:

On Tue, Mar 31, 2020 at 04:13:00PM +0200, Jürgen Purtz wrote:

Please find some minor suggestions in the attachment. They are based on
Corey's last patch 0001-glossary-v4.patch.
@@ -220,7 +220,7 @@
Records to the file system and creates a special
checkpoint record. This process is initiated when predefined
conditions are met, such as a specified amount of time has passed, or
-  a certain volume of records have been collected.
+  a certain volume of records has been collected.

I think you're correct in that "volume" is singular.  But I think "collected"
is the wrong world.  I suggested "written".

"collected" is not optimal. I suggest "created". Please avoid "written", 
the WAL records will be written when the Checkpointer is running, not 
before. So:


 "a certain volume of WAL records has been 
collected."



Every thing else is ok for me.

Kind regards, Jürgen




Re: Add A Glossary

2020-04-01 Thread Jürgen Purtz



On 31.03.20 20:07, Justin Pryzby wrote:

On Mon, Mar 30, 2020 at 01:10:19PM -0400, Corey Huinker wrote:

+   
+Aggregating
+
+ 
+  The act of combining a collection of data (input) values into
+  a single output value, which may not be of the same type as the
+  input values.

I think we maybe already tried to address this ; but could we define a noun
form ?  But not "aggregate" since it's the same word as the verb form.  I think
it would maybe be best to merge with "aggregate function", below.


Yes, combine the two. Or remove "aggregating" at all.



+ 

+Log Writer
+
+ 
+  If activated and parameterized, the

I still don't know what parameterized means here.


Remove "and parameterized". The Log Writer always has (default) parameters.


Every thing else is ok for me.

Kind regards, Jürgen




Re: Re: recovery_target_action=pause with confusing hint

2020-04-01 Thread movead...@highgo.ca

>But, sorry,,, I failed to understand the issue that you reported, yet...
>You mean that the first call of pg_wal_replay_pause() in the step #2
>should check whether the trigger file exists or not? If so, could you
>tell me why we should do that?
Sorry about my pool english.  The 'pg_wal_replay_pause()' is executed
in step 4. I mention it in step 2 is for explain why it need lots of insert
data. 

>BTW, right now only the startup process is allowed to call
>CheckForStandbyTrigger(). So the backend process calling
>pg_wal_replay_pause() and PromoteIsTriggered() is not allowed to call
>CheckForStandbyTrigger(). The current logic is that the startup process
>is responsible for checking the trigger file and set the flag in the shmem
It's here, startup process does not call CheckForStandbyTrigger() to check
the trigger file until a pg_wal_replay_pause() or PromoteIsTriggered() comes,
so first time to call the pg_wal_replay_pause(), it use a wrong 
'SharedPromoteIsTriggered' value.


>if promotion is triggered. Then other processes like backend know
>whether promotion is ongoing or not from the shmem. So basically
>the backend doesn't need to check the trigger file itself.
If backend is not allowed to call CheckForStandbyTrigger(), then you should
find a way to hold it.
In another word, during the recovery if I add the trigger file, the starup 
process
do not know it at all until after a pg_wal_replay_pause() come.

In addition, although the first time I exec 'pg_wal_replay_pause' it shows 
success,
the startup process is keeping redo(no stop). 




Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca


Re: WAL usage calculation patch

2020-04-01 Thread Julien Rouhaud
So here's a v9, rebased on top of the latest versions of Sawada-san's bug fixes
(Amit's v6 for vacuum and Sawada-san's v2 for create index), with all
previously mentionned changes.

Note that I'm only attaching those patches for convenience and to make sure
that cfbot is happy.
>From a0fb471f9f498f7479eb4ae2182271374d694e46 Mon Sep 17 00:00:00 2001
From: Amit Kapila 
Date: Wed, 1 Apr 2020 11:50:57 +0530
Subject: [PATCH v9 1/6] Allow parallel vacuum to accumulate buffer usage.

Commit 40d964ec99 allowed vacuum command to process indexes in parallel but
forgot to accumulate the buffer usage stats of parallel workers.  This
allows leader backend to accumulate buffer usage stats of all the parallel
workers.

Reported-by: Julien Rouhaud
Author: Sawada Masahiko
Reviewed-by: Dilip Kumar, Amit Kapila and Julien Rouhaud
Discussion: https://postgr.es/m/20200328151721.GB12854@nol
---
 src/backend/access/heap/vacuumlazy.c | 47 ++--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c 
b/src/backend/access/heap/vacuumlazy.c
index 04b12342b8..9f9596c718 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -65,6 +65,7 @@
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
+#include "executor/instrument.h"
 #include "miscadmin.h"
 #include "optimizer/paths.h"
 #include "pgstat.h"
@@ -137,6 +138,7 @@
 #define PARALLEL_VACUUM_KEY_SHARED 1
 #define PARALLEL_VACUUM_KEY_DEAD_TUPLES2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT 3
+#define PARALLEL_VACUUM_KEY_BUFFER_USAGE   4
 
 /*
  * Macro to check if we are in a parallel vacuum.  If true, we are in the
@@ -270,6 +272,9 @@ typedef struct LVParallelState
/* Shared information among parallel vacuum workers */
LVShared   *lvshared;
 
+   /* Points to buffer usage area in DSM */
+   BufferUsage *buffer_usage;
+
/*
 * The number of indexes that support parallel index bulk-deletion and
 * parallel index cleanup respectively.
@@ -2137,8 +2142,20 @@ lazy_parallel_vacuum_indexes(Relation *Irel, 
IndexBulkDeleteResult **stats,
parallel_vacuum_index(Irel, stats, lps->lvshared,
  vacrelstats->dead_tuples, 
nindexes, vacrelstats);
 
-   /* Wait for all vacuum workers to finish */
-   WaitForParallelWorkersToFinish(lps->pcxt);
+   /*
+* Next, accumulate buffer usage.  (This must wait for the workers to
+* finish, or we might get incomplete data.)
+*/
+   if (nworkers > 0)
+   {
+   int i;
+
+   /* Wait for all vacuum workers to finish */
+   WaitForParallelWorkersToFinish(lps->pcxt);
+
+   for (i = 0; i < lps->pcxt->nworkers_launched; i++)
+   InstrAccumParallelQuery(&lps->buffer_usage[i]);
+   }
 
/*
 * Carry the shared balance value to heap scan and disable shared 
costing
@@ -3153,6 +3170,7 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, 
LVRelStats *vacrelstats,
ParallelContext *pcxt;
LVShared   *shared;
LVDeadTuples *dead_tuples;
+   BufferUsage *buffer_usage;
bool   *can_parallel_vacuum;
longmaxtuples;
char   *sharedquery;
@@ -3236,6 +3254,17 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, 
LVRelStats *vacrelstats,
shm_toc_estimate_chunk(&pcxt->estimator, est_deadtuples);
shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+   /*
+* Estimate space for BufferUsage -- PARALLEL_VACUUM_KEY_BUFFER_USAGE.
+*
+* If there are no extensions loaded that care, we could skip this.  We
+* have no way of knowing whether anyone's looking at pgBufferUsage, so 
do
+* it unconditionally.
+*/
+   shm_toc_estimate_chunk(&pcxt->estimator,
+  
mul_size(sizeof(BufferUsage), pcxt->nworkers));
+   shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Finally, estimate PARALLEL_VACUUM_KEY_QUERY_TEXT space */
querylen = strlen(debug_query_string);
shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
@@ -3270,6 +3299,12 @@ begin_parallel_vacuum(Oid relid, Relation *Irel, 
LVRelStats *vacrelstats,
shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_TUPLES, dead_tuples);
vacrelstats->dead_tuples = dead_tuples;
 
+   /* Allocate space for each worker's BufferUsage; no need to initialize 
*/
+   buffer_usage = shm_toc_allocate(pcxt->toc,
+   
mul_size(sizeof(BufferUsage), pcxt->nworkers));
+   shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, 
buffer_usage);
+   lps->buffer_usage = buffer_usage;
+
/* Store query string for workers */
  

Re: Berserk Autovacuum (let's save next Mandrill)

2020-04-01 Thread Dean Rasheed
On Tue, 31 Mar 2020 at 22:16, Tom Lane  wrote:
>
> > Dean Rasheed  writes:
> >> ...
> >> It looks to me as though the problem is that statext_store() needs to
> >> take its lock on pg_statistic_ext_data *before* searching for the
> >> stats tuple to update.
>
> > Hmm, yeah, that seems like clearly a bad idea.
>
> I pushed a fix for that

Thanks for doing that (looks like it was my mistake originally).

> but I think it must be unrelated to the
> buildfarm failures we're seeing.  For that coding to be a problem,
> it would have to run concurrently with a VACUUM FULL or CLUSTER
> on pg_statistic_ext_data (which would give all the tuples new TIDs).
> AFAICS that won't happen with the tests that are giving trouble.
>

Yeah, that makes sense. I still can't see what might be causing those
failures. The tests that were doing an ALTER COLUMN and then expecting
to see the results of a non-analysed table ought to be fixed by
0936d1b6f, but that doesn't match the buildfarm failures. Possibly
0936d1b6f will help with those anyway, but if so, it'll be annoying
not understanding why.

Regards,
Dean




Re: pgbench - add \aset to store results of a combined query

2020-04-01 Thread Michael Paquier
On Mon, Mar 30, 2020 at 03:30:58PM +0900, Michael Paquier wrote:
> Except for the addition of a test case to skip empty results when
> \aset is used, I think that we are pretty good here.

While hacking on the patch more by myself, I found that mixing tests
for \gset and \aset was rather messy.  A test for an empty result
leads also to a failure with the pgbench command as we want to make
sure that the variable does not exist in this case using debug().  So
let's split the tests in three parts:
- the set for \get is left alone.
- addition of a new set for the valid cases of \aset.
- addition of an invalid test for \aset (the empty set one).

Fabien, what do you think about the attached?  Perhaps we should also
have a test where we return more than 1 row for \get?  The last point
is unrelated to this thread though.
--
Michael
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index b8864c6ae5..354231e82d 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -480,6 +480,7 @@ typedef enum MetaCommand
 	META_SHELL,	/* \shell */
 	META_SLEEP,	/* \sleep */
 	META_GSET,	/* \gset */
+	META_ASET,	/* \aset */
 	META_IF,	/* \if */
 	META_ELIF,	/* \elif */
 	META_ELSE,	/* \else */
@@ -509,9 +510,10 @@ static const char *QUERYMODE[] = {"simple", "extended", "prepared"};
  * argv			Command arguments, the first of which is the command or SQL
  *string itself.  For SQL commands, after post-processing
  *argv[0] is the same as 'lines' with variables substituted.
- * varprefix 	SQL commands terminated with \gset have this set
+ * varprefix 	SQL commands terminated with \gset or \aset have this set
  *to a non NULL value.  If nonempty, it's used to prefix the
  *variable name that receives the value.
+ * aset			do gset on all possible queries of a combined query (\;).
  * expr			Parsed expression, if needed.
  * stats		Time spent in this command.
  */
@@ -2489,6 +2491,8 @@ getMetaCommand(const char *cmd)
 		mc = META_ENDIF;
 	else if (pg_strcasecmp(cmd, "gset") == 0)
 		mc = META_GSET;
+	else if (pg_strcasecmp(cmd, "aset") == 0)
+		mc = META_ASET;
 	else
 		mc = META_NONE;
 	return mc;
@@ -2711,17 +2715,20 @@ sendCommand(CState *st, Command *command)
  * Process query response from the backend.
  *
  * If varprefix is not NULL, it's the variable name prefix where to store
- * the results of the *last* command.
+ * the results of the *last* command (META_GSET) or *all* commands (META_ASET).
  *
  * Returns true if everything is A-OK, false if any error occurs.
  */
 static bool
-readCommandResponse(CState *st, char *varprefix)
+readCommandResponse(CState *st, MetaCommand meta, char *varprefix)
 {
 	PGresult   *res;
 	PGresult   *next_res;
 	int			qrynum = 0;
 
+	Assert((meta == META_NONE && varprefix == NULL) ||
+		   ((meta == META_GSET || meta == META_ASET) && varprefix != NULL));
+
 	res = PQgetResult(st->con);
 
 	while (res != NULL)
@@ -2736,7 +2743,7 @@ readCommandResponse(CState *st, char *varprefix)
 		{
 			case PGRES_COMMAND_OK:	/* non-SELECT commands */
 			case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */
-if (is_last && varprefix != NULL)
+if (is_last && meta == META_GSET)
 {
 	pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
  st->id, st->use_file, st->command, qrynum, 0);
@@ -2745,16 +2752,24 @@ readCommandResponse(CState *st, char *varprefix)
 break;
 
 			case PGRES_TUPLES_OK:
-if (is_last && varprefix != NULL)
+if ((is_last && meta == META_GSET) || meta == META_ASET)
 {
-	if (PQntuples(res) != 1)
+	int			ntuples = PQntuples(res);
+
+	if (meta == META_GSET && ntuples != 1)
 	{
+		/* under \gset, report the error */
 		pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
 	 st->id, st->use_file, st->command, qrynum, PQntuples(res));
 		goto error;
 	}
+	else if (meta == META_ASET && ntuples <= 0)
+	{
+		/* coldly skip empty result under \aset */
+		break;
+	}
 
-	/* store results into variables */
+	/* store (last/only) row into possibly prefixed variables */
 	for (int fld = 0; fld < PQnfields(res); fld++)
 	{
 		char	   *varname = PQfname(res, fld);
@@ -2763,9 +2778,9 @@ readCommandResponse(CState *st, char *varprefix)
 		if (*varprefix != '\0')
 			varname = psprintf("%s%s", varprefix, varname);
 
-		/* store result as a string */
-		if (!putVariable(st, "gset", varname,
-		 PQgetvalue(res, 0, fld)))
+		/* store last row result as a string */
+		if (!putVariable(st, meta == META_ASET ? "aset" : "gset", varname,
+		 PQgetvalue(res, ntuples - 1, fld)))
 		{
 			/* internal error */
 			pg_log_error("client %d script %d command %d query %d: error storing into variable %s",
@@ -3181,7 +3196,9 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 	return;		/* don't

Re: recovery_target_action=pause with confusing hint

2020-04-01 Thread Fujii Masao




On 2020/04/01 16:53, movead...@highgo.ca wrote:



But, sorry,,, I failed to understand the issue that you reported, yet
You mean that the first call of pg_wal_replay_pause() in the step #2
should check whether the trigger file exists or not? If so, could you
tell me why we should do that?

Sorry about my pool english.  The 'pg_wal_replay_pause()' is executed
in step 4. I mention it in step 2 is for explain why it need lots of insert
data.


BTW, right now only the startup process is allowed to call
CheckForStandbyTrigger(). So the backend process calling
pg_wal_replay_pause() and PromoteIsTriggered() is not allowed to call
CheckForStandbyTrigger(). The current logic is that the startup process
is responsible for checking the trigger file and set the flag in the shmem

It's here, startup process does not call CheckForStandbyTrigger() to check
the trigger file until a pg_wal_replay_pause() or PromoteIsTriggered() comes,
so first time to call the pg_wal_replay_pause(), it use a wrong
'SharedPromoteIsTriggered' value.



if promotion is triggered. Then other processes like backend know
whether promotion is ongoing or not from the shmem. So basically
the backend doesn't need to check the trigger file itself.

If backend is not allowed to call CheckForStandbyTrigger(), then you should
find a way to hold it.
In another word, during the recovery if I add the trigger file, the starup 
process
do not know it at all until after a pg_wal_replay_pause() come.


Thanks for the explanation again! Maybe I understand your point.

As far as I read the code, in the standby mode, the startup process
periodically checks the trigger file in WaitForWALToBecomeAvailable().
No?

There can be small delay between the creation of the trigger file
and the periodic call to CheckForStandbyTrigger() by the startup process.
If you execute pg_wal_replay_pause() during that delay, it would suceed.

But you'd like to get rid of that delay completely? In other words,
both the startup process and the backend calling pg_wal_replay_pause()
should detect the existence of the trigger file immdiately after
it's created?

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-04-01 Thread Alexey Kondratov

On 2020-04-01 05:19, Michael Paquier wrote:

On Tue, Mar 31, 2020 at 03:48:21PM +0900, Michael Paquier wrote:

Thanks, committed 0001 after fixing the order of the headers.  One
patch left.


And committed now 0002, meaning that we are officially done.  Thanks
Alexey for your patience.



Thanks for doing that! And to all others who participated.


--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: recovery_target_action=pause with confusing hint

2020-04-01 Thread movead...@highgo.ca
>Thanks for the explanation again! Maybe I understand your point.
Great.

>As far as I read the code, in the standby mode, the startup process
>periodically checks the trigger file in WaitForWALToBecomeAvailable().
>No?
Yes it is. 

>There can be small delay between the creation of the trigger file
>and the periodic call to CheckForStandbyTrigger() by the startup process.
>If you execute pg_wal_replay_pause() during that delay, it would suceed.
If there be a huge number of wal segments need a standby to rewind,
then it can not be a 'small delay'.

>But you'd like to get rid of that delay completely? In other words,
>both the startup process and the backend calling pg_wal_replay_pause()
>should detect the existence of the trigger file immdiately after
>it's created?
I want to point out the thing, the pg_wal_replay_pause() shows success but
the startup process keeping redo, it may cause something confused. So it's
better to solve the issue.





Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca


Re: control max length of parameter values logged

2020-04-01 Thread Alexey Bashtanov

Hi,

+If greater than zero, bind parameter values reported in non-error
+statement-logging messages are trimmed to no more than this many bytes.

Can I suggest to say:

"Limit bind parameter values reported by non-error statement-logging messages
to this many bytes". Or,

"The maximum length of bind parameter values to log with non-error
statement-logging messages".

Okay I'll rephrase.

Could you make zero a normal value and -1 the "special" value to disable
trimming ?

Setting to zero will avoid displaying parameters at all, setting to -1 wil
display values in full.
I can, but then for the sake of consistency I'll have to do the same for 
log_parameter_max_length.
Then we'll end up with two ways to enable/disable parameter logging on 
error:

using the primary boolean setting and setting length to 0.
One of them will require superuser privileges, the other one won't.
Do you think it's okay? I have no objections but I'm a bit worried 
someone may find it clumsy.


Best, Alex




Re: Planning counters in pg_stat_statements (using pgss_store)

2020-04-01 Thread Fujii Masao




On 2020/04/01 3:42, Julien Rouhaud wrote:

On Wed, Apr 01, 2020 at 02:43:10AM +0900, Fujii Masao wrote:



On 2020/03/31 16:33, Julien Rouhaud wrote:


v12 attached!


Thanks for updating the patch! The patch looks good to me.

I applied minor and cosmetic changes into the patch. Attached is
the updated version of the patch. Barring any objection, I'd like to
commit this version.

BTW, the minor and cosmetic changes that I applied are, for example,

- Rename pgss_planner_hook to pgss_planner for the sake of consistency.
Other function using hook in pgss doesn't use "_hook" in their names, too.
- Make pgss_planner use PG_FINALLY() instead of PG_CATCH().
- Make PGSS_NUMKIND as the last value in enum pgssStoreKind.



+1, and the PGSS_INVALID is also way better.



- Update the sample output in the document.
etc



Thanks a lot.  It all looks good to me!


Thanks for the check!

I tried to pick up the names of authors and reviewers of this patch,
from the past discussions. Then I'm thinking to write the followings
in the commit log. Are there any other developers that should be
credited as author or reviewer?

Author: Julien Rouhaud, Pascal Legrand, Thomas Munro, Fujii Masao
Reviewed-by: Sergei Kornilov, Tomas Vondra, Yoshikazu Imai, Haribabu Kommi, Tom 
Lane

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




Re: control max length of parameter values logged

2020-04-01 Thread Justin Pryzby
On Wed, Apr 01, 2020 at 10:10:55AM +0100, Alexey Bashtanov wrote:
> Hi,
> > > +If greater than zero, bind parameter values reported in non-error
> > > +statement-logging messages are trimmed to no more than this many 
> > > bytes.
> > Can I suggest to say:
> > 
> > "Limit bind parameter values reported by non-error statement-logging 
> > messages
> > to this many bytes". Or,
> > 
> > "The maximum length of bind parameter values to log with non-error
> > statement-logging messages".
> Okay I'll rephrase.
> > Could you make zero a normal value and -1 the "special" value to disable
> > trimming ?
> > 
> > Setting to zero will avoid displaying parameters at all, setting to -1 wil
> > display values in full.
> I can, but then for the sake of consistency I'll have to do the same for
> log_parameter_max_length.
> Then we'll end up with two ways to enable/disable parameter logging on
> error:
> using the primary boolean setting and setting length to 0.
> One of them will require superuser privileges, the other one won't.

I guess you're referring to log_parameters_on_error.
Does it have to be SUSET ?

Or maybe log_parameters_on_error doesn't need to exist at all, and setting
log_parameter_max_length=0 can be used to disable parameter logging.

I showed up late to this thread, so let's see what others think.

-- 
Justin




Re: recovery_target_action=pause with confusing hint

2020-04-01 Thread Fujii Masao




On 2020/04/01 17:58, movead...@highgo.ca wrote:

Thanks for the explanation again! Maybe I understand your point.

Great.


As far as I read the code, in the standby mode, the startup process
periodically checks the trigger file in WaitForWALToBecomeAvailable().
No?

Yes it is.


There can be small delay between the creation of the trigger file
and the periodic call to CheckForStandbyTrigger() by the startup process.
If you execute pg_wal_replay_pause() during that delay, it would suceed.

If there be a huge number of wal segments need a standby to rewind,
then it can not be a 'small delay'.


Yeah, that's true.


But you'd like to get rid of that delay completely? In other words,
both the startup process and the backend calling pg_wal_replay_pause()
should detect the existence of the trigger file immdiately after
it's created?

I want to point out the thing, the pg_wal_replay_pause() shows success but
the startup process keeping redo, it may cause something confused. So it's
better to solve the issue.


This happens because the startup process detects the trigger file
after pg_wal_replay_pause() succeeds, and then make the recovery
get out of the paused state. It might be problematic to end the paused
state silently? So, to make the situation less confusing, what about
emitting a log message when ending the paused state because of
the promotion?

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




Re: recovery_target_action=pause with confusing hint

2020-04-01 Thread movead...@highgo.ca

>This happens because the startup process detects the trigger file
>after pg_wal_replay_pause() succeeds, and then make the recovery
>get out of the paused state. 
Yes that is.

>It might be problematic to end the paused
>state silently? So, to make the situation less confusing, what about
>emitting a log message when ending the paused state because of
>the promotion?
But where to emit it? I think it not so good by emitting to log file,
because the user will not check it everytime. BTW, why
CheckForStandbyTrigger() can not be called in backend.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca


Re: recovery_target_action=pause with confusing hint

2020-04-01 Thread Fujii Masao




On 2020/04/01 18:56, movead...@highgo.ca wrote:


 >This happens because the startup process detects the trigger file

after pg_wal_replay_pause() succeeds, and then make the recovery
get out of the paused state.

Yes that is.

 >It might be problematic to end the paused

state silently? So, to make the situation less confusing, what about
emitting a log message when ending the paused state because of
the promotion?

But where to emit it? I think it not so good by emitting to log file,
because the user will not check it everytime.


Yeah, I'm thinking to emit the message to log file, like the startup process
does in other places :)


BTW, why
CheckForStandbyTrigger() can not be called in backend.


Because

1) promote_signaled flag that IsPromoteSignaled() sees is set
 only in the startup process
2) the trigger file can be removed or promote_trigger_file can be
 changed after the backend detects it but before the startup process
 does. That is, the result of the trigger file detection can be different
 between the processes.

Of course we can change CheckForStandbyTrigger() so that it can
be called by backends, but I'm not sure if it's worth doing that.

Or another idea to reduce the delay between the request for
the promotion and the detection of it is to make the startup process
call CheckForStandbyTrigger() more frequently. Calling that every
replay of WAL record would be overkill and decrease the recovery
performance. Calling thst every WAL file might be ok. I'm not sure
if this is really improvement or not, though...

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




Re: WAL usage calculation patch

2020-04-01 Thread Amit Kapila
On Wed, Apr 1, 2020 at 1:32 PM Julien Rouhaud  wrote:
>
> So here's a v9, rebased on top of the latest versions of Sawada-san's bug 
> fixes
> (Amit's v6 for vacuum and Sawada-san's v2 for create index), with all
> previously mentionned changes.
>

Few other comments:
v9-0003-Add-infrastructure-to-track-WAL-usage
1.
 static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
-
+static void WalUsageAdd(WalUsage *dst, WalUsage *add);

Looks like a spurious line removal

2.
+ /* Report a full page imsage constructed for the WAL record */
+ *num_fpw += 1;

Typo. /imsage/image

3. Doing some testing with and without parallelism to ensure WAL usage
data is correct would be great and if possible, share the results?

v9-0005-Keep-track-of-WAL-usage-in-pg_stat_statements
4.
+-- SELECT usage data, check WAL usage is reported, wal_records equal
rows count for INSERT/UPDATE/DELETE
+SELECT query, calls, rows,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated,
+wal_records = rows as wal_records_as_rows
+FROM pg_stat_statements ORDER BY query COLLATE "C";
+  query   |
calls | rows | wal_bytes_generated | wal_records_generated |
wal_records_as_rows
+--+---+--+-+---+-
+ DELETE FROM pgss_test WHERE a > $1   |
  1 |1 | t   | t | t
+ DROP TABLE pgss_test |
  1 |0 | t   | t | f
+ INSERT INTO pgss_test (a, b) VALUES ($1, $2), ($3, $4), ($5, $6) |
  1 |3 | t   | t | t
+ INSERT INTO pgss_test VALUES(generate_series($1, $2), $3)|
  1 |   10 | t   | t | t
+ SELECT * FROM pgss_test ORDER BY a   |
  1 |   12 | f   | f | f
+ SELECT * FROM pgss_test WHERE a > $1 ORDER BY a  |
  2 |4 | f   | f | f
+ SELECT * FROM pgss_test WHERE a IN ($1, $2, $3, $4, $5)  |
  1 |8 | f   | f | f
+ SELECT pg_stat_statements_reset()|
  1 |1 | f   | f | f
+ SET pg_stat_statements.track_utility = FALSE |
  1 |0 | f   | f | t
+ UPDATE pgss_test SET b = $1 WHERE a = $2 |
  6 |6 | t   | t | t
+ UPDATE pgss_test SET b = $1 WHERE a > $2 |
  1 |3 | t   | t | t
+(11 rows)
+

I am not sure if the above tests make much sense as they are just
testing that if WAL is generated for these commands.  I understand it
is not easy to make these tests reliable but in that case, we can
think of some simple tests.  It seems to me that the difficulty is due
to full_page_writes as that depends on the checkpoint.  Can we make
full_page_writes = off for these tests and check some simple
Insert/Update/Delete cases?  Alternatively, if you can present the
reason why that is unstable or are tricky to write, then we can simply
get rid of these tests because I don't see tests for BufferUsage.  Let
not write tests for the sake of writing it unless they can detect bugs
in the future or are meaningfully covering the new code added.

5.
-SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
-   query   | calls | rows
+---+--
- SELECT $1::TEXT   | 1 |1
- SELECT PLUS_ONE($1)   | 2 |2
- SELECT PLUS_TWO($1)   | 2 |2
- SELECT pg_stat_statements_reset() | 1 |1
+SELECT query, calls, rows, wal_bytes, wal_records FROM
pg_stat_statements ORDER BY query COLLATE "C";
+   query   | calls | rows | wal_bytes | wal_records
+---+---+--+---+-
+ SELECT $1::TEXT   | 1 |1 | 0 |   0
+ SELECT PLUS_ONE($1)   | 2 |2 | 0 |   0
+ SELECT PLUS_TWO($1)   | 2 |2 | 0 |   0
+ SELECT pg_stat_statements_reset() | 1 |1 | 0 |   0
 (4 rows)

Again, I am not sure if these modifications make much sense?

6.
 static void pgss_shmem_startup(void);
@@ -313,6 +318,7 @@ static void pgss_store(const char *query, uint64 queryId,
 int query_location, int query_len,
 double total_time, uint64 rows,
 const BufferUsage *bufusage,
+const WalUsage* walusage,
 pgssJumbleState *jstate);

The alignment for walusage doesn't seem to be correct. Running
pgindent will fix this.

Re: WAL usage calculation patch

2020-04-01 Thread Amit Kapila
On Wed, Apr 1, 2020 at 4:29 PM Amit Kapila  wrote:
>
> v9-0005-Keep-track-of-WAL-usage-in-pg_stat_statements
>

One more comment related to this patch.
+
+ snprintf(buf, sizeof buf, UINT64_FORMAT, tmp.wal_bytes);
+
+ /* Convert to numeric. */
+ wal_bytes = DirectFunctionCall3(numeric_in,
+ CStringGetDatum(buf),
+ ObjectIdGetDatum(0),
+ Int32GetDatum(-1));
+
+ values[i++] = wal_bytes;

I see that other places that display uint64 values use BIGINT datatype
in SQL, so why can't we do the same here?  See the usage of queryid in
pg_stat_statements or internal_pages, *_pages exposed via
pgstatindex.c.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: recovery_target_action=pause with confusing hint

2020-04-01 Thread Fujii Masao



On 2020/04/01 19:37, Fujii Masao wrote:



On 2020/04/01 18:56, movead...@highgo.ca wrote:


 >This happens because the startup process detects the trigger file

after pg_wal_replay_pause() succeeds, and then make the recovery
get out of the paused state.

Yes that is.

 >It might be problematic to end the paused

state silently? So, to make the situation less confusing, what about
emitting a log message when ending the paused state because of
the promotion?

But where to emit it? I think it not so good by emitting to log file,
because the user will not check it everytime.


Yeah, I'm thinking to emit the message to log file, like the startup process
does in other places :)


So I'd like to propose the attached patch. The patch changes the message
logged when a promotion is requested, based on whether the recovery is
in paused state or not.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 977d448f50..c7255ca24c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12472,7 +12472,12 @@ CheckForStandbyTrigger(void)
fast_promote = false;
}
 
-   ereport(LOG, (errmsg("received promote request")));
+   if (RecoveryIsPaused())
+   ereport(LOG,
+   (errmsg("received promote request while 
recovery is paused"),
+errdetail("The paused state ends and 
the promotion continues.")));
+   else
+   ereport(LOG, (errmsg("received promote request")));
 
ResetPromoteSignaled();
SetPromoteIsTriggered();
@@ -12484,8 +12489,14 @@ CheckForStandbyTrigger(void)
 
if (stat(PromoteTriggerFile, &stat_buf) == 0)
{
-   ereport(LOG,
-   (errmsg("promote trigger file found: %s", 
PromoteTriggerFile)));
+   if (RecoveryIsPaused())
+   ereport(LOG,
+   (errmsg("promote trigger file found 
while recovery is paused: %s",
+   PromoteTriggerFile),
+errdetail("The paused state ends and 
the promotion continues.")));
+   else
+   ereport(LOG,
+   (errmsg("promote trigger file found: 
%s", PromoteTriggerFile)));
unlink(PromoteTriggerFile);
SetPromoteIsTriggered();
fast_promote = true;


Re: [PATCH] Opclass parameters

2020-04-01 Thread Alexander Korotkov
On Tue, Mar 31, 2020 at 12:15 PM Alexander Korotkov
 wrote:
> What is XXX supposed to be?
>
> The rest of patch looks good to me.

I've pushed the patch excepts XXX.  Thank you.
You're welcome to clarify XXX and/or do additional corrections.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-04-01 Thread Justin Pryzby
On Wed, Apr 01, 2020 at 03:03:34PM +0900, Michael Paquier wrote:
> On Tue, Mar 31, 2020 at 01:56:07PM +0300, Alexey Kondratov wrote:
> > I am fine with allowing REINDEX (CONCURRENTLY), but then we will have to
> > support both syntaxes as we already do for VACUUM. Anyway, if we agree to
> > add parenthesized options to REINDEX/CLUSTER, then it should be done as a
> > separated patch before the current patch set.
> 
> would honestly prefer that for now on we only add the parenthesized
> version of an option if something new is added to such utility
> commands (vacuum, analyze, reindex, etc.) as that's much more
> extensible from the point of view of the parser.  And this, even if
> you need to rework things a bit more things around
> reindex_option_elem for the tablespace option proposed here.

Thanks for your input.

I'd already converted VACUUM and REINDEX to use a parenthesized TABLESPACE
option, and just converted CLUSTER to take an option list and do the same.

Alexey suggested that those changes should be done as a separate patch, with
the tablespace options built on top.  Which makes sense.  I had quite some fun
rebasing these with patches in that order.

However, I've kept my changes separate from Alexey's patch, to make it easier
for him to integrate.  So there's "fix!" commits which are not logically
separate and should be read as if they're merged with their parent commits.
That makes the patchset look kind of dirty.  So I'm first going to send the
"before rebase" patchset.  There's a few fixme items, but I think this is in
pretty good shape, and I'd appreciate review.

I'll follow up later with the "after rebase" patchset.  Maybe Alexey will want
to integrate that.

I claimed it would be easy, so I also implemented (INDEX_TABESPACE ..) option:

template1=# VACUUM (TABLESPACE pg_default, INDEX_TABLESPACE ts, FULL) t;
template1=# CLUSTER (TABLESPACE pg_default, INDEX_TABLESPACE ts) t;

-- 
Justin
>From f108192a89a663214d46b84d86c8ce4fdce44663 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 23 Mar 2020 21:10:29 +0300
Subject: [PATCH v15 1/6] Allow REINDEX to change tablespace

REINDEX already does full relation rewrite, this patch adds a
possibility to specify a new tablespace where new relfilenode
will be created.
---
 doc/src/sgml/ref/reindex.sgml |  24 +++-
 src/backend/catalog/index.c   |  89 --
 src/backend/commands/cluster.c|   2 +-
 src/backend/commands/indexcmds.c  | 139 --
 src/backend/commands/tablecmds.c  |   2 +-
 src/backend/nodes/copyfuncs.c |   1 +
 src/backend/nodes/equalfuncs.c|   1 +
 src/backend/parser/gram.y |  14 ++-
 src/backend/tcop/utility.c|   6 +-
 src/bin/psql/tab-complete.c   |   6 +
 src/include/catalog/index.h   |   7 +-
 src/include/commands/defrem.h |   6 +-
 src/include/nodes/parsenodes.h|   1 +
 src/test/regress/input/tablespace.source  |  39 ++
 src/test/regress/output/tablespace.source |  50 
 15 files changed, 354 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index c54a7c420d..b97bd673a5 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name
+REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name [ TABLESPACE new_tablespace ]
 
 where option can be one of:
 
@@ -174,6 +174,28 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+TABLESPACE
+
+ 
+  This specifies a tablespace, where all rebuilt indexes will be created.
+  Cannot be used with "mapped" relations. If SCHEMA,
+  DATABASE or SYSTEM is specified, then
+  all unsuitable relations will be skipped and a single WARNING
+  will be generated.
+ 
+
+   
+
+   
+new_tablespace
+
+ 
+  The name of a specific tablespace to store rebuilt indexes.
+ 
+
+   
+

 VERBOSE
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bd7ec923e9..5267ecfffb 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1247,9 +1247,13 @@ index_create(Relation heapRelation,
  * Create concurrently an index based on the definition of the one provided by
  * caller.  The index is inserted into catalogs and needs to be built later
  * on.  This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the new tablespace to use for this index.  If
+ * InvalidOid, use the tablespace in-use instead.
  */
 Oid
-index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, const char *newName)
+index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
+			   Oid tablespaceOid, const char *newN

Re: [PATCH] Redudant initilization

2020-04-01 Thread Ranier Vilela
Hi,
New patch with yours suggestions.

best regards,
Ranier Vilela


v2_redundant_initialization.patch
Description: Binary data


Re: [PATCH] Opclass parameters

2020-04-01 Thread Justin Pryzby
On Wed, Apr 01, 2020 at 02:53:41PM +0300, Alexander Korotkov wrote:
> On Tue, Mar 31, 2020 at 12:15 PM Alexander Korotkov
>  wrote:
> > What is XXX supposed to be?
> >
> > The rest of patch looks good to me.
> 
> I've pushed the patch excepts XXX.  Thank you.
> You're welcome to clarify XXX and/or do additional corrections.

XXX was a reminder to myself to check the accuracy of that description, rather
than just its language.  As you can see, I didn't manage to do that, but the
language is at least consistent.

Thanks for pushing the update.

-- 
Justin




Re: [PATCH] Opclass parameters

2020-04-01 Thread Alexander Korotkov
On Wed, Apr 1, 2020 at 2:59 PM Justin Pryzby  wrote:
> On Wed, Apr 01, 2020 at 02:53:41PM +0300, Alexander Korotkov wrote:
> > On Tue, Mar 31, 2020 at 12:15 PM Alexander Korotkov
> >  wrote:
> > > What is XXX supposed to be?
> > >
> > > The rest of patch looks good to me.
> >
> > I've pushed the patch excepts XXX.  Thank you.
> > You're welcome to clarify XXX and/or do additional corrections.
>
> XXX was a reminder to myself to check the accuracy of that description, rather
> than just its language.  As you can see, I didn't manage to do that, but the
> language is at least consistent.
>
> Thanks for pushing the update.

Thank you!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: pg_stat_statements issue with parallel maintenance (Was Re: WAL usage calculation patch)

2020-04-01 Thread Dilip Kumar
On Wed, Apr 1, 2020 at 12:01 PM Amit Kapila  wrote:
>
> On Wed, Apr 1, 2020 at 8:51 AM Dilip Kumar  wrote:
> >
> > > Agreed. I've attached the updated patch.
> > >
> > > Thank you for testing, Dilip!
> >
> > Thanks!  One hunk is failing on the latest head.  And, I have rebased
> > the patch for my testing so posting the same.  I have done some more
> > testing to test multi-pass vacuum.
> >
>
> The patch looks good to me.  I have done a few minor modifications (a)
> moved the declaration of variable closer to where it is used, (b)
> changed a comment, (c) ran pgindent.  I have also done some additional
> testing with more number of indexes and found that vacuum and parallel
> vacuum used the same number of total_read_blks and that is what is
> expected here.
>
> Let me know what you think of the attached?

The patch looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [Proposal] Global temporary tables

2020-04-01 Thread Prabhat Sahu
Hi Wenjing,
I hope we need to change the below error message.

postgres=# create global temporary table gtt(c1 int) on commit preserve
rows;
CREATE TABLE

postgres=# create materialized view mvw as select * from gtt;
ERROR: materialized views must not use global temporary tables* or views*

Anyways we are not allowed to create a "global temporary view",
so the above ERROR message should change(i.e. *" or view"* need to be
removed from the error message) something like:
*"ERROR: materialized views must not use global temporary tables"*

-- 

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com


Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-04-01 Thread Justin Pryzby
On Wed, Apr 01, 2020 at 06:57:18AM -0500, Justin Pryzby wrote:
> Alexey suggested that those changes should be done as a separate patch, with
> the tablespace options built on top.  Which makes sense.  I had quite some fun
> rebasing these with patches in that order.
> 
> However, I've kept my changes separate from Alexey's patch, to make it easier
> for him to integrate.  So there's "fix!" commits which are not logically
> separate and should be read as if they're merged with their parent commits.
> That makes the patchset look kind of dirty.  So I'm first going to send the
> "before rebase" patchset.  There's a few fixme items, but I think this is in
> pretty good shape, and I'd appreciate review.
> 
> I'll follow up later with the "after rebase" patchset.  

Attached.  As I said, the v15 patches might be easier to review, even though
v16 is closer to what's desirable to merge.

> Maybe Alexey will want to integrate that.

Or maybe you'd want me to squish my changes into yours and resend after any
review comments ?

-- 
Justin
>From cbb8e856e0913f1a94d28d7164adabad1362d58c Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 27 Mar 2020 17:50:46 -0500
Subject: [PATCH v16 1/7] Change REINDEX/CLUSTER to accept an option list..

..like reindex (CONCURRENTLY)

Change docs in the style of VACUUM.  See also: 52dcfda48778d16683c64ca4372299a099a15b96
---
 doc/src/sgml/ref/cluster.sgml| 13 
 doc/src/sgml/ref/reindex.sgml| 26 +++
 src/backend/commands/cluster.c   | 21 -
 src/backend/commands/indexcmds.c | 41 +---
 src/backend/nodes/copyfuncs.c|  2 ++
 src/backend/nodes/equalfuncs.c   |  2 ++
 src/backend/parser/gram.y| 54 +++-
 src/backend/tcop/utility.c   | 33 ---
 src/include/commands/cluster.h   |  3 +-
 src/include/commands/defrem.h|  7 ++---
 src/include/nodes/parsenodes.h   |  4 ++-
 11 files changed, 149 insertions(+), 57 deletions(-)

diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 4da60d8d56..bd0682ddfd 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -82,31 +82,32 @@ CLUSTER [VERBOSE]
 
   

-table_name
+VERBOSE
 
  
-  The name (possibly schema-qualified) of a table.
+  Prints a progress report as each table is clustered.
  
 

 

-index_name
+table_name
 
  
-  The name of an index.
+  The name (possibly schema-qualified) of a table.
  
 

 

-VERBOSE
+index_name
 
  
-  Prints a progress report as each table is clustered.
+  The name of an index.
  
 

+
   
  
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index c54a7c420d..200526b6f4 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -141,19 +141,6 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
-   
-name
-
- 
-  The name of the specific index, table, or database to be
-  reindexed.  Index and table names can be schema-qualified.
-  Presently, REINDEX DATABASE and REINDEX SYSTEM
-  can only reindex the current database, so their parameter must match
-  the current database's name.
- 
-
-   
-

 CONCURRENTLY
 
@@ -182,6 +169,19 @@ REINDEX [ ( option [, ...] ) ] { IN
  
 

+
+   
+name
+
+ 
+  The name of the specific index, table, or database to be
+  reindexed.  Index and table names can be schema-qualified.
+  Presently, REINDEX DATABASE and REINDEX SYSTEM
+  can only reindex the current database, so their parameter must match
+  the current database's name.
+ 
+
+   
   
  
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index fc1cea0236..6c3fdd3874 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -35,6 +35,7 @@
 #include "catalog/pg_am.h"
 #include "catalog/toasting.h"
 #include "commands/cluster.h"
+#include "commands/defrem.h"
 #include "commands/progress.h"
 #include "commands/tablecmds.h"
 #include "commands/vacuum.h"
@@ -99,8 +100,26 @@ static List *get_tables_to_cluster(MemoryContext cluster_context);
  *---
  */
 void
-cluster(ClusterStmt *stmt, bool isTopLevel)
+cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
 {
+	ListCell *lc;
+
+	/* Parse list of generic parameter not handled by the parser */
+	foreach(lc, stmt->params)
+	{
+		DefElem	*opt = (DefElem *) lfirst(lc);
+
+		if (strcmp(opt->defname, "verbose") == 0)
+			stmt->options |= CLUOPT_VERBOSE;
+			// XXX: handle boolean opt: VERBOSE off
+		else
+			ereport(ERROR,
+	(errcode(ERRCODE_SYNTAX_ERROR),
+	 errmsg("unrecognized CLUSTER option \"%s\"",
+		 opt->defname),
+	 parser_errposition(pstate, opt->location)));
+	}
+
 	if (stmt->relation != NU

Re: Some problems of recovery conflict wait events

2020-04-01 Thread Fujii Masao




On 2020/03/30 20:10, Masahiko Sawada wrote:

On Fri, 27 Mar 2020 at 17:54, Fujii Masao  wrote:




On 2020/03/04 14:31, Masahiko Sawada wrote:

On Wed, 4 Mar 2020 at 13:48, Fujii Masao  wrote:




On 2020/03/04 13:27, Michael Paquier wrote:

On Wed, Mar 04, 2020 at 01:13:19PM +0900, Masahiko Sawada wrote:

Yeah, so 0001 patch sets existing wait events to recovery conflict
resolution. For instance, it sets (PG_WAIT_LOCK | LOCKTAG_TRANSACTION)
to the recovery conflict on a snapshot. 0003 patch improves these wait
events by adding the new type of wait event such as
WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT. Therefore 0001 (and 0002) patch
is the fix for existing versions and 0003 patch is an improvement for
only PG13. Did you mean even 0001 patch doesn't fit for back-patching?


Yes, it looks like a improvement rather than bug fix.



Okay, understand.


I got my eyes on this patch set.  The full patch set is in my opinion
just a set of improvements, and not bug fixes, so I would refrain from
back-backpatching.


I think that the issue (i.e., "waiting" is reported twice needlessly
in PS display) that 0002 patch tries to fix is a bug. So it should be
fixed even in the back branches.


So we need only two patches: one fixes process title issue and another
improve wait event. I've attached updated patches.


I started reading 
v2-0002-Improve-wait-events-for-recovery-conflict-resolut.patch.

-   ProcWaitForSignal(PG_WAIT_BUFFER_PIN);
+   ProcWaitForSignal(WAIT_EVENT_RECOVERY_CONFLICT_BUFFER_PIN);

Currently the wait event indicating the wait for buffer pin has already
been reported. But the above change in the patch changes the name of
wait event for buffer pin only in the startup process. Is this really useful?
Isn't the existing wait event for buffer pin enough?

-   /* Wait to be signaled by the release of the Relation Lock */
-   ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
+   /* Wait to be signaled by the release of the Relation Lock */
+   ProcWaitForSignal(WAIT_EVENT_RECOVERY_CONFLICT_LOCK);

Same as above. Isn't the existing wait event enough?


Yeah, we can use the existing wait events for buffer pin and lock.



-   /*
-* Progressively increase the sleep times, but not to more than 1s, 
since
-* pg_usleep isn't interruptible on some platforms.
-*/
-   standbyWait_us *= 2;
-   if (standbyWait_us > 100)
-   standbyWait_us = 100;
+   WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT,
+ STANDBY_WAIT_MS,
+ wait_event_info);
+   ResetLatch(MyLatch);

ResetLatch() should be called before WaitLatch()?


Fixed.



Could you tell me why you dropped the "increase-sleep-times" logic?


I thought we can remove it because WaitLatch is interruptible but my
observation was not correct. The waiting startup process is not
necessarily woken up by signal. I think it's still better to not wait
more than 1 sec even if it's an interruptible wait.


So we don't need to use WaitLatch() there, i.e., it's ok to keep using
pg_usleep()?


Attached patch fixes the above and introduces only two wait events of
conflict resolution: snapshot and tablespace.


Many thanks for updating the patch!

-   /* Wait to be signaled by the release of the Relation Lock */
-   ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
+   /* Wait to be signaled by the release of the Relation Lock */
+   ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
+   }

Is this change really valid? What happens if the latch is set during
ResolveRecoveryConflictWithVirtualXIDs()?
ResolveRecoveryConflictWithVirtualXIDs() can return after the latch
is set but before WaitLatch() in WaitExceedsMaxStandbyDelay() is reached.

+   default:
+   event_name = "unknown wait event";
+   break;

Seems this default case should be removed. Please see other
pgstat_get_wait_xxx() function, so there is no such code.


I also removed the wait
event of conflict resolution of database since it's unlikely to become
a user-visible and a long sleep as we discussed before.


Is it worth defining new wait event type RecoveryConflict only for
those two events? ISTM that it's ok to use IPC event type here.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: backend type in log_line_prefix?

2020-04-01 Thread Peter Eisentraut

On 2020-04-01 03:55, Bruce Momjian wrote:

Agreed.  I ended up moving "wal" as a separate word, since it looks
cleaner;  patch attached.  Tools that look for the backend type in
pg_stat_activity would need to be adjusted;  it would be an
incompatibility.  Maybe changing it would cause too much disruption.


Yeah, it's probably not worth the change for that reason.  There is no 
confusion what the "archiver" is.  Also, we have archive_mode, 
archive_command, etc. without a wal_ prefix.  Let's leave it as is.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: color by default

2020-04-01 Thread Peter Eisentraut

On 2020-03-30 10:03, Michael Paquier wrote:

On Sun, Mar 29, 2020 at 02:55:37PM +0200, Juan José Santamaría Flecha wrote:

Add it to the tests done when PG_COLOR is "auto".


FWIW, I am not sure that it is a good idea to stick into the code
knowledge inherent to TERM.  That would likely rot depending on how
terminals evolve in the future, and it is easy to test if a terminal
supports color or not but just switching PG_COLOR in a given
environment and look at the error message produced by anything
able to support coloring.


There could be some value in this, I think.  Other systems also do this 
in some variant.  However, it's unclear to me to what extent this is 
legacy behavior or driven by current needs.  I'd be willing to refine 
this, but it should be based on some actual needs.  What terminals (or 
terminal-like things) don't support color, and how do we detect them?


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WAL usage calculation patch

2020-04-01 Thread Dilip Kumar
On Wed, Apr 1, 2020 at 5:01 PM Amit Kapila  wrote:
>
> On Wed, Apr 1, 2020 at 4:29 PM Amit Kapila  wrote:
> >
> > v9-0005-Keep-track-of-WAL-usage-in-pg_stat_statements
> >
>
> One more comment related to this patch.
> +
> + snprintf(buf, sizeof buf, UINT64_FORMAT, tmp.wal_bytes);
> +
> + /* Convert to numeric. */
> + wal_bytes = DirectFunctionCall3(numeric_in,
> + CStringGetDatum(buf),
> + ObjectIdGetDatum(0),
> + Int32GetDatum(-1));
> +
> + values[i++] = wal_bytes;
>
> I see that other places that display uint64 values use BIGINT datatype
> in SQL, so why can't we do the same here?  See the usage of queryid in
> pg_stat_statements or internal_pages, *_pages exposed via
> pgstatindex.c.

I have reviewed 0003 and 0004,  I have a few comments.
v9-0003-Add-infrastructure-to-track-WAL-usage

1.
  /* Points to buffer usage area in DSM */
  BufferUsage *buffer_usage;
+ /* Points to WAL usage area in DSM */
+ WalUsage   *wal_usage;

Better to give one blank line between the previous statement/variable
declaration and the next comment line.

  /* Points to buffer usage area in DSM */
  BufferUsage *buffer_usage;
-Empty line here
+ /* Points to WAL usage area in DSM */
+ WalUsage   *wal_usage;

2.
@@ -2154,7 +2157,7 @@ lazy_parallel_vacuum_indexes(Relation *Irel,
IndexBulkDeleteResult **stats,
  WaitForParallelWorkersToFinish(lps->pcxt);

  for (i = 0; i < lps->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&lps->buffer_usage[i]);
+ InstrAccumParallelQuery(&lps->buffer_usage[i], &lps->wal_usage[i]);
  }

The existing comment above this loop, which just mentions the buffer
usage, not the wal usage so I guess we need to change that.
" /*
* Next, accumulate buffer usage.  (This must wait for the workers to
* finish, or we might get incomplete data.)
*/"


v9-0004-Add-option-to-report-WAL-usage-in-EXPLAIN-and-aut

3.
+ if (usage->wal_num_fpw > 0)
+ appendStringInfo(es->str, " full page records=%ld",
+usage->wal_num_fpw);
+ if (usage->wal_bytes > 0)
+ appendStringInfo(es->str, " bytes=" UINT64_FORMAT,
+usage->wal_bytes);

Shall we change to 'full page writes' or 'full page image' instead of
full page records?

Apart from this, I have some testing to see the wal_usage with the
parallel vacuum and the results look fine.

postgres[104248]=# CREATE TABLE test (a int, b int);
CREATE TABLE
postgres[104248]=# INSERT INTO test SELECT i, i FROM
GENERATE_SERIES(1,200) as i;
INSERT 0 200
postgres[104248]=# CREATE INDEX idx1 on test(a);
CREATE INDEX
postgres[104248]=# VACUUM (PARALLEL 1) test;
VACUUM
postgres[104248]=# select query , wal_bytes, wal_records, wal_num_fpw
from pg_stat_statements where query like 'VACUUM%';
  query   | wal_bytes | wal_records | wal_num_fpw
--+---+-+-
 VACUUM (PARALLEL 1) test |  72814331 |8857 |8855



postgres[106479]=# CREATE TABLE test (a int, b int);
CREATE TABLE
postgres[106479]=# INSERT INTO test SELECT i, i FROM
GENERATE_SERIES(1,200) as i;
INSERT 0 200
postgres[106479]=# CREATE INDEX idx1 on test(a);
CREATE INDEX
postgres[106479]=# VACUUM (PARALLEL 0) test;
VACUUM
postgres[106479]=# select query , wal_bytes, wal_records, wal_num_fpw
from pg_stat_statements where query like 'VACUUM%';
  query   | wal_bytes | wal_records | wal_num_fpw
--+---+-+-
 VACUUM (PARALLEL 0) test |  72814331 |8857 |8855

By tomorrow, I will try to finish reviewing 0005 and 0006.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: color by default

2020-04-01 Thread Peter Eisentraut

On 2020-03-30 10:08, Michael Paquier wrote:

On Sun, Mar 29, 2020 at 11:56:15AM +0200, Peter Eisentraut wrote:

I didn't do this because it would create additional complications in the man
pages.  But there is now an index entry, so it's possible to find more
information.


Cannot you add a link to the page for color support in each one of
them?  That seems more user-friendly to me.


Do you have a specific phrasing or look in mind?


I'm not sure how to do that.  The full list of possible values is huge, and
exactly what is supported depends on the terminal.


An idea is to add a reference to SGR parameters directly from
wikipedia:
https://en.wikipedia.org/wiki/ANSI_escape_code
However I recall that you don't like adding references to
Wiki-sensei.  Please feel free to discard this idea if you don't like
it.


Yeah, we could perhaps do this.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Robert Haas
On Wed, Apr 1, 2020 at 2:40 AM Andres Freund  wrote:
> The problem is that there's no protection again the xids in the
> ringbuffer getting old enough to wrap around. Given that practical uses
> of old_snapshot_threshold are likely to be several hours to several
> days, that's not particularly hard to hit.

Presumably this could be fixed by changing it to use FullTransactionId.

> The problem, as far as I can tell, is that
> oldSnapshotControl->head_timestamp appears to be intended to be the
> oldest value in the ring. But we update it unconditionally in the "need
> a new bucket, but it might not be the very next one" branch of
> MaintainOldSnapshotTimeMapping().

I agree, that doesn't look right. It's correct, I think, for the "if
(advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)" case, but not in the
"else" case. In the "else" case, it should advance by 1 (wrapping if
needed) each time we take the "if (oldSnapshotControl->count_used ==
OLD_SNAPSHOT_TIME_MAP_ENTRIES)" branch, and should remain unchanged in
the "else" branch for that if statement.

> As far as I can tell, this code has been wrong since the feature has
> been committed.  The tests don't show a problem, because none of this
> code is reached when old_snapshot_threshold = 0 (which has no real world
> use, it's purely for testing).

I'm pretty sure I complained about the fact that only the
old_snapshot_threshold = 0 case was tested at the time this went in,
but I don't think Kevin was too convinced that we needed to do
anything else, and realistically, if he'd tried for a regression test
that ran for 15 minutes, Tom would've gone ballistic.

> I really don't know what to do here. The feature never worked and will
> silently cause wrong query results. Fixing it seems like a pretty large
> task - there's a lot more bugs. But ripping out a feature in stable
> branches is pretty bad too.

I don't know what other bugs there are, but the two you mention above
look fixable. Even if we decide that the feature can't be salvaged, I
would vote against ripping it out in back branches. I would instead
argue for telling people not to use it and ripping it out in master.
However, much as I'm not in love with all of the complexity this
feature adds, I don't see the problems you've reported here as serious
enough to justify ripping it out.

What exactly is the interaction of this patch with your snapshot
scalability work?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-03-31 23:40:08 -0700, Andres Freund wrote:
> I added some debug output to print the mapping before/after changes by
> MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
> to the server start in minutes/seconds to make it easier to interpret).

Now attached.

Greetings,

Andres Freund
diff --git i/src/backend/utils/time/snapmgr.c w/src/backend/utils/time/snapmgr.c
index 1c063c592ce..cbfc462ee26 100644
--- i/src/backend/utils/time/snapmgr.c
+++ w/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@ typedef struct OldSnapshotControlData
 	int			head_offset;	/* subscript of oldest tracked time */
 	TimestampTz head_timestamp; /* time corresponding to head xid */
 	int			count_used;		/* how many slots are in use */
+	TimestampTz starttime; // rounded to minute
 	TransactionId xid_by_minute[FLEXIBLE_ARRAY_MEMBER];
 } OldSnapshotControlData;
 
@@ -290,6 +291,8 @@ SnapMgrInit(void)
 		oldSnapshotControl->head_offset = 0;
 		oldSnapshotControl->head_timestamp = 0;
 		oldSnapshotControl->count_used = 0;
+		oldSnapshotControl->starttime =
+			AlignTimestampToMinuteBoundary(GetCurrentTimestamp());
 	}
 }
 
@@ -1762,6 +1765,39 @@ SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit)
 	SpinLockRelease(&oldSnapshotControl->mutex_threshold);
 }
 
+static void PrintOldSnapshotMapping(const char *head, bool already_locked)
+{
+	StringInfoData s;
+
+	initStringInfo(&s);
+
+	if (!already_locked)
+		LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
+
+
+	appendStringInfo(&s, "old snapshot mapping at \"%s\" with head ts: %lu, current entries: %d max entries: %d, offset: %d\n",
+	 head,
+	 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_MINUTE,
+	 oldSnapshotControl->count_used,
+	 OLD_SNAPSHOT_TIME_MAP_ENTRIES,
+	 oldSnapshotControl->head_offset);
+
+	for (int off = 0; off < oldSnapshotControl->count_used; off++)
+	{
+		int ringoff = (off + oldSnapshotControl->head_offset) % OLD_SNAPSHOT_TIME_MAP_ENTRIES;
+
+		appendStringInfo(&s, "  entry %d (ring %d): min %ld: xid %d\n",
+		 off, ringoff,
+		 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_MINUTE + off,
+		 oldSnapshotControl->xid_by_minute[ringoff]);
+	}
+
+	if (!already_locked)
+		LWLockRelease(OldSnapshotTimeMapLock);
+
+	elog(WARNING, "%s", s.data);
+}
+
 /*
  * TransactionIdLimitedForOldSnapshots
  *
@@ -1824,6 +1860,7 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 
 		if (!same_ts_as_threshold)
 		{
+			PrintOldSnapshotMapping("non cached limit", false);
 			if (ts == update_ts)
 			{
 xlimit = latest_xmin;
@@ -1923,14 +1960,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 	 */
 	if (whenTaken < 0)
 	{
-		elog(DEBUG1,
+		elog(PANIC,
 			 "MaintainOldSnapshotTimeMapping called with negative whenTaken = %ld",
 			 (long) whenTaken);
 		return;
 	}
 	if (!TransactionIdIsNormal(xmin))
 	{
-		elog(DEBUG1,
+		elog(PANIC,
 			 "MaintainOldSnapshotTimeMapping called with xmin = %lu",
 			 (unsigned long) xmin);
 		return;
@@ -1938,6 +1975,8 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 
 	LWLockAcquire(OldSnapshotTimeMapLock, LW_EXCLUSIVE);
 
+	PrintOldSnapshotMapping("before update", true);
+
 	Assert(oldSnapshotControl->head_offset >= 0);
 	Assert(oldSnapshotControl->head_offset < OLD_SNAPSHOT_TIME_MAP_ENTRIES);
 	Assert((oldSnapshotControl->head_timestamp % USECS_PER_MINUTE) == 0);
@@ -1956,7 +1995,7 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 	{
 		/* old ts; log it at DEBUG */
 		LWLockRelease(OldSnapshotTimeMapLock);
-		elog(DEBUG1,
+		elog(PANIC,
 			 "MaintainOldSnapshotTimeMapping called with old whenTaken = %ld",
 			 (long) whenTaken);
 		return;
@@ -1971,6 +2010,12 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
  / USECS_PER_MINUTE))
 		% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
 
+		elog(WARNING, "head %ld s: updating existing bucket %d for sec %ld with xmin %u",
+			 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_SEC,
+			 bucket,
+			 (ts - oldSnapshotControl->starttime) / USECS_PER_SEC,
+			 xmin);
+
 		if (TransactionIdPrecedes(oldSnapshotControl->xid_by_minute[bucket], xmin))
 			oldSnapshotControl->xid_by_minute[bucket] = xmin;
 	}
@@ -1980,6 +2025,13 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 		int			advance = ((ts - oldSnapshotControl->head_timestamp)
 			   / USECS_PER_MINUTE);
 
+		elog(WARNING, "head %ld s: filling %d buckets starting at %d for sec %ld with xmin %u",
+			 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_SEC,
+			 advance,
+			 oldSnapshotControl->head_offset,
+			 (ts - oldSnapshotControl->starttime) / USECS_PER_SEC,
+			 xmin);
+
 		oldSnapshotControl->head_timestamp = ts;
 
 		if (advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)
@@ -2021,6 +2073,8 

Re: allow online change primary_conninfo

2020-04-01 Thread Peter Eisentraut

On 2020-03-28 11:49, Sergei Kornilov wrote:

I attached updated patch: walreceiver will use configured primary_slot_name as 
temporary slot name if wal_receiver_create_temp_slot is enabled.


The original setup worked consistently with pg_basebackup.  In 
pg_basebackup, if you specify -S/--slot, then it uses a permanent slot 
with the given name.  This is the same behavior as primary_slot_name has 
now.  We should be careful that we don't create two sets of 
options/settings that look similar but combine in different ways.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Commitfest 2020-03 Now in Progress

2020-04-01 Thread David Steele

On 3/17/20 8:10 AM, David Steele wrote:

On 3/1/20 4:10 PM, David Steele wrote:

The last Commitfest for v13 is now in progress!

Current stats for the Commitfest are:

Needs review: 192
Waiting on Author: 19
Ready for Committer: 4
Total: 215


Halfway through, here's where we stand.  Note that a CF entry was added 
after the stats above were recorded so the total has increased.


Needs review: 151 (-41)
Waiting on Author: 20 (+1)
Ready for Committer: 9 (+5)
Committed: 25
Moved to next CF: 1
Withdrawn: 4
Returned with Feedback: 6
Total: 216


After regulation time here's where we stand:

Needs review: 111 (-40)
Waiting on Author: 26 (+6)
Ready for Committer: 11 (+2)
Committed: 51 (+11)
Moved to next CF: 2 (+1)
Returned with Feedback: 10 (+4)
Rejected: 1
Withdrawn: 5 (+1)
Total: 217

We have one more patch so it appears that one in a completed state 
(committed, etc.) at the beginning of the CF has been moved to an 
uncompleted state. Or perhaps my math is just bad.


The RMT has determined that the CF will be extended for one week so I'll 
hold off on moving and marking patches until April 8.


Regards,
--
-David
da...@pgmasters.net




Re: potential stuck lock in SaveSlotToPath()

2020-04-01 Thread Peter Eisentraut

On 2020-03-27 08:48, Michael Paquier wrote:

On Thu, Mar 26, 2020 at 02:16:05PM +0100, Peter Eisentraut wrote:

committed and backpatched


The patch committed does that in three places:
 /* rename to permanent file, fsync file and directory */
 if (rename(tmppath, path) != 0)
 {
+   LWLockRelease(&slot->io_in_progress_lock);
ereport(elevel,
 (errcode_for_file_access(),
  errmsg("could not rename file \"%s\" to \"%s\": %m",

But why do you assume that LWLockRelease() never changes errno?  It
seems to me that you should save errno before calling LWLockRelease(),
and then restore it back before using %m in the log message, no?  See
for example the case where trace_lwlocks is set.


Good catch.  How about the attached patch?

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 484d734e9a985fa3e9242644ca9173658b9efe2c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 1 Apr 2020 16:24:47 +0200
Subject: [PATCH] Save errno across LWLockRelease() calls

Fixup for "Drop slot's LWLock before returning from SaveSlotToPath()"

Reported-by: Michael Paquier 
---
 src/backend/replication/slot.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index d90c7235e9..47851ec4c1 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1259,9 +1259,13 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, 
int elevel)
/*
 * If not an ERROR, then release the lock before returning.  In 
case
 * of an ERROR, the error recovery path automatically releases 
the
-* lock, but no harm in explicitly releasing even in that case.
+* lock, but no harm in explicitly releasing even in that case. 
 Note
+* that LWLockRelease() could affect errno.
 */
+   int save_errno = errno;
+
LWLockRelease(&slot->io_in_progress_lock);
+   errno = save_errno;
ereport(elevel,
(errcode_for_file_access(),
 errmsg("could not create file \"%s\": %m",
@@ -1325,7 +1329,10 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, 
int elevel)
 
if (CloseTransientFile(fd) != 0)
{
+   int save_errno = errno;
+
LWLockRelease(&slot->io_in_progress_lock);
+   errno = save_errno;
ereport(elevel,
(errcode_for_file_access(),
 errmsg("could not close file \"%s\": %m",
@@ -1336,7 +1343,10 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, 
int elevel)
/* rename to permanent file, fsync file and directory */
if (rename(tmppath, path) != 0)
{
+   int save_errno = errno;
+
LWLockRelease(&slot->io_in_progress_lock);
+   errno = save_errno;
ereport(elevel,
(errcode_for_file_access(),
 errmsg("could not rename file \"%s\" to 
\"%s\": %m",
-- 
2.26.0



Re: WAL usage calculation patch

2020-04-01 Thread Julien Rouhaud
Hi,

I'm replying here to all reviews that have been sent, thanks a lot!

On Wed, Apr 01, 2020 at 04:29:16PM +0530, Amit Kapila wrote:
> On Wed, Apr 1, 2020 at 1:32 PM Julien Rouhaud  wrote:
> >
> > So here's a v9, rebased on top of the latest versions of Sawada-san's bug 
> > fixes
> > (Amit's v6 for vacuum and Sawada-san's v2 for create index), with all
> > previously mentionned changes.
> >
> 
> Few other comments:
> v9-0003-Add-infrastructure-to-track-WAL-usage
> 1.
>  static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
> -
> +static void WalUsageAdd(WalUsage *dst, WalUsage *add);
> 
> Looks like a spurious line removal


Fixed.


> 2.
> + /* Report a full page imsage constructed for the WAL record */
> + *num_fpw += 1;
> 
> Typo. /imsage/image


Ah sorry I though I fixed it previously, fixed.


> 3. Doing some testing with and without parallelism to ensure WAL usage
> data is correct would be great and if possible, share the results?


I just saw that Dilip did some testing, but just in case here is some
additional one

- vacuum, after a truncate, loading 1M row and a "UPDATE t1 SET id = id"

=# select query, calls, wal_bytes, wal_records, wal_num_fpw from 
pg_stat_statements where query ilike '%vacuum%';
 query  | calls | wal_bytes | wal_records | wal_num_fpw
+---+---+-+-
 vacuum (parallel 3) t1 | 1 |  20098962 |   34104 |   2
 vacuum (parallel 0) t1 | 1 |  20098962 |   34104 |   2
(2 rows)

- create index, overload t1's parallel_workers, using the 1M line just
  vacuumed:

=# alter table t1 set (parallel_workers = 2);
ALTER TABLE

=# create index t1_parallel_2 on t1(id);
CREATE INDEX

=# alter table t1 set (parallel_workers = 0);
ALTER TABLE

=# create index t1_parallel_0 on t1(id);
CREATE INDEX

=# select query, calls, wal_bytes, wal_records, wal_num_fpw from 
pg_stat_statements where query ilike '%create index%';
query | calls | wal_bytes | wal_records | 
wal_num_fpw
--+---+---+-+-
 create index t1_parallel_0 on t1(id) | 1 |  20355540 |2762 |   
 2745
 create index t1_parallel_2 on t1(id) | 1 |  20406811 |2762 |   
 2758
(2 rows)

It all looks good to me.


> v9-0005-Keep-track-of-WAL-usage-in-pg_stat_statements
> 4.
> +-- SELECT usage data, check WAL usage is reported, wal_records equal
> rows count for INSERT/UPDATE/DELETE
> +SELECT query, calls, rows,
> +wal_bytes > 0 as wal_bytes_generated,
> +wal_records > 0 as wal_records_generated,
> +wal_records = rows as wal_records_as_rows
> +FROM pg_stat_statements ORDER BY query COLLATE "C";
> +  query   |
> calls | rows | wal_bytes_generated | wal_records_generated |
> wal_records_as_rows
> +--+---+--+-+---+-
> + DELETE FROM pgss_test WHERE a > $1   |
>   1 |1 | t   | t | t
> + DROP TABLE pgss_test |
>   1 |0 | t   | t | f
> + INSERT INTO pgss_test (a, b) VALUES ($1, $2), ($3, $4), ($5, $6) |
>   1 |3 | t   | t | t
> + INSERT INTO pgss_test VALUES(generate_series($1, $2), $3)|
>   1 |   10 | t   | t | t
> + SELECT * FROM pgss_test ORDER BY a   |
>   1 |   12 | f   | f | f
> + SELECT * FROM pgss_test WHERE a > $1 ORDER BY a  |
>   2 |4 | f   | f | f
> + SELECT * FROM pgss_test WHERE a IN ($1, $2, $3, $4, $5)  |
>   1 |8 | f   | f | f
> + SELECT pg_stat_statements_reset()|
>   1 |1 | f   | f | f
> + SET pg_stat_statements.track_utility = FALSE |
>   1 |0 | f   | f | t
> + UPDATE pgss_test SET b = $1 WHERE a = $2 |
>   6 |6 | t   | t | t
> + UPDATE pgss_test SET b = $1 WHERE a > $2 |
>   1 |3 | t   | t | t
> +(11 rows)
> +
> 
> I am not sure if the above tests make much sense as they are just
> testing that if WAL is generated for these commands.  I understand it
> is not easy to make these tests reliable but in that case, we can
> think of some simple tests.  It seems to me that the difficulty is due
> to full_page_writes as that depends on the checkpoint.  Can we make
> full_page_writes = off for these tests and check some simple
> Insert/Update/Delete cases? 

Re: Less-silly selectivity for JSONB matching operators

2020-04-01 Thread Tom Lane
Alexey Bashtanov  writes:
> On 31/03/2020 18:53, Tom Lane wrote:
>> Renamed "matchsel" to "matchingsel" etc, added DEFAULT_MATCHING_SEL,
>> rebased over commit 911e70207.  Since that commit already created
>> new versions of the relevant contrib modules, I think we can just
>> redefine what those versions contain, rather than making yet-newer
>> versions.  (Of course, that assumes we're going to include this in
>> v13.)

> Looks good to me.

Pushed, thanks for reviewing!

regards, tom lane




Re: control max length of parameter values logged

2020-04-01 Thread Tom Lane
Justin Pryzby  writes:
> On Wed, Apr 01, 2020 at 10:10:55AM +0100, Alexey Bashtanov wrote:
>>> Could you make zero a normal value and -1 the "special" value to disable
>>> trimming ?

>> I can, but then for the sake of consistency I'll have to do the same for
>> log_parameter_max_length.
>> Then we'll end up with two ways to enable/disable parameter logging on
>> error:
>> using the primary boolean setting and setting length to 0.
>> One of them will require superuser privileges, the other one won't.

> I guess you're referring to log_parameters_on_error.
> Does it have to be SUSET ?
> Or maybe log_parameters_on_error doesn't need to exist at all, and setting
> log_parameter_max_length=0 can be used to disable parameter logging.
> I showed up late to this thread, so let's see what others think.

I think Justin's got a point: defining zero this way is weirdly
inconsistent.  -1, being clearly outside the domain of possible
length limits, makes more sense as a marker for "don't trim".

Alexey's right that having a separate boolean flag is pointless, but
I think we could just drop the boolean; we haven't shipped that yet.
The privilege argument seems irrelevant to me.  We already decided
that the plan is (a) SUSET for non-error statement logging purposes and
(b) USERSET for logging caused by errors, and that would have to apply
to length limits as well as enable/disable ability.  Otherwise a user
could pretty effectively disable logging by setting the length to 1.

regards, tom lane




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 10:01:07 -0400, Robert Haas wrote:
> On Wed, Apr 1, 2020 at 2:40 AM Andres Freund  wrote:
> > The problem is that there's no protection again the xids in the
> > ringbuffer getting old enough to wrap around. Given that practical uses
> > of old_snapshot_threshold are likely to be several hours to several
> > days, that's not particularly hard to hit.
> 
> Presumably this could be fixed by changing it to use FullTransactionId.

That doesn't exist in all the back branches. Think it'd be easier to add
code to explicitly prune it during MaintainOldSnapshotTimeMapping().


> > The problem, as far as I can tell, is that
> > oldSnapshotControl->head_timestamp appears to be intended to be the
> > oldest value in the ring. But we update it unconditionally in the "need
> > a new bucket, but it might not be the very next one" branch of
> > MaintainOldSnapshotTimeMapping().
> 
> I agree, that doesn't look right. It's correct, I think, for the "if
> (advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)" case, but not in the
> "else" case. In the "else" case, it should advance by 1 (wrapping if
> needed) each time we take the "if (oldSnapshotControl->count_used ==
> OLD_SNAPSHOT_TIME_MAP_ENTRIES)" branch, and should remain unchanged in
> the "else" branch for that if statement.

Yea.


> > As far as I can tell, this code has been wrong since the feature has
> > been committed.  The tests don't show a problem, because none of this
> > code is reached when old_snapshot_threshold = 0 (which has no real world
> > use, it's purely for testing).
> 
> I'm pretty sure I complained about the fact that only the
> old_snapshot_threshold = 0 case was tested at the time this went in,
> but I don't think Kevin was too convinced that we needed to do
> anything else, and realistically, if he'd tried for a regression test
> that ran for 15 minutes, Tom would've gone ballistic.

I think it's not just Tom that'd have gone ballistic. I think it's the
reason why, as I think is pretty clear, the feature was *never* actually
tested. The results of whats being removed are not quite random, but
it's not far from it. And there's long stretches of time where it never
removes things.

It's also a completely self-made problem.

There's really no reason at all to have bins of one minute. As it's a
PGC_POSTMASTER GUC, it should just have didided time into bins of
(old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold
of a week there's no need to keep 10k bins, and the minimum threshold of
1 minute obviously is problematic.


> > I really don't know what to do here. The feature never worked and will
> > silently cause wrong query results. Fixing it seems like a pretty large
> > task - there's a lot more bugs. But ripping out a feature in stable
> > branches is pretty bad too.
> 
> I don't know what other bugs there are, but the two you mention above
> look fixable.

They probably are fixable. But there's a lot more, I think:

Looking at TransactionIdLimitedForOldSnapshots() I think the ts ==
update_ts threshold actually needs to be ts >= update_ts, right now we
don't handle being newer than the newest bin correctly afaict (mitigated
by autovacuum=on with naptime=1s doing a snapshot more often). It's hard
to say, because there's no comments.

The whole lock nesting is very hazardous. Most (all?)
TestForOldSnapshot() calls happen with locks on on buffers held, and can
acquire lwlocks itself. In some older branches we do entire *catalog
searches* with the buffer lwlock held (for RelationHasUnloggedIndex()).

GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
basis to detect conflicts seems dangerous to me. Isn't that ignoring
inserts that are already in progress?


> Even if we decide that the feature can't be salvaged, I would vote
> against ripping it out in back branches. I would instead argue for
> telling people not to use it and ripping it out in master.

It currently silently causes wrong query results. There's no
infrastructure to detect / protect against that (*).

I'm sure we can fix individual instances of problems. But I don't know
how one is supposed to verify that the fixes actually work. There's
currently no tests for the actual feature. And manual tests are painful
due to the multi-minute thresholds needed, and it's really hard to
manually verify that only the right rows are removed due to the feature,
and that all necessary errors are thrown. Given e.g. the bugs in my
email upthread, there's periods of several minutes where we'd not see
any row removed and then periods where the wrong ones would be removed,
so the manual tests would have to be repeated numerous times to actually
ensure anything.

If somebody wants to step up to the plate and fix these, it'd perhaps be
more realistic to say that we'll keep the feature. But even if somebody
does, I think it'd require a lot of development in the back branches. On
a feature whose purpose is to eat data that is still required.

I think even if we dec

Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Robert Haas
On Wed, Apr 1, 2020 at 2:40 AM Andres Freund  wrote:
> I added some debug output to print the mapping before/after changes by
> MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
> to the server start in minutes/seconds to make it easier to interpret).
>
> And the output turns out to be something like:
>
> WARNING:  old snapshot mapping at "before update" with head ts: 7, current 
> entries: 8 max entries: 15, offset: 0
>   entry 0 (ring 0): min 7: xid 582921233
>   entry 1 (ring 1): min 8: xid 654154155
>   entry 2 (ring 2): min 9: xid 661972949
>   entry 3 (ring 3): min 10: xid 666899382
>   entry 4 (ring 4): min 11: xid 644169619
>   entry 5 (ring 5): min 12: xid 644169619
>   entry 6 (ring 6): min 13: xid 644169619
>   entry 7 (ring 7): min 14: xid 644169619
>
> WARNING:  head 420 s: updating existing bucket 4 for sec 660 with xmin 
> 666899382
>
> WARNING:  old snapshot mapping at "after update" with head ts: 7, current 
> entries: 8 max entries: 15, offset: 0
>   entry 0 (ring 0): min 7: xid 582921233
>   entry 1 (ring 1): min 8: xid 654154155
>   entry 2 (ring 2): min 9: xid 661972949
>   entry 3 (ring 3): min 10: xid 666899382
>   entry 4 (ring 4): min 11: xid 666899382
>   entry 5 (ring 5): min 12: xid 644169619
>   entry 6 (ring 6): min 13: xid 644169619
>   entry 7 (ring 7): min 14: xid 644169619
>
> It's pretty obvious that the xids don't make a ton of sense, I think:
> They're not monotonically ordered. The same values exist multiple times,
> despite xids being constantly used. Also, despite the ringbuffer
> supposedly having 15 entries (that's snapshot_too_old = 5min + the 10 we
> always add), and the workload having run for 14min, we only have 8
> entries.

The function header comment for MaintainOldSnapshotTimeMapping could
hardly be more vague, as it's little more than a restatement of the
function name. However, it looks to me like the idea is that this
function might get called multiple times for the same or similar
values of whenTaken. I suppose that's the point of this code:

else if (ts <= (oldSnapshotControl->head_timestamp +
((oldSnapshotControl->count_used - 1)
 * USECS_PER_MINUTE)))
{
/* existing mapping; advance xid if possible */
int bucket = (oldSnapshotControl->head_offset
  + ((ts - oldSnapshotControl->head_timestamp)
 / USECS_PER_MINUTE))
% OLD_SNAPSHOT_TIME_MAP_ENTRIES;

if (TransactionIdPrecedes(oldSnapshotControl->xid_by_minute[bucket],
xmin))
oldSnapshotControl->xid_by_minute[bucket] = xmin;
}

What I interpret this to be doing is saying - if we got a new call to
this function with a rounded-to-the-minute timestamp that we've seen
previously and for which we still have an entry, and if the XID passed
to this function is newer than the one passed by the previous call,
then advance the xid_by_minute[] bucket to the newer value. Now that
begs the question - what does this XID actually represent? The
comments don't seem to answer that question, not even the comments for
OldSnapshotControlData, which say that we should "Keep one xid per
minute for old snapshot error handling." but don't say which XIDs we
should keep or how they'll be used. However, the only call to
MaintainOldSnapshotTimeMapping() is in GetSnapshotData(). It appears
that we call this function each time a new snapshot is taken and pass
the current time (modulo some fiddling) and snapshot xmin. Given that,
one would expect that any updates to the map would be tight races,
i.e. a bunch of processes that all took their snapshots right around
the same time would all update the same map entry in quick succession,
with the newest value winning.

And that make the debugging output which I quoted from your message
above really confusing. At this point, the "head timestamp" is 7
minutes after this facility started up. The first we entry we have is
for minute 7, and the last is for minute 14. But the one we're
updating is for minute 11. How the heck can that happen? I might
suspect that you'd stopped a process inside GetSnapshotData() with a
debugger, but that can't explain it either, because GetSnapshotData()
gets the xmin first and only afterwards gets the timestamp - so if
you'd stopped for ~3 minutes it just before the call to
MaintainOldSnapshotTimeMapping(), it would've been updating the map
with an *old* XID. In reality, though, it changed the XID from
644169619 to 666899382, advancing over 22 million XIDs. I don't
understand what's going on there. How is this function getting called
with a 4-minute old value of whenTaken?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Verify true root on replicas with amcheck

2020-04-01 Thread David Steele

On 1/9/20 3:55 AM, godjan • wrote:

Hi, we have trouble to detect true root corruptions on replicas. I made a patch 
for resolving it with the locking meta page and potential root page. I heard 
that amcheck has an invariant about locking no more than 1 page at a moment for 
avoiding deadlocks. Is there possible a deadlock situation?


This patch no longer applies cleanly: 
http://cfbot.cputube.org/patch_27_2418.log


The CF entry has been updated to Waiting on Author.

Also, it would be a good idea to answer Peter's questions down-thread if 
you are interested in moving this patch forward.


Regards,
--
-David
da...@pgmasters.net




Re: proposal \gcsv

2020-04-01 Thread Tom Lane
Vik Fearing  writes:
> On 4/1/20 1:53 AM, Tom Lane wrote:
>> Consider some syntax along the lines of
>> \gpset (pset-option-name [pset-option-value]) ... filename

> If parens are going to be required, why don't we just add them to \g?
> TABLE blah \g (format csv) filename

Yeah, if we're willing to assume that nobody uses filenames beginning
with '(', we could just extend \g's syntax rather than adding a new
command.

After sleeping on it, though, I'm liking my Plan B idea better than
Plan A.  Plan B is very clearly implementable without needing surgery
on the backslash-command parser (I didn't look at the lexer to see
what paren-handling would involve, but it might be painful).  And it
doesn't put any new limits on what pset parameters can look like;
Plan A would likely result in some problems if anybody wants to use
parens in future pset options.

I think that maybe the best terminology for Plan B would be to say
that there's an "alternate" formatting parameter set, which is
manipulated by \apset and then used by \ga.

Another thought, bearing in mind the dictum that the only good numbers
in computer science are 0, 1, and N, is to introduce a concept of named
formatting parameter sets, which you'd manipulate with say
\npset set-name [param-name [param-value]]
and use with
\gn set-name filename-or-command
A likely usage pattern for that would be to set up a few favorite
formats in your ~/.psqlrc, and then they'd be available to just use
immediately in \gn.  (In this universe, \gn should not destroy or
reset the parameter set it uses.)

This is way beyond what anyone has asked for, so I'm not seriously
proposing that we do it right now, but it might be something to keep
in mind as a possible future direction.  The main thing that it calls
into question is whether we really want \ga to reset the alternate
parameter values after use.  Maybe it'd be better not to --- I can
think of about-equally-likely usage patterns where you would want
that or not.  We could invent an explicit "\apset reset" command
instead of auto-reset.  I could see having a command to copy the
current primary formatting parameters to the alternate area, too.

There's an argument that this is all way too complicated, of course,
and maybe it is.  But considering that we've already had two requests
for things you can't do with \gfmt as it stands, I think the patch
is too simple as it is.

regards, tom lane




Re: Online checksums patch - once again

2020-04-01 Thread David Steele

On 1/18/20 6:18 PM, Daniel Gustafsson wrote:


Attached is a v16 rebased on top of current master which addresses the above
commented points, and which I am basing the concurrency work on.


This patch no longer applies cleanly: 
http://cfbot.cputube.org/patch_27_2260.log


The CF entry has been updated to Waiting on Author.

Regards,

--
-David
da...@pgmasters.net




Re: Removing unneeded self joins

2020-04-01 Thread David Steele



On 1/27/20 11:10 PM, Andrey Lepikhov wrote:

Rebased version v.22.
- Added enable_self_join_removal GUC (true is default)
- The joinquals of the relation that is being removed, redistributed in 
accordance with the remove_rel_from_query () machinery.


This patch no longer applies cleanly on 
src/test/regress/sql/equivclass.sql: 
http://cfbot.cputube.org/patch_27_1712.log


The CF entry has been updated to Waiting on Author.

Regards,
--
-David
da...@pgmasters.net




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 11:15:14 -0400, Robert Haas wrote:
> On Wed, Apr 1, 2020 at 2:40 AM Andres Freund  wrote:
> > I added some debug output to print the mapping before/after changes by
> > MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
> > to the server start in minutes/seconds to make it easier to interpret).
> >
> > And the output turns out to be something like:
> >
> > WARNING:  old snapshot mapping at "before update" with head ts: 7, current 
> > entries: 8 max entries: 15, offset: 0
> >   entry 0 (ring 0): min 7: xid 582921233
> >   entry 1 (ring 1): min 8: xid 654154155
> >   entry 2 (ring 2): min 9: xid 661972949
> >   entry 3 (ring 3): min 10: xid 666899382
> >   entry 4 (ring 4): min 11: xid 644169619
> >   entry 5 (ring 5): min 12: xid 644169619
> >   entry 6 (ring 6): min 13: xid 644169619
> >   entry 7 (ring 7): min 14: xid 644169619
> >
> > WARNING:  head 420 s: updating existing bucket 4 for sec 660 with xmin 
> > 666899382
> >
> > WARNING:  old snapshot mapping at "after update" with head ts: 7, current 
> > entries: 8 max entries: 15, offset: 0
> >   entry 0 (ring 0): min 7: xid 582921233
> >   entry 1 (ring 1): min 8: xid 654154155
> >   entry 2 (ring 2): min 9: xid 661972949
> >   entry 3 (ring 3): min 10: xid 666899382
> >   entry 4 (ring 4): min 11: xid 666899382
> >   entry 5 (ring 5): min 12: xid 644169619
> >   entry 6 (ring 6): min 13: xid 644169619
> >   entry 7 (ring 7): min 14: xid 644169619
> >
> > It's pretty obvious that the xids don't make a ton of sense, I think:
> > They're not monotonically ordered. The same values exist multiple times,
> > despite xids being constantly used. Also, despite the ringbuffer
> > supposedly having 15 entries (that's snapshot_too_old = 5min + the 10 we
> > always add), and the workload having run for 14min, we only have 8
> > entries.
> 
> The function header comment for MaintainOldSnapshotTimeMapping could
> hardly be more vague, as it's little more than a restatement of the
> function name. However, it looks to me like the idea is that this
> function might get called multiple times for the same or similar
> values of whenTaken. I suppose that's the point of this code:

Right. We enforce whenTaken to be monotonic
(cf. GetSnapshotCurrentTimestamp()), but since
GetSnapshotCurrentTimestamp() reduces the granularity of the timestamp
to one-minute (the AlignTimestampToMinuteBoundary() call), it's
obviously possible to end up in the same bin as a previous


> What I interpret this to be doing is saying - if we got a new call to
> this function with a rounded-to-the-minute timestamp that we've seen
> previously and for which we still have an entry, and if the XID passed
> to this function is newer than the one passed by the previous call,
> then advance the xid_by_minute[] bucket to the newer value. Now that
> begs the question - what does this XID actually represent? The
> comments don't seem to answer that question, not even the comments for
> OldSnapshotControlData, which say that we should "Keep one xid per
> minute for old snapshot error handling." but don't say which XIDs we
> should keep or how they'll be used. However, the only call to
> MaintainOldSnapshotTimeMapping() is in GetSnapshotData(). It appears
> that we call this function each time a new snapshot is taken and pass
> the current time (modulo some fiddling) and snapshot xmin. Given that,
> one would expect that any updates to the map would be tight races,
> i.e. a bunch of processes that all took their snapshots right around
> the same time would all update the same map entry in quick succession,
> with the newest value winning.

Right.


> And that make the debugging output which I quoted from your message
> above really confusing. At this point, the "head timestamp" is 7
> minutes after this facility started up. The first we entry we have is
> for minute 7, and the last is for minute 14. But the one we're
> updating is for minute 11. How the heck can that happen?

If I undestand what your reference correctly, I think that is because,
due to the bug, the "need a new bucket" branch doesn't just extend by
one bucket, it extends it by many in common cases. Basically filling
buckets "into the future".

the advance = ... variable in the branch will not always be 1, even when
we continually call Maintain*.  Here's some debug output showing that
(slightly modified from the patch I previously sent):

WARNING:  old snapshot mapping at "before update" with head ts: 1, current 
entries: 2 max entries: 15, offset: 0
  entry 0 (ring 0): min 1: xid 1089371384
  entry 1 (ring 1): min 2: xid 1099553206

WARNING:  head 1 min: filling 2 buckets starting at 0 for whenTaken 3 min, with 
xmin 1109840204

WARNING:  old snapshot mapping at "after update" with head ts: 3, current 
entries: 4 max entries: 15, offset: 0
  entry 0 (ring 0): min 3: xid 1089371384
  entry 1 (ring 1): min 4: xid 1099553206
  entry 2 (ring 2): min 5: xid 1109840204
  entry 3 (ring 3): min 6: xid 1109840204

Re: BufFileRead() error signalling

2020-04-01 Thread David Steele

Hi Thomas,

On 11/29/19 9:46 PM, Thomas Munro wrote:


Ok.  Here is a first attempt at that.


It's been a few CFs since this patch received an update, though there 
has been plenty of discussion.


Perhaps it would be best to mark it RwF until you have a chance to 
produce an update patch?


Regards,
--
-David
da...@pgmasters.net




Re: proposal \gcsv

2020-04-01 Thread Daniel Verite
Tom Lane wrote:

>  I could see having a command to copy the current primary formatting
> parameters to the alternate area, too.

We could have a stack to store parameters before temporary
changes, for instance if you want to do one csv export and
come back to normal without assuming what "normal"
values were.

\pset push format csv_fieldsep
\pset format csv
\pset  csv_fielsep '\t'
some command \g somefile
\pset pop

So \pset pop would reset the pushed parameters
to their values when pushed, which also could be all
parameters:

\pset push all
\pset param1 something
\pset param2 something-else
...other commands...
\pset pop

or

\pset push all
\i somescript.sql
\pset pop


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite




Re: proposal \gcsv

2020-04-01 Thread Isaac Morland
On Wed, 1 Apr 2020 at 11:52, Daniel Verite  wrote:

> Tom Lane wrote:
>
> >  I could see having a command to copy the current primary formatting
> > parameters to the alternate area, too.
>
> We could have a stack to store parameters before temporary
> changes, for instance if you want to do one csv export and
> come back to normal without assuming what "normal"
> values were.
>

I think it might be a good idea to decide whether psql is to be a
programming environment, or just a command shell.

If it is to be a programming environment, we should either adopt an
existing language or strike a committee of programming language experts to
design a new one.

If it is to be a command shell, new features should be evaluated in part on
whether they move psql significantly closer to being a programming language
and rejected if they do.


Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Robert Haas
On Wed, Apr 1, 2020 at 11:09 AM Andres Freund  wrote:
> That doesn't exist in all the back branches. Think it'd be easier to add
> code to explicitly prune it during MaintainOldSnapshotTimeMapping().

That's reasonable.

> There's really no reason at all to have bins of one minute. As it's a
> PGC_POSTMASTER GUC, it should just have didided time into bins of
> (old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold
> of a week there's no need to keep 10k bins, and the minimum threshold of
> 1 minute obviously is problematic.

I am very doubtful that this approach would have been adequate. It
would mean that, with old_snapshot_threshold set to a week, the
threshold for declaring a snapshot "old" would jump forward 16.8 hours
at a time. It's hard for me to make a coherent argument right now as
to exactly what problems that would create, but it's not very
granular, and a lot of bloat-related things really benefit from more
granularity. I also don't really see what the problem with keeping a
bucket per minute in memory is, even for a week. It's only 60 * 24 * 7
= ~10k buckets, isn't it? That's not really insane for an in-memory
data structure. I agree that the code that does that maintenance being
buggy is a problem to whatever extent that is the case, but writing
the code to have fewer buckets wouldn't necessarily have made it any
less buggy.

> They probably are fixable. But there's a lot more, I think:
>
> Looking at TransactionIdLimitedForOldSnapshots() I think the ts ==
> update_ts threshold actually needs to be ts >= update_ts, right now we
> don't handle being newer than the newest bin correctly afaict (mitigated
> by autovacuum=on with naptime=1s doing a snapshot more often). It's hard
> to say, because there's no comments.

That test and the following one for "if (ts == update_ts)" both make
me nervous too.  If only two of <, >, and = are expected, there should
be an Assert() to that effect, at least. If all three values are
expected then we need an explanation of why we're only checking for
equality.

> The whole lock nesting is very hazardous. Most (all?)
> TestForOldSnapshot() calls happen with locks on on buffers held, and can
> acquire lwlocks itself. In some older branches we do entire *catalog
> searches* with the buffer lwlock held (for RelationHasUnloggedIndex()).

The catalog searches are clearly super-bad, but I'm not sure that the
other ones have a deadlock risk or anything. They might, but I think
we'd need some evidence of that.

> GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
> basis to detect conflicts seems dangerous to me. Isn't that ignoring
> inserts that are already in progress?

How so?

> It currently silently causes wrong query results. There's no
> infrastructure to detect / protect against that (*).

Sure, and what if you break more stuff ripping it out? Ripping this
volume of code out in a supposedly-stable branch is totally insane
almost no matter how broken the feature is. I also think, and we've
had this disagreement before, that you're far too willing to say
"well, that's wrong so we need to hit it with a nuke." I complained
when you added those error checks to vacuum in back-branches, and
since that release went out people are regularly tripping those checks
and taking prolonged outages for a problem that wasn't making them
unhappy before. I know that in theory those people are better off
because their database was always corrupted and now they know. But for
some of them, those prolonged outages are worse than the problem they
had before. I believe it was irresponsible to decide on behalf of our
entire user base that they were better off with such a behavior change
in a supposedly-stable branch, and I believe the same thing here.

I have no objection to the idea that *if* the feature is hopelessly
broken, it should be removed. But I don't have confidence at this
point that you've established that, and I think ripping out thousands
of lines of codes in the back-branches is terrible. Even
hard-disabling the feature in the back-branches without actually
removing the code is an awfully strong reaction, but it could be
justified if we find out that things are actually super-bad and not
really fixable. Actually removing the code is unnecessary, protects
nobody, and has risk.

> Do we actually have any evidence of this feature ever beeing used? I
> didn't find much evidence for that in the archives (except Thomas
> finding a problem). Given that it currently will switch between not
> preventing bloat and causing wrong query results, without that being
> noticed...

I believe that at least one EnterpriseDB customer used it, and
possibly more than one. I am not sure how extensively, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pgbench - add pseudo-random permutation function

2020-04-01 Thread David Steele

Hi Fabien,

On 2/1/20 5:12 AM, Fabien COELHO wrote:


Attached is an attempt at improving things. I have added a explicit note 
and hijacked an existing example to better illustrate the purpose of the 
function.


This patch does not build on Linux due to some unused functions and 
variables: http://cfbot.cputube.org/patch_27_1712.log


The CF entry has been updated to Waiting on Author.

A few committers have expressed doubts about whether this patch is 
needed and it doesn't make sense to keep moving it from CF to CF.


I'm planning to mark this patch RwF on April 8 and I suggest you 
resubmit if you are able to get some consensus.


Regards,
--
-David
da...@pgmasters.net




Re: proposal \gcsv

2020-04-01 Thread Pavel Stehule
st 1. 4. 2020 v 17:52 odesílatel Daniel Verite 
napsal:

> Tom Lane wrote:
>
> >  I could see having a command to copy the current primary formatting
> > parameters to the alternate area, too.
>
> We could have a stack to store parameters before temporary
> changes, for instance if you want to do one csv export and
> come back to normal without assuming what "normal"
> values were.
>
> \pset push format csv_fieldsep
> \pset format csv
> \pset  csv_fielsep '\t'
> some command \g somefile
> \pset pop
>
> So \pset pop would reset the pushed parameters
> to their values when pushed, which also could be all
> parameters:
>
> \pset push all
> \pset param1 something
> \pset param2 something-else
> ...other commands...
> \pset pop
>
> or
>
> \pset push all
> \i somescript.sql
> \pset pop
>
>
It can work, but it is not user friendly - my proposal was motivated by
using some quick csv exports to gplot's pipe.

Regards

Pavel

>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite
>


Re: A bug when use get_bit() function for a long bytea string

2020-04-01 Thread Tom Lane
"movead...@highgo.ca"  writes:
> [ long_bytea_string_bug_fix_ver5.patch ]

I don't think this has really solved the overflow hazards.  For example,
in binary_encode we've got

resultlen = enc->encode_len(VARDATA_ANY(data), datalen);
result = palloc(VARHDRSZ + resultlen);

and all you've done about that is changed resultlen from int to int64.
On a 64-bit machine, sure, palloc will be able to detect if the
result exceeds what can be allocated --- but on a 32-bit machine
it'd be possible for the size_t argument to overflow altogether.
(Or if you want to argue that it can't overflow because no encoder
expands the data more than 4X, then we don't need to be making this
change at all.)

I don't think there's really any way to do that safely without an
explicit check before we call palloc.

I also still find the proposed signatures for the encoding-specific
functions to be just plain weird:

-   unsigned(*encode_len) (const char *data, unsigned dlen);
-   unsigned(*decode_len) (const char *data, unsigned dlen);
-   unsigned(*encode) (const char *data, unsigned dlen, char *res);
-   unsigned(*decode) (const char *data, unsigned dlen, char *res);
+   int64   (*encode_len) (const char *data, unsigned dlen);
+   int64   (*decode_len) (const char *data, unsigned dlen);
+   int64   (*encode) (const char *data, unsigned dlen, char *res);
+   int64   (*decode) (const char *data, unsigned dlen, char *res);

Why did you change the outputs from unsigned to signed?  Why didn't
you change the dlen inputs?  I grant that we know that the input
can't exceed 1GB in Postgres' usage, but these signatures are just
randomly inconsistent, and you didn't even add a comment explaining
why.  Personally I think I'd make them like

uint64  (*encode_len) (const char *data, size_t dlen);

which makes it explicit that the dlen argument describes the length
of a chunk of allocated memory, while the result might exceed that.

Lastly, there is a component of this that can be back-patched and
a component that can't --- we do not change system catalog contents
in released branches.  Cramming both parts into the same patch
is forcing the committer to pull them apart, which is kind of
unfriendly.

regards, tom lane




Re: wraparound dangers in snapshot too old

2020-04-01 Thread Andres Freund
Hi,

On 2020-03-31 21:53:04 -0700, Andres Freund wrote:
> I am trying to change the snapshot too old infrastructure so it
> cooperates with my snapshot scalability patch. While trying to
> understand the code sufficiently, I think I found a fairly serious
> issue:

I accidentally sent this email, I was intending to instead only send
https://www.postgresql.org/message-id/20200401064008.qob7bfnnbu4w5cw4%40alap3.anarazel.de

It started to happen because of some default keybinding changes between
mutt / emacs / magit, leading to emacs sending a buffer as an email,
which I didn't intend.

Sorry for that,

Andres Freund




Re: proposal \gcsv

2020-04-01 Thread Tom Lane
Pavel Stehule  writes:
> It can work, but it is not user friendly - my proposal was motivated by
> using some quick csv exports to gplot's pipe.

I kind of liked the stack idea, myself.  It's simpler than what I was
suggesting and it covers probably 90% of the use-case.

However, if we prefer something closer to Plan A ... I took a look at
the psql lexer, and the only difference between OT_FILEPIPE and OT_NORMAL
parsing is if the argument starts with '|'.  So we could make it work
I think.  I'd modify my first proposal so far as to make it

\g ( pset-option pset-value ... ) filename-or-pipe

That is, require spaces around the parens, and require a value for each
pset-option (no fair using the shortcuts like "\pset expanded").  Then
it's easy to separate the option names and values from the paren markers.
The \g parser would consume its first argument in OT_FILEPIPE mode, and
then if it sees '(' it would consume arguments in OT_NORMAL mode until
it's found the ')'.

This way also narrows the backwards-compatibility problem from "fails if
your filename starts with '('" to "fails if your filename is exactly '('",
which seems acceptably improbable to me.

regards, tom lane




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Peter Geoghegan
On Wed, Apr 1, 2020 at 9:02 AM Robert Haas  wrote:
> I complained
> when you added those error checks to vacuum in back-branches, and
> since that release went out people are regularly tripping those checks
> and taking prolonged outages for a problem that wasn't making them
> unhappy before. I know that in theory those people are better off
> because their database was always corrupted and now they know. But for
> some of them, those prolonged outages are worse than the problem they
> had before. I believe it was irresponsible to decide on behalf of our
> entire user base that they were better off with such a behavior change
> in a supposedly-stable branch, and I believe the same thing here.

I agreed with that decision, FWIW. Though I don't deny that there is
some merit in what you say. This is the kind of high level
philosophical question where large differences of opinion are quite
normal.

I don't think that it's fair to characterize Andres' actions in that
situation as in any way irresponsible. We had an extremely complicated
data corruption bug that he went to great lengths to fix, following
two other incorrect fixes. He was jet lagged from travelling to India
at the time. He went to huge lengths to make sure that the bug was
correctly squashed.

> Actually removing the code is unnecessary, protects
> nobody, and has risk.

Every possible approach has risk. We are deciding among several
unpleasant and risky alternatives here, no?

-- 
Peter Geoghegan




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-04-01 Thread Ashutosh Bapat
On Thu, 26 Mar 2020 at 00:35, Tomas Vondra 
wrote:

> Hi,
>
> I've started reviewing the patch a couple days ago. I haven't done any
> extensive testing, but I do have a bunch of initial comments that I can
> share now.
>
> 1) I wonder if this needs to update src/backend/optimizer/README, which
> does have a section about partitionwise joins. It seems formulated in a
> way that that probably covers even this more advanced algorithm, but
> maybe it should mention handling of default partitions etc.?
>

Done. Please check the wording. It might need some word smithy.


>
> There certainly needs to be some description of the algorithm somewhere,
> either in a README or before a suitable function. It doesn't have to be
> particularly detailed, a rough outline of the matching would be enough,
> so that readers don't have to rebuild the knowledge from pieces
> scattered around various comments.
>

The algorithm for list and range partitioned tables is slightly different.
So, I have added separate prologue to each list_merge_bounds() and
range_merge_bounds(). Please check if that serves the purpose.


>
> 2) Do we need another GUC enabling this more complex algorithm? In PG11
> the partitionwise join is disabled by default, on the grounds that it's
> expensive and not worth it by default. How much more expensive is this?
> Maybe it makes sense to allow enabling only the "simple" approach?
>

We have reduced the complexity of merging bounds quite a bit so this
shouldn't be costly. Further more we handle the usual case of equal bounds
quickly using the merge flag so most of the cases should be fine. It's only
when two partitioned tables with same partition scheme are joined but do
not have merge-able bounds that this algorithm would not yield useful
result - but that would be rare in the field. enable_partitionwise_join =
false should take care of such scenarios easily. I am not in favour of
adding another GUC which we set to false by default and then take another
few releases to make it true by default.


>
> 3) This comment in try_partitionwise_join is now incorrect, because the
> condition may be true even for partitioned tables with (nparts == 0).
>
>  /* Nothing to do, if the join relation is not partitioned. */
>  if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
>  return;
>
>
If part_scheme = NULL, npart should be 0, fixed that in
build_joinrel_partition_info(). If partscheme != NULL but bounds can not be
merged, nparts = 0. So this condition is correct. Encapsulated in a macro
IS_JOINREL_NOT_PARTITITIONED(). and added comments for the macro. Given
that the macro is used exactly at one place, it may not be necessary to
define it but it looks *nice*.


> Moreover, the condition used to be
>
>  if (!IS_PARTITIONED_REL(joinrel))
>  return;
>
> which is way more readable. I think it's net negative to replace these
> "nice" macros with clear meaning with complex conditions. If needed, we
> can invent new macros. There are many other places where the patch
> replaces macros with less readable conditions.
>
>
The only other place where we have replaced a *nice* macro is in
build_joinrel_partition_info(). But I think it's a legit replacement. I
have added a comment there.


>
> 4) I'm a bit puzzled how we could get here with non-partitioned rels?
>
>  /*
>   * We can not perform partitionwise join if either of the joining
> relations
>   * is not partitioned.
>   */
>  if (!IS_PARTITIONED_REL(rel1) || !IS_PARTITIONED_REL(rel2))
>  return;
>

See the comment I have added in build_joinrel_partition_info(). Not all
joining pairs for a given relation are partitioned.


>
> 5) I find the "merged" flag in RelOptInfo rather unclear, because it
> does not clearly indicate what was merged. Maybe something like
> partbounds_merged would be better?
>

Done.


>
> 6) The try_partitionwise_join function is getting a bit too long and
> harder to understand. The whole block in
>
>  if (joinrel->nparts == -1)
>  {
>  ...
>  }
>
> seems rather well isolated, so I propose to move it to a separate
> function responsible only for the merging. We can simply call it on the
> joinrel, and make it return right away if (joinrel->nparts == -1).
>

Looks like you have already taken care of this one in one of your patches.


>
> 7) I'd suggest not to reference exact functions in comments unless
> abolutely necessary, because it's harder to maintain and it does not
> really explain purpose of the struct/code. E.g. consider this:
>
>  /* Per-partitioned-relation data for
> merge_list_bounds()/merge_range_bounds() */
>  typedef struct PartitionMap
>  { ... }
>
> Why does it matter where is the struct used? That's pretty trivial to
> find using 'git grep' or something. Instead the comment should explain
> the purpose of the struct.
>

Adjusted names and comments a bit.

-- 
Best Wishes,
Ashutosh


Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Robert Haas
On Wed, Apr 1, 2020 at 1:03 PM Peter Geoghegan  wrote:
> I don't think that it's fair to characterize Andres' actions in that
> situation as in any way irresponsible. We had an extremely complicated
> data corruption bug that he went to great lengths to fix, following
> two other incorrect fixes. He was jet lagged from travelling to India
> at the time. He went to huge lengths to make sure that the bug was
> correctly squashed.

I don't mean it as a personal attack on Andres, and I know and am glad
that he worked hard on the problem, but I don't agree that it was the
right decision. Perhaps "irresponsible" is the wrong word, but it's
certainly caused problems for multiple EnterpriseDB customers, and in
my view, those problems weren't necessary. Either a WARNING or an
ERROR would have shown up in the log, but an ERROR terminates VACUUM
for that table and thus basically causes autovacuum to be completely
broken. That is a really big problem. Perhaps you will want to argue,
as Andres did, that the value of having ERROR rather than WARNING in
the log justifies that outcome, but I sure don't agree.

> > Actually removing the code is unnecessary, protects
> > nobody, and has risk.
>
> Every possible approach has risk. We are deciding among several
> unpleasant and risky alternatives here, no?

Sure, but not all levels of risk are equal. Jumping out of a plane
carries some risk of death whether or not you have a parachute, but
that does not mean that we shouldn't worry about whether you have one
or not before you jump.

In this case, I think it is pretty clear that hard-disabling the
feature by always setting old_snapshot_threshold to -1 carries less
risk of breaking unrelated things than removing code that caters to
the feature all over the code base. Perhaps it is not quite as
dramatic as my parachute example, but I think it is pretty clear all
the same that one is a lot more likely to introduce new bugs than the
other. A carefully targeted modification of a few lines of code in 1
file just about has to carry less risk than ~1k lines of code spread
across 40 or so files.

However, I still think that without some more analysis, it's not clear
whether we should go this direction at all. Andres's results suggest
that there are some bugs here, but I think we need more senior hackers
to study the situation before we make a decision about what to do
about them. I certainly haven't had enough time to even fully
understand the problems yet, and nobody else has posted on that topic
at all. I have the highest respect for Andres and his technical
ability, and if he says this stuff has problems, I'm sure it does. Yet
I'm not willing to conclude that because he's tired and frustrated
with this stuff right now, it's unsalvageable. For the benefit of the
whole community, such a claim deserves scrutiny from multiple people.

Is there any chance that you're planning to look into the details?
That would certainly be welcome from my perspective.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: BUG #16109: Postgres planning time is high across version (Expose buffer usage during planning in EXPLAIN)

2020-04-01 Thread Fujii Masao



On 2020/03/31 10:31, Justin Pryzby wrote:

On Wed, Jan 29, 2020 at 12:15:59PM +0100, Julien Rouhaud wrote:

Rebase due to conflict with 3ec20c7091e97.


This is failing to apply probably since 
4a539a25ebfc48329fd656a95f3c1eb2cda38af3.
Could you rebase?   (Also, not sure if this can be set as RFC?)


I updated the patch. Attached.

+/* Compute the difference between two BufferUsage */
+BufferUsage
+ComputeBufferCounters(BufferUsage *start, BufferUsage *stop)

Since BufferUsageAccumDiff() was exported, ComputeBufferCounters() is
no longer necessary. In the patched version, BufferUsageAccumDiff() is
used to calculate the difference of buffer usage.

+   if (es->summary && (planduration || es->buffers))
+   ExplainOpenGroup("Planning", "Planning", true, es);

Isn't it more appropriate to check "bufusage" instead of "es->buffers" here?
The patch changes the code so that "bufusage" is checked.

+   "Planning Time": N.N,+
+ "Shared Hit Blocks": N,+
+ "Shared Read Blocks": N,   +
+ "Shared Dirtied Blocks": N,+

Doesn't this indent look strange? IMO no indent for buffer usage is
necessary when the format is either json, xml, and yaml. This looks
better at least for me. OTOH, in text format, it seems better to indent
the buffer usage for more readability. Thought?
The patch changes the code so that "es->indent" is
increment/decrement only when the format is text.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ee0e638f33..b1f3fe13c6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -372,7 +372,10 @@ ExplainOneQuery(Query *query, int cursorOptions,
PlannedStmt *plan;
instr_time  planstart,
planduration;
+   BufferUsage bufusage_start,
+   bufusage;
 
+   bufusage_start = pgBufferUsage;
INSTR_TIME_SET_CURRENT(planstart);
 
/* plan the query */
@@ -381,9 +384,13 @@ ExplainOneQuery(Query *query, int cursorOptions,
INSTR_TIME_SET_CURRENT(planduration);
INSTR_TIME_SUBTRACT(planduration, planstart);
 
+   /* calc differences of buffer counters. */
+   memset(&bufusage, 0, sizeof(BufferUsage));
+   BufferUsageAccumDiff(&bufusage, &pgBufferUsage, 
&bufusage_start);
+
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
-  &planduration);
+  &planduration, &bufusage);
}
 }
 
@@ -476,7 +483,8 @@ ExplainOneUtility(Node *utilityStmt, IntoClause *into, 
ExplainState *es,
 void
 ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
   const char *queryString, ParamListInfo params,
-  QueryEnvironment *queryEnv, const instr_time 
*planduration)
+  QueryEnvironment *queryEnv, const instr_time 
*planduration,
+  const BufferUsage *bufusage)
 {
DestReceiver *dest;
QueryDesc  *queryDesc;
@@ -560,6 +568,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, 
ExplainState *es,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);
 
+   if (es->summary && (planduration || bufusage))
+   ExplainOpenGroup("Planning", "Planning", true, es);
+
if (es->summary && planduration)
{
double  plantime = INSTR_TIME_GET_DOUBLE(*planduration);
@@ -567,6 +578,19 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, 
ExplainState *es,
ExplainPropertyFloat("Planning Time", "ms", 1000.0 * plantime, 
3, es);
}
 
+   /* Show buffer usage */
+   if (es->summary && bufusage)
+   {
+   if (es->format == EXPLAIN_FORMAT_TEXT)
+   es->indent++;
+   show_buffer_usage(es, bufusage);
+   if (es->format == EXPLAIN_FORMAT_TEXT)
+   es->indent--;
+   }
+
+   if (es->summary && (planduration || bufusage))
+   ExplainCloseGroup("Planning", "Planning", true, es);
+
/* Print info about runtime of triggers */
if (es->analyze)
ExplainPrintTriggers(es, queryDesc);
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 284a5bfbec..d54568e7b2 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -616,7 +616,10 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause 
*into, ExplainState *es,
EState *estate = NULL;
instr_time  planstart;
instr_ti

Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 12:02:18 -0400, Robert Haas wrote:
> On Wed, Apr 1, 2020 at 11:09 AM Andres Freund  wrote:
> > There's really no reason at all to have bins of one minute. As it's a
> > PGC_POSTMASTER GUC, it should just have didided time into bins of
> > (old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold
> > of a week there's no need to keep 10k bins, and the minimum threshold of
> > 1 minute obviously is problematic.
>
> I am very doubtful that this approach would have been adequate. It
> would mean that, with old_snapshot_threshold set to a week, the
> threshold for declaring a snapshot "old" would jump forward 16.8 hours
> at a time. It's hard for me to make a coherent argument right now as
> to exactly what problems that would create, but it's not very
> granular, and a lot of bloat-related things really benefit from more
> granularity. I also don't really see what the problem with keeping a
> bucket per minute in memory is, even for a week. It's only 60 * 24 * 7
> = ~10k buckets, isn't it? That's not really insane for an in-memory
> data structure. I agree that the code that does that maintenance being
> buggy is a problem to whatever extent that is the case, but writing
> the code to have fewer buckets wouldn't necessarily have made it any
> less buggy.

My issue isn't really that it's too many buckets right now, but that it
doesn't scale down to smaller thresholds. I think to be able to develop
this reasonably, it'd need to be able to support thresholds in the
sub-second range. And I don't see how you can have the same binning for
such small thresholds, and for multi-day thresholds - we'd quickly go to
millions of buckets for longer thresholds.

I really think we'd need to support millisecond resolution to make this
properly testable.


> > GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
> > basis to detect conflicts seems dangerous to me. Isn't that ignoring
> > inserts that are already in progress?

> How so?

Because it returns the end of the reserved WAL, not how far we've
actually inserted. I.e. there can be in-progress, but not finished,
modifications that will have an LSN < GetXLogInsertRecPtr(). But the
whenTaken timestamp could reflect one that should throw an error for
these in-progress modifications (since the transaction limiting happens
before the WAL logging).

I am not 100%, but I suspect that that could lead to errors not being
thrown that should, because TestForOldSnapshot() will not see these
in-progress modifications as conflicting.

Hm, also, shouldn't
&& PageGetLSN(page) > (snapshot)->lsn)
in TestForOldSnapshot() be an >=?


> > It currently silently causes wrong query results. There's no
> > infrastructure to detect / protect against that (*).
>
> Sure, and what if you break more stuff ripping it out? Ripping this
> volume of code out in a supposedly-stable branch is totally insane
> almost no matter how broken the feature is.

For the backbranches I was just thinking of forcing the GUC to be off
(either by disallowing it to be set to on, or just warning when its set
to true, but not propagating the value).


> I have no objection to the idea that *if* the feature is hopelessly
> broken, it should be removed.

I would be a lot less inclined to go that way if old_snapshot_threshold

a) weren't explicitly about removing still-needed data - in contrast to
  a lot of other features, where the effects of bugs is temporary, here
  it can be much larger.
b) were a previously working feature, but as far as I can tell, it never really 
did
c) had tests that verify that my fixes actually do the right thing. As
  it stands, I'd not just have to fix the bugs, I'd also have to develop
  a test framework that can test this

While I wish I had been more forceful, and reviewed more of the code to
point out more of the quality issues, I did argue hard against the
feature going in. On account of it being architecturally bad and
impactful. Which I think it has proven to be several times over by
now. And now I'm kind of on the hook to fix it, it seems?


> I also think, and we've had this disagreement before, that you're far
> too willing to say "well, that's wrong so we need to hit it with a
> nuke." I complained when you added those error checks to vacuum in
> back-branches, and since that release went out people are regularly
> tripping those checks and taking prolonged outages for a problem that
> wasn't making them unhappy before. I know that in theory those people
> are better off because their database was always corrupted and now
> they know. But for some of them, those prolonged outages are worse
> than the problem they had before.

I think this is a somewhat revisionist. Sure, the errors were added
after like the 10th data corruption bug around freezing that we didn't
find for a long time, because of the lack of errors being thrown. But
the error checks weren't primarily added to find further bugs, but to
prevent data loss due 

Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Peter Geoghegan
On Wed, Apr 1, 2020 at 10:28 AM Robert Haas  wrote:
> Sure, but not all levels of risk are equal. Jumping out of a plane
> carries some risk of death whether or not you have a parachute, but
> that does not mean that we shouldn't worry about whether you have one
> or not before you jump.
>
> In this case, I think it is pretty clear that hard-disabling the
> feature by always setting old_snapshot_threshold to -1 carries less
> risk of breaking unrelated things than removing code that caters to
> the feature all over the code base. Perhaps it is not quite as
> dramatic as my parachute example, but I think it is pretty clear all
> the same that one is a lot more likely to introduce new bugs than the
> other. A carefully targeted modification of a few lines of code in 1
> file just about has to carry less risk than ~1k lines of code spread
> across 40 or so files.

Yeah, that's certainly true. But is that fine point really what
anybody disagrees about? I didn't think that Andres was focussed on
literally ripping it out over just disabling it.

> Is there any chance that you're planning to look into the details?
> That would certainly be welcome from my perspective.

I had a few other things that I was going to work on this week, but
those seems less urgent. I'll take a look into it, and report back
what I find.

-- 
Peter Geoghegan




Re: proposal \gcsv

2020-04-01 Thread Alvaro Herrera
On 2020-Apr-01, Pavel Stehule wrote:

> It can work, but it is not user friendly - my proposal was motivated by
> using some quick csv exports to gplot's pipe.

Can we fix that by adding some syntax to allow command aliases?
So you could add to your .psqlrc something like

\alias \gcsv \pset push all \; \cbuf; \; \pset pop

where the \cbuf is a hypothetical "function" that expands to the current
query buffer.  This needs some refining I guess, but it'd allow you to
create your own shortcuts for the most common features you want without
excessive typing effort.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 11:04:43 -0700, Peter Geoghegan wrote:
> On Wed, Apr 1, 2020 at 10:28 AM Robert Haas  wrote:
> > Is there any chance that you're planning to look into the details?
> > That would certainly be welcome from my perspective.

+1

This definitely needs more eyes. I am not even close to understanding
the code fully.


> I had a few other things that I was going to work on this week, but
> those seems less urgent. I'll take a look into it, and report back
> what I find.

Thanks you!

I attached a slightly evolved version of my debugging patch.


Greetings,

Andres Freund
diff --git i/src/backend/utils/time/snapmgr.c w/src/backend/utils/time/snapmgr.c
index 1c063c592ce..6a722863bcf 100644
--- i/src/backend/utils/time/snapmgr.c
+++ w/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@ typedef struct OldSnapshotControlData
 	int			head_offset;	/* subscript of oldest tracked time */
 	TimestampTz head_timestamp; /* time corresponding to head xid */
 	int			count_used;		/* how many slots are in use */
+	TimestampTz starttime; // rounded to minute
 	TransactionId xid_by_minute[FLEXIBLE_ARRAY_MEMBER];
 } OldSnapshotControlData;
 
@@ -290,6 +291,8 @@ SnapMgrInit(void)
 		oldSnapshotControl->head_offset = 0;
 		oldSnapshotControl->head_timestamp = 0;
 		oldSnapshotControl->count_used = 0;
+		oldSnapshotControl->starttime =
+			AlignTimestampToMinuteBoundary(GetCurrentTimestamp());
 	}
 }
 
@@ -1762,6 +1765,39 @@ SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit)
 	SpinLockRelease(&oldSnapshotControl->mutex_threshold);
 }
 
+static void PrintOldSnapshotMapping(const char *head, bool already_locked)
+{
+	StringInfoData s;
+
+	initStringInfo(&s);
+
+	if (!already_locked)
+		LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
+
+
+	appendStringInfo(&s, "old snapshot mapping at \"%s\" with head ts: %lu, current entries: %d max entries: %d, offset: %d\n",
+	 head,
+	 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_MINUTE,
+	 oldSnapshotControl->count_used,
+	 OLD_SNAPSHOT_TIME_MAP_ENTRIES,
+	 oldSnapshotControl->head_offset);
+
+	for (int off = 0; off < oldSnapshotControl->count_used; off++)
+	{
+		int ringoff = (off + oldSnapshotControl->head_offset) % OLD_SNAPSHOT_TIME_MAP_ENTRIES;
+
+		appendStringInfo(&s, "  entry %d (ring %d): min %ld: xid %d\n",
+		 off, ringoff,
+		 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_MINUTE + off,
+		 oldSnapshotControl->xid_by_minute[ringoff]);
+	}
+
+	if (!already_locked)
+		LWLockRelease(OldSnapshotTimeMapLock);
+
+	elog(WARNING, "%s", s.data);
+}
+
 /*
  * TransactionIdLimitedForOldSnapshots
  *
@@ -1826,9 +1862,15 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 		{
 			if (ts == update_ts)
 			{
+PrintOldSnapshotMapping("non cached limit via update_ts", false);
+
 xlimit = latest_xmin;
 if (NormalTransactionIdFollows(xlimit, recentXmin))
+{
+	elog(LOG, "increasing threshold from %u to %u (via update_ts)",
+		 recentXmin, xlimit);
 	SetOldSnapshotThresholdTimestamp(ts, xlimit);
+}
 			}
 			else
 			{
@@ -1839,6 +1881,8 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 {
 	int			offset;
 
+	PrintOldSnapshotMapping("non cached limit via bins", true);
+
 	offset = ((ts - oldSnapshotControl->head_timestamp)
 			  / USECS_PER_MINUTE);
 	if (offset > oldSnapshotControl->count_used - 1)
@@ -1848,7 +1892,15 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 	xlimit = oldSnapshotControl->xid_by_minute[offset];
 
 	if (NormalTransactionIdFollows(xlimit, recentXmin))
+	{
+		elog(LOG, "increasing threshold from %u to %u (via bins)",
+			 recentXmin, xlimit);
 		SetOldSnapshotThresholdTimestamp(ts, xlimit);
+	}
+}
+else
+{
+	// currently debugging output here is pretty darn verbose
 }
 
 LWLockRelease(OldSnapshotTimeMapLock);
@@ -1869,7 +1921,11 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 			xlimit = latest_xmin;
 
 		if (NormalTransactionIdFollows(xlimit, recentXmin))
+		{
+			elog(LOG, "increasing prune threshold from %u to %u",
+ recentXmin, xlimit);
 			return xlimit;
+		}
 	}
 
 	return recentXmin;
@@ -1923,14 +1979,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 	 */
 	if (whenTaken < 0)
 	{
-		elog(DEBUG1,
+		elog(PANIC,
 			 "MaintainOldSnapshotTimeMapping called with negative whenTaken = %ld",
 			 (long) whenTaken);
 		return;
 	}
 	if (!TransactionIdIsNormal(xmin))
 	{
-		elog(DEBUG1,
+		elog(PANIC,
 			 "MaintainOldSnapshotTimeMapping called with xmin = %lu",
 			 (unsigned long) xmin);
 		return;
@@ -1938,6 +1994,8 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 
 	LWLockAcquire(OldSnapshotTimeMapLock, LW_EXCLUSIVE);
 
+	PrintOldSnapshotMapping("before update", true);
+
 	Assert(oldSnapshotControl->head_offset >

Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 13:27:56 -0400, Robert Haas wrote:
> Perhaps "irresponsible" is the wrong word, but it's certainly caused
> problems for multiple EnterpriseDB customers, and in my view, those
> problems weren't necessary. Either a WARNING or an ERROR would have
> shown up in the log, but an ERROR terminates VACUUM for that table and
> thus basically causes autovacuum to be completely broken. That is a
> really big problem. Perhaps you will want to argue, as Andres did,
> that the value of having ERROR rather than WARNING in the log
> justifies that outcome, but I sure don't agree.

If that had been a really viable option, I would have done so. At the
very least in the back branches, but quite possibly also in master. Or
if somebody had brought them up as an issue at the time.

What is heap_prepare_freeze_tuple/FreezeMultiXactId supposed to do after
issuing a WARNING in these cases. Without the ERROR, e.g.,
if (!TransactionIdDidCommit(xid))
ereport(ERROR,

(errcode(ERRCODE_DATA_CORRUPTED),
 errmsg_internal("uncommitted 
xmin %u from before xid cutoff %u needs to be frozen",

 xid, cutoff_xid)));
would make a deleted tuple visible.


if (TransactionIdPrecedes(xid, relfrozenxid))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
 errmsg_internal("found xmin %u from 
before relfrozenxid %u",
 xid, 
relfrozenxid)));
would go on replace xmin of a potentially uncommitted tuple with
relfrozenxid, making it appear visible.


if (TransactionIdPrecedes(xid, relfrozenxid))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
 errmsg_internal("found xmax %u from 
before relfrozenxid %u",
 xid, 
relfrozenxid)));
would replace the xmax indicating a potentially deleted tuple with ?, either
making the tuple become, potentially wrongly, visible/invisible

or
else if (MultiXactIdPrecedes(multi, relminmxid))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
 errmsg_internal("found multixact %u from 
before relminmxid %u",
 multi, 
relminmxid)));
or ...


Just continuing is easier said than done. Especially with the background
of knowing that several users had hit the bug that allowed all of the
above to be hit, and that advancing relfrozenxid further would make it
worse.

Greetings,

Andres Freund




Re: BUG #16109: Postgres planning time is high across version (Expose buffer usage during planning in EXPLAIN)

2020-04-01 Thread Julien Rouhaud
On Wed, Apr 1, 2020 at 7:51 PM Fujii Masao  wrote:
>
>
> On 2020/03/31 10:31, Justin Pryzby wrote:
> > On Wed, Jan 29, 2020 at 12:15:59PM +0100, Julien Rouhaud wrote:
> >> Rebase due to conflict with 3ec20c7091e97.
> >
> > This is failing to apply probably since 
> > 4a539a25ebfc48329fd656a95f3c1eb2cda38af3.
> > Could you rebase?   (Also, not sure if this can be set as RFC?)
>
> I updated the patch. Attached.

Thanks a lot!  I'm sorry I missed Justin's ping, and it I just
realized that my cron job that used to warn me about cfbot failure was
broken :(

> +/* Compute the difference between two BufferUsage */
> +BufferUsage
> +ComputeBufferCounters(BufferUsage *start, BufferUsage *stop)
>
> Since BufferUsageAccumDiff() was exported, ComputeBufferCounters() is
> no longer necessary. In the patched version, BufferUsageAccumDiff() is
> used to calculate the difference of buffer usage.

Indeed, exposing BufferUsageAccumDiff wa definitely a good thing!

> +   if (es->summary && (planduration || es->buffers))
> +   ExplainOpenGroup("Planning", "Planning", true, es);
>
> Isn't it more appropriate to check "bufusage" instead of "es->buffers" here?
> The patch changes the code so that "bufusage" is checked.

AFAICS not unless ExplainOneQuery is also changed to pass a NULL
pointer if the BUFFER option wasn't provided (and maybe also
optionally skip the planning buffer computation).  With this version
you now get:

=# explain (analyze, buffers off) update t1 set id = id;
  QUERY PLAN
---
 Update on t1  (cost=0.00..22.70 rows=1270 width=42) (actual
time=0.170..0.170 rows=0 loops=1)
   ->  Seq Scan on t1  (cost=0.00..22.70 rows=1270 width=42) (actual
time=0.050..0.054 rows=1 loops=1)
 Planning Time: 1.461 ms
   Buffers: shared hit=25
 Execution Time: 1.071 ms
(5 rows)

which seems wrong to me.

I reused the es->buffers to avoid having needing something like:

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b1f3fe13c6..9dbff97a32 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -375,7 +375,9 @@ ExplainOneQuery(Query *query, int cursorOptions,
BufferUsage bufusage_start,
bufusage;

-   bufusage_start = pgBufferUsage;
+   if (ex->buffers)
+   bufusage_start = pgBufferUsage;
+
INSTR_TIME_SET_CURRENT(planstart);

/* plan the query */
@@ -384,13 +386,16 @@ ExplainOneQuery(Query *query, int cursorOptions,
INSTR_TIME_SET_CURRENT(planduration);
INSTR_TIME_SUBTRACT(planduration, planstart);

-   /* calc differences of buffer counters. */
-   memset(&bufusage, 0, sizeof(BufferUsage));
-   BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+   if (es->buffers)
+   {
+   /* calc differences of buffer counters. */
+   memset(&bufusage, 0, sizeof(BufferUsage));
+   BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+   }

/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
-  &planduration, &bufusage);
+  &planduration, (es->buffers ? &bufusage : NULL));
}

which seemed like a win, but I'm not opposed to do that if you prefer.

>
> +   "Planning Time": N.N,+
> + "Shared Hit Blocks": N,+
> + "Shared Read Blocks": N,   +
> + "Shared Dirtied Blocks": N,+
>
> Doesn't this indent look strange? IMO no indent for buffer usage is
> necessary when the format is either json, xml, and yaml. This looks
> better at least for me. OTOH, in text format, it seems better to indent
> the buffer usage for more readability. Thought?
> The patch changes the code so that "es->indent" is
> increment/decrement only when the format is text.

Indeed, that's way better!




Re: [PATCH] Check operator when creating unique index on partition table

2020-04-01 Thread Tom Lane
Guancheng Luo  writes:
> On Mar 26, 2020, at 01:00, Tom Lane  wrote:
>> This would reject, for example, a hash index associated with a btree-based
>> partition constraint, but I'm not sure we're losing anything much thereby.

> There is cases when a BTREE index associated with a HASH partition key, but I 
> think we should allow them,
> as long as their equality operators consider the same value as equal.

Ah, yeah, I see we already have regression test cases that require that.

I pushed the patch with some cosmetic revisions.  I left out the
regression test case though; it seemed pretty expensive considering
that the code is already being exercised by existing cases.

Thanks for the report and patch!

regards, tom lane




Re: tweaking perfect hash multipliers

2020-04-01 Thread John Naylor
On Tue, Mar 31, 2020 at 4:05 PM John Naylor  wrote:
>
> On Tue, Mar 31, 2020 at 2:31 AM Andres Freund  wrote:
> > I think the form of lea generated here is among the ones that can only
> > be executed on port 1. Whereas e.g. an register+register/immediate add
> > can be executed on four different ports.
>
> I looked into slow vs. fast leas, and I think the above are actually
> fast because they have 2 operands.

No, scratch that, it seems the two forms of lea are:

leal (,%rdx,8), %ecx
leal (%rdx,%rdx,8), %ecx

The first operand in both is the implicit zero, so with 3 and 5 we do
get the slow lea on some architectures. So I've only kept the
shift-and-add multipliers in v2. I also changed the order of iteration
of the parameters, for speed. Before, it took over 30 seconds to build
the unicode quick check tables, now it takes under 2 seconds.

--
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v2-tweak-perfect-hash-gen.patch
Description: Binary data


Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Robert Haas
On Wed, Apr 1, 2020 at 2:37 PM Andres Freund  wrote:
> Just continuing is easier said than done. Especially with the background
> of knowing that several users had hit the bug that allowed all of the
> above to be hit, and that advancing relfrozenxid further would make it
> worse.

Fair point, but it seems we're arguing over nothing here, or at least
nothing relevant to this thread, because it sounds like if we are
going to disable that you're OK with doing that by just shutting it
off the code rather than trying to remove it all. I had the opposite
impression from your first email.

Sorry to have derailed the thread, and for my poor choice of words.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Kevin Grittner
On Wed, Apr 1, 2020 at 10:09 AM Andres Freund  wrote:

First off, many thanks to Andres for investigating this, and apologies for
the bugs.  Also thanks to Michael for making sure I saw the thread.  I must
also apologize that for not being able to track the community lists
consistently due to health issues that are exacerbated by stress, and the
fact that these lists often push past my current limits.  I'll try to help
in this as best I can.

Do we actually have any evidence of this feature ever beeing used? I
> didn't find much evidence for that in the archives (except Thomas
> finding a problem).


This was added because a very large company trying to convert from Oracle
had a test that started to show some slowdown on PostgreSQL after 8 hours,
serious slowdown by 24 hours, and crashed hard before it could get to 48
hours -- due to lingering WITH HOLD cursors left by ODBC code.  They had
millions of lines of code that would need to be rewritten without this
feature.  With this feature (set to 20 minutes, if I recall correctly),
their unmodified code ran successfully for at least three months solid
without failure or corruption.  Last I heard, they were converting a large
number of instances from Oracle to PostgreSQL, and those would all fail
hard within days of running with this feature removed or disabled.

Also, VMware is using PostgreSQL as an embedded part of many products, and
this feature was enabled to deal with similar failures due to ODBC cursors;
so the number of instances running 24/7 under high load which have shown a
clear benefit from enabling this feature has a lot of zeros.

Perhaps the lack of evidence for usage in the archives indicates a low
frequency of real-world failures due to the feature, rather than lack of
use?  I'm not doubting that Andres found real issues that should be fixed,
but perhaps not very many people who are using the feature have more than
two billion transactions within the time threshold, and perhaps the other
problems are not as big as the problems solved by use of the feature -- at
least in some cases.

To save readers who have not yet done the math some effort, at the 20
minute threshold used by the initial user, they would need to have a
sustained rate of consumption of transaction IDs of over 66 million per
second to experience wraparound problems, and at the longest threshold I
have seen it would need to exceed an average of 461,893 TPS for three days
solid to hit wraparound.  Those aren't impossible rates to hit, but in
practice it might not be a frequent occurrence yet on modern hardware with
some real-world applications.  Hopefully we can find a way to fix this
before those rates become common.

I am reviewing the issue and patches now, and hope I can make some useful
contribution to the discussion.

-- 
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/


Re: error context for vacuum to include block number

2020-04-01 Thread Andres Freund
On 2020-04-01 07:54:45 +0530, Amit Kapila wrote:
> Pushed. I think we are done here.  The patch is marked as committed in
> CF.  Thank you!

Awesome! Thanks for all your work on this, all. This'll make it a lot
easier to debug errors during autovacuum.




Re: error context for vacuum to include block number

2020-04-01 Thread Alvaro Herrera
On 2020-Apr-01, Andres Freund wrote:

> On 2020-04-01 07:54:45 +0530, Amit Kapila wrote:
> > Pushed. I think we are done here.  The patch is marked as committed in
> > CF.  Thank you!
> 
> Awesome! Thanks for all your work on this, all. This'll make it a lot
> easier to debug errors during autovacuum.

Seconded!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

Nice to have you back for a bit! Even if the circumstances aren't
great...

It's very understandable that the lists are past your limits, I barely
keep up these days. Without any health issues.


On 2020-04-01 14:10:09 -0500, Kevin Grittner wrote:
> Perhaps the lack of evidence for usage in the archives indicates a low
> frequency of real-world failures due to the feature, rather than lack of
> use?  I'm not doubting that Andres found real issues that should be fixed,
> but perhaps not very many people who are using the feature have more than
> two billion transactions within the time threshold, and perhaps the other
> problems are not as big as the problems solved by use of the feature -- at
> least in some cases.

> To save readers who have not yet done the math some effort, at the 20
> minute threshold used by the initial user, they would need to have a
> sustained rate of consumption of transaction IDs of over 66 million per
> second to experience wraparound problems, and at the longest threshold I
> have seen it would need to exceed an average of 461,893 TPS for three days
> solid to hit wraparound.  Those aren't impossible rates to hit, but in
> practice it might not be a frequent occurrence yet on modern hardware with
> some real-world applications.  Hopefully we can find a way to fix this
> before those rates become common.

The wraparound issue on their own wouldn't be that bad - when I found it
I did play around with a few ideas for how to fix it. The most practical
would probably be to have MaintainOldSnapshotTimeMapping() scan all
buckets when a new oldSnapshotControl->oldest_xid is older than
RecentGlobalXmin. There's no benefit in the contents of those buckets
anyway, since we know that we can freeze those independent of
old_snapshot_threshold.

The thing that makes me really worried is that the contents of the time
mapping seem very wrong. I've reproduced query results in a REPEATABLE
READ transaction changing (pruned without triggering an error). And I've
reproduced rows not getting removed for much longer than than they
should, according to old_snapshot_threshold.


I suspect one reason for users not noticing either is that

a) it's plausible that users of the feature would mostly have
  long-running queries/transactions querying immutable or insert only
  data. Those would not notice that, on other tables, rows are getting
  removed, where access would not trigger the required error.

b) I observe long-ish phases were no cleanup is happening (due to
  oldSnapshotControl->head_timestamp getting updated more often than
  correct). But if old_snapshot_threshold is small enough in relation to
  the time the generated bloat becomes problematic, there will still be
  occasions to actually perform cleanup.

Greetings,

Andres Freund




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Kevin Grittner
On Wed, Apr 1, 2020 at 2:43 PM Andres Freund  wrote:

> The thing that makes me really worried is that the contents of the time
> mapping seem very wrong. I've reproduced query results in a REPEATABLE
> READ transaction changing (pruned without triggering an error).


That is a very big problem.  On the sort-of bright side (ironic in light of
the fact that I'm a big proponent of using serializable transactions), none
of the uses that I have personally seen of this feature use anything other
than the default READ COMMITTED isolation level.  That might help explain
the lack of complaints for those using the feature.  But yeah, I REALLY
want to see a solid fix for that!


> And I've
> reproduced rows not getting removed for much longer than than they
> should, according to old_snapshot_threshold.
>
> I suspect one reason for users not noticing either is that
>
> a) it's plausible that users of the feature would mostly have
>   long-running queries/transactions querying immutable or insert only
>   data. Those would not notice that, on other tables, rows are getting
>   removed, where access would not trigger the required error.
>
> b) I observe long-ish phases were no cleanup is happening (due to
>   oldSnapshotControl->head_timestamp getting updated more often than
>   correct). But if old_snapshot_threshold is small enough in relation to
>   the time the generated bloat becomes problematic, there will still be
>   occasions to actually perform cleanup.
>

Keep in mind that the real goal of this feature is not to eagerly _see_
"snapshot too old" errors, but to prevent accidental debilitating bloat due
to one misbehaving user connection.  This is particularly easy to see (and
therefore unnervingly common) for those using ODBC, which in my experience
tends to correspond to the largest companies which are using PostgreSQL.
In some cases, the snapshot which is preventing removal of the rows will
never be used again; removal of the rows will not actually affect the
result of any query, but only the size and performance of the database.
This is a "soft limit" -- kinda like max_wal_size.  Where there was a
trade-off between accuracy of the limit and performance, the less accurate
way was intentionally chosen.  I apologize for not making that more clear
in comments.

While occasional "snapshot too old" errors are an inconvenient side effect
of achieving the primary goal, it might be of interest to know that the
initial (very large corporate) user of this feature had, under Oracle,
intentionally used a cursor that would be held open as long as a user chose
to leave a list open for scrolling around.  They used cursor features for
as long as the cursor allowed.  This could be left open for days or weeks
(or longer?).  Their query ordered by a unique index, and tracked the ends
of the currently displayed portion of the list so that if they happened to
hit the "snapshot too old" error they could deallocate and restart the
cursor and reposition before moving forward or back to the newly requested
rows.  They were not willing to convert to PostgreSQL unless this approach
continued to work.

In Summary:
(1) It's not urgent that rows always be removed as soon as possible after
the threshold is crossed as long as they don't often linger too awfully far
past that limit and allow debilitating bloat.
(2) It _is_ a problem if results inconsistent with the snapshot are
returned -- a "snapshot too old" error is necessary.
(3) Obviously, wraparound problems need to be solved.

I hope this is helpful.

-- 
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/


Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Robert Haas
On Wed, Apr 1, 2020 at 3:43 PM Andres Freund  wrote:
> The thing that makes me really worried is that the contents of the time
> mapping seem very wrong. I've reproduced query results in a REPEATABLE
> READ transaction changing (pruned without triggering an error). And I've
> reproduced rows not getting removed for much longer than than they
> should, according to old_snapshot_threshold.

I think it would be a good idea to add a system view that shows the
contents of the mapping. We could make it a contrib module, if you
like, so that it can even be installed on back branches. We'd need to
move the structure definition from snapmgr.c to a header file, but
that doesn't seem like such a big deal.

Maybe that contrib module could even have some functions to simulate
aging without the passage of any real time. Like, say you have a
function or procedure old_snapshot_pretend_time_has_passed(integer),
and it moves oldSnapshotControl->head_timestamp backwards by that
amount. Maybe that would require updating some other fields in
oldSnapshotControl too but it doesn't seem like we'd need to do a
whole lot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: backup manifests

2020-04-01 Thread Andres Freund
Hi,

On 2020-03-31 22:15:04 -0700, Noah Misch wrote:
> On Tue, Mar 31, 2020 at 03:50:34PM -0700, Andres Freund wrote:
> > On 2020-03-31 14:10:34 -0400, Robert Haas wrote:
> > > +/*
> > > + * Attempt to parse the WAL files required to restore from backup using
> > > + * pg_waldump.
> > > + */
> > > +static void
> > > +parse_required_wal(validator_context *context, char *pg_waldump_path,
> > > +char *wal_directory, manifest_wal_range 
> > > *first_wal_range)
> > > +{
> > > + manifest_wal_range *this_wal_range = first_wal_range;
> > > +
> > > + while (this_wal_range != NULL)
> > > + {
> > > + char *pg_waldump_cmd;
> > > +
> > > + pg_waldump_cmd = psprintf("\"%s\" --quiet --path=\"%s\" 
> > > --timeline=%u --start=%X/%X --end=%X/%X\n",
> > > +pg_waldump_path, wal_directory, this_wal_range->tli,
> > > +(uint32) (this_wal_range->start_lsn >> 32),
> > > +(uint32) this_wal_range->start_lsn,
> > > +(uint32) (this_wal_range->end_lsn >> 32),
> > > +(uint32) this_wal_range->end_lsn);
> > > + if (system(pg_waldump_cmd) != 0)
> > > + report_backup_error(context,
> > > + "WAL parsing 
> > > failed for timeline %u",
> > > + 
> > > this_wal_range->tli);
> > > +
> > > + this_wal_range = this_wal_range->next;
> > > + }
> > > +}
> > 
> > Should we have a function to properly escape paths in cases like this?
> > Not that it's likely or really problematic, but the quoting for path
> > could be "circumvented".
> 
> Are you looking for appendShellString(), or something different?

Looks like that'd be it. Thanks.

Greetings,

Andres Freund




Re: backup manifests

2020-04-01 Thread Andres Freund
Hi,

On 2020-03-31 14:56:07 +0530, Amit Kapila wrote:
> On Tue, Mar 31, 2020 at 11:10 AM Noah Misch  wrote:
> > On Mon, Mar 30, 2020 at 12:16:31PM -0700, Andres Freund wrote:
> > > On 2020-03-30 15:04:55 -0400, Robert Haas wrote:
> > > I'm mildly inclined to name it pg_validate, pg_validate_dbdir or
> > > such. And eventually (definitely not this release) subsume pg_checksums
> > > in it. That way we can add other checkers too.
> >
> > Works for me; of those two, I prefer pg_validate.
> >
> 
> pg_validate sounds like a tool with a much bigger purpose.  I think
> even things like amcheck could also fall under it.

Intentionally so. We don't serve our users by collecting a lot of
differently named commands to work with data directories. A I wrote
above, the point would be to eventually have that tool also perform
checksum validation etc.  Potentially even in a single pass over the
data directory.


> This patch has two parts (a) Generate backup manifests for base
> backups, and (b) Validate backup (manifest).  It seems to me that
> there are not many things pending for (a), can't we commit that first
> or is it the case that (a) depends on (b)?  This is *not* a suggestion
> to leave pg_validatebackup from this release rather just to commit if
> something is ready and meaningful on its own.

IDK, it seems easier to be able to modify both at the same time.

Greetings,

Andres Freund




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 15:11:52 -0500, Kevin Grittner wrote:
> On Wed, Apr 1, 2020 at 2:43 PM Andres Freund  wrote:
>
> > The thing that makes me really worried is that the contents of the time
> > mapping seem very wrong. I've reproduced query results in a REPEATABLE
> > READ transaction changing (pruned without triggering an error).
>
>
> That is a very big problem.  On the sort-of bright side (ironic in light of
> the fact that I'm a big proponent of using serializable transactions), none
> of the uses that I have personally seen of this feature use anything other
> than the default READ COMMITTED isolation level.  That might help explain
> the lack of complaints for those using the feature.  But yeah, I REALLY
> want to see a solid fix for that!

I don't think it's dependent on RR - it's just a bit easier to verify
that the query results are wrong that way.


> > And I've
> > reproduced rows not getting removed for much longer than than they
> > should, according to old_snapshot_threshold.
> >
> > I suspect one reason for users not noticing either is that
> >
> > a) it's plausible that users of the feature would mostly have
> >   long-running queries/transactions querying immutable or insert only
> >   data. Those would not notice that, on other tables, rows are getting
> >   removed, where access would not trigger the required error.
> >
> > b) I observe long-ish phases were no cleanup is happening (due to
> >   oldSnapshotControl->head_timestamp getting updated more often than
> >   correct). But if old_snapshot_threshold is small enough in relation to
> >   the time the generated bloat becomes problematic, there will still be
> >   occasions to actually perform cleanup.
> >
>
> Keep in mind that the real goal of this feature is not to eagerly _see_
> "snapshot too old" errors, but to prevent accidental debilitating bloat due
> to one misbehaving user connection.

I don't think it's an "intentional" inaccuracy issue leading to
this. The map contents are just wrong, in particular the head_timestamp
most of the time is so new that
TransactionIdLimitedForOldSnapshots(). When filling a new bucket,
MaintainOldSnapshotThreshold() unconditionally updates
oldSnapshotControl->head_timestamp to be the current minute, which means
it'll take old_snapshot_threshold minutes till
TransactionIdLimitedForOldSnapshots() even looks at the mapping again.

As far as I can tell, with a large old_snapshot_threshold, it can take a
very long time to get to a head_timestamp that's old enough for
TransactionIdLimitedForOldSnapshots() to do anything.  Look at this
trace of a pgbench run with old_snapshot_threshold enabled, showing some of
the debugging output added in the patch upthread.

This is with a threshold of 10min, in a freshly started database:
> 2020-04-01 13:49:00.000 PDT [1268502][2/43571:2068881994] WARNING:  head 0 
> min: filling 1 buckets starting at 0 for whenTaken 1 min, with xmin 2068881994
> 2020-04-01 13:49:00.000 PDT [1268502][2/43571:2068881994] WARNING:  old 
> snapshot mapping at "after update" with head ts: 1, current entries: 2 max 
> entries: 20, offset: 0
> entry 0 (ring 0): min 1: xid 2068447214
> entry 1 (ring 1): min 2: xid 2068881994
>
> 2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING:  old snapshot 
> mapping at "before update" with head ts: 1, current entries: 2 max entries: 
> 20, offset: 0
> entry 0 (ring 0): min 1: xid 2068447214
> entry 1 (ring 1): min 2: xid 2068881994
>
> 2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING:  head 1 min: 
> updating existing bucket 1 for whenTaken 2 min, with xmin 2069199511
> 2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING:  old snapshot 
> mapping at "after update" with head ts: 1, current entries: 2 max entries: 
> 20, offset: 0
> entry 0 (ring 0): min 1: xid 2068447214
> entry 1 (ring 1): min 2: xid 2069199511
>
> 2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING:  old 
> snapshot mapping at "before update" with head ts: 1, current entries: 2 max 
> entries: 20, offset: 0
> entry 0 (ring 0): min 1: xid 2068447214
> entry 1 (ring 1): min 2: xid 2069199511
>
> 2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING:  head 1 
> min: filling 2 buckets starting at 0 for whenTaken 3 min, with xmin 2069516499
> 2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING:  old 
> snapshot mapping at "after update" with head ts: 3, current entries: 4 max 
> entries: 20, offset: 0
> entry 0 (ring 0): min 3: xid 2068447214
> entry 1 (ring 1): min 4: xid 2069199511
> entry 2 (ring 2): min 5: xid 2069516499
> entry 3 (ring 3): min 6: xid 2069516499
> ...
> 2020-04-01 14:03:00.000 PDT [1268504][4/1158832:2075918094] WARNING:  old 
> snapshot mapping at "before update" with head ts: 7, current entries: 8 max 
> entries: 20, offset: 0
> entry 0 (ring 0): min 7: xid 2068447214
> entry 1 (ring 1): min 8

Proposal: Expose oldest xmin as SQL function for monitoring

2020-04-01 Thread James Coleman
Currently there's no good way that I'm aware of for monitoring
software to check what the xmin horizon is being blocked at. You can
check pg_stat_replication and pg_replication_slots and
txid_snapshot_xmin(txid_current_snapshot()) and so on, but that list
can grow, and it means monitoring setups need to update any time any
new feature might hold another snapshot and expose it in a different
way.

To my knowledge the current oldest xmin (GetOldestXmin() if I'm not
mistaken) isn't exposed directly in any view or function by Postgres.

Am I missing anything in the above description? And if not, would
there be any reason why we would want to avoid exposing that
information? And if not, then would exposing it as a function be
acceptable?

Thanks,
James




Re: backup manifests

2020-04-01 Thread David Steele

On 3/31/20 7:57 AM, Robert Haas wrote:

On Mon, Mar 30, 2020 at 7:24 PM David Steele  wrote:

I'm confused as to why you're not seeing that. What's the exact
sequence of steps?


$ pg_basebackup -D test/backup5 --manifest-checksums=SHA256

$ vi test/backup5/backup_manifest
  * Add 'X' to the checksum of backup_label

$ pg_validatebackup test/backup5
pg_validatebackup: fatal: invalid checksum for file "backup_label":
"a98e9164fd59d498d14cfdf19c67d1c2208a30e7b939d1b4a09f524c7adfc11fX"

No mention of the manifest checksum being invalid.  But if I remove the
backup label file from the manifest:

pg_validatebackup: fatal: manifest checksum mismatch


Oh, I see what's happening now. If the checksum is not an even-length
string of hexademical characters, it's treated as a syntax error, so
it bails out at that point. Generally, a syntax error in the manifest
file is treated as a fatal error, and you just die right there. You'd
get the same behavior if you had malformed JSON, like a stray { or }
or [ or ] someplace that it doesn't belong according to the rules of
JSON. On the other hand, if you corrupt the checksum by adding AA or
EE or 54 or some other even-length string of hex characters, then you
have (in this code's view) a semantic error rather than a syntax
error, so it will finish loading all the manifest data and then bail
because the checksum doesn't match.

We really can't avoid bailing out early sometimes, because if the file
is totally malformed at the JSON level, there's just no way to
continue. We could cause this particular error to get treated as a
semantic error rather than a syntax error, but I don't really see much
advantage in so doing. This way was easier to code, and I don't think
it really matters which error we find first.


I think it would be good to know that the manifest checksum is bad in 
all cases because that may well inform other errors.


That said, I know you have a lot on your plate with this patch so I'm 
not going to make a fuss about such a minor gripe. Perhaps this can be 
considered for future improvement.


Regards,
--
-David
da...@pgmasters.net




Re: [PATCH] ltree, lquery, and ltxtquery binary protocol support

2020-04-01 Thread Tom Lane
I wrote:
> Fortunately for the odds of getting this patch accepted, we just
> pushed an ALTER TYPE improvement that will solve your problem [1].
> Please replace ltree--1.2.sql with an upgrade script that uses
> that, and resubmit.

I decided it would be a shame for this to miss v13, seeing that
(a) it'd provide real-world use of that ALTER TYPE code, and
(b) we already bumped ltree's extension version for v13.

So I went ahead and rebased this, reviewed and pushed it.  The
main non-cosmetic thing I changed, other than using ALTER TYPE,
was that you had the output functions looking like this:

pq_sendtext(&buf, res, pg_mbstrlen(res));

That's just wrong; it should be plain strlen().  Did you copy
that coding from someplace?  I could not find any similar
occurrences in our code tree.

regards, tom lane




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-04-01 Thread Tomas Vondra

On Wed, Apr 01, 2020 at 09:05:27AM -0400, James Coleman wrote:

On Tue, Mar 31, 2020 at 11:07 PM James Coleman  wrote:


On Tue, Mar 31, 2020 at 10:44 PM Tomas Vondra
 wrote:
>
> On Tue, Mar 31, 2020 at 10:12:29PM -0400, James Coleman wrote:
> >On Tue, Mar 31, 2020 at 9:59 PM Tomas Vondra
> > wrote:
> >>
> >> On Tue, Mar 31, 2020 at 08:42:47PM -0400, James Coleman wrote:
> >> >On Tue, Mar 31, 2020 at 8:38 PM Tomas Vondra
> >> > wrote:
> >> >>
> >> >> On Tue, Mar 31, 2020 at 08:11:15PM -0400, James Coleman wrote:
> >> >> >On Tue, Mar 31, 2020 at 7:56 PM Tomas Vondra
> >> >> > wrote:

...

> >> >> >One small idea, but I'm not yet sure it helps us a whole lot: if the
> >> >> >query pathkeys is only length 1, then we could skip the additional
> >> >> >path creation.
> >> >> >
> >> >>
> >> >> I don't follow. Why would we create incremental sort in this case at
> >> >> all? With single-element query_pathkeys the path is either unsorted or
> >> >> fully sorted - there's no room for incremental sort. No?
> >> >
> >> >Well, we shouldn't, that's what I'm getting. But I didn't see anything
> >> >in the code now that explicitly excludes that case when decided
> >> >whether or not to create an incremental sort path, unless I'm missing
> >> >something obvious.
> >>
> >> Well, my point is that create_ordered_paths() looks like this:
> >>
> >>  is_sorted = pathkeys_common_contained_in(root->sort_patkeys, ...);
> >>
> >>  if (is_sorted)
> >>  {
> >>  ... old code
> >>  }
> >>  else
> >>  {
> >>  if (input_path == cheapest_input_path)
> >>  {
> >>  ... old code
> >>  }
> >>
> >>  /* With incremental sort disabled, don't build those paths. */
> >>  if (!enable_incrementalsort)
> >>  continue;
> >>
> >>  /* Likewise, if the path can't be used for incremental sort. */
> >>  if (!presorted_keys)
> >>  continue;
> >>
> >>  ... incremental sort path
> >>  }
> >>
> >> Now, with single-item sort_pathkeys, the input path can't be partially
> >> sorted. It's either fully sorted - in which case it's handled by the
> >> first branch. Or it's not sorted at all, so presorted_keys==0 and we
> >> never get to the incremental path.
> >>
> >> Or did you mean to use the optimization somewhere else?
> >
> >Hmm, yes, I didn't think through that properly. I'll have to look at
> >the other cases to confirm the same logic applies there.


I looked through this more carefully, and I did end up finding a few
places where we can skip iterating through a list of paths entirely
with this check, so I added it there. I also cleaned up some comments,
added comments and asserts to the other places where
list_length(pathkeys) should be guaranteed to be > 1, removed a few
asserts I found unnecessary, and merged duplicative
pathkeys_[count_]_contained_in calls.



OK


> >One other thing:in the code above we create the regular sort path
> >inside of `if (input_path == cheapest_input_path)`, but incremental
> >sort is outside of that condition. I'm not sure I'm remembering why
> >that was, and it's not obvious to me reading it right now (though it's
> >getting late here, so maybe I'm just not thinking clearly). Do you
> >happen to remember why that is?
> >
>
> It's because for the regular sort, the path is either already sorted or
> it requires a full sort. But full sort only makes sense on the cheapest
> path, because we assume the additional sort cost is independent of the
> input cost, essentially
>
> cost(path + Sort) = cost(path) + cost(Sort)
>
> and it's always
>
>  cost(path) + cost(Sort) >= cost(cheapest path) + cost(Sort)
>
> and by checking for cheapest path we simply skip building all the paths
> that we'd end up discarding anyway.
>
> With incremental sort we can't do this, the cost of the incremental sort
> depends on how well presorted is the input path.


Thanks for the explanation. I've added a comment to that effect.



Thanks.

I've realized the way get_useful_pathkeys_for_relation() is coded kinda
works against the fastpath we added when comparing pathkeys. That
depends on comparing pointers to the list, but we've been building new
lists (and then returned those) which defeats the optimization. Attached
is a patch that returns the original list in most cases (and only
creates a copy when really necessary). This might also save a few cycles
on bulding the new list, of course.

I've done a bunch of read-only pgbench tests with fairly small scales (1
and 10). First with the built-in read-only transaction, and also with a
simple custom query doing an order-by. And I did this both on the
default schema and with a bunch of extra indexes. The script I used to
run this is attached, along with a summary of results.

There are results for master and v40 and v50 patches (the v50 also
includes the extra patch fixing get_useful_pathkeys_for_relation).

Overall, I'm happy with those results - the v50 seems to be 

Re: Proposal: Expose oldest xmin as SQL function for monitoring

2020-04-01 Thread Tom Lane
James Coleman  writes:
> To my knowledge the current oldest xmin (GetOldestXmin() if I'm not
> mistaken) isn't exposed directly in any view or function by Postgres.

You could do something like

select max(age(backend_xmin)) from pg_stat_activity;

though I'm not sure whether that accounts for absolutely every process.

> Am I missing anything in the above description? And if not, would
> there be any reason why we would want to avoid exposing that
> information? And if not, then would exposing it as a function be
> acceptable?

The fact that I had to use max(age(...)) in that sample query
hints at one reason: it's really hard to do arithmetic correctly
on raw XIDs.  Dealing with wraparound is a problem, and knowing
what's past or future is even harder.  What use-case do you
foresee exactly?

regards, tom lane




Re: [PATCH] ltree, lquery, and ltxtquery binary protocol support

2020-04-01 Thread Nino Floris
Hi Tom,

Thanks a lot for pushing this through.

In complete agreement on fixing mbstrlen, it would clearly have lead to cut
off string sends, or worse (does the binary protocol use null terminated
strings, or are they length prefixed?). Apologies anyways, it's been a
while so I don't know how it may have ended up there, thanks for catching
it.

Even though it was a bit bumpy, and vastly different from my usual github
contribution flow, I've had a good time contributing my first patch!

Best regards,
Nino Floris


On Apr 1, 2020 at 23:37, Tom Lane  wrote:

I wrote:
> Fortunately for the odds of getting this patch accepted, we just
> pushed an ALTER TYPE improvement that will solve your problem [1].
> Please replace ltree--1.2.sql with an upgrade script that uses
> that, and resubmit.

I decided it would be a shame for this to miss v13, seeing that
(a) it'd provide real-world use of that ALTER TYPE code, and
(b) we already bumped ltree's extension version for v13.

So I went ahead and rebased this, reviewed and pushed it.  The
main non-cosmetic thing I changed, other than using ALTER TYPE,
was that you had the output functions looking like this:

pq_sendtext(&buf, res, pg_mbstrlen(res));

That's just wrong; it should be plain strlen().  Did you copy
that coding from someplace?  I could not find any similar
occurrences in our code tree.

regards, tom lane


Re: Proposal: Expose oldest xmin as SQL function for monitoring

2020-04-01 Thread Alvaro Herrera
On 2020-Apr-01, Tom Lane wrote:

> James Coleman  writes:
> > To my knowledge the current oldest xmin (GetOldestXmin() if I'm not
> > mistaken) isn't exposed directly in any view or function by Postgres.
> 
> You could do something like
> 
> select max(age(backend_xmin)) from pg_stat_activity;
> 
> though I'm not sure whether that accounts for absolutely every process.
>
> > Am I missing anything in the above description? And if not, would
> > there be any reason why we would want to avoid exposing that
> > information? And if not, then would exposing it as a function be
> > acceptable?
> 
> The fact that I had to use max(age(...)) in that sample query
> hints at one reason: it's really hard to do arithmetic correctly
> on raw XIDs.  Dealing with wraparound is a problem, and knowing
> what's past or future is even harder.  What use-case do you
> foresee exactly?

Maybe it would make sense to start exposing fullXids in these views and
functions, for this reason.  There's no good reason to continue to
expose bare Xids to userspace, we should use them only for storage.

But I think James' point is precisely that it's not easy to know where
to look for things that keep Xmin from advancing.  Currently it's
backends, replication slots, prepared transactions, and replicas with
hot_standby_feedback.  If you forget to monitor just one of these, your
vacuums might be useless and you won't notice until disaster strikes.


Maybe a useful value to publish in some monitoring view is
RecentGlobalXmin -- which has a valid value when reading a view, since
you had to acquire a snapshot to read the view in the first place.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Peter Geoghegan
On Wed, Apr 1, 2020 at 1:25 PM Robert Haas  wrote:
> Maybe that contrib module could even have some functions to simulate
> aging without the passage of any real time. Like, say you have a
> function or procedure old_snapshot_pretend_time_has_passed(integer),
> and it moves oldSnapshotControl->head_timestamp backwards by that
> amount. Maybe that would require updating some other fields in
> oldSnapshotControl too but it doesn't seem like we'd need to do a
> whole lot.

I like that idea. I think that I've spotted what may be an independent
bug, but I have to wait around for a minute or two to reproduce it
each time. Makes it hard to get to a minimal test case.

-- 
Peter Geoghegan




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Peter Geoghegan
On Wed, Apr 1, 2020 at 3:00 PM Peter Geoghegan  wrote:
> I like that idea. I think that I've spotted what may be an independent
> bug, but I have to wait around for a minute or two to reproduce it
> each time. Makes it hard to get to a minimal test case.

I now have simple steps to reproduce a bug when I start Postgres
master with "--old_snapshot_threshold=1"  (1 minute).

This example shows wrong answers to queries in session 2:

Session 1:

pg@regression:5432 [1444078]=# create table snapshot_bug (col int4);
CREATE TABLE
pg@regression:5432 [1444078]=# create index on snapshot_bug (col );
CREATE INDEX
pg@regression:5432 [1444078]=# insert into snapshot_bug select i from
generate_series(1, 500) i;
INSERT 0 500

Session 2 starts, and views the data in a serializable transaction:

pg@regression:5432 [1444124]=# begin isolation level serializable ;
BEGIN
pg@regression:5432 [1444124]=*# select col from snapshot_bug where col
>= 0 order by col limit 14;
┌─┐
│ col │
├─┤
│   1 │
│   2 │
│   3 │
│   4 │
│   5 │
│   6 │
│   7 │
│   8 │
│   9 │
│  10 │
│  11 │
│  12 │
│  13 │
│  14 │
└─┘
(14 rows)

So far so good. Now session 2 continues:

pg@regression:5432 [1444078]=# delete from snapshot_bug where col < 15;
DELETE 14

Session 1:

(repeats the same "select col from snapshot_bug where col >= 0 order
by col limit 14" query every 100 ms using psql's \watch 0.1)

Session 2:

pg@regression:5432 [1444078]=# vacuum snapshot_bug ;
VACUUM

Before too long, we see the following over in session 2 -- the answer
the query gives changes, even though this is a serializable
transaction:

Wed 01 Apr 2020 03:12:59 PM PDT (every 0.1s)

┌─┐
│ col │
├─┤
│   1 │
│   2 │
│   3 │
│   4 │
│   5 │
│   6 │
│   7 │
│   8 │
│   9 │
│  10 │
│  11 │
│  12 │
│  13 │
│  14 │
└─┘
(14 rows)

Wed 01 Apr 2020 03:13:00 PM PDT (every 0.1s)

┌─┐
│ col │
├─┤
│  15 │
│  16 │
│  17 │
│  18 │
│  19 │
│  20 │
│  21 │
│  22 │
│  23 │
│  24 │
│  25 │
│  26 │
│  27 │
│  28 │
└─┘
(14 rows)

Wed 01 Apr 2020 03:13:00 PM PDT (every 0.1s)

┌─┐
│ col │
├─┤
│  15 │
│  16 │
│  17 │
│  18 │
│  19 │
│  20 │
│  21 │
│  22 │
│  23 │
│  24 │
│  25 │
│  26 │
│  27 │
│  28 │
└─┘
(14 rows)

We continue to get this wrong answer for almost another minute (at
least on this occasion). Eventually we get "snapshot too old". Note
that the answer changes when we cross the "minute threshold"

Andres didn't explain anything to me that contributed to finding the
bug (though it could be a known bug, I don't think that it is). It
took me a surprisingly short amount of time to stumble upon this bug
-- I didn't find it because I have good intuitions about how to break
the feature.

-- 
Peter Geoghegan


Re: Proposal: Expose oldest xmin as SQL function for monitoring

2020-04-01 Thread James Coleman
On Wed, Apr 1, 2020 at 5:58 PM Alvaro Herrera  wrote:
>
> On 2020-Apr-01, Tom Lane wrote:
>
> > James Coleman  writes:
> > > To my knowledge the current oldest xmin (GetOldestXmin() if I'm not
> > > mistaken) isn't exposed directly in any view or function by Postgres.
> >
> > You could do something like
> >
> > select max(age(backend_xmin)) from pg_stat_activity;
> >
> > though I'm not sure whether that accounts for absolutely every process.

That doesn't account for for replication slots that's aren't active, right?

> > > Am I missing anything in the above description? And if not, would
> > > there be any reason why we would want to avoid exposing that
> > > information? And if not, then would exposing it as a function be
> > > acceptable?
> >
> > The fact that I had to use max(age(...)) in that sample query
> > hints at one reason: it's really hard to do arithmetic correctly
> > on raw XIDs.  Dealing with wraparound is a problem, and knowing
> > what's past or future is even harder.  What use-case do you
> > foresee exactly?

So the use case we've encountered multiple times is some (at that
point unknown) process or object that's preventing the xmin from
advancing, and thus blocking vacuum. That kind of situation can pretty
quickly lead to query plans that can result in significant business
impact.

As such, it'd be helpful to be able to monitor something like "how old
is the current xmin on the cluster". Ideally it would also tell you
what process or object is holding that xmin, but starting with the
xmin itself would at least alert to the problem so you can
investigate.

On that note, for this particular use case it would be sufficient to
have something like pg_timestamp_of_oldest_xmin() (given we have
commit timestamps tracking enabled) or even a function returning the
number of xids consumed since the oldest xmin, but it seems more
broadly useful to provide a function that gives the oldest xmin and
allow users to build on top of that.

> Maybe it would make sense to start exposing fullXids in these views and
> functions, for this reason.  There's no good reason to continue to
> expose bare Xids to userspace, we should use them only for storage.

This would be useful too (and for more reasons than the above).

> But I think James' point is precisely that it's not easy to know where
> to look for things that keep Xmin from advancing.  Currently it's
> backends, replication slots, prepared transactions, and replicas with
> hot_standby_feedback.  If you forget to monitor just one of these, your
> vacuums might be useless and you won't notice until disaster strikes.
>
>
> Maybe a useful value to publish in some monitoring view is
> RecentGlobalXmin -- which has a valid value when reading a view, since
> you had to acquire a snapshot to read the view in the first place.

If we went down that path what view do you think would be best -- an
existing one or a new one?

I go back and forth on whether this is best exposed as a monitoring
oriented view or as part of the suite of txid functions we already
have that seem to have broader applicability.

James




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 14:11:11 -0700, Andres Freund wrote:
> As far as I can tell, with a large old_snapshot_threshold, it can take a
> very long time to get to a head_timestamp that's old enough for
> TransactionIdLimitedForOldSnapshots() to do anything.  Look at this
> trace of a pgbench run with old_snapshot_threshold enabled, showing some of
> the debugging output added in the patch upthread.
>
> This is with a threshold of 10min, in a freshly started database:
> [...]

I took a lot longer till the queries started to be cancelled. The last
mapping update before that was:

> 2020-04-01 14:28:00.000 PDT [1268503][3/1894126:2078853871] WARNING:  old 
> snapshot mapping at "before update" with head ts: 31, current entries: 20 max 
> entries: 20, offset: 12
> entry 0 (ring 12): min 31: xid 2078468128
> entry 1 (ring 13): min 32: xid 2078642746
> entry 2 (ring 14): min 33: xid 2078672303
> entry 3 (ring 15): min 34: xid 2078700941
> entry 4 (ring 16): min 35: xid 2078728729
> entry 5 (ring 17): min 36: xid 2078755425
> entry 6 (ring 18): min 37: xid 2078781089
> entry 7 (ring 19): min 38: xid 2078805567
> entry 8 (ring 0): min 39: xid 2078830065
> entry 9 (ring 1): min 40: xid 2078611853
> entry 10 (ring 2): min 41: xid 2078611853
> entry 11 (ring 3): min 42: xid 2078611853
> entry 12 (ring 4): min 43: xid 2078611853
> entry 13 (ring 5): min 44: xid 2078611853
> entry 14 (ring 6): min 45: xid 2078611853
> entry 15 (ring 7): min 46: xid 2078611853
> entry 16 (ring 8): min 47: xid 2078611853
> entry 17 (ring 9): min 48: xid 2078611853
> entry 18 (ring 10): min 49: xid 2078611853
> entry 19 (ring 11): min 50: xid 2078611853
>
> 2020-04-01 14:28:00.000 PDT [1268503][3/1894126:2078853871] WARNING:  head 31 
> min: updating existing bucket 1 for whenTaken 40 min, with xmin 2078853870
> 2020-04-01 14:28:00.000 PDT [1268503][3/1894126:2078853871] WARNING:  old 
> snapshot mapping at "after update" with head ts: 31, current entries: 20 max 
> entries: 20, offset: 12
> entry 0  (ring 12): min 31: xid 2078468128
> entry 1  (ring 13): min 32: xid 2078642746
> entry 2  (ring 14): min 33: xid 2078672303
> entry 3  (ring 15): min 34: xid 2078700941
> entry 4  (ring 16): min 35: xid 2078728729
> entry 5  (ring 17): min 36: xid 2078755425
> entry 6  (ring 18): min 37: xid 2078781089
> entry 7  (ring 19): min 38: xid 2078805567
> entry 8  (ring 0 ): min 39: xid 2078830065
> entry 9  (ring 1 ): min 40: xid 2078853870
> entry 10 (ring 2 ): min 41: xid 2078611853
> entry 11 (ring 3 ): min 42: xid 2078611853
> entry 12 (ring 4 ): min 43: xid 2078611853
> entry 13 (ring 5 ): min 44: xid 2078611853
> entry 14 (ring 6 ): min 45: xid 2078611853
> entry 15 (ring 7 ): min 46: xid 2078611853
> entry 16 (ring 8 ): min 47: xid 2078611853
> entry 17 (ring 9 ): min 48: xid 2078611853
> entry 18 (ring 10): min 49: xid 2078611853
> entry 19 (ring 11): min 50: xid 2078611853


A query ran for fourty minutes during this, without getting cancelled.



A good while later this happens:
> 2020-04-01 15:30:00.000 PDT [1268503][3/2518699:2081262046] WARNING:  old 
> snapshot mapping at "before update" with head ts: 82, current entries: 20 max 
> entries: 20, offset: 12
> entry 0 (ring 12): min 82: xid 2080333207
> entry 1 (ring 13): min 83: xid 2080527298
> entry 2 (ring 14): min 84: xid 2080566990
> entry 3 (ring 15): min 85: xid 2080605960
> entry 4 (ring 16): min 86: xid 2080644554
> entry 5 (ring 17): min 87: xid 2080682957
> entry 6 (ring 18): min 88: xid 2080721936
> entry 7 (ring 19): min 89: xid 2080760947
> entry 8 (ring 0): min 90: xid 2080799843
> entry 9 (ring 1): min 91: xid 2080838696
> entry 10 (ring 2): min 92: xid 2080877550
> entry 11 (ring 3): min 93: xid 2080915870
> entry 12 (ring 4): min 94: xid 2080954151
> entry 13 (ring 5): min 95: xid 2080992556
> entry 14 (ring 6): min 96: xid 2081030980
> entry 15 (ring 7): min 97: xid 2081069403
> entry 16 (ring 8): min 98: xid 2081107811
> entry 17 (ring 9): min 99: xid 2081146322
> entry 18 (ring 10): min 100: xid 2081185023
> entry 19 (ring 11): min 101: xid 2081223632
>
> 2020-04-01 15:30:00.000 PDT [1268503][3/2518699:2081262046] WARNING:  head 82 
> min: filling 20 buckets starting at 12 for whenTaken 102 min, with xmin 
> 2081262046
> 2020-04-01 15:30:00.000 PDT [1268503][3/2518699:2081262046] WARNING:  old 
> snapshot mapping at "after update" with head ts: 102, current entries: 1 max 
> entries: 20, offset: 0
> entry 0 (ring 0): min 102: xid 2081262046

The entire mapping reset, i.e. it'll take another f

Re: Ltree syntax improvement

2020-04-01 Thread Tom Lane
Nikita Glukhov  writes:
> [ latest version of ltree syntax extension ]

This is going to need another rebase after all the other ltree hacking
that just got done.  However, I did include 0001 (use a switch) in
the commit I just pushed, so you don't need to worry about that.

regards, tom lane




Re: Proposal: Expose oldest xmin as SQL function for monitoring

2020-04-01 Thread Tom Lane
Alvaro Herrera  writes:
> On 2020-Apr-01, Tom Lane wrote:
>> The fact that I had to use max(age(...)) in that sample query
>> hints at one reason: it's really hard to do arithmetic correctly
>> on raw XIDs.  Dealing with wraparound is a problem, and knowing
>> what's past or future is even harder.  What use-case do you
>> foresee exactly?

> Maybe it would make sense to start exposing fullXids in these views and
> functions, for this reason.  There's no good reason to continue to
> expose bare Xids to userspace, we should use them only for storage.

+1, that would help a lot.

> But I think James' point is precisely that it's not easy to know where
> to look for things that keep Xmin from advancing.  Currently it's
> backends, replication slots, prepared transactions, and replicas with
> hot_standby_feedback.  If you forget to monitor just one of these, your
> vacuums might be useless and you won't notice until disaster strikes.

Agreed, but just knowing what the oldest xmin is doesn't help you
find *where* it is.  Maybe what we need is a view showing all of
these potential sources of an old xmin.

regards, tom lane




Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 15:30:39 -0700, Peter Geoghegan wrote:
> On Wed, Apr 1, 2020 at 3:00 PM Peter Geoghegan  wrote:
> > I like that idea. I think that I've spotted what may be an independent
> > bug, but I have to wait around for a minute or two to reproduce it
> > each time. Makes it hard to get to a minimal test case.
>
> I now have simple steps to reproduce a bug when I start Postgres
> master with "--old_snapshot_threshold=1"  (1 minute).

Thanks, that's super helpful.


> This example shows wrong answers to queries in session 2:
>
> Session 1:
>
> pg@regression:5432 [1444078]=# create table snapshot_bug (col int4);
> CREATE TABLE
> pg@regression:5432 [1444078]=# create index on snapshot_bug (col );
> CREATE INDEX
> pg@regression:5432 [1444078]=# insert into snapshot_bug select i from
> generate_series(1, 500) i;
> INSERT 0 500
>
> Session 2 starts, and views the data in a serializable transaction:
>
> pg@regression:5432 [1444124]=# begin isolation level serializable ;
> BEGIN
> pg@regression:5432 [1444124]=*# select col from snapshot_bug where col
> >= 0 order by col limit 14;
> ┌─┐
> │ col │
> ├─┤
> │   1 │
> │   2 │
> │   3 │
> │   4 │
> │   5 │
> │   6 │
> │   7 │
> │   8 │
> │   9 │
> │  10 │
> │  11 │
> │  12 │
> │  13 │
> │  14 │
> └─┘
> (14 rows)
>
> So far so good. Now session 2 continues:

> pg@regression:5432 [1444078]=# delete from snapshot_bug where col < 15;
> DELETE 14

I got a bit confused here - you seemed to have switched session 1 and 2
around? Doesn't seem to matter much though, I was able to reproduce this.

This indeed seems a separate bug.

The backtrace to the point where the xmin horizon is affected by
TransactionIdLimitedForOldSnapshots() is:

#0  TransactionIdLimitedForOldSnapshots (recentXmin=2082816071, 
relation=0x7f52ff3b56f8) at 
/home/andres/src/postgresql/src/backend/utils/time/snapmgr.c:1870
#1  0x5567f4cd1a55 in heap_page_prune_opt (relation=0x7f52ff3b56f8, 
buffer=175) at 
/home/andres/src/postgresql/src/backend/access/heap/pruneheap.c:106
#2  0x5567f4cc70e2 in heapam_index_fetch_tuple (scan=0x5567f6db3028, 
tid=0x5567f6db2e40, snapshot=0x5567f6d67d68, slot=0x5567f6db1b60,
call_again=0x5567f6db2e46, all_dead=0x7ffce13d78de) at 
/home/andres/src/postgresql/src/backend/access/heap/heapam_handler.c:137
#3  0x5567f4cdf5e6 in table_index_fetch_tuple (scan=0x5567f6db3028, 
tid=0x5567f6db2e40, snapshot=0x5567f6d67d68, slot=0x5567f6db1b60,
call_again=0x5567f6db2e46, all_dead=0x7ffce13d78de) at 
/home/andres/src/postgresql/src/include/access/tableam.h:1020
#4  0x5567f4ce0767 in index_fetch_heap (scan=0x5567f6db2de0, 
slot=0x5567f6db1b60) at 
/home/andres/src/postgresql/src/backend/access/index/indexam.c:577
#5  0x5567f4f19191 in IndexOnlyNext (node=0x5567f6db16a0) at 
/home/andres/src/postgresql/src/backend/executor/nodeIndexonlyscan.c:169
#6  0x5567f4ef4bc4 in ExecScanFetch (node=0x5567f6db16a0, 
accessMtd=0x5567f4f18f20 , recheckMtd=0x5567f4f1951c 
)
at /home/andres/src/postgresql/src/backend/executor/execScan.c:133
#7  0x5567f4ef4c39 in ExecScan (node=0x5567f6db16a0, 
accessMtd=0x5567f4f18f20 , recheckMtd=0x5567f4f1951c 
)
at /home/andres/src/postgresql/src/backend/executor/execScan.c:182
#8  0x5567f4f195d4 in ExecIndexOnlyScan (pstate=0x5567f6db16a0) at 
/home/andres/src/postgresql/src/backend/executor/nodeIndexonlyscan.c:317
#9  0x5567f4ef0f71 in ExecProcNodeFirst (node=0x5567f6db16a0) at 
/home/andres/src/postgresql/src/backend/executor/execProcnode.c:444
#10 0x5567f4f1d694 in ExecProcNode (node=0x5567f6db16a0) at 
/home/andres/src/postgresql/src/include/executor/executor.h:245
#11 0x5567f4f1d7d2 in ExecLimit (pstate=0x5567f6db14b8) at 
/home/andres/src/postgresql/src/backend/executor/nodeLimit.c:95
#12 0x5567f4ef0f71 in ExecProcNodeFirst (node=0x5567f6db14b8) at 
/home/andres/src/postgresql/src/backend/executor/execProcnode.c:444
#13 0x5567f4ee57c3 in ExecProcNode (node=0x5567f6db14b8) at 
/home/andres/src/postgresql/src/include/executor/executor.h:245
#14 0x5567f4ee83dd in ExecutePlan (estate=0x5567f6db1280, 
planstate=0x5567f6db14b8, use_parallel_mode=false, operation=CMD_SELECT, 
sendTuples=true,
numberTuples=0, direction=ForwardScanDirection, dest=0x5567f6db3c78, 
execute_once=true)
at /home/andres/src/postgresql/src/backend/executor/execMain.c:1646
#15 0x5567f4ee5e23 in standard_ExecutorRun (queryDesc=0x5567f6d0c490, 
direction=ForwardScanDirection, count=0, execute_once=true)
at /home/andres/src/postgresql/src/backend/executor/execMain.c:364
#16 0x5567f4ee5c35 in ExecutorRun (queryDesc=0x5567f6d0c490, 
direction=ForwardScanDirection, count=0, execute_once=true)
at /home/andres/src/postgresql/src/backend/executor/execMain.c:308
#17 0x5567f510c4de in PortalRunSelect (portal=0x5567f6d49260, forward=true, 
count=0, dest=0x5567f6db3c78)
at /home/andres/src/postgresql/src/backend/tcop/pquery.c:912
#18 0x5567f510c191 in PortalRun (portal=0x5567f6d49260

Re: Add A Glossary

2020-04-01 Thread Alvaro Herrera
On 2020-Apr-01, Jürgen Purtz wrote:

> 
> On 31.03.20 19:58, Justin Pryzby wrote:
> > On Tue, Mar 31, 2020 at 04:13:00PM +0200, Jürgen Purtz wrote:
> > > Please find some minor suggestions in the attachment. They are based on
> > > Corey's last patch 0001-glossary-v4.patch.
> > > @@ -220,7 +220,7 @@
> > > Records to the file system and creates a special
> > > checkpoint record. This process is initiated when predefined
> > > conditions are met, such as a specified amount of time has 
> > > passed, or
> > > -  a certain volume of records have been collected.
> > > +  a certain volume of records has been collected.
> > I think you're correct in that "volume" is singular.  But I think 
> > "collected"
> > is the wrong world.  I suggested "written".
> > 
> "collected" is not optimal. I suggest "created". Please avoid "written", the
> WAL records will be written when the Checkpointer is running, not before.

Actually, you're mistaken; the checkpointer hardly writes any WAL
records.  In fact, it only writes *one* wal record, which is the
checkpoint record itself.  All the other wal records are written either
by the backends that produce it, or by the wal writer process.  By the
time the checkpoint runs, the wal records are long expected to be written.

Anyway I changed a lot of terms again, as well as changing the way the
terms are marked up -- for two reasons:

1. I didn't like the way the WAL-related entries were structured.  I
created a new entry called "Write-Ahead Log", which explains what WAL
is; this replaces the term "WAL Log", which is redundant (since the L in
WAL stands for "log" already). I kept the id as glossary-wal, though,
because it's shorter and *shrug*.  The definition uses the terms "wal
record" and "wal file", which I also rewrote.

2. I found out that "see xyz" and "see also" have bespoke markup in
Docbook --  and .  I changed some glossentries
to use those, removing some glossdefs and changing a couple of paras to
glossseealsos.  I also removed all "id" properties from glossentries
that are just , because I think it's a mistake to have
references to entries that will make the reader look up a different
term; for me as a reader that's annoying, and I don't like to annoy
people.


While at it, I again came across "analytic", which is a term we don't
use much, so I made it a glosssee for "window function"; and while at it
I realized we didn't clearly explain what a window was. So I added
"window frame" for that.  I considered adding the term "partition" which
is used in this context, but decided it wasn't necessary.

I also added "(process)" to terms that define processes.  So
now we have "checkpointer (process)" and so on.

I rewrote the definition for "atomic" once again.  Made it two
glossdefs, because I can.  If you don't like this, I can undo.

I added "recycling".

I still have to go through some other defs.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 1043d0f7ab..cf21ef857e 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -170,6 +170,7 @@
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
new file mode 100644
index 00..3c417f2fd3
--- /dev/null
+++ b/doc/src/sgml/glossary.sgml
@@ -0,0 +1,1526 @@
+
+ Glossary
+ 
+  This is a list of terms and their meaning in the context of
+  PostgreSQL and relational database
+  systems in general.
+ 
+
+ 
+  
+   Aggregating
+   
+
+ The act of combining a collection of data (input) values into
+ a single output value, which may not be of the same type as the
+ input values.
+
+   
+  
+
+  
+   Aggregate Function
+   
+
+ A function that
+ combines multiple input values,
+ for example by counting, averaging or adding them all together,
+ yielding a single output value.
+
+
+ For more information, see
+ .
+
+
+   
+  
+
+  
+   Analytic Function
+   
+  
+
+  
+   Atomic
+   
+
+ In reference to a datum:
+ the fact that its value that cannot be broken down into smaller
+ components.
+
+   
+   
+
+ In reference to a
+ database transaction:
+ the fact that all the operations in the transaction either complete as
+ a whole, or none of them become visible.
+
+   
+  
+
+  
+   Atomicity
+   
+
+ One of the ACID properties. This is the state of
+ being Atomic in the operational/transactional sense.
+
+   
+  
+
+  
+   Attribute
+   
+
+ An element with a certain name and data type found within a
+ Tuple or Table.
+
+   
+  
+
+  
+   Autovacuum
+   
+
+ Background processes that routinely perform
+ Vacuum and Analyze
+ operations.
+
+
+ For more information, see
+ .
+
+   
+  
+
+  
+   Backend Process
+   
+
+ Proc

Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Andres Freund
Hi,

On 2020-04-01 16:59:51 -0700, Andres Freund wrote:
> The primary issue here is that there is no TestForOldSnapshot() in
> heap_hot_search_buffer(). Therefore index fetches will never even try to
> detect that tuples it needs actually have already been pruned away.

FWIW, with autovacuum=off the query does not get killed until a manual
vacuum, nor if fewer rows are deleted and the table has previously been
vacuumed.

The vacuum in the second session isn't required. There just needs to be
something consuming an xid, so that oldSnapshotControl->latest_xmin is
increased. A single SELECT txid_current(); or such in a separate session
is sufficient.

Greetings,

Andres Freund




Re: Allow continuations in "pg_hba.conf" files

2020-04-01 Thread David Zhang
Hi Fabien,
Should we consider the case "\ ", i.e. one or more spaces after the backslash?
For example, if I replace a user map 
"mymap   /^(.*)@mydomain\.com$  \1" with 
"mymap   /^(.*)@mydomain\.com$  \ "
"\1"
by adding one extra space after the backslash, then I got the pg_role="\\"
but I think what we expect is pg_role="\\1"

Re: snapshot too old issues, first around wraparound and then more.

2020-04-01 Thread Peter Geoghegan
On Wed, Apr 1, 2020 at 4:59 PM Andres Freund  wrote:
> Thanks, that's super helpful.

Glad I could help.

> I got a bit confused here - you seemed to have switched session 1 and 2
> around? Doesn't seem to matter much though, I was able to reproduce this.

Yeah, I switched the session numbers because I was in a hurry. Sorry about that.

As you have already worked out, one session does all the DDL and
initial loading of data, while the other session queries the data
repeatedly in a serializable (or RR) xact. The latter session exhibits
the bug.

> This indeed seems a separate bug.

> The primary issue here is that there is no TestForOldSnapshot() in
> heap_hot_search_buffer(). Therefore index fetches will never even try to
> detect that tuples it needs actually have already been pruned away.

I suspected that heap_hot_search_buffer() was missing something.

> The wrong queries I saw took longer to reproduce, so I've not been able
> to debug the precise reasons.

How hard would it be to write a debug patch that reduces the quantum
used in places like TransactionIdLimitedForOldSnapshots() to something
much less than the current 1 minute quantum? That made reproducing the
bug *very* tedious.

-- 
Peter Geoghegan




Re: control max length of parameter values logged

2020-04-01 Thread Alexey Bashtanov

Hi,

The privilege argument seems irrelevant to me.  We already decided
that the plan is (a) SUSET for non-error statement logging purposes and
(b) USERSET for logging caused by errors, and that would have to apply
to length limits as well as enable/disable ability.  Otherwise a user
could pretty effectively disable logging by setting the length to 1.
The only privilege that user can gain if we drop the boolean is to 
*enable* logging parameters on error.
That gives user a little bit easier way to fill up the disk with logs, 
but they anyway can do that if they want to.

If that's okay with everyone, please see the new version attached.

Best, Alex
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..3d7db57d74 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6690,29 +6690,48 @@ log_line_prefix = '%m [%p] %q%u@%d/%a '
   
  
 
- 
-  log_parameters_on_error (boolean)
+ 
+  log_parameter_max_length (integer)
   
-   log_parameters_on_error configuration parameter
+   log_parameter_max_length configuration parameter
   
   
   

-Controls whether bind parameters are logged when a statement is logged
-as a result of .
-It adds some overhead, as PostgreSQL will
-compute and store textual representations of parameter values in
-memory for all statements, even if they eventually do not get logged.
-This setting has no effect on statements logged due to
- or
- settings, as they are always logged
-with parameters.
-The default is off.
+If greater than zero, bind parameter values reported in non-error
+statement-logging messages are trimmed to no more than this many bytes.
+If this value is specified without units, it is taken as bytes.
+Zero disables logging parameters with statements.
+-1 (the default) makes parameters logged in full.
 Only superusers can change this setting.

   
  
 
+ 
+  log_parameter_max_length_on_error (integer)
+  
+   log_parameter_max_length_on_error configuration parameter
+  
+  
+  
+   
+If greater than zero, bind parameter values reported in error messages
+are trimmed to no more than this many bytes.
+If this value is specified without units, it is taken as bytes.
+Zero (the default) disables logging parameters on error.
+-1 makes parameters logged in full.
+
+This setting only affects statements logged as a result of
+.
+Non-zero values add some overhead, as
+PostgreSQL will compute and store textual
+representations of parameter values in memory for all statements,
+even if they eventually do not get logged.
+   
+  
+ 
+
  
   log_statement (enum)
   
diff --git a/src/backend/nodes/params.c b/src/backend/nodes/params.c
index a57f9eea76..77328e17ff 100644
--- a/src/backend/nodes/params.c
+++ b/src/backend/nodes/params.c
@@ -269,7 +269,7 @@ RestoreParamList(char **start_address)
  * can contain NULLs for any unknown individual values.  NULL can be given if
  * no parameters are known.
  *
- * If maxlen is not zero, that's the maximum number of bytes of any one
+ * If maxlen is not -1, that's the maximum number of bytes of any one
  * parameter value to be printed; an ellipsis is added if the string is
  * longer.  (Added quotes are not considered in this calculation.)
  */
@@ -280,6 +280,7 @@ BuildParamLogString(ParamListInfo params, char **knownTextValues, int maxlen)
 oldCxt;
 	StringInfoData buf;
 
+	Assert(maxlen == -1 || maxlen > 0);
 	/*
 	 * NB: think not of returning params->paramValuesStr!  It may have been
 	 * generated with a different maxlen, and so be unsuitable.  Besides that,
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 5b677863b9..cbe4907e61 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -1838,7 +1838,7 @@ exec_bind_message(StringInfo input_message)
  */
 if (pstring)
 {
-	if (log_parameters_on_error)
+	if (log_parameter_max_length_on_error != 0)
 	{
 		MemoryContext	oldcxt;
 
@@ -1846,13 +1846,24 @@ exec_bind_message(StringInfo input_message)
 		if (knownTextValues == NULL)
 			knownTextValues =
 palloc0(numParams * sizeof(char *));
-		/*
-		 * Note: must copy at least two more full characters
-		 * than BuildParamLogString wants to see; otherwise it
-		 * might fail to include the ellipsis.
-		 */
-		knownTextValues[paramno] =
-			pnstrdup(pstring, 64 + 2 * MAX_MULTIBYTE_CHAR_LEN);
+
+		if (log_parameter_max_length_on_error > 0)
+		{
+			/*
+			 * We can trim the saved string, knowing that we
+			 * won't print all of it.  But we must copy at
+			 * least two more full characters than
+		

  1   2   >