Re: Parallel copy

2020-09-24 Thread Bharath Rupireddy
Thanks Greg for the testing.

On Thu, Sep 24, 2020 at 8:27 AM Greg Nancarrow  wrote:
>
> > 3. Could you please run the test case 3 times at least? Just to ensure
the consistency of the issue.
>
> Yes, have run 4 times. Seems to be a performance hit (whether normal
> copy or parallel-1 copy) on the first COPY run on a freshly created
> database. After that, results are consistent.
>

>From the logs, I see that it is happening only with default
postgresql.conf, and there's inconsistency in table insertion times,
especially from the 1st time to 2nd time. Also, the table insertion time
variation is more. This is expected with the default postgresql.conf,
because of the background processes interference. That's the reason we
usually run with custom configuration to correctly measure the performance
gain.

br_default_0_1.log:
2020-09-23 22:32:36.944 JST [112616] LOG:  totaltableinsertiontime =
155068.244 ms
2020-09-23 22:33:57.615 JST [11426] LOG:  totaltableinsertiontime =
42096.275 ms
2020-09-23 22:37:39.192 JST [43097] LOG:  totaltableinsertiontime =
29135.262 ms
2020-09-23 22:38:56.389 JST [54205] LOG:  totaltableinsertiontime =
38953.912 ms
2020-09-23 22:40:27.573 JST [66485] LOG:  totaltableinsertiontime =
27895.326 ms
2020-09-23 22:41:34.948 JST [77523] LOG:  totaltableinsertiontime =
28929.642 ms
2020-09-23 22:43:18.938 JST [89857] LOG:  totaltableinsertiontime =
30625.015 ms
2020-09-23 22:44:21.938 JST [101372] LOG:  totaltableinsertiontime =
24624.045 ms

br_default_1_0.log:
2020-09-24 11:12:14.989 JST [56146] LOG:  totaltableinsertiontime =
192068.350 ms
2020-09-24 11:13:38.228 JST [88455] LOG:  totaltableinsertiontime =
30999.942 ms
2020-09-24 11:15:50.381 JST [108935] LOG:  totaltableinsertiontime =
31673.204 ms
2020-09-24 11:17:14.260 JST [118541] LOG:  totaltableinsertiontime =
31367.027 ms
2020-09-24 11:20:18.975 JST [17270] LOG:  totaltableinsertiontime =
26858.924 ms
2020-09-24 11:22:17.822 JST [26852] LOG:  totaltableinsertiontime =
66531.442 ms
2020-09-24 11:24:09.221 JST [47971] LOG:  totaltableinsertiontime =
38943.384 ms
2020-09-24 11:25:30.955 JST [58849] LOG:  totaltableinsertiontime =
28286.634 ms

br_custom_0_1.log:
2020-09-24 10:29:44.956 JST [110477] LOG:  totaltableinsertiontime =
20207.928 ms
2020-09-24 10:30:49.570 JST [120568] LOG:  totaltableinsertiontime =
23360.006 ms
2020-09-24 10:32:31.659 JST [2753] LOG:  totaltableinsertiontime =
19837.588 ms
2020-09-24 10:35:49.245 JST [31118] LOG:  totaltableinsertiontime =
21759.253 ms
2020-09-24 10:36:54.834 JST [41763] LOG:  totaltableinsertiontime =
23547.323 ms
2020-09-24 10:38:53.507 JST [56779] LOG:  totaltableinsertiontime =
21543.984 ms
2020-09-24 10:39:58.713 JST [67489] LOG:  totaltableinsertiontime =
25254.563 ms

br_custom_1_0.log:
2020-09-24 10:49:03.242 JST [15308] LOG:  totaltableinsertiontime =
16541.201 ms
2020-09-24 10:50:11.848 JST [23324] LOG:  totaltableinsertiontime =
15076.577 ms
2020-09-24 10:51:24.497 JST [35394] LOG:  totaltableinsertiontime =
16400.777 ms
2020-09-24 10:52:32.354 JST [42953] LOG:  totaltableinsertiontime =
15591.051 ms
2020-09-24 10:54:30.327 JST [61136] LOG:  totaltableinsertiontime =
16700.954 ms
2020-09-24 10:55:38.377 JST [68719] LOG:  totaltableinsertiontime =
15435.150 ms
2020-09-24 10:57:08.927 JST [83335] LOG:  totaltableinsertiontime =
17133.251 ms
2020-09-24 10:58:17.420 JST [90905] LOG:  totaltableinsertiontime =
15352.753 ms

>
> Test results show that Parallel COPY with 1 worker is performing
> better than normal COPY in the test scenarios run.
>

Good to know :)

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Re: Range checks of pg_test_fsync --secs-per-test and pg_test_timing --duration

2020-09-24 Thread Michael Paquier
On Wed, Sep 23, 2020 at 08:11:59AM +0200, Peter Eisentraut wrote:
> This patch mixes up unsigned int and uint32 in random ways.  The variable is
> uint32, but the format is %u and the max constant is UINT_MAX.
> 
> I think just use unsigned int as the variable type.  There is no need to use
> the bit-exact types.  Note that the argument of alarm() is of type unsigned
> int.

Makes sense, thanks.
--
Michael
diff --git a/src/bin/pg_test_fsync/.gitignore b/src/bin/pg_test_fsync/.gitignore
index f3b5932498..5eb5085f45 100644
--- a/src/bin/pg_test_fsync/.gitignore
+++ b/src/bin/pg_test_fsync/.gitignore
@@ -1 +1,3 @@
 /pg_test_fsync
+
+/tmp_check/
diff --git a/src/bin/pg_test_fsync/Makefile b/src/bin/pg_test_fsync/Makefile
index 7632c94eb7..c4f9ae0664 100644
--- a/src/bin/pg_test_fsync/Makefile
+++ b/src/bin/pg_test_fsync/Makefile
@@ -22,6 +22,12 @@ install: all installdirs
 installdirs:
 	$(MKDIR_P) '$(DESTDIR)$(bindir)'
 
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
 uninstall:
 	rm -f '$(DESTDIR)$(bindir)/pg_test_fsync$(X)'
 
diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index 6e47293123..c66492eba4 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -5,6 +5,7 @@
 
 #include "postgres_fe.h"
 
+#include 
 #include 
 #include 
 #include 
@@ -62,7 +63,7 @@ do { \
 
 static const char *progname;
 
-static int	secs_per_test = 5;
+static unsigned int secs_per_test = 5;
 static int	needs_unlink = 0;
 static char full_buf[DEFAULT_XLOG_SEG_SIZE],
 		   *buf,
@@ -148,6 +149,8 @@ handle_args(int argc, char *argv[])
 
 	int			option;			/* Command line option */
 	int			optindex = 0;	/* used by getopt_long */
+	unsigned long	optval;		/* used for option parsing */
+	char	   *endptr;
 
 	if (argc > 1)
 	{
@@ -173,7 +176,24 @@ handle_args(int argc, char *argv[])
 break;
 
 			case 's':
-secs_per_test = atoi(optarg);
+errno = 0;
+optval = strtoul(optarg, &endptr, 10);
+
+if (endptr == optarg || *endptr != '\0' ||
+	errno != 0 || optval != (unsigned int) optval)
+{
+	pg_log_error("invalid argument for option %s", "--secs-per-test");
+	fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+	exit(1);
+}
+
+secs_per_test = (unsigned int) optval;
+if (secs_per_test == 0)
+{
+	pg_log_error("%s must be in range %u..%u",
+ "--secs-per-test", 1, UINT_MAX);
+	exit(1);
+}
 break;
 
 			default:
@@ -193,8 +213,8 @@ handle_args(int argc, char *argv[])
 		exit(1);
 	}
 
-	printf(ngettext("%d second per test\n",
-	"%d seconds per test\n",
+	printf(ngettext("%u second per test\n",
+	"%u seconds per test\n",
 	secs_per_test),
 		   secs_per_test);
 #if PG_O_DIRECT != 0
diff --git a/src/bin/pg_test_fsync/t/001_basic.pl b/src/bin/pg_test_fsync/t/001_basic.pl
new file mode 100644
index 00..fe9c295c49
--- /dev/null
+++ b/src/bin/pg_test_fsync/t/001_basic.pl
@@ -0,0 +1,25 @@
+use strict;
+use warnings;
+
+use Config;
+use TestLib;
+use Test::More tests => 12;
+
+#
+# Basic checks
+
+program_help_ok('pg_test_fsync');
+program_version_ok('pg_test_fsync');
+program_options_handling_ok('pg_test_fsync');
+
+#
+# Test invalid option combinations
+
+command_fails_like(
+	[ 'pg_test_fsync', '--secs-per-test', 'a' ],
+	qr/\Qpg_test_fsync: error: invalid argument for option --secs-per-test\E/,
+	'pg_test_fsync: invalid argument for option --secs-per-test');
+command_fails_like(
+	[ 'pg_test_fsync', '--secs-per-test', '0' ],
+	qr/\Qpg_test_fsync: error: --secs-per-test must be in range 1..4294967295\E/,
+	'pg_test_fsync: --secs-per-test must be in range');
diff --git a/src/bin/pg_test_timing/.gitignore b/src/bin/pg_test_timing/.gitignore
index f6c664c765..e5aac2ab12 100644
--- a/src/bin/pg_test_timing/.gitignore
+++ b/src/bin/pg_test_timing/.gitignore
@@ -1 +1,3 @@
 /pg_test_timing
+
+/tmp_check/
diff --git a/src/bin/pg_test_timing/Makefile b/src/bin/pg_test_timing/Makefile
index 334d6ff5c0..52994b4103 100644
--- a/src/bin/pg_test_timing/Makefile
+++ b/src/bin/pg_test_timing/Makefile
@@ -22,6 +22,12 @@ install: all installdirs
 installdirs:
 	$(MKDIR_P) '$(DESTDIR)$(bindir)'
 
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
 uninstall:
 	rm -f '$(DESTDIR)$(bindir)/pg_test_timing$(X)'
 
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index e14802372b..894bb1d0d4 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -6,15 +6,17 @@
 
 #include "postgres_fe.h"
 
+#include 
+
 #include "getopt_long.h"
 #include "portability/instr_time.h"
 
 static const char *progname;
 
-static int32 test_duration = 3;
+static unsigned int test_duration = 3;
 
 static void handle_args(int argc, char *argv[]);
-static uint64 test_timing(int32);
+static uint6

Re: Online checksums patch - once again

2020-09-24 Thread Daniel Gustafsson
> On 24 Sep 2020, at 06:27, Michael Paquier  wrote:
> 
> On Wed, Sep 23, 2020 at 02:34:36PM +0200, Daniel Gustafsson wrote:
>> While in the patch I realized that the relationlist saved the relkind but the
>> code wasn't actually using it, so I've gone ahead and removed that with a lot
>> fewer palloc calls as a result. The attached v22 fixes that and the above.
> 
> Some of the TAP tests are blowing up here, as the CF bot is telling:
> t/003_standby_checksum.pl .. 1/11 # Looks like you planned 11 tests but ran 4.
> # Looks like your test exited with 29 just after 4.
> t/003_standby_checksum.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)

Interesting, I've been unable to trigger a fault and the latest Travis build
was green.  I'll continue to try and see if I can shake something loose.

cheers ./daniel





Re: [PATCH] Add features to pg_stat_statements

2020-09-24 Thread Katsuragi Yuta

On 2020-09-24 14:15, legrand legrand wrote:

Hi,
Both are probably not needed.
I would prefer it accessible in a view live

Event | date | victims
Eviction...
Reset...
Part reset ...

As there are other needs to track reset times.


Let me confirm one thing.
Is the number of records of this view fixed to three?
Or, will a new record be appended every time
an event (Eviction or Reset or Part Reset) happens?

Regards,
Katsuragi Yuta




Re: A little enhancement for hashdisk testset

2020-09-24 Thread Daniel Gustafsson
> On 24 Sep 2020, at 07:45, Hou, Zhijie  wrote:

> I found some tables is not dropped at the end of the sqlscript,
> It does not hava any problem, but I think it's better to drop the table in 
> time.

There is value in keeping a representative set of objects around after
regression tests, as the database which is left after regression tests run is
used as the input to the pg_regress tests.  That being said, I don't think that
was the intention for these relationns in commit 92c58fd94801, but in general
we should avoid cleaning up too much to ensure that we stress pg_upgrade well
enough (there is a delicate balance to keep test runtime short as well of
course).

cheers ./daniel



Re: 回复:how to create index concurrently on partitioned table

2020-09-24 Thread Michael Paquier
On Mon, Sep 14, 2020 at 09:31:03AM -0500, Justin Pryzby wrote:
> Anyway, for now this is rebased on 83158f74d.

I have not thought yet about all the details of CIC and DIC and what
you said upthread, but I have gone through CLUSTER for now, as a
start.  So, the goal of the patch, as I would define it, is to give a
way to CLUSTER to work on a partitioned table using a given
partitioned index.  Hence, we would perform CLUSTER automatically from
the top-most parent for any partitions that have an index on the same
partition tree as the partitioned index defined in USING, switching
indisclustered automatically depending on the index used.

+CREATE TABLE clstrpart3 PARTITION OF clstrpart DEFAULT PARTITION BY RANGE(a);
+CREATE TABLE clstrpart33 PARTITION OF clstrpart3 DEFAULT;
 CREATE INDEX clstrpart_idx ON clstrpart (a);
-ALTER TABLE clstrpart CLUSTER ON clstrpart_idx
So..  For any testing of partitioned trees, we should be careful to
check if relfilenodes have been changed or not as part of an
operation, to see if the operation has actually done something.

From what I can see, attempting to use a CLUSTER on a top-most
partitioned table fails to work on child tables, but isn't the goal of
the patch to make sure that if we attempt to do the operation on a
partitioned table using a partitioned index, then the operation should
be done as well on the partitions with the partition index part of the
same partition tree as the parent partitioned index?  If using CLUSTER
on a new partitioned index with USING, it seems to me that we may want
to check that indisclustered is correctly set for all the partitions
concerned in the regression tests.  It would be good also to check if
we have a partition index tree that maps partially with a partition
table tree (aka no all table partitions have a partition index), where
these don't get clustered because there is no index to work on.

+   MemoryContext old_context = MemoryContextSwitchTo(cluster_context);
+   inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+   MemoryContextSwitchTo(old_context);
Er, isn't that incorrect?  I would have assumed that what should be
saved in the context of cluster is the list of RelToCluster items.
And we should not do any lookup of the partitions in a different
context, because this may do allocations of things we don't really
need to keep around for the context dedicated to CLUSTER.  Isn't
NoLock unsafe here, even knowing that an exclusive lock is taken on
the parent?  It seems to me that at least schema changes should be
prevented with a ShareLock in the first transaction building the list
of elements to cluster.

+   /*
+* We have a full list of direct and indirect children, so skip
+* partitioned tables and just handle their children.
+*/
+   if (get_rel_relkind(relid) == RELKIND_PARTITIONED_TABLE)
+   continue;
It would be better to use RELKIND_HAS_STORAGE here.

All this stuff needs to be documented clearly.
--
Michael


signature.asc
Description: PGP signature


RE: Transactions involving multiple postgres foreign servers, take 2

2020-09-24 Thread tsunakawa.ta...@fujitsu.com
From: Masahiko Sawada 
> So with your idea, I think we require FDW developers to not call
> ereport(ERROR) as much as possible. If they need to use a function
> including palloc, lappend etc that could call ereport(ERROR), they
> need to use PG_TRY() and PG_CATCH() and return the control along with
> the error message to the transaction manager rather than raising an
> error. Then the transaction manager will emit the error message at an
> error level lower than ERROR (e.g., WARNING), and call commit/rollback
> API again. But normally we do some cleanup on error but in this case
> the retrying commit/rollback is performed without any cleanup. Is that
> right? I’m not sure it’s safe though.


Yes.  It's legitimate to require the FDW commit routine to return control, 
because the prepare of 2PC is a promise to commit successfully.  The 
second-phase commit should avoid doing that could fail.  For example, if some 
memory is needed for commit, it should be allocated in prepare or before.


> IIUC aggregation functions can be pushed down to the foreign server
> but I have not idea the normal UDF in the select list is pushed down.
> I wonder if it isn't.

Oh, that's the current situation.  Understood.  I thought the UDF call is also 
pushed down, as I saw Greenplum does so.  (Reading the manual, Greenplum 
disallows data updates in the UDF when it's executed on the remote segment 
server.)

(Aren't we overlooking something else that updates data on the remote server 
while the local server is unaware?)


Regards
Takayuki Tsunakawa



RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-24 Thread k.jami...@fujitsu.com
On Thursday, September 24, 2020 1:27 PM, Tsunakawa-san wrote:

> (1)
> + for (cur_blk = firstDelBlock[j]; cur_blk <
> nblocks; cur_blk++)
> 
> The right side of "cur_blk <" should not be nblocks, because nblocks is not
> the number of the relation fork anymore.

Right. Fixed. It should be the total number of (n)blocks of relation.

> (2)
> + BlockNumber nblocks;
> + nblocks = smgrnblocks(smgr_reln, forkNum[j]) -
> firstDelBlock[j];
> 
> You should either:
> 
> * Combine the two lines into one: BlockNumber nblocks = ...;
> 
> or
> 
> * Put an empty line between the two lines to separate declarations and
> execution statements.

Right. I separated them in the updated patch. And to prevent confusion,
instead of nblocks, nTotalBlocks & nBlocksToInvalidate are used.

/* Get the total number of blocks for the supplied relation's fork */
nTotalBlocks = smgrnblocks(smgr_reln, forkNum[j]);

/* Get the total number of blocks to be invalidated for the specified fork */
nBlocksToInvalidate = nTotalBlocks - firstDelBlock[j];
 

> After correcting these, I think you can check the recovery performance.

I'll send performance measurement results in the next email. Thanks a lot for 
the reviews!

Regards,
Kirk Jamison


v16-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description: v16-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch
Description:  v1-Prevent-invalidating-blocks-in-smgrextend-during-recovery.patch


Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-24 Thread Kyotaro Horiguchi
Hello.

At Wed, 23 Sep 2020 05:37:24 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Jamison, Kirk/ジャミソン カーク 

# Wow. I'm surprised to read it..

> > I revised the patch based from my understanding of Horiguchi-san's comment,
> > but I could be wrong.
> > Quoting:
> > 
> > "
> > +   /* Get the number of blocks for the supplied relation's
> > fork */
> > +   nblocks = smgrnblocks(smgr_reln,
> > forkNum[fork_num]);
> > +   Assert(BlockNumberIsValid(nblocks));
> > +
> > +   if (nblocks < BUF_DROP_FULLSCAN_THRESHOLD)
> > 
> > As mentioned upthread, the criteria whether we do full-scan or
> > lookup-drop is how large portion of NBUFFERS this relation-drop can be
> > going to invalidate.  So the nblocks above should be the sum of number
> > of blocks to be truncated (not just the total number of blocks) of all
> > designated forks.  Then once we decided to do lookup-drop method, we
> > do that for all forks."
> 
> One takeaway from Horiguchi-san's comment is to use the number of blocks to 
> invalidate for comparison, instead of all blocks in the fork.  That is, use
> 
> nblocks = smgrnblocks(fork) - firstDelBlock[fork];
> 
> Does this make sense?
> 
> What do you think is the reason for summing up all forks?  I didn't 
> understand why.  Typically, FSM and VM forks are very small.  If the main 
> fork is larger than NBuffers / 500, then v14 scans the entire shared buffers 
> for the FSM and VM forks as well as the main fork, resulting in three scans 
> in total.

I thought of summing up smgrnblocks(fork) - firstDelBlock[fork] of all
folks. I don't mind omitting non-main forks but a comment to explain
the reason or reasoning would be needed.

reards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: I'd like to discuss scaleout at PGCon

2020-09-24 Thread tsunakawa.ta...@fujitsu.com
From: MauMau 
> I intentionally have put little conclusion on our specification and
> design.  I'd like you to look at recent distributed databases, and
> then think about and discuss what we want to aim for together.  I feel
> it's better to separate a thread per topic or group of topics.

Finally, MariaDB is now equiped with a scale-out feature.

How MariaDB achieves global scale with Xpand
https://www.infoworld.com/article/3574077/how-mariadb-achieves-global-scale-with-xpand.html

I haven't read its documentation, and am not planning to read it now, but the 
feature looks nice.  I hope this will also be a good stimulus.  I believe we 
should be aware of competitiveness when designing -- "If we adopt this 
architecture or design to simplify the first version, will it really naturally 
evolve to a competitive product in the future without distorting the concept, 
design, and interface?"


Regards
Takayuki Tsunakawa



Re: Parallel copy

2020-09-24 Thread Bharath Rupireddy
>
> > Have you tested your patch when encoding conversion is needed? If so,
> > could you please point out the email that has the test results.
> >
>
> We have not yet done encoding testing, we will do and post the results
> separately in the coming days.
>

Hi Ashutosh,

I ran the tests ensuring pg_server_to_any() gets called from copy.c. I
specified the encoding option of COPY command, with client and server
encodings being UTF-8.

Tests are performed with custom postgresql.conf[1], 10million rows, 5.2GB
data. The results are of the triplet form (exec time in sec, number of
workers, gain)

Use case 1: 2 indexes on integer columns, 1 index on text column
(1174.395, 0, 1X), (1127.792, 1, 1.04X), (644.260, 2, 1.82X), (341.284, 4,
3.43X), (204.423, 8, 5.74X), (140.692, 16, 8.34X), (129.843, 20, 9.04X),
(134.511, 30, 8.72X)

Use case 2: 1 gist index on text column
(811.412, 0, 1X), (772.203, 1, 1.05X), (437.364, 2, 1.85X), (263.575, 4,
3.08X), (175.135, 8, 4.63X), (155.355, 16, 5.22X), (178.704, 20, 4.54X),
(199.402, 30, 4.06)

Use case 3: 3 indexes on integer columns
(220.680, 0, 1X), (185.096, 1, 1.19X), (134.811, 2, 1.64X), (114.585, 4,
1.92X), (107.707, 8, 2.05X), (101.253, 16, 2.18X), (100.749, 20, 2.19X),
(100.656, 30, 2.19X)

The results are similar to our earlier runs[2].

[1]
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

[2]
https://www.postgresql.org/message-id/CALDaNm13zK%3DJXfZWqZJsm3%2B2yagYDJc%3DeJBgE4i77-4PPNj7vw%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


Re: [PATCH] Add features to pg_stat_statements

2020-09-24 Thread legrand legrand
Not limited to 3, Like an history table.

Will have to think if data is saved at shutdown.



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html




Re: problem with RETURNING and update row movement

2020-09-24 Thread Etsuro Fujita
On Thu, Sep 24, 2020 at 2:47 PM Amit Langote  wrote:
> On Thu, Sep 24, 2020 at 4:25 AM Etsuro Fujita  wrote:
> BTW, the discussion so far on the other thread is oblivious to the
> issue being discussed here, where we need to find a way to transfer
> system attributes between a pair of partitions that are possibly
> incompatible with each other in terms of what set of system attributes
> they support.

Yeah, we should discuss the two issues together.

> Although, if we prevent accessing system attributes
> when performing the operation on partitioned tables, like what you
> seem to propose below, then we wouldn't really have that problem.

Yeah, I think so.

> > Yeah, but for the other issue, I started thinking that we should just
> > forbid referencing xmin/xmax/cmin/cmax in 12, 13, and HEAD...
>
> When the command is being performed on a partitioned table you mean?

Yes.  One concern about that is triggers: IIUC, triggers on a
partition as-is can or can not reference xmin/xmax/cmin/cmax depending
on whether a dedicated tuple slot for the partition is used or not.
We should do something about this if we go in that direction?

> That is, it'd be okay to reference them when the command is being
> performed directly on a leaf partition, although it's another thing
> whether the leaf partitions themselves have sensible values to provide
> for them.

I think so too.

Best regards,
Etsuro Fujita




Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-24 Thread Dilip Kumar
On Thu, Sep 24, 2020 at 11:55 AM Amit Kapila  wrote:
>
> On Tue, Sep 22, 2020 at 5:15 PM Dilip Kumar  wrote:
> >
> > On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila  
> > wrote:
> > >
> > >
> > > I am not sure if this suggestion makes it better than what is purposed
> > > by Dilip but I think we can declare them in define number order like
> > > below:
> > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
> > > #define LOGICALREP_PROTO_VERSION_NUM 1
> > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> > > #define LOGICALREP_PROTO_MAX_VERSION_NUM 
> > > LOGICALREP_PROTO_STREAM_VERSION_NUM
> >
> > Done this way.
> >
>
> - options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
> + options.proto.logical.proto_version = MySubscription->stream ?
> + LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
>
> Here, I think instead of using MySubscription->stream, we should use
> server/walrecv version number as we used at one place in tablesync.c.

I am not sure how can we do this?  If PG version number is 14 then we
will always sent the LOGICALREP_PROTO_STREAM_VERSION_NUM? then again
we will face the same error right?  I think it should be strictly
based on whether we have enabled the streaming or not.  Because
logical replication protocol is still the same, only if the streaming
is enabled we expect the streaming protocol otherwise not.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-24 Thread Amit Kapila
On Thu, Sep 24, 2020 at 4:31 PM Dilip Kumar  wrote:
>
> On Thu, Sep 24, 2020 at 11:55 AM Amit Kapila  wrote:
> >
> > On Tue, Sep 22, 2020 at 5:15 PM Dilip Kumar  wrote:
> > >
> > > On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila  
> > > wrote:
> > > >
> > > >
> > > > I am not sure if this suggestion makes it better than what is purposed
> > > > by Dilip but I think we can declare them in define number order like
> > > > below:
> > > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
> > > > #define LOGICALREP_PROTO_VERSION_NUM 1
> > > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> > > > #define LOGICALREP_PROTO_MAX_VERSION_NUM 
> > > > LOGICALREP_PROTO_STREAM_VERSION_NUM
> > >
> > > Done this way.
> > >
> >
> > - options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
> > + options.proto.logical.proto_version = MySubscription->stream ?
> > + LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
> >
> > Here, I think instead of using MySubscription->stream, we should use
> > server/walrecv version number as we used at one place in tablesync.c.
>
> I am not sure how can we do this?
>

Have you checked what will function walrcv_server_version() will
return? I was thinking that if we know that subscriber is connected to
Publisher version < 14 then we can send the right value.


-- 
With Regards,
Amit Kapila.




Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-24 Thread Dilip Kumar
On Thu, Sep 24, 2020 at 4:45 PM Amit Kapila  wrote:
>
> On Thu, Sep 24, 2020 at 4:31 PM Dilip Kumar  wrote:
> >
> > On Thu, Sep 24, 2020 at 11:55 AM Amit Kapila  
> > wrote:
> > >
> > > On Tue, Sep 22, 2020 at 5:15 PM Dilip Kumar  wrote:
> > > >
> > > > On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > >
> > > > > I am not sure if this suggestion makes it better than what is purposed
> > > > > by Dilip but I think we can declare them in define number order like
> > > > > below:
> > > > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
> > > > > #define LOGICALREP_PROTO_VERSION_NUM 1
> > > > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> > > > > #define LOGICALREP_PROTO_MAX_VERSION_NUM 
> > > > > LOGICALREP_PROTO_STREAM_VERSION_NUM
> > > >
> > > > Done this way.
> > > >
> > >
> > > - options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
> > > + options.proto.logical.proto_version = MySubscription->stream ?
> > > + LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
> > >
> > > Here, I think instead of using MySubscription->stream, we should use
> > > server/walrecv version number as we used at one place in tablesync.c.
> >
> > I am not sure how can we do this?
> >
>
> Have you checked what will function walrcv_server_version() will
> return? I was thinking that if we know that subscriber is connected to
> Publisher version < 14 then we can send the right value.

But, suppose we know the publisher version is < 14 and user has set
streaming on while  creating subscriber then still we will send the
right version?  I think tablesync we are forming a query so we are
forming as per the publisher's version.  But here we are sending which
protocol version we are expecting for the data transfer so I feel it
should be LOGICALREP_PROTO_VERSION_NUM if we expect non-streaming
transfer and LOGICALREP_PROTO_STREAM_VERSION_NUM if we expect the
streaming transfer.  Now let the publisher decide whether it supports
the protocol we have asked for or not and if it doesn't support then
let it give error.  Am I missing something here?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Report error position in partition bound check

2020-09-24 Thread Amit Langote
On Thu, Sep 24, 2020 at 7:19 AM Tom Lane  wrote:
> I looked this over and pushed it with some minor adjustments.

Thank you.

> However, while I was looking at it I couldn't help noticing that
> transformPartitionBoundValue's handling of collation concerns seems
> less than sane.  There are two things bugging me:
>
> 1. Why does it care about the expression's collation only when there's
> a top-level CollateExpr?  For example, that means we get an error for
>
> regression=# create table p (f1 text collate "C") partition by list(f1);
> CREATE TABLE
> regression=# create table c1 partition of p for values in ('a' collate 
> "POSIX");
> ERROR:  collation of partition bound value for column "f1" does not match 
> partition key collation "C"
>
> but not this:
>
> regression=# create table c2 partition of p for values in ('a' || 'b' collate 
> "POSIX");
> CREATE TABLE
>
> Given that we will override the expression's collation with the partition
> column's collation anyway, I don't see why we have this check at all,
> so my preference is to just rip out the entire stanza beginning with
> "if (IsA(value, CollateExpr))".  If we keep it, though, I think it needs
> to do something else that is more general.
>
> 2. Nothing is doing assign_expr_collations() on the partition expression.
> This can trivially be shown to cause problems:
>
> regression=# create table p (f1 bool) partition by list(f1);
> CREATE TABLE
> regression=# create table cf partition of p for values in ('a' < 'b');
> ERROR:  could not determine which collation to use for string comparison
> HINT:  Use the COLLATE clause to set the collation explicitly.
>
>
> If we want to rip out the collation mismatch error altogether, then
> fixing #2 would just require inserting assign_expr_collations() before
> the expression_planner() call.  The other direction that would make
> sense to me is to perform assign_expr_collations() after
> coerce_to_target_type(), and then to complain if exprCollation()
> isn't default and doesn't match the partition collation.  In any
> case a specific test for a CollateExpr seems quite wrong.

I tried implementing that as attached and one test failed:

create table test_part_coll_posix (a text) partition by range (a
collate "POSIX");
...
create table test_part_coll_cast2 partition of test_part_coll_posix
for values from (name 's') to ('z');
+ERROR:  collation of partition bound value for column "a" does not
match partition key collation "POSIX"
+LINE 1: ...ion of test_part_coll_posix for values from (name 's') to ('...

I dug up the discussion which resulted in this test being added and
found that Peter E had opined that this failure should not occur [1].
Maybe that is why I put that half-baked guard consisting of checking
if the erroneous collation comes from an explicit COLLATE clause.  Now
I think maybe giving an error is alright but we should tell in the
DETAIL message what the expression's collation is, like as follows:

create table test_part_coll_cast2 partition of test_part_coll_posix
for values from (name 's') to ('z');
+ERROR:  collation of partition bound value for column "a" does not
match partition key collation "POSIX"
+LINE 1: ...ion of test_part_coll_posix for values from (name 's') to ('...
+ ^
+DETAIL:  The collation of partition bound value is "C".

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

[1] 
https://www.postgresql.org/message-id/04661508-b6f5-177e-6f6b-c4cb8426b9f0%402ndquadrant.com


partition-bound-collation-check.patch
Description: Binary data


Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-24 Thread Amit Kapila
On Thu, Sep 24, 2020 at 5:11 PM Dilip Kumar  wrote:
>
> On Thu, Sep 24, 2020 at 4:45 PM Amit Kapila  wrote:
> >
> >
> > Have you checked what will function walrcv_server_version() will
> > return? I was thinking that if we know that subscriber is connected to
> > Publisher version < 14 then we can send the right value.
>
> But, suppose we know the publisher version is < 14 and user has set
> streaming on while  creating subscriber then still we will send the
> right version?
>

Yeah we can send the version depending on which server we are talking to?

>  I think tablesync we are forming a query so we are
> forming as per the publisher's version.  But here we are sending which
> protocol version we are expecting for the data transfer so I feel it
> should be LOGICALREP_PROTO_VERSION_NUM if we expect non-streaming
> transfer and LOGICALREP_PROTO_STREAM_VERSION_NUM if we expect the
> streaming transfer.
>

I am not sure if this is the right strategy. See
libpqrcv_startstreaming, even if the user asked for streaming unless
the server supports it we don't send the streaming option to the user.
Another thing is this check will become more complicated if we need
another feature like decode prepared to send different version or even
if it is the same version, we might need additional checks. Why do you
want to send a streaming protocol version when we know the server
doesn't support it, lets behave similar to what we do in
libpqrcv_startstreaming.

-- 
With Regards,
Amit Kapila.




Re: Resetting spilled txn statistics in pg_stat_replication

2020-09-24 Thread Amit Kapila
On Sat, Sep 19, 2020 at 1:48 PM Amit Kapila  wrote:
>
> On Tue, Sep 8, 2020 at 7:02 PM Amit Kapila  wrote:
> >
> > On Tue, Sep 8, 2020 at 7:53 AM Masahiko Sawada
> >  wrote:
>
> I have fixed these review comments in the attached patch.
>
>
> Apart from the above,
> (a) fixed one bug in ReorderBufferSerializeTXN() where we were
> updating the stats even when we have not spilled anything.
> (b) made changes in pgstat_read_db_statsfile_timestamp to return false
> when the replication slot entry is corrupt.
> (c) move the declaration and definitions in pgstat.c to make them
> consistent with existing code
> (d) made another couple of cosmetic fixes and changed a few comments
> (e) Tested the patch by using a guc which allows spilling all the
> changes. See v4-0001-guc-always-spill
>

I have found a way to write the test case for this patch. This is
based on the idea we used in stats.sql. As of now, I have kept the
test as a separate patch. We can decide to commit the test part
separately as it is slightly timing dependent but OTOH as it is based
on existing logic in stats.sql so there shouldn't be much problem. I
have not changed anything apart from the test patch in this version.
Note that the first patch is just a debugging kind of tool to test the
patch.

Thoughts?

-- 
With Regards,
Amit Kapila.


v5-0001-guc-always-spill.patch
Description: Binary data


v5-0002-Track-statistics-for-spilling-of-changes-from-Reo.patch
Description: Binary data


v5-0003-Test-stats.patch
Description: Binary data


Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-24 Thread Ashutosh Sharma
Hi Amit,

On Thu, Sep 24, 2020 at 11:55 AM Amit Kapila  wrote:
>
> On Tue, Sep 22, 2020 at 5:15 PM Dilip Kumar  wrote:
> >
> > On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila  
> > wrote:
> > >
> > >
> > > I am not sure if this suggestion makes it better than what is purposed
> > > by Dilip but I think we can declare them in define number order like
> > > below:
> > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
> > > #define LOGICALREP_PROTO_VERSION_NUM 1
> > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> > > #define LOGICALREP_PROTO_MAX_VERSION_NUM 
> > > LOGICALREP_PROTO_STREAM_VERSION_NUM
> >
> > Done this way.
> >
>
> - options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
> + options.proto.logical.proto_version = MySubscription->stream ?
> + LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;
>
> Here, I think instead of using MySubscription->stream, we should use
> server/walrecv version number as we used at one place in tablesync.c.

Should subscribers be setting the LR protocol value based on what is
the publisher version it is communicating with or should it be set
based on whether streaming was enabled or not while creating that
subscription? AFAIU if we set this value based on the publisher
version (which is lets say >= 14), then it's quite possible that the
subscriber will start streaming changes for the in-progress
transactions even if the streaming was disabled while creating the
subscription, won't it?

Please let me know if I am missing something here.

Thanks,

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com




Re: fixing old_snapshot_threshold's time->xid mapping

2020-09-24 Thread Hamid Akhtar
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Patch looks good to me.

Re: PostgreSQL 13 Release Timeline

2020-09-24 Thread Jonathan S. Katz
On 9/10/20 1:08 PM, Jonathan S. Katz wrote:
> Hi,
> 
> On 9/2/20 2:13 PM, Jonathan S. Katz wrote:
> 
>> * PostgreSQL 13 Release Candidate 1 (RC1) will be released on September
>> 17, 2020.
>>
>> * In absence of any critical issues, PostgreSQL 13 will become generally
>> available on September 24, 2020.
>>
>> The aim is to have all outstanding open items committed before RC1.
>> Please ensure everything is committed by September 16, 2020 AoE. We will
>> send out a reminder prior to that date.
> 
> As a friendly reminder, RC1 is one week away. I also realized I made an
> error in the commit date above for RC1. Sorry :( Corrected date below.
> 
> Please ensure you have all of your commits in by **September 13, 2020
> AoE** so we can wrap RC1 and get it out the door.
> 
> Presuming RC1 is successful, commits for GA must be in by **September
> 20, 2020 AoE**.
> 
> Please try to close out open items[1] before RC1.

I am pleased to report that PostgreSQL 13 is now GA:

https://www.postgresql.org/about/news/2077/

Thank you everyone for your hard work and congratulations on another
successful release!

Jonathan





signature.asc
Description: OpenPGP digital signature


Re: Parallel copy

2020-09-24 Thread Ashutosh Sharma
On Thu, Sep 24, 2020 at 3:00 PM Bharath Rupireddy
 wrote:
>
> >
> > > Have you tested your patch when encoding conversion is needed? If so,
> > > could you please point out the email that has the test results.
> > >
> >
> > We have not yet done encoding testing, we will do and post the results
> > separately in the coming days.
> >
>
> Hi Ashutosh,
>
> I ran the tests ensuring pg_server_to_any() gets called from copy.c. I 
> specified the encoding option of COPY command, with client and server 
> encodings being UTF-8.
>

Thanks Bharath for the testing. The results look impressive.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com




Re: PostgreSQL 13 Release Timeline

2020-09-24 Thread Simon Riggs
> I am pleased to report that PostgreSQL 13 is now GA:
>
> https://www.postgresql.org/about/news/2077/
>
> Thank you everyone for your hard work and congratulations on another
> successful release!

Well done team!

-- 
Simon Riggshttp://www.2ndQuadrant.com/
Mission Critical Databases




Re: problem with RETURNING and update row movement

2020-09-24 Thread Amit Langote
On Thu, Sep 24, 2020 at 7:30 PM Etsuro Fujita  wrote:
> On Thu, Sep 24, 2020 at 2:47 PM Amit Langote  wrote:
> > On Thu, Sep 24, 2020 at 4:25 AM Etsuro Fujita  
> > wrote:
> > > Yeah, but for the other issue, I started thinking that we should just
> > > forbid referencing xmin/xmax/cmin/cmax in 12, 13, and HEAD...
> >
> > When the command is being performed on a partitioned table you mean?
>
> Yes.  One concern about that is triggers: IIUC, triggers on a
> partition as-is can or can not reference xmin/xmax/cmin/cmax depending
> on whether a dedicated tuple slot for the partition is used or not.
> We should do something about this if we go in that direction?

Maybe I'm missing something, but assuming that we're talking about
prohibiting system attribute access in the RETURNING clause, how does
that affect what triggers can or cannot do?  AFAICS, only AFTER
row-level triggers may sensibly access system attributes and
whether/how they can do so has not much to do with the slot that
ExecInsert() gets the new tuple in.  It seems that the AFTER trigger
infrastructure remembers an affected tuple's ctid and fetches it just
before calling trigger function by asking the result relation's (e.g.,
a partition's) access method.

To illustrate, with HEAD:

create table foo (a int, b int) partition by range (a);
create table foo1 partition of foo for values from (1) to (2);

create or replace function report_system_info () returns trigger
language plpgsql as $$
begin
  raise notice 'ctid: %', new.ctid;
  raise notice 'xmin: %', new.xmin;
  raise notice 'xmax: %', new.xmax;
  raise notice 'cmin: %', new.cmin;
  raise notice 'cmax: %', new.cmax;
  raise notice 'tableoid: %', new.tableoid;
  return NULL;
end; $$;

create trigger foo_after_trig after insert on foo for each row execute
function report_system_info();

begin;

insert into foo values (1);
NOTICE:  ctid: (0,1)
NOTICE:  xmin: 532
NOTICE:  xmax: 0
NOTICE:  cmin: 0
NOTICE:  cmax: 0
NOTICE:  tableoid: 16387

insert into foo values (1);
NOTICE:  ctid: (0,2)
NOTICE:  xmin: 532
NOTICE:  xmax: 0
NOTICE:  cmin: 1
NOTICE:  cmax: 1
NOTICE:  tableoid: 16387

insert into foo values (1);
NOTICE:  ctid: (0,3)
NOTICE:  xmin: 532
NOTICE:  xmax: 0
NOTICE:  cmin: 2
NOTICE:  cmax: 2
NOTICE:  tableoid: 16387

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: PostgreSQL 13 Release Timeline

2020-09-24 Thread James Coleman
On Thu, Sep 24, 2020 at 9:30 AM Jonathan S. Katz  wrote:
>
> I am pleased to report that PostgreSQL 13 is now GA:
>
> https://www.postgresql.org/about/news/2077/

I'm not sure if there was a "release announcement draft" thread
(searching my folder didn't find one for GA, but it's also easily
possible I missed it): I'm wondering if the link for incremental sort
[1] should instead/also point to the examples in [2] which actually
explain what incremental sort does.

Actually, should the GUC page link there? There's not an obvious
anchor tag on the page (that I can tell) to use for such a link
though...

If it'd be better to split off this to a new thread, I can do that too.

James

1: 
https://www.postgresql.org/docs/13/runtime-config-query.html#GUC-ENABLE-INCREMENTAL-SORT
2: https://www.postgresql.org/docs/13/using-explain.html




Re: PostgreSQL 13 Release Timeline

2020-09-24 Thread Jonathan S. Katz
On 9/24/20 10:11 AM, James Coleman wrote:
> On Thu, Sep 24, 2020 at 9:30 AM Jonathan S. Katz  wrote:
>>
>> I am pleased to report that PostgreSQL 13 is now GA:
>>
>> https://www.postgresql.org/about/news/2077/
> 
> I'm not sure if there was a "release announcement draft" thread
> (searching my folder didn't find one for GA, but it's also easily
> possible I missed it)

Yes, there always is[1]

: I'm wondering if the link for incremental sort
> [1] should instead/also point to the examples in [2] which actually
> explain what incremental sort does.
>
> 
> Actually, should the GUC page link there? There's not an obvious
> anchor tag on the page (that I can tell) to use for such a link
> though...

When researching it, I believe I chose the GUC because it was anchorable.

> If it'd be better to split off this to a new thread, I can do that too.

Probably, because it will trigger discussion ;)

Jonathan

[1]
https://www.postgresql.org/message-id/dfc987e3-bb8c-96f5-3bd2-11b9dcc96f19%40postgresql.org



signature.asc
Description: OpenPGP digital signature


Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-24 Thread Daniel Gustafsson
> On 24 Sep 2020, at 04:53, Michael Paquier  wrote:

> Enabling FIPS with OpenSSL 1.0.2 causes direct calls to the SHAXXX
> routines to fail:
> "Low level API call to digest SHA256 forbidden in fips mode"

Confirmed by running 1.0.2s-fips with fips_mode=yes in the alg_section in
openssl.cnf.

> This got discussed back in 2018, but I never got back to it:
> https://www.postgresql.org/message-id/20180911030250.ga27...@paquier.xyz
> 
> One thing I did not like back in the past patch was that we did not
> handle failures if one of OpenSSL's call failed, but this can easily
> be handled by using a trick similar to jsonapi.c to fail hard if that
> happens.
> 
> It is worth noting that the low-level SHA routines are not recommended
> for years in OpenSSL, and that these have been officially marked as
> deprecated in 3.0.0.  So, while the changes in sha2.h don't make this
> stuff back-patchable per the ABI breakage it introduces, switching 
> sha2_openssl.c to use EVP is a better move in the long term,

Agreed.  Commit 5ff4a67f63f [0] moved contrib/pgcrypto from using low-level
functions for pretty much exactly the same reasons: they are advised against
and break FIPS.

With your patch I can run the SSL tests successfully both with and without
FIPS. I'm in favor of moving to the EVP API.

> ..even if
> that means that SCRAM+FIPS would not work with PG 10~13, so the
> attached is something for HEAD, even if this would be possible to do
> in older releases as the routines used in the attached are available
> in versions of OpenSSL older than 1.0.1.

Thats kind of a shame, but given the low volume of reports to -bugs and
-hackers, the combination SCRAM and FIPS might not be all too common.  Since
OpenSSL FIPS is EOL it might also not increase until 3.0.0 comes with FIPS
certification?

If we really want to support it (which would require more evidence of it being
a problem IMO), using the non-OpenSSL sha256 code would be one option I guess?

cheers ./daniel

[0] https://postgr.es/m/561274f1.1030...@iki.fi



Re: [patch] Fix checksum verification in base backups for zero page headers

2020-09-24 Thread Anastasia Lubennikova



On 22.09.2020 17:30, Michael Banck wrote:

Hi,

Am Dienstag, den 22.09.2020, 16:26 +0200 schrieb Michael Banck:

Am Mittwoch, den 02.09.2020, 16:50 +0300 schrieb Anastasia Lubennikova:

I've looked through the previous discussion. As far as I got it, most of
the controversy was about online checksums improvements.

The warning about pd_upper inconsistency that you've added is a good
addition. The patch is a bit messy, though, because a huge code block
was shifted.

Will it be different, if you just leave
  "if (!PageIsNew(page) && PageGetLSN(page) < startptr)"
block as it was, and add
  "else if (PageIsNew(page) && !PageIsZero(page))" ?

Thanks, that does indeed look better as a patch and I think it's fine
as-is for the code as well, I've attached a v2.

Sorry, forgot to add you as reviewer in the proposed commit message,
I've fixed that up now in V3.


Michael


Great. This version looks good to me.
Thank you for answering my questions, I agree, that we can work on them 
in separate threads.


So I mark this one as ReadyForCommitter.

The only minor problem is a typo (?) in the proposed commit message.
"If a page is all zero, consider that a checksum failure." It should be 
"If a page is NOT all zero...".


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Refactor pg_rewind code and make it work against a standby

2020-09-24 Thread Heikki Linnakangas
Thanks for the review! I'll post a new version shortly, with your 
comments incorporated. More detailed response to a few of them below:


On 18/09/2020 10:41, Kyotaro Horiguchi wrote:

I don't think filemap_finalize needs to iterate over filemap twice.


True, but I thought it's more clear that way, doing one thing at a time.


hash_string_pointer is a copy of that of pg_verifybackup.c. Is it
worth having in hashfn.h or .c?


I think it's fine for now. Maybe in the future if more copies crop up.


--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
...
+   filemap_t  *filemap;
..
+   filemap_init();
...
+   filemap = filemap_finalize();


I'm a bit confused by this, and realized that what filemap_init
initializes is *not* the filemap, but the filehash. So for example,
the name of the functions might should be something like this?

filehash_init()
filemap = filehash_finalyze()/create_filemap()


My thinking was that filemap_* is the prefix for the operations in 
filemap.c, hence filemap_init(). I can see the confusion, though, and I 
think you're right that renaming would be good. I renamed them to 
filehash_init(), and decide_file_actions(). I think those names make the 
calling code in pg_rewind.c quite clear.



/*
  * Also check that full_page_writes is enabled.  We can get torn pages if
  * a page is modified while we read it with pg_read_binary_file(), and we
  * rely on full page images to fix them.
  */
str = run_simple_query(conn, "SHOW full_page_writes");
if (strcmp(str, "on") != 0)
pg_fatal("full_page_writes must be enabled in the source server");
pg_free(str);


This is a part not changed by this patch set. But If we allow to
connect to a standby, this check can be tricked by setting off on the
primary and "on" on the standby (FWIW, though). Some protection
measure might be necessary.


Good point, the value in the primary is what matters. In fact, even when 
connected to the primary, the value might change while pg_rewind is 
running. I'm not sure how we could tighten that up.



+   if (chunksize > rq->length)
+   {
+   pg_fatal("received more than requested for file \"%s\"",
+rq->path);
+   /* receiving less is OK, though */

Don't we need to truncate the target file, though?


If a file is truncated in the source while pg_rewind is running, there 
should be a WAL record about the truncation that gets replayed when you 
start the server after pg_rewind has finished. We could truncate the 
file if we wanted to, but it's not necessary. I'll add a comment.


- Heikki




Re: Report error position in partition bound check

2020-09-24 Thread Tom Lane
[ cc'ing Peter, since his opinion seems to have got us here in the first place ]

Amit Langote  writes:
> On Thu, Sep 24, 2020 at 7:19 AM Tom Lane  wrote:
>> However, while I was looking at it I couldn't help noticing that
>> transformPartitionBoundValue's handling of collation concerns seems
>> less than sane.  There are two things bugging me:
>> 
>> 1. Why does it care about the expression's collation only when there's
>> a top-level CollateExpr?  For example, that means we get an error for
>> 
>> regression=# create table p (f1 text collate "C") partition by list(f1);
>> CREATE TABLE
>> regression=# create table c1 partition of p for values in ('a' collate 
>> "POSIX");
>> ERROR:  collation of partition bound value for column "f1" does not match 
>> partition key collation "C"
>> 
>> but not this:
>> 
>> regression=# create table c2 partition of p for values in ('a' || 'b' 
>> collate "POSIX");
>> CREATE TABLE
>> 
>> Given that we will override the expression's collation with the partition
>> column's collation anyway, I don't see why we have this check at all,
>> so my preference is to just rip out the entire stanza beginning with
>> "if (IsA(value, CollateExpr))".  If we keep it, though, I think it needs
>> to do something else that is more general.

> I dug up the discussion which resulted in this test being added and
> found that Peter E had opined that this failure should not occur [1].

Well, I agree with Peter to that extent, but my opinion is that *none*
of these cases ought to be errors.  What we're doing here is performing
an implicit assignment-level coercion of the expression to the type of
the column, and changing the collation is allowed as part of that:

regression=# create table foo (f1 text collate "C");
CREATE TABLE
regression=# insert into foo values ('a' COLLATE "POSIX");
INSERT 0 1
regression=# update foo set f1 = 'b' COLLATE "POSIX";
UPDATE 1

So I find it completely inconsistent that the partitioning logic
complains about equivalent cases.  I think we should just rip the
whole thing out, as per the attached draft.  This causes several
regression test results to change, but AFAICS those are only there
to exercise the error tests that I think we should get rid of.

regards, tom lane

diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index 164312d60e..0dc03dd984 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -4183,50 +4183,6 @@ transformPartitionBoundValue(ParseState *pstate, Node *val,
 	 */
 	Assert(!contain_var_clause(value));
 
-	/*
-	 * Check that the input expression's collation is compatible with one
-	 * specified for the parent's partition key (partcollation).  Don't throw
-	 * an error if it's the default collation which we'll replace with the
-	 * parent's collation anyway.
-	 */
-	if (IsA(value, CollateExpr))
-	{
-		Oid			exprCollOid = exprCollation(value);
-
-		/*
-		 * Check we have a collation iff it is a collatable type.  The only
-		 * expected failures here are (1) COLLATE applied to a noncollatable
-		 * type, or (2) partition bound expression had an unresolved
-		 * collation.  But we might as well code this to be a complete
-		 * consistency check.
-		 */
-		if (type_is_collatable(colType))
-		{
-			if (!OidIsValid(exprCollOid))
-ereport(ERROR,
-		(errcode(ERRCODE_INDETERMINATE_COLLATION),
-		 errmsg("could not determine which collation to use for partition bound expression"),
-		 errhint("Use the COLLATE clause to set the collation explicitly.")));
-		}
-		else
-		{
-			if (OidIsValid(exprCollOid))
-ereport(ERROR,
-		(errcode(ERRCODE_DATATYPE_MISMATCH),
-		 errmsg("collations are not supported by type %s",
-format_type_be(colType;
-		}
-
-		if (OidIsValid(exprCollOid) &&
-			exprCollOid != DEFAULT_COLLATION_OID &&
-			exprCollOid != partCollation)
-			ereport(ERROR,
-	(errcode(ERRCODE_DATATYPE_MISMATCH),
-	 errmsg("collation of partition bound value for column \"%s\" does not match partition key collation \"%s\"",
-			colName, get_collation_name(partCollation)),
-	 parser_errposition(pstate, exprLocation(value;
-	}
-
 	/*
 	 * Coerce to the correct type.  This might cause an explicit coercion step
 	 * to be added on top of the expression, which must be evaluated before
@@ -4253,6 +4209,7 @@ transformPartitionBoundValue(ParseState *pstate, Node *val,
 	 */
 	if (!IsA(value, Const))
 	{
+		assign_expr_collations(pstate, value);
 		value = (Node *) expression_planner((Expr *) value);
 		value = (Node *) evaluate_expr((Expr *) value, colType, colTypmod,
 	   partCollation);


Re: "cert" + clientcert=verify-ca in pg_hba.conf?

2020-09-24 Thread Bruce Momjian
On Thu, Sep 24, 2020 at 12:44:01PM +0900, Michael Paquier wrote:
> On Tue, Sep 01, 2020 at 10:27:03PM -0400, Bruce Momjian wrote:
> > OK, good.  Let's wait a few days and I will then apply it for PG 14.
> 
> It has been a few days, and nothing has happened here.  I have not
> looked at the patch in details, so I cannot say if that's fine or not,
> but please note that the patch fails to apply per the CF bot.

I will handle it.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Custom options for building extensions with --with--llvm

2020-09-24 Thread Konstantin Knizhnik

Hi,

In my extension I want to define some custom options for compiler.
I do it in the following way:

ifdef USE_DISK
CUSTOM_COPT += -DIMCS_DISK_SUPPORT
endif


So if I want to enable disk support, I am building my extension as

make USE_DISK=1 clean all install

and it normally works... unless Postgres is built with --enable-llvm.
In this case it tries to build llvm code using clang and is not using 
CUSTOM_COPTS:


gcc -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Werror=vla -Wendif-labels 
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing 
-fwrapv -fexcess-precision=standard -g -O3 -Wall -pthread 
-DIMCS_DISK_SUPPORT -fPIC -I. -I. -I../../src/include -D_GNU_SOURCE   -c 
-o disk.o disk.c


vs.

/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2  
-I. -I. -I../../src/include  -D_GNU_SOURCE  -flto=thin -emit-llvm -c -o 
disk.bc disk.c



I wonder is there any way to pass custom compile options to clang?
Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-24 Thread Heikki Linnakangas

On 24/09/2020 17:21, Daniel Gustafsson wrote:

If we really want to support it (which would require more evidence of it being
a problem IMO), using the non-OpenSSL sha256 code would be one option I guess?


That would technically work, but wouldn't it make the product as whole 
not FIPS compliant? I'm not a FIPS lawyer, but as I understand it the 
point of FIPS is that all the crypto code is encapsulated in a certified 
module. Having your own SHA-256 implementation would defeat that.


- Heikki




Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-24 Thread Daniel Gustafsson
> On 24 Sep 2020, at 18:21, Heikki Linnakangas  wrote:
> 
> On 24/09/2020 17:21, Daniel Gustafsson wrote:
>> If we really want to support it (which would require more evidence of it 
>> being
>> a problem IMO), using the non-OpenSSL sha256 code would be one option I 
>> guess?
> 
> That would technically work, but wouldn't it make the product as whole not 
> FIPS compliant? I'm not a FIPS lawyer, but as I understand it the point of 
> FIPS is that all the crypto code is encapsulated in a certified module. 
> Having your own SHA-256 implementation would defeat that.

Doh, of course, I blame a lack of caffeine this afternoon.  Having a private
local sha256 implementation using the EVP_* API inside scram-common would
maintain FIPS compliance and ABI compatibility, but would also be rather ugly.

cheers ./daniel



Incremental sort docs and release announcement

2020-09-24 Thread James Coleman
I'm breaking out a few questions I'd posed on another thread about the
release timeline [1] into this new thread.

I noticed on the PG13 release announcement that the link for
incremental sort goes to the GUC docs [2] because (as Jonathan Katz
confirmed in the linked thread) that page actually has helpful anchor
tags.

But I'm wondering if instead/also it should point to the examples in
the EXPLAIN docs [3] which actually explain what incremental sort
does. In the initial patch discussion we ended up putting the
explanation there because there was a desire to keep the GUC
descriptions short.

But that raises a larger question: should the GUC page also link to
the EXPLAIN examples? There's not an obvious anchor tag on the page
(that I can tell) to use for such a link though...so that could get
into an ever larger question about adding those anchors.

It seems to me that we don't have a particularly great place for
detail explanations in the docs of the algorithms/nodes/etc. we
use...unless the new glossary section in the docs could fill that
role?

Any thoughts?

James

1: 
https://www.postgresql.org/message-id/CAAaqYe_kiFZ2T_KhHf8%2BjR_f4YnH%3DhdnN3QDCA1D%2B%2BF0GuHpdg%40mail.gmail.com
2:  
https://www.postgresql.org/docs/13/runtime-config-query.html#GUC-ENABLE-INCREMENTAL-SORT
3: https://www.postgresql.org/docs/13/using-explain.html




Re: Refactor pg_rewind code and make it work against a standby

2020-09-24 Thread Heikki Linnakangas

On 20/09/2020 23:44, Soumyadeep Chakraborty wrote:

Before getting into the code review for the patch, I wanted to know why
we don't use a Bitmapset for target_modified_pages?


Bitmapset is not available in client code. Perhaps it could be moved to 
src/common with some changes, but doesn't seem worth it until there's 
more client code that would need it.


I'm not sure that a bitmap is the best data structure for tracking the 
changed blocks in the first place. A hash table might be better if there 
are only a few changed blocks, or something like 
src/backend/lib/integerset.c if there are many. But as long as the 
simple bitmap gets the job done, let's keep it simple.



2. Rename target_modified_pages to target_pages_to_overwrite?
target_modified_pages can lead to confusion as to whether it includes pages
that were modified on the target but not even present in the source and
the other exclusionary cases. target_pages_to_overwrite is much clearer.


Agreed, I'll rename it.

Conceptually, while we're scanning source WAL, we're just making note of 
the modified blocks. The decision on what to do with them happens only 
later, in decide_file_action(). The difference is purely theoretical, 
though. There is no real decision to be made, all the modified blocks 
will be overwritten. So on the whole, I agree 'target_page_to_overwrite' 
is better.



/*
  * If this is a relation file, copy the modified blocks.
  *
  * This is in addition to any other changes.
  */
iter = datapagemap_iterate(&entry->target_modified_pages);
while (datapagemap_next(iter, &blkno))
{
offset = blkno * BLCKSZ;

source->queue_fetch_range(source, entry->path, offset, BLCKSZ);
}
pg_free(iter);


Can we put this hunk into a static function overwrite_pages()?


Meh, it's only about 10 lines of code, and one caller.


4.


* block that have changed in the target system.  It makes note of all the
* changed blocks in the pagemap of the file.


Can we replace the above with:


* block that has changed in the target system.  It decides if the given

blkno in the target relfile needs to be overwritten from the source.


Ok. Again conceptually though, process_target_wal_block_change() just 
collects information, and the decisions are made later. But you're right 
that we do leave out truncated-away blocks here, so we are doing more 
than just noting all the changed blocks.



/*
  * Doesn't exist in either server. Why does it have an entry in the
  * first place??
  */
return FILE_ACTION_NONE;


Can we delete the above hunk and add the following assert to the very
top of decide_file_action():

Assert(entry->target_exists || entry->source_exists);


I would like to keep the check even when assertions are not enabled. 
I'll add an Assert(false) there.



7. Please address the FIXME for the symlink case:
/* FIXME: Check if it points to the same target? */


It's not a new issue. Would be nice to fix, of course. I'm not sure what 
the right thing to do would be. If you have e.g. replaced 
postgresql.conf with a symlink that points outside the data directory, 
would it be appropriate to overwrite it? Or perhaps we should throw an 
error? We also throw an error if a file is a symlink in the source but a 
regular file in the target, or vice versa.



8.

* it anyway. But there's no harm in copying it now.)

and

* copy them here. But we don't know which scenario we're
* dealing with, and there's no harm in copying the missing
* blocks now, so do it now.

Could you add a line or two explaining why there is "no harm" in these
two hunks above?


The previous sentences explain that there's a WAL record covering them. 
So they will be overwritten by WAL replay anyway. Does it need more 
explanation?



14. queue_overwrite_range(), finish_overwrite() instead of
queue_fetch_range(), finish_fetch()? Similarly update\
*_fetch_file_range() and *_finish_fetch()


15. Let's have local_source.c and libpq_source.c instead of *_fetch.c


Good idea! And fetch.h -> rewind_source.h.

I also moved the code in execute_file_actions() function to pg_rewind.c, 
into a new function: perform_rewind(). It felt a bit silly to have just 
execute_file_actions() in a file of its own. perform_rewind() now does 
all the modifications to the data directory, writing the backup file. 
Except for writing the recovery config: that also needs to be when 
there's no rewind to do, so it makes sense to keep it separate. What do 
you think?



16.


conn = PQconnectdb(connstr_source);

if (PQstatus(conn) == CONNECTION_BAD)
pg_fatal("could not connect to server: %s",
PQerrorMessage(conn));

if (showprogress)
pg_log_info("connected to server");



The above hunk should be part of init_libpq_source(). Consequently,
init_libpq_source() should take a connection string instead of a conn.


The libpq connection is also needed by WriteRecoveryConfig(), that's why 
it's not fully encapsulated in libpq_source.



19.


typedef struct
{
const char *path; /* path relative to data directory root */
uint64

Re: proposal: possibility to read dumped table's name from file

2020-09-24 Thread Pavel Stehule
Hi

rebase + minor change - using pg_get_line_buf instead pg_get_line_append

Regards

Pavel
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index e09ed0a4c3..9d7ebaf922 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -757,6 +757,56 @@ PostgreSQL documentation
   
  
 
+ 
+  --filter=filename
+  
+   
+Read objects filters from the specified file.
+If you use "-" as a filename, the filters are read from stdin.
+The lines of this file must have the following format:
+
+(+|-)[tnfd] objectname
+
+   
+
+   
+The first character specifies whether the object is to be included
+(+) or excluded (-), and the
+second character specifies the type of object to be filtered:
+t (table),
+n (schema),
+f (foreign server),
+d (table data).
+   
+
+   
+With the following filter file, the dump would include table
+mytable1 and data from foreign tables of
+some_foreign_server foreign server, but exclude data
+from table mytable2.
+
++t mytable1
++f some_foreign_server
+-d mytable2
+
+   
+
+   
+The lines starting with symbol # are ignored.
+Previous white chars (spaces, tabs) are not allowed. These
+lines can be used for comments, notes. Empty lines are ignored too.
+   
+
+   
+The --filter option works just like the other
+options to include or exclude tables, schemas, table data, or foreign
+tables, and both forms may be combined.  Note that there are no options
+to exclude a specific foreign table or to include a specific table's
+data.
+   
+  
+ 
+
  
   --if-exists
   
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index f021bb72f4..a7824c5bf6 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -53,9 +53,11 @@
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
+#include "common/string.h"
 #include "dumputils.h"
 #include "fe_utils/string_utils.h"
 #include "getopt_long.h"
+#include "lib/stringinfo.h"
 #include "libpq/libpq-fs.h"
 #include "parallel.h"
 #include "pg_backup_db.h"
@@ -290,6 +292,7 @@ static void appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions,
 static char *get_synchronized_snapshot(Archive *fout);
 static void setupDumpWorker(Archive *AHX);
 static TableInfo *getRootTableInfo(TableInfo *tbinfo);
+static void read_patterns_from_file(char *filename, DumpOptions *dopt);
 
 
 int
@@ -364,6 +367,7 @@ main(int argc, char **argv)
 		{"enable-row-security", no_argument, &dopt.enable_row_security, 1},
 		{"exclude-table-data", required_argument, NULL, 4},
 		{"extra-float-digits", required_argument, NULL, 8},
+		{"filter", required_argument, NULL, 12},
 		{"if-exists", no_argument, &dopt.if_exists, 1},
 		{"inserts", no_argument, NULL, 9},
 		{"lock-wait-timeout", required_argument, NULL, 2},
@@ -603,6 +607,10 @@ main(int argc, char **argv)
 		  optarg);
 break;
 
+			case 12:			/* filter implementation */
+read_patterns_from_file(optarg, &dopt);
+break;
+
 			default:
 fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 exit_nicely(1);
@@ -1022,6 +1030,8 @@ help(const char *progname)
 			 "   access to)\n"));
 	printf(_("  --exclude-table-data=PATTERN do NOT dump data for the specified table(s)\n"));
 	printf(_("  --extra-float-digits=NUM override default setting for extra_float_digits\n"));
+	printf(_("  --filter=FILENAMEdump objects and data based on the filter expressions\n"
+			 "   from the filter file\n"));
 	printf(_("  --if-exists  use IF EXISTS when dropping objects\n"));
 	printf(_("  --include-foreign-data=PATTERN\n"
 			 "   include data of foreign tables on foreign\n"
@@ -18470,3 +18480,160 @@ appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions,
 	if (!res)
 		pg_log_warning("could not parse reloptions array");
 }
+
+/*
+ * Print error message and exit.
+ */
+static void
+exit_invalid_filter_format(FILE *fp, char *filename, char *message, char *line, int lineno)
+{
+	pg_log_error("invalid format of filter file \"%s\": %s",
+ filename,
+ message);
+
+	fprintf(stderr, "%d: %s\n", lineno, line);
+
+	if (fp != stdin)
+		fclose(fp);
+
+	exit_nicely(-1);
+}
+
+/*
+ * Read dumped object specification from file
+ */
+static void
+read_patterns_from_file(char *filename, DumpOptions *dopt)
+{
+	FILE	   *fp;
+	int			lineno = 0;
+	StringInfoData line;
+
+	/* use "-" as symbol for stdin */
+	if (strcmp(filename, "-") != 0)
+	{
+		fp = fopen(filename, "r");
+		if (!fp)
+			fatal("could not open the input file \"%s\": %m",
+  filename);
+	}
+	else
+		fp = stdin;
+
+	initStringInfo(&line);
+
+	while (pg_ge

Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-24 Thread Peter Eisentraut

On 2020-09-24 18:21, Heikki Linnakangas wrote:

That would technically work, but wouldn't it make the product as whole
not FIPS compliant? I'm not a FIPS lawyer, but as I understand it the
point of FIPS is that all the crypto code is encapsulated in a certified
module. Having your own SHA-256 implementation would defeat that.


Depends on what one considers to be covered by FIPS.  The entire rest of 
SCRAM is custom code, so running it on top of the world's greatest 
SHA-256 implementation isn't going to make the end product any more 
trustworthy.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Custom options for building extensions with --with--llvm

2020-09-24 Thread Andres Freund
Hi,

On 2020-09-24 19:15:22 +0300, Konstantin Knizhnik wrote:
> In my extension I want to define some custom options for compiler.
> I do it in the following way:
> 
> ifdef USE_DISK
> CUSTOM_COPT += -DIMCS_DISK_SUPPORT
> endif

Why aren't you adding it to PG_CPPFLAGS? That should work, and I think
that's what several contrib modules are using.

My understanding of CUSTOM_COPT is that it's for use in Makefile.custom,
not for contrib modules etc?


> I wonder is there any way to pass custom compile options to clang?
> Thanks in advance,

It probably wouldn't hurt to add COPT that to the respective rules in
Makefile.global.in in some way. Shouldn't be too hard to do, if you want
to write a patch. Probably just apply it to BITCODE_CFLAGS as well.

Greetings,

Andres Freund




Re: proposal: schema variables

2020-09-24 Thread Pavel Stehule
čt 24. 9. 2020 v 5:58 odesílatel Pavel Stehule 
napsal:

>
>
> čt 24. 9. 2020 v 5:56 odesílatel Michael Paquier 
> napsal:
>
>> On Sat, Jul 11, 2020 at 06:44:24AM +0200, Pavel Stehule wrote:
>> > rebase
>>
>> Per the CF bot, this needs an extra rebase as it does not apply
>> anymore.  This has not attracted much the attention of committers as
>> well.
>>
>
> I'll fix it today
>

fixed patch attached

Regards

Pavel


> --
>> Michael
>>
>


schema-variables-20200924.patch.gz
Description: application/gzip


[patch] Concurrent table reindex per-index progress reporting

2020-09-24 Thread Matthias van de Meent
Hi,

While working on a PG12-instance I noticed that the progress reporting of
concurrent index creation for non-index relations fails to update the
index/relation OIDs that it's currently working on in the
pg_stat_progress_create_index view after creating the indexes. Please find
attached a patch which properly sets these values at the appropriate places.

Any thoughts?

Matthias van de Meent
From f41da096b1f36118917fe345e2a6fc89530a40c9 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Thu, 24 Sep 2020 20:41:10 +0200
Subject: [PATCH] Report the active index for reindex table concurrently

The pgstat_progress reporting for the multi-index path of reindexing
failed to set the index and relation OIDs correctly for various stages,
which resulted in a pg_stat_progress_create_index view that did not
accurately represent the index that the progress was being reported on.

This commit tags the correct index and relation for each of the concurrent
index creation stages.

Signed-off-by: Matthias van de Meent 
---
 src/backend/commands/indexcmds.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f1b5f87e6a..b2f04012a4 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -3431,6 +3431,15 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		heapId = indexRel->rd_index->indrelid;
 		index_close(indexRel, NoLock);
 
+		/*
+		 * Configure progress reporting to report for this index.
+		 * While we're at it, reset the phase as it is cleared by start_command.
+		 */
+		pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, heapId);
+		pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID, newIndexId);
+		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+	 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+
 		/* Perform concurrent build of new index */
 		index_concurrently_build(heapId, newIndexId);
 
@@ -3477,6 +3486,15 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
 		PushActiveSnapshot(snapshot);
 
+		/*
+		 * Configure progress reporting to report for this index.
+		 * While we're at it, reset the phase as it is cleared by start_command.
+		 */
+		pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, heapId);
+		pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID, newIndexId);
+		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+	 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+
 		validate_index(heapId, newIndexId, snapshot);
 
 		/*
-- 
2.20.1



Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-24 Thread Robert Haas
On Thu, Sep 24, 2020 at 1:57 PM Peter Eisentraut
 wrote:
> Depends on what one considers to be covered by FIPS.  The entire rest of
> SCRAM is custom code, so running it on top of the world's greatest
> SHA-256 implementation isn't going to make the end product any more
> trustworthy.

I mean, the issue here, as is so often the case, is not what is
actually more secure, but what meets the terms of some security
standard. At least in the US, FIPS 140-2 compliance is a reasonably
common need, so if we can make it easier for people who have that need
to be compliant, they are more likely to use PostgreSQL, which seems
like something that we should want. Our opinions about that standard
do not matter to the users who are legally required to comply with it;
the opinions of their lawyers and auditors do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Compatible defaults for LEAD/LAG

2020-09-24 Thread Pavel Stehule
út 22. 9. 2020 v 2:33 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > I see few possibilities how to finish and close this issue:
> > 1. use anycompatible type and add note to documentation so expression of
> > optional argument can change a result type, and so this is Postgres
> > specific - other databases and ANSI SQL disallow this.
> > It is the most simple solution, and it is good enough for this case, that
> > is not extra important.
> > 2. choose the correct type somewhere inside the parser - for these two
> > functions.
> > 3. introduce and implement some generic solution for this case - it can
> be
> > implemented on C level via some function helper or on syntax level
> >CREATE OR REPLACE FUNCTION lag(a anyelement, offset int, default
> defval
> > a%type) ...
> > "defval argname%type" means for caller's expression "CAST(defval AS
> > typeof(argname))"
>
> I continue to feel that the spec's definition of this is not so
> obviously right that we should jump through hoops to duplicate it.
> In fact, I don't even agree that we need a disclaimer in the docs
> saying that it's not quite the same.  Only the most nitpicky
> language lawyers would ever even notice.
>
> In hopes of moving this along a bit, I experimented with converting
> the other functions I listed to use anycompatible.  I soon found that
> touching any functions/operators that are listed in operator classes
> would open a can of worms far larger than I'd previously supposed.
> To maintain consistency, we'd have to propagate the datatype changes
> to *every* function/operator associated with those opfamilies --- many
> of which only have one any-foo input and thus aren't on my original
> list.  (An example here is that extending btree array_ops will require
> changing the max(anyarray) and min(anyarray) aggregates too.)  We'd
> then end up with a situation that would be rather confusing from a
> user's standpoint, in that it'd be quite un-obvious why some array
> functions use anyarray while other ones use anycompatiblearray.
>
> So I backed off to just changing the functions/operators that have
> no opclass connections, such as array_cat.  Even that has some
> downsides --- for example, in the attached patch, it's necessary
> to change some polymorphism.sql examples that explicitly reference
> array_cat(anyarray).  I wonder whether this change would break a
> significant number of user-defined aggregates or operators.
>
> (Note that I think we'd have to resist the temptation to fix that
> by letting CREATE AGGREGATE et al accept support functions that
> take anyarray/anycompatiblearray (etc) interchangeably.  A lot of
> the security analysis that went into CVE-2020-14350 depended on
> the assumption that utility commands only do exact lookups of
> support functions.  If we tried to be lax about this, we'd
> re-introduce the possibility for hostile capture of function
> references in extension scripts.)
>
> Another interesting issue, not seen in the attached but which
> came up while I was experimenting with the more aggressive patch,
> was this failure in the polymorphism test:
>
>  select max(histogram_bounds) from pg_stats where tablename = 'pg_am';
> -ERROR:  cannot compare arrays of different element types
> +ERROR:  function max(anyarray) does not exist
>
> That's because we don't accept pg_stats.histogram_bounds (of
> declared type anyarray) as a valid input for anycompatiblearray.
> I wonder if that isn't a bug we need to fix in the anycompatible
> patch, independently of anything else.  We'd not be able to deduce
> an actual element type from such an input, but we already cannot
> do so when we match it to an anyarray parameter.
>
> Anyway, attached find
>
> 0001 - updates Vik's original patch to HEAD and tweaks the
> grammar in the docs a bit.
>
> 0002 - add-on patch to convert array_append, array_prepend,
> array_cat, array_position, array_positions, array_remove,
> array_replace, and width_bucket to use anycompatiblearray.
>
> I think 0001 is committable, but 0002 is just WIP since
> I didn't touch the docs.  I'm slightly discouraged about
> whether 0002 is worth proceeding with.  Any thoughts?
>

I think so 0002 has sense - more than doc I miss related regress tests, but
it is partially covered by anycompatible tests

Anyway I tested both patches and there is not problem with compilation,
building doc, and make check-world passed

I'll mark this patch as ready for committer

Best regards

Pavel


>
> regards, tom lane
>
>


Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-24 Thread Daniel Gustafsson
> On 24 Sep 2020, at 21:22, Robert Haas  wrote:
> 
> On Thu, Sep 24, 2020 at 1:57 PM Peter Eisentraut
>  wrote:
>> Depends on what one considers to be covered by FIPS.  The entire rest of
>> SCRAM is custom code, so running it on top of the world's greatest
>> SHA-256 implementation isn't going to make the end product any more
>> trustworthy.
> 
> I mean, the issue here, as is so often the case, is not what is
> actually more secure, but what meets the terms of some security
> standard.

Correct, IIUC in order to be FIPS compliant all cryptographic modules used must
be FIPS certified.

> At least in the US, FIPS 140-2 compliance is a reasonably
> common need, so if we can make it easier for people who have that need
> to be compliant, they are more likely to use PostgreSQL, which seems
> like something that we should want.

The proposed patch makes SCRAM+FIPS work for 14, question is if we need/want to
try and address v10-13.

cheers ./daniel



Re: fixing old_snapshot_threshold's time->xid mapping

2020-09-24 Thread Robert Haas
On Wed, Sep 23, 2020 at 9:16 PM Thomas Munro  wrote:
> LGTM.

Committed.

Thomas, with respect to your part of this patch set, I wonder if we
can make the functions that you're using to write tests safe enough
that we could add them to contrib/old_snapshot and let users run them
if they want. As you have them, they are hedged around with vague and
scary warnings, but is that really justified? And if so, can it be
fixed? It would be nicer not to end up with two loadable modules here,
and maybe the right sorts of functions could even have some practical
use.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Custom options for building extensions with --with--llvm

2020-09-24 Thread Konstantin Knizhnik




On 24.09.2020 21:37, Andres Freund wrote:

Hi,

On 2020-09-24 19:15:22 +0300, Konstantin Knizhnik wrote:

In my extension I want to define some custom options for compiler.
I do it in the following way:

ifdef USE_DISK
CUSTOM_COPT += -DIMCS_DISK_SUPPORT
endif

Why aren't you adding it to PG_CPPFLAGS? That should work, and I think
that's what several contrib modules are using.


Thank you.
PG_CPPFLAGS works correctly.





Re: Inconsistent Japanese name order in v13 contributors list

2020-09-24 Thread Bruce Momjian
On Mon, Sep 21, 2020 at 10:31:31AM -0300, Alvaro Herrera wrote:
> On 2020-Sep-18, Bruce Momjian wrote:
> 
> > This thread from 2015 is the most comprehensive discussion I remember of
> > Japanese name ordering, including a suggestion to use small caps:
> > 
> > 
> > https://www.postgresql.org/message-id/flat/20150613231826.GY133018%40postgresql.org#88d245a5cdd2b32e1e3e80fc07eab6f2
> > 
> > I have been following this guidance ever since.
> 
> Right.
> 
> About smallcaps, we didn't do it then because there was no way known to
> us to use them in our then-current toolchain.  But we've changed now to
> XML and apparently it is possible to use them -- we could try something
> like  and define a CSS rule like
> 
> .caps_lastname {font-variant: small-caps;}

Yes, that's what I use for CSS.

> (Apparently you also need to set 'emphasis.propagates.style' to 1 in the
> stylesheet).  This does it for HTML.  You also need some FO trick to
> cover the PDF (probably something like what a042750646db did to change
> catalog_table_entry formatting.)

Yeah, we have PDF output to worry about too.  It is easy in LaTeX.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: [PATCH] Automatic HASH and LIST partition creation

2020-09-24 Thread Anastasia Lubennikova

On 24.09.2020 06:27, Michael Paquier wrote:

On Mon, Sep 14, 2020 at 02:38:56PM +0300, Anastasia Lubennikova wrote:

Fixed. This was also caught by cfbot. This version should pass it clean.

Please note that regression tests are failing, because of 6b2c4e59.
--
Michael


Thank you. Updated patch is attached.

Open issues for review:
- new syntax;
- generation of partition names;
- overall patch review and testing, especially with complex partitioning 
clauses.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit faa5805b839effd9d8220eff787fb0995276c370
Author: anastasia 
Date:   Mon Sep 14 11:34:42 2020 +0300

Auto generated HASH and LIST partitions.

New syntax:

CREATE TABLE tbl_hash (i int) PARTITION BY HASH (i)
CONFIGURATION (modulus 3);

CREATE TABLE tbl_list (i int) PARTITION BY LIST (i)
CONFIGURATION (values in (1, 2), (3, 4) DEFAULT PARTITION tbl_default);

With documentation draft.

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 087cad184c..ff9a7eda09 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -29,6 +29,7 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
 ] )
 [ INHERITS ( parent_table [, ... ] ) ]
 [ PARTITION BY { RANGE | LIST | HASH } ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [, ... ] ) ]
+[ CONFIGURATION ( partition_bound_auto_spec ) ]
 [ USING method ]
 [ WITH ( storage_parameter [= value] [, ... ] ) | WITHOUT OIDS ]
 [ ON COMMIT { PRESERVE ROWS | DELETE ROWS | DROP } ]
@@ -41,6 +42,7 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
 [, ... ]
 ) ]
 [ PARTITION BY { RANGE | LIST | HASH } ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [, ... ] ) ]
+[ CONFIGURATION ( partition_bound_auto_spec ) ]
 [ USING method ]
 [ WITH ( storage_parameter [= value] [, ... ] ) | WITHOUT OIDS ]
 [ ON COMMIT { PRESERVE ROWS | DELETE ROWS | DROP } ]
@@ -53,6 +55,7 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
 [, ... ]
 ) ] { FOR VALUES partition_bound_spec | DEFAULT }
 [ PARTITION BY { RANGE | LIST | HASH } ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [, ... ] ) ]
+[ CONFIGURATION ( partition_bound_auto_spec ) ]
 [ USING method ]
 [ WITH ( storage_parameter [= value] [, ... ] ) | WITHOUT OIDS ]
 [ ON COMMIT { PRESERVE ROWS | DELETE ROWS | DROP } ]
@@ -96,6 +99,11 @@ FROM ( { partition_bound_expr | MIN
   TO ( { partition_bound_expr | MINVALUE | MAXVALUE } [, ...] ) |
 WITH ( MODULUS numeric_literal, REMAINDER numeric_literal )
 
+and partition_bound_auto_spec is:
+
+VALUES IN ( partition_bound_expr [, ...] ), [( partition_bound_expr [, ...] )] [, ...] [DEFAULT PARTITION defailt_part_name]
+MODULUS numeric_literal
+
 index_parameters in UNIQUE, PRIMARY KEY, and EXCLUDE constraints are:
 
 [ INCLUDE ( column_name [, ... ] ) ]
@@ -383,6 +391,11 @@ WITH ( MODULUS numeric_literal, REM
   however, you can define these constraints on individual partitions.
  
 
+ 
+  Range and list partitioning also support automatic creation of partitions
+  with an optional CONFIGURATION clause.
+
+
  
   See  for more discussion on table
   partitioning.
@@ -391,6 +404,38 @@ WITH ( MODULUS numeric_literal, REM
 

 
+   
+CONFIGURATION ( partition_bound_auto_spec ) ] 
+
+ 
+  The optional CONFIGURATION clause used together
+  with PARTITION BY specifies a rule of generating bounds
+  for partitions of the partitioned table. All partitions are created automatically
+  along with the parent table.
+  
+  Any indexes, constraints and user-defined row-level triggers that exist
+  in the parent table are cloned on the new partitions.
+ 
+
+ 
+  The partition_bound_auto_spec
+  must correspond to the partitioning method and partition key of the
+  parent table, and must not overlap with any existing partition of that
+  parent.  The form with VALUES IN is used for list partitioning 
+  and the form with MODULUS is used for hash partitioning.
+  List partitioning can also provide a default partition using 
+  DEFAULT PARTITION.
+ 
+
+ 
+ Automatic range partitioning is not supported yet.
+ 
+
+
+
+
+   
+

 PARTITION OF parent_table { FOR VALUES partition_bound_spec | DEFAULT }
 
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 0409a40b82..6893fa5495 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4628,6 +4628,7 @@ _copyPartitionSpec(const PartitionSpec *from)
 
 	COPY_STRING_FIELD(strategy);
 	COPY_NODE_FIELD(partParams);
+	COPY_NODE_FIELD(autopart);
 	COPY_LOCATION_FIELD(location);
 
 	return newnode;
@@ -4650,6 +465

Re: WIP: BRIN multi-range indexes

2020-09-24 Thread John Naylor
On Mon, Sep 21, 2020 at 3:56 PM Tomas Vondra
 wrote:
>
> On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote:

> >While playing around with the numbers I had an epiphany: At the
> >defaults, the filter already takes up ~4.3kB, over half the page.
> >There is no room for another tuple, so if we're only indexing one
> >column, we might as well take up the whole page.
>
> Hmm, yeah. I may be wrong but IIRC indexes don't support external
> storage but compression is still allowed. So even if those defaults are
> a bit higher than needed that should make the bloom filters a bit more
> compressible, and thus fit multiple BRIN tuples on a single page.

> Not sure about how much we want to rely on these optimizations, though,
> considering multi-column indexes kinda break this.

Yeah. Okay, then it sounds like we should go in the other direction,
as the block comment at the top of brin_bloom.c implies. Indexes with
multiple bloom-indexed columns already don't fit in one 8kB page, so I
think every documented example should have a much lower
pages_per_range. Using 32 pages per range with max tuples gives n =
928. With default p, that's about 1.1 kB per brin tuple, so one brin
page can index 224 pages, much more than with the default 128.

Hmm, how ugly would it be to change the default range size depending
on the opclass?

If indexes don't support external storage, that sounds like a pain to
add. Also, with very small fpr, you can easily get into many megabytes
of filter space, which kind of defeats the purpose of brin in the
first place.

There is already this item from the brin readme:

* Different-size page ranges?
  In the current design, each "index entry" in a BRIN index covers the same
  number of pages.  There's no hard reason for this; it might make sense to
  allow the index to self-tune so that some index entries cover smaller page
  ranges, if this allows the summary values to be more compact.  This
would incur
  larger BRIN overhead for the index itself, but might allow better pruning of
  page ranges during scan.  In the limit of one index tuple per page, the index
  itself would occupy too much space, even though we would be able to skip
  reading the most heap pages, because the summary values are tight; in the
  opposite limit of a single tuple that summarizes the whole table, we wouldn't
  be able to prune anything even though the index is very small.  This can
  probably be made to work by using the range map as an index in itself.

This sounds like a lot of work, but would be robust.

Anyway, given that this is a general problem and not specific to the
prime partition algorithm, I'll leave that out of the attached patch,
named as a .txt to avoid confusing the cfbot.

> >We could also generate the primes via a sieve instead, which is really
> >fast (and more code). That would be more robust, but that would require
> >the filter to store the actual primes used, so 20 more bytes at max k =
> >10. We could hard-code space for that, or to keep from hard-coding
> >maximum k and thus lowest possible false positive rate, we'd need more
> >bookkeeping.
> >
>
> I don't think the efficiency of this code matters too much - it's only
> used once when creating the index, so the simpler the better. Certainly
> for now, while testing the partitioning approach.

To check my understanding, isn't bloom_init() called for every tuple?
Agreed on simplicity so done this way.

> >So, with the partition approach, we'd likely have to set in stone
> >either max nbits, or min target false positive rate. The latter option
> >seems more principled, not only for the block size, but also since the
> >target fp rate is already fixed by the reloption, and as I've
> >demonstrated, we can often go above and beyond the reloption even
> >without changing k.
> >
>
> That seems like a rather annoying limitation, TBH.

I don't think the latter is that bad. I've capped k at 10 for
demonstration's sake.:

(928 is from using 32 pages per range)

nk   m  p
928   7  8895   0.01
928  10  13343  0.001  (lowest p supported in patch set)
928  13  17790  0.0001
928  10  18280  0.0001 (still works with lower k, needs higher m)
928  10  17790  0.00012 (even keeping m from row #3, capping k doesn't
degrade p much)

Also, k seems pretty robust against small changes as long as m isn't
artificially constrained and as long as p is small.

So I *think* it's okay to cap k at 10 or 12, and not bother with
adjusting m, which worsens space issues. As I found before, lowering k
raises target fpr, but seems more robust to overshooting ndistinct. In
any case, we only need k * 2 bytes to store the partition lengths.

The only way I see to avoid any limitation is to make the array of
primes variable length, which could be done by putting the filter
offset calculation into a macro. But having two variable-length arrays
seems messy.

> >Hmm, I'm not sure I understand you. I can see not caring to trim wasted
> >bits, but we can't set/read off the end

Re: Add session statistics to pg_stat_database

2020-09-24 Thread Soumyadeep Chakraborty
Hello Laurenz,

Thanks for submitting this! Please find my feedback below.

* Are we trying to capture ONLY client initiated disconnects in
m_aborted (we are not handling other disconnects by not accounting for
EOF..like if psql was killed)? If yes, why?

* pgstat_send_connstats(): How about renaming the "force" argument to
"disconnected"?

*
> static TimestampTz pgStatActiveStart = DT_NOBEGIN;
> static PgStat_Counter pgStatActiveTime = 0;
> static TimestampTz pgStatTransactionIdleStart = DT_NOBEGIN;
> static PgStat_Counter pgStatTransactionIdleTime = 0;
> static bool pgStatSessionReported = false;
> bool pgStatSessionDisconnected = false;

I think we can house all of these globals inside PgBackendStatus and can
follow the protocol for reading/writing fields in PgBackendStatus.
Refer: PGSTAT_{BEGIN|END}_WRITE_ACTIVITY

Also, some of these fields are not required:

I don't think we need pgStatActiveStart and pgStatTransactionIdleStart -
instead of these two we could use
PgBackendStatus.st_state_start_timestamp which marks the beginning TS of
the backend's current state (st_state). We can look at that field along
with the current and to-be-transitioned-to states inside
pgstat_report_activity() when there is a transition away from
STATE_RUNNING, STATE_IDLEINTRANSACTION or
STATE_IDLEINTRANSACTION_ABORTED, in order to update pgStatActiveTime and
pgStatTransactionIdleTime. We would also need to update those counters
on disconnect/PGSTAT_STAT_INTERVAL timeout if the backend's current
state was STATE_RUNNING, STATE_IDLEINTRANSACTION or
STATE_IDLEINTRANSACTION_ABORTED (in pgstat_send_connstats())

pgStatSessionDisconnected is not required as it can be determined if a
session has been disconnected by looking at the force argument to
pgstat_report_stat() [unless we would want to distinguish between
client-initiated disconnects, which I am not sure why, as I have
brought up above].

pgStatSessionReported is not required. We can glean this information by
checking if the function local static last_report in
pgstat_report_stat() is 0 and passing this on as another param
"first_report" to pgstat_send_connstats().


* PGSTAT_FILE_FORMAT_ID needs to be updated when a stats collector data
structure changes and we had a change in PgStat_StatDBEntry.

* We can directly use PgBackendStatus.st_proc_start_timestamp for
calculating m_session_time. We can also choose to report session uptime
even when the report is for the not-disconnect case
(PGSTAT_STAT_INTERVAL elapsed). No reason why not. Then we would need to
pass in the value of last_report to pgstat_send_connstats() -> calculate
m_session_time to be number of time units from
PgBackendStatus.st_proc_start_timestamp for the first report and then
number of time units from the last_report for all subsequent reports.

* We would need to bump the catalog version since we have made
changes to system views. Refer: #define CATALOG_VERSION_NO


Regards,
Soumyadeep (VMware)




RE: Transactions involving multiple postgres foreign servers, take 2

2020-09-24 Thread tsunakawa.ta...@fujitsu.com
From: Ashutosh Bapat 
> The way I am looking at is to put the parallelism in the resolution
> worker and not in the FDW. If we use multiple resolution workers, they
> can fire commit/abort on multiple foreign servers at a time.

From a single session's view, yes.  However, the requests from multiple 
sessions are processed one at a time within each resolver, because the resolver 
has to call the synchronous FDW prepare/commit routines and wait for the 
response from the remote server.  That's too limiting.


> But if we want parallelism within a single resolution worker, we will
> need a separate FDW APIs for firing asynchronous commit/abort prepared
> txn and fetching their results resp. But given the variety of FDWs,
> not all of them will support asynchronous API, so we have to support
> synchronous API anyway, which is what can be targeted in the first
> version.

I agree in that most FDWs will be unlikely to have asynchronous prepare/commit 
functions, as demonstrated by the fact that even Oracle and Db2 don't implement 
XA asynchronous APIs.  That's one problem of using FDW for Postgres scale-out.  
When we enhance FDW, we have to take care of other DBMSs to make the FDW 
interface practical.  OTOH, we want to make maximum use of Postgres features, 
such as libpq asynchronous API, to make Postgres scale-out as performant as 
possible.  But the scale-out design is bound by the FDW interface.  I don't 
feel accepting such less performant design is an attitude of this community, as 
people here are strict against even 1 or 2 percent performance drop.


> Thinking more about it, the core may support an API which accepts a
> list of prepared transactions, their foreign servers and user mappings
> and let FDW resolve all those either in parallel or one by one. So
> parallelism is responsibility of FDW and not the core. But then we
> loose parallelism across FDWs, which may not be a common case.

Hmm, I understand asynchronous FDW relation scan is being developed now, in the 
form of cooperation between the FDW and the executor.  If we make just the FDW 
responsible for prepare/commit parallelism, the design becomes asymmetric.  As 
you say, I'm not sure if the parallelism is wanted among different types, say, 
Postgres and Oracle.  In fact, major DBMSs don't implement XA asynchronous API. 
 But such lack of parallelism may be one cause of the bad reputation that 2PC 
(of XA) is slow.


> Given the complications around this, I think we should go ahead
> supporting synchronous API first and in second version introduce
> optional asynchronous API.

How about the following?

* Add synchronous and asynchronous versions of prepare/commit/abort routines 
and a routine to wait for completion of asynchronous execution in FdwRoutine.  
They are optional.
postgres_fdw can implement the asynchronous routines using libpq asynchronous 
functions.  Other DBMSs can implement XA asynchronous API for them in theory.

* The client backend uses asynchronous FDW routines if available:

/* Issue asynchronous prepare | commit | rollback to FDWs that support it */
foreach (per each foreign server used in the transaction)
{
if (fdwroutine->{prepare | commit | rollback}_async_func)
fdwroutine->{prepare | commit | rollback}_async_func(...);
}

/* Wait for completion of asynchronous prepare | commit | rollback */
foreach (per each foreign server used in the transaction)
{
if (fdwroutine->{prepare | commit | rollback}_async_func)
ret = fdwroutine->wait_for_completion(...);
}

/* Issue synchronous prepare | commit | rollback to FDWs that don't support it 
*/
foreach (per each foreign server used in the transaction)
{
if (fdwroutine->{prepare | commit | rollback}_async_func == NULL)
ret = fdwroutine->{prepare | commit | rollback}_func(...);
}

* The client backend asks the resolver to commit or rollback the remote 
transaction only when the remote transaction fails (due to the failure of 
remote server or network.)  That is, the resolver is not involved during normal 
operation.


This will not be complex, and can be included in the first version, if we 
really want to use FDW for Postgres scale-out.


Regards
Takayuki Tsunakawa



Re: Fix inconsistency in jsonpath .datetime()

2020-09-24 Thread Alexander Korotkov
On Sun, Sep 20, 2020 at 2:23 AM Nikita Glukhov  wrote:
> The beta-tester of PG13 reported a inconsistency in our current jsonpath
> datetime() method implementation.  By the standard format strings in 
> datetime()
> allows only characters "-./,':; " to be used as separators in format strings.
> But our to_json[b]() serializes timestamps into XSD format with "T" separator
> between date and time, so the serialized data cannot be parsed back by 
> jsonpath
> and it looks inconsistent:
>
> =# SELECT to_jsonb('2020-09-19 23:45:06'::timestamp);
>to_jsonb
> ---
>  "2020-09-19T23:45:06"
> (1 row)
>
> =# SELECT jsonb_path_query(to_jsonb('2020-09-19 23:45:06'::timestamp),
>'$.datetime()');
> ERROR:  datetime format is not recognized: "2020-09-19T23:45:06"
> HINT:  Use a datetime template argument to specify the input data format.
>
> =# SELECT jsonb_path_query(to_jsonb('2020-09-19 23:45:06'::timestamp),
>'$.datetime("-mm-dd HH:MI:SS")');
> ERROR:  unmatched format separator " "
>
> =# SELECT jsonb_path_query(to_jsonb('2020-09-19 23:45:06'::timestamp),
>'$.datetime("-mm-dd\"T\"HH:MI:SS")');
> ERROR:  invalid datetime format separator: """
>
>
>
> Excerpt from SQL-2916 standard (5.3 , page 197):
>
>  ::=
> 
>
>  ::=
>[  ]
>
>  ::=
>  
>
>
>
> Attached patch #2 tries to fix this problem by enabling escaped characters in
> standard mode.  I'm not sure is it better to enable the whole set of text
> separators or only the problematic "T" character, allow only quoted text
> separators or not.
>
> Patch #1 is a more simple fix (so it comes first) removing excess space 
> between
> time and timezone fields in built-in format strings used for datetime type
> recognition.  (It seemed to work as expected with extra space in earlier
> version of the patch in which standard mode has not yet been introduced).

Jsonpath .datetime() was developed as an implementation of
corresponding parts of SQL Standard.  Patch #1 fixes inconsistency
between our implementation and Standard.  I'm going to backpatch it to
v13.

There is also inconsistency among to_json[b]() and jsonpath
.datetime().  In this case, I wouldn't say the problem is on the
jsonpath side.  to_json[b]() makes special exceptions for datetime
types and converts them not using standard output function, but using
javascript-compatible format (see f30015b6d7).  Luckily, our input
function for timestamp[tz] datatypes doesn't use strict format
parsing, so it can work with output of to_json[b]().  But according to
SQL Standard, jsonpath .datetime() implements strict format parsing,
so it can't work with output of to_json[b]().  So, I wouldn't say in
this case it's an inconsistency in the jsonpath .datetime() method.
But, given now it's not an appropriate time for redesigning
to_json[b](), we should probably improve jsonpath .datetime() method
to understand more formats.

So, patch #2 is probably acceptable, and even might be backpatched
v13.  One thing I don't particularly like is "In standard mode format
string characters are strictly matched or matched to spaces."
Instead, I would like to just strictly match characters and just add
more options to fmt_str[].

Other opinions?

--
Regards,
Alexander Korotkov




Re: history file on replica and double switchover

2020-09-24 Thread David Zhang

Hi,

My understanding is that the "archiver" won't even start if 
"archive_mode = on" has been set on a "replica". Therefore, either 
(XLogArchiveMode == ARCHIVE_MODE_ALWAYS) or (XLogArchiveMode != 
ARCHIVE_MODE_OFF) will produce the same result.


Please see how the "archiver" is started in 
src/backend/postmaster/postmaster.c


5227 /*
5228  * Start the archiver if we're responsible for 
(re-)archiving received

5229  * files.
5230  */
5231 Assert(PgArchPID == 0);
5232 if (XLogArchivingAlways())
5233 PgArchPID = pgarch_start();

I did run the nice script "double_switchover.sh" using either "always" 
or "on" on patch v1 and v2. They all generate the same results below. In 
other words, whether history file (0003.history) is archived or not 
depends on "archive_mode" settings.


echo "archive_mode = always" >> ${PGDATA_NODE2}/postgresql.auto.conf

$ ls -l archive
-rw--- 1 david david 16777216 Sep 24 14:40 00010002
... ...
-rw--- 1 david david 16777216 Sep 24 14:40 0002000A
-rw--- 1 david david   41 Sep 24 14:40 0002.history
-rw--- 1 david david   83 Sep 24 14:40 0003.history

echo "archive_mode = on" >> ${PGDATA_NODE2}/postgresql.auto.conf

$ ls -l archive
-rw--- 1 david david 16777216 Sep 24 14:47 00010002
... ...
-rw--- 1 david david 16777216 Sep 24 14:47 0002000A
-rw--- 1 david david   41 Sep 24 14:47 0002.history


Personally, I prefer patch v2 since it allies to the statement here, 
https://www.postgresql.org/docs/13/warm-standby.html#CONTINUOUS-ARCHIVING-IN-STANDBY


"If archive_mode is set to on, the archiver is not enabled during 
recovery or standby mode. If the standby server is promoted, it will 
start archiving after the promotion, but will not archive any WAL it did 
not generate itself."


By the way, I think the last part of the sentence should be changed to 
something like below:


"but will not archive any WAL, which was not generated by itself."


Best regards,

David

On 2020-09-17 10:18 a.m., Fujii Masao wrote:



On 2020/09/04 13:53, Fujii Masao wrote:



On 2020/09/04 8:29, Anastasia Lubennikova wrote:

On 27.08.2020 16:02, Grigory Smolkin wrote:

Hello!

I`ve noticed, that when running switchover replica to master and 
back to replica, new history file is streamed to replica, but not 
archived,
which is not great, because it breaks PITR if archiving is running 
on replica. The fix looks trivial.

Bash script to reproduce the problem and patch are attached.


Thanks for the report. I agree that it looks like a bug.


+1

+    /* Mark history file as ready for archiving */
+    if (XLogArchiveMode != ARCHIVE_MODE_OFF)
+    XLogArchiveNotify(fname);

I agree that history file should be archived in the standby when
archive_mode=always. But why do we need to do when archive_mode=on?
I'm just concerned about the case where the primary and standby
have the shared archive area, and archive_mode is on.


So I updated the patch so that walreceiver marks the streamed history 
file

as ready for archiving only when archive_mode=always. Patch attached.
Thought?

Regards,


--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca





Re: Handing off SLRU fsyncs to the checkpointer

2020-09-24 Thread Thomas Munro
On Wed, Sep 23, 2020 at 1:56 PM Thomas Munro  wrote:
> As for the ShutdownXXX() functions, I haven't yet come up with any
> reason for this code to exist.  Emboldened by a colleague's inability
> to explain to me what that code is doing for us, here is a new version
> that just rips it all out.

Rebased.

Tom, do you have any thoughts on ShutdownCLOG() etc?
From 07196368e6cd36df5910b536e93570c9acff4ff2 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 25 Sep 2020 11:16:03 +1200
Subject: [PATCH v7] Defer flushing of SLRU files.

Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery.  Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba.  This causes a significant improvement to recovery
performance.

Remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions,
because they were redundant after the shutdown checkpoint that
immediately precedes them.

Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer
that it applies to all the SLRU mini-buffer-pools as well as the main
buffer pool.  Also rearrange things so that data collected in
CheckpointStats includes SLRU activity.

Tested-by: Jakub Wartak 
Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com
---
 src/backend/access/transam/clog.c  |  36 +++---
 src/backend/access/transam/commit_ts.c |  32 +++--
 src/backend/access/transam/multixact.c |  53 +
 src/backend/access/transam/slru.c  | 154 +
 src/backend/access/transam/subtrans.c  |  25 +---
 src/backend/access/transam/xlog.c  |  18 ++-
 src/backend/commands/async.c   |   5 +-
 src/backend/storage/buffer/bufmgr.c|   7 --
 src/backend/storage/lmgr/predicate.c   |   8 +-
 src/backend/storage/sync/sync.c|  28 -
 src/include/access/clog.h  |   3 +
 src/include/access/commit_ts.h |   3 +
 src/include/access/multixact.h |   4 +
 src/include/access/slru.h  |  14 ++-
 src/include/storage/sync.h |   7 +-
 src/tools/pgindent/typedefs.list   |   1 +
 16 files changed, 233 insertions(+), 165 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 9e352d2658..c4c20f508b 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "storage/proc.h"
+#include "storage/sync.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -691,7 +692,8 @@ CLOGShmemInit(void)
 {
 	XactCtl->PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
-  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+  SYNC_HANDLER_CLOG);
 }
 
 /*
@@ -808,34 +810,15 @@ TrimCLOG(void)
 	LWLockRelease(XactSLRULock);
 }
 
-/*
- * This must be called ONCE during postmaster or standalone-backend shutdown
- */
-void
-ShutdownCLOG(void)
-{
-	/* Flush dirty CLOG pages to disk */
-	TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(false);
-	SimpleLruFlush(XactCtl, false);
-
-	/*
-	 * fsync pg_xact to ensure that any files flushed previously are durably
-	 * on disk.
-	 */
-	fsync_fname("pg_xact", true);
-
-	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(false);
-}
-
 /*
  * Perform a checkpoint --- either during shutdown, or on-the-fly
  */
 void
 CheckPointCLOG(void)
 {
-	/* Flush dirty CLOG pages to disk */
+	/* Write dirty CLOG pages to disk */
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
-	SimpleLruFlush(XactCtl, true);
+	SimpleLruWriteAll(XactCtl, true);
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
 }
 
@@ -1026,3 +1009,12 @@ clog_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "clog_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+	return SlruSyncFileTag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index f6a7329ba3..af5d9baae5 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
 	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
 	SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
   CommitTsSLRULock, "pg_commit_ts",
-  LWTRANCHE_COMMITTS_BUFFER);
+  LWTRANCHE_COMMITTS_BUFFER,
+  SYNC_HANDLER_COMMIT_TS);
 
 	commitTsShared = ShmemInitStruct("CommitTs shared",
 	 sizeof(CommitTimestampShared),
@@ -798,30 +799,14 @@ DeactivateCommitTs(void)
 	LWLockRelease(CommitTsSLRULock);
 }
 
-/*
- * This must be called ONCE during postmaster or st

Re: WIP: BRIN multi-range indexes

2020-09-24 Thread Tomas Vondra

On Thu, Sep 24, 2020 at 05:18:03PM -0400, John Naylor wrote:

On Mon, Sep 21, 2020 at 3:56 PM Tomas Vondra
 wrote:


On Mon, Sep 21, 2020 at 01:42:34PM -0400, John Naylor wrote:



>While playing around with the numbers I had an epiphany: At the
>defaults, the filter already takes up ~4.3kB, over half the page.
>There is no room for another tuple, so if we're only indexing one
>column, we might as well take up the whole page.

Hmm, yeah. I may be wrong but IIRC indexes don't support external
storage but compression is still allowed. So even if those defaults are
a bit higher than needed that should make the bloom filters a bit more
compressible, and thus fit multiple BRIN tuples on a single page.



Not sure about how much we want to rely on these optimizations, though,
considering multi-column indexes kinda break this.


Yeah. Okay, then it sounds like we should go in the other direction,
as the block comment at the top of brin_bloom.c implies. Indexes with
multiple bloom-indexed columns already don't fit in one 8kB page, so I
think every documented example should have a much lower
pages_per_range. Using 32 pages per range with max tuples gives n =
928. With default p, that's about 1.1 kB per brin tuple, so one brin
page can index 224 pages, much more than with the default 128.

Hmm, how ugly would it be to change the default range size depending
on the opclass?



Not sure. What would happen for multi-column BRIN indexes with different
opclasses?


If indexes don't support external storage, that sounds like a pain to
add. Also, with very small fpr, you can easily get into many megabytes
of filter space, which kind of defeats the purpose of brin in the
first place.



True.


There is already this item from the brin readme:

* Different-size page ranges?
 In the current design, each "index entry" in a BRIN index covers the same
 number of pages.  There's no hard reason for this; it might make sense to
 allow the index to self-tune so that some index entries cover smaller page
 ranges, if this allows the summary values to be more compact.  This
would incur
 larger BRIN overhead for the index itself, but might allow better pruning of
 page ranges during scan.  In the limit of one index tuple per page, the index
 itself would occupy too much space, even though we would be able to skip
 reading the most heap pages, because the summary values are tight; in the
 opposite limit of a single tuple that summarizes the whole table, we wouldn't
 be able to prune anything even though the index is very small.  This can
 probably be made to work by using the range map as an index in itself.

This sounds like a lot of work, but would be robust.



Yeah. I think it's a fairly independent / orthogonal project.


Anyway, given that this is a general problem and not specific to the
prime partition algorithm, I'll leave that out of the attached patch,
named as a .txt to avoid confusing the cfbot.


>We could also generate the primes via a sieve instead, which is really
>fast (and more code). That would be more robust, but that would require
>the filter to store the actual primes used, so 20 more bytes at max k =
>10. We could hard-code space for that, or to keep from hard-coding
>maximum k and thus lowest possible false positive rate, we'd need more
>bookkeeping.
>

I don't think the efficiency of this code matters too much - it's only
used once when creating the index, so the simpler the better. Certainly
for now, while testing the partitioning approach.


To check my understanding, isn't bloom_init() called for every tuple?
Agreed on simplicity so done this way.



No, it's only called for the first non-NULL value in the page range
(unless I made a boo boo when writing that code).


>So, with the partition approach, we'd likely have to set in stone
>either max nbits, or min target false positive rate. The latter option
>seems more principled, not only for the block size, but also since the
>target fp rate is already fixed by the reloption, and as I've
>demonstrated, we can often go above and beyond the reloption even
>without changing k.
>

That seems like a rather annoying limitation, TBH.


I don't think the latter is that bad. I've capped k at 10 for
demonstration's sake.:

(928 is from using 32 pages per range)

nk   m  p
928   7  8895   0.01
928  10  13343  0.001  (lowest p supported in patch set)
928  13  17790  0.0001
928  10  18280  0.0001 (still works with lower k, needs higher m)
928  10  17790  0.00012 (even keeping m from row #3, capping k doesn't
degrade p much)

Also, k seems pretty robust against small changes as long as m isn't
artificially constrained and as long as p is small.

So I *think* it's okay to cap k at 10 or 12, and not bother with
adjusting m, which worsens space issues. As I found before, lowering k
raises target fpr, but seems more robust to overshooting ndistinct. In
any case, we only need k * 2 bytes to store the partition lengths.

The only way I see to avoid any limitation is to m

Re: Refactor pg_rewind code and make it work against a standby

2020-09-24 Thread Soumyadeep Chakraborty
On Thu, Sep 24, 2020 at 10:27 AM Heikki Linnakangas  wrote:
> >> /*
> >>   * If this is a relation file, copy the modified blocks.
> >>   *
> >>   * This is in addition to any other changes.
> >>   */
> >> iter = datapagemap_iterate(&entry->target_modified_pages);
> >> while (datapagemap_next(iter, &blkno))
> >> {
> >> offset = blkno * BLCKSZ;
> >>
> >> source->queue_fetch_range(source, entry->path, offset, BLCKSZ);
> >> }
> >> pg_free(iter);
> >
> > Can we put this hunk into a static function overwrite_pages()?
>
> Meh, it's only about 10 lines of code, and one caller.

Fair.

>
> > 7. Please address the FIXME for the symlink case:
> > /* FIXME: Check if it points to the same target? */
>
> It's not a new issue. Would be nice to fix, of course. I'm not sure what
> the right thing to do would be. If you have e.g. replaced
> postgresql.conf with a symlink that points outside the data directory,
> would it be appropriate to overwrite it? Or perhaps we should throw an
> error? We also throw an error if a file is a symlink in the source but a
> regular file in the target, or vice versa.
>

Hmm, I can imagine a use case for 2 different symlink targets on the
source and target clusters. For example the primary's pg_wal directory
can have a different symlink target as compared to a standby's
(different mount points on the same network maybe?). An end user might
not desire pg_rewind meddling with that setup or may desire pg_rewind to
treat the source as a source-of-truth with respect to this as well and
would want pg_rewind to overwrite the target's symlink. Maybe doing a
check and emitting a warning with hint/detail is prudent here while
taking no action.


> > 8.
> >
> > * it anyway. But there's no harm in copying it now.)
> >
> > and
> >
> > * copy them here. But we don't know which scenario we're
> > * dealing with, and there's no harm in copying the missing
> > * blocks now, so do it now.
> >
> > Could you add a line or two explaining why there is "no harm" in these
> > two hunks above?
>
> The previous sentences explain that there's a WAL record covering them.
> So they will be overwritten by WAL replay anyway. Does it need more
> explanation?

Yeah you are right, that is reason enough.

> > 14. queue_overwrite_range(), finish_overwrite() instead of
> > queue_fetch_range(), finish_fetch()? Similarly update\
> > *_fetch_file_range() and *_finish_fetch()
> >
> >
> > 15. Let's have local_source.c and libpq_source.c instead of *_fetch.c
>
> Good idea! And fetch.h -> rewind_source.h.

+1. You might have missed the changes to rename "fetch" -> "overwrite"
as was mentioned in 14.

>
> I also moved the code in execute_file_actions() function to pg_rewind.c,
> into a new function: perform_rewind(). It felt a bit silly to have just
> execute_file_actions() in a file of its own. perform_rewind() now does
> all the modifications to the data directory, writing the backup file.
> Except for writing the recovery config: that also needs to be when
> there's no rewind to do, so it makes sense to keep it separate. What do
> you think?

I don't mind inlining that functionality into perform_rewind(). +1.
Heads up: The function declaration for execute_file_actions() is still
there in rewind_source.h.

> > 16.
> >
> >> conn = PQconnectdb(connstr_source);
> >>
> >> if (PQstatus(conn) == CONNECTION_BAD)
> >> pg_fatal("could not connect to server: %s",
> >> PQerrorMessage(conn));
> >>
> >> if (showprogress)
> >> pg_log_info("connected to server");
> >
> >
> > The above hunk should be part of init_libpq_source(). Consequently,
> > init_libpq_source() should take a connection string instead of a conn.
>
> The libpq connection is also needed by WriteRecoveryConfig(), that's why
> it's not fully encapsulated in libpq_source.

Ah. I find it pretty weird why we need to specify --source-server to
have write-recovery-conf work. From the code, we only need the conn
for calling PQserverVersion(), something we can easily get by slurping
pg_controldata on the source side? Maybe we can remove this limitation?

Regards,
Soumyadeep (VMware)




Re: history file on replica and double switchover

2020-09-24 Thread Fujii Masao




On 2020/09/25 8:15, David Zhang wrote:

Hi,

My understanding is that the "archiver" won't even start if "archive_mode = on" has been 
set on a "replica". Therefore, either (XLogArchiveMode == ARCHIVE_MODE_ALWAYS) or (XLogArchiveMode 
!= ARCHIVE_MODE_OFF) will produce the same result.


Yes, the archiver isn't invoked in the standby when archive_mode=on.
But, with the original patch, walreceiver creates .ready archive status file
even when archive_mode=on and no archiver is running. This causes
the history file to be archived after the standby is promoted and
the archiver is invoked.

With my patch, walreceiver creates .ready archive status for the history file
only when archive_mode=always, like it does for the streamed WAL files.
This is the difference between those two patches. To prevent the streamed
timeline history file from being archived, IMO we should use the condition
archive_mode==always in the walreceiver.




Please see how the "archiver" is started in src/backend/postmaster/postmaster.c

5227 /*
5228  * Start the archiver if we're responsible for 
(re-)archiving received
5229  * files.
5230  */
5231 Assert(PgArchPID == 0);
5232 if (XLogArchivingAlways())
5233 PgArchPID = pgarch_start();

I did run the nice script "double_switchover.sh" using either "always" or "on" on patch 
v1 and v2. They all generate the same results below. In other words, whether history file (0003.history) is 
archived or not depends on "archive_mode" settings.

echo "archive_mode = always" >> ${PGDATA_NODE2}/postgresql.auto.conf

$ ls -l archive
-rw--- 1 david david 16777216 Sep 24 14:40 00010002
... ...
-rw--- 1 david david 16777216 Sep 24 14:40 0002000A
-rw--- 1 david david   41 Sep 24 14:40 0002.history
-rw--- 1 david david   83 Sep 24 14:40 0003.history

echo "archive_mode = on" >> ${PGDATA_NODE2}/postgresql.auto.conf

$ ls -l archive
-rw--- 1 david david 16777216 Sep 24 14:47 00010002
... ...
-rw--- 1 david david 16777216 Sep 24 14:47 0002000A
-rw--- 1 david david   41 Sep 24 14:47 0002.history


Personally, I prefer patch v2 since it allies to the statement here, 
https://www.postgresql.org/docs/13/warm-standby.html#CONTINUOUS-ARCHIVING-IN-STANDBY

"If archive_mode is set to on, the archiver is not enabled during recovery or 
standby mode. If the standby server is promoted, it will start archiving after the 
promotion, but will not archive any WAL it did not generate itself."

By the way, I think the last part of the sentence should be changed to 
something like below:

"but will not archive any WAL, which was not generated by itself."


I'm not sure if this change is an improvement or not. But if we apply
the patch I proposed, maybe we should mention also history file here.
For example, "but will not archive any WAL or timeline history files
that it did not generate itself".

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Handing off SLRU fsyncs to the checkpointer

2020-09-24 Thread Tom Lane
Thomas Munro  writes:
> Tom, do you have any thoughts on ShutdownCLOG() etc?

Hm, if we cannot reach that without first completing a shutdown checkpoint,
it does seem a little pointless.

It'd likely be a good idea to add a comment to CheckPointCLOG et al
explaining that we expect $what-exactly to fsync the data we are writing
before the checkpoint is considered done.

regards, tom lane




Re: "cert" + clientcert=verify-ca in pg_hba.conf?

2020-09-24 Thread Kyotaro Horiguchi
At Thu, 24 Sep 2020 11:43:40 -0400, Bruce Momjian  wrote in 
> On Thu, Sep 24, 2020 at 12:44:01PM +0900, Michael Paquier wrote:
> > On Tue, Sep 01, 2020 at 10:27:03PM -0400, Bruce Momjian wrote:
> > > OK, good.  Let's wait a few days and I will then apply it for PG 14.
> > 
> > It has been a few days, and nothing has happened here.  I have not
> > looked at the patch in details, so I cannot say if that's fine or not,
> > but please note that the patch fails to apply per the CF bot.
> 
> I will handle it.

Thank you Bruce, Michael. This is a rebased version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 2978479ada887284eae0ed36c8acf29f1a002feb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 21 Jul 2020 23:01:27 +0900
Subject: [PATCH v2] Allow directory name for GUC ssl_crl_file and connection
 option sslcrl

X509_STORE_load_locations accepts a directory, which leads to
on-demand loading method with which method only relevant CRLs are
loaded.
---
 doc/src/sgml/config.sgml  | 10 ++--
 doc/src/sgml/libpq.sgml   |  8 ++--
 doc/src/sgml/runtime.sgml | 30 
 src/backend/libpq/be-secure-openssl.c | 12 -
 src/interfaces/libpq/fe-secure-openssl.c  | 12 -
 src/test/ssl/Makefile | 20 +++-
 src/test/ssl/ssl/both-cas-1.crt   | 46 +--
 src/test/ssl/ssl/both-cas-2.crt   | 46 +--
 src/test/ssl/ssl/client+client_ca.crt | 28 +--
 src/test/ssl/ssl/client-crldir/9bb9e3c3.r0| 11 +
 src/test/ssl/ssl/client-revoked.crt   | 14 +++---
 src/test/ssl/ssl/client.crl   | 16 +++
 src/test/ssl/ssl/client.crt   | 14 +++---
 src/test/ssl/ssl/client_ca.crt| 14 +++---
 .../ssl/ssl/root+client-crldir/9bb9e3c3.r0| 11 +
 .../ssl/ssl/root+client-crldir/a3d11bff.r0| 11 +
 src/test/ssl/ssl/root+client.crl  | 32 ++---
 src/test/ssl/ssl/root+client_ca.crt   | 32 ++---
 .../ssl/ssl/root+server-crldir/a3d11bff.r0| 11 +
 .../ssl/ssl/root+server-crldir/a836cc2d.r0| 11 +
 src/test/ssl/ssl/root+server.crl  | 32 ++---
 src/test/ssl/ssl/root+server_ca.crt   | 32 ++---
 src/test/ssl/ssl/root.crl | 16 +++
 src/test/ssl/ssl/root_ca.crt  | 18 
 src/test/ssl/ssl/server-cn-and-alt-names.crt  | 14 +++---
 src/test/ssl/ssl/server-cn-only.crt   | 14 +++---
 src/test/ssl/ssl/server-crldir/a836cc2d.r0| 11 +
 .../ssl/ssl/server-multiple-alt-names.crt | 14 +++---
 src/test/ssl/ssl/server-no-names.crt  | 16 +++
 src/test/ssl/ssl/server-revoked.crt   | 14 +++---
 src/test/ssl/ssl/server-single-alt-name.crt   | 16 +++
 src/test/ssl/ssl/server-ss.crt| 18 
 src/test/ssl/ssl/server.crl   | 16 +++
 src/test/ssl/ssl/server_ca.crt| 14 +++---
 src/test/ssl/t/001_ssltests.pl| 31 -
 src/test/ssl/t/SSLServer.pm   |  5 +-
 36 files changed, 418 insertions(+), 252 deletions(-)
 create mode 100644 src/test/ssl/ssl/client-crldir/9bb9e3c3.r0
 create mode 100644 src/test/ssl/ssl/root+client-crldir/9bb9e3c3.r0
 create mode 100644 src/test/ssl/ssl/root+client-crldir/a3d11bff.r0
 create mode 100644 src/test/ssl/ssl/root+server-crldir/a3d11bff.r0
 create mode 100644 src/test/ssl/ssl/root+server-crldir/a836cc2d.r0
 create mode 100644 src/test/ssl/ssl/server-crldir/a836cc2d.r0

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8eabf93834..1ec4b54fcd 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1182,11 +1182,11 @@ include_dir 'conf.d'
   

 Specifies the name of the file containing the SSL server certificate
-revocation list (CRL).
-Relative paths are relative to the data directory.
-This parameter can only be set in the postgresql.conf
-file or on the server command line.
-The default is empty, meaning no CRL file is loaded.
+revocation list (CRL) or the directory containing CRL files.  Relative
+paths are relative to the data directory.  This parameter can only be
+set in the postgresql.conf file or on the server
+command line.  The default is empty, meaning no CRL file is
+loaded. See  for details.

   
  
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 3315f1dd05..1de85612f3 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -1711,10 +1711,12 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   

 This parameter specifies the file name of the SSL certificate
-revocation list (CRL).  Certificates listed in this file, if it
-exists, will be rejected while attempting to aut

Re: Handing off SLRU fsyncs to the checkpointer

2020-09-24 Thread Thomas Munro
On Fri, Sep 25, 2020 at 12:05 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Tom, do you have any thoughts on ShutdownCLOG() etc?
>
> Hm, if we cannot reach that without first completing a shutdown checkpoint,
> it does seem a little pointless.

Thanks for the sanity check.

> It'd likely be a good idea to add a comment to CheckPointCLOG et al
> explaining that we expect $what-exactly to fsync the data we are writing
> before the checkpoint is considered done.

Good point.  Done like this:

+   /*
+* Write dirty CLOG pages to disk.  This may result in sync
requests queued
+* for later handling by ProcessSyncRequests(), as part of the
checkpoint.
+*/
TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
-   SimpleLruFlush(XactCtl, true);
+   SimpleLruWriteAll(XactCtl, true);
TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);

Here's a new version.  The final thing I'm contemplating before
pushing this is whether there may be hidden magical dependencies in
the order of operations in CheckPointGuts(), which I've changed
around.  Andres, any comments?
From a722d71c4296e5fae778a783eb8140be7887eced Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 25 Sep 2020 11:16:03 +1200
Subject: [PATCH v8] Defer flushing of SLRU files.

Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery.  Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba.  This causes a significant improvement to recovery
performance.

Remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions,
because they were redundant after the shutdown checkpoint that
immediately precedes them.

Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer
that it applies to all the SLRU mini-buffer-pools as well as the main
buffer pool.  Also rearrange things so that data collected in
CheckpointStats includes SLRU activity.

Reviewed-by: Tom Lane  (ShutdownXXX removal part)
Tested-by: Jakub Wartak 
Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com
---
 src/backend/access/transam/clog.c  |  39 +++
 src/backend/access/transam/commit_ts.c |  36 +++---
 src/backend/access/transam/multixact.c |  57 +
 src/backend/access/transam/slru.c  | 154 +
 src/backend/access/transam/subtrans.c  |  25 +---
 src/backend/access/transam/xlog.c  |  18 ++-
 src/backend/commands/async.c   |   5 +-
 src/backend/storage/buffer/bufmgr.c|   7 --
 src/backend/storage/lmgr/predicate.c   |   8 +-
 src/backend/storage/sync/sync.c|  28 -
 src/include/access/clog.h  |   3 +
 src/include/access/commit_ts.h |   3 +
 src/include/access/multixact.h |   4 +
 src/include/access/slru.h  |  14 ++-
 src/include/storage/sync.h |   7 +-
 src/tools/pgindent/typedefs.list   |   1 +
 16 files changed, 244 insertions(+), 165 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 9e352d2658..2ec6a924db 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "storage/proc.h"
+#include "storage/sync.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -691,7 +692,8 @@ CLOGShmemInit(void)
 {
 	XactCtl->PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
-  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+  XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+  SYNC_HANDLER_CLOG);
 }
 
 /*
@@ -808,34 +810,18 @@ TrimCLOG(void)
 	LWLockRelease(XactSLRULock);
 }
 
-/*
- * This must be called ONCE during postmaster or standalone-backend shutdown
- */
-void
-ShutdownCLOG(void)
-{
-	/* Flush dirty CLOG pages to disk */
-	TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(false);
-	SimpleLruFlush(XactCtl, false);
-
-	/*
-	 * fsync pg_xact to ensure that any files flushed previously are durably
-	 * on disk.
-	 */
-	fsync_fname("pg_xact", true);
-
-	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(false);
-}
-
 /*
  * Perform a checkpoint --- either during shutdown, or on-the-fly
  */
 void
 CheckPointCLOG(void)
 {
-	/* Flush dirty CLOG pages to disk */
+	/*
+	 * Write dirty CLOG pages to disk.  This may result in sync requests queued
+	 * for later handling by ProcessSyncRequests(), as part of the checkpoint.
+	 */
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
-	SimpleLruFlush(XactCtl, true);
+	SimpleLruWriteAll(XactCtl, true);
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
 }
 
@@ -1026,3 +1012,12 @@ clog_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "clog_redo: unknown op code %u", info);
 }
+
+/*
+ *

pg_upgrade: fail early if a tablespace dir already exists for new cluster version

2020-09-24 Thread Justin Pryzby
This should be caught during --check, or earlier in the upgrade, rather than
only after dumping the schema.

Typically, the tablespace dir would be left behind by a previous failed upgrade
attempt, causing a cascade of failured upgrades.  I guess I may not be the only
one with a 3 year old wrapper which has a hack to check for this.

|rm -fr pgsql12.dat pgsql13.dat
|/usr/pgsql-12/bin/initdb -D pgsql12.dat --no-sync
|/usr/pgsql-13/bin/initdb -D pgsql13.dat --no-sync
|/usr/pgsql-12/bin/postgres -D pgsql12.dat -c port=5678 -k /tmp
|mkdir tblspc tblspc/PG_13_202007203
|psql -h /tmp -p 5678 postgres -c "CREATE TABLESPACE one LOCATION 
'/home/pryzbyj/tblspc'"
|/usr/pgsql-13/bin/pg_upgrade -D pgsql13.dat -d pgsql12.dat -b /usr/pgsql-12/bin
|pg_upgrade_utility.log:
|CREATE TABLESPACE "one" OWNER "pryzbyj" LOCATION '/home/pryzbyj/tblspc';
|psql:pg_upgrade_dump_globals.sql:27: ERROR:  directory 
"/home/pryzbyj/tblspc/PG_13_202007201" already in use as a tablespace
>From 3df104610d73833da60bffcc54756a6ecb2f390f Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 24 Sep 2020 19:49:40 -0500
Subject: [PATCH v1] pg_upgrade --check to avoid tablespace failure mode

---
 src/bin/pg_upgrade/check.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 2f7aa632c5..b07e63dab2 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -13,6 +13,7 @@
 #include "fe_utils/string_utils.h"
 #include "mb/pg_wchar.h"
 #include "pg_upgrade.h"
+#include "common/relpath.h"
 
 static void check_new_cluster_is_empty(void);
 static void check_databases_are_compatible(void);
@@ -27,6 +28,7 @@ static void check_for_tables_with_oids(ClusterInfo *cluster);
 static void check_for_reg_data_type_usage(ClusterInfo *cluster);
 static void check_for_jsonb_9_4_usage(ClusterInfo *cluster);
 static void check_for_pg_role_prefix(ClusterInfo *cluster);
+static void check_for_existing_tablespace_dirs(ClusterInfo *new_cluster);
 static char *get_canonical_locale_name(int category, const char *locale);
 
 
@@ -187,6 +189,8 @@ check_new_cluster(void)
 	check_is_install_user(&new_cluster);
 
 	check_for_prepared_transactions(&new_cluster);
+
+	check_for_existing_tablespace_dirs(&new_cluster);
 }
 
 
@@ -541,6 +545,34 @@ create_script_for_cluster_analyze(char **analyze_script_file_name)
 	check_ok();
 }
 
+/*
+ * Check that tablespace dirs for new cluster do not already exist.
+ * If they did, they'd cause an error while restoring global objects.
+ * This allows failing earlier rather than only after dumping schema.
+ */
+static void
+check_for_existing_tablespace_dirs(ClusterInfo *new_cluster)
+{
+	char	old_tablespace_dir[MAXPGPATH];
+
+	prep_status("Checking for pre-existing tablespace directories for new cluster version");
+
+// if (GET_MAJOR_VERSION(old_cluster.major_version) <= 804)
+	for (int tblnum = 0; tblnum < os_info.num_old_tablespaces; tblnum++)
+	{
+		struct stat statbuf;
+		snprintf(old_tablespace_dir, MAXPGPATH, "%s%c%s",
+os_info.old_tablespaces[tblnum],
+PATH_SEPARATOR, TABLESPACE_VERSION_DIRECTORY);
+
+		canonicalize_path(old_tablespace_dir);
+
+		// XXX: lstat ?
+		if (stat(old_tablespace_dir, &statbuf) == 0)
+			pg_fatal("tablespace directory already exists for new cluster version: \"%s\"\n",
+	 old_tablespace_dir);
+	}
+}
 
 /*
  * create_script_for_old_cluster_deletion()
-- 
2.17.0



Re: "cert" + clientcert=verify-ca in pg_hba.conf?

2020-09-24 Thread Michael Paquier
On Thu, Sep 24, 2020 at 11:43:40AM -0400, Bruce Momjian wrote:
> I will handle it.

Thanks.  I have switched the patch as waiting on author due to the
complaint of the CF bot for now, but if you feel that this does not
require an extra round of review after the new rebase, of course
please feel free.
--
Michael


signature.asc
Description: PGP signature


Re: Memory allocation abstraction in pgcrypto

2020-09-24 Thread Michael Paquier
On Wed, Sep 23, 2020 at 04:57:44PM +0900, Michael Paquier wrote:
> I doubt that anybody has been compiling with PX_OWN_ALLOC in the last
> years, so let's remove this abstraction.  And +1 for your patch.

So, I have looked at that this morning again, and applied the thing.
--
Michael


signature.asc
Description: PGP signature


Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-24 Thread Dilip Kumar
On Thu, Sep 24, 2020 at 5:31 PM Amit Kapila  wrote:
>
> On Thu, Sep 24, 2020 at 5:11 PM Dilip Kumar  wrote:
> >
> > On Thu, Sep 24, 2020 at 4:45 PM Amit Kapila  wrote:
> > >
> > >
> > > Have you checked what will function walrcv_server_version() will
> > > return? I was thinking that if we know that subscriber is connected to
> > > Publisher version < 14 then we can send the right value.
> >
> > But, suppose we know the publisher version is < 14 and user has set
> > streaming on while  creating subscriber then still we will send the
> > right version?
> >
>
> Yeah we can send the version depending on which server we are talking to?

Ok

> >  I think tablesync we are forming a query so we are
> > forming as per the publisher's version.  But here we are sending which
> > protocol version we are expecting for the data transfer so I feel it
> > should be LOGICALREP_PROTO_VERSION_NUM if we expect non-streaming
> > transfer and LOGICALREP_PROTO_STREAM_VERSION_NUM if we expect the
> > streaming transfer.
> >
>
> I am not sure if this is the right strategy. See
> libpqrcv_startstreaming, even if the user asked for streaming unless
> the server supports it we don't send the streaming option to the user.
> Another thing is this check will become more complicated if we need
> another feature like decode prepared to send different version or even
> if it is the same version, we might need additional checks. Why do you
> want to send a streaming protocol version when we know the server
> doesn't support it, lets behave similar to what we do in
> libpqrcv_startstreaming.

Okay, so even if the streaming is enabled we are sending the streaming
on to the server versions which are >= 14 which make sense.  But I
still doubt that it is right thing to send the protocol version based
on the server version.   For example suppose in future PG20 we change
the streaming protocol version to 3,  that mean from PG14 to PG20 we
may not support the streaming transmission but we still be able to
support the normal transamission.  Now if streaming is off then
ideally we should send the LOGICALREP_PROTO_VERSION_NUM and that is
still supporetd by the publisher but if we send the
LOGICALREP_PROTO_STREAM_VERSION_NUM then it will error out because the
streaming protocol changed in the latest subscriber and the same is
not supported by the older publisher (14).  So now if we want non
streaming transmission from 14 to 20 and that should be allowed but if
based on the server version we always send the
LOGICALREP_PROTO_STREAM_VERSION_NUM then that will be a problem
because our streaming version has changed.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: fixing old_snapshot_threshold's time->xid mapping

2020-09-24 Thread Michael Paquier
On Thu, Sep 24, 2020 at 03:46:14PM -0400, Robert Haas wrote:
> Committed.

Cool, thanks.

> Thomas, with respect to your part of this patch set, I wonder if we
> can make the functions that you're using to write tests safe enough
> that we could add them to contrib/old_snapshot and let users run them
> if they want. As you have them, they are hedged around with vague and
> scary warnings, but is that really justified? And if so, can it be
> fixed? It would be nicer not to end up with two loadable modules here,
> and maybe the right sorts of functions could even have some practical
> use.

I have switched this item as waiting on author in the CF app then, as
we are not completely done yet.
--
Michael


signature.asc
Description: PGP signature


Re: "cert" + clientcert=verify-ca in pg_hba.conf?

2020-09-24 Thread Tom Lane
Kyotaro Horiguchi  writes:
> Thank you Bruce, Michael. This is a rebased version.

I really strongly object to all the encoded data in this patch.
One cannot read it, one cannot even easily figure out how long
it is until the tests break by virtue of the certificates expiring.

One can, however, be entirely certain that they *will* break at
some point.  I don't like the idea of time bombs in our test suite.
That being the case, it'd likely be better to drop all the pre-made
certificates and have the test scripts create them on the fly.
That'd remove both the documentation problem (i.e., having readable
info as to how the certificates were made) and the expiration problem.

regards, tom lane




Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-24 Thread Amit Kapila
On Fri, Sep 25, 2020 at 7:21 AM Dilip Kumar  wrote:
>
> On Thu, Sep 24, 2020 at 5:31 PM Amit Kapila  wrote:
> >
> > On Thu, Sep 24, 2020 at 5:11 PM Dilip Kumar  wrote:
> > >
> > > On Thu, Sep 24, 2020 at 4:45 PM Amit Kapila  
> > > wrote:
> > > >
> > > >
> > > > Have you checked what will function walrcv_server_version() will
> > > > return? I was thinking that if we know that subscriber is connected to
> > > > Publisher version < 14 then we can send the right value.
> > >
> > > But, suppose we know the publisher version is < 14 and user has set
> > > streaming on while  creating subscriber then still we will send the
> > > right version?
> > >
> >
> > Yeah we can send the version depending on which server we are talking to?
>
> Ok
>
> > >  I think tablesync we are forming a query so we are
> > > forming as per the publisher's version.  But here we are sending which
> > > protocol version we are expecting for the data transfer so I feel it
> > > should be LOGICALREP_PROTO_VERSION_NUM if we expect non-streaming
> > > transfer and LOGICALREP_PROTO_STREAM_VERSION_NUM if we expect the
> > > streaming transfer.
> > >
> >
> > I am not sure if this is the right strategy. See
> > libpqrcv_startstreaming, even if the user asked for streaming unless
> > the server supports it we don't send the streaming option to the user.
> > Another thing is this check will become more complicated if we need
> > another feature like decode prepared to send different version or even
> > if it is the same version, we might need additional checks. Why do you
> > want to send a streaming protocol version when we know the server
> > doesn't support it, lets behave similar to what we do in
> > libpqrcv_startstreaming.
>
> Okay, so even if the streaming is enabled we are sending the streaming
> on to the server versions which are >= 14 which make sense.  But I
> still doubt that it is right thing to send the protocol version based
> on the server version.   For example suppose in future PG20 we change
> the streaming protocol version to 3,  that mean from PG14 to PG20 we
> may not support the streaming transmission but we still be able to
> support the normal transamission.  Now if streaming is off then
> ideally we should send the LOGICALREP_PROTO_VERSION_NUM and that is
> still supporetd by the publisher but if we send the
> LOGICALREP_PROTO_STREAM_VERSION_NUM then it will error out because the
> streaming protocol changed in the latest subscriber and the same is
> not supported by the older publisher (14).
>

No that won't happen, we will send the version based on what the
server version supports (in this case it would be 2 when we are
talking to publisher 14). We can define a new macro or something like
that. I feel the protocol should depend on whether the server supports
it or not rather than based on some user specified option because it
will make our life easier if tomorrow we want to enable streaming by
default rather than based on option specified by the user. You can
consider it this way that how would you handle this if we wouldn't
have a user specified option (streaming) because that is generally the
way it should work.

-- 
With Regards,
Amit Kapila.




Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-24 Thread Amit Kapila
On Thu, Sep 24, 2020 at 6:33 PM Ashutosh Sharma  wrote:
>
> Hi Amit,
>
> > Here, I think instead of using MySubscription->stream, we should use
> > server/walrecv version number as we used at one place in tablesync.c.
>
> Should subscribers be setting the LR protocol value based on what is
> the publisher version it is communicating with or should it be set
> based on whether streaming was enabled or not while creating that
> subscription? AFAIU if we set this value based on the publisher
> version (which is lets say >= 14), then it's quite possible that the
> subscriber will start streaming changes for the in-progress
> transactions even if the streaming was disabled while creating the
> subscription, won't it?
>

No that won't happen because we send this option to the server
(publisher in this case) only when version is >=14 and user has
specified this option. See the below check in function
libpqrcv_startstreaming()
{
..
if (options->proto.logical.streaming &&
PQserverVersion(conn->streamConn) >= 14)
appendStringInfo(&cmd, ", streaming 'on'");
..
}

-- 
With Regards,
Amit Kapila.




Re: New statistics for tuning WAL buffer size

2020-09-24 Thread Masahiro Ikeda

On 2020-09-18 11:11, Kyotaro Horiguchi wrote:

At Fri, 18 Sep 2020 09:40:11 +0900, Masahiro Ikeda
 wrote in

Thanks. I confirmed that it causes HOT pruning or killing of
dead index tuple if DecodeCommit() is called.

As you said, DecodeCommit() may access the system table.

...

The wals are generated only when logical replication is performed.
So, I added pgstat_send_wal() in XLogSendLogical().

But, I concerned that it causes poor performance
since pgstat_send_wal() is called per wal record,


I think that's too frequent.  If we want to send any stats to the
collector, it is usually done at commit time using
pgstat_report_stat(), and the function avoids sending stats too
frequently. For logrep-worker, apply_handle_commit() is calling it. It
seems to be the place if we want to send the wal stats.  Or it may be
better to call pgstat_send_wal() via pgstat_report_stat(), like
pg_stat_slru().


Thanks for your comments.
Since I changed to use pgstat_report_stat() and DecodeCommit() is 
calling it,

the frequency to send statistics is not so high.


Currently logrep-laucher, logrep-worker and autovac-launcher (and some
other processes?) don't seem (AFAICS) sending scan stats at all but
according to the discussion here, we should let such processes send
stats.


I added pgstat_report_stat() to logrep-laucher and autovac-launcher.
As you said, logrep-worker already calls apply_handle_commit() and 
pgstat_report_stat().


The checkpointer doesn't seem to call pgstat_report_stat() currently,
but since there is a possibility to send wal statistics, I added 
pgstat_report_stat().


Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONdiff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4e0193a967..dd292fe27c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -424,6 +424,13 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  
  
 
+ 
+  pg_stat_walpg_stat_wal
+  One row only, showing statistics about the WAL writing activity. See
+for details.
+  
+ 
+
  
   pg_stat_databasepg_stat_database
   One row per database, showing database-wide statistics. See
@@ -3280,6 +3287,56 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  
 
+ 
+   pg_stat_wal
+
+  
+   pg_stat_wal
+  
+
+  
+   The pg_stat_wal view will always have a
+   single row, containing data about the WAL writing activity of the cluster.
+  
+
+  
+   pg_stat_wal View
+   
+
+ 
+  
+   Column Type
+  
+  
+   Description
+  
+ 
+
+
+
+ 
+  
+   wal_buffers_full bigint
+  
+  
+   Number of WAL writes when the  are full
+  
+ 
+
+ 
+  
+   stats_reset timestamp with time zone
+  
+  
+   Time at which these statistics were last reset
+  
+ 
+ 
+   
+  
+
+
+
  
   pg_stat_database
 
@@ -4668,8 +4725,9 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 argument.  The argument can be bgwriter to reset
 all the counters shown in
 the pg_stat_bgwriter
-view, or archiver to reset all the counters shown in
-the pg_stat_archiver view.
+view, archiver to reset all the counters shown in
+the pg_stat_archiver view ,or wal
+to reset all the counters shown in the pg_stat_wal view.


 This function is restricted to superusers by default, but other users
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61754312e2..3a06cacefb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2195,6 +2195,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 	WriteRqst.Flush = 0;
 	XLogWrite(WriteRqst, false);
 	LWLockRelease(WALWriteLock);
+	WalStats.m_wal_buffers_full++;
 	TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
 }
 /* Re-acquire WALBufMappingLock and retry */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ed4f3f142d..643445c189 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -979,6 +979,11 @@ CREATE VIEW pg_stat_bgwriter AS
 pg_stat_get_buf_alloc() AS buffers_alloc,
 pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_wal AS
+SELECT
+pg_stat_get_wal_buffers_full() AS wal_buffers_full,
+pg_stat_get_wal_stat_reset_time() AS stats_reset;
+
 CREATE VIEW pg_stat_progress_analyze AS
 SELECT
 S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 2cef56f115..1b224d479d 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -795,6 +795,8 @@ AutoVacLauncherMain(int argc, char *argv[

Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-24 Thread Amit Kapila
On Thu, Sep 24, 2020 at 7:51 AM Andres Freund  wrote:
>
> On 2020-09-22 14:55:21 +1000, Greg Nancarrow wrote:
>
>
> > diff --git a/src/backend/access/heap/heapam.c 
> > b/src/backend/access/heap/heapam.c
> > index 1585861..94c8507 100644
> > --- a/src/backend/access/heap/heapam.c
> > +++ b/src/backend/access/heap/heapam.c
> > @@ -2049,11 +2049,6 @@ heap_prepare_insert(Relation relation, HeapTuple 
> > tup, TransactionId xid,
> >* inserts in general except for the cases where inserts generate a 
> > new
> >* CommandId (eg. inserts into a table having a foreign key column).
> >*/
> > - if (IsParallelWorker())
> > - ereport(ERROR,
> > - (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
> > -  errmsg("cannot insert tuples in a parallel 
> > worker")));
> > -
>
> I'm afraid that this weakens our checks more than I'd like.
>

I think we need to change/remove this check to allow inserts by
parallel workers. I am not sure but maybe we can add an Assert to
ensure that it is safe to perform insert via parallel worker.

> What if this
> ends up being invoked from inside C code?
>

I think it shouldn't be a problem unless one is trying to do something
like insert into foreign key table. So, probably we can have an Assert
to catch it if possible. Do you have any other idea?

-- 
With Regards,
Amit Kapila.




Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-24 Thread Michael Paquier
On Thu, Sep 24, 2020 at 06:28:25PM +0200, Daniel Gustafsson wrote:
> Doh, of course, I blame a lack of caffeine this afternoon.  Having a private
> local sha256 implementation using the EVP_* API inside scram-common would
> maintain FIPS compliance and ABI compatibility, but would also be rather ugly.

Even if we'd try to force our internal implementation of SHA256 on
already-released branches instead of the one of OpenSSL, this would be
an ABI break for compiled modules expected to work on this released
branch as OpenSSL's internal SHA structures don't exactly match with
our own implementation (think just about sizeof() or such).
--
Michael


signature.asc
Description: PGP signature


Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-24 Thread Michael Paquier
On Thu, Sep 24, 2020 at 09:44:57PM +0200, Daniel Gustafsson wrote:
> On 24 Sep 2020, at 21:22, Robert Haas  wrote:
>> I mean, the issue here, as is so often the case, is not what is
>> actually more secure, but what meets the terms of some security
>> standard.
> 
> Correct, IIUC in order to be FIPS compliant all cryptographic modules used 
> must
> be FIPS certified.

/me whitles, thinking about not using src/common/md5.c when building
with OpenSSL to actually get a complain if FIPS gets used.

>> At least in the US, FIPS 140-2 compliance is a reasonably
>> common need, so if we can make it easier for people who have that need
>> to be compliant, they are more likely to use PostgreSQL, which seems
>> like something that we should want.
> 
> The proposed patch makes SCRAM+FIPS work for 14, question is if we need/want 
> to
> try and address v10-13.

Unfortunately, I don't have a good answer for that, except for the
answers involving an ABI breakage.  FWIW, the only users of those APIs
I can find in the open wild are pgpool, which actually bundles a copy
of the code in src/common/ so it does not matter, and pgbouncer, that
uses a different compatibility layer to make the code compilable:
https://sources.debian.org/src/pgbouncer/1.14.0-1/include/common/postgres_compat.h/?hl=26#L26

But it is not really possible to say if there is any closed code
relying on that, so I'd like to fix that just on HEAD, about which I
guess there would be no objections?
--
Michael


signature.asc
Description: PGP signature


Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-24 Thread Greg Nancarrow
> > What if this
> > ends up being invoked from inside C code?
> >
>
> I think it shouldn't be a problem unless one is trying to do something
> like insert into foreign key table. So, probably we can have an Assert
> to catch it if possible. Do you have any other idea?
>

Note that the planner code updated by the patch does avoid creating a
Parallel INSERT plan in the case of inserting into a table with a
foreign key (so commandIds won't be created in the parallel-worker
code).
I'm not sure how to distinguish the "invoked from inside C code" case though.

Regards,
Greg Nancarrow
Fujitsu Australia




Re: history file on replica and double switchover

2020-09-24 Thread David Zhang

On 2020-09-24 5:00 p.m., Fujii Masao wrote:



On 2020/09/25 8:15, David Zhang wrote:

Hi,

My understanding is that the "archiver" won't even start if 
"archive_mode = on" has been set on a "replica". Therefore, either 
(XLogArchiveMode == ARCHIVE_MODE_ALWAYS) or (XLogArchiveMode != 
ARCHIVE_MODE_OFF) will produce the same result.


Yes, the archiver isn't invoked in the standby when archive_mode=on.
But, with the original patch, walreceiver creates .ready archive 
status file

even when archive_mode=on and no archiver is running. This causes
the history file to be archived after the standby is promoted and
the archiver is invoked.

With my patch, walreceiver creates .ready archive status for the 
history file

only when archive_mode=always, like it does for the streamed WAL files.
This is the difference between those two patches. To prevent the streamed
timeline history file from being archived, IMO we should use the 
condition

archive_mode==always in the walreceiver.

+1





Please see how the "archiver" is started in 
src/backend/postmaster/postmaster.c


5227 /*
5228  * Start the archiver if we're responsible for 
(re-)archiving received

5229  * files.
5230  */
5231 Assert(PgArchPID == 0);
5232 if (XLogArchivingAlways())
5233 PgArchPID = pgarch_start();

I did run the nice script "double_switchover.sh" using either 
"always" or "on" on patch v1 and v2. They all generate the same 
results below. In other words, whether history file 
(0003.history) is archived or not depends on "archive_mode" 
settings.


echo "archive_mode = always" >> ${PGDATA_NODE2}/postgresql.auto.conf

$ ls -l archive
-rw--- 1 david david 16777216 Sep 24 14:40 00010002
... ...
-rw--- 1 david david 16777216 Sep 24 14:40 0002000A
-rw--- 1 david david   41 Sep 24 14:40 0002.history
-rw--- 1 david david   83 Sep 24 14:40 0003.history

echo "archive_mode = on" >> ${PGDATA_NODE2}/postgresql.auto.conf

$ ls -l archive
-rw--- 1 david david 16777216 Sep 24 14:47 00010002
... ...
-rw--- 1 david david 16777216 Sep 24 14:47 0002000A
-rw--- 1 david david   41 Sep 24 14:47 0002.history


Personally, I prefer patch v2 since it allies to the statement here, 
https://www.postgresql.org/docs/13/warm-standby.html#CONTINUOUS-ARCHIVING-IN-STANDBY


"If archive_mode is set to on, the archiver is not enabled during 
recovery or standby mode. If the standby server is promoted, it will 
start archiving after the promotion, but will not archive any WAL it 
did not generate itself."


By the way, I think the last part of the sentence should be changed 
to something like below:


"but will not archive any WAL, which was not generated by itself."


I'm not sure if this change is an improvement or not. But if we apply
the patch I proposed, maybe we should mention also history file here.
For example, "but will not archive any WAL or timeline history files
that it did not generate itself".

make sense for me


Regards,


--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca





Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-24 Thread Michael Paquier
On Fri, Sep 25, 2020 at 12:19:44PM +0900, Michael Paquier wrote:
> Even if we'd try to force our internal implementation of SHA256 on
> already-released branches instead of the one of OpenSSL, this would be
> an ABI break for compiled modules expected to work on this released
> branch as OpenSSL's internal SHA structures don't exactly match with
> our own implementation (think just about sizeof() or such).

Well, we could as well add one extra SHA API layer pointing to the EVP
structures and APIs with new names, leaving the original ones in
place, and then have SCRAM use the new ones, but I'd rather not go
down that road for the back-branches.
--
Michael


signature.asc
Description: PGP signature


Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-24 Thread Tom Lane
Michael Paquier  writes:
> On Fri, Sep 25, 2020 at 12:19:44PM +0900, Michael Paquier wrote:
>> Even if we'd try to force our internal implementation of SHA256 on
>> already-released branches instead of the one of OpenSSL, this would be
>> an ABI break for compiled modules expected to work on this released
>> branch as OpenSSL's internal SHA structures don't exactly match with
>> our own implementation (think just about sizeof() or such).

> Well, we could as well add one extra SHA API layer pointing to the EVP
> structures and APIs with new names, leaving the original ones in
> place, and then have SCRAM use the new ones, but I'd rather not go
> down that road for the back-branches.

Given the tiny number of complaints to date, it seems sufficient to me
to deal with this in HEAD.

regards, tom lane




Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-24 Thread Greg Nancarrow
Hi Andres,

On Thu, Sep 24, 2020 at 12:21 PM Andres Freund  wrote:
>
>I think it'd be good if you outlined what your approach is to make this
>safe.

Some prior work has already been done to establish the necessary
infrastructure to allow parallel INSERTs, in general, to be safe,
except for cases where new commandIds would be generated in the
parallel-worker code (such as inserts into a table having a foreign
key) - these cases need to be avoided.
See the following commits.

85f6b49 Allow relation extension lock to conflict among parallel group members
3ba59cc Allow page lock to conflict among parallel group members

The planner code updated by the patch avoids creating a Parallel
INSERT plan in the case of inserting into a table that has a foreign
key.


>> For cases where it can't be allowed (e.g. INSERT into a table with
>> foreign keys, or INSERT INTO ... SELECT ... ON CONFLICT ... DO UPDATE
>> ...") it at least allows parallelism of the SELECT part.
>
>I think it'd be good to do this part separately and first, independent
>of whether the insert part can be parallelized.

OK then, I'll try to extract that as a separate patch.


>> Obviously I've had to update the planner and executor and
>> parallel-worker code to make this happen, hopefully not breaking too
>> many things along the way.
>
>Hm, it looks like you've removed a fair bit of checks, it's not clear to
>me why that's safe in each instance.

It should be safe for Parallel INSERT - but you are right, these are
brute force removals (for the purpose of a POC patch) that should be
tightened up wherever possible to disallow unsafe paths into that
code. Problem is, currently there's not a lot of context information
available to easily allow that, so some work needs to be done.


>> - Currently only for "INSERT INTO ... SELECT ...". To support "INSERT
>> INTO ... VALUES ..." would need additional Table AM functions for
>> dividing up the INSERT work amongst the workers (currently only exists
>> for scans).
>
>Hm, not entirely following. What precisely are you thinking of here?

All I was saying is that for SELECTs, the work done by each parallel
worker is effectively divided up by parallel-worker-related functions
in tableam.c and indexam.c, and no such technology currently exists
for dividing up work for the "INSERT ... VALUES" case.


>I doubt it's really worth adding parallelism support for INSERT
>... VALUES, the cost of spawning workers will almost always higher than
>the benefit.

You're probably right in doubting any benefit, but I wasn't entirely sure.


>> @@ -116,7 +117,7 @@ toast_save_datum(Relation rel, Datum value,
>>   TupleDesc   toasttupDesc;
>>   Datum   t_values[3];
>>   boolt_isnull[3];
>> - CommandId   mycid = GetCurrentCommandId(true);
>> + CommandId   mycid = GetCurrentCommandId(!IsParallelWorker());
>>   struct varlena *result;
>>   struct varatt_external toast_pointer;
>>   union
>
>Hm? Why do we need this in the various places you have made this change?

It's because for Parallel INSERT, we're assigning the same command-id
to each worker up-front during worker initialization (the commandId
has been retrieved by the leader and passed through to each worker)
and "currentCommandIdUsed" has been set true. See the
AssignCommandIdForWorker() function in the patch.
If you see the code of GetCurrentCommandId(), you'll see it Assert
that it's not being run by a parallel worker if the parameter is true.
I didn't want to remove yet another check, without being able to know
the context of the caller, because only for Parallel INSERT do I know
that "currentCommandIdUsed was already true at the start of the
parallel operation". See the comment in that function. Anyway, that's
why I'm passing "false" to relevant GetCurrentCommandId() calls if
they're being run by a parallel (INSERT) worker.


>> @@ -822,19 +822,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, 
>> Relation NewHeap,
>>   isdead = false;
>>   break;
>>   case HEAPTUPLE_INSERT_IN_PROGRESS:
>> -
>>   /*
>>* Since we hold exclusive lock on the 
>> relation, normally the
>>* only way to see this is if it was inserted 
>> earlier in our
>>* own transaction.  However, it can happen in 
>> system
>>* catalogs, since we tend to release write 
>> lock before >commit
>> -  * there.  Give a warning if neither case 
>> applies; but in any
>> -  * case we had better copy it.
>> +  * there. In any case we had better copy it.
>>*/
>> - if (!is_system_catalog &&
>> - 
>> !TransactionIdIsCurrentTransactionId>(HeapTuple

Feature improvement of tab completion for DEALLOCATE

2020-09-24 Thread btnakamichin

Hello!

I’d like to improve the deallocate tab completion.

The deallocate tab completion can’t complement “ALL”. Therefore, this 
patch fixes the problem.


Regards,

Naoki Nakamichidiff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 9c6f5ecb6a..d877cc86f0 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2911,7 +2911,7 @@ psql_completion(const char *text, int start, int end)
 
 /* DEALLOCATE */
 	else if (Matches("DEALLOCATE"))
-		COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements);
+		COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements " UNION SELECT 'ALL'");
 
 /* DECLARE */
 	else if (Matches("DECLARE", MatchAny))


Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

2020-09-24 Thread Bharath Rupireddy
On Wed, Sep 23, 2020 at 8:19 PM Fujii Masao  wrote:
>
> > Please let me know if okay with the above agreed points, I will work on the 
> > new patch.
>
> Yes, please work on the patch! Thanks! I may revisit the above points later, 
> though ;)
>

Thanks, attaching v4 patch. Please consider this for further review.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v4-Retry-Cached-Remote-Connections-For-postgres_fdw.patch
Description: Binary data


Feature improvement for FETCH tab completion

2020-09-24 Thread btnakamichin

Hello!

I’d like to improve the FETCH tab completion.

The FETCH tab completion . Therefore, this patch fixes the problem.

Previous function completed one of FORWARD, BACKWARD, RELATIVE, 
ABSOLUTE, but now it completes one of FORWARD, BACKWARD, RELATIVE, 
ABSOLUTE, ALL, NEXT, PRIOR, FIRST, LAST and Corresponded to later IN and 
FROM clauses.


Regards,

Naoki Nakamichidiff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index d877cc86f0..eb103b784e 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -3076,19 +3076,22 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH("SELECT", "INSERT", "DELETE", "UPDATE", "DECLARE");
 
 /* FETCH && MOVE */
-	/* Complete FETCH with one of FORWARD, BACKWARD, RELATIVE */
+	/* Complete FETCH with one of FORWARD, BACKWARD, RELATIVE, ALL, NEXT, PRIOR, FIRST, LAST */
 	else if (Matches("FETCH|MOVE"))
-		COMPLETE_WITH("ABSOLUTE", "BACKWARD", "FORWARD", "RELATIVE");
-	/* Complete FETCH  with one of ALL, NEXT, PRIOR */
-	else if (Matches("FETCH|MOVE", MatchAny))
-		COMPLETE_WITH("ALL", "NEXT", "PRIOR");
+		COMPLETE_WITH("ABSOLUTE", "BACKWARD", "FORWARD", "RELATIVE", "ALL", "NEXT", "PRIOR", "FIRST", "LAST");
+	/* Complete FETCH BACKWARD or FORWARD with one of ALL */
+	else if (Matches("FETCH|MOVE", "BACKWARD|FORWARD"))
+		COMPLETE_WITH("ALL");
 
 	/*
-	 * Complete FETCH   with "FROM" or "IN". These are equivalent,
+	 * Complete FETCH  with "FROM" or "IN". These are equivalent,
 	 * but we may as well tab-complete both: perhaps some users prefer one
 	 * variant or the other.
 	 */
-	else if (Matches("FETCH|MOVE", MatchAny, MatchAny))
+	else if (Matches("FETCH|MOVE", "ABSOLUTE|BACKWARD|FORWARD|RELATIVE", MatchAny))
+		COMPLETE_WITH("FROM", "IN");
+
+	else if (Matches("FETCH|MOVE", "ALL|NEXT|PRIOR|FIRST|LAST"))
 		COMPLETE_WITH("FROM", "IN");
 
 /* FOREIGN DATA WRAPPER */


Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-24 Thread Peter Eisentraut

On 2020-09-24 21:44, Daniel Gustafsson wrote:

On 24 Sep 2020, at 21:22, Robert Haas  wrote:

On Thu, Sep 24, 2020 at 1:57 PM Peter Eisentraut
 wrote:

Depends on what one considers to be covered by FIPS.  The entire rest of
SCRAM is custom code, so running it on top of the world's greatest
SHA-256 implementation isn't going to make the end product any more
trustworthy.


I mean, the issue here, as is so often the case, is not what is
actually more secure, but what meets the terms of some security
standard.


Correct, IIUC in order to be FIPS compliant all cryptographic modules used must
be FIPS certified.


As I read FIPS 140-2, it just specifies what must be true of 
cryptographic modules that claim to follow that standard, it doesn't say 
that all cryptographic activity in an application or platform must only 
use such modules.  (Notably, it doesn't even seem to define 
"cryptographic".)  The latter may well be a requirement of a user or 
customer on top of the actual standard.  However, again, the SCRAM 
implementation would already appear to fail that requirement because it 
uses a custom HMAC implementation, and HMAC is listed in FIPS 140-2 as a 
covered algorithm.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Feature improvement of tab completion for DEALLOCATE

2020-09-24 Thread Fujii Masao




On 2020/09/25 13:45, btnakamichin wrote:

Hello!

I’d like to improve the deallocate tab completion.

The deallocate tab completion can’t complement “ALL”. Therefore, this patch 
fixes the problem.


Thanks for the patch! It looks basically good, but I think it's better
to add newline just before UNION to avoid long line, as follows.

-   COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements);
+   COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements
+   " UNION SELECT 'ALL'");

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-24 Thread Tom Lane
Peter Eisentraut  writes:
> On 2020-09-24 21:44, Daniel Gustafsson wrote:
>> Correct, IIUC in order to be FIPS compliant all cryptographic modules used 
>> must
>> be FIPS certified.

> As I read FIPS 140-2, it just specifies what must be true of 
> cryptographic modules that claim to follow that standard, it doesn't say 
> that all cryptographic activity in an application or platform must only 
> use such modules.  (Notably, it doesn't even seem to define 
> "cryptographic".)  The latter may well be a requirement of a user or 
> customer on top of the actual standard.

Hm ... AFAICT, the intent of the "FIPS mode" in Red Hat's implementation,
and probably other Linux distros, is exactly that thou shalt not use
any non-FIPS-approved crypto code.  By your reading above, there would
be no need at all for a kernel-level enforcement switch.

> However, again, the SCRAM 
> implementation would already appear to fail that requirement because it 
> uses a custom HMAC implementation, and HMAC is listed in FIPS 140-2 as a 
> covered algorithm.

Ugh.  But is there any available FIPS-approved library code that could be
used instead?

regards, tom lane




Re: vs formatting in the docs

2020-09-24 Thread Peter Eisentraut

On 2020-07-10 17:05, Peter Eisentraut wrote:

On 2020-06-21 18:57, Dagfinn Ilmari Mannsåker wrote:

Attached are two patches: the first adds the missing  tags,
the second adds  to all the SQL commands (specifically anything
with 7).


I have committed the first one.

I have some concerns about the second one.  If you look at the diff of
the source of the man pages before and after, it creates a bit of a
mess, even though the man pages look okay when rendered.  I'll need to
think about this a bit more.


I asked about this on a DocBook discussion list.  While the general 
answer is that you can do anything you want, it was clear that putting 
markup into title elements requires more careful additional 
customization and testing, and it's preferable to handle appearance 
issues on the link source side.  (See also us dialing back the number of 
xreflabels recently.)  This is also the direction that DocBook 5 appears 
to be taking, where you can put linkend attributes into most inline 
tags, so you could maybe do .  This 
doesn't work for us yet, but the equivalent of that would be 
.


So, based on that, I think the patch proposed here is not the right one, 
and we should instead be marking up the link sources appropriately.


(This also implies that the already committed 0001 patch should be 
reverted.)


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Add features to pg_stat_statements

2020-09-24 Thread Katsuragi Yuta

On 2020-09-24 18:55, legrand legrand wrote:

Not limited to 3, Like an history table.

Will have to think if data is saved at shutdown.


Thanks.

This design might introduce some additional complexity
and things to be considered.
For example, the maximum size of "history table",
how to evict entries from the history table and how to manage
the information maintained by evicted entries.

Also, providing a history table looks similar to logging.

Providing the original view (# of dealloc and last_dealloc ts)
and optional logging looks a simple and better way.

Regards,
Katsuragi Yuta




Re: Range checks of pg_test_fsync --secs-per-test and pg_test_timing --duration

2020-09-24 Thread Peter Eisentraut

On 2020-09-24 09:12, Michael Paquier wrote:

On Wed, Sep 23, 2020 at 08:11:59AM +0200, Peter Eisentraut wrote:

This patch mixes up unsigned int and uint32 in random ways.  The variable is
uint32, but the format is %u and the max constant is UINT_MAX.

I think just use unsigned int as the variable type.  There is no need to use
the bit-exact types.  Note that the argument of alarm() is of type unsigned
int.


Makes sense, thanks.


looks good to me

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-24 Thread Ashutosh Sharma
On Fri, Sep 25, 2020 at 8:12 AM Amit Kapila  wrote:
>
> On Thu, Sep 24, 2020 at 6:33 PM Ashutosh Sharma  wrote:
> >
> > Hi Amit,
> >
> > > Here, I think instead of using MySubscription->stream, we should use
> > > server/walrecv version number as we used at one place in tablesync.c.
> >
> > Should subscribers be setting the LR protocol value based on what is
> > the publisher version it is communicating with or should it be set
> > based on whether streaming was enabled or not while creating that
> > subscription? AFAIU if we set this value based on the publisher
> > version (which is lets say >= 14), then it's quite possible that the
> > subscriber will start streaming changes for the in-progress
> > transactions even if the streaming was disabled while creating the
> > subscription, won't it?
> >
>
> No that won't happen because we send this option to the server
> (publisher in this case) only when version is >=14 and user has
> specified this option. See the below check in function
> libpqrcv_startstreaming()
> {
> ..
> if (options->proto.logical.streaming &&
> PQserverVersion(conn->streamConn) >= 14)
> appendStringInfo(&cmd, ", streaming 'on'");
> ..
> }
>

Ah, okay, understood, thanks. So, that means we can use the streaming
protocol version if the server (publisher) supports it, regardless of
whether the client (subscriber) has the streaming option enabled or
not.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com




Re: Feature improvement of tab completion for DEALLOCATE

2020-09-24 Thread btnakamichin

2020-09-25 14:30 に Fujii Masao さんは書きました:

On 2020/09/25 13:45, btnakamichin wrote:

Hello!

I’d like to improve the deallocate tab completion.

The deallocate tab completion can’t complement “ALL”. Therefore, this 
patch fixes the problem.


Thanks for the patch! It looks basically good, but I think it's better
to add newline just before UNION to avoid long line, as follows.

-   
COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements);
+   
COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements
+   " UNION SELECT 
'ALL'");


Regards,


Thank you, I appreciate your comment.

I have attached pattch with newline.

Regards,

NaokiNskamichidiff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index d877cc86f0..75f81d66dc 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2911,7 +2911,8 @@ psql_completion(const char *text, int start, int end)
 
 /* DEALLOCATE */
 	else if (Matches("DEALLOCATE"))
-		COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements " UNION SELECT 'ALL'");
+		COMPLETE_WITH_QUERY(Query_for_list_of_prepared_statements 
+		" UNION SELECT 'ALL'");
 
 /* DECLARE */
 	else if (Matches("DECLARE", MatchAny))


Problem of ko.po on branch REL9_5_STABLE

2020-09-24 Thread Yang, Rong
Hi,

When I checked the encoding of the Po files, I noticed that 
src/bin/pg_config/po/ko.po on REL9_5_STABLE branch seemed a little different.
The ‘Content-Type’ of this file and file’s encoding are different from 
those under other modules, as follows:
 
src/bin/pg_config/po/ko.po:
   "Content-Type: text/plain; charset=euc-kr\n"
src/bin/initdb/po/ko.po:
   "Content-Type: text/plain; charset=UTF-8\n"
 
    These even different from the other branches, as follows:

REL9_5_STABLE src/bin/pg_config/po/ko.po:
   "Content-Type: text/plain; charset=euc-kr\n"
REL9_6_STABLE src/bin/pg_config/po/ko.po
   "Content-Type: text/plain; charset=UTF-8\n"
 
Is there any particular reason for doing this? Or are there any 
challenges for compatible problems?
Thanks in advance.


--
Yang Rong
Development Department II Software Division II Nanjing Fujitsu Nanda Software 
Tech. Co., Ltd.(FNST)
ADDR.: No.6 Wenzhu Road, Software Avenue,
  Nanjing, 210012, China
TEL  : +86+25-86630566-9431
COINS: 7998-9431
FAX  : +86+25-83317685
MAIL: yangr.f...@cn.fujitsu.com
--






Re: Load TIME fields - proposed performance improvement

2020-09-24 Thread Peter Smith
The patch has been re-implemented based on previous advice.

Please see attached.

~

Test:

A test table was created and 20 million rows inserted as follows:

test=# create table t1 (id int, a timestamp, b time without time zone
default '01:02:03', c date default CURRENT_DATE, d time with time zone
default CURRENT_TIME, e time with time zone default LOCALTIME);
CREATE TABLE

$ time psql -d test -c "insert into t1(id, a)
values(generate_series(1,2000), timestamp 'now');"

~

Observations:

BEFORE PATCH

perf results
6.18% GetSQLCurrentTime
5.73% GetSQLCurrentDate
5.20% GetSQLLocalTime
4.67% GetCurrentDateTime
-.--% GetCurrentTimeUsec

elapsed time
Run1 1m57s
Run2 1m58s
Run3 2m00s

AFTER PATCH

perf results
1.77% GetSQLCurrentTime
0.12% GetSQLCurrentDate
0.50% GetSQLLocalTime
0.36% GetCurrentDateTime
-.--% GetCurrentTimeUsec

elapsed time
Run1 1m36s
Run2 1m36s
Run3 1m36s

(represents 19% improvement for this worst case table/data)

~

Note: I patched the function GetCurrentTimeUsec consistently with the
others, but actually I was not able to discover any SQL syntax which
could cause that function to be invoked multiple times. Perhaps the
patch for that function should be removed?

---

Kind Regards,
Peter Smith
Fujitsu Australia

On Tue, Sep 22, 2020 at 1:06 PM Peter Smith  wrote:
>
> Hi Tom.
>
> Thanks for your feedback.
>
> On Tue, Sep 22, 2020 at 12:44 PM Tom Lane  wrote:
>
> > Still, for the size of the patch I'm envisioning, it'd be well
> > worth the trouble.
>
> The OP patch I gave was just a POC to test the effect and to see if
> the idea was judged as worthwhile...
>
> I will rewrite/fix it based on your suggestions.
>
> Kind Regards,
> Peter Smith.
> Fujitsu Australia.


PS_cache_pg_tm-v01.patch
Description: Binary data


Re: Feature improvement for FETCH tab completion

2020-09-24 Thread Fujii Masao




On 2020/09/25 14:24, btnakamichin wrote:

Hello!

I’d like to improve the FETCH tab completion.

The FETCH tab completion . Therefore, this patch fixes the problem.

Previous function completed one of FORWARD, BACKWARD, RELATIVE, ABSOLUTE, but 
now it completes one of FORWARD, BACKWARD, RELATIVE, ABSOLUTE, ALL, NEXT, 
PRIOR, FIRST, LAST and Corresponded to later IN and FROM clauses.


Thanks for the patch! Here are review comments.

+   /* Complete FETCH BACKWARD or FORWARD with one of ALL */
+   else if (Matches("FETCH|MOVE", "BACKWARD|FORWARD"))
+   COMPLETE_WITH("ALL");

Not only "ALL" but also "FROM" and "IN" should be displayed here
because they also can follow "BACKWARD" and "FORWARD"?

else if (Matches("FETCH|MOVE", MatchAny, MatchAny))
+   else if (Matches("FETCH|MOVE", "ABSOLUTE|BACKWARD|FORWARD|RELATIVE", 
MatchAny))
+   COMPLETE_WITH("FROM", "IN");

This change seems to cause "FETCH FORWARD FROM " to display "FROM"
and "IN". To avoid this confusing tab-completion, we should use something like
MatchAnyExcept("FROM|IN") here, instead?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: [patch] Concurrent table reindex per-index progress reporting

2020-09-24 Thread Michael Paquier
On Thu, Sep 24, 2020 at 09:19:18PM +0200, Matthias van de Meent wrote:
> While working on a PG12-instance I noticed that the progress reporting of
> concurrent index creation for non-index relations fails to update the
> index/relation OIDs that it's currently working on in the
> pg_stat_progress_create_index view after creating the indexes. Please find
> attached a patch which properly sets these values at the appropriate places.
> 
> Any thoughts?

I agree that this is a bug and that we had better report correctly the
heap and index IDs worked on, as these could also be part of a toast
relation from the parent table reindexed.  However, your patch is
visibly incorrect in the two places you are updating.

> +  * Configure progress reporting to report for this index.
> +  * While we're at it, reset the phase as it is cleared by 
> start_command.
> +  */
> + pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, 
> heapId);
> + pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID, 
> newIndexId);
> + pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
> +  
> PROGRESS_CREATEIDX_PHASE_WAIT_1);

First, updating PROGRESS_CREATEIDX_PHASE here to WAIT_1 makes no
sense, because this is not a wait phase, and index_build() called
within index_concurrently_build() will set this state correctly a bit
after so IMO it is fine to leave that unset here, and the build phase
is by far the bulk of the operation that needs tracking.  I think that
you are also missing to update PROGRESS_CREATEIDX_COMMAND to 
PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY, and
PROGRESS_CREATEIDX_ACCESS_METHOD_OID to the correct index AM OID,
similarly to reindex_index().

> + /*
> +  * Configure progress reporting to report for this index.
> +  * While we're at it, reset the phase as it is cleared by 
> start_command.
> +  */
> + pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, 
> heapId);
> + pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID, 
> newIndexId);
> + pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
> +  
> PROGRESS_CREATEIDX_PHASE_WAIT_2);
> +
>   validate_index(heapId, newIndexId, snapshot);

Basically the same three comments here: PROGRESS_CREATEIDX_PHASE set
to WAIT_2 makes no real sense, and validate_index() would update the
phase as it should be.  This should set PROGRESS_CREATEIDX_COMMAND to
PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY, and 
PROGRESS_CREATEIDX_ACCESS_METHOD_OID to the correct index AM OID.

While reviewing this code, I also noticed that the wait state set just
before dropping the indexes was wrong.  The code was using WAIT_4, but
this has to be WAIT_5.

And this gives me the attached.  This also means that we still set the
table and relation OID to the last index worked on for WAIT_2, WAIT_4
and WAIT_5, but we cannot set the command start to relationOid either
as given in input of ReindexRelationConcurrently() as this could be a
single index given for REINDEX INDEX CONCURRENTLY.  Matthias, does
that look right to you?
--
Michael
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f1b5f87e6a..d43051aea9 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -3409,6 +3409,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		Oid			oldIndexId = lfirst_oid(lc);
 		Oid			newIndexId = lfirst_oid(lc2);
 		Oid			heapId;
+		Oid			indexam;
 
 		/* Start new transaction for this index's concurrent build */
 		StartTransactionCommand();
@@ -3429,8 +3430,21 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		 */
 		indexRel = index_open(oldIndexId, ShareUpdateExclusiveLock);
 		heapId = indexRel->rd_index->indrelid;
+		/* The access method of the old and new indexes match */
+		indexam = indexRel->rd_rel->relam;
 		index_close(indexRel, NoLock);
 
+		/*
+		 * Update progress for the index to build, with the correct parent
+		 * table involved.
+		 */
+		pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, heapId);
+		pgstat_progress_update_param(PROGRESS_CREATEIDX_COMMAND,
+	 PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY);
+		pgstat_progress_update_param(PROGRESS_CREATEIDX_INDEX_OID, newIndexId);
+		pgstat_progress_update_param(PROGRESS_CREATEIDX_ACCESS_METHOD_OID,
+	 indexam);
+
 		/* Perform concurrent build of new index */
 		index_concurrently_build(heapId, newIndexId);
 
@@ -3458,6 +3472,8 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		Oid			heapId;
 		TransactionId limitXmin;
 		Snapshot	snapshot;
+		Relation	newIndexRel;
+		Oid			indexam;
 
 		StartTransactionCommand();
 
@@ -3468,8 +3484,6 @@ ReindexRelationConcurrently(Oid relationOid, in

Re: [patch] Fix checksum verification in base backups for zero page headers

2020-09-24 Thread Michael Banck
Hi,

Am Donnerstag, den 24.09.2020, 17:24 +0300 schrieb Anastasia
Lubennikova:
> So I mark this one as ReadyForCommitter.

Thanks!

> The only minor problem is a typo (?) in the proposed commit message.
> "If a page is all zero, consider that a checksum failure." It should be 
> "If a page is NOT all zero...".

Oh right, I've fixed up the commit message in the attached V4.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
From af83f4a42403e8a994e101086affafa86d67b52a Mon Sep 17 00:00:00 2001
From: Michael Banck 
Date: Tue, 22 Sep 2020 16:09:36 +0200
Subject: [PATCH] Fix checksum verification in base backups for zero page
 headers

We currently do not checksum a page if it is considered new by PageIsNew().
However, this means that if the whole page header is zero, PageIsNew() will
consider that page new due to the pd_upper field being zero and no subsequent
checksum verification is done.

To fix, factor out the all-zeroes check from PageIsVerified() into a new method
PageIsZero() and call that for pages considered new by PageIsNew(). If a page
is not totally empty, consider that a checksum failure.

Add one test to the pg_basebackup TAP tests to check this error.

Reported-By: Andres Freund
Reviewed-By: Asif Rehman, Anastasia Lubennikova
Discussion: https://postgr.es/m/20190319200050.ncuxejradurja...@alap3.anarazel.de
---
 src/backend/replication/basebackup.c | 30 +
 src/backend/storage/page/bufpage.c   | 35 +++-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 18 +-
 src/include/storage/bufpage.h| 11 +++---
 4 files changed, 73 insertions(+), 21 deletions(-)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b89df01fa7..c82765a429 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1746,6 +1746,36 @@ sendFile(const char *readfilename, const char *tarfilename,
 			"be reported", readfilename)));
 	}
 }
+else
+{
+	/*
+	 * We skip completely new pages after checking they are
+	 * all-zero, since they don't have a checksum yet.
+	 */
+	if (PageIsNew(page) && !PageIsZero(page))
+	{
+		/*
+		 * pd_upper is zero, but the page is not all zero.  We
+		 * cannot run pg_checksum_page() on the page as it
+		 * would throw an assertion failure.  Consider this a
+		 * checksum failure.
+		 */
+		checksum_failures++;
+
+		if (checksum_failures <= 5)
+			ereport(WARNING,
+	(errmsg("checksum verification failed in "
+			"file \"%s\", block %d: pd_upper "
+			"is zero but page is not all-zero",
+			readfilename, blkno)));
+		if (checksum_failures == 5)
+			ereport(WARNING,
+	(errmsg("further checksum verification "
+			"failures in file \"%s\" will not "
+			"be reported", readfilename)));
+	}
+}
+
 block_retry = false;
 blkno++;
 			}
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 4bc2bf955d..8be3cd6190 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -82,11 +82,8 @@ bool
 PageIsVerified(Page page, BlockNumber blkno)
 {
 	PageHeader	p = (PageHeader) page;
-	size_t	   *pagebytes;
-	int			i;
 	bool		checksum_failure = false;
 	bool		header_sane = false;
-	bool		all_zeroes = false;
 	uint16		checksum = 0;
 
 	/*
@@ -120,18 +117,7 @@ PageIsVerified(Page page, BlockNumber blkno)
 	}
 
 	/* Check all-zeroes case */
-	all_zeroes = true;
-	pagebytes = (size_t *) page;
-	for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
-	{
-		if (pagebytes[i] != 0)
-		{
-			all_zeroes = false;
-			break;
-		}
-	}
-
-	if (all_zeroes)
+	if (PageIsZero(page))
 		return true;
 
 	/*
@@ -154,6 +140,25 @@ PageIsVerified(Page page, BlockNumber blkno)
 	return false;
 }
 
+/*
+ * PageIsZero
+ *		Check that the page consists only of zero bytes.
+ *
+ */
+bool
+PageIsZero(Page page)
+{
+	int			i;
+	size_t	   *pagebytes = (size_t *) page;
+
+	for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
+	{
+		if (pagebytes[i] != 0)
+			return false;
+	}
+
+	return true;
+}
 
 /*
  *	PageAddItemExtended
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index f674a7c94e..f5735569c5 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 109;
+use Test::Mo

Re: scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-24 Thread Michael Paquier
On Fri, Sep 25, 2020 at 01:36:44AM -0400, Tom Lane wrote:
> Peter Eisentraut  writes:
>> However, again, the SCRAM 
>> implementation would already appear to fail that requirement because it 
>> uses a custom HMAC implementation, and HMAC is listed in FIPS 140-2 as a 
>> covered algorithm.
> 
> Ugh.  But is there any available FIPS-approved library code that could be
> used instead?

That's a good point, and I think that this falls down to use OpenSSL's
HMAC_* interface for this job when building with OpenSSL:
https://www.openssl.org/docs/man1.1.1/man3/HMAC.html

Worth noting that these have been deprecated in 3.0.0 as per the
rather-recent commit dbde472, where they recommend the use of
EVP_MAC_*() instead.
--
Michael


signature.asc
Description: PGP signature