date:20200415

Re: [patch] some PQExpBuffer are not destroyed in pg_dump

2020-04-15 Thread Michael Paquier

On Tue, Apr 14, 2020 at 10:11:56AM +0900, Michael Paquier wrote:
> Indeed.  Any code path of pg_dump calling buildACLQueries() clears up
> things, and I think that it is a better practice to clean up properly
> PQExpBuffer stuff even if there is always the argument that pg_dump
> is a tool running in a "short"-term context.  So I will backpatch that
> unless there are any objections from others.

And done as of 8f4ee44.
--
Michael


signature.asc
Description: PGP signature

Allow pg_read_all_stats to read pg_stat_progress_*

2020-04-15 Thread Andrey M. Borodin

Hi!

One of our users asked me why they cannot read details of 
pg_stat_progress_vacuum while they have pg_read_all_stats role.
Maybe I'm missing something, but I think they should be able to read stats...

PFA fix.
This affects pg_stat_progress_analyze, pg_stat_progress_basebackup, 
pg_stat_progress_cluster, pg_stat_progress_create_index and 
pg_stat_progress_vacuum.

With patch
postgres=# set role pg_read_all_stats ;
postgres=> select * from pg_stat_progress_vacuum ;
  pid  | datid | datname  | relid | phase | heap_blks_total | 
heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | 
num_dead_tuples 
---+---+--+---+---+-+---+++-+-
 76331 | 12923 | postgres |  1247 | scanning heap |  10 |   
  1 |  0 |  0 |2910 |   
0
(1 row)

Without patch
postgres=# set role pg_read_all_stats  ;
SET
postgres=> select * from pg_stat_progress_vacuum ;
  pid  | datid | datname  | relid | phase | heap_blks_total | heap_blks_scanned 
| heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | num_dead_tuples 
---+---+--+---+---+-+---+++-+-
 76331 | 12923 | postgres |   |   | |   
||| |
(1 row)

Thanks!

Best regards, Andrey Borodin.



allow_read_all_stats.diff
Description: Binary data

Parallel Append can break run-time partition pruning

2020-04-15 Thread David Rowley

I had a report from the wilds that run-time partition pruning was not
working in certain cases.

After some investigation and obtaining the mockup of the actual case,
I discovered that the problem was down to accumulate_append_subpath()
hitting the case where it does not pullup a Parallel Append where the
first parallel node is > 0.

What's actually happening is that the plan is left with a nested
Append, and in this particular case, the top-level Append only has a
single subpath, to which the code for 8edd0e794 (Suppress Append and
MergeAppend plan nodes that have a single child) causes the nested
Append to be pulled up to become the main Append.   This causes
run-time pruning to break since we only attach the pruning information
to the top-level Append.

The most simplified test case I can find to demonstrate this issue is:

create table list (a int, b int) partition by list(a);
create table list_12 partition of list for values in(1,2) partition by list(a);
create table list_12_1 partition of list_12 for values in(1);
create table list_12_2 partition of list_12 for values in(2);

insert into list select 2,0 from generate_Series(1,100) x;
vacuum analyze list;

explain (analyze on, costs off, timing off, summary off)
select * from list where a = (select 1) and b > 0;

-- force the 2nd subnode of the Append to be non-parallel.
alter table list_12_1 set (parallel_workers=0);

explain (analyze on, costs off, timing off, summary off)
select * from list where a = (select 1) and b > 0;


The results of this in master are:

postgres=# explain (analyze on, costs off, timing off, summary off)
select * from list where a = (select 1) and b > 0;
QUERY PLAN
---
 Gather (actual rows=0 loops=1)
   Workers Planned: 2
   Params Evaluated: $0
   Workers Launched: 2
   InitPlan 1 (returns $0)
 ->  Result (actual rows=1 loops=1)
   ->  Parallel Append (actual rows=0 loops=3)
 ->  Parallel Seq Scan on list_12_2 list_2 (never executed)
   Filter: ((b > 0) AND (a = $0))
 ->  Parallel Seq Scan on list_12_1 list_1 (actual rows=0 loops=1)
   Filter: ((b > 0) AND (a = $0))
(11 rows)


postgres=# alter table list_12_1 set (parallel_workers=0);
ALTER TABLE
postgres=# explain (analyze on, costs off, timing off, summary off)
select * from list where a = (select 1) and b > 0;
QUERY PLAN
---
 Gather (actual rows=0 loops=1)
   Workers Planned: 2
   Params Evaluated: $0
   Workers Launched: 2
   InitPlan 1 (returns $0)
 ->  Result (actual rows=1 loops=1)
   ->  Parallel Append (actual rows=0 loops=3)
 ->  Seq Scan on list_12_1 list_1 (actual rows=0 loops=1)
   Filter: ((b > 0) AND (a = $0))
 ->  Parallel Seq Scan on list_12_2 list_2 (actual rows=0 loops=3)
   Filter: ((b > 0) AND (a = $0))
   Rows Removed by Filter: 33
(12 rows)

Notice that we don't get "(never executed)" for list_12_2 in the 2nd case.

I'm a bit divided on what the correct fix is.  If I blame Parallel
Append for not trying hard enough to pull up the lower Append in
accumulate_append_subpath(), then clearly the parallel append code is
to blame. However, perhaps run-time pruning should be tagging on
PartitionPruneInfo to more than top-level Appends. Fixing the latter
case, code-wise is about as simple as removing the "rel->reloptkind ==
RELOPT_BASEREL &&" line from create_append_plan(). Certainly, if the
outer Append hadn't been a single subpath Append, then we wouldn't
have pulled up the lower-level Append, so perhaps we should be
run-time pruning lower-level ones too.

What do other people think?

(copying in Robert and Amit K due to their work on Parallel Append,
Tom as I seem to remember him complaining about
accumulate_append_subpath() at some point and Amit L because...
partitioning...)

David

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-15 Thread Kyotaro Horiguchi

At Wed, 15 Apr 2020 11:35:58 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I'm looking this more closer.

It looks to be in the right direction to me.

As mentioned in the previous mail, I removed is_sync_standby from
SycnStandbyData. But just doing that breaks pg_stat_get_wal_senders.
It is an exsting issue but the logic for sync_state (values[10]) looks
odd. Fixed in the attached.

SyncRepInitConfig uses mutex instead of SyncRepLock. Since anyway the
integrity of sync_standby_priority is not guaranteed, it seems OK to
me. It seems fine to remove the assertion and requirement about
SyncRepLock from SyncRepGetSyncRecPtr for the same reason. (Actually
the lock is held, though.) 

SyncRepGetSyncStandbysPriority doesn't seem worth existing as a
function. Removed in the attached.

+   num_standbys = SyncRepGetSyncStandbys(&sync_standbys);

The list is no longer consists only of synchronous standbys.  I
changed the function name, variable name and tried to adjust related
comments.

It's not what the patch did, but I don't understand why
SyncRepGetNthLatestSyncRecPtr takes SyncRepConfig->num_sync but
SyncRepGetOldest.. accesses it directly.  Changed the function
*Oldest* in the attached.  I didn't do that but finally, the two
functions can be consolidated, just by moving the selection logic
currently in SyncRepGetSyncRecPtr into the new function.


The resulting patch is attached.

- removed is_sync_standby from SyncRepStandbyData
- Fixed the logic for values[10] in pg_stat_get_wal_senders
- Changed the signature of SyncRepGetOldestSyncRecPtr
- Adjusted some comments to the behavioral change of
  SyncRepGet(Sync)Standbys.

regards.


-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index ffd5b31eb2..e8a47f98b3 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -108,14 +108,17 @@ static bool SyncRepGetSyncRecPtr(XLogRecPtr *writePtr,
 static void SyncRepGetOldestSyncRecPtr(XLogRecPtr *writePtr,
 	   XLogRecPtr *flushPtr,
 	   XLogRecPtr *applyPtr,
-	   List *sync_standbys);
+	   SyncRepStandbyData *sync_standbys,
+	   int num_standbys,
+	   uint8 nsyncs);
 static void SyncRepGetNthLatestSyncRecPtr(XLogRecPtr *writePtr,
 		  XLogRecPtr *flushPtr,
 		  XLogRecPtr *applyPtr,
-		  List *sync_standbys, uint8 nth);
+		  SyncRepStandbyData *sync_standbys,
+		  int num_standbys,
+		  uint8 nth);
 static int	SyncRepGetStandbyPriority(void);
-static List *SyncRepGetSyncStandbysPriority(bool *am_sync);
-static List *SyncRepGetSyncStandbysQuorum(bool *am_sync);
+static int	standby_priority_comparator(const void *a, const void *b);
 static int	cmp_lsn(const void *a, const void *b);
 
 #ifdef USE_ASSERT_CHECKING
@@ -406,9 +409,10 @@ SyncRepInitConfig(void)
 	priority = SyncRepGetStandbyPriority();
 	if (MyWalSnd->sync_standby_priority != priority)
 	{
-		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+		SpinLockAcquire(&MyWalSnd->mutex);
 		MyWalSnd->sync_standby_priority = priority;
-		LWLockRelease(SyncRepLock);
+		SpinLockRelease(&MyWalSnd->mutex);
+
 		ereport(DEBUG1,
 (errmsg("standby \"%s\" now has synchronous standby priority %u",
 		application_name, priority)));
@@ -523,8 +527,6 @@ SyncRepReleaseWaiters(void)
 /*
  * Calculate the synced Write, Flush and Apply positions among sync standbys.
  *
- * The caller must hold SyncRepLock.
- *
  * Return false if the number of sync standbys is less than
  * synchronous_standby_names specifies. Otherwise return true and
  * store the positions into *writePtr, *flushPtr and *applyPtr.
@@ -536,27 +538,41 @@ static bool
 SyncRepGetSyncRecPtr(XLogRecPtr *writePtr, XLogRecPtr *flushPtr,
 	 XLogRecPtr *applyPtr, bool *am_sync)
 {
-	List	   *sync_standbys;
-
-	Assert(LWLockHeldByMe(SyncRepLock));
+	SyncRepStandbyData *standbys;
+	int			num_standbys;
+	int			i;
 
+	/* Initialize default results */
 	*writePtr = InvalidXLogRecPtr;
 	*flushPtr = InvalidXLogRecPtr;
 	*applyPtr = InvalidXLogRecPtr;
 	*am_sync = false;
 
-	/* Get standbys that are considered as synchronous at this moment */
-	sync_standbys = SyncRepGetSyncStandbys(am_sync);
+	/* Quick out if not even configured to be synchronous */
+	if (SyncRepConfig == NULL)
+		return false;
+
+	/* Get standbys.  They are in priority order if we are in priority mode */
+	num_standbys = SyncRepGetStandbys(&standbys);
+
+	/* Am I among the candidate sync standbys? */
+	for (i = 0; i < num_standbys; i++)
+	{
+		if (standbys[i].is_me)
+		{
+			*am_sync = (i < SyncRepConfig->num_sync);
+			break;
+		}
+	}
 
 	/*
-	 * Quick exit if we are not managing a sync standby or there are not
-	 * enough synchronous standbys.
+	 * Nothing more to do if we are not managing a sync standby or there are
+	 * not enough synchronous standbys.
 	 */
 	if (!(*am_sync) ||
-		SyncRepConfig == NULL ||
-		list_length(sync_standbys) < Sy

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-15 Thread Kyotaro Horiguchi

At Wed, 15 Apr 2020 13:01:02 +0900, Masahiko Sawada 
 wrote in 
> On Tue, 14 Apr 2020 at 18:35, Kyotaro Horiguchi  
> wrote:
> >
> > At Tue, 14 Apr 2020 13:06:14 +0900, Masahiko Sawada 
> >  wrote in
> > > On Tue, 14 Apr 2020 at 10:34, Tom Lane  wrote:
> > > >
> > > > Kyotaro Horiguchi  writes:
> > > > > At Sat, 11 Apr 2020 18:30:30 -0400, Tom Lane  
> > > > > wrote in
> > > > >> What I think we should do about this is, essentially, to get rid of
> > > > >> SyncRepGetSyncStandbys.  Instead, let's have each walsender advertise
> > > > >> whether *it* believes that it is a sync standby, based on its last
> > > > >> evaluation of the relevant GUCs.  This would be a bool that it'd
> > > > >> compute and set alongside sync_standby_priority.  (Hm, maybe we'd not
> > > >
> > > > > Mmm..  SyncRepGetStandbyPriority returns the "priority" that a
> > > > > walsender thinks it is at, among synchronous_standby_names.  Then to
> > > > > decide "I am a sync standby" we need to know how many walsenders with
> > > > > higher priority are alive now.  SyncRepGetSyncStandbyPriority does the
> > > > > judgment now and suffers from the inconsistency of priority values.
> > > >
> > > > Yeah.  After looking a bit closer, I think that the current definition
> > > > of sync_standby_priority (that is, as the result of local evaluation
> > > > of SyncRepGetStandbyPriority()) is OK.  The problem is what we're doing
> > > > with it.  I suggest that what we should do in SyncRepGetSyncRecPtr()
> > > > is make one sweep across the WalSnd array, collecting PID,
> > > > sync_standby_priority, *and* the WAL pointers from each valid entry.
> > > > Then examine that data and decide which WAL value we need, without 
> > > > assuming
> > > > that the sync_standby_priority values are necessarily totally 
> > > > consistent.
> > > > But in any case we must examine each entry just once while holding its
> > > > mutex, not go back to it later expecting it to still be the same.
> >
> > SyncRepGetSyncStandbysPriority() is runing holding SyncRepLock so
> > sync_standby_priority of any walsender can be changed while the
> > function is scanning welsenders. The issue is we already have
> > inconsistent walsender information before we enter the function.  Thus
> > how many times we scan on the array doesn't make any difference.
> >
> > I think we need to do one of the followings.
> >
> >  A) prevent SyncRepGetSyncStandbysPriority from being entered while
> > walsender priority is inconsistent.
> >
> >  B) make SyncRepGetSyncStandbysPriority be tolerant of priority
> > inconsistency.
> >
> >  C) protect walsender priority array from beinig inconsistent.
> >
> > The (B) is the band aids. To achieve A we need to central controller
> > of priority config handling.  C is:
> >
> > > Can we have a similar approach of sync_standby_defined for
> > > sync_standby_priority? That is, checkpionter is responsible for
> > > changing sync_standby_priority of all walsenders when SIGHUP. That
> > > way, all walsenders can see a consistent view of
> > > sync_standby_priority. And when a walsender starts, it sets
> > > sync_standby_priority by itself. The logic to decide who's a sync
> > > standby doesn't change. SyncRepGetSyncRecPtr() gets all walsenders
> > > having higher priority along with their WAL positions.
> >
> > Yeah, it works if we do , but the problem of that way is that to
> > determin priority of walsenders, we need to know what walsenders are
> > running. That is, when new walsender comes the process needs to aware
> > of the arrival (or leaving) right away and reassign the priority of
> > every wal senders again.
> 
> I think we don't need to reassign the priority when new walsender
> comes or leaves. IIUC The priority is calculated based on only
> synchronous_standby_names. Coming or leaving a walsender doesn't
> affect other's priorities.

Sorry, the "priority" in this area is a bit confusing.  The "priority"
defined by synchronous_standby_names is determined in isolation from
the presence of walsenders.  The "priority" in
walsnd->sync_standby_priority needs walsender presence to determine.
I thought of the latter in the discussion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Parallel copy

2020-04-15 Thread Ants Aasma

On Tue, 14 Apr 2020 at 22:40, Kuntal Ghosh  wrote:
> 1. Each worker scans a distinct fixed sized chunk of the CSV file and
> collects the following three stats from the chunk:
> a) number of quotes
> b) position of the first new line after even number of quotes
> c) position of the first new line after odd number of quotes
> 2. Once stats from all the chunks are collected, the leader identifies
> the adjusted chunk boundaries by iterating over the stats linearly:
> - For the k-th chunk, the leader adds the number of quotes in k-1 chunks.
> - If the number is even, then the k-th chunk does not start in the
> middle of a quoted field, and the first newline after an even number
> of quotes (the second collected information) is the first record
> delimiter in this chunk.
> - Otherwise, if the number is odd, the first newline after an odd
> number of quotes (the third collected information) is the first record
> delimiter.
> - The end position of the adjusted chunk is obtained based on the
> starting position of the next adjusted chunk.

The trouble is that, at least with current coding, the number of
quotes in a chunk can depend on whether the chunk started in a quote
or not. That's because escape characters only count inside quotes. See
for example the following csv:

foo,\"bar
baz",\"xyz"

This currently parses as one line and the number of parsed quotes
doesn't change if you add a quote in front.

But the general approach of doing the tokenization in parallel and
then a serial pass over the tokenization would still work. The quote
counting and new line finding just has to be done for both starting in
quote and not starting in quote case.

Using phases doesn't look like the correct approach - the tokenization
can be prepared just in time for the serial pass and processing the
chunk can proceed immediately after. This could all be done by having
the data in a single ringbuffer with a processing pipeline where one
process does the reading, then workers grab tokenization chunks as
they become available, then one process handles determining the chunk
boundaries, after which the chunks are processed.

But I still don't think this is something to worry about for the first
version. Just a better line splitting algorithm should go a looong way
in feeding a large number of workers, even when inserting to an
unindexed unlogged table. If we get the SIMD line splitting in, it
will be enough to overwhelm most I/O subsystems available today.

Regards,
Ants Aasma

Re: WIP/PoC for parallel backup

2020-04-15 Thread Ahsan Hadi

On Wed, 15 Apr 2020 at 1:49 AM, Robert Haas  wrote:

> On Tue, Apr 14, 2020 at 10:37 AM Asif Rehman 
> wrote:
> > I forgot to make a check for no-manifest. Fixed. Attached is the updated
> patch.
>
> +typedef struct
> +{
> ...
> +} BackupFile;
> +
> +typedef struct
> +{
> ...
> +} BackupState;
>
> These structures need comments.
>
> +list_wal_files_opt_list:
> +   SCONST SCONST
> {
> - $$ = makeDefElem("manifest_checksums",
> -
> (Node *)makeString($2), -1);
> +   $$ = list_make2(
> +   makeDefElem("start_wal_location",
> +   (Node *)makeString($2),
> -1),
> +   makeDefElem("end_wal_location",
> +   (Node *)makeString($2),
> -1));
> +
> }
>
> This seems like an unnecessarily complicated parse representation. The
> DefElems seem to be completely unnecessary here.
>
> @@ -998,7 +1110,37 @@ SendBaseBackup(BaseBackupCmd *cmd)
> set_ps_display(activitymsg);
> }
>
> -   perform_base_backup(&opt);
> +   switch (cmd->cmdtag)
>
> So the design here is that SendBaseBackup() is now going to do a bunch
> of things that are NOT sending a base backup? With no updates to the
> comments of that function and no change to the process title it sets?
>
> -   return (manifest->buffile != NULL);
> +   return (manifest && manifest->buffile != NULL);
>
> Heck no. It appears that you didn't even bother reading the function
> header comment.
>
> + * Send a single resultset containing XLogRecPtr record (in text format)
> + * TimelineID and backup label.
>   */
>  static void
> -SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli)
> +SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli,
> +StringInfo label, char *backupid)
>
> This just casually breaks wire protocol compatibility, which seems
> completely unacceptable.
>
> +   if (strlen(opt->tablespace) > 0)
> +   sendTablespace(opt->tablespace, NULL, true, NULL, &files);
> +   else
> +   sendDir(".", 1, true, NIL, true, NULL, NULL, &files);
> +
> +   SendFilesHeader(files);
>
> So I guess the idea here is that we buffer the entire list of files in
> memory, regardless of size, and then we send it out afterwards. That
> doesn't seem like a good idea. The list of files might be very large.
> We probably need some code refactoring here rather than just piling
> more and more different responsibilities onto sendTablespace() and
> sendDir().
>
> +   if (state->parallel_mode)
> +   SpinLockAcquire(&state->lock);
> +
> +   state->throttling_counter += increment;
> +
> +   if (state->parallel_mode)
> +   SpinLockRelease(&state->lock);
>
> I don't like this much. It seems to me that we would do better to use
> atomics here all the time, instead of conditional spinlocks.
>
> +static void
> +send_file(basebackup_options *opt, char *file, bool missing_ok)
> ...
> +   if (file == NULL)
> +   return;
>
> That seems totally inappropriate.
>
> +   sendFile(file, file + basepathlen, &statbuf,
> true, InvalidOid, NULL, NULL);
>
> Maybe I'm misunderstanding, but this looks like it's going to write a
> tar header, even though we're not writing a tarfile.
>
> +   else
> +   ereport(WARNING,
> +   (errmsg("skipping special file
> or directory \"%s\"", file)));
>
> So, if the user asks for a directory or symlink, what's going to
> happen is that they're going to receive an empty file, and get a
> warning. That sounds like terrible behavior.
>
> +   /*
> +* Check for checksum failures. If there are failures across
> multiple
> +* processes it may not report total checksum count, but it will
> error
> +* out,terminating the backup.
> +*/
>
> In other words, the patch breaks the feature. Not that the feature in
> question works particularly well as things stand, but this makes it
> worse.
>
> I think this patch (0003) is in really bad shape. I'm having second
> thoughts about the design, but it's kind of hard to even have a
> discussion about the design when the patch is riddled with minor
> problems like inadequate comments, failure to update existing
> comments, and breaking a bunch of things. I understand that sometimes
> things get missed, but this is version 14 of a patch that's been
> kicking around since last August.


Fair enough. Some of this is also due to backup related features i.e backup
manifest, progress reporting that got committed to master towards the tail
end of PG-13. Rushing to get parallel backup feature compatible with these
features also caused some of the oversights.


>
> --
> Robert Haas
>

Re: Parallel copy

2020-04-15 Thread Ants Aasma

On Mon, 13 Apr 2020 at 23:16, Andres Freund  wrote:
> > Still, if the reader does the splitting, then you don't need as much
> > IPC, right? The shared memory data structure is just a ring of bytes,
> > and whoever reads from it is responsible for the rest.
>
> I don't think so. If only one process does the splitting, the
> exclusively locked section is just popping off a bunch of offsets of the
> ring. And that could fairly easily be done with atomic ops (since what
> we need is basically a single producer multiple consumer queue, which
> can be done lock free fairly easily ). Whereas in the case of each
> process doing the splitting, the exclusively locked part is splitting
> along lines - which takes considerably longer than just popping off a
> few offsets.

I see the benefit of having one process responsible for splitting as
being able to run ahead of the workers to queue up work when many of
them need new data at the same time. I don't think the locking
benefits of a ring are important in this case. At current rather
conservative chunk sizes we are looking at ~100k chunks per second at
best, normal locking should be perfectly adequate. And chunk size can
easily be increased. I see the main value in it being simple.

But there is a point that having a layer of indirection instead of a
linear buffer allows for some workers to fall behind. Either because
the kernel scheduled them out for a time slice, or they need to do I/O
or because inserting some tuple hit an unique conflict and needs to
wait for a tx to complete or abort to resolve. With a ring buffer
reading has to wait on the slowest worker reading its chunk. Having
workers copy the data to a local buffer as the first step would reduce
the probability of hitting any issues. But still, at GB/s rates,
hiding a 10ms timeslice of delay would need 10's of megabytes of
buffer.

FWIW. I think just increasing the buffer is good enough - the CPUs
processing this workload are likely to have tens to hundreds of
megabytes of cache on board.

Re: WIP/PoC for parallel backup

2020-04-15 Thread Rajkumar Raghuwanshi

Hi Asif,

In below scenarios backup verification failed for tablespace, when backup
taken with parallel option.
without parallel for the same scenario pg_verifybackup is passed without
any error.

[edb@localhost bin]$ mkdir /tmp/test_bkp/tblsp1
[edb@localhost bin]$ ./psql postgres -p 5432 -c "create tablespace tblsp1
location '/tmp/test_bkp/tblsp1';"
CREATE TABLESPACE
[edb@localhost bin]$ ./psql postgres -p 5432 -c "create table test (a text)
tablespace tblsp1;"
CREATE TABLE
[edb@localhost bin]$ ./psql postgres -p 5432 -c "insert into test values
('parallel_backup with -T tablespace option');"
INSERT 0 1
[edb@localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/test_bkp/bkp -T
/tmp/test_bkp/tblsp1=/tmp/test_bkp/tblsp2 -j 4
[edb@localhost bin]$ ./pg_verifybackup /tmp/test_bkp/bkp
pg_verifybackup: error: "pg_tblspc/16384/PG_13_202004074/13530/16390" is
present on disk but not in the manifest
pg_verifybackup: error: "pg_tblspc/16384/PG_13_202004074/13530/16388" is
present on disk but not in the manifest
pg_verifybackup: error: "pg_tblspc/16384/PG_13_202004074/13530/16385" is
present on disk but not in the manifest
pg_verifybackup: error: "/PG_13_202004074/13530/16388" is present in the
manifest but not on disk
pg_verifybackup: error: "/PG_13_202004074/13530/16390" is present in the
manifest but not on disk
pg_verifybackup: error: "/PG_13_202004074/13530/16385" is present in the
manifest but not on disk

--without parallel backup
[edb@localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/test_bkp/bkp1 -T
/tmp/test_bkp/tblsp1=/tmp/test_bkp/tblsp3 -j 1
[edb@localhost bin]$ ./pg_verifybackup /tmp/test_bkp/bkp1
backup successfully verified


Thanks & Regards,
Rajkumar Raghuwanshi


On Wed, Apr 15, 2020 at 2:19 PM Ahsan Hadi  wrote:

>
>
> On Wed, 15 Apr 2020 at 1:49 AM, Robert Haas  wrote:
>
>> On Tue, Apr 14, 2020 at 10:37 AM Asif Rehman 
>> wrote:
>> > I forgot to make a check for no-manifest. Fixed. Attached is the
>> updated patch.
>>
>> +typedef struct
>> +{
>> ...
>> +} BackupFile;
>> +
>> +typedef struct
>> +{
>> ...
>> +} BackupState;
>>
>> These structures need comments.
>>
>> +list_wal_files_opt_list:
>> +   SCONST SCONST
>> {
>> - $$ = makeDefElem("manifest_checksums",
>> -
>> (Node *)makeString($2), -1);
>> +   $$ = list_make2(
>> +   makeDefElem("start_wal_location",
>> +   (Node *)makeString($2),
>> -1),
>> +   makeDefElem("end_wal_location",
>> +   (Node *)makeString($2),
>> -1));
>> +
>> }
>>
>> This seems like an unnecessarily complicated parse representation. The
>> DefElems seem to be completely unnecessary here.
>>
>> @@ -998,7 +1110,37 @@ SendBaseBackup(BaseBackupCmd *cmd)
>> set_ps_display(activitymsg);
>> }
>>
>> -   perform_base_backup(&opt);
>> +   switch (cmd->cmdtag)
>>
>> So the design here is that SendBaseBackup() is now going to do a bunch
>> of things that are NOT sending a base backup? With no updates to the
>> comments of that function and no change to the process title it sets?
>>
>> -   return (manifest->buffile != NULL);
>> +   return (manifest && manifest->buffile != NULL);
>>
>> Heck no. It appears that you didn't even bother reading the function
>> header comment.
>>
>> + * Send a single resultset containing XLogRecPtr record (in text format)
>> + * TimelineID and backup label.
>>   */
>>  static void
>> -SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli)
>> +SendXlogRecPtrResult(XLogRecPtr ptr, TimeLineID tli,
>> +StringInfo label, char *backupid)
>>
>> This just casually breaks wire protocol compatibility, which seems
>> completely unacceptable.
>>
>> +   if (strlen(opt->tablespace) > 0)
>> +   sendTablespace(opt->tablespace, NULL, true, NULL, &files);
>> +   else
>> +   sendDir(".", 1, true, NIL, true, NULL, NULL, &files);
>> +
>> +   SendFilesHeader(files);
>>
>> So I guess the idea here is that we buffer the entire list of files in
>> memory, regardless of size, and then we send it out afterwards. That
>> doesn't seem like a good idea. The list of files might be very large.
>> We probably need some code refactoring here rather than just piling
>> more and more different responsibilities onto sendTablespace() and
>> sendDir().
>>
>> +   if (state->parallel_mode)
>> +   SpinLockAcquire(&state->lock);
>> +
>> +   state->throttling_counter += increment;
>> +
>> +   if (state->parallel_mode)
>> +   SpinLockRelease(&state->lock);
>>
>> I don't like this much. It seems to me that we would do better to use
>> atomics here all the time, instead of conditional spinlocks.
>>
>> +static void
>> +send_file(basebackup_options *opt, char *

[PATHC] Fix minor memory leak in pg_basebackup

2020-04-15 Thread Zhang, Jie

Hi all

In some cases , PGresult is not cleared.

File: src\bin\pg_basebackup\streamutil.c

bool
RetrieveWalSegSize(PGconn *conn)
{
PGresult   *res;
..
res = PQexec(conn, "SHOW wal_segment_size");
if (PQresultStatus(res) != PGRES_TUPLES_OK)
{
pg_log_error("could not send replication command \"%s\": %s",
 "SHOW wal_segment_size", 
PQerrorMessage(conn));

PQclear(res); // *** res  is cleared  ***
return false;
}
..
/* fetch xlog value and unit from the result */
if (sscanf(PQgetvalue(res, 0, 0), "%d%s", &xlog_val, xlog_unit) != 2)
{
pg_log_error("WAL segment size could not be parsed");
return false;// *** res  is not cleared  ***
}
..
if (!IsValidWalSegSize(WalSegSz))
{
pg_log_error(ngettext("WAL segment size must be a power of two 
between 1 MB and 1 GB, but the remote server reported a value of %d byte",
  "WAL segment size 
must be a power of two between 1 MB and 1 GB, but the remote server reported a 
value of %d bytes",
  WalSegSz),
 WalSegSz);
return false;   ;// *** res  is not cleared  ***
}
..


Here is a patch.

Best Regards!







pg_basebackup.patch
Description: pg_basebackup.patch

Re: Allow pg_read_all_stats to read pg_stat_progress_*

2020-04-15 Thread Magnus Hagander

On Wed, Apr 15, 2020 at 9:14 AM Andrey M. Borodin 
wrote:

> Hi!
>
> One of our users asked me why they cannot read details of
> pg_stat_progress_vacuum while they have pg_read_all_stats role.
> Maybe I'm missing something, but I think they should be able to read
> stats...
>
> PFA fix.
> This affects pg_stat_progress_analyze, pg_stat_progress_basebackup,
> pg_stat_progress_cluster, pg_stat_progress_create_index and
> pg_stat_progress_vacuum.
>
> With patch
> postgres=# set role pg_read_all_stats ;
> postgres=> select * from pg_stat_progress_vacuum ;
>   pid  | datid | datname  | relid | phase | heap_blks_total |
> heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count |
> max_dead_tuples | num_dead_tuples
>
> ---+---+--+---+---+-+---+++-+-
>  76331 | 12923 | postgres |  1247 | scanning heap |  10 |
>1 |  0 |  0 |2910 |
>  0
> (1 row)
>
> Without patch
> postgres=# set role pg_read_all_stats  ;
> SET
> postgres=> select * from pg_stat_progress_vacuum ;
>   pid  | datid | datname  | relid | phase | heap_blks_total |
> heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count |
> max_dead_tuples | num_dead_tuples
>
> ---+---+--+---+---+-+---+++-+-
>  76331 | 12923 | postgres |   |   | |
>  ||| |
>
> (1 row)
>

I think that makes perfect sense. The documentation explicitly says "can
read all pg_stat_* views", which is clearly wrong -- so either the code or
the docs should be fixed, and it looks like it's the code that should be
fixed to me.

As for the patch, one could argue that we should just store the resulting
boolean instead of re-running the check (e.g. have a "bool
has_stats_privilege" or such), but perhaps that's an unnecessary
micro-optimization, like the attached.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 175f4fd26b..60fc94d89c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -466,6 +466,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	MemoryContext per_query_ctx;
 	MemoryContext oldcontext;
+	bool		has_stats_role = is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS);
 
 	/* check to see if caller supports us returning a tuplestore */
 	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
@@ -537,7 +538,8 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 		values[1] = ObjectIdGetDatum(beentry->st_databaseid);
 
 		/* show rest of the values including relid only to role members */
-		if (has_privs_of_role(GetUserId(), beentry->st_userid))
+		if (has_stats_role ||
+			has_privs_of_role(GetUserId(), beentry->st_userid))
 		{
 			values[2] = ObjectIdGetDatum(beentry->st_progress_command_target);
 			for (i = 0; i < PGSTAT_NUM_PROGRESS_PARAM; i++)

Re: Allow pg_read_all_stats to read pg_stat_progress_*

2020-04-15 Thread Andrey M. Borodin




> 15 апр. 2020 г., в 15:25, Magnus Hagander  написал(а):
> 
> 
> I think that makes perfect sense. The documentation explicitly says "can read 
> all pg_stat_* views", which is clearly wrong -- so either the code or the 
> docs should be fixed, and it looks like it's the code that should be fixed to 
> me.
Should it be bug or v14 feature?

Also pgstatfuncs.c contains a lot more checks of has_privs_of_role(GetUserId(), 
beentry->st_userid).
Maybe grant them all?

> As for the patch, one could argue that we should just store the resulting 
> boolean instead of re-running the check (e.g. have a "bool 
> has_stats_privilege" or such), but perhaps that's an unnecessary 
> micro-optimization, like the attached.

Looks good to me.

Thanks!


Best regards, Andrey Borodin.

Re: Parallel copy

2020-04-15 Thread Amit Kapila

On Wed, Apr 15, 2020 at 1:10 AM Kuntal Ghosh  wrote:
>
> Hence, I was trying to think whether we can leverage this idea for
> implementing parallel COPY in PG. We can design an algorithm similar
> to parallel hash-join where the workers pass through different phases.
> 1. Phase 1 - Read fixed size chunks in parallel, store the chunks and
> the small stats about each chunk in the shared memory. If the shared
> memory is full, go to phase 2.
> 2. Phase 2 - Allow a single worker to process the stats and decide the
> actual chunk boundaries so that no tuple spans across two different
> chunks. Go to phase 3.
>
> 3. Phase 3 - Each worker picks one adjusted chunk, parse and process
> tuples from the same. Once done with one chunk, it picks the next one
> and so on.
>
> 4. If there are still some unread contents, go back to phase 1.
>
> We can probably use separate workers for phase 1 and phase 3 so that
> they can work concurrently.
>
> Advantages:
> 1. Each worker spends some significant time in each phase. Gets
> benefit of the instruction cache - at least in phase 1.
> 2. It also has the same advantage of parallel hash join - fast workers
> get to work more.
> 3. We can extend this solution for reading data from STDIN. Of course,
> the phase 1 and phase 2 must be performed by the leader process who
> can read from the socket.
>
> Disadvantages:
> 1. Surely doesn't work if we don't have enough shared memory.
> 2. Probably, this approach is just impractical for PG due to certain
> limitations.
>

As I understand this, it needs to parse the lines twice (second time
in phase-3) and till the first two phases are over, we can't start the
tuple processing work which is done in phase-3.  So even if the
tokenization is done a bit faster but we will lose some on processing
the tuples which might not be an overall win and in fact, it can be
worse as compared to the single reader approach being discussed.
Now, if the work done in tokenization is a major (or significant)
portion of the copy then thinking of such a technique might be useful
but that is not the case as seen in the data shared above (the
tokenize time is very less as compared to data processing time) in
this email.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: where should I stick that backup?

2020-04-15 Thread Robert Haas

On Tue, Apr 14, 2020 at 9:50 PM Andres Freund  wrote:
> On 2020-04-14 11:38:03 -0400, Robert Haas wrote:
> > I'm fairly deeply uncomfortable with what Andres is proposing. I see
> > that it's very powerful, and can do a lot of things, and that if
> > you're building something that does sophisticated things with storage,
> > you probably want an API like that. It does a great job making
> > complicated things possible. However, I feel that it does a lousy job
> > making simple things simple.
>
> I think it's pretty much exactly the opposite. Your approach seems to
> move all the complexity to the user, having to build entire combination
> of commands themselves.  Instead of having one or two default commands
> that do backups in common situations, everyone has to assemble them from
> pieces.

I think we're mostly talking about different things. I was speaking
mostly about the difficulty of developing it. I agree that a project
which is easier to develop is likely to provide fewer benefits to the
end user. On the other hand, it might be more likely to get done, and
projects that don't get done provide few benefits to users. I strongly
believe we need an incremental approach here.

> In general, I think it's good to give expert users the ability to
> customize things like backups and archiving. But defaulting to every
> non-expert user having to all that expert work (or coyping it from bad
> examples) is one of the most user hostile things in postgres.

I'm not against adding more built-in compression algorithms, but I
also believe (as I have several times now) that the world moves a lot
faster than PostgreSQL, which has not added a single new compression
algorithm to pg_basebackup ever. We had 1 compression algorithm in
2011, and we still have that same 1 algorithm today. So, either nobody
cares, or adding new algorithms is sufficiently challenging - for
either technical or political reasons - that nobody's managed to get
it done. I think having a simple framework in pg_basebackup for
plugging in new algorithms would make it noticeably simpler to add LZ4
or whatever your favorite compression algorithm is. And I think having
that framework also be able to use shell commands, so that users don't
have to wait a decade or more for new choices to show up, is also a
good idea.

I don't disagree that the situation around things like archive_command
is awful, but a good part of that is that every time somebody shows up
and says "hey, let's try to make a small improvement," between two and
forty people show up and start explaining why it's still going to be
terrible. Eventually the pile of requirements get so large, and/or
there are enough contradictory opinions, that the person who made the
proposal for how to improve things gives up and leaves. So then we
still have the documentation suggesting "cp". When people - it happens
to be me in this case, but the problem is much more general - show up
and propose improvements to difficult areas, we can and should give
them good advice on how to improve their proposals. But we should not
insist that they have to build something incredibly complex and
grandiose and solve every problem in that area. We should be happy if
we get ANY improvement in a difficult area, not send dozens of angry
emails complaining that their proposal is imperfect.

> I really really don't understand this. Are you suggesting that for
> server side compression etc we're going to add the ability to specify
> shell commands as argument to the base backup command?  That seems so
> obviously a non-starter?  A good default for backup configurations
> should be that the PG user that the backup is done under is only allowed
> to do that, and not that it directly has arbitrary remote command
> execution.

I hadn't really considered that aspect, and that's certainly a
concern. But I also don't understand why you think it's somehow a big
deal. My point is not that clients should have the ability to execute
arbitrary commands on the server. It's that shelling out to an
external binary provided by the operating system is a reasonable thing
to do, versus having everything have to be done by binaries that we
create. Which I think is what you are also saying right here:

> But the tool speaking the protocol can just allow piping through
> whatever tool?  Given that there likely is benefits to either doing
> things on the client side or on the server side, it seems inevitable
> that there's multiple places that would make sense to have the
> capability for?

Unless I am misunderstanding you, this is exactly what i was
proposing, and have been proposing since the first email on the
thread.

> > It's possibly not the exact same thing. A special might, for example,
> > use multiple threads for parallel compression rather than multiple
> > processes, perhaps gaining a bit of efficiency. But it's doubtful
> > whether all users care about such marginal improvements.
>
> Marginal improvements? Compression scales decently we

Re: documenting the backup manifest file format

2020-04-15 Thread Robert Haas

On Tue, Apr 14, 2020 at 11:49 PM Fujii Masao
 wrote:
> While reading the document that you pushed, I thought that it's better
> to define index term for backup manifest, so that we can easily reach
> this document from the index page. Thought? Patch attached.

Fine with me. I tend not to think about the index very much, so I'm
glad you are. :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP/PoC for parallel backup

2020-04-15 Thread Robert Haas

On Wed, Apr 15, 2020 at 4:49 AM Ahsan Hadi  wrote:
> Fair enough. Some of this is also due to backup related features i.e backup 
> manifest, progress reporting that got committed to master towards the tail 
> end of PG-13. Rushing to get parallel backup feature compatible with these 
> features also caused some of the oversights.

Sure, but there's also no point in rushing out a feature that's in a
state where it's got no chance of being acceptable, and quite a number
of these problems are not new, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [DOC] Document concurrent index builds waiting on each other

2020-04-15 Thread James Coleman

On Wed, Mar 25, 2020 at 3:58 PM Andres Freund  wrote:
>
> Hi,
>
> On 2020-03-25 16:30:10 -0300, Alvaro Herrera wrote:
> > I posted this in November
> > https://postgr.es/m/20191101203310.GA12239@alvherre.pgsql but I didn't
> > put time to go through the issues there.
>
> Oh, missed that.
>
>
> > I don't know if my approach is exactly what Andres has in mind
>
> Not quite. I don't think it's generally correct for CIC to set
> PROC_IN_VACUUM. I'm doubtful it's the case even just for plain indexes -
> we don't want rows to be pruned away from under us. I also think we'd
> want to set such a flag during all of the CIC phases?
>
> What I was thinking of was a new flag, with a distinct value from
> PROC_IN_VACUUM. It'd currently just be specified in the
> GetCurrentVirtualXIDs() calls in WaitForOlderSnapshots(). That'd avoid
> needing to wait for other CICs on different relations. Since CIC is not
> permitted on system tables, and CIC doesn't do DML on normal tables, it
> seems fairly obviously correct to exclude other CICs.

That would keep CIC from blocking other CICs, but it wouldn't solve
the problem of CIC blocking vacuum on unrelated tables, right? Perhaps
that's orthogonal though.

James

Re: Parallel copy

2020-04-15 Thread Robert Haas

On Wed, Apr 15, 2020 at 7:15 AM Amit Kapila  wrote:
> As I understand this, it needs to parse the lines twice (second time
> in phase-3) and till the first two phases are over, we can't start the
> tuple processing work which is done in phase-3.  So even if the
> tokenization is done a bit faster but we will lose some on processing
> the tuples which might not be an overall win and in fact, it can be
> worse as compared to the single reader approach being discussed.
> Now, if the work done in tokenization is a major (or significant)
> portion of the copy then thinking of such a technique might be useful
> but that is not the case as seen in the data shared above (the
> tokenize time is very less as compared to data processing time) in
> this email.

It seems to me that a good first step here might be to forget about
parallelism for a minute and just write a patch to make the line
splitting as fast as possible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Perl modules for testing/viewing/corrupting/repairing your heap files

2020-04-15 Thread Mark Dilger

> On Apr 14, 2020, at 6:17 PM, Peter Geoghegan  wrote:
> 
> On Wed, Apr 8, 2020 at 3:51 PM Mark Dilger  
> wrote:
>> Recently, as part of testing something else, I had need of a tool to create
>> surgically precise corruption within heap pages.  I wanted to make the
>> corruption from within TAP tests, so I wrote the tool as a set of perl 
>> modules.
> 
> There is also pg_hexedit:
> 
> https://github.com/petergeoghegan/pg_hexedit

I steered away from software released under the GPL, such as pg_hexedit, owing 
to difficulties in getting anything I develop accepted.  (That's a hard enough 
problem without licensing issues.).  I'm not taking a political stand for or 
against the GPL here, just a pragmatic position that I wouldn't be able to 
integrate pg_hexedit into a postgres submission.

(Thanks for writing pg_hexedit, BTW.  I'm not criticizing it.)

The purpose of these perl modules is not the viewing of files, but the 
intentional and targeted corruption of files from within TAP tests.  There are 
limited examples of tests in the postgres source tree that intentionally 
corrupt files, and as I read them, they employ a blunt force trauma approach:

In src/bin/pg_basebackup/t/010_pg_basebackup.pl:

> # induce corruption
> system_or_bail 'pg_ctl', '-D', $pgdata, 'stop';
> open $file, '+<', "$pgdata/$file_corrupt1";
> seek($file, $pageheader_size, 0);
> syswrite($file, "\0\0\0\0\0\0\0\0\0");
> close $file;
> system_or_bail 'pg_ctl', '-D', $pgdata, 'start';

In src/bin/pg_checksums/t/002_actions.pl:
> # Time to create some corruption
> open my $file, '+<', "$pgdata/$file_corrupted";
> seek($file, $pageheader_size, 0);
> syswrite($file, "\0\0\0\0\0\0\0\0\0");
> close $file;

These blunt force trauma tests are fine, as far as they go.  But I wanted to be 
able to do things like

# Corrupt the tuple to look like it has lots of attributes, some of
# them null.  This falsely creates the impression that the t_bits
# array is longer than just one byte, but t_hoff still says otherwise.
$tup->{HEAP_HASNULL} = 1;
$tup->{HEAP_NATTS_MASK} = 0x3FF;
$tup->{t_bits} = 0xAA;

or

# Same as above, but this time t_hoff plays along
$tup->{HEAP_HASNULL} = 1;
$tup->{HEAP_NATTS_MASK} = 0x3FF;
$tup->{t_bits} = 0xAA;
$tup->{t_hoff} = 32;

That's hard to do from a TAP test without modules like this, as you have to 
calculate by hand the offsets where you're going to write the corruption, and 
the bit pattern you are going to write to that location.  Even if you do all 
that, nobody else is likely going to be able to read and maintain your tests.

I'd like an easy way from within TAP tests to selectively corrupt files, to 
test whether various parts of the system fail gracefully in the presence of 
corruption.  What happens when a child partition is corrupted?  Does that 
impact queries that only access other partitions?  What kinds of corruption 
cause pg_upgrade to fail? ...to expand the scope of the corruption?  What 
happens to logical replication when there is corruption on the primary? ...on 
the standby?  What kinds of corruption cause a query to return data from 
neighboring tuples that the querying role has not permission to view?  What 
happens when a NAS is only intermittently corrupt?

The modules I've submitted thus far are incomplete for this purpose.  They 
don't yet handle toast tables, btree, hash, gist, gin, fsm, or vm, and I might 
be forgetting a few other things in the list.  Before I go and implement all of 
that, I thought perhaps others would express preferences about how this should 
all work, even stuff like, "Don't bother implementing that in perl, as I'm 
reimplementing the entire testing structure in COBOL", or similarly unexpected 
feedback.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: wrong relkind error messages

2020-04-15 Thread Robert Haas

On Tue, Apr 14, 2020 at 7:02 PM Alvaro Herrera  wrote:
> On 2020-Apr-13, Robert Haas wrote:
> > + ereport(ERROR,
> > + (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> > + errmsg("action cannot be performed on relation \"%s\"",
> > + RelationGetRelationName(rel)),
> >
> > Super-vague.
>
> Maybe, but note that the patch proposed to replace this current error
> message:
>   ERROR:  foo is not an index or foreign table
> with
>   ERROR:  action cannot be performed on "foo"
>   DETAIL:  "foo" is a materialized view.

Sure, but the point is that this case is not improved nearly as much
as most of the others. In a whole bunch of cases, he made the error
message describe the attempted operation, but here he didn't. I'm not
saying that makes it worse than what we had before, just that it would
be better if we could make this look like the other cases the patch
also changes.

> This could be improved if we had stringification of ALTER TABLE
> subcommand types:
>
>   ERROR:  ALTER TABLE ... ADD COLUMN cannot be performed on "foo"
>   DETAIL:  "foo" is a gummy bear.
> or
>   ERROR:  ALTER TABLE ... ADD COLUMN cannot be performed on foo
>   DETAIL:  This action cannot be performed on gummy bears.
>
> but that seems material for a different patch.

Even without that, you could at least say "this form of ALTER TABLE is
not supported for foo" or something like that.

I'm not trying to block the patch. I think it's a good patch. I was
just making an observation about some parts of it where it seems like
we could try slightly harder to do better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-15 Thread Ranier Vilela

Em qua., 15 de abr. de 2020 às 03:08, davinder singh <
davindersingh2...@gmail.com> escreveu:

>
> Thanks for the review comments.
>
> On Tue, Apr 14, 2020 at 9:12 PM Ranier Vilela  wrote:
>
>> >>I m still working on testing this patch. If anyone has Idea please
>> suggest.
>> I still see problems with this patch.
>>
>> 1. Variable loct have redundant initialization, it would be enough to
>> declare so: _locale_t loct;
>> 2. Style white space in variable rc declaration.
>> 3. Style variable cp_index can be reduced.
>> if (tmp != NULL) {
>>size_t cp_index;
>>
>> cp_index = (size_t)(tmp - winlocname);
>> strncpy(loc_name, winlocname, cp_index);
>> loc_name[cp_index] = '\0';
>> 4. Memory leak if _WIN32_WINNT >= 0x0600 is true, _free_locale(loct); is
>> not called.
>>
> I resolved the above comments.
>
Ok, thanks.


>
>> 5. Why call _create_locale if _WIN32_WINNT >= 0x0600 is true and loct is
>> not used?
>>
> _create_locale can take bigger input than GetLocaleInfoEx. But we are
> interested in
> *language[_country-region[.code-page]]*. We are using _create_locale to
> validate
> the given input. The reason is we can't verify the locale name if it is
> appended with
> code-page by using GetLocaleInfoEx. So before parsing, we verify if the
> whole input
> locale name is valid by using _create_locale. I hope that answers your
> question.
>
Understood. In this case, _create_locale, is being used only to validate
the input.
Perhaps, in addition, you could create an additional function, which only
validates winlocname, without having to create structures or use malloc, to
be used when _WIN32_WINNT> = 0x0600 is true, but it is only a suggestion,
if you think it is necessary.

But have a last problem, in case GetLocaleInfoEx fail, there is still one
memory leak,
before return NULL is needed call: _free_locale(loct);

Another suggestion, if GetLocaleInfoEx fail wouldn't it be good to log the
error and the error number?

regards,
Ranier Vilela

Re: sqlsmith crash incremental sort

2020-04-15 Thread Tomas Vondra

On Tue, Apr 14, 2020 at 01:16:33PM -0400, James Coleman wrote:

On Sun, Apr 12, 2020 at 8:09 PM Tomas Vondra
 wrote:

On Sun, Apr 12, 2020 at 12:44:45AM +0200, Tomas Vondra wrote:
>Hi,
>
>I've looked into this a bit, and at first I thought that maybe the
>issue is in how cost_incremental_sort picks the EC members. It simply
>does this:
>
>EquivalenceMember *member = (EquivalenceMember *)
>linitial(key->pk_eclass->ec_members);
>
>so I was speculating that maybe there are multiple EC members and the
>one we need is not the first one. That would have been easy to fix.
>
>But that doesn't seem to be the case - in this example the EC ony has a
>single EC member anyway.
>
>(gdb) p key->pk_eclass->ec_members
>$14 = (List *) 0x12eb958
>(gdb) p *key->pk_eclass->ec_members
>$15 = {type = T_List, length = 1, max_length = 5, elements = 0x12eb970, 
initial_elements = 0x12eb970}
>
>and the member is a Var with varno=0 (with a RelabelType on top, but
>that's irrelevant).
>
>(gdb) p *(Var*)((RelabelType*)member->em_expr)->arg
>$12 = {xpr = {type = T_Var}, varno = 0, varattno = 1, vartype = 12445, 
vartypmod = -1, varcollid = 950, varlevelsup = 0, varnosyn = 0, varattnosyn = 1, 
location = -1}
>
>which then triggers the assert in find_base_rel. When looking for other
>places calling estimate_num_groups I found this in prepunion.c:
>
> * XXX you don't really want to know about this: we do the estimation
> * using the subquery's original targetlist expressions, not the
> * subroot->processed_tlist which might seem more appropriate.  The
> * reason is that if the subquery is itself a setop, it may return a
> * processed_tlist containing "varno 0" Vars generated by
> * generate_append_tlist, and those would confuse estimate_num_groups
> * mightily.  We ought to get rid of the "varno 0" hack, but that
> * requires a redesign of the parsetree representation of setops, so
> * that there can be an RTE corresponding to each setop's output.
>
>which seems pretty similar to the issue at hand, because the subpath is
>T_UpperUniquePath (not sure if that passes as setop, but the symptoms
>match nicely).
>
>Not sure what to do about it in cost_incremental_sort, though :-(
>

I've been messing with this the whole day, without much progress :-(

I'm 99.% sure it's the same issue described by the quoted comment,
because the plan looks like this:

  Nested Loop Left Join
->  Sample Scan on pg_namespace
  Sampling: system ('7.2'::real)
->  Incremental Sort
  Sort Key: ...
  Presorted Key: ...
  ->  Unique
->  Sort
  Sort Key: ...
  ->  Append
->  Nested Loop
...
->  Nested Loop
...

so yeah, the plan does have set operations, and generate_append_tlist
does generate Vars with varno == 0, causing this issue.

This is a bit of an oddly shaped plan anyway, right? In an ideal world
the sort for the unique would have knowledge about what would be
useful for the parent node, and we wouldn't need the incremental sort
at all.

Well, yeah. The problem is the Unique simply compares the columns in the
order it sees them, and it does not match the column order desired by
incremental sort. But we don't push down this information at all :-(

In fact, there may be other reasons to reorder the comparisons, e.g.
when the cost is different for different columns. There was a patch by
Teodor IIRC correctly doing exactly that.

I'm not sure that that kind of thing is really a new problem, though,
and it might not even be entirely possible to fix directly by trying
to push down knowledge about useful sort keys to whatever created that
sort path; it might only be fixable by having the incremental sort (or
even regular sort) path creation know to "subsume" a sort underneath
it.

Anyway, I think that's a bit off topic, but it stood out to me.

It's not a new problem. It's an optimization we don't have.

But I'm not entirely sure what to do about it in cost_incremental_sort.
The comment (introduced by 89deca582a in 2017) suggests a proper fix
would require redesigning the parsetree representation of setops, and
it's a bit too late for that.

So I wonder what a possible solution might look like. I was hoping we
might grab the original target list and use that, similarly to
recurse_set_operations, but I'm not sure how/where to get it.

This is also not an area I'm familiar with. Reading through the
prepunion.c code alongside cost_incremental_sort, it seems that we
don't have access to the same level of information as the prepunion
code (i.e., we're only looking at the result of the union, not the
components of it), and trying descend down into it seems even more
gross, so, see below...

Yeah. And I'm not even sure having that information would allow good
estimates e.g. for

Re: Incremental sorts and EXEC_FLAG_REWIND

2020-04-15 Thread James Coleman

On Tue, Apr 14, 2020 at 2:53 AM Michael Paquier  wrote:
>
> Hi,
>
> When initializing an incremental sort node, we have the following as
> of ExecInitIncrementalSort():
> /*
>  * Incremental sort can't be used with either EXEC_FLAG_REWIND,
>  * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many sort
>  * batches in the current sort state.
>  */
>  Assert((eflags & (EXEC_FLAG_BACKWARD |
>EXEC_FLAG_MARK)) == 0);
> While I don't quite follow why EXEC_FLAG_REWIND should be allowed here
> to begin with (because incremental sorts don't support rescans without
> parameter changes, right?), the comment and the assertion are telling
> a different story.

I remember changing this assertion in response to an issue I'd found
which led to rewriting the rescan implementation, but I must have
missed updating the comment.

> And I can see that child nodes of an
> IncrementalSort one use a set of eflags where these three are removed:
> /*
>  * Initialize child nodes.
>  *
>  * We shield the child node from the need to support REWIND, BACKWARD, or
>  * MARK/RESTORE.
>  */
>  eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
>
> I can also spot one case in the regression tests where we actually pass
> down a REWIND flag (see incremental_sort.sql) when initializing an
> IncrementalSort node:
> -- We force the planner to choose a plan with incremental sort on the right 
> side
> -- of a nested loop join node. That way we trigger the rescan code path.
> set local enable_hashjoin = off;
> set local enable_mergejoin = off;
> set local enable_material = off;
> set local enable_sort = off;
> explain (costs off) select * from t left join (select * from (select *
> from t order by a) v order by a, b) s on s.a = t.a where t.a in (1,
> 2);
> select * from t left join (select * from (select * from t order by a)
> v order by a, b) s on s.a = t.a where t.a in (1, 2);
>
> Alexander, Tomas, any thoughts?

I'll try to respond more fully later today when I can dig up the
specific change.

In the meantime, your question is primarily about making sure the
code/comments/etc. are consistent and not a behavioral problem or
failure you've seen in testing?

James

Re: wrong relkind error messages

2020-04-15 Thread Alvaro Herrera

On 2020-Apr-15, Robert Haas wrote:

> [good arguments]

I don't disagree with anything you said, and I don't have anything to
add for now.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel copy

2020-04-15 Thread Kuntal Ghosh

On Wed, Apr 15, 2020 at 2:15 PM Ants Aasma  wrote:
>
> On Tue, 14 Apr 2020 at 22:40, Kuntal Ghosh  wrote:
> > 1. Each worker scans a distinct fixed sized chunk of the CSV file and
> > collects the following three stats from the chunk:
> > a) number of quotes
> > b) position of the first new line after even number of quotes
> > c) position of the first new line after odd number of quotes
> > 2. Once stats from all the chunks are collected, the leader identifies
> > the adjusted chunk boundaries by iterating over the stats linearly:
> > - For the k-th chunk, the leader adds the number of quotes in k-1 chunks.
> > - If the number is even, then the k-th chunk does not start in the
> > middle of a quoted field, and the first newline after an even number
> > of quotes (the second collected information) is the first record
> > delimiter in this chunk.
> > - Otherwise, if the number is odd, the first newline after an odd
> > number of quotes (the third collected information) is the first record
> > delimiter.
> > - The end position of the adjusted chunk is obtained based on the
> > starting position of the next adjusted chunk.
>
> The trouble is that, at least with current coding, the number of
> quotes in a chunk can depend on whether the chunk started in a quote
> or not. That's because escape characters only count inside quotes. See
> for example the following csv:
>
> foo,\"bar
> baz",\"xyz"
>
> This currently parses as one line and the number of parsed quotes
> doesn't change if you add a quote in front.
>
> But the general approach of doing the tokenization in parallel and
> then a serial pass over the tokenization would still work. The quote
> counting and new line finding just has to be done for both starting in
> quote and not starting in quote case.
>
Yeah, right.

> Using phases doesn't look like the correct approach - the tokenization
> can be prepared just in time for the serial pass and processing the
> chunk can proceed immediately after. This could all be done by having
> the data in a single ringbuffer with a processing pipeline where one
> process does the reading, then workers grab tokenization chunks as
> they become available, then one process handles determining the chunk
> boundaries, after which the chunks are processed.
>
I was thinking from this point of view - the sooner we introduce
parallelism in the process, the greater the benefits. Probably there
isn't any way to avoid a single-pass over the data (phase - 2 in the
above case) to tokenise the chunks. So yeah, if the reading and
tokenisation phase doesn't take much time, parallelising the same will
just be an overkill. As pointed by Andres and you, using a lock-free
circular buffer implementation sounds the way to go forward. AFAIK,
FIFO circular queue with CAS-based implementation suffers from two
problems - 1. (as pointed by you) slow workers may block producers. 2.
Since it doesn't partition the queue among the workers, does not
achieve good locality and cache-friendliness, limits their scalability
on NUMA systems.

> But I still don't think this is something to worry about for the first
> version. Just a better line splitting algorithm should go a looong way
> in feeding a large number of workers, even when inserting to an
> unindexed unlogged table. If we get the SIMD line splitting in, it
> will be enough to overwhelm most I/O subsystems available today.
>
Yeah. Parsing text is a great use case for data parallelism which can
be achieved by SIMD instructions. Consider processing 8-bit ASCII
characters in 512-bit SIMD word. A lot of code and complexity from
CopyReadLineText will surely go away. And further (I'm not sure in
this point), if we can use the schema of the table, perhaps JIT can
generate machine code to efficient read of fields based on their
types.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: documenting the backup manifest file format

2020-04-15 Thread Jehan-Guillaume de Rorthais

On Tue, 14 Apr 2020 12:56:49 -0400
Robert Haas  wrote:

> On Mon, Apr 13, 2020 at 5:43 PM Alvaro Herrera 
> wrote:
> > Yeah, I guess I'm just saying that it feels brittle to have a file
> > format that's supposed to be good for data exchange and then make it
> > itself depend on representation details such as the order that fields
> > appear in, the letter case, or the format of newlines.  Maybe this isn't
> > really of concern, but it seemed strange.  
> 
> I didn't want to use JSON for this at all, but I got outvoted. When I
> raised this issue, it was suggested that I deal with it in this way,
> so I did. I can't really defend it too far beyond that, although I do
> think that one nice thing about this is that you can verify the
> checksum using shell commands if you want. Just figure out the number
> of lines in the file, minus one, and do head -n$LINES backup_manifest
> | shasum -a256 and boom. If there were some whitespace-skipping thing
> figuring out how to reproduce the checksum calculation would be hard.

FWIW, shell commands (md5sum and sha*sum) read checksums from a separate file
with a very simple format: one file per line with format "CHECKSUM FILEPATH".

Thanks to json, it is fairly easy to extract checksums and filenames from the
current manifest file format and check them all with one command:

  jq -r '.Files|.[]|.Checksum+" "+.Path' backup_manifest > checksums.sha256
  sha256sum --check --quiet checksums.sha256

You can even pipe these commands together to avoid the intermediary file.

But for backup_manifest, it's kind of shame we have to check the checksum
against an transformed version of the file. Did you consider creating eg. a
separate backup_manifest.sha256 file?

I'm very sorry in advance if this has been discussed previously.

Regards,

Re: remove_useless_groupby_columns does not need to record constraint dependencies

2020-04-15 Thread Tom Lane

David Rowley  writes:
> Over in [1], Tom and I had a discussion in response to some confusion
> about why remove_useless_groupby_columns() goes to the trouble of
> recording a dependency on the PRIMARY KEY constraint when removing
> surplus columns from the GROUP BY clause.

> The outcome was that we don't need to do this since
> remove_useless_groupby_columns() is used only as a plan-time
> optimisation, we don't need to record any dependency.

Right.  I think it would be good for the comments to emphasize that
a relcache inval will be forced if the *index* underlying the pkey
constraint is dropped; the code doesn't care so much about the constraint
as such.  (This is also why it'd be safe to use a plain unique index
for the same optimization, assuming you can independently verify
non-nullness of the columns.  Maybe we should trash the existing coding
and just have it look for unique indexes + attnotnull flags.)

> To prevent future confusion, I'd like to remove dependency recording
> code from remove_useless_groupby_columns() and update the misleading
> comment. Likely this should also be backpatched to 9.6.

+1 for removing the dependency and improving the comments in HEAD.
Minus quite a lot for back-patching: this is not a bug fix, and
there's a nonzero risk that we've overlooked something.  I'd rather
find that out in beta testing than from bug reports against stable
branches.

regards, tom lane

Re: sqlsmith crash incremental sort

2020-04-15 Thread James Coleman

On Wed, Apr 15, 2020 at 10:47 AM Tomas Vondra
 wrote:
>
> On Tue, Apr 14, 2020 at 01:16:33PM -0400, James Coleman wrote:
> >On Sun, Apr 12, 2020 at 8:09 PM Tomas Vondra
> > wrote:
> >>
> >> On Sun, Apr 12, 2020 at 12:44:45AM +0200, Tomas Vondra wrote:
> >> >Hi,
> >> >
> >> >I've looked into this a bit, and at first I thought that maybe the
> >> >issue is in how cost_incremental_sort picks the EC members. It simply
> >> >does this:
> >> >
> >> >EquivalenceMember *member = (EquivalenceMember *)
> >> >linitial(key->pk_eclass->ec_members);
> >> >
> >> >so I was speculating that maybe there are multiple EC members and the
> >> >one we need is not the first one. That would have been easy to fix.
> >> >
> >> >But that doesn't seem to be the case - in this example the EC ony has a
> >> >single EC member anyway.
> >> >
> >> >(gdb) p key->pk_eclass->ec_members
> >> >$14 = (List *) 0x12eb958
> >> >(gdb) p *key->pk_eclass->ec_members
> >> >$15 = {type = T_List, length = 1, max_length = 5, elements = 
> >> > 0x12eb970, initial_elements = 0x12eb970}
> >> >
> >> >and the member is a Var with varno=0 (with a RelabelType on top, but
> >> >that's irrelevant).
> >> >
> >> >(gdb) p *(Var*)((RelabelType*)member->em_expr)->arg
> >> >$12 = {xpr = {type = T_Var}, varno = 0, varattno = 1, vartype = 
> >> > 12445, vartypmod = -1, varcollid = 950, varlevelsup = 0, varnosyn = 0, 
> >> > varattnosyn = 1, location = -1}
> >> >
> >> >which then triggers the assert in find_base_rel. When looking for other
> >> >places calling estimate_num_groups I found this in prepunion.c:
> >> >
> >> > * XXX you don't really want to know about this: we do the estimation
> >> > * using the subquery's original targetlist expressions, not the
> >> > * subroot->processed_tlist which might seem more appropriate.  The
> >> > * reason is that if the subquery is itself a setop, it may return a
> >> > * processed_tlist containing "varno 0" Vars generated by
> >> > * generate_append_tlist, and those would confuse estimate_num_groups
> >> > * mightily.  We ought to get rid of the "varno 0" hack, but that
> >> > * requires a redesign of the parsetree representation of setops, so
> >> > * that there can be an RTE corresponding to each setop's output.
> >> >
> >> >which seems pretty similar to the issue at hand, because the subpath is
> >> >T_UpperUniquePath (not sure if that passes as setop, but the symptoms
> >> >match nicely).
> >> >
> >> >Not sure what to do about it in cost_incremental_sort, though :-(
> >> >
> >>
> >> I've been messing with this the whole day, without much progress :-(
> >>
> >> I'm 99.% sure it's the same issue described by the quoted comment,
> >> because the plan looks like this:
> >>
> >>   Nested Loop Left Join
> >> ->  Sample Scan on pg_namespace
> >>   Sampling: system ('7.2'::real)
> >> ->  Incremental Sort
> >>   Sort Key: ...
> >>   Presorted Key: ...
> >>   ->  Unique
> >> ->  Sort
> >>   Sort Key: ...
> >>   ->  Append
> >> ->  Nested Loop
> >> ...
> >> ->  Nested Loop
> >> ...
> >>
> >> so yeah, the plan does have set operations, and generate_append_tlist
> >> does generate Vars with varno == 0, causing this issue.
> >
> >This is a bit of an oddly shaped plan anyway, right? In an ideal world
> >the sort for the unique would have knowledge about what would be
> >useful for the parent node, and we wouldn't need the incremental sort
> >at all.
> >
>
> Well, yeah. The problem is the Unique simply compares the columns in the
> order it sees them, and it does not match the column order desired by
> incremental sort. But we don't push down this information at all :-(
>
> In fact, there may be other reasons to reorder the comparisons, e.g.
> when the cost is different for different columns. There was a patch by
> Teodor IIRC correctly doing exactly that.
>
> >I'm not sure that that kind of thing is really a new problem, though,
> >and it might not even be entirely possible to fix directly by trying
> >to push down knowledge about useful sort keys to whatever created that
> >sort path; it might only be fixable by having the incremental sort (or
> >even regular sort) path creation know to "subsume" a sort underneath
> >it.
> >
> >Anyway, I think that's a bit off topic, but it stood out to me.
> >
>
> It's not a new problem. It's an optimization we don't have.
>
> >> But I'm not entirely sure what to do about it in cost_incremental_sort.
> >> The comment (introduced by 89deca582a in 2017) suggests a proper fix
> >> would require redesigning the parsetree representation of setops, and
> >> it's a bit too late for that.
> >>
> >> So I wonder what a possible solution might look like. I was hoping we
> >> might grab the original target list and use that, simi

Re: Race condition in SyncRepGetSyncStandbysPriority

2020-04-15 Thread Tom Lane

Kyotaro Horiguchi  writes:
> At Tue, 14 Apr 2020 16:32:40 -0400, Tom Lane  wrote in 
> + stby->is_sync_standby = true;   /* might change below */

> I'm uneasy with that.  In quorum mode all running standbys are marked
> as "sync" and that's bogus.

I don't follow that?  The existing coding of SyncRepGetSyncStandbysQuorum
returns all the candidates in its list, so this is isomorphic to that.

Possibly a different name for the flag would be more suited?

> On the other hand sync_standbys is already sorted in priority order so I 
> think we can get rid of the member by setting *am_sync as the follows.

> SyncRepGetSyncRecPtr:
>   if (sync_standbys[i].is_me)
>   {
>   *am_sync = (i < SyncRepConfig->num_sync);
>   break;
>   }

I disagree with this, it will change the behavior in the quorum case.

In any case, a change like this will cause callers to know way more than
they ought to about the ordering of the array.  In my mind, the fact that
SyncRepGetSyncStandbysPriority is sorting the array is an internal
implementation detail; I do not want it to be part of the API.

(Apropos to that, I realized from working on this patch that there's
another, completely undocumented assumption in the existing code, that
the integer list will be sorted by walsender index for equal priorities.
I don't like that either, and not just because it's undocumented.)

regards, tom lane

Re: Poll: are people okay with function/operator table redesign?

2020-04-15 Thread Isaac Morland

On Wed, 15 Apr 2020 at 11:26, Pierre Giraud 
wrote:


> The best way to achieve this is to use some styling (font style and color).
>
> Attached you will find two different options I worked on very quickly.
>

I really like the first. Just a couple of suggestions I would make:

- leave a space between the function name and (. Regardless of opinions on
what source code should look like, your documentation has space between
each parameter and the next one, and between the ) and the -> and the ->.
and the return type so it seems crowded not to have space between the
function name and the (.
- At this point it's not really a table any more; I would get rid of the
lines, maybe tweak the spacing, and possibly use(definition
list) rather than table-related HTML elements. See
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/dl.

I think the bolding really makes stand out the crucial parts one needs to
find.

Re: Poll: are people okay with function/operator table redesign?

2020-04-15 Thread Pavel Stehule

st 15. 4. 2020 v 17:43 odesílatel Isaac Morland 
napsal:

> On Wed, 15 Apr 2020 at 11:26, Pierre Giraud 
> wrote:
>
>
>> The best way to achieve this is to use some styling (font style and
>> color).
>>
>> Attached you will find two different options I worked on very quickly.
>>
>
> I really like the first. Just a couple of suggestions I would make:
>

yes, it is very well readable

Pavel


> - leave a space between the function name and (. Regardless of opinions on
> what source code should look like, your documentation has space between
> each parameter and the next one, and between the ) and the -> and the ->.
> and the return type so it seems crowded not to have space between the
> function name and the (.
> - At this point it's not really a table any more; I would get rid of the
> lines, maybe tweak the spacing, and possibly use(definition
> list) rather than table-related HTML elements. See
> https://developer.mozilla.org/en-US/docs/Web/HTML/Element/dl.
>
> I think the bolding really makes stand out the crucial parts one needs to
> find.
>
>

design for parallel backup

2020-04-15 Thread Robert Haas

Hi,

Over at 
http://postgr.es/m/CADM=JehKgobEknb+_nab9179HzGj=9eitzwmod2mpqr_rif...@mail.gmail.com
there's a proposal for a parallel backup patch which works in the way
that I have always thought parallel backup would work: instead of
having a monolithic command that returns a series of tarballs, you
request individual files from a pool of workers. Leaving aside the
quality-of-implementation issues in that patch set, I'm starting to
think that the design is fundamentally wrong and that we should take a
whole different approach. The problem I see is that it makes a
parallel backup and a non-parallel backup work very differently, and
I'm starting to realize that there are good reasons why you might want
them to be similar.

Specifically, as Andres recently pointed out[1], almost anything that
you might want to do on the client side, you might also want to do on
the server side. We already have an option to let the client compress
each tarball, but you might also want the server to, say, compress
each tarball[2]. Similarly, you might want either the client or the
server to be able to encrypt each tarball, or compress but with a
different compression algorithm than gzip. If, as is presently the
case, the server is always returning a set of tarballs, it's pretty
easy to see how to make this work in the same way on either the client
or the server, but if the server returns a set of tarballs in
non-parallel backup cases, and a set of tarballs in parallel backup
cases, it's a lot harder to see how that any sort of server-side
processing should work, or how the same mechanism could be used on
either the client side or the server side.

So, my new idea for parallel backup is that the server will return
tarballs, but just more of them. Right now, you get base.tar and
${tablespace_oid}.tar for each tablespace. I propose that if you do a
parallel backup, you should get base-${N}.tar and
${tablespace_oid}-${N}.tar for some or all values of N between 1 and
the number of workers, with the server deciding which files ought to
go in which tarballs. This is more or less the naming convention that
BART uses for its parallel backup implementation, which, incidentally,
I did not write. I don't really care if we pick something else, but it
seems like a sensible choice. The reason why I say "some or all" is
that some workers might not get any of the data for a given
tablespace. In fact, it's probably desirable to have different workers
work on different tablespaces as far as possible, to maximize parallel
I/O, but it's quite likely that you will have more workers than
tablespaces. So you might end up, with pg_basebackup -j4, having the
server send you base-1.tar and base-2.tar and base-4.tar, but not
base-3.tar, because worker 3 spent all of its time on user-defined
tablespaces, or was just out to lunch.

Now, if you use -Fp, those tar files are just going to get extracted
anyway by pg_basebackup itself, so you won't even know they exist.
However, if you use -Ft, you're going to end up with more files than
before. This seems like something of a wart, because you wouldn't
necessarily expect that the set of output files produced by a backup
would depend on the degree of parallelism used to take it. However,
I'm not sure I see a reasonable alternative. The client could try to
glue all of the related tar files sent by the server together into one
big tarfile, but that seems like it would slow down the process of
writing the backup by forcing the different server connections to
compete for the right to write to the same file. Moreover, if you end
up needing to restore the backup, having a bunch of smaller tar files
instead of one big one means you can try to untar them in parallel if
you like, so it seems not impossible that it could be advantageous to
have them split in that case as well.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] http://postgr.es/m/20200412191702.ul7ohgv5gus3t...@alap3.anarazel.de
[2] 
https://www.postgresql.org/message-id/20190823172637.GA16436%40tamriel.snowman.net

Re: documenting the backup manifest file format

2020-04-15 Thread Robert Haas

On Wed, Apr 15, 2020 at 11:23 AM Jehan-Guillaume de Rorthais
 wrote:
> But for backup_manifest, it's kind of shame we have to check the checksum
> against an transformed version of the file. Did you consider creating eg. a
> separate backup_manifest.sha256 file?
>
> I'm very sorry in advance if this has been discussed previously.

It was briefly mentioned in the original (lengthy) discussion, but I
think there was one vote in favor and two votes against or something
like that, so it didn't go anywhere. I didn't realize that there were
handy command-line tools for manipulating json like that, or I
probably would have considered that idea more strongly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Poll: are people okay with function/operator table redesign?

2020-04-15 Thread Robert Haas

On Wed, Apr 15, 2020 at 11:54 AM Pavel Stehule  wrote:
> st 15. 4. 2020 v 17:43 odesílatel Isaac Morland  
> napsal:
>> On Wed, 15 Apr 2020 at 11:26, Pierre Giraud  wrote:
>>> The best way to achieve this is to use some styling (font style and color).
>>>
>>> Attached you will find two different options I worked on very quickly.
>>
>> I really like the first. Just a couple of suggestions I would make:
>
> yes, it is very well readable

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel copy

2020-04-15 Thread Andres Freund

On 2020-04-15 10:12:14 -0400, Robert Haas wrote:
> On Wed, Apr 15, 2020 at 7:15 AM Amit Kapila  wrote:
> > As I understand this, it needs to parse the lines twice (second time
> > in phase-3) and till the first two phases are over, we can't start the
> > tuple processing work which is done in phase-3.  So even if the
> > tokenization is done a bit faster but we will lose some on processing
> > the tuples which might not be an overall win and in fact, it can be
> > worse as compared to the single reader approach being discussed.
> > Now, if the work done in tokenization is a major (or significant)
> > portion of the copy then thinking of such a technique might be useful
> > but that is not the case as seen in the data shared above (the
> > tokenize time is very less as compared to data processing time) in
> > this email.
> 
> It seems to me that a good first step here might be to forget about
> parallelism for a minute and just write a patch to make the line
> splitting as fast as possible.

+1

Compared to all the rest of the efforts during COPY a fast "split rows"
implementation should not be a bottleneck anymore.

Re: Poll: are people okay with function/operator table redesign?

2020-04-15 Thread Steven Pousty

Is there a way to get a heavier line between each function? It would be
helpful to have a clearer demarcation of what belongs to each function.

On Wed, Apr 15, 2020 at 9:04 AM Robert Haas  wrote:

> On Wed, Apr 15, 2020 at 11:54 AM Pavel Stehule 
> wrote:
> > st 15. 4. 2020 v 17:43 odesílatel Isaac Morland 
> napsal:
> >> On Wed, 15 Apr 2020 at 11:26, Pierre Giraud 
> wrote:
> >>> The best way to achieve this is to use some styling (font style and
> color).
> >>>
> >>> Attached you will find two different options I worked on very quickly.
> >>
> >> I really like the first. Just a couple of suggestions I would make:
> >
> > yes, it is very well readable
>
> +1.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>

Re: Parallel copy

2020-04-15 Thread Andres Freund

Hi,

On 2020-04-15 20:36:39 +0530, Kuntal Ghosh wrote:
> I was thinking from this point of view - the sooner we introduce
> parallelism in the process, the greater the benefits.

I don't really agree. Sure, that's true from a theoretical perspective,
but the incremental gains may be very small, and the cost in complexity
very high. If we can get single threaded splitting of rows to be >4GB/s,
which should very well be attainable, the rest of the COPY work is going
to dominate the time.  We shouldn't add complexity to parallelize more
of the line splitting, caring too much about scalable datastructures,
etc when the bottleneck after some straightforward optimization is
usually still in the parallelized part.

I'd expect that for now we'd likely hit scalability issues in other
parts of the system first (e.g. extension locks, buffer mapping).

Greetings,

Andres Freund

Re: Poll: are people okay with function/operator table redesign?

2020-04-15 Thread Tom Lane

Steven Pousty  writes:
> Is there a way to get a heavier line between each function? It would be
> helpful to have a clearer demarcation of what belongs to each function.

The first alternative I posted at

https://www.postgresql.org/message-id/31833.1586817876%40sss.pgh.pa.us

seems like it would accomplish that pretty well, by having lines
*only* between functions.  The last couple of things that have been
posted seem way more cluttered than that one.

regards, tom lane

Re: Incremental sorts and EXEC_FLAG_REWIND

2020-04-15 Thread James Coleman

On Wed, Apr 15, 2020 at 11:02 AM James Coleman  wrote:
>
> On Tue, Apr 14, 2020 at 2:53 AM Michael Paquier  wrote:
> >
> > Hi,
> >
> > When initializing an incremental sort node, we have the following as
> > of ExecInitIncrementalSort():
> > /*
> >  * Incremental sort can't be used with either EXEC_FLAG_REWIND,
> >  * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we only one of many 
> > sort
> >  * batches in the current sort state.
> >  */
> >  Assert((eflags & (EXEC_FLAG_BACKWARD |
> >EXEC_FLAG_MARK)) == 0);
> > While I don't quite follow why EXEC_FLAG_REWIND should be allowed here
> > to begin with (because incremental sorts don't support rescans without
> > parameter changes, right?), the comment and the assertion are telling
> > a different story.
>
> I remember changing this assertion in response to an issue I'd found
> which led to rewriting the rescan implementation, but I must have
> missed updating the comment.

All right, here are the most relevant messages:

[1]: Here I'd said:
--
While working on finding a test case to show rescan isn't implemented
properly yet, I came across a bug. At the top of
ExecInitIncrementalSort, we assert that eflags does not contain
EXEC_FLAG_REWIND. But the following query (with merge and hash joins
disabled) breaks that assertion:

select * from t join (select * from t order by a, b) s on s.a = t.a
where t.a in (1,2);

The comments about this flag in src/include/executor/executor.h say:

* REWIND indicates that the plan node should try to efficiently support
* rescans without parameter changes. (Nodes must support ExecReScan calls
* in any case, but if this flag was not given, they are at liberty to do it
* through complete recalculation. Note that a parameter change forces a
* full recalculation in any case.)

Now we know that except in rare cases (as just discussed recently up
thread) we can't implement rescan efficiently.

So is this a planner bug (i.e., should we try not to generate
incremental sort plans that require efficient rewind)? Or can we just
remove that part of the assertion and know that we'll implement the
rescan, albeit inefficiently? We already explicitly declare that we
don't support backwards scanning, but I don't see a way to declare the
same for rewind.
--

So it seems to me that we can't disallow REWIND, and we have to
support rescan, but, we could try to mitigate the effects (without a
param change) with a materialize node, as noted below.

[2]: Here, in response to my questioning above if this was a planner
bug, I'd said:
--
Other nodes seem to get a materialization node placed above them to
support this case "better". Is that something we should be doing?
--

I never got any reply on this point; if we _did_ introduce a
materialize node here, then it would mean we could start disallowing
REWIND again. See the email for full details of a specific plan that I
encountered that reproduced this.

Thoughts?

> In the meantime, your question is primarily about making sure the
> code/comments/etc. are consistent and not a behavioral problem or
> failure you've seen in testing?

Still want to confirm this is the case.

James

[1]: 
https://www.postgresql.org/message-id/CAAaqYe9%2Bap2SbU_E2WaC4F9ZMF4oa%3DpJZ1NBwaKDMP6GFUA77g%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/CAAaqYe-sOp2o%3DL7nvGZDJ6GsL9%3Db_ztrGE1rhyi%2BF82p3my2bQ%40mail.gmail.com

Re: Parallel copy

2020-04-15 Thread Andres Freund

Hi,

On 2020-04-15 12:05:47 +0300, Ants Aasma wrote:
> I see the benefit of having one process responsible for splitting as
> being able to run ahead of the workers to queue up work when many of
> them need new data at the same time.

Yea, I agree.

> I don't think the locking benefits of a ring are important in this
> case. At current rather conservative chunk sizes we are looking at
> ~100k chunks per second at best, normal locking should be perfectly
> adequate. And chunk size can easily be increased. I see the main value
> in it being simple.

I think the locking benefits of not needing to hold a lock *while*
splitting (as we'd need in some proposal floated earlier) is likely to
already be beneficial. I don't think we need to worry about lock
scalability protecting the queue of already split data, for now.

I don't think we really want to have a much larger chunk size,
btw. Makes it more likely for data to workers to take an uneven amount
of time.

> But there is a point that having a layer of indirection instead of a
> linear buffer allows for some workers to fall behind.

Yea. It'd probably make sense to read the input data into an array of
evenly sized blocks, and have the datastructure (still think a
ringbuffer makes sense) of split boundaries point into those entries. If
we don't require the input blocks to be in-order in that array, we can
reuse blocks therein that are fully processed, even if "earlier" data in
the input has not yet been fully processed.

> With a ring buffer reading has to wait on the slowest worker reading
> its chunk.

To be clear, I was only thinking of using a ringbuffer to indicate split
boundaries. And that workers would just pop entries from it before they
actually process the data (stored outside of the ringbuffer). Since the
split boundaries will always be read in order by workers, and the
entries will be tiny, there's no need to avoid copying out entries.

So basically what I was thinking we *eventually* may want (I'd forgo some
of this initially) is something like:

struct InputBlock
{
uint32 unprocessed_chunk_parts;
uint32 following_block;
char data[INPUT_BLOCK_SIZE]
};

// array of input data, with > 2*nworkers entries
InputBlock *input_blocks;

struct ChunkedInputBoundary
{
uint32 firstblock;
uint32 startoff;
};

struct ChunkedInputBoundaries
{
uint32 read_pos;
uint32 write_end;
ChunkedInputBoundary ring[RINGSIZE];
};

Where the leader would read data into InputBlocks with
unprocessed_chunk_parts == 0. Then it'd split the read input data into
chunks (presumably with chunk size << input block size), putting
identified chunks into ChunkedInputBoundaries. For each
ChunkedInputBoundary it'd increment the unprocessed_chunk_parts of each
InputBlock containing parts of the chunk.  For chunks across >1
InputBlocks each InputBlock's following_block would be set accordingly.

Workers would just pop an entry from the ringbuffer (making that entry
reusable), and process the chunk. The underlying data would not be
copied out of the InputBlocks, but obviously readers would need to take
care to handle InputBlock boundaries. Whenever a chunk is fully read, or
when crossing a InputBlock boundary, the InputBlock's
unprocessed_chunk_parts would be decremented.

Recycling of InputBlocks could probably just be an occasional linear
search for buffers with unprocessed_chunk_parts == 0.

Something roughly like this should not be too complicated to
implement. Unless extremely unluckly (very wide input data spanning many
InputBlocks) a straggling reader would not prevent global progress, it'd
just prevent reuse of the InputBlocks with data for its chunk (normally
that'd be two InputBlocks, not more).

> Having workers copy the data to a local buffer as the first
> step would reduce the probability of hitting any issues. But still, at
> GB/s rates, hiding a 10ms timeslice of delay would need 10's of
> megabytes of buffer.

Yea. Given the likelihood of blocking on resources (reading in index
data, writing out dirty buffers for reclaim, row locks for uniqueness
checks, extension locks, ...), as well as non uniform per-row costs
(partial indexes, index splits, ...) I think we ought to try to cope
well with that. IMO/IME it'll be common to see stalls that are much
longer than 10ms for processes that do COPY, even when the system is not
overloaded.

> FWIW. I think just increasing the buffer is good enough - the CPUs
> processing this workload are likely to have tens to hundreds of
> megabytes of cache on board.

It'll not necessarily be a cache shared between leader / workers though,
and some of the cache-cache transfers will be more expensive even within
a socket (between core complexes for AMD, multi chip processors for
Intel).

Greetings,

Andres Freund

Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-15 Thread Juan José Santamaría Flecha

On Wed, Apr 15, 2020 at 4:46 PM Ranier Vilela  wrote:

> Em qua., 15 de abr. de 2020 às 03:08, davinder singh <
> davindersingh2...@gmail.com> escreveu:
>
>>
>> 5. Why call _create_locale if _WIN32_WINNT >= 0x0600 is true and loct is
>>> not used?
>>>
>> _create_locale can take bigger input than GetLocaleInfoEx. But we are
>> interested in
>> *language[_country-region[.code-page]]*. We are using _create_locale to
>> validate
>> the given input. The reason is we can't verify the locale name if it is
>> appended with
>> code-page by using GetLocaleInfoEx. So before parsing, we verify if the
>> whole input
>> locale name is valid by using _create_locale. I hope that answers your
>> question.
>>
> Understood. In this case, _create_locale, is being used only to validate
> the input.
> Perhaps, in addition, you could create an additional function, which only
> validates winlocname, without having to create structures or use malloc, to
> be used when _WIN32_WINNT> = 0x0600 is true, but it is only a suggestion,
> if you think it is necessary.
>

Looking at the comments for IsoLocaleName() I see: "MinGW headers declare
_create_locale(), but msvcrt.dll lacks that symbol". This is outdated
[1][2], and  _create_locale() could be used from Windows 8, but I think we
should use GetLocaleInfoEx() as a complete alternative to  _create_locale().

Please find attached a patch for so.

[1]
https://sourceforge.net/p/mingw-w64/mailman/mingw-w64-public/?limit=250&page=2
[2] https://github.com/mirror/mingw-w64/commit/b508bb87ad

Regards,

Juan José Santamaría Flecha

>
>


0001-PG-compilation-error-with-VS-2015-2017-2019_v05.patch
Description: Binary data

Re: Poll: are people okay with function/operator table redesign?

2020-04-15 Thread David G. Johnston

On Mon, Apr 13, 2020 at 10:13 AM Tom Lane  wrote:

> As discussed in the thread at [1], I've been working on redesigning
> the tables we use to present SQL functions and operators.  The
> first installment of that is now up; see tables 9.30 and 9.31 at
>
> https://www.postgresql.org/docs/devel/functions-datetime.html
>
> and table 9.33 at
>
> https://www.postgresql.org/docs/devel/functions-enum.html
>
>
The centering of the headers doesn't do it for me.  Too much gap and the
data itself is not centered so there is a large disconnect between the
headers and the values.

The run-on aspect of the left-aligned setup is of some concern but maybe
just adding some left padding to the second column - and right padding to
the first - can provide the desired negative space without adding so much
as to break usability.

David J.

Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-15 Thread Ranier Vilela

Em qua., 15 de abr. de 2020 às 15:28, Juan José Santamaría Flecha <
juanjo.santama...@gmail.com> escreveu:

>
>
> On Wed, Apr 15, 2020 at 4:46 PM Ranier Vilela  wrote:
>
>> Em qua., 15 de abr. de 2020 às 03:08, davinder singh <
>> davindersingh2...@gmail.com> escreveu:
>>
>>>
>>> 5. Why call _create_locale if _WIN32_WINNT >= 0x0600 is true and loct is
 not used?

>>> _create_locale can take bigger input than GetLocaleInfoEx. But we are
>>> interested in
>>> *language[_country-region[.code-page]]*. We are using _create_locale to
>>> validate
>>> the given input. The reason is we can't verify the locale name if it is
>>> appended with
>>> code-page by using GetLocaleInfoEx. So before parsing, we verify if the
>>> whole input
>>> locale name is valid by using _create_locale. I hope that answers your
>>> question.
>>>
>> Understood. In this case, _create_locale, is being used only to validate
>> the input.
>> Perhaps, in addition, you could create an additional function, which only
>> validates winlocname, without having to create structures or use malloc, to
>> be used when _WIN32_WINNT> = 0x0600 is true, but it is only a suggestion,
>> if you think it is necessary.
>>
>
> Looking at the comments for IsoLocaleName() I see: "MinGW headers declare
> _create_locale(), but msvcrt.dll lacks that symbol". This is outdated
> [1][2], and  _create_locale() could be used from Windows 8, but I think we
> should use GetLocaleInfoEx() as a complete alternative to  _create_locale().
>
Sounds good to me, the exception maybe log error in case fail?

regards,
Ranier Vilela

Re: cleaning perl code

2020-04-15 Thread Andrew Dunstan


On 4/14/20 4:44 PM, Alvaro Herrera wrote:
> On 2020-Apr-14, Andrew Dunstan wrote:
>
>> One of the things that's a bit sad is that perlcritic doesn't generally
>> let you apply policies to a given set of files or files matching some
>> pattern. It would be nice, for instance, to be able to apply some
>> additional standards to strategic library files like PostgresNode.pm,
>> TestLib.pm and Catalog.pm. There are good reasons as suggested upthread
>> to apply higher standards to library files than to, say, a TAP test
>> script. The only easy way I can see to do that would be to have two
>> different perlcriticrc files and adjust pgperlcritic to make two runs.
>> If people think that's worth it I'll put a little work into it. If not,
>> I'll just leave things here.
> I think being more strict about it in strategic files (I'd say that's
> Catalog.pm plus src/test/perl/*.pm) might be a good idea.  Maybe give it
> a try and see what comes up.
>


OK, in fact those files are in reasonably good shape. I also took a pass
through the library files in src/tools/msvc, which had a few more issues.


Here's a patch that does the stricter testing for those library files,
and fixes them so we get a clean pass


This brings to an end my perl gardening project.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/backend/catalog/Catalog.pm b/src/backend/catalog/Catalog.pm
index dd39a086ce..bd9eac0c80 100644
--- a/src/backend/catalog/Catalog.pm
+++ b/src/backend/catalog/Catalog.pm
@@ -67,7 +67,7 @@ sub ParseHeader
 		if (!$is_client_code)
 		{
 			# Strip C-style comments.
-			s;/\*(.|\n)*\*/;;g;
+			s;/\*(?:.|\n)*\*/;;g;
 			if (m;/\*;)
 			{
 
@@ -260,7 +260,9 @@ sub ParseData
 # We're treating the input line as a piece of Perl, so we
 # need to use string eval here. Tell perlcritic we know what
 # we're doing.
-eval '$hash_ref = ' . $_;   ## no critic (ProhibitStringyEval)
+## no critic (ProhibitStringyEval)
+## no critic (RequireCheckingReturnValueOfEval)
+eval '$hash_ref = ' . $_;
 if (!ref $hash_ref)
 {
 	die "$input_file: error parsing line $.:\n$_\n";
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 1d5450758e..5249053ee2 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -385,7 +385,7 @@ sub set_replication_conf
 	$self->host eq $test_pghost
 	  or croak "set_replication_conf only works with the default host";
 
-	open my $hba, '>>', "$pgdata/pg_hba.conf";
+	open my $hba, '>>', "$pgdata/pg_hba.conf" || die;
 	print $hba "\n# Allow replication (set up by PostgresNode.pm)\n";
 	if ($TestLib::windows_os && !$TestLib::use_unix_sockets)
 	{
@@ -439,7 +439,7 @@ sub init
 	TestLib::system_or_bail($ENV{PG_REGRESS}, '--config-auth', $pgdata,
 		@{ $params{auth_extra} });
 
-	open my $conf, '>>', "$pgdata/postgresql.conf";
+	open my $conf, '>>', "$pgdata/postgresql.conf" || die;
 	print $conf "\n# Added by PostgresNode.pm\n";
 	print $conf "fsync = off\n";
 	print $conf "restart_after_crash = off\n";
@@ -1254,7 +1254,7 @@ END
 		$node->clean_node if $exit_code == 0 && TestLib::all_tests_passing();
 	}
 
-	$? = $exit_code;
+	$? = $exit_code;  ## no critic (RequireLocalizedPunctuationVars)
 }
 
 =pod
@@ -1462,8 +1462,8 @@ sub psql
 	# https://metacpan.org/pod/release/ETHER/Try-Tiny-0.24/lib/Try/Tiny.pm
 	do
 	{
-		local $@;
-		eval {
+		local $@ = "";
+		eval { ## no critic (RequireCheckingReturnValueOfEval)
 			my @ipcrun_opts = (\@psql_params, '<', \$sql);
 			push @ipcrun_opts, '>',  $stdout if defined $stdout;
 			push @ipcrun_opts, '2>', $stderr if defined $stderr;
@@ -2074,8 +2074,8 @@ sub pg_recvlogical_upto
 
 	do
 	{
-		local $@;
-		eval {
+		local $@ = "";
+		eval { ## no critic (RequireCheckingReturnValueOfEval)
 			IPC::Run::run(\@cmd, ">", \$stdout, "2>", \$stderr, $timeout);
 			$ret = $?;
 		};
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 0e6c4819e4..fd3bbc1979 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -162,6 +162,8 @@ INIT
 	open my $testlog, '>', $test_logfile
 	  or die "could not open STDOUT to logfile \"$test_logfile\": $!";
 
+	# don't need to  check the result of these dup operations
+	## no critic (RequireCheckedOpen)
 	# Hijack STDOUT and STDERR to the log file
 	open(my $orig_stdout, '>&', \*STDOUT);
 	open(my $orig_stderr, '>&', \*STDERR);
@@ -409,7 +411,7 @@ Return the full contents of the specified file.
 sub slurp_file
 {
 	my ($filename) = @_;
-	local $/;
+	local $/ = undef;
 	my $contents;
 	if ($Config{osname} ne 'MSWin32')
 	{
diff --git a/src/tools/msvc/Install.pm b/src/tools/msvc/Install.pm
index e65ac6fc66..34797d97c3 100644
--- a/src/tools/msvc/Install.pm
+++ b/src/tools/msvc/Install.pm
@@ -12,9 +12,8 @@ use File::Basename;
 use File::Copy;
 use File::Find ();
 
-use Exporter;
-our (@ISA, @EXPORT_OK);
-@ISA   = qw(Ex

Re: Parallel copy

2020-04-15 Thread Kuntal Ghosh

On Wed, Apr 15, 2020 at 10:45 PM Andres Freund  wrote:
>
> Hi,
>
> On 2020-04-15 20:36:39 +0530, Kuntal Ghosh wrote:
> > I was thinking from this point of view - the sooner we introduce
> > parallelism in the process, the greater the benefits.
>
> I don't really agree. Sure, that's true from a theoretical perspective,
> but the incremental gains may be very small, and the cost in complexity
> very high. If we can get single threaded splitting of rows to be >4GB/s,
> which should very well be attainable, the rest of the COPY work is going
> to dominate the time.  We shouldn't add complexity to parallelize more
> of the line splitting, caring too much about scalable datastructures,
> etc when the bottleneck after some straightforward optimization is
> usually still in the parallelized part.
>
> I'd expect that for now we'd likely hit scalability issues in other
> parts of the system first (e.g. extension locks, buffer mapping).
>
Got your point. In this particular case, a single producer is fast
enough (or probably we can make it fast enough) to generate enough
chunks for multiple consumers so that they don't stay idle and wait
for work.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

tadarida vs REL_12_STABLE

2020-04-15 Thread Tom Turelinckx

Hi,

> On 2020-Apr-13, I wrote to buildfarm-admins:
> > As skate and snapper are increasingly difficult to keep alive and 
> > building on debian sparc, and as there aren't many sparc animals 
> > in general, I've set up four new debian sparc64 animals, two on 
> > stretch and two on buster.
> > 
> > All four animals are already successfully building all current 
> > branches, just not submitting results yet.

It turns out I'd missed one failure:

https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=tadarida&br=REL_12_STABLE

Only tadarida fails the sequence regression test, and only 
for REL_12_STABLE. It fails with -O2 and -O1 but succeeds with -O0.

Other than that, all branches for all four animals succeed:

> mussurana | Debian | 9 Stretch | gcc | 6 | with cassert
> tadarida | Debian | 9 Stretch | gcc | 6 | without cassert
> ibisbill | Debian | 10 Buster | gcc | 8 | with cassert
> kittiwake | Debian | 10 Buster | gcc | 8 | without cassert

Both mussurana and tadarida are Stretch 9.12 with gcc 6.3.0-18+deb9u1.
There is no newer gcc source pkg for stretch or for stretch-backports.

Should I keep the -O0 flag for REL_12_STABLE for tadarida?

Thanks,
Tom

Re: pgsql: Rationalize GetWalRcv{Write,Flush}RecPtr().

2020-04-15 Thread Alvaro Herrera

On 2020-Apr-09, Alvaro Herrera wrote:

> It seems worth pointing out that the new GetWalRcvWriteRecPtr function
> has a different signature from the original one -- so any third-party
> code using the original function will now get a compile failure that
> should alert them that they need to change their code to call
> GetWalRcvFlushRecPtr instead.  Maybe we should add a line or two in the
> comments GetWalRcvWriteRecPtr to make this explicit.

After using codesearch.debian.net and finding no results, I decided that
this is not worth the effort.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Comment explaining why vacuum needs to push snapshot seems insufficient.

2020-04-15 Thread Alvaro Herrera

On 2020-Apr-05, Andres Freund wrote:

> vacuum_rel() has the following comment:
>   /*
>* Functions in indexes may want a snapshot set.  Also, setting a 
> snapshot
>* ensures that RecentGlobalXmin is kept truly recent.
>*/
>   PushActiveSnapshot(GetTransactionSnapshot());
> 
> which was added quite a while ago in
> 
> commit d53a56687f3d4772d17ffa0013a33231b7163731

Note that what that commit did was change the snapshot acquisition from
occurring solely during vacuum full to being acquired always -- and
added a small additional bit of knowledge:

-   if (vacstmt->full)
-   {
-   /* functions in indexes may want a snapshot set */
-   PushActiveSnapshot(GetTransactionSnapshot());
-   }
-   else
+   /*
+* Functions in indexes may want a snapshot set.  Also, setting
+* a snapshot ensures that RecentGlobalXmin is kept truly recent.
+*/
+   PushActiveSnapshot(GetTransactionSnapshot());

so I wouldn't blame that commit for failing to understand all the side
effects of acquiring a snapshot there and then.

> But to me that's understating the issue. Don't we e.g. need a snapshot
> to ensure that pg_subtrans won't be truncated away?  I thought xlog.c
> doesn't pass PROCARRAY_FLAGS_VACUUM to GetOldestXmin when truncating?
>   TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));

Nice find.  This bug would probably be orders-of-magnitude easier to hit
now than it was in 2008 -- given both the hardware advances and the
increased transaction rates.

> Also, without an xmin set, it seems possible that vacuum could see some
> of the transaction ids it uses (in local memory) wrap around.

Ditto.

> How about replacing it with something like
>   /*
>* Need to acquire a snapshot to prevent pg_subtrans from being 
> truncated,
>* cutoff xids in local memory wrapping around, and to have updated xmin
>* horizons.
>*/

"While we don't typically need a snapshot, we need the side effects of
acquiring one: having an xmin prevents concurrent pg_subtrans truncation
and prevents our cutoff Xids from becoming wrapped-around; this also
updates our Xmin horizons.  Lastly, functions in indexes may want a
snapshot set."

(You omitted that last point without explanation -- maybe on purpose?
is it no longer needed?)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: where should I stick that backup?

2020-04-15 Thread Andres Freund

Hi,

On 2020-04-15 09:23:30 -0400, Robert Haas wrote:
> On Tue, Apr 14, 2020 at 9:50 PM Andres Freund  wrote:
> > On 2020-04-14 11:38:03 -0400, Robert Haas wrote:
> > > I'm fairly deeply uncomfortable with what Andres is proposing. I see
> > > that it's very powerful, and can do a lot of things, and that if
> > > you're building something that does sophisticated things with storage,
> > > you probably want an API like that. It does a great job making
> > > complicated things possible. However, I feel that it does a lousy job
> > > making simple things simple.
> >
> > I think it's pretty much exactly the opposite. Your approach seems to
> > move all the complexity to the user, having to build entire combination
> > of commands themselves.  Instead of having one or two default commands
> > that do backups in common situations, everyone has to assemble them from
> > pieces.
> 
> I think we're mostly talking about different things.

That certainly would explain some misunderstandings ;)

I mostly still am trying to define where we eventually want to be on a
medium to high level. And I don't think we really have agreement on
that.  My original understanding of your eventual goal is that it's the
example invocation of pg_basebackup upthread, namely a bunch of shell
arguments to pg_basebackup.  And I don't think that's good enough. IMO
it only really make to design incremental steps after we have a rough
agreement on the eventual goal. Otherwise we'll just end up supporting
the outcomes of missteps for a long time.

> I was speaking mostly about the difficulty of developing it. I agree
> that a project which is easier to develop is likely to provide fewer
> benefits to the end user. On the other hand, it might be more likely
> to get done, and projects that don't get done provide few benefits to
> users. I strongly believe we need an incremental approach here.

I agree.  My concern is just that we should not expose things to the
user that will make it much harder to evolve going forward.

> I'm not against adding more built-in compression algorithms, but I
> also believe (as I have several times now) that the world moves a lot
> faster than PostgreSQL, which has not added a single new compression
> algorithm to pg_basebackup ever. We had 1 compression algorithm in
> 2011, and we still have that same 1 algorithm today. So, either nobody
> cares, or adding new algorithms is sufficiently challenging - for
> either technical or political reasons - that nobody's managed to get
> it done.

Imo most of the discussion has been around toast, and there the
situation imo is much more complicated than just about adding the
compression algorithm.  I don't recall a discussion about adding an
optional dependency to other compression algorithms to pg_basebackup
that didn't go anywhere for either technical or political reasons.

> I think having a simple framework in pg_basebackup for plugging in new
> algorithms would make it noticeably simpler to add LZ4 or whatever
> your favorite compression algorithm is. And I think having that
> framework also be able to use shell commands, so that users don't have
> to wait a decade or more for new choices to show up, is also a good
> idea.

As long as here's sensible defaults, and so that the user doesn't have
to specify paths to binaries for the common cases, I'm OK with that. I'm
not ok with requiring the user to specify shell fragments for things
that should be built in.

If we think the appropriate way to implement extensible compression is
by piping to commandline binaries ([1]), I'd imo e.g. ok if we had a
builtin list of [{fileending, shell-fragment-for-compression}] that is
filled with appropriate values detected at build time for a few common
cases. But then also allowed adding new methods via commandline options.

I guess what I perceived to be the fundamental difference, before this
email, between our positions is that I (still) think that exposing
detailed postprocessing shell fragment style arguments to pg_basebackup,
especially as the only option to use the new capabilities, will nail us
into a corner - but you don't necessarily think so?  Where I had/have no
problems with implementing features by *internally* piping through
external binaries, as long as the user doesn't have to always specify
them.

[1] I am not sure, nor the opposite, that piping is a great idea medium
term. One concern is that IIRC windows pipe performance is not great,
and that there's some other portability problems as well. I think
there's also valid concerns about per-file overhead, which might be a
problem for some future uses.

> > I really really don't understand this. Are you suggesting that for
> > server side compression etc we're going to add the ability to specify
> > shell commands as argument to the base backup command?  That seems so
> > obviously a non-starter?  A good default for backup configurations
> > should be that the PG user that the backup is done under is only allowed
> >

Re: [DOC] Document concurrent index builds waiting on each other

2020-04-15 Thread Andres Freund

Hi,

On 2020-04-15 09:31:58 -0400, James Coleman wrote:
> On Wed, Mar 25, 2020 at 3:58 PM Andres Freund  wrote:
> > On 2020-03-25 16:30:10 -0300, Alvaro Herrera wrote:
> > > I posted this in November
> > > https://postgr.es/m/20191101203310.GA12239@alvherre.pgsql but I didn't
> > > put time to go through the issues there.
> >
> > Oh, missed that.
> >
> >
> > > I don't know if my approach is exactly what Andres has in mind
> >
> > Not quite. I don't think it's generally correct for CIC to set
> > PROC_IN_VACUUM. I'm doubtful it's the case even just for plain indexes -
> > we don't want rows to be pruned away from under us. I also think we'd
> > want to set such a flag during all of the CIC phases?
> >
> > What I was thinking of was a new flag, with a distinct value from
> > PROC_IN_VACUUM. It'd currently just be specified in the
> > GetCurrentVirtualXIDs() calls in WaitForOlderSnapshots(). That'd avoid
> > needing to wait for other CICs on different relations. Since CIC is not
> > permitted on system tables, and CIC doesn't do DML on normal tables, it
> > seems fairly obviously correct to exclude other CICs.
> 
> That would keep CIC from blocking other CICs, but it wouldn't solve
> the problem of CIC blocking vacuum on unrelated tables, right? Perhaps
> that's orthogonal though.

I am not sure what blocking you are referring to here? CIC shouldn't
block vacuum on other tables from running? Or do you just mean that
vacuum will not be able to remove some rows due to the snapshot from the
CIC? That'd be an orthogonal problem, yes.

If it's about the xmin horizon for vacuum: I think we could probably
avoid that using the same flag. As vacuum cannot be run against a table
that has a CIC running (although it'd theoretically be possible to allow
that), it should be safe to ignore PROC_IN_CIC backends in vacuum's
GetOldestXmin() call. That might not be true for system relations, but
we don't allow CIC on those.

Greetings,

Andres Freund

Re: documenting the backup manifest file format

2020-04-15 Thread Jehan-Guillaume de Rorthais

On Wed, 15 Apr 2020 12:03:28 -0400
Robert Haas  wrote:

> On Wed, Apr 15, 2020 at 11:23 AM Jehan-Guillaume de Rorthais
>  wrote:
> > But for backup_manifest, it's kind of shame we have to check the checksum
> > against an transformed version of the file. Did you consider creating eg. a
> > separate backup_manifest.sha256 file?
> >
> > I'm very sorry in advance if this has been discussed previously.  
> 
> It was briefly mentioned in the original (lengthy) discussion, but I
> think there was one vote in favor and two votes against or something
> like that, so it didn't go anywhere.

Argh.

> I didn't realize that there were handy command-line tools for manipulating
> json like that, or I probably would have considered that idea more strongly.

That was indeed a lengthy thread with various details discussed. I'm sorry I
didn't catch the ball back then.

Regards,

Re: documenting the backup manifest file format

2020-04-15 Thread David Steele


On 4/15/20 6:43 PM, Jehan-Guillaume de Rorthais wrote:

On Wed, 15 Apr 2020 12:03:28 -0400
Robert Haas  wrote:


On Wed, Apr 15, 2020 at 11:23 AM Jehan-Guillaume de Rorthais
 wrote:

But for backup_manifest, it's kind of shame we have to check the checksum
against an transformed version of the file. Did you consider creating eg. a
separate backup_manifest.sha256 file?

I'm very sorry in advance if this has been discussed previously.


It was briefly mentioned in the original (lengthy) discussion, but I
think there was one vote in favor and two votes against or something
like that, so it didn't go anywhere.


Argh.


I didn't realize that there were handy command-line tools for manipulating
json like that, or I probably would have considered that idea more strongly.


That was indeed a lengthy thread with various details discussed. I'm sorry I
didn't catch the ball back then.


One of the reasons to use JSON was to be able to use command line tools 
like jq to do tasks (I use it myself). But I think only the 
pg_verifybackup tool should be used to verify the internal checksum.


Two thoughts:

1) You can always generate an external checksum when you generate the 
backup if you want to do your own verification without running 
pg_verifybackup.


2) Perhaps it would be good if the pg_verifybackup command had a 
--verify-manifest-checksum option (or something) to check that the 
manifest file looks valid without checking any files. That's not going 
to happen for PG13, but it's possible for PG14.


Regards,
--
-David
da...@pgmasters.net

Re: Comment explaining why vacuum needs to push snapshot seems insufficient.

2020-04-15 Thread Andres Freund

Hi,

On 2020-04-15 17:56:58 -0400, Alvaro Herrera wrote:
> On 2020-Apr-05, Andres Freund wrote:
> 
> > vacuum_rel() has the following comment:
> > /*
> >  * Functions in indexes may want a snapshot set.  Also, setting a 
> > snapshot
> >  * ensures that RecentGlobalXmin is kept truly recent.
> >  */
> > PushActiveSnapshot(GetTransactionSnapshot());
> > 
> > which was added quite a while ago in
> > 
> > commit d53a56687f3d4772d17ffa0013a33231b7163731
> 
> Note that what that commit did was change the snapshot acquisition from
> occurring solely during vacuum full to being acquired always -- and
> added a small additional bit of knowledge:
> 
> -   if (vacstmt->full)
> -   {
> -   /* functions in indexes may want a snapshot set */
> -   PushActiveSnapshot(GetTransactionSnapshot());
> -   }
> -   else
> +   /*
> +* Functions in indexes may want a snapshot set.  Also, setting
> +* a snapshot ensures that RecentGlobalXmin is kept truly recent.
> +*/
> +   PushActiveSnapshot(GetTransactionSnapshot());
> 
> so I wouldn't blame that commit for failing to understand all the side
> effects of acquiring a snapshot there and then.

Fair enough. I just looked far enough to find where the comment was
introduced. But I'm not sure the logic leading to that is correct - see
below.

> > But to me that's understating the issue. Don't we e.g. need a snapshot
> > to ensure that pg_subtrans won't be truncated away?  I thought xlog.c
> > doesn't pass PROCARRAY_FLAGS_VACUUM to GetOldestXmin when truncating?
> > TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
> 
> Nice find.  This bug would probably be orders-of-magnitude easier to hit
> now than it was in 2008 -- given both the hardware advances and the
> increased transaction rates.

Yea. I think there's a lot of very old sloppiness around xmin horizons
that we just avoided hitting frequently due to what you say. Although
I'm fairly sure that we've declared some of the resulting bugs
OS/hardware level issues that actually were horizon related...

> > How about replacing it with something like
> > /*
> >  * Need to acquire a snapshot to prevent pg_subtrans from being 
> > truncated,
> >  * cutoff xids in local memory wrapping around, and to have updated xmin
> >  * horizons.
> >  */
> 
> "While we don't typically need a snapshot, we need the side effects of
> acquiring one: having an xmin prevents concurrent pg_subtrans truncation
> and prevents our cutoff Xids from becoming wrapped-around; this also
> updates our Xmin horizons.  Lastly, functions in indexes may want a
> snapshot set."
> 
> (You omitted that last point without explanation -- maybe on purpose?
> is it no longer needed?)

I left out the "typically need a snapshot" part because it doesn't
really seem to say something meaningful. To me it's adding confusion
without removing any.  I guess we could reformulate it to explain that
while we don't use the snapshot to make mvcc visibility determinations
(since only definitely dead rows matter to vacuum), nor do we use it to
prevent concurrent removal of dead tuples (since we're fine with that
happening), we still need to ensure that resources like pg_subtrans stay
around.

I left out the "in indexes" part because I didn't see it
saying/guaranteeing much. Thinking more about it, that's probably not
quite right. In fact, I wonder if PushActiveSnapshot() does anything
meaningful for the case of expression indexes.  If a snapshot is needed
for some indexes, I assume that would be because visibility tests are
done. But because we set PROC_IN_VACUUM it'd not at all be safe to
actually do visibility tests - rows that are still visible could get
removed.

It's not clear to me which functions this is talking about. For vacuum
full, where the comment originated from, it's obvious (new index being
built) that we need to evaluate arbitrary functions, in particular for
expression indexes. But there's no different behaviour for expression
indexes during normal vacuums, all the expression related work was done
during index insertion.  And if the index operators themselves needed a
valid snapshot, it'd not be safe to set PROC_IN_VACUUM.

I think, at least for btree, it's presumably ok that we invoke btree
operators without a valid snapshot. Because we need to be able to
compare tuples on inner pages during normal operation, which might long
ago may have been removed, btree operators need to be safe against
underlying data vanishing anyway (which is why e.g. enum values are hard
to remove).

Greetings,

Andres Freund

xid wraparound danger due to INDEX_CLEANUP false

2020-04-15 Thread Andres Freund

Hi,

commit a96c41feec6b6616eb9d5baee9a9e08c20533c38
Author: Robert Haas 
Date:   2019-04-04 14:58:53 -0400

Allow VACUUM to be run with index cleanup disabled.

This commit adds a new reloption, vacuum_index_cleanup, which
controls whether index cleanup is performed for a particular
relation by default.  It also adds a new option to the VACUUM
command, INDEX_CLEANUP, which can be used to override the
reloption.  If neither the reloption nor the VACUUM option is
used, the default is true, as before.

Masahiko Sawada, reviewed and tested by Nathan Bossart, Alvaro
Herrera, Kyotaro Horiguchi, Darafei Praliaskouski, and me.
The wording of the documentation is mostly due to me.

Discussion: 
http://postgr.es/m/CAD21AoAt5R3DNUZSjOoXDUY=naypuouffvsrzutymz29ylz...@mail.gmail.com

made the index scan that is part of vacuum optional.  I'm afraid that it
is not safe to do so unconditionally. Until this commit indexes could
rely on at least the amvacuumcleanup callback being called once per
vacuum. Which guaranteed that an index could ensure that there are no
too-old xids anywhere in the index.

But now that's not the case anymore:

vacrelstats->useindex = (nindexes > 0 &&
 params->index_cleanup 
== VACOPT_TERNARY_ENABLED);
...
/* Do post-vacuum cleanup */
if (vacrelstats->useindex)
lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, 
nindexes);

E.g. btree has xids both in the metapage contents, as well as using it
on normal index pages as part of page deletion.

The slightly oder feature to avoid unnecessary scans during cleanup
protects against this issue by skipping the scan inside the index AM:

/*
 * _bt_vacuum_needs_cleanup() -- Checks if index needs cleanup assuming that
 *  btbulkdelete() wasn't called.
 */
static bool
_bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
{
...
else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
 TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
   
RecentGlobalXmin))
{
/*
 * If oldest btpo.xact in the deleted pages is older than
 * RecentGlobalXmin, then at least one deleted page can be 
recycled.
 */
result = true;
}

which will afaict result in all such xids getting removed (or at least
give the AM the choice to do so).


It's possible that something protects against dangers in the case of
INDEX CLEANUP false, or that the consequences aren't too bad. But I
didn't see any comments about the danagers in the patch.

Greetings,

Andres Freund

Re: Include sequence relation support in logical replication

2020-04-15 Thread Andres Freund

Hi,

On 2020-03-26 15:33:33 -0700, Cary Huang wrote:

> >> For the replication to make sense, the patch actually disables the WAL 
> 
> >> update at every 32 nextval() calls, so every call to nextval() will 
> 
> >> emit a WAL update for proper replication. This is done by setting 
> 
> >> SEQ_LOG_VALS to 0 in sequence.c 
> 
> >
> 
> > Why is that needed? ISTM updating in increments of 32 is fine for 
> 
> > replication purposes? It's good imo, because sending out more granular 
> 
> > increments would increase the size of the WAL stream? 
> 
> 
> 
> yes, updating WAL at every 32 increment is good and have huge performance 
> benefits according 
> 
> to Michael, but when it is replicated logically to subscribers, the sequence 
> value they receive would not 
> 
> make much sense to them.
> 
> For example, 
> 
> 
> 
> If i have a Sequence called "seq" with current value = 100 and increment = 5. 
> The nextval('seq') call will
> 
> return 105 to the client but it will write 260 to WAL record ( 100 + (5*32) 
> ), because that is the value after 32
> 
> increments and internally it is also maintaining a "log_cnt" counter that 
> tracks how many nextval() calls have been invoked
> 
> since the last WAL write, so it could kind of derive backwards to find the 
> proper next value to return to client. 
> 
> 
> 
> But the subscriber for this sequence will receive a change of 260 instead of 
> 105, and it does not represent the current
> 
> sequence status. Subscriber is not able to derive backwards because it does 
> not know the increment size in its schema.

What is the problem with the subscriber seeing 260? This already can
happen on the primary today, if there is a crash / immediate
restart. But that is fine - sequences don't guarantee that they are free
of gaps, just that each call will return a bigger value than before.


> 
> Setting SEQ_LOG_VALS to 0 in sequence.c basically disables this 32 increment 
> behavior and makes WAL update at every nextval() call
> 
> and therefore the subscriber to this sequence will receive the same value 
> (105) as the client, as a cost of more WAL writes.
> 
> 
> 
> I would like to ask if you have some suggestions or ideas that can make 
> subscriber receives the current value without the need to
> 
> disabling the 32 increment behavior?

It simply shouldn't expect that to be the case.  What do you need it
for?

As far as I can tell replicating sequence values is useful to allow
failover, by ensuring all sequences will return sensible values going
forward. That does not require to now precise values.


Greetings,

Andres Freund

Re: where should I stick that backup?

2020-04-15 Thread Robert Haas

On Wed, Apr 15, 2020 at 6:13 PM Andres Freund  wrote:
> I guess what I perceived to be the fundamental difference, before this
> email, between our positions is that I (still) think that exposing
> detailed postprocessing shell fragment style arguments to pg_basebackup,
> especially as the only option to use the new capabilities, will nail us
> into a corner - but you don't necessarily think so?  Where I had/have no
> problems with implementing features by *internally* piping through
> external binaries, as long as the user doesn't have to always specify
> them.

My principle concern is actually around having a C API and a flexible
command-line interface. If we rearrange the code and the pg_basebackup
command line syntax so that it's easy to add new "filters" and
"targets", then I think that's a very good step forward. It's of less
concern to me whether those "filters" and "targets" are (1) C code
that we ship as part of pg_basebackup, (2) C code by extension authors
that we dynamically load into pg_basebackup, (3) off-the-shelf
external programs that we invoke, or (4) special external programs
that we provide which do special magic. However, of those options, I
like #4 least, because it seems like a pain in the tail to implement.
It may turn out to be the most powerful and flexible, though I'm not
completely sure about that yet.

As to exactly how far we can get with #3, I think it depends a good
deal on the answer to this question you pose in a footnote:

> [1] I am not sure, nor the opposite, that piping is a great idea medium
> term. One concern is that IIRC windows pipe performance is not great,
> and that there's some other portability problems as well. I think
> there's also valid concerns about per-file overhead, which might be a
> problem for some future uses.

If piping stuff through shell commands performs well for use cases
like compression, then I think we can get pretty far with piping
things through shell commands. It means we can use any compression at
all with no build-time dependency on that compressor. People can
install anything they want, stick it in $PATH, and away they go. I see
no particular reason to dislike that kind of thing; in fact, I think
it offers many compelling advantages. On the other hand, if we really
need to interact directly with the library to get decent performance,
because, say, pipes are too slow, then the approach of piping things
through an arbitrary shell commands is a lot less exciting.

Even then, though, I wonder how many runtime dependencies we're
seriously willing to add. I imagine we can add one or two more
compression algorithms without giving everybody fits, even if it means
adding optional build-time and run-time dependencies on some external
libraries. Any more than that is likely to provoke a backlash. And I
doubt whether we're willing to have the postgresql operating system
package depend on something like libgcrypt at all. I would expect such
a proposal to meet with vigorous objections. But without such a
dependency, how would we realistically get encrypted backups except by
piping through a shell command? I don't really see a way, and letting
a user specify a shell fragment to define what happens there seems
pretty reasonable to me. I'm also not very sure to what we can assume,
with either compression or encryption, that one size fits all. If
there are six popular compression libraries and four popular
encryption libraries, does anyone really believe that it's going to be
OK for 'yum install postgresql-server' to suck in all of those things?
Or, even if that were OK or if it we could somehow avoid it, what are
the chances that we'd actually go to the trouble of building
interfaces to all of those things? I'd rate them as slim to none; we
suck at that sort of thing. Exhibit A: The work to make PostgreSQL
support more than one SSL library.

I'm becoming fairly uncertain as to how far we can get with shell
commands; some of the concerns raised about, for example, connection
management when talking to stuff like S3 are very worrying. At the
same time, I think we need to think pretty seriously about some of the
upsides of shell commands. The average user cannot write a C library
that implements an API. The average user cannot write a C binary that
speaks a novel, PostgreSQL-specific protocol. Even the above-average
user who is capable of doing those things probably won't have the time
to actually do it. So if thing you have to do to make PostgreSQL talk
to the new sljgsjl compressor is either of those things, then we will
not have sljgsjl compression support for probably a decade after it
becomes the gold standard that everyone else in the industry is using.
If what you have to do is 'yum install sljgsjl' and then pg_basebackup
--client-filter='shell sljgsjl', people can start using it as soon as
their favorite distro packages it, without anyone who reads this
mailing list needing to do any work whatsoever.  If what you have to
do is create a 'sljgsjl.

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-04-15 Thread Robert Haas

On Wed, Apr 15, 2020 at 7:38 PM Andres Freund  wrote:
> It's possible that something protects against dangers in the case of
> INDEX CLEANUP false, or that the consequences aren't too bad. But I
> didn't see any comments about the danagers in the patch.

I seem to recall Simon raising this issue at the time that the patch
was being discussed, and I thought that we had eventually decided that
it was OK for some reason. But I don't remember the details, and it is
possible that we got it wrong. :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tadarida vs REL_12_STABLE

2020-04-15 Thread Tom Lane

"Tom Turelinckx"  writes:
> It turns out I'd missed one failure:
> https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=tadarida&br=REL_12_STABLE
> Only tadarida fails the sequence regression test, and only 
> for REL_12_STABLE. It fails with -O2 and -O1 but succeeds with -O0.

Yeah, I saw that.  The failure mode is really pretty odd:

 CREATE SEQUENCE sequence_test9 AS integer INCREMENT BY -1;
+ERROR:  MINVALUE (-9223372036854775808) is out of range for sequence data type 
integer

It's not difficult to see where things must be going wrong: sequence.c,
around line 1490 in HEAD, must be choosing to set the sequence's seqmin
to PG_INT64_MIN instead of PG_INT32_MIN as it should.  But that code is
exactly the same from HEAD back to v11 (and probably further, though
I didn't look).

The next two failures are the same thing for smallint, and the rest is
just fallout from the sequence-creation failures.  So that's one extremely
specific codegen bug in the whole test suite.  I wonder if it's related to
the branch-delay-slot codegen bug we identified for sparc32 awhile back.

Not sure what to tell you, other than that it's darn odd that this only
fails in v12.  But I don't have much faith that "use -O0 in v12 only"
is going to be a long-term answer.

regards, tom lane

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-04-15 Thread Peter Geoghegan

On Wed, Apr 15, 2020 at 4:57 PM Robert Haas  wrote:
> I seem to recall Simon raising this issue at the time that the patch
> was being discussed, and I thought that we had eventually decided that
> it was OK for some reason. But I don't remember the details, and it is
> possible that we got it wrong. :-(

It must be unreliable because it's based on something that is known to
be unreliable:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=c5b0a30e4ebd4fe3bd4a6f8192284c452d1170b9;hb=refs/heads/REL_12_STABLE#l331

Also, the commit message of 6655a729 says that nbtree has had this
problem "since time immemorial". I am planning to work on that
problem, eventually.

-- 
Peter Geoghegan

Re: [DOC] Document concurrent index builds waiting on each other

2020-04-15 Thread James Coleman

On Wed, Apr 15, 2020 at 6:31 PM Andres Freund  wrote:
>
> Hi,
>
> On 2020-04-15 09:31:58 -0400, James Coleman wrote:
> > On Wed, Mar 25, 2020 at 3:58 PM Andres Freund  wrote:
> > > On 2020-03-25 16:30:10 -0300, Alvaro Herrera wrote:
> > > > I posted this in November
> > > > https://postgr.es/m/20191101203310.GA12239@alvherre.pgsql but I didn't
> > > > put time to go through the issues there.
> > >
> > > Oh, missed that.
> > >
> > >
> > > > I don't know if my approach is exactly what Andres has in mind
> > >
> > > Not quite. I don't think it's generally correct for CIC to set
> > > PROC_IN_VACUUM. I'm doubtful it's the case even just for plain indexes -
> > > we don't want rows to be pruned away from under us. I also think we'd
> > > want to set such a flag during all of the CIC phases?
> > >
> > > What I was thinking of was a new flag, with a distinct value from
> > > PROC_IN_VACUUM. It'd currently just be specified in the
> > > GetCurrentVirtualXIDs() calls in WaitForOlderSnapshots(). That'd avoid
> > > needing to wait for other CICs on different relations. Since CIC is not
> > > permitted on system tables, and CIC doesn't do DML on normal tables, it
> > > seems fairly obviously correct to exclude other CICs.
> >
> > That would keep CIC from blocking other CICs, but it wouldn't solve
> > the problem of CIC blocking vacuum on unrelated tables, right? Perhaps
> > that's orthogonal though.
>
> I am not sure what blocking you are referring to here? CIC shouldn't
> block vacuum on other tables from running? Or do you just mean that
> vacuum will not be able to remove some rows due to the snapshot from the
> CIC? That'd be an orthogonal problem, yes.
>
> If it's about the xmin horizon for vacuum: I think we could probably
> avoid that using the same flag. As vacuum cannot be run against a table
> that has a CIC running (although it'd theoretically be possible to allow
> that), it should be safe to ignore PROC_IN_CIC backends in vacuum's
> GetOldestXmin() call. That might not be true for system relations, but
> we don't allow CIC on those.

Yeah, I mean that if I have a CIC running on table X then vacuum can't
remove dead tuples (from after the CIC's snapshot) on table Y.

That's a pretty significant danger, given the combination of:
1. Index builds on very large tables can take many days, and
2. The well understood problems of high update tables with dead tuples
and poor plans.

I've previously discussed this with other hackers and the reasoning
they'd understood way that we couldn't always safely ignore
PROC_IN_CIC backends in the vacuum's oldest xmin call because of
function indexes, and the fact that (despite clear recommendations to
the contrary), there's nothing actually preventing someone from adding
a function index on table X that queries table Y.

I'm not sure I buy that we should care about people doing something
clearly so dangerous, but...I grant that it'd be nice not to cause new
crashes.

James

Re: remove_useless_groupby_columns does not need to record constraint dependencies

2020-04-15 Thread David Rowley

On Thu, 16 Apr 2020 at 03:24, Tom Lane  wrote:
>
> David Rowley  writes:
> > Over in [1], Tom and I had a discussion in response to some confusion
> > about why remove_useless_groupby_columns() goes to the trouble of
> > recording a dependency on the PRIMARY KEY constraint when removing
> > surplus columns from the GROUP BY clause.
>
> > The outcome was that we don't need to do this since
> > remove_useless_groupby_columns() is used only as a plan-time
> > optimisation, we don't need to record any dependency.
>
> Right.  I think it would be good for the comments to emphasize that
> a relcache inval will be forced if the *index* underlying the pkey
> constraint is dropped; the code doesn't care so much about the constraint
> as such.  (This is also why it'd be safe to use a plain unique index
> for the same optimization, assuming you can independently verify
> non-nullness of the columns.

I've reworded the comment in the attached version.

> Maybe we should trash the existing coding
> and just have it look for unique indexes + attnotnull flags.)

I'd like to, but the timing seems off. Perhaps after we branch for PG14.

> > To prevent future confusion, I'd like to remove dependency recording
> > code from remove_useless_groupby_columns() and update the misleading
> > comment. Likely this should also be backpatched to 9.6.
>
> +1 for removing the dependency and improving the comments in HEAD.
> Minus quite a lot for back-patching: this is not a bug fix, and
> there's a nonzero risk that we've overlooked something.  I'd rather
> find that out in beta testing than from bug reports against stable
> branches.

That seems fair.

David


fix_remove_useless_groupby_columns2.patch
Description: Binary data

Re: cleaning perl code

2020-04-15 Thread Noah Misch

On Wed, Apr 15, 2020 at 03:43:36PM -0400, Andrew Dunstan wrote:
> On 4/14/20 4:44 PM, Alvaro Herrera wrote:
> > On 2020-Apr-14, Andrew Dunstan wrote:
> >> One of the things that's a bit sad is that perlcritic doesn't generally
> >> let you apply policies to a given set of files or files matching some
> >> pattern. It would be nice, for instance, to be able to apply some
> >> additional standards to strategic library files like PostgresNode.pm,
> >> TestLib.pm and Catalog.pm. There are good reasons as suggested upthread
> >> to apply higher standards to library files than to, say, a TAP test
> >> script. The only easy way I can see to do that would be to have two
> >> different perlcriticrc files and adjust pgperlcritic to make two runs.
> >> If people think that's worth it I'll put a little work into it. If not,
> >> I'll just leave things here.
> > I think being more strict about it in strategic files (I'd say that's
> > Catalog.pm plus src/test/perl/*.pm) might be a good idea.  Maybe give it
> > a try and see what comes up.
> 
> OK, in fact those files are in reasonably good shape. I also took a pass
> through the library files in src/tools/msvc, which had a few more issues.

It would be an unpleasant surprise to cause a perlcritic buildfarm failure by
moving a function, verbatim, from a non-strategic file to a strategic file.
Having two Perl style regimes in one tree is itself a liability.

> --- a/src/backend/catalog/Catalog.pm
> +++ b/src/backend/catalog/Catalog.pm
> @@ -67,7 +67,7 @@ sub ParseHeader
>   if (!$is_client_code)
>   {
>   # Strip C-style comments.
> - s;/\*(.|\n)*\*/;;g;
> + s;/\*(?:.|\n)*\*/;;g;

This policy against unreferenced groups makes the code harder to read, and the
chance of preventing a bug is too low to justify that.

> --- a/src/tools/perlcheck/pgperlcritic
> +++ b/src/tools/perlcheck/pgperlcritic
> @@ -14,7 +14,21 @@ PERLCRITIC=${PERLCRITIC:-perlcritic}
>  
>  . src/tools/perlcheck/find_perl_files
>  
> -find_perl_files | xargs $PERLCRITIC \
> +flist=`mktemp`
> +find_perl_files > $flist
> +
> +pattern='src/test/perl/|src/backend/catalog/Catalog.pm|src/tools/msvc/[^/]*.pm'

I don't find these files to be especially strategic, and I'm mostly shrugging
about the stricter policy's effect on code quality.  -1 for this patch.

Re: pg_restore: could not close data file: Success

2020-04-15 Thread Kyotaro Horiguchi

Hello.

Added -hackers.

At Wed, 15 Apr 2020 12:14:25 +0200, "Peter J. Holzer"  wrote 
in 
> On 2020-04-15 12:01:46 +0200, Peter J. Holzer wrote:
> > I'm trying to restore a backup on a different machine and it terminates
> > with the not really helpful messages:
> > 
> > pg_restore: [directory archiver] could not close data file: Success
> > pg_restore: [parallel archiver] a worker process died unexpectedly
> [...]
> > My guess is that maybe one of the data files is damaged
> 
> As is often the case the matter became obvious a few minutes after
> writing the mail. 
> 
> There were indeed two file with length 0 in the dump. That happened
> because the backup failed because it couldn't obtain a lock on a table.
> 
> I nicer error message (something like "cannot decompress '13503.dat.gz':
> Empty file") would have helped.

Unfortunately, just emptying .dat.gz file doesn't worked for me.
Anyway the message is emitted the following way.

pg_backup_directoy.c:
>  if (cfclose(cfp) !=0)
>fatal("could not close data file: %m");

%m doesn't work for some kinds of errors about compressed files but
cfclose conseals the true cause.

I'm surprised to find an old thread about the same issue.

https://www.postgresql.org/message-id/20160307.174354.251049100.horiguchi.kyotaro%40lab.ntt.co.jp

But I don't think it's not acceptable that use fake errno for gzclose,
but cfclose properly passes-through the error code from gzclose, so it
is enought that the caller should recognize the difference.

Please find the attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/bin/pg_dump/compress_io.c b/src/bin/pg_dump/compress_io.c
index 1417401086..3a8394d7f2 100644
--- a/src/bin/pg_dump/compress_io.c
+++ b/src/bin/pg_dump/compress_io.c
@@ -645,6 +645,13 @@ cfgets(cfp *fp, char *buf, int len)
 		return fgets(buf, len, fp->uncompressedfp);
 }
 
+/*
+ * cfclose close the stream
+ *
+ * Returns 0 if successfully closed the cfp. Most of errors are reported as -1
+ * and errno is set.  Otherwise the return value is the return value from
+ * gzclose and errno doesn't hold a meangful value.
+ */
 int
 cfclose(cfp *fp)
 {
@@ -665,6 +672,11 @@ cfclose(cfp *fp)
 #endif
 	{
 		result = fclose(fp->uncompressedfp);
+
+		/* normalize error return, just in case EOF is not -1 */
+		if (result != 0)
+			result = -1;
+
 		fp->uncompressedfp = NULL;
 	}
 	free_keep_errno(fp);
diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c
index c9cce5ed8a..ecc6aa5fbb 100644
--- a/src/bin/pg_dump/pg_backup_directory.c
+++ b/src/bin/pg_dump/pg_backup_directory.c
@@ -108,6 +108,7 @@ void
 InitArchiveFmt_Directory(ArchiveHandle *AH)
 {
 	lclContext *ctx;
+	int			ret;
 
 	/* Assuming static functions, this can be copied for each format. */
 	AH->ArchiveEntryPtr = _ArchiveEntry;
@@ -218,8 +219,14 @@ InitArchiveFmt_Directory(ArchiveHandle *AH)
 		ReadToc(AH);
 
 		/* Nothing else in the file, so close it again... */
-		if (cfclose(tocFH) != 0)
-			fatal("could not close TOC file: %m");
+		ret = cfclose(tocFH);
+		if (ret < 0)
+		{
+			if (ret == -1)
+fatal("could not close TOC file: %m");
+			else
+fatal("could not close TOC file: zlib error (%d)", ret);
+		}
 		ctx->dataFH = NULL;
 	}
 }
@@ -378,6 +385,7 @@ _PrintFileData(ArchiveHandle *AH, char *filename)
 	char	   *buf;
 	size_t		buflen;
 	cfp		   *cfp;
+	int			ret;
 
 	if (!filename)
 		return;
@@ -396,8 +404,15 @@ _PrintFileData(ArchiveHandle *AH, char *filename)
 	}
 
 	free(buf);
-	if (cfclose(cfp) !=0)
-		fatal("could not close data file: %m");
+
+	ret = cfclose(cfp);
+	if (ret < 0)
+	{
+		if (ret == -1)
+			fatal("could not close data file: %m");
+		else
+			fatal("could not close data file: zlib error (%d)", ret);
+	}
 }
 
 /*
@@ -429,6 +444,7 @@ _LoadBlobs(ArchiveHandle *AH)
 	lclContext *ctx = (lclContext *) AH->formatData;
 	char		fname[MAXPGPATH];
 	char		line[MAXPGPATH];
+	int			ret;
 
 	StartRestoreBlobs(AH);
 
@@ -460,9 +476,16 @@ _LoadBlobs(ArchiveHandle *AH)
 		fatal("error reading large object TOC file \"%s\"",
 			  fname);
 
-	if (cfclose(ctx->blobsTocFH) != 0)
-		fatal("could not close large object TOC file \"%s\": %m",
-			  fname);
+	ret = cfclose(ctx->blobsTocFH);
+	if (ret < 0)
+	{
+		if (ret == -1)
+			fatal("could not close large object TOC file \"%s\": %m",
+  fname);
+		else
+			fatal("could not close large object TOC file \"%s\": zlib error (%d)",
+  fname, ret);
+	}
 
 	ctx->blobsTocFH = NULL;
 
@@ -555,6 +578,7 @@ _CloseArchive(ArchiveHandle *AH)
 	{
 		cfp		   *tocFH;
 		char		fname[MAXPGPATH];
+		int			ret;
 
 		setFilePath(AH, fname, "toc.dat");
 
@@ -576,8 +600,14 @@ _CloseArchive(ArchiveHandle *AH)
 		WriteHead(AH);
 		AH->format = archDirectory;
 		WriteToc(AH);
-		if (cfclose(tocFH) != 0)
-			fatal("could not close TOC file: %m");
+		ret = cfclose(tocFH);
+		if (ret < 0)
+		{
+			if (ret == -1)
+fatal("could not close TOC file: %m");
+			else
+fatal("could not close TOC file: zlib

Re: Allow pg_read_all_stats to read pg_stat_progress_*

2020-04-15 Thread Kyotaro Horiguchi

At Wed, 15 Apr 2020 15:58:05 +0500, "Andrey M. Borodin"  
wrote in 
> > 15 апр. 2020 г., в 15:25, Magnus Hagander  написал(а):
> > I think that makes perfect sense. The documentation explicitly says "can 
> > read all pg_stat_* views", which is clearly wrong -- so either the code or 
> > the docs should be fixed, and it looks like it's the code that should be 
> > fixed to me.
> Should it be bug or v14 feature?
> 
> Also pgstatfuncs.c contains a lot more checks of 
> has_privs_of_role(GetUserId(), beentry->st_userid).
> Maybe grant them all?
> 
> > As for the patch, one could argue that we should just store the resulting 
> > boolean instead of re-running the check (e.g. have a "bool 
> > has_stats_privilege" or such), but perhaps that's an unnecessary 
> > micro-optimization, like the attached.
> 
> Looks good to me.

pg_stat_get_activty checks (has_privs_of_role() ||
is_member_of_role()) in-place for every entry.  It's not necessary but
I suppose that doing the same thing for pg_stat_progress_info might be
better.

It's another issue, but pg_stat_get_backend_* functions don't consider
pg_read_all_stats. I suppose that the functions should work under the
same criteria to pg_stat views, and maybe explicitly documented?

If we do that, it may be better that we define "PGSTAT_VIEW_PRIV()" or
something like and replace the all occurances of the idiomatic
condition with it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Poll: are people okay with function/operator table redesign?

2020-04-15 Thread Kyotaro Horiguchi

At Wed, 15 Apr 2020 12:04:34 -0400, Robert Haas  wrote 
in 
> On Wed, Apr 15, 2020 at 11:54 AM Pavel Stehule  
> wrote:
> > st 15. 4. 2020 v 17:43 odesílatel Isaac Morland  
> > napsal:
> >> On Wed, 15 Apr 2020 at 11:26, Pierre Giraud  
> >> wrote:
> >>> The best way to achieve this is to use some styling (font style and 
> >>> color).
> >>>
> >>> Attached you will find two different options I worked on very quickly.
> >>
> >> I really like the first. Just a couple of suggestions I would make:
> >
> > yes, it is very well readable
> 
> +1.

+1.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Include sequence relation support in logical replication

2020-04-15 Thread Craig Ringer

On Thu, 16 Apr 2020 at 07:44, Andres Freund  wrote:

>
> > I would like to ask if you have some suggestions or ideas that can make
> subscriber receives the current value without the need to
> >
> > disabling the 32 increment behavior?
>
> It simply shouldn't expect that to be the case.  What do you need it
> for?
>
> As far as I can tell replicating sequence values is useful to allow
> failover, by ensuring all sequences will return sensible values going
> forward. That does not require to now precise values.

Totally agree. Code that relies on getting specific sequence values is
broken code. Alas, very common, but still broken.

Cary, by way of background a large part of why this wasn't supported by
logical decoding back in 9.4 is that until the pg_catalog.pg_sequence
relation was introduced in PostgreSQL 10, the sequence relfilenode
intermixed a bunch of transactional and non-transactional state in a very
messy way. This made it very hard to achieve sensible behaviour for logical
decoding.

As it is, make sure your regression tests carefully cover the following
cases, as TAP tests in src/test/recovery, probably a new module for logical
decoding of sequences:

1.

* Begin txn
* Create sequence
* Call nextval() on sequence over generate_series() and discard results
* Rollback
* Issue a dummy insert+commit to some other table to force logical decoding
to send something
* Ensure subscription catches up successfully

This checks that we cope with advances for a sequence that doesn't get
created.

2.

* Begin 1st txn
* Create a sequence
* Use the sequence to populate a temp table with enough rows to ensure
sequence updates are written
* Begin a 2nd txn
* Issue a dummy insert+commit to some other table to force logical decoding
to send something
* Commit the 2nd txn
* Commit the 1st txn
* Wait for subscription catchup
* Check that the sequence value on the subscriber reflects the value after
sequence advance, not the value at creation time

This makes sure that sequence advances are handled sensibly when they
arrive for a sequence that does not yet exist in the catalogs.

You'll need to run psql in an IPC::Run background session for that. We
should really have a helper for this. I'll see if I'm allowed to post the
one I use for my own TAP tests to the list.

Re: pg_basebackup, manifests and backends older than ~12

2020-04-15 Thread Michael Paquier

On Tue, Apr 14, 2020 at 08:09:22PM -0400, Tom Lane wrote:
> Michael Paquier  writes:
>> On Mon, Apr 13, 2020 at 07:04:20PM -0400, Alvaro Herrera wrote:
>>> I agree, I think forcing users to specify --no-manifest when run on old
>>> servers will cause users to write bad scripts; I vote for silently
>>> disabling checksums.
> 
>> Okay, thanks.  Are there any other opinions?
> 
> FWIW, I concur with silently disabling the feature if the source
> server can't support it.

Thanks.  I have applied the patch, then.
--
Michael


signature.asc
Description: PGP signature

Re: Do we need to handle orphaned prepared transactions in the server?

2020-04-15 Thread Craig Ringer

On Wed, 15 Apr 2020 at 03:19, Robert Haas  wrote:

> On Wed, Feb 19, 2020 at 10:05 AM Hamid Akhtar 
> wrote:
> > Attached is version 1 of POC patch for notifying of orphaned
> > prepared transactions via warnings emitted to a client
> > application and/or log file. It applies to PostgreSQL branch
> > "master" on top of "e2e02191" commit.
>
> I think this is a bad idea and that we should reject the patch. It's
> true that forgotten prepared transactions are a problem, but it's also
> true that you can monitor for that yourself using the
> pg_prepared_xacts view. If you do, you will have a lot more
> flexibility than this patch gives you, or than any similar patch ever
> can give you.
>

I agree. It's going to cause nothing but problems.

I am generally a fan of improvements that make PostgreSQL easier to use,
easier to monitor and understand, harder to break accidentally, etc. But
not when those improvements come at the price of correct behaviour for
standard, externally specified interfaces.

Nothing allows us to just throw away prepared xacts. Doing so violates the
very definition of what a prepared xact is. It's like saying "hey, this
table is bloated, lets discard all rows with xmin < foo because we figure
the user probably doesn't care about them; though they're visible to some
still-running xacts, but those xacts haven't accessed the table.". Um. No.
We can't do that.

If you want this, write an extension that does it as a background worker.
You can munge the prepared xacts state in any manner you please from there.

I advocated for visibility / monitoring improvements upthread that might
help mitigate the operational issues. Because I do agree that there's a
problem with users having to watch the logs or query obscure state to
understand what the system is doing and why bloat is being created by
abandoned prepared xacts.

Just discarding the prepared xacts is not the answer though.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: Do we need to handle orphaned prepared transactions in the server?

2020-04-15 Thread Craig Ringer

On Thu, 16 Apr 2020 at 13:23, Craig Ringer  wrote:

>
> Just discarding the prepared xacts is not the answer though.
>

... however, I have wondered a few times about making vacuum smarter about
cases where the xmin is held down by prepared xacts or by replication
slots. If we could record the oldest *and newest* xid needed by such
resource retention markers we could potentially teach vacuum to remove
intermediate dead rows. For high-churn workloads like like workqueue
applications that could be a really big win.

We wouldn't need to track a fine-grained snapshot with an in-progress list
(or inverted in-progress list like historic snapshots) for these. We'd just
remember the needed xid range in [xmin,xmax] form. And we could even do the
same for live backends' PGXACT - it might not be worth the price there, but
if you have workloads that have batch xacts + high churn rate xacts it'd be
pretty appealing.

It wouldn't help with xid wraparound concerns, but it could help a lot with
bloat caused by old snapshots for some very common workloads.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: pg_restore: could not close data file: Success

2020-04-15 Thread Michael Paquier

On Thu, Apr 16, 2020 at 12:08:09PM +0900, Kyotaro Horiguchi wrote:
> I'm surprised to find an old thread about the same issue.
> 
> https://www.postgresql.org/message-id/20160307.174354.251049100.horiguchi.kyotaro%40lab.ntt.co.jp
> 
> But I don't think it's not acceptable that use fake errno for gzclose,
> but cfclose properly passes-through the error code from gzclose, so it
> is enought that the caller should recognize the difference.

A problem with this patch is that we may forget again to add this
special error handling if more code paths use cfclose().

As of HEAD, there are three code paths where cfclose() is called but
it does not generate an error: two when ending a blob and one when
ending a data file.  Perhaps it would make sense to just move all this
error within the routine itself?  Note that it would also mean
registering file names in lclContext or equivalent as that's an
important piece of the error message.
--
Michael


signature.asc
Description: PGP signature

Re: Vacuum o/p with (full 1, parallel 0) option throwing an error

2020-04-15 Thread Amit Kapila

On Wed, Apr 15, 2020 at 9:12 AM Amit Kapila  wrote:
>
> On Wed, Apr 15, 2020 at 9:03 AM Justin Pryzby  wrote:
> >
> > On Wed, Apr 15, 2020 at 08:54:17AM +0530, Amit Kapila wrote:
> > > On Tue, Apr 14, 2020 at 8:55 AM Amit Kapila  
> > > wrote:
> > > >
> > > > On Tue, Apr 14, 2020 at 7:52 AM Michael Paquier  
> > > > wrote:
> > > > >
> > > >
> > > > > -VACUUM (PARALLEL 1) tmp; -- disables parallel vacuum option
> > > > > +VACUUM (PARALLEL 1) tmp; -- parallel vacuum disabled for temp tables
> > > > >  WARNING:  disabling parallel option of vacuum on "tmp" --- cannot  
> > > > > vacuum temporary tables in parallel
> > > > > +VACUUM (PARALLEL 0, FULL TRUE) tmp; -- can specify parallel disabled 
> > > > > (even though that's implied by FULL)
> > > > >
> > > > > To fully close the gap in the tests, I would also add a test for
> > > > > (PARALLEL 1, FULL false) where FULL directly specified, even if that
> > > > > sounds like a nit.  That's fine to test even on a temporary table.
> > > > >
> > > >
> > > > Okay, I will do this once we agree on the error message stuff.
> > > >
> > >
> > > I have changed one of the existing tests to test the option suggested
> > > by you.  Additionally, I have changed the docs as per suggestion from
> > > Sawada-san.  I haven't changed the error message.  Let me know if you
> > > have any more comments?
> >
> > You did:
> > |...then the number of workers is determined based on the number of
> > |indexes that support parallel vacuum operation on the 
> > [-relation,-]{+relation+} and is further
> > |limited by .
> >
> > I'd suggest to say instead:
> > |...then the number of workers is determined based on the number of
> > |indexes ON THE RELATION that support parallel vacuum operation, and is 
> > further
> > |limited by .
> >
>
> I have not changed this now but I find your suggestion better than
> existing wording.  I'll change this before committing the patch unless
> there are more comments.
>

Pushed.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Poll: are people okay with function/operator table redesign?

2020-04-15 Thread Pierre Giraud



Le 16/04/2020 à 00:18, Tom Lane a écrit :
> As I threatened to do earlier, I made a pass at converting table 9.10
> to a couple of the styles under discussion.  (This is just a
> draft-quality patch, so it might have some minor bugs --- the point
> is just to see what these styles look like.)
> 
> I've concluded after looking around that the ideas involving not having
> a  at all, but just a  or the like, are not very
> well-advised.  That would eliminate, or at least greatly degrade, the
> visual distinction between the per-function material and the surrounding
> commentary.  Which does not seem like a winner to me; for example it
> would make it quite hard to skip over the detailed material when you're
> just trying to skim the docs.
> 
> We did have a number of people suggesting that just reordering things as
> "description, signature, examples" might be a good idea, so I gave that
> a try; attached is a rendition of a portion of 9.10 in that style (the
> "v1" image).  It's not bad, but there's still going to be a lot of
> wasted whitespace in tables that include even one long function name.
> (9.10's longest is "regexp_split_to_array", so it's showing this problem
> significantly.)
> 
> I also experimented with Jonathan's idea of dropping the separate
> function name and allowing the function signature to span left into
> that column -- see "v2" images.  This actually works really well,
> and would work even better (IMO) if we could get rid of the inter-row
> and inter-column rules within a function entry.  I failed to
> accomplish that with rowsep/colsep annotations, but from remarks
> upthread I suppose there might be a CSS way to accomplish it.  (But
> the rowsep/colsep annotations *do* work in PDF output, so I kept them;
> that means we only need a CSS fix and not some kind of flow-object
> magic for PDF.)
> 
> To allow direct comparison of these 9.10 images against the situation
> in HEAD, I've also attached an extract of 9.10 as rendered by my
> browser with "STYLE=website".  As you can see this is *not* quite
> identical to how it renders on postgresql.org, so there is still some
> unexplained differential in font or margins or something.  But if you
> look at those three PNGs you can see that either v1 or v2 has a pretty
> substantial advantage over HEAD in terms of the amount of space
> needed.  v2 would be even further ahead if we could eliminate some of
> the vertical space around the intra-function row split, which again
> might be doable with CSS magic.
> 
> The main disadvantage I can see to the v2 design is that we're back
> to having two  per function, which is inevitably going to result
> in PDF builds putting page breaks between those rows.  But you can't
> have everything ... and maybe we could find a way to discourage such
> breaks if we tried.

What about putting everything into one  and use a block with
some left padding/margin for description + example.
This would solve the PDF page break issue as well as the column
separation border one.

The screenshot attached uses a  tag for the descrition/example block.

> 
> Another issue is that v2 won't adapt real well to operator tables;
> the operator name won't be at the left.  I don't have a lot of faith
> in the proposal to fix that with font tricks.  Maybe we could stick
> to something close to the layout that table 9.30 has in HEAD (ie
> repeating the operator name in column 1), since we won't have long
> operator names messing up the format.  Again, CSS'ing our way
> out of the internal lines and extra vertical space within a single
> logical table cell would make that layout look nicer.
> 
> On balance I quite like the v2 layout and would prefer to move forward
> with that, assuming we can solve the remaining issues via CSS or style
> sheets.
> 
> In addition to screenshots, I've attached patches against HEAD that
> convert both tables 9.10 and 9.33 into v1 and v2 styles.
> 
>   regards, tom lane
>

Re: [PATHC] Fix minor memory leak in pg_basebackup

2020-04-15 Thread Michael Paquier

On Wed, Apr 15, 2020 at 10:06:52AM +, Zhang, Jie wrote:
> In some cases , PGresult is not cleared.
> 
> File: src\bin\pg_basebackup\streamutil.c
> 
> bool
> RetrieveWalSegSize(PGconn *conn)
> {
>   PGresult   *res;

RetrieveWalSegSize() gets called only once at the beginning of
pg_basebackup and pg_receivewal, so that's not an issue that has major
effects, still that's an issue.  The first one PQclear() is needed
where you say.  Now for the second one, I would just move it once the
code is done with the query result, aka after calling PQgetvalue().
What do you think?  Please see the attached.
--
Michael
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 1b9005722a..410116492e 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -313,9 +313,12 @@ RetrieveWalSegSize(PGconn *conn)
 	if (sscanf(PQgetvalue(res, 0, 0), "%d%s", &xlog_val, xlog_unit) != 2)
 	{
 		pg_log_error("WAL segment size could not be parsed");
+		PQclear(res);
 		return false;
 	}
 
+	PQclear(res);
+
 	/* set the multiplier based on unit to convert xlog_val to bytes */
 	if (strcmp(xlog_unit, "MB") == 0)
 		multiplier = 1024 * 1024;
@@ -334,7 +337,6 @@ RetrieveWalSegSize(PGconn *conn)
 		return false;
 	}
 
-	PQclear(res);
 	return true;
 }
 


signature.asc
Description: PGP signature

73 matches

Mail list logo