Is it possible to set end-of-data marker for COPY statement.

2020-09-01 Thread Junfeng Yang
Hi hackers,

As described in the doc https://www.postgresql.org/docs/current/sql-copy.html, 
the TEXT format recognizes
backslash-period (\.) as end-of-data marker.

The example below will raise an error for the line contains `\.`.

CREATE TABLE test (
id int,
name text,
dep text
)

Data in file "/tmp/data".

122,as\.d,adad
133,sa dad,adadad

Then execute

copy test from '/tmp/data' DELIMITER ',';

An end-of-copy marker corrupt error will be raised.

This requires users to escape the end-of-data marker manually in their data.
Why we don't have a mechanism to define other characters as end-of-data marker?
Or there are other ways to avoid escape the end-of-data in data?

Regards,
Junfeng




RE: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

2020-09-01 Thread Sait Talha Nisanci
Hi,

The WAL size for "SSD, full_page_writes=on" was 36GB. I currently don't have 
the exact size for the other rows because my test VMs got auto-deleted. I can 
possibly redo the benchmark to get pg_waldump stats for each row.

Best,
Talha.


-Original Message-
From: Stephen Frost  
Sent: Sunday, August 30, 2020 3:24 PM
To: Tomas Vondra 
Cc: Robert Haas ; Andres Freund ; 
Sait Talha Nisanci ; Thomas Munro 
; Dmitry Dolgov <9erthali...@gmail.com>; David Steele 
; Alvaro Herrera ; pgsql-hackers 

Subject: Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On Thu, Aug 27, 2020 at 04:28:54PM -0400, Stephen Frost wrote:
> >* Robert Haas (robertmh...@gmail.com) wrote:
> >>On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost  wrote:
> >>> > Hm? At least earlier versions didn't do prefetching for records with an 
> >>> > fpw, and only for subsequent records affecting the same or if not in 
> >>> > s_b anymore.
> >>>
> >>> We don't actually read the page when we're replaying an FPW though..?
> >>> If we don't read it, and we entirely write the page from the FPW, 
> >>> how is pre-fetching helping..?
> >>
> >>Suppose there is a checkpoint. Then we replay a record with an FPW, 
> >>pre-fetching nothing. Then the buffer gets evicted from 
> >>shared_buffers, and maybe the OS cache too. Then, before the next 
> >>checkpoint, we again replay a record for the same page. At this 
> >>point, pre-fetching should be helpful.
> >
> >Sure- but if we're talking about 25GB of WAL, on a server that's got 
> >32GB, then why would those pages end up getting evicted from memory 
> >entirely?  Particularly, enough of them to end up with such a huge 
> >difference in replay time..
> >
> >I do agree that if we've got more outstanding WAL between checkpoints 
> >than the system's got memory then that certainly changes things, but 
> >that wasn't what I understood the case to be here.
> 
> I don't think it's very clear how much WAL there actually was in each 
> case - the message only said there was more than 25GB, but who knows 
> how many checkpoints that covers? In the cases with FPW=on this may 
> easily be much less than one checkpoint (because with scale 45GB an 
> update to every page will log 45GB of full-page images). It'd be 
> interesting to see some stats from pg_waldump etc.

Also in the message was this:

--
In order to avoid checkpoints during benchmark, max_wal_size(200GB) and
checkpoint_timeout(200 mins) are set to a high value.
--

Which lead me to suspect, at least, that this was much less than a checkpoint, 
as you suggest.  Also, given that the comment was 'run is cancelled when there 
is a reasonable amount of WAL (>25GB), seems likely that it's at least *around* 
there.

Ultimately though, there just isn't enough information provided to really be 
able to understand what's going on.  I agree, pg_waldump stats would be useful.

> >>Admittedly, I don't quite understand whether that is what is 
> >>happening in this test case, or why SDD vs. HDD should make any 
> >>difference. But there doesn't seem to be any reason why it doesn't 
> >>make sense in theory.
> >
> >I agree that this could be a reason, but it doesn't seem to quite fit 
> >in this particular case given the amount of memory and WAL.  I'm 
> >suspecting that it's something else and I'd very much like to know if 
> >it's a general "this applies to all (most?  a lot of?) SSDs because 
> >the hardware has a larger than 8KB page size and therefore the kernel 
> >has to read it", or if it's something odd about this particular 
> >system and doesn't apply generally.
> 
> Not sure. I doubt it has anything to do with the hardware page size, 
> that's mostly transparent to the kernel anyway. But it might be that 
> the prefetching on a particular SSD has more overhead than what it saves.

Right- I wouldn't have thought the hardware page size would matter either, but 
it's entirely possible that assumption is wrong and that it does matter for 
some reason- perhaps with just some SSDs, or maybe with a lot of them, or maybe 
there's something else entirely going on.  About all I feel like I can say at 
the moment is that I'm very interested in ways to make WAL replay go faster and 
it'd be great to get more information about what's going on here to see if 
there's something we can do to generally improve WAL replay.

Thanks,

Stephen




Re: Documentation patch for backup manifests in protocol.sgml

2020-09-01 Thread Bernd Helmle
Am Montag, den 31.08.2020, 18:48 -0400 schrieb Bruce Momjian:
> > So confirmed.
> 
> 
> Patch applied through 13.

Thanks!






Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-09-01 Thread Masahiko Sawada
On Fri, 28 Aug 2020 at 23:39, Robert Haas  wrote:
>
> On Fri, Aug 28, 2020 at 4:07 AM Masahiko Sawada
>  wrote:
> > You've removed the description about executing VACUUM with
> > DISABLE_PAGE_SKIPPING option on the target relation after using
> > pg_surgery functions from the doc but I guess it’s better to recommend
> > that in the doc for safety. Could you please tell me the reason for
> > removing that?
>
> Well, I think that was added because there wasn't code to clear the
> visibility map bits, either page-level in the map, but we added code
> for that, so now I don't really see why it's necessary or even
> desirable.
>
> Here are a few example scenarios:
>
> 1. My table is not corrupt. For no particular reason, I force-freeze
> or force-kill a tuple which is neither dead nor all-visible.
> Concurrent queries might return wrong answers, but the table is not
> corrupt. It does not require VACUUM and would not benefit from it.
> Actually, it doesn't need anything at all.
>
> 2. My table is not corrupt. For no particular reason, I force-freeze a
> tuple which is dead. I believe it's possible that the index entries
> for that tuple might be gone already, but VACUUM won't fix that.
> REINDEX or a table rewrite would, though. It's also possible, if the
> dead tuple was added by an aborted transaction which added columns to
> the table, that the tuple might have been created using a tuple
> descriptor that differs from the table's current tuple descriptor. If
> so, I think scanning the table could produce a crash. VACUUM won't fix
> this, either. I would need to delete or force-kill the offending
> tuple.
>
> 3. I have one or more tuples in my table that are intact except that
> they have garbage values for xmin, resulting in VACUUM failure or
> possibly even SELECT failure if the CLOG entries are also missing. I
> force-kill or force-freeze them. If by chance the affected tuples were
> also omitted from one or more indexes, a REINDEX or table rewrite is
> needed to fix them, but a VACUUM will not help. On the other hand, if
> those tuples are present in the indexes, there's no remaining problem
> and VACUUM is not needed for the purpose of restoring the integrity of
> the table. If the problem has been ongoing for a while, VACUUM might
> be needed to advance relfrozenxid, but that doesn't require
> DISABLE_PAGE_SKIPPING.
>
> 4. I have some pages in my table that have incorrect visibility map
> bits. In this case, I need VACUUM (DISABLE_PAGE_SKIPPING). However, I
> don't need the functions we're talking about here at all unless I also
> have tuples with corrupted visibility information. If I do happen to
> have both tuples with corrupted visibility information and also pages
> with incorrect visibility map bits, then I suppose I need both these
> tools and also VACUUM (DISABLE_PAGE_SKIPPING). Probably, I'll want to
> do the VACUUM second. But, if I happened to do the VACUUM first and
> then use these functions afterward, the worst thing that could happen
> is that I might end up with a some dead tuples that could've gotten
> removed faster if I'd switched the order. And that's not a disaster.
>
> Basically, I can see no real reason to recommend VACUUM
> (DISABLE_PAGE_SKIPPING) here. There are problems that can be fixed
> with that command, and there are problems that can be fixed by this
> method, but they are mostly independent of each other. We should not
> recommend that people run VACUUM "just in case." That kind of fuzzy
> thinking seems relatively prevalent already, and it leads to people
> spending a lot of time running slow maintenance commands that do
> nothing to help them, and which occasionally make things worse.
>

Thank you for your explanation. That very makes sense to me.

If vacuum could fix the particular kind of problem by using together
with pg_surgery we could recommend using vacuum. But I agree that the
corruption of heap table is not the case.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Disk-based hash aggregate's cost model

2020-09-01 Thread Tomas Vondra

On Mon, Aug 31, 2020 at 11:34:34PM -0700, Jeff Davis wrote:

On Sun, 2020-08-30 at 17:03 +0200, Tomas Vondra wrote:

So I'm wondering if the hashagg is not ignoring similar non-I/O costs
for the spilling case. In particular, the initial section computing
startup_cost seems to ignore that we may need to so some of the stuff
repeatedly - for example we'll repeat hash lookups for spilled
tuples,
and so on.


To fix that, we'd also need to change the cost of in-memory HashAgg,
right?



Why? I don't think we need to change costing of in-memory HashAgg. My
assumption was we'd only tweak startup_cost for cases with spilling by
adding something like (cpu_operator_cost * npartitions * ntuples).


The other thing is that sort seems to be doing only about half the
physical I/O (as measured by iosnoop) compared to hashagg, even
though
the estimates of pages / input_bytes are exactly the same. For
hashagg
the iosnoop shows 5921MB reads and 7185MB writes, while sort only
does
2895MB reads and 3655MB writes. Which kinda matches the observed
sizes
of temp files in the two cases, so the input_bytes for sort seems to
be
a bit overestimated.


Hmm, interesting.



FWIW I suspect some of this difference may be due to logical vs.
physical I/O. iosnoop only tracks physical I/O sent to the device, but
maybe we do much more logical I/O and it simply does not expire from
page cache for the sort. It might behave differently for larger data
set, longer query, ...


How reasonable is it to be making these kinds of changes to the cost
model right now? I think your analysis is solid, but I'm worried about
making more intrusive changes very late in the cycle.

I had originally tried to limit the cost model changes to the new plans
I am introducing -- that is, HashAgg plans expected to require disk.
That's why I came up with a rather arbitrary penalty.



I don't know. I certainly understand the desire not to change things
this late. OTOH I'm worried that we'll end up receiving a lot of poor
plans post release.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [patch] demote

2020-09-01 Thread Jehan-Guillaume de Rorthais
On Tue, 18 Aug 2020 17:41:31 +0200
Jehan-Guillaume de Rorthais  wrote:

> Hi,
> 
> Please find in attachment v5 of the patch set rebased on master after various
> conflicts.
> 
> Regards,
> 
> On Wed, 5 Aug 2020 00:04:53 +0200
> Jehan-Guillaume de Rorthais  wrote:
> 
> > Demote now keeps backends with no active xid alive. Smart mode keeps all
> > backends: it waits for them to finish their xact and enter read-only. Fast
> > mode terminate backends wit an active xid and keeps all other ones.
> > Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1
> > (check recovery state) once demoted.
> > During demote, no new session is allowed.
> > 
> > As backends with no active xid survive, a new SQL admin function
> > "pg_demote(fast bool, wait bool, wait_seconds int)" had been added.

Just to keep the list inform, I found a race condition leading to backends
trying to write to XLog after they processed the demote signal. Eg.:

  [posmaster]  LOG:  all backends in read only
  [checkpointer]   LOG:  demoting
  [backend]  PANIC:  cannot make new WAL entries during recovery
 STATEMENT:  UPDATE pgbench_accounts [...]

Because of this Postmaster enters in crash recovery while demote
environnement is in progress.

I have a couple of other subjects right now, but I plan to get back to it soon.

Regards,




Re: Switch to multi-inserts for pg_depend

2020-09-01 Thread Daniel Gustafsson
> On 14 Aug 2020, at 20:23, Alvaro Herrera  wrote:

> The logic to keep track number of used slots used is baroque, though -- that
> could use a lot of simplification.

What if slot_init was an integer which increments together with the loop
variable until max_slots is reached?  If so, maybe it should be renamed
slot_init_count and slotCount renamed slot_stored_count to make the their use
clearer?

> On 31 Aug 2020, at 09:56, Michael Paquier  wrote:

> It has been a couple of weeks, and I am not really sure what is the
> suggestion here.  So I would like to move on with this patch set as
> the changes are straight-forward using the existing routines for
> object addresses by grouping all insert dependencies of the same type.
> Are there any objections?

I'm obviously biased but I think this patchset is a net win.  There are more
things we can do in this space, but it's a good start.

> Attached is a rebased set, where I have added in indexing.h a unique
> definition for the hard limit of 64kB for the amount of data that can
> be inserted at once, based on the suggestion from Alvaro and Andres.

+#define MAX_CATALOG_INSERT_BYTES 65535
This name, and inclusion in a headerfile, implies that the definition is
somewhat generic as opposed to its actual use.  Using MULTIINSERT rather than
INSERT in the name would clarify I reckon.

A few other comments:

+   /*
+* Allocate the slots to use, but delay initialization until we know 
that
+* they will be used.
+*/
I think this comment warrants a longer explanation on why they wont all be
used, or perhaps none of them (which is the real optimization win here).

+   ObjectAddresses *addrs_auto;
+   ObjectAddresses *addrs_normal;
It's not for this patch, but it seems a logical next step would be to be able
to record the DependencyType as well when collecting deps rather than having to
create multiple buckets.

+   /* Normal dependency from a domain to its collation. */
+   /* We know the default collation is pinned, so don't bother recording 
it */
It's just moved and not introduced in this patch, but shouldn't these two lines
be joined into a normal block comment?

cheers ./daniel



Re: New statistics for tuning WAL buffer size

2020-09-01 Thread Fujii Masao




On 2020/08/24 21:00, Masahiro Ikeda wrote:

On 2020-08-24 20:45, Masahiro Ikeda wrote:

Hi, thanks for useful comments.


I agree to expose the number of WAL write caused by full of WAL buffers.
It's helpful when tuning wal_buffers size. Haribabu separated that number
into two fields in his patch; one is the number of WAL write by backend,
and another is by background processes and workers. But I'm not sure
how useful such separation is. I'm ok with just one field for that number.

I agree with you.  I don't think we need to separate the numbers for foreground processes 
and background ones.  WAL buffer is a single resource.  So "Writes due to full WAL 
buffer are happening.  We may be able to boost performance by increasing 
wal_buffers" would be enough.


I made a patch to expose the number of WAL write caused by full of WAL buffers.
I'm going to submit this patch to commitfests.

As Fujii-san and Tsunakawa-san said, it expose the total number
since I agreed that we don't need to separate the numbers for
foreground processes and background ones.

By the way, do we need to add another metrics related to WAL?
For example, is the total number of WAL writes to the buffers useful
to calculate the dirty WAL write ratio?

Is it enough as a first step?


I forgot to rebase the current master.
I've attached the rebased patch.


Thanks for the patch!

+/* --
+ * Backend types
+ * --

You seem to forget to add "*/" into the above comment.
This issue could cause the following compiler warning.

../../src/include/pgstat.h:761:1: warning: '/*' within block comment [-Wcomment]


The contents of pg_stat_walwrites are reset when the server
is restarted. Isn't this problematic? IMO since pg_stat_walwrites
is a collected statistics view, basically its contents should be
kept even in the case of server restart.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Parallel copy

2020-09-01 Thread Greg Nancarrow
Hi Vignesh,

>Can you share with me the script you used to generate the data & the ddl of 
>the table, so that it will help me check that >scenario you faced the >problem.

Unfortunately I can't directly share it (considered company IP),
though having said that it's only doing something that is relatively
simple and unremarkable, so I'd expect it to be much like what you are
currently doing. I can describe it in general.

The table being used contains 100 columns (as I pointed out earlier),
with the first column of "bigserial" type, and the others of different
types like "character varying(255)", "numeric", "date" and "time
without timezone". There's about 60 of the "character varying(255)"
overall, with the other types interspersed.

When testing with indexes, 4 b-tree indexes were used that each
included the first column and then distinctly 9 other columns.

A CSV record (row) template file was created with test data
(corresponding to the table), and that was simply copied and appended
over and over with a record prefix in order to create the test data
file.
The following shell-script basically does it (but very slowly). I was
using a small C program to do similar, a lot faster.
In my case, N=255 produced about a 5GB CSV file.

file_out=data.csv; for i in {1..N}; do echo -n "$i," >> $file_out;
cat sample_record.csv >> $file_out; done

One other thing I should mention is that between each test run, I
cleared the OS page cache, as described here:
https://linuxhint.com/clear_cache_linux/
That way, each COPY FROM is not taking advantage of any OS-cached data
from a previous COPY FROM.

If your data is somehow significantly different and you want to (and
can) share your script, then I can try it in my environment.


Regards,
Greg




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-01 Thread Justin Pryzby
This patch seems to be missing a call to RelationAssumeNewRelfilenode() in
reindex_index().

That's maybe the related to the cause of the crashes I pointed out earlier this
year.

Alexey's v4 patch changed RelationSetNewRelfilenode() to accept a tablespace
parameter, but Michael seemed to object to that.  However that seems cleaner
and ~30 line shorter.

Michael, would you comment on that ?  The v4 patch and your comments are here.
https://www.postgresql.org/message-id/attachment/105574/v4-0001-Allow-REINDEX-and-REINDEX-CONCURRENTLY-to-change-tablespace.patch
https://www.postgresql.org/message-id/20191127035416.GG5435%40paquier.xyz

> --- a/src/backend/catalog/index.c
> +++ b/src/backend/catalog/index.c
> @@ -3480,6 +3518,47 @@ reindex_index(Oid indexId, bool 
> skip_constraint_checks, char persistence,
>*/
>   CheckTableNotInUse(iRel, "REINDEX INDEX");
>  
> + if (tablespaceOid == MyDatabaseTableSpace)
> + tablespaceOid = InvalidOid;
> +
> + /*
> +  * Set the new tablespace for the relation.  Do that only in the
> +  * case where the reindex caller wishes to enforce a new tablespace.
> +  */
> + if (set_tablespace &&
> + tablespaceOid != iRel->rd_rel->reltablespace)
> + {
> + Relationpg_class;
> + Form_pg_class   rd_rel;
> + HeapTuple   tuple;
> +
> + /* First get a modifiable copy of the relation's pg_class row */
> + pg_class = table_open(RelationRelationId, RowExclusiveLock);
> +
> + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(indexId));
> + if (!HeapTupleIsValid(tuple))
> + elog(ERROR, "cache lookup failed for relation %u", 
> indexId);
> + rd_rel = (Form_pg_class) GETSTRUCT(tuple);
> +
> + /*
> +  * Mark the relation as ready to be dropped at transaction 
> commit,
> +  * before making visible the new tablespace change so as this 
> won't
> +  * miss things.
> +  */
> + RelationDropStorage(iRel);
> +
> + /* Update the pg_class row */
> + rd_rel->reltablespace = tablespaceOid;
> + CatalogTupleUpdate(pg_class, &tuple->t_self, tuple);
> +
> + heap_freetuple(tuple);
> +
> + table_close(pg_class, RowExclusiveLock);
> +
> + /* Make sure the reltablespace change is visible */
> + CommandCounterIncrement();





[patch] Fix checksum verification in base backups for zero page headers

2020-09-01 Thread Michael Banck
Hi,

as a continuation of [1], I've split-off the zero page header case from
the last patch, as this one seems less contentious.


Michael

[1] https://commitfest.postgresql.org/28/2308/

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
From 8784d27ff258a6ff1d80f18b22e2b275a3944991 Mon Sep 17 00:00:00 2001
From: Michael Banck 
Date: Tue, 1 Sep 2020 12:14:16 +0200
Subject: [PATCH] Fix checksum verification in base backups for zero page
 headers

We currently do not checksum a page if it is considered new by PageIsNew().
However, this means that if the whole page header is zero, PageIsNew() will
consider that page new due to the pd_upper field being zero and no subsequent
checksum verification is done.

To fix, factor out the all-zeroes check from PageIsVerified() into a new method
PageIsZero() and call that for pages considered new by PageIsNew(). If a page
is all zero, consider that a checksum failure.

Add one more test to the pg_basebackup TAP tests to check this error.

Reported-By: Andres Freund
Reviewed-By: Asif Rehman
Discussion: https://postgr.es/m/20190319200050.ncuxejradurja...@alap3.anarazel.de
---
 src/backend/replication/basebackup.c | 138 ---
 src/backend/storage/page/bufpage.c   |  35 +++--
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |  18 ++-
 src/include/storage/bufpage.h|  11 +-
 4 files changed, 128 insertions(+), 74 deletions(-)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 6064384e32..30cd5905ed 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1672,70 +1672,28 @@ sendFile(const char *readfilename, const char *tarfilename,
 page = buf + BLCKSZ * i;
 
 /*
- * Only check pages which have not been modified since the
- * start of the base backup. Otherwise, they might have been
- * written only halfway and the checksum would not be valid.
- * However, replaying WAL would reinstate the correct page in
- * this case. We also skip completely new pages, since they
- * don't have a checksum yet.
+ * We skip completely new pages after checking they are
+ * all-zero, since they don't have a checksum yet.
  */
-if (!PageIsNew(page) && PageGetLSN(page) < startptr)
+if (PageIsNew(page))
 {
-	checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
-	phdr = (PageHeader) page;
-	if (phdr->pd_checksum != checksum)
+	if (!PageIsZero(page))
 	{
 		/*
-		 * Retry the block on the first failure.  It's
-		 * possible that we read the first 4K page of the
-		 * block just before postgres updated the entire block
-		 * so it ends up looking torn to us.  We only need to
-		 * retry once because the LSN should be updated to
-		 * something we can ignore on the next pass.  If the
-		 * error happens again then it is a true validation
-		 * failure.
+		 * pd_upper is zero, but the page is not all zero.  We
+		 * cannot run pg_checksum_page() on the page as it
+		 * would throw an assertion failure.  Consider this a
+		 * checksum failure.
 		 */
-		if (block_retry == false)
-		{
-			int			reread_cnt;
-
-			/* Reread the failed block */
-			reread_cnt =
-basebackup_read_file(fd, buf + BLCKSZ * i,
-	 BLCKSZ, len + BLCKSZ * i,
-	 readfilename,
-	 false);
-			if (reread_cnt == 0)
-			{
-/*
- * If we hit end-of-file, a concurrent
- * truncation must have occurred, so break out
- * of this loop just as if the initial fread()
- * returned 0. We'll drop through to the same
- * code that handles that case. (We must fix
- * up cnt first, though.)
- */
-cnt = BLCKSZ * i;
-break;
-			}
-
-			/* Set flag so we know a retry was attempted */
-			block_retry = true;
-
-			/* Reset loop to validate the block again */
-			i--;
-			continue;
-		}
 
 		checksum_failures++;
 
 		if (checksum_failures <= 5)
 			ereport(WARNING,
 	(errmsg("checksum verification failed in "
-			"file \"%s\", block %d: calculated "
-			"%X but expected %X",
-			readfilename, blkno, checksum,
-			phdr->pd_checksum)));
+			"file \"%s\", block %d: pd_upper "
+			"is zero but page is not all-zero",
+			readfilename, blkno)));
 		if (checksum_failures == 5)
 			ereport(WARNING,
 	(errmsg("further checksum verification "
@@ -1743,6 +1701,80 

Re: REINDEX SCHEMA/DATABASE/SYSTEM weak with dropped relations

2020-09-01 Thread Anastasia Lubennikova

On 01.09.2020 04:38, Michael Paquier wrote:

I have added some extra comments.  There is one in
ReindexRelationConcurrently() to mention that there should be no extra
use of MISSING_OK once the list of indexes is built as session locks
are taken where needed.
Great, it took me a moment to understand the logic around index list 
check at first pass.

Does the version attached look fine to you?  I have done one round of
indentation while on it.


Yes, this version is good.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Include access method in listTables output

2020-09-01 Thread Georgios



‐‐‐ Original Message ‐‐‐
On Tuesday, 1 September 2020 07:41, Michael Paquier  wrote:

> On Thu, Aug 20, 2020 at 08:16:19AM +, Georgios wrote:
>
> > Please find version 7 attached which hopefully addresses the error along 
> > with a proper
> > expansion of the test coverage and removal of recently introduced
> > whitespace warnings.
>
> +CREATE ROLE conditional_tableam_display_role;
> As a convention, regression tests need to have roles prefixed with
> "regress_" or this would cause some buildfarm members to turn red.
> Please see -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS (you could use
> that in your environment for example).


I was wondering about the name. I hoped that it would either come up during 
review, or that it would be fine.

Thank you for the detailed explanation.

>
> So, as of the tests.. The role gets added to make sure that when
> using \d+ on the full schema as well as the various \d*+ variants we
> have a consistent owner. The addition of the relation size for the
> sequence and the btree index in the output generated is a problem
> though, because that's not really portable when compiling with other
> page sizes. It is true that there are other tests failing in this
> case, but I think that we should try to limit that if we can. In
> short, I agree that having some tests is better than nothing, but I
> would suggest to reduce their scope, as per the attached.

I could not agree more. I have only succumbed to the pressure of reviewing.

>
> Adding \dE as there are no foreign tables does not make much sense,
> and also I wondered why \dt+ was not added.
>
> Does the attached look correct to you?

You have my :+1:

>
> ---
>
> Michael






Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-01 Thread Alexey Kondratov

On 2020-09-01 13:12, Justin Pryzby wrote:
This patch seems to be missing a call to RelationAssumeNewRelfilenode() 
in

reindex_index().

That's maybe the related to the cause of the crashes I pointed out 
earlier this

year.

Alexey's v4 patch changed RelationSetNewRelfilenode() to accept a 
tablespace
parameter, but Michael seemed to object to that.  However that seems 
cleaner

and ~30 line shorter.

Michael, would you comment on that ?  The v4 patch and your comments 
are here.

https://www.postgresql.org/message-id/attachment/105574/v4-0001-Allow-REINDEX-and-REINDEX-CONCURRENTLY-to-change-tablespace.patch
https://www.postgresql.org/message-id/20191127035416.GG5435%40paquier.xyz



Actually, the last time we discussed this point I only got the gut 
feeling that this is a subtle place and it is very easy to break things 
with these changes. However, it isn't clear for me how exactly. That 
way, I'd be glad if Michael could reword his explanation, so it'd more 
clear for me as well.


BTW, I've started doing a review of the last patch set yesterday and 
will try to post some comments later.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Reloptions for table access methods

2020-09-01 Thread gkokolatos






‐‐‐ Original Message ‐‐‐
On Tuesday, 1 September 2020 09:18, Jeff Davis  wrote:

> A custom table access method might want to add a new reloption to
> control something specific to that table access method. Unfortunately,
> if you add a new option of type RELOPT_KIND_HEAP, it will immediately
> fail because of the validation that happens in fillRelOptions().
>
> Right now, heap reloptions (e.g. FILLFACTOR) are validated in two
> places: parseRelOptions() and fillRelOptions().
>
> parseRelOptions() validates against boolRelOpts[], intRelOpts[], etc.
> This validation is extensible by add_bool_reloption(), etc.
>
> fillRelOptions() validates when filling in a struct to make sure there
> aren't "leftover" options. It does this using a hard-coded parsing
> table that is not extensible.
>
> Index access methods get total control over validation of reloptions,
> but that doesn't fit well with heaps, because all heaps need the
> vacuum-related options.
>
> I considered some other approaches, but they all seemed like over-
> engineering, so the attached patch just passes validate=false to
> fillRelOptions() for heaps. That allows custom table access methods to
> just define new options of kind RELOPT_KIND_HEAP.

I have yet to look at the diff. I simply wanted to give you my wholehearted +1 
to the idea. It is a necessary and an essential part of developing access 
methods.

I have worked extensively in the merge of PG12 into Greenplum, with a focus to 
the tableam api. Handling reloptions has been quite a pain to do cleanly. Given 
the nature of Greenplum's table access methods, we were forced to take it a 
couple of steps further. We did use an approach which I am certain that you 
have considered and discarded as over-engineering for postgres.

In short, I am really excited to see a patch for this topic!

>
> Regards,
> Jeff Davis






Docs: inaccurate description about config settings

2020-09-01 Thread Li Japin
Hi, hackers

When I setup a stream replication I found that the documentation says that 
promote_trigger_file
parameter can only be set in the postgresql.conf file or on the server command 
line, however, it
can also be put into postgresql.auto.conf. If I use include to import a new 
config, it works too.

There are many parameters use this description:
$ grep 'This parameter can only be set in the' -rn doc/
doc/src/sgml/sepgsql.sgml:273:  This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:1001:This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:1040:for details. This parameter can only be 
set in the
doc/src/sgml/config.sgml:1071:This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:1133:This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:1151:This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:1169:This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:1187:This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:1204:This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:1232:This parameter can only be set in the
doc/src/sgml/config.sgml:1307:This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:1334:This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:1411:This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:1445:This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:1471:This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:2180: of 10.  This parameter can only be set 
in the
doc/src/sgml/config.sgml:2199: This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:2227: This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:2259: This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:2817:This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:2863:This parameter can only be set in the 
postgresql.conf
doc/src/sgml/config.sgml:3012:next higher multiple of 10. This 
parameter can only be set in the
doc/src/sgml/config.sgml:3036:This parameter can only be set in the
….

I think this description is misleading. we should correct it, isn't it?

Regards,
Japin Li


Group by reordering optimization

2020-09-01 Thread Dmitry Dolgov
Hi,

Better late than never, to follow up on the original thread [1] I would like to
continue the discussion with the another version of the patch for group by
reordering optimization. To remind, it's about reordering of group by clauses
to do sorting more efficiently. The patch is rebased and modified to address
(at least partially) the suggestions about making it consider new additional
paths instead of changing original ones. It is still pretty much
proof-of-concept version though with many blind spots, but I wanted to start
kicking it and post at least something, otherwise it will never happen. An
incremental approach so to say.

In many ways it still contains the original code from Teodor. Changes and notes:

* Instead of changing the order directly, now patch creates another patch with
  modifier order of clauses. It does so for the normal sort as well as for
  incremental sort. The whole thing is done in two steps: first it finds a
  potentially better ordering taking into account number of groups, widths and
  comparison costs; afterwards this information is used to produce a cost
  estimation. This is implemented via a separate create_reordered_sort_path to
  not introduce too many changes, I couldn't find any better place.

* Function get_func_cost was removed at some point, but unfortunately this
  patch was implemented before that, so it's still present there.

* For simplicity I've removed support in create_partial_grouping_paths, since
  they were not covered by the existing tests anyway.

* The costing part is pretty rudimentary and looks only at the first group.
  It's mostly hand crafted to pass the existing tests.

The question about handling skewed data sets is not addressed yet.

[1]: 
https://www.postgresql.org/message-id/flat/7c79e6a5-8597-74e8-0671-1c39d124c9d6%40sigaev.ru


0001-Group-by-optimization.patch
Description: Binary data


Re: Docs: inaccurate description about config settings

2020-09-01 Thread Ian Lawrence Barwick
2020年9月1日(火) 19:37 Li Japin :
>
> Hi, hackers
>
> When I setup a stream replication I found that the documentation says that 
> promote_trigger_file
> parameter can only be set in the postgresql.conf file or on the server 
> command line, however, it
> can also be put into postgresql.auto.conf. If I use include to import a new 
> config, it works too.
>
> There are many parameters use this description:
> $ grep 'This parameter can only be set in the' -rn doc/
(...)
>
> I think this description is misleading. we should correct it, isn't it?

I must admit every time I see this wording, it strikes me as very specific
and potentially confusing given the alternative files the parameter could be
placed in.

I think it would be clearer for anyone not familiar with the configuration file
system to change occurrences of this wording to something like:

  This parameter can only be set in the configuration file

which would link to:

  
https://www.postgresql.org/docs/current/config-setting.html#CONFIG-SETTING-CONFIGURATION-FILE

which provides more information. Though on that page it seems like it would be
also sensible to bundle the section about include directives in the
configuration
file (19.1.5) together with the section about the configuration itself (19.1.2).


Regards

Ian Barwick

-- 
Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Evaluate expression at planning time for two more cases

2020-09-01 Thread Surafel Temesgen
Hi ,

Thank you for looking into this

On Fri, Aug 28, 2020 at 9:48 AM Ashutosh Bapat 
wrote:

>  }
>  else
>  has_nonconst_input = true;
> @@ -3382,7 +3395,47 @@ eval_const_expressions_mutator(Node *node,
>
> +
> +if (pkattnos != NULL &&
> bms_is_member(var->varattno - FirstLowInvalidHeapAttributeNumber,
> pkattnos)
> +&& !check_null_side(context->root, relid))
>
> Since this is working on parse->rtable this will work only for top level
> tables
> as against the inherited tables or partitions which may have their own
> primary
> key constraints if the parent doesn't have those.
>
>

In that case the table have to be specified in from clause otherwise its
error

e.g postgres=# CREATE TABLE cities (

name text,

population float,

altitude int

);

CREATE TABLE

postgres=# CREATE TABLE capitals (

id serial primary key,

state char(2)

) INHERITS (cities);

CREATE TABLE

postgres=# EXPLAIN SELECT * FROM cities WHERE id is not null;

ERROR: column "id" does not exist

LINE 1: EXPLAIN SELECT * FROM cities WHERE id is not null;


Even it will not work on the child table because the primary key constraint
on the parent table is not in-force in the child table.



> This better be done when planning individual relations, plain or join or
> upper,
> where all the necessary information is already available with each of the
> relations and also the quals, derived as well as user specified, are
> distributed to individual relations where they should be evalutated. My
> memory
> is hazy but it might be possible do this while distributing the quals
> themselves (distribute_qual_to_rels()).
>
>
The place where all the necessary information available is on
reduce_outer_joins as the comment of the function states but the downside
is its will only be inexpensive if the query contains outer join


> Said that, to me, this looks more like something we should be able to do
> at the
> time of constraint exclusion. But IIRC, we just prove whether constraints
> refute a qual and not necessarily whether constraints imply a qual, making
> it
> redundant, as is required here. E.g. primary key constraint implies key NOT
> NULL rendering a "key IS NOT NULL" qual redundant. It might be better to
> test
> the case when col IS NOT NULL is specified on a column which already has a
> NOT
> NULL constraint. That may be another direction to take. We may require much
> lesser code.
>
>
I don’t add NOT NULL constraint optimization to the patch because cached
plan is not invalidation in case of a change in NOT NULL constraint


> Please add the patch to the next commitfest
> https://commitfest.postgresql.org/.
>
>
I add it is here  https://commitfest.postgresql.org/29/2699/
Thank you

regards
Surafel


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-01 Thread k.jami...@fujitsu.com
On Tuesday, August 18, 2020 3:05 PM (GMT+9), Amit Kapila wrote: 
> On Fri, Aug 7, 2020 at 9:33 AM Tom Lane  wrote:
> >
> > Amit Kapila  writes:
> > > On Sat, Aug 1, 2020 at 1:53 AM Andres Freund 
> wrote:
> > >> We could also just use pg_class.relpages. It'll probably mostly be
> > >> accurate enough?
> >
> > > Don't we need the accurate 'number of blocks' if we want to
> > > invalidate all the buffers? Basically, I think we need to perform
> > > BufTableLookup for all the blocks in the relation and then Invalidate all
> buffers.
> >
> > Yeah, there is no room for "good enough" here.  If a dirty buffer
> > remains in the system, the checkpointer will eventually try to flush
> > it, and fail (because there's no file to write it to), and then
> > checkpointing will be stuck.  So we cannot afford to risk missing any
> buffers.
> >
> 
> Today, again thinking about this point it occurred to me that during recovery
> we can reliably find the relation size and after Thomas's recent commit
> c5315f4f44 (Cache smgrnblocks() results in recovery), we might not need to
> even incur the cost of lseek. Why don't we fix this first for 'recovery' (by
> following something on the lines of what Andres suggested) and then later
> once we have a generic mechanism for "caching the relation size" [1], we can
> do it for non-recovery paths.
> I think that will at least address the reported use case with some minimal
> changes.
> 
> [1] -
> https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1r
> U-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
> 

Attached is an updated V9 version with minimal code changes only and
avoids the previous overhead in the BufferAlloc. This time, I only updated
the recovery path as suggested by Amit, and followed Andres' suggestion
of referring to the cached blocks in smgrnblocks.
The layering is kinda tricky so the logic may be wrong. But as of now,
it passes the regression tests. I'll follow up with the performance results.
It seems there's regression for smaller shared_buffers. I'll update if I find 
bugs.
But I'd also appreciate your reviews in case I missed something.

Regards,
Kirk Jamison




v9-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description:  v9-Speedup-dropping-of-relation-buffers-during-recovery.patch


clarify "rewritten" in pg_checksums docs

2020-09-01 Thread Michael Banck
Hi,

the pg_checksums docs mention that "When enabling checksums, every file
in the cluster is rewritten".

>From IRC discussions, "rewritten" seems ambiguous, it could mean that a
second copy of the file is written and then switched over, implying
increased storage demand during the operation.

So maybe "rewritten in-place" is better, as per the attached?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
diff --git a/doc/src/sgml/ref/pg_checksums.sgml b/doc/src/sgml/ref/pg_checksums.sgml
index 8e7807f86b..1dd4e54ff1 100644
--- a/doc/src/sgml/ref/pg_checksums.sgml
+++ b/doc/src/sgml/ref/pg_checksums.sgml
@@ -47,8 +47,8 @@ PostgreSQL documentation
 
   
When verifying checksums, every file in the cluster is scanned. When
-   enabling checksums, every file in the cluster is rewritten. Disabling
-   checksums only updates the file pg_control.
+   enabling checksums, every file in the cluster is rewritten in-place.
+   Disabling checksums only updates the file pg_control.
   
  
 


Re: WIP: WAL prefetch (another approach)

2020-09-01 Thread Tomas Vondra

On Thu, Aug 13, 2020 at 06:57:20PM +1200, Thomas Munro wrote:

On Thu, Aug 6, 2020 at 10:47 PM Tomas Vondra
 wrote:

On Thu, Aug 06, 2020 at 02:58:44PM +1200, Thomas Munro wrote:
>On Tue, Aug 4, 2020 at 3:47 AM Tomas Vondra
>> Any luck trying to reproduce thigs? Should I try again and collect some
>> additional debug info?
>
>No luck.  I'm working on it now, and also trying to reduce the
>overheads so that we're not doing extra work when it doesn't help.

OK, I'll see if I can still reproduce it.


Since someone else ask me off-list, here's a rebase, with no
functional changes.  Soon I'll post a new improved version, but this
version just fixes the bitrot and hopefully turns cfbot green.


I've decided to do some tests with this patch version, but I immediately
ran into issues. What I did was initializing a 32GB pgbench database,
backed it up (shutdown + tar) and then ran 2h pgbench with archiving.
And then I restored the backed-up data directory and instructed it to
replay WAL from the archive. There's about 16k WAL segments, so about
256GB of WAL.

Unfortunately, the very first thing that happens after starting the
recovery is this:

LOG:  starting archive recovery
LOG:  restored log file "000100160080" from archive
LOG:  consistent recovery state reached at 16/80A0
LOG:  redo starts at 16/80A0
LOG:  database system is ready to accept read only connections
LOG:  recovery started prefetching on timeline 1 at 0/80A0
LOG:  recovery no longer prefetching: unexpected pageaddr 8/8400 in log 
segment 000100160081, offset 0
LOG:  restored log file "000100160081" from archive
LOG:  restored log file "000100160082" from archive

So we start applying 000100160081 and it fails almost
immediately on the first segment. This is confirmed by prefetch stats,
which look like this:

-[ RECORD 1 ]---+-
stats_reset | 2020-09-01 15:02:31.18766+02
prefetch| 1044
skip_hit| 1995
skip_new| 87
skip_fpw| 2108
skip_seq| 27
distance| 0
queue_depth | 0
avg_distance| 135838.95
avg_queue_depth | 8.852459

So we do a little bit of prefetching and then it gets disabled :-(

The segment looks perfectly fine when inspected using pg_waldump, see
the attached file.

I've tested this applied on 6ca547cf75ef6e922476c51a3fb5e253eef5f1b6,
and the failure seems fairly similar to what I reported before, except
that now it happened right at the very beginning.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


000100160081.log.gz
Description: application/gzip


Re: clarify "rewritten" in pg_checksums docs

2020-09-01 Thread Daniel Gustafsson
> On 1 Sep 2020, at 15:13, Michael Banck  wrote:

> the pg_checksums docs mention that "When enabling checksums, every file
> in the cluster is rewritten".
> 
> From IRC discussions, "rewritten" seems ambiguous, it could mean that a
> second copy of the file is written and then switched over, implying
> increased storage demand during the operation.

Makes sense, I can see that confusion.

> So maybe "rewritten in-place" is better, as per the attached?

Isn't "modified in-place" a more accurate description of the process?

cheers ./daniel




Re: clarify "rewritten" in pg_checksums docs

2020-09-01 Thread Michael Banck
Hi,

Am Dienstag, den 01.09.2020, 15:29 +0200 schrieb Daniel Gustafsson:
> > On 1 Sep 2020, at 15:13, Michael Banck  wrote:
> > the pg_checksums docs mention that "When enabling checksums, every file
> > in the cluster is rewritten".
> > 
> > From IRC discussions, "rewritten" seems ambiguous, it could mean that a
> > second copy of the file is written and then switched over, implying
> > increased storage demand during the operation.
> 
> Makes sense, I can see that confusion.
> 
> > So maybe "rewritten in-place" is better, as per the attached?
> 
> Isn't "modified in-place" a more accurate description of the process?

AIUI we do rewrite the whole file (block by block, after updating the
page header with the checksum), so yeah, I though about using modified
instead but then decided rewritten is pretty (or even more) accurate.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz





Re: WIP: WAL prefetch (another approach)

2020-09-01 Thread Thomas Munro
On Wed, Sep 2, 2020 at 1:14 AM Tomas Vondra
 wrote:
> from the archive

Ahh, so perhaps that's the key.

> I've tested this applied on 6ca547cf75ef6e922476c51a3fb5e253eef5f1b6,
> and the failure seems fairly similar to what I reported before, except
> that now it happened right at the very beginning.

Thanks, will see if I can work out why.  My newer version probably has
the same problem.




Re: WIP: WAL prefetch (another approach)

2020-09-01 Thread Tomas Vondra

On Wed, Sep 02, 2020 at 02:05:10AM +1200, Thomas Munro wrote:

On Wed, Sep 2, 2020 at 1:14 AM Tomas Vondra
 wrote:

from the archive


Ahh, so perhaps that's the key.



Maybe. For the record, the commands look like this:

archive_command = 'gzip -1 -c %p > /mnt/raid/wal-archive/%f.gz'

restore_command = 'gunzip -c /mnt/raid/wal-archive/%f.gz > %p.tmp && mv %p.tmp 
%p'


I've tested this applied on 6ca547cf75ef6e922476c51a3fb5e253eef5f1b6,
and the failure seems fairly similar to what I reported before, except
that now it happened right at the very beginning.


Thanks, will see if I can work out why.  My newer version probably has
the same problem.


OK.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: clarify "rewritten" in pg_checksums docs

2020-09-01 Thread Daniel Gustafsson
> On 1 Sep 2020, at 15:34, Michael Banck  wrote:
> Am Dienstag, den 01.09.2020, 15:29 +0200 schrieb Daniel Gustafsson:

>> Isn't "modified in-place" a more accurate description of the process?
> 
> AIUI we do rewrite the whole file (block by block, after updating the
> page header with the checksum), so yeah, I though about using modified
> instead but then decided rewritten is pretty (or even more) accurate.

Well, I was thinking less technically accurate and more descriptive for end
users, hiding the implementation details.  "Rewrite" sounds to me more like
changing data rather than amending pages with a checksum keeping data intact.
Either way, adding "in-place" is an improvement IMO.

cheers ./daniel




Re: Online checksums patch - once again

2020-09-01 Thread Daniel Gustafsson
> On 28 Jul 2020, at 04:33, Justin Pryzby  wrote:
> 
> On Thu, Jun 25, 2020 at 11:43:00AM +0200, Daniel Gustafsson wrote:
>> The attached v19 fixes a few doc issues I had missed.
> 
> +   They can also be enabled or disabled at a later timne, either as an 
> offline
> => time

Fixed.

> +  * awaiting shutdown, but we can continue turning off checksums anyway
> => a waiting

This was intentional, as it refers to the impending requested shutdown and not
one which is blocked waiting.  I've reworded the comment to make this clearer.

> +  * We are starting a checksumming process scratch, and need to start by
> => FROM scratch

Fixed.

> +  * to inprogress new relations will set relhaschecksums in pg_class so 
> it
> => inprogress COMMA

Fixed.

> +  * Relation no longer exist. We don't consider this an error 
> since
> => exists

Fixed.

> +  * so when the cluster comes back up processing will habe to be resumed.
> => have

Fixed.

> +  "completed one pass over all databases for checksum 
> enabling, %i databases processed",
> => I think this will be confusing to be hardcoded "one".  It'll say "one" over
> and over.

Good point, I've reworded this based on the number of processed databases to
indicate what will happen next.  This call should probably be removed in case
we merge and ship this feature, but it's handy for this part of the patch
process.

> +  * still exist.
> => exists

Fixed.

> In many places, you refer to "datachecksumsworker" (sums) but in nine places
> you refer to datachecksumworker (sum).

Good catch, they should all be using "sums". Fixed.

> +ProcessSingleRelationFork(Relation reln, ForkNumber forkNum, 
> BufferAccessStrategy strategy)
> +{
> + BlockNumber numblocks = RelationGetNumberOfBlocksInFork(reln, forkNum);
> 
> => I think looping over numblocks is safe since new blocks are intended to be
> written with checksum, right?  Maybe it's good to say that here.

Fixed.

> + BlockNumber b;
> 
> blknum will be easier to grep for

Fixed.

> + (errmsg("background worker \"datachecksumsworker\" 
> starting for database oid %d",
> => Should be %u or similar (several of these)

Fixed.  As per Roberts review downthread, these will be reworded in a future
version.

> It looks like you rewrite every page, even if it already has correct checksum,
> to handle replicas.  I wonder if it's possible/reasonable/good to skip pages
> with correct checksum when wal_level=minimal ?

That would AFAICT be possible, but I'm not sure it's worth adding that before
the patch is deemed safe in its simpler form.  I've added a comment to record
this as a potential future optimization.

> It looks like it's not possible to change the checksum delay while a checksum
> worker is already running.  That may be important to allow: 1) decreased delay
> during slow periods; 2) increased delay if the process is significantly done,
> but needs to be throttled to avoid disrupting production environment.
> 
> Have you collaborated with Julien about this one?  His patch adds new GUCs:
> https://www.postgresql.org/message-id/20200714090808.GA20780@nol
> checksum_cost_delay
> checksum_cost_page
> checksum_cost_limit

I honestly hadn't thought about that, but I very much agree that any controls
introduced should work the same for both of these patches.

> Maybe you'd say that Julien's pg_check_relation() should accept parameters
> instead of adding GUCs.  I think you should be in agreement on that.  It'd be
> silly if the verification function added three GUCs and allowed adjusting
> throttle midcourse, but the checksum writer process didn't use them.

Agreed.  I'm not a fan of using yet more GUCs for controlling this, but I don't
a good argument against it.  It's in line with the cost-based vacuum delays so
I guess it's the most appropriate interface.

> If you used something like that, I guess you'd also want to distinguish
> checksum_cost_page_read vs write.  Possibly, the GUCs part should be a
> preliminary shared patch 0001 that you both used.

+1.

Thanks for the review!  I will attach a v20 to Robers email with these changes
included as well.

cheers ./daniel





Re: New default role- 'pg_read_all_data'

2020-09-01 Thread Georgios Kokolatos
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Version 2 of the patch, implements a useful feature. Based on the mailing list 
discussion, it is also a feature that the community desires.

The code seems to be correct and it follows the style. The patch comes complete 
with tests and documentation.

As a non native English speaker, I did not notice any syntactical or 
grammatical errors in the documentation. Yet it should not mean a lot.

As far as I am concerned, this version of the patch is ready for a committer.

Please feel free to contest my review, if you think I am wrong.

The new status of this patch is: Ready for Committer


Re: Docs: inaccurate description about config settings

2020-09-01 Thread Li Japin


On Sep 1, 2020, at 8:20 PM, Ian Lawrence Barwick 
mailto:barw...@gmail.com>> wrote:

2020年9月1日(火) 19:37 Li Japin mailto:japi...@hotmail.com>>:

Hi, hackers

When I setup a stream replication I found that the documentation says that 
promote_trigger_file
parameter can only be set in the postgresql.conf file or on the server command 
line, however, it
can also be put into postgresql.auto.conf. If I use include to import a new 
config, it works too.

There are many parameters use this description:
$ grep 'This parameter can only be set in the' -rn doc/
(...)

I think this description is misleading. we should correct it, isn't it?

I must admit every time I see this wording, it strikes me as very specific
and potentially confusing given the alternative files the parameter could be
placed in.

I think it would be clearer for anyone not familiar with the configuration file
system to change occurrences of this wording to something like:

 This parameter can only be set in the configuration file

which would link to:

 
https://www.postgresql.org/docs/current/config-setting.html#CONFIG-SETTING-CONFIGURATION-FILE

which provides more information. Though on that page it seems like it would be
also sensible to bundle the section about include directives in the
configuration
file (19.1.5) together with the section about the configuration itself (19.1.2).

+1

Attached is a fix following Ian Barwick’s suggestion.

Regards,
Japin Li






fix-confusing-configuration-parameters-setting-desc.patch
Description: fix-confusing-configuration-parameters-setting-desc.patch


Re: Remove line length restriction in passwordFromFile()

2020-09-01 Thread Tom Lane
Fujii Masao  writes:
> The patch looks good to me, except the following minor thing.
> + if (fgets(buf.data + buf.len, buf.maxlen - buf.len - 1, fp) == 
> NULL)
> IIUC fgets() reads the data with the specified size - 1, so ISTM that -1 of
> "buf.maxlen - buf.len - 1" is not necessary.

Good point, I was being unduly conservative.  Thanks for reviewing
the patch!

regards, tom lane




Kerberos support broken on MSVC builds for Windows x64?

2020-09-01 Thread Dave Page
I was experimenting with building with MIT Kerberos support on 64 bit
Windows using MSVC and ran into a number of linker errors along the lines
of:

"C:\Users\dpage\Downloads\postgresql-12.4\pgsql.sln" (default target) (1) ->
"C:\Users\dpage\Downloads\postgresql-12.4\zic.vcxproj" (default target) (2)
->
(Link target) ->
  LINK : fatal error LNK1181: cannot open input file
'C:\Progra~1\MIT\Kerberos\lib.obj'
[C:\Users\dpage\Downloads\postgresql-12.4\zic.vcxproj]

That was after I had to manually add the include and lib paths in
buildenv.pl. Diving in a bit further I found a couple of things:

1) The only buildfarm machine doing 64bit Windows Kerberos enabled builds
with MSVC is hammerkop. It enables it by setting the "krb5" option in
config.pl, however, as far as I can see (going back to 9.5), the option is
actually "gss". I can't see any sign in the log for the make step that it
actually is making any attempt to build with Kerberos, despite the UI
showing the icon for it.

2) I can't find anything in the MSVC build scripts in src/tools/msvc to
deal with 64bit Kerberos builds - Solution.pm seems to unconditionally try
to link with the 32bit libraries (e.g. lib/i386/krb5_32.lib instead of
lib/amd64/krb5_64.lib).

I'm assuming noone has tried a build with 64bit Kerberos, or am I missing
something?

Sidenote: I'm not sure even a 32bit Kerberos build will work, as
Solution.pm assumes the headers are in $self->{options}->{gss} .
'\inc\krb5', however in at least the latest installer from MIT they're
actually in $self->{options}->{gss} . '\include'.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com


Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-01 Thread Alvaro Herrera
On 2020-Aug-11, Justin Pryzby wrote:
> On Tue, Aug 11, 2020 at 02:39:45PM +0900, Michael Paquier wrote:

> > The grammar that has been committed was the one that for the most
> > support, so we need to live with that.  I wonder if we should simplify
> > ReindexStmt and move the "concurrent" flag to be under "options", but
> > that may not be worth the time spent on as long as we don't have
> > CONCURRENTLY part of the parenthesized grammar.
> 
> I think it's kind of a good idea, since the next patch does exactly that
> (parenthesize (CONCURRENTLY)).
> 
> I included that as a new 0002, but it doesn't save anything though, so maybe
> it's not a win.

The advantage of using a parenthesized option list is that you can add
*further* options without making the new keywords reserved.  Of course,
we already reserve CONCURRENTLY and VERBOSE pretty severely, so there's
no change.  If you wanted REINDEX FLUFFY then it wouldn't work without
making that at least type_func_name_keyword I think; but REINDEX
(FLUFFY) would work just fine.  And of course the new feature at hand
can be implemented.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-01 Thread Justin Pryzby
On Tue, Sep 01, 2020 at 11:40:18AM -0400, Alvaro Herrera wrote:
> On 2020-Aug-11, Justin Pryzby wrote:
> > On Tue, Aug 11, 2020 at 02:39:45PM +0900, Michael Paquier wrote:
> 
> > > The grammar that has been committed was the one that for the most
> > > support, so we need to live with that.  I wonder if we should simplify
> > > ReindexStmt and move the "concurrent" flag to be under "options", but
> > > that may not be worth the time spent on as long as we don't have
> > > CONCURRENTLY part of the parenthesized grammar.
> > 
> > I think it's kind of a good idea, since the next patch does exactly that
> > (parenthesize (CONCURRENTLY)).
> > 
> > I included that as a new 0002, but it doesn't save anything though, so maybe
> > it's not a win.
> 
> The advantage of using a parenthesized option list is that you can add
> *further* options without making the new keywords reserved.  Of course,
> we already reserve CONCURRENTLY and VERBOSE pretty severely, so there's
> no change.  If you wanted REINDEX FLUFFY then it wouldn't work without
> making that at least type_func_name_keyword I think; but REINDEX
> (FLUFFY) would work just fine.  And of course the new feature at hand
> can be implemented.

The question isn't whether to use a parenthesized option list.  I realized that
long ago (even though Alexey didn't initially like it).  Check 0002, which gets
rid of "bool concurrent" in favour of stmt->options&REINDEXOPT_CONCURRENT.

-- 
Justin




Re: "cert" + clientcert=verify-ca in pg_hba.conf?

2020-09-01 Thread Bruce Momjian
On Tue, Sep  1, 2020 at 01:59:25PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 31 Aug 2020 11:34:29 -0400, Bruce Momjian  wrote in 
> > On Mon, Aug 31, 2020 at 05:56:58PM +0900, Kyotaro Horiguchi wrote:
> > > Ok, this is that. If we spcify clientcert=no-verify other than for
> > > "cert" authentication, server complains as the following at startup.
> > 
> > Why does clientcert=no-verify have any value, even for a
> > cert-authentication line?
> > 
> > > > LOG:  no-verify or 0 is the default setting that is discouraged to use 
> > > > explicitly for clientcert option
> > > > HINT:  Consider removing the option instead. This option value is going 
> > > > to be deprecated in later version.
> > > > CONTEXT:  line 90 of configuration file 
> > > > "/home/horiguti/data/data_noverify/pg_hba.conf"
> > 
> > I think it should just be removed in PG 14.  This is a configuration
> > setting, not an SQL-level item that needs a deprecation period.
> 
> Ok, it is changed to just error out. I tempted to show a suggestion to
> removing the option in that case like the following, but *didn't* in
> this version of the patch.

OK, I have developed the attached patch based on yours.  I reordered the
tests, simplified the documentation, and removed the hint since they
will already get a good error message, and we will document this change
in the release notes.  It is also good you removed the 0/1 values for
this, since that was also confusing.  We will put that removal in the
release notes too.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
new file mode 100644
index 5cd88b4..a0d584f
*** a/doc/src/sgml/client-auth.sgml
--- b/doc/src/sgml/client-auth.sgml
*** host ... radius radiusservers="server1,s
*** 2042,2054 
 
  
 
! In a pg_hba.conf record specifying certificate
! authentication, the authentication option clientcert is
! assumed to be verify-ca or verify-full,
! and it cannot be turned off since a client certificate is necessary for this
! method. What the cert method adds to the basic
! clientcert certificate validity test is a check that the
! cn attribute matches the database user name.
 

  
--- 2042,2051 
 
  
 
! It is redundant to use the clientcert option with
! cert authentication because cert
! authentication is effectively trust authentication
! with clientcert=verify-full.
 

  
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
new file mode 100644
index 6cda39f..322ac8e
*** a/doc/src/sgml/runtime.sgml
--- b/doc/src/sgml/runtime.sgml
*** pg_dumpall -p 5432 | psql -d postgres -p
*** 2282,2290 
 The clientcert authentication option is available for
 all authentication methods, but only in pg_hba.conf lines
 specified as hostssl.  When clientcert is
!not specified or is set to no-verify, the server will still
!verify any presented client certificates against its CA file, if one is
!configured — but it will not insist that a client certificate be presented.

  

--- 2282,2289 
 The clientcert authentication option is available for
 all authentication methods, but only in pg_hba.conf lines
 specified as hostssl.  When clientcert is
!not specified, the server verifies the client certificate against its CA
!file only if a client certificate is presented and the CA is configured.

  

diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
new file mode 100644
index 9d63830..3a87673
*** a/src/backend/libpq/hba.c
--- b/src/backend/libpq/hba.c
*** parse_hba_auth_opt(char *name, char *val
*** 1708,1736 
  			*err_msg = "clientcert can only be configured for \"hostssl\" rows";
  			return false;
  		}
! 		if (strcmp(val, "1") == 0
! 			|| strcmp(val, "verify-ca") == 0)
! 		{
! 			hbaline->clientcert = clientCertCA;
! 		}
! 		else if (strcmp(val, "verify-full") == 0)
  		{
  			hbaline->clientcert = clientCertFull;
  		}
! 		else if (strcmp(val, "0") == 0
!  || strcmp(val, "no-verify") == 0)
  		{
  			if (hbaline->auth_method == uaCert)
  			{
  ereport(elevel,
  		(errcode(ERRCODE_CONFIG_FILE_ERROR),
! 		 errmsg("clientcert can not be set to \"no-verify\" when using \"cert\" authentication"),
  		 errcontext("line %d of configuration file \"%s\"",
  	line_num, HbaFileName)));
! *err_msg = "clientcert can not be set to \"no-verify\" when using \"cert\" authentication";
  return false;
  			}
! 			hbaline->clientcert = clientCertOff;
  		}
  		else
  		{
--- 1708,1732 
  			*err_msg = "clientcert can only be configured for \"hostssl\" rows";
  			return false;
  		}
! 
! 		if (strcmp(val, "verify-full") == 0)
  		{
  			hbaline->clientcert = clientCertFull;
  		}
! 		else i

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-01 Thread Alvaro Herrera
On 2020-Sep-01, Justin Pryzby wrote:

> On Tue, Sep 01, 2020 at 11:40:18AM -0400, Alvaro Herrera wrote:

> > The advantage of using a parenthesized option list is that you can add
> > *further* options without making the new keywords reserved.  Of course,
> > we already reserve CONCURRENTLY and VERBOSE pretty severely, so there's
> > no change.  If you wanted REINDEX FLUFFY then it wouldn't work without
> > making that at least type_func_name_keyword I think; but REINDEX
> > (FLUFFY) would work just fine.  And of course the new feature at hand
> > can be implemented.
> 
> The question isn't whether to use a parenthesized option list.  I realized 
> that
> long ago (even though Alexey didn't initially like it).  Check 0002, which 
> gets
> rid of "bool concurrent" in favour of stmt->options&REINDEXOPT_CONCURRENT.

Ah!  I see, sorry for the noise.  Well, respectfully, having a separate
boolean to store one option when you already have a bitmask for options
is silly.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Is it possible to set end-of-data marker for COPY statement.

2020-09-01 Thread Bruce Momjian
On Tue, Sep  1, 2020 at 06:14:45AM +, Junfeng Yang wrote:
> Hi hackers,
> 
> As described in the doc https://www.postgresql.org/docs/current/sql-copy.html,
> the TEXT format recognizes
> backslash-period (\.) as end-of-data marker.
> 
> The example below will raise an error for the line contains `\.`.
> 
> CREATE TABLE test (
> id int,
> name text,
> dep text
> )
> 
> Data in file "/tmp/data".
> 
> 122,as\.d,adad
> 133,sa dad,adadad
> 
> Then execute
> 
> copy test from '/tmp/data' DELIMITER ',';
> 
> An end-of-copy marker corrupt error will be raised.
> 
> This requires users to escape the end-of-data marker manually in their data.
> Why we don't have a mechanism to define other characters as end-of-data 
> marker?
> Or there are other ways to avoid escape the end-of-data in data?

This is the first I am hearing of this.  The problem is that the system
can't decide if \. is escaping a delimiter, or the end-of-copy marker. 
I think we need to just disable period as a delimiter.  I don't think
there is enough demand to allow the end-of-data marker to be
configurable.

Interestingly, you can use period as s delimiter if you are copying from
a file that doesn't need an end-of-data marker and you never need to
escape the delimiter, but that seems like too rare a use case to allow
period to be supported as a delimiter.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Kerberos support broken on MSVC builds for Windows x64?

2020-09-01 Thread Dave Page
On Tue, Sep 1, 2020 at 4:22 PM Dave Page  wrote:

> I was experimenting with building with MIT Kerberos support on 64 bit
> Windows using MSVC and ran into a number of linker errors along the lines
> of:
>
> "C:\Users\dpage\Downloads\postgresql-12.4\pgsql.sln" (default target) (1)
> ->
> "C:\Users\dpage\Downloads\postgresql-12.4\zic.vcxproj" (default target)
> (2) ->
> (Link target) ->
>   LINK : fatal error LNK1181: cannot open input file
> 'C:\Progra~1\MIT\Kerberos\lib.obj'
> [C:\Users\dpage\Downloads\postgresql-12.4\zic.vcxproj]
>
> That was after I had to manually add the include and lib paths in
> buildenv.pl. Diving in a bit further I found a couple of things:
>
> 1) The only buildfarm machine doing 64bit Windows Kerberos enabled builds
> with MSVC is hammerkop. It enables it by setting the "krb5" option in
> config.pl, however, as far as I can see (going back to 9.5), the option
> is actually "gss". I can't see any sign in the log for the make step that
> it actually is making any attempt to build with Kerberos, despite the UI
> showing the icon for it.
>
> 2) I can't find anything in the MSVC build scripts in src/tools/msvc to
> deal with 64bit Kerberos builds - Solution.pm seems to unconditionally try
> to link with the 32bit libraries (e.g. lib/i386/krb5_32.lib instead of
> lib/amd64/krb5_64.lib).
>
> I'm assuming noone has tried a build with 64bit Kerberos, or am I missing
> something?
>
> Sidenote: I'm not sure even a 32bit Kerberos build will work, as
> Solution.pm assumes the headers are in $self->{options}->{gss} .
> '\inc\krb5', however in at least the latest installer from MIT they're
> actually in $self->{options}->{gss} . '\include'.
>

Attached is a patch against 12.4 for the build system in case anyone wants
to play (I'll do it properly against the head branch later). I'm guessing
this will work for < 12, as with 12 I'm now getting the following which
looks like it's related to GSS encryption:

"C:\Users\dpage\Downloads\postgresql-12.4\pgsql.sln" (default target) (1) ->
"C:\Users\dpage\Downloads\postgresql-12.4\pgcrypto.vcxproj" (default
target) (2) ->
"C:\Users\dpage\Downloads\postgresql-12.4\postgres.vcxproj" (default
target) (3) ->
(Link target) ->
  be-secure-gssapi.obj : error LNK2019: unresolved external symbol setenv
referenced in function secure_open_gssapi
[C:\Users\dpage\Downloads\postgresql-12.4\postgres.vcxproj]
  .\Release\postgres\postgres.exe : fatal error LNK1120: 1 unresolved
externals [C:\Users\dpage\Downloads\postgresql-12.4\postgres.vcxproj]

I'll dig into that some more.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com


msvc64-kerberos.diff
Description: Binary data


Re: Reloptions for table access methods

2020-09-01 Thread Alvaro Herrera
On 2020-Aug-31, Jeff Davis wrote:

> fillRelOptions() validates when filling in a struct to make sure there
> aren't "leftover" options. It does this using a hard-coded parsing
> table that is not extensible.

Hmm, I think that if we're going to do this, we should do it for all
AMs, not just table AMs, since surely index AMs also want extensible
reloptions; and I think that makes the 'validate' mechanism dead code
and so we should remove it.

I think this means that you can have tables with mistyped options, and
you'd not get an error message about them.  Is that right?  Is that what
we want?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Kerberos support broken on MSVC builds for Windows x64?

2020-09-01 Thread Stephen Frost
Greetings,

* Dave Page (dp...@pgadmin.org) wrote:
> Attached is a patch against 12.4 for the build system in case anyone wants
> to play (I'll do it properly against the head branch later). I'm guessing
> this will work for < 12, as with 12 I'm now getting the following which
> looks like it's related to GSS encryption:
> 
> "C:\Users\dpage\Downloads\postgresql-12.4\pgsql.sln" (default target) (1) ->
> "C:\Users\dpage\Downloads\postgresql-12.4\pgcrypto.vcxproj" (default
> target) (2) ->
> "C:\Users\dpage\Downloads\postgresql-12.4\postgres.vcxproj" (default
> target) (3) ->
> (Link target) ->
>   be-secure-gssapi.obj : error LNK2019: unresolved external symbol setenv
> referenced in function secure_open_gssapi
> [C:\Users\dpage\Downloads\postgresql-12.4\postgres.vcxproj]
>   .\Release\postgres\postgres.exe : fatal error LNK1120: 1 unresolved
> externals [C:\Users\dpage\Downloads\postgresql-12.4\postgres.vcxproj]
> 
> I'll dig into that some more.

Yes, that'd be in the GSSENC code, which I hadn't been expecting to be
used under Windows.  If you're successful, I don't have any issue
helping to make that work, though I'm curious if you're trying to build
with MIT KfW (which is rather ancient these days, being based on krb5
1.13 and not updated since..) or with a more current release...?

Of course, it'd be good to get a buildfarm animal in place that's
actually testing this if we're going to make it work.

Regarding the setenv() call, should be able to use pgwin32_putenv() in
place on Windows, I'd think..?

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Is it possible to set end-of-data marker for COPY statement.

2020-09-01 Thread David G. Johnston
On Tue, Sep 1, 2020 at 9:05 AM Bruce Momjian  wrote:

> On Tue, Sep  1, 2020 at 06:14:45AM +, Junfeng Yang wrote:
> > Hi hackers,
> >
>
> > Data in file "/tmp/data".
> >
> > 122,as\.d,adad
> > 133,sa dad,adadad
> >
> > Then execute
> >
> > copy test from '/tmp/data' DELIMITER ',';
> >
> > An end-of-copy marker corrupt error will be raised.
>
> This is the first I am hearing of this.  The problem is that the system
> can't decide if \. is escaping a delimiter, or the end-of-copy marker.
> I think we need to just disable period as a delimiter.  I don't think
> there is enough demand to allow the end-of-data marker to be
> configurable.
>
> Interestingly, you can use period as s delimiter if you are copying from
> a file that doesn't need an end-of-data marker and you never need to
> escape the delimiter, but that seems like too rare a use case to allow
> period to be supported as a delimiter.
>

Something isn't right here because the rules for end-of-copy are explicit
that the \. must appear on a line all by itself.  That isn't the case with
the shown test data.

The system should do one of two things with that input (it seems option 2
is the one we've chosen):

One, see that the character following the backslash is not an action
character and just treat the backslash as data.
Two, complain that the character following the backslash is not a valid
action character.

The system is reporting an error, it's just trying to be helpful and seeing
the period it incorrectly reports that the error has something to do with
the end-of-copy marker when in reality all that can be said is that "a
period in this location is not valid" (unless the command uses DELIMITER
'' at least, in which case the period is now valid and \. means a
literal period since its not alone on a line.)

The only limitation our definition of end-of-copy imposes is that a single
column input file cannot contain a record that is only a period.  It does
not impose a limitation on which delimiters are valid.

David J.


Re: v13: show extended stats target in \d

2020-09-01 Thread Georgios Kokolatos
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Hi,

I will humbly disagree with the current review. I shall refrain from changing 
the status though because I do not have very strong feeling about it.

I am in agreement that it is a helpful feature and as implemented, the result 
seems to be the desired one.

However the patch contains:

- "  'm' = any(stxkind) 
AS mcv_enabled\n"
+ "  'm' = any(stxkind) 
AS mcv_enabled,\n"
+ "  %s"
  "FROM 
pg_catalog.pg_statistic_ext stat "
  "WHERE stxrelid = 
'%s'\n"
  "ORDER BY 1;",
+ (pset.sversion >= 
13 ? "stxstattarget\n" : "-1\n"),
  oid);

This seems to be breaking a bit the pattern in describeOneTableDetails().
First, it is customary to write the whole query for the version in its own 
block. I do find this pattern rather helpful and clean. So in my humble 
opinion, the ternary expression should be replaced with a distinct if block 
that would implement stxstattarget. Second, setting the value to -1 as an 
indication of absence breaks the pattern a bit further. More on that bellow.

+   if (strcmp(PQgetvalue(result, i, 8), 
"-1") != 0)
+   appendPQExpBuffer(&buf, " 
STATISTICS %s",
+ 
PQgetvalue(result, i, 8));
+

In the same function, one will find a bit of a different pattern for printing 
the statistics, e.g.
 if (strcmp(PQgetvalue(result, i, 7), "t") == 0) 
I will again hold the opinion that if the query gets crafted a bit differently, 
one can inspect if the table has `stxstattarget` and then, go ahead and print 
the value.

Something in the lines of:
if (strcmp(PQgetvalue(result, i, 8), "t") == 0)
appendPQExpBuffer(&buf, " STATISTICS %s", 
PQgetvalue(result, i, 9));

Finally, the patch might be more complete if a note is added in the 
documentation.
Have you considered adding something in doc/src/sgml/ref/psql-ref.sgml? If no, 
will you consider it? If yes, why did you discard it?

Regards,
Georgios

Re: autovac issue with large number of tables

2020-09-01 Thread Kasahara Tatsuhito
Hi,

On Wed, Aug 12, 2020 at 2:46 AM Tom Lane  wrote:
> So I think Kasahara-san's point is that the shared memory stats collector
> might wipe out those costs, depending on how it's implemented.  (I've not
> looked at that patch in a long time either, so I don't know how much it'd
> cut the reader-side costs.  But maybe it'd be substantial.)
Thanks for your clarification, that's what I wanted to say.
Sorry for the lack of explanation.

> I think the real issue here is autovac_refresh_stats's insistence that it
> shouldn't throttle pgstats re-reads in workers.
I agree that.

> I wonder if we could have table_recheck_autovac do two probes of the stats
> data.  First probe the existing stats data, and if it shows the table to
> be already vacuumed, return immediately.  If not, *then* force a stats
> re-read, and check a second time.
Does the above mean that the second and subsequent table_recheck_autovac()
will be improved to first check using the previous refreshed statistics?
I think that certainly works.

If that's correct, I'll try to create a patch for the PoC.

Best regards,

-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com




Re: Reloptions for table access methods

2020-09-01 Thread Jeff Davis
On Tue, 2020-09-01 at 12:20 -0400, Alvaro Herrera wrote:
> Hmm, I think that if we're going to do this, we should do it for all
> AMs, not just table AMs, since surely index AMs also want extensible
> reloptions; and I think that makes the 'validate' mechanism dead code
> and so we should remove it.

Index AMs totally control the validation, so as they add options with
add_bool_reloption, they can also add to their custom parsing table.

I'm fine removing the "validate" parameter completely for the sake of
consistency.

> I think this means that you can have tables with mistyped options,
> and
> you'd not get an error message about them.  Is that right?  Is that
> what
> we want?

No, mistyped options (e.g. "fillfactory=50") will still cause an error,
because there are two layers of validation.

The problem case would be a situation like the following:

1. Someone loads an extension that creates a new reloption
"custom_reloption" of kind RELOPT_KIND_HEAP for their table access
method "my_tableam". 

2. They then create a table and forget to specify "USING my_tableam",
but use the option "custom_reloption=123".

Ideally, that would throw an error because "custom_reloption" is only
valid for "my_tableam"; but with my patch, no error would be thrown
because the extension has already added the reloption. It would just
create a normal heap and "custom_reloption=123" would be ignored.



I went with the simple approach because fixing that problem seemed a
bit over-engineered. Here are some thoughts on what we could do:

* Consider StdRdOptions to be an extensible struct (where postgres just
looks at the StdRdOptions fields, but the extension can cast it to the
more specialized struct with extra fields). This is awkward, because
it's not a "normal" struct. Strings are represented as offsets that
point to memory past the end of the struct. We'd need an extra AM hook
that allocateReloptStruct can call into.

* We could refactor the validation logic so that extra options don't
count immediately as an error, but we instead save them in a different
array, and pass them to an AM-specific validation routine. But we have
no place to really put these options once they are parsed, because rel-
>rd_options is already holding the StdRdOptions struct, and there's no
separate place in the catalog for AM-specific reloptions. So the
extension would then need to re-parse them, and stash them in
rd_amcache or something.

* We could introduce a new AM routine that returns a custom
relopt_kind, created with add_reloption_kind(). Then, we could change
heap_reloptions to call default_reloptions like:

return default_reloptions(reloptions, validate,
RELOPT_KIND_HEAP|customReloptKind);

This still requires my patch, because we need to avoid the second
validation step. It also requires some extra code for TOAST tables,
because they can also be a custom table access method. The benefit over
my current patch is that extensions wouldn't be creating new options
for RELOPT_KIND_HEAP, they would create them only for their own custom
relopt kind, so if you try to use those options with a heap, you would
get an error.


Suggestions welcome.

Regards,
Jeff Davis






Re: Is it possible to set end-of-data marker for COPY statement.

2020-09-01 Thread Bruce Momjian
On Tue, Sep  1, 2020 at 12:05:02PM -0400, Bruce Momjian wrote:
> > copy test from '/tmp/data' DELIMITER ',';
> > 
> > An end-of-copy marker corrupt error will be raised.
> > 
> > This requires users to escape the end-of-data marker manually in their data.
> > Why we don't have a mechanism to define other characters as end-of-data 
> > marker?
> > Or there are other ways to avoid escape the end-of-data in data?
> 
> This is the first I am hearing of this.  The problem is that the system
> can't decide if \. is escaping a delimiter, or the end-of-copy marker. 
> I think we need to just disable period as a delimiter.  I don't think
> there is enough demand to allow the end-of-data marker to be
> configurable.
> 
> Interestingly, you can use period as s delimiter if you are copying from
> a file that doesn't need an end-of-data marker and you never need to
> escape the delimiter, but that seems like too rare a use case to allow
> period to be supported as a delimiter.

I am sorry I mis-read this email.  The example uses _comma_ for the
delimiter, rather than period.  Let me reply again.  We already disallow
period for delimiters:

COPY test TO '/u/postgres/tmp/x' WITH (DELIMITER '.');
ERROR:  COPY delimiter cannot be "."

COPY test TO STDOUT WITH (DELIMITER '.');
ERROR:  COPY delimiter cannot be "."

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Is it possible to set end-of-data marker for COPY statement.

2020-09-01 Thread Bruce Momjian
On Tue, Sep  1, 2020 at 06:14:45AM +, Junfeng Yang wrote:
> Hi hackers,
> 
> As described in the doc https://www.postgresql.org/docs/current/sql-copy.html,
> the TEXT format recognizes
> backslash-period (\.) as end-of-data marker.
> 
> The example below will raise an error for the line contains `\.`.
> 
> CREATE TABLE test (
> id int,
> name text,
> dep text
> )
> 
> Data in file "/tmp/data".
> 
> 122,as\.d,adad
> 133,sa dad,adadad
> 
> Then execute
> 
> copy test from '/tmp/data' DELIMITER ',';
> 
> An end-of-copy marker corrupt error will be raised.
> 
> This requires users to escape the end-of-data marker manually in their data.
> Why we don't have a mechanism to define other characters as end-of-data 
> marker?
> Or there are other ways to avoid escape the end-of-data in data?

So, you are using comma as the delimiter, but have \. (backslash-period)
as a data value.  You need to double-up backslashes in your input data,
no matter what is after the backslash.  You just happen to hit backslash
period, but other things like \N could cause problems --- literal
backslashes have to be doubled.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Is it possible to set end-of-data marker for COPY statement.

2020-09-01 Thread Tom Lane
Bruce Momjian  writes:
> On Tue, Sep  1, 2020 at 06:14:45AM +, Junfeng Yang wrote:
>> Data in file "/tmp/data".
>> 
>> 122,as\.d,adad
>> 133,sa dad,adadad

> So, you are using comma as the delimiter, but have \. (backslash-period)
> as a data value.  You need to double-up backslashes in your input data,
> no matter what is after the backslash.  You just happen to hit backslash
> period, but other things like \N could cause problems --- literal
> backslashes have to be doubled.

As mentioned upthread, using CSV format might work better.  In CSV
you aren't necessarily stuck with backslash being treated as an
escape character.

regards, tom lane




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2020-09-01 Thread Alvaro Herrera
On 2020-Aug-27, Robert Haas wrote:

> On Wed, Aug 26, 2020 at 7:40 PM Alvaro Herrera  
> wrote:
> > To mark it detached means to set pg_inherits.inhdetached=true.  That
> > column name is a bit of a misnomer, since that column really means "is
> > in the process of being detached"; the pg_inherits row goes away
> > entirely once the detach process completes.  This mark guarantees that
> > everyone will see that row because the detaching session waits long
> > enough after committing the first transaction and before deleting the
> > pg_inherits row, until everyone else has moved on.
> 
> OK. Do you just wait until the XID of the transaction that set
> inhdetached is all-visible, or how do you do it?

I'm just doing WaitForLockers( ... AccessExclusiveLock ...)  on the
partitioned table at the start of the second transaction.  That will
wait until all lockers that have obtained a partition descriptor with
the old definition are gone.  Note we don't actually lock the
partitioned table with that lock level.

In the second transaction we additionally obtain AccessExclusiveLock on
the partition itself, but that's after nobody sees it as a partition
anymore.  That lock level is needed for some of the internal DDL
changes, and should not cause problems.

I thought about using WaitForOlderSnapshots() instead of waiting for
lockers, but it didn't seem to solve any actual problem.

Note that on closing the first transaction, the locks on both tables are
released.  This avoids the deadlock hazards because of the lock upgrades
that would otherwise occur.  This means that the tables could be dropped
or changed in the meantime.  The case where either relation is dropped
is handled by using try_relation_open() in the second transaction; if
either table is gone, then we can just mark the operation as completed.
This part is a bit fuzzy.  One thing that should probably be done is
have a few operations (such as other ALTER TABLE) raise an error when
run on a table with inhdetached=true, because that might get things out
of step and potentially cause other problems.  I've not done that yet.  

> So all the plans that were created before you set
> inhdetached=true have to be guaranteed to be invaliated or gone
> altogether before you can actually delete the pg_inherits row. It
> seems like it ought to be possible to ensure that, though I am not
> surely of the details exactly. Is it sufficient to wait for all
> transactions that have locked the table to go away? I'm not sure
> exactly how this stuff interacts with the plan cache.

Hmm, any cached plan should be released with relcache inval events, per
PlanCacheRelCallback().  There are some comments in plancache.h about
"unsaved" cached plans that I don't really understand :-(

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Maximum password length

2020-09-01 Thread Tom Lane
I wrote:
> Note this patch is intended to be applied over my patch at [2],
> since it modifies the test case added there.

I've now pushed that patch, so the patch in my previous mail should
directly apply to HEAD.

I'd originally been wondering whether we need to back-patch this patch.
But unless someone wants to make a case for the max password length
being more than 1024, it seems like this is mostly cleanup and could
just be done in HEAD.  At 1024, the actual behavior of pg_saslprep()
isn't changing at all, and the behavior of recv_password_packet() isn't
changing by much.  The real impact is just that the places that prompt
for a password will accept passwords up to 1K instead of 100 bytes.
Which, TBH, seems like neatnik-ism rather than fixing anything useful.
Surely nobody is going to manually enter passwords exceeding 100 bytes.
And, since simple_prompt insists on reading /dev/tty not stdin, there
is no very easy way to pass a machine-generated password through that
code path.  The practical ways to deal with a long password are either
to set it as PGPASSWORD (has always worked) or put it in .pgpass
(works as of now).

Anyway, I added this thread to the upcoming CF, in case anyone wants to
discuss it further.

regards, tom lane




Re: Disk-based hash aggregate's cost model

2020-09-01 Thread Jeff Davis
On Tue, 2020-09-01 at 11:19 +0200, Tomas Vondra wrote:
> Why? I don't think we need to change costing of in-memory HashAgg. My
> assumption was we'd only tweak startup_cost for cases with spilling
> by
> adding something like (cpu_operator_cost * npartitions * ntuples).

The code above (the in-memory case) has a clause:

  startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;

which seems to account only for the hash calculation, because it's
multiplying by the number of grouping columns.

Your calculation would also use cpu_operator_cost, but just for the
lookup. I'm OK with that, but it's a little inconsistent to only count
it for the tuples that spill to disk.

But why multiply by the number of partitions? Wouldn't it be the depth?
A wide fanout will not increase the number of lookups.

> FWIW I suspect some of this difference may be due to logical vs.
> physical I/O. iosnoop only tracks physical I/O sent to the device,
> but
> maybe we do much more logical I/O and it simply does not expire from
> page cache for the sort. It might behave differently for larger data
> set, longer query, ...

That would suggest something like a penalty for HashAgg for being a
worse IO pattern. Or do you have another suggestion?

> I don't know. I certainly understand the desire not to change things
> this late. OTOH I'm worried that we'll end up receiving a lot of poor
> plans post release.

I was reacting mostly to changing the cost of Sort. Do you think
changes to Sort are required or did I misunderstand?

Regards,
Jeff Davis






Re: Maximum password length

2020-09-01 Thread Bossart, Nathan
On 8/31/20, 5:55 PM, "Tom Lane"  wrote:
> I set the proposed limit at 1024 bytes, but given that we now know
> of use-cases needing up to 800 bytes, maybe there should be a little
> more headroom?  I don't want to make it enormous, though, seeing that
> we're allocating static buffers of that size.

For the use-case described in [0], I ended up bumping the server-side
limit in libpq/auth.c to 8192 bytes for RDS instances.  This appears
to be the PqRecvBuffer size, too.  In any case, these tokens regularly
exceed 1024 bytes, so I would definitely argue for more headroom if
possible.  Otherwise, I like the idea of unifying all the limits.

Nathan

[0] 
https://www.postgresql.org/message-id/flat/CAOhmDze1nqG2vfegpSsTFCgaiFRsqgjO6yLsbmhroz2zGmJHog%40mail.gmail.com



Re: v13: show extended stats target in \d

2020-09-01 Thread Alvaro Herrera
+1 on fixing this, since the ability to change stats target is new in
pg13.

On 2020-Aug-31, Justin Pryzby wrote:

> Maybe it should have a comma, like ", STATISTICS %s"?

It does need some separator.  Maybe a comma is sufficient, but I'm not
sure: that will fail when we add cross-relation stats, because the
FROM clause will have more relations and possibly have commas too.

Table "public.ab1"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | 
Description
+-+---+--+-+-+--+-
 a  | integer |   |  | | plain   |  |
 b  | integer |   |  | | plain   |  |
Statistics objects:
"public"."ab1_a_b_stats" (ndistinct, dependencies, mcv) ON a, b FROM ab1


> Also, now I wonder if CREATE STATISTICS should support some syntax to set the
> initial target.  Like:
> 
> |CREATE STATISTICS abstats ON a,b FROM child.abc_202006 WITH(STATISTICS 111);

Umm, if true (on which I'm not sold), maybe it should appear in the
parenthesized list that currently is just the stats type list:

 |CREATE STATISTICS abstats (STATISTICS 111) ON a,b FROM child.abc_202006;

I'm not really convinced we need this.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Group by reordering optimization

2020-09-01 Thread Tomas Vondra

On Tue, Sep 01, 2020 at 01:15:31PM +0200, Dmitry Dolgov wrote:

Hi,

Better late than never, to follow up on the original thread [1] I would like to
continue the discussion with the another version of the patch for group by
reordering optimization. To remind, it's about reordering of group by clauses
to do sorting more efficiently. The patch is rebased and modified to address
(at least partially) the suggestions about making it consider new additional
paths instead of changing original ones. It is still pretty much
proof-of-concept version though with many blind spots, but I wanted to start
kicking it and post at least something, otherwise it will never happen. An
incremental approach so to say.

In many ways it still contains the original code from Teodor. Changes and notes:

* Instead of changing the order directly, now patch creates another patch with
 modifier order of clauses. It does so for the normal sort as well as for
 incremental sort. The whole thing is done in two steps: first it finds a
 potentially better ordering taking into account number of groups, widths and
 comparison costs; afterwards this information is used to produce a cost
 estimation. This is implemented via a separate create_reordered_sort_path to
 not introduce too many changes, I couldn't find any better place.



I haven't tested the patch with any queries, but I agree this seems like
the right approach in general.

I'm a bit worried about how complex the code in planner.c is getting -
the incremental sort patch already made it a bit too complex, and this
is just another step in that direction.  I suppose we should refactor
add_paths_to_grouping_rel() by breaking it into smaller / more readable
pieces ...


* Function get_func_cost was removed at some point, but unfortunately this
 patch was implemented before that, so it's still present there.

* For simplicity I've removed support in create_partial_grouping_paths, since
 they were not covered by the existing tests anyway.



Hmmm, OK. I think that's something we'll need to address for the final
patch, but I agree we can add it after improving the costing etc.


* The costing part is pretty rudimentary and looks only at the first group.
 It's mostly hand crafted to pass the existing tests.



OK, understood.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Maximum password length

2020-09-01 Thread Peter Eisentraut

On 2020-09-01 02:54, Tom Lane wrote:

Therefore, I propose setting this up with a #define symbol in
pg_config_manual.h and leaving it at that.  Giving documentation in
pg_config_manual.h seems sufficient to me.  Attached is a revised
version of Nathan's patches that does it like that.

I set the proposed limit at 1024 bytes, but given that we now know
of use-cases needing up to 800 bytes, maybe there should be a little
more headroom?  I don't want to make it enormous, though, seeing that
we're allocating static buffers of that size.


ISTM that it's only going to be a matter of time before that will be 
exceeded.  Why have a limit at all?  Accept whatever StringInfo accepts.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Disk-based hash aggregate's cost model

2020-09-01 Thread Tomas Vondra

On Tue, Sep 01, 2020 at 12:58:59PM -0700, Jeff Davis wrote:

On Tue, 2020-09-01 at 11:19 +0200, Tomas Vondra wrote:

Why? I don't think we need to change costing of in-memory HashAgg. My
assumption was we'd only tweak startup_cost for cases with spilling
by
adding something like (cpu_operator_cost * npartitions * ntuples).


The code above (the in-memory case) has a clause:

 startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;

which seems to account only for the hash calculation, because it's
multiplying by the number of grouping columns.

Your calculation would also use cpu_operator_cost, but just for the
lookup. I'm OK with that, but it's a little inconsistent to only count
it for the tuples that spill to disk.

But why multiply by the number of partitions? Wouldn't it be the depth?
A wide fanout will not increase the number of lookups.



Yeah, I think you're right it should be depth, not number of partitions.

FWIW I don't know if this is enough to "fix" the costing, it's just
something I noticed while looking at the code.


FWIW I suspect some of this difference may be due to logical vs.
physical I/O. iosnoop only tracks physical I/O sent to the device,
but
maybe we do much more logical I/O and it simply does not expire from
page cache for the sort. It might behave differently for larger data
set, longer query, ...


That would suggest something like a penalty for HashAgg for being a
worse IO pattern. Or do you have another suggestion?



Possibly, yes. I think it'd be good to measure logical I/O (e.g. by
adding some instrumentation to LogicalTapeSet) to see if this hypothesis
is actually true.

FWIW any thoughts about the different in temp size compared to
CP_SMALL_TLIST?


I don't know. I certainly understand the desire not to change things
this late. OTOH I'm worried that we'll end up receiving a lot of poor
plans post release.


I was reacting mostly to changing the cost of Sort. Do you think
changes to Sort are required or did I misunderstand?



Not sure I'm following. I don't think anyone proposed changing costing
for Sort. Or did I miss something?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: v13: show extended stats target in \d

2020-09-01 Thread Tomas Vondra

On Tue, Sep 01, 2020 at 05:08:25PM -0400, Alvaro Herrera wrote:

+1 on fixing this, since the ability to change stats target is new in
pg13.

On 2020-Aug-31, Justin Pryzby wrote:


Maybe it should have a comma, like ", STATISTICS %s"?


It does need some separator.  Maybe a comma is sufficient, but I'm not
sure: that will fail when we add cross-relation stats, because the
FROM clause will have more relations and possibly have commas too.

   Table "public.ab1"
Column |  Type   | Collation | Nullable | Default | Storage | Stats target | 
Description
+-+---+--+-+-+--+-
a  | integer |   |  | | plain   |  |
b  | integer |   |  | | plain   |  |
Statistics objects:
   "public"."ab1_a_b_stats" (ndistinct, dependencies, mcv) ON a, b FROM ab1



Also, now I wonder if CREATE STATISTICS should support some syntax to set the
initial target.  Like:

|CREATE STATISTICS abstats ON a,b FROM child.abc_202006 WITH(STATISTICS 111);


Umm, if true (on which I'm not sold), maybe it should appear in the
parenthesized list that currently is just the stats type list:

|CREATE STATISTICS abstats (STATISTICS 111) ON a,b FROM child.abc_202006;

I'm not really convinced we need this.



Me neither. IMHO it's somewhat similar to per-column statistics target,
where we don't allow specifying that in CREATE TABLE and you have to do

ALTER TABLE ... ALTER COLUMN ... SET STATISTICS n;

as a separate step.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Disk-based hash aggregate's cost model

2020-09-01 Thread Peter Geoghegan
On Tue, Sep 1, 2020 at 2:19 AM Tomas Vondra
 wrote:
> FWIW I suspect some of this difference may be due to logical vs.
> physical I/O. iosnoop only tracks physical I/O sent to the device, but
> maybe we do much more logical I/O and it simply does not expire from
> page cache for the sort. It might behave differently for larger data
> set, longer query, ...

There is also the fact that the LogicalTapeSetBlocks() instrumentation
is known to have problems that we still need to fix:

https://www.postgresql.org/message-id/CAH2-Wzn5PCBLUrrds=hD439LtWP+PD7ekRTd=8ldtqj+ko5...@mail.gmail.com

I'm not suggesting that this is a significant factor here. But I can't
rule it out just yet either.

> I don't know. I certainly understand the desire not to change things
> this late. OTOH I'm worried that we'll end up receiving a lot of poor
> plans post release.

I think that this needs to get fixed before release.

-- 
Peter Geoghegan




Re: Maximum password length

2020-09-01 Thread Tom Lane
Peter Eisentraut  writes:
> ISTM that it's only going to be a matter of time before that will be 
> exceeded.  Why have a limit at all?  Accept whatever StringInfo accepts.

Hmm, that would require some refactoring of simple_prompt for starters.
I agree there's no hard reason why we have to have any specific limit,
if we're willing to do that.

regards, tom lane




Re: Group by reordering optimization

2020-09-01 Thread Peter Geoghegan
On Tue, Sep 1, 2020 at 2:09 PM Tomas Vondra
 wrote:
> >* Instead of changing the order directly, now patch creates another patch 
> >with
> >  modifier order of clauses. It does so for the normal sort as well as for
> >  incremental sort. The whole thing is done in two steps: first it finds a
> >  potentially better ordering taking into account number of groups, widths 
> > and
> >  comparison costs; afterwards this information is used to produce a cost
> >  estimation. This is implemented via a separate create_reordered_sort_path 
> > to
> >  not introduce too many changes, I couldn't find any better place.
> >
>
> I haven't tested the patch with any queries, but I agree this seems like
> the right approach in general.

If we're creating a new sort path anyway, then perhaps we can also
change the collation -- it might be possible to "reduce" it to the "C"
collation without breaking queries.

This is admittedly pretty hard to do well. It could definitely work
out when we have to do a sort anyway -- a sort with high cardinality
abbreviated keys will be very fast (though we can't use abbreviated
keys with libc collations right now). OTOH, it would be quite
counterproductive if we were hoping to get an incremental sort that
used some available index that happens to use the default collation
(which is not the C collation in cases where this optimization is
expected to help).

-- 
Peter Geoghegan




Re: Problems with the FSM, heap fillfactor, and temporal locality

2020-09-01 Thread Peter Geoghegan
On Wed, Aug 26, 2020 at 1:46 AM John Naylor  wrote:
> The fact that that logic extends by 20 * numwaiters to get optimal
> performance is a red flag that resources aren't being allocated
> efficiently.

I agree that that's pretty suspicious.

> I have an idea to ignore fp_next_slot entirely if we have
> extended by multiple blocks: The backend that does the extension
> stores in the FSM root page 1) the number of blocks added and 2) the
> end-most block number. Any request for space will look for a valid
> value here first before doing the usual search. If there is then the
> block to try is based on a hash of the xid. Something like:
>
> candidate-block = prev-end-of-relation + 1 + (xid % (num-new-blocks))

I was thinking of doing something in shared memory, and not using the
FSM here at all. If we're really giving 20 pages out to each backend,
we will probably benefit from explicitly assigning contiguous ranges
of pages to each backend, and making some effort to respect that they
own the blocks in some general sense. Hopefully without also losing
access to the free space in corner cases (e.g. one of the backend's
has an error shortly after receiving its contiguous range of blocks).

> To guard against collisions, then peak in the FSM at that slot and if
> it's not completely empty, then search FSM using a "look-nearby" API
> and increment a counter every time we collide. When the counter gets
> to some-value, clear the special area in the root page so that future
> backends use the usual search.

The backends already use a look nearby API, sort of --
RecordAndGetPageWithFreeSpace() already behaves that way. I'm not sure
exactly how well it works in practice, but it definitely works to some
degree.

> Also num-new-blocks above could be scaled down from the actual number
> of blocks added, just to make sure writes aren't happening all over
> the place.

Or scaled up, perhaps.

-- 
Peter Geoghegan




Re: [HACKERS] Custom compression methods

2020-09-01 Thread Mark Dilger



> On Aug 13, 2020, at 4:48 AM, Dilip Kumar  wrote:
> 
> v1-0001: As suggested by Robert, it provides the syntax support for
> setting the compression method for a column while creating a table and
> adding columns.  However, we don't support changing the compression
> method for the existing column.  As part of this patch, there is only
> one built-in compression method that can be set (pglz).  In this, we
> have one in-build am (pglz) and the compressed attributes will directly
> store the oid of the AM.  In this patch, I have removed the
> pg_attr_compresion as we don't support changing the compression
> for the existing column so we don't need to preserve the old
> compressions.

I do not like the way pglz compression is handled in this patch.  After 
upgrading PostgreSQL to the first version with this patch included, 
pre-existing on-disk compressed data will not include any custom compression 
Oid in the header, and toast_decompress_datum will notice that and decompress 
the data directly using pglz_decompress.   If the same data were then written 
back out, perhaps to another table, into a column with no custom compression 
method defined, it will get compressed by toast_compress_datum using 
DefaultCompressionOid, which is defined as PGLZ_COMPRESSION_AM_OID.  That isn't 
a proper round-trip for the data, as when it gets re-compressed, the 
PGLZ_COMPRESSION_AM_OID gets written into the header, which makes the data a 
bit longer, but also means that it is not byte-for-byte the same as it was, 
which is counter-intuitive.  Given that any given pglz compressed datum now has 
two totally different formats that might occur on disk, code may have to 
consider both of them, which increases code complexity, and regression tests 
will need to be written with coverage for both of them, which increases test 
complexity.  It's also not easy to write the extra tests, as there isn't any 
way (that I see) to intentionally write out the traditional shorter form from a 
newer database server; you'd have to do something like a pg_upgrade test where 
you install an older server to write the older format, upgrade, and then check 
that the new server can handle it.

The cleanest solution to this would seem to be removal of the compression am's 
Oid from the header for all compression ams, so that pre-patch written data and 
post-patch written data look exactly the same.  The other solution is to give 
pglz pride-of-place as the original compression mechanism, and just say that 
when pglz is the compression method, no Oid gets written to the header, and 
only when other compression methods are used does the Oid get written.  This 
second option seems closer to the implementation that you already have, because 
you already handle the decompression of data where the Oid is lacking, so all 
you have to do is intentionally not write the Oid when compressing using pglz.

Or did I misunderstand your implementation?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-01 Thread Alvaro Herrera
On 2020-Jul-30, Peter Geoghegan wrote:

> Commit 896ddf9b added prefetching to logtape.c to avoid excessive
> fragmentation in the context of hash aggs that spill and have many
> batches/tapes. Apparently the preallocation doesn't actually perform
> any filesystem operations, so the new mechanism should be zero
> overhead when "preallocated" blocks aren't actually used after all
> (right?). However, I notice that this breaks the statistics shown by
> things like trace_sort, and even EXPLAIN ANALYZE.
> LogicalTapeSetBlocks() didn't get the memo about preallocation.

This open item hasn't received any replies.  I think Peter knows how to
fix it already, but no patch has been posted ... It'd be good to get a
move on it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Manager for commit fest 2020-09

2020-09-01 Thread Michael Paquier
On Tue, Sep 01, 2020 at 10:43:47AM +0900, Michael Paquier wrote:
> As of the moment this message is written, 10 hours remain until we are
> the 1st of September AoE, where I'll switch the commit fest as
> officially in progress.  It will not be possible to register new
> patches to 2020-09 after that.

The commit fest is marked as officially running now.  Not seeing any
volunteers around, I'll do the CFM duties for this one.
--
Michael


signature.asc
Description: PGP signature


Re: REINDEX SCHEMA/DATABASE/SYSTEM weak with dropped relations

2020-09-01 Thread Michael Paquier
On Tue, Sep 01, 2020 at 01:25:27PM +0300, Anastasia Lubennikova wrote:
> Yes, this version is good.

Thanks.  I have added an extra comment for the case of RELKIND_INDEX
with REINDEXOPT_MISSING_OK while on it, as it was not really obvious
why the parent relation needs to be locked (at least attempted to) at
this stage.  And applied the change.  Thanks for the review,
Anastasia.
--
Michael


signature.asc
Description: PGP signature


Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-01 Thread Peter Geoghegan
On Tue, Sep 1, 2020 at 4:36 PM Alvaro Herrera  wrote:
> This open item hasn't received any replies.  I think Peter knows how to
> fix it already, but no patch has been posted ... It'd be good to get a
> move on it.

I picked this up again today.

It's not obvious what we should do. It's true that the instrumentation
doesn't accurately reflect the on-disk temp file overhead. That is, it
doesn't agree with the high watermark temp file size I see in the
pgsql_tmp directory, which is a clear regression compared to earlier
releases (where tuplesort was the only user of logtape.c). But it's
also true that we need to use somewhat more temp file space for a
tuplesort in Postgres 13, because we use the preallocation stuff for
tuplesort -- though probably without getting any benefit for it.

I haven't figured out how to correct the accounting just yet. In fact,
I'm not sure that this isn't some kind of leak of blocks from the
freelist, which shouldn't happen at all. The code is complicated
enough that I wasn't able to work that out in the couple of hours I
spent on it today. I can pick it up again tomorrow.

BTW, this MaxAllocSize freeBlocksLen check is wrong -- doesn't match
the later repalloc allocation:

if (lts->nFreeBlocks >= lts->freeBlocksLen)
{
/*
 * If the freelist becomes very large, just return and leak this free
 * block.
 */
if (lts->freeBlocksLen * 2 > MaxAllocSize)
return;

lts->freeBlocksLen *= 2;
lts->freeBlocks = (long *) repalloc(lts->freeBlocks,
lts->freeBlocksLen * sizeof(long));
}

-- 
Peter Geoghegan




Re: Manager for commit fest 2020-09

2020-09-01 Thread Ian Barwick

On 2020/09/01 10:43, Michael Paquier wrote:

On Mon, Aug 31, 2020 at 04:37:12PM +0900, Michael Paquier wrote:

We are going to be in September in a couple of hours, meaning that the
second commit fest for Postgres 14 will begin soon.  Do we have any
volunteers to take the role of CFM this time?


As of the moment this message is written, 10 hours remain until we are
the 1st of September AoE, where I'll switch the commit fest as
officially in progress.  It will not be possible to register new
patches to 2020-09 after that.


2020-11 is also now showing as "in progress", is that correct?


Regards

Ian Barwick


--
Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Manager for commit fest 2020-09

2020-09-01 Thread Michael Paquier
On Wed, Sep 02, 2020 at 09:28:07AM +0900, Ian Barwick wrote:
> 2020-11 is also now showing as "in progress", is that correct?

It was not.  Thanks!
--
Michael


signature.asc
Description: PGP signature


Re: Dependencies for partitioned indexes are still a mess

2020-09-01 Thread Alvaro Herrera
On 2020-Aug-12, Alvaro Herrera wrote:

> On 2020-Jul-15, Tom Lane wrote:

> > (There seem to be some other problems as well, but most of the 54 complaints
> > are related to partitioned indexes/constraints.)
> 
> In my run of it there's a good dozen remaining problems, all alike: we
> do DROP TYPE widget CASCADE (which works) followed by DROP FUNCTION
> public.widget_out(widget), which fails complaining that type widget
> doesn't exist.  But in reality the error is innocuous, since that
> function was dropped by the DROP TYPE CASCADE anyway.  You could say
> that the same thing is happening with these noisy DROP INDEX of index
> partitions: the complaints are right in that each partition's DROP INDEX
> command doesn't actually work, but the indexes are dropped later anyway,
> so the effect is the same.

I pushed the typo fix that was in this patch.  Other than that, I think
this patch should not be pushed; ISTM it would break the logic.
(Consider that the partition with its index might exist beforehand and
be an independent table.  If we wanted --clean to work properly, it
should definitely drop that index.)

Although I'm doubtful that it makes sense to do DROP INDEX when the
table is going to be dropped completely, even for regular tables.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Maximum password length

2020-09-01 Thread Tom Lane
I wrote:
> Peter Eisentraut  writes:
>> ISTM that it's only going to be a matter of time before that will be 
>> exceeded.  Why have a limit at all?  Accept whatever StringInfo accepts.

> Hmm, that would require some refactoring of simple_prompt for starters.

To use StringInfo, we have to move sprompt.c into src/common/ where
the stringinfo stuff lives; but that seems fine to me, because it had
little if any business being in src/port/.  Here's a draft patch
that does it that way.

This could be refined; in particular, I think that most of the
password-prompting sites could drop their separate have_password
flags in favor of checking whether the password pointer is NULL
or not.  That would likely also prove that some of the free(password)
calls I sprinkled in are unnecessary.

regards, tom lane

diff --git a/contrib/oid2name/oid2name.c b/contrib/oid2name/oid2name.c
index 91b7958c48..753996585d 100644
--- a/contrib/oid2name/oid2name.c
+++ b/contrib/oid2name/oid2name.c
@@ -12,6 +12,7 @@
 #include "catalog/pg_class_d.h"
 #include "common/connect.h"
 #include "common/logging.h"
+#include "common/string.h"
 #include "getopt_long.h"
 #include "libpq-fe.h"
 #include "pg_getopt.h"
@@ -294,7 +295,7 @@ sql_conn(struct options *my_opts)
 {
 	PGconn	   *conn;
 	bool		have_password = false;
-	char		password[100];
+	char	   *password = NULL;
 	bool		new_pass;
 	PGresult   *res;
 
@@ -339,7 +340,9 @@ sql_conn(struct options *my_opts)
 			!have_password)
 		{
 			PQfinish(conn);
-			simple_prompt("Password: ", password, sizeof(password), false);
+			if (password)
+free(password);
+			password = simple_prompt("Password: ", false);
 			have_password = true;
 			new_pass = true;
 		}
diff --git a/contrib/vacuumlo/vacuumlo.c b/contrib/vacuumlo/vacuumlo.c
index e4019fafaa..e09362cc51 100644
--- a/contrib/vacuumlo/vacuumlo.c
+++ b/contrib/vacuumlo/vacuumlo.c
@@ -24,6 +24,7 @@
 #include "catalog/pg_class_d.h"
 #include "common/connect.h"
 #include "common/logging.h"
+#include "common/string.h"
 #include "getopt_long.h"
 #include "libpq-fe.h"
 #include "pg_getopt.h"
@@ -70,12 +71,14 @@ vacuumlo(const char *database, const struct _param *param)
 	bool		new_pass;
 	bool		success = true;
 	static bool have_password = false;
-	static char password[100];
+	static char *password = NULL;
 
 	/* Note: password can be carried over from a previous call */
 	if (param->pg_prompt == TRI_YES && !have_password)
 	{
-		simple_prompt("Password: ", password, sizeof(password), false);
+		if (password)
+			free(password);
+		password = simple_prompt("Password: ", false);
 		have_password = true;
 	}
 
@@ -119,7 +122,9 @@ vacuumlo(const char *database, const struct _param *param)
 			param->pg_prompt != TRI_NO)
 		{
 			PQfinish(conn);
-			simple_prompt("Password: ", password, sizeof(password), false);
+			if (password)
+free(password);
+			password = simple_prompt("Password: ", false);
 			have_password = true;
 			new_pass = true;
 		}
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index 02b6c3f127..36565df4fc 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -698,7 +698,7 @@ recv_password_packet(Port *port)
 	}
 
 	initStringInfo(&buf);
-	if (pq_getmessage(&buf, 1000))	/* receive password */
+	if (pq_getmessage(&buf, 0)) /* receive password */
 	{
 		/* EOF - pq_getmessage already logged a suitable message */
 		pfree(buf.data);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 786672b1b6..b62f8f6a5e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -67,6 +67,7 @@
 #include "common/file_utils.h"
 #include "common/logging.h"
 #include "common/restricted_token.h"
+#include "common/string.h"
 #include "common/username.h"
 #include "fe_utils/string_utils.h"
 #include "getaddrinfo.h"
@@ -1481,23 +1482,25 @@ setup_auth(FILE *cmdfd)
 static void
 get_su_pwd(void)
 {
-	char		pwd1[100];
-	char		pwd2[100];
+	char	   *pwd1;
 
 	if (pwprompt)
 	{
 		/*
 		 * Read password from terminal
 		 */
+		char	   *pwd2;
+
 		printf("\n");
 		fflush(stdout);
-		simple_prompt("Enter new superuser password: ", pwd1, sizeof(pwd1), false);
-		simple_prompt("Enter it again: ", pwd2, sizeof(pwd2), false);
+		pwd1 = simple_prompt("Enter new superuser password: ", false);
+		pwd2 = simple_prompt("Enter it again: ", false);
 		if (strcmp(pwd1, pwd2) != 0)
 		{
 			fprintf(stderr, _("Passwords didn't match.\n"));
 			exit(1);
 		}
+		free(pwd2);
 	}
 	else
 	{
@@ -1510,7 +1513,7 @@ get_su_pwd(void)
 		 * for now.
 		 */
 		FILE	   *pwf = fopen(pwfilename, "r");
-		int			i;
+		char		pwdbuf[8192];
 
 		if (!pwf)
 		{
@@ -1518,7 +1521,7 @@ get_su_pwd(void)
 		 pwfilename);
 			exit(1);
 		}
-		if (!fgets(pwd1, sizeof(pwd1), pwf))
+		if (!fgets(pwdbuf, sizeof(pwdbuf), pwf))
 		{
 			if (ferror(pwf))
 pg_log_error("could not read password from file \"%s\": %m",
@@ -1530,12 +1533,11 @@ get_su_pwd(void)
 		}
 		fclose(pwf);
 
-		i = strlen(pwd1);
-		while (i > 0 && (

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-01 Thread Michael Paquier
On Tue, Sep 01, 2020 at 11:48:30AM -0400, Alvaro Herrera wrote:
> On 2020-Sep-01, Justin Pryzby wrote:
>> The question isn't whether to use a parenthesized option list.  I realized 
>> that
>> long ago (even though Alexey didn't initially like it).  Check 0002, which 
>> gets
>> rid of "bool concurrent" in favour of stmt->options&REINDEXOPT_CONCURRENT.
> 
> Ah!  I see, sorry for the noise.  Well, respectfully, having a separate
> boolean to store one option when you already have a bitmask for options
> is silly.

Yeah, I am all for removing "concurrent" from ReindexStmt, but I don't
think that the proposed 0002 is that, because it is based on the
assumption that we'd want more than just boolean-based options in
those statements, and this case is not justified yet except if it
becomes possible to enforce tablespaces.  At this stage, I think that
it is more sensible to just update gram.y and add a
REINDEXOPT_CONCURRENTLY.  I also think that it would also make sense
to pass down "options" within ReindexIndexCallbackState() (for example
imagine a SKIP_LOCKED for REINDEX).
--
Michael


signature.asc
Description: PGP signature


Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-01 Thread Alvaro Herrera
On 2020-Sep-02, Michael Paquier wrote:

> Yeah, I am all for removing "concurrent" from ReindexStmt, but I don't
> think that the proposed 0002 is that, because it is based on the
> assumption that we'd want more than just boolean-based options in
> those statements, and this case is not justified yet except if it
> becomes possible to enforce tablespaces.  At this stage, I think that
> it is more sensible to just update gram.y and add a
> REINDEXOPT_CONCURRENTLY.

Yes, agreed.  I had not seen the "params" business.

> I also think that it would also make sense to pass down "options"
> within ReindexIndexCallbackState() (for example imagine a SKIP_LOCKED
> for REINDEX).

Seems sensible, but only to be done when actually needed, right?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-01 Thread Kyotaro Horiguchi
Hello.

At Tue, 1 Sep 2020 13:02:28 +, "k.jami...@fujitsu.com" 
 wrote in 
> On Tuesday, August 18, 2020 3:05 PM (GMT+9), Amit Kapila wrote: 
> > Today, again thinking about this point it occurred to me that during 
> > recovery
> > we can reliably find the relation size and after Thomas's recent commit
> > c5315f4f44 (Cache smgrnblocks() results in recovery), we might not need to
> > even incur the cost of lseek. Why don't we fix this first for 'recovery' (by
> > following something on the lines of what Andres suggested) and then later
> > once we have a generic mechanism for "caching the relation size" [1], we can
> > do it for non-recovery paths.
> > I think that will at least address the reported use case with some minimal
> > changes.
> > 
> > [1] -
> > https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1r
> > U-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com

Isn't a relation always locked asscess-exclusively, at truncation
time?  If so, isn't even the result of lseek reliable enough? And if
we don't care the cost of lseek, we can do the same optimization also
for non-recovery paths. Since anyway we perform actual file-truncation
just after so I think the cost of lseek is negligible here.

> Attached is an updated V9 version with minimal code changes only and
> avoids the previous overhead in the BufferAlloc. This time, I only updated
> the recovery path as suggested by Amit, and followed Andres' suggestion
> of referring to the cached blocks in smgrnblocks.
> The layering is kinda tricky so the logic may be wrong. But as of now,
> it passes the regression tests. I'll follow up with the performance results.
> It seems there's regression for smaller shared_buffers. I'll update if I find 
> bugs.
> But I'd also appreciate your reviews in case I missed something.

BUF_DROP_THRESHOLD seems to be misued. IIUC it defines the maximum
number of file pages that we make relation-targetted search for
buffers. Otherwise we scan through all buffers. On the other hand the
latest patch just leaves all buffers for relation forks longer than
the threshold.

I think we should determine whether to do targetted-scan or full-scan
based on the ratio of (expectedly maximum) total number of pages for
all (specified) forks in a relation against total number of buffers.

By the way

> #define BUF_DROP_THRESHOLD500 /* NBuffers divided by 2 */

NBuffers is not a constant. Even if we wanted to set the macro as
described in the comment, we should have used (NBuffers/2) instead of
"500". But I suppose you might wanted to use (NBuffders / 500) as Tom
suggested upthread.  And the name of the macro seems too generic. I
think more specific names like BUF_DROP_FULLSCAN_THRESHOLD would be
better.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
 




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-01 Thread Kyotaro Horiguchi
I'd like make a subtle correction.

At Wed, 02 Sep 2020 10:31:22 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> By the way
> 
> > #define BUF_DROP_THRESHOLD  500 /* NBuffers divided by 2 */
> 
> NBuffers is not a constant. Even if we wanted to set the macro as
> described in the comment, we should have used (NBuffers/2) instead of
> "500". But I suppose you might wanted to use (NBuffders / 500) as Tom
> suggested upthread.  And the name of the macro seems too generic. I

Who made the suggestion is Andres, not Tom. Sorry for the mistake.

> think more specific names like BUF_DROP_FULLSCAN_THRESHOLD would be
> better.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: 回复:how to create index concurrently on partitioned table

2020-09-01 Thread Michael Paquier
On Fri, Aug 14, 2020 at 09:29:45AM +0900, Michael Paquier wrote:
> Once this gets done, we should then be able to get rid of the extra
> session locking taken when building the list of partitions, limiting
> session locks to only be taken during the concurrent reindex of a
> single partition (the table itself for a partition table, and the
> parent table for a partition index), making the whole operation less
> invasive.

The problem with dropped relations in REINDEX has been addressed by
1d65416, so I have gone through this patch again and simplified the
use of session locks, these being taken only when doing a REINDEX
CONCURRENTLY for a given partition.  This part is in a rather
committable shape IMO, so I would like to get it done first, before
looking more at the other cases with CIC and CLUSTER.  I am still
planning to go through it once again.
--
Michael
From 618760af9ce3229f775e1fdfb635837865aadb66 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 2 Sep 2020 10:37:46 +0900
Subject: [PATCH] Add support for partitioned tables and indexes with REINDEX

Until now, REINDEX was not able to work with partitioned tables and
indexes, forcing users to reindex partitions one by one.  This extends
REINDEX INDEX and REINDEX TABLE so as they can accept a partitioned
index and table in input, respectively, to reindex all the partitions
assigned to them.

This shares some logic with schema and database REINDEX as each
partition gets processed in its own transaction (this choice minimizes
the number of invalid indexes to one partition with REINDEX CONCURRENTLY
in case of a cancellation or failure in-flight, and the duration of each
transaction).  Isolation tests are added to emulate cases I bumped
into, particularly with the concurrent drop of a leaf partition
reindexed.

Per their multi-transaction nature, these new flavors cannot run in a
transaction block.

Author: Justin Pryzby, Michael Paquier
Discussion: https://postgr.es/m/db12e897-73ff-467e-94cb-4af03705435f.adger...@alibaba-inc.com
---
 src/include/commands/defrem.h |   6 +-
 src/backend/catalog/index.c   |  25 +-
 src/backend/commands/indexcmds.c  | 242 ++
 src/backend/tcop/utility.c|   6 +-
 .../isolation/expected/reindex-partitions.out | 107 
 src/test/isolation/isolation_schedule |   1 +
 .../isolation/specs/reindex-partitions.spec   |  61 +
 src/test/regress/expected/create_index.out| 158 +++-
 src/test/regress/sql/create_index.sql |  88 ++-
 doc/src/sgml/ref/reindex.sgml |  13 +-
 .../postgres_fdw/expected/postgres_fdw.out|  21 ++
 contrib/postgres_fdw/sql/postgres_fdw.sql |  20 ++
 src/tools/pgindent/typedefs.list  |   1 +
 13 files changed, 657 insertions(+), 92 deletions(-)
 create mode 100644 src/test/isolation/expected/reindex-partitions.out
 create mode 100644 src/test/isolation/specs/reindex-partitions.spec

diff --git a/src/include/commands/defrem.h b/src/include/commands/defrem.h
index c26a102b17..22db3f660d 100644
--- a/src/include/commands/defrem.h
+++ b/src/include/commands/defrem.h
@@ -34,8 +34,10 @@ extern ObjectAddress DefineIndex(Oid relationId,
  bool check_not_in_use,
  bool skip_build,
  bool quiet);
-extern void ReindexIndex(RangeVar *indexRelation, int options, bool concurrent);
-extern Oid	ReindexTable(RangeVar *relation, int options, bool concurrent);
+extern void ReindexIndex(RangeVar *indexRelation, int options, bool concurrent,
+		 bool isTopLevel);
+extern Oid	ReindexTable(RangeVar *relation, int options, bool concurrent,
+		 bool isTopLevel);
 extern void ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
   int options, bool concurrent);
 extern char *makeObjectName(const char *name1, const char *name2,
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1d662e9af4..1764739ff5 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -77,6 +77,7 @@
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/pg_rusage.h"
+#include "utils/rel.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
@@ -3473,11 +3474,12 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 	 iRel->rd_rel->relam);
 
 	/*
-	 * The case of reindexing partitioned tables and indexes is handled
-	 * differently by upper layers, so this case shouldn't arise.
+	 * Partitioned indexes should never get processed here, as they have no
+	 * physical storage.
 	 */
 	if (iRel->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
-		elog(ERROR, "unsupported relation kind for index \"%s\"",
+		elog(ERROR, "cannot reindex partitioned index \"%s.%s\"",
+			 get_namespace_name(RelationGetNamespace(iRel)),
 			 RelationGetRelationName(iRel));
 
 	/*
@@ -3694,20 +3696,13 @@ reindex_relation(Oid relid, int flags, int

Re: "cert" + clientcert=verify-ca in pg_hba.conf?

2020-09-01 Thread Kyotaro Horiguchi
Hello.

At Tue, 1 Sep 2020 11:47:34 -0400, Bruce Momjian  wrote in 
> On Tue, Sep  1, 2020 at 01:59:25PM +0900, Kyotaro Horiguchi wrote:
> > At Mon, 31 Aug 2020 11:34:29 -0400, Bruce Momjian  wrote 
> > in 
> > > On Mon, Aug 31, 2020 at 05:56:58PM +0900, Kyotaro Horiguchi wrote:
> > > > Ok, this is that. If we spcify clientcert=no-verify other than for
> > > > "cert" authentication, server complains as the following at startup.
> > > 
> > > Why does clientcert=no-verify have any value, even for a
> > > cert-authentication line?
> > > 
> > > > > LOG:  no-verify or 0 is the default setting that is discouraged to 
> > > > > use explicitly for clientcert option
> > > > > HINT:  Consider removing the option instead. This option value is 
> > > > > going to be deprecated in later version.
> > > > > CONTEXT:  line 90 of configuration file 
> > > > > "/home/horiguti/data/data_noverify/pg_hba.conf"
> > > 
> > > I think it should just be removed in PG 14.  This is a configuration
> > > setting, not an SQL-level item that needs a deprecation period.
> > 
> > Ok, it is changed to just error out. I tempted to show a suggestion to
> > removing the option in that case like the following, but *didn't* in
> > this version of the patch.
> 
> OK, I have developed the attached patch based on yours.  I reordered the
> tests, simplified the documentation, and removed the hint since they

Looks good to me.

> will already get a good error message, and we will document this change

Oops! I thought I had removed that in the patch. Sorry for the mistake
and that also looks good to me.

> in the release notes.  It is also good you removed the 0/1 values for
> this, since that was also confusing.  We will put that removal in the
> release notes too.

Thank you for your assistance, Bruce!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-01 Thread Michael Paquier
On Tue, Sep 01, 2020 at 09:29:28PM -0400, Alvaro Herrera wrote:
> Seems sensible, but only to be done when actually needed, right?

Of course.
--
Michael


signature.asc
Description: PGP signature


Re: SyncRepLock acquired exclusively in default configuration

2020-09-01 Thread Fujii Masao




On 2020/08/28 21:20, Masahiko Sawada wrote:

On Fri, 28 Aug 2020 at 10:33, Fujii Masao  wrote:




On 2020/08/27 15:59, Asim Praveen wrote:



On 26-Aug-2020, at 11:10 PM, Fujii Masao  wrote:

I added the following comments based on the suggestion by Sawada-san upthread. 
Thought?

+ * Since this routine gets called every commit time, it's important to
+ * exit quickly if sync replication is not requested. So we check
+ * WalSndCtl->sync_standbys_define without the lock and exit
+ * immediately if it's false. If it's true, we check it again later
+ * while holding the lock, to avoid the race condition described
+ * in SyncRepUpdateSyncStandbysDefined().



+1.  May I suggest the following addition to the above comment (feel free to
rephrase / reject)?

"If sync_standbys_defined was being set from false to true and we observe it as
false, it ok to skip the wait.  Replication was async and it is in the process
of being changed to sync, due to user request.  Subsequent commits will observe
the change and start waiting.”


Thanks for the suggestion! I'm not sure if it's worth adding this because
it seems obvious thing. But maybe you imply that we need to comment
why the lock is not necessary when sync_standbys_defined is false. Right?
If so, what about updating the comments as follows?

+* Since this routine gets called every commit time, it's important to
+* exit quickly if sync replication is not requested. So we check
+* WalSndCtl->sync_standbys_defined flag without the lock and exit
+* immediately if it's false. If it's true, we need to check it again 
later
+* while holding the lock, to check the flag and operate the sync rep
+* queue atomically. This is necessary to avoid the race condition
+* described in SyncRepUpdateSyncStandbysDefined(). On the other
+* hand, if it's false, the lock is not necessary because we don't touch
+* the queue.





Attached is the updated version of the patch. I didn't change how to
fix the issue. But I changed the check for fast exit so that it's called
before setting the "mode", to avoid a few cycle.




Looks good to me.  There is a typo in the comment:

  s/sync_standbys_define/sync_standbys_defined/


Fixed. Thanks!



Both v2 and v3 look good to me too. IMO I'm okay with and without the
last sentence.


Asim and Sawada-san, thanks for the review! I pushed the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: SyncRepLock acquired exclusively in default configuration

2020-09-01 Thread Andres Freund
On 2020-09-02 10:58:58 +0900, Fujii Masao wrote:
> Asim and Sawada-san, thanks for the review! I pushed the patch.

Thanks for all your combined work!




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-01 Thread Justin Pryzby
On Wed, Sep 02, 2020 at 10:00:12AM +0900, Michael Paquier wrote:
> On Tue, Sep 01, 2020 at 11:48:30AM -0400, Alvaro Herrera wrote:
> > On 2020-Sep-01, Justin Pryzby wrote:
> >> The question isn't whether to use a parenthesized option list.  I realized 
> >> that
> >> long ago (even though Alexey didn't initially like it).  Check 0002, which 
> >> gets
> >> rid of "bool concurrent" in favour of stmt->options&REINDEXOPT_CONCURRENT.
> > 
> > Ah!  I see, sorry for the noise.  Well, respectfully, having a separate
> > boolean to store one option when you already have a bitmask for options
> > is silly.
> 
> Yeah, I am all for removing "concurrent" from ReindexStmt, but I don't
> think that the proposed 0002 is that, because it is based on the
> assumption that we'd want more than just boolean-based options in
> those statements, and this case is not justified yet except if it
> becomes possible to enforce tablespaces.  At this stage, I think that
> it is more sensible to just update gram.y and add a
> REINDEXOPT_CONCURRENTLY.  I also think that it would also make sense
> to pass down "options" within ReindexIndexCallbackState() (for example
> imagine a SKIP_LOCKED for REINDEX).

Uh, this whole thread is about implementing REINDEX (TABLESPACE foo), and the
preliminary patch 0001 is to keep separate the tablespace parts of that
content.  0002 is a minor tangent which I assume would be squished into 0001
which cleans up historic cruft, using new params in favour of historic options.

I think my change is probably incomplete, and ReindexStmt node should not have
an int options.  parse_reindex_params() would parse it into local int *options
and char **tablespacename params.

-- 
Justin




Re: "cert" + clientcert=verify-ca in pg_hba.conf?

2020-09-01 Thread Bruce Momjian
On Wed, Sep  2, 2020 at 10:45:30AM +0900, Kyotaro Horiguchi wrote:
> At Tue, 1 Sep 2020 11:47:34 -0400, Bruce Momjian  wrote in 
> > OK, I have developed the attached patch based on yours.  I reordered the
> > tests, simplified the documentation, and removed the hint since they
> 
> Looks good to me.
> 
> > will already get a good error message, and we will document this change
> 
> Oops! I thought I had removed that in the patch. Sorry for the mistake
> and that also looks good to me.
> 
> > in the release notes.  It is also good you removed the 0/1 values for
> > this, since that was also confusing.  We will put that removal in the
> > release notes too.
> 
> Thank you for your assistance, Bruce!

OK, good.  Let's wait a few days and I will then apply it for PG 14.
Thanks.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: REINDEX SCHEMA/DATABASE/SYSTEM weak with dropped relations

2020-09-01 Thread Justin Pryzby
> diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
> index 47d4c07306..23840bb8e6 100644
> --- a/src/include/nodes/parsenodes.h
> +++ b/src/include/nodes/parsenodes.h
> @@ -3352,6 +3352,7 @@ typedef struct ConstraintsSetStmt
>  /* Reindex options */
>  #define REINDEXOPT_VERBOSE (1 << 0) /* print progress info */
>  #define REINDEXOPT_REPORT_PROGRESS (1 << 1) /* report pgstat progress */
> +#define REINDEXOPT_MISSING_OK (2 << 1)   /* skip missing relations */

I think you probably intended to write: 1<<2

Even though it's the same, someone is likely to be confused if they try to use
3<<1 vs 1<<3.

I noticed while resolving merge conflict.

-- 
Justin




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-01 Thread Amit Kapila
On Wed, Sep 2, 2020 at 7:01 AM Kyotaro Horiguchi
 wrote:
>
> Hello.
>
> At Tue, 1 Sep 2020 13:02:28 +, "k.jami...@fujitsu.com" 
>  wrote in
> > On Tuesday, August 18, 2020 3:05 PM (GMT+9), Amit Kapila wrote:
> > > Today, again thinking about this point it occurred to me that during 
> > > recovery
> > > we can reliably find the relation size and after Thomas's recent commit
> > > c5315f4f44 (Cache smgrnblocks() results in recovery), we might not need to
> > > even incur the cost of lseek. Why don't we fix this first for 'recovery' 
> > > (by
> > > following something on the lines of what Andres suggested) and then later
> > > once we have a generic mechanism for "caching the relation size" [1], we 
> > > can
> > > do it for non-recovery paths.
> > > I think that will at least address the reported use case with some minimal
> > > changes.
> > >
> > > [1] -
> > > https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1r
> > > U-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
>
> Isn't a relation always locked asscess-exclusively, at truncation
> time?  If so, isn't even the result of lseek reliable enough?
>

Even if the relation is locked, background processes like checkpointer
can still touch the relation which might cause problems. Consider a
case where we extend the relation but didn't flush the newly added
pages. Now during truncate operation, checkpointer can still flush
those pages which can cause trouble for truncate. But, I think in the
recovery path such cases won't cause a problem.

-- 
With Regards,
Amit Kapila.




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-01 Thread Tom Lane
Amit Kapila  writes:
> Even if the relation is locked, background processes like checkpointer
> can still touch the relation which might cause problems. Consider a
> case where we extend the relation but didn't flush the newly added
> pages. Now during truncate operation, checkpointer can still flush
> those pages which can cause trouble for truncate. But, I think in the
> recovery path such cases won't cause a problem.

I wouldn't count on that staying true ...

https://www.postgresql.org/message-id/ca+hukgj8nrsqgkzensnrc2mfrobv-jcnacbyvtpptk2a9yy...@mail.gmail.com

regards, tom lane




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-01 Thread k.jami...@fujitsu.com
On Wednesday, September 2, 2020 10:31 AM, Kyotaro Horiguchi wrote:
> Hello.
> 
> At Tue, 1 Sep 2020 13:02:28 +, "k.jami...@fujitsu.com"
>  wrote in
> > On Tuesday, August 18, 2020 3:05 PM (GMT+9), Amit Kapila wrote:
> > > Today, again thinking about this point it occurred to me that during
> > > recovery we can reliably find the relation size and after Thomas's
> > > recent commit
> > > c5315f4f44 (Cache smgrnblocks() results in recovery), we might not
> > > need to even incur the cost of lseek. Why don't we fix this first
> > > for 'recovery' (by following something on the lines of what Andres
> > > suggested) and then later once we have a generic mechanism for
> > > "caching the relation size" [1], we can do it for non-recovery paths.
> > > I think that will at least address the reported use case with some
> > > minimal changes.
> > >
> > > [1] -
> > >
> https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1r
> > > U-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
> 
> Isn't a relation always locked asscess-exclusively, at truncation time?  If 
> so,
> isn't even the result of lseek reliable enough? And if we don't care the cost 
> of
> lseek, we can do the same optimization also for non-recovery paths. Since
> anyway we perform actual file-truncation just after so I think the cost of 
> lseek
> is negligible here.

The reason for that is when I read the comment in smgrnblocks in smgr.c
I thought that smgrnblocks can only be reliably used during recovery here
to ensure that we have the correct size.
Please correct me if my understanding is wrong, and I'll fix the patch.

 * For now, we only use cached values in recovery due to lack of a 
shared
 * invalidation mechanism for changes in file size.
 */
if (InRecovery && reln->smgr_cached_nblocks[forknum] != 
InvalidBlockNumber)
return reln->smgr_cached_nblocks[forknum]; 

> > Attached is an updated V9 version with minimal code changes only and
> > avoids the previous overhead in the BufferAlloc. This time, I only
> > updated the recovery path as suggested by Amit, and followed Andres'
> > suggestion of referring to the cached blocks in smgrnblocks.
> > The layering is kinda tricky so the logic may be wrong. But as of now,
> > it passes the regression tests. I'll follow up with the performance results.
> > It seems there's regression for smaller shared_buffers. I'll update if I 
> > find
> bugs.
> > But I'd also appreciate your reviews in case I missed something.
> 
> BUF_DROP_THRESHOLD seems to be misued. IIUC it defines the maximum
> number of file pages that we make relation-targetted search for buffers.
> Otherwise we scan through all buffers. On the other hand the latest patch just
> leaves all buffers for relation forks longer than the threshold.

Right, I missed the part or condition for that part. Fixed in the latest one.
 
> I think we should determine whether to do targetted-scan or full-scan based
> on the ratio of (expectedly maximum) total number of pages for all (specified)
> forks in a relation against total number of buffers.

> By the way
> 
> > #define BUF_DROP_THRESHOLD  500 /* NBuffers divided
> by 2 */
> 
> NBuffers is not a constant. Even if we wanted to set the macro as described
> in the comment, we should have used (NBuffers/2) instead of "500". But I
> suppose you might wanted to use (NBuffders / 500) as Tom suggested
> upthread.  And the name of the macro seems too generic. I think more
> specific names like BUF_DROP_FULLSCAN_THRESHOLD would be better.

Fixed.

Thank you for the review!
Attached is the v10 of the patch.

Best regards,
Kirk Jamison


v10-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description:  v10-Speedup-dropping-of-relation-buffers-during-recovery.patch


Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-09-01 Thread David Rowley
On Sat, 29 Aug 2020 at 02:54, David Rowley  wrote:
>
> On Wed, 26 Aug 2020 at 03:52, Andres Freund  wrote:
> > There'll be a significant reduction in increase in performance.
>
> So I did a very rough-cut change to the patch to have the caching be
> part of Nested Loop.  It can be applied on top of the other 3 v7
> patches.
>
> For the performance, the test I did results in the performance
> actually being reduced from having the Result Cache as a separate
> node.  The reason for this is mostly because Nested Loop projects.

I spoke to Andres off-list this morning in regards to what can be done
to remove this performance regression over the separate Result Cache
node version of the patch. He mentioned that I could create another
ProjectionInfo for when reading from the cache's slot and use that to
project with.

I've hacked this up in the attached. It looks like another version of
the joinqual would also need to be created to that the MinimalTuple
from the cache is properly deformed. I've not done this yet.

The performance does improve this time. Using the same two test
queries from [1], I get:

v7 (Separate Result Cache node)

Query 1:
postgres=# explain (analyze, timing off) select count(l.a) from
hundredk hk inner join lookup100 l on hk.one = l.a;
  QUERY
PLAN
--
 Aggregate  (cost=152861.19..152861.20 rows=1 width=8) (actual rows=1 loops=1)
   ->  Nested Loop  (cost=0.45..127891.79 rows=9987763 width=4)
(actual rows=1000 loops=1)
 ->  Seq Scan on hundredk hk  (cost=0.00..1637.00 rows=10
width=4) (actual rows=10 loops=1)
 ->  Result Cache  (cost=0.45..2.53 rows=100 width=4) (actual
rows=100 loops=10)
   Cache Key: hk.one
   Hits: 9  Misses: 1  Evictions: 0  Overflows: 0
   ->  Index Only Scan using lookup100_a_idx on lookup100
l  (cost=0.43..2.52 rows=100 width=4) (actual rows=100 loops=1)
 Index Cond: (a = hk.one)
 Heap Fetches: 0
 Planning Time: 0.045 ms
 Execution Time: 894.003 ms
(11 rows)

Query 2:
postgres=# explain (analyze, timing off) select * from hundredk hk
inner join lookup100 l on hk.one = l.a;
   QUERY PLAN

 Nested Loop  (cost=0.45..127891.79 rows=9987763 width=28) (actual
rows=1000 loops=1)
   ->  Seq Scan on hundredk hk  (cost=0.00..1637.00 rows=10
width=24) (actual rows=10 loops=1)
   ->  Result Cache  (cost=0.45..2.53 rows=100 width=4) (actual
rows=100 loops=10)
 Cache Key: hk.one
 Hits: 9  Misses: 1  Evictions: 0  Overflows: 0
 ->  Index Only Scan using lookup100_a_idx on lookup100 l
(cost=0.43..2.52 rows=100 width=4) (actual rows=100 loops=1)
   Index Cond: (a = hk.one)
   Heap Fetches: 0
 Planning Time: 0.077 ms
 Execution Time: 854.950 ms
(10 rows)

v7 + hacks_V3 (caching done in Nested Loop)

Query 1:
explain (analyze, timing off) select count(l.a) from hundredk hk inner
join lookup100 l on hk.one = l.a;
   QUERY PLAN

 Aggregate  (cost=378570.41..378570.42 rows=1 width=8) (actual rows=1 loops=1)
   ->  Nested Loop Cached  (cost=0.43..353601.00 rows=9987763 width=4)
(actual rows=1000 loops=1)
 Cache Key: $0
 Hits: 9  Misses: 1  Evictions: 0  Overflows: 0
 ->  Seq Scan on hundredk hk  (cost=0.00..1637.00 rows=10
width=4) (actual rows=10 loops=1)
 ->  Index Only Scan using lookup100_a_idx on lookup100 l
(cost=0.43..2.52 rows=100 width=4) (actual rows=100 loops=1)
   Index Cond: (a = hk.one)
   Heap Fetches: 0
 Planning Time: 0.103 ms
 Execution Time: 770.470 ms
(10 rows)

Query 2
explain (analyze, timing off) select * from hundredk hk inner join
lookup100 l on hk.one = l.a;
QUERY PLAN
--
 Nested Loop Cached  (cost=0.43..353601.00 rows=9987763 width=28)
(actual rows=1000 loops=1)
   Cache Key: $0
   Hits: 9  Misses: 1  Evictions: 0  Overflows: 0
   ->  Seq Scan on hundredk hk  (cost=0.00..1637.00 rows=10
width=24) (actual rows=10 loops=1)
   ->  Index Only Scan using lookup100_a_idx on lookup100 l
(cost=0.43..2.52 rows=100 width=4) (actual rows=100 loops=1)
 Index Cond: (a = hk.one)
 Heap Fetches: 0
 Planning Time: 0.090 ms
 Execution Time: 779.181 ms
(9 rows)

Also, I'd just like 

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-01 Thread Justin Pryzby
On Tue, Sep 01, 2020 at 09:24:10PM -0500, Justin Pryzby wrote:
> On Wed, Sep 02, 2020 at 10:00:12AM +0900, Michael Paquier wrote:
> > On Tue, Sep 01, 2020 at 11:48:30AM -0400, Alvaro Herrera wrote:
> > > On 2020-Sep-01, Justin Pryzby wrote:
> > >> The question isn't whether to use a parenthesized option list.  I 
> > >> realized that
> > >> long ago (even though Alexey didn't initially like it).  Check 0002, 
> > >> which gets
> > >> rid of "bool concurrent" in favour of 
> > >> stmt->options&REINDEXOPT_CONCURRENT.
> > > 
> > > Ah!  I see, sorry for the noise.  Well, respectfully, having a separate
> > > boolean to store one option when you already have a bitmask for options
> > > is silly.
> > 
> > Yeah, I am all for removing "concurrent" from ReindexStmt, but I don't
> > think that the proposed 0002 is that, because it is based on the
> > assumption that we'd want more than just boolean-based options in
> > those statements, and this case is not justified yet except if it
> > becomes possible to enforce tablespaces.  At this stage, I think that
> > it is more sensible to just update gram.y and add a
> > REINDEXOPT_CONCURRENTLY.  I also think that it would also make sense
> > to pass down "options" within ReindexIndexCallbackState() (for example
> > imagine a SKIP_LOCKED for REINDEX).
> 
> Uh, this whole thread is about implementing REINDEX (TABLESPACE foo), and the
> preliminary patch 0001 is to keep separate the tablespace parts of that
> content.  0002 is a minor tangent which I assume would be squished into 0001
> which cleans up historic cruft, using new params in favour of historic 
> options.
> 
> I think my change is probably incomplete, and ReindexStmt node should not have
> an int options.  parse_reindex_params() would parse it into local int *options
> and char **tablespacename params.

Done in the attached, which is also rebased on 1d6541666.

And added RelationAssumeNewRelfilenode() as I mentioned - but I'm hoping to
hear from Michael about any reason not to call RelationSetNewRelfilenode()
instead of directly calling the things it would itself call.

-- 
Justin
>From 9d881a6aa401cebef0a2ce781d34a2a5ea8ded60 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 27 Mar 2020 17:50:46 -0500
Subject: [PATCH v23 1/5] Change REINDEX/CLUSTER to accept an option list..

..like EXPLAIN (..), VACUUM (..), and ANALYZE (..).

Change docs in the style of VACUUM.  See also: 52dcfda48778d16683c64ca4372299a099a15b96
---
 doc/src/sgml/ref/cluster.sgml| 28 ++---
 doc/src/sgml/ref/reindex.sgml| 43 +---
 src/backend/commands/cluster.c   | 23 -
 src/backend/commands/indexcmds.c | 41 --
 src/backend/nodes/copyfuncs.c|  2 ++
 src/backend/nodes/equalfuncs.c   |  2 ++
 src/backend/parser/gram.y| 35 +++---
 src/backend/tcop/utility.c   | 36 +++---
 src/bin/psql/tab-complete.c  | 23 +
 src/include/commands/cluster.h   |  3 ++-
 src/include/commands/defrem.h|  7 +++---
 src/include/nodes/parsenodes.h   |  2 ++
 12 files changed, 179 insertions(+), 66 deletions(-)

diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index 4da60d8d56..110fb3083e 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -21,8 +21,13 @@ PostgreSQL documentation
 
  
 
-CLUSTER [VERBOSE] table_name [ USING index_name ]
-CLUSTER [VERBOSE]
+CLUSTER [VERBOSE] [ ( option [, ...] ) ] table_name [ USING index_name ]
+CLUSTER [VERBOSE] [ ( option [, ...] ) ]
+
+where option can be one of:
+
+VERBOSE [ boolean ]
+
 
  
 
@@ -81,6 +86,16 @@ CLUSTER [VERBOSE]
   Parameters
 
   
+   
+VERBOSE
+
+ 
+  Prints a progress report as each table is clustered.
+
+ 
+
+   
+

 table_name
 
@@ -100,10 +115,15 @@ CLUSTER [VERBOSE]

 

-VERBOSE
+boolean
 
  
-  Prints a progress report as each table is clustered.
+  Specifies whether the selected option should be turned on or off.
+  You can write TRUE, ON, or
+  1 to enable the option, and FALSE,
+  OFF, or 0 to disable it.  The
+  boolean value can also
+  be omitted, in which case TRUE is assumed.
  
 

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index c16f223e4e..a32e192a87 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -25,7 +25,7 @@ REINDEX [ ( option [, ...] ) ] { IN
 
 where option can be one of:
 
-VERBOSE
+VERBOSE [ boolean ]
 
  
 
@@ -141,19 +141,6 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
-   
-name
-
- 
-  The name of the specific index, table, or database to be
-  reindexed.  Index and table names can be schema-qualified.
-  Presently, REINDEX DATABASE and REINDEX SYSTEM
-  can only reindex the current database, so their parameter must match
-  the current data

describe-config issue

2020-09-01 Thread vignesh C
Hi,

Postgres's describe-config option prints reset_val for int & real
configuration parameters which is not useful as it is not updated.
Printing boot_val is better in this case. reset_val is updated with
boot_val while the server is getting started but in case of postgres
--describe-config this value is not updated. I felt printing boot_val
is more appropriate in this case. Attached patch for the same.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From 847406e95ed35b5ce77c407a4b1b990e42e49125 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Wed, 2 Sep 2020 10:06:06 +0530
Subject: [PATCH] describe-config issue

describe-config prints reset_val for int & real configuration parameters which
is not useful as it is not updated. Printing boot_val is better in this case.
---
 src/backend/utils/misc/help_config.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/misc/help_config.c b/src/backend/utils/misc/help_config.c
index c0120e1..7e5338a 100644
--- a/src/backend/utils/misc/help_config.c
+++ b/src/backend/utils/misc/help_config.c
@@ -103,14 +103,14 @@ printMixedStruct(mixedStruct *structToPrint)
 
 		case PGC_INT:
 			printf("INTEGER\t%d\t%d\t%d\t",
-   structToPrint->integer.reset_val,
+   structToPrint->integer.boot_val,
    structToPrint->integer.min,
    structToPrint->integer.max);
 			break;
 
 		case PGC_REAL:
 			printf("REAL\t%g\t%g\t%g\t",
-   structToPrint->real.reset_val,
+   structToPrint->real.boot_val,
    structToPrint->real.min,
    structToPrint->real.max);
 			break;
-- 
1.8.3.1



A micro-optimisation for walkdir()

2020-09-01 Thread Thomas Munro
Hello hackers,

You don't need to call stat() just to find out if a dirent is a file
or directory, most of the time.  Please see attached.
From cc2f0fd4a078728a67d862e2deec0332fb8b3555 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 2 Sep 2020 16:15:09 +1200
Subject: [PATCH] Skip unnecessary stat() calls in walkdir().

Often the kernel has already told us whether we have a file or a
directory, so we can avoid a call to stat().
---
 src/backend/storage/file/fd.c | 57 +++
 1 file changed, 44 insertions(+), 13 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index f376a97ed6..99562f1ec7 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3340,8 +3340,9 @@ walkdir(const char *path,
 	while ((de = ReadDirExtended(dir, path, elevel)) != NULL)
 	{
 		char		subpath[MAXPGPATH * 2];
-		struct stat fst;
-		int			sret;
+		bool		is_unknown = false;
+		bool		is_reg = false;
+		bool		is_dir = false;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -3351,22 +3352,52 @@ walkdir(const char *path,
 
 		snprintf(subpath, sizeof(subpath), "%s/%s", path, de->d_name);
 
-		if (process_symlinks)
-			sret = stat(subpath, &fst);
-		else
-			sret = lstat(subpath, &fst);
+		/*
+		 * We want to know if it's a file or directory.  Some systems can tell
+		 * us that already without an extra system call, but that's a BSD/GNU
+		 * extension not mandated by POSIX, and even when the interface is
+		 * present, sometimes the type is unknown, depending on the filesystem
+		 * in use or in some cases options used at filesystem creation time.
+		 */
+#if defined(DT_UNKNOWN) && defined(DT_REG) && defined(DT_DIR) && defined(DT_LNK)
+		if (de->d_type == DT_UNKNOWN ||
+			(de->d_type == DT_LNK && process_symlinks))
+			is_unknown = true;
+		else if (de->d_type == DT_REG)
+			is_reg = true;
+		else if (de->d_type == DT_DIR)
+			is_dir = true;
+#else
+		is_unknown = true;
+#endif
 
-		if (sret < 0)
+		if (is_unknown)
 		{
-			ereport(elevel,
-	(errcode_for_file_access(),
-	 errmsg("could not stat file \"%s\": %m", subpath)));
-			continue;
+			struct stat fst;
+			int			sret;
+
+
+			if (process_symlinks)
+sret = stat(subpath, &fst);
+			else
+sret = lstat(subpath, &fst);
+
+			if (sret < 0)
+			{
+ereport(elevel,
+		(errcode_for_file_access(),
+		 errmsg("could not stat file \"%s\": %m", subpath)));
+continue;
+			}
+			if (S_ISREG(fst.st_mode))
+is_reg = true;
+			else if (S_ISDIR(fst.st_mode))
+is_dir = true;
 		}
 
-		if (S_ISREG(fst.st_mode))
+		if (is_reg)
 			(*action) (subpath, false, elevel);
-		else if (S_ISDIR(fst.st_mode))
+		else if (is_dir)
 			walkdir(subpath, action, false, elevel);
 	}
 
-- 
2.20.1



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-09-01 Thread Amit Kapila
On Tue, Sep 1, 2020 at 8:33 PM Amit Kapila  wrote:
>
> I have fixed all the comments except
..
> 3. +# Change the local values of the extra columns on the subscriber,
> +# update publisher, and check that subscriber retains the expected
> +# values
> +$node_subscriber->safe_psql('postgres', "UPDATE test_tab SET c =
> 'epoch'::timestamptz + 987654321 * interval '1s'");
> +$node_publisher->safe_psql('postgres', "UPDATE test_tab SET b = 
> md5(a::text)");
> +
> +wait_for_caught_up($node_publisher, $appname);
> +
> +$result =
> +  $node_subscriber->safe_psql('postgres', "SELECT count(*),
> count(extract(epoch from c) = 987654321), count(d = 999) FROM
> test_tab");
> +is($result, qq(3334|3334|3334), 'check extra columns contain locally
> changed data');
>
> Again, how this test is relevant to streaming mode?
>

I think we can keep this test in one of the newly added tests say in
015_stream_simple.pl to ensure that after streaming transaction, the
non-streaming one behaves expectedly. So we can change the comment as
"Change the local values of the extra columns on the subscriber,
update publisher, and check that subscriber retains the expected
values. This is to ensure that non-streaming transactions behave
properly after a streaming transaction."

We can remove this test from the other two places
016_stream_subxact.pl and 020_stream_binary.pl.

> 4. Apart from the above, I think we should think of minimizing the
> test cases which can be committed with the base patch. We can later
> add more tests.
>

We can combine the tests in 015_stream_simple.pl and
020_stream_binary.pl as I can't see a good reason to keep them
separate. Then, I think we can keep only this part with the main patch
and extract other tests into a separate patch. Basically, we can
commit the basic tests with the main patch and then keep the advanced
tests separately. I am afraid that there are some tests that don't add
much value so we can review them separately.

One minor comment for option 'streaming = on', spacing-wise it should
be consistent in all the tests.

Similarly, we can combine 017_stream_ddl.pl and 021_stream_schema.pl
as both contains similar tests. As per the above suggestion, this will
be in a separate patch though.

If you agree with the above suggestions then kindly make these
adjustments and send the updated patch.

-- 
With Regards,
Amit Kapila.




builtin functions, parameter names and psql's \df

2020-09-01 Thread Andres Freund
Hi,

on a regular basis I remember a builtin function's name, or can figure it out
using \df etc, but can't remember the argument order. A typical example is
regexp_*, where I never remember whether the pattern or the input string comes
first.

Unfortunatly \df does not really help with that:

=# \df regexp_split_to_table
┌┬───┬──┬─┬──┐
│   Schema   │ Name  │ Result data type │ Argument data types │ 
Type │
├┼───┼──┼─┼──┤
│ pg_catalog │ regexp_split_to_table │ SETOF text   │ text, text  │ 
func │
│ pg_catalog │ regexp_split_to_table │ SETOF text   │ text, text, text│ 
func │
└┴───┴──┴─┴──┘

If the parameters were named however, it'd be clear:

=# CREATE OR REPLACE FUNCTION pg_catalog.regexp_split_to_table(string text, 
pattern text)
 RETURNS SETOF text
 LANGUAGE internal
 IMMUTABLE PARALLEL SAFE STRICT
AS $function$regexp_split_to_table_no_flags$function$

=# \df regexp_split_to_table
┌┬───┬──┬──┬──┐
│   Schema   │ Name  │ Result data type │   Argument data types 
   │ Type │
├┼───┼──┼──┼──┤
│ pg_catalog │ regexp_split_to_table │ SETOF text   │ string text, pattern 
text │ func │
│ pg_catalog │ regexp_split_to_table │ SETOF text   │ text, text, text  
   │ func │
└┴───┴──┴──┴──┘

(I intentionally left the three parameter version unchanged, to show the 
difference)


In the docs we already name the parameters using SQL like syntax, see [1]. How
about we actually do so for at least the more common / complicated functions?
It may not be worth adding operator names for every comparator, but for
functions we expect to be used directly it seems worthwhile?

It sure would be some initial work, but it seems doable.

Comments?


A mildly related note: It's a bit annoying that the "Pattern Matching"
documentation page [2] does not appear to contain a link to the documentation
about the individual pattern matching functions [1].  Am I missing something?

Greetings,

Andres Freund

[1] 
https://www.postgresql.org/docs/current/functions-string.html#FUNCTIONS-STRING-OTHER
[2] https://www.postgresql.org/docs/current/functions-matching.html




Re: Parallel copy

2020-09-01 Thread vignesh C
On Tue, Sep 1, 2020 at 3:39 PM Greg Nancarrow  wrote:
>
> Hi Vignesh,
>
> >Can you share with me the script you used to generate the data & the ddl of 
> >the table, so that it will help me check that >scenario you faced the 
> >>problem.
>
> Unfortunately I can't directly share it (considered company IP),
> though having said that it's only doing something that is relatively
> simple and unremarkable, so I'd expect it to be much like what you are
> currently doing. I can describe it in general.
>
> The table being used contains 100 columns (as I pointed out earlier),
> with the first column of "bigserial" type, and the others of different
> types like "character varying(255)", "numeric", "date" and "time
> without timezone". There's about 60 of the "character varying(255)"
> overall, with the other types interspersed.
>
> When testing with indexes, 4 b-tree indexes were used that each
> included the first column and then distinctly 9 other columns.
>
> A CSV record (row) template file was created with test data
> (corresponding to the table), and that was simply copied and appended
> over and over with a record prefix in order to create the test data
> file.
> The following shell-script basically does it (but very slowly). I was
> using a small C program to do similar, a lot faster.
> In my case, N=255 produced about a 5GB CSV file.
>
> file_out=data.csv; for i in {1..N}; do echo -n "$i," >> $file_out;
> cat sample_record.csv >> $file_out; done
>
> One other thing I should mention is that between each test run, I
> cleared the OS page cache, as described here:
> https://linuxhint.com/clear_cache_linux/
> That way, each COPY FROM is not taking advantage of any OS-cached data
> from a previous COPY FROM.

I will try with a similar test and check if I can reproduce.

> If your data is somehow significantly different and you want to (and
> can) share your script, then I can try it in my environment.

I have attached the scripts that I used for the test results I
mentioned in my previous mail. create.sql file has the table that I
used, insert_data_gen.txt has the insert data generation scripts. I
varied the count in insert_data_gen to generate csv files of 1GB, 2GB
& 5GB & varied the data to generate 1 char, 10 char & 100 char for
each column for various testing. You can rename insert_data_gen.txt to
insert_data_gen.sh & generate the csv file.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
#!/bin/bash
for i in {1..10}
do
   echo 
"1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890,1234567890123456789012345678901234567890123456789012345678901234567890123456789

Re: REINDEX SCHEMA/DATABASE/SYSTEM weak with dropped relations

2020-09-01 Thread Michael Paquier
On Tue, Sep 01, 2020 at 09:41:48PM -0500, Justin Pryzby wrote:
> I think you probably intended to write: 1<<2

Thanks, fixed.
--
Michael


signature.asc
Description: PGP signature


Re: More tests with USING INDEX replident and dropped indexes

2020-09-01 Thread Michael Paquier
On Wed, Sep 02, 2020 at 09:27:33AM +0530, Rahila Syed wrote:
> TBH, I am not sure.  I think it is a reasonable change. It is even
> indicated in the
> comment above index_set_state_flags() that it can be made transactional.
> At the same time, probably we can just go ahead with current
> inconsistent update of relisreplident and indisvalid flags. Can't see what
> will break with that.

Yeah, that's true as well.  Still, I would like to see first if people
are fine with changing this code path to be transactional.  This way,
we will have zero history in the tree where there was a risk for an
inconsistent window.
--
Michael


signature.asc
Description: PGP signature