date:20200621

Re: [PATCH] Missing links between system catalog documentation pages

2020-06-21 Thread Fabien COELHO



Hello Dagfinn,


The attached patch


applies cleanly, doc generation is ok. I'm ok with adding such links 
systematically.


makes the first mention of another system catalog or view (as well as 
pg_hba.conf in pg_hba_file_lines) a link, for easier navigation.


Why only the first mention? It seems unlikely that I would ever read
such chapter linearly, and even so that I would want to jump especially
on the first occurrence but not on others, so ISTM that it should be
done all mentions?


It's the first mention in the introductory paragraph of _each_ catalog
table/view page, not the first mention in the entire catalogs.sgml file.
E.g. https://www.postgresql.org/docs/current/catalog-pg-aggregate.html
has two mentions of pg_proc one word apart:

  Each entry in pg_aggregate is an extension of an entry in pg_proc. The
  pg_proc entry carries the aggregate's name, …

I didn't think there was much point in linkifying both in that case, and
other similar situations.


The point is that the user reads a sentence, attempts to jump but 
sometimes can't, because the is not the first occurrence. I'd go for all 
mentions of another relation should be link.


Alse, ISTM you missed some, maybe you could consider adding them? eg 
pg_database in the very first paragraph of the file, pg_attrdef in 
pg_attribute description, quite a few in pg_class…


--
Fabien.

Re: suggest to rename enable_incrementalsort

2020-06-21 Thread Julien Rouhaud

On Sun, Jun 21, 2020 at 8:26 AM Peter Eisentraut
 wrote:
>
> I suggest to rename enable_incrementalsort to enable_incremental_sort.
> This is obviously more readable and also how we have named recently
> added multiword planner parameters.
>
> See attached patch.

+1, this is a way better name (and patch LGTM on REL_13_STABLE).

tag typos in "catalog.sgml"

2020-06-21 Thread Fabien COELHO



Hello,

While reviewing a documentation patch, I noticed that a few tags where 
wrong in "catalog.sgml". Attached patch fixes them.


--
Fabien.diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 700271fd40..5a66115df1 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -897,7 +897,7 @@
 
   
An entry's amopmethod must match the
-   opfmethod of its containing operator family (including
+   opfmethod of its containing operator family (including
amopmethod here is an intentional denormalization of the
catalog structure for performance reasons).  Also,
amoplefttype and amoprighttype must match
@@ -5064,10 +5064,10 @@ SCRAM-SHA-256$:&l
 
   
An operator class's opcmethod must match the
-   opfmethod of its containing operator family.
+   opfmethod of its containing operator family.
Also, there must be no more than one pg_opclass
-   row having opcdefault true for any given combination of
-   opcmethod and opcintype.
+   row having opcdefault true for any given combination of
+   opcmethod and opcintype.

Missing "Up" navigation link between parts and doc root?

2020-06-21 Thread Fabien COELHO




Hello devs,

I've been annoyed that the documentation navigation does not always has an 
"Up" link. It has them inside parts, but the link disappears and you have 
to go for the "Home" link which is far on the right when on the root page 
of a part?


Is there a good reason not to have the "Up" link there as well?

--
Fabien.

Raising stop and warn limits

2020-06-21 Thread Noah Misch

In brief, I'm proposing to raise xidWrapLimit-xidStopLimit to 3M and
xidWrapLimit-xidWarnLimit to 40M.  Likewise for mxact counterparts.


PostgreSQL has three "stop limit" values beyond which only single-user mode
will assign new values of a certain counter:

- xidStopLimit protects pg_xact, pg_commit_ts, pg_subtrans, and pg_serial.
  SetTransactionIdLimit() withholds a million XIDs, and warnings start ten
  million before that.
- multiStopLimit protects pg_multixact/offsets.  SetMultiXactIdLimit()
  withholds 100 mxacts, and warnings start at ten million.
- offsetStopLimit protects pg_multixact/members.  SetOffsetVacuumLimit()
  withholds [1,2) SLRU segments thereof (50k-100k member XIDs).  No warning
  phase for this one.

Reasons to like a larger stop limit:

1. VACUUM, to advance a limit, may assign IDs subject to one of the limits.
   VACUUM formerly consumed XIDs, not mxacts.  It now consumes mxacts, not
   XIDs.  I think a DBA can suppress VACUUM's mxact consumption by stopping
   all transactions older than vacuum_freeze_min_age, including prepared
   transactions.

2. We currently have edge-case bugs when assigning values in the last few
   dozen pages before the wrap limit
   (https://postgr.es/m/20190214072623.ga1139...@rfd.leadboat.com and
   https://postgr.es/m/20200525070033.ga1591...@rfd.leadboat.com).  A higher
   stop limit could make this class of bug unreachable outside of single-user
   mode.  That's valuable against undiscovered bugs of this class.

3. Any bug in stop limit enforcement is less likely to have user impact.  For
   a live example, see the XXX comment that
   https://postgr.es/m/attachment/111084/slru-truncate-modulo-v3.patch adds to
   CheckPointPredicate().

Raising a stop limit prompts an examination of warn limits, which represent
the time separating the initial torrent of warnings from the service outage.
The current limits appeared in 2005; transaction rates have grown, while human
reaction times have not.  I like warnings starting when an SLRU is 98%
consumed (40M XIDs or mxacts remaining).  That doesn't change things enough to
make folks reconfigure VACUUM, and it buys back some of the grace period DBAs
had in 2005.  I considered 95-97%, but the max_val of
autovacuum_freeze_max_age would then start the warnings before the autovacuum.
While that wouldn't rule out a value lower than 98%, 98% felt fine anyhow.

For the new stop limits, I propose allowing 99.85% SLRU fullness (stop with 3M
XIDs or mxacts remaining).  If stopping this early will bother users, an
alternative is 3M for XIDs and 0.2M for others.  Either way leaves at least
two completely-idle segments for each SLRU, which I expect to mitigate present
and future edge-case bugs.

Changing this does threaten clusters that experience pg_upgrade when close to
a limit.  pg_upgrade can fail or, worse, yield a cluster that spews warnings
shortly after the upgrade.  I could implement countermeasures, but they would
take effect only when one upgrades a cluster having a 98%-full SLRU.  I
propose not to change pg_upgrade; some sites may find cause to do a
whole-cluster VACUUM before pg_upgrade.  Do you agree or disagree with that
choice?  I am attaching a patch (not for commit) that demonstrates the
pg_upgrade behavior that nearly-full-SLRU clusters would see.

Thanks,
nm
diff --git a/src/backend/access/transam/clog.c 
b/src/backend/access/transam/clog.c
index f3da40a..c43ebbf 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -795,6 +795,10 @@ TrimCLOG(void)
int slotno;
char   *byteptr;
 
+   /* hack for pg_resetwal moving us to the middle of a page */
+   if (!SimpleLruDoesPhysicalPageExist(XactCtl, pageno))
+   SimpleLruZeroPage(XactCtl, pageno);
+
slotno = SimpleLruReadPage(XactCtl, pageno, false, xid);
byteptr = XactCtl->shared->page_buffer[slotno] + byteno;
 
diff --git a/src/backend/access/transam/multixact.c 
b/src/backend/access/transam/multixact.c
index ce84dac..ed47ce0 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2042,6 +2042,10 @@ TrimMultiXact(void)
int slotno;
MultiXactOffset *offptr;
 
+   /* hack for pg_resetwal moving us to the middle of a page */
+   if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
+   SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, 
nextMXact);
offptr = (MultiXactOffset *) 
MultiXactOffsetCtl->shared->page_buffer[slotno];
offptr += entryno;
diff --git a/src/bin/pg_resetwal/pg_resetwal.c 
b/src/bin/pg_resetwal/pg_resetwal.c
index 2334418..454ff8b 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -434,6

Re: pg_regress cleans up tablespace twice.

2020-06-21 Thread Michael Paquier

On Sun, Jun 21, 2020 at 12:08:37PM +1200, Thomas Munro wrote:
> I'm not sure what needs to change, but in the meantime I told it to
> comment out the offending test from the schedule files:
> 
> +before_test:
> +  - 'perl -p -i.bak -e "s/^test: tablespace/#test: tablespace/"
> src/test/regress/serial_schedule'
> +  - 'perl -p -i.bak -e "s/^test: tablespace/#test: tablespace/"
> src/test/regress/parallel_schedule'
> 
> Now the results are slowly turning green again.

Thanks, and sorry for the trouble.  What actually happened back in
2018?  I can see c2ff3c68 in the git history of the cfbot code, but it
does not give much details.
--
Michael


signature.asc
Description: PGP signature

Re: new heapcheck contrib module

2020-06-21 Thread Dilip Kumar

On Sat, Jun 13, 2020 at 2:36 AM Mark Dilger
 wrote:
>
>
>
> > On Jun 11, 2020, at 11:35 PM, Dilip Kumar  wrote:
> >
> > On Fri, Jun 12, 2020 at 12:40 AM Mark Dilger
> >  wrote:
> >>
> >>
> >>
> >>> On Jun 11, 2020, at 9:14 AM, Dilip Kumar  wrote:
> >>>
> >>> I have just browsed through the patch and the idea is quite
> >>> interesting.  I think we can expand it to check that whether the flags
> >>> set in the infomask are sane or not w.r.t other flags and xid status.
> >>> Some examples are
> >>>
> >>> - If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED
> >>> should not be set in new_infomask2.
> >>> - If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we
> >>> actually cross verify the transaction status from the CLOG and check
> >>> whether is matching the hint bit or not.
> >>>
> >>> While browsing through the code I could not find that we are doing
> >>> this kind of check,  ignore if we are already checking this.
> >>
> >> Thanks for taking a look!
> >>
> >> Having both of those bits set simultaneously appears to fall into a 
> >> different category than what I wrote verify_heapam.c to detect.
> >
> > Ok
> >
> >
> >>  It doesn't violate any assertion in the backend, nor does it cause
> >> the code to crash.  (At least, I don't immediately see how it does
> >> either of those things.)  At first glance it appears invalid to have
> >> those bits both set simultaneously, but I'm hesitant to enforce that
> >> without good reason.  If it is a good thing to enforce, should we also
> >> change the backend code to Assert?
> >
> > Yeah, it may not hit assert or crash but it could lead to a wrong
> > result.  But I agree that it could be an assertion in the backend
> > code.
>
> For v7, I've added an assertion for this.  Per heap/README.tuplock, "We 
> currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit 
> is set."  I added an assertion for that, too.  Both new assertions are in 
> RelationPutHeapTuple().  I'm not sure if that is the best place to put the 
> assertion, but I am confident that the assertion needs to only check tuples 
> destined for disk, as in memory tuples can and do violate the assertion.
>
> Also for v7, I've updated contrib/amcheck to report these two conditions as 
> corruption.
>
> > What about the other check, like hint bit is saying the
> > transaction is committed but actually as per the clog the status is
> > something else.  I think in general processing it is hard to check
> > such things in backend no? because if the hint bit is set saying that
> > the transaction is committed then we will directly check its
> > visibility with the snapshot.  I think a corruption checker may be a
> > good tool for catching such anomalies.
>
> I already made some design changes to this patch to avoid taking the 
> CLogTruncationLock too often.  I'm happy to incorporate this idea, but 
> perhaps you could provide a design on how to do it without all the extra 
> locking?  If not, I can try to get this into v8 as an optional check, so 
> users can turn it on at their discretion.  Having the check enabled by 
> default is probably a non-starter.

Okay,  even I can't think a way to do it without an extra locking.

I have looked into 0001 patch and I have a few comments.

1.
+
+ /* Skip over unused/dead/redirected line pointers */
+ if (!ItemIdIsUsed(ctx.itemid) ||
+ ItemIdIsDead(ctx.itemid) ||
+ ItemIdIsRedirected(ctx.itemid))
+ continue;

Isn't it a good idea to verify the Redirected Itemtid?  Because we
will still access the redirected item id to find the
actual tuple from the index scan.  Maybe not exactly at this level,
but we can verify that the link itemid store in that
is within the itemid range of the page or not.

2.

+ /* Check for tuple header corruption */
+ if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader)
+ {
+ confess(ctx,
+ psprintf("t_hoff < SizeofHeapTupleHeader (%u < %u)",
+ ctx->tuphdr->t_hoff,
+ (unsigned) SizeofHeapTupleHeader));
+ fatal = true;
+ }

I think we can also check that if there is no NULL attributes (if
(!(t_infomask & HEAP_HASNULL)) then
ctx->tuphdr->t_hoff should be equal to SizeofHeapTupleHeader.


3.
+ ctx->offset = 0;
+ for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++)
+ {
+ if (!check_tuple_attribute(ctx))
+ break;
+ }
+ ctx->offset = -1;
+ ctx->attnum = -1;

So we are first setting ctx->offset to 0, then inside
check_tuple_attribute, we will keep updating the offset as we process
the attributes and after the loop is over we set ctx->offset to -1,  I
did not understand that why we need to reset it to -1, do we ever
check for that.  We don't even initialize the ctx->offset to -1 while
initializing the context for the tuple so I do not understand what is
the meaning of the random value -1.

4.
+ if (!VARATT_IS_EXTENDED(chunk))
+ {
+ chunksize = VARSIZE(chunk) - VARHDRSZ;
+ chunkdata = VARDATA(chunk);
+ }
+ else if (VARATT_IS_SHORT(chunk))
+ {
+ /*
+ * could happen due to heap_form_tuple doing its thing
+ */

Re: tag typos in "catalog.sgml"

2020-06-21 Thread Michael Paquier

On Sun, Jun 21, 2020 at 09:10:35AM +0200, Fabien COELHO wrote:
> While reviewing a documentation patch, I noticed that a few tags where wrong
> in "catalog.sgml". Attached patch fixes them.

Good catches, thanks Fabien.  I will fix that tomorrow or so.
--
Michael


signature.asc
Description: PGP signature

Re: pg_regress cleans up tablespace twice.

2020-06-21 Thread Thomas Munro

On Sun, Jun 21, 2020 at 8:42 PM Michael Paquier  wrote:
> On Sun, Jun 21, 2020 at 12:08:37PM +1200, Thomas Munro wrote:
> > I'm not sure what needs to change, but in the meantime I told it to
> > comment out the offending test from the schedule files:
> >
> > +before_test:
> > +  - 'perl -p -i.bak -e "s/^test: tablespace/#test: tablespace/"
> > src/test/regress/serial_schedule'
> > +  - 'perl -p -i.bak -e "s/^test: tablespace/#test: tablespace/"
> > src/test/regress/parallel_schedule'
> >
> > Now the results are slowly turning green again.
>
> Thanks, and sorry for the trouble.  What actually happened back in
> 2018?  I can see c2ff3c68 in the git history of the cfbot code, but it
> does not give much details.

commit ce5d3424d6411f7a7228fd4463242cb382af3e0c
Author: Andrew Dunstan 
Date:   Sat Oct 20 09:02:36 2018 -0400

Lower privilege level of programs calling regression_main

On Windows this mean that the regression tests can now safely and
successfully run as Administrator, which is useful in situations like
Appveyor. Elsewhere it's a no-op.

Backpatch to 9.5 - this is harder in earlier branches and not worth the
trouble.

Discussion:
https://postgr.es/m/650b0c29-9578-8571-b1d2-550d7f89f...@2ndquadrant.com

Re: suggest to rename enable_incrementalsort

2020-06-21 Thread Tomas Vondra


On Sun, Jun 21, 2020 at 09:05:32AM +0200, Julien Rouhaud wrote:

On Sun, Jun 21, 2020 at 8:26 AM Peter Eisentraut
 wrote:


I suggest to rename enable_incrementalsort to enable_incremental_sort.
This is obviously more readable and also how we have named recently
added multiword planner parameters.

See attached patch.


+1, this is a way better name (and patch LGTM on REL_13_STABLE).



The reason why I kept the single-word variant is consistency with other
GUCs that affect planning, like enable_indexscan, enable_hashjoin and
many others.

That being said, I'm not particularly attached this choice, so if you
think this is better I'm OK with it.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Initial progress reporting for COPY command

2020-06-21 Thread Josef Šimánek

po 15. 6. 2020 v 2:18 odesílatel Michael Paquier 
napsal:

> Hi Josef,
>
> On Sun, Jun 14, 2020 at 02:32:33PM +0200, Josef Šimánek wrote:
> > Hello, as proposed by Pavel Stěhule and discussed on local czech
> PostgreSQL
> > maillist (
> >
> https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer
> ),
> > I have prepared an initial patch for COPY command progress reporting.
>
> Sounds like a good idea to me.
>

Great. I will continue working on this.


> > I have not found any tests for progress reporting. Are there any? It
> would
> > need two backends running (one running COPY, one checking output of
> report
> > view). Is there any similar test I can inspire at? In theory, it should
> be
> > possible to use dblink_send_query to run async COPY command in the
> > background.
>
> We don't have any tests in core.  I think that making deterministic
> test cases is rather tricky here as long as we don't have a more
> advanced testing framework that allows is to lock certain code paths
> and keep around an expected state until a second session comes around
> and looks at the progress catalog (even that would need adding more
> code to core to mark the extra point looked at).  So I think that it is
> fine to not focus on that for this feature.  The important parts are
> the choice of the progress points and the data sent to MyProc, and
> both should be chosen wisely.
>

Thanks for the info. I'm focusing exactly at looking for right spots to
report the progress. I'll attach new patch with better places and
supporting more options of reporting (including STDIN, STDOUT) soon and
also I'll try to add it to commitfest.


>
> > My initial (attached) patch also doesn't introduce documentation for this
> > system view. I can add that later once this patch is finalized (if that
> > happens).
>
> You may want to add it to the next commit fest:
> https://commitfest.postgresql.org/28/
> Documentation is necessary, and having some would ease reviews.
> --
> Michael
>

Re: [PATCH] Initial progress reporting for COPY command

2020-06-21 Thread Josef Šimánek

po 15. 6. 2020 v 6:39 odesílatel Fujii Masao 
napsal:

>
>
> On 2020/06/14 21:32, Josef Šimánek wrote:
> > Hello, as proposed by Pavel Stěhule and discussed on local czech
> PostgreSQL maillist (
> https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer),
> I have prepared an initial patch for COPY command progress reporting.
>
> Sounds nice!
>
>
> > file - bool - is file is used?
> > program - bool - is program used?
>
> Are these fields really necessary in a progress view?
> What values are reported when STDOUT/STDIN is specified in COPY command?
>

For STDOUT and STDIN file is true and program is false.


> > file_bytes_processed - amount of bytes processed when file is used
> (otherwise 0), works for both direction (
> > FROM/TO) when file is used (file = t)
>
> What value is reported when STDOUT/STDIN is specified in COPY command?


For my first patch nothing was reported on STDOUT/STDIN usage. I'll attach
new patch soon supporting those as well.


>
>


> Regards,
>
>
> --
> Fujii Masao
> Advanced Computing Technology Center
> Research and Development Headquarters
> NTT DATA CORPORATION
>

Re: [PATCH] Initial progress reporting for COPY command

2020-06-21 Thread Josef Šimánek

po 15. 6. 2020 v 2:18 odesílatel Michael Paquier 
napsal:

> Hi Josef,
>
> On Sun, Jun 14, 2020 at 02:32:33PM +0200, Josef Šimánek wrote:
> > Hello, as proposed by Pavel Stěhule and discussed on local czech
> PostgreSQL
> > maillist (
> >
> https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer
> ),
> > I have prepared an initial patch for COPY command progress reporting.
>
> Sounds like a good idea to me.
>
> > I have not found any tests for progress reporting. Are there any? It
> would
> > need two backends running (one running COPY, one checking output of
> report
> > view). Is there any similar test I can inspire at? In theory, it should
> be
> > possible to use dblink_send_query to run async COPY command in the
> > background.
>
> We don't have any tests in core.  I think that making deterministic
> test cases is rather tricky here as long as we don't have a more
> advanced testing framework that allows is to lock certain code paths
> and keep around an expected state until a second session comes around
> and looks at the progress catalog (even that would need adding more
> code to core to mark the extra point looked at).  So I think that it is
> fine to not focus on that for this feature.  The important parts are
> the choice of the progress points and the data sent to MyProc, and
> both should be chosen wisely.
>
> > My initial (attached) patch also doesn't introduce documentation for this
> > system view. I can add that later once this patch is finalized (if that
> > happens).
>
> You may want to add it to the next commit fest:
> https://commitfest.postgresql.org/28/
> Documentation is necessary, and having some would ease reviews.
>

I have added documentation, more code comments and I'll upload patch to
commit fest.


> --
> Michael
>

Re: [PATCH] Initial progress reporting for COPY command

2020-06-21 Thread Josef Šimánek

po 15. 6. 2020 v 7:34 odesílatel Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> napsal:

> > I'm using ftell to get current position in file to populate
> file_bytes_processed without error handling (ftell can return -1L and also
> populate errno on problems).
> >
> > 1. Is that a good way to get progress of file processing?
>
> IMO, it's better to handle the error cases. One possible case where
> ftell can return -1 and set errno is when the total bytes processed is
> more than LONG_MAX.
>
> Will your patch handle file_bytes_processed reporting for COPY FROM
> STDIN cases? For this case, ftell can't be used.
>
> Instead of using ftell and worrying about the errors, a simple
> approach could be to have a uint64 variable in CopyStateData to track
> the number of bytes read whenever CopyGetData is called. This approach
> can also handle the case of COPY FROM STDIN.
>

Thanks for suggestion. I used this approach and latest patch supports both
STDIN and STDOUT now.


> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com
>

Re: Possible NULL pointer deferenced (src/interfaces/libpq/fe-exec.c (line 563)

2020-06-21 Thread Ranier Vilela

Em dom., 21 de jun. de 2020 às 02:16, Tom Lane  escreveu:

> Ranier Vilela  writes:
> > The res->curBlock pointer possibly, can be NULL here (line 563).
>
> No, it can't.
>
> To get to that line, nBytes has to be > 0, which means res->spaceLeft
> has to be > 0, which cannot happen while res->curBlock is NULL.
>
Hi Tom, thanks for answer.

regards,
Ranier Vilela

Re: [PATCH] Missing links between system catalog documentation pages

2020-06-21 Thread Dagfinn Ilmari Mannsåker

Hi Fabien,

Fabien COELHO  writes:

>> It's the first mention in the introductory paragraph of _each_ catalog
>> table/view page, not the first mention in the entire catalogs.sgml file.
>> E.g. https://www.postgresql.org/docs/current/catalog-pg-aggregate.html
>> has two mentions of pg_proc one word apart:
>>
>>   Each entry in pg_aggregate is an extension of an entry in pg_proc. The
>>   pg_proc entry carries the aggregate's name, …
>>
>> I didn't think there was much point in linkifying both in that case, and
>> other similar situations.
>
> The point is that the user reads a sentence, attempts to jump but
> sometimes can't, because the is not the first occurrence. I'd go for all
> mentions of another relation should be link.

Okay, I'll make them all links, except the pg_aggregate aggfnoid column,
which I've changed from "pg_proc OID of the aggregate function" to just
"OID of the agregate function", since pg_proc is linked immediately
prior in the "references" section, and we generally don't mention the
catalog table again in similar cases elsehwere.

> Alse, ISTM you missed some, maybe you could consider adding them? eg
> pg_database in the very first paragraph of the file, pg_attrdef in
> pg_attribute description, quite a few in pg_class…

Yes, I only looked at the intro paragraphs of the per-catalog pages, not
the overview section nor the text after the column tables.  I've gone
through them all now and linked them.  Updated patch attached.

- ilmari
-- 
"A disappointingly low fraction of the human race is,
 at any given time, on fire." - Stig Sandbeck Mathisen


>From b2940b93f5ed17c0e739d6ed0c9411a44bd87428 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Sun, 24 May 2020 21:48:29 +0100
Subject: [PATCH v2 1/2] Add missing cross-links in system catalog
 documentation

This makes all mentions of other system catalogs and views in the
system system catalog and view documentation pages hyperlinks, for
easier navigation.

Also linkify the first mention of pg_hba.conf in pg_hba_file_rules, as
that's more specific and easier to spot than the link to the client
authentication chapter at the bottom.
---
 doc/src/sgml/catalogs.sgml | 142 -
 1 file changed, 78 insertions(+), 64 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 700271fd40..051fa7d361 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -16,7 +16,8 @@
Normally, one should not change the system catalogs by hand, there
are normally SQL commands to do that.  (For example, CREATE
DATABASE inserts a row into the
-   pg_database catalog — and actually
+   pg_database
+   catalog — and actually
creates the database on disk.)  There are some exceptions for
particularly esoteric operations, but many of those have been made
available as SQL commands over time, and so the need for direct manipulation
@@ -381,9 +382,10 @@
sum, count, and
max.  Each entry in
pg_aggregate is an extension of an entry
-   in pg_proc.  The pg_proc
-   entry carries the aggregate's name, input and output data types, and
-   other information that is similar to ordinary functions.
+   in pg_proc.
+   The pg_proc entry carries the aggregate's name,
+   input and output data types, and other information that is similar to
+   ordinary functions.
   
 
   
@@ -407,7 +409,7 @@
(references pg_proc.oid)
   
   
-   pg_proc OID of the aggregate function
+   OID of the aggregate function
   
  
 
@@ -902,7 +904,7 @@
catalog structure for performance reasons).  Also,
amoplefttype and amoprighttype must match
the oprleft and oprright fields of the
-   referenced pg_operator entry.
+   referenced pg_operator entry.
   
 
  
@@ -1099,7 +1101,8 @@
table columns.  There will be exactly one
pg_attribute row for every column in every
table in the database.  (There will also be attribute entries for
-   indexes, and indeed all objects that have pg_class
+   indexes, and indeed all objects that have
+   pg_class
entries.)
   
 
@@ -1270,7 +1273,7 @@
   
This column has a default expression or generation expression, in which
case there will be a corresponding entry in the
-   pg_attrdef catalog that actually defines the
+   pg_attrdef catalog that actually defines the
expression.  (Check attgenerated to
determine whether this is a default or a generation expression.)
   
@@ -1402,9 +1405,9 @@
In a dropped column's pg_attribute entry,
atttypid is reset to zero, but
attlen and the other fields copied from
-   pg_type are still valid.  This arrangement is needed
+   pg_type are still valid.  This arrangement is needed
to cope with the situation where the dropped column's data type was
-   later dropped, and so there is no pg_type row anymore.
+   later dropped, and so there is no pg_type row anymore.
attlen and

Re: [PATCH] Missing links between system catalog documentation pages

2020-06-21 Thread Tom Lane

Fabien COELHO  writes:
>> I didn't think there was much point in linkifying both in that case, and
>> other similar situations.

> The point is that the user reads a sentence, attempts to jump but 
> sometimes can't, because the is not the first occurrence. I'd go for all 
> mentions of another relation should be link.

That has not been our practice up to now, eg in comparable cases in
discussions of GUC variables, only the first reference is xref-ified.
I think it could be kind of annoying to make every reference a link,
both for regular readers (the link decoration is too bold in most
browsers) and for users of screen-reader software.

There is a fair question as to how far apart two references should
be before we  both of them.  But I think that distance
does need to be more than zero, and probably more than one para.

regards, tom lane

vs for command line tools in the docs

2020-06-21 Thread Dagfinn Ilmari Mannsåker

Hi Hackers,

While I was looking at linkifying SQL commands in the system catalog
docs, I noticed that catalog.sgml uses the  tag to refer to
initdb, while I'd expected it to use .

Looking for patterns, I grepped for the two tags with contents
consisting only of lower-case letters, numbers, hyphens and underscores,
to restrict it to things that look like shell command names, not SQL
commands or full command lines.  When referring to binaries shipped with
postgres itself,  is by far the most common, except for
initdb, postgres, ecpg, pg_ctl, pg_resetwal, and pg_config.  For
external programs it's more of a mix, but overall there are more than
twice as many instances of  of this form than of .

I'm not proposing going throuhgh all 1500+ instances in one fell swoop,
but some consistency (and a policy going forward) would be nice.

- ilmari
-- 
"A disappointingly low fraction of the human race is,
 at any given time, on fire." - Stig Sandbeck Mathisen

Re: [PATCH] Missing links between system catalog documentation pages

2020-06-21 Thread Alvaro Herrera

On 2020-Jun-21, Tom Lane wrote:

> That has not been our practice up to now, eg in comparable cases in
> discussions of GUC variables, only the first reference is xref-ified.
> I think it could be kind of annoying to make every reference a link,
> both for regular readers (the link decoration is too bold in most
> browsers) and for users of screen-reader software.

In the glossary I also changed things so that only the first reference
of a term in another term's definition is linked; my experience reading
the originals as submitted (which did link them all at some point) is
that the extra links are very distracting, bad for readability.  So +1
for not adding links to every single mention.

> There is a fair question as to how far apart two references should
> be before we  both of them.  But I think that distance
> does need to be more than zero, and probably more than one para.

Nod.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: vs for command line tools in the docs

2020-06-21 Thread Tom Lane

ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> While I was looking at linkifying SQL commands in the system catalog
> docs, I noticed that catalog.sgml uses the  tag to refer to
> initdb, while I'd expected it to use .

I agree that the latter is what we generally use.

regards, tom lane

vs formatting in the docs

2020-06-21 Thread Dagfinn Ilmari Mannsåker

Hi Hackers,

While looking at making more SQL into links, I
noticed that  loses the monospace formatting of , and
can't itself be wrapped in .  This becomes particularly
apparent when you have one link that can be an  next to another
that's ... because it's actually
referring to a specific variant of the command.

By some trial and error I found that putting  inside the
 tag propagates the formatting to the  contents.
We already do this with  for (most of) the client and
server applications.  Is this something we want to do consistently for
both?

- ilmari
-- 
"The surreality of the universe tends towards a maximum" -- Skud's Law
"Never formulate a law or axiom that you're not prepared to live with
 the consequences of."  -- Skud's Meta-Law

Re: vs formatting in the docs

2020-06-21 Thread Alvaro Herrera

On 2020-Jun-21, Dagfinn Ilmari Mannsåker wrote:

> While looking at making more SQL into links, I
> noticed that  loses the monospace formatting of , and
> can't itself be wrapped in .

Ouch.

> By some trial and error I found that putting  inside the
>  tag propagates the formatting to the  contents.
> We already do this with  for (most of) the client and
> server applications.  Is this something we want to do consistently for
> both?

Looking at the ones that use , it looks like manpages are
not damaged, so +1.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: vs for command line tools in the docs

2020-06-21 Thread Chapman Flack

On 06/21/20 11:16, Tom Lane wrote:
> ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
>> docs, I noticed that catalog.sgml uses the  tag to refer to
>> initdb, while I'd expected it to use .
> 
> I agree that the latter is what we generally use.

'The latter' is  in the Subject:,  in the body

Regards,
-Chap

Re: vs formatting in the docs

2020-06-21 Thread Dagfinn Ilmari Mannsåker

Alvaro Herrera  writes:

> On 2020-Jun-21, Dagfinn Ilmari Mannsåker wrote:
>
>> While looking at making more SQL into links, I
>> noticed that  loses the monospace formatting of , and
>> can't itself be wrapped in .
>
> Ouch.
>
>> By some trial and error I found that putting  inside the
>>  tag propagates the formatting to the  contents.
>> We already do this with  for (most of) the client and
>> server applications.  Is this something we want to do consistently for
>> both?
>
> Looking at the ones that use , it looks like manpages are
> not damaged, so +1.

Attached are two patches: the first adds the missing  tags,
the second adds  to all the SQL commands (specifically anything
with 7).

I'll add it to the commitfest.

- ilmari
-- 
- Twitter seems more influential [than blogs] in the 'gets reported in
  the mainstream press' sense at least.   - Matt McLeod
- That'd be because the content of a tweet is easier to condense down
  to a mainstream media article.  - Calle Dybedahl

>From 739e80f8e25b193af4fba54bceb1a7b37c47a793 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Sun, 21 Jun 2020 16:26:45 +0100
Subject: [PATCH 1/2] Add missing  tags in application doc
 s

Most of them already have this, but some were missing
---
 doc/src/sgml/ref/initdb.sgml  | 2 +-
 doc/src/sgml/ref/pg_basebackup.sgml   | 2 +-
 doc/src/sgml/ref/pg_config-ref.sgml   | 2 +-
 doc/src/sgml/ref/pg_dump.sgml | 2 +-
 doc/src/sgml/ref/pg_receivewal.sgml   | 2 +-
 doc/src/sgml/ref/pg_restore.sgml  | 2 +-
 doc/src/sgml/ref/pg_verifybackup.sgml | 2 +-
 7 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 1635fcb1fd..b6a55ce105 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -9,7 +9,7 @@
  
 
  
-  initdb
+  initdb
   1
   Application
  
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index db480be674..ce1653faf7 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -9,7 +9,7 @@
  
 
  
-  pg_basebackup
+  pg_basebackup
   1
   Application
  
diff --git a/doc/src/sgml/ref/pg_config-ref.sgml b/doc/src/sgml/ref/pg_config-ref.sgml
index 6c797746cc..e177769188 100644
--- a/doc/src/sgml/ref/pg_config-ref.sgml
+++ b/doc/src/sgml/ref/pg_config-ref.sgml
@@ -9,7 +9,7 @@
  
 
  
-  pg_config
+  pg_config
   1
   Application
  
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 2f0807e912..8cd5875f67 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -9,7 +9,7 @@
  
 
  
-  pg_dump
+  pg_dump
   1
   Application
  
diff --git a/doc/src/sgml/ref/pg_receivewal.sgml b/doc/src/sgml/ref/pg_receivewal.sgml
index cad4689ae6..865ec84262 100644
--- a/doc/src/sgml/ref/pg_receivewal.sgml
+++ b/doc/src/sgml/ref/pg_receivewal.sgml
@@ -9,7 +9,7 @@
  
 
  
-  pg_receivewal
+  pg_receivewal
   1
   Application
  
diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index 232f88024f..6cb06d4910 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -9,7 +9,7 @@
  
 
  
-  pg_restore
+  pg_restore
   1
   Application
  
diff --git a/doc/src/sgml/ref/pg_verifybackup.sgml b/doc/src/sgml/ref/pg_verifybackup.sgml
index 44f4e67d57..527af5ad28 100644
--- a/doc/src/sgml/ref/pg_verifybackup.sgml
+++ b/doc/src/sgml/ref/pg_verifybackup.sgml
@@ -9,7 +9,7 @@
  
 
  
-  pg_verifybackup
+  pg_verifybackup
   1
   Application
  
-- 
2.26.2

>From e8cd316e87e467cf70b1989c7c68d3110dbe5323 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Sun, 21 Jun 2020 16:52:47 +0100
Subject: [PATCH 2/2] Add  to SQL command  tags

This means that  links to them will get the proper monospace
formatting.  Manpage formatting is not affected.
---
 doc/src/sgml/ref/abort.sgml   | 2 +-
 doc/src/sgml/ref/alter_aggregate.sgml | 2 +-
 doc/src/sgml/ref/alter_collation.sgml | 2 +-
 doc/src/sgml/ref/alter_conversion.sgml| 2 +-
 doc/src/sgml/ref/alter_database.sgml  | 2 +-
 doc/src/sgml/ref/alter_default_privileges.sgml| 2 +-
 doc/src/sgml/ref/alter_domain.sgml| 2 +-
 doc/src/sgml/ref/alter_event_trigger.sgml | 2 +-
 doc/src/sgml/ref/alter_extension.sgml | 2 +-
 doc/src/sgml/ref/alter_foreign_data_wrapper.sgml  | 2 +-
 doc/src/sgml/ref/alter_foreign_table.sgml | 2 +-
 doc/src/sgml/ref/alter_function.sgml  | 2 +-
 doc/src/sgml/ref/alter_group.sgml | 2 +-
 doc/src/sgml/ref/alter_index.sgml | 2 +-
 doc/src/sgml/ref/alter_language.sgml  | 2 +-
 doc/src/sgml/ref/alter_large_object.sgml  | 2 +-
 doc/src/sgml/ref/alter_materialized_view.sgml | 2 +-
 doc/src/sgml/ref/alter_opclass.sgml   | 2 +-
 doc/src/sgml

Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-21 Thread Jonathan S. Katz

Hi,

On 6/5/20 11:51 AM, Alvaro Herrera wrote:
> On 2020-Jun-05, Dave Cramer wrote:
> 
>> On Thu, 4 Jun 2020 at 19:46, Alvaro Herrera 
>> wrote:
> 
>>> Ouch ... so they made IDENT in the replication grammar be a trigger to
>>> enter the regular grammar.  Crazy.  No way to put those worms back in
>>> the tin now, I guess.
>>
>> Is that documented ?
> 
> I don't think it is.
> 
>>> It is still my opinion that we should prohibit a logical replication
>>> connection from being used to do physical replication.  Horiguchi-san,
>>> Sawada-san and Masao-san are all of the same opinion.  Dave Cramer (of
>>> the JDBC team) is not opposed to the change -- he says they're just
>>> using it because they didn't realize they should be doing differently.
>>
>> I think my exact words were
>>
>> "I don't see this is a valid reason to keep doing something. If it is
>> broken then fix it.
>> Clients can deal with the change."
>>
>> in response to:
>>
>>> Well, I don't really think that we can just break a behavior that
>>> exists since 9.4 as you could break applications relying on the
>>> existing behavior, and that's also the point of Vladimir upthread.
>>
>> Which is different than not being opposed to the change. I don't see this
>> as broken, and it's quite possible that some of our users are using
>> it.
> 
> Apologies for misinterpreting.
> 
>> It certainly needs to be documented
> 
> I'd rather not.

The PG13 RMT had a discussion about this thread, and while the initial
crash has been fixed, we decided to re-open the Open Item around whether
we should allow physical replication to be initiated in a logical
replication session.

We anticipate a resolution for PG13, whether it is explicitly
disallowing physical replication from occurring on a logical replication
slot, maintaining the status quo, or something else such that there is
consensus on the approach.

Thanks,

Jonathan



signature.asc
Description: OpenPGP digital signature

Re: [PATCH] Missing links between system catalog documentation pages

2020-06-21 Thread Dagfinn Ilmari Mannsåker

Alvaro Herrera  writes:

> On 2020-Jun-21, Tom Lane wrote:
>
>> That has not been our practice up to now, eg in comparable cases in
>> discussions of GUC variables, only the first reference is xref-ified.
>> I think it could be kind of annoying to make every reference a link,
>> both for regular readers (the link decoration is too bold in most
>> browsers) and for users of screen-reader software.
>
> In the glossary I also changed things so that only the first reference
> of a term in another term's definition is linked; my experience reading
> the originals as submitted (which did link them all at some point) is
> that the extra links are very distracting, bad for readability.  So +1
> for not adding links to every single mention.

There were only three cases of multiple mentions of the same table in a
single paragraph, I've removed them in the attached patch.

I've also added a second patch that makes the SQL commands links.  There
were some cases of the same commands being mentioned in the descriptions
of multiple columns in the same table, but I've left those in place,
since that feels less disruptive than in prose.

>> There is a fair question as to how far apart two references should
>> be before we  both of them.  But I think that distance
>> does need to be more than zero, and probably more than one para.
>
> Nod.

I tend to agree.

- ilmari
-- 
"A disappointingly low fraction of the human race is,
 at any given time, on fire." - Stig Sandbeck Mathisen

>From 165c4aa4613d572a0547a2f2d01dcf06dfa93a24 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Sun, 24 May 2020 21:48:29 +0100
Subject: [PATCH v3 1/2] Add missing cross-links in system catalog
 documentation

This makes the first mention of a system catalog or view in each
paragraph in the system system catalog and view documentation pages
hyperlinks, for easier navigation.

Also linkify the first mention of pg_hba.conf in pg_hba_file_rules, as
that's more specific and easier to spot than the link to the client
authentication chapter at the bottom.
---
 doc/src/sgml/catalogs.sgml | 140 -
 1 file changed, 77 insertions(+), 63 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 700271fd40..47d78d2ff8 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -16,7 +16,8 @@
Normally, one should not change the system catalogs by hand, there
are normally SQL commands to do that.  (For example, CREATE
DATABASE inserts a row into the
-   pg_database catalog — and actually
+   pg_database
+   catalog — and actually
creates the database on disk.)  There are some exceptions for
particularly esoteric operations, but many of those have been made
available as SQL commands over time, and so the need for direct manipulation
@@ -381,9 +382,10 @@
sum, count, and
max.  Each entry in
pg_aggregate is an extension of an entry
-   in pg_proc.  The pg_proc
-   entry carries the aggregate's name, input and output data types, and
-   other information that is similar to ordinary functions.
+   in pg_proc.
+   The pg_proc entry carries the aggregate's name,
+   input and output data types, and other information that is similar to
+   ordinary functions.
   
 
   
@@ -407,7 +409,7 @@
(references pg_proc.oid)
   
   
-   pg_proc OID of the aggregate function
+   OID of the aggregate function
   
  
 
@@ -902,7 +904,7 @@
catalog structure for performance reasons).  Also,
amoplefttype and amoprighttype must match
the oprleft and oprright fields of the
-   referenced pg_operator entry.
+   referenced pg_operator entry.
   
 
  
@@ -1099,7 +1101,8 @@
table columns.  There will be exactly one
pg_attribute row for every column in every
table in the database.  (There will also be attribute entries for
-   indexes, and indeed all objects that have pg_class
+   indexes, and indeed all objects that have
+   pg_class
entries.)
   
 
@@ -1270,7 +1273,7 @@
   
This column has a default expression or generation expression, in which
case there will be a corresponding entry in the
-   pg_attrdef catalog that actually defines the
+   pg_attrdef catalog that actually defines the
expression.  (Check attgenerated to
determine whether this is a default or a generation expression.)
   
@@ -1402,7 +1405,7 @@
In a dropped column's pg_attribute entry,
atttypid is reset to zero, but
attlen and the other fields copied from
-   pg_type are still valid.  This arrangement is needed
+   pg_type are still valid.  This arrangement is needed
to cope with the situation where the dropped column's data type was
later dropped, and so there is no pg_type row anymore.
attlen and the other fields can be used
@@ -1835,14 +1838,16 @@
 
   
The catalog pg_class catalogs tables and most
-   everything else that has columns or is

Re: [PATCH] Add support for choosing huge page size

2020-06-21 Thread Odin Ugedal

> Documentation syntax error "2MB" shows up as:

Ops, sorry, should be fixed now.

> The build is currently failing on Windows:

Ahh, thanks. Looks like the Windows stuff isn't autogenerated, so
maybe this new patch works..

> When using huge_pages=on, huge_page_size=1GB, but default
shared_buffers, I noticed that the error message reports the wrong
(unrounded) size in this message:

Ahh, yes, that is correct. Switched to printing the _real_ allocsize now!


> 1GB pages are so big that it becomes a little tricky to set shared
buffers large enough without wasting RAM.  What I mean is, if I want
to use shared_buffers=16GB, I need to have at least 17 huge pages
available, but the 17th page is nearly entirely wasted!  Imagine that
on POWER 16GB pages.  That makes me wonder if we should actually
redefine these GUCs differently so that you state the total, or at
least use the rounded memory for buffers...  I think we could consider
that to be a separate problem with a separate patch though.

Yes, that is a good point! But as you say, I guess that fits better in
another patch.

> Just for fun, I compared 4KB, 2MB and 1GB pages for a hash join of a
3.5GB table against itself. [...]

Thanks for the results! Will look into your patch when I get time, but
it certainly looks cool! I have a 4-node numa machine with ~100GiB of
memory and a single node numa machine, so i'll take some benchmarks
when I get time!

> I wondered if this was something to do
> with NUMA effects on this two node box, so I tried running that again
> with postgres under numactl --cpunodebind 0 --membind 0 and I got: [...]

Yes, making this "properly" numa aware to avoid/limit cross-numa
memory access is kinda tricky. When reserving huge pages they are
distributed more or less evenly between the nodes, and they can be
found by using `grep -R ""
/sys/devices/system/node/node*/hugepages/hugepages-*/nr_hugepages`
(can also be written to), so there _may_ be a chance that the huge
pages you got was on another node than 0 (due to the fact that there
not were enough), but that is just guessing.

tor. 18. jun. 2020 kl. 06:01 skrev Thomas Munro :
>
> Hi Odin,
>
> Documentation syntax error "2MB" shows up as:
>
> config.sgml:1605: parser error : Opening and ending tag mismatch:
> literal line 1602 and para
>
>   ^
>
> Please install the documentation tools
> https://www.postgresql.org/docs/devel/docguide-toolsets.html, rerun
> configure and "make docs" to see these kinds of errors.
>
> The build is currently failing on Windows:
>
> undefined symbol: HAVE_DECL_MAP_HUGE_MASK at src/include/pg_config.h
> line 143 at src/tools/msvc/Mkvcbuild.pm line 851.
>
> I think that's telling us that you need to add this stuff into
> src/tools/msvc/Solution.pm, so that we can say it doesn't have it.  I
> don't have Windows but whenever you post a new version we'll see if
> Windows likes it here:
>
> http://cfbot.cputube.org/odin-ugedal.html
>
> When using huge_pages=on, huge_page_size=1GB, but default
> shared_buffers, I noticed that the error message reports the wrong
> (unrounded) size in this message:
>
> 2020-06-18 02:06:30.407 UTC [73552] HINT:  This error usually means
> that PostgreSQL's request for a shared memory segment exceeded
> available memory, swap space, or huge pages. To reduce the request
> size (currently 149069824 bytes), reduce PostgreSQL's shared memory
> usage, perhaps by reducing shared_buffers or max_connections.
>
> The request size was actually:
>
> mmap(NULL, 1073741824, PROT_READ|PROT_WRITE,
> MAP_SHARED|MAP_ANONYMOUS|MAP_HUGETLB|30< ENOMEM (Cannot allocate memory)
>
> 1GB pages are so big that it becomes a little tricky to set shared
> buffers large enough without wasting RAM.  What I mean is, if I want
> to use shared_buffers=16GB, I need to have at least 17 huge pages
> available, but the 17th page is nearly entirely wasted!  Imagine that
> on POWER 16GB pages.  That makes me wonder if we should actually
> redefine these GUCs differently so that you state the total, or at
> least use the rounded memory for buffers...  I think we could consider
> that to be a separate problem with a separate patch though.
>
> Just for fun, I compared 4KB, 2MB and 1GB pages for a hash join of a
> 3.5GB table against itself.  Hash joins are the perfect way to
> exercise the TLB because they're very likely to miss.  I also applied
> my patch[1] to allow parallel queries to use shared memory from the
> main shared memory area, so that they benefit from the configured page
> size, using pages that are allocated once at start up.  (Without that,
> you'd have to mess around with /dev/shm mount options, and then hope
> that pages were available at query time, and it'd also be slower for
> other stupid implementation reasons).
>
> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
> # echo 8500 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> # echo 17 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>
> shared_buffers=

Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-21 Thread Andres Freund

Hi,

On 2020-06-21 13:45:36 -0400, Jonathan S. Katz wrote:
> The PG13 RMT had a discussion about this thread, and while the initial
> crash has been fixed, we decided to re-open the Open Item around whether
> we should allow physical replication to be initiated in a logical
> replication session.

Since this is a long-time issue, this doesn't quite seem like an issue
for the RMT?

> We anticipate a resolution for PG13, whether it is explicitly
> disallowing physical replication from occurring on a logical replication
> slot, maintaining the status quo, or something else such that there is
> consensus on the approach.

s/logical replication slot/logical replication connection/?

I still maintain that adding restrictions here is a bad idea. Even
disregarding the discussion of running normal queries interspersed, it's
useful to be able to both request WAL and receive logical changes over
the same connection. E.g. for creating a logical replica by first doing
a physical base backup (vastly faster), or fetching WAL for decoding
large transactions onto a standby.

And I just don't see any reasons to disallow it. There's basically no
reduction in complexity by doing so.

Greetings,

Andres Freund

Re: [PATCH] Add support for choosing huge page size

2020-06-21 Thread Andres Freund

Hi,

On 2020-06-18 16:00:49 +1200, Thomas Munro wrote:
> Unfortunately I can't access the TLB miss counters on this system due
> to virtualisation restrictions, and the systems where I can don't have
> 1GB pages.  According to cpuid(1) this system has a fairly typical
> setup:
> 
>cache and TLB information (2):
>   0x63: data TLB: 2M/4M pages, 4-way, 32 entries
> data TLB: 1G pages, 4-way, 4 entries
>   0x03: data TLB: 4K pages, 4-way, 64 entries

Hm. Doesn't that system have a second level of TLB (STLB) with more 1GB
entries? I think there's some errata around what intel exposes via cpuid
around this :(

Guessing that this is a skylake server chip?
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Memory_Hierarchy

> [...] Additionally there is a unified L2 TLB (STLB)
> [...]  STLB
> [...] 1 GiB page translations:
> [...] 16 entries; 4-way set associative

> This operation is touching about 8GB of data (scanning 3.5GB of table,
> building a 4.5GB hash table) so 4 x 1GB is not enough do this without
> TLB misses.

I assume this uses 7 workers?

> Let's try that again, except this time with shared_buffers=4GB,
> dynamic_shared_memory_main_size=4GB, and only half as many tuples in
> t, so it ought to fit:
> 
> 4KB pages:  6.37 seconds
> 2MB pages:  4.96 seconds
> 1GB pages:  5.07 seconds
> 
> Well that's disappointing.

Hm, I don't actually know the answer to this: If this actually uses
multiple workers, won't the fact that each has an independent page table
(despite having overlapping contents) lead to there being fewer actually
available 1GB entries available?  Obviously depends on how processes are
scheduled (iirc hyperthreading shares dTLBs).

Might be worth looking at whether there are cpu migrations or testing
with a single worker.

> I wondered if this was something to do
> with NUMA effects on this two node box, so I tried running that again
> with postgres under numactl --cpunodebind 0 --membind 0 and I got:

> 4KB pages:  5.43 seconds
> 2MB pages:  4.05 seconds
> 1GB pages:  4.00 seconds
> 
> From this I can't really conclude that it's terribly useful to use
> larger page sizes, but it's certainly useful to have the ability to do
> further testing using the proposed GUC.

Due to the low number of 1GB entries they're quite likely to be
problematic imo. Especially when there's more concurrent misses than
there are page table entries.

I'm somewhat doubtful that it's useful to use 1GB entries for all of our
shared memory when that's bigger than the maximum covered size. I
suspect that it'd better to use 1GB entries for some and smaller entries
for the rest of the memory.

Greetings,

Andres Freund

Re: [PATCH] Missing links between system catalog documentation pages

2020-06-21 Thread Fabien COELHO




Hello Tom,

I didn't think there was much point in linkifying both in that case, 
and other similar situations.


The point is that the user reads a sentence, attempts to jump but 
sometimes can't, because the is not the first occurrence. I'd go for 
all mentions of another relation should be link.


That has not been our practice up to now, eg in comparable cases in
discussions of GUC variables, only the first reference is xref-ified.
I think it could be kind of annoying to make every reference a link,
both for regular readers (the link decoration is too bold in most
browsers)


Hmmm. That looks like an underlying CSS issue, not that links are 
intrinsically bad.


I find it annoying that the same thing appears differently from one line 
to the next. It seems I'm the only one who likes things to be uniform, 
though.



and for users of screen-reader software.


I do not know about those, and what constraints it puts on markup.

--
Fabien.

Re: Parallel Seq Scan vs kernel read ahead

2020-06-21 Thread David Rowley

On Sat, 20 Jun 2020 at 08:00, Robert Haas  wrote:
>
> On Thu, Jun 18, 2020 at 10:10 PM David Rowley  wrote:
> > Here's a patch which caps the maximum chunk size to 131072.  If
> > someone doubles the page size then that'll be 2GB instead of 1GB. I'm
> > not personally worried about that.
>
> Maybe use RELSEG_SIZE?

I was hoping to keep the guarantees that the chunk size is always a
power of 2.  If, for example, someone configured PostgreSQL
--with-segsize=3, then RELSEG_SIZE would be 393216 with the standard
BLCKSZ.

Not having it a power of 2 does mean the ramp-down is more uneven when
the sizes become very small:

postgres=# select 393216>>x from generate_Series(0,18)x;
 ?column?
--
   393216
   196608
98304
49152
24576
12288
 6144
 3072
 1536
  768
  384
  192
   96
   48
   24
   12
6
3
1
(19 rows)

Perhaps that's not a problem though, but then again, perhaps just
keeping it at 131072 regardless of RELSEG_SIZE and BLCKSZ is also ok.
The benchmarks I did on Windows [1] showed that the returns diminished
once we started making the step size some decent amount so my thoughts
are that I've set PARALLEL_SEQSCAN_MAX_CHUNK_SIZE to something large
enough that it'll make no difference to the performance anyway. So
there's probably not much point in giving it too much thought.

Perhaps pg_nextpower2_32(RELSEG_SIZE) would be okay though.

David

[1] 
https://www.postgresql.org/message-id/caaphdvoppka+q5y_k_6cuv4u6dphmz771veumuzls3d3mwy...@mail.gmail.com

Re: suggest to rename enable_incrementalsort

2020-06-21 Thread David Rowley

On Sun, 21 Jun 2020 at 23:22, Tomas Vondra  wrote:
>
> On Sun, Jun 21, 2020 at 09:05:32AM +0200, Julien Rouhaud wrote:
> >On Sun, Jun 21, 2020 at 8:26 AM Peter Eisentraut
> > wrote:
> >>
> >> I suggest to rename enable_incrementalsort to enable_incremental_sort.
> >> This is obviously more readable and also how we have named recently
> >> added multiword planner parameters.
> >>
> >> See attached patch.
> >
> >+1, this is a way better name (and patch LGTM on REL_13_STABLE).
> >
>
> The reason why I kept the single-word variant is consistency with other
> GUCs that affect planning, like enable_indexscan, enable_hashjoin and
> many others.

Looking at the other enable_* GUCs, for all the ones that aim to
disable a certain executor node type, with the exception of
enable_hashagg and enable_bitmapscan, they're all pretty consistent in
naming the GUC after the executor node's .c file:

enable_bitmapscan nodeBitmapHeapscan.c
enable_gathermergenodeGatherMerge.c
enable_hashaggnodeAgg.c
enable_hashjoin   nodeHashjoin.c
enable_incrementalsortnodeIncrementalSort.c
enable_indexonlyscan  nodeIndexonlyscan.c
enable_indexscan  nodeIndexscan.c
enable_material   nodeMaterial.c
enable_mergejoin  nodeMergejoin.c
enable_nestloop   nodeNestloop.c
enable_parallel_appendnodeAppend.c
enable_parallel_hash  nodeHash.c
enable_partition_pruning
enable_partitionwise_aggregate
enable_partitionwise_join
enable_seqscannodeSeqscan.c
enable_sort   nodeSort.c
enable_tidscannodeTidscan.c

enable_partition_pruning, enable_partitionwise_aggregate,
enable_partitionwise_join are the odd ones out here as they're not
really related to a specific node type.

Going by that, it does seem the current name for
enable_incrementalsort is consistent with the majority.   Naming it
enable_incremental_sort looks like it would be more suited if the
feature had been added by overloading nodeSort.c.  In that regard, it
would be similar to enable_parallel_append and enable_parallel_hash,
where the middle word becomes a modifier.

David

Get rid of runtime handling of AlternativeSubPlan?

2020-06-21 Thread Tom Lane

Back in bd3daddaf232d95b0c9ba6f99b0170a0147dd8af, which introduced
AlternativeSubPlans, I wrote:

  There is a lot more that could be done based on this infrastructure: in
  particular it's interesting to consider switching to the hash plan if we start
  out using the non-hashed plan but find a lot more upper rows going by than we
  expected.  I have therefore left some minor inefficiencies in place, such as
  initializing both subplans even though we will currently only use one.

That commit will be twelve years old come August, and nobody has either
built anything else atop it or shown any interest in making the plan choice
switchable mid-run.  So it seems like kind of a failed experiment.

Therefore, I'm considering the idea of ripping out all executor support
for AlternativeSubPlan and instead having the planner replace an
AlternativeSubPlan with the desired specific SubPlan somewhere late in
planning, possibly setrefs.c.

Admittedly, the relevant executor support only amounts to a couple hundred
lines, but that's not nothing.  A perhaps-more-useful effect is to get rid
of the confusing and poorly documented EXPLAIN output that you get for an
AlternativeSubPlan.

I also noted that the existing subplan-selection logic in
ExecInitAlternativeSubPlan is really pretty darn bogus, in that it uses a
one-size-fits-all execution count estimate of parent->plan->plan_rows, no
matter which subexpression the subplan is in.  This is only appropriate
for subplans in the plan node's targetlist, and can be either too high
or too low elsewhere.  It'd be relatively easy for setrefs.c to do
better, I think, since it knows which subexpression it's working on
at any point.

Thoughts?

regards, tom lane

Re: Get rid of runtime handling of AlternativeSubPlan?

2020-06-21 Thread David Rowley

On Mon, 22 Jun 2020 at 12:20, Tom Lane  wrote:
>
> Back in bd3daddaf232d95b0c9ba6f99b0170a0147dd8af, which introduced
> AlternativeSubPlans, I wrote:
>
>   There is a lot more that could be done based on this infrastructure: in
>   particular it's interesting to consider switching to the hash plan if we 
> start
>   out using the non-hashed plan but find a lot more upper rows going by than 
> we
>   expected.  I have therefore left some minor inefficiencies in place, such as
>   initializing both subplans even though we will currently only use one.
>
> That commit will be twelve years old come August, and nobody has either
> built anything else atop it or shown any interest in making the plan choice
> switchable mid-run.  So it seems like kind of a failed experiment.
>
> Therefore, I'm considering the idea of ripping out all executor support
> for AlternativeSubPlan and instead having the planner replace an
> AlternativeSubPlan with the desired specific SubPlan somewhere late in
> planning, possibly setrefs.c.
>
> Admittedly, the relevant executor support only amounts to a couple hundred
> lines, but that's not nothing.  A perhaps-more-useful effect is to get rid
> of the confusing and poorly documented EXPLAIN output that you get for an
> AlternativeSubPlan.
>
> I also noted that the existing subplan-selection logic in
> ExecInitAlternativeSubPlan is really pretty darn bogus, in that it uses a
> one-size-fits-all execution count estimate of parent->plan->plan_rows, no
> matter which subexpression the subplan is in.  This is only appropriate
> for subplans in the plan node's targetlist, and can be either too high
> or too low elsewhere.  It'd be relatively easy for setrefs.c to do
> better, I think, since it knows which subexpression it's working on
> at any point.

When I was working on [1] a few weeks ago, I did wonder if I'd have to
use an AlternativeSubPlan when doing result caching for subqueries.
The problem is likely the same as why they were invented in the first
place; we basically don't know how many rows the parent will produce
when planning the subplan.

For my case, I have an interest in both the number of rows in the
outer plan, and the ndistinct estimate on the subplan parameters.  If
the parameters for the subquery are all distinct, then there's not
much sense in trying to cache results to use later. We're never going
to need them.

Right now, if I wanted to use AlternativeSubPlan to delay the choice
of this until run-time, then I'd be missing information about the
ndistinct estimation since we don't have that information available in
the final plan. Perhaps that's an argument for doing this in setrefs.c
instead. I could look up the ndistinct estimate there.

For switching plans on the fly during execution. I can see the sense
in that as an idea.  For the hashed subplan case, we'd likely want to
switch to hashing mode if we discovered that there were many more rows
in the outer query than we had thought there would be.   However, I'm
uncertain if Result Cache would never need anything similar as
technically we could just switch off the caching if we discovered our
cache hit ration was either terrible or 0.  We would have an
additional node to pull tuples through, however.   Switching would
also require that the tupleslot type was the same between the
alternatives.

David

[1] 
https://www.postgresql.org/message-id/caaphdvrpcqyqdwergywx8j+2dlungxu+fosbq1uscxrunyx...@mail.gmail.com

Re: Improve planner cost estimations for alternative subplans

2020-06-21 Thread Tom Lane

I wrote:
> Nope.  The entire reason why we have that kluge is that we don't know
> until much later how many times we expect to execute the subplan.
> AlternativeSubPlan allows the decision which subplan form to use to be
> postponed till runtime; but when we're doing things like estimating the
> cost and selectivity of a where-clause, we don't know that.

> Maybe there's some way to recast things to avoid that problem,
> but I have little clue what it'd look like.

Actually ... maybe it's not that bad.  Clearly there would be a
circularity issue for selectivity estimation, but all the alternatives
should have the same selectivity.  Cost estimation is a different story:
by the time we need to do cost estimation for a subexpression, we do in
many cases have an idea how often the subexpression will be executed.

I experimented with adding a number-of-evaluations parameter to
cost_qual_eval, and found that the majority of call sites do have
something realistic they can pass.  The attached very-much-WIP
patch shows my results so far.  There's a lot of loose ends:

* Any call site using COST_QUAL_EVAL_DUMMY_NUM_EVALS is a potential spot
for future improvement.  The only one that seems like it might be
fundamentally unsolvable is cost_subplan's usage; but that would only
matter if a subplan's testexpr contains an AlternativeSubPlan, which is
probably a negligible case in practice.  The other ones seem to just
require more refactoring than I cared to do on a Sunday afternoon.

* I did not do anything for postgres_fdw.c beyond making it compile.
We can surely do better there, but it might require some rethinking
of the way that plan costs get cached.

* The merge and hash join costsize.c functions estimate costs of qpquals
(i.e. quals to be applied at the join that are not being used as merge
or hash conditions) by computing cost_qual_eval of the whole
joinrestrictlist and then subtracting off the cost of the merge or hash
quals.  This is kind of broken if we want to use different num_eval
estimates for the qpquals and the merge/hash quals, which I think we do.
This probably just needs some refactoring to fix.  We also need to save
the relevant rowcounts in the join Path nodes so that createplan.c can
do the right thing.

* I think it might be possible to improve the situation in
get_agg_clause_costs() if we're willing to postpone collection
of the actual aggregate costs till later.  This'd require two
passes over the aggregate expressions, but that seems like it
might not be terribly expensive.  (I'd be inclined to also look
at the idea of merging duplicate agg calls at plan time not
run time, if we refactor that.)

* I had to increase the number of table rows in one updatable_views.sql
test to keep the plans the same.  Without that, the planner realized
that a seqscan would be cheaper than an indexscan.  The result wasn't
wrong exactly, but it failed to prove that leakproof quals could be
used as indexquals, so I think we need to keep the plan choice the same.

Anyway, this is kind of invasive, but I think it shouldn't really
add noticeable costs as long as we save relevant rowcounts rather
than recomputing them in createplan.c.  Is it worth doing?  I dunno.
AlternativeSubPlan is pretty much a backwater, I think --- if it
were interesting performance-wise to a lot of people, more would
have been done with it by now.

regards, tom lane

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 9fc53cad68..1149c298ba 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -656,7 +656,8 @@ postgresGetForeignRelSize(PlannerInfo *root,
 	 JOIN_INNER,
 	 NULL);

-	cost_qual_eval(&fpinfo->local_conds_cost, fpinfo->local_conds, root);
+	cost_qual_eval(&fpinfo->local_conds_cost, fpinfo->local_conds,
+   COST_QUAL_EVAL_DUMMY_NUM_EVALS, root);

 	/*
 	 * Set # of retrieved rows and cached relation costs to some negative
@@ -2766,7 +2767,8 @@ estimate_path_cost_size(PlannerInfo *root,
 		/* Add in the eval cost of the locally-checked quals */
 		startup_cost += fpinfo->local_conds_cost.startup;
 		total_cost += fpinfo->local_conds_cost.per_tuple * retrieved_rows;
-		cost_qual_eval(&local_cost, local_param_join_conds, root);
+		cost_qual_eval(&local_cost, local_param_join_conds,
+	   COST_QUAL_EVAL_DUMMY_NUM_EVALS, root);
 		startup_cost += local_cost.startup;
 		total_cost += local_cost.per_tuple * retrieved_rows;

@@ -2782,7 +2784,8 @@ estimate_path_cost_size(PlannerInfo *root,
 		{
 			QualCost	tlist_cost;

-			cost_qual_eval(&tlist_cost, fdw_scan_tlist, root);
+			cost_qual_eval(&tlist_cost, fdw_scan_tlist,
+		   COST_QUAL_EVAL_DUMMY_NUM_EVALS, root);
 			startup_cost -= tlist_cost.startup;
 			total_cost -= tlist_cost.startup;
 			total_cost -= tlist_cost.per_tuple * rows;
@@ -2870,9 +2873,11 @@ estimate_path_cost_size(PlannerInfo *root,
 			/*
 			 * Calculate the cost of clauses push

Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)

2020-06-21 Thread Justin Pryzby

On Sun, Jun 07, 2020 at 10:07:19AM +0200, Fabien COELHO wrote:
> Hello Justin,
> > Rebased onto 7b48f1b490978a8abca61e9a9380f8de2a56f266 and renumbered OIDs.

Rebased again on whatever broke func.sgml.

> pg_stat_file() and pg_stat_dir_files() now return a char type, as well as
> the function which call them, but the documentation does not seem to say
> that it is the case.

Fixed, thanks

> I must admit that I'm not a fan on the argument management of
> pg_ls_dir_metadata and pg_ls_dir_metadata_1arg and others. I understand that
> it saves a few lines though, so maybe let it be.

I think you're saying that you don't like the _1arg functions, but they're
needed to allow the regression tests to pass:

| * note: this wrapper is necessary to pass the sanity check in opr_sanity,
| * which checks that all built-in functions that share the implementing C
| * function take the same number of arguments

> There is a comment in pg_ls_dir_files which talks about pg_ls_dir.
> 
> Could pg_ls_*dir functions C implementations be dropped in favor of a pure
> SQL implementation, like you did with recurse?

I'm still waiting to hear feedback from a commiter if this is a good idea to
put this into the system catalog.  Right now, ts_debug is the only nontrivial
function.

> If so, ISTM that pg_ls_dir_files() could be significantly simplified by
> moving its filtering flag to SQL conditions on "type" and others. That could
> allow not to change the existing function output a keep the "isdir" column
> defined as "type = 'd'" where it was used previously, if someone complains,
> but still have the full capability of "ls". That would also allow to drop
> the "*_1arg" hacks. Basically I'm advocating having 1 or 2 actual C
> functions, and all other variants managed at the SQL level.

You want to get rid of the 1arg stuff and just have one function.

I see your point, but I guess the C function would still need to accept a
"missing_ok" argument, so we need two functions, so there's not much utility in
getting rid of the "include_dot_dirs" argument, which is there for consistency
with pg_ls_dir.

Conceivably we could 1) get rid of pg_ls_dir, and 2) get rid of the
include_dot_dirs argument and 3) maybe make "missing_ok" a required argument;
and, 4) get rid of the C wrapper functions, and replace with a bunch of stuff
like this:

SELECT name, size, access, modification, change, creation, type='d' AS isdir
FROM pg_ls_dir_metadata('pg_wal') WHERE substring(name,1,1)!='.' AND type!='d';

Where the defaults I changed in this patchset still remain to be discussed:
with or without metadata, hidden files, dotdirs.

As I'm still waiting for committer feedback on the first 10 patches, so not
intending to add more.

-- 
Justin
>From 60593b397f03ec840a97f8004d97177dec463752 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 16 Mar 2020 14:12:55 -0500
Subject: [PATCH v19 01/10] Document historic behavior of links to
 directories..

Backpatch to 9.5: pg_stat_file
---
 doc/src/sgml/func.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b7c450ea29..9f47745c5a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25881,7 +25881,7 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 Returns a record containing the file's size, last access time stamp,
 last modification time stamp, last file status change time stamp (Unix
 platforms only), file creation time stamp (Windows only), and a flag
-indicating if it is a directory.
+indicating if it is a directory (or a symbolic link to a directory).


 This function is restricted to superusers by default, but other users
-- 
2.17.0

>From c1acf58316869e176bc2b215e2545f9183f100e3 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 30 Mar 2020 18:59:16 -0500
Subject: [PATCH v19 02/10] pg_stat_file and pg_ls_dir_* to use lstat()..

pg_ls_dir_* will now skip (no longer show) symbolic links, same as other
non-regular file types, as we advertize we do since 8b6d94cf6.  That seems to
be the intented behavior, since irregular file types are 1) less portable; and,
2) we don't currently show a file's type except for "bool is_dir".

pg_stat_file will now 1) show metadata of links themselves, rather than their
target; and, 2) specifically, show links to directories with "is_dir=false";
and, 3) not error on broken symlinks.
---
 doc/src/sgml/func.sgml  | 2 +-
 src/backend/utils/adt/genfile.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 9f47745c5a..b7c450ea29 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25881,7 +25881,7 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 Returns a record containing the file's size, last access time stamp,
 last modification time stamp, last file status ch

Re: [PATCH] Initial progress reporting for COPY command

2020-06-21 Thread Fujii Masao

On 2020/06/21 20:33, Josef Šimánek wrote:

po 15. 6. 2020 v 6:39 odesílatel Fujii Masao mailto:masao.fu...@oss.nttdata.com>> napsal:

On 2020/06/14 21:32, Josef Šimánek wrote:
 > Hello, as proposed by Pavel Stěhule and discussed on local czech PostgreSQL 
maillist 
(https://groups.google.com/d/msgid/postgresql-cz/CAFj8pRCZ42CBCa1bPHr7htffSV%2BNAcgcHHG0dVqOog4bsu2LFw%40mail.gmail.com?utm_medium=email&utm_source=footer),
 I have prepared an initial patch for COPY command progress reporting.

Sounds nice!

 > file - bool - is file is used?
 > program - bool - is program used?

Are these fields really necessary in a progress view?
What values are reported when STDOUT/STDIN is specified in COPY command?

For STDOUT and STDIN file is true and program is false.

Could you tell me why these columns are necessary in *progress* view?
If we want to see what copy command is actually running, we can see
pg_stat_activity, instead. For example,

SELECT pc.*, a.query FROM pg_stat_progress_copy pc, pg_stat_activity a 
WHERE pc.pid = a.pid;

 > file_bytes_processed - amount of bytes processed when file is used 
(otherwise 0), works for both direction (
 > FROM/TO) when file is used (file = t)

What value is reported when STDOUT/STDIN is specified in COPY command?

For my first patch nothing was reported on STDOUT/STDIN usage. I'll attach new 
patch soon supporting those as well.

Thanks for the patch!

With the patch, pg_stat_progress_copy seems to report the progress of
the processing on file_fdw. Is this intentional?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-21 Thread Amit Kapila

On Sun, Jun 21, 2020 at 3:27 AM Tomas Vondra
 wrote:
>
> On Thu, Jun 18, 2020 at 12:21:17PM +0530, Amit Kapila wrote:
> >On Thu, Jun 18, 2020 at 8:01 AM Masahiko Sawada
> > wrote:
> >>
> >> On Wed, 17 Jun 2020 at 20:14, Amit Kapila  wrote:
> >> >
> >> >
> >> > I had written above in the context of persisting these stats.  I mean
> >> > to say if the process has bounced or server has restarted then the
> >> > previous stats might not make much sense because we were planning to
> >> > use pid [1], so the stats from process that has exited might not make
> >> > much sense or do you think that is okay?  If we don't want to persist
> >> > and the lifetime of these stats is till the process is alive then we
> >> > are fine.
> >> >
> >>
> >> Sorry for confusing you. The above my idea is about having the stats
> >> per slots. That is, we add spill_txns, spill_count and spill_bytes to
> >> pg_replication_slots or a new view pg_stat_logical_replication_slots
> >> with some columns: slot_name plus these stats columns and stats_reset.
> >> The idea is that the stats values accumulate until either the slot is
> >> dropped, the server crashed, the user executes the reset function, or
> >> logical decoding is performed with different logical_decoding_work_mem
> >> value than the previous time. In other words, the stats values are
> >> reset in either case. That way, I think the stats values always
> >> correspond to logical decoding using the same slot with the same
> >> logical_decoding_work_mem value.
> >>
> >
> >What if the decoding has been performed by multiple backends using the
> >same slot?  In that case, it will be difficult to make the judgment
> >for the value of logical_decoding_work_mem based on stats.  It would
> >make sense if we provide a way to set logical_decoding_work_mem for a
> >slot but not sure if that is better than what we have now.
> >
> >What problems do we see in displaying these for each process?  I think
> >users might want to see the stats for the exited processes or after
> >server restart but I think both of those are not even possible today.
> >I think the stats are available till the corresponding WALSender
> >process is active.
> >
>
> I don't quite see what the problem is. We're in this exact position with
> many other stats we track and various GUCs. If you decide to tune the
> setting for a particular slot, you simply need to be careful which
> backends decode the slot and what GUC values they used.
>

What problem do you if we allow it to display per-process (WALSender
or backend)?  They are incurred by the WALSender or by backends so
displaying them accordingly seems more straightforward and logical to
me.

As of now, we don't allow it to be set for a slot, so it won't be
convenient for the user to tune it per slot.  I think we can allow to
set it per-slot but not sure if there is any benefit for the same.

> I really think we should not be inventing something that automatically
> resets the stats when someone happens to change the GUC.
>

I agree with that.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-21 Thread Amit Kapila

On Mon, Jun 22, 2020 at 8:22 AM Amit Kapila  wrote:
>
> On Sun, Jun 21, 2020 at 3:27 AM Tomas Vondra
>  wrote:
> >
> > On Thu, Jun 18, 2020 at 12:21:17PM +0530, Amit Kapila wrote:
> > >On Thu, Jun 18, 2020 at 8:01 AM Masahiko Sawada
> > > wrote:
> > >>
> > >> On Wed, 17 Jun 2020 at 20:14, Amit Kapila  
> > >> wrote:
> > >> >
> > >> >
> > >> > I had written above in the context of persisting these stats.  I mean
> > >> > to say if the process has bounced or server has restarted then the
> > >> > previous stats might not make much sense because we were planning to
> > >> > use pid [1], so the stats from process that has exited might not make
> > >> > much sense or do you think that is okay?  If we don't want to persist
> > >> > and the lifetime of these stats is till the process is alive then we
> > >> > are fine.
> > >> >
> > >>
> > >> Sorry for confusing you. The above my idea is about having the stats
> > >> per slots. That is, we add spill_txns, spill_count and spill_bytes to
> > >> pg_replication_slots or a new view pg_stat_logical_replication_slots
> > >> with some columns: slot_name plus these stats columns and stats_reset.
> > >> The idea is that the stats values accumulate until either the slot is
> > >> dropped, the server crashed, the user executes the reset function, or
> > >> logical decoding is performed with different logical_decoding_work_mem
> > >> value than the previous time. In other words, the stats values are
> > >> reset in either case. That way, I think the stats values always
> > >> correspond to logical decoding using the same slot with the same
> > >> logical_decoding_work_mem value.
> > >>
> > >
> > >What if the decoding has been performed by multiple backends using the
> > >same slot?  In that case, it will be difficult to make the judgment
> > >for the value of logical_decoding_work_mem based on stats.  It would
> > >make sense if we provide a way to set logical_decoding_work_mem for a
> > >slot but not sure if that is better than what we have now.
> > >
> > >What problems do we see in displaying these for each process?  I think
> > >users might want to see the stats for the exited processes or after
> > >server restart but I think both of those are not even possible today.
> > >I think the stats are available till the corresponding WALSender
> > >process is active.
> > >
> >
> > I don't quite see what the problem is. We're in this exact position with
> > many other stats we track and various GUCs. If you decide to tune the
> > setting for a particular slot, you simply need to be careful which
> > backends decode the slot and what GUC values they used.
> >
>
> What problem do you if we allow it to display per-process (WALSender
> or backend)?  They are incurred by the WALSender or by backends so
> displaying them accordingly seems more straightforward and logical to
> me.
>
> As of now, we don't allow it to be set for a slot, so it won't be
> convenient for the user to tune it per slot.  I think we can allow to
> set it per-slot but not sure if there is any benefit for the same.
>

If we display stats as discussed in email [1] (pid, slot_name,
spill_txns, spill_count, etc.), then we can even find the stats w.r.t
each slot.


[1] - 
https://www.postgresql.org/message-id/CA%2Bfd4k5nqeFdhpnCULpTh9TR%2B15rHZSbz0SDC6sZhr_v99SeKA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

proposal: unescape_text function

2020-06-21 Thread Pavel Stehule

Hi

There is one user request for unescape function in core.

https://stackoverflow.com/questions/20124393/convert-escaped-unicode-character-back-to-actual-character-in-postgresql/20125412?noredirect=1#comment110502526_20125412

This request is about possibility that we do with string literal via
functional interface instead string literals only

I wrote plpgsql function, but built in function can be simpler:

CREATE OR REPLACE FUNCTION public.unescape(text, text)
 RETURNS text
 LANGUAGE plpgsql
 AS $function$
 DECLARE result text;
 BEGIN
   EXECUTE format('SELECT U&%s UESCAPE %s',
 quote_literal(replace($1, '\u','^')),
 quote_literal($2)) INTO result;
   RETURN result;
 END;
 $function$

postgres=# select unescape('Odpov\u011Bdn\u00E1 osoba','^');
unescape -
 Odpovědná osoba(1 row)

What do you think about this?

Regards

Pavel

Re: Add support for INDEX_CLEANUP and TRUNCATE to vacuumdb

2020-06-21 Thread Michael Paquier

On Fri, Jun 19, 2020 at 10:57:01AM +0900, Michael Paquier wrote:
> Thanks.  This flavor looks good to me in terms of code, and the test
> coverage is what's needed for all the code paths added.  This version
> is using my suggestion of upthread for the option names: --no-truncate
> and --no-index-cleanup.  Are people fine with this choice?

Okay.  I have gone through the patch again, and applied it as of
9550ea3.  Thanks.
--
Michael


signature.asc
Description: PGP signature

Re: tag typos in "catalog.sgml"

2020-06-21 Thread Michael Paquier

On Sun, Jun 21, 2020 at 07:31:16PM +0900, Michael Paquier wrote:
> Good catches, thanks Fabien.  I will fix that tomorrow or so.

And applied to HEAD.
--
Michael


signature.asc
Description: PGP signature

Re: I'd like to discuss scaleout at PGCon

2020-06-21 Thread Sumanta Mukherjee

Hi,

I read through the symfora paper and it is a nice technique. I am not very
sure about where Hyder is used commercially but given that it has come out
of Microsoft Research so some microsoft products might be using it/some of
these concepts already.

With Regards,
Sumanta Mukherjee.
EnterpriseDB: http://www.enterprisedb.com


On Wed, Jun 17, 2020 at 9:38 PM tsunakawa.ta...@fujitsu.com <
tsunakawa.ta...@fujitsu.com> wrote:

> Hello,
>
>
>
> It seems you didn't include pgsql-hackers.
>
>
>
>
>
> From: Sumanta Mukherjee 
>
> > I saw the presentation and it is great except that  it seems to be
> unclear of both SD and SN  if the storage and the compute are being
> explicitly separated. Separation of storage and compute would have some
> cost advantages as per my understanding. The following two work (ref below)
> has some information about the usefulness of this technique for scale out
> and so it would be an interesting addition  to see if in the SN
> architecture that is being proposed could be modified to take care of this
> phenomenon and reap the gain.
>
>
>
> Thanks.  Separation of compute and storage is surely to be considered.
> Unlike the old days when the shared storage was considered to be a
> bottleneck with slow HDDs and FC-SAN, we could now expect high speed shared
> storage thanks to flash memory, NVMe-oF, and RDMA.
>
>
>
> > 1. Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - A
>
> > Transactional Record Manager for Shared Flash. In CIDR 2011.
>
>
>
> This is interesting.  I'll go into this.  Do you know there's any product
> based on Hyder?  OTOH, Hyder seems to require drastic changes when adopting
> for Postgres -- OCC, log-structured database, etc.  I'd like to hear how
> feasible those are.  However, its scale-out capability without the need for
> data or application partitioning appears appealing.
>
>
>
>
>
> To explore another possibility that would have more affinity with the
> current Postgres, let me introduce our proprietary product called
> Symfoware.  It's not based on Postgres.
>
>
>
> It has shared nothing scale-out functionality with full ACID based on 2PC,
> conventional 2PL locking and distributed deadlock resolution.  Despite
> being shared nothing, all the database files and transaction logs are
> stored on shared storage.
>
>
>
> The database is divided into "log groups."  Each log group has one
> transaction log and multiple tablespaces (it's called "database space"
> instead of tablespace.)
>
>
>
> Each DB instance in the cluster owns multiple log groups, and handles
> reads/writes to the data in its owning log groups.  When a DB instance
> fails, other surviving DB instances take over the log groups of the failed
> DB instance, recover the data using the transaction log of the log group,
> and accepts reads/writes to the data in the log group.  The DBA configures
> which DB instance initially owns which log groups and which DB instances
> are candidates to take over which log groups.
>
>
>
> This way, no server is idle as a standby.  All DB instances work hard to
> process read-write transactions.  This "no idle server for HA" is one of
> the things Oracle RAC users want in terms of cost.
>
>
>
> However, it still requires data and application partitioning unlike
> Hyder.  Does anyone think of a way to eliminate partitioning?  Data and
> application partitioning is what Oracle RAC users want to avoid or cannot
> tolerate.
>
>
>
> Ref: Introduction of the Symfoware shared nothing scale-out called "load
> share."
>
>
> https://pdfs.semanticscholar.org/8b60/163593931cebc58e9f637cfb501500230adc.pdf
>
>
>
>
>
> Regards
>
> Takayuki Tsunakawa
>
>
>
>
>
> --- below is Sumanta's original mail ---
>
> *From:* Sumanta Mukherjee 
> *Sent:* Wednesday, June 17, 2020 5:34 PM
> *To:* Tsunakawa, Takayuki/綱川 貴之 
> *Cc:* Bruce Momjian ; Merlin Moncure ;
> Robert Haas ; maumau...@gmail.com
> *Subject:* Re: I'd like to discuss scaleout at PGCon
>
>
>
> Hello,
>
>
>
> I saw the presentation and it is great except that  it seems to be unclear
> of both SD and SN  if the storage and the compute are being explicitly
> separated. Separation of storage and compute would have some cost
> advantages as per my understanding. The following two work (ref below) has
> some information about the usefulness of this technique for scale out and
> so it would be an interesting addition  to see if in the SN architecture
> that is being proposed could be modified to take care of this phenomenon
> and reap the gain.
>
>
>
> 1. Philip A. Bernstein, Colin W. Reid, and Sudipto Das. 2011. Hyder - A
> Transactional Record Manager for Shared Flash. In CIDR 2011.
>
>
>
> 2. Dhruba Borthakur. 2017. The Birth of RocksDB-Cloud. http://rocksdb.
> blogspot.com/2017/05/the-birth-of-rocksdb-cloud.html.
>
>
>
> With Regards,
>
> Sumanta Mukherjee.
>
> EnterpriseDB: http://www.enterprisedb.com
>
>
>
>
>
>
>
>

Backpatch b61d161c14

2020-06-21 Thread Amit Kapila

I propose to backpatch b61d161c14 [1] (Introduce vacuum errcontext to
display additional information.).  In the recent past, we have seen an
error report similar to "ERROR: found xmin 2157740646 from before
relfrozenxid 1197" from multiple EDB customers.  A similar report is
seen on pgsql-bugs as well [2] which I think has triggered the
implementation of this feature for v13.  Such reports mostly indicate
database corruption rather than any postgres bug which is also
indicated by the error-code (from before relfrozenxid) for this
message. I think there is a good reason to back-patch this as multiple
users are facing similar issues.  This patch won't fix this issue but
it will help us in detecting the problematic part of the heap/index
and then if users wish they can delete the portion of data that
appeared to be corrupted and resume the operations on that relation.

I have tried to back-patch this for v12 and attached is the result.
The attached patch passes make check-world but I have yet to test it
manually and also prepare the patch for other branches once we agree
on this proposal.

Thoughts?

[1] -
commit b61d161c146328ae6ba9ed937862d66e5c8b035a
Author: Amit Kapila 
Date:   Mon Mar 30 07:33:38 2020 +0530

Introduce vacuum errcontext to display additional information.

The additional information displayed will be block number for error
occurring while processing heap and index name for error occurring
while processing the index.

[2] - 
https://www.postgresql.org/message-id/20190807235154.erbmr4o4bo6vgnjv%40alap3.anarazel.de

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Introduce-vacuum-errcontext-to-display-additional-in.v12.patch
Description: Binary data

Re: Failures with wal_consistency_checking and 13~

2020-06-21 Thread Michael Paquier

On Sat, Jun 20, 2020 at 05:43:19PM +0300, Alexander Korotkov wrote:
> I have discovered and fixed the issue in a44dd932ff.  spg_mask()
> masked unused space only when pagehdr->pd_lower >
> SizeOfPageHeaderData.  But during the vacuum regression tests, one
> page has been erased completely and pagehdr->pd_lower was set to
> SizeOfPageHeaderData.  Actually, 13 didn't introduce any issue, it
> just added a test that spotted the issue.  The issue is here since
> a507b86900.

Thanks Alexander for looking at this issue!  My runs with
wal_consistency_checking are all clear now.
--
Michael


signature.asc
Description: PGP signature

Re: min_safe_lsn column in pg_replication_slots view

2020-06-21 Thread Michael Paquier

On Sat, Jun 20, 2020 at 03:53:54PM +0900, Michael Paquier wrote:
> On Sat, Jun 20, 2020 at 09:45:52AM +0530, Amit Kapila wrote:
>> Isn't this information specific to checkpoints, so maybe better to
>> display in view pg_stat_bgwriter?
> 
> Not sure that's a good match.  If we decide to expose that, a separate
> function returning a LSN based on the segment number from
> XLogGetLastRemovedSegno() sounds fine to me, like
> pg_wal_last_recycled_lsn().  Perhaps somebody has a better name in
> mind?

I was thinking on this one for the last couple of days, and came up
with the name pg_wal_oldest_lsn(), as per the attached, traking the
oldest WAL location still available.  That's unfortunately too late
for beta2, but let's continue the discussion.
--
Michael
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 7644147cf5..7de4338910 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*			mmddN */
-#define CATALOG_VERSION_NO	202006151
+#define CATALOG_VERSION_NO	202006221
 
 #endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 61f2c2f5b4..1a07877086 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6090,6 +6090,10 @@
   proname => 'pg_current_wal_flush_lsn', provolatile => 'v',
   prorettype => 'pg_lsn', proargtypes => '',
   prosrc => 'pg_current_wal_flush_lsn' },
+{ oid => '9054', descr => 'oldest wal location still available',
+  proname => 'pg_wal_oldest_lsn', provolatile => 'v',
+  prorettype => 'pg_lsn', proargtypes => '',
+  prosrc => 'pg_wal_oldest_lsn' },
 { oid => '2850',
   descr => 'wal filename and byte offset, given a wal location',
   proname => 'pg_walfile_name_offset', prorettype => 'record',
@@ -10063,9 +10067,9 @@
   proname => 'pg_get_replication_slots', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', prorettype => 'record',
   proargtypes => '',
-  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text,pg_lsn}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status,min_safe_lsn}',
+  proallargtypes => '{name,name,text,oid,bool,bool,int4,xid,xid,pg_lsn,pg_lsn,text}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{slot_name,plugin,slot_type,datoid,temporary,active,active_pid,xmin,catalog_xmin,restart_lsn,confirmed_flush_lsn,wal_status}',
   prosrc => 'pg_get_replication_slots' },
 { oid => '3786', descr => 'set up a logical replication slot',
   proname => 'pg_create_logical_replication_slot', provolatile => 'v',
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 290658b22c..ccb3b5d5db 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -387,6 +387,31 @@ pg_current_wal_flush_lsn(PG_FUNCTION_ARGS)
 	PG_RETURN_LSN(current_recptr);
 }
 
+/*
+ * Report the oldest WAL location still available after WAL segment removal
+ *
+ * This is useful to monitor how much WAL retention is happening with
+ * replication slots and concurrent checkpoints.  NULL means that no WAL
+ * segments have been removed since startup yet.
+ */
+Datum
+pg_wal_oldest_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	oldestptr;
+	XLogSegNo	last_removed_seg;
+
+	last_removed_seg = XLogGetLastRemovedSegno();
+
+	/* Leave if no segments have been removed since startup */
+	if (last_removed_seg == 0)
+		PG_RETURN_NULL();
+
+	XLogSegNoOffsetToRecPtr(last_removed_seg + 1, 0,
+			wal_segment_size, oldestptr);
+
+	PG_RETURN_LSN(oldestptr);
+}
+
 /*
  * Report the last WAL receive location (same format as pg_start_backup etc)
  *
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5314e9348f..507b602a08 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -878,8 +878,7 @@ CREATE VIEW pg_replication_slots AS
 L.catalog_xmin,
 L.restart_lsn,
 L.confirmed_flush_lsn,
-L.wal_status,
-L.min_safe_lsn
+L.wal_status
 FROM pg_get_replication_slots() AS L
 LEFT JOIN pg_database D ON (L.datoid = D.oid);
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 06e4955de7..590f7054d6 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -236,7 +236,7 @@ pg_drop_replication_slot(PG_FUNCTION_ARGS)
 Datum
 pg_get_replication_slots(PG_FUNCTION_ARGS)
 {
-#define PG_GET_REPLICATION_SLOTS_COLS 13
+#define PG_GET_REPLICATION_SLOTS_COLS 12
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	TupleDesc	tupdesc;
 	Tuplestorestate *tupstore;
@@ -282,7 +282,6 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
 		Datum		values[PG_GET_REPLI

Re: Parallel Seq Scan vs kernel read ahead

2020-06-21 Thread David Rowley

On Mon, 22 Jun 2020 at 16:54, David Rowley  wrote:
> I also tested this an AMD machine running Ubuntu 20.04 on kernel
> version 5.4.0-37.  I used the same 100GB table I mentioned in [1], but
> with the query "select * from t where a < 0;", which saves having to
> do any aggregate work.

I just wanted to add a note here that Thomas and I just discussed this
a bit offline. He recommended I try setting the kernel readhead a bit
higher.

It was set to 128kB, so I cranked it up to 2MB with:

sudo blockdev --setra 4096 /dev/nvme0n1p2

I didn't want to run the full test again as it took quite a long time,
so I just tried with 32 workers.

The first two results here are taken from the test results I just
posted 1 hour ago.

Master readhead=128kB = 89921.283 ms
v2 patch readhead=128kB = 36085.642 ms
master readhead=2MB = 60984.905 ms
v2 patch readhead=2MB = 22611.264 ms

There must be a fairly large element of reading from the page cache
there since 22.6 seconds means 4528MB/sec over the 100GB table. The
maximum for a PCIe 3.0 x4 slot is 3940MB/s

David

[PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-06-21 Thread Bharath Rupireddy

Hi,

When a query on foreign table is executed from a local session using
postgres_fdw, as expected the local postgres backend opens a
connection which causes a remote session/backend to be opened on the
remote postgres server for query execution.

One observation is that, even after the query is finished, the remote
session/backend still persists on the remote postgres server. Upon
researching, I found that there is a concept of Connection Caching for
the remote connections made using postgres_fdw. Local backend/session
can cache up to 8 different connections per backend. This caching is
useful as it avoids the cost of reestablishing new connections per
foreign query.

However, at times, there may be situations where the long lasting
local sessions may execute very few foreign queries and remaining all
are local queries, in this scenario, the remote sessions opened by the
local sessions/backends may not be useful as they remain idle and eat
up the remote server connections capacity. This problem gets even
worse(though this use case is a bit imaginary) if all of
max_connections(default 100 and each backend caching 8 remote
connections) local sessions open remote sessions and they are cached
in the local backend.

I propose to have a new session level GUC called
"enable_connectioncache"(name can be changed if it doesn't correctly
mean the purpose) with the default value being true which means that
all the remote connections are cached. If set to false, the
connections are not cached and so are remote sessions closed by the local
backend/session at
the end of each remote transaction.

Attached the initial patch(based on commit
9550ea3027aa4f290c998afd8836a927df40b09d), test setup.

Another approach to solve this problem could be that (based on Robert's
idea[1]) automatic clean up of cache entries, but letting users decide
on caching also seems to be good.

Please note that documentation is still pending.

Thoughts?

Test Case:
without patch:
1. Run the query on foreign table
2. Look for the backend/session opened on the remote postgres server, it
exists till the local session remains active.

with patch:
1. SET enable_connectioncache TO false;
2. Run the query on the foreign table
3. Look for the backend/session opened on the remote postgres server, it
should not exist.

[1] -
https://www.postgresql.org/message-id/CA%2BTgmob_ksTOgmbXhno%2Bk5XXPOK%2B-JYYLoU3MpXuutP4bH7gzA%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


testcase
Description: Binary data
From d4594067b29ab1414e9741a7e6abd91daf078201 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 22 Jun 2020 10:07:53 +0530
Subject: [PATCH v1] enable_connectioncache GUC for postgres_fdw connection
 caching

postgres_fdw connection caching - cause remote sessions linger
till the local session exit.
---
 contrib/postgres_fdw/connection.c |  3 ++-
 src/backend/utils/hash/dynahash.c | 11 +++
 src/backend/utils/misc/guc.c  | 15 +++
 src/backend/utils/misc/postgresql.conf.sample |  2 +-
 src/include/utils/hsearch.h   |  4 
 5 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 52d1fe3563..de994f678b 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -869,7 +869,8 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 		 */
 		if (PQstatus(entry->conn) != CONNECTION_OK ||
 			PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
-			entry->changing_xact_state)
+			entry->changing_xact_state ||
+			scan.enableconncache == false)
 		{
 			elog(DEBUG3, "discarding connection %p", entry->conn);
 			disconnect_pg_server(entry);
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 2688e27726..a5448f4af4 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -276,6 +276,9 @@ static bool has_seq_scans(HTAB *hashp);
  */
 static MemoryContext CurrentDynaHashCxt = NULL;
 
+/* parameter for enabling fdw connection hashing */
+bool	enable_connectioncache = true;
+
 static void *
 DynaHashAlloc(Size size)
 {
@@ -1383,6 +1386,14 @@ hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp)
 	status->hashp = hashp;
 	status->curBucket = 0;
 	status->curEntry = NULL;
+	status->enableconncache = true;
+
+	/* see if the cache was for postgres_fdw connections and
+	   user chose to disable connection caching*/
+	if ((strcmp(hashp->tabname,"postgres_fdw connections") == 0) &&
+		!enable_connectioncache)
+		status->enableconncache = false;
+
 	if (!hashp->frozen)
 		register_seq_scan(hashp);
 }
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 75fc6f11d6..f7a57415fc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -127,6 +127,7 @@ extern char *temp_tablespaces;
 extern bool ignore_checksum_failure;
 extern bool i

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-06-21 Thread Andrey Lepikhov




19.06.2020 19:58, Etsuro Fujita пишет:

On Tue, Jun 2, 2020 at 2:51 PM Andrey Lepikhov
 wrote:

Hiding the COPY code under the buffers management machinery allows us to
generalize buffers machinery, execute one COPY operation on each buffer
and simplify error handling.


I'm not sure that it's really a good idea that the bulk-insert API is
designed the way it's tightly coupled with the bulk-insert machinery
in the core, because 1) some FDWs might want to send tuples provided
by the core to the remote, one by one, without storing them in a
buffer, or 2) some other FDWs might want to store the tuples in the
buffer and send them in a lump as postgres_fdw in the proposed patch
but might want to do so independently of MAX_BUFFERED_TUPLES and/or
MAX_BUFFERED_BYTES defined in the bulk-insert machinery.

I agree that we would need special handling for cases you mentioned
above if we design this API based on something like the idea I
proposed in that thread.

Agreed



As i understand, main idea of the thread, mentioned by you, is to add
"COPY FROM" support without changes in FDW API.


I don't think so; I think we should introduce new API for this feature
to keep the ExecForeignInsert() API simple.

Ok



All that I can offer in this place now is to introduce one new
ExecForeignBulkInsert(buf) routine that will execute single "COPY FROM
STDIN" operation, send tuples and close the operation. We can use the
ExecForeignInsert() routine for each buffer tuple if
ExecForeignBulkInsert() is not supported.


Agreed.
In the next version (see attachment) of the patch i removed Begin/End 
fdwapi routines. Now we have only the ExecForeignBulkInsert() routine.


--
Andrey Lepikhov
Postgres Professional
>From 108dc421cec88ab5afd092f40da3fa31b8fcfbc5 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Mon, 22 Jun 2020 10:28:42 +0500
Subject: [PATCH] Fast COPY FROM into the foreign or sharded table.

This feature enables bulk COPY into foreign table in the case of
multi inserts is possible and foreign table has non-zero number of columns.
---
 contrib/postgres_fdw/deparse.c|  60 -
 .../postgres_fdw/expected/postgres_fdw.out|  33 ++-
 contrib/postgres_fdw/postgres_fdw.c   |  87 
 contrib/postgres_fdw/postgres_fdw.h   |   1 +
 contrib/postgres_fdw/sql/postgres_fdw.sql |  28 +++
 src/backend/commands/copy.c   | 206 --
 src/include/commands/copy.h   |   5 +
 src/include/foreign/fdwapi.h  |   8 +
 8 files changed, 344 insertions(+), 84 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index ad37a74221..a37981ff66 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -184,6 +184,8 @@ static void appendAggOrderBy(List *orderList, List *targetList,
 static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
 static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno,
 	deparse_expr_cxt *context);
+static List *deparseRelColumnList(StringInfo buf, Relation rel,
+  bool enclose_in_parens);
 
 /*
  * Helper functions
@@ -1758,6 +1760,20 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte,
 		 withCheckOptionList, returningList, retrieved_attrs);
 }
 
+/*
+ * Deparse COPY FROM into given buf.
+ * We need to use list of parameters at each query.
+ */
+void
+deparseCopyFromSql(StringInfo buf, Relation rel)
+{
+	appendStringInfoString(buf, "COPY ");
+	deparseRelation(buf, rel);
+	(void) deparseRelColumnList(buf, rel, true);
+
+	appendStringInfoString(buf, " FROM STDIN ");
+}
+
 /*
  * deparse remote UPDATE statement
  *
@@ -2061,6 +2077,30 @@ deparseAnalyzeSizeSql(StringInfo buf, Relation rel)
  */
 void
 deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
+{
+	appendStringInfoString(buf, "SELECT ");
+	*retrieved_attrs = deparseRelColumnList(buf, rel, false);
+
+	/* Don't generate bad syntax for zero-column relation. */
+	if (list_length(*retrieved_attrs) == 0)
+		appendStringInfoString(buf, "NULL");
+
+	/*
+	 * Construct FROM clause
+	 */
+	appendStringInfoString(buf, " FROM ");
+	deparseRelation(buf, rel);
+}
+
+/*
+ * Construct the list of columns of given foreign relation in the order they
+ * appear in the tuple descriptor of the relation. Ignore any dropped columns.
+ * Use column names on the foreign server instead of local names.
+ *
+ * Optionally enclose the list in parantheses.
+ */
+static List *
+deparseRelColumnList(StringInfo buf, Relation rel, bool enclose_in_parens)
 {
 	Oid			relid = RelationGetRelid(rel);
 	TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -2069,10 +2109,8 @@ deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
 	List	   *options;
 	ListCell   *lc;
 	bool		first = true;
+	List	   *retrieved_attrs = NIL;
 
-	*retrieved_attrs = NIL;
-
-	appendStringInfoString(buf, "SELECT ");
 	for (i = 0; i < tupdesc->natts; i++)
 	{
 		/* Ignore

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-21 Thread Dilip Kumar

On Tue, Jun 16, 2020 at 2:37 PM Amit Kapila  wrote:
>
> On Mon, Jun 15, 2020 at 6:29 PM Amit Kapila  wrote:
> >
> > I have few more comments on the patch
> > 0013-Change-buffile-interface-required-for-streaming-.patch:
> >
>
> Review comments on 0014-Worker-tempfile-use-the-shared-buffile-infrastru:
> 1.
> The subxact file is only create if there
> + * are any suxact info under this xid.
> + */
> +typedef struct StreamXidHash
>
> Lets slightly reword the part of the comment as "The subxact file is
> created iff there is any suxact info under this xid."

Done

>
> 2.
> @@ -710,6 +740,9 @@ apply_handle_stream_stop(StringInfo s)
>   subxact_info_write(MyLogicalRepWorker->subid, stream_xid);
>   stream_close_file();
>
> + /* Commit the per-stream transaction */
> + CommitTransactionCommand();
>
> Before calling commit, ensure that we are in a valid transaction.  I
> think we can have an Assert for IsTransactionState().

Done

> 3.
> @@ -761,11 +791,13 @@ apply_handle_stream_abort(StringInfo s)
>
>   int64 i;
>   int64 subidx;
> - int fd;
> + BufFile*fd;
>   bool found = false;
>   char path[MAXPGPATH];
> + StreamXidHash *ent;
>
>   subidx = -1;
> + ensure_transaction();
>   subxact_info_read(MyLogicalRepWorker->subid, xid);
>
> Why to call ensure_transaction here?  Is there any reason that we
> won't have a valid transaction by now?  If not, then its better to
> have an Assert for IsTransactionState().

We are only starting transaction from stream_start to stream_stop,  so
at stream_abort we will not have the transaction.

> 4.
> - if (write(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
> + if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
>   {
> - int save_errno = errno;
> + int save_errno = errno;
>
> - CloseTransientFile(fd);
> + BufFileClose(fd);
>
> On error, won't these files be close automatically?  If so, why at
> this place and before other errors, we need to close this?

Yes, that's correct.  I have fixed those.

> 5.
> if ((len > 0) && ((BufFileRead(fd, subxacts, len)) != len))
> {
> int save_errno = errno;
>
> BufFileClose(fd);
> errno = save_errno;
> ereport(ERROR,
> (errcode_for_file_access(),
> errmsg("could not read file \"%s\": %m",
>
> Can we change the error message to "could not read from streaming
> transactions file .." or something like that and similarly we can
> change the message for failure in reading changes file?

Done


> 6.
> if (BufFileWrite(fd, &nsubxacts, sizeof(nsubxacts)) != sizeof(nsubxacts))
> {
> int save_errno = errno;
>
> BufFileClose(fd);
> errno = save_errno;
> ereport(ERROR,
> (errcode_for_file_access(),
> errmsg("could not write to file \"%s\": %m",
>
> Similar to previous, can we change it to "could not write to streaming
> transactions file

BufFileWrite is not returning failure anymore.

> 7.
> @@ -2855,17 +2844,32 @@ stream_open_file(Oid subid, TransactionId xid,
> bool first_segment)
>   * for writing, in append mode.
>   */
>   if (first_segment)
> - flags = (O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
> - else
> - flags = (O_WRONLY | O_APPEND | PG_BINARY);
> + {
> + /*
> + * Shared fileset handle must be allocated in the persistent context.
> + */
> + SharedFileSet *fileset =
> + MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet));
>
> - stream_fd = OpenTransientFile(path, flags);
> + PrepareTempTablespaces();
> + SharedFileSetInit(fileset, NULL);
>
> Why are we calling PrepareTempTablespaces here? It is already called
> in SharedFileSetInit.

My bad,  First I tired using SharedFileSetInit but later it got
changed for forgot to remove this part.

> 8.
> + /*
> + * Start a transaction on stream start, this transaction will be committed
> + * on the stream stop.  We need the transaction for handling the buffile,
> + * used for serializing the streaming data and subxact info.
> + */
> + ensure_transaction();
>
> I think we need this for PrepareTempTablespaces to set the
> temptablespaces.  Also, isn't it required for a cleanup of buffile
> resources at the transaction end?  Are there any other reasons for it
> as well?  The comment should be a bit more clear for why we need a
> transaction here.

I am not sure that will it make sense to add a comment here that why
buffile and sharedfileset need a transaction?  Do you think that we
should add comment in buffile/shared fileset API that it should be
called under a transaction?

> 9.
> * Open a file for streamed changes from a toplevel transaction identified
>  * by stream_xid (global variable). If it's the first chunk of streamed
>  * changes for this transaction, perform cleanup by removing existing
>  * files after a possible previous crash.
> ..
> stream_open_file(Oid subid, TransactionId xid, bool first_segment)
>
> The above part comment atop stream_open_file needs to be changed after
> new implementation.

Done

> 10.
>  * enabled.  This context is reeset on each stream stop.
> */
> LogicalStreamingContext = AllocSetContextCreate(ApplyContext,
>
> /reeset/reset

Done

Testing big endian code with Travis CI's new s390x support

2020-06-21 Thread Thomas Munro

Hi,

This is something I've wanted several times in the past, so I thought
others here could be interested:  if you're looking for a way to run
your development branch through check-world on a big endian box, the
new s390x support[1] on Travis is good for that.  Capacity is a bit
limited, so I don't think I'll point cfbot.cputube.org at it just yet
(maybe I need to invent a separate slow build cycle).

I tried it just now and found that cfbot's .travis.yml file[2] just
needed this at the top:

arch:
  - s390x

... and then it needed these lines commented out:

#before_install:
#  - echo '/tmp/%e-%s-%p.core' | sudo tee /proc/sys/kernel/core_pattern

I didn't look into why, but otherwise that fails with a permission
error on that environment, so it'd be nice to figure out what's up
with that so we can still get back traces from cores.

[1] https://blog.travis-ci.com/2019-11-12-multi-cpu-architecture-ibm-power-ibm-z
[2] https://github.com/macdice/cfbot/blob/master/travis/.travis.yml

pg_resetwal --next-transaction-id may cause database failed to restart.

2020-06-21 Thread movead...@highgo.ca

hello hackers,

When I try to use pg_resetwal tool to skip some transaction ID, I get a problem 
that is
the tool can accept all transaction id I offered with '-x' option, however, the 
database
may failed to restart because of can not read file under $PGDATA/pg_xact.  For
example, the 'NextXID' in a database is 1000, if you offer '-x 32769' then the 
database
failed to restart.

I read the document of pg_resetwal tool, it told me to write a 'safe value', 
but I think
pg_resetwal tool should report it and refuse to exec walreset work when using 
an unsafe
value, rather than remaining it until the user restarts the database.

I do a initial patch to limit the input, now it accepts transaction in two ways:
1. The transaction ID is on the same CLOG page with the 'NextXID' in pg_control.
2. The transaction ID is right at the end of a CLOG page.
The input limited above can ensure the database restart successfully.

The same situation with multixact and multixact-offset option and I make
the same change in the patch.

Do you think it is an issue?



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca


pg_resetwal_transaction_limit.patch
Description: Binary data

52 matches

Mail list logo