Re: Problem while setting the fpw with SIGHUP

2018-09-18 Thread Kyotaro HORIGUCHI
At Fri, 14 Sep 2018 16:30:37 +0530, Amit Kapila  wrote 
in 
> On Fri, Sep 14, 2018 at 12:57 PM Michael Paquier  wrote:
> >
> > On Thu, Sep 06, 2018 at 04:37:28PM -0700, Michael Paquier wrote:
> > > /*
> > >  * Properly accept or ignore signals the postmaster might send us.
> > >  */
> > > -   pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file 
> > > */
> > > +   pqsignal(SIGHUP, SIG_IGN);  /* ignore reload config */
> > >
> > > I am finally coming back to this patch set, and that's one of the first
> > > things I am going to help moving on for this CF.  And this bit from the
> > > last patch series is not acceptable as there are some parameters which
> > > are used by the startup process which can be reloaded.  One of them is
> > > wal_retrieve_retry_interval for tuning when fetching WAL at recovery.
> >
> > So, I have been working on this problem again and I have reviewed the
> > thread, and there have been many things discussed in the last couple of
> > months:
> > 1) We do not want to initialize XLogInsert stuff unconditionally for all
> > processes at the moment recovery begins, but we just want to initialize
> > it once WAL write is open for business.
> > 2) Both the checkpointer and the startup process can call
> > UpdateFullPageWrites() which can cause Insert->fullPageWrites to get
> > incorrect values.
> 
> Can you share the steps to reproduce this problem?

The window for the issue is small.

It happens if checkpointer first looks SharedRecoveryInProgress
is turned off at the beginning of the CheckPointerMain loop.
The window is needed be widen to make sure the issue happens.

Applying the first patch attched, the following steps will cause
the issue with high reliability.

1. initdb  (full_page_writes = on by default)

2. put recovery.conf so that server can accept promote signal

3. touch /tmp/hoge
   change full_page_write to off in postgresql.conf

4. pg_ctl_promote

The attached second file is a script do the above steps.
Server writes the following log message and die.

| 2018-09-18 13:55:49.928 JST [16405] LOG:  database system is ready to accept 
connections
| TRAP: FailedAssertion("!(CritSectionCount == 0)", File: "mcxt.c", Line: 731)
| 2018-09-18 13:55:50.546 JST [16405] LOG:  checkpointer process (PID 16407) 
was terminated by signal 6: Aborted

We can preallocating the XLogInsert buffer just to prevent the
crash. This is done by calling RecoveryInProgress() before
UpdateFullPageWrites() finds it turned to false. This is an
isolated problem. But it has another issue that FPW_CHANGE record
can be missing or wrong FPW state after recovery end.

It comes from the fact that responsibility to update the flag is
not atomically passed from startup to checkpointer. (The window
is after UpdateFullPageWrites() call and until setting
SharedRecoveryInProgress to false.)

My latest patch tries to remove the window by imposing all
responsibility to apply config file changes to the shared FPW
flag on the checkpointer. RecoveryInProgress() is changed to be
called prior to UpdateFullPageWrites on the way doing that.


> > 3) We do not want a palloc() in a critical section because of
> > RecoveryinProgress being called.
> >
> > And the root issue here is 2), because the checkpointer tries to update
> > Insert->fullPageWrites but it does not need to do so until recovery has
> > been finished.  So in order to fix the original issue I am proposing a
> > simple fix: let's make sure that the checkpointer does not update
> > Insert->fullPageWrites until recovery finishes, and let's have the
> > startup process do the first update once it finishes recovery and
> > inserts by itself the XLOG_PARAMETER_CHANGE.  This way the order of
> > events is purely sequential and we don't need to worry about having the
> > checkpointer and the startup process eat on each other's plate because
> > the checkpointer would only try to work on updating the shared memory
> > value of full_page_writes once SharedRecoveryInProgress is switched to
> > true, and that happens after the startup process does its initial call
> > to UpdateFullPageWrites().  I have improved as well all the comments
> > around to make clear the behavior wanted.
> >
> > Thoughts?

InRecoery is turned off after the last UpdateFullPageWrite() and
before SharedRecoveryInProgress is turned off. So it is still
leaving the window. Thus, checkpointer stops flipping the value
before SharedRecoveryInProgress's turning off. (I don't
understand InRecovery condition..) However, this lets
checkpointer omit reloading after UpdateFullPagesWrite() in
startup until SharedRecoveryInProgress is tunred off.

>  UpdateFullPageWrites(void)
>  {
>   XLogCtlInsert *Insert = &XLogCtl->Insert;
> + /*
> + * Check if recovery is still in progress before entering this critical
> + * section, as some memory allocation could happen at the end of
> + * recovery.  There is nothing to do for a system still in recovery.
> + * Note that we need to process things here at 

Re: Difference in TO_TIMESTAMP results.

2018-09-18 Thread Alexander Korotkov
On Tue, Sep 18, 2018 at 8:44 AM Prabhat Sahu
 wrote:
>
> I have found below difference in TO_TIMESTAMP results.
>
> postgres[114552]=# select to_timestamp('15-07-1984 23:30:32','dd-mm- 
> hh24:mi:ss');
>to_timestamp
> ---
>  1984-07-15 23:30:32+05:30
> (1 row)
>
> postgres[114552]=# select to_timestamp('15-07-84 23:30:32','dd-mm- 
> hh24:mi:ss');
>  to_timestamp
> --
>  0084-07-15 23:30:32+05:53:28
> (1 row)
>
> My doubt is the ":28" in timezone part in 2nd result, is it expected ?

Hmm, that looks strange.  My first idea was that it's related to
cf984672.  But I found that same behavior existed before cf984672.
I'm going to investigate more on that.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Removing LEFT JOINs in more cases

2018-09-18 Thread Antonin Houska
David Rowley  wrote:

> I've attached an updated version of this patch.

Following are the findings from my review.

On the LATERAL references:

This query (proposed upthread by Tom and adjusted by me so it can be executed
on the your test tables)

select distinct t1.id from t1 left join t2 on t1.id=t2.b, lateral 
nextval('foo');

is subject to join removal because in rel_distinctified_after_join() you check
brel->lateral_vars, which is NIL in this case.

On the other hand I'm not sure that user should expect any number of
executions of the function in this particular case because the actual join
order can change: it can happen that the function scan is first joined to t1
and t2 is joined to the result. The case that requires more caution is that
the function references both t1 and t2, i.e. *all* tables of the left join
from which you want to remove the inner relation. Maybe that's the reason you
introduced the test for brel->lateral_vars, but you forgot to check the actual
variables in the list.


On the volatile function in the targetlist:

Volatile function in the DISTINCT ON expression isn't necessarily noticed by
your checks:

select distinct on (nextval('foo')) t1.id from t1 left join t2 on t1.id=t2.b;

Perhaps you should simply check the whole targetlist if there's no GROUP BY
clause.


On coding:

  * join_is_removable(): Just like rel_distinctified_after_join() is called
 before rel_is_distinct_for() - obviously because it's cheaper - I suggest
 to call query_distinctifies_rows() before rel_supports_distinctness().

  * header comment of query_distinctifies_rows() mentions rel_tuples_needed,
but I can't find such a function.

  * rel_distinctified_after_join() does not really use the "rel"
argument. Besides removing it you probably should rename the function.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



RE: Changing the setting of wal_sender_timeout per standby

2018-09-18 Thread Tsunakawa, Takayuki
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
> At Fri, 14 Sep 2018 18:22:33 +0900, Masahiko Sawada 
> wrote in
> 
> > On Thu, Sep 13, 2018 at 12:32 PM, Michael Paquier 
> wrote:
> > > On Thu, Sep 13, 2018 at 01:14:12AM +, Tsunakawa, Takayuki wrote:
> > >> Some customer wants to change the setting per standby, i.e., a shorter
> > >> timeout for a standby in the same region to enable faster detection
> > >> failure and failover, and a longer timeout for a standby in the remote
> > >> region (for disaster recovery) to avoid mis-judging its health.
> > >
> > > This argument is sensible.
> > >
> > >> The current PGC_HUP allows to change the setting by editing
> > >> postgresql.conf or ALTER SYSTEM and then sending SIGHUP to a specific
> > >> walsender.  But that's not easy to use.  The user has to do it upon
> > >> every switchover and failover.
> > >>
> > >> With PGC_BACKEND, the user would be able to tune the timeout as follows:
> > >>
> > >> [recovery.conf]
> > >> primary_conninfo = '... options=''-c wal_sender_timeout=6'' ...'
> > >>
> > >> With PGC_USERSET, the user would be able to use different user
> > >> accounts for each standby, and tune the setting as follows:
> > >>
> > >> ALTER USER repluser_remote SET wal_sender_timeout = 6;
> > >
> > > It seems to me that switching to PGC_BACKENDwould cover already all
> the
> > > use-cases you are mentioning, as at the end one would just want to
> > > adjust the WAL sender timeout on a connection basis depending on the
> > > geographical location of the receiver and the latency between primary
> > > and standby.
> >
> > +1 for PGC_BACKEND. It looks enough for most use cases.
> 
> +1, and we need a means to see the actual value, in
> pg_stat_replication?

Thanks, the patch attached.  I'll add this to the next CF shortly.  As Michael 
said, I think viewing the configured value would be a separate feature.


Regards
Takayuki Tsunakawa




walsender_timeout_PGC_BACKEND.patch
Description: walsender_timeout_PGC_BACKEND.patch


Re: Difference in TO_TIMESTAMP results.

2018-09-18 Thread Andrew Gierth
> "Prabhat" == Prabhat Sahu  writes:

 Prabhat> postgres[114552]=# select to_timestamp('15-07-84 23:30:32','dd-mm-
 Prabhat> hh24:mi:ss');
 Prabhat>  to_timestamp
 Prabhat> --
 Prabhat>  0084-07-15 23:30:32+05:53*:28*
 Prabhat> (1 row)

 Prabhat> My doubt is the *":28"* in timezone part in 2nd result, is it
 Prabhat> expected ?

Yes, it's expected.

The timezone db as a general rule gives the actual local mean solar time
offset (down to seconds) at the reference location (Kolkata in this
case) as the UTC offset for times before the earliest historical data
known to the tzdb maintainers.

-- 
Andrew (irc:RhodiumToad)



Re: [PATCH] Fix for infinite signal loop in parallel scan

2018-09-18 Thread Oleksii Kliukin



> On 18. Sep 2018, at 03:18, Thomas Munro  wrote:
> 
> On Tue, Sep 18, 2018 at 1:15 AM Chris Travers  
> wrote:
>> On Mon, Sep 17, 2018 at 2:59 PM Oleksii Kliukin  wrote:
>>> With the patch applied, the posix_fallocate loop terminated right away 
>>> (because
>>> of QueryCancelPending flag set to true) and the backend went through the
>>> cleanup, showing an ERROR of cancelling due to the conflict with recovery.
>>> Without the patch, it looped indefinitely in the dsm_impl_posix_resize, 
>>> while
>>> the startup process were looping forever, trying to send SIGUSR1.
> 
> Thanks for testing!
> 
>>> One thing I’m wondering is whether we could do the same by just blocking 
>>> SIGUSR1
>>> for the duration of posix_fallocate?
>> 
>> If we were to do that, I would say we should mask all signals we can mask 
>> during the call.
>> 
>> I don't have a problem going down that road instead if people think it is 
>> better.
> 
> We discussed that when adding posix_fallocate() and decided that
> retrying is better:
> 
> https://www.postgresql.org/message-id/20170628230458.n5ehizmvhoerr5yq%40alap3.anarazel.de
> 
> Here is a patch that I propose to commit and back-patch to 9.4.  I
> just wrote a suitable commit message, edited the comments lightly and
> fixed some whitespace.

Thanks!

Apart from the fact that the reviewer's name is “Murat Kabilov” and
not “Murak Kabilov” the back-patch looks good to me.

Cheers,
Oleksii Kliukin


Re: Online verification of checksums

2018-09-18 Thread Michael Banck
Hi,

Am Montag, den 17.09.2018, 14:09 -0400 schrieb Stephen Frost:
> > 5. There seems to be no consensus on whether the number of skipped pages
> > should be summarized at the end.
> 
> I agree with printing the number of skipped pages, that does seem like
> a nice to have.  I don’t know that actually printing the pages
> themselves is all that useful though. 

Oh ok - I never intended to print out the block numbers themselves, just
the final number of skipped blocks in the summary. So I guess that's
fine and I will add that in my branch.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz



Code of Conduct

2018-09-18 Thread Dave Page
The PostgreSQL Core team are pleased to announce that following a long
consultation process, the project’s Code of Conduct (CoC) has now been
finalised and published at https://www.postgresql.org/about/policies/coc/.

Please take time to read and understand the CoC, which is intended to
ensure that PostgreSQL remains an open and enjoyable project for anyone to
join and participate in.

A Code of Conduct Committee has been formed to handle any complaints. This
consists of the following volunteers:

- Stacey Haysler (Chair)
- Lætitia Avrot
- Vik Fearing
- Jonathan Katz
- Ilya Kosmodemiansky

We would like to extend our thanks and gratitude to Stacey Haysler for her
patience and expertise in helping develop the Code of Conduct, forming the
committee and guiding the work to completion.

-- 
Dave Page
PostgreSQL Core Team
http://www.postgresql.org/



Re: Online verification of checksums

2018-09-18 Thread Michael Banck
Hi.

Am Montag, den 17.09.2018, 20:45 -0400 schrieb Stephen Frost:
> > You're right it's not about the fsync, sorry for the confusion. My point
> > is that using the checkpoint LSN gives us a guarantee that write is no
> > longer in progress, and so we can't see a page torn because of it. And
> > if we see a partial write due to a new write, it's guaranteed to update
> > the page LSN (and we'll notice it).
> 
> Right, no worries about the confusion, I hadn't been fully thinking
> through the LSN bit either and that what we really need is some external
> confirmation of a write having *completed* (not just started) and that
> makes a definite difference.
> 
> > > Right, I'm in agreement with doing that and it's what is done in
> > > pgbasebackup and pgBackRest.
> > 
> > OK. All I'm saying is pg_verify_checksums should probably do the same
> > thing, i.e. grab checkpoint LSN and roll with that.
> 
> Agreed.

I've attached the patch I added to my branch to swap out the pg_sleep()
with a check against the checkpoint LSN on a recheck verification
failure.

Let me know if there are still issues with it. I'll send a new patch for
the whole online verification feature in a bit.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzcommit 3b7d80fa7431098de4aab19a85f7d7b01c43f8a8
Author: Michael Banck 
Date:   Tue Sep 18 11:53:36 2018 +0200

Skip pages with new LSNs on recheck verification failure.

Before, we only skipped new pages and added a pg_sleep during the recheck.
This logic is broken, so replace it with a check against the (startup)
checkpoint LSN on a verification failure on recheck. If the LSN in the page
header is newer than the checkpoint LSN, we assume a torn page and skip it in
this run of pg_verify_checksums.

This now follows the logic during base backup checksum verification, however,
contrary to that we do not skip a page immediately if the LSN is newer than the
checkpoint, but try to (re-)verify it first and only skip it if its LSN is
newer on a recheck verification failure.

diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c
index 7c2a06390c..2f8b5bf167 100644
--- a/src/bin/pg_verify_checksums/pg_verify_checksums.c
+++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c
@@ -26,6 +26,7 @@ static int64 files = 0;
 static int64 blocks = 0;
 static int64 badblocks = 0;
 static ControlFileData *ControlFile;
+static XLogRecPtr checkpointLSN;
 
 static char *only_relfilenode = NULL;
 static bool verbose = false;
@@ -125,14 +126,14 @@ scan_file(const char *fn, BlockNumber segmentno)
 			 * possible that we read the first 4K page of
 			 * the block just before postgres updated the
 			 * entire block so it ends up looking torn to
-			 * us. If there were less than 10 bad blocks so
-			 * far, we wait for 500ms before we retry.
+			 * us.  We only need to retry once because the
+			 * LSN should be updated to something we can
+			 * ignore on the next pass.  If the error
+			 * happens again then it is a true validation
+			 * failure.
 			 */
 			if (block_retry == false)
 			{
-if (badblocks < 10)
-	pg_usleep(500);
-
 /* Seek to the beginning of the failed block */
 if (lseek(f, -BLCKSZ, SEEK_CUR) == -1)
 {
@@ -150,6 +151,16 @@ scan_file(const char *fn, BlockNumber segmentno)
 continue;
 			}
 
+			/* The checksum verification failed on retry as well.
+			 * Check if the page has been modified since the
+			 * checkpoint and skip it in this case.
+			 */
+			if (PageGetLSN(buf.data) > checkpointLSN)
+			{
+block_retry = false;
+continue;
+			}
+
 			if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_VERSION)
 fprintf(stderr, _("%s: checksum verification failed in file \"%s\", block %u: calculated checksum %X but block contains %X\n"),
 		progname, fn, blockno, csum, header->pd_checksum);
@@ -344,6 +355,9 @@ main(int argc, char *argv[])
 		exit(1);
 	}
 
+	/* Get checkpoint LSN */
+	checkpointLSN = ControlFile->checkPoint;
+
 	/* Scan all files */
 	scan_directory(DataDir, "global");
 	scan_directory(DataDir, "base");


Is it possible for postgres_fdw to push down queries on co-located tables?

2018-09-18 Thread Jinhua Luo
That is, if table `foo` and table `bar` are both tables on the same
remote server, then when I do `select * from foo, bar`, can it
delegate the whole query on the remote side, rather than fetching rows
from both servers one by one and do merging on the local side?

For example:

```
foo=> explain select * from foreign_test2, foreign_test where m = id;
 QUERY PLAN
-
 Merge Join  (cost=444.06..590.63 rows=9316 width=72)
   Merge Cond: (foreign_test2.m = foreign_test.id)
   ->  Sort  (cost=222.03..225.44 rows=1365 width=36)
 Sort Key: foreign_test2.m
 ->  Foreign Scan on foreign_test2  (cost=100.00..150.95
rows=1365 width=36)
   ->  Sort  (cost=222.03..225.44 rows=1365 width=36)
 Sort Key: foreign_test.id
 ->  Foreign Scan on foreign_test  (cost=100.00..150.95
rows=1365 width=36)
```

If the planning involving remote tables, could the planner push down
the queries once to gain the best efficiency?



What is OLD in INSTEAD OF trigger?

2018-09-18 Thread Jinhua Luo
If the view definition is too complex to be automatic updateable, then
how postgresql defines OLD in INSTEAD OF trigger? It cannot bind any
column on any table directly then, right?

Unless postgresql refresh view as table source before executing
trigger? Then it may filter (WHERE sub-clause of UPDATE/DELETE) the
rows to be update, and use those rows as OLD?



Re: [HACKERS] Bug in to_timestamp().

2018-09-18 Thread Prabhat Sahu
Hi All,

Few more findings on to_timestamp() test with HEAD.

postgres[3493]=# select to_timestamp('15-07-1984 23:30:32',' dd- mm-  
hh24: mi: ss');
   to_timestamp
---
 1984-07-15 23:30:32+05:30
(1 row)

postgres[3493]=# select to_timestamp('15-07-*1984* 23:30:32','*9*dd-*9*mm-
*99* *9*hh24:*9*mi:*9*ss');
 to_timestamp
--
 *0084*-07-05 03:00:02+05:53:28
(1 row)

If there are spaces before any formate then output is fine(1st output) but
instead of spaces if we have *digit* then we are getting wrong output.


On Mon, Sep 10, 2018 at 12:23 AM Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Thu, Sep 6, 2018 at 3:58 PM Alexander Korotkov <
> a.korot...@postgrespro.ru> wrote:
>
>> On Thu, Sep 6, 2018 at 2:40 PM Alexander Korotkov
>>  wrote:
>> >
>> > On Wed, Sep 5, 2018 at 7:28 PM amul sul  wrote:
>> > > On Wed, Sep 5, 2018, 6:35 PM Alexander Korotkov <
>> a.korot...@postgrespro.ru> wrote:
>> > >> On Wed, Sep 5, 2018 at 3:10 PM amul sul  wrote:
>> > >> > On Wed, Sep 5, 2018 at 3:05 PM Alexander Korotkov
>> > >> >  wrote:
>> > >> > > On Wed, Sep 5, 2018 at 1:22 AM David G. Johnston
>> > >> > >  wrote:
>> > >> > > > From those results the question is how important is it to
>> force the following breakage on our users (i.e., introduce FX exact symbol
>> matching):
>> > >> > > >
>> > >> > > > SELECT to_timestamp('97/Feb/16', 'FXYY:Mon:DD');
>> > >> > > > - to_timestamp
>> > >> > > > ---
>> > >> > > > - Sun Feb 16 00:00:00 1997 PST
>> > >> > > > -(1 row)
>> > >> > > > -
>> > >> > > > +ERROR:  unexpected character "/", expected character ":"
>> > >> > > > +HINT:  In FX mode, punctuation in the input string must
>> exactly match the format string.
>> > >> > > >
>> > >> > > > There seemed to be some implicit approvals of this breakage
>> some 30 emails and 10 months ago but given that this is the only change
>> from a correct result to a failure I'd like to officially put it out there
>> for opinion/vote gathering.   Mine is a -1; though keeping the distinction
>> between space and non-alphanumeric characters is expected.
>> > >> > >
>> > >> > > Do I understand correctly that you're -1 to changes to FX mode,
>> but no
>> > >> > > objection to changes in non-FX mode?
>> > >> > >
>> > >> > Ditto.
>> > >>
>> > >> So, if no objections for non-FX mode changes, then I'll extract that
>> > >> part and commit it separately.
>> > >
>> > >
>> > > Yeah, that make sense to me, thank you.
>> >
>> > OK!  I've removed FX changes from the patch.  The result is attached.
>> > I'm going to commit this if no objections.
>>
>> Attached revision fixes usage of two subsequent spaces in the
>> documentation.
>>
>
> So, pushed!  Thanks to every thread participant for review and feedback.
>
> --
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>


-- 


With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company


EXPLAIN stored procedures

2018-09-18 Thread Jinhua Luo
Normally, EXPLAIN do not include the commands by stored procedures,
e.g. aggregated function, trigger, correct?

So how to review the plan by those extensions?



Re: Changing the setting of wal_sender_timeout per standby

2018-09-18 Thread Masahiko Sawada
On Tue, Sep 18, 2018 at 5:27 PM, Tsunakawa, Takayuki
 wrote:
> From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
>> At Fri, 14 Sep 2018 18:22:33 +0900, Masahiko Sawada 
>> wrote in
>> 
>> > On Thu, Sep 13, 2018 at 12:32 PM, Michael Paquier 
>> wrote:
>> > > On Thu, Sep 13, 2018 at 01:14:12AM +, Tsunakawa, Takayuki wrote:
>> > >> Some customer wants to change the setting per standby, i.e., a shorter
>> > >> timeout for a standby in the same region to enable faster detection
>> > >> failure and failover, and a longer timeout for a standby in the remote
>> > >> region (for disaster recovery) to avoid mis-judging its health.
>> > >
>> > > This argument is sensible.
>> > >
>> > >> The current PGC_HUP allows to change the setting by editing
>> > >> postgresql.conf or ALTER SYSTEM and then sending SIGHUP to a specific
>> > >> walsender.  But that's not easy to use.  The user has to do it upon
>> > >> every switchover and failover.
>> > >>
>> > >> With PGC_BACKEND, the user would be able to tune the timeout as follows:
>> > >>
>> > >> [recovery.conf]
>> > >> primary_conninfo = '... options=''-c wal_sender_timeout=6'' ...'
>> > >>
>> > >> With PGC_USERSET, the user would be able to use different user
>> > >> accounts for each standby, and tune the setting as follows:
>> > >>
>> > >> ALTER USER repluser_remote SET wal_sender_timeout = 6;
>> > >
>> > > It seems to me that switching to PGC_BACKENDwould cover already all
>> the
>> > > use-cases you are mentioning, as at the end one would just want to
>> > > adjust the WAL sender timeout on a connection basis depending on the
>> > > geographical location of the receiver and the latency between primary
>> > > and standby.
>> >
>> > +1 for PGC_BACKEND. It looks enough for most use cases.
>>
>> +1, and we need a means to see the actual value, in
>> pg_stat_replication?
>
> Thanks, the patch attached.  I'll add this to the next CF shortly.  As 
> Michael said, I think viewing the configured value would be a separate 
> feature.
>

Thank you for the patch.

+   
+
+ The %q escape is useful when including
information that is
+ You can also set this in recovery.conf as follows.  This
allows you to set a
+ longer timeout for a standby in the remote data center
across the slow WAN.
+
+primary_conninfo = 'host=192.168.1.50 port=5432 user=foo
password=foopass options=''-c wal_sender_timeout=5000'''
+
+
+   

I didn't follow the first sentence of the above hunk. Is the
wal_sender_timeout relevant with %q?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Code of Conduct

2018-09-18 Thread James Keener
> following a long consultation process

It's not a consultation if any dissenting voice is simply ignored. Don't 
sugar-coat or politicize it like this -- it was rammed down everyone's throats. 
That is core's right, but don't act as everyone's opinions and concerns were 
taken into consideration. There are a good number of folks who are concerned 
that this CoC is overreaching and is ripe for abuse. Those concerns were always 
simply, plainly, and purposely ignored.

> Please take time to read and understand the CoC, which is intended to ensure 
> that PostgreSQL remains an open and enjoyable project for anyone to join and 
> participate in.

I sincerely hope so, and that it doesn't become a tool to enforce social 
ideology like in other groups I've been part of.  Especially since this is the 
main place to come to get help for PostgreSQL and not a social club.

Jim

On September 18, 2018 6:27:56 AM EDT, Dave Page  wrote:
>The PostgreSQL Core team are pleased to announce that following a long
>consultation process, the project’s Code of Conduct (CoC) has now been
>finalised and published at
>https://www.postgresql.org/about/policies/coc/.
>
>Please take time to read and understand the CoC, which is intended to
>ensure that PostgreSQL remains an open and enjoyable project for anyone
>to
>join and participate in.
>
>A Code of Conduct Committee has been formed to handle any complaints.
>This
>consists of the following volunteers:
>
>- Stacey Haysler (Chair)
>- Lætitia Avrot
>- Vik Fearing
>- Jonathan Katz
>- Ilya Kosmodemiansky
>
>We would like to extend our thanks and gratitude to Stacey Haysler for
>her
>patience and expertise in helping develop the Code of Conduct, forming
>the
>committee and guiding the work to completion.
>
>-- 
>Dave Page
>PostgreSQL Core Team
>http://www.postgresql.org/
>

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] Fix for infinite signal loop in parallel scan

2018-09-18 Thread Thomas Munro
On Tue, Sep 18, 2018 at 9:25 PM Oleksii Kliukin  wrote:
> > On 18. Sep 2018, at 03:18, Thomas Munro  
> > wrote:
> > Here is a patch that I propose to commit and back-patch to 9.4.  I
> > just wrote a suitable commit message, edited the comments lightly and
> > fixed some whitespace.
>
> Thanks!
>
> Apart from the fact that the reviewer's name is “Murat Kabilov” and
> not “Murak Kabilov” the back-patch looks good to me.

Oops, fixed.  Pushed.  Thanks all for the report, patch and reviews.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: EXPLAIN stored procedures

2018-09-18 Thread Andrew Gierth
> "Jinhua" == Jinhua Luo  writes:

 Jinhua> Normally, EXPLAIN do not include the commands by stored
 Jinhua> procedures, e.g. aggregated function, trigger, correct?

 Jinhua> So how to review the plan by those extensions?

auto_explain with its log_nested_statements option

-- 
Andrew (irc:RhodiumToad)



Re: Problem while updating a foreign table pointing to a partitioned table on foreign server

2018-09-18 Thread Kyotaro HORIGUCHI
Hello.

At Fri, 14 Sep 2018 22:01:39 +0900, Etsuro Fujita  
wrote in <5b9bb133.1060...@lab.ntt.co.jp>
> (2018/08/24 16:58), Kyotaro HORIGUCHI wrote:
> > At Tue, 21 Aug 2018 11:01:32 +0900 (Tokyo Standard Time), Kyotaro
> > HORIGUCHI wrote
> > in<20180821.110132.261184472.horiguchi.kyot...@lab.ntt.co.jp>
> >>> You wrote:
>  Several places seems to be assuming that fdw_scan_tlist may be
>  used foreign scan on simple relation but I didn't find that
>  actually happens.
> >>>
> >>> Yeah, currently, postgres_fdw and file_fdw don't use that list for
> >>> simple foreign table scans, but it could be used to improve the
> >>> efficiency for those scans, as explained in fdwhandler.sgml:
> > ...
> >> I'll put more consideration on using fdw_scan_tlist in the
> >> documented way.
> >
> > Done. postgres_fdw now generates full fdw_scan_tlist (as
> > documented) for foreign relations with junk columns having a
> > small change in core side. However it is far less invasive than
> > the previous version and I believe that it dones't harm
> > maybe-existing use of fdw_scan_tlist on non-join rels (that is,
> > in the case of a subset of relation columns).
> 
> Yeah, changes to the core by the new version is really small, which is
> great, but I'm not sure it's a good idea to modify the catalog info on
> the target table on the fly:
> 
> @@ -126,8 +173,18 @@ get_relation_info(PlannerInfo *root, Oid
> relationObjectId,\
>  bool inhparent,
> (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>  errmsg("cannot access temporary or unlogged relations during
>  r\
> ecovery")));
> 
> +   max_attrnum = RelationGetNumberOfAttributes(relation);
> +
> + /* Foreign table may have exanded this relation with junk columns */
> + if (root->simple_rte_array[varno]->relkind == RELKIND_FOREIGN_TABLE)
> +   {
> + AttrNumber maxattno = max_varattno(root->parse->targetList, varno);
> +   if (max_attrnum < maxattno)
> +   max_attrnum = maxattno;
> +   }
> +
> rel->min_attr = FirstLowInvalidHeapAttributeNumber + 1;
> -   rel->max_attr = RelationGetNumberOfAttributes(relation);
> +   rel->max_attr = max_attrnum;
> rel->reltablespace = RelationGetForm(relation)->reltablespace;
> 
> This breaks the fundamental assumption that rel->max_attr is equal to
> RelationGetNumberOfAttributes of that table.  My concern is: this
> change would probably be a wart, so it would be bug-prone in future
> versions.

Hmm. I believe that once RelOptInfo is created all attributes
defined in it is safely accessed. Is it a wrong assumption?
Actually RelationGetNumberOfAttributes is used in few distinct
places while planning.

expand_targetlist uses it to scan the source relation's nonjunk
attributes. get_rel_data_width uses it to scan width of
attributes in statistics. It fails to add junk's width but it
dones't harm so much.. build_physical_tlist is not used for
foreign relations. build_path_tlist creates a tlist without
proper resjunk flags but create_modifytable_plan immediately
fixes that.

If we don't accept the expanded tupdesc for base relations, the
another way I can find is transforming the foreign relation into
something another like a subquery, or allowing expansion of
attribute list of a base relation...

> Another thing on the new version:
> 
> @@ -1575,6 +1632,19 @@ build_physical_tlist(PlannerInfo *root,
> RelOptInfo *rel)
> relation = heap_open(rte->relid, NoLock);
> 
> numattrs = RelationGetNumberOfAttributes(relation);
> +
> +   /*
> + * Foreign tables may have expanded with some junk columns. Punt
> +* in the case.
...
> I think this would disable the optimization on projection in foreign
> scans, causing performance regression.

Well, in update/delete cases, create_plan_recurse on foreign scan
is called with CP_EXACT_TLIST in create_modifytable_plan so the
code path is not actually used. Just replacing the if clause with
Assert seems to change nothing. I'm not sure we will add junks in
other cases but it's not likely..

> > One arguable behavior change is about wholrow vars. Currently it
> > refferes local tuple with all columns but it is explicitly
> > fetched as ROW() after this patch applied. This could be fixed
> > but not just now.
> >
> > Part of 0004-:
> > -  Output: f1, ''::text, ctid, rem1.*
> > -  Remote SQL: SELECT f1, f2, ctid FROM public.loc1 FOR UPDATE
> > +  Output: f1, ''::text, tableoid, ctid, rem1.*
> > + Remote SQL: SELECT f1, tableoid, ctid, ROW(f1, f2) FROM public.loc1
> > FOR UPDATE
> 
> That would be also performance regression.  If we go in this
> direction, that should be fixed.

Agreed. Will consider sooner..

> > Since this uses fdw_scan_tlist so it is theoretically
> > back-patchable back to 9.6.
> 
> IIRC, the fdw_scan_tlist stuff was introduced in PG9.5 as part of join
> pushdown infrastructure, so I think your patch can be back-patched to
> PG9.5, but I don't think that's enough; IIRC, this issue 

Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options

2018-09-18 Thread Krasiyan Andreev
Hi,
Patch applies and compiles, all included tests and building of the docs
pass.
I am using last version from more than two months ago in production
environment with real data and I didn't find any bugs,
so I'm marking this patch as ready for committer in the commitfest app.

На сб, 28.07.2018 г. в 22:00 ч. David Fetter  написа:

> On Fri, Jul 13, 2018 at 01:52:00PM +0100, Oliver Ford wrote:
> > Adds the options RESPECT/IGNORE NULLS (null treatment clause) and FROM
> > FIRST/LAST to the non-aggregate window functions.
>
> Please find attached an updated version for OID drift.
>
> Best,
> David.
> --
> David Fetter  http://fetter.org/
> Phone: +1 415 235 3778
>
> Remember to vote!
> Consider donating to Postgres: http://www.postgresql.org/about/donate
>


Re: Unused argument from execute_sql_string()

2018-09-18 Thread Yugo Nagata
On Thu, 13 Sep 2018 17:03:28 +0900
Michael Paquier  wrote:

> On Thu, Sep 13, 2018 at 03:47:26PM +0900, Tatsuo Ishii wrote:
> > Anyway, considering it's a static function, chance of breaking
> > backward compatibility is minimum, I think.
> > 
> > So +1 to remove the unused argument.
> 
> Same opinion and arguments here, so I have committed the patch.

Thanks!

-- 
Yugo Nagata 



Re: Collation versioning

2018-09-18 Thread Stephen Frost
Greetings,

* Thomas Munro (thomas.mu...@enterprisedb.com) wrote:
> On Mon, Sep 17, 2018 at 9:02 AM Stephen Frost  wrote:
> > * Thomas Munro (thomas.mu...@enterprisedb.com) wrote:
> > > On Mon, Sep 17, 2018 at 6:13 AM Douglas Doole  wrote:
> > > > On Sun, Sep 16, 2018 at 1:20 AM Thomas Munro 
> > > >  wrote:
> > > >> 3.  Fix the tracking of when reindexes need to be rebuilt, so that you
> > > >> can't get it wrong (as you're alluding to above).
> > > >
> > > > I've mentioned this in the past, but didn't seem to get any traction, 
> > > > so I'll try it again ;-)
> > >
> > > Probably because we agree with you, but don't have all the answers :-)
> >
> > Agreed.
> >
> > > > The focus on indexes when a collation changes is, in my opinion, the 
> > > > least of the problems. You definitely have to worry about indexes, but 
> > > > they can be easily rebuilt. What about other places where collation is 
> > > > hardened into the system, such as constraints?
> > >
> > > We have to start somewhere and indexes are the first thing that people
> > > notice, and are much likely to actually be a problem (personally I've
> > > encountered many cases of index corruption due to collation changes in
> > > the wild, but never a constraint corruption, though I fully understand
> > > the theoretical concern).  Several of us have observed specifically
> > > that the same problems apply to CHECK constraints and PARTITION
> > > boundaries, and there may be other things like that.  You could
> > > imagine tracking collation dependencies on those, requiring a RECHECK
> > > or REPARTITION operation to update them after a depended-on collation
> > > version changes.
> > >
> > > Perhaps that suggests that there should be a more general way to store
> > > collation dependencies -- something more like pg_depend, rather than
> > > bolting something like indcollversion onto indexes and every other
> > > kind of catalog that might need it.  I don't know.
> >
> > Agreed.  If we start thinking about pg_depend then maybe we realize
> > that this all comes back to pg_attribute as the holder of the
> > column-level information and maybe what we should be thinking about is a
> > way to encode version information into the typmod for text-based
> > types...
> 
> So to be more concrete: pg_depend could have a new column
> "refobjversion".  Whenever indexes are created or rebuilt, we'd
> capture the current version string in the pg_depend rows that link
> index attributes and collations.  Then we'd compare those against the
> current value when we first open an index and complain if they don't
> match.  (In this model there would be no "collversion" column in the
> pg_collation catalog.)

I'm really not sure why you're pushing to have this in pg_depend..

> That'd leave a place for other kinds of database objects (CHECKs,
> PARTITIONS, ...) to store their version dependency, if someone later
> wants to add support for that.

Isn't what matters here where the data's stored, as in, in a column..?

All of those would already have dependencies on the column so that they
can be tracked back there.

> I'm not sure if my idea about updating the default collation row in
> newly created databases has legs though.  Any thoughts on that?

My initial reaction is that we should have a version included basically
everywhere and then let users decide how they want to change it.  For a
new cluster, I'd agree with using the latest available (while allowing
it to be chosen if a user wishes for something else) but I'm not sure
I'd go farther than that.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: [HACKERS] Bug in to_timestamp().

2018-09-18 Thread Alexander Korotkov
On Tue, Sep 18, 2018 at 2:08 PM Prabhat Sahu
 wrote:
>
> Few more findings on to_timestamp() test with HEAD.
>
> postgres[3493]=# select to_timestamp('15-07-1984 23:30:32',' dd- mm-    
> hh24: mi: ss');
>to_timestamp
> ---
>  1984-07-15 23:30:32+05:30
> (1 row)
>
> postgres[3493]=# select to_timestamp('15-07-1984 23:30:32','9dd-9mm-99 
> 9hh24:9mi:9ss');
>  to_timestamp
> --
>  0084-07-05 03:00:02+05:53:28
> (1 row)
>
> If there are spaces before any formate then output is fine(1st output) but 
> instead of spaces if we have digit then we are getting wrong output.

This behavior might look strange, but it wasn't introduced by
cf9846724.  to_timestamp() behaves so, because it takes digit have
NODE_TYPE_CHAR type.  And for NODE_TYPE_CHAR we just "eat" single
character of input string regardless what is it.

But, I found related issue in cf9846724.  Before it was:

# select to_timestamp('2018 01 01', '9MM9DD');
  to_timestamp

 2018-01-01 00:00:00+03
(1 row)

But after it becomes so.

# select to_timestamp('2018 01 01', '9MM9DD');
ERROR:  invalid value "1 " for "MM"
DETAIL:  Field requires 2 characters, but only 1 could be parsed.
HINT:  If your source string is not fixed-width, try using the "FM" modifier.

That happens because we've already skipped space "for free", and then
NODE_TYPE_CHAR eats digit.  I've checked that Oracle doesn't allow
random charaters/digits to appear in format string.

select to_timestamp('2018 01 01', '9MM9DD') from dual
ORA-01821: date format not recognized

So, Oracle compatibility isn't argument here. Therefore I'm going to
propose following fix for that: let NODE_TYPE_CHAR eat characters only
if we didn't skip input string characters more than it was in format
string.  I'm sorry for vague explanation.  I'll come up with patch
later, and it should be clear then.


--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Is it possible for postgres_fdw to push down queries on co-located tables?

2018-09-18 Thread Tom Lane
Jinhua Luo  writes:
> That is, if table `foo` and table `bar` are both tables on the same
> remote server, then when I do `select * from foo, bar`, can it
> delegate the whole query on the remote side, rather than fetching rows
> from both servers one by one and do merging on the local side?

Reasonably recent releases can do that.  What version are you testing?

> foo=> explain select * from foreign_test2, foreign_test where m = id;
>  QUERY PLAN
> -
>  Merge Join  (cost=444.06..590.63 rows=9316 width=72)
>Merge Cond: (foreign_test2.m = foreign_test.id)
>->  Sort  (cost=222.03..225.44 rows=1365 width=36)
>  Sort Key: foreign_test2.m
>  ->  Foreign Scan on foreign_test2  (cost=100.00..150.95
> rows=1365 width=36)
>->  Sort  (cost=222.03..225.44 rows=1365 width=36)
>  Sort Key: foreign_test.id
>  ->  Foreign Scan on foreign_test  (cost=100.00..150.95
> rows=1365 width=36)
> ```

I don't find this particular example to be very compelling.  Taking
the amount of data pulled from the foreign server as the main cost
factor, the plan as given requires pulling 1365*2 rows, whereas if
it were to push down the join, it'd have to retrieve 9316 rows
(or so the planner estimates, anyway).  So it's quite possible that
the planner just rejected the remote join as a net loss.  If you
think it isn't a net loss, you might want to twiddle the cost
parameters for this foreign server.

regards, tom lane



Re: Is it possible for postgres_fdw to push down queries on co-located tables?

2018-09-18 Thread Jinhua Luo
I was testing PG10.

Sorry, the example is not so proper. I just think even if it's a
simple example, e.g. join two co-located tables, the planner should
work out to push down it. Can you confirm the postgresql could detect
co-located tables on the same foreign server and push down queries on
them? Could you give an actual example or point out the relevant
source code paths for reference?

(Let me clarify the context of this question, if the planner supports
co-located push-down, then it's meaningful for manual sharding via
partitioning to remote tables, where it's mostly necessary to join two
or more co-located parent tables in complex queries. If not, the
postgresql instance on which the parent tables are placed (let's say
it's a coordinator node) would be likely the bottleneck.)

Tom Lane  于2018年9月18日周二 下午9:43写道:
>
> Jinhua Luo  writes:
> > That is, if table `foo` and table `bar` are both tables on the same
> > remote server, then when I do `select * from foo, bar`, can it
> > delegate the whole query on the remote side, rather than fetching rows
> > from both servers one by one and do merging on the local side?
>
> Reasonably recent releases can do that.  What version are you testing?
>
> > foo=> explain select * from foreign_test2, foreign_test where m = id;
> >  QUERY PLAN
> > -
> >  Merge Join  (cost=444.06..590.63 rows=9316 width=72)
> >Merge Cond: (foreign_test2.m = foreign_test.id)
> >->  Sort  (cost=222.03..225.44 rows=1365 width=36)
> >  Sort Key: foreign_test2.m
> >  ->  Foreign Scan on foreign_test2  (cost=100.00..150.95
> > rows=1365 width=36)
> >->  Sort  (cost=222.03..225.44 rows=1365 width=36)
> >  Sort Key: foreign_test.id
> >  ->  Foreign Scan on foreign_test  (cost=100.00..150.95
> > rows=1365 width=36)
> > ```
>
> I don't find this particular example to be very compelling.  Taking
> the amount of data pulled from the foreign server as the main cost
> factor, the plan as given requires pulling 1365*2 rows, whereas if
> it were to push down the join, it'd have to retrieve 9316 rows
> (or so the planner estimates, anyway).  So it's quite possible that
> the planner just rejected the remote join as a net loss.  If you
> think it isn't a net loss, you might want to twiddle the cost
> parameters for this foreign server.
>
> regards, tom lane



Re: Code of Conduct

2018-09-18 Thread Tomas Vondra

On 09/18/2018 01:47 PM, James Keener wrote:

 > following a long consultation process

It's not a consultation if any dissenting voice is simply ignored.
Don't sugar-coat or politicize it like this -- it was rammed down
everyone's throats. That is core's right, but don't act as everyone's
opinions and concerns were taken into consideration.


I respectfully disagree.

I'm not sure which dissenting voices you think were ignored, but from 
what I've observed in the various CoC threads the core team took the 
time to respond to all comments. That does not necessarily mean the 
resulting CoC makes everyone happy, but unfortunately that's not quite 
possible. And it does not mean it was not an honest consultation.


IMO the core team did a good job in listening to comments, tweaking the 
wording and/or explaining the reasoning. Kudos to them.



There are a good number of folks who are concerned that this CoC is
overreaching and is ripe for abuse. Those concerns were always
simply, plainly, and purposely ignored.
No, they were not. There were multiple long discussions about exactly 
these dangers, You may dislike the outcome, but it was not ignored.


 > Please take time to read and understand the CoC, which is intended to 
ensure that PostgreSQL remains an open and enjoyable project for anyone 
to join and participate in.


I sincerely hope so, and that it doesn't become a tool to enforce social 
ideology like in other groups I've been part of. Especially since this 
is the main place to come to get help for PostgreSQL and not a social club.




Ultimately, it's a matter of trust that the CoC committee and core team 
apply the CoC in a careful and cautious way. Based on my personal 
experience with most of the people involved in both groups I'm not 
worried about this part.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online verification of checksums

2018-09-18 Thread Michael Banck
Hi,

please find attached version 2 of the patch.

Am Donnerstag, den 26.07.2018, 13:59 +0200 schrieb Michael Banck:
> I've now forward-ported this change to pg_verify_checksums, in order to
> make this application useful for online clusters, see attached patch.
> 
> I've tested this in a tight loop (while true; do pg_verify_checksums -D
> data1 -d > /dev/null || /bin/true; done)[2] while doing "while true; do
> createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench;
> done", which I already used to develop the original code in the fork and
> which brought up a few bugs.
> 
> I got one checksums verification failure this way, all others were
> caught by the recheck (I've introduced a 500ms delay for the first ten
> failures) like this:
> 
> > pg_verify_checksums: checksum verification failed on first attempt in
> > file "data1/base/16837/16850", block 7770: calculated checksum 785 but
> > expected 5063
> > pg_verify_checksums: block 7770 in file "data1/base/16837/16850"
> > verified ok on recheck

I have now changed this from the pg_sleep() to a check against the
checkpoint LSN as discussed upthread.

> However, I am also seeing sporadic (maybe 0.5 times per pgbench run)
> failures like this:
> 
> > pg_verify_checksums: short read of block 2644 in file
> > "data1/base/16637/16650", got only 4096 bytes
> 
> This is not strictly a verification failure, should we do anything about
> this? In my fork, I am also rechecking on this[3] (and I am happy to
> extend the patch that way), but that makes the code and the patch more
> complicated and I wanted to check the general opinion on this case
> first.

I have added a retry for this as well now, without a pg_sleep() as well.
This catches around 80% of the half-reads, but a few slip through. At
that point we bail out with exit(1), and the user can try again, which I
think is fine? 

Alternatively, we could just skip to the next file then and don't make
it count as a checksum failure.

Other changes from V1:

1. Rebased to 422952ee
2. Ignore ENOENT failure during file open and skip to next file
3. Mention total number of skipped blocks during the summary at the end
of the run
4. Skip files starting with pg_internal.init*


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c
index 1bc020ab6c..c3cc7b90b5 100644
--- a/src/bin/pg_verify_checksums/pg_verify_checksums.c
+++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c
@@ -1,7 +1,7 @@
 /*
  * pg_verify_checksums
  *
- * Verifies page level checksums in an offline cluster
+ * Verifies page level checksums in a cluster
  *
  *	Copyright (c) 2010-2018, PostgreSQL Global Development Group
  *
@@ -25,7 +25,9 @@
 static int64 files = 0;
 static int64 blocks = 0;
 static int64 badblocks = 0;
+static int64 skippedblocks = 0;
 static ControlFileData *ControlFile;
+static XLogRecPtr checkpointLSN;
 
 static char *only_relfilenode = NULL;
 static bool verbose = false;
@@ -54,6 +56,7 @@ static const char *const skip[] = {
 	"pg_filenode.map",
 	"pg_internal.init",
 	"PG_VERSION",
+	"pgsql_tmp",
 	NULL,
 };
 
@@ -67,8 +70,14 @@ skipfile(const char *fn)
 		return true;
 
 	for (f = skip; *f; f++)
+	{
 		if (strcmp(*f, fn) == 0)
 			return true;
+		if (strcmp(*f, "pg_internal.init") == 0)
+			if (strncmp(*f, fn, strlen(*f)) == 0)
+return true;
+	}
+
 	return false;
 }
 
@@ -79,10 +88,17 @@ scan_file(const char *fn, BlockNumber segmentno)
 	PageHeader	header = (PageHeader) buf.data;
 	int			f;
 	BlockNumber blockno;
+	bool		block_retry = false;
 
 	f = open(fn, O_RDONLY | PG_BINARY, 0);
 	if (f < 0)
 	{
+		if (errno == ENOENT)
+		{
+			/* File was removed in the meantime */
+			return;
+		}
+
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
 progname, fn, strerror(errno));
 		exit(1);
@@ -99,24 +115,99 @@ scan_file(const char *fn, BlockNumber segmentno)
 			break;
 		if (r != BLCKSZ)
 		{
-			fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"),
-	progname, blockno, fn, r, BLCKSZ);
-			exit(1);
+			if (block_retry)
+			{
+/* We already tried once to reread the block, bail out */
+fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"),
+		progname, blockno, fn, r, BLCKSZ);
+exit(1);
+			}
+
+			/*
+			 * Retry the block. It's possible that we read the block while it
+			 * was extended or shrinked, so it it ends up looking torn to us.
+			 */
+
+			/*
+			 * Seek back by the amount of bytes we read to the beginning of
+			 * the failed block.
+	

Re: pgsql: Allow concurrent-safe open() and fopen() in frontend code for Wi

2018-09-18 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Sep 18, 2018 at 09:11:43AM +0900, Michael Paquier wrote:
>> What I think I broke is that CreateFile ignores what _fmode uses, which
>> has caused the breakage, while calling directly open() or fopen() does
>> the work.  There are also other things assuming that binary mode is
>> used, you can grep for "setmode" and see how miscinit.c or pg_basebackup
>> do the job.

> I have been playing with this stuff, and hacked the attached.  Now, while
> TAP tests of initdb and pgbench are happy (I can actually see the past
> failure as well), pg_dump complains at authentication time when using
> plain-text mode when using databases with all ASCII characters.  That's
> not something I expected first, but _get_fmode also influences
> operations like pipe(), which is something that pg_dump uses, and
> setmode is enforced to binary mode only when adapted.

> I am getting to wonder if what's present on HEAD represents actually the
> best deal we could get.  Attached is the patch I used for reference.
> Thoughts?

Well, we have to do something.  I have a report from EDB's packagers
that in 11beta4, "initdb --pwfile" is failing on Windows (ie, one can't
connect afterwards using the specified password).  It seems nearly
certain to me that the reason is that the file is read with

FILE   *pwf = fopen(pwfilename, "r");

and so the \r isn't getting stripped from what's used as the password.

So my opinion is that it's not negotiable that we have to restore
the previous behavior in this realm.  I don't want to be running
around finding other cases one at a time, and I absolutely won't
hold still for putting an "#ifdef WIN32" around each such case.
(Even if we fixed all our own code, we'd still be breaking 3rd-party
code.)

What I don't understand yet is why your latest patch doesn't restore
the previous behavior.  What exactly is still different?

In the meantime, we might be well advised to revert this patch in
v11 and just continue to work on the problem in HEAD.  I see now
that this wasn't something to cram in during late beta ...

regards, tom lane



Is it really difficult for postgres_fdw to implement READ COMMITTED isolation?

2018-09-18 Thread Jinhua Luo
https://www.postgresql.org/docs/current/static/postgres-fdw.html#id-1.11.7.43.12

As the doc said, the REPEATABLE READ isolation level is used to get
snapshot-consistent results.

But is it possible that postgres_fdw could get to know which remote
queries involved by each top outer command in the local transaction,
and use the same snapshot in the remote server to execute them
sequentially? For example, could we use PREPARE TRANSACTION and SET
TRANSACTION SNAPSHOT to archive this goal? Then we could use READ
COMMITTED on both sides?



Re: pgsql: Allow concurrent-safe open() and fopen() in frontend code for Wi

2018-09-18 Thread Laurenz Albe
Tom Lane wrote:
> Well, we have to do something.  I have a report from EDB's packagers
> that in 11beta4, "initdb --pwfile" is failing on Windows (ie, one can't
> connect afterwards using the specified password).  It seems nearly
> certain to me that the reason is that the file is read with
> 
> FILE   *pwf = fopen(pwfilename, "r");
> 
> and so the \r isn't getting stripped from what's used as the password.

Perhaps there is something obvious that I'm missing, but it seems that
all the problems we observe are caused by frontend code suddenly defaulting
to binary mode when it was text mode before.

Would it be an option to have pgwin32_open default to text mode in
frontend code and to binary mode in backend code?

Yours,
Laurenz Albe




Re: [HACKERS] proposal - Default namespaces for XPath expressions (PostgreSQL 11)

2018-09-18 Thread Tom Lane
Pavel Stehule  writes:
> [ xml-xpath-default-ns-7.patch ]

At Andrew's prompting, I took a look over this patch.  I don't know much
of anything about XML, so I have no idea as to standards compliance here,
but I do have some comments:

* I'm fairly uncomfortable with the idea that we're going to maintain
our own XPath parser.  That seems like a recipe for lots of future
work ... and the code is far too underdocumented for anyone to actually
maintain it.  (Case in point: _transformXPath's arguments are not
documented at all, and in fact there's hardly a word about what it's
even supposed to *do*.)

* I think the business with "pgdefnamespace.pgsqlxml.internal" is just
plain awful.  It's a wart, and I don't think it's even saving you any
code once you account for all the places where you have to throw errors
for somebody trying to use that as a regular name.  This should be done
with out-of-band signaling if possible.  The existing convention above
this code is that a NULL pointer means a default namespace; can't that
be continued throughout?  If that's not practical, can you pick a string
that simply can't syntactically be a namespace name?  (In the SQL world,
for example, an empty string is an illegal identifier so that that could
be used for the purpose.  But I don't know if that applies to XML.)
Or perhaps you can build a random name that is chosen just to make it
different from any of the listed namespaces?  If none of those work,
I think passing around an explicit "isdefault" boolean alongside the name
would be better than having a reserved word.

* _transformXPath recurses, so doesn't it need check_stack_depth()?

* I'm not especially in love with using function names that start
with an underscore.  I do not think that adds anything, and it's
unlike the style in most places in PG.

* This is a completely unhelpful error message:
+   if (!parser->buffer_is_empty)
+   elog(ERROR, "internal error");
If you can't be bothered to write a message that says something
useful, either drop the test or turn it into an Assert.  I see some
other internal errors that aren't up to project norms either.

* Either get rid of the "elog(DEBUG1)"s, or greatly lower their
message priority.  They might've been appropriate for developing
this patch, but they are not okay to commit that way.

* Try to reduce the amount of random whitespace changes in the patch.

* .h files should never #include "postgres.h", per project policy.

* I'm not sure I'd bother with putting the new code into a separate
file rather than cramming it into xml.c.  The main reason why not
is that you're going to get "empty translation unit" warnings from
some compilers in builds without USE_LIBXML.

* Documentation, comments, and error messages could all use some
copy-editing by a native English speaker (you knew that of course).

regards, tom lane



Re: pgsql: Allow concurrent-safe open() and fopen() in frontend code for Wi

2018-09-18 Thread Tom Lane
Laurenz Albe  writes:
> Would it be an option to have pgwin32_open default to text mode in
> frontend code and to binary mode in backend code?

Well, the question is why Michael's latest proposed patch doesn't
accomplish that.

regards, tom lane



Re: Online verification of checksums

2018-09-18 Thread Stephen Frost
Greetings,

* Michael Banck (michael.ba...@credativ.de) wrote:
> please find attached version 2 of the patch.
> 
> Am Donnerstag, den 26.07.2018, 13:59 +0200 schrieb Michael Banck:
> > I've now forward-ported this change to pg_verify_checksums, in order to
> > make this application useful for online clusters, see attached patch.
> > 
> > I've tested this in a tight loop (while true; do pg_verify_checksums -D
> > data1 -d > /dev/null || /bin/true; done)[2] while doing "while true; do
> > createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench;
> > done", which I already used to develop the original code in the fork and
> > which brought up a few bugs.
> > 
> > I got one checksums verification failure this way, all others were
> > caught by the recheck (I've introduced a 500ms delay for the first ten
> > failures) like this:
> > 
> > > pg_verify_checksums: checksum verification failed on first attempt in
> > > file "data1/base/16837/16850", block 7770: calculated checksum 785 but
> > > expected 5063
> > > pg_verify_checksums: block 7770 in file "data1/base/16837/16850"
> > > verified ok on recheck
> 
> I have now changed this from the pg_sleep() to a check against the
> checkpoint LSN as discussed upthread.

Ok.

> > However, I am also seeing sporadic (maybe 0.5 times per pgbench run)
> > failures like this:
> > 
> > > pg_verify_checksums: short read of block 2644 in file
> > > "data1/base/16637/16650", got only 4096 bytes
> > 
> > This is not strictly a verification failure, should we do anything about
> > this? In my fork, I am also rechecking on this[3] (and I am happy to
> > extend the patch that way), but that makes the code and the patch more
> > complicated and I wanted to check the general opinion on this case
> > first.
> 
> I have added a retry for this as well now, without a pg_sleep() as well.

> This catches around 80% of the half-reads, but a few slip through. At
> that point we bail out with exit(1), and the user can try again, which I
> think is fine? 

No, this is perfectly normal behavior, as is having completely blank
pages, now that I think about it.  If we get a short read then I'd say
we simply check that we got an EOF and, in that case, we just move on.

> Alternatively, we could just skip to the next file then and don't make
> it count as a checksum failure.

No, I wouldn't count it as a checksum failure.  We could possibly count
it towards the skipped pages, though I'm even on the fence about that.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Progress reporting for pg_verify_checksums

2018-09-18 Thread Michael Banck
Hi,

Am Freitag, den 31.08.2018, 14:50 +0200 schrieb Michael Banck:
> my colleague Bernd Helmle recently added progress reporting to our
> pg_checksums application[1]. I have now forward ported this to
> pg_verify_checksums for the September commitfest, please see the
> attached patch.
> 
> Here's the description:
> 
> This optionally prints the progress of pg_verify_checksums via read
> kilobytes to the terminal with the new command line argument -P.
> 
> If -P was forgotten and pg_verify_checksums operates on a large cluster,
> the caller can send SIGUSR1 to pg_verify_checksums to turn progress
> status reporting on during runtime.

Version 2 of the patch is attached. This is rebased to master as of
422952ee and changes the signal handling to be a toggle as suggested by
Alvaro.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c
index 1bc020ab6c..24ac99d204 100644
--- a/src/bin/pg_verify_checksums/pg_verify_checksums.c
+++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c
@@ -10,6 +10,8 @@
 #include "postgres_fe.h"
 
 #include 
+#include 
+#include 
 #include 
 #include 
 
@@ -21,7 +23,6 @@
 #include "storage/checksum.h"
 #include "storage/checksum_impl.h"
 
-
 static int64 files = 0;
 static int64 blocks = 0;
 static int64 badblocks = 0;
@@ -29,9 +30,18 @@ static ControlFileData *ControlFile;
 
 static char *only_relfilenode = NULL;
 static bool verbose = false;
+static bool show_progress = false;
 
 static const char *progname;
 
+/*
+ * Progress status information.
+ */
+int64		total_size = 0;
+int64		current_size = 0;
+pg_time_t	last_progress_update;
+pg_time_t	scan_started;
+
 static void
 usage(void)
 {
@@ -42,6 +52,7 @@ usage(void)
 	printf(_(" [-D, --pgdata=]DATADIR  data directory\n"));
 	printf(_("  -v, --verbose  output verbose messages\n"));
 	printf(_("  -r RELFILENODE check only relation with specified relfilenode\n"));
+	printf(_("  -P, --progress show progress information\n"));
 	printf(_("  -V, --version  output version information, then exit\n"));
 	printf(_("  -?, --help show this help, then exit\n"));
 	printf(_("\nIf no data directory (DATADIR) is specified, "
@@ -57,6 +68,72 @@ static const char *const skip[] = {
 	NULL,
 };
 
+static void
+toggle_progress_report(int signo,
+	   siginfo_t * siginfo,
+	   void *context)
+{
+
+	/* we handle SIGUSR1 only */
+	if (signo == SIGUSR1)
+	{
+		if (show_progress)
+			show_progress = false;
+		else
+			show_progress = true;
+	}
+
+}
+
+/*
+ * Report current progress status. Parts borrowed from
+ * PostgreSQLs' src/bin/pg_basebackup.c
+ */
+static void
+report_progress(bool force)
+{
+	pg_time_t	now = time(NULL);
+	int			total_percent = 0;
+
+	char		totalstr[32];
+	char		currentstr[32];
+	char		currspeedstr[32];
+
+	/* Make sure we just report at least once a second */
+	if ((now == last_progress_update) && !force)
+		return;
+
+	/* Save current time */
+	last_progress_update = now;
+
+	/*
+	 * Calculate current percent done, based on KiB...
+	 */
+	total_percent = total_size ? (int64) ((current_size / 1024) * 100 / (total_size / 1024)) : 0;
+
+	/* Don't display larger than 100% */
+	if (total_percent > 100)
+		total_percent = 100;
+
+	/* The same for total size */
+	if (current_size > total_size)
+		total_size = current_size / 1024;
+
+	snprintf(totalstr, sizeof(totalstr), INT64_FORMAT,
+			 total_size / 1024);
+	snprintf(currentstr, sizeof(currentstr), INT64_FORMAT,
+			 current_size / 1024);
+	snprintf(currspeedstr, sizeof(currspeedstr), INT64_FORMAT,
+			 (current_size / 1024) / (((time(NULL) - scan_started) == 0) ? 1 : (time(NULL) - scan_started)));
+	fprintf(stderr, "%s/%s kB (%d%%, %s kB/s)",
+			currentstr, totalstr, total_percent, currspeedstr);
+
+	if (isatty(fileno(stderr)))
+		fprintf(stderr, "\r");
+	else
+		fprintf(stderr, "\n");
+}
+
 static bool
 skipfile(const char *fn)
 {
@@ -73,7 +150,7 @@ skipfile(const char *fn)
 }
 
 static void
-scan_file(const char *fn, BlockNumber segmentno)
+scan_file(const char *fn, BlockNumber segmentno, bool sizeonly)
 {
 	PGAlignedBlock buf;
 	PageHeader	header = (PageHeader) buf.data;
@@ -110,6 +187,7 @@ scan_file(const char *fn, BlockNumber segmentno)
 			continue;
 
 		csum = pg_checksum_page(buf.data, blockno + segmentno * RELSEG_SIZE);
+		current_size += r;
 		if (csum != header->pd_checksum)
 		{
 			if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_VERSION)
@@ -117,6 +195,9 @@ scan_file(const char *fn, BlockNumber segmentno)
 		progname, fn, bloc

Re: pgsql: Allow concurrent-safe open() and fopen() in frontend code for Wi

2018-09-18 Thread Laurenz Albe
Tom Lane wrote:
> Laurenz Albe  writes:
> > Would it be an option to have pgwin32_open default to text mode in
> > frontend code and to binary mode in backend code?
> 
> Well, the question is why Michael's latest proposed patch doesn't
> accomplish that.

I was thinking of something trivial like this:

--- a/src/port/open.c
+++ b/src/port/open.c
@@ -71,6 +71,12 @@ pgwin32_open(const char *fileName, int fileFlags,...)
 _O_SHORT_LIVED | O_DSYNC | O_DIRECT |
 (O_CREAT | O_TRUNC | O_EXCL) | (O_TEXT | O_BINARY))) 
== fileFlags);
 
+#ifdef FRONTEND
+   /* default to text mode in frontend code */
+   if (fileFlags & O_BINARY == 0)
+   fileFlags |= O_TEXT;
+#endif
+
sa.nLength = sizeof(sa);
sa.bInheritHandle = TRUE;
sa.lpSecurityDescriptor = NULL;

That wouldn't influence pipes, which was what Michael said was a
problem for pg_dump.

I currently have no Windows system close, so I cannot test...

Yours,
Laurenz Albe




Re: Is it really difficult for postgres_fdw to implement READ COMMITTED isolation?

2018-09-18 Thread Ashutosh Bapat
On Tue, Sep 18, 2018 at 8:28 PM Jinhua Luo  wrote:

>
> https://www.postgresql.org/docs/current/static/postgres-fdw.html#id-1.11.7.43.12
>
> As the doc said, the REPEATABLE READ isolation level is used to get
> snapshot-consistent results.
>
> But is it possible that postgres_fdw could get to know which remote
> queries involved by each top outer command in the local transaction,
> and use the same snapshot in the remote server to execute them
> sequentially? For example, could we use PREPARE TRANSACTION and SET
> TRANSACTION SNAPSHOT to archive this goal? Then we could use READ
> COMMITTED on both sides?
>
>
I guess the problem is 1. exporting snapshots is not cheap 2. tracking
prepared transactions is not implemented. See a nearby thread on
"transaction involving multiple foreign server".

--
Best Wishes,
Ashutosh Bapat


Re: Is it possible for postgres_fdw to push down queries on co-located tables?

2018-09-18 Thread Dilip Kumar
On Tue, Sep 18, 2018 at 7:50 PM, Jinhua Luo  wrote:
> I was testing PG10.
>
> Sorry, the example is not so proper. I just think even if it's a
> simple example, e.g. join two co-located tables, the planner should
> work out to push down it. Can you confirm the postgresql could detect
> co-located tables on the same foreign server and push down queries on
> them? Could you give an actual example or point out the relevant
> source code paths for reference?
>

You can check "postgresGetForeignJoinPaths" function or you can refer
commit "e4106b2528727c4b48639c0e12bf2f70a766b910".

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: PostgreSQL 11 {Beta 4, RC1} Release: 2018-09-20

2018-09-18 Thread Jonathan S. Katz
On 9/13/18 12:51 PM, Jonathan S. Katz wrote:
> Hi,
> 
> We are planning to have another release of PostgreSQL 11, either Beta 4
> or RC1, next week on Thursday, 2018-09-20. The version will be
> determined based on the state of the open items list[1] around the time
> of stamping.

PostgreSQL 11 was stamped Beta 4[1] yesterday evening for release on
2018-09-20.

There is a copy of the Beta 4 release notes draft here[2]. I would
greatly appreciate any review and feedback on them, particularly around
technical accuracy and any omissions that should be included.

Thanks!

Jonathan

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3d65e406d1ea82060ad13a7bc41178ed22c599d1
[2]
https://git.postgresql.org/gitweb/?p=press.git;a=blob;f=releases/11/beta/11beta4.md;



signature.asc
Description: OpenPGP digital signature


Re: Code of Conduct

2018-09-18 Thread James Keener
>
>  You may dislike the outcome, but it was not ignored.


I can accept that I don't like the outcome, but I can point to maybe a
dozen people in the last
exchange worried about the CoC being used to further political goals, and
the only response
was "well, the CoC Committee will handle it reasonable" which is not a good
answer, because
that's exactly the situation that we are worried about not happening! These
concerns were never
actually addressed and always just brushed aside -- that's what I found
bothersome and worrisome.

We shouldn't have to expect the rules to be applied fairly in order to
counter actual abuses of the
rules. I've seen it in other groups and have been the target of such
actions. (I had the gall to claim
that hiring practices that require submitting side- or open-source- work
aren't only detrimental to
women because they statistically shoulder more of the housework and
childcare, but also to
husbands and fathers who take an active role in the household and
childcare. It wasn't intended to
diminish the effect this hiring practice has on women, but to suggest that
it's a broader problem than
the conversation at that point was making it out to be. I was subsequently
silenced and eventually
booted from the group for that incident and another, in a social channel,
where a discussion on guns
was taking place and someone said that the discussion is sexist and this is
why there are so few
female programmers, and I had the impertinence to say that I know more
women who hunt and shot
for sport then men (it's ~50-50 in this area). Forgive me for not having a
favourable view of CoCs.)

So, it's not that I don't trust the CoC Committee, but I just really don't
trust most people. The clearer
the rules the better. As it stands, the rules are extremely vague and
overreaching.

Jim


Re: Code of Conduct

2018-09-18 Thread Chris Travers
On Tue, Sep 18, 2018 at 4:35 PM Tomas Vondra 
wrote:

> On 09/18/2018 01:47 PM, James Keener wrote:
> >  > following a long consultation process
> >
> > It's not a consultation if any dissenting voice is simply ignored.
> > Don't sugar-coat or politicize it like this -- it was rammed down
> > everyone's throats. That is core's right, but don't act as everyone's
> > opinions and concerns were taken into consideration.
>
> I respectfully disagree.
>
> I'm not sure which dissenting voices you think were ignored, but from
> what I've observed in the various CoC threads the core team took the
> time to respond to all comments. That does not necessarily mean the
> resulting CoC makes everyone happy, but unfortunately that's not quite
> possible. And it does not mean it was not an honest consultation.
>
> IMO the core team did a good job in listening to comments, tweaking the
> wording and/or explaining the reasoning. Kudos to them.
>

I said I would stand aside my objections after the last point I mentioned
them but I did not feel that my particular objection and concern with
regard to one specific sentence added got much of a hearing.  This being
said, it is genuinely hard to sort through the noise and try to reach the
signal.  I think the resurgence of the debate about whether we need a code
of conduct made it very difficult to discuss specific objections to
specific wording.  So to be honest the breakdown was mutual.

>
> > There are a good number of folks who are concerned that this CoC is
> > overreaching and is ripe for abuse. Those concerns were always
> > simply, plainly, and purposely ignored.
> No, they were not. There were multiple long discussions about exactly
> these dangers, You may dislike the outcome, but it was not ignored.
>

Also those of us who had specific, actionable concerns were often drowned
out by the noise.  That's deeply unfortunate.

I think those of us who had specific concerns about one specific sentence
that was added were drowned out by those who seemed to be opposed to the
idea of a code of conduct generally.

I would have appreciated at least a reason why the concerns I had about the
fact that the addition a) doesn't cover what it is needs to cover, and b)
will attract complaints that it shouldn't cover was not considered valid.
But I can understand that given the noise-to-signal ratio of the discussion
made such discussion next to impossible.

Again I find that regrettable.

>
> >  > Please take time to read and understand the CoC, which is intended to
> > ensure that PostgreSQL remains an open and enjoyable project for anyone
> > to join and participate in.
> >
> > I sincerely hope so, and that it doesn't become a tool to enforce social
> > ideology like in other groups I've been part of. Especially since this
> > is the main place to come to get help for PostgreSQL and not a social
> club.
> >
>
> Ultimately, it's a matter of trust that the CoC committee and core team
> apply the CoC in a careful and cautious way. Based on my personal
> experience with most of the people involved in both groups I'm not
> worried about this part.
>

I would actually go further than you here.  The CoC committee *cannot*
apply the CoC in the way that the opponents fear.  The fact is, Europe has
anti-discrimination laws regarding social and political ideology (something
the US might want to consider as it would help avoid problems on this list
;-) ).  And different continents have different norms on these sorts of
things.  Pushing a social ideology via the code of conduct would, I
suspect, result in everything from legal action to large emerging markets
going elsewhere.  So I don't think ti is a question of "trust us" but
rather that the community won't let that sort of abuse happen no matter who
is on the CoC committee.

>
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>

-- 
Best Wishes,
Chris Travers

Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor
lock-in.
http://www.efficito.com/learn_more


Re: Code of Conduct

2018-09-18 Thread Stephen Frost
Greetings,

* Chris Travers (chris.trav...@gmail.com) wrote:
> I said I would stand aside my objections after the last point I mentioned
> them but I did not feel that my particular objection and concern with
> regard to one specific sentence added got much of a hearing.  This being
> said, it is genuinely hard to sort through the noise and try to reach the
> signal.  I think the resurgence of the debate about whether we need a code
> of conduct made it very difficult to discuss specific objections to
> specific wording.  So to be honest the breakdown was mutual.

I would ask that you, and anyone else who has a suggestion for how to
improve or revise the CoC, submit your ideas to the committee by
email'ing c...@postgresql.org.

As was discussed previously, the current CoC isn't written in stone and
it will be changed and amended as needed.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: pgbench - add pseudo-random permutation function

2018-09-18 Thread Hironobu SUZUKI

Hi Fabian-san,

I reviewed 'pgbench-prp-func/pgbench-prp-func-4.patch'.

I could apply it and did the TAP test successfully. I could not find 
typo in the documentations and comments.


To make sure, I checked the new routine which contains the 
__builtin_popcountll() function on Linux + gcc 7.3.1 and I confirmed 
that it works well.


I thinks this patch is fine.


Best regards,


On 2018/09/16 21:05, Fabien COELHO wrote:


Hello Hironobu-san,

Here is a v4, based on our out-of-list discussion:
  - the mask function is factored out
  - the popcount builtin is used if available


Attached a v3, based on your fix, plus some additional changes:
- explicitly declare unsigned variables where appropriate, to avoid casts
- use smaller 24 bits primes instead of 27-29 bits
- add a shortcut for multiplier below 24 bits and y value below 40 bits,
  which should avoid the manually implemented multiplication in most
  practical cases (tables with over 2^40 rows are pretty rare...).
- change the existing shortcut to look a the number of bits instead of
  using 32 limits.
- add a test for minimal code coverage with over 40 bits sizes
- attempt to improve the documentation
- some comments were updates, hopefully for the better







Re: Online verification of checksums

2018-09-18 Thread David Steele
On 9/18/18 11:45 AM, Stephen Frost wrote:
> * Michael Banck (michael.ba...@credativ.de) wrote:

>> I have added a retry for this as well now, without a pg_sleep() as well.
> 
>> This catches around 80% of the half-reads, but a few slip through. At
>> that point we bail out with exit(1), and the user can try again, which I
>> think is fine? 
> 
> No, this is perfectly normal behavior, as is having completely blank
> pages, now that I think about it.  If we get a short read then I'd say
> we simply check that we got an EOF and, in that case, we just move on.
> 
>> Alternatively, we could just skip to the next file then and don't make
>> it count as a checksum failure.
> 
> No, I wouldn't count it as a checksum failure.  We could possibly count
> it towards the skipped pages, though I'm even on the fence about that.

+1 for it not being a failure.  Personally I'd count it as a skipped
page, since we know the page exists but it can't be verified.

The other option is to wait for the page to stabilize, which doesn't
seem like it would take very long in most cases -- unless you are doing
this test from another host with shared storage.  Then I would expect to
see all kinds of interesting torn pages after the last checkpoint.

Regards,
-- 
-David
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature


Re: pgbench - add pseudo-random permutation function

2018-09-18 Thread Fabien COELHO



I reviewed 'pgbench-prp-func/pgbench-prp-func-4.patch'. [...] I thinks 
this patch is fine.


Thanks!

You may consider turning it "ready" in the CF app, so as to see whether 
some committer agrees with you.


--
Fabien.



fast default vs triggers

2018-09-18 Thread Andrew Dunstan




Tomas Vondra has pointed out to me that there's an issue with triggers 
not getting expanded tuples for columns with fast defaults. Here is an 
example that shows the issue:



   andrew=# create table blurfl (id int);
   CREATE TABLE
   andrew=# insert into blurfl select x from generate_series(1,5) x;
   INSERT 0 5
   andrew=# alter table blurfl add column x int default 100;
   ALTER TABLE
   andrew=# create or replace function showmej() returns trigger
   language plpgsql as $$ declare j json; begin j := to_json(old);
   raise notice 'old x: %', j->>'x'; return new; end; $$;
   CREATE FUNCTION
   andrew=# create trigger show_x before update on blurfl for each row
   execute procedure showmej();
   CREATE TRIGGER
   andrew=# update blurfl set id = id where id = 1;
   NOTICE:  old x: 
   UPDATE 1
   andrew=# update blurfl set id = id where id = 1;
   NOTICE:  old x: 100
   UPDATE 1
   andrew=#


The error is fixed with this patch:


   diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
   index 2436692..f34a72a 100644
   --- a/src/backend/commands/trigger.c
   +++ b/src/backend/commands/trigger.c
   @@ -3396,7 +3396,11 @@ ltrmark:;
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
}
 
   -   result = heap_copytuple(&tuple);

   +   if (HeapTupleHeaderGetNatts(tuple.t_data) < relation->rd_att->natts)
   +   result = heap_expand_tuple(&tuple, relation->rd_att);
   +   else
   +   result = heap_copytuple(&tuple);
   +
ReleaseBuffer(buffer);
 
return result;


I'm going to re-check the various places that this might have been 
missed. I guess it belongs on the open items list.



cheers


andrew


--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Collation versioning

2018-09-18 Thread Thomas Munro
On Wed, Sep 19, 2018 at 12:48 AM Stephen Frost  wrote:
> * Thomas Munro (thomas.mu...@enterprisedb.com) wrote:
> > So to be more concrete: pg_depend could have a new column
> > "refobjversion".  Whenever indexes are created or rebuilt, we'd
> > capture the current version string in the pg_depend rows that link
> > index attributes and collations.  Then we'd compare those against the
> > current value when we first open an index and complain if they don't
> > match.  (In this model there would be no "collversion" column in the
> > pg_collation catalog.)
>
> I'm really not sure why you're pushing to have this in pg_depend..
>
> > That'd leave a place for other kinds of database objects (CHECKs,
> > PARTITIONS, ...) to store their version dependency, if someone later
> > wants to add support for that.
>
> Isn't what matters here where the data's stored, as in, in a column..?
>
> All of those would already have dependencies on the column so that they
> can be tracked back there.

Suppose I have a table "emp" and indexes "emp_firstname_idx" and
"emp_lastname_idx".  Suppose I created them in a sequence like this:

0: collation fr_CA has version "30"
1: create table emp (firstname text collate "fr_CA", lastname text
collate "fr_CA");
2: create index on emp(firstname);
3: [upgrade operating system]; now collation fr_CA has version "31"
4: create index on emp(lastname);

Now I have two indexes, built when different versions of the collation
were in effect.  One of them is potentially corrupted, the other
isn't.  Where are you going to record that?  Earlier I suggested that
pg_index could have an indcollversion column, so that
emp_firstname_idx's row would hold {"30"} and emp_lastname_idx's row
would hold {"31"}.  It would be captured at CREATE INDEX time, and
after that the only way to change it would be to REINDEX, and whenever
it disagrees with the current version according to the provider you'd
get a warning that you can only clear by running REINDEX.  Then I
suggested that perhaps pg_depend might be a better place for it,
because it would generalise to other kinds of object too.

For example, suppose I create a constraint CHECK (foo < 'côté') [evil
laugh].  The pg_depend row that links the constraint and the collation
could record the current version as of the moment the constraint was
defined.  After an OS upgrade that changes the reported version, I'd
see a warning whenever loading the check constraint, and the only way
to clear it would be to drop and recreate the constraint.  (I'm not
proposing we do that, just trying to demonstrate that pg_depend might
be a tidier and more general solution than adding 'collation version'
columns holding arrays of version strings to multiple catalogs.  So
help me Codd.)

Just an idea...

-- 
Thomas Munro
http://www.enterprisedb.com



Re: pgsql: Allow concurrent-safe open() and fopen() in frontend code for Wi

2018-09-18 Thread Michael Paquier
On Tue, Sep 18, 2018 at 10:45:09AM -0400, Tom Lane wrote:
> In the meantime, we might be well advised to revert this patch in
> v11 and just continue to work on the problem in HEAD.  I see now
> that this wasn't something to cram in during late beta ...

I can see that you have reverted the change just before I was able to
reply.  Thanks I was going to do the same.  Also, if we cannot find a
clear solution for HEAD rather soon, say by tomorrow, let's also remove
it there as well and go ack to the blackboard.  I still want to test a
couple of other solutions first.
--
Michael


signature.asc
Description: PGP signature


Re: Collation versioning

2018-09-18 Thread Stephen Frost
Greetings,

* Thomas Munro (thomas.mu...@enterprisedb.com) wrote:
> On Wed, Sep 19, 2018 at 12:48 AM Stephen Frost  wrote:
> > * Thomas Munro (thomas.mu...@enterprisedb.com) wrote:
> > > So to be more concrete: pg_depend could have a new column
> > > "refobjversion".  Whenever indexes are created or rebuilt, we'd
> > > capture the current version string in the pg_depend rows that link
> > > index attributes and collations.  Then we'd compare those against the
> > > current value when we first open an index and complain if they don't
> > > match.  (In this model there would be no "collversion" column in the
> > > pg_collation catalog.)
> >
> > I'm really not sure why you're pushing to have this in pg_depend..
> >
> > > That'd leave a place for other kinds of database objects (CHECKs,
> > > PARTITIONS, ...) to store their version dependency, if someone later
> > > wants to add support for that.
> >
> > Isn't what matters here where the data's stored, as in, in a column..?
> >
> > All of those would already have dependencies on the column so that they
> > can be tracked back there.
> 
> Suppose I have a table "emp" and indexes "emp_firstname_idx" and
> "emp_lastname_idx".  Suppose I created them in a sequence like this:
> 
> 0: collation fr_CA has version "30"
> 1: create table emp (firstname text collate "fr_CA", lastname text
> collate "fr_CA");
> 2: create index on emp(firstname);
> 3: [upgrade operating system]; now collation fr_CA has version "31"
> 4: create index on emp(lastname);
> 
> Now I have two indexes, built when different versions of the collation
> were in effect.  One of them is potentially corrupted, the other
> isn't.  Where are you going to record that?  Earlier I suggested that
> pg_index could have an indcollversion column, so that
> emp_firstname_idx's row would hold {"30"} and emp_lastname_idx's row
> would hold {"31"}.  It would be captured at CREATE INDEX time, and
> after that the only way to change it would be to REINDEX, and whenever
> it disagrees with the current version according to the provider you'd
> get a warning that you can only clear by running REINDEX.  Then I
> suggested that perhaps pg_depend might be a better place for it,
> because it would generalise to other kinds of object too.

For indexes, just like for tables, we have entries in pg_attribute where
that information would go.

> For example, suppose I create a constraint CHECK (foo < 'côté') [evil
> laugh].  The pg_depend row that links the constraint and the collation
> could record the current version as of the moment the constraint was
> defined.  After an OS upgrade that changes the reported version, I'd
> see a warning whenever loading the check constraint, and the only way
> to clear it would be to drop and recreate the constraint.  (I'm not
> proposing we do that, just trying to demonstrate that pg_depend might
> be a tidier and more general solution than adding 'collation version'
> columns holding arrays of version strings to multiple catalogs.  So
> help me Codd.)

The CHECK constraint doesn't need to directly track that information-
it should have a dependency on the column in the table and that's where
the information would be recorded about the current collation version.

Maybe I'm missing something but I have to admit that I feel like
pg_depend is being looked at here because everything goes through it-
but everything goes through it because it's simple and we just use it to
get to other things that have the complete definition of the object.

Lots and lots of things in pg_depend would have zero use for such a
field and I'm a bit worried you'd possibly also get into cases where
you've got different collation versions for the same object because of
the different dependencies into it...  even if you don't, you're
duplicating that information into every dependency, aren't you?

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Collation versioning

2018-09-18 Thread Douglas Doole
>
> The CHECK constraint doesn't need to directly track that information-
> it should have a dependency on the column in the table and that's where
> the information would be recorded about the current collation version.
>

Just to have fun throwing odd cases out, how would something like this be
recorded?

Database default collation: en_US

CREATE TABLE t (c1 TEXT, c2 TEXT, c3 TEXT,
 CHECK (c1 COLLATE "fr_FR" BETWEEN c2 COLLATE "fr_FR" AND c3 COLLATE
"fr_FR"));

You could even be really warped and apply multiple collations on a single
column in a single constraint.


Re: Collation versioning

2018-09-18 Thread Stephen Frost
Greetings,

* Douglas Doole (dougdo...@gmail.com) wrote:
> > The CHECK constraint doesn't need to directly track that information-
> > it should have a dependency on the column in the table and that's where
> > the information would be recorded about the current collation version.
> 
> Just to have fun throwing odd cases out, how would something like this be
> recorded?
> 
> Database default collation: en_US
> 
> CREATE TABLE t (c1 TEXT, c2 TEXT, c3 TEXT,
>  CHECK (c1 COLLATE "fr_FR" BETWEEN c2 COLLATE "fr_FR" AND c3 COLLATE
> "fr_FR"));
> 
> You could even be really warped and apply multiple collations on a single
> column in a single constraint.

Once it gets to an expression and not just a simple check, I'd think
we'd record it in the expression..

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH] kNN for SP-GiST

2018-09-18 Thread Alexander Korotkov
On Mon, Sep 17, 2018 at 12:42 PM Andrey Borodin  wrote:
> > 17 сент. 2018 г., в 2:03, Alexander Korotkov  
> > написал(а):
> >
> > Also, it appears to me that it's OK to be a single patch
>
> +1, ISTM that these 6 patches represent atomic unit of work.

Thank you, pushed.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Collation versioning

2018-09-18 Thread Thomas Munro
On Wed, Sep 19, 2018 at 10:09 AM Stephen Frost  wrote:
> * Douglas Doole (dougdo...@gmail.com) wrote:
> > > The CHECK constraint doesn't need to directly track that information-
> > > it should have a dependency on the column in the table and that's where
> > > the information would be recorded about the current collation version.
> >
> > Just to have fun throwing odd cases out, how would something like this be
> > recorded?
> >
> > Database default collation: en_US
> >
> > CREATE TABLE t (c1 TEXT, c2 TEXT, c3 TEXT,
> >  CHECK (c1 COLLATE "fr_FR" BETWEEN c2 COLLATE "fr_FR" AND c3 COLLATE
> > "fr_FR"));
> >
> > You could even be really warped and apply multiple collations on a single
> > column in a single constraint.
>
> Once it gets to an expression and not just a simple check, I'd think
> we'd record it in the expression..

Maybe I misunderstood, but I don't think it makes sense to have a
collation version "on the column in the table", because (1) that fails
to capture the fact that two CHECK constraints that were defined at
different times might have become dependent on two different versions
(you created one constraint before upgrading and the other after, now
the older one is invalidated and sounds the alarm but the second one
is fine), and (2) the table itself doesn't care about collation
versions since heap tables are unordered; there is no particular
operation on the table that would be the correct time to update the
collation version on a table/column.  What we're trying to track is
when objects that in some way depend on the version become
invalidated, so wherever we store it there's going to have to be a
version recorded per dependent object at its creation time, so that's
either new columns on every interested catalog table, or ...

-- 
Thomas Munro
http://www.enterprisedb.com



RE: Changing the setting of wal_sender_timeout per standby

2018-09-18 Thread Tsunakawa, Takayuki
From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> I didn't follow the first sentence of the above hunk. Is the
> wal_sender_timeout relevant with %q?

Ouch, that's a careless mistake.  I copied the paragraph from another parameter 
and failed to remove some sentence.  Patch revised.

Regards
Takayuki Tsunakawa




walsender_timeout_PGC_BACKEND_v2.patch
Description: walsender_timeout_PGC_BACKEND_v2.patch


heap_sync seems rather oblivious to partitioned tables (wal_level=minimal)

2018-09-18 Thread Tomas Vondra
Hi,

I've been doing some testing today, and it seems heap_sync is somewhat
confused by partitioned tables. I'm doing a COPY into a partitioned
table (lineitem from TPC-H partitioned per month) like this:

-
BEGIN;

CREATE TABLE lineitem (...) PARTITION BY RANGE (l_shipdate);

CREATE TABLE lineitem_1992_01 PARTITION OF lineitem FOR VALUES FROM
('1992-01-01') TO ('1992-02-01');

...

CREATE TABLE lineitem_1998_12 PARTITION OF lineitem FOR VALUES FROM
('1998-12-01') TO ('1999-01-01');

COPY INTO lineitem FROM 

COMMIT;
-

I'm observing COPY failing like this:

ERROR: could not open file "base/16384/16427": No such file or directory

where the relfilenode matches the "lineitem" relation (which is
guaranteed to be empty, as it's partitioned, so it's not surprising the
relfilenode does not exist).

After adding an Assert(false) into mdopen(), which is where the error
comes from, I got this backtrace (full backtrace attached):

...
#3  0x00766387 in mdopen (...) at md.c:609
#4  0x007674b7 in mdnblocks (...) at md.c:876
#5  0x00767a61 in mdimmedsync (...) at md.c:1034
#6  0x004c62b5 in heap_sync (...) at heapam.c:9399
#7  0x005a2bdd in CopyFrom (...) at copy.c:2890
#8  0x005a1621 in DoCopy (...) at copy.c:992
...

So apparently CopyFrom() invokes heap_sync() on the partitioned
relation, which attempts to do mdimmedsync() only on the root. That
seems like a bug to me.

Obviously this only applies to wal_level=minimal. There are multiple
callers of heap_sync(), but only the COPY seems to be affected by this,
because the rest can't see partitioned tables.

So it seems heap_sync() needs to be improved to sync all partitions.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x7f79267aeb7d in __GI_abort () at abort.c:90
#2  0x00878ef4 in ExceptionalCondition (conditionName=, 
errorType=, fileName=, lineNumber=) at assert.c:54
#3  0x00766387 in mdopen (reln=, forknum=, behavior=1) at md.c:609
#4  0x007674b7 in mdnblocks (reln=0x174c420, forknum=MAIN_FORKNUM) at 
md.c:876
#5  0x00767a61 in mdimmedsync (reln=0x174c420, forknum=MAIN_FORKNUM) at 
md.c:1034
#6  0x004c62b5 in heap_sync (rel=0x7f78a1bdf5c0) at heapam.c:9399
#7  0x005a2bdd in CopyFrom (cstate=) at copy.c:2890
#8  0x005a1621 in DoCopy (pstate=, stmt=0x150c298, 
stmt_location=, stmt_len=, processed=) at copy.c:992
#9  0x00770ae1 in standard_ProcessUtility (pstmt=0x150c388, 
queryString=0x150b578 "COPY lineitem FROM '/mnt/raid/tpch/lineitem.csv' WITH 
(FORMAT csv, DELIMITER '|');", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 
queryEnv=0x0, dest=0x150c678, 
completionTag=0x7ffe998b7b20 "") at utility.c:548
#10 0x007704c2 in ProcessUtility (pstmt=0x2, queryString=0x7ffe998b7420 
"", context=PROCESS_UTILITY_TOPLEVEL, params=0x7f79267acfa0 <__GI_raise+272>, 
queryEnv=0x0, dest=0x7ffe998b7420, completionTag=0x7ffe998b7b20 "") at 
utility.c:360
#11 0x007700a3 in PortalRunUtility (portal=0x1572028, pstmt=0x150c388, 
isTopLevel=, setHoldSnapshot=, dest=0x150c678, 
completionTag=0x7ffe998b7b20 "") at pquery.c:1178
#12 0x0076f7b6 in PortalRunMulti (portal=0x1572028, isTopLevel=true, 
setHoldSnapshot=false, dest=0x150c678, altdest=0x150c678, 
completionTag=0x7ffe998b7b20 "") at pquery.c:1331
#13 0x0076f141 in PortalRun (portal=0x1572028, 
count=9223372036854775807, isTopLevel=true, run_once=, 
dest=, altdest=0x150c678, completionTag=0x7ffe998b7b20 "") at 
pquery.c:799
#14 0x0076e108 in exec_simple_query (query_string=0x150b578 "COPY 
lineitem FROM '/mnt/raid/tpch/lineitem.csv' WITH (FORMAT csv, DELIMITER '|');") 
at postgres.c:1122
#15 0x0076c045 in PostgresMain (argc=, argv=, dbname=, username=) at postgres.c:4155
#16 0x006ef3d8 in BackendRun (port=0x152c620) at postmaster.c:4361
#17 0x006eea96 in BackendStartup (port=) at 
postmaster.c:4033
#18 ServerLoop () at postmaster.c:1706
#19 0x006ebd21 in PostmasterMain (argc=, argv=) at postmaster.c:1379
#20 0x00660dd4 in main (argc=3, argv=0x1505fd0) at main.c:228
(gdb) print wal_level
$1 = 0



(gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
set = {__val = {0, 9779315902183899136, 0, 0, 0, 2305878331024736256, 
0, 0, 4594464784623501632, 0, 0, 0, 4594464784623501626, 0, 
4619490161887593984, 0}}
pid = 
tid = 
ret = 
#1  0x7f79267aeb7d in __GI_abort () at abort.c:90
save_stage = 2
act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, 
sa_mask = {__val = {0 , 19}}, sa_flags = 1471333120, 
sa_restorer = 0x0}
  

incorrect comment or possible lock upgrade hazards in executor

2018-09-18 Thread David Rowley
While reviewing Amit's 0001 patch in [1] I noticed Amit pulled out the
code that inits the ResultRelInfos during InitPlan() but didn't update
the comment which says:

/*
* initialize result relation stuff, and open/lock the result rels.
*
* We must do this before initializing the plan tree, else we might try to
* do a lock upgrade if a result rel is also a source rel.
*/

This got me thinking about what Tom pointed out in [2].

Tom wrote:
> It struck me this morning that a whole lot of the complication here is
> solely due to needing to identify the right type of relation lock to take
> during executor startup, and that THAT WORK IS TOTALLY USELESS. In every
> case, we must already hold a suitable lock before we ever get to the
> executor; either one acquired during the parse/plan pipeline, or one
> re-acquired by AcquireExecutorLocks in the case of a cached plan.
> Otherwise it's entirely possible that the plan has been invalidated by
> concurrent DDL --- and it's not the executor's job to detect that and
>  re-plan; that *must* have been done upstream.

This seems to mean that the above code comment was never true. How can
we possibly be preventing a lock upgrade hazard if the locks are
already all obtained?

Amit's change to delay the ResultRelInfo creation until the node is
initialized seems to not make the problem worse as the locks are
already taken, but I do wonder if there is something I'm not seeing
here.  I do have a case here that does deadlock because of the lock
upgrade, but it feels pretty fabricated.

S1: set plan_cache_mode = 'force_generic_plan';
S1: prepare q1 as select * from t1 cross join t2 cross join t1 t1a for
update of t2,t1a;

S1: execute q1;
-- break after AccessShareLock on t1. (the 1st lock that's obtained)
S2: BEGIN;
S2: LOCK TABLE t1 IN EXCLUSIVE MODE;
-- continue S1 until RowShareLock on t2 (the 2nd lock that's obtained)
S2: LOCK TABLE t2 IN EXCLUSIVE MODE;
-- unattach S1 from debugger.
S1: ERROR:  deadlock detected


I think we're also protected to some degree because result relations
come first in the rangetable. This is why I used FOR UPDATE in my
example as I can control the range table index based on where I put
the rel in the query.

Is it safe to remove the 2nd part of the code comment? Or is there
something I'm not seeing here?

[1] 
https://www.postgresql.org/message-id/a20151ff-9d3b-bec8-8c64-5336676cfda3%40lab.ntt.co.jp
[2] https://www.postgresql.org/message-id/4114.1531674...@sss.pgh.pa.us

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: heap_sync seems rather oblivious to partitioned tables (wal_level=minimal)

2018-09-18 Thread David Rowley
On 19 September 2018 at 12:21, Tomas Vondra
 wrote:
> So apparently CopyFrom() invokes heap_sync() on the partitioned
> relation, which attempts to do mdimmedsync() only on the root. That
> seems like a bug to me.
>
> Obviously this only applies to wal_level=minimal. There are multiple
> callers of heap_sync(), but only the COPY seems to be affected by this,
> because the rest can't see partitioned tables.
>
> So it seems heap_sync() needs to be improved to sync all partitions.

Wouldn't it be better to modify copy.c to just perform the heap_sync
on just the partitions it touches?

After https://commitfest.postgresql.org/19/1690/ it's very efficient
to know which partitions were used. The bulk load might just be
hitting a single partition, so it makes sense to try to just do the
ones we need to, rather than needlessly performing heap_sync() on all
of them.

For the back branches, the current form of PartitionDispatch is not
particularly inefficient to just skip over the uninitialized
partitions.

6b78231d918 did recently just make the PartitionDispatch struct's
members private to execPartition.c, so doing this will probably
require some API function that can be passed a callback function to do
what we need.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Collation versioning

2018-09-18 Thread Stephen Frost
Greetings,

* Thomas Munro (thomas.mu...@enterprisedb.com) wrote:
> On Wed, Sep 19, 2018 at 10:09 AM Stephen Frost  wrote:
> > * Douglas Doole (dougdo...@gmail.com) wrote:
> > > > The CHECK constraint doesn't need to directly track that information-
> > > > it should have a dependency on the column in the table and that's where
> > > > the information would be recorded about the current collation version.
> > >
> > > Just to have fun throwing odd cases out, how would something like this be
> > > recorded?
> > >
> > > Database default collation: en_US
> > >
> > > CREATE TABLE t (c1 TEXT, c2 TEXT, c3 TEXT,
> > >  CHECK (c1 COLLATE "fr_FR" BETWEEN c2 COLLATE "fr_FR" AND c3 COLLATE
> > > "fr_FR"));
> > >
> > > You could even be really warped and apply multiple collations on a single
> > > column in a single constraint.
> >
> > Once it gets to an expression and not just a simple check, I'd think
> > we'd record it in the expression..
> 
> Maybe I misunderstood, but I don't think it makes sense to have a
> collation version "on the column in the table", because (1) that fails
> to capture the fact that two CHECK constraints that were defined at
> different times might have become dependent on two different versions
> (you created one constraint before upgrading and the other after, now
> the older one is invalidated and sounds the alarm but the second one
> is fine), and (2) the table itself doesn't care about collation
> versions since heap tables are unordered; there is no particular
> operation on the table that would be the correct time to update the
> collation version on a table/column.  What we're trying to track is
> when objects that in some way depend on the version become
> invalidated, so wherever we store it there's going to have to be a
> version recorded per dependent object at its creation time, so that's
> either new columns on every interested catalog table, or ...

Today, we work out what operators to use for a given column based on the
data type.  This is why I was trying to get at the idea of using typmod
earlier, but if we can't make that work then I'm inclined to try and
figure out a way to get it as close as possible to being associated with
the type.

Yes, the heap is unordered, but the type *is* ordered and we track that
with the type system.  Maybe we just extend that somehow, rather than
using the typmod or adding columns into catalogs where we store type
information (such as pg_attribute).  Perhaps typmod could be made larger
to be able to support this, that might be another approach.

No where in this does it strike me that it makes sense to push this into
pg_depend though.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: pread() and pwrite()

2018-09-18 Thread Thomas Munro
On Fri, Sep 7, 2018 at 2:17 AM Jesper Pedersen
 wrote:
> > This needs a rebase again.

Done.

> Would it be of benefit to update these call sites
>
> * slru.c
>- SlruPhysicalReadPage
>- SlruPhysicalWritePage
> * xlogutils.c
>- XLogRead
> * pg_receivewal.c
>- FindStreamingStart
> * rewriteheap.c
>- heap_xlog_logical_rewrite
> * walreceiver.c
>- XLogWalRcvWrite

It certainly wouldn't hurt... but more pressing to get this committed
would be Windows support IMHO.  I think the thing to do is to open
files with the FILE_FLAG_OVERLAPPED flag, and then use ReadFile() and
WriteFile() with an LPOVERLAPPED struct that holds an offset, but I'm
not sure if I can write that myself.  I tried doing some semi-serious
Windows development for the fsyncgate patch using only AppVeyor CI a
couple of weeks ago and it was like visiting the dentist.

On Thu, Sep 6, 2018 at 6:42 AM Jesper Pedersen
 wrote:
> Once resolved the patch passes make check-world, and a strace analysis
> shows the associated read()/write() have been turned into
> pread64()/pwrite64(). All lseek()'s are SEEK_END's.

Yeah :-)  Just for fun, here is the truss output for a single pgbench
transaction here:

recvfrom(9,"B\0\0\0\^R\0P0_4\0\0\0\0\0\0\^A"...,8192,0,NULL,0x0) = 41 (0x29)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\nBEG"...,27,0,NULL,0) = 27 (0x1b)
recvfrom(9,"B\0\0\0&\0P0_5\0\0\0\0\^B\0\0\0"...,8192,0,NULL,0x0) = 61 (0x3d)
pread(22,"\0\0\0\0\^P\M^C\M-@P\0\0\0\0\M-X"...,8192,0x960a000) = 8192 (0x2000)
pread(20,"\0\0\0\0\M^X\^D\M^Iq\0\0\0\0\^T"...,8192,0x380fe000) = 8192 (0x2000)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\rUPD"...,30,0,NULL,0) = 30 (0x1e)
recvfrom(9,"B\0\0\0\^]\0P0_6\0\0\0\0\^A\0\0"...,8192,0,NULL,0x0) = 52 (0x34)
sendto(9,"2\0\0\0\^DT\0\0\0!\0\^Aabalance"...,75,0,NULL,0) = 75 (0x4b)
recvfrom(9,"B\0\0\0"\0P0_7\0\0\0\0\^B\0\0\0"...,8192,0,NULL,0x0) = 57 (0x39)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\rUPD"...,30,0,NULL,0) = 30 (0x1e)
recvfrom(9,"B\0\0\0!\0P0_8\0\0\0\0\^B\0\0\0"...,8192,0,NULL,0x0) = 56 (0x38)
lseek(29,0x0,SEEK_END)   = 8192 (0x2000)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\rUPD"...,30,0,NULL,0) = 30 (0x1e)
recvfrom(9,"B\0\0\0003\0P0_9\0\0\0\0\^D\0\0"...,8192,0,NULL,0x0) = 74 (0x4a)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\^OIN"...,32,0,NULL,0) = 32 (0x20)
recvfrom(9,"B\0\0\0\^S\0P0_10\0\0\0\0\0\0\^A"...,8192,0,NULL,0x0) = 42 (0x2a)
pwrite(33,"\M^X\M-P\^E\0\^A\0\0\0\0\M-`\M^A"...,16384,0x81e000) = 16384 (0x4000)
fdatasync(0x21)  = 0 (0x0)
sendto(9,"2\0\0\0\^Dn\0\0\0\^DC\0\0\0\vCOM"...,28,0,NULL,0) = 28 (0x1c)

There is only one lseek() left.  I actually have another patch that
gets rid of even that (by caching sizes in SMgrRelation using a shared
invalidation counter which I'm not yet sure about).  Then pgbench's
7-round-trip transaction makes only the strictly necessary 18 syscalls
(every one an explainable network message, disk page or sync).
Unpatched master has 5 extra lseek()s.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Use-pread-pwrite-instead-of-lseek-read-write-v4.patch
Description: Binary data


Re: Usage of epoch in txid_current

2018-09-18 Thread Thomas Munro
On Tue, Jul 24, 2018 at 5:24 PM Thomas Munro
 wrote:
> Here's a new version.

Bitrot, rebased.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Track-the-next-xid-using-64-bits-v6.patch
Description: Binary data


Re: pgsql: Allow concurrent-safe open() and fopen() in frontend code for Wi

2018-09-18 Thread Michael Paquier
On Tue, Sep 18, 2018 at 06:04:58PM +0200, Laurenz Albe wrote:
> That wouldn't influence pipes, which was what Michael said was a
> problem for pg_dump.

Yeah, the authentication blows up badly on that..  You can see all the
tests using user and database names with all ASCII range turning red.

> I currently have no Windows system close, so I cannot test...

I have spent a large portion of my morning trying to test all the
solutions proposed, and a winner shows up.  Attached are three patches
which present all the solutions mentioned, attached here for visibility:
- win32-open-michael.patch, or the solution I was working on.  This has
led me actually nowhere.  Even trying to enforce _fmode to happen only
on the frontend has caused breakages of pg_dump for authentication.
Trying to be smarter than what other binaries do is nice and consistent
with the Windows experience I think, still it looks that this can lead
to breakages when a utility uses setmode() for some of the output
files.  I noticed that pgwin32_open missed to enforce the mode used when
calling _fdmode, still that solves nothing.
- win32-open-tom.patch, which switches pgwin32_fopen() to use text mode
all the time if binary is not specified.  While this looked promising,
I have noticed that this has been causing the same set of errors as what
my previous patch has been doing in pg_dump TAP tests.  Anyway, a
solution needs also to happen for pgwin32_open() as we want direct calls
to it to get the right call.
- win32-open-laurenz.patch, which enforces to text mode only if binary
mode is not defined, which maps strictly to what pre-11 is doing when
calling the system _open or _fopen.  And surprisingly, this is proving
to pass all the tests I ran: bincheck (including pgbench and pg_dump),
upgradecheck, recoverycheck, check, etc.  initdb --pwfile is not
complaining to me either.

So the odds are that Laurenz's solution is what we are looking for.
Laurenz, Tom, any thoughts to share?  I won't back-patch that ;)
--
Michael
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index b53d6eb9cc..cb8c7450d9 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -491,15 +491,7 @@ readfile(const char *path)
 	char	   *buffer;
 	int			c;
 
-#ifdef WIN32
-	/*
-	 * On Windows, we have to open the file in text mode so that carriage
-	 * returns are stripped.
-	 */
-	if ((infile = fopen(path, "rt")) == NULL)
-#else
 	if ((infile = fopen(path, "r")) == NULL)
-#endif
 	{
 		fprintf(stderr, _("%s: could not open file \"%s\" for reading: %s\n"),
 progname, path, strerror(errno));
diff --git a/src/port/open.c b/src/port/open.c
index a3ad946a60..9174cdce66 100644
--- a/src/port/open.c
+++ b/src/port/open.c
@@ -122,11 +122,35 @@ pgwin32_open(const char *fileName, int fileFlags,...)
 		return -1;
 	}
 
+	/*
+	 * If caller has set neither O_BINARY nor O_TEXT, then look for what
+	 * the default mode is for this process, then enforce it.  This can
+	 * be done only after opening the file and before switching the file
+	 * mode.
+	 */
+	if ((fileFlags & (O_BINARY | O_TEXT)) == 0)
+	{
+		int		pmode = 0;
+
+		/* only MSVC newer than 2015 support _get_fmode */
+#if (_MSC_VER >= 1900)
+		if (_get_fmode(&pmode) < 0)
+		{
+			/* _get_fmode sets errno */
+			CloseHandle(h);
+			return -1;
+		}
+#else
+		pmode = _fmode;
+#endif
+
+		fileFlags |= pmode;
+	}
+
 	/* _open_osfhandle will, on error, set errno accordingly */
 	if ((fd = _open_osfhandle((intptr_t) h, fileFlags & O_APPEND)) < 0)
 		CloseHandle(h);			/* will not affect errno */
-	else if (fileFlags & (O_TEXT | O_BINARY) &&
-			 _setmode(fd, fileFlags & (O_TEXT | O_BINARY)) < 0)
+	else if (_setmode(fd, fileFlags & (O_TEXT | O_BINARY)) < 0)
 	{
 		_close(fd);
 		return -1;
@@ -138,6 +162,7 @@ pgwin32_open(const char *fileName, int fileFlags,...)
 FILE *
 pgwin32_fopen(const char *fileName, const char *mode)
 {
+	char		fdmode[32];
 	int			openmode = 0;
 	int			fd;
 
@@ -160,7 +185,36 @@ pgwin32_fopen(const char *fileName, const char *mode)
 	fd = pgwin32_open(fileName, openmode);
 	if (fd == -1)
 		return NULL;
-	return _fdopen(fd, mode);
+
+	strcpy(fdmode, mode);
+
+	/*
+	 * Like pgwin32_open, look for the default mode to be used for this
+	 * file descriptor.  Note that it is important to do that again here
+	 * for _fdopen below.
+	 */
+	if ((openmode & (O_BINARY | O_TEXT)) == 0)
+	{
+		int		pmode = 0;
+
+		/* only MSVC newer than 2015 support _get_fmode */
+#if (_MSC_VER >= 1900)
+		if (_get_fmode(&pmode) < 0)
+		{
+			/* get_fmode sets errno */
+			_close(fd);
+			return NULL;
+		}
+#else
+		pmode = _fmode;
+#endif
+
+		snprintf(fdmode, sizeof(fdmode), "%s%s%s", fdmode,
+ (pmode & O_BINARY) != 0 ? "b" : "",
+ (pmode & O_TEXT) != 0 ? "t" : "");
+	}
+
+	return _fdopen(fd, fdmode);
 }
 
 #endif
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index b53d6eb9cc..cb8c7450d9 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -491,15 +491,7 @@ readfile(const char

Re: heap_sync seems rather oblivious to partitioned tables (wal_level=minimal)

2018-09-18 Thread Michael Paquier
On Wed, Sep 19, 2018 at 01:14:10PM +1200, David Rowley wrote:
> On 19 September 2018 at 12:21, Tomas Vondra
>  wrote:
>> So apparently CopyFrom() invokes heap_sync() on the partitioned
>> relation, which attempts to do mdimmedsync() only on the root. That
>> seems like a bug to me.

And the root should not have any physical storage either, so attempting
to sync it is wrong.

>> Obviously this only applies to wal_level=minimal. There are multiple
>> callers of heap_sync(), but only the COPY seems to be affected by this,
>> because the rest can't see partitioned tables.
>>
>> So it seems heap_sync() needs to be improved to sync all partitions.
> 
> Wouldn't it be better to modify copy.c to just perform the heap_sync
> on just the partitions it touches?

Yeah, my gut is telling me that this would be the best approach for now,
still I am not sure that this is the best move in the long term.  All
the other callers of heap_sync don't care about partitioned tables, so
we could add an assertion on RELKIND_PARTITIONED_TABLE.  Still, one
crazy argument against that approach would be that we'd need to do the
same thing again if one day CTAS or matviews allow some sort of
automatic creation of partitions.
--
Michael


signature.asc
Description: PGP signature


Re: when set track_commit_timestamp on, database system abort startup

2018-09-18 Thread Kyotaro HORIGUCHI
At Sat, 15 Sep 2018 19:26:39 +0900, Masahiko Sawada  
wrote in 
> >> To fix that maybe we can disable commitTs if
> >> controlFile->track_commit_timestamp == false and the
> >> track_commit_timestamp == true even in crash recovery.
> >
> > Hmm, so keep it off while crash recovery runs, and once it's out of that
> > then enable it automatically?
> 
> Yes. The attached patch changes it to check
> controlFile->track_commit_timestamp even the crash recover case. If
> track_commit_timestamp is set to true in the config file, it's enabled
> at end of the recovery.
> 
> > That might work -- by definition we don't
> > care about the commit TSs of the transaction replayed during crash
> > recovery, since they were executed in the primary that didn't have
> > commitTS enable anyway.
> >
> > It seems like the first thing we need is TAP cases that reproduce these
> > two crash scenarios.
> 
> I attached TAP test that reproduces this issue. We can reproduce it
> even with single server; making postgres replay a commit WAL in the
> crash recovery after consumed transactions and enabled
> track_commit_timestamp.

The fix looks good to me.  The TAP test works fine.

In the TAP test:


The test script lacks a general description about its objective.


+$node->append_conf('postgresql.conf',
+ "track_commit_timestamp = off");
+$node->start;
+
+# When we start firstly from the initdb the PARAMETER_CHANGES
+# is emitted at end of the recovery, which disables the
+# track_commit_timestamp if the crash recovery replay that
+# WAL. Therefore we restart the server so that we can recovery
+# from the point where doesn't contain that WAL.
+$node->restart;

The first startup actually doesn't emit a PARAMETER_CHAGES. If
track_commit_timestamp were set to on, we get one. However, I
agree that it is reasonable to eliminate the possiblity of being
affected by the record. How about something like this?

+# We don't want to replay PARAMETER_CHANGES record while the
+# crash recovery test below. It is not expected to be emitted
+# thus far, but we restart the server to get rid of it just in
+# case.



+# During the crash recovery we replay the commit WAL that sets
+# the commit timestamp to a new page.
+$node->start;

The comment is mentioning the unexpected symptom. Shouldn't it be
the desired behavior?

+# During crash recovery server will replay COMMIT records
+# emitted while commit timestamp was off. The server can start
+# if we correctly avoid processing commit timestamp for the
+# records.


The following error message is seen when the issue happens.

> DETAIL:  Could not read from file "pg_commit_ts/" at offset 8192: Success.

This seems what 811b6e36a9 tried to fix. This should be like the follows.

> DETAIL:  Could not read from file "pg_commit_ts/" at offset 8192: read 0 
> of 8192"

I, as one of reviewers of it, didn't remember how it was
overlooked butthis also needs a fix in this patch or as a
separate issue.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: when set track_commit_timestamp on, database system abort startup

2018-09-18 Thread Michael Paquier
On Wed, Sep 19, 2018 at 12:12:44PM +0900, Kyotaro HORIGUCHI wrote:
> The fix looks good to me.  The TAP test works fine.
> 
> In the TAP test:
> 
> 
> The test script lacks a general description about its objective.

That's always welcome.  The patch looks sensible to me.

> The following error message is seen when the issue happens.
> 
>> DETAIL:  Could not read from file "pg_commit_ts/" at offset 8192: 
>> Success.
> 
> This seems what 811b6e36a9 tried to fix. This should be like the follows.
> 
>> DETAIL:  Could not read from file "pg_commit_ts/" at offset 8192: read 0 
>> of 8192"
> 
> I, as one of reviewers of it, didn't remember how it was
> overlooked butthis also needs a fix in this patch or as a
> separate issue.

Not completely overlooked...  Please see here why I have left out the
bits in slru.c, on purpose:
https://www.postgresql.org/message-id/20180613035038.ge3...@paquier.xyz
If you want to address that as well, please feel free to send a patch.
Reporting the number of bytes read would be the nicest error report, but
this requires a new field similar to slru_errno which includes an error
string to use if errno is 0.
--
Michael


signature.asc
Description: PGP signature


Re: Something fishy happening on frogmouth

2018-09-18 Thread Kyotaro HORIGUCHI
Thank you for finding and fixing this.

At Sat, 15 Sep 2018 18:21:52 -0400, Tom Lane  wrote in 
<1.1537050...@sss.pgh.pa.us>
> Noah Misch  writes:
> > Usually, the first srandom() call happens early in PostmasterMain().  I plan
> > to add one to InitStandaloneProcess(), which substitutes for several tasks
> > otherwise done in PostmasterMain().  That seems like a good thing even if 
> > DSM
> > weren't in the picture.  Also, initdb needs an srandom() somewhere;
> > choose_dsm_implementation() itself seems fine.  Attached.
> 
> +1, but some comments would be good.
> 
>   regards, tom lane
> 

+1, too.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: hot_standby_feedback vs excludeVacuum and snapshots

2018-09-18 Thread Thomas Munro
On Fri, Sep 14, 2018 at 6:09 PM Kyotaro HORIGUCHI
 wrote:
> At Sat, 4 Aug 2018 14:09:18 +1200, Thomas Munro 
>  wrote in
> > Does anyone know why StandbyReleaseLocks() releases all locks if
> > passed InvalidTransactionId?  When would that happen?
>
> AFAICS, it used to be used at shutdown time since hot standby was
> introduced by efc16ea520 from
> ShutdownRecoveryTransactionEnvironment/StandbyReleaseAllLocks.
>
> c172b7b02e (Jan 23 2012) modified StandbyReleaseAllLocks not to
> call StandbyReleaseLocks with InvalidTransactionId and the
> feature became useless, and now it is.
>
> So I think the feature has been obsolete for a long time.

Thank you for this analysis.  It looks like dead code that we should
remove in master at least.

> As a similar thing, the following commands leaves AEL even though
> the savepoint is rollbacked.
>
> BEGIN; SAVEPOINT s; LOCK foo; CHECKPOINT; ROLLBACK TO SAVEPOINT s;
>
> This is because the checkpoint issues XLOG_STANDBY_LOCK on foo
> with the top-transaciton XID.
>
> Every checkpoint issues it for all existent locks so
> RecoveryLockList(s) can be bloated with the same lock entries and
> increases lock counts. Although it doesn't seem common to sustain
> AELs for a long time so that the length harms, I don't think such
> duplication is good. Patch attached.

I noticed that too.  It seems like it would take a very long time to
cause a problem.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Horrible CREATE DATABASE Performance in High Sierra

2018-09-18 Thread Tom Lane
I wrote:
> Peter Eisentraut  writes:
>> What is the status of this?  Is performance on High Sierra still bad?

> I committed the fix at 643c27e36.  If Apple have done anything about the
> underlying problem, you couldn't tell it from their non-response to my
> bug report.

So, after just about one year of radio silence, I received email from
Apple saying that (a) they'd closed my bug report as a duplicate of
another one, and (b) that other one was also closed, and (c) not one
other iota of information.

This seems to be standard practice for them, though I'm at a loss
to say why they consider it even minimally acceptable.  From here,
it seems like a great way to piss people off and ensure they won't
bother filing any future bug reports.

Anyway, reading the tea leaves and considering the timing, one might
guess that they actually fixed the problem as of macOS Mojave.

regards, tom lane



Re: Problem while setting the fpw with SIGHUP

2018-09-18 Thread Michael Paquier
On Tue, Sep 18, 2018 at 04:15:42PM +0900, Kyotaro HORIGUCHI wrote:
> My latest patch tries to remove the window by imposing all
> responsibility to apply config file changes to the shared FPW
> flag on the checkpointer. RecoveryInProgress() is changed to be
> called prior to UpdateFullPageWrites on the way doing that.

I still need to look at that in details.  That may be better than what I
am proposing.  At quick glance what I propose is more simple, and does
not need enforcing a checkpoint using SIGINT.

> InRecoery is turned off after the last UpdateFullPageWrite() and
> before SharedRecoveryInProgress is turned off. So it is still
> leaving the window. Thus, checkpointer stops flipping the value
> before SharedRecoveryInProgress's turning off. (I don't
> understand InRecovery condition..) However, this lets
> checkpointer omit reloading after UpdateFullPagesWrite() in
> startup until SharedRecoveryInProgress is tunred off.

That won't matter in this case, as RecoveryInProgress() gets called out
of the critical section in the previous patch I sent, so there is no
failure.  Let's not forget that the first issue is the allocation in the
critical section.  The second issue is that UpdateFullPageWrites may be
called concurrently across multiple processes, which is not something it
is designed for.  The second issue gets addressed in my proposal my
making sure that the checkpointer never tries to update full_page_writes
by himself until recovery has finished, and that the startup process is
the only updater once recovery ends.

> Agreed. "If we need to do that in the start process," we need to
> change the shared flag and issue FPW_CHANGE always when the
> database state is different from configuration file, regardless
> of what StartXLOG() did until the point.

Definitely my mistake here.  Attached is a patch to show what I have in
mind by making sure that the startup process generates a record even
after switching full_page_writes when started normally.  This removes
the condition based on InRecovery, and uses a new argument for
UpdateFullPageWrites() instead.  Your test case,as well as what I do
manually for testing, pass without triggering the assertion.

+   /* DEBUG: cause a reload */
+   {
+   struct stat b;
+   if (stat("/tmp/hoge", &b) == 0)
+   {
+   elog(LOG, "STARTUP SLEEP FOR 1s");
+   sleep(1);
+   elog(LOG, "DONE.");
+   DirectFunctionCall1(pg_reload_conf, 0);
+   }
+   }
The way you patch the backend this way is always nice to see so as it is
easy to reproduce bugs, and it actually helps in reproducing the
assertion failure correctly ;)
--
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3025d0badb..079e1814c1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7719,10 +7719,15 @@ StartupXLOG(void)
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
 	 * record is written.
+	 *
+	 * It is safe to check the shared full_page_writes without the lock,
+	 * because there is no concurrently running process able to update it.
+	 * The only other process able to update full_page_writes is the
+	 * checkpointer, still it is unable to do so until recovery finishes.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
 	LocalSetXLogInsertAllowed();
-	UpdateFullPageWrites();
+	UpdateFullPageWrites(true);
 	LocalXLogInsertAllowed = -1;
 
 	if (InRecovery)
@@ -9693,14 +9698,28 @@ XLogReportParameters(void)
  * Update full_page_writes in shared memory, and write an
  * XLOG_FPW_CHANGE record if necessary.
  *
- * Note: this function assumes there is no other process running
- * concurrently that could update it.
+ * Note: this can be called from the checkpointer, or the startup process
+ * at the end of recovery.  One could think that this routine should be
+ * careful with its lock handling, however this is a no-op for the
+ * checkpointer until the startup process marks the end of recovery,
+ * so only one of them can do the work of this routine at once.
  */
 void
-UpdateFullPageWrites(void)
+UpdateFullPageWrites(bool force)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 
+	/*
+	 * Check if recovery is still in progress before entering this critical
+	 * section, as some memory allocation could happen at the end of
+	 * recovery.  There is nothing to do for a system still in recovery.
+	 * Note that the caller may still want to enforce an update to happen.
+	 * One such example is the startup process, which updates full_page_writes
+	 * at the end of recovery.
+	 */
+	if (RecoveryInProgress() && !force)
+		return;
+
 	/*
 	 * Do nothing if full_page_writes has not been changed.
 	 *
@@ -9731,7 +9750,7 @@ UpdateFullPageWrites(void)
 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep track of
 	 * full_page_writes during archive recovery, if required.
 	 */
-	if (XL

Re: [HACKERS] Horrible CREATE DATABASE Performance in High Sierra

2018-09-18 Thread Gavin Flower

On 19/09/2018 16:38, Tom Lane wrote:

I wrote:

Peter Eisentraut  writes:

What is the status of this?  Is performance on High Sierra still bad?

I committed the fix at 643c27e36.  If Apple have done anything about the
underlying problem, you couldn't tell it from their non-response to my
bug report.

So, after just about one year of radio silence, I received email from
Apple saying that (a) they'd closed my bug report as a duplicate of
another one, and (b) that other one was also closed, and (c) not one
other iota of information.

This seems to be standard practice for them, though I'm at a loss
to say why they consider it even minimally acceptable.  From here,
it seems like a great way to piss people off and ensure they won't
bother filing any future bug reports.

Anyway, reading the tea leaves and considering the timing, one might
guess that they actually fixed the problem as of macOS Mojave.

regards, tom lane


Don't worry your tiny little mind, Apple knows best what is good for you!


[Smileys omitted, due to budget constraints]





Re: Changing the setting of wal_sender_timeout per standby

2018-09-18 Thread Michael Paquier
On Wed, Sep 19, 2018 at 12:14:57AM +, Tsunakawa, Takayuki wrote:
> From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> > I didn't follow the first sentence of the above hunk. Is the
> > wal_sender_timeout relevant with %q?
> 
> Ouch, that's a careless mistake.  I copied the paragraph from another
> parameter and failed to remove some sentence.  Patch revised.

-A value of zero disables the timeout mechanism.  This parameter
-can only be set in
-the postgresql.conf file or on the server
-command line.
+A value of zero disables the timeout mechanism.
+Only superusers can change this parameter at session start,
+and it cannot be changed at all within a session.
 The default value is 60 seconds.

Parameters classified as PGC_BACKEND can be updated by any users, and
those marked as PGC_SU_BACKEND can only be updated by superusers.
Replication users are not superusers, which is why PGC_BACKEND is most
adapted.  Your description should just say "This parameter cannot be
changed after session start.
--
Michael


signature.asc
Description: PGP signature


Re: Tid scan improvements

2018-09-18 Thread Edmund Horner
On Mon, 17 Sep 2018 at 23:21, David Rowley  wrote:
> On 15 August 2018 at 11:11, Edmund Horner  wrote:
>> So we'd extend that to:
>>   - Include in the OR-list "range" subquals of the form (ctid > ? AND
>> ctid < ?) (either side could be optional, and we have to deal with >=
>> and <= and having ctid on the rhs, etc.).
>>   - Cost the range subquals by assuming they don't overlap, and
>> estimating how many blocks and tuples they span.
>>   - When beginning the scan, evaluate all the ?s and build an array of
>> "tid ranges" to fetch.  A tid range is a struct with a starting tid,
>> and an ending tid, and might just be a single tid item.
>>   - Sort and remove duplicates.
>>   - Iterate over the array, using a single fetch for single-item tid
>> ranges, and starting/ending a heap scan for multi-item tid ranges.
>>
>> I think I'll try implementing this.
>
>
> I've set this patch as waiting on author in the commitfest app.

Thanks David.

Between work I have found time here and there to work on it, but
making a path type that handles all the above turns out to be
surprisingly harder than my tid range scan.

In the earlier discussion from 2012, Tom Lane said:

> Bruce Momjian  writes:
> > On Wed, Jun 13, 2012 at 03:21:17PM -0500, Merlin Moncure wrote:
> >> IMNSHO, it's a no-brainer for the todo (but I think it's more
> >> complicated than adding some comparisons -- which are working now):
>
> > I see. Seems we have to add index smarts to those comparisons. That
> > might be complicated.
>
> Uh, the whole point of a TID scan is to *not* need an index.
>
> What would be needed is for tidpath.c to let through more kinds of TID
> comparison quals than it does now, and then for nodeTidscan.c to know
> what to do with them. The latter logic might well look something like
> btree indexscan qual preparation, but it wouldn't be the same code.

I have been generally following this approach (handling more kinds of
TID comparisons), and have found myself doing things like pairing up >
with <, estimating how much of a table is covered by some set of >, <,
or "> AND <" quals, etc.  Things that I'm sure are handled in an
advanced way by index paths; unfortunately I didn't see any easily
reusable code in the index path code.  So I've ended up writing
special-case code for TID scans.  Hopefully it will be worth it.

Edmund



Re: Changing the setting of wal_sender_timeout per standby

2018-09-18 Thread Michael Paquier
On Wed, Sep 19, 2018 at 02:40:31PM +0900, Michael Paquier wrote:
> Parameters classified as PGC_BACKEND can be updated by any users, and
> those marked as PGC_SU_BACKEND can only be updated by superusers.
> Replication users are not superusers, which is why PGC_BACKEND is most
> adapted.  Your description should just say "This parameter cannot be
> changed after session start.

Actually, now that I look at guc.c, using PGC_BACKEND breaks one case:
such parameter types cannot be changed even with SIGHUP, and this even
if the session relies on the default value.  So you could break
applications relying on reloads.  PGC_USERSET would actually do the work
as if the session uses a non-default value specified by SET or within
the connection string, then SIGHUP updates have no effect.  On the
contrary, if the client relies on the default value, then SIGHUP updates
take effect.  Sorry for the confusion, I should have looked at guc.c
more carefully.
--
Michael


signature.asc
Description: PGP signature


RE: Changing the setting of wal_sender_timeout per standby

2018-09-18 Thread Tsunakawa, Takayuki
From: Michael Paquier [mailto:mich...@paquier.xyz]
> Parameters classified as PGC_BACKEND can be updated by any users, and those
> marked as PGC_SU_BACKEND can only be updated by superusers.
> Replication users are not superusers, which is why PGC_BACKEND is most
> adapted.  Your description should just say "This parameter cannot be
> changed after session start.

How embarrassing...  I'm sorry to cause you trouble to point out a silly 
mistake like this (I thought I would write as you suggested, but it seems that 
I was not who I usually am.)  The revised patch attached.


Regards
Takayuki Tsunakawa





walsender_timeout_PGC_BACKEND_v3.patch
Description: walsender_timeout_PGC_BACKEND_v3.patch


Re: PostgreSQL 11 {Beta 4, RC1} Release: 2018-09-20

2018-09-18 Thread Noah Misch
On Tue, Sep 18, 2018 at 12:15:45PM -0400, Jonathan S. Katz wrote:
> On 9/13/18 12:51 PM, Jonathan S. Katz wrote:
> > We are planning to have another release of PostgreSQL 11, either Beta 4
> > or RC1, next week on Thursday, 2018-09-20. The version will be
> > determined based on the state of the open items list[1] around the time
> > of stamping.
> 
> PostgreSQL 11 was stamped Beta 4[1] yesterday evening for release on
> 2018-09-20.
> 
> There is a copy of the Beta 4 release notes draft here[2]. I would
> greatly appreciate any review and feedback on them, particularly around
> technical accuracy and any omissions that should be included.

s/verifiy/verify/.  Looks good otherwise.



Re: Tid scan improvements

2018-09-18 Thread David Rowley
On 19 September 2018 at 18:04, Edmund Horner  wrote:
> I have been generally following this approach (handling more kinds of
> TID comparisons), and have found myself doing things like pairing up >
> with <, estimating how much of a table is covered by some set of >, <,
> or "> AND <" quals, etc.  Things that I'm sure are handled in an
> advanced way by index paths; unfortunately I didn't see any easily
> reusable code in the index path code.  So I've ended up writing
> special-case code for TID scans.  Hopefully it will be worth it.

I don't think it would need to be as complex as the index matching
code. Just looping over the quals and gathering up all compatible ctid
quals should be fine.  I imagine the complex handling of sorting the
quals by ctid and removal of redundant quals that are covered by some
range would be done in the executor.

Probably the costing will get more complex. At the moment it seems we
add a random_page_cost per ctid, but you'd probably need to make that
better and loop over the quals in each implicitly ANDed set and find
the max ctid for the > / >= quals and the the min < / <= ctid, then
get the page number from each and assume max - min seq_page_cost, then
add random_page_cost for any remaining equality quals.  The costs from
other OR branches can likely just be added on.  This would double
count if someone did WHERE ctid BETWEEN '(0,0') AND '(100,300)' OR
ctid BETWEEN '(0,0') AND '(100,300)';  The current code seems to
double count now for duplicate ctids anyway. It even double counts if
the ctid being compared to is on the same page as another ctid, so I
don't think that would be unacceptable.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services