date:20190115

On Tue, Jan 15, 2019 at 11:19:48AM +0500, Andrey Borodin wrote:
> Is it possible to avoid those memcpy's by aligning available buffers
> instead?  I couldn't understand this from the patch and this thread.

Sure, it had better do that.  That's just a lazy implementation.
--
Michael


signature.asc
Description: PGP signature

Re: current_logfiles not following group access and instead follows log_file_mode permissions

2019-01-15 Thread Haribabu Kommi

On Tue, Jan 15, 2019 at 4:15 PM Michael Paquier  wrote:

> On Tue, Jan 15, 2019 at 03:08:41PM +1100, Haribabu Kommi wrote:
> > current_logfiles is a meta data file, that stores the current log writing
> > file, and this file presents in the data directory. This file
> > doesn't follow the group access mode set at the initdb time, but it
> > follows the log_file_mode permissions.
> >
> > Without group access permissions, backup with group access can lead to
> > failure.  Attached patch fix the problem.
>
> initdb enforces log_file_mode to 0640 when using the group mode, still
> if one enforces the parameter value then current_logfiles would just
> stick with it.  This is not really user-friendly.  This impacts also
> normal log files as these get included in base backups if the log path
> is within the data folder (not everybody uses an absolute path out of
> the data folder for the logs).
>

we got this problem when the log_file_mode is set 0600 but the database
file are with group access permissions. In our scenario, the log files are
outside the data folder, so we faced the problem with current_logfiles
file.

> One way to think about this is that we may want to worry also about
> normal log files and document that one had better be careful with the
> setting of log_file_mode?  Still, as we are talking about a file
> aiming at storing meta-data for log files, something like what you
> suggest can make sense.
>

Yes, with log_file_mode less than 0640 containing the log files inside
the data directory can leads to backup failure. Yes, providing extra
information about group access when log_file_mode is getting chosen.

Another option is how about not letting user to choose less than 0640
when the group access mode is enabled?

> When discussing about pg_current_logfile(), I raised the point about
> not including as well in base backups which would also address the
> problem reported here.  However we decided to keep it because it can
> be helpful to know what's the last log file associated to a base
> backup for debugging purposes:
>
> https://www.postgresql.org/message-id/50b58f25-ab07-f6bd-7a68-68f29f214...@dalibo.com
>
> Instead of what you are proposing, why not revisiting that and just
> exclude the file from base backups.  I would be in favor of just doing
> that instead of switching the file's permission from log_file_mode to
> pg_file_create_mode.
>

I am not sure how much useful having the details of the log file in the
backup.
It may be useful when there is any problem with backup.

Excluding the file in the backup can solve the problem of backup by an
unprivileged user. Is there any scenarios it can cause problems if it
doesn't follow the group access mode?

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

2019-01-15 Thread Amit Khandekar

On Sat, 12 Jan 2019 at 18:11, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > On Sat, Jan 12, 2019 at 1:44 AM Andres Freund  wrote:
> > >   /* other fields were zeroed above */
> > >
> > > @@ -9355,7 +9370,7 @@ dumpComment(Archive *fout, const char *type, const 
> > > char *name,
> > >* post-data.
> > >*/
> > >   ArchiveEntry(fout, nilCatalogId, createDumpId(),
> > > -  tag->data, namespace, NULL, owner,
> > > +  tag->data, namespace, NULL, owner, 
> > > NULL,
> > >"COMMENT", SECTION_NONE,
> > >query->data, "", NULL,
> > >&(dumpId), 1,
> >
> > We really ought to move the arguments to a struct, so we don't generate
> > quite as much useless diffs whenever we do a change around one of
> > these...
>
> That's what I thought too. Maybe then I'll suggest a mini-patch to the master 
> to
> refactor these arguments out into a separate struct, so we can leverage it 
> here.

Then for each of the calls, we would need to declare that structure
variable (with = {0}) and assign required fields in that structure
before passing it to ArchiveEntry(). But a major reason of
ArchiveEntry() is to avoid doing this and instead conveniently pass
those fields as parameters. This will cause unnecessary more lines of
code. I think better way is to have an ArchiveEntry() function with
limited number of parameters, and have an ArchiveEntryEx() with those
extra parameters which are not needed in usual cases. E.g. we can have
tablespace, tableam, dumpFn and dumpArg as those extra arguments of
ArchiveEntryEx(), because most of the places these are passed as NULL.
All future arguments would go in ArchiveEntryEx().
Comments ?

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

regress tests fails

2019-01-15 Thread Pavel Stehule

Hi

make check-world fails

t/008_diff_schema.pl .. ok
t/009_matviews.pl . ok
t/010_truncate.pl . ok
t/011_generated.pl  # Looks like your test exited with 29 before it
could output anything.
t/011_generated.pl  Dubious, test returned 29 (wstat 7424, 0x1d00)
Failed 2/2 subtests

Test Summary Report
---
t/011_generated.pl  (Wstat: 7424 Tests: 0 Failed: 0)
  Non-zero exit status: 29
  Parse errors: Bad plan.  You planned 2 tests but ran 0.
Files=11, Tests=50, 62 wallclock secs ( 0.09 usr  0.03 sys + 21.82 cusr
7.63 csys = 29.57 CPU)
Result: FAIL
make[2]: *** [Makefile:19: check] Chyba 1
make[2]: Opouští se adresář
„/home/pavel/src/postgresql.master/src/test/subscription“
make[1]: *** [Makefile:48: check-subscription-recurse] Chyba 2
make[1]: Opouští se adresář „/home/pavel/src/postgresql.master/src/test“
make: *** [GNUmakefile:70: check-world-src/test-recurse] Chyba 2

Regards

Pavel

Re: regress tests fails

2019-01-15 Thread Pavel Stehule

út 15. 1. 2019 v 10:47 odesílatel Pavel Stehule 
napsal:

> Hi
>
> make check-world fails
>
> t/008_diff_schema.pl .. ok
> t/009_matviews.pl . ok
> t/010_truncate.pl . ok
> t/011_generated.pl  # Looks like your test exited with 29 before it
> could output anything.
> t/011_generated.pl  Dubious, test returned 29 (wstat 7424, 0x1d00)
> Failed 2/2 subtests
>
> Test Summary Report
> ---
> t/011_generated.pl  (Wstat: 7424 Tests: 0 Failed: 0)
>   Non-zero exit status: 29
>   Parse errors: Bad plan.  You planned 2 tests but ran 0.
> Files=11, Tests=50, 62 wallclock secs ( 0.09 usr  0.03 sys + 21.82 cusr
> 7.63 csys = 29.57 CPU)
> Result: FAIL
> make[2]: *** [Makefile:19: check] Chyba 1
> make[2]: Opouští se adresář
> „/home/pavel/src/postgresql.master/src/test/subscription“
> make[1]: *** [Makefile:48: check-subscription-recurse] Chyba 2
> make[1]: Opouští se adresář „/home/pavel/src/postgresql.master/src/test“
> make: *** [GNUmakefile:70: check-world-src/test-recurse] Chyba 2
>
>
I am sorry for noise. looks like garbage in my repository

Pavel


> Regards
>
> Pavel
>

Re: Pluggable Storage - Andres's take

2019-01-15 Thread Amit Khandekar

On Tue, 15 Jan 2019 at 12:27, Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Mon, Jan 14, 2019 at 2:07 PM Amit Khandekar  
> > wrote:
> >
> > createPQExpBuffer() should be moved after the below statement, so that
> > it does not leak memory
>
> Thanks for noticing, fixed.

Looks good.

>
> > So how about bumping up the archive version and doing these checks ?
>
> Yeah, you're right, I've added this check.

Need to bump K_VERS_MINOR as well.

On Mon, 14 Jan 2019 at 18:36, Amit Khandekar  wrote:
> +static void _selectTableAccessMethod(ArchiveHandle *AH, const char
> *tablespace);
> tablespace => tableam

This is yet to be addressed.


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: Log a sample of transactions

2019-01-15 Thread Adrien NAYRAT


On 1/15/19 9:00 AM, Kuroda, Hayato wrote:

Dear Adrien,

I applied your file and faced a bug-candidate.
(I have not read your source yet. Sorry.)

When I tried to use tab-complition, some dirty messages were appeared.

Messages I faced are attached in this mail.

Best Regards,
Hayato Kuroda
Fujitsu LIMITED



Hello,

Thanks for your review ;)

Your messages looks normal to me, you will have same messages if you put 
log_min_duration to 0.


However, my compiler warn me of multiple definition of xact_is_sampled. 
I fix it in this 2nd patch.


Regards,
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b6f5822b84..3eb418cf99 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5780,6 +5780,24 @@ local0.*/var/log/postgresql

   
 
+ 
+  log_transaction_sample_rate (real)
+  
+   log_transaction_sample_rate configuration parameter
+  
+  
+   
+
+ Set the fraction of transactions whose statements are logged. It applies
+ to each new transaction regardless of their duration. The default is
+ 0, meaning do not log statements from this transaction.
+ Setting this to 1 logs all statements for all transactions.
+ log_transaction_sample_rate is helpful to track a
+ sample of transaction.
+
+   
+  
+
  
 
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index f665e38ecf..19306a9547 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1821,6 +1821,9 @@ StartTransaction(void)
 	s->state = TRANS_START;
 	s->transactionId = InvalidTransactionId;	/* until assigned */
 
+	/* Determine if we log statements in this transaction */
+	xact_is_sampled = random() <= log_xact_sample_rate * MAX_RANDOM_VALUE;
+
 	/*
 	 * initialize current transaction state fields
 	 *
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 0c0891b33e..faa0485796 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -100,7 +100,8 @@ int			max_stack_depth = 100;
 /* wait N seconds to allow attach from a debugger */
 int			PostAuthDelay = 0;
 
-
+/* flag for logging statements in this transaction */
+bool		xact_is_sampled = false;
 
 /* 
  *		private variables
@@ -2203,6 +2204,8 @@ check_log_statement(List *stmt_list)
  * check_log_duration
  *		Determine whether current command's duration should be logged.
  *		If log_statement_sample_rate < 1.0, log only a sample.
+ *		We also check if this statement in this transaction must be logged
+ *		(regardless of its duration).
  *
  * Returns:
  *		0 if no logging is needed
@@ -2218,7 +2221,7 @@ check_log_statement(List *stmt_list)
 int
 check_log_duration(char *msec_str, bool was_logged)
 {
-	if (log_duration || log_min_duration_statement >= 0)
+	if (log_duration || log_min_duration_statement >= 0 || xact_is_sampled)
 	{
 		long		secs;
 		int			usecs;
@@ -2252,11 +2255,11 @@ check_log_duration(char *msec_str, bool was_logged)
 			(log_statement_sample_rate == 1 ||
 			 random() <= log_statement_sample_rate * MAX_RANDOM_VALUE);
 
-		if ((exceeded && in_sample) || log_duration)
+		if ((exceeded && in_sample) || log_duration || xact_is_sampled)
 		{
 			snprintf(msec_str, 32, "%ld.%03d",
 	 secs * 1000 + msecs, usecs % 1000);
-			if (exceeded && !was_logged)
+			if ((exceeded || xact_is_sampled) && !was_logged)
 return 2;
 			else
 return 1;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c216ed0922..06a5e668aa 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -494,6 +494,7 @@ int			client_min_messages = NOTICE;
 int			log_min_duration_statement = -1;
 int			log_temp_files = -1;
 double		log_statement_sample_rate = 1.0;
+double		log_xact_sample_rate = 0;
 int			trace_recovery_messages = LOG;
 
 int			temp_file_limit = -1;
@@ -3347,13 +3348,25 @@ static struct config_real ConfigureNamesReal[] =
 		{"log_statement_sample_rate", PGC_SUSET, LOGGING_WHEN,
 			gettext_noop("Fraction of statements over log_min_duration_statement to log."),
 			gettext_noop("If you only want a sample, use a value between 0 (never "
-		 "log) and 1.0 (always log).")
+		 "log) and 1 (always log).")
 		},
 		&log_statement_sample_rate,
 		1.0, 0.0, 1.0,
 		NULL, NULL, NULL
 	},
 
+	{
+		{"log_transaction_sample_rate", PGC_SUSET, LOGGING_WHEN,
+			gettext_noop("Set the fraction of transactions to log for new transactions."),
+			gettext_noop("Logs all statements from a fraction of transactions. "
+		 "Use a value between 0 (never log) and 1 (log all "
+		 "statements for all transactions).")
+		},
+		&log_xact_sample_rate,
+		0.0, 0.0, 1.0,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresq

Re: FETCH FIRST clause WITH TIES option

2019-01-15 Thread Surafel Temesgen

On Wed, Jan 2, 2019 at 6:19 PM Tomas Vondra 
wrote:

> After looking at the "FETCH FIRST ... PERCENT" patch, I wonder if this
> patch should tweak estimates in some way. Currently, the cardinality
> estimate is the same as for plain LIMIT, using the requested number of
> rows. But let's say there are very few large groups - that will
> naturally increase the number of rows produced.
>
> As an example, let's say the subplan produces 1M rows, and there are
> 1000 groups (when split according to the ORDER BY clause).
>

can we use ORDER BY column raw statistic in limit node reliably? because it
seems to me it can be affected by other operation in a subplan like filter
condition

regards

Surafel

Re: Synchronous replay take III

2019-01-15 Thread Masahiko Sawada

On Sat, Dec 1, 2018 at 10:49 AM Thomas Munro
 wrote:
>
> On Sat, Dec 1, 2018 at 9:06 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > Unfortunately, cfbot says that patch can't be applied without conflicts, 
> > could
> > you please post a rebased version and address commentaries from Masahiko?
>
> Right, it conflicted with 4c703369 and cfdf4dc4.  While rebasing on
> top of those, I found myself wondering why syncrep.c thinks it needs
> special treatment for postmaster death.  I don't see any reason why we
> shouldn't just use WL_EXIT_ON_PM_DEATH, so I've done it like that in
> this new version.  If you kill -9 the postmaster, I don't see any
> reason to think that the existing coding is more correct than simply
> exiting immediately.
>
> On Thu, Nov 15, 2018 at 6:34 PM Masahiko Sawada  wrote:
> > On Thu, Mar 1, 2018 at 10:40 AM Thomas Munro
> >  wrote:
> > > I was pinged off-list by a fellow -hackers denizen interested in the
> > > synchronous replay feature and wanting a rebased patch to test.  Here
> > > it goes, just in time for a Commitfest.  Please skip to the bottom of
> > > this message for testing notes.
> >
> > Thank you for working on this. The overview and your summary was
> > helpful for me to understand this feature, thank you. I've started to
> > review this patch for PostgreSQL 12. I've tested this patch and found
> > some issue but let me ask you questions about the high-level design
> > first. Sorry if these have been already discussed.
>
> Thanks for your interest in this work!
>
> > > This is a design choice favouring read-mostly workloads at the expense
> > > of write transactions.  Hot standbys' whole raison for existing is to
> > > move *some* read-only workloads off the primary server.  This proposal
> > > is for users who are prepared to trade increased primary commit
> > > latency for a guarantee about visibility on the standbys, so that
> > > *all* read-only work could be moved to hot standbys.
> >
> > To be clear what did you mean read-mostly workloads?
>
> I mean workloads where only a small percentage of transactions perform
> a write.  If you need write-scalability, then hot_standby is not the
> solution for you (with or without this patch).
>
> The kind of user who would be interested in this feature is someone
> who already uses some kind of heuristics to move some queries to
> read-only standbys.  For example, some people send transaction for
> logged-in users to the primary database (because only logged-in users
> generate write queries), and all the rest to standby servers (for
> example "public" users who can only read content).  Another technique
> I have seen is to keep user sessions "pinned" on the primary server
> for N minutes after they perform a write transaction.  These types of
> load balancing policies are primitive ways of achieving
> read-your-writes consistency, but they are conservative and
> pessimistic: they probably send too many queries to the primary node.
>
> This proposal is much more precise, allowing you to run the minimum
> number of transactions on the primary node (ie transactions that
> actually need to perform a write), and the maximum number of
> transactions on the hot standbys.
>
> As discussed, making reads wait for a token would be a useful
> alternative (and I am willing to help make that work too), but:
>
> 1.  For users that do more many more reads than writes, would you
> rather make (say) 80% of transactions slower or 20%?  (Or 99% vs 1% as
> the case may be, depending on your application.)
>
> 2.  If you are also using synchronous_commit = on for increased
> durability, then you are already making writers wait, and you might be
> able to tolerate a small increase.
>
> Peter Eisentraut expressed an interesting point of view against this
> general line of thinking:
>
> https://www.postgresql.org/message-id/5643933F.4010701%40gmx.net
>
> My questions are:  Why do we have hot_standby mode?  Is load balancing
> a style of usage we want to support?  Do we want a technology that
> lets you do more of it?
>
> > I think there are two kind of reads on standbys: a read happend after
> > writes and a directly read (e.g. reporting). The former usually
> > requires the causal reads as you mentioned in order to read its own
> > writes but the latter might be different: it often wants to read the
> > latest data on the master at the time. IIUC even if we send a
> > read-only query directly to a synchronous replay server we could get a
> > stale result if the standby delayed for less than
> > synchronous_replay_max_lag. So this synchronous replay feature would
> > be helpful for the former case(i.e. a few writes and many reads wants
> > to see them) whereas for the latter case perhaps the keeping the reads
> > waiting on standby seems a reasonable solution.
>
> I agree 100% that this is not a solution for all users.  But I also
> suspect a token system would be quite complicated, and can't be done
> in a way that is transparent to applications w

Re: Ryu floating point output patch

2019-01-15 Thread Andrew Gierth

> "Andres" == Andres Freund  writes:

 >> In particular, how does it know how every strtod() on the planet
 >> will react to specific input?

 Andres> strtod()'s job ought to computationally be significantly easier
 Andres> than the other way round, no? And if there's buggy strtod()
 Andres> implementations out there, why would they be guaranteed to do
 Andres> the correct thing with our current output, but not with ryu's?

Funny thing: I've been devoting considerable effort to testing this, and
the one failure mode I've found is very interesting; it's not a problem
with strtod(), in fact it's a bug in our float4in caused by _misuse_ of
strtod().

In particular, float4in thinks it's ok to do strtod() on the input, and
then round the result to a float. It is wrong to think that, and here's
why:

Consider the (float4) bit pattern x15ae43fd. The true decimal value of
this, and its adjacent values are:

x15ae43fc = 
7.0385300755536269169437150392273653469292493678466371420654468238353729248046875e-26
  midpoint  
7.0385303837024180189014515281838361605176203339429008565275580622255802154541015625e-28
x15ae43fd = 
7.038530691851209120859188017140306974105991300039164570989669300615787506103515625e-26
  midpoint  
7.0385310222816924506096876943622661354282854517805390059947967529296875e-26
x15ae43fe = 
7.0385313081487913247746609950532486012827332322316911389177739620208740234375e-26

Now look at what happens if you pass '7.038531e-26' as input for float4.

>From the above values, the correct result is x15ae43fd, because any
other value would be strictly further away. But if you input the value
as a double first, here are the adjacent representations:

x3ab5c87fafff = 
7.0385309074873222531206633286975087635036133540...e-26
 midpoint   
7.03853096488450735186517055373347249505857809130...e-28
x3ab5c87fb000 = 
7.03853102228169245060968769436226613542828545...e-28
 midpoint   
7.038531079678877549354185003805399958168507565784...e-28
x3ab5c87fb001 = 
7.038531137076062648098692228841363689723472303024...e-28

So the double value is x3ab5c87fb000, which has a mantissa of
(1).0101110010110...

which when rounded to 23 bits (excluding the implied (1)), is
010 1110 0100 0011  1110  = 2e43fe

So the rounded result is incorrectly x15ae43fe, rather than the expected
correct value of x15ae43fd.

strtof() exists as part of the relevant standards, so float4in should be
using that in place of strtod, and that seems to fix this case for me.

-- 
Andrew (irc:RhodiumToad)

Re: Log a sample of transactions

2019-01-15 Thread Masahiko Sawada

On Sat, Jan 5, 2019 at 12:57 AM Adrien Mobile
 wrote:
>
> Le 4 janvier 2019 13:20:09 GMT+01:00, Peter Eisentraut 
>  a écrit :
> >On 12/12/2018 22:32, Adrien Nayrat wrote:
> >> Per idea of Nikolay Samokhvalov[1] I propose this patch to add the
> >possibility
> >> to log all statements from a fraction of transactions.
> >>
> >> I have several questions:
> >>   * Do we want this feature?
> >
> >It's not clear to me what the use is.  The per-statement sampling
> >allows
> >you to capture slow queries without overwhelming the logs.  We don't
> >have any logging of slow transactions or any other transaction scope
> >logging, so what will this sample?
> >
> >--
> >Peter Eisentraut  http://www.2ndQuadrant.com/
> >PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
> Hello,
> When you troubleshoot applicative issues with multi-statements transaction, 
> you may have to log all queries to find all statements of one transaction. 
> With high throughput, it could be hard to log all queries without causing 
> troubles.

Hm, can we use log_min_duration_statement to find slow queries of a
transaction instead? Could you please elaborate on the use-case?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: SPI Interface to Call Procedure with Transaction Control Statements?

2019-01-15 Thread Andrew Gierth

> "Jack" == Jack LIU  writes:

 Jack> Hi Andrew,
 Jack> This is my code to call the procedure with
 Jack> SPI_connect_ext(SPI_OPT_NONATOMIC).

Ah. You need to take a look at exec_stmt_call in plpgsql, and do the
same things it does with snapshot management (specifically, setting the
no_snapshot flag on the plan that you're going to execute). SPI forces
atomic mode if the normal snapshot management is in use, because
otherwise a commit inside the procedure would warn about still having a
snapshot open.

-- 
Andrew (irc:RhodiumToad)

Re: SPI Interface to Call Procedure with Transaction Control Statements?

2019-01-15 Thread Peter Eisentraut

On 15/01/2019 11:49, Andrew Gierth wrote:
>> "Jack" == Jack LIU  writes:
> 
>  Jack> Hi Andrew,
>  Jack> This is my code to call the procedure with
>  Jack> SPI_connect_ext(SPI_OPT_NONATOMIC).
> 
> Ah. You need to take a look at exec_stmt_call in plpgsql, and do the
> same things it does with snapshot management (specifically, setting the
> no_snapshot flag on the plan that you're going to execute). SPI forces
> atomic mode if the normal snapshot management is in use, because
> otherwise a commit inside the procedure would warn about still having a
> snapshot open.

Yeah, eventually we might want to add a new SPI function to do
non-atomic calls, but right now you need to go the manual route.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: FETCH FIRST clause WITH TIES option

2019-01-15 Thread Tomas Vondra

On 1/15/19 11:07 AM, Surafel Temesgen wrote:
> 
> 
> On Wed, Jan 2, 2019 at 6:19 PM Tomas Vondra
> mailto:tomas.von...@2ndquadrant.com>> wrote:
> 
> After looking at the "FETCH FIRST ... PERCENT" patch, I wonder if this
> patch should tweak estimates in some way. Currently, the cardinality
> estimate is the same as for plain LIMIT, using the requested number of
> rows. But let's say there are very few large groups - that will
> naturally increase the number of rows produced.
> 
> As an example, let's say the subplan produces 1M rows, and there are
> 1000 groups (when split according to the ORDER BY clause).
> 
> 
>  
> 
> can we use ORDER BY column raw statistic in limit node reliably?
> because it seems to me it can be affected by other operation in a
> subplan like filter condition
> 

What do you mean by "raw statistic"? Using stats from the underlying
table is not quite possible, because you might be operating on top of
join or something like that.

IMHO the best thing you can do is call estimate_num_groups() and combine
that with the number of input rows. That shall benefit from ndistinct
coefficients when available, etc. I've been thinking that considering
the unreliability of grouping estimates we should use a multiple of the
average size (because there may be much larger groups), but I think
that's quite unprecipled and I'd much rather try without it first.

But maybe we can do better when there really is a single table to
consider, in which case we might look at MCV lists and estimate the
largest group. That would give us a much better idea of the worst case.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Pluggable Storage - Andres's take

2019-01-15 Thread Dmitry Dolgov

> On Tue, Jan 15, 2019 at 10:52 AM Amit Khandekar  
> wrote:
>
> Need to bump K_VERS_MINOR as well.

I've bumped it up, but somehow this change escaped the previous version. Now
should be there, thanks!

> On Mon, 14 Jan 2019 at 18:36, Amit Khandekar  wrote:
> > +static void _selectTableAccessMethod(ArchiveHandle *AH, const char
> > *tablespace);
> > tablespace => tableam
>
> This is yet to be addressed.

Fixed.

Also I guess another attached patch should address the psql part, namely
displaying a table access method with \d+ and possibility to hide it with a
psql variable (HIDE_TABLEAM, but I'm open for suggestion about the name).

pg_dump_access_method_v4.patch
Description: Binary data

psql_describe_am.patch
Description: Binary data

Re: Connection slots reserved for replication

2019-01-15 Thread Oleksii Kliukin

On Wed, Jan 2, 2019, at 12:36, Alexander Kukushkin wrote:
> Hi,
> 
> On Wed, 2 Jan 2019 at 12:17, Oleksii Kliukin  wrote:
> 
> > Thank you. Attached the new version(called it v8) with the following 
> > changes to the documentation:
> 
> Thank you for jumping on it. Your changes look good to me.
> 
> 
> I was also thinking about changing the value in PG_CONTROL_VERSION,
> because we added the new field into the control file, but decided to
> leave this change to committer.

Sounds reasonable, not sure what the 'official policy' is on this.

As there is no further reviews/issues found for almost 2 weeks since the last 
one, should we move it to RFC?

Cheers,
Oleksii

Re: Reviving the "Stopping logical replication protocol" patch from Vladimir Gordichuk

2019-01-15 Thread Dave Cramer

Dave Cramer


On Sun, 13 Jan 2019 at 23:19, Craig Ringer  wrote:

> On Mon, 3 Dec 2018 at 19:38, Dave Cramer  wrote:
>
>> Dmitry,
>>
>> Please see attached rebased patches
>>
>
> I'm fine with patch 0001, though I find this comment a bit hard to follow:
>
> + * The send_data callback must enqueue complete CopyData messages to libpq
> + * using pq_putmessage_noblock or similar, since the walsender loop may
> send
> + * CopyDone then exit and return to command mode in response to a client
> + * CopyDone between calls to send_data.
>
>
I think it needs splitting up into a couple of sentences.
>
> Fair point, remember it was originally written by a non-english speaker

>
> In patch 0002, stopping during a txn. It's important that once we skip any
> action, we continue skipping. In patch 0002 I'd like it to be clearer that
> we will *always* skip the rb->commit callback if rb->continue_decoding_cb()
> returned false and aborted the while loop. I suggest storing the result of
> (rb->continue_decoding_cb == NULL || rb->continue_decoding_cb())  in an
> assignment within the while loop, and testing that result later.
>
> e.g.
>
> (continue_decoding = (rb->continue_decoding_cb == NULL ||
> rb->continue_decoding_cb()))
>
> and later
>
> if (continue_decoding) {
> rb->commit(rb, txn, commit_lsn);
> }
>
> Will do

> I don't actually think it's necessary to re-test the continue_decoding
> callback and skip commit here. If we've sent all the of the txn
> except the commit, we might as well send the commit, it's tiny and might
> save some hassle later.
>
>
> I think a comment on the skipped commit would be good too:
>
> /*
>  * If we skipped any changes due to a client CopyDone we must not send a
> commit
>  * otherwise the client would incorrectly think it received a complete
> transaction.
>  */
>
> I notice that the fast-path logic in WalSndWriteData is removed by this
> patch. It isn't moved; there's no comparison of last_reply_timestamp
> and wal_sender_timeout now. What's the rationale there? You want to ensure
> that we reach ProcessRepliesIfAny() ? Can we cheaply test for a readable
> client socket then still take the fast-path if it's not readable?
>

This may have been a mistake as I am taking this over from a very old patch
that I did not originally write. I'll have a look at this

I

>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  2ndQuadrant - PostgreSQL Solutions for the Enterprise
>

Re: Libpq support to connect to standby server as priority

2019-01-15 Thread Dave Cramer

On Mon, 14 Jan 2019 at 21:19, Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com> wrote:

> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> > The problem here of course is that whoever invented target_session_attrs
> > was unconcerned with following that precedent, so what we have is
> > "target_session_attrs=(any | read-write)".
> > Are we prepared to add some aliases in service of unifying these names?
>
> I think "yes".
>
Agreed. There's no downside to aliasing and I'd really like to see
consistency.

>
> > 2. Whether or not you want to follow pgJDBC's naming, it seems like we
> > ought to have both "require read only" and "prefer read only" behaviors
> > in this patch, and maybe likewise "require read write" versus "prefer
> > read write".
>
> Agreed, although I don't see a use case for "prefer read write".  I don't
> think there's an app like "I want to write, but I'm OK if I cannot."
>

> > 3. We ought to sync this up with whatever's going to happen in
> > https://commitfest.postgresql.org/21/1090/
> > at least to the extent of agreeing on what GUCs we'd like to see
> > the server start reporting.
>
> Yes.
>
> > 4. Given that other discussion, it's not quite clear what we should
> > even be checking.  The existing logic devolves to checking that
> > transaction_read_only is true, but that's not really the same thing as
> > "is a master server", eg you might have connected to a master server
> > under a role that has SET ROLE default_transaction_read_only = false.
> > (I wonder what pgJDBC is really checking, under the hood.)
> > Do we want to have modes that are checking hot-standby state in some
> > fashion, rather than the transaction_read_only state?
>
> PgJDBC uses transaction_read_only like this:
>
> [core/v3/ConnectionFactoryImpl.java]
>   private boolean isMaster(QueryExecutor queryExecutor) throws
> SQLException, IOException {
> byte[][] results = SetupQueryRunner.run(queryExecutor, "show
> transaction_read_only", true);
> String value = queryExecutor.getEncoding().decode(results[0]);
> return value.equalsIgnoreCase("off");
>   }
>
> But as some people said, I don't think this is the right way.  I suppose
> what's leading to the current somewhat complicated situation is that there
> was no easy way for the client to know whether the server is the master.
> That ended up in using "SHOW transaction_read_only" instead, and people
> supported that compromise by saying "read only status is more useful than
> whether the server is standby or not," I'm afraid.
>
> The original desire should have been the ability to connect to a primary
> or a standby.  So, I think we should go back to the original thinking (and
> not complicate the feature), and create a read only GUC_REPORT variable,
> say, server_role, that identifies whether the server is a primary or a
> standby.
>
> I'm confused as to how this would work. Who or what determines if the
server is a primary or standby?

Dave Cramer

da...@postgresintl.com
www.postgresintl.com

>
>

Re: using expression syntax for partition bounds

2019-01-15 Thread Peter Eisentraut

On 15/01/2019 07:31, Amit Langote wrote:
>> Is "partition bound" the right term?  For list partitioning, it's not
>> really a bound.  Maybe it's OK.
> 
> Hmm, maybe "partition bound value"?  Just want to stress that the
> expression specifies "bounding" value of a partition.

I was more concerned about the term "bound", which it is not range
partitioning.  But I can't think of a better term.

I wouldn't change expr to value as you have done between v7 and v8,
since the point of this patch is to make expressions valid where
previously only values were.

>> I don't see any treatment of non-immutable expressions.  There is a test
>> case with a non-immutable cast that was removed, but that's all.  There
>> was considerable discussion earlier in the thread on whether
>> non-immutable expressions should be allowed.  I'm not sure what the
>> resolution was, but in any case there should be documentation and tests
>> of the outcome.
> 
> The partition bound expression is evaluated only once, that is, when the
> partition creation command is executed, and what gets stored in the
> catalog is a Const expression that contains the value that was computed,
> not the original non-immutable expression.  So, we don't need to do
> anything special in terms of code to handle users possibly specifying a
> non-immutable expression as partition bound.  Tom said something in that
> thread which seems to support such a position:
> 
> https://www.postgresql.org/message-id/22534.1523374457%40sss.pgh.pa.us

OK, if we are going with that approach, then it needs to be documented
and there should be test cases.

>> The collation handling might need another look.  The following is
>> allowed without any diagnostic:
>>
>> CREATE TABLE t2 (
>> a text COLLATE "POSIX"
>> ) PARTITION BY RANGE (a);
>> CREATE TABLE t2a PARTITION OF t2 FOR VALUES FROM ('foo' COLLATE "C") TO
>> ('xyz');
>>
>> I think the correct behavior would be that an explicit collation in the
>> partition bound expression is an error.

> But that's essentially ignoring the collation specified by the user for
> the partition bound value without providing any information about that to
> the user.  I've fixed that by explicitly checking if the collate clause in
> the partition bound value expression contains the same collation as the
> parent partition key collation and give an error otherwise.

I think that needs more refinement.  In v8, the following errors

CREATE TABLE t2 ( a name COLLATE "POSIX" ) PARTITION BY RANGE (a);
CREATE TABLE t2a PARTITION OF t2 FOR VALUES FROM (name 'foo') TO (name
'xyz');
ERROR:  collation of partition bound value for column "a" does not match
partition key collation "POSIX"

The problem here is that the "name" type has a collation that is neither
the one of the column nor the default collation.  We can allow that.
What we don't want is someone writing an explicit COLLATE clause.  I
think we just need to check that there is no top-level COLLATE clause.
This would then sort of match the logic parse_collate.c for combining
collations.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

On Tue, Jan 15, 2019 at 12:47 AM David Rowley
 wrote:
> 1. In:
>
> + if (IsA(clause, ScalarArrayOpExpr))
> + {
> + ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
> + Node *subexpr = (Node *) ((NullTest *) predicate)->arg;
> + if (op_strict(saop->opno) &&
> + clause_is_strict_for((Node *) linitial(saop->args), subexpr))
> + return true;
> + }
> +
>   /* foo IS NOT NULL refutes foo IS NULL */
>   if (clause && IsA(clause, NullTest) &&
>
> Your IsA(clause, ScalarArrayOpExpr) condition should also be checking
> that "clause" can't be NULL.  The NullTest condition one below does
> this.

Fixed in my local copy. I also did the same in
predicate_implied_by_simple_clause since it seems to have the same
potential issue.

After sleeping on it and looking at it again this morning, I also
realized I need to exclude weak implication in
predicate_refuted_by_simple_clause.

> 2. I was also staring predicate_implied_by_simple_clause() a bit at
> the use of clause_is_strict_for() to ensure that the IS NOT NULL's
> operand matches the ScalarArrayOpExpr left operand.  Since
> clause_is_strict_for() = "Can we prove that "clause" returns NULL if
> "subexpr" does?", in this case, your clause is the ScalarArrayOpExpr's
> left operand and subexpr is the IS NOT NULL's operand.  This means
> that a partial index with "WHERE a IS NOT NULL" should also be fine to
> use for WHERE strict_func(a) IN (1,2,..., 101); since strict_func(a)
> must be NULL if a is NULL. Also also works for WHERE a+a
> IN(1,2,...,101);   I wonder if it's worth adding a test for that, or
> even just modify one of the existing tests to ensure you get the same
> result from it. Perhaps it's worth an additional test to ensure that x
> IN(1,2,...,101) does not imply x+x IS NOT NULL and maybe that x+x IS
> NULL does not refute x IN(1,2,...,101), as a strict function is free
> to return NULL even if it's input are not NULL.

Are you suggesting a different test than clause_is_strict_for to
verify the saop LHS is the same as the null test's arg? I suppose we
could use "equal()" instead?

I've added several tests, and noticed a few things:

1. The current code doesn't result in "strongly_implied_by = t" for
the "(x + x) is not null" case, but it does result in "s_i_holds = t".
This doesn't change if I switch to using "equal()" as mentioned above.

2. The tests I have for refuting "x is null" show that w_r_holds as
well as s_r_holds. I'd only expected the latter, since only
"strongly_refuted_by = t".

3. The tests I have for refuting "(x + x) is null" show that both
s_r_holds and w_r_holds. I'd expected these to be false.

I'm attaching the current version of the patch with the new tests, but
I'm not sure I understand the *_holds results mentioned above. Are
they an artifact of the data under test? Or do they suggest a
remaining bug? I'm a bit fuzzy on what to expect for *_holds though I
understand the requirements for strongly/weakly_implied/refuted_by.

James Coleman

saop_is_not_null-v6.patch
Description: Binary data

Re: current_logfiles not following group access and instead follows log_file_mode permissions

Haribabu Kommi  writes:
> Excluding the file in the backup can solve the problem of backup by an
> unprivileged user. Is there any scenarios it can cause problems if it
> doesn't follow the group access mode?

The point of this file, as I understood it, was to allow someone who's
allowed to read the log files to find out which one is the latest.  It
makes zero sense for it to have different permissions from the log files,
because doing that would break its only use-case.

I am wondering what is the use-case for a backup arrangement that's so
fragile it can't cope with varying permissions in the data directory.

regards, tom lane

Re: insensitive collations

2019-01-15 Thread Peter Eisentraut

On 14/01/2019 13:23, Daniel Verite wrote:
> On a table with pre-existing contents, the creation of a unique index
> does not seem to detect the duplicates that are equal per the
> collation and different binary-wise.

Fixed in the attached updated patch.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From dcd081ba5fdc3a29adddcd52deac5fb67d18fcc1 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 15 Jan 2019 15:54:33 +0100
Subject: [PATCH v3] Collations with nondeterministic comparison

This adds a flag "deterministic" to collations.  If that is false,
such a collation disables various optimizations that assume that
strings are equal only if they are byte-wise equal.  That then allows
use cases such as case-insensitive or accent-insensitive comparisons
or handling of strings with different Unicode normal forms.

The term "deterministic comparison" in this context is from Unicode
Technical Standard #10
(https://unicode.org/reports/tr10/#Deterministic_Comparison).
---
 contrib/bloom/bloom.h |   1 +
 contrib/bloom/blutils.c   |   3 +-
 doc/src/sgml/catalogs.sgml|   7 +
 doc/src/sgml/charset.sgml |  11 +-
 doc/src/sgml/ref/create_collation.sgml|  22 ++
 src/backend/access/hash/hashfunc.c|  86 +
 src/backend/catalog/pg_collation.c|   2 +
 src/backend/commands/collationcmds.c  |  25 +-
 src/backend/executor/execExpr.c   |   4 +-
 src/backend/executor/execGrouping.c   |  12 +-
 src/backend/executor/execPartition.c  |   1 +
 src/backend/executor/nodeAgg.c|   4 +
 src/backend/executor/nodeGroup.c  |   1 +
 src/backend/executor/nodeHash.c   |  14 +-
 src/backend/executor/nodeHashjoin.c   |   5 +
 src/backend/executor/nodeRecursiveunion.c |   1 +
 src/backend/executor/nodeSetOp.c  |   2 +
 src/backend/executor/nodeSubplan.c|   8 +
 src/backend/executor/nodeUnique.c |   1 +
 src/backend/executor/nodeWindowAgg.c  |   2 +
 src/backend/nodes/copyfuncs.c |   7 +
 src/backend/nodes/outfuncs.c  |   7 +
 src/backend/nodes/readfuncs.c |   7 +
 src/backend/optimizer/plan/createplan.c   |  54 ++-
 src/backend/optimizer/util/tlist.c|  25 ++
 src/backend/partitioning/partbounds.c |   4 +-
 src/backend/partitioning/partprune.c  |   3 +-
 src/backend/utils/adt/arrayfuncs.c|   2 +-
 src/backend/utils/adt/name.c  |  32 +-
 src/backend/utils/adt/orderedsetaggs.c|   3 +-
 src/backend/utils/adt/pg_locale.c |   1 +
 src/backend/utils/adt/ri_triggers.c   |  20 +
 src/backend/utils/adt/varchar.c   |  20 +
 src/backend/utils/adt/varlena.c   |  51 ++-
 src/backend/utils/cache/catcache.c|   2 +-
 src/backend/utils/cache/lsyscache.c   |  16 +
 src/bin/initdb/initdb.c   |   4 +-
 src/bin/pg_dump/pg_dump.c |  39 +-
 src/bin/psql/describe.c   |  17 +-
 src/include/catalog/pg_collation.h|   2 +
 src/include/executor/executor.h   |   3 +
 src/include/executor/hashjoin.h   |   1 +
 src/include/executor/nodeHash.h   |   2 +-
 src/include/nodes/execnodes.h |   3 +
 src/include/nodes/plannodes.h |   7 +
 src/include/optimizer/planmain.h  |   2 +-
 src/include/optimizer/tlist.h |   1 +
 src/include/partitioning/partbounds.h |   1 +
 src/include/utils/lsyscache.h |   1 +
 src/include/utils/pg_locale.h |   1 +
 .../regress/expected/collate.icu.utf8.out | 341 +-
 .../regress/expected/collate.linux.utf8.out   |  25 +-
 src/test/regress/sql/collate.icu.utf8.sql | 115 ++
 src/test/regress/sql/collate.linux.utf8.sql   |   8 +
 54 files changed, 928 insertions(+), 111 deletions(-)

diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index 24200fb5fa..c4672f6853 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -137,6 +137,7 @@ typedef struct BloomMetaPageData
 typedef struct BloomState
 {
FmgrInfohashFn[INDEX_MAX_KEYS];
+   Oid collations[INDEX_MAX_KEYS];
BloomOptions opts;  /* copy of options on index's 
metapage */
int32   nColumns;
 
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6458376578..d078dfbd46 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -163,6 +163,7 @@ initBloomState(BloomState *state, Relation index)
fmgr_info_copy(&(state->hashFn[i]),
   index_getprocinfo(index, i + 1, 
BLOOM_HASH_PROC),

Re: [HACKERS] COPY FREEZE and PD_ALL_VISIBLE

2019-01-15 Thread Komяpa

Hello,

Today I bumped into need to limit first VACUUM time on data import.
I'm using utility called osmium together with COPY FREEZE to import
openstreetmap data into database.

osmium export -c osmium.config -f pg belarus-latest.osm.pbf  -v --progress
| psql -1 -c 'create table byosm(geom geometry, osm_type text, osm_id
bigint, tags jsonb);copy byosm from stdin freeze;'

However, first pass of VACUUM rewrites the whole table. Here is two logs of
VACUUM VERBOSE in a row:

https://gist.github.com/Komzpa/e765c1c5e04623d83a6263d4833cf3a5

In Russian Postgres Telegram group I've been recommended this thread.
Can the patch be revived? What is needed to get it up for 12?

On Sun, Aug 14, 2016 at 10:37 PM Jeff Janes  wrote:

> On Tue, Nov 3, 2015 at 6:37 AM, Simon Riggs  wrote:
> > On 3 November 2015 at 15:23, Amit Kapila 
> wrote:
> >>
> >> On Fri, Oct 23, 2015 at 6:29 AM, Simon Riggs 
> >> wrote:
> >>>
> >>> On 21 October 2015 at 13:31, Jeff Janes  wrote:
> >>>
>  Index-only scans will visit the heap for each tuple until the first
>  VACUUM is done.
> 
>  The first vacuum will read the entire table, but not need to write it
>  anymore.  And will create the _vm file.
> 
>  I think we really want to create _vm file as well as set
> PD_ALL_VISIBLE,
>  but I don't know the best way to do that.  Set a flag somewhere and
> then
>  create it in bulk at the end of the transaction?  Set it bit by bit
> as the
>  pages are extended and initialized?
> >>>
> >>>
> >>> Easy enough to do it at the end of the COPY FREEZE in one step.
> >>
> >>
> >> Here, we might want to consider that setting bit in visibility map
> >> will generate WAL log whereas Copy Freeze otherwise skip WAL
> >> when wal_level is less than archive.  This can lead to extra disk
> >> writes which can slow down Copy Freeze, but OTOH that might
> >> be acceptable.
> >
> >
> > I'm building the map as I go, in the latest version of this patch I'm
> > working on.
>
> Hi Simon,
>
> Is this still on your radar?  If you would like someone else to pick
> it up, can you post the WIP patch you have?
>
> Thanks,
>
> Jeff
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hack...@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa

Re: current_logfiles not following group access and instead follows log_file_mode permissions

I wrote:
> Haribabu Kommi  writes:
>> Excluding the file in the backup can solve the problem of backup by an
>> unprivileged user. Is there any scenarios it can cause problems if it
>> doesn't follow the group access mode?

> The point of this file, as I understood it, was to allow someone who's
> allowed to read the log files to find out which one is the latest.  It
> makes zero sense for it to have different permissions from the log files,
> because doing that would break its only use-case.

On reflection, maybe the problem is not that we're giving the file
the wrong permissions, but that we're putting it in the wrong place?
That is, seems like it should be in the logfile directory not the
data directory.  That would certainly simplify the intended use-case,
and it would fix this complaint too.

regards, tom lane

Re: O_DIRECT for relations and SLRUs (Prototype)

2019-01-15 Thread Maksim Milyutin


On 1/15/19 11:28 AM, Michael Paquier wrote:


On Tue, Jan 15, 2019 at 11:19:48AM +0500, Andrey Borodin wrote:

Is it possible to avoid those memcpy's by aligning available buffers
instead?  I couldn't understand this from the patch and this thread.

Sure, it had better do that.  That's just a lazy implementation.



Hi!

Could you specify all cases when buffers will not be aligned with BLCKSZ?

AFAIC shared and temp buffers are aligned. And what ones are not?


--
Regards, Maksim Milyutin

Re: Pluggable Storage - Andres's take

Hi,

On 2019-01-15 14:37:36 +0530, Amit Khandekar wrote:
> Then for each of the calls, we would need to declare that structure
> variable (with = {0}) and assign required fields in that structure
> before passing it to ArchiveEntry(). But a major reason of
> ArchiveEntry() is to avoid doing this and instead conveniently pass
> those fields as parameters. This will cause unnecessary more lines of
> code. I think better way is to have an ArchiveEntry() function with
> limited number of parameters, and have an ArchiveEntryEx() with those
> extra parameters which are not needed in usual cases.

I don't think that'll really solve the problem. I think it might be more
reasonable to rely on structs. Now that we can rely on designated
initializers for structs we can do something like

ArchiveEntry((ArchiveArgs){.tablespace = 3,
   .dumpFn = somefunc,
   ...});

and unused arguments will automatically initialized to zero.  Or we
could pass the struct as a pointer, might be more efficient (although I
doubt it matters here):

ArchiveEntry(&(ArchiveArgs){.tablespace = 3,
.dumpFn = somefunc,
...});

What do others think?  It'd probably be a good idea to start a new
thread about this.

Greetings,

Andres Freund

Re: Log a sample of transactions

2019-01-15 Thread Adrien NAYRAT


On 1/15/19 11:42 AM, Masahiko Sawada wrote:

When you troubleshoot applicative issues with multi-statements transaction, you 
may have to log all queries to find all statements of one transaction. With 
high throughput, it could be hard to log all queries without causing troubles.

Hm, can we use log_min_duration_statement to find slow queries of a
transaction instead? Could you please elaborate on the use-case?


Hello,

The goal is not to find slow queries in a transaction, but troubleshoot 
applicative issue when you have short queries.


Sometimes you want to understand what happens in a transaction, either 
you perfectly know your application, either you have to log all queries 
and find ones with the same transaction ID (%x). It could be problematic 
if you have a huge traffic with fast queries.



Thanks,

Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

James Coleman  writes:
> I'm attaching the current version of the patch with the new tests, but
> I'm not sure I understand the *_holds results mentioned above. Are
> they an artifact of the data under test? Or do they suggest a
> remaining bug? I'm a bit fuzzy on what to expect for *_holds though I
> understand the requirements for strongly/weakly_implied/refuted_by.

IIRC, the "foo_holds" outputs mean "the foo relationship appears
to hold for these expressions across all tested inputs", for instance
s_i_holds means "there were no tested cases where A is true and B is not
true".  The implied_by/refuted_by outputs mean "predtest.c claims it can
prove this".  Obviously, a claimed proof for a relationship that fails
experimentally would be big trouble.  The other way around just means
that either the proof rules are inadequate to prove this particular case,
or the set of test values for the expressions don't expose the fact that
the relationship doesn't hold.

Now, if we *expected* that predtest.c should be able to prove something,
failure to do so would be a bug, but it's not a bug if we know it's
incapable of making that proof.  The second explanation might represent
a bug in the test case.

I added the foo_holds columns just as a sort of cross-check on the
test cases themselves, they don't test anything about the predtest
code.

regards, tom lane

Re: [PATCH] Allow UNLISTEN during recovery

2019-01-15 Thread Shay Rojansky

Mitar, thanks for giving this your attention!

So patch looks good to me, but documentation changes are missing from it.
> UNLISTEN should be removed from the list of commands not allowed and moved
> to the list of those which are allowed [1]. Moreover,
> src/test/regress/sql/hs_standby_allowed.sql and
> src/test/regress/sql/hs_standby_disallowed.sql tests should be updated
> based on these changes in the patch. What is surprising to me is that make
> check-world does not fail with this change, but there is an explicit check
> for UNLISTEN *. So does this mean those tests are not run or does it mean
> that this patch does not fix the problem?
>

I've made the requested changes to the docs and to the regression tests.

I think that specifically the standby regression tests do not get executed
by check-world - see section
https://www.postgresql.org/docs/current/regress-run.html#id-1.6.20.5.8. I'm
guessing this should be executed as part of the build verification pipeline
for patches, but I don't know anything about the PostgreSQL build
infrastructure.

Unfortunately I'm extremely tight for time at the moment and don't have
time to do the appropriate hot standby setup to test this... As the patch
is pretty straightforward, and since I'm hoping you guys execute the tests
on your end, I hope that's acceptable. If it's absolutely necessary for me
to test the patch locally, let me know and I'll do so.
From 8c816354e820bf3d0be69d55dbf0052b1d27feeb Mon Sep 17 00:00:00 2001
From: Shay Rojansky 
Date: Tue, 15 Jan 2019 18:49:40 +0100
Subject: [PATCH] Allow unlisten when in hot standby:wq

---
 doc/src/sgml/high-availability.sgml| 16 ++--
 src/backend/tcop/utility.c |  2 +-
 src/test/regress/sql/hs_standby_allowed.sql|  4 
 src/test/regress/sql/hs_standby_disallowed.sql |  2 --
 4 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 4882b20828..bb1c86f73e 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1767,6 +1767,11 @@ if (!triggered)
Plugins and extensions - LOAD
   
  
+ 
+  
+   UNLISTEN
+  
+ 
 

 
@@ -1856,18 +1861,17 @@ if (!triggered)
  
  
   
-   LISTEN, UNLISTEN, NOTIFY
+   LISTEN, NOTIFY
   
  
 

 

-In normal operation, read-only transactions are allowed to
-use LISTEN, UNLISTEN, and
-NOTIFY, so Hot Standby sessions operate under slightly tighter
-restrictions than ordinary read-only sessions.  It is possible that some
-of these restrictions might be loosened in a future release.
+In normal operation, read-only transactions are allowed to use
+LISTEN, and NOTIFY, so Hot Standby sessions
+operate under slightly tighter restrictions than ordinary read-only sessions.
+It is possible that some of these restrictions might be loosened in a future release.

 

diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 27ae6be751..44136060d6 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -629,7 +629,7 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 			{
 UnlistenStmt *stmt = (UnlistenStmt *) parsetree;
 
-PreventCommandDuringRecovery("UNLISTEN");
+/* allow UNLISTEN during recovery, which is a noop */
 CheckRestrictedOperation("UNLISTEN");
 if (stmt->conditionname)
 	Async_Unlisten(stmt->conditionname);
diff --git a/src/test/regress/sql/hs_standby_allowed.sql b/src/test/regress/sql/hs_standby_allowed.sql
index a33199dbbd..1f1cdf5a00 100644
--- a/src/test/regress/sql/hs_standby_allowed.sql
+++ b/src/test/regress/sql/hs_standby_allowed.sql
@@ -110,6 +110,10 @@ LOCK hs1 IN ROW SHARE MODE;
 LOCK hs1 IN ROW EXCLUSIVE MODE;
 COMMIT;
 
+-- UNLISTEN
+unlisten a;
+unlisten *;
+
 -- LOAD
 -- should work, easier if there is no test for that...
 
diff --git a/src/test/regress/sql/hs_standby_disallowed.sql b/src/test/regress/sql/hs_standby_disallowed.sql
index 21bbf526b7..a470600eec 100644
--- a/src/test/regress/sql/hs_standby_disallowed.sql
+++ b/src/test/regress/sql/hs_standby_disallowed.sql
@@ -88,8 +88,6 @@ COMMIT;
 -- Listen
 listen a;
 notify a;
-unlisten a;
-unlisten *;
 
 -- disallowed commands
 
-- 
2.19.1

Re: [PATCH] Allow UNLISTEN during recovery

2019-01-15 Thread Mitar

Hi!

On Tue, Jan 15, 2019 at 10:17 AM Shay Rojansky  wrote:
> Unfortunately I'm extremely tight for time at the moment and don't have time 
> to do the appropriate hot standby setup to test this... As the patch is 
> pretty straightforward, and since I'm hoping you guys execute the tests on 
> your end, I hope that's acceptable. If it's absolutely necessary for me to 
> test the patch locally, let me know and I'll do so.

I would definitely prefer if you could run the tests. You might
discover that your patch does not sufficiently address tests. Please
do so and confirm here that it works for you and then I can do another
review.

Mitar

-- 
http://mitar.tnode.com/
https://twitter.com/mitar_m

Re: Protect syscache from bloating with negative cache entries

"Tsunakawa, Takayuki"  writes:
> But the syscache/relcache bloat still remains a problem, when there are many 
> live tables and application connections.  Would you agree to solve this in 
> some way?  I thought Horiguchi-san's latest patches would solve this and the 
> negative entries.  Can we consider that his patch and yours are orthogonal, 
> i.e., we can pursue Horiguchi-san's patch after yours is committed?

Certainly, what I've done here doesn't preclude adding some wider solution to
the issue of extremely large catcaches.  I think it takes the pressure off
for one rather narrow problem case, and the mechanism could be used to fix
other ones.  But if you've got an application that just plain accesses a
huge number of objects, this isn't going to make your life better.

> (As you said, some parts of Horiguchi-san's patches may be made simpler.  For 
> example, the ability to change another session's GUC variable can be 
> discussed in a separate thread.)

Yeah, that idea seems just bad from here ...

> I think we need some limit to the size of the relcache, syscache, and 
> plancache.  Oracle and MySQL both have it, using LRU to evict less frequently 
> used entries.  You seem to be concerned about the LRU management based on 
> your experience, but would it really cost so much as long as each postgres 
> process can change the LRU list without coordination with other backends now? 
>  Could you share your experience?

Well, we *had* an LRU mechanism for the catcaches way back when.  We got
rid of it --- see commit 8b9bc234a --- because (a) maintaining the LRU
info was expensive and (b) performance fell off a cliff in scenarios where
the cache size limit was exceeded.  You could probably find some more info
about that by scanning the mail list archives from around the time of that
commit, but I'm too lazy to do so right now.

That was a dozen years ago, and it's possible that machine performance
has moved so much since then that the problems are gone or mitigated.
In particular I'm sure that any limit we would want to impose today will
be far more than the 5000-entries-across-all-caches limit that was in use
back then.  But I'm not convinced that a workload that would create 100K
cache entries in the first place wouldn't have severe problems if you
tried to constrain it to use only 80K entries.  I fear it's just wishful
thinking to imagine that the behavior of a larger cache won't be just
like a smaller one.  Also, IIRC some of the problem with the LRU code
was that it resulted in lots of touches of unrelated data, leading to
CPU cache miss problems.  It's hard to see how that doesn't get even
worse with a bigger cache.

As far as the relcache goes, we've never had a limit on that, but there
are enough routine causes of relcache flushes --- autovacuum for instance
--- that I'm not really convinced relcache bloat can be a big problem in
production.

The plancache has never had a limit either, which is a design choice that
was strongly influenced by our experience with catcaches.  Again, I'm
concerned about the costs of adding a management layer, and the likelihood
that cache flushes will simply remove entries we'll soon have to rebuild.

> FYI, Oracle provides one parameter, shared_pool_size, that determine the
> size of a memory area that contains SQL plans and various dictionary
> objects.  Oracle decides how to divide the area among constituents.  So
> it could be possible that one component (e.g. table/index metadata) is
> short of space, and another (e.g. SQL plans) has free space.  Oracle
> provides a system view to see the free space and hit/miss of each
> component.  If one component suffers from memory shortage, the user
> increases shared_pool_size.  This is similar to what Horiguchi-san is
> proposing.

Oracle seldom impresses me as having designs we ought to follow.
They have a well-earned reputation for requiring a lot of expertise to
operate, which is not the direction this project should be going in.
In particular, I don't want to "solve" cache size issues by exposing
a bunch of knobs that most users won't know how to twiddle.

regards, tom lane

Re: Protect syscache from bloating with negative cache entries

2019-01-15 Thread and...@anarazel.de

Hi,

On 2019-01-15 13:32:36 -0500, Tom Lane wrote:
> Well, we *had* an LRU mechanism for the catcaches way back when.  We got
> rid of it --- see commit 8b9bc234a --- because (a) maintaining the LRU
> info was expensive and (b) performance fell off a cliff in scenarios where
> the cache size limit was exceeded.  You could probably find some more info
> about that by scanning the mail list archives from around the time of that
> commit, but I'm too lazy to do so right now.
> 
> That was a dozen years ago, and it's possible that machine performance
> has moved so much since then that the problems are gone or mitigated.
> In particular I'm sure that any limit we would want to impose today will
> be far more than the 5000-entries-across-all-caches limit that was in use
> back then.  But I'm not convinced that a workload that would create 100K
> cache entries in the first place wouldn't have severe problems if you
> tried to constrain it to use only 80K entries.

I think that'd be true if you the accesses were truly randomly
distributed - but that's not the case in the cases where I've seen huge
caches.  It's usually workloads that have tons of functions, partitions,
... and a lot of them are not that frequently accessed, but because we
have no cache purging mechanism stay around for a long time. This is
often exascerbated by using a pooler to keep connections around for
longer (which you have to, to cope with other limits of PG).

> As far as the relcache goes, we've never had a limit on that, but there
> are enough routine causes of relcache flushes --- autovacuum for instance
> --- that I'm not really convinced relcache bloat can be a big problem in
> production.

It definitely is.

> The plancache has never had a limit either, which is a design choice that
> was strongly influenced by our experience with catcaches.

This sounds a lot of having learned lessons from one bad implementation
and using it far outside of that situation.

Greetings,

Andres Freund

Re: Protect syscache from bloating with negative cache entries

2019-01-15 Thread Kenneth Marshall

On Tue, Jan 15, 2019 at 01:32:36PM -0500, Tom Lane wrote:
> ... 
> > FYI, Oracle provides one parameter, shared_pool_size, that determine the
> > size of a memory area that contains SQL plans and various dictionary
> > objects.  Oracle decides how to divide the area among constituents.  So
> > it could be possible that one component (e.g. table/index metadata) is
> > short of space, and another (e.g. SQL plans) has free space.  Oracle
> > provides a system view to see the free space and hit/miss of each
> > component.  If one component suffers from memory shortage, the user
> > increases shared_pool_size.  This is similar to what Horiguchi-san is
> > proposing.
> 
> Oracle seldom impresses me as having designs we ought to follow.
> They have a well-earned reputation for requiring a lot of expertise to
> operate, which is not the direction this project should be going in.
> In particular, I don't want to "solve" cache size issues by exposing
> a bunch of knobs that most users won't know how to twiddle.
> 
>   regards, tom lane

+1

Regards,
Ken

Re: MERGE SQL statement for PG12

On Mon, Jan 14, 2019 at 1:37 AM Pavan Deolasee wrote:
> Can you please help me understand what's fundamentally wrong with the
> approach and more importantly, can you please explain what would the the
> architecturally sound way to do this? The same also applies to the executor
> side where the current approach is deemed wrong, but very little is said on
> what's the correct way.

I think the basic problem is that optimization should happen in the
optimizer, not the parser. Deciding that MERGE looks like a RIGHT
JOIN is not a parsing task and therefore shouldn't happen in the
parser. The guys who were (are?) proposing support for ALIGN and
NORMALIZE for temporal joins got the same complaint. Here's Tom
talking about that:

http://postgr.es/m/32265.1510681...@sss.pgh.pa.us

I realize that you may have worked around some of the specific issues
that Tom mentions in that email, but I believe the general point
remains valid. What parse analysis should do is construct a
representation of the query the user entered, not some other query
derived from it. And this code clearly does the latter. You can see
by the fact that it uses QSRC_PARSER, for example, a constant that is
currently unused (and for good reason). And you can see that it
results in hacks like the cross-check to make sure that
resultRelRTE->relid == mergeRelRTE->relid. You need that because you
are manipulating and transforming the query too early in the pipeline.
If you do the transformation in the optimizer, you won't need stuff
like this. There are other hints in this function that you're really
doing planning tasks:

+ * The MERGE query's join can be tuned in some cases, see below for these
+ * special case tweaks.

Tuning the query is otherwise known as optimization. Do it in the optimizer.

+ * Simplify the MERGE query as much as possible
+ *
+ * These seem like things that could go into Optimizer, but they are
+ * semantic simplifications rather than optimizations, per se.

Actually, they are optimizations. The fact that an outer join may be
more expensive than an inner join is a costing consideration, not
something that the parser needs to know about.

+ /*
+ * This query should just provide the source relation columns. Later, in
+ * preprocess_targetlist(), we shall also add "ctid" attribute of the
+ * target relation to ensure that the target tuple can be fetched
+ * correctly.
+ */

This is not a problem unique to MERGE. It comes up for UPDATE and
DELETE as well. The handling is not all one place, but it's all in
the optimizer. The place which actually adds the CTID column is in
preprocess_targetlist(), but note that it's much smarter than the code
in the MERGE patch, because it can do different things depending on
which type of rowmark is requested. The code as it exists in the
MERGE patch can't possibly work for anything that doesn't have a CTID
table, which means it won't work for FDWs and, in the future, any new
heap types added via pluggable storage unless they happen to support
CTID. Correctly written, MERGE will leverage what the optimizer
already knows about rowmarks rather than reinventing them in the parse
analysis phase.

You should redesign this whole representation so that you just pass
the whole query through to the optimizer without any structural
change. Just as we do for other statements, you need to do the basic
transformation stuff like looking up relation OIDs: that has to do
with what the query MEANS and the external SQL objects on which it
DEPENDS; those things have to be figured out before planning so that
things like the plan-cache work correctly. But also just as we do for
other statements, you should avoid having any code in this function
that has to do with the way in which the MERGE should ultimately be
executed, and you should not have any code here that builds a new
query or does heavy restructuring of the parse tree.

Since it looks like we will initially have only one strategy for
executing MERGE, it might be OK to move the transformation that you're
currently doing in transformMergeStmt() to some early phase of the
optimizer. Here's me making approximately the same point about ALIGN
and NORMALIZE:

https://www.postgresql.org/message-id/CA+TgmoZ=UkRzpisqK5Qox_ekLG+SQP=xBeFiDkXTgLF_=1f...@mail.gmail.com

However, I'm not sure that's actually the right way to go. Something,
possibly this transformation, is causing us to end up with two copies
of the target relation. This comment is a pretty good clue that this
is nothing but an ugly kludge:

+ * While executing MERGE, the target relation is processed twice; once
+ * as a target relation and once to run a join between the target and the
+ * source. We generate two different RTEs for these two purposes, one with
+ * rte->inh set to false and other with rte->inh set to true.
+ *
+ * Since the plan re-evaluated by EvalPlanQual uses the join RTE, we must
+ * install the updated tuple in the scan corresponding to that RTE. The
+ * following me

Re: MERGE SQL statement for PG12

Hi,

On 2019-01-15 14:05:25 -0500, Robert Haas wrote:
> On Mon, Jan 14, 2019 at 1:37 AM Pavan Deolasee  
> wrote:
> > Can you please help me understand what's fundamentally wrong with
> > the approach and more importantly, can you please explain what would
> > the the architecturally sound way to do this? The same also applies
> > to the executor side where the current approach is deemed wrong, but
> > very little is said on what's the correct way.
> 
> [ Long and good explanation by Robert ]

> I want to point out that it is not as if nobody has reviewed this
> patch previously.  Here is Andres making basically the same point
> about parse analysis that I'm making here -- FWIW, I didn't find his
> reply until after I'd written the above:
> 
> https://www.postgresql.org/message-id/20180403021800.b5nsgiclzanobiup%40alap3.anarazel.de
> 
> Here he is explaining these points some more:
> 
> https://www.postgresql.org/message-id/2018040523.gar3j26gsk32gqpe%40alap3.anarazel.de
> 
> And here's Peter Geoghegan not only explaining the same problem but
> having a go at fixing it:
> 
> https://www.postgresql.org/message-id/CAH2-Wz%3DZwNQvp11XjeHo-dBLHr9GDRi1vao8w1j7LQ8mOsUkzw%40mail.gmail.com
> 
> Actually, I see that Peter Geoghegan not just the emails above but a
> blizzard of other emails explaining the structural problems with the
> patch, which I now see include not only the parse analysis concerns
> but also the business of multiple RTEs which I mentioned above.

+ many.

Pavan, I think the reason you're not getting much further feedback is
that you and Simon have gotten a lot and only incorporated feedback only
very grudgingly, if at all.  You've not even attempted to sketch out a
move of the main merge handling from parse-analysis to planner, as far
as I can tell, despite this being one of the main criticisms for about a
year. Given that not much besides rebasing has happened since v11, I
don't find it surprising that people don't want to invest more energy in
this patch.

Greetings,

Andres Freund

Re: Ryu floating point output patch

Andrew Gierth  writes:
> Funny thing: I've been devoting considerable effort to testing this, and
> the one failure mode I've found is very interesting; it's not a problem
> with strtod(), in fact it's a bug in our float4in caused by _misuse_ of
> strtod().

> In particular, float4in thinks it's ok to do strtod() on the input, and
> then round the result to a float.

Hm.  I'm not sure how much weight to put on this example, since we don't
guarantee that float4 values will round-trip with less than 9 digits of
printed precision.  This example corresponds to extra_float_digits = 1
which doesn't meet that minimum.  However, I agree that it's rounding
the presented input in the less desirable direction.

> strtof() exists as part of the relevant standards, so float4in should be
> using that in place of strtod, and that seems to fix this case for me.

To clarify that: strtof() is defined in C99 but not older standards
(C89 and SUSv2 lack it).

While we've agreed that we'll require C99 *compilers* going forward,
I'm less enthused about supposing that the platform's libc is up to
date, not least because replacing libc on an old box is orders of
magnitude harder than installing a nondefault version of gcc.

As a concrete example, gaur lacks strtof(), and I'd be pretty much
forced to retire it if we start requiring that function.  Putting
a more modern gcc on that machine wasn't too painful, but I'm not
messing with its libc.

But I'd not object to adding a configure check and teaching
float4in to use strtof() if available.

regards, tom lane

Re: Libpq support to connect to standby server as priority

On Mon, Jan 14, 2019 at 5:17 PM Tom Lane  wrote:
> The problem here of course is that whoever invented target_session_attrs
> was unconcerned with following that precedent, so what we have is
> "target_session_attrs=(any | read-write)".
> Are we prepared to add some aliases in service of unifying these names?

I wasn't unconcerned about the problem, but I wasn't prepared to to be
the first person who added a connection parameter that used
namesLikeThis instead of names_like_this, especially if the semantics
weren't exactly the same.  That seemed to be a recipe for somebody
yelling at me, and I try to avoid that when I can.

> 4. Given that other discussion, it's not quite clear what we should
> even be checking.  The existing logic devolves to checking that
> transaction_read_only is true, but that's not really the same thing as
> "is a master server", eg you might have connected to a master server
> under a role that has SET ROLE default_transaction_read_only = false.
> (I wonder what pgJDBC is really checking, under the hood.)
> Do we want to have modes that are checking hot-standby state in some
> fashion, rather than the transaction_read_only state?

Well, this has been discussed before, too, I'm pretty sure, but I'm
too lazy to go find the old discussion right now.  The upshot is that
default_transaction_read_only lets an administrator make a server look
read-only even if it technically isn't, which somebody might find
useful.  Otherwise what do you do if, for example, you are using
logical replication?  None of your servers are in recovery, but you
can make some of them report default_transaction_read_only = true if
you like.  To me, that kind of configurability is a feature, not a
bug.

That being said, I don't object to having even more values for
target_session_attrs that check other things.  You could have:

read_only: default_transaction_read_only => true
read_write: default_transaction_read_only => false
master: pg_is_in_recovery => false
standby: pg_is_in_recovery => true

But what I think would be a Very Bad Plan is to use confused naming
that looks for something different than what it purports to do.  For
example, if you were to change things so that read_write checks
pg_is_in_recovery(), then you might ask for a "read-write" server and
get one where only read-only transactions are permitted.  We need not
assume that "read-write master" and "read-only standby" are the only
two kinds of things that can ever exist, as long as we're careful
about the names we choose.  Choosing the names carefully also helps to
avoid POLA violations.

Another point I'd like to mention is that target_session_attrs could
be extended to care about other kinds of properties which someone
might want a server to have, quite apart from
master/standby/read-only/read-write.  I don't know exactly what sort
of thing somebody might care about, but the name is such that we can
decide to care about other properties in the future without having to
add a whole new parameter.  You can imagine a day when someone can say
target_session_attrs=read-write,v42+,ftl to get a server connection
that is read-write on a server running PostgreSQL 42 or greater that
also has a built-in hyperdrive.  Or whatever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Connection slots reserved for replication

On Wed, Jan 2, 2019 at 6:36 AM Alexander Kukushkin  wrote:
> I was also thinking about changing the value in PG_CONTROL_VERSION,
> because we added the new field into the control file, but decided to
> leave this change to committer.

We typically omit catversion bumps from submitted patches to avoid
unnecessary conflicts, but I think PG_CONTROL_VERSION doesn't change
enough to cause a problem.  Also, it's not date-dependent the way
catversion is.  So I would include the bump in the patch, if it were
me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: New vacuum option to do only freezing

On Mon, Jan 14, 2019 at 9:24 PM Masahiko Sawada  wrote:
> I think that because the tuples that got dead after heap_page_prune()
> looked are recorded but not removed without lazy_vacuum_page() we need
> to process them in lazy_vacuum_page(). For decision about whether to
> truncate we should not change it, so I will fix it. It should be an
> another option to control whether to truncate if we want.

I think that heap_page_prune() is going to truncate dead tuples to
dead line pointers. At that point the tuple is gone.  The only
remaining step is to mark the dead line pointer as unused, which can't
be done without scanning the indexes.  So I don't understand what
lazy_vacuum_page() would need to do in this case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MERGE SQL statement for PG12

Robert Haas  writes:
> You should redesign this whole representation so that you just pass
> the whole query through to the optimizer without any structural
> change.  Just as we do for other statements, you need to do the basic
> transformation stuff like looking up relation OIDs: that has to do
> with what the query MEANS and the external SQL objects on which it
> DEPENDS; those things have to be figured out before planning so that
> things like the plan-cache work correctly.  But also just as we do for
> other statements, you should avoid having any code in this function
> that has to do with the way in which the MERGE should ultimately be
> executed, and you should not have any code here that builds a new
> query or does heavy restructuring of the parse tree.

Just to comment on that: IMV, the primary task of parse analysis is
exactly to figure out which database objects the query references or
depends on.  We must know that much so that we can store proper
dependency information if the query goes into a view or rule.  But as
soon as you add more processing than that, you are going to run into
problems of at least two kinds:

* The further the parsetree diverges from just representing what was
entered, the harder the task of ruleutils.c to reconstruct something
that looks like what was entered.

* The more information you absorb about the referenced objects, the
more likely it is that a stored view/rule will be broken by subsequent
ALTER commands.  This connects to Robert's complaint about the parse
transformation depending on CTID, though I don't think it's quite the
same thing.  We want to minimize the dependency surface of stored
rules -- for example, it's okay for a stored view to depend on the
signature (argument and return data types) of a function, but not
for it to depend on, say, the provolatile property.  If we allowed
that then we'd have to forbid ALTER FUNCTION from changing the
volatility.  I've not looked at this patch closely enough to know
whether it moves the goalposts in terms of what a dependency-on-a-
table means, but judging from Robert's and Andres' reviews, I'm
quite worried that it does.

Delaying transformations to the rewrite or plan phases avoids these
problems because now you are past the point where the query could
be stored to or fetched from a rule.

regards, tom lane

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Hi,

On 2018-12-20 15:04:11 -0800, Andres Freund wrote:
> On 2018-12-19 16:56:36 -0800, Andres Freund wrote:
> > On 2018-12-19 19:39:33 -0500, Tom Lane wrote:
> > > Robert Haas  writes:
> > > > On Wed, Dec 19, 2018 at 5:37 PM Andres Freund  
> > > > wrote:
> > > >> What's gained by the logic of emitting that warning in VACUUM after a
> > > >> crash? I don't really see any robustness advantages in it.
> > > 
> > > > I don't know.  I am just normally reluctant to change things
> > > > precipitously that are of long tenure.
> > > 
> > > Me too, but I think Andres has a point here.  Those warnings in VACUUM
> > > are ancient, probably predating the introduction of WAL :-(.  At the
> > > time there was good reason to be suspicious of zeroed pages in tables.
> > > Now, though, we have (what we think is) a bulletproof crash recovery
> > > procedure in which possibly-zeroed pages are to be expected; so we're
> > > just causing users unnecessary alarm by warning about them.  I think
> > > it's reasonable to, if not remove the messages entirely, at least
> > > downgrade them to a low DEBUG level.
> > 
> > Yea, I think that'd be reasonable.
> > 
> > I think we ought to add them, as new/zero pages, to the FSM. If we did
> > so both during vacuum and RelationAddExtraBlocks() we'd avoid the
> > redundant writes, and avoid depending on heap layout in
> > RelationAddExtraBlocks().
> > 
> > I wonder if it'd make sense to only log a DEBUG if there's no
> > corresponding FSM entry. That'd obviously not be bulltproof, but it'd
> > reduce the spammyness considerably.
> 
> Here's a prototype patch implementing this change.  I'm not sure the new
> DEBUG message is really worth it?
> 
> 
> Looking at the surrounding code made me wonder about the wisdom of
> entering empty pages as all-visible and all-frozen into the VM. That'll
> mean we'll never re-discover them on a primary, after promotion. There's
> no mechanism to add such pages to the FSM on a standby (in contrast to
> e.g. pages where tuples are modified), as vacuum will never visit that
> page again.  Now obviously it has the advantage of avoiding
> re-processing the page in the next vacuum, but is that really an
> important goal? If there's frequent vacuums, there got to be a fair
> amount of modifications, which ought to lead to re-use of such pages at
> a not too far away point?

Any comments on the approach in this patch?

Greetings,

Andres Freund

Re: Ryu floating point output patch

2019-01-15 Thread Andrew Gierth

> "Tom" == Tom Lane  writes:

On a slightly unrelated note, is the small-is-zero variant output file
for the float tests still required?

-- 
Andrew (irc:RhodiumToad)

Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

James Coleman  writes:
> [ saop_is_not_null-v6.patch ]

I quite dislike taking the responsibility out of clause_is_strict_for
and putting it in the callers.  Aside from requiring duplicative code,
it means that this fails to prove anything for recursive situations
(i.e., where the ScalarArrayOp appears one or more levels down in a
clause examined by clause_is_strict_for).

If the behavior needs to vary depending on proof rule, which it
looks like it does, the way to handle that is to add flag argument(s)
to clause_is_strict_for.

I'm also not happy about the lack of comments justifying the proof
rules -- eg, it's far from obvious why you need to restrict one
case to !weak but not the other.

regards, tom lane

Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

2019-01-15 Thread David Rowley

On Wed, 16 Jan 2019 at 03:33, James Coleman  wrote:
> > 2. I was also staring predicate_implied_by_simple_clause() a bit at
> > the use of clause_is_strict_for() to ensure that the IS NOT NULL's
> > operand matches the ScalarArrayOpExpr left operand.  Since
> > clause_is_strict_for() = "Can we prove that "clause" returns NULL if
> > "subexpr" does?", in this case, your clause is the ScalarArrayOpExpr's
> > left operand and subexpr is the IS NOT NULL's operand.  This means
> > that a partial index with "WHERE a IS NOT NULL" should also be fine to
> > use for WHERE strict_func(a) IN (1,2,..., 101); since strict_func(a)
> > must be NULL if a is NULL. Also also works for WHERE a+a
> > IN(1,2,...,101);   I wonder if it's worth adding a test for that, or
> > even just modify one of the existing tests to ensure you get the same
> > result from it. Perhaps it's worth an additional test to ensure that x
> > IN(1,2,...,101) does not imply x+x IS NOT NULL and maybe that x+x IS
> > NULL does not refute x IN(1,2,...,101), as a strict function is free
> > to return NULL even if it's input are not NULL.
>
> Are you suggesting a different test than clause_is_strict_for to
> verify the saop LHS is the same as the null test's arg? I suppose we
> could use "equal()" instead?

I wasn't suggesting any code changes.  I just thought the code was
sufficiently hard to understand to warrant some additional tests that
ensure we don't assume that if the int4 column x is not null that also
x+x is not null. Only the reverse can be implied since int4pl is
strict.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pgsql: Update ssl test certificates and keys

2019-01-15 Thread Thomas Munro

On Fri, Jan 4, 2019 at 10:08 AM Thomas Munro
 wrote:
> On Fri, Jan 4, 2019 at 3:36 AM Peter Eisentraut
>  wrote:
> > On 23/12/2018 09:04, Michael Paquier wrote:
> > > On Tue, Nov 27, 2018 at 02:21:39PM +, Peter Eisentraut wrote:
> > >> Update ssl test certificates and keys
> > >>
> > >> Debian testing and newer now require that RSA and DHE keys are at
> > >> least 2048 bit long and no longer allow SHA-1 for signatures in
> > >> certificates.  This is currently causing the ssl tests to fail there
> > >> because the test certificates and keys have been created in violation
> > >> of those conditions.
> > >>
> > >> Update the parameters to create the test files and create a new set of
> > >> test files.
> > >
> > > Peter, would it make sense to back-patch this commit down to where the
> > > SSL tests have been introduced?  If /etc/ssl/ is not correctly
> > > configured, this results in failures across branches on Debian if the
> > > default is used.
> >
> > done
>
> Thanks.  FWIW I've just updated eelpout (a Debian testing BF animal
> that runs all the extra tests including SSL) to use libssl-dev
> (instead of libssl1.0-dev), and cleared its accache files.  Let's see
> if that works...

Since that upgrade (to libssl 1.1.1a-1), there are have been a few
intermittent failures in the SSL tests on that animal (thanks Tom for
pointing that out off-list).  In a quick check, I was able to
reproduce the failure after about 8 successful runs of "make check"
under src/test/ssl on that machine.  I couldn't immediately see what
the problem was and I'm away from computers and work this week, so
I'll have to investigate properly early next week.  The main unusual
thing about that animal is that it's an ARM CPU.  FWIW I run that test
by having this in build-farm.conf (I mention this in case someone
wants to do the same on a Debian buster/testing x86 system to see if
it has a similar problem, if there isn't one like that already):

$ENV{PG_TEST_EXTRA} = "ssl ldap kerberos";

-- 
Thomas Munro
http://www.enterprisedb.com

PSA: "tenk1" and other similar regression test tables are from the Wisconsin Benchmark

2019-01-15 Thread Peter Geoghegan

I recently purchased a copy of "The Benchmark Handbook", a book from
the early 1990s that was edited by Jim Gray. It features analysis of
the Wisconsin Benchmark in chapter 3 -- that's a single client
benchmark that famously showed real limitations in the optimizers that
were current in the early to mid 1980s. The book describes various
limitations of Wisconsin as a general purpose benchmark, but it's
still interesting in other ways, then and now. The book goes on to say
that it is still often used in regression tests.

I see that we even had a full copy of the benchmark until it was torn
out by commit a05a4b47 in 2009. I don't think that anybody will be
interested in the Benchmark itself, but the design of the benchmark
may provide useful context. I could imagine somebody with an interest
in the optimizer finding the book useful. I paid about $5 for a second
hand copy of the first edition, so it isn't a hard purchase for me to
justify.
--
Peter Geoghegan

Re: Ryu floating point output patch

Andrew Gierth  writes:
> On a slightly unrelated note, is the small-is-zero variant output file
> for the float tests still required?

Hm, good question.  I had been about to respond that it must be,
but looking more closely at pg_regress.c I see that the default
expected-file will be tried if the one fingered by resultmap isn't
an exact match.  So if everybody is matching float8.out these
days, we wouldn't know it.

The only one of the resultmap entries I'm in a position to try
right now is x86 freebsd, and I can report that FreeBSD/x86
matches the default float8.out, and has done at least back to
FreeBSD 8.4.

It wouldn't be an unreasonable suspicion to guess that the other
BSDen have likewise been brought up to speed to produce ERANGE
on underflow, but I can't check it ATM.

On the other hand, cygwin is probably pretty frozen-in-time,
so it might still have the old behavior.  Can anyone check that?

regards, tom lane

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Andres Freund  writes:
>> Looking at the surrounding code made me wonder about the wisdom of
>> entering empty pages as all-visible and all-frozen into the VM. That'll
>> mean we'll never re-discover them on a primary, after promotion. There's
>> no mechanism to add such pages to the FSM on a standby (in contrast to
>> e.g. pages where tuples are modified), as vacuum will never visit that
>> page again.  Now obviously it has the advantage of avoiding
>> re-processing the page in the next vacuum, but is that really an
>> important goal? If there's frequent vacuums, there got to be a fair
>> amount of modifications, which ought to lead to re-use of such pages at
>> a not too far away point?

> Any comments on the approach in this patch?

I agree with the concept of postponing page init till we're actually
going to do something with the page.  However, the patch seems to need
some polish:

* There's a comment in RelationAddExtraBlocks, just above where you
changed, that

 * Extend by one page.  This should generally match the main-line
 * extension code in RelationGetBufferForTuple, except that we hold
 * the relation extension lock throughout.

This seems to be falsified by this patch, in that one of the two paths
does PageInit and the other doesn't.

* s/unitialized pages/uninitialized pages/

* This bit in vacuumlazy seems unnecessarily confusing:

+Sizefreespace = 0;
...
+if (GetRecordedFreeSpace(onerel, blkno) == 0)
+freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
+
+if (freespace > 0)
+{
+RecordPageWithFreeSpace(onerel, blkno, freespace);

I'd write that as just

+if (GetRecordedFreeSpace(onerel, blkno) == 0)
+{
+Sizefreespace = BufferGetPageSize(buf) - 
SizeOfPageHeaderData;
+RecordPageWithFreeSpace(onerel, blkno, freespace);


I tend to agree that the DEBUG message isn't very necessary, or at least
could be lower than DEBUG1.

regards, tom lane

Re: [PROPOSAL] Shared Ispell dictionaries

2019-01-15 Thread Tomas Vondra

Hello Arthur,

I've looked at the patch today, and in general is seems quite solid to
me. I do have a couple of minor points

1) I think the comments need more work. Instead of describing all the
individual changes here, I've outlined those improvements in attached
patches (see the attached "tweaks" patches). Some of it is formatting,
minor rewording or larger changes. Some comments are rather redundant
(e.g. the one before calls to release the DSM segment).

2) It's not quite clear to me why we need DictInitData, which simply
combines DictPointerData and list of options. It seems as if the only
point is to pass a single parameter to the init function, but is it
worth it? Why not to get rid of DictInitData entirely and pass two
parameters instead?

3) I find it a bit cumbersome that before each ts_dict_shmem_release
call we construct a dummy DickPointerData value. Why not to pass
individual parameters and construct the struct in the function?

4) The reference to max_shared_dictionaries_size is obsolete, because
there's no such limit anymore.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From e76d34bcb1a84127f9b4402f0147642d77505cc2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 15 Jan 2019 22:16:35 +0100
Subject: [PATCH 1/7] Fix ispell memory handling

---
 src/backend/tsearch/spell.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c
index eb39466b22..eb8416ce7f 100644
--- a/src/backend/tsearch/spell.c
+++ b/src/backend/tsearch/spell.c
@@ -78,6 +78,8 @@
 #define tmpalloc(sz)  MemoryContextAlloc(Conf->buildCxt, (sz))
 #define tmpalloc0(sz)  MemoryContextAllocZero(Conf->buildCxt, (sz))
 
+#define tmpstrdup(str)	MemoryContextStrdup(Conf->buildCxt, (str))
+
 /*
  * Prepare for constructing an ISpell dictionary.
  *
@@ -498,7 +500,7 @@ NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
 	Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
 	strcpy(Conf->Spell[Conf->nspell]->word, word);
 	Conf->Spell[Conf->nspell]->p.flag = (*flag != '\0')
-		? cpstrdup(Conf, flag) : VoidString;
+		? tmpstrdup(flag) : VoidString;
 	Conf->nspell++;
 }
 
@@ -1040,7 +1042,7 @@ setCompoundAffixFlagValue(IspellDict *Conf, CompoundAffixFlag *entry,
 		entry->flag.i = i;
 	}
 	else
-		entry->flag.s = cpstrdup(Conf, s);
+		entry->flag.s = tmpstrdup(s);
 
 	entry->flagMode = Conf->flagMode;
 	entry->value = val;
@@ -1541,6 +1543,9 @@ nextline:
 	return;
 
 isnewformat:
+	pfree(recoded);
+	pfree(pstr);
+
 	if (oldformat)
 		ereport(ERROR,
 (errcode(ERRCODE_CONFIG_FILE_ERROR),
-- 
2.17.2

>From dbb3cc3b7e7c560472cfa5efa77598f4992e0b70 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 15 Jan 2019 22:17:32 +0100
Subject: [PATCH 2/7] Change tmplinit argument

---
 contrib/dict_int/dict_int.c  |  4 +-
 contrib/dict_xsyn/dict_xsyn.c|  4 +-
 contrib/unaccent/unaccent.c  |  4 +-
 src/backend/commands/tsearchcmds.c   | 10 +++-
 src/backend/snowball/dict_snowball.c |  4 +-
 src/backend/tsearch/dict_ispell.c|  4 +-
 src/backend/tsearch/dict_simple.c|  4 +-
 src/backend/tsearch/dict_synonym.c   |  4 +-
 src/backend/tsearch/dict_thesaurus.c |  4 +-
 src/backend/utils/cache/ts_cache.c   | 19 +++-
 src/include/tsearch/ts_cache.h   |  4 ++
 src/include/tsearch/ts_public.h  | 69 ++--
 12 files changed, 113 insertions(+), 21 deletions(-)

diff --git a/contrib/dict_int/dict_int.c b/contrib/dict_int/dict_int.c
index 628b9769c3..ddde55eee4 100644
--- a/contrib/dict_int/dict_int.c
+++ b/contrib/dict_int/dict_int.c
@@ -30,7 +30,7 @@ PG_FUNCTION_INFO_V1(dintdict_lexize);
 Datum
 dintdict_init(PG_FUNCTION_ARGS)
 {
-	List	   *dictoptions = (List *) PG_GETARG_POINTER(0);
+	DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0);
 	DictInt*d;
 	ListCell   *l;
 
@@ -38,7 +38,7 @@ dintdict_init(PG_FUNCTION_ARGS)
 	d->maxlen = 6;
 	d->rejectlong = false;
 
-	foreach(l, dictoptions)
+	foreach(l, init_data->dict_options)
 	{
 		DefElem*defel = (DefElem *) lfirst(l);
 
diff --git a/contrib/dict_xsyn/dict_xsyn.c b/contrib/dict_xsyn/dict_xsyn.c
index 509e14aee0..15b1a0033a 100644
--- a/contrib/dict_xsyn/dict_xsyn.c
+++ b/contrib/dict_xsyn/dict_xsyn.c
@@ -140,7 +140,7 @@ read_dictionary(DictSyn *d, const char *filename)
 Datum
 dxsyn_init(PG_FUNCTION_ARGS)
 {
-	List	   *dictoptions = (List *) PG_GETARG_POINTER(0);
+	DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0);
 	DictSyn*d;
 	ListCell   *l;
 	char	   *filename = NULL;
@@ -153,7 +153,7 @@ dxsyn_init(PG_FUNCTION_ARGS)
 	d->matchsynonyms = false;
 	d->keepsynonyms = true;
 
-	foreach(l, dictoptions)
+	foreach(l, init_data->dict_options)
 	{
 		DefElem*defel = (DefElem *) lfirst(l);
 
diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c
index fc5176e338..f3663cef

RE: Log a sample of transactions

2019-01-15 Thread Kuroda, Hayato

Dear Adrien,

> Your messages looks normal to me, you will have same messages if you put 
> log_min_duration to 0.

That was my lack of understanding. Sorry.

By the way, I think the name of this parameter should be changed.
In the Cambridge dictionary, the word "rate" means as follows:

the speed at which something happens or changes, 
or the amount or number of times it happens or changes in a particular period.

This parameter represents "probability" whether it log, 
therefore this name is inappropriate.
You should use "probability" or "ratio", I think.


Next week I will give you some comments about your good source.

Best Regards,
Hayato Kuroda
Fujitsu LIMITED

Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

On Tue, Jan 15, 2019 at 3:53 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > [ saop_is_not_null-v6.patch ]
>
> I quite dislike taking the responsibility out of clause_is_strict_for
> and putting it in the callers.  Aside from requiring duplicative code,
> it means that this fails to prove anything for recursive situations
> (i.e., where the ScalarArrayOp appears one or more levels down in a
> clause examined by clause_is_strict_for).
>
> If the behavior needs to vary depending on proof rule, which it
> looks like it does, the way to handle that is to add flag argument(s)
> to clause_is_strict_for.

The reason I moved it was that we're no longer just proving
strictness, so it seemed odd to put it in a function specifically
named to be about proving strictness.

If you'd prefer an argument like "bool allow_false" or "bool
check_saop_implies_is_null" I'm happy to make that change, but it
seems generally unhelpful to change the function implementation to
contradict the name (the empty array case that returns false even then
the lhs is null).

> I'm also not happy about the lack of comments justifying the proof
> rules -- eg, it's far from obvious why you need to restrict one
> case to !weak but not the other.

Both implication and refutation have the !weak check; it's just that
the check in the implication function was already present.

I'll add more comments inline for the proof rules, but from what I can
tell only strong implication holds for implication of IS NOT NULL. A
sample case disproving weak implication is "select null is not null,
null::int = any(array[null]::int[])" which results in "false, null",
so non-falsity does not imply non-falsity.

However I'm currently unable to come up with a case for refutation
such that truth of a ScalarArrayOp refutes IS NULL, so I'll update
code and tests for that. That also takes care of my point:

James Coleman  writes:
> 2. The tests I have for refuting "x is null" show that w_r_holds as
> well as s_r_holds. I'd only expected the latter, since only
> "strongly_refuted_by = t".

But it still leaves my questions (1) and (3).

James Coleman

Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

On Tue, Jan 15, 2019 at 4:36 PM David Rowley
 wrote:
> I wasn't suggesting any code changes.  I just thought the code was
> sufficiently hard to understand to warrant some additional tests that
> ensure we don't assume that if the int4 column x is not null that also
> x+x is not null. Only the reverse can be implied since int4pl is
> strict.

At the risk of missing something obvious, I'm not sure I see a case
where "x is not null" does not imply "(x + x) is not null", at least
for integers. Since an integer + an integer results in an integer,
then it must imply the addition of itself is not null also?

This is the root of the questions I had:

James Coleman  writes:
> 1. The current code doesn't result in "strongly_implied_by = t" for
> the "(x + x) is not null" case, but it does result in "s_i_holds = t".
> This doesn't change if I switch to using "equal()" as mentioned above.

> 3. The tests I have for refuting "(x + x) is null" show that both
> s_r_holds and w_r_holds. I'd expected these to be false.

James Coleman

Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

2019-01-15 Thread David Rowley

On Wed, 16 Jan 2019 at 14:05, James Coleman  wrote:
> At the risk of missing something obvious, I'm not sure I see a case
> where "x is not null" does not imply "(x + x) is not null", at least
> for integers. Since an integer + an integer results in an integer,
> then it must imply the addition of itself is not null also?

A strict function guarantees it will return NULL on any NULL input.
This does not mean it can't return NULL on a non-NULL input.

While int4pl might do what you want, some other strict function might
not. A simple example would be a strict function that decided to
return NULL when the two ints combined overflowed int.

The docs [1] define STRICT as:

 "RETURNS NULL ON NULL INPUT or STRICT indicates that the function
always returns null whenever any of its arguments are null. If this
parameter is specified, the function is not executed when there are
null arguments; instead a null result is assumed automatically."

[1] https://www.postgresql.org/docs/devel/sql-createfunction.html

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

On Tue, Jan 15, 2019 at 8:14 PM David Rowley
 wrote:
>
> On Wed, 16 Jan 2019 at 14:05, James Coleman  wrote:
> > At the risk of missing something obvious, I'm not sure I see a case
> > where "x is not null" does not imply "(x + x) is not null", at least
> > for integers. Since an integer + an integer results in an integer,
> > then it must imply the addition of itself is not null also?
>
> A strict function guarantees it will return NULL on any NULL input.
> This does not mean it can't return NULL on a non-NULL input.
>
> While int4pl might do what you want, some other strict function might
> not. A simple example would be a strict function that decided to
> return NULL when the two ints combined overflowed int.

Yes, the claim about not all strict functions guarantee this (like
int4pl) made sense.

Is your preference in this kind of case to comment the code and/or
tests but stick with int4pl even though it doesn't demonstrate the
"problem", or try to engineer a different test case such that the
*_holds results in the tests don't seem to imply we could prove more
things than we do?

James Coleman

Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

2019-01-15 Thread David Rowley

On Wed, 16 Jan 2019 at 14:29, James Coleman  wrote:
>
> On Tue, Jan 15, 2019 at 8:14 PM David Rowley
> > While int4pl might do what you want, some other strict function might
> > not. A simple example would be a strict function that decided to
> > return NULL when the two ints combined overflowed int.
>
> Yes, the claim about not all strict functions guarantee this (like
> int4pl) made sense.
>
> Is your preference in this kind of case to comment the code and/or
> tests but stick with int4pl even though it doesn't demonstrate the
> "problem", or try to engineer a different test case such that the
> *_holds results in the tests don't seem to imply we could prove more
> things than we do?

I think using x+x or whatever is fine. I doubt there's a need to
invent some new function that returns NULL on non-NULL input.  The
code you're adding has no idea about the difference between the two.
It has no way to know that anyway.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Log a sample of transactions

On Tue, Jan 15, 2019 at 06:03:36PM +0100, Adrien NAYRAT wrote:
> The goal is not to find slow queries in a transaction, but troubleshoot
> applicative issue when you have short queries.
> 
> Sometimes you want to understand what happens in a transaction, either you
> perfectly know your application, either you have to log all queries and find
> ones with the same transaction ID (%x). It could be problematic if you have
> a huge traffic with fast queries.

Looks like a sensible argument to me.  High log throughput can cause
Postgres performance to go down by a couple of percents, which kills
the purpose of tracking down performance problems as this could impact
directly the application.
--
Michael


signature.asc
Description: PGP signature

Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Sat, Jan 12, 2019 at 08:10:07AM +0900, Michael Paquier wrote:
> On Fri, Jan 11, 2019 at 12:52:08PM -0500, Robert Haas wrote:
>> With the patch, the PrimaryConnInfo and PrimarySlotName arguments are
>> removed from RequestXLogStreaming.  That means that the new
>> walreceiver could come to a different conclusion about the values of
>> those arguments than the startup process.  In particular, it could end
>> up thinking that primary_conninfo is empty when, if the startup
>> process had thought that, the walreceiver would never have been
>> launched in the first place.  But it's not obvious that you've added
>> any logic in WALReceiverMain or elsewhere to compensate for this
>> possibility -- what would happen in that case?  Would we crash?
>> Connect to the wrong server?
> 
> If I contemplate the patch this morning there is this bit:
> @@ -291,32 +295,40 @@ WalReceiverMain(void)
> /* Unblock signals (they were blocked when the postmaster forked
> us) */
> PG_SETMASK(&UnBlockSig);
> 
> +   /*
> +* Fail immediately if primary_conninfo goes missing, better safe than
> +* sorry.
> +*/
> +   if (PrimaryConnInfo == NULL || strcmp(PrimaryConnInfo, "") == 0)
> +   ereport(ERROR,
> +   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +errmsg("cannot connect to the primary server as 
> \"primary_conninfo\" is not defined")));
> 
> So the answer to your question is: the WAL receiver fails to start.

Robert, does this answer your question?  I agree that depending on the
timing, we could finish by having the startup process spawning a WAL
receiver with a given connection string, and that the WAL receiver
could use a different one, but as long as we fail the WAL receiver
startup this does not seem like an issue to me, so I disagree with the
upthread statement that the patch author has not thought about such
cases :)

It seems to me that making the WAL receiver relying more on the GUC
context makes it more robust when it comes to reloading the parameters
which would be an upcoming change as we need to rely on the WAL
receiver check the GUC context itself and FATAL (or we would need to
have the startup process check for a change in
primary_conninfo/primary_slot_name, and then have the startup process
kill the WAL receiver by itself).  In short, handling all that in the
WAL receiver would be more robust in the long term in my opinion as we
reduce interactions between the WAL receiver and the startup process.
And on top of it we get rid of ready_to_display, which is something I
am unhappy with since we had to introduce it.
--
Michael


signature.asc
Description: PGP signature

Re: O_DIRECT for relations and SLRUs (Prototype)

On Tue, Jan 15, 2019 at 07:40:12PM +0300, Maksim Milyutin wrote:
> Could you specify all cases when buffers will not be aligned with BLCKSZ?
> 
> AFAIC shared and temp buffers are aligned. And what ones are not?

SLRU buffers are not aligned with the OS pages (aka alignment with
4096 at least).  There are also a bunch of code paths where the callers
of mdread() or mdwrite() don't do that, which makes a correct patch
more invasive.
--
Michael


signature.asc
Description: PGP signature

Bump up PG_CONTROL_VERSION on HEAD

Hi all,

f3db7f16 has proved that it can be a bad idea to run pg_resetwal on a
data folder which does not match the version it has been compiled
with.

As of HEAD, PG_CONTROL_VERSION is still 1100:
$ pg_controldata | grep "pg_control version"
pg_control version number:1100

Wouldn't it be better to bump it up to 1200?

Thanks,
--
Michael


signature.asc
Description: PGP signature

Re: Bump up PG_CONTROL_VERSION on HEAD

Hi,

On 2019-01-16 11:02:08 +0900, Michael Paquier wrote:
> f3db7f16 has proved that it can be a bad idea to run pg_resetwal on a
> data folder which does not match the version it has been compiled
> with.
> 
> As of HEAD, PG_CONTROL_VERSION is still 1100:
> $ pg_controldata | grep "pg_control version"
> pg_control version number:1100
> 
> Wouldn't it be better to bump it up to 1200?

We don't commonly bump that without corresponding control version
changes. I don't see what we'd gain by the bump?

Greetings,

Andres Freund

Re: current_logfiles not following group access and instead follows log_file_mode permissions

On Tue, Jan 15, 2019 at 10:53:30AM -0500, Tom Lane wrote:
> On reflection, maybe the problem is not that we're giving the file
> the wrong permissions, but that we're putting it in the wrong place?
> That is, seems like it should be in the logfile directory not the
> data directory.  That would certainly simplify the intended use-case,
> and it would fix this complaint too.

Yeah, thinking more on this one using for this file different
permissions than the log files makes little sense, so what you propose
here seems like a sensible thing to do things.  Even if we exclude the
file from native BASE_BACKUP this would not solve the case of custom
backup solutions doing their own copy of things, when they rely on
group-read permissions.  This would not solve completely the problem
anyway if log files are in the data folder, but it would address the
case where the log files are in an absolute path out of the data
folder.

I am adding in CC Gilles who implemented current_logfiles for his
input.
--
Michael

signature.asc
Description: PGP signature

RE: Libpq support to connect to standby server as priority

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> Michael Paquier  writes:
> > On Tue, Jan 15, 2019 at 02:00:57AM +, Tsunakawa, Takayuki wrote:
> >> The original desire should have been the ability to connect to a
> >> primary or a standby.  So, I think we should go back to the original
> >> thinking (and not complicate the feature), and create a read only
> >> GUC_REPORT variable, say, server_role, that identifies whether the
> >> server is a primary or a standby.
> 
> > From the point of view of making sure that a client is really
> > connected to  a primary or a standby, this is the best idea around.
> 
> There are a couple of issues here:

> 1. Are you sure there are no use-cases for testing transaction_read_only
> as such?

I don't find any practical use case, but I won't object to leaving the current 
target_session_attrs as-is.  Alide from that, I think a parameter like PgJDBC's 
is necessary, e.g., target_server_type = {primary | standby | prefer_standby}, 
which acts based on just the server role (primary or standby).

> 2. What will the fallback implementation be, when connecting to a server
> too old to have the variable you want?

One of the following:

1. "Unsupported" error.  I'll take this.
2. libpq issues "SELECT pg_is_in_recovery()".
3. Blindly accepts the first successful connection.

Regards
Takayuki Tsunakawa

Re: What to name the current heap after pluggable storage / what to rename?

2019-01-15 Thread Haribabu Kommi

On Tue, Jan 15, 2019 at 1:43 PM Andres Freund  wrote:

> Hi,
>
> On 2019-01-11 12:01:36 -0500, Robert Haas wrote:
> > On Thu, Jan 10, 2019 at 7:05 PM Andres Freund 
> wrote:
> > > ISTM that the first block would best belong into new files like
> > > access/rel[ation].h and access/common/rel[ation].h.
> >
> > +1.
> >
> > > I think the second
> > > set should be renamed to be table_open() (with backward compat
> > > redirects, there's way way too many references) and should go into
> > > access/table.h access/table/table.c alongside tableam.[ch],
> >
> > Sounds reasonable.
> >
> > > but I could
> > > also see just putting them into relation.[ch].
> >
> > I would view that as a less-preferred alternative, but not crazy.
>
> Here's a set of patches. Not commit quality, but enough to discuss.
>
>
> The first patch, the only really interesting one, splits out
> relation_(open|openrv|openrv_extended|close) into access/relation.h and
> access/common/relation.c
> and
> heap_(open|openrv|openrv_extended|close) into access/table.h and
> access/table/table.c
>
> It's worthwhile to note that there's another header named
> nodes/relation.h. But there's also utils/rel.h, so I couldn't think of a
> another good name.
>

access/relation.[c|h] name is fine. Or how about access/rel.[c|h],
because nodes/relation.h is related to planner. utils/rel.h is some how
related to relation caches.


> I'm basically thinking that table.h, even in the post pluggable storage
> world, should not contain lower level functionality like dispatching
> into table-am (that'll reside in tableam.h). But e.g. a
> simple_table_(insert|update|delete) could live there, as well as
> potentially some other heap_ functionality strewn around the backend.
>
> I made table.h not include relation.h, which means that a few files
> might need both.  I'm not sure that's the right choice, but it seems
> easier to extend that later if shows to be painful, than to do the
> reverse.
>

The need of both relation.h and table.h into a single file is because of
use of both relation_open table_open functions. when I checked one
of the file where both headers are included, I found that it simply used
the relation_open function even the type of the relation is known.

I didn't check whether it is possible or not? In case if the kind of the
relation
is known at every caller of relation_open, it can be replaced with either
table_open or index_open or composite_type_open functions. So that
there may not need any of the relation functions needs to be exposed.


I've left the following in table.h:
> /*
>  * heap_ used to be the prefix for these routines, and a lot of code will
> just
>  * continue to work without adaptions after the introduction of pluggable
>  * storage, therefore just map these names.
>  */
> #define heap_open(r, l) table_open(r, l)
> #define heap_openrv(r, l)   table_openrv(r, l)
> #define heap_openrv_extended(r, l, m)   table_openrv_extended(r, l, m)
> #define heap_close(r, l)table_close(r, l)
>
> and I think we should leave that in there for the forseeable future.
>

Above typedefs are good. They are useful to avoid any third party
extensions.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: WIP: Avoid creation of the free space map for small tables

2019-01-15 Thread Mithun Cy

On Fri, Jan 11, 2019 at 3:54 AM John Naylor 
wrote:
>
> On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila 
wrote:
> > Thanks, Mithun for performance testing, it really helps us to choose
> > the right strategy here.  Once John provides next version, it would be
> > good to see the results of regular pgbench (read-write) runs (say at
> > 50 and 300 scale factor) and the results of large copy.  I don't think
> > there will be any problem, but we should just double check that.
>
> Attached is v12 using the alternating-page strategy. I've updated the
> comments and README as needed. In addition, I've

Below are my performance tests and numbers
Machine : cthulhu

Tests and setups
Server settings:
max_connections = 200
shared_buffers=8GB
checkpoint_timeout =15min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
min_wal_size=15GB and max_wal_size=20GB.

pgbench settings:
---
read-write settings (TPCB like tests)
./pgbench -c $threads -j $threads -T $time_for_reading -M prepared postgres

scale factor 50 -- median of 3 TPS
clients   v12-patch  base patch% diff
1   826.081588   834.328238   -0.9884179421
16 10805.807081   10800.6628050.0476292621
32 19722.277019   19641.5466280.4110185034
64 30232.681889   30263.616073   -0.1022157561

scale factor 300 -- median of 3 TPS
clients   v12-patch  base patch% diff
1  813.646062   822.18648  -1.038744641
16   11379.02870211277.055860.9042505709
32   21688.08409321613.044463  0.3471960192
64   36288.85711  36348.6178 -0.1644098005


Copy command
Test: setup
 ./psql -d postgres -c "COPY pgbench_accounts TO '/mnt/data-mag/
mithun.cy/fsmbin/bin/dump.out' WITH csv"
 ./psql -d postgres -c "CREATE UNLOGGED TABLE pgbench_accounts_ulg (LIKE
pgbench_accounts) WITH (fillfactor = 100);"
Test run:
TRUNCATE TABLE pgbench_accounts_ulg;
\timing
COPY pgbench_accounts_ulg FROM '/mnt/data-mag/mithun.cy/fsmbin/bin/dump.out'
WITH csv;
\timing

execution time in ms. (scale factor indicates size of pgbench_accounts)
scale factor   v12-patchbase patch   % diff
300   77166.407   77862.041 -0.8934186557
50 13329.233  13284.583   0.3361038882

So for large table tests do not show any considerable performance variance
from base code!



On Fri, Jan 11, 2019 at 3:54 AM John Naylor 
wrote:

> On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila 
> wrote:
> > Thanks, Mithun for performance testing, it really helps us to choose
> > the right strategy here.  Once John provides next version, it would be
> > good to see the results of regular pgbench (read-write) runs (say at
> > 50 and 300 scale factor) and the results of large copy.  I don't think
> > there will be any problem, but we should just double check that.
>
> Attached is v12 using the alternating-page strategy. I've updated the
> comments and README as needed. In addition, I've
>
> -handled a possible stat() call failure during pg_upgrade
> -added one more assertion
> -moved the new README material into a separate paragraph
> -added a comment to FSMClearLocalMap() about transaction abort
> -corrected an outdated comment that erroneously referred to extension
> rather than creation
> -fleshed out the draft commit messages
>


-- 
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com

RE: Libpq support to connect to standby server as priority

From: Dave Cramer [mailto:p...@fastcrypt.com]
>   The original desire should have been the ability to connect to a
> primary or a standby.  So, I think we should go back to the original thinking
> (and not complicate the feature), and create a read only GUC_REPORT variable,
> say, server_role, that identifies whether the server is a primary or a
> standby.
> 
> 
> 
> I'm confused as to how this would work. Who or what determines if the server
> is a primary or standby?

Overall, the server determines the server role (primary or standby) using the 
same mechanism as pg_is_in_recovery(), and set the server_role GUC parameter.  
As the parameter is GUC_REPORT, the change is reported to the clients using the 
ParameterStatus ('S') message.  The clients also get the value at connection.


Regards
Takayuki Tsunakawa

Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

James Coleman  writes:
> On Tue, Jan 15, 2019 at 3:53 PM Tom Lane  wrote:
>> I quite dislike taking the responsibility out of clause_is_strict_for
>> and putting it in the callers.

> The reason I moved it was that we're no longer just proving
> strictness, so it seemed odd to put it in a function specifically
> named to be about proving strictness.

Well, as I said upthread, it seems like we need to think a bit more
carefully about what it is that clause_is_strict_for is testing ---
and if that ends up finding that some other name is more apposite,
I'd not have any hesitation about renaming it.  But we're really
missing a bet if the ScalarArrayOp-specific check isn't inside that
recursive check of an "atomic" expression's properties.  The
routines above there only recurse through things that act like
AND or OR, but clause_is_strict_for is capable of descending
through other stuff as long as it's strict in some sense.  What
we need to be clear about here is exactly what that sense is.

regards, tom lane

Re: Libpq support to connect to standby server as priority

2019-01-15 Thread Tatsuo Ishii

>> I'm confused as to how this would work. Who or what determines if the server
>> is a primary or standby?
> 
> Overall, the server determines the server role (primary or standby) using the 
> same mechanism as pg_is_in_recovery(), and set the server_role GUC parameter. 
>  As the parameter is GUC_REPORT, the change is reported to the clients using 
> the ParameterStatus ('S') message.  The clients also get the value at 
> connection.

But pg_is_in_recovery() returns true even for a promoting standby. So
you have to wait and retry to send pg_is_in_recovery() until it
finishes the promotion to find out it is now a primary. I am not sure
if backend out to be responsible for this process. If not, libpq would
need to handle it but I doubt it would be possible.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

David Rowley  writes:
> On Wed, 16 Jan 2019 at 14:29, James Coleman  wrote:
>> Is your preference in this kind of case to comment the code and/or
>> tests but stick with int4pl even though it doesn't demonstrate the
>> "problem", or try to engineer a different test case such that the
>> *_holds results in the tests don't seem to imply we could prove more
>> things than we do?

> I think using x+x or whatever is fine. I doubt there's a need to
> invent some new function that returns NULL on non-NULL input.  The
> code you're adding has no idea about the difference between the two.
> It has no way to know that anyway.

Yeah, I never intended that the *_holds results would be "exact" in
every test case.  They're mostly there to catch egregious errors in
test construction.  Since the code we're actually testing doesn't get
to see the input or output values of the test cases, it's not really
useful to insist that the test cases exercise every possible combination
of input/output values for the given expressions.

regards, tom lane

Re: Bump up PG_CONTROL_VERSION on HEAD

Andres Freund  writes:
> On 2019-01-16 11:02:08 +0900, Michael Paquier wrote:
>> f3db7f16 has proved that it can be a bad idea to run pg_resetwal on a
>> data folder which does not match the version it has been compiled
>> with.
>> 
>> As of HEAD, PG_CONTROL_VERSION is still 1100:
>> $ pg_controldata | grep "pg_control version"
>> pg_control version number:1100
>> 
>> Wouldn't it be better to bump it up to 1200?

> We don't commonly bump that without corresponding control version
> changes. I don't see what we'd gain by the bump?

Yeah, it has not been our practice to bump PG_CONTROL_VERSION
unless the contents of pg_control actually change.  The whole
point of f3db7f16 was to ensure that we didn't have to do that
just because of a major version change.

regards, tom lane

Re: pgbench doc fix

2019-01-15 Thread Tatsuo Ishii

> On 2018-Dec-03, Tatsuo Ishii wrote:
> 
>> To:
>> ---
>> -M querymode
>> --protocol=querymode
>> 
>> Protocol to use for submitting queries to the server:
>> 
>> simple: use simple query protocol.
>> 
>> extended: use extended query protocol.
>> 
>> prepared: use extended query protocol with prepared statements.
>> 
>> Because in "prepared" mode pgbench reuses the parse analysis
>> result for the second and subsequent query iteration, pgbench runs
>> faster in the prepared mode than in other modes.
> 
> Looks good to me.

Thanks. I'm going to commit this if there's no objection.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

RE: Libpq support to connect to standby server as priority

From: Tatsuo Ishii [mailto:is...@sraoss.co.jp]
> But pg_is_in_recovery() returns true even for a promoting standby. So
> you have to wait and retry to send pg_is_in_recovery() until it
> finishes the promotion to find out it is now a primary. I am not sure
> if backend out to be responsible for this process. If not, libpq would
> need to handle it but I doubt it would be possible.

Yes, the application needs to retry connection attempts until success.  That's 
not different from PgJDBC and other DBMSs.


Regards
Takayuki Tsunakawa

Re: Query with high planning time at version 11.1 compared versions 10.5 and 11.0


Amit-san,

(2019/01/15 10:46), Amit Langote wrote:

On 2019/01/11 20:04, Etsuro Fujita wrote:

(2019/01/11 13:46), Amit Langote wrote:



However, what you proposed here as-is would
not keep that behavior, because rel->part_scheme is created for those
tables as well



(even though there would be no need IIUC).


Partition pruning uses part_scheme and pruning can occur even if a
partitioned table is under UNION ALL, so it *is* needed in that case.


Ah, you are right.  Thanks for pointing that out!


I think
enable_partitionwise_join should only be checked in relnode.c or
joinrels.c.


Sorry, I don't understand this.


What I was trying to say is that we should check the GUC close to where
partitionwise join is actually implemented even though there is no such
hard and fast rule.  Or maybe I'm just a bit uncomfortable with setting
consider_partitionwise_join based on the GUC.


I didn't think so.  Consider the consider_parallel flag.  I think the 
way of setting it deviates from that rule already; it is set essentially 
based on a GUC and is set in set_base_rel_sizes() (ie, before 
implementing parallel paths).  When adding the 
consider_partitionwise_join flag, I thought it would be a good idea to 
set consider_partitionwise_join in a similar way to consider_parallel, 
keeping build_simple_rel() simple.



I've attached a patch to show what I mean. Can you please
take a look?


Thanks for the patch!  Maybe I'm missing something, but I don't have a
strong opinion about that change.  I'd rather think to modify
build_simple_rel so that it doesn't create rel->part_scheme if unnecessary
(ie, partitioned tables contained in inheritance trees where the top
parent is a UNION ALL subquery).


As I said above, partition pruning can occur even if a partitioned table
happens to be under UNION ALL.  However, we *can* avoid creating
part_scheme and setting other partitioning properties if *all* of
enable_partition_pruning, enable_partitionwise_join, and
enable_partitionwise_aggregate are turned off.


Yeah, I think so.


If you think that this patch is a good idea, then you'll need to
explicitly set consider_partitionwise_join to false for a dummy partition
rel in set_append_rel_size(), because the assumption of your patch that
such partition's rel's consider_partitionwise_join would be false (as
initialized with the current code) would be broken by my patch.  But that
might be a good thing to do anyway as it will document the special case
usage of consider_partitionwise_join variable more explicitly, assuming
you'll be adding a comment describing why it's being set to false
explicitly.


I'm not sure we need this as part of a fix for the issue reported on this
thread.  I don't object to what you proposed here, but that would be
rather an improvement, so I think we should leave that for another patch.


Sure, no problem with committing it separately if at all.


OK

Thanks!

Best regards,
Etsuro Fujita

Re: Query with high planning time at version 11.1 compared versions 10.5 and 11.0


(2019/01/15 11:42), Amit Langote wrote:

On 2019/01/11 21:50, Etsuro Fujita wrote:

(2019/01/10 10:41), Amit Langote wrote:

That's a loaded meaning and abusing it to mean something else can be
challenged, but we can live with that if properly documented.
Speaking of
which:

   /* used by partitionwise joins: */
   boolconsider_partitionwise_join;/* consider
partitionwise join
* paths? (if
partitioned
rel) */

Maybe, mention here how it will be abused in back-branches for
non-partitioned relations?



I know we don't yet reach a consensus on what to do in details to address
this issue, but for the above, how about adding comments like this to
set_append_rel_size(), instead of the header file:

 /*
  * If we consider partitionwise joins with the parent rel, do the
same
  * for partitioned child rels.
  *
  * Note: here we abuse the consider_partitionwise_join flag for child
  * rels that are not partitioned, to tell try_partitionwise_join()
  * that their targetlists and EC entries have been generated.
  */
 if (rel->consider_partitionwise_join)
 childrel->consider_partitionwise_join = true;

ISTM that that would be more clearer than the header file.


Thanks for updating the patch.  I tend to agree that it might be better to
add such details here than in the header as it's better to keep the latter
more stable.

About the comment you added, I think we could clarify the note further as:

Note: here we abuse the consider_partitionwise_join flag by setting it
*even* for child rels that are not partitioned.  In that case, we set it
to tell try_partitionwise_join() that it doesn't need to generate their
targetlists and EC entries as they have already been generated here, as
opposed to the dummy child rels for which the flag is left set to false so
that it will generate them.

Maybe it's a bit wordy, but it helps get the intention across more clearly.


I think that is well-worded, so +1 from me.

Thanks again!

Best regards,
Etsuro Fujita

Re: Query with high planning time at version 11.1 compared versions 10.5 and 11.0


Hi Ashutosh,

(2019/01/15 13:29), Ashutosh Bapat wrote:

I think, there's something better possible. Two partitioned relations
won't use partition-wise join, if their partition schemes do not match.
Partitioned relations with same partitioning scheme share
PartitionScheme pointer. PartitionScheme structure should get an extra
counter, maintaining a count of number of partitioned relations sharing
that structure. When this counter is 1, that relation is certainly not
going to participate in PWJ and thus need not have all the structure
required by PWJ set up. If we use this counter coupled with
enable_partitionwise_join flag, we can get rid of
consider_partitionwise_join flag altogether, I think.


Interesting!

That flag was introduced to disable PWJs when whole-row Vars are 
involved, as you know, so I think we need to first eliminate that 
limitation, to remove that flag.


Thanks!

Best regards,
Etsuro Fujita

Re: de-deduplicate code in DML execution hooks in postgres_fdw


(2019/01/07 20:26), Etsuro Fujita wrote:

(2018/11/30 19:55), Etsuro Fujita wrote:

One thing we would need to discuss more about this is the name of a new
function added by the patch. IIRC, we have three options so far [1]:

1) perform_one_foreign_dml proposed by Ashutosh
2) execute_dml_single_row proposed by Michael
3) execute_parameterized_dml_stmt proposed by me

I'll withdraw my proposal; I think #1 and #2 would describe the
functionality better than #3. My vote would go to #2 or
execute_dml_stmt_single_row, which moves the function much more closer
to execute_dml_stmt. I'd like to hear the opinions of others.


On second thought I'd like to propose another option:
execute_foreign_modify because I think this would match the existing
helper functions for updating foreign tables in postgres_fdw.c better,
such as create_foreign_modify, prepare_foreign_modify and
finish_foreign_modify. I don't really think the function name should
contain "one" or "single_row". Like the names of the calling APIs (ie,
ExecForeignInsert, ExecForeignUpdate and ExecForeignDelete), I think
it's OK to omit such words from the function name. Here is an updated
version of the patch. In addition to the naming, I tweaked the function
a little bit to match other functions (mainly variable names), moved it
to the place where the helper functions are defined, fiddled with some
comments, and removed an extra include file that the original patch added.


If there are no objections, I'll commit that version of the patch.

Best regards,
Etsuro Fujita

Re: Libpq support to connect to standby server as priority

2019-01-15 Thread Tatsuo Ishii

> From: Tatsuo Ishii [mailto:is...@sraoss.co.jp]
>> But pg_is_in_recovery() returns true even for a promoting standby. So
>> you have to wait and retry to send pg_is_in_recovery() until it
>> finishes the promotion to find out it is now a primary. I am not sure
>> if backend out to be responsible for this process. If not, libpq would
>> need to handle it but I doubt it would be possible.
> 
> Yes, the application needs to retry connection attempts until success.  
> That's not different from PgJDBC and other DBMSs.

I don't know what PgJDBC is doing, however I think libpq needs to do
more than just retrying.

1) Try to find a node on which pg_is_in_recovery() returns false. If
   found, then we assume that is the primary. We also assume that
   other nodes are standbys. done.

2) If there's no node on which pg_is_in_recovery() returns false, then
   we need to retry until we find it. To not retry forever, there
   should be a timeout counter parameter.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

RE: Protect syscache from bloating with negative cache entries

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> Certainly, what I've done here doesn't preclude adding some wider solution
> to
> the issue of extremely large catcaches.

I'm relieved to hear that.

> I think it takes the pressure off
> for one rather narrow problem case, and the mechanism could be used to fix
> other ones.  But if you've got an application that just plain accesses a
> huge number of objects, this isn't going to make your life better.

I understand you're trying to solve the problem caused by negative cache 
entries as soon as possible, because the user is really suffering from it.  I 
feel sympathy with that attitude, because you seem to be always addressing 
issues that others are reluctant to take.  That's one of the reasons I respect 
you.


> Well, we *had* an LRU mechanism for the catcaches way back when.  We got
> rid of it --- see commit 8b9bc234a --- because (a) maintaining the LRU
> info was expensive and (b) performance fell off a cliff in scenarios where
> the cache size limit was exceeded.  You could probably find some more info
> about that by scanning the mail list archives from around the time of that
> commit, but I'm too lazy to do so right now.

Oh, in 2006...  I'll examine the patch and the discussion to see how the LRU 
management was done.


> That was a dozen years ago, and it's possible that machine performance
> has moved so much since then that the problems are gone or mitigated.

I really, really hope so.  Even if we see some visible impact by the LRU 
management, I think that's the debt PostgreSQL had to pay for but doesn't now.  
Even the single-process MySQL, which doesn't suffer from cache bloat for many 
server processes, has the ability to limit the cache.  And PostgreSQL has many 
parameters for various memory components such as shared_buffers, wal_buffers, 
work_mem, etc, so it would be reasonable to also have the limit for the catalog 
caches.  That said, we can avoid the penalty and retain the current performance 
by disabling the limit (some_size_param = 0).

I think we'll evaluate the impact of LRU management by adding prev and next 
members to catcache and relcache structures, and putting the entry at the front 
(or back) of the LRU chain every time the entry is obtained.  I think pgbench's 
select-only mode is enough for evaluation.  I'd like to hear if any other 
workload is more appropriate to see the CPU cache effect.


> In particular I'm sure that any limit we would want to impose today will
> be far more than the 5000-entries-across-all-caches limit that was in use
> back then.  But I'm not convinced that a workload that would create 100K
> cache entries in the first place wouldn't have severe problems if you
> tried to constrain it to use only 80K entries.  I fear it's just wishful
> thinking to imagine that the behavior of a larger cache won't be just
> like a smaller one.  Also, IIRC some of the problem with the LRU code
> was that it resulted in lots of touches of unrelated data, leading to
> CPU cache miss problems.  It's hard to see how that doesn't get even
> worse with a bigger cache.
>
> As far as the relcache goes, we've never had a limit on that, but there
> are enough routine causes of relcache flushes --- autovacuum for instance
> --- that I'm not really convinced relcache bloat can be a big problem in
> production.

As Andres and Robert mentioned, we want to free less frequently used cache 
entries.  Otherwise, we're now suffering from the bloat to TBs of memory.  This 
is a real, not hypothetical issue...



> The plancache has never had a limit either, which is a design choice that
> was strongly influenced by our experience with catcaches.  Again, I'm
> concerned about the costs of adding a management layer, and the likelihood
> that cache flushes will simply remove entries we'll soon have to rebuild.

Fortunately, we're not bothered with the plan cache.  But I remember you said 
you were annoyed by PL/pgSQL's plan cache use at Salesforce.  Were you able to 
overcome it somehow?



> Oracle seldom impresses me as having designs we ought to follow.
> They have a well-earned reputation for requiring a lot of expertise to
> operate, which is not the direction this project should be going in.
> In particular, I don't want to "solve" cache size issues by exposing
> a bunch of knobs that most users won't know how to twiddle.


Oracle certainly seems to be difficult to use.  But they seem to be studying 
other DBMSs to make it simpler to use.  I'm sure they also have a lot we should 
learn, and the cache limit if one of them (although MySQL's per-cache tuning 
may be better.)

And having limits for various components would be the first step toward the 
autonomous database; tunable -> auto tuning -> autonomous



Regards
Takayuki Tsunakawa

Re: de-deduplicate code in DML execution hooks in postgres_fdw