Re: New GUC to sample log queries

2018-11-18 Thread Adrien Nayrat
On 11/18/18 12:47 AM, Dmitry Dolgov wrote:
> This patch went through last few commitfests without any activity, but cfbot
> says it still applies and doesn't break any tests. From what I see there was
> some agreement about naming, so the patch is indeed needs more review. Could
> anyone from the reviewers (Vik?) confirm that it's in a good shape?

Thanks Dmitry to bringing this up.

For me, the question is how to name this GUC?
Is "log_sample_rate" is enough? I am not sure because it is only related to
queries logged by log_min_duration_statements.

Alors, I wonder if we should use the same logic for other parameters, such as
log_statement_stats
log_parser_stats
log_planner_stats
log_executor_stats

It was mentioned in this thread
https://www.postgresql.org/message-id/20180710183828.GB3890%40telsasoft.com






Re: zheap: a new storage format for PostgreSQL

2018-11-18 Thread Komяpa
On Sat, Nov 17, 2018 at 8:51 AM Adam Brusselback 
wrote:

> >  I don't know how much what I write on this thread is read by others or
> how useful this is for others who are following this work
>
> I've been following this thread and many others like it, silently soaking
> it up, because I don't feel like i'd have anything useful to add in most
> cases. It is very interesting seeing the development take place though, so
> just know it's appreciated at least from my perspective.
>


I'm also following the development and have hopes about it going forward.
Not much low-level details I can comment on though :)

In PostGIS workloads, UPDATE table SET geom = ST_CostyFunction(geom,
magicnumber); is one of biggest time-eaters that happen upon initial load
and clean up of your data. It is commonly followed by CLUSTER table using
table_geom_idx; to make sure you're back at full speed and no VACUUM is
needed, and your table (usually static after that) is more-or-less
spatially ordered. I see that zheap can remove the need for VACUUM, which
is a big win already. If you can do something that will allow reorder of
tuples according to index happen during an UPDATE that rewrites most of
table, that would be a game changer :)

Another story is Visibility Map and Index-Only Scans. Right now there is a
huge gap between the insert of rows and the moment they are available for
index only scan, as VACUUM is required. Do I understand correctly that for
zheap this all can be inverted, and UNDO can become "invisibility map" that
may be quite small and discarded quickly?




-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: [HACKERS] pgbench - allow to store select results into variables

2018-11-18 Thread Fabien COELHO


Hello Alvaro,


I think this patch's Command->lines would benefit from using PQExpBuffer
(or maybe StringInfo?) for the command string instead of open-coding
string manipulation and allocation.


[...]


Ok.


I'm not sure that Command->first_line is really all that useful.  It
seems we go to a lot of trouble to keep it up to date.  Isn't it easier
to chop Command->lines at the first newline when it is needed?


Hmmm, it is needed quite often (about 12 times) to report errors, that would
mean having to handle the truncation in many places, so I felt it was worth
the trouble.


Ok, as long as we don't repeat the work during script execution.


Sure, the point of first_line is that it is computed once at parse time.

Attached a v23 with PQExpBuffer for managing lines.

I've also added a function to compute the summary first line, which 
handles carriage-return.


--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index b5e3a62a33..246944ea79 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -954,6 +954,87 @@ pgbench  options  d
   
 
   
+   
+
+ \cset [prefix]
+
+
+
+ 
+  This command may be used to end SQL queries, replacing an embedded
+  semicolon (\;) within a compound SQL command.
+ 
+
+ 
+  When this command is used, the preceding SQL query is expected to
+  return one row, the columns of which are stored into variables named after
+  column names, and prefixed with prefix if provided.
+ 
+
+ 
+  The following example sends three queries as one compound SQL command,
+  inducing one message sent at the protocol level.
+  The result of the second query is stored into variable two,
+  whereas the results of the other queries are discarded.
+
+-- compound of 3 queries
+SELECT 1 AS one \; SELECT 2 AS two \cset
+SELECT 2;
+
+ 
+
+ 
+  
+\cset does not work when empty SQL queries appear
+within a compound SQL command.
+  
+ 
+
+   
+
+   
+
+ \gset [prefix]
+
+
+
+ 
+  This commands may be used to end SQL queries, replacing a final semicolon
+  (;). 
+ 
+
+ 
+  When this command is used, the preceding SQL query is expected to
+  return one row, the columns of which are stored into variables named after
+  column names, and prefixed with prefix if provided.
+ 
+
+ 
+  The following example puts the final account balance from the first query
+  into variable abalance, and fills variables
+  p_two and p_three 
+  with integers from the last query.
+  The result of the second query is discarded.
+
+UPDATE pgbench_accounts
+  SET abalance = abalance + :delta
+  WHERE aid = :aid
+  RETURNING abalance \gset
+-- compound of two queries
+SELECT 1 \;
+SELECT 2 AS two, 3 AS three \gset p_
+
+ 
+
+ 
+  
+\gset does not work when empty SQL queries appear
+within a compound SQL command.
+  
+ 
+
+   
+

 \if expression
 \elif expression
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 73d3de0677..60d4c95be7 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -490,12 +490,15 @@ static const char *QUERYMODE[] = {"simple", "extended", "prepared"};
 
 typedef struct
 {
-	char	   *line;			/* text of command line */
+	char	   *first_line;		/* first line for short display */
+	PQExpBufferData	lines;		/* full multi-line text of command */
 	int			command_num;	/* unique index of this Command struct */
 	int			type;			/* command type (SQL_COMMAND or META_COMMAND) */
 	MetaCommand meta;			/* meta command identifier, or META_NONE */
 	int			argc;			/* number of command words */
 	char	   *argv[MAX_ARGS]; /* command word list */
+	int			compound;   /* last compound command (number of \;) */
+	char	  **gset;   /* per-compound command prefix */
 	PgBenchExpr *expr;			/* parsed expression, if needed */
 	SimpleStats stats;			/* time spent in this command */
 } Command;
@@ -1718,6 +1721,107 @@ valueTruth(PgBenchValue *pval)
 	}
 }
 
+/* read all responses from backend, storing into variable or discarding */
+static bool
+read_response(CState *st, char **gset)
+{
+	PGresult   *res;
+	int			compound = 0;
+
+	while ((res = PQgetResult(st->con)) != NULL)
+	{
+		switch (PQresultStatus(res))
+		{
+			case PGRES_COMMAND_OK: /* non-SELECT commands */
+			case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */
+if (gset[compound] != NULL)
+{
+	fprintf(stderr,
+			"client %d file %d command %d compound %d: "
+			"\\gset/cset expects a row\n",
+			st->id, st->use_file, st->command, compound);
+	st->ecnt++;
+	return false;
+}
+break; /* OK */
+
+			case PGRES_TUPLES_OK:
+if (gset[compound] != NULL)
+{
+	/* store result into variables if required */
+	int ntuples = PQntuples(res),
+		nfields = PQnfields(res),
+		f;
+

Re: New GUC to sample log queries

2018-11-18 Thread Dmitry Dolgov
> On Sun, 18 Nov 2018 at 10:52, Adrien Nayrat 
> wrote:
>
> For me, the question is how to name this GUC? Is "log_sample_rate" is enough?
> I am not sure because it is only related to queries logged by
> log_min_duration_statements.

Since it's hard to come up with a concise name that will mention sampling rate
in the context of min_duration_statement, I think it's fine to name this
configuration "log_sample_rate", as long as it's dependency from
log_min_duration_statements is clearly explained in the documentation.



[PATCH] Allow UNLISTEN during recovery

2018-11-18 Thread Shay Rojansky
Hi all,

Here is a tiny patch removing PreventCommandDuringRecovery() for UNLISTEN.
See previous discussion in
https://www.postgresql.org/message-id/CADT4RqBweu7QKRYAYzeRW77b%2BMhJdUikNe45m%2BfL4GJSq_u2Fg%40mail.gmail.com
.

In a nutshell, this prevents an error being raised when UNLISTEN is issued
during recovery. The operation is a no-op (since LISTEN is still
disallowed). This logic here is that some clients (namely Npgsql) issue
UNLISTEN * to clear connection state (in the connection pool), but this
needlessly breaks when the backend is in recovery.

On a related note, there currently doesn't seem to be a good way for
clients to know whether the backend is in recovery. As a backend can come
out of recovery at any point, perhaps an asynchronous ParameterStatus
announcing this state change could be useful.

Hopefully this also qualifies for backporting to earlier version branches.

Shay
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 970c94ee80..3efd262cb8 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -629,7 +629,7 @@ standard_ProcessUtility(PlannedStmt *pstmt,
 			{
 UnlistenStmt *stmt = (UnlistenStmt *) parsetree;
 
-PreventCommandDuringRecovery("UNLISTEN");
+/* allow UNLISTEN during recovery, which is a noop */
 CheckRestrictedOperation("UNLISTEN");
 if (stmt->conditionname)
 	Async_Unlisten(stmt->conditionname);


Re: _isnan() on Windows

2018-11-18 Thread Andrew Dunstan



On 11/17/18 2:46 PM, Andres Freund wrote:

Hi,

On 2018-07-12 11:28:46 -0400, Andrew Dunstan wrote:

On 07/12/2018 10:38 AM, Tom Lane wrote:

Andrew Dunstan  writes:

On 07/12/2018 10:20 AM, Tom Lane wrote:

bowerbird and hamerkop have some gripes like this:

bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro 
redefinition (src/pl/plperl/SPI.c) 
[G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj]

We actually undef a bunch of things in plperl.h to keep the compiler
quiet. Maybe we need to add this to the list?

Perhaps.  But how do we tell the platforms where we should do that
from the ones where we shouldn't?



In the _MSCVER section:

#ifdef isnan
#undef isnan
#endif

By inspection the perl header is just defining it to _isnan, for every MSC
version.

Let's try undefining it before plperl.h then?  It'd be nice to get rid
of those warnings...




done.


cheers


andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pg_dumpall --exclude-database option

2018-11-18 Thread Andrew Dunstan



On 11/17/18 9:55 AM, Alvaro Herrera wrote:

The comment in expand_dbname_patterns is ungrammatical and mentions
"OID" rather than "name", so I suggest

/*
 * The loop below might sometimes result in duplicate entries in the
 * output name list, but we don't care.
 */



Will fix.




I'm not sure this is grammatical either:
exclude databases whose name matches PATTERN
I would have written it like this:
exclude databases whose names match PATTERN
but I'm not sure (each database has only one name, of course, but aren't
you talking about multiple databases there?)




I think the original is grammatical.




Other than that, the patch seems fine to me -- I tested and it works as
intended.

Personally I would say "See also expand_table_name_patterns" instead of
"This is similar to code in pg_dump.c for handling matching table names.",
as well as mention this function in expand_table_name_patterns' comment.
(No need to mention expand_schema_name_patterns, since these are
adjacent.)  But this is mostly stylistic and left to your own judgement.

In the long run, I think we should add an option to processSQLNamePattern
to use OR instead of AND, which would fix both this problem as well as
pg_dump's.  I don't think that's important enough to stall this patch.



Thanks for the review.


cheers


andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: doc - improve description of default privileges

2018-11-18 Thread Fabien COELHO


Hello Tom,

Thanks for this precise feedback.


Progress on this patch seems to be blocked on the question of whether
we want to keep enlarging the amount of psql-specific information
in the GRANT reference page, or move that all somewhere else.


Yep.


FWIW, I think I agree with Peter's position that moving it somewhere
else is the better option.  Section 5.6 "Privileges" seems like a
reasonable choice.


Ok.


* Perhaps we could fix Peter's complaint about the "Owner" column by
relabeling it "All Privileges".


Ok.

I'd be inclined to label the last column "Default PUBLIC Privileges", 
too, if we can fit that in.


Ok.


* The phrase "relation-like objects" seems way too vague, especially since
one has to read it as excluding sequences, which surely are relations for
most purposes.  Is there a good reason not to just leave that entry as
"TABLE", full stop?  Or maybe it could be "TABLE, VIEW, etc" or some such.


Ok.


* I don't think the use of "hardcoded" adds anything.


Hmmm. As "default privileges" can be altered, the point is to describe the 
"default default privileges", but this looks absurd, hence the look for 
something to add the idea that there is another one. ISTM that removing 
"hardcoded" without replacing it makes the thing slightly ambiguous.

No big deal.


* Is it worth adding another table matching privilege names ("INSERT")
with their aclitem letters ("a"), rather than having the semi-formal
format currently appearing in grant.sgml?


Indeed I thought about that, because the description is not easy to read.

There's also some related material in 9.25 with the aclitem functions; 
it'd be worth unifying that too maybe.


I've put a reference to it at least.

Attached v4:
 - moves the table to the privileges section
 - updates the table column headers
 - adds a privilege/aclitem letter mapping table
 - adds some appropriate links towards psql & aclitem

--
Fabien.diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index c8268222af..5fbaacc214 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1509,6 +1509,192 @@ REVOKE ALL ON accounts FROM PUBLIC;
privilege.  For details see the  and
 reference pages.
   
+
+  
+shows how privileges are displayed
+   by aclitem functions described
+   in .
+summarizes the
+   default privileges granted to all object's types with
+   their associated  backslash commands.
+  
+
+  
+   Privileges' one-letter display
+   
+
+ 
+  PRIVILEGE
+  Letter
+  Objects
+ 
+
+
+ 
+  SELECT
+  r (Read)
+  
+   LARGE OBJECT,
+   TABLE, VIEW...
+  
+ 
+ 
+  INSERT
+  a (Append)
+  TABLE, VIEW...
+ 
+ 
+  UPDATE
+  w (Write)
+  
+   LARGE OBJECT,
+   TABLE, VIEW...
+  
+ 
+ 
+  DELETE
+  d
+  TABLE, VIEW...
+ 
+ 
+  TRUNCATE
+  D (Delete)
+  TABLE, VIEW...
+ 
+ 
+  REFERENCES
+  x
+  TABLE
+ 
+ 
+  TRIGGER
+  t
+  TABLE, VIEW...
+ 
+ 
+  CREATE
+  C
+  
+   DATABASE, SCHEMA,
+   TABLESPACE
+  
+ 
+ 
+  CONNECT
+  c
+  DATABASE
+ 
+ 
+  TEMPORARY
+  T
+  DATABASE
+ 
+ 
+  EXECUTE
+  X (eXecute)
+  FUNCTION, PROCEDURE
+ 
+ 
+  USAGE
+  U
+  
+   DOMAIN, FOREIGN ...,
+   LANGUAGE, SCHEMA,
+   SEQUENCE, TYPE
+  
+ 
+ 
+   
+  
+
+  
+   Default access privileges per object's type, as shown by psql
+   
+
+ 
+  Object's type
+  psql \-command
+  All Privileges
+  Default PUBLIC Privileges
+  
+ 
+ 
+  
+   DATABASE
+   \l
+   CTc
+   Tc
+  
+  
+   DOMAIN
+   \dD+
+   U
+   U
+  
+  
+   FUNCTION or PROCEDURE
+   \df+
+   X
+   X
+  
+  
+   FOREIGN DATA WRAPPER
+   \dew+
+   U
+   
+  
+  
+   FOREIGN SERVER
+   \des+
+   U
+   
+  
+  
+   LANGUAGE
+   \dL+
+   U
+   U
+  
+  
+   LARGE OBJECT
+   
+   rw
+   
+  
+  
+   SCHEMA
+   \dn+
+   UC
+   
+  
+  
+  SEQUENCE
+  \dp
+  rwU
+  
+ 
+  
+   TABLE, VIEW...
+   \dp
+   arwdDxt
+   
+  
+  
+   TABLESPACE
+   \db+
+   C
+   
+  
+  
+   TYPE
+   \dT+
+   U
+   U
+  
+ 
+
+   
+
  
 
  
diff --git a/doc/src/sgml/ref/grant.sgml b/doc/src/sgml/ref/grant.sgml
index ff64c7a3ba..bf643bfe92 100644
--- a/doc/src/sgml/ref/grant.sgml
+++ b/doc/src/sgml/ref/grant.sgml
@@ -533,7 +533,8 @@ GRANT role_name [, ...] TO 
-The entries shown by \dp are interpreted thus:
+The entries shown by psql backslash-commands,
+like \dp, are interpreted thus:
 
 rolename= -- privileges granted to a role
 = -- privileges granted to PU

Re: _isnan() on Windows

2018-11-18 Thread Andres Freund
Hi,

On 2018-11-18 12:46:14 -0500, Andrew Dunstan wrote:
> On 11/17/18 2:46 PM, Andres Freund wrote:
> > On 2018-07-12 11:28:46 -0400, Andrew Dunstan wrote:
> > > By inspection the perl header is just defining it to _isnan, for every MSC
> > > version.
> > Let's try undefining it before plperl.h then?  It'd be nice to get rid
> > of those warnings...

> done.

Thanks!

Greetings,

Andres Freund



docs should mention that max_wal_size default depends on WAL segment size

2018-11-18 Thread Tomas Vondra
Hi,

while investigating something on a cluster with a non-default WAL
segment (say 256MB), I've noticed a somewhat surprising behavior of
max_wal_size default. While the docs claim the default is 1GB, the
actual default depends on the WAL segment size.

For example with the 256MB WAL segments, you end up with this:

  test=# show max_wal_size ;
   max_wal_size
  --
   16GB
  (1 row)

This behavior is not entirely new - I've noticed it on 10, before the
WAL segment size was moved to initdb (which made it more likely to be
used). It's even more surprising there, because it leaves

#max_wal_size = 1GB

in the sample config, while fc49e24f at least emits the actual value.

But I'd say not mentioning this behavior in the docs is a bug.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Fixing AC_CHECK_DECLS to do the right thing with clang

2018-11-18 Thread Andres Freund
Hi,

On 2018-11-17 23:32:47 -0500, Tom Lane wrote:
> We've seen repeated complaints about bogus build warnings when using
> "clang": it complains that strlcpy and some related library functions
> haven't been declared.  Several of the buildfarm animals exhibit such
> warnings, for instance.  That's because Autoconf's AC_CHECK_DECLS macro
> fails to cope with the fact that clang only generates a warning, not
> an error, for the test case that that macro uses.  Noah fixed this
> in upstream autoconf several years ago:
> 
> http://git.savannah.gnu.org/cgit/autoconf.git/commit/?id=82ef7805faffa151e724aa76c245ec590d174580
> 
> However, I'm beginning to despair of the Autoconf crowd ever putting
> out an official new release.  Hence, I propose to apply and back-patch
> the attached, which essentially just imports Noah's fix into our
> configure script.  I've verified that this does the right thing with
> Fedora 28's version of clang (clang version 6.0.1).

Seems like a good plan. The problem doesn't reproduce for me on debian
(using any version of clang), so all I can report is that at patched
build still works as it should.

- Andres



Re: docs should mention that max_wal_size default depends on WAL segment size

2018-11-18 Thread Andres Freund
Hi,

On 2018-11-18 22:16:12 +0100, Tomas Vondra wrote:
> while investigating something on a cluster with a non-default WAL
> segment (say 256MB), I've noticed a somewhat surprising behavior of
> max_wal_size default. While the docs claim the default is 1GB, the
> actual default depends on the WAL segment size.
> 
> For example with the 256MB WAL segments, you end up with this:
> 
>   test=# show max_wal_size ;
>max_wal_size
>   --
>16GB
>   (1 row)
> 
> This behavior is not entirely new - I've noticed it on 10, before the
> WAL segment size was moved to initdb (which made it more likely to be
> used). It's even more surprising there, because it leaves
> 
> #max_wal_size = 1GB
> 
> in the sample config, while fc49e24f at least emits the actual value.
> 
> But I'd say not mentioning this behavior in the docs is a bug.

Hm, you're not wrong there. Wonder if it'd be better to make it so that
the default actually has the effect of being 1GB - I think that ought to
be doable?

Greetings,

Andres Freund



Re: [HACKERS] WIP: Aggregation push-down

2018-11-18 Thread Tom Lane
Antonin Houska  writes:
> [ agg_pushdown_v8.tgz ]

I spent a few hours looking at this today.  It seems to me that at no
point has there been a clear explanation of what the patch is trying to
accomplish, so let me see whether I've got it straight or not.  (I suggest
that something like this ought to be included in optimizer/README; the
patch's lack of internal documentation is a serious deficiency.)

--

Conceptually, we'd like to be able to push aggregation down below joins
when that yields a win, which it could do by reducing the number of rows
that have to be processed at the join stage.  Thus consider

SELECT agg(a.x) FROM a, b WHERE a.y = b.z;

We can't simply aggregate during the scan of "a", because some values of
"x" might not appear at all in the input of the naively-computed aggregate
(if their rows have no join partner in "b"), or might appear multiple
times (if their rows have multiple join partners).  So at first glance,
aggregating before the join is impossible.  The key insight of the patch
is that we can make some progress by considering only grouped aggregation:
if we can group "a" into sets of rows that must all have the same join
partners, then we can do preliminary aggregation within each such group,
and take care of the number-of-repetitions problem when we join.  If the
groups are large then this reduces the number of rows processed by the
join, at the cost that we might spend time computing the aggregate for
some row sets that will just be discarded by the join.  So now we consider

SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?;

and ask what properties the grouping column(s) "?" must have to make this
work.  I believe we can say that it's OK to push down through a join if
its join clauses are all of the form "a.y = b.z", where either a.y or b.z
is listed as a GROUP BY column *and* the join operator is equality for
the btree opfamily specified by the SortGroupClause.  (Note: actually,
SortGroupClause per se mentions an equality operator, although I think the
planner quickly reduces that to an opfamily specification.)  The concerns
Robert had about equality semantics are certainly vital, but I believe
that the logic goes through correctly as long as the grouping equality
operator and the join equality operator belong to the same opfamily.

Case 1: the GROUP BY column is a.y.  Two rows of "a" whose y values are
equal per the grouping operator must join to exactly the same set of "b"
rows, else transitivity is failing.

Case 2: the GROUP BY column is b.z.  It still works, because the set of
"a" rows that are equal to a given z value is well-defined, and those
rows must join to exactly the "b" rows whose z entries are equal to
the given value, else transitivity is failing.

Robert postulated cases like "citext = text", but that doesn't apply
here because no cross-type citext = text operator could be part of a
well-behaved opfamily.  What we'd actually be looking at is either
"citextvar::text texteq textvar" or "citextvar citexteq textvar::citext",
and the casts prevent these expressions from matching GROUP BY entries
that have the wrong semantics.

However: what we have proven here is only that we can aggregate across
a set of rows that must share the same join partners.  We still have
to be able to handle the case that the rowset has more than one join
partner, which AFAICS means that we must use partial aggregation and
then apply an "aggmultifn" (or else apply the aggcombinefn N times).
We can avoid that and use plain aggregation when we can prove the "b"
side of the join is unique, so that no sets of rows will have to be merged
post-join; but ISTM that that reduces the set of use cases to be too small
to be worth such a complex patch.  So I'm really doubtful that we should
proceed forward with only that case available.

Also, Tomas complained in the earlier thread that he didn't think
grouping on the join column was a very common use-case in the first
place.  I share that concern, but I think we could extend the logic
to the case that Tomas posited as being useful:

SELECT agg(a.x) FROM a, b WHERE a.y = b.id GROUP BY b.z;

where the join column b.id is unique.  If we group on a.y (using semantics
compatible with the join operator and the uniqueness constraint), then all
"a" rows in a given group will join to exactly one "b" row that
necessarily has exactly one grouping value, so this group can safely be
aggregated together.  We might need to combine it post-join with other "b"
rows that have equal "z" values, but we can do that as long as we're okay
with partial aggregation.  I think this example shows why the idea is far
more powerful with partial aggregation than without.

--

In short, then, I don't have much use for the patch as presented in this
thread, without "aggmultifn".  That might be OK as the first phase in a
multi-step patch, but not as a production feature.  I also have no use
for the stuff depending on bitwise equality rather than the user-visible
operators; tha

logical decoding vs. VACUUM FULL / CLUSTER on table with TOAST-ed data

2018-11-18 Thread Tomas Vondra
Hi,

It seems we have pretty annoying problem with logical decoding when
performing VACUUM FULL / CLUSTER on a table with toast-ed data.

The trouble is that the rewritten heap is WAL-logged using XLOG/FPI
records, the TOAST data is logged as regular INSERT records. XLOG/FPI is
ignored in logical decoding, and so reorderbuffer never gets those
records. But we do decode the TOAST data, and reorderbuffer stashes them
in toast_hash hash table. Which gets reset only when handling a row from
the main heap, and that never arrives. So we end up stashing all the
TOAST data in memory :-(

So do VACUUM FULL (or CLUSTER) on a sufficiently large table, and you're
likely to break any logical replication connection. And it does not
matter if you replicate this particular table.

Luckily enough, this can leverage some of the pieces introduced by
e9edc1ba which was meant to deal with rewrites of system tables, and in
raw_heap_insert it added this:

/*
 * The new relfilenode's relcache entrye doesn't have the necessary
 * information to determine whether a relation should emit data for
 * logical decoding.  Force it to off if necessary.
 */
if (!RelationIsLogicallyLogged(state->rs_old_rel))
options |= HEAP_INSERT_NO_LOGICAL;

As raw_heap_insert is used only for heap rewrites, we can simply remove
the if condition and use the HEAP_INSERT_NO_LOGICAL flag for all TOAST
data logged from here.

This does fix the issue, because we still decode the TOAST changes but
there are no data and so

if (change->data.tp.newtuple != NULL)
{
dlist_delete(&change->node);
ReorderBufferToastAppendChunk(rb, txn, relation,
  change);
}

ends up not stashing the change in the hash table. It's imperfect,
because we still decode the changes (and stash them to disk), but ISTM
that can be fixed by tweaking DecodeInsert a bit to just ignore those
changes entirely.

Attached is a PoC patch with these two fixes.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index c5db75afa1..ce6f9ed117 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -658,13 +658,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 		if (!state->rs_use_wal)
 			options |= HEAP_INSERT_SKIP_WAL;
 
-		/*
-		 * The new relfilenode's relcache entrye doesn't have the necessary
-		 * information to determine whether a relation should emit data for
-		 * logical decoding.  Force it to off if necessary.
-		 */
-		if (!RelationIsLogicallyLogged(state->rs_old_rel))
-			options |= HEAP_INSERT_NO_LOGICAL;
+		/* do not decode TOAST data for heap rewrites */
+		options |= HEAP_INSERT_NO_LOGICAL;
 
 		heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
 		 options);
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index afb497227e..f23cb120e8 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -665,6 +665,9 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 static void
 DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 {
+	Size		datalen;
+	char	   *tupledata;
+	Size		tuplelen;
 	XLogReaderState *r = buf->record;
 	xl_heap_insert *xlrec;
 	ReorderBufferChange *change;
@@ -672,6 +675,10 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	xlrec = (xl_heap_insert *) XLogRecGetData(r);
 
+	/* ignore insert records without new tuples */
+	if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE))
+		return;
+
 	/* only interested in our database */
 	XLogRecGetBlockTag(r, 0, &target_node, NULL, NULL);
 	if (target_node.dbNode != ctx->slot->data.database)
@@ -690,17 +697,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
 
-	if (xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE)
-	{
-		Size		datalen;
-		char	   *tupledata = XLogRecGetBlockData(r, 0, &datalen);
-		Size		tuplelen = datalen - SizeOfHeapHeader;
+	tupledata = XLogRecGetBlockData(r, 0, &datalen);
+	tuplelen = datalen - SizeOfHeapHeader;
 
-		change->data.tp.newtuple =
-			ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
+	change->data.tp.newtuple =
+		ReorderBufferGetTupleBuf(ctx->reorder, tuplelen);
 
-		DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
-	}
+	DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple);
 
 	change->data.tp.clear_toast_afterwards = true;
 


Re: Fixing AC_CHECK_DECLS to do the right thing with clang

2018-11-18 Thread Tom Lane
Andres Freund  writes:
> Seems like a good plan. The problem doesn't reproduce for me on debian
> (using any version of clang), so all I can report is that at patched
> build still works as it should.

Interesting.  It's hardly surprising that the problem would occur only
on some platforms, since if  declares the function then the
problem isn't visible.  But I'm surprised that some Debian boxes would
show it and some not.  Still, a closer look at the buildfarm shows both
clang-on-Debian members with the warning (eg gull) and clang-on-Debian
members without (eg yours).

regards, tom lane



Re: docs should mention that max_wal_size default depends on WAL segment size

2018-11-18 Thread Tomas Vondra



On 11/18/18 10:22 PM, Andres Freund wrote:
> Hi,
> 
> On 2018-11-18 22:16:12 +0100, Tomas Vondra wrote:
>> while investigating something on a cluster with a non-default WAL
>> segment (say 256MB), I've noticed a somewhat surprising behavior of
>> max_wal_size default. While the docs claim the default is 1GB, the
>> actual default depends on the WAL segment size.
>>
>> For example with the 256MB WAL segments, you end up with this:
>>
>>   test=# show max_wal_size ;
>>max_wal_size
>>   --
>>16GB
>>   (1 row)
>>
>> This behavior is not entirely new - I've noticed it on 10, before the
>> WAL segment size was moved to initdb (which made it more likely to be
>> used). It's even more surprising there, because it leaves
>>
>> #max_wal_size = 1GB
>>
>> in the sample config, while fc49e24f at least emits the actual value.
>>
>> But I'd say not mentioning this behavior in the docs is a bug.
> 
> Hm, you're not wrong there. Wonder if it'd be better to make it so that
> the default actually has the effect of being 1GB - I think that ought to
> be doable?
> 

I've actually thought about that, initially, but I'm not sure that's a
good idea for a couple of reasons:

(1) WAL segments can go up to 1GB now, and we need at least two of
those, so there are at least some cases where we can't stick to the 1GB
max_wal_size default.

(2) If you're messing with WAL segment size, chances are you either have
very large/busy or very small system, and so the changes to max_wal_size
probably make sense.

(3) If we decide to enforce the 1GB default, we should probably
backpatch it, otherwise you'll get surprisingly different behavior on
different versions. Not great. But backpatching may influence existing
systems quite a bit. Of course, there are probably few systems tweaking
WAL segment size, and I'd expect them to set max_wal_size explicitly
(rendering the default irrelevant).

So, not sure.

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: has_table_privilege for a table in unprivileged schema causes an error

2018-11-18 Thread Tom Lane
"David G. Johnston"  writes:
> On Sat, Aug 25, 2018 at 8:29 PM, Tom Lane  wrote:
>> There's a backwards-compatibility argument for not changing this behavior,
>> sure, but I don't buy the other arguments you made here.  And I don't
>> think there's all that much to the backwards-compatibility argument,
>> considering that the current behavior is to fail.

> +1; any code using these functions can reasonably be expected to handle
> true and false responses.  Converting a present error into a false under
> that presumption results in similar, and arguably improved, semantics.

I took a closer look at what's going on here, and realized that the
existing code is a bit confused, or confusing, about whose privileges
it's testing.  Consider this exchange (with HEAD, not the patch):

regression=# CREATE SCHEMA testns;
CREATE SCHEMA
regression=# CREATE TABLE testns.t1 (f1 int);
CREATE TABLE
regression=# CREATE USER regress_priv_user1;
CREATE ROLE
regression=# SELECT has_table_privilege('regress_priv_user1', 'testns.t1', 
'SELECT'); 
 has_table_privilege 
-
 f
(1 row)

regression=# \c - regress_priv_user1
You are now connected to database "regression" as user "regress_priv_user1".
regression=> SELECT has_table_privilege('regress_priv_user1', 'testns.t1', 
'SELECT'); 
ERROR:  permission denied for schema testns

That is, whether you get an error or not depends on your *own*
privileges for the schema, not those of the user you're supposedly
testing.  This seems rather inconsistent.  If we apply the proposed
patch, then (I'd expect, but haven't tested) you should always get
the same results from has_table_privilege('u', 's.t', 'p') as from
doing has_table_privilege('s.t', 'p') as user u.

However ... if we do that, then Robert's previous concern about
information leakage comes to life, because an unprivileged user
could run has_table_privilege('postgres', 'testns.t1', 'SELECT')
and thereby find out whether t1 exists regardless of whether he
has any privilege for testns.

Mind you, I think that argument is mostly bunk, because that
same user can trivially find out whether t1 exists, and what
its privilege grants are, just by looking into pg_catalog.
So there's no real security improvement from disallowing this.

Anyway, it seems to me that the principle of least astonishment
suggests that the results of has_table_privilege should depend only
on the privileges of the user allegedly being tested, not those of
the calling user.  Or if we decide differently, somebody has to write
some doc text explaining that it's not so.

Getting all the way to that might be a bit difficult though.
For example, in

SELECT has_function_privilege('someuser', 'func(schema.type)', 'usage');

the lookup and permission check for "schema" are buried a long
way down from the has_function_privilege code.  It'd take a
lot of refactoring to keep it from throwing an error.  I guess
maybe you could argue that privileges on the type are a different
question from privileges on the function, but it still seems ugly.

A related thought is that it's not clear whether there should be
any behavioral difference between

SELECT has_function_privilege('someuser', 'func(schema.type)'::text, 'usage');

SELECT has_function_privilege('someuser', 'func(schema.type)'::regprocedure, 
'usage');

In the latter case, it's entirely unsurprising that the schema lookup
is done as the current user.  However, if we define the first case
as being equivalent to the second, then the error that Yugo-san is
complaining of is something we can't/shouldn't fix, because certainly
"SELECT 'testns.t1'::regclass" is going to throw an error if you lack
usage privilege for testns.

So at this point I'm not sure what to think, other than that things
are inconsistent (and underdocumented).

regards, tom lane



Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-18 Thread Haribabu Kommi
On Sun, Nov 18, 2018 at 1:11 AM Alvaro Herrera 
wrote:

> On 2018-Nov-17, Amit Kapila wrote:
>
> > On Sat, Nov 17, 2018 at 4:46 PM Alvaro Herrera 
> wrote:
>
> > > Uh, ouch!  Seems to me that this is a high-caliber foot-gun, and will
> > > cause nasty suprises when production stats data disappear
> inadvertently!
> >
> > What is the alternative in your mind?
>
> Well, as I see this interface, it is as if
> DELETE FROM ... WHERE queryid = 
> where passing a NULL value meant to delete *all* rows regardless of
> queryid, rather than only those where queryid is NULL.
>
> > AFAICS, we have below alternatives:
> >
> > a) Return an error for the NULL value of any argument?
> > b) Silently ignore if there is any NULL value argument and don't
> > remove any stats.
> > c) Just ignore the NULL value parameter and remove the stats based on
> > other parameters.
>
> > Currently, patch implements option (c), but we can do (a) or (b) as
> > well, however, that can make the API usage bit awkward.
>
> I think option d) is to have more functions (seven functions total), and
> have them all be strict (i.e. don't do anything if one argument is
> NULL).  One function takes three arguments, and removes all entries
> matching the three.  Other functions take two arguments, and remove
> everything matching those, ignoring the third (so behaving like the
> current NULL).  Others take one argument.
>
> Maybe it doesn't make sense to provide all combinations, so it's less
> than seven.  But having an interface that's inherently easy to misuse
> makes me a bit nervous.
>

Following are the combinations that are possible and function name as
also pg_stat_statements_reset_by_***

userid,
dbid
queryid

userid, dbid
userid, queryid
dbid, queryid

userid, dbid, queryid -- Existing function name with arguments can work.

So 6 new functions needs to be added to cover all the above cases,
IMO, we may need functions for all combinations, because I feel some
user may have the requirement of left out combination, in case if we choose
only some combinations.

Regards,
Haribabu Kommi
Fujitsu Australia


Re: spurious(?) warnings in archive recovery

2018-11-18 Thread Vik Fearing
On 13/11/2018 16:34, Andrew Gierth wrote:
> So while investigating a case of this warning (in
> UpdateMinRecoveryPoint):
> 
> "xlog min recovery request %X/%X is past current point %X/%X"
> 
> I noticed that it is issued even in cases where we know that
> minRecoveryPoint is not yet valid, for example because we're waiting to
> see XLOG_BACKUP_END before declaring consistency.
> 
> But, you'd think, you shouldn't get this error because any page we
> modify during recovery should have been restored from an FPI with a
> suitably early LSN? For data pages that is correct, but not for VM or
> (iff wal_log_hints or checksums are enabled) FSM pages.
> 
> When we replay an operation that, for example, clears a bit in the VM,
> the redo code will read in that VM page from disk, and because we're not
> yet consistent and because _clearing_ a VM bit is not in itself
> wal-logged and doesn't result in any FPI being generated for the VM
> page, it could well read a VM page that has a far-future LSN from the
> point of view of replay, and dirty it, causing a later eviction to try
> and do UpdateMinRecoveryPoint with that future LSN.
> 
> (I haven't investigated this aspect, but there also appears to be no
> protection against torn pages in the VM when checksums are enabled? am I
> missing something somewhere?)
> 
> I'm less clear on the exact mechanisms, but when wal_log_hints (or
> checksums) is on, FSM pages also get LSNs, sometimes, thanks to
> MarkBufferDirtyHint, and at least some code paths can also do
> MarkBufferDirty on FSM pages during recovery, which would cause their
> eviction with possible future LSNs as with VM pages.
> 
> This means that if you simply do an old-style base backup using
> pg_start_backup/rsync/pg_stop_backup (on a sufficiently active system
> and taking long enough) and then recover from it, you're likely to get a
> log spammed with these errors for no very good reason.
> 
> So it seems to me that issuing this error is a bug if the conditions
> described are actually harmless, while if they're not harmless, then
> obviously that is a bug. So _something_ needs fixing here, but I'm not
> yet sufficiently confident of my analysis to say what.
> 
> Opinions?
> 
> (as a further point, it seems to me that backupEndRequired is a rather
> misleadingly named variable, since what _actually_ determines whether an
> XLOG_BACKUP_END record is expected is whether backupStartPoint is set.
> backupEndRequired seems to change one error message and, questionably,
> one decision about whether to do crash recovery before entering archive
> recovery, but nothing else.)


Bump.

I was the originator of this report.  I work with Postgres every single
day and I was spooked by these warnings.  People with much less
involvement would probably be terrified.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support



Re: [PATCH] XLogReadRecord returns pointer to currently read page

2018-11-18 Thread Michael Paquier
On Fri, Nov 16, 2018 at 03:23:55PM +0900, Michael Paquier wrote:
> I was looking at this patch, and shouldn't we worry about compatibility
> with plugins or utilities which look directly at what's in readRecordBuf
> for the record contents?  Let's not forget that the contents of
> XLogReaderState are public.

And a couple of days later, committed.  I did not notice first that the
comments of xlogreader.h mention that a couple of areas are private, and
readRecordBuf is part of that, so objection withdrawn.

There is a comment in xlog.c (on top of ReadRecord) referring to
readRecordBuf which has not been removed as part of 7fcbf6a4 when WAL
reading has been moved to its own facility.  The original comment was
from 0ffe11a.  So I have removed the comment at the same time.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] PoC: full merge join on comparison clause

2018-11-18 Thread Tom Lane
Alexander Kuzmenkov  writes:
> [ Inequality-merge-join-v10.patch ]

Just thinking about this patch a bit ... I wonder why you were so quick to
reject the UNION approach at the outset.  This patch is pretty messy, and
it complicates a lot of stuff that is quite fundamental to the planner,
and you still end up that the only functionality gain is now we can handle
full joins whose conditions include a single btree inequality clause.
Nor are we doing that remarkably efficiently ... it's pretty much
impossible to do it efficiently, in fact, since if the inputs have M and N
rows respectively then the output will have something like (M*N)/2 rows.

So it seems to me that if we're going to put sweat into this area at all,
our ambition ought to be "we'll successfully perform a FULL JOIN with any
join clause whatsoever, though it might take O(M*N) time".

Now as far as I can tell, the UNION substitution you proposed is
completely valid, although it'd be better to phrase the second step
as an antijoin.  That is, I believe

select * from t1 full join t2 on (anything)

is exactly equal to

select t1.*, t2.* from t1 left join t2 on (anything)
union all
select t1.*, t2.* from t2 anti join t1 on (anything)

There is one fly in the ointment, which is that we will have to run the
join clause twice, so it can't contain volatile functions --- but the
merge join approach wouldn't handle that case either.

Having to read the inputs twice is not good, but we could put them
into CTEs, which fixes any problems with volatility below the join
and at least alleviates the performance problem.  Since we can't
currently do any meaningful qual pushdown through full joins, the
optimization-fence aspect of a CTE doesn't seem like an issue either.

In short, proceeding like the above when we can't find another plan
type for a full join seems like it fixes a far wider variety of cases.
The possibility that maybe we could do some of those cases a bit faster
isn't sufficiently attractive to me to justify also putting in a
mechanism like this patch proposes.  We only rarely see complaints at
all about can't-do-a-full-join problems, and I do not think this patch
would fix enough of those complaints to be worthwhile.

regards, tom lane



Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-18 Thread Michael Paquier
On Mon, Nov 19, 2018 at 10:41:22AM +1100, Haribabu Kommi wrote:
> So 6 new functions needs to be added to cover all the above cases,
> IMO, we may need functions for all combinations, because I feel some
> user may have the requirement of left out combination, in case if we choose
> only some combinations.

That's bloating the interface in my opinion.

Another choice #e would be to keep the function as not strict, but
defaulting to the current database if NULL is used for the database ID
and defaulting to the current user if NULL is used for the user ID.  For
the query ID, if NULL is given in output, process all queries matching
with (database,user), removing nothing if there is no match.  If the
caller has set up a query ID the rest of the stats should not be
touched.
--
Michael


signature.asc
Description: PGP signature


Re: New GUC to sample log queries

2018-11-18 Thread Michael Paquier
On Sun, Nov 18, 2018 at 12:18:33PM +0100, Dmitry Dolgov wrote:
> Since it's hard to come up with a concise name that will mention sampling rate
> in the context of min_duration_statement, I think it's fine to name this
> configuration "log_sample_rate", as long as it's dependency from
> log_min_duration_statements is clearly explained in the documentation.

log_sample_rate looks fine to me as a name.
--
Michael


signature.asc
Description: PGP signature


Re: WIP: Avoid creation of the free space map for small tables

2018-11-18 Thread John Naylor
On 11/16/18, Amit Kapila  wrote:
> +/* Status codes for the local map. */
> +#define FSM_LOCAL_ZERO 0x00 /* Beyond the end of the relation */
> +#define FSM_LOCAL_AVAIL 0x01 /* Available to try */
> +#define FSM_LOCAL_TRIED 0x02 /* Already tried, not enough space */
>
> Instead of maintaining three states, can't we do with two states
> (Available and Not Available), basically combine 0 and 2 in your case.
> I think it will save some cycles in
> fsm_local_set, where each time you need to initialize all the entries
> in the map.  I think we can argue that it is not much overhead, but I
> think it is better code-wise also if we can make it happen with fewer
> states.

That'd work too, but let's consider this scenario: We have a 2-block
table that has no free space. After trying each block, the local cache
looks like

0123
TT00

Let's say we have to wait to acquire a relation extension lock,
because another backend had already started extending the heap by 1
block. We call GetPageWithFreeSpace() and now the local map looks like

0123
TTA0

By using bitwise OR to set availability, the already-tried blocks
remain as they are. With only 2 states, the map would look like this
instead:

0123
AAAN

If we assume that an insert into the newly-created block 2 will almost
always succeed, we don't have to worry about wasting time re-checking
the first 2 full blocks. Does that sound right to you?


> Some assorted comments:
> 1.
>  
> -Each heap and index relation, except for hash indexes, has a Free Space
> Map
> +Each heap relation, unless it is very small, and each index relation,
> +except for hash indexes, has a Free Space Map
>  (FSM) to keep track of available space in the relation. It's stored
>
> It appears that line has ended abruptly.

Not sure what you're referring to here.

> 2.
> page = BufferGetPage(buffer);
> + targetBlock = BufferGetBlockNumber(buffer);
>
>   if (!PageIsNew(page))
>   elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
> - BufferGetBlockNumber(buffer),
> + targetBlock,
>   RelationGetRelationName(relation));
>
>   PageInit(page, BufferGetPageSize(buffer), 0);
> @@ -623,7 +641,18 @@ loop:
>   * current backend to make more insertions or not, which is probably a
>   * good bet most of the time.  So for now, don't add it to FSM yet.
>   */
> - RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
> + RelationSetTargetBlock(relation, targetBlock);
>
> Is this related to this patch? If not, I suggest let's do it
> separately if required.

I will separate this out.

> 3.
>  static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
> -uint8 newValue, uint8 minValue);
> + uint8 newValue, uint8 minValue);
>
> This appears to be a spurious change.

It was intentional, but I will include it separately as above.

> 4.
> @@ -378,24 +386,15 @@ RelationGetBufferForTuple(Relation relation, Size
> len,
>   * target.
>   */
>   targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
> +
> + /*
> + * In case we used an in-memory map of available blocks, reset
> + * it for next use.
> + */
> + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD)
> + ClearLocalMap();
>
> How will you clear the local map during error?  I think you need to
> clear it in abort path and you can name the function as
> FSMClearLocalMap or something like that.

That sounds right, and I will rename the function that way. For the
abort path, were you referring to this or somewhere else?

if (!PageIsNew(page))
elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
 targetBlock,
 RelationGetRelationName(relation));

> 5.
> +/*#define TRACE_TARGETBLOCK */
>
> Debugging leftover, do you want to retain this and related stuff
> during the development of patch?

I modeled this after TRACE_VISIBILITYMAP in visibilitymap.c. It's
useful for development, but I don't particularly care whether it's in
the final verision.

Also, I found an off-by-one error that caused an unnecessary
smgrexists() call in tables with threshold + 1 pages. This will be
fixed in the next version.

-John Naylor



Re: ToDo: show size of partitioned table

2018-11-18 Thread Alvaro Herrera
On 2018-Jul-23, Pavel Stehule wrote:

> p.s. Another patch can be replacement of relation type from "table" to
> "partitioned table"
> 
> postgres=# \dt+
>  List of relations
> +++---+---+-+-+
> | Schema |Name|   Type| Owner |  Size   | Description |
> +++---+---+-+-+
> | public | data   | partitioned table | pavel | 0 bytes | |
> | public | data_2016  | table | pavel | 15 MB   | |
> | public | data_2017  | table | pavel | 15 MB   | |
> | public | data_other | table | pavel | 11 MB   | |
> +++---+---+-+-+
> (4 rows)

I think this is a clear improvement.  The term "table" was introduced
for this case by f0e44751d7 ("Implement table partitioning.") and now
the author of that commit supports this change.  I used the term "index"
for partitioned indexes originally because I was copying the existing
term, but now I too think they should say "partitioned indexes" instead,
because they are different enough objects from plain indexes.

To be certain I'm not going against some old decision, I digged up
Amit's old patches.  Turns out he submitted psql's describe.c using the
term "partitioned table" on August 10th [1] and then based on a
discussion where Robert suggested calling these new objects "partition
roots" instead to avoid confusion, it was changed to "table" in the next
submission on August 26th [2].  It seems the right call to have used the
term "table" in many places (rather than "partition roots"), but at
least in psql's \dt it seems extremely useful to show the type as
"partitioned table" instead, because it is one place where the
distinction is clearly useful.

In this thread there have been no contrary votes, so I'm pushing this
part soon.

[1] https://postgr.es/m/ad16e2f5-fc7c-cc2d-333a-88d4aa446...@lab.ntt.co.jp
[2] https://postgr.es/m/169708f6-6e5a-18d1-707b-1b323e4a6...@lab.ntt.co.jp

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Psql patch to show access methods info

2018-11-18 Thread Michael Paquier
On Sat, Nov 17, 2018 at 11:20:50PM -0300, Alvaro Herrera wrote:
> Here's a rebased version, fixing the rejects, pgindenting, and fixing
> some "git show --check" whitespace issues.  Haven't reviewed any further
> than that.

Schema qualifications are missing in many places, and they are added
sometimes.  The character limit in documentation paragraph could be more
respected as well.

+opereator families associated with whose name matches the
pattern are shown.
s/opereator/operator/.

+List procedures ()
accociated with access method operator families.
s/accociated/associated/.
--
Michael


signature.asc
Description: PGP signature


Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-18 Thread Alvaro Herrera
On 2018-Nov-19, Michael Paquier wrote:

> On Mon, Nov 19, 2018 at 10:41:22AM +1100, Haribabu Kommi wrote:
> > So 6 new functions needs to be added to cover all the above cases,
> > IMO, we may need functions for all combinations, because I feel some
> > user may have the requirement of left out combination, in case if we choose
> > only some combinations.
> 
> That's bloating the interface in my opinion.

I understand.

Let's call for a vote from a larger audience.  It's important to get
this interface right, ISTM.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: ToDo: show size of partitioned table

2018-11-18 Thread Amit Langote
On 2018/11/19 11:17, Alvaro Herrera wrote:
> On 2018-Jul-23, Pavel Stehule wrote:
> 
>> p.s. Another patch can be replacement of relation type from "table" to
>> "partitioned table"
>>
>> postgres=# \dt+
>>  List of relations
>> +++---+---+-+-+
>> | Schema |Name|   Type| Owner |  Size   | Description |
>> +++---+---+-+-+
>> | public | data   | partitioned table | pavel | 0 bytes | |
>> | public | data_2016  | table | pavel | 15 MB   | |
>> | public | data_2017  | table | pavel | 15 MB   | |
>> | public | data_other | table | pavel | 11 MB   | |
>> +++---+---+-+-+
>> (4 rows)
> 
> I think this is a clear improvement.  The term "table" was introduced
> for this case by f0e44751d7 ("Implement table partitioning.") and now
> the author of that commit supports this change.  I used the term "index"
> for partitioned indexes originally because I was copying the existing
> term, but now I too think they should say "partitioned indexes" instead,
> because they are different enough objects from plain indexes.
> 
> To be certain I'm not going against some old decision, I digged up
> Amit's old patches.  Turns out he submitted psql's describe.c using the
> term "partitioned table" on August 10th [1] and then based on a
> discussion where Robert suggested calling these new objects "partition
> roots" instead to avoid confusion, it was changed to "table" in the next
> submission on August 26th [2].  It seems the right call to have used the
> term "table" in many places (rather than "partition roots"), but at
> least in psql's \dt it seems extremely useful to show the type as
> "partitioned table" instead, because it is one place where the
> distinction is clearly useful.
> 
> In this thread there have been no contrary votes, so I'm pushing this
> part soon.
> 
> [1] https://postgr.es/m/ad16e2f5-fc7c-cc2d-333a-88d4aa446...@lab.ntt.co.jp
> [2] https://postgr.es/m/169708f6-6e5a-18d1-707b-1b323e4a6...@lab.ntt.co.jp

Yeah, I agree that showing "partitioned table" for partitioned tables in
this case is helpful.

Earlier on this thread [1], I had expressed a slight concern about the
consistency of mentioning "partitioned" in various outputs, because many
error messages say "table" even if the table is partitioned.  But now I
think that it's orthogonal.  We should show "partitioned" where it is helpful.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/5474c8b6-04e7-1afc-97b6-adb7471c2c71%40lab.ntt.co.jp




Re: Psql patch to show access methods info

2018-11-18 Thread Alvaro Herrera
On 2018-Nov-19, Michael Paquier wrote:

> On Sat, Nov 17, 2018 at 11:20:50PM -0300, Alvaro Herrera wrote:
> > Here's a rebased version, fixing the rejects, pgindenting, and fixing
> > some "git show --check" whitespace issues.  Haven't reviewed any further
> > than that.
> 
> Schema qualifications are missing in many places, and they are added
> sometimes.  The character limit in documentation paragraph could be more
> respected as well.

Sergey, are you available to fix these issues?  Nikita?

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: ToDo: show size of partitioned table

2018-11-18 Thread Michael Paquier
On Sun, Nov 18, 2018 at 11:17:37PM -0300, Alvaro Herrera wrote:
> To be certain I'm not going against some old decision, I digged up
> Amit's old patches.  Turns out he submitted psql's describe.c using the
> term "partitioned table" on August 10th [1] and then based on a
> discussion where Robert suggested calling these new objects "partition
> roots" instead to avoid confusion, it was changed to "table" in the next
> submission on August 26th [2].  It seems the right call to have used the
> term "table" in many places (rather than "partition roots"), but at
> least in psql's \dt it seems extremely useful to show the type as
> "partitioned table" instead, because it is one place where the
> distinction is clearly useful.

+1.

> In this thread there have been no contrary votes, so I'm pushing this
> part soon.
> 
> [1] https://postgr.es/m/ad16e2f5-fc7c-cc2d-333a-88d4aa446...@lab.ntt.co.jp
> [2] https://postgr.es/m/169708f6-6e5a-18d1-707b-1b323e4a6...@lab.ntt.co.jp

Sorry for degressing, but could you also update \di at the same time so
as it shows "partitioned index"?  listTables() should be switched to use
partitioned tables and partitioned indexes, and permissionsList() has a
reference to partitioned tables.  While on it, this gives the attached..
--
Michael
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 4ca0db1d0c..8e06097442 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -955,7 +955,7 @@ permissionsList(const char *pattern)
 	  gettext_noop("materialized view"),
 	  gettext_noop("sequence"),
 	  gettext_noop("foreign table"),
-	  gettext_noop("table"),	/* partitioned table */
+	  gettext_noop("partitioned table"),
 	  gettext_noop("Type"));
 
 	printACLColumn(&buf, "c.relacl");
@@ -3524,8 +3524,8 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 	  gettext_noop("sequence"),
 	  gettext_noop("special"),
 	  gettext_noop("foreign table"),
-	  gettext_noop("table"),	/* partitioned table */
-	  gettext_noop("index"),	/* partitioned index */
+	  gettext_noop("partitioned table"),
+	  gettext_noop("partitioned index"),
 	  gettext_noop("Type"),
 	  gettext_noop("Owner"));
 


signature.asc
Description: PGP signature


Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-11-18 Thread Amit Langote
On 2018/11/17 9:06, Michael Paquier wrote:
> On Fri, Nov 16, 2018 at 09:38:40AM -0500, Robert Haas wrote:
>> OK, but it seems to me that your version of my patch rearranges the
>> code more than necessary.
>>
>> How about the attached?
> 
> What you are proposing here looks good to me.  Thanks!

Me too, now that I see the patch closely.  The errors I'd seen in the
regression tests were due to uninitialized oids variable which is fixed in
the later patches, not due to "confused memory context switching" as I'd
put it [1] and made that the reason for additional changes.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/1be8055c-137b-5639-9bcf-8a2d5fef6e5a%40lab.ntt.co.jp




Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT

2018-11-18 Thread Kyotaro HORIGUCHI
At Sat, 17 Nov 2018 11:14:49 -0500, Tom Lane  wrote in 
<23489.1542471...@sss.pgh.pa.us>
> I wrote:
> > After still further thought, it seems like "if (bytesread == 0)"
> > is the right way to proceed.  That protects us against incorrectly
> > setting reached_eof if the pipe is being run in unbuffered or
> > line-buffered mode.  (Which copy.c doesn't do, at the moment,
> > but I'd just as soon this code didn't need to assume that.
> > Also, I'm not 100% convinced that Windows or other platforms
> > might not act that way.)  In the case where we terminate early
> > because we saw an in-band EOF marker, we would also not set reached_eof,
> > and again that seems like a good outcome: if we see SIGPIPE it
> > must mean that the program kept sending data after the EOF marker,
> > and that seems like a good thing to complain about.  (It's only
> > guaranteed to fail if the program sends more than the current pipe
> > buffer-ful, but I don't feel a need to add extra code to check for
> > nonempty buffer contents as we exit.)
> 
> Oh, wait, that last bit is backwards: if we see an in-band EOF mark,
> and as a consequence exit without having set reached_eof, then the
> exit code will think that SIGPIPE is ignorable.  So transmitting
> more data after an EOF mark will not be complained of, whether
> it's within the same bufferload or not.
> 
> Still, I can't get very excited about that.  Potentially we could
> improve it by having the places that recognize EOF marks set 
> reached_eof, but I'm unconvinced that it's worth the trouble.
> I'm thinking that it's better to err in the direction of reporting
> SIGPIPE less often not more often, considering that up to now
> we've never reported it at all and there've been so few complaints.

My opinion here is when we execute an external program on the
other end of a pipe, we should behave as loosely (tolerantly) as
ordinary un*x programs are expected. If we're connecting to
another PostgreSQL server, we should be stringent as the current
behavior.

In other words, we don't need to change the behavior of other
than the COPY_FILE case, but ClosePipeToProgram shouldn't
complain not only for SIGPIPE but any kinds of error.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Pull up sublink of type 'NOT NOT (expr)'

2018-11-18 Thread Richard Guo
Hi Tom,

Thanks for your input.

On Fri, Nov 16, 2018 at 6:06 AM, Tom Lane  wrote:

> Richard Guo  writes:
> > On Tue, Nov 13, 2018 at 10:05 AM, Tom Lane  wrote:
> >> What is the argument that this occurs often enough to be worth expending
> >> extra cycles and code space on?
>
> > I am not using an ORM, but just considering maybe it would be better if
> > PostgreSQL can do such pull-up.
> > Tom, what's your suggestion? Is it worthwhile expending several lines of
> > codes to do this pull-up?
>
> Well, really, if we're just doing this on speculation and don't even have
> any concrete indication that anybody writes code like that, I can't get
> excited about expending even a few lines of code on it.
>
> The reason why we perform optimizations similar to this in places like
> eval_const_expressions is (IMO anyway) that transformations such as
> function inlining and subquery pullup can create parse trees that look
> like this, even when the original query was perfectly sane and without
> obvious redundancy.  However, because pull_up_sublinks runs before any
> of that, it would only ever see NOT NOT if someone had actually written
> such a thing.  So to justify expending any code space or cycles on
> optimizing this, you have to make the case that that actually happens
> in the wild, and does so often enough to justify wasting some (admittedly
> small) number of cycles for everybody else.  I'd kind of like to see some
> actual field occurrence of NOT NOT over an optimizable IN/EXISTS before
> deciding that it's worth doing.
>
> It's sort of annoying that we have to run pull_up_sublinks before we do
> scalar expression simplification.  If we could do that in the other order,
> NOT NOT elimination would fall out automatically, and we'd also be able
> to recognize some other cases where it initially seems that an IN or
> EXISTS is not at top level, but it becomes so after we const-fold, apply
> DeMorgan's laws, etc.  However, according to the point I made above,
> it makes more sense to apply expression simplification after we've
> completed pullup-like operations.  So we can't really get there from
> here unless we wanted to do (parts of?) expression simplification twice,
> which is unlikely to win often enough to justify the cost.
>

Agree! If we do expression simplification in the other order, we have to do
that twice,
which is not a good idea.


>
> So I'm inclined to reject this patch as probably being a net loss in
> almost all cases.  If you can show any real-world cases where it wins,
> we could reconsider.
>

I am convinced. I do not see this happen often enough in real world neither.
Feel free to reject his patch.

Thanks
Richard


Re: [PATCH] XLogReadRecord returns pointer to currently read page

2018-11-18 Thread Kyotaro HORIGUCHI
Thank you for taking this.

At Mon, 19 Nov 2018 10:27:29 +0900, Michael Paquier  wrote 
in <20181119012728.ga1...@paquier.xyz>
> On Fri, Nov 16, 2018 at 03:23:55PM +0900, Michael Paquier wrote:
> > I was looking at this patch, and shouldn't we worry about compatibility
> > with plugins or utilities which look directly at what's in readRecordBuf
> > for the record contents?  Let's not forget that the contents of
> > XLogReaderState are public.
> 
> And a couple of days later, committed.  I did not notice first that the
> comments of xlogreader.h mention that a couple of areas are private, and
> readRecordBuf is part of that, so objection withdrawn.

It wasn't changed the way the function replaces the content of
the buffer. (Regardless of whether it is private or not, I
believe no one tries to write directly there outside the
function.)  Also the memory pointed by the retuned pointer from
the previous code was regarded as read-only. The only point to
notice was the lifetime of the returned pointer, I suppose.

> There is a comment in xlog.c (on top of ReadRecord) referring to
> readRecordBuf which has not been removed as part of 7fcbf6a4 when WAL
> reading has been moved to its own facility.  The original comment was
> from 0ffe11a.  So I have removed the comment at the same time.

- * The record is copied into readRecordBuf, so that on successful return,
- * the returned record pointer always points there.

Yeah, it is incorrect in some sense, but the comment was
suggesting the lifetime of the pointed record. So I'd like to
have a comment like this instead.

+ * The record is copied into xlogreader, so that on successful return,
+ * the returned record pointer always points there.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Early WIP/PoC for inlining CTEs

2018-11-18 Thread Craig Ringer
On Sat, 17 Nov 2018 at 10:12, Stephen Frost  wrote:

> Greetings,
>
> * Andrew Gierth (and...@tao11.riddles.org.uk) wrote:
> > > "Tom" == Tom Lane  writes:
> >
> >  >> [ inlining-ctes-v5.patch ]
> >
> >  Tom> I took a little bit of a look through this.  Some thoughts:
> >
> >  Tom> * I think it'd be a good idea if we made OFFSET/LIMIT in a CTE be
> >  Tom> an alternate way of keeping it from being inlined. As the patch
> >  Tom> stands, if that's the behavior you want, you have no way to
> >  Tom> express it in a query that will also work in older servers. (I
> >  Tom> will manfully resist suggesting that then we don't need the
> >  Tom> nonstandard syntax at all ... oops, too late.)
> >
> > I think this is the wrong approach, because you may want the
> > optimization-barrier effects of OFFSET/LIMIT _without_ the actual
> > materialization - there is no need to force a query like
> >
> > with d as (select stuff from bigtable offset 1) select * from d;
> >
> > to push all the data through an (on-disk) tuplestore.
>
> Agreed, there's going to be cases where you want the CTE to be inlined
> even with OFFSET/LIMIT.  Let's please not cater to the crowd who
> happened to know that they could hack around with OFFSET/LIMIT to make
> something not be inlined when it comes to the question of if the CTE
> should be inlined or not.  That's the same issue we were argueing around
> when discussing if we should allow parallel array_agg, imv.
>
> Particularly since, with CTEs anyway, we never inlined them, so the
> whole OFFSET/LIMIT thing doesn't really make any sense- today, if you
> wrote a CTE, you wouldn't bother with OFFSET/LIMIT because you knew it
> wasn't going to be inlined, that entire line of thinking is for
> subqueries, not CTEs.  If you're going to force people to change their
> CTEs to require that they not be inlined, let's not pick a method which
> makes it ambiguous and makes us have to ask "do they really want this
> limit/offset, or did they just want to make the CTE not be inlined...?"
>
>
To satisfy Tom's understandable desire to let people write queries that
behave the same on old and new versions, can we get away with back-patching
the MATERIALIZED parser enhancement as a no-op in point releases?


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-18 Thread Amit Kapila
On Sat, Nov 17, 2018 at 7:41 PM Alvaro Herrera  wrote:
>
> On 2018-Nov-17, Amit Kapila wrote:
>
> > On Sat, Nov 17, 2018 at 4:46 PM Alvaro Herrera  
> > wrote:
>
> > > Uh, ouch!  Seems to me that this is a high-caliber foot-gun, and will
> > > cause nasty suprises when production stats data disappear inadvertently!
> >
> > What is the alternative in your mind?
>
> Well, as I see this interface, it is as if
> DELETE FROM ... WHERE queryid = 
> where passing a NULL value meant to delete *all* rows regardless of
> queryid, rather than only those where queryid is NULL.
>

If one want to understand in query form, it will be like:
DELETE FROM ... WHERE userid =  AND dbid =  AND queryid = ;

Now, if any of the value is NULL, we ignore it, so the case in
question would be:
DELETE FROM ... WHERE userid =  AND dbid =  OR queryid = NULL;

Now, I think it is quite possible that one can interpret it as you
have written above and then this can lead to some unintended behavior.
I think one thing we can do to avoid such cases is to make this
function strict and make the default values as InvalidOid or -1.  So,
on NULL input, we don't do anything, but for -1 or an invalid value
for the parameter, we ignore only that parameter as we are doing now.
Do you think such an interface will address your concern?

> > AFAICS, we have below alternatives:
> >
> > a) Return an error for the NULL value of any argument?
> > b) Silently ignore if there is any NULL value argument and don't
> > remove any stats.
> > c) Just ignore the NULL value parameter and remove the stats based on
> > other parameters.
>
> > Currently, patch implements option (c), but we can do (a) or (b) as
> > well, however, that can make the API usage bit awkward.
>
> I think option d) is to have more functions (seven functions total), and
> have them all be strict (i.e. don't do anything if one argument is
> NULL).  One function takes three arguments, and removes all entries
> matching the three.  Other functions take two arguments, and remove
> everything matching those, ignoring the third (so behaving like the
> current NULL).  Others take one argument.
>
> Maybe it doesn't make sense to provide all combinations, so it's less
> than seven.  But having an interface that's inherently easy to misuse
> makes me a bit nervous.
>

Yeah, we can go in that direction if the one interface idea doesn't
work.  IIRC, we have discussed having multiple interfaces for this API
and majority people seems to be in favor of single interface.  Let us
first see if there is some reasonable way to provide this
functionality with a single interface, if not, then surely we can
explore multiple interface idea.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: zheap: a new storage format for PostgreSQL

2018-11-18 Thread Amit Kapila
On Sun, Nov 18, 2018 at 3:42 PM Darafei "Komяpa" Praliaskouski
 wrote:
>
> On Sat, Nov 17, 2018 at 8:51 AM Adam Brusselback  
> wrote:
>>
>> >  I don't know how much what I write on this thread is read by others or
>> how useful this is for others who are following this work
>>
>> I've been following this thread and many others like it, silently soaking it 
>> up, because I don't feel like i'd have anything useful to add in most cases. 
>> It is very interesting seeing the development take place though, so just 
>> know it's appreciated at least from my perspective.
>
> I'm also following the development and have hopes about it going forward. Not 
> much low-level details I can comment on though :)
>
> In PostGIS workloads, UPDATE table SET geom = ST_CostyFunction(geom, 
> magicnumber); is one of biggest time-eaters that happen upon initial load and 
> clean up of your data. It is commonly followed by CLUSTER table using 
> table_geom_idx; to make sure you're back at full speed and no VACUUM is 
> needed, and your table (usually static after that) is more-or-less spatially 
> ordered. I see that zheap can remove the need for VACUUM, which is a big win 
> already. If you can do something that will allow reorder of tuples according 
> to index happen during an UPDATE that rewrites most of table, that would be a 
> game changer :)
>

If the tuples are already in the order of the index, then we would
retain the order, otherwise, we might not want to anything special for
ordering w.r.t index.  I think this is important as we are not sure of
the user's intention and I guess it won't be easy to do such
rearrangement during Update statement.

> Another story is Visibility Map and Index-Only Scans. Right now there is a 
> huge gap between the insert of rows and the moment they are available for 
> index only scan, as VACUUM is required. Do I understand correctly that for 
> zheap this all can be inverted, and UNDO can become "invisibility map" that 
> may be quite small and discarded quickly?
>

Yeah, eventually that is our goal with the help of delete-marking in
indexes, however, for the first version, we still need to rely on
visibility maps for index-only-scans.

Thank you for showing interest in this work.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT

2018-11-18 Thread Etsuro Fujita

(2018/11/17 8:10), Tom Lane wrote:

I wrote:

I'm not quite sure whether the reached_eof test should be
"if (bytesread == 0)" or "if (bytesread<  maxread)".  The former
seems more certain to indicate real EOF; are there other cases where
the fread might return a short read?  On the other hand, if we
support in-band EOF indicators (\. for instance) then we might
stop without having made an extra call to CopyGetData that would
see bytesread == 0.  On the third hand, if we do do that it's
not quite clear to me whether SIGPIPE is ignorable or not.


After still further thought, it seems like "if (bytesread == 0)"
is the right way to proceed.  That protects us against incorrectly
setting reached_eof if the pipe is being run in unbuffered or
line-buffered mode.  (Which copy.c doesn't do, at the moment,
but I'd just as soon this code didn't need to assume that.
Also, I'm not 100% convinced that Windows or other platforms
might not act that way.)



So I think this version is probably good, although maybe it could
use an additional comment explaining the above reasoning.


I agree that it's better to keep the BeginCopyFrom API as-is.  Also, I 
think your version would handle SIGPIPE in COPY FROM PROGRAM more 
properly than what I proposed.  So, +1 from me.


Thanks!

Best regards,
Etsuro Fujita



Re: Function to promote standby servers

2018-11-18 Thread Michael Paquier
On Fri, Oct 26, 2018 at 01:51:24PM +0900, Michael Paquier wrote:
> Or we could use "the function returns true immediately after initiating
> the promotion by sending the promotion signal to the postmaster"?  As a
> native speaker which one feels more natural to you?

Sorry for the time it took.  After pondering on it, I have committed the
improvements from your version.
--
Michael


signature.asc
Description: PGP signature


Re: Accounting of zero-filled buffers in EXPLAIN (BUFFERS)

2018-11-18 Thread Kyotaro HORIGUCHI
At Sat, 17 Nov 2018 11:15:54 -0300, Alvaro Herrera  
wrote in <20181117141554.4dkx2u4j6md3bqdh@alvherre.pgsql>
> Is this patch committable now?

I don't think so. We should make a decision on a point.

I was a bit confused (sorry) but IIUIC Haribabu suggested that
the RBM_ZERO_ON_ERROR case should be included to read (not just
ignored). I'm on it.  I think that RPM_ZERO_* other than ON_ERROR
cases could be a kind of hit but I don't insist on it.

So, I think we should decide on at least the ON_ERROR case before
this becomes commttable.

The another counter could be another issue. I don't object to add
the counter but I'm not sure what is the suitable name.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [PATCH] XLogReadRecord returns pointer to currently read page

2018-11-18 Thread Michael Paquier
On Mon, Nov 19, 2018 at 12:28:06PM +0900, Kyotaro HORIGUCHI wrote:
> Yeah, it is incorrect in some sense, but the comment was
> suggesting the lifetime of the pointed record. So I'd like to
> have a comment like this instead.

I think that we can still live without as it is not the business of this
routine to be careful how the lifetime of a record read is handled, but
that's part of the internals of XLogReader.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] Restricting maximum keep segments by repslots

2018-11-18 Thread Michael Paquier
On Fri, Oct 26, 2018 at 11:26:36AM +0900, Kyotaro HORIGUCHI wrote:
> The reason for doing that in the fucntion is it can happen also
> for physical replication when walsender is active but far
> behind. The removed(renamed)-but-still-open segment may be
> recycled and can be overwritten while reading, and it will be
> caught by page/record validation. It is substantially lost in
> that sense.  I don't think the strictness is useful for anything..

I was just coming by to look at bit at the patch series, and bumped
into that:

> + /*
> +  * checkpoint can remove the segment currently looking for.  make sure 
> the
> +  * current segment is still exists. We check this only once per record.
> +  */
> + XLByteToSeg(targetPagePtr, targetSegNo, state->wal_segment_size);
> + if (targetSegNo <= XLogGetLastRemovedSegno())
> + ereport(ERROR,
> + (errcode(ERRCODE_NO_DATA),
> +  errmsg("WAL record at %X/%X no longer 
> available",
> + (uint32)(RecPtr >> 32), 
> (uint32) RecPtr),
> +  errdetail("The segment for the record has been 
> removed.")));
> + 

ereport should not be called within xlogreader.c as a base rule:
 *  This file is compiled as both front-end and backend code, so it
 *  may not use ereport, server-defined static variables, etc.
--
Michael


signature.asc
Description: PGP signature


Re: Postgres, fsync, and OSs (specifically linux)

2018-11-18 Thread Thomas Munro
On Fri, Nov 9, 2018 at 9:03 AM Thomas Munro
 wrote:
> On Fri, Nov 9, 2018 at 7:07 AM Robert Haas  wrote:
> > On Wed, Nov 7, 2018 at 9:41 PM Thomas Munro
> >  wrote:
> > > My plan is do a round of testing and review of this stuff next week
> > > once the dust is settled on the current minor releases (including
> > > fixing a few typos I just spotted and some word-smithing).  All going
> > > well, I will then push the resulting patches to master and all
> > > supported stable branches, unless other reviews or objections appear.

...

On Sun, Nov 18, 2018 at 3:20 PM Thomas Munro
 wrote:
> I think these patches are looking good now.  If I don't spot any other
> problems or hear any objections, I will commit them tomorrow-ish.

Hearing no objections, pushed to all supported branches.

Thank you to Craig for all his work getting to the bottom of this, to
Andres for his open source diplomacy, and the Linux guys for their
change "errseq: Always report a writeback error once" which came out
of that.

Some more comments:

* The promotion of errors from close() to PANIC may or may not be
effective considering that it doesn't have interlocking with
concurrent checkpoints.  I'm not sure if it can really happen on local
file systems anyway... this may fall under the category of "making
PostgreSQL work reliably on NFS", a configuration that is not
recommended currently, and a separate project IMV.

* In 9.4 and 9.5 there is no checking of errors from
sync_file_range(), and I didn't add any for now.  It was claimed that
sync_file_range() without BEFORE/AFTER can't consume errors[1].
Errors are promoted in 9.6+ for consistency because we already looked
at the return code, so we won't long rely on that knowledge in the
long term.

* I personally believe it is safe to run with data_sync_retry = on on
any file system on FreeBSD, and ZFS on any operating system... but I
see no need to make recommendations about that in the documentation,
other than that you should investigate the behaviour of your operating
system if you really want to turn it on.

* A PANIC (and possibly ensuing crash restart loop if the I/O error is
not transient) is of course a very unpleasant failure mode, but it is
one that we already had for the WAL and control file.  So I'm not sure
I'd personally bother to run with the non-default setting even on a
system where I believe it to be safe (considering the low likelihood
that I/O failure is isolated to certain files); at best it probably
gives you a better experience if the fs underneath a non-default
tablespace dies.

* The GUC is provided primarily because this patch is so drastic in
its effect that it seems like we owe our users a way to disable it on
principle, and that seems to outweigh a desire not to add GUCs in
back-branches.

* If I/O errors happen, your system is probably toast and you need to
fail over or restore from backups, but at least we won't tell you any
lies about checkpoints succeeding.  In rare scenarios, perhaps
involving a transient failure of virtualised storage with thin
provisioning as originally described by Craig, the system may actually
be able to continue running, and with this change we should now be
able to avoid data loss by recovering from the WAL.

* As noted the commit message, this isn't quite the end of the story.
See the fsync queue redesign thread[2], WIP for master only.

[1] 
https://www.postgresql.org/message-id/20180430160945.5s5qfoqryhtmugxl%40alap3.anarazel.de
[2] 
https://www.postgresql.org/message-id/flat/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmazxk0jykxw+b21f...@mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com



Re: WIP: Avoid creation of the free space map for small tables

2018-11-18 Thread Amit Kapila
On Mon, Nov 19, 2018 at 7:30 AM John Naylor  wrote:
>
> On 11/16/18, Amit Kapila  wrote:
> > +/* Status codes for the local map. */
> > +#define FSM_LOCAL_ZERO 0x00 /* Beyond the end of the relation */
> > +#define FSM_LOCAL_AVAIL 0x01 /* Available to try */
> > +#define FSM_LOCAL_TRIED 0x02 /* Already tried, not enough space */
> >
> > Instead of maintaining three states, can't we do with two states
> > (Available and Not Available), basically combine 0 and 2 in your case.
> > I think it will save some cycles in
> > fsm_local_set, where each time you need to initialize all the entries
> > in the map.  I think we can argue that it is not much overhead, but I
> > think it is better code-wise also if we can make it happen with fewer
> > states.
>
> That'd work too, but let's consider this scenario: We have a 2-block
> table that has no free space. After trying each block, the local cache
> looks like
>
> 0123
> TT00
>
> Let's say we have to wait to acquire a relation extension lock,
> because another backend had already started extending the heap by 1
> block. We call GetPageWithFreeSpace() and now the local map looks like
>
> 0123
> TTA0
>
> By using bitwise OR to set availability, the already-tried blocks
> remain as they are. With only 2 states, the map would look like this
> instead:
>
> 0123
> AAAN
>

I expect below part of code to go-away.
+fsm_local_set(Relation rel, BlockNumber nblocks)
{
..
+ /*
+ * If the blkno is beyond the end of the relation, the status should
+ * be zero already, but make sure it is.  If the blkno is within the
+ * relation, mark it available unless it's already been tried.
+ */
+ for (blkno = 0; blkno < HEAP_FSM_CREATION_THRESHOLD; blkno++)
+ {
+ if (blkno < nblocks)
+ FSMLocalMap[blkno] |= FSM_LOCAL_AVAIL;
+ else
+ FSMLocalMap[blkno] = FSM_LOCAL_ZERO;
+ }
..
}

In my mind for such a case it should look like below:
0123
NNAN

> If we assume that an insert into the newly-created block 2 will almost
> always succeed, we don't have to worry about wasting time re-checking
> the first 2 full blocks. Does that sound right to you?
>

As explained above, such a situation won't exist.

>
> > Some assorted comments:
> > 1.
> >  
> > -Each heap and index relation, except for hash indexes, has a Free Space
> > Map
> > +Each heap relation, unless it is very small, and each index relation,
> > +except for hash indexes, has a Free Space Map
> >  (FSM) to keep track of available space in the relation. It's stored
> >
> > It appears that line has ended abruptly.
>
> Not sure what you're referring to here.
>

There is a space after "has a Free Space Map   " so you can combine next line.

> > 2.
> > page = BufferGetPage(buffer);
> > + targetBlock = BufferGetBlockNumber(buffer);
> >
> >   if (!PageIsNew(page))
> >   elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
> > - BufferGetBlockNumber(buffer),
> > + targetBlock,
> >   RelationGetRelationName(relation));
> >
> >   PageInit(page, BufferGetPageSize(buffer), 0);
> > @@ -623,7 +641,18 @@ loop:
> >   * current backend to make more insertions or not, which is probably a
> >   * good bet most of the time.  So for now, don't add it to FSM yet.
> >   */
> > - RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
> > + RelationSetTargetBlock(relation, targetBlock);
> >
> > Is this related to this patch? If not, I suggest let's do it
> > separately if required.
>
> I will separate this out.
>
> > 3.
> >  static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
> > -uint8 newValue, uint8 minValue);
> > + uint8 newValue, uint8 minValue);
> >
> > This appears to be a spurious change.
>
> It was intentional, but I will include it separately as above.
>
> > 4.
> > @@ -378,24 +386,15 @@ RelationGetBufferForTuple(Relation relation, Size
> > len,
> >   * target.
> >   */
> >   targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
> > +
> > + /*
> > + * In case we used an in-memory map of available blocks, reset
> > + * it for next use.
> > + */
> > + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD)
> > + ClearLocalMap();
> >
> > How will you clear the local map during error?  I think you need to
> > clear it in abort path and you can name the function as
> > FSMClearLocalMap or something like that.
>
> That sounds right, and I will rename the function that way. For the
> abort path, were you referring to this or somewhere else?
>
> if (!PageIsNew(page))
> elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
>  targetBlock,
>  RelationGetRelationName(relation));
>

I think it might come from any other place between when you set it and
before it got cleared (like any intermediate buffer and pin related
API's).

> > 5.
> > +/*#define TRACE_TARGETBLOCK */
> >
> > Debugging leftover, do you want to retain this and related stuff
> > during the development of patch?
>
> I modeled this after TRACE_VISIBILITYMAP in visibilitymap.c. It's
> useful for dev

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-18 Thread Haribabu Kommi
On Mon, Nov 19, 2018 at 1:37 PM Alvaro Herrera 
wrote:

> On 2018-Nov-19, Michael Paquier wrote:
>
> > On Mon, Nov 19, 2018 at 10:41:22AM +1100, Haribabu Kommi wrote:
> > > So 6 new functions needs to be added to cover all the above cases,
> > > IMO, we may need functions for all combinations, because I feel some
> > > user may have the requirement of left out combination, in case if we
> choose
> > > only some combinations.
> >
> > That's bloating the interface in my opinion.
>
> I understand.
>
> Let's call for a vote from a larger audience.  It's important to get
> this interface right, ISTM.
>

Amit suggested another option in another mail, so total viable
solutions that are discussed as of now are,

1. Single API with NULL input treat as invalid value
2. Multiple API to deal with NULL input of other values
3. Single API with NULL value to treat them as current user, current
database
 and NULL queryid.
4. Single API with -1 as invalid value, treat NULL as no matching. (Only
problem
 with this approach is till now -1 is also a valid queryid, but setting -1
as queryid
needs to be avoided.

I prefer single API. I can go with either 3 or 4.

opinion from others?

Regards,
Haribabu Kommi
Fujitsu Australia


Re: [PATCH] XLogReadRecord returns pointer to currently read page

2018-11-18 Thread Andrey Lepikhov



On 16.11.2018 11:23, Michael Paquier wrote:

On Thu, Nov 15, 2018 at 06:12:38PM +0900, Kyotaro HORIGUCHI wrote:

This patch eliminates unnecessary copying that was done for
non-continued records. Now the return value of XLogReadRecord
directly points into page buffer holded in XLogReaderStats. It is
safe because no caller site uses the returned pointer beyond the
replacement of buffer content at the next call to the same
function.


I was looking at this patch, and shouldn't we worry about compatibility
with plugins or utilities which look directly at what's in readRecordBuf
for the record contents?  Let's not forget that the contents of
XLogReaderState are public.


According to my experience, I clarify some comments to avoid this 
mistakes in the future (see attachment).


--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company
>From 7de252405ef5d5979fe2711515c0c6402abc0e06 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Mon, 19 Nov 2018 07:08:28 +0500
Subject: [PATCH] Some clarifications on readRecordBuf comments

---
 src/backend/access/transam/xlogreader.c | 4 ++--
 src/include/access/xlogreader.h | 6 +-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index c5e019bf77..88d0bcf48a 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -209,8 +209,8 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
  * If the reading fails for some other reason, NULL is also returned, and
  * *errormsg is set to a string with details of the failure.
  *
- * The returned pointer (or *errormsg) points to an internal buffer that's
- * valid until the next call to XLogReadRecord.
+ * The returned pointer (or *errormsg) points to an internal read-only buffer
+ * that's valid until the next call to XLogReadRecord.
  */
 XLogRecord *
 XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 40116f8ecb..0837625b95 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -185,7 +185,11 @@ struct XLogReaderState
 	 */
 	TimeLineID	nextTLI;
 
-	/* Buffer for current ReadRecord result (expandable) */
+	/*
+	 * Buffer for current ReadRecord result (expandable).
+	 * Used in the case, than current ReadRecord result don't fit on the
+	 * currently read WAL page.
+	 */
 	char	   *readRecordBuf;
 	uint32		readRecordBufSize;
 
-- 
2.17.1



Re: In-place updates and serializable transactions

2018-11-18 Thread Amit Kapila
On Fri, Nov 16, 2018 at 4:07 AM Kevin Grittner  wrote:
>
> On Thu, Nov 15, 2018 at 3:03 AM Kuntal Ghosh  
> wrote:
>
> > The test multiple-row-versions is failing because of the
> > above-discussed scenario. I've attached the regression diff file and
> > the result output file for the same. Here is a brief summary of the
> > test w.r.t. heap:
> >
> > Step 1: T1-> BEGIN; Read FROM t where id=100;
> > Step 2: T2-> BEGIN; UPDATE t where id=100; COMMIT; (creates T1->T2)
> > Step 3: T3-> BEGIN; UPDATE t where id=100; Read FROM t where id=50;
> > Step 4: T4-> BEGIN; UPDATE t where id= 50; Read FROM t where id=1;
> > COMMIT;  (creates T3->T4)
> > Step 5: T3-> COMMIT;
> > Step 6: T1-> UPDATE t where id=1; COMMIT;  (creates T4->T1,)
> >
> > At step 6, when the update statement is executed, T1 is rolled back
> > because of T3->T4->T1.
> >
> > But for zheap, step 3 also creates a dependency T1->T3 because of
> > in-place update. When T4 commits in step 4, it marks T3 as doomed
> > because of T1 --> T3 --> T4. Hence, in step 5, T3 is rolled back.
>
> If I understand this, no permutation (order of execution of the
> statements in a set of concurrent transactions vulnerable to
> serialization anomalies) which have succeeded with the old storage
> engine now fail with zheap; what we have with zheap is an earlier
> failure in one case.  More importantly, zheap doesn't create any false
> negatives (cases where a serialization anomaly is missed).
>

Your understanding is correct.  Thanks for sharing your feedback.

> I would say this should be considered a resounding success.  We should
> probably add an alternative result file to cover this case, but
> otherwise I don't see anything which requires action.
>
> Congratulations on making this work so well!
>

Thanks.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: In-place updates and serializable transactions

2018-11-18 Thread Kuntal Ghosh
On Fri, Nov 16, 2018 at 4:07 AM Kevin Grittner  wrote:
>
> I would say this should be considered a resounding success.  We should
> probably add an alternative result file to cover this case, but
> otherwise I don't see anything which requires action.
>
As of now, we've added an extra expected file for the same.

> Congratulations on making this work so well!
>
Thanks for developing the serializable module in a nice decoupled architecture.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: NOTIFY and pg_notify performance when deduplicating notifications

2018-11-18 Thread Julien Demoor

On 10/10/2018 19:42, Catalin Iacob wrote:

On Tue, Oct 9, 2018 at 2:17 PM  wrote:

I just caught an error in my patch, it's fixed in the attachment. The
'never' and 'maybe' collapse modes were mixed up in one location.


Here's a partial review of this version, did not read the doc part
very carefully.

First of all, I agree that this is a desirable feature as, for a large
number of notiifications, the O(n^2) overhead quickly becomes very
noticeable.

I would expect the collapse mode to be an enum which is created from
the string early on during parsing and used for the rest of the code.
Instead the string is used all the way leading to string comparisons
in the notification dispatcher and to the need of hardcoding special
strings in various places, including the contrib module.

This comment in the beginning of async.c should also be updated:
*   Duplicate notifications from the same transaction are sent out as one
*   notification only. This is done to save work when for example a trigger

pg_notify_3args duplicates pg_notify, I would expect a helper function
to be extracted and called from both.

There are braces placed on the same line as the if, for example if
(strlen(collapse_mode) != 0) { which seems to not be the project's
style.


Thank you for the review. I've addressed all your points in the attached 
patch. The patch was made against release 11.1.


I couldn't find a way to make a good helper function for pg_notify_3args 
and pg_notify, I hope my proposed solution is acceptable.
diff --git a/contrib/tcn/tcn.c b/contrib/tcn/tcn.c
index 0c274322bd..1494a35a5a 100644
--- a/contrib/tcn/tcn.c
+++ b/contrib/tcn/tcn.c
@@ -161,7 +161,7 @@ triggered_change_notification(PG_FUNCTION_ARGS)
strcpy_quoted(payload, 
SPI_getvalue(trigtuple, tupdesc, colno), '\'');
}
 
-   Async_Notify(channel, payload->data);
+   Async_Notify(channel, payload->data, 
NOTIFY_COLLAPSE_MODE_MAYBE);
}
ReleaseSysCache(indexTuple);
break;
diff --git a/doc/src/sgml/ref/notify.sgml b/doc/src/sgml/ref/notify.sgml
index e0e125a2a2..96e0d7a990 100644
--- a/doc/src/sgml/ref/notify.sgml
+++ b/doc/src/sgml/ref/notify.sgml
@@ -21,7 +21,8 @@ PostgreSQL documentation
 
  
 
-NOTIFY channel [ , payload ]
+NOTIFY channel [ , payload [ , collapse_mode ] ]
+
 
  
 
@@ -93,20 +94,6 @@ NOTIFY channel 
[ , 
 
-  
-   If the same channel name is signaled multiple times from the same
-   transaction with identical payload strings, the
-   database server can decide to deliver a single notification only.
-   On the other hand, notifications with distinct payload strings will
-   always be delivered as distinct notifications. Similarly, notifications from
-   different transactions will never get folded into one notification.
-   Except for dropping later instances of duplicate notifications,
-   NOTIFY guarantees that notifications from the same
-   transaction get delivered in the order they were sent.  It is also
-   guaranteed that messages from different transactions are delivered in
-   the order in which the transactions committed.
-  
-
   
It is common for a client that executes NOTIFY
to be listening on the same notification channel itself.  In that case
@@ -121,6 +108,41 @@ NOTIFY channel [ , 
+
+
+
+
+
+  Ordering and collapsing of notifications
+
+  
+   If the same channel name is signaled multiple times from the same
+   transaction with identical payload strings, the
+   database server can decide to deliver a single notification only,
+   when the value of the collapse_mode parameter is 
+   'maybe' or '' (the empty string).
+
+   If the 'never' collapse mode is specified, the server 
will 
+   deliver all notifications, including duplicates. Turning off deduplication 
+   in this way can considerably speed up transactions that emit large numbers 
+   of notifications.
+   
+   Removal of duplicate notifications takes place within transaction block,
+   finished with COMMIT, END or 
SAVEPOINT.
+  
+
+  
+   Notifications with distinct payload strings will
+   always be delivered as distinct notifications. Similarly, notifications from
+   different transactions will never get folded into one notification.
+   Except for dropping later instances of duplicate notifications,
+   NOTIFY guarantees that notifications from the same
+   transaction get delivered in the order they were sent.  It is also
+   guaranteed that messages from different transactions are delivered in
+   the order in which the transactions committed.
+  
+
+
  
 
  
@@ -147,6 +169,16 @@ NOTIFY channel [ , 
 

+   
+collapse_mode
+
+ 
+  The collapse mode to apply when identical notifications are issued 
within 
+  a transaction. The acceptable values are 'maybe' (the 
+  default) and 'never'.
+ 
+
+   
   
  
 
@@ -190,6 +222