Re: Rework manipulation and structure of attribute mappings

2019-12-04 Thread Michael Paquier
On Mon, Nov 25, 2019 at 05:55:50PM +0900, Amit Langote wrote:
> Actually, I was just suggesting that we create a new function
> convert_tuples_by_position_map() and put the logic that compares the
> two TupleDescs to create the AttrMap in it, just like
> convert_tuples_by_name_map().  Now you could say that there would be
> no point in having such a function, because no caller currently wants
> to use such a map (that is, without the accompanying
> TupleConversionMap), but maybe there will be in the future.  Though
> irrespective of that consideration, I guess you'd agree to group
> similar code in a single source file.

Hmm.  I would rather keep the attribute map generation and the tuple
conversion part, the latter depending on the former, into two
different files.  That's what I did in the v2 attached.

> As it's mainly just moving around code, I gave it a shot; patch
> attached (applies on top of yours).  I haven't invented any new names
> yet, but let's keep discussing that as you say.

I see.  That saved me some time, thanks.  It is not really intuitive
to name routines about tuple conversion from tupconvert.c to
attrmap.c, so I'd think about renaming those routines as well, like
build_attrmap_by_name/position instead. That's more consistent with
the rest I added.

Another thing is that we have duplicated code at the end of
build_attrmap_by_name_if_req and build_attrmap_by_position, which I
think would be better refactored as a static function part of
attmap.c.  This way the if_req() flavor gets much simpler.

I have also fixed the issue with the FK mapping in
addFkRecurseReferencing() you reported previously.

What do you think about that?  I would like to think that we are
getting at something rather committable here, though I feel that I
need to review more the comments around the new files and if we could
document better AttrMap and its properties.
--
Michael
From 9b7414a8df8da37147370541d92c4c18f9eec7ac Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 4 Dec 2019 17:00:26 +0900
Subject: [PATCH v2] Rework attribute mappings

---
 src/backend/access/common/Makefile |   1 +
 src/backend/access/common/attmap.c | 320 +
 src/backend/access/common/tupconvert.c | 287 ++
 src/backend/catalog/index.c|  12 +-
 src/backend/catalog/partition.c|  11 +-
 src/backend/catalog/pg_constraint.c|   1 -
 src/backend/commands/indexcmds.c   |  16 +-
 src/backend/commands/tablecmds.c   | 100 ---
 src/backend/executor/execMain.c|  22 +-
 src/backend/executor/execPartition.c   |  57 ++--
 src/backend/jit/llvm/llvmjit_expr.c|   1 -
 src/backend/parser/parse_utilcmd.c |  18 +-
 src/backend/replication/logical/relation.c |  10 +-
 src/backend/replication/logical/worker.c   |   9 +-
 src/backend/rewrite/rewriteManip.c |  12 +-
 src/include/access/attmap.h|  51 
 src/include/access/tupconvert.h|  13 +-
 src/include/catalog/index.h|   2 +-
 src/include/parser/parse_utilcmd.h |   3 +-
 src/include/replication/logicalrelation.h  |   2 +-
 src/include/rewrite/rewriteManip.h |   3 +-
 21 files changed, 531 insertions(+), 420 deletions(-)
 create mode 100644 src/backend/access/common/attmap.c
 create mode 100644 src/include/access/attmap.h

diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index 6c9c6f3256..fd74e14024 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -13,6 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	attmap.o \
 	bufmask.o \
 	detoast.o \
 	heaptuple.o \
diff --git a/src/backend/access/common/attmap.c b/src/backend/access/common/attmap.c
new file mode 100644
index 00..33ffef9039
--- /dev/null
+++ b/src/backend/access/common/attmap.c
@@ -0,0 +1,320 @@
+/*-
+ *
+ * attmap.c
+ *	  Attribute mapping support.
+ *
+ * This file provides utility routines to build and manage attribute mappings
+ * by comparing input and output TupleDescs.  Such mappings are typically used
+ * by DDL operating on inheritance and partition trees to convert fully
+ * transformed expression trees from parent rowtype to child rowtype or
+ * vice-versa.  They are also used by the tuple conversion routines in
+ * tupconvert.c.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/attmap.c
+ *
+ *-
+ */
+
+#include "postgres.h"
+
+#include "access/attmap.h"
+#include "access/htup_details.h"
+#include "utils/builtins.h"
+
+
+static bool check_attrmap_match(TupleDesc indesc,
+TupleDesc outdesc,
+

Re: [PATCH] Fix PostgreSQL 12.1 server build and install problems under MSYS2

2019-12-04 Thread Michael Paquier
On Wed, Dec 04, 2019 at 10:46:39AM +0300, Guram Duka wrote:
> I made a patch fixing build and install problems under MSYS2, including
> llvmjit.
> 
> I have tested this in my environment and it works, of course need more
> extensive testing.
> Attached is a patch that fixes it. Tag REL_12_1.

Do you have the same problems if you compile the code from the latest
branch of the master branch?

Could you register this patch to the upcoming commit fest?  Here is a
link to it:
https://commitfest.postgresql.org/26/

Thanks,
--
Michael


signature.asc
Description: PGP signature


Re: Update minimum SSL version

2019-12-04 Thread Peter Eisentraut

On 2019-12-03 12:44, Magnus Hagander wrote:
On Tue, Dec 3, 2019 at 12:09 PM Michael Paquier > wrote:


On Tue, Dec 03, 2019 at 10:10:57AM +0100, Magnus Hagander wrote:
 > Is 1.0.1 considered a separate major from 1.0.0, in this
reasoning? Because
 > while retiring 1.0.0 should probably not be that terrible, 1.0.1
is still
 > in very widespread use on most long term supported distributions.

1.0.1 and 1.0.0 are two different major releases in the OpenSSL world,
so my suggestion would be to cut support for everything which does not
have TLSv1.2, meaning that we keep compatibility with 1.0.1 for
a longer period.


Good, that's what I thought you meant :) And that makes it sound like a 
working plan to me.


This would mean we'd stop support for RHEL 5, which is probably OK, 
seeing that even the super-extended support ends in November 2020.


Dropping RHEL 5 would also allow us to drop support for Python 2.4, 
which is something I've been itching to do. ;-)


In both of these cases, maintaining support for all these ancient 
versions is a significant burden IMO, so it would be good to clean up 
the tail end a bit.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: pg_upgrade fails with non-standard ACL

2019-12-04 Thread Michael Paquier
On Wed, Dec 04, 2019 at 12:17:25PM +0900, Arthur Zakirov wrote:
> I updated the patch. It generates "revoke_objects.sql" (similar to v3 patch)
> now and doesn't rely on --check option. It also logs still FATAL message
> because it seems pg_upgrade should stop here since it fails later if there
> are objects with changed identities.

(I haven't looked at the patch yet, sorry!)

FWIW, I am not much a fan of that part because the output generated by
the description is most likely not compatible with the grammar
supported.

In order to make the review easier, and to test for all the patterns
we need to cover, I have an evil idea though.  Could you write a
dummy, still simple patch for HEAD which introduces
backward-incompatible changes for all the object types we want to
stress?  Then by having ACLs on the source server which are fakely
broken on the target server we can make sure that the queries we have
are right, and that they report the objects we are looking for.
--
Michael


signature.asc
Description: PGP signature


Re: Update minimum SSL version

2019-12-04 Thread Michael Paquier
On Wed, Dec 04, 2019 at 09:10:04AM +0100, Peter Eisentraut wrote:
> This would mean we'd stop support for RHEL 5, which is probably OK, seeing
> that even the super-extended support ends in November 2020.

Sounds like a plan.  I can work on the OpenSSL part, if you need help
of course.  And if others don't object in doing that.  Of course.

> Dropping RHEL 5 would also allow us to drop support for Python 2.4, which is
> something I've been itching to do. ;-)
> 
> In both of these cases, maintaining support for all these ancient versions
> is a significant burden IMO, so it would be good to clean up the tail end a
> bit.

Good to know.
--
Michael


signature.asc
Description: PGP signature


Re: progress report for ANALYZE

2019-12-04 Thread Tatsuro Yamada

Hi Amit-san,

Thanks for your comments!


Attached patch is the revised patch. :)

I wonder two things below. What do you think?

1)
For now, I'm not sure it should be set current_child_table_relid to zero
when the current phase is changed from "acquiring inherited sample rows" to
"computing stats". See  bellow.


In the upthread discussion [1], Robert asked to *not* do such things,
that is, resetting some values due to phase change.  I'm not sure his
point applies to this case too though.


Yeah, I understood.
I'll check target relid of "computing stats" to re-read a code of
analyze command later. :)

 

2)
There are many "finalizing analyze" phases based on relids in the case
of partitioning tables. Would it better to fix the document? or it
would be better to reduce it to one?


-
   finalizing analyze
   
 The command is updating pg_class. When this phase is completed,
 ANALYZE will end.
-


When a partitioned table is analyzed, its partitions are analyzed too.
So, the ANALYZE command effectively runs N + 1 times if there are N
partitions -- first analyze partitioned table to collect "inherited"
statistics by collecting row samples using
acquire_inherited_sample_rows(), then each partition to collect its
own statistics.  Note that this recursive application to ANALYZE to
partitions (child tables) only occurs for partitioned tables, not for
legacy inheritance.


Thanks for your explanation.
I understand Analyzing Partitioned table a little.
Below is my understanding. Is it right?

==
In the case of partitioned table (N = 3)

 - Partitioned table name: p (relid is 100)
 - Partitioning table names: p1, p2, p3 (relids are 201, 202 and 203)

For now, We can get the following results by executing "analyze p;".

Num Phase relid current_child_table_relid
 1  acquire inherited sample rows  100   201
 2  acquire inherited sample rows  100   202
 3  acquire inherited sample rows  100   203
 4  computing stats100   0
 5  finalizing analyze 100   0

 6  acquiring sample rows  201   0
 7  computing stats201   0
 8  finalizing analyze 201   0

 9  acquiring sample rows  202   0
10  computing stats202   0
11  finalizing analyze 202   0

12  acquiring sample rows  203   0
13  computing stats203   0
14  finalizing analyze 203   0
==


Thanks,
Tatsuro Yamada





Re: Proposal: Add more compile-time asserts to expose inconsistencies.

2019-12-04 Thread Peter Eisentraut

On 2019-12-02 16:55, Andres Freund wrote:

+StaticAssertDecl(lengthof(LockTagTypeNames) == (LOCKTAG_ADVISORY + 1),
+"LockTagTypeNames array inconsistency");
+

These error messages strike me as somewhat unhelpful. I'd probably just
reword them as "array length mismatch" or something like that.


I'd prefer it if we could just get rid of the second argument and show 
the actual expression in the error message, like run-time assertions work.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Yet another vectorized engine

2019-12-04 Thread Hubert Zhang
Thanks Konstantin for your detailed review!

On Tue, Dec 3, 2019 at 5:58 PM Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

>
>
> On 02.12.2019 4:15, Hubert Zhang wrote:
>
>
> The prototype extension is at https://github.com/zhangh43/vectorize_engine
>
>
> I am very sorry, that I have no followed this link.
> Few questions concerning your design decisions:
>
> 1. Will it be more efficient to use native arrays in vtype instead of
> array of Datum? I think it will allow compiler to generate more efficient
> code for operations with float4 and int32 types.
> It is possible to use union to keep fixed size of vtype.


Yes, I'm also considering that when scan a column store, the column batch
is loaded into a continuous memory region. For int32, the size of this
region is 4*BATCHSIZE, while for int16, the size is 2*BATCHSIZE. So using
native array could just do a single memcpy to fill the vtype batch.


> 2. Why VectorTupleSlot contains array (batch) of heap tuples rather than
> vectors (array of vtype)?
>

a. VectorTupleSlot stores array of vtype in tts_values field which is used
to reduce the code change and reuse functions like ExecProject. Of course
we could use separate field to store vtypes.
b. VectorTupleSlot also contains array of heap tuples. This used to do heap
tuple deform. In fact, the tuples in a batch may across many pages, so we
also need to pin an array of related pages instead of just one page.

3. Why you have to implement your own plan_tree_mutator and not using
> expression_tree_mutator?
>

I also want to replace plan node, e.g. Agg->CustomScan(with VectorAgg
implementation). expression_tree_mutator cannot be used to mutate plan node
such as Agg, am I right?


> 4. As far as I understand you now always try to replace SeqScan with your
> custom vectorized scan. But it makes sense only if there are quals for this
> scan or aggregation is performed.
> In other cases batch+unbatch just adds extra overhead, doesn't it?
>
Probably extra overhead for heap format and query like 'select i from t;'
without qual, projection, aggregation.
But with column store, VectorScan could directly read batch, and no
additional batch cost. Column store is the better choice for OLAP queries.
Can we conclude that it would be better to use vector engine for OLAP
queries and row engine for OLTP queries.

5. Throwing and catching exception for queries which can not be vectorized
> seems to be not the safest and most efficient way of handling such cases.
> May be it is better to return error code in plan_tree_mutator and
> propagate this error upstairs?


Yes, as for efficiency, another way is to enable some plan node to be
vectorized and leave other nodes not vectorized and add batch/unbatch layer
between them(Is this what you said "propagate this error upstairs"). As you
mentioned, this could introduce additional overhead. Is there any other
good approaches?
What do you mean by not safest? PG catch will receive the ERROR, and
fallback to the original non-vectorized plan.


> 6. Have you experimented with different batch size? I have done similar
> experiments in VOPS and find out that tile size larger than 128 are not
> providing noticable increase of performance.
> You are currently using batch size 1024 which is significantly larger than
> typical amount of tuples on one page.
>

Good point, We will do some experiments on it.

7. How vectorized scan can be combined with parallel execution (it is
> already supported in9.6, isn't it?)
>

We didn't implement it yet. But the idea is the same as non parallel one.
Copy the current parallel scan and implement vectorized Gather, keeping
their interface to be VectorTupleTableSlot.
Our basic idea to reuse most of the current PG executor logic, and make
them vectorized, then tuning performance gradually.

-- 
Thanks

Hubert Zhang


Re: pg_upgrade fails with non-standard ACL

2019-12-04 Thread Arthur Zakirov

On 2019/12/04 17:15, Michael Paquier wrote:

On Wed, Dec 04, 2019 at 12:17:25PM +0900, Arthur Zakirov wrote:

I updated the patch. It generates "revoke_objects.sql" (similar to v3 patch)
now and doesn't rely on --check option. It also logs still FATAL message
because it seems pg_upgrade should stop here since it fails later if there
are objects with changed identities.


(I haven't looked at the patch yet, sorry!)

FWIW, I am not much a fan of that part because the output generated by
the description is most likely not compatible with the grammar
supported.


Ah, I thought that pg_identify_object() gives properly quoted identity, 
and it could be used to make SQL script.



In order to make the review easier, and to test for all the patterns
we need to cover, I have an evil idea though.  Could you write a
dummy, still simple patch for HEAD which introduces
backward-incompatible changes for all the object types we want to
stress?  Then by having ACLs on the source server which are fakely
broken on the target server we can make sure that the queries we have
are right, and that they report the objects we are looking for.


Sure! But I'm not sure that I understood the idea. Do you mean the patch 
which adds a TAP test? It can initialize two clusters, in first it 
renames some system objects and changes their ACLs. And finally the test 
runs pg_upgrade which will fail.


--
Arthur




Unsigned 64 bit integer to numeric

2019-12-04 Thread Dmitry Dolgov
Hi,

Probably a simple question, but I don't see a simple answer so far. In
one extension I want to convert uint64 into a numeric to put it
eventually into a jsonb object. As far as I see in numeric.c there are
functions only for signed int64. Is there a way to achive this with
uint64 (without duplicating significant part of numeric implementation
in the extension)?




Re: Simplify passing of configure arguments to pg_config

2019-12-04 Thread Peter Eisentraut

On 2019-12-03 06:03, Tom Lane wrote:

Peter Eisentraut  writes:

Currently, configure puts the configure args into the makefiles and
then have the makefiles pass them to the build of pg_config.  That looks
like an unnecessary redirection, and indeed that method was
put in place when pg_config was a shell script.  We can simplify that
by having configure put the value into pg_config.h directly.  This
also makes the standard build system match how the MSVC build system
already does it.


I dunno, is this really an improvement?  It makes the handling of
VAL_CONFIGURE different from every other one of the values passed
into pg_config, and I don't see any countervailing addition of
some other regularity.


The other values come from the makefiles, so we have to do it that way. 
The configure args come from configure, so why make them go through the 
makefile?  (PG_VERSION also comes in that way. ;-) )


There is also the weird difference with how the MSVC build system 
handles it.  It appends VAL_CONFIGURE to pg_config.h instead of passing 
it on the command line.



I'm also a bit suspicious of the ad-hoc escaping step ...


Hmm, the current way doesn't handle embedded quotes at all, so perhaps 
this wouldn't be necessary.  But it would add some robustness.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Fix PostgreSQL 12.1 server build and install problems under MSYS2

2019-12-04 Thread Guram Duka
Master branch got error in configure stage and then compiling like 12.1
branch.

checking how to link an embedded Python application... configure: error:
> could not find shared library for Python
> You might have to rebuild your Python installation.  Refer to the
> documentation for details.  Use --without-python to disable building
> PL/Python.
>
> I registered the patch.
>

ср, 4 дек. 2019 г. в 11:05, Michael Paquier :

> On Wed, Dec 04, 2019 at 10:46:39AM +0300, Guram Duka wrote:
> > I made a patch fixing build and install problems under MSYS2, including
> > llvmjit.
> >
> > I have tested this in my environment and it works, of course need more
> > extensive testing.
> > Attached is a patch that fixes it. Tag REL_12_1.
>
> Do you have the same problems if you compile the code from the latest
> branch of the master branch?
>
> Could you register this patch to the upcoming commit fest?  Here is a
> link to it:
> https://commitfest.postgresql.org/26/
>
> Thanks,
> --
> Michael
>


-- 
Best regards.
Guram Duka.


Re: could not stat promote trigger file leads to shutdown

2019-12-04 Thread Peter Eisentraut

On 2019-11-20 16:21, Tom Lane wrote:

AFAICT, a GUC check hook wouldn't actually be able to address the
specific scenario I described.  At the time the GUC is set, the
containing the directory of the trigger file does not exist yet.  This
is currently not an error.  The problem only happens if after the GUC is
set the containing directory appears and is not readable.

True, if the hook just consists of trying fopen() and checking the
errno.  Would it be feasible to insist that the containing directory
exist and be readable?  We have enough infrastructure that that
should only take a few lines of code, so the question is whether
or not that's a nicer behavior than we have now.


Is it possible to do this in a mostly bullet-proof way?  Just because 
the directory exists and looks pretty good otherwise, doesn't mean we 
can read a file created in it later in a way that doesn't fall afoul of 
the existing error checks.  There could be something like SELinux 
lurking, for example.


Maybe some initial checking would be useful, but I think we still need 
to downgrade the error check at use time a bit to not crash in the cases 
that we miss.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Does 'instead of delete' trigger support modification of OLD

2019-12-04 Thread Eugen Konkov
Hi again.

> Thinking some more on this, I now don't think a TODO makes sense, so I
> have removed it.

Please look into this example: 
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=95ed9fab6870d7c4b6266ea4d93def13
This is real life code from our production.

You  can  see that this is important to get correct info about deleted
data

-- EXPECTED app_period: ["2018-08-20", "2018-08-25")
-- ACTUAL   app_period: ["2018-08-14", )

> Triggers are designed to check and modify input data, and since DELETE
> has no input data, it makes no sense.

Please   put   back   into   TODO  list this feature  request to allow
triggers to modify output data.

INPUT -- receives data OK (behavior is expected)
UPDATE  -- receives and returns data OK (behavior is expected)
DELETE  -- returns data   FAIL (behavior is not 
expected)

This  is  inconsistent  to  allow  modify  output  data for UPDATE and
restrict to do this for DELETE


Thank you

-- 
Best regards,
Eugen Konkov





Re: Does 'instead of delete' trigger support modification of OLD

2019-12-04 Thread Eugen Konkov
Hello Eugen,

> https://dbfiddle.uk/?rdbms=postgres_12&fiddle=95ed9fab6870d7c4b6266ea4d93def13

sorry, forget to update link to the latest example:
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=8e114ccc9f15a30ca3115cdc6c70d247


-- 
Best regards,
Eugen Konkov





Re: Unsigned 64 bit integer to numeric

2019-12-04 Thread didier
Hi,
I don't think  so, but there's an (unmaintained?) uint extension at
https://github.com/petere/pguint.git


On Wed, Dec 4, 2019 at 11:24 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:

> Hi,
>
> Probably a simple question, but I don't see a simple answer so far. In
> one extension I want to convert uint64 into a numeric to put it
> eventually into a jsonb object. As far as I see in numeric.c there are
> functions only for signed int64. Is there a way to achive this with
> uint64 (without duplicating significant part of numeric implementation
> in the extension)?
>
>
>


Re: Unsigned 64 bit integer to numeric

2019-12-04 Thread Andrew Gierth
> "Dmitry" == Dmitry Dolgov <9erthali...@gmail.com> writes:

 Dmitry> Hi,

 Dmitry> Probably a simple question, but I don't see a simple answer so
 Dmitry> far. In one extension I want to convert uint64 into a numeric
 Dmitry> to put it eventually into a jsonb object. As far as I see in
 Dmitry> numeric.c there are functions only for signed int64. Is there a
 Dmitry> way to achive this with uint64 (without duplicating significant
 Dmitry> part of numeric implementation in the extension)?

Sure. Flip the top bit; convert the value as if signed; then subtract
-(2^63) from the result. (Easier to subtract -(2^63) than to add 2^63,
since the former can itself be represented in a signed int64 for easy
conversion to numeric.)

-- 
Andrew (irc:RhodiumToad)




Re: [HACKERS] Block level parallel vacuum

2019-12-04 Thread Amit Kapila
On Wed, Dec 4, 2019 at 2:01 AM Masahiko Sawada
 wrote:
>
> On Tue, 3 Dec 2019 at 11:57, Amit Kapila  wrote:
> >
> >
> > Forgot one minor point.  Please run pgindent on all the patches.
>
> Got it. I will run pgindent before sending patch from next time.
>

Today, I again read the patch and found a few more minor comments:

1.
void
-LaunchParallelWorkers(ParallelContext *pcxt)
+LaunchParallelWorkers(ParallelContext *pcxt, int nworkers)


I think we should add a comment for this API change which should
indicate why we need to pass nworkers as an additional parameter when
the context itself contains information about the number of workers.

2.
At the beginning of a lazy vacuum (at lazy_scan_heap) we
+ * prepare the parallel context and initialize the DSM segment that contains
+ * shared
information as well as the memory space for storing dead tuples.
+ * When starting either index vacuuming or index cleanup, we launch parallel
+ *
worker processes.  Once all indexes are processed the parallel worker
+ * processes exit.  And then the leader process re-initializes the parallel
+ *
context so that it can use the same DSM for multiple passses of index
+ * vacuum and for performing index cleanup.

a. /And then the leader/After that, the leader ..  This will avoid
using 'and' two times in this sentence.
b. typo, /passses/passes

3.
+ * Macro to check if we are in a parallel lazy vacuum.  If true, we are
+ * in the parallel mode and prepared the DSM segment.

How about changing it slightly as /and prepared the DSM segment./ and
the DSM segment is initialized.?

4.
-
 /* non-export function prototypes */
 static void lazy_scan_heap(Relation onerel, VacuumParams *params,

   LVRelStats *vacrelstats, Relation *Irel, int nindexes,
 bool aggressive);

Spurious change, please remove.  I think this is done by me in one of
the versions.

5.
+ * function we exit from parallel mode.  Index bulk-deletion results are
+ * stored in the DSM segment and update index
statistics as a whole after
+ * exited from parallel mode since all writes are not allowed during parallel
+ * mode.

Can we slightly change the above sentence as "Index bulk-deletion
results are stored in the DSM segment and we update index statistics
as a whole after exited from parallel mode since writes are not
allowed during the parallel mode."?

6.
/*
+ * Reset the local value so that we compute cost balance during
+ * parallel index vacuuming.
+
*/

This comment is a bit unclear.  How about "Reset the local cost values
for leader backend as we have already accumulated the remaining
balance of heap."?

7.
+ /* Do vacuum or cleanup one index */

How about changing it as: Do vacuum or cleanup of the index?

8.
The copying the result normally
+ * happens only after the first time of index vacuuming.

/The copying the ../The copying of the

9.
+ /*
+ * no longer need the locally allocated result and now
+ * stats[idx] points to the DSM segment.
+
 */

How about changing it as below:
"Now that the stats[idx] points to the DSM segment, we don't need the
locally allocated results."

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Implementing Incremental View Maintenance

2019-12-04 Thread nuko yokohama
Hi.

I found the problem after running "ALTER MATERIALIZED VIEW ... RENAME TO".
If a view created with "CREATE INCREMENT MATERIALIZED VIEW" is renamed,
subsequent INSERT operations to the base table will fail.

Error message.
```
ERROR:  could not open relation with OID 0
```

Execution log.
```
[ec2-user@ip-10-0-1-10 ivm]$ psql -U postgres test -e -f
~/test/ivm/alter_rename_bug.sql
DROP TABLE IF EXISTS table_x CASCADE;
psql:/home/ec2-user/test/ivm/alter_rename_bug.sql:1: NOTICE:  drop cascades
to materialized view group_imv
DROP TABLE
CREATE TABLE table_x AS
   SELECT generate_series(1, 1) AS id,
   ROUND(random()::numeric * 100, 2) AS data,
   CASE (random() * 5)::integer
 WHEN 4 THEN 'group-a'
 WHEN 3 THEN 'group-b'
 ELSE 'group-c'
   END AS part_key
;
SELECT 1
   Table "public.table_x"
  Column  |  Type   | Collation | Nullable | Default
--+-+---+--+-
 id   | integer |   |  |
 data | numeric |   |  |
 part_key | text|   |  |

DROP MATERIALIZED VIEW IF EXISTS group_imv;
psql:/home/ec2-user/test/ivm/alter_rename_bug.sql:15: NOTICE:  materialized
view "group_imv" does not exist, skipping
DROP MATERIALIZED VIEW
CREATE INCREMENTAL MATERIALIZED VIEW group_imv AS
SELECT part_key, COUNT(*), MAX(data), MIN(data), SUM(data), AVG(data)
FROM table_x
GROUP BY part_key;
SELECT 3
 List of relations
 Schema |   Name|   Type|  Owner
+---+---+--
 public | group_imv | materialized view | postgres
 public | table_x   | table | postgres
(2 rows)

 Materialized view "public.group_imv"
  Column   |  Type   | Collation | Nullable | Default
---+-+---+--+-
 part_key  | text|   |  |
 count | bigint  |   |  |
 max   | numeric |   |  |
 min   | numeric |   |  |
 sum   | numeric |   |  |
 avg   | numeric |   |  |
 __ivm_count_max__ | bigint  |   |  |
 __ivm_count_min__ | bigint  |   |  |
 __ivm_count_sum__ | bigint  |   |  |
 __ivm_count_avg__ | bigint  |   |  |
 __ivm_sum_avg__   | numeric |   |  |
 __ivm_count__ | bigint  |   |  |

SELECT * FROM group_imv ORDER BY part_key;
 part_key | count |  max  | min  |sum| avg
--+---+---+--+---+-
 group-a  |  1966 | 99.85 | 0.05 |  98634.93 | 50.1703611393692777
 group-b  |  2021 | 99.99 | 0.17 | 102614.02 | 50.7738842157347848
 group-c  |  6013 | 99.99 | 0.02 | 300968.43 | 50.0529569266589057
(3 rows)

ALTER MATERIALIZED VIEW group_imv RENAME TO group_imv2;
ALTER MATERIALIZED VIEW
 List of relations
 Schema |Name|   Type|  Owner
++---+--
 public | group_imv2 | materialized view | postgres
 public | table_x| table | postgres
(2 rows)

Materialized view "public.group_imv2"
  Column   |  Type   | Collation | Nullable | Default
---+-+---+--+-
 part_key  | text|   |  |
 count | bigint  |   |  |
 max   | numeric |   |  |
 min   | numeric |   |  |
 sum   | numeric |   |  |
 avg   | numeric |   |  |
 __ivm_count_max__ | bigint  |   |  |
 __ivm_count_min__ | bigint  |   |  |
 __ivm_count_sum__ | bigint  |   |  |
 __ivm_count_avg__ | bigint  |   |  |
 __ivm_sum_avg__   | numeric |   |  |
 __ivm_count__ | bigint  |   |  |

SET client_min_messages = debug5;
psql:/home/ec2-user/test/ivm/alter_rename_bug.sql:30: DEBUG:
 CommitTransaction(1) name: unnamed; blockState: STARTED; state:
INPROGRESS, xid/subid/cid: 0/1/0
SET
INSERT INTO table_x VALUES (1001, ROUND(random()::numeric * 100, 2),
'gruop_d');
psql:/home/ec2-user/test/ivm/alter_rename_bug.sql:33: DEBUG:
 StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS,
xid/subid/cid: 0/1/0
psql:/home/ec2-user/test/ivm/alter_rename_bug.sql:33: DEBUG:  relation
"public.group_imv" does not exist
psql:/home/ec2-user/test/ivm/alter_rename_bug.sql:33: DEBUG:  relation
"public.group_imv" does not exist
psql:/home/ec2-user/test/ivm/alter_rename_bug.sql:33: ERROR:  could not
open relation with OID 0
RESET client_min_messages;
psql:/home/ec2-user/test/ivm/alter_rename_bug.sql:34: DEBUG:
 StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS,
xid/subid/cid: 0/1/0
RESET
SELECT * FROM group_imv2 ORDER

Re: Implementing Incremental View Maintenance

2019-12-04 Thread nuko yokohama
2019年12月3日(火) 14:42 Yugo Nagata :

> On Mon, 2 Dec 2019 13:48:40 -0300
> Alvaro Herrera  wrote:
>
> > On 2019-Dec-02, Yugo Nagata wrote:
> >
> > > On Mon, 02 Dec 2019 10:36:36 +0900 (JST)
> > > Tatsuo Ishii  wrote:
> > >
> > > > >> One thing pending in this development line is how to catalogue
> aggregate
> > > > >> functions that can be used in incrementally-maintainable views.
> > > > >> I saw a brief mention somewhere that the devels knew it needed to
> be
> > > > >> done, but I don't see in the thread that they got around to doing
> it.
> > > > >> Did you guys have any thoughts on how it can be represented in
> catalogs?
> > > > >> It seems sine-qua-non ...
> >
> > > > > In the first option, we support only built-in aggregates which we
> know able
> > > > > to handle correctly. Supported aggregates can be identified using
> their OIDs.
> > > > > User-defined aggregates are not supported. I think this is the
> simplest and
> > > > > easiest way.
> > > >
> > > > I think this is enough for the first cut of IVM. So +1.
> > >
> > > If there is no objection, I will add the check of aggregate functions
> > > by this way. Thanks.
> >
> > The way I imagine things is that there's (one or more) new column in
> > pg_aggregate that links to the operator(s) (or function(s)?) that
> > support incremental update of the MV for that aggregate function.  Is
> > that what you're proposing?
>
> The way I am proposing above is using OID to check if a aggregate can be
> used in IVM. This allows only a part of built-in aggreagete functions.
>
> This way you mentioned was proposed as one of options as following.
>
> On Fri, 29 Nov 2019 17:33:28 +0900
> Yugo Nagata  wrote:
> > Third, we can add a new attribute to pg_aggregate which shows if each
> > aggregate can be used in IVM. We don't need to use names or OIDs list of
> > supported aggregates although we need modification of the system
> catalogue.
> >
> > Regarding pg_aggregate, now we have aggcombinefn attribute for supporting
> > partial aggregation. Maybe we could use combine functions to calculate
> new
> > aggregate values in IVM when tuples are inserted into a table. However,
> in
> > the context of IVM, we also need other function used when tuples are
> deleted
> > from a table, so we can not use partial aggregation for IVM in the
> current
> > implementation. This might be another option to implement "inverse
> combine
> > function"(?) for IVM, but I am not sure it worth.
>
> If we add "inverse combine function" in pg_aggregate that takes two results
> of aggregating over tuples in a view and tuples in a delta, and produces a
> result of aggregating over tuples in the view after tuples in the delta are
> deleted from this, it would allow to calculate new aggregate values in IVM
> using aggcombinefn together when the aggregate function provides both
> functions.
>
> Another idea is to use support functions for moving-aggregate mode which
> are
> already provided in pg_aggregate. However, in this case, we have to apply
> tuples in the delta to the view one by one instead of applying after
> aggregating tuples in the delta.
>
> In both case, we can not use these support functions in SQL via SPI because
> the type of some aggregates is internal. We have to alter the current
> apply_delta implementation if we adopt a way using these support functions.
> Instead, we also can add support functions for IVM independent to partial
> aggregate or moving-aggregate. Maybe this is also one of options.
>
>
> Regards,
> Yugo Nagata
>
> --
> Yugo Nagata 
>
>
>


Re: Update minimum SSL version

2019-12-04 Thread Tom Lane
Peter Eisentraut  writes:
> On 2019-12-03 12:44, Magnus Hagander wrote:
>> On Tue, Dec 3, 2019 at 12:09 PM Michael Paquier > > wrote:
>> On Tue, Dec 03, 2019 at 10:10:57AM +0100, Magnus Hagander wrote:
>>> Is 1.0.1 considered a separate major from 1.0.0, in this reasoning? Because
>>> while retiring 1.0.0 should probably not be that terrible, 1.0.1
>>> is still in very widespread use on most long term supported distributions.

> This would mean we'd stop support for RHEL 5, which is probably OK, 
> seeing that even the super-extended support ends in November 2020.

> Dropping RHEL 5 would also allow us to drop support for Python 2.4, 
> which is something I've been itching to do. ;-)

> In both of these cases, maintaining support for all these ancient 
> versions is a significant burden IMO, so it would be good to clean up 
> the tail end a bit.

So, what exactly are we going to set as the new minimum version in
each case?  I'll have to go update my trailing-edge-Johnnie buildfarm
critters, and it'd make sense to have them continue to test the
oldest nominally-supported versions.

For OpenSSL it seems like 1.0.1a is the target, per the above
discussion.

For Python, I'll just observe that RHEL6 ships 2.6.6, so we can't
bump up to 2.7.

regards, tom lane




Re: log bind parameter values on error

2019-12-04 Thread Alvaro Herrera


> (Maybe do strnlen(maxlen), then count strnlen(1) starting at that point
> -- so if that returns >=1, print the "..."?)

So I found that I can make the code more reasonable with this simple
coding,

if (maxlen > 0)
{
s = pnstrdup(s, maxlen);
ellipsis = strnlen(s, maxlen + 1) > maxlen;
/* enlarge while we can do so cheaply */
enlargeStringInfo(str, maxlen);
}

... but the problem is that we now compile stringinfo.c for frontend
environments also, and there's no pnstrdup() in frontends.  And to
introduce it, we'd need a configure check (because GNU libc has it) and
a src/port naive implementation and a fe_memutils.c addition.

Sigh.

Still, it's not that much code, so I'll just go do that and open a
separate thread for it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Session WAL activity

2019-12-04 Thread Konstantin Knizhnik




On 04.12.2019 8:33, Kyotaro Horiguchi wrote:

It seems to be useful to me. We also might want statistics of other
session IOs.  In that case the table name would be
"pg_stat_session/process_activity". We are aleady collecting most
kinds of the IO activity but it loses session information...


Well, actually monitoring disk activity for the particular 
backend/session can be easily done using some external tools
(just because now in Postgres session=backend=process). So we can 
monitor IO of processes, for example using iotop at Unix

or Performance Monitor at Windows.

Certainly it is more convenient to have such statstic inside Postgres. 
But I am not sure if it is really needed.
Concerning WAL activity situation is more obscure: records can be added 
to the WAL by one process, but written by another.

This is why it is not possible to use some external tools.




Briefly looking the patch, I have some comments on it.

As mentioned above, if we are intending future exantion of the
session-stats table, the name should be changed.

Backend status is more appropriate than PGPROC. See pgstat.c.

Do you mean pgstat_fetch_stat_beentry?
But why it is better than storing this information directly in PGPROC?
As far as this information  ha to be updated from XLogInsertRecord and 
it seems to be very performance critical function  my intention was to 
minimize
overhead of maintaining this statistic. It is hard to imagine something 
more efficient than just MyProc->walWriten += write_len;


Also pgstat_fetch_stat_beentry is taken backend id, which is not 
reported in pg_stat_activity view and this is why it is more
convenient to pass PID to pg_stat_get_wal_activity. Certainly it is 
possible to map PID to backendid, but... why actually do we need to

perform such mapping if simpler solution exists?


Some kind of locking is needed to update the fields on shared segment.
(LWLocks for PGPROC and PGSTAT_BEGIN/END_WRITE_ACTIVITY for
PgBackendStatus)

This information is updated locally only by backend itself.
Certainly update of 64 bit field is not atomic at 32-but architectures.
But it is just statistic. I do not think that it will be fatal if for a 
moment
we can see some incorrect value of written WAL bytes (and at most 
platforms this

update will be atomic).

As I already wrote above, this information in updated in performance 
critical place and this is why
I want to avoid any expensive operations (such as locking or atomic 
updates) as much as possible.

Knitpickings:

The patch contains a trace of older trial in
pg_stat_get_activity. Proc OID should be >= 8000 in
patches. src/include/catalog/unused_oids offers some OID for you.



Will fix it.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Yet another vectorized engine

2019-12-04 Thread Konstantin Knizhnik



On 04.12.2019 12:13, Hubert Zhang wrote:
3. Why you have to implement your own plan_tree_mutator and not using 
expression_tree_mutator?


I also want to replace plan node, e.g. Agg->CustomScan(with VectorAgg 
implementation). expression_tree_mutator cannot be used to mutate plan 
node such as Agg, am I right?


O, sorry, I see.


4. As far as I understand you now always try to replace SeqScan
with your custom vectorized scan. But it makes sense only if there
are quals for this scan or aggregation is performed.
In other cases batch+unbatch just adds extra overhead, doesn't it?

Probably extra overhead for heap format and query like 'select i from 
t;' without qual, projection, aggregation.
But with column store, VectorScan could directly read batch, and no 
additional batch cost. Column store is the better choice for OLAP queries.


Generally, yes.
But will it be true for the query with a lot of joins?

select * from T1 join T2 on (T1.pk=T2.fk) join T3 on (T2.pk=T3.fk) join 
T4 ...


How can batching improve performance in this case?
Also if query contains LIMIT clause or cursors, then batching can cause 
fetching of useless records (which never will be requested by client).


Can we conclude that it would be better to use vector engine for OLAP 
queries and row engine for OLTP queries.


5. Throwing and catching exception for queries which can not be
vectorized seems to be not the safest and most efficient way of
handling such cases.
May be it is better to return error code in plan_tree_mutator and
propagate this error upstairs? 

Yes, as for efficiency, another way is to enable some plan node to be 
vectorized and leave other nodes not vectorized and add batch/unbatch 
layer between them(Is this what you said "propagate this error 
upstairs"). As you mentioned, this could introduce additional 
overhead. Is there any other good approaches?

What do you mean by not safest?
PG catch will receive the ERROR, and fallback to the original 
non-vectorized plan.


The problem with catching and ignoring exception was many times 
discussed in hackers.
Unfortunately Postgres PG_TRY/PG_CATCH mechanism is not analog of 
exception mechanism in more high level languages, like C++, Java...
It doesn't perform stack unwind. If some resources (files, locks, 
memory,...) were obtained before throwing error, then them are not 
reclaimed.
Only rollback of transaction is guaranteed to release all resources. And 
it actually happen in case of normal error processing.
But if you catch and ignore exception , trying to continue execution, 
then it can cause many problems.


May be in your case it is not a problem, because you know for sure where 
error can happen: it is thrown by plan_tree_mutator
and looks like there are no resources obtained by this function.  But in 
any case overhead of setjmp is much higher than of explicit checks of 
return code.
So checking return codes will not actually add some noticeable overhead 
except code complication by adding extra checks.

But in can be hidden in macros which are used in any case (like MUTATE).


7. How vectorized scan can be combined with parallel execution (it
is already supported in9.6, isn't it?)


We didn't implement it yet. But the idea is the same as non parallel 
one. Copy the current parallel scan and implement vectorized Gather, 
keeping their interface to be VectorTupleTableSlot.
Our basic idea to reuse most of the current PG executor logic, and 
make them vectorized, then tuning performance gradually.


Parallel scan is scattering pages between parallel workers.
To fill VectorTupleSlot with data you may need more than one page 
(unless you make a decision that it can fetch tuples only from single page).

So it should be somehow take in account specific of parallel search.
Also there is special nodes for parallel search so if we want to provide 
parallel execution for vectorized operations we need also to substitute 
this nodes with

custom nodes.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Dumping/restoring fails on inherited generated column

2019-12-04 Thread Tom Lane
Run the regression tests with "make installcheck", then:

$ pg_dump -Fc regression >r.dump
$ createdb r2
$ pg_restore -d r2 r.dump
pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 6005; 2604 24821 DEFAULT gtest1_1 b postgres
pg_restore: error: could not execute query: ERROR:  column "b" of relation 
"gtest1_1" is a generated column
Command was: ALTER TABLE ONLY public.gtest1_1 ALTER COLUMN b SET DEFAULT (a * 
2);


pg_restore: warning: errors ignored on restore: 1
$ 

It looks like gtest1_1 inherits column "b" from gtest1, so possibly
pg_dump is just confused about the combination of inheritance and
generated columns.

I see this in v12 as well as HEAD.  One interesting question is how come
the pg_upgrade test isn't failing --- maybe binary-upgrade mode handles
this case differently?

regards, tom lane




Re: log bind parameter values on error

2019-12-04 Thread Alvaro Herrera
On 2019-Dec-04, Alvaro Herrera wrote:

> 
> > (Maybe do strnlen(maxlen), then count strnlen(1) starting at that point
> > -- so if that returns >=1, print the "..."?)
> 
> So I found that I can make the code more reasonable with this simple
> coding,

With strndup.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 9a92076e53a6664ec61c5f475310c5d6edb91d90 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 4 Dec 2019 11:02:34 -0300
Subject: [PATCH v3 1/3] Add strndup / pg_strndup

The former is now a POSIX function.  We already had pnstrdup, but it was
not offered to frontend code.
---
 configure| 23 +++
 configure.in |  3 ++-
 src/common/fe_memutils.c | 26 ++
 src/include/common/fe_memutils.h |  2 ++
 src/include/pg_config.h.in   |  7 +++
 src/include/pg_config.h.win32|  7 +++
 src/include/port.h   |  4 
 src/port/strndup.c   | 32 
 8 files changed, 103 insertions(+), 1 deletion(-)
 create mode 100644 src/port/strndup.c

diff --git a/configure b/configure
index 1d88983b34..fd8c759473 100755
--- a/configure
+++ b/configure
@@ -15482,6 +15482,16 @@ fi
 cat >>confdefs.h <<_ACEOF
 #define HAVE_DECL_STRLCPY $ac_have_decl
 _ACEOF
+ac_fn_c_check_decl "$LINENO" "strndup" "ac_cv_have_decl_strndup" "$ac_includes_default"
+if test "x$ac_cv_have_decl_strndup" = xyes; then :
+  ac_have_decl=1
+else
+  ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_STRNDUP $ac_have_decl
+_ACEOF
 ac_fn_c_check_decl "$LINENO" "strnlen" "ac_cv_have_decl_strnlen" "$ac_includes_default"
 if test "x$ac_cv_have_decl_strnlen" = xyes; then :
   ac_have_decl=1
@@ -15815,6 +15825,19 @@ esac
 
 fi
 
+ac_fn_c_check_func "$LINENO" "strndup" "ac_cv_func_strndup"
+if test "x$ac_cv_func_strndup" = xyes; then :
+  $as_echo "#define HAVE_STRNDUP 1" >>confdefs.h
+
+else
+  case " $LIBOBJS " in
+  *" strndup.$ac_objext "* ) ;;
+  *) LIBOBJS="$LIBOBJS strndup.$ac_objext"
+ ;;
+esac
+
+fi
+
 ac_fn_c_check_func "$LINENO" "strnlen" "ac_cv_func_strnlen"
 if test "x$ac_cv_func_strnlen" = xyes; then :
   $as_echo "#define HAVE_STRNLEN 1" >>confdefs.h
diff --git a/configure.in b/configure.in
index a2cb20b5e3..11f66d19c3 100644
--- a/configure.in
+++ b/configure.in
@@ -1679,7 +1679,7 @@ AC_CHECK_DECLS(posix_fadvise, [], [], [#include ])
 ]) # fi
 
 AC_CHECK_DECLS(fdatasync, [], [], [#include ])
-AC_CHECK_DECLS([strlcat, strlcpy, strnlen])
+AC_CHECK_DECLS([strlcat, strlcpy, strndup, strnlen])
 # This is probably only present on macOS, but may as well check always
 AC_CHECK_DECLS(F_FULLFSYNC, [], [], [#include ])
 
@@ -1738,6 +1738,7 @@ AC_REPLACE_FUNCS(m4_normalize([
 	srandom
 	strlcat
 	strlcpy
+	strndup
 	strnlen
 	strtof
 ]))
diff --git a/src/common/fe_memutils.c b/src/common/fe_memutils.c
index ce99b4f4da..896e5d46ca 100644
--- a/src/common/fe_memutils.c
+++ b/src/common/fe_memutils.c
@@ -101,6 +101,26 @@ pg_strdup(const char *in)
 	return tmp;
 }
 
+char *
+pg_strndup(const char *in, size_t size)
+{
+	char	   *tmp;
+
+	if (!in)
+	{
+		fprintf(stderr,
+_("cannot duplicate null pointer (internal error)\n"));
+		exit(EXIT_FAILURE);
+	}
+	tmp = strndup(in, size);
+	if (!tmp)
+	{
+		fprintf(stderr, _("out of memory\n"));
+		exit(EXIT_FAILURE);
+	}
+	return tmp;
+}
+
 void
 pg_free(void *ptr)
 {
@@ -142,6 +162,12 @@ pstrdup(const char *in)
 	return pg_strdup(in);
 }
 
+char *
+pnstrdup(const char *in, Size size)
+{
+	return pg_strndup(in, size);
+}
+
 void *
 repalloc(void *pointer, Size size)
 {
diff --git a/src/include/common/fe_memutils.h b/src/include/common/fe_memutils.h
index a1e5940d31..c8797ffe38 100644
--- a/src/include/common/fe_memutils.h
+++ b/src/include/common/fe_memutils.h
@@ -23,6 +23,7 @@
  * (except pg_malloc_extended with MCXT_ALLOC_NO_OOM)
  */
 extern char *pg_strdup(const char *in);
+extern char *pg_strndup(const char *in, size_t size);
 extern void *pg_malloc(size_t size);
 extern void *pg_malloc0(size_t size);
 extern void *pg_malloc_extended(size_t size, int flags);
@@ -31,6 +32,7 @@ extern void pg_free(void *pointer);
 
 /* Equivalent functions, deliberately named the same as backend functions */
 extern char *pstrdup(const char *in);
+extern char *pnstrdup(const char *in, Size size);
 extern void *palloc(Size size);
 extern void *palloc0(Size size);
 extern void *palloc_extended(Size size, int flags);
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c208dcdfc7..1db3f1b1e8 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -170,6 +170,10 @@
don't. */
 #undef HAVE_DECL_STRLCPY
 
+/* Define to 1 if you have the declaration of `strndup', and to 0 if you
+   don't. */
+#undef HAVE_DECL_STRNDUP
+
 /* Define to 1 if you have the declaration of `strnlen', and to 0 if you
don't. */
 #undef HAVE_DECL_STRNL

adding strndup

2019-12-04 Thread Alvaro Herrera
I just proposed in
https://postgr.es/m/0191204143715.GA17312@alvherre.pgsql the addition of
strndup() to our src/port.

I think this should be pretty uncontroversial, but wanted to give a
heads-up outside that thread.  I attach the patch here for completeness.

-- 
Álvaro Herrerahttp://www.twitter.com/alvherre
>From 9a92076e53a6664ec61c5f475310c5d6edb91d90 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 4 Dec 2019 11:02:34 -0300
Subject: [PATCH v3 1/3] Add strndup / pg_strndup

The former is now a POSIX function.  We already had pnstrdup, but it was
not offered to frontend code.
---
 configure| 23 +++
 configure.in |  3 ++-
 src/common/fe_memutils.c | 26 ++
 src/include/common/fe_memutils.h |  2 ++
 src/include/pg_config.h.in   |  7 +++
 src/include/pg_config.h.win32|  7 +++
 src/include/port.h   |  4 
 src/port/strndup.c   | 32 
 8 files changed, 103 insertions(+), 1 deletion(-)
 create mode 100644 src/port/strndup.c

diff --git a/configure b/configure
index 1d88983b34..fd8c759473 100755
--- a/configure
+++ b/configure
@@ -15482,6 +15482,16 @@ fi
 cat >>confdefs.h <<_ACEOF
 #define HAVE_DECL_STRLCPY $ac_have_decl
 _ACEOF
+ac_fn_c_check_decl "$LINENO" "strndup" "ac_cv_have_decl_strndup" "$ac_includes_default"
+if test "x$ac_cv_have_decl_strndup" = xyes; then :
+  ac_have_decl=1
+else
+  ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_STRNDUP $ac_have_decl
+_ACEOF
 ac_fn_c_check_decl "$LINENO" "strnlen" "ac_cv_have_decl_strnlen" "$ac_includes_default"
 if test "x$ac_cv_have_decl_strnlen" = xyes; then :
   ac_have_decl=1
@@ -15815,6 +15825,19 @@ esac
 
 fi
 
+ac_fn_c_check_func "$LINENO" "strndup" "ac_cv_func_strndup"
+if test "x$ac_cv_func_strndup" = xyes; then :
+  $as_echo "#define HAVE_STRNDUP 1" >>confdefs.h
+
+else
+  case " $LIBOBJS " in
+  *" strndup.$ac_objext "* ) ;;
+  *) LIBOBJS="$LIBOBJS strndup.$ac_objext"
+ ;;
+esac
+
+fi
+
 ac_fn_c_check_func "$LINENO" "strnlen" "ac_cv_func_strnlen"
 if test "x$ac_cv_func_strnlen" = xyes; then :
   $as_echo "#define HAVE_STRNLEN 1" >>confdefs.h
diff --git a/configure.in b/configure.in
index a2cb20b5e3..11f66d19c3 100644
--- a/configure.in
+++ b/configure.in
@@ -1679,7 +1679,7 @@ AC_CHECK_DECLS(posix_fadvise, [], [], [#include ])
 ]) # fi
 
 AC_CHECK_DECLS(fdatasync, [], [], [#include ])
-AC_CHECK_DECLS([strlcat, strlcpy, strnlen])
+AC_CHECK_DECLS([strlcat, strlcpy, strndup, strnlen])
 # This is probably only present on macOS, but may as well check always
 AC_CHECK_DECLS(F_FULLFSYNC, [], [], [#include ])
 
@@ -1738,6 +1738,7 @@ AC_REPLACE_FUNCS(m4_normalize([
 	srandom
 	strlcat
 	strlcpy
+	strndup
 	strnlen
 	strtof
 ]))
diff --git a/src/common/fe_memutils.c b/src/common/fe_memutils.c
index ce99b4f4da..896e5d46ca 100644
--- a/src/common/fe_memutils.c
+++ b/src/common/fe_memutils.c
@@ -101,6 +101,26 @@ pg_strdup(const char *in)
 	return tmp;
 }
 
+char *
+pg_strndup(const char *in, size_t size)
+{
+	char	   *tmp;
+
+	if (!in)
+	{
+		fprintf(stderr,
+_("cannot duplicate null pointer (internal error)\n"));
+		exit(EXIT_FAILURE);
+	}
+	tmp = strndup(in, size);
+	if (!tmp)
+	{
+		fprintf(stderr, _("out of memory\n"));
+		exit(EXIT_FAILURE);
+	}
+	return tmp;
+}
+
 void
 pg_free(void *ptr)
 {
@@ -142,6 +162,12 @@ pstrdup(const char *in)
 	return pg_strdup(in);
 }
 
+char *
+pnstrdup(const char *in, Size size)
+{
+	return pg_strndup(in, size);
+}
+
 void *
 repalloc(void *pointer, Size size)
 {
diff --git a/src/include/common/fe_memutils.h b/src/include/common/fe_memutils.h
index a1e5940d31..c8797ffe38 100644
--- a/src/include/common/fe_memutils.h
+++ b/src/include/common/fe_memutils.h
@@ -23,6 +23,7 @@
  * (except pg_malloc_extended with MCXT_ALLOC_NO_OOM)
  */
 extern char *pg_strdup(const char *in);
+extern char *pg_strndup(const char *in, size_t size);
 extern void *pg_malloc(size_t size);
 extern void *pg_malloc0(size_t size);
 extern void *pg_malloc_extended(size_t size, int flags);
@@ -31,6 +32,7 @@ extern void pg_free(void *pointer);
 
 /* Equivalent functions, deliberately named the same as backend functions */
 extern char *pstrdup(const char *in);
+extern char *pnstrdup(const char *in, Size size);
 extern void *palloc(Size size);
 extern void *palloc0(Size size);
 extern void *palloc_extended(Size size, int flags);
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c208dcdfc7..1db3f1b1e8 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -170,6 +170,10 @@
don't. */
 #undef HAVE_DECL_STRLCPY
 
+/* Define to 1 if you have the declaration of `strndup', and to 0 if you
+   don't. */
+#undef HAVE_DECL_STRNDUP
+
 /* Define to 1 if you have the declaration of `strnlen', and to 0 if you
don't. */
 #undef HAVE_DECL_STRNLEN
@@ -545,6 +549,9 @@
 /* Define to 1 if you have the `

Re: Update minimum SSL version

2019-12-04 Thread Peter Eisentraut

On 2019-12-04 13:53, Tom Lane wrote:

So, what exactly are we going to set as the new minimum version in
each case?  I'll have to go update my trailing-edge-Johnnie buildfarm
critters, and it'd make sense to have them continue to test the
oldest nominally-supported versions.

For OpenSSL it seems like 1.0.1a is the target, per the above
discussion.

For Python, I'll just observe that RHEL6 ships 2.6.6, so we can't
bump up to 2.7.


Yes, it would be Python 2.6.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Update minimum SSL version

2019-12-04 Thread Peter Eisentraut

On 2019-12-04 09:20, Michael Paquier wrote:

On Wed, Dec 04, 2019 at 09:10:04AM +0100, Peter Eisentraut wrote:

This would mean we'd stop support for RHEL 5, which is probably OK, seeing
that even the super-extended support ends in November 2020.


Sounds like a plan.  I can work on the OpenSSL part, if you need help
of course.  And if others don't object in doing that.  Of course.


Please go ahead and propose a patch.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Runtime pruning problem

2019-12-04 Thread Tom Lane
I wrote:
>> This may be arguing for a change in ruleutils' existing behavior,
>> not sure.  But when dealing with traditional-style inheritance,
>> I've always thought that Vars above the Append were referring to
>> the parent rel in its capacity as the parent, not in its capacity
>> as the first child.  With new-style partitioning drawing a clear
>> distinction between the parent and all its children, it's easier
>> to understand the difference.

> OK, so experimenting, I see that it is a change: HEAD does

> regression=# explain verbose select * from part order by a;
>QUERY PLAN 
>   
> -
>  Sort  (cost=362.21..373.51 rows=4520 width=8)
>Output: part_p1.a, part_p1.b
>Sort Key: part_p1.a
>->  Append  (cost=0.00..87.80 rows=4520 width=8)
>  ->  Seq Scan on public.part_p1  (cost=0.00..32.60 rows=2260 width=8)
>Output: part_p1.a, part_p1.b
>  ->  Seq Scan on public.part_p2_p1  (cost=0.00..32.60 rows=2260 
> width=8)
>Output: part_p2_p1.a, part_p2_p1.b
> (8 rows)

> The portion of this below the Append is fine, but I argue that
> the Vars above the Append should say "part", not "part_p1".
> In that way they'd look the same regardless of which partitions
> have been pruned or not.

So I've been thinking about how to make this actually happen.
I do not think it's possible without adding more information
to Plan trees.  Which is not a show-stopper in itself --- there's
already various fields there that have no use except to support
EXPLAIN --- but it'd behoove us to minimize the amount of work
the planner spends to generate such new info.

I think it can be made to work with a design along these lines:

* Add the planner's AppendRelInfo list to the finished PlannedStmt.
We would have no need for the translated_vars list, only for the
recently-added reverse-lookup array, so we could reduce the cost
of copying plans by having setrefs.c zero out the translated_vars
fields, much as it does for unnecessary fields of RTEs.

* In Append and MergeAppend plan nodes, add a bitmapset field that
contains the relids of any inheritance parent rels formed by this
append operation.  (It has to be a set, not a single relid, because
a partitioned join would form two appendrels at the same plan node.
In general, partitioned joins break a lot of the simpler ideas
I'd had before this one...)  I think this is probably just the relids
of the path's parent RelOptInfo, so it's little or no extra cost to
calculate.

* In ExplainPreScanNode, treat relids mentioned in such fields as
referenced by the query, so that they'll be assigned aliases by
select_rtable_names_for_explain.  (Note that this will generally mean
that a partition root table gets its unmodified alias, and all child
rels will have "_N" added, rather than the current situation where the
first unpruned child gets the parent's unmodified alias.  This seems
good to me from a consistency standpoint, although it'll mean another
round of churn in the regression test results.)

* When ruleutils has to resolve a Var, and it descends through an
Append or MergeAppend that has this field nonempty, remember the
bitmapset of relevant relids as we continue recursing.  Once we've
finally located a base Var, if the passed-down set of inheritance
relids isn't empty, then use the AppendRelInfo data to try to map
the base Var's varno/varattno back up to any one of these relids.
If successful, print the name of the mapped-to table and column
instead of the base Var's name.

This design will correctly print references to the "same" Var
differently depending on where they appear in the plan tree, ie above
or below the Append that forms the appendrel.  I don't see any way we
can make that happen reliably without new plantree decoration --- in
particular, I don't think ruleutils can reverse-engineer which Appends
form which appendrels without any help.

An interesting point is what to do if we see more than one such append
node as we descend.  We should union the sets of relevant appendrel
relids, for sure, but now there is a possibility that more than one
appendrel can be matched while chasing back up the AppendRelInfo data.
I think that can only happen for an inheritance appendrel nested
inside a UNION ALL appendrel, so the question becomes whether we'd
rather report the inheritance root or whatever alias we're going to
assign for UNION appendrels.  Perhaps that choice should wait until
we've got some code to test these ideas with.

I haven't tried to code this yet, but will go do so if there aren't
objections to this sketch.

regards, tom lane




Re: adding strndup

2019-12-04 Thread Tom Lane
Alvaro Herrera  writes:
> I just proposed in
> https://postgr.es/m/0191204143715.GA17312@alvherre.pgsql the addition of
> strndup() to our src/port.
> I think this should be pretty uncontroversial, but wanted to give a
> heads-up outside that thread.  I attach the patch here for completeness.

Grepping, I notice that ecpg has an ecpg_strndup; should that be
replaced with this?

regards, tom lane




Re: Windows buildfarm members vs. new async-notify isolation test

2019-12-04 Thread Andrew Dunstan
On Wed, Dec 4, 2019 at 12:12 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > On Tue, Dec 3, 2019 at 10:10 PM Tom Lane  wrote:
> >> Hmm ... just looking at the code again, could it be that there's
> >> no well-placed CHECK_FOR_INTERRUPTS?  Andrew, could you see if
> >> injecting one in what 790026972 added to postgres.c helps?
>
> > I also tried to analyze this failure and it seems this is a good bet,
> > but I am also wondering why we have never seen such a timing issue in
> > other somewhat similar tests.  For ex.,  one with comment (#
> > Cross-backend notification delivery.).  Do they have a better way of
> > ensuring that the notification will be received or is it purely
> > coincidental that they haven't seen such a symptom?
>
> TBH, my bet is that this *won't* fix it, but it seemed like an easy
> thing to test.  For this to fix it, you'd have to suppose that we
> never do a CHECK_FOR_INTERRUPTS during a COMMIT command, which is
> improbable at best.
>


You win your bet. Tried this on frogmouth and it still failed.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Allow superuser to grant passwordless connection rights on postgres_fdw

2019-12-04 Thread Andrew Dunstan
On Tue, Dec 3, 2019 at 9:36 AM Stephen Frost  wrote:


>
> > A necessary prerequisite is that Pg be able to cope with passwordless
> > user-mappings though. Hence this patch.
>
> Sure, that part seems like it makes sense to me (and perhaps has now
> been done, just catching up on things after travel and holidays and such
> here in the US).
>


It hasn't been done, but I now propose to commit it shortly so other
work can proceed.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Proposal: Add more compile-time asserts to expose inconsistencies.

2019-12-04 Thread Andres Freund
On 2019-12-04 15:16:25 +0900, Michael Paquier wrote:
> On Mon, Dec 02, 2019 at 07:55:45AM -0800, Andres Freund wrote:
> > Now that I'm back from vacation, I'll try to take a stab at b). It
> > should definitely doable to use the same approach for StaticAssertStmt,
> > the problematic case might be StaticAssertExpr.
> 
> So you basically want to minimize the amount of code relying on
> external compiler expressions?  Sounds like a plan.  At quick glance,
> it seems that this should work.  I haven't tested though.  I'll wait
> for what you come up with then.

I don't know what you mean by "external compiler expressions"?


> >> +StaticAssertDecl(lengthof(LockTagTypeNames) == (LOCKTAG_ADVISORY + 1),
> >> +   "LockTagTypeNames array inconsistency");
> >> +
> > 
> > These error messages strike me as somewhat unhelpful. I'd probably just
> > reword them as "array length mismatch" or something like that.
> 
> That's indeed better.  Now I think that it is useful to have the
> structure name in the error message as well, no?

No. I think the cost of having the different error messages is much
higher than the cost of not having the struct name in there. Note that
you'll commonly get an error message including the actual source code
for the offending expression.


> > I think this patch ought to include at least one StaticAssertDecl in a
> > header, to make sure we get that part right across compilers. E.g. the
> > one in PageIsVerified()?
> 
> No objections to have one, but I don't think that your suggestion is a
> good choice.  This static assertion is based on size_t and BLCKSZ, and
> is located close to a code path where we have a specific logic based
> on both things.

Well, but it's a reliance that goes beyond that specific source code
location


> If in the future this code gets removed, then we'd likely miss to
> remove the static assertion if they are separated across multiple
> files.

It'll never get removed. There's just plainly no way that we'd use a
block size that's not a multiple of size_t.

Greetings,

Andres Freund




Re: Proposal: Add more compile-time asserts to expose inconsistencies.

2019-12-04 Thread Andres Freund
Hi,

On 2019-12-04 10:09:28 +0100, Peter Eisentraut wrote:
> On 2019-12-02 16:55, Andres Freund wrote:
> > > +StaticAssertDecl(lengthof(LockTagTypeNames) == (LOCKTAG_ADVISORY + 1),
> > > +  "LockTagTypeNames array inconsistency");
> > > +
> > These error messages strike me as somewhat unhelpful. I'd probably just
> > reword them as "array length mismatch" or something like that.
> 
> I'd prefer it if we could just get rid of the second argument and show the
> actual expression in the error message, like run-time assertions work.

Well, _Static_assert has an error message, so we got to pass
something. And having something like "array length mismatch", without
referring again to the variable, doesn't strike me as that bad. We could
of course just again pass the expression, this time stringified, but
that doesn't seem an improvement.

Greetings,

Andres Freund




Re: backup manifests

2019-12-04 Thread Rushabh Lathia
As per the  discussion on the thread, here is the patch which

a) Make checksum for manifest file optional.
b) Allow user to choose a particular algorithm.

Currently with the WIP patch SHA256 and CRC checksum algorithm
supported.  Patch also changed the manifest file format to append
the used algorithm name before the checksum, this way it will be
easy to validator to know which algorithm to used.

Ex:
./db/bin/pg_basebackup -D bksha/ --manifest-with-checksums=SHA256

$ cat bksha/backup_manifest  | more
PostgreSQL-Backup-Manifest-Version 1
File backup_label 226 2019-12-04 17:46:46 GMT
SHA256:7cf53d1b9facca908678ab70d93a9e7460cd35cedf7891de948dcf858f8a281a
File pg_xact/ 8192 2019-12-04 17:46:46 GMT
SHA256:8d2b6cb1dc1a6e8cee763b52d75e73571fddce06eb573861d44082c7d8c03c26

./db/bin/pg_basebackup -D bkcrc/ --manifest-with-checksums=CRC
PostgreSQL-Backup-Manifest-Version 1
File backup_label 226 2019-12-04 17:58:40 GMT CRC:343138313931333134
File pg_xact/ 8192 2019-12-04 17:46:46 GMT CRC:36353834343133

Pending TODOs:
- Documentation update
- Code cleanup
- Testing.

I will further continue to work on the patch and meanwhile feel free to
provide
thoughts/inputs.

Thanks,


On Mon, Nov 25, 2019 at 11:13 PM Robert Haas  wrote:

> On Fri, Nov 22, 2019 at 5:15 PM Tels 
> wrote:
> > It is related to the number of states...
>
> Thanks for this explanation. See my reply to David where I also
> discuss this point.
>
> > However, if you choose a hash, please do not go below SHA-256. Both MD5
> > and SHA-1 already had collision attacks, and these only got to be bound
> > to be worse.
> >
> >https://www.mscs.dal.ca/~selinger/md5collision/
> >https://shattered.io/
>
> Yikes, that second link, about SHA-1, is depressing. Now, it's not
> likely that an attacker has access to your backup repository and can
> spend 6500 years of CPU time to engineer a Trojan file there (maybe
> more, because the files are probably bigger than the PDFs they used in
> that case) and then induce you to restore and rely upon that backup.
> However, it's entirely likely that somebody is going to eventually ban
> SHA-1 as the attacks get better, which is going to be a problem for us
> whether the underlying exposures are problems or not.
>
> > It might even be a wise idea to encode the used Hash-Algorithm into the
> > manifest file, so it can be changed later. The hash length might be not
> > enough to decide which algorithm is the one used.
>
> I agree. Let's write
> SHA256:bc1c3a57369acd0d2183a927fb2e07acbbb1c97f317bbc3b39d93ec65b754af5
> or similar rather than just the hash. That way even if the entire SHA
> family gets cracked, we can easily substitute in something else that
> hasn't been cracked yet.
>
> (It is unclear to me why anyone supposes that *any* popular hash
> function won't eventually be cracked. For a K-bit hash function, there
> are 2^K possible outputs, where K is probably in the hundreds. But
> there are 2^{2^33} possible 1GB files. So for every possible output
> value, there are 2^{2^33-K} inputs that produce that value, which is a
> very very big number. The probability that any given input produces a
> certain output is very low, but the number of possible inputs that
> produce a given output is very high; so assuming that nobody's ever
> going to figure out how to construct them seems optimistic.)
>
> > To get a feeling one can use:
> >
> > openssl speed md5 sha1 sha256 sha512
> >
> > On my really-not-fast desktop CPU (i5-4690T CPU @ 2.50GHz) it says:
> >
> >   The 'numbers' are in 1000s of bytes per second processed.
> >type   16 bytes 64 bytes256 bytes   1024 bytes   8192
> > bytes  16384 bytes
> >md5   122638.55k   277023.96k   487725.57k   630806.19k
> > 683892.74k   688553.98k
> >sha1  127226.45k   313891.52k   632510.55k   865753.43k
> > 960995.33k   977215.19k
> >sha256 77611.02k   173368.15k   325460.99k   412633.43k
> > 447022.92k   448020.48k
> >sha512 51164.77k   205189.87k   361345.79k   543883.26k
> > 638372.52k   645933.74k
> >
> > Or in other words, it can hash nearly 931 MByte /s with SHA-1 and about
> > 427 MByte / s with SHA-256 (if I haven't miscalculated something). You'd
> > need a
> > pretty fast disk (aka M.2 SSD) and network (aka > 1 Gbit) to top these
> > speeds
> > and then you'd use a real CPU for your server, not some poor Intel
> > powersaving
> > surfing thingy-majingy :)
>
> I mean, how fast is in theory doesn't matter nearly as much as what
> happens when you benchmark the proposed implementation, and the
> results we have so far don't support the theory that this is so cheap
> as to be negligible.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


-- 
Rushabh Lathia
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 66aa0fc..c6bb74d 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ 

Re: adding strndup

2019-12-04 Thread Andres Freund
Hi,

On 2019-12-04 11:40:21 -0300, Alvaro Herrera wrote:
> I just proposed in
> https://postgr.es/m/0191204143715.GA17312@alvherre.pgsql the addition of
> strndup() to our src/port.
> 
> I think this should be pretty uncontroversial, but wanted to give a
> heads-up outside that thread.  I attach the patch here for completeness.

Well, I personally think it's a bad idea to add further implementations
for functions that are in standar libraries on some systems. Especially,
but not exclusively, when made available for frontend code, where it's
not unlikely that there might be other applications having their own
implementations of strndup/whatever.

Besides that reason, I think AC_REPLACE_FUNCS is just a bad mechanism,
that yields fragmented source code and needs to implemented differently
for windows.  The code additionally often will also be badly optimized
in general, due to tiny translation units without relevant functions
having knoweldge about each other.

I'd just provide pnstrdup() in the frontend, without adding strndup().

I also see no point in adding both pnstrdup() and pg_strndup(). I'm fine
with moving towards pg_strndup(), but then we just ought to remove
pnstrdup().

- Andres




Re: backup manifests

2019-12-04 Thread Robert Haas
On Wed, Dec 4, 2019 at 1:01 PM Rushabh Lathia  wrote:
> As per the  discussion on the thread, here is the patch which
>
> a) Make checksum for manifest file optional.
> b) Allow user to choose a particular algorithm.
>
> Currently with the WIP patch SHA256 and CRC checksum algorithm
> supported.  Patch also changed the manifest file format to append
> the used algorithm name before the checksum, this way it will be
> easy to validator to know which algorithm to used.
>
> Ex:
> ./db/bin/pg_basebackup -D bksha/ --manifest-with-checksums=SHA256
>
> $ cat bksha/backup_manifest  | more
> PostgreSQL-Backup-Manifest-Version 1
> File backup_label 226 2019-12-04 17:46:46 GMT 
> SHA256:7cf53d1b9facca908678ab70d93a9e7460cd35cedf7891de948dcf858f8a281a
> File pg_xact/ 8192 2019-12-04 17:46:46 GMT 
> SHA256:8d2b6cb1dc1a6e8cee763b52d75e73571fddce06eb573861d44082c7d8c03c26
>
> ./db/bin/pg_basebackup -D bkcrc/ --manifest-with-checksums=CRC
> PostgreSQL-Backup-Manifest-Version 1
> File backup_label 226 2019-12-04 17:58:40 GMT CRC:343138313931333134
> File pg_xact/ 8192 2019-12-04 17:46:46 GMT CRC:36353834343133
>
> Pending TODOs:
> - Documentation update
> - Code cleanup
> - Testing.
>
> I will further continue to work on the patch and meanwhile feel free to 
> provide
> thoughts/inputs.

+ initilize_manifest_checksum(&cCtx);

Spelling.

-

Spurious.

+ case MC_CRC:
+ INIT_CRC32C(cCtx->crc_ctx);

Suggest that we do CRC -> CRC32C throughout the patch. Someone might
conceivably want some other CRC variant, mostly likely 64-bit, in the
future.

+final_manifest_checksum(ChecksumCtx *cCtx, char *checksumbuf)

finalize

  printf(_("  --manifest-with-checksums\n"
- " do calculate checksums for manifest files\n"));
+ " calculate checksums for manifest files
using provided algorithm\n"));

Switch name is wrong. Suggest --manifest-checksums.
Help usually shows that an argument is expected, e.g.
--manifest-checksums=ALGORITHM or
--manifest-checksums=sha256|crc32c|none

This seems to apply over some earlier version of the patch.  A
consolidated patch, or the whole stack, would be better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] Block level parallel vacuum

2019-12-04 Thread Masahiko Sawada
On Wed, 4 Dec 2019 at 04:57, Dilip Kumar  wrote:
>
> On Wed, Dec 4, 2019 at 9:12 AM Amit Kapila  wrote:
> >
> > On Wed, Dec 4, 2019 at 1:58 AM Masahiko Sawada
> >  wrote:
> > >
> > > On Tue, 3 Dec 2019 at 11:55, Amit Kapila  wrote:
> > > >
> > > In your code, I think if two workers enter to compute_parallel_delay
> > > function at the same time, they add their local balance to
> > > VacuumSharedCostBalance and both workers sleep because both values
> > > reach the VacuumCostLimit.
> > >
> >
> > True, but isn't it more appropriate because the local cost of any
> > worker should be ideally added to shared cost as soon as it occurred?
> > I mean to say that we are not adding any cost in shared balance
> > without actually incurring it.   Then we also consider the individual
> > worker's local balance as well and sleep according to local balance.
>
> Even I think it is better to add the balance to the shared balance at
> the earliest opportunity.  Just consider the case that there are 5
> workers and all have I/O balance of 20, and VacuumCostLimit is 50.  So
> Actually, there combined balance is 100 (which is double of the
> VacuumCostLimit) but if we don't add immediately then none of the
> workers will sleep and it may go to the next cycle which is not very
> good. OTOH, if we add 20 immediately then check the shared balance
> then all the workers might go for sleep if their local balances have
> reached the limit but they will only sleep in proportion to their
> local balance.  So IMHO, adding the current balance to shared balance
> early is more close to the model we are trying to implement i.e.
> shared cost accounting.

I agree to add the balance as soon as it occurred. But the problem I'm
concerned is, let's suppose we have 4 workers, the cost limit is 100
and the shared balance is now 95. Two workers, whom local
balance(VacuumCostBalanceLocal) are 40, consumed I/O, added 10 to
theirs local balance and entered compute_parallel_delay function at
the same time. One worker adds 10 to the shared
balance(VacuumSharedCostBalance) and another worker also adds 10 to
the shared balance. The one worker then subtracts the local balance
from the shared balance and sleeps because the shared cost is now 115
(> the cost limit) and its local balance is 50 (> 0.5*(100/4)). Even
another worker also does the same for the same reason. On the other
hand if two workers do that serially, only one worker sleeps and
another worker doesn't because the total shared cost will be 75 when
the later worker enters the condition. At first glance it looks like a
concurrency problem but is that expected behaviour?

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: adding strndup

2019-12-04 Thread Tom Lane
Andres Freund  writes:
> On 2019-12-04 11:40:21 -0300, Alvaro Herrera wrote:
>> I think this should be pretty uncontroversial, but wanted to give a
>> heads-up outside that thread.  I attach the patch here for completeness.

> I'd just provide pnstrdup() in the frontend, without adding strndup().

+1 --- seems like a bunch more mechanism than is warranted.  Let's
just open-code it in pnstrdup.  We can rely on strnlen, since that's
already supported, and there's not much more there beyond that.

> I also see no point in adding both pnstrdup() and pg_strndup(). I'm fine
> with moving towards pg_strndup(), but then we just ought to remove
> pnstrdup().

There's a fair number of uses of pnstrdup in the backend.  While it
wouldn't be too painful to rename them, I'm not sure I see the point.
(What I'd really argue for, if we did rename, is "pstrndup".)

regards, tom lane




Re: adding strndup

2019-12-04 Thread Alvaro Herrera
On 2019-Dec-04, Tom Lane wrote:

> Andres Freund  writes:
> > On 2019-12-04 11:40:21 -0300, Alvaro Herrera wrote:
> >> I think this should be pretty uncontroversial, but wanted to give a
> >> heads-up outside that thread.  I attach the patch here for completeness.
> 
> > I'd just provide pnstrdup() in the frontend, without adding strndup().
> 
> +1 --- seems like a bunch more mechanism than is warranted.  Let's
> just open-code it in pnstrdup.  We can rely on strnlen, since that's
> already supported, and there's not much more there beyond that.

I can get behind that ... it makes the patch a lot smaller.  I'm gonna
send an updated version in a jiffy.

> > I also see no point in adding both pnstrdup() and pg_strndup(). I'm fine
> > with moving towards pg_strndup(), but then we just ought to remove
> > pnstrdup().
> 
> There's a fair number of uses of pnstrdup in the backend.  While it
> wouldn't be too painful to rename them, I'm not sure I see the point.
> (What I'd really argue for, if we did rename, is "pstrndup".)

*shrug*  I also looked for pstrndup() first.  And Peter E also in
https://postgr.es/m/1339713732.11971.79.ca...@vanquo.pezone.net
submitted an implementation of pstrndup().  I'm not opposed to renaming
it, but I hesitate to do it at the same time as putting it in pgport.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Unsigned 64 bit integer to numeric

2019-12-04 Thread Dmitry Dolgov
> On Wed, Dec 04, 2019 at 11:49:20AM +, Andrew Gierth wrote:
>
> > "Dmitry" == Dmitry Dolgov <9erthali...@gmail.com> writes:
>
>  Dmitry> Hi,
>
>  Dmitry> Probably a simple question, but I don't see a simple answer so
>  Dmitry> far. In one extension I want to convert uint64 into a numeric
>  Dmitry> to put it eventually into a jsonb object. As far as I see in
>  Dmitry> numeric.c there are functions only for signed int64. Is there a
>  Dmitry> way to achive this with uint64 (without duplicating significant
>  Dmitry> part of numeric implementation in the extension)?
>
> Sure. Flip the top bit; convert the value as if signed; then subtract
> -(2^63) from the result. (Easier to subtract -(2^63) than to add 2^63,
> since the former can itself be represented in a signed int64 for easy
> conversion to numeric.)

Indeed, looks like this does the trick, thank you!




Re: [HACKERS] Block level parallel vacuum

2019-12-04 Thread Robert Haas
On Thu, Nov 21, 2019 at 12:32 AM Dilip Kumar  wrote:
> In v33-0001-Add-index-AM-field-and-callback-for-parallel-ind patch,  I
> am a bit doubtful about this kind of arrangement, where the code in
> the "if" is always unreachable with the current AMs.  I am not sure
> what is the best way to handle this, should we just drop the
> amestimateparallelvacuum altogether?  Because currently, we are just
> providing a size estimate function without a copy function,  even if
> the in future some Am give an estimate about the size of the stats, we
> can not directly memcpy the stat from the local memory to the shared
> memory, we might then need a copy function also from the AM so that it
> can flatten the stats and store in proper format?

I agree that it's a crock to add an AM method that is never used for
anything. That's just asking for the design to prove buggy and
inadequate. One way to avoid this would be to require that every AM
that wants to support parallel vacuuming supply this method, and if it
wants to just return sizeof(IndexBulkDeleteResult), then it can. But I
also think someone should modify one of the AMs to use a
differently-sized object, and then see whether they can really make
parallel vacuum work with this patch. If, as you speculated here, it
needs another API, then we should add both of them or neither. A
half-baked solution is worse than nothing at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Why JIT speed improvement is so modest?

2019-12-04 Thread Andres Freund
Hi,

On 2019-11-25 18:09:29 +0300, Konstantin Knizhnik wrote:
> I wonder why even at this query, which seems to be ideal use case for JIT,
> we get such modest improvement?

I think there's a number of causes:

1) There's bottlenecks elsewhere:
   - The order of sequential scan memory accesses is bad
 
https://www.postgresql.org/message-id/20161030073655.rfa6nvbyk4w2kkpk%40alap3.anarazel.de

 In my experiments, fixing that yields larger JIT improvements,
 because less time is spent stalling due to cache misses during
 tuple deforming (needing the tuple's natts at the start prevents
 out-of-order from hiding the relevant latency).


   - The transition function for floating point aggregates is pretty
 expensive. In particular, we compute the full youngs-cramer stuff
 for sum/avg, even though they aren't actually needed there. This
 has become measurably worse with
 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e954a727f0c8872bf5203186ad0f5312f6183746
 In this case it's complicated enough apparently that the transition
 functions are too expensive to inline.

   - float4/8_accum use arrays to store the transition state. That's
 noticably more expensive than just accessing a struct, partially
 because more checks needs to be done. We really should move most,
 if not all, aggregates that use array transition states to
 "internal" type transition states. Probably with some reusable
 helpers to make it easier to write serialization / deserialization
 functions so we can continue to allow parallelism.

   - The per-row overhead on lower levels of the query is
 significant. E.g. in your profile the
 HeapTupleSatisfiesVisibility() calls (you'd get largely rid of this
 by freezing), and the hashtable overhead is quite noticable. JITing
 expression eval doesn't fix that.

   ...


2) The code generated for JIT isn't that good. In particular, the
   external memory references included in the generated code limit the
   optimization potential quite substantially. There's also quite some
   (not just JIT) improvement potential related to the aggregation code,
   simplifying the generated expressions.

   See 
https://www.postgresql.org/message-id/20191023163849.sosqbfs5yenocez3%40alap3.anarazel.de
   for my attempt at improving the situation. It does measurably
   improve the situation for Q1, while still leaving a lot of further
   improvements to be done.  You'd be more than welcome to review some
   of that!


3) Plenty of crucial code is not JITed, even when expression
   related. Most crucial for Q1 is the fact that the hash computation
   for aggregates isn't JITed as a whole - when looking at hierarchical
   profiles, we spend about 1/3 of the whole query time within
   TupleHashTable*.

4) The currently required forming / deforming of tuples into minimal
   tuples when storing them in the hashagg table is *expensive*.

   We can address that partially by computing NOT NULL information for
   the tupledesc used for the hashtable (which will make JITed tuple
   deforming considerably faster, because it'll just be a reference to
   an hardcoded offset).

   We can also simplify the minimal tuple representation - historically
   it looks the way it does now because we needed minimal tuples to be
   largely compatible with heap tuples - but we don't anymore. Even just
   removing the weird offset math we do for minimal tuples would be
   beneficial, but I think we can do more than that.



> Vitesse DB reports 8x speedup on Q1,
> ISP-RAS JIT version  provides 3x speedup of Q1:

I think those measurements were done before a lot of generic
improvements to aggregation speed were done. E.g. Q1 performance
improved significantly due to the new expression evaluation engine, even
without JIT. Because the previous tree-walking expression evaluation was
so slow for many things, JITing that away obviously yielded bigger
improvements than it does now.


> VOPS provides 10x improvement of Q1.

My understanding of VOPS is that it ferries around more than one tuple
at a time. And avoids a lot of generic code paths. So that just doesn't
seem a meaningful comparison.


> In theory by elimination of interpretation overhead JIT should provide
> performance comparable with vecrtorized executor.

I don't think that's true at all. Vectorized execution, which I assume
to mean dealing with more than one tuple at a time, is largely
orthogonal to the way expressions are evaluated. The reason that
vectorized execution is good is that it drastically increases cache
locality (by performing work that accesses related data, e.g. a buffer
page, in a tight loop, without a lot of other work happening inbetween),
that it increases the benefits of out of order execution (by removing
dependencies, as e.g. predicates for multiple tuples can be computed,
without a separate dependency on the result for each predicate
evaluation), etc.

JIT compiled expres

more backtraces

2019-12-04 Thread Peter Eisentraut
In the previous discussions on backtrace support, some people asked for 
backtraces in more situations.  Here is a patch that prints backtraces 
on SIGABRT, SIGBUS, and SIGSEGV signals.  SIGABRT includes assertions 
and elog(PANIC).


Do signals work like this on Windows?  Do we need special EXEC_BACKEND 
support?


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From f790d208dc85a26585a2f5fb3042c29f8b3bbecc Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 4 Dec 2019 16:34:41 +0100
Subject: [PATCH] Print backtrace on SIGABRT, SIGBUS, SIGSEGV

This removes the backtrace code from the assertion handling, since
that is now also handled by the SIGABRT signal handler.
---
 src/backend/postmaster/postmaster.c | 30 +
 src/backend/utils/error/assert.c| 13 -
 2 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c 
b/src/backend/postmaster/postmaster.c
index 9ff2832c00..fa4dc6772b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -77,6 +77,10 @@
 #include 
 #include 
 
+#ifdef HAVE_EXECINFO_H
+#include 
+#endif
+
 #ifdef HAVE_SYS_SELECT_H
 #include 
 #endif
@@ -391,6 +395,7 @@ static void pmdie(SIGNAL_ARGS);
 static void reaper(SIGNAL_ARGS);
 static void sigusr1_handler(SIGNAL_ARGS);
 static void startup_die(SIGNAL_ARGS);
+static void abort_handler(int signum);
 static void dummy_handler(SIGNAL_ARGS);
 static void StartupPacketTimeoutHandler(void);
 static void CleanupBackend(int pid, int exitstatus);
@@ -646,6 +651,9 @@ PostmasterMain(int argc, char *argv[])
pqsignal_pm(SIGUSR1, sigusr1_handler);  /* message from child process */
pqsignal_pm(SIGUSR2, dummy_handler);/* unused, reserve for children 
*/
pqsignal_pm(SIGCHLD, reaper);   /* handle child termination */
+   pqsignal_pm(SIGABRT, abort_handler);
+   pqsignal_pm(SIGBUS, abort_handler);
+   pqsignal_pm(SIGSEGV, abort_handler);
 
/*
 * No other place in Postgres should touch SIGTTIN/SIGTTOU handling.  We
@@ -5333,6 +5341,28 @@ startup_die(SIGNAL_ARGS)
proc_exit(1);
 }
 
+/*
+ * Signal handler for SIGABRT, SIGBUS, SIGSEGV
+ *
+ * If supported, print stack trace, then continue with normal signal handler.
+ */
+static void
+abort_handler(int signum)
+{
+#ifdef HAVE_BACKTRACE_SYMBOLS
+   {
+   void   *buf[100];
+   int nframes;
+
+   nframes = backtrace(buf, lengthof(buf));
+   backtrace_symbols_fd(buf, nframes, fileno(stderr));
+   }
+#endif
+
+   pqsignal_pm(signum, SIG_DFL);
+   raise(signum);
+}
+
 /*
  * Dummy signal handler
  *
diff --git a/src/backend/utils/error/assert.c b/src/backend/utils/error/assert.c
index 1069bbee81..2050b4355d 100644
--- a/src/backend/utils/error/assert.c
+++ b/src/backend/utils/error/assert.c
@@ -18,9 +18,6 @@
 #include "postgres.h"
 
 #include 
-#ifdef HAVE_EXECINFO_H
-#include 
-#endif
 
 /*
  * ExceptionalCondition - Handles the failure of an Assert()
@@ -45,16 +42,6 @@ ExceptionalCondition(const char *conditionName,
/* Usually this shouldn't be needed, but make sure the msg went out */
fflush(stderr);
 
-#ifdef HAVE_BACKTRACE_SYMBOLS
-   {
-   void   *buf[100];
-   int nframes;
-
-   nframes = backtrace(buf, lengthof(buf));
-   backtrace_symbols_fd(buf, nframes, fileno(stderr));
-   }
-#endif
-
 #ifdef SLEEP_ON_ASSERT
 
/*
-- 
2.24.0



Re: more backtraces

2019-12-04 Thread Andres Freund
Hi,

On 2019-12-04 20:45:25 +0100, Peter Eisentraut wrote:
> In the previous discussions on backtrace support, some people asked for
> backtraces in more situations.  Here is a patch that prints backtraces on
> SIGABRT, SIGBUS, and SIGSEGV signals.  SIGABRT includes assertions and
> elog(PANIC).

Hm. Can we really do that somewhat reliably like this? I'd suspect that
there'll be some oddities e.g. for stack overflows if done this way. To
my knowledge it's not a good idea to intercept SIGBUS/SIGSEGV without
using a separate signal stack (cf. sigaltstack) - but using a separate
stack could also make it harder to determine a correct backtrace?

It'd be bad if the addition of backtraces for SEGV/BUS suddenly made it
harder to attach a debugger and getting useful results. Even
disregarding the previous concerns, we'll get less useful debugger
interactions due to this, e.g. for things like null pointer derefs,
right?

Doing this for SIGABRT seems like a more clearly good case - by that
point we're already removed a few frames from the triggering code
anyway. So debugging experience won't suffer much. And I don't think
there's a corresponding issue with the stack potentially being
corrupted / not large enough.

- Andres




Re: [HACKERS] Block level parallel vacuum

2019-12-04 Thread Robert Haas
On Mon, Dec 2, 2019 at 2:26 PM Masahiko Sawada
 wrote:
> It's just an example, I'm not saying your idea is bad. ISTM the idea
> is good on an assumption that all indexes take the same time or take a
> long time so I'd also like to consider if this is true even in
> production and which approaches is better if we don't have such
> assumption.

I think his idea is good. You're not wrong when you say that there are
cases where it could work out badly, but I think on the whole it's a
clear improvement. Generally, the indexes should be of relatively
similar size because index size is driven by table size; it's true
that different AMs could result in different-size indexes, but it
seems like a stretch to suppose that the indexes that don't support
parallelism are also going to be the little tiny ones that go fast
anyway, unless we have some evidence that this is really true. I also
wonder whether we really need the option to disable parallel vacuum in
the first place. Maybe I'm looking in the right place, but I'm not
finding anything in the way of comments or documentation explaining
why some AMs don't support it. It's an important design point, and
should be documented.

I also think PARALLEL_VACUUM_DISABLE_LEADER_PARTICIPATION seems like a
waste of space. For parallel queries, there is a trade-off between
having the leader do work (which can speed up the query) and having it
remain idle so that it can immediately absorb tuples from workers and
keep them from having their tuple queues fill up (which can speed up
the query). But here, at least as I understand it, there's no such
trade-off. Having the leader fail to participate is just a loser.
Maybe it's useful to test while debugging the patch, but why should
the committed code support it?

To respond to another point from a different part of the email chain,
the reason why LaunchParallelWorkers() does not take an argument for
the number of workers is because I believed that the caller should
always know how many workers they're going to want at the time they
CreateParallelContext(). Increasing it later is not possible, because
the DSM has already sized based on the count provided. I grant that it
would be possible to allow the number to be reduced later, but why
would you want to do that? Why not get the number right when first
creating the DSM?

Is there any legitimate use case for parallel vacuum in combination
with vacuum cost delay? As I understand it, any serious vacuuming is
going to be I/O-bound, so can you really need multiple workers at the
same time that you are limiting the I/O rate? Perhaps it's possible if
the I/O limit is so high that a single worker can't hit the limit by
itself, but multiple workers can, but it seems like a bad idea to
spawn more workers and then throttle them rather than just starting
fewer workers. In any case, the algorithm suggested in vacuumlazy.c
around the definition of VacuumSharedCostBalance seems almost the
opposite of what you probably want. The idea there seems to be that
you shouldn't make a worker sleep if it hasn't actually got to do
anything. Apparently the idea is that if you have 3 workers and you
only have enough I/O rate for 1 worker, you want all 3 workers to run
at once, so that the I/O is random, rather than having them run 1 at a
time, so that the I/O is sequential. That seems counterintuitive. It
could be right if the indexes are in different tablespaces, but if
they are in the same tablespace it's probably wrong. I guess it could
still be right if there's just so much I/O that you aren't going to
run out ever, and the more important consideration is that you don't
know which index will take longer to vacuum and so want to start them
all at the same time so that you don't accidentally start the slow one
last, but that sounds like a stretch. I think this whole area needs
more thought. I feel like we're trying to jam a go-slower feature and
a go-faster feature into the same box.

+ * vacuum and for performing index cleanup.  Note that all parallel workers
+ * live during either index vacuuming or index cleanup but the leader process
+ * neither exits from the parallel mode nor destroys the parallel context.
+ * For updating the index statistics, since any updates are not allowed during
+ * parallel mode we update the index statistics after exited from the parallel

The first of these sentences ("Note that all...") is not very clear to
me, and seems like it may amount to a statement that the leader
doesn't try to destroy the parallel context too early, but since I
don't understand it, maybe that's not what it is saying. The second
sentence needs exited -> exiting, and maybe some more work on the
grammar, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Dumping/restoring fails on inherited generated column

2019-12-04 Thread Peter Eisentraut

On 2019-12-04 15:14, Tom Lane wrote:

Run the regression tests with "make installcheck", then:

$ pg_dump -Fc regression >r.dump
$ createdb r2
$ pg_restore -d r2 r.dump
pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 6005; 2604 24821 DEFAULT gtest1_1 b postgres
pg_restore: error: could not execute query: ERROR:  column "b" of relation 
"gtest1_1" is a generated column
Command was: ALTER TABLE ONLY public.gtest1_1 ALTER COLUMN b SET DEFAULT (a * 
2);


pg_restore: warning: errors ignored on restore: 1
$

It looks like gtest1_1 inherits column "b" from gtest1, so possibly
pg_dump is just confused about the combination of inheritance and
generated columns.


Yeah, there was some stuff about the "separate" dumping of defaults that 
I apparently forgot to address.  The attached patch fixes it.  I'll see 
about adding a test case for it, too.



I see this in v12 as well as HEAD.  One interesting question is how come
the pg_upgrade test isn't failing --- maybe binary-upgrade mode handles
this case differently?


Binary upgrade dumps out even inherited columns, so it won't run into 
the "separate" case that's the issue here.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 0ef24c98edf468ebe906c5b04e3ddf9e53bda02f Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 4 Dec 2019 21:14:19 +0100
Subject: [PATCH v1] Fix dumping of inherited generated columns

---
 src/bin/pg_dump/pg_dump.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 33e58fa287..87b3c5b09c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -8515,6 +8515,11 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int 
numTables)
{
attrdefs[j].separate = true;
}
+   else if (tbinfo->attgenerated[adnum - 1])
+   {
+   /* generated columns only go with the 
root table */
+   attrdefs[j].separate = false;
+   }
else if (!shouldPrintColumn(dopt, tbinfo, adnum 
- 1))
{
/* column will be suppressed, print 
default separately */
-- 
2.24.0



Re: more backtraces

2019-12-04 Thread Peter Eisentraut

On 2019-12-04 20:59, Andres Freund wrote:

On 2019-12-04 20:45:25 +0100, Peter Eisentraut wrote:

In the previous discussions on backtrace support, some people asked for
backtraces in more situations.  Here is a patch that prints backtraces on
SIGABRT, SIGBUS, and SIGSEGV signals.  SIGABRT includes assertions and
elog(PANIC).


Hm. Can we really do that somewhat reliably like this?


I've seen reputable programs that do all kinds of things in SIGSEGV 
handlers, including running user-defined programs, without taking any 
special precautions.  So it seems possible in general.



I'd suspect that
there'll be some oddities e.g. for stack overflows if done this way. To
my knowledge it's not a good idea to intercept SIGBUS/SIGSEGV without
using a separate signal stack (cf. sigaltstack) - but using a separate
stack could also make it harder to determine a correct backtrace?


Didn't know about that, but seems useful.  I'll look into it.


It'd be bad if the addition of backtraces for SEGV/BUS suddenly made it
harder to attach a debugger and getting useful results. Even
disregarding the previous concerns, we'll get less useful debugger
interactions due to this, e.g. for things like null pointer derefs,
right?


The backtrace and level of detail jumping around between frames I get in 
lldb looks the same as without this.  But it might depend.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Dumping/restoring fails on inherited generated column

2019-12-04 Thread Tom Lane
Peter Eisentraut  writes:
> On 2019-12-04 15:14, Tom Lane wrote:
>> It looks like gtest1_1 inherits column "b" from gtest1, so possibly
>> pg_dump is just confused about the combination of inheritance and
>> generated columns.

> Yeah, there was some stuff about the "separate" dumping of defaults that 
> I apparently forgot to address.  The attached patch fixes it.  I'll see 
> about adding a test case for it, too.

I don't think this is right.  It will probably misbehave if the
"generated" expression has any interesting dependencies:

1. You didn't duplicate the behavior of the existing separate=false
case, where it adds a dependency to try to force the default's
dependencies to exist before the table is created.

2. If that dependency turns out to create a dependency loop, the
later code will break the loop by setting separate=true anyway.
Then what?

I also find it improbable that overriding the !shouldPrintColumn
test is really the right thing.  That test is what distinguishes
the is-a-parent-table from the is-a-child-table cases, and the
core of the issue here seems to be that we need to treat those
differently.

I wonder if the right fix is to not generate a DO_ATTRDEF
object at all for generated columns in child tables.  Am
I right in guessing that we propagate generated-ness to
child tables automatically?

regards, tom lane




Re: more backtraces

2019-12-04 Thread Tom Lane
Andres Freund  writes:
> It'd be bad if the addition of backtraces for SEGV/BUS suddenly made it
> harder to attach a debugger and getting useful results.

Yeah.  TBH, I'm not sure I want this, at least not in debug builds.

regards, tom lane




Re: adding strndup

2019-12-04 Thread Alvaro Herrera
On 2019-Dec-04, Alvaro Herrera wrote:

> On 2019-Dec-04, Tom Lane wrote:
> 
> > Andres Freund  writes:
> > > On 2019-12-04 11:40:21 -0300, Alvaro Herrera wrote:
> > >> I think this should be pretty uncontroversial, but wanted to give a
> > >> heads-up outside that thread.  I attach the patch here for completeness.
> > 
> > > I'd just provide pnstrdup() in the frontend, without adding strndup().
> > 
> > +1 --- seems like a bunch more mechanism than is warranted.  Let's
> > just open-code it in pnstrdup.  We can rely on strnlen, since that's
> > already supported, and there's not much more there beyond that.
> 
> I can get behind that ... it makes the patch a lot smaller.

Here it is.

I noticed that ECPG's copy was setting errno.  I had forgot to do that
in my previous patch, but on second look, malloc failure already sets
it, so doing it again is pointless.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From c4dc0fd76e86bf97cb612bafe30847c47b77813e Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 4 Dec 2019 11:02:34 -0300
Subject: [PATCH v4] Offer pnstrdup to frontend code

We already had it on the backend.  Frontend can also use it now.

Discussion: https://postgr.es/m/20191204144021.GA17976@alvherre.pgsql
---
 src/bin/pg_waldump/pg_waldump.c  |  3 +--
 src/bin/psql/prompt.c| 16 +-
 src/bin/scripts/common.c |  3 +--
 src/common/fe_memutils.c | 27 
 src/include/common/fe_memutils.h |  1 +
 src/interfaces/ecpg/compatlib/informix.c | 21 +-
 6 files changed, 36 insertions(+), 35 deletions(-)

diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 30a5851d87..a05fbe6938 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -114,8 +114,7 @@ split_path(const char *path, char **dir, char **fname)
 	/* directory path */
 	if (sep != NULL)
 	{
-		*dir = pg_strdup(path);
-		(*dir)[(sep - path) + 1] = '\0';	/* no strndup */
+		*dir = pnstrdup(path, sep - path);
 		*fname = pg_strdup(sep + 1);
 	}
 	/* local directory */
diff --git a/src/bin/psql/prompt.c b/src/bin/psql/prompt.c
index 41c6f21ecf..1199181521 100644
--- a/src/bin/psql/prompt.c
+++ b/src/bin/psql/prompt.c
@@ -270,13 +270,10 @@ get_prompt(promptStatus_t status, ConditionalStack cstack)
 	/* execute command */
 case '`':
 	{
-		FILE	   *fd;
-		char	   *file = pg_strdup(p + 1);
-		int			cmdend;
+		int			cmdend = strcspn(p + 1, "`");
+		char	   *file = pnstrdup(p + 1, cmdend);
+		FILE	   *fd = popen(file, "r");
 
-		cmdend = strcspn(file, "`");
-		file[cmdend] = '\0';
-		fd = popen(file, "r");
 		if (fd)
 		{
 			if (fgets(buf, sizeof(buf), fd) == NULL)
@@ -295,13 +292,10 @@ get_prompt(promptStatus_t status, ConditionalStack cstack)
 	/* interpolate variable */
 case ':':
 	{
-		char	   *name;
+		int			nameend = strcspn(p + 1, ":");
+		char	   *name = pnstrdup(p + 1, nameend);
 		const char *val;
-		int			nameend;
 
-		name = pg_strdup(p + 1);
-		nameend = strcspn(name, ":");
-		name[nameend] = '\0';
 		val = GetVariable(pset.vars, name);
 		if (val)
 			strlcpy(buf, val, sizeof(buf));
diff --git a/src/bin/scripts/common.c b/src/bin/scripts/common.c
index d2a7547441..d919c1cc0b 100644
--- a/src/bin/scripts/common.c
+++ b/src/bin/scripts/common.c
@@ -352,8 +352,7 @@ splitTableColumnsSpec(const char *spec, int encoding,
 		else
 			cp += PQmblen(cp, encoding);
 	}
-	*table = pg_strdup(spec);
-	(*table)[cp - spec] = '\0'; /* no strndup */
+	*table = pnstrdup(spec, cp - spec);
 	*columns = cp;
 }
 
diff --git a/src/common/fe_memutils.c b/src/common/fe_memutils.c
index ce99b4f4da..2bc6606b80 100644
--- a/src/common/fe_memutils.c
+++ b/src/common/fe_memutils.c
@@ -142,6 +142,33 @@ pstrdup(const char *in)
 	return pg_strdup(in);
 }
 
+char *
+pnstrdup(const char *in, Size size)
+{
+	char	   *tmp;
+	int			len;
+
+	if (!in)
+	{
+		fprintf(stderr,
+_("cannot duplicate null pointer (internal error)\n"));
+		exit(EXIT_FAILURE);
+	}
+
+	len = strnlen(in, size);
+	tmp = malloc(len + 1);
+	if (tmp == NULL)
+	{
+		fprintf(stderr, _("out of memory\n"));
+		exit(EXIT_FAILURE);
+	}
+
+	memcpy(tmp, in, len);
+	tmp[len] = '\0';
+
+	return tmp;
+}
+
 void *
 repalloc(void *pointer, Size size)
 {
diff --git a/src/include/common/fe_memutils.h b/src/include/common/fe_memutils.h
index a1e5940d31..3181ee17dd 100644
--- a/src/include/common/fe_memutils.h
+++ b/src/include/common/fe_memutils.h
@@ -31,6 +31,7 @@ extern void pg_free(void *pointer);
 
 /* Equivalent functions, deliberately named the same as backend functions */
 extern char *pstrdup(const char *in);
+extern char *pnstrdup(const char *in, Size size);
 extern void *palloc(Size size);
 extern void *palloc0(Size size);
 extern void *p

Re: adding strndup

2019-12-04 Thread Tom Lane
Alvaro Herrera  writes:
>> I can get behind that ... it makes the patch a lot smaller.

> Here it is.

LGTM.

> I noticed that ECPG's copy was setting errno.  I had forgot to do that
> in my previous patch, but on second look, malloc failure already sets
> it, so doing it again is pointless.

Right.

regards, tom lane




about allow_system_table_mods and SET STATISTICS

2019-12-04 Thread Peter Eisentraut
Until PostgreSQL 9.1, it was possible to run ALTER TABLE ... SET 
STATISTICS without allow_system_table_mods.  In PostgreSQL 9.2 and 
later, this no longer works.  This change was apparently accidental.  (I 
gave up after a while trying to bisect it exactly, but probably 
something related to 1489e2f26a4c0318938b3085f50976512f321d84.)


(It didn't work on mapped relations, so it wasn't all roses.)

A comment in ATPrepSetStatistics() still makes references to this being 
possible.  I propose to remove this comment.


There was some discussion about (re-)allowing this and some other 
commands like this, but after the recent changes to make 
allow_system_table_mods easier to use, I think this has less urgency, so 
I'd rather get the comment correct in the meantime.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 203571a0402face9feb92a058d8423317501545e Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 4 Dec 2019 23:46:00 +0100
Subject: [PATCH] Remove comment about SET STATISTICS on system tables

It was once possible to do ALTER TABLE ... SET STATISTICS on system
tables without allow_sytem_table_mods.  This was changed apparently by
accident between PostgreSQL 9.1 and 9.2, but a code comment still
claimed this was possible.  Remove that comment.  (Maybe someone wants
to revive that functionality, but not right now.)
---
 src/backend/commands/tablecmds.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 5440eb9015..55b9b63336 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6707,10 +6707,8 @@ static void
 ATPrepSetStatistics(Relation rel, const char *colName, int16 colNum, Node 
*newValue, LOCKMODE lockmode)
 {
/*
-* We do our own permission checking because (a) we want to allow SET
-* STATISTICS on indexes (for expressional index columns), and (b) we 
want
-* to allow SET STATISTICS on system catalogs without requiring
-* allowSystemTableMods to be turned on.
+* We do our own permission checking because we want to allow SET
+* STATISTICS on indexes (for expressional index columns).
 */
if (rel->rd_rel->relkind != RELKIND_RELATION &&
rel->rd_rel->relkind != RELKIND_MATVIEW &&
-- 
2.24.0



Re: Update minimum SSL version

2019-12-04 Thread Peter Eisentraut

On 2019-12-02 16:13, Tom Lane wrote:

Peter Eisentraut  writes:

On 2019-11-30 04:06, Tom Lane wrote:

I think the real question we have to answer is this: are we intent on
making people upgrade ancient openssl installations?



The trade-off is that this makes the defaults better for the vast
majority of users and gives users of really old systems a nudge that
they are no longer in compliance with industry best practices.  You need
manual steps to set up SSL anyway, so this doesn't introduce an entirely
new kind of requirement for the latter group of users.


True.  I'm okay with this as long as we adapt the ssl test suite as
per your other reply.


I have committed this with that change.  The discussion on which OpenSSL 
versions to support and how will continue.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: about allow_system_table_mods and SET STATISTICS

2019-12-04 Thread Tom Lane
Peter Eisentraut  writes:
> Until PostgreSQL 9.1, it was possible to run ALTER TABLE ... SET 
> STATISTICS without allow_system_table_mods.  In PostgreSQL 9.2 and 
> later, this no longer works.  This change was apparently accidental.  (I 
> gave up after a while trying to bisect it exactly, but probably 
> something related to 1489e2f26a4c0318938b3085f50976512f321d84.)
> (It didn't work on mapped relations, so it wasn't all roses.)

> A comment in ATPrepSetStatistics() still makes references to this being 
> possible.  I propose to remove this comment.
> There was some discussion about (re-)allowing this and some other 
> commands like this, but after the recent changes to make 
> allow_system_table_mods easier to use, I think this has less urgency, so 
> I'd rather get the comment correct in the meantime.

Seems reasonable.  The argument for making this an exception to
allow_system_table_mods was always more about expediency than logical
cleanliness.  After the recent changes I think it's okay to remove the
special case (especially since nobody has griped about it being broken).

However ... if we're not going to have that special case, couldn't
we get rid of the whole business of having a special permissions test?
What is it that ATSimplePermissions can't handle here?  The business
about requiring a colName certainly doesn't need to be done before the
ownership check (in fact, it could be left to execution, so we don't need
a prep function at all anymore).

regards, tom lane




RE: Proposal: Add more compile-time asserts to expose inconsistencies.

2019-12-04 Thread Smith, Peter
-Original Message-
From: Andres Freund  Sent: Tuesday, 3 December 2019 2:56 AM
> +StaticAssertDecl(lengthof(LockTagTypeNames) == (LOCKTAG_ADVISORY + 1),
> +  "LockTagTypeNames array inconsistency");
> +
>
>These error messages strike me as somewhat unhelpful. I'd probably just reword 
>them as "array length mismatch" or something like that.

OK.  I have no problem to modify all my current assertion messages to your 
suggested text ("array length mismatch") if you think it is better.

Please correct me if I am wrong, but I didn't think the error message text is 
of very great significance here because it is a compile-time issue meaning the 
*only* person who would see the message is the 1 developer who accidentally 
introduced a bug just moments beforehand. The compile will fail with a source 
line number, and when the developer sees the StaticAssertDecl at that source 
line the cause of the error is anyway self-evident by the condition parameter. 

Kind Regards
--
Peter Smith
Fujitsu Australia





Re: Implementing Incremental View Maintenance

2019-12-04 Thread Yugo Nagata
On Wed, 4 Dec 2019 21:18:02 +0900
nuko yokohama  wrote:

> Hi.
> 
> I found the problem after running "ALTER MATERIALIZED VIEW ... RENAME TO".
> If a view created with "CREATE INCREMENT MATERIALIZED VIEW" is renamed,
> subsequent INSERT operations to the base table will fail.
> 
> Error message.
> ```
> ERROR:  could not open relation with OID 0

Thank you for your pointing out this issue! This error occurs
because the view's OID is retrieved using the view name.
Considering that the name can be changed, this is obviously
wrong. We'll fix it.

Regards,
Yugo Nagata

> ```
> 
> Execution log.
> ```
> [ec2-user@ip-10-0-1-10 ivm]$ psql -U postgres test -e -f
> ~/test/ivm/alter_rename_bug.sql
> DROP TABLE IF EXISTS table_x CASCADE;
> psql:/home/ec2-user/test/ivm/alter_rename_bug.sql:1: NOTICE:  drop cascades
> to materialized view group_imv
> DROP TABLE
> CREATE TABLE table_x AS
>SELECT generate_series(1, 1) AS id,
>ROUND(random()::numeric * 100, 2) AS data,
>CASE (random() * 5)::integer
>  WHEN 4 THEN 'group-a'
>  WHEN 3 THEN 'group-b'
>  ELSE 'group-c'
>END AS part_key
> ;
> SELECT 1
>Table "public.table_x"
>   Column  |  Type   | Collation | Nullable | Default
> --+-+---+--+-
>  id   | integer |   |  |
>  data | numeric |   |  |
>  part_key | text|   |  |
> 
> DROP MATERIALIZED VIEW IF EXISTS group_imv;
> psql:/home/ec2-user/test/ivm/alter_rename_bug.sql:15: NOTICE:  materialized
> view "group_imv" does not exist, skipping
> DROP MATERIALIZED VIEW
> CREATE INCREMENTAL MATERIALIZED VIEW group_imv AS
> SELECT part_key, COUNT(*), MAX(data), MIN(data), SUM(data), AVG(data)
> FROM table_x
> GROUP BY part_key;
> SELECT 3
>  List of relations
>  Schema |   Name|   Type|  Owner
> +---+---+--
>  public | group_imv | materialized view | postgres
>  public | table_x   | table | postgres
> (2 rows)
> 
>  Materialized view "public.group_imv"
>   Column   |  Type   | Collation | Nullable | Default
> ---+-+---+--+-
>  part_key  | text|   |  |
>  count | bigint  |   |  |
>  max   | numeric |   |  |
>  min   | numeric |   |  |
>  sum   | numeric |   |  |
>  avg   | numeric |   |  |
>  __ivm_count_max__ | bigint  |   |  |
>  __ivm_count_min__ | bigint  |   |  |
>  __ivm_count_sum__ | bigint  |   |  |
>  __ivm_count_avg__ | bigint  |   |  |
>  __ivm_sum_avg__   | numeric |   |  |
>  __ivm_count__ | bigint  |   |  |
> 
> SELECT * FROM group_imv ORDER BY part_key;
>  part_key | count |  max  | min  |sum| avg
> --+---+---+--+---+-
>  group-a  |  1966 | 99.85 | 0.05 |  98634.93 | 50.1703611393692777
>  group-b  |  2021 | 99.99 | 0.17 | 102614.02 | 50.7738842157347848
>  group-c  |  6013 | 99.99 | 0.02 | 300968.43 | 50.0529569266589057
> (3 rows)
> 
> ALTER MATERIALIZED VIEW group_imv RENAME TO group_imv2;
> ALTER MATERIALIZED VIEW
>  List of relations
>  Schema |Name|   Type|  Owner
> ++---+--
>  public | group_imv2 | materialized view | postgres
>  public | table_x| table | postgres
> (2 rows)
> 
> Materialized view "public.group_imv2"
>   Column   |  Type   | Collation | Nullable | Default
> ---+-+---+--+-
>  part_key  | text|   |  |
>  count | bigint  |   |  |
>  max   | numeric |   |  |
>  min   | numeric |   |  |
>  sum   | numeric |   |  |
>  avg   | numeric |   |  |
>  __ivm_count_max__ | bigint  |   |  |
>  __ivm_count_min__ | bigint  |   |  |
>  __ivm_count_sum__ | bigint  |   |  |
>  __ivm_count_avg__ | bigint  |   |  |
>  __ivm_sum_avg__   | numeric |   |  |
>  __ivm_count__ | bigint  |   |  |
> 
> SET client_min_messages = debug5;
> psql:/home/ec2-user/test/ivm/alter_rename_bug.sql:30: DEBUG:
>  CommitTransaction(1) name: unnamed; blockState: STARTED; state:
> INPROGRESS, xid/subid/cid: 0/1/0
> SET
> INSERT INTO table_x VALUES (1001, ROUND(random()::numeric * 100, 2),
> 'gruop_d');
> psql:/home/ec2-user/test/ivm/alter_rename_bug.sql:33: DEBUG:
>  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS,
> xid/subid/cid: 0/1/0
> psql:/home/ec2-user/tes

Re: Memory-Bounded Hash Aggregation

2019-12-04 Thread Melanie Plageman
On Thu, Nov 28, 2019 at 9:47 AM Tomas Vondra 
wrote:

> On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote:
> >On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:
> >> Right now the patch always initializes 32 spill partitions. Have you
> >> given
> >> any thought into how to intelligently pick an optimal number of
> >> partitions yet?
> >
> >Attached a new patch that addresses this.
> >
> >1. Divide hash table memory used by the number of groups in the hash
> >table to get the average memory used per group.
> >2. Multiply by the number of groups spilled -- which I pessimistically
> >estimate as the number of tuples spilled -- to get the total amount of
> >memory that we'd like to have to process all spilled tuples at once.
>
> Isn't the "number of tuples = number of groups" estimate likely to be
> way too pessimistic? IIUC the consequence is that it pushes us to pick
> more partitions than necessary, correct?


> Could we instead track how many tuples we actually consumed for the the
> in-memory groups, and then use this information to improve the estimate
> of number of groups? I mean, if we know we've consumed 1000 tuples which
> created 100 groups, then we know there's ~1:10 ratio.
>

What would the cost be of having many small partitions? Some of the
spill files created may not be used if the estimate was pessimistic,
but that seems better than the alternative of re-spilling, since every
spill writes every tuple again.

Also, number of groups = number of tuples is only for re-spilling.
This is a little bit unclear from the variable naming.

It looks like the parameter input_tuples passed to hash_spill_init()
in lookup_hash_entries() is the number of groups estimated by planner.
However, when reloading a spill file, if we run out of memory and
re-spill, hash_spill_init() is passed batch->input_groups (which is
actually set from input_ngroups which is the number of tuples in the
spill file). So, input_tuples is groups and input_groups is
input_tuples. It may be helpful to rename this.


>
> 4) I'm not sure I agree with this reasoning that HASH_PARTITION_FACTOR
> making the hash tables smaller is desirable - it may be, but if that was
> generally the case we'd just use small hash tables all the time. It's a
> bit annoying to give user the capability to set work_mem and then kinda
> override that.
>
>   * ... Another benefit of having more, smaller partitions is that small
>   * hash tables may perform better than large ones due to memory caching
>   * effects.
>
>
So, it looks like the HASH_PARTITION_FACTOR is only used when
re-spilling. The initial hashtable will use work_mem.
It seems like the reason for using it when re-spilling is to be very
conservative to avoid more than one re-spill and make sure each spill
file fits in a hashtable in memory.
The comment does seem to point to some other reason, though...


>
> 11) The hash_spill_npartitions naming seems a bit confusing, because it
> seems to imply it's about the "spill" while in practice it just choses
> number of spill partitions. Maybe hash_choose_num_spill_partitions would
> be better?
>
>
Agreed that a name with "choose" or "calculate" as the verb would be
more clear.


>
> 12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the
> reasoning behind the current value (256)? Not wanting to pick too many
> partitions? Comment?
>
>  if (npartitions > HASH_MAX_PARTITIONS)
>  npartitions = HASH_MAX_PARTITIONS;
>
>
256 actually seems very large. hash_spill_npartitions() will be called
for every respill, so, HASH_MAX_PARTITIONS it not the total number of
spill files permitted, but, actually, it is the number of respill
files in a given spill (a spill set). So if you made X partitions
initially and every partition re-spills, now you would have (at most)
X * 256 partitions.
If HASH_MAX_PARTITIONS is 256, wouldn't the metadata from the spill
files take up a lot of memory at that point?

Melanie & Adam Lee


Re: could not stat promote trigger file leads to shutdown

2019-12-04 Thread Kyotaro Horiguchi
At Wed, 4 Dec 2019 11:52:33 +0100, Peter Eisentraut 
 wrote in 
> On 2019-11-20 16:21, Tom Lane wrote:
> >> AFAICT, a GUC check hook wouldn't actually be able to address the
> >> specific scenario I described.  At the time the GUC is set, the
> >> containing the directory of the trigger file does not exist yet.  This
> >> is currently not an error.  The problem only happens if after the GUC
> >> is
> >> set the containing directory appears and is not readable.
> > True, if the hook just consists of trying fopen() and checking the
> > errno.  Would it be feasible to insist that the containing directory
> > exist and be readable?  We have enough infrastructure that that
> > should only take a few lines of code, so the question is whether
> > or not that's a nicer behavior than we have now.
> 
> Is it possible to do this in a mostly bullet-proof way?  Just because
> the directory exists and looks pretty good otherwise, doesn't mean we
> can read a file created in it later in a way that doesn't fall afoul
> of the existing error checks.  There could be something like SELinux
> lurking, for example.
> 
> Maybe some initial checking would be useful, but I think we still need
> to downgrade the error check at use time a bit to not crash in the
> cases that we miss.

+1. Any GUC variables that points to outer, or externally-modifiable
resources, including directories, files, commands can face that kind
of problem. For example a bogus value for archive_command doesn't
preveint server from starting. I understand that the reason is that we
don't have a reliable means to check-up the command before we actually
execute it, but server can (or should) continue running even if it
fails. I think this issue falls into that category.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Update minimum SSL version

2019-12-04 Thread Michael Paquier
On Mon, Dec 02, 2019 at 02:09:51PM +0100, Daniel Gustafsson wrote:
> However, looking at the signatures detected by autoconf we can however get an
> idea of which version is used.  SSL_clear_options and X509_get_signature_nid()
> first shipped in 1.0.2, while SSL_get_current_compression first shipped in
> 0.9.8.  There are also a set of functions which are new in 1.1.0 (BIO_get_data
> et.al).

I was just looking at this problem, and something does not match with
what you wrote here.  SSL_clear_options() is defined in OpenSSL from
0.9.8 to 1.0.2 as a macro (see ssl/ssl.h), and is defined as a
function since 1.1.0.  So it seems to me that we are able to correctly
detect the presence of this function in the configure checks if
building with 1.1.0~, but not other versions.

In LibreSSL, the code has visibly always used a macro, even on their
latest HEAD since the code has been forked from OpenSSL 1.0.1g:
https://github.com/libressl-portable/openbsd.  So we should be  able
to compile our code, still we fail to detect that we can use the
macro. 

It seems to me that we have quite a couple of arguments in favor of
dropping this configure check all together.  (I saw the business
around a364dfa as well regarding NetBSD 5.1).

We can do more cleanup, and the discussion is quite different than the
original intent of this thread, so I am going to create a new one on
the matter.
--
Michael


signature.asc
Description: PGP signature


Re: Update minimum SSL version

2019-12-04 Thread Tom Lane
Peter Eisentraut  writes:
> On 2019-12-04 13:53, Tom Lane wrote:
>> So, what exactly are we going to set as the new minimum version in
>> each case?  I'll have to go update my trailing-edge-Johnnie buildfarm
>> critters, and it'd make sense to have them continue to test the
>> oldest nominally-supported versions.
>> 
>> For OpenSSL it seems like 1.0.1a is the target, per the above
>> discussion.
>> 
>> For Python, I'll just observe that RHEL6 ships 2.6.6, so we can't
>> bump up to 2.7.

> Yes, it would be Python 2.6.

So the upshot, after a fair amount of hair-pulling, is

* Somebody maybe should be testing openssl 1.0.1, but it won't be
me, because neither 1.0.1 nor 1.0.1a will even build on non-Intel
platforms.  After closer study of their release notes, I've settled
on 1.0.1e as being the best compromise between being old and not
having unreasonable teething pains.  (I wonder how coincidental
it is that that's also what Red Hat is now shipping in RHEL6.)
I've successfully installed 1.0.1e on prairiedog and gaur, so
I can flip them to start building HEAD with that whenever we
break compatibility with 0.9.8.

* Python 2.6.x also suffered from an unreasonable amount of
teething pains --- 2.6.2 is the oldest version that seems
to know how to build a shared library on Darwin.  I've now
got a reasonably functional 2.6 on gaur and 2.6.2 on prairiedog,
and again will adjust those buildfarm members to use those
installations when/if our support for their current versions
goes away.

regards, tom lane




Re: pg_upgrade fails with non-standard ACL

2019-12-04 Thread Michael Paquier
On Wed, Dec 04, 2019 at 06:15:52PM +0900, Arthur Zakirov wrote:
> On 2019/12/04 17:15, Michael Paquier wrote:
>> FWIW, I am not much a fan of that part because the output generated by
>> the description is most likely not compatible with the grammar
>> supported.
> 
> Ah, I thought that pg_identify_object() gives properly quoted identity, and
> it could be used to make SQL script.

It depends on the object type.  For columns I can see in your patch
that you are using a dedicated code path, but I don't think that we
should assume that there is an exact one-one mapping between the
object type and the grammar of GRANT/REVOKE for the object type
supported because both are completely independent things and
facilities.  Take for example foreign data wrappers.  Of course for
this patch this depends if we have system object of the type which
would be impacted.  That's not the case of foreign data wrappers and
likely it will never be, but there is no way to be sure that this
won't get broken silently in the future.

> Sure! But I'm not sure that I understood the idea. Do you mean the patch
> which adds a TAP test? It can initialize two clusters, in first it renames
> some system objects and changes their ACLs. And finally the test runs
> pg_upgrade which will fail.

A TAP test won't help here because the idea is to create a patch for
HEAD which willingly introduces changes for system objects, where the
source binaries have ACLs on object types which are broken on the
target binaries with the custom patch.  That's to make sure that all
the logic which would get added to pu_upgrade is working correctly.
Other ideas are welcome.
--
Michael


signature.asc
Description: PGP signature


Re: Increase footprint of %m and reduce strerror()

2019-12-04 Thread Michael Paquier
On Wed, Dec 04, 2019 at 03:32:11PM +0900, Kyotaro Horiguchi wrote:
> It sounds good to me.  Message unification (including printf) needs
> somehow treating trailing new lines, though.  About translation
> burden, I'm not sure how the message unification eases translators'
> work. Identical messages of different commands appear having different
> neighbours in different po files.

Newlines are a problem.  Still there are cases where we don't use
them.  See for example pg_waldump.c.  It seems like it would be first
interesting to fix the code paths where we know we can reduce the
duplicates.

> By the way aren't we going to have ereport on frontend?

Not sure that this will happen, there are quite a few things to
consider related to what error hints and such should be for frontends.
That's quite a different discussion..
--
Michael


signature.asc
Description: PGP signature


Re: Session WAL activity

2019-12-04 Thread Kyotaro Horiguchi
Hi.

At Wed, 4 Dec 2019 16:40:27 +0300, Konstantin Knizhnik 
 wrote in 
> 
> 
> On 04.12.2019 8:33, Kyotaro Horiguchi wrote:
> > It seems to be useful to me. We also might want statistics of other
> > session IOs.  In that case the table name would be
> > "pg_stat_session/process_activity". We are aleady collecting most
> > kinds of the IO activity but it loses session information...
> 
> Well, actually monitoring disk activity for the particular
> backend/session can be easily done using some external tools
> (just because now in Postgres session=backend=process). So we can
> monitor IO of processes, for example using iotop at Unix
> or Performance Monitor at Windows.

Operations that completes on shared buffers cannot be monitored that
way. This is the same with WAL writing.

> Certainly it is more convenient to have such statstic inside
> Postgres. But I am not sure if it is really needed.
> Concerning WAL activity situation is more obscure: records can be
> added to the WAL by one process, but written by another.
> This is why it is not possible to use some external tools.

For clarity, I didn't suggest that this patch should include general
session IO statistics. Just the view name looked a bit specific.

> > Briefly looking the patch, I have some comments on it.
> >
> > As mentioned above, if we are intending future exantion of the
> > session-stats table, the name should be changed.
> >
> > Backend status is more appropriate than PGPROC. See pgstat.c.
> Do you mean pgstat_fetch_stat_beentry?
> But why it is better than storing this information directly in PGPROC?

No it cannot be used there for performance reasons as you are
saying. I'm not sure it's acceptable, but we can directly access
backend status the same way if we expose MyBEEntry (and update it
through a macro or a inline function).  If we don't need per record
resolution for the value, we can update a process local variable at
WAL-write time then write it to backend status at commit time or at
the same timing as pgstat reporting.

According to my faint memory, PGPROC is thought that it must be kept
as small as possible for the reasons of CPU caches, that is the reason
for PgBackendStatus.

> As far as this information  ha to be updated from XLogInsertRecord and
> it seems to be very performance critical function  my intention was to
> minimize
> overhead of maintaining this statistic. It is hard to imagine
> something more efficient than just MyProc->walWriten += write_len;
> 
> Also pgstat_fetch_stat_beentry is taken backend id, which is not
> reported in pg_stat_activity view and this is why it is more
> convenient to pass PID to pg_stat_get_wal_activity. Certainly it is
> possible to map PID to backendid, but... why actually do we need to
> perform such mapping if simpler solution exists?
> 
> > Some kind of locking is needed to update the fields on shared segment.
> > (LWLocks for PGPROC and PGSTAT_BEGIN/END_WRITE_ACTIVITY for
> > PgBackendStatus)
> This information is updated locally only by backend itself.
> Certainly update of 64 bit field is not atomic at 32-but
> architectures.
> But it is just statistic. I do not think that it will be fatal if for
> a moment
> we can see some incorrect value of written WAL bytes (and at most
> platforms this
> update will be atomic).

At least reader needs to take procarray lock to keep PID-WALwrite
consistency, in order to prevent reading WALwrite values for a wrong
process.

> As I already wrote above, this information in updated in performance
> critical place and this is why
> I want to avoid any expensive operations (such as locking or atomic
> updates) as much as possible.

I'm afraid that the reason doesn't justify expanding PGPROC..

> > Knitpickings:
> >
> > The patch contains a trace of older trial in
> > pg_stat_get_activity. Proc OID should be >= 8000 in
> > patches. src/include/catalog/unused_oids offers some OID for you.
> >
> 
> Will fix it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Memory-Bounded Hash Aggregation

2019-12-04 Thread Jeff Davis

Thanks very much for a great review! I've attached a new patch.

There are some significant changes in the new version also:

In the non-spilling path, removed the extra nullcheck branch in the
compiled evaltrans expression. When the first tuple is spilled, I the
branch becomes necessary, so I recompile the expression using a new
opcode that includes that branch.

I also changed the read-from-spill path to use a slot with
TTSOpsMinimalTuple (avoiding the need to make it into a virtual slot
right away), which means I need to recompile the evaltrans expression
for that case, as well.

I also improved the way we initialize the hash tables to use a better
estimate for the number of groups. And I made it only initialize one
hash table in the read-from-spill path.

With all of the changes I made (thanks to some suggestions from Andres)
the performance is looking pretty good. It's pretty easy to beat
Sort+Group when the group size is 10+. Even for average group size of
~1, HashAgg is getting really close to Sort in some cases.

There are still a few things to do, most notably costing. I also need
to project before spilling to avoid wasting disk. And I'm sure my
changes have created some more problems, so I have some significant
work to do on quality.

My answers to your questions inline:

On Thu, 2019-11-28 at 18:46 +0100, Tomas Vondra wrote:
> Could we instead track how many tuples we actually consumed for the
> the
> in-memory groups, and then use this information to improve the
> estimate
> of number of groups? I mean, if we know we've consumed 1000 tuples
> which
> created 100 groups, then we know there's ~1:10 ratio.

That would be a good estimate for an even distribution, but not
necessarily for a skewed distribution. I'm not opposed to it, but it's
generally my philosophy to overpartition as it seems there's not a big
downside.

> A couple of comments based on eye-balling the patch:
> 
> 
> 1) Shouldn't the hashagg_mem_overflow use the other GUC naming, i.e.
> maybe it should be enable_hashagg_mem_overflow or something similar?

The enable_* naming is for planner GUCs. hashagg_mem_overflow is an
execution-time GUC that disables spilling and overflows work_mem (that
is, it reverts to the old behavior).

> 
> assume the pointer really is NULL. Surely we'll get a segfault on the
> preceding line which does dereference it
> 
>  pergroup = &pergroup_allaggs[op->d.agg_init_trans.transno];
> 
> Or am I missing anything?

That's not actually dereferencing anything, it's just doing a pointer
calculation. You are probably right that it's not a good thing to rely
on, or at least not quite as readable, so I changed the order to put
the NULL check first.


> 
> 3) execGrouping.c
> 
> A couple of functions would deserve a comment, explaining what it
> does.
> 
>   - LookupTupleHashEntryHash
>   - prepare_hash_slot
>   - calculate_hash

Done, thank you.

> And it's not clear to me why we should remove part of the comment
> before
> TupleHashTableHash.

Trying to remember back to when I first did that, but IIRC the comment
was not updated from a previous change, and I was cleaning it up. I
will check over that again to be sure it's an improvement.

> 
> 4) I'm not sure I agree with this reasoning that
> HASH_PARTITION_FACTOR
> making the hash tables smaller is desirable - it may be, but if that
> was
> generally the case we'd just use small hash tables all the time. It's
> a
> bit annoying to give user the capability to set work_mem and then
> kinda
> override that.

I think adding some kind of headroom is reasonable to avoid recursively
spilling, but perhaps it's not critical. I see this as a tuning
question more than anything else. I don't see it as "overriding"
work_mem, but I can see where you're coming from.

> 5) Not sure what "directly" means in this context?
> 
>   * partitions at the time we need to spill, and because this
> algorithm
>   * shouldn't depend too directly on the internal memory needs of a
>   * BufFile.
> 
> #define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ)
> 
> Does that mean we don't want to link to PGAlignedBlock, or what?

That's what I meant, yes, but I reworded the comment to not say that.

> 6) I think we should have some protection against underflows in this
> piece of code:
> 
> - this would probably deserve some protection against underflow if
> HASH_PARTITION_MEM gets too big
> 
>  if (hashagg_mem_overflow)
>  aggstate->hash_mem_limit = SIZE_MAX;
>  else
>  aggstate->hash_mem_limit = (work_mem * 1024L) -
> HASH_PARTITION_MEM;
> 
> At the moment it's safe because work_mem is 64kB at least, and
> HASH_PARTITION_MEM is 32kB (4 partitions, 8kB each). But if we happen
> to
> bump HASH_MIN_PARTITIONS up, this can underflow.

Thank you, done.

> 7) Shouldn't lookup_hash_entry briefly explain why/how it handles the
> memory limit?

Improved.

> 
> 8) The comment before lookup_hash_entries says:
> 
>   ...
>   * Return false if hash table has exceeded its memo

Re: Increase footprint of %m and reduce strerror()

2019-12-04 Thread Kyotaro Horiguchi
At Thu, 5 Dec 2019 11:36:48 +0900, Michael Paquier  wrote 
in 
> On Wed, Dec 04, 2019 at 03:32:11PM +0900, Kyotaro Horiguchi wrote:
> > It sounds good to me.  Message unification (including printf) needs
> > somehow treating trailing new lines, though.  About translation
> > burden, I'm not sure how the message unification eases translators'
> > work. Identical messages of different commands appear having different
> > neighbours in different po files.
> 
> Newlines are a problem.  Still there are cases where we don't use
> them.  See for example pg_waldump.c.  It seems like it would be first
> interesting to fix the code paths where we know we can reduce the
> duplicates.

So, (IIUC) do we replace fprintf()s for error reporting together (but
maybe in a separate patch)?

> > By the way aren't we going to have ereport on frontend?
> 
> Not sure that this will happen, there are quite a few things to
> consider related to what error hints and such should be for frontends.
> That's quite a different discussion..

Agreed.

+1 for going that way after having above considerations.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Append with naive multiplexing of FDWs

2019-12-04 Thread Kyotaro Horiguchi
Hello.

At Sat, 30 Nov 2019 14:26:11 -0500, Bruce Momjian  wrote in 
> On Sun, Nov 17, 2019 at 09:54:55PM +1300, Thomas Munro wrote:
> > On Sat, Sep 28, 2019 at 4:20 AM Bruce Momjian  wrote:
> > > On Wed, Sep  4, 2019 at 06:18:31PM +1200, Thomas Munro wrote:
> > > > A few years back[1] I experimented with a simple readiness API that
> > > > would allow Append to start emitting tuples from whichever Foreign
> > > > Scan has data available, when working with FDW-based sharding.  I used
> > > > that primarily as a way to test Andres's new WaitEventSet stuff and my
> > > > kqueue implementation of that, but I didn't pursue it seriously
> > > > because I knew we wanted a more ambitious async executor rewrite and
> > > > many people had ideas about that, with schedulers capable of jumping
> > > > all over the tree etc.
> > > >
> > > > Anyway, Stephen Frost pinged me off-list to ask about that patch, and
> > > > asked why we don't just do this naive thing until we have something
> > > > better.  It's a very localised feature that works only between Append
> > > > and its immediate children.  The patch makes it work for postgres_fdw,
> > > > but it should work for any FDW that can get its hands on a socket.
> > > >
> > > > Here's a quick rebase of that old POC patch, along with a demo.  Since
> > > > 2016, Parallel Append landed, but I didn't have time to think about
> > > > how to integrate with that so I did a quick "sledgehammer" rebase that
> > > > disables itself if parallelism is in the picture.
> > >
> > > Yes, sharding has been waiting on parallel FDW scans.  Would this work
> > > for parallel partition scans if the partitions were FDWs?
> > 
> > Yeah, this works for partitions that are FDWs (as shown), but only for
> > Append, not for Parallel Append.  So you'd have parallelism in the
> > sense that your N remote shard servers are all doing stuff at the same
> > time, but it couldn't be in a parallel query on your 'home' server,
> > which is probably good for things that push down aggregation and bring
> > back just a few tuples from each shard, but bad for anything wanting
> > to ship back millions of tuples to chew on locally.  Do you think
> > that'd be useful enough on its own?
> 
> Yes, I think so.  There are many data warehouse queries that want to
> return only aggregate values, or filter for a small number of rows. 
> Even OLTP queries might return only a few rows from multiple partitions.
> This would allow for a proof-of-concept implementation so we can see how
> realistic this approach is.
> 
> > The problem is that parallel safe non-partial plans (like postgres_fdw
> > scans) are exclusively 'claimed' by one process under Parallel Append,
> > so with the patch as posted, if you modify it to allow parallelism
> > then it'll probably give correct answers but nothing prevents a single
> > process from claiming and starting all the scans and then waiting for
> > them to be ready, while the other processes miss out on doing any work
> > at all.  There's probably some kludgy solution involving not letting
> > any one worker start more than X, and some space cadet solution
> > involving passing sockets around and teaching libpq to hand over
> > connections at certain controlled phases of the protocol (due to lack
> > of threads), but nothing like that has jumped out as the right path so
> > far.
> 
> I am unclear how many queries can do any meaningful work until all
> shards have giving their full results.

There's my pending (somewhat stale) patch, which allows to run local
scans while waiting for remote servers.

https://www.postgresql.org/message-id/20180515.202945.69332784.horiguchi.kyot...@lab.ntt.co.jp

I (or we) wanted to introduce the asynchronous node mechanism as the
basis of async-capable postgres_fdw. The reason why it is stopping is
that we are seeing and I am waiting the executor change that makes
executor push-up style, on which the async-node mechanism will be
constructed. If that won't happen shortly, I'd like to continue that
work..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Increase footprint of %m and reduce strerror()

2019-12-04 Thread Kyotaro Horiguchi
(Just to clarifying the last mail..)

At Thu, 05 Dec 2019 12:06:54 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Thu, 5 Dec 2019 11:36:48 +0900, Michael Paquier  
> wrote in 
> > On Wed, Dec 04, 2019 at 03:32:11PM +0900, Kyotaro Horiguchi wrote:
> > > It sounds good to me.  Message unification (including printf) needs
> > > somehow treating trailing new lines, though.  About translation
> > > burden, I'm not sure how the message unification eases translators'
> > > work. Identical messages of different commands appear having different
> > > neighbours in different po files.
> > 
> > Newlines are a problem.  Still there are cases where we don't use
> > them.  See for example pg_waldump.c.  It seems like it would be first
> > interesting to fix the code paths where we know we can reduce the
> > duplicates.
> 
> So, (IIUC) do we replace fprintf()s for error reporting together (but
> maybe in a separate patch)?
> 
> > > By the way aren't we going to have ereport on frontend?
> > 
> > Not sure that this will happen, there are quite a few things to
> > consider related to what error hints and such should be for frontends.
> > That's quite a different discussion..
> 
> Agreed.
> 
> +1 for going that way after having above considerations.

(This might be took wrongly. The following would be clearer.)

Since I see the above considertaions, I put +1 for this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Memory-Bounded Hash Aggregation

2019-12-04 Thread Adam Lee
On Wed, Dec 04, 2019 at 06:55:43PM -0800, Jeff Davis wrote:
> 
> Thanks very much for a great review! I've attached a new patch.

Hi,

About the `TODO: project needed attributes only` in your patch, when
would the input tuple contain columns not needed? It seems like anything
you can project has to be in the group or aggregates.

-- 
Melanie Plageman & Adam




Re: Append with naive multiplexing of FDWs

2019-12-04 Thread Thomas Munro
On Thu, Dec 5, 2019 at 4:26 PM Kyotaro Horiguchi
 wrote:
> There's my pending (somewhat stale) patch, which allows to run local
> scans while waiting for remote servers.
>
> https://www.postgresql.org/message-id/20180515.202945.69332784.horiguchi.kyot...@lab.ntt.co.jp
>
> I (or we) wanted to introduce the asynchronous node mechanism as the
> basis of async-capable postgres_fdw. The reason why it is stopping is
> that we are seeing and I am waiting the executor change that makes
> executor push-up style, on which the async-node mechanism will be
> constructed. If that won't happen shortly, I'd like to continue that
> work..

After rereading some threads to remind myself what happened here...
right, my little patch began life in March 2016[1] when I wanted a
test case to test Andres's work on WaitEventSets, and your patch set
started a couple of months later and is vastly more ambitious[2][3].
It wants to escape from the volcano give-me-one-tuple-or-give-me-EOF
model.  And I totally agree that there are lots of reason to want to
do that (including yielding to other parts of the plan instead of
waiting for I/O, locks and some parallelism primitives enabling new
kinds of parallelism), and I'm hoping to help with some small pieces
of that if I can.

My patch set (rebased upthread) was extremely primitive, with no new
planner concepts, and added only a very simple new executor node
method: ExecReady().  Append used that to try to ask its children if
they'd like some time to warm up.  By default, ExecReady() says "I
don't know what you're talking about, go away", but FDWs can provide
an implementation that says "yes, please call me again when this fd is
ready" or "yes, I am ready, please call ExecProc() now".  It doesn't
deal with anything more complicated than that, and in particular it
doesn't work if there are extra planner nodes in between Append and
the foreign scan.  (It also doesn't mix particularly well with
parallelism, as mentioned.)

The reason I reposted this unambitious work is because Stephen keeps
asking me why we don't consider the stupidly simple thing that would
help with simple foreign partition-based queries today, instead of
waiting for someone to redesign the entire executor, because that's
... really hard.

[1] 
https://www.postgresql.org/message-id/CAEepm%3D1CuAWfxDk%3D%3DjZ7pgCDCv52fiUnDSpUvmznmVmRKU5zpA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/CA%2BTgmobx8su_bYtAa3DgrqB%2BR7xZG6kHRj0ccMUUshKAQVftww%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/flat/CA%2BTgmoaXQEt4tZ03FtQhnzeDEMzBck%2BLrni0UWHVVgOTnA6C1w%40mail.gmail.com




Re: [HACKERS] Block level parallel vacuum

2019-12-04 Thread Dilip Kumar
On Thu, Dec 5, 2019 at 12:21 AM Masahiko Sawada
 wrote:
>
> On Wed, 4 Dec 2019 at 04:57, Dilip Kumar  wrote:
> >
> > On Wed, Dec 4, 2019 at 9:12 AM Amit Kapila  wrote:
> > >
> > > On Wed, Dec 4, 2019 at 1:58 AM Masahiko Sawada
> > >  wrote:
> > > >
> > > > On Tue, 3 Dec 2019 at 11:55, Amit Kapila  
> > > > wrote:
> > > > >
> > > > In your code, I think if two workers enter to compute_parallel_delay
> > > > function at the same time, they add their local balance to
> > > > VacuumSharedCostBalance and both workers sleep because both values
> > > > reach the VacuumCostLimit.
> > > >
> > >
> > > True, but isn't it more appropriate because the local cost of any
> > > worker should be ideally added to shared cost as soon as it occurred?
> > > I mean to say that we are not adding any cost in shared balance
> > > without actually incurring it.   Then we also consider the individual
> > > worker's local balance as well and sleep according to local balance.
> >
> > Even I think it is better to add the balance to the shared balance at
> > the earliest opportunity.  Just consider the case that there are 5
> > workers and all have I/O balance of 20, and VacuumCostLimit is 50.  So
> > Actually, there combined balance is 100 (which is double of the
> > VacuumCostLimit) but if we don't add immediately then none of the
> > workers will sleep and it may go to the next cycle which is not very
> > good. OTOH, if we add 20 immediately then check the shared balance
> > then all the workers might go for sleep if their local balances have
> > reached the limit but they will only sleep in proportion to their
> > local balance.  So IMHO, adding the current balance to shared balance
> > early is more close to the model we are trying to implement i.e.
> > shared cost accounting.
>
> I agree to add the balance as soon as it occurred. But the problem I'm
> concerned is, let's suppose we have 4 workers, the cost limit is 100
> and the shared balance is now 95. Two workers, whom local
> balance(VacuumCostBalanceLocal) are 40, consumed I/O, added 10 to
> theirs local balance and entered compute_parallel_delay function at
> the same time. One worker adds 10 to the shared
> balance(VacuumSharedCostBalance) and another worker also adds 10 to
> the shared balance. The one worker then subtracts the local balance
> from the shared balance and sleeps because the shared cost is now 115
> (> the cost limit) and its local balance is 50 (> 0.5*(100/4)). Even
> another worker also does the same for the same reason. On the other
> hand if two workers do that serially, only one worker sleeps and
> another worker doesn't because the total shared cost will be 75 when
> the later worker enters the condition. At first glance it looks like a
> concurrency problem but is that expected behaviour?

If both workers sleep then the remaining shared balance will be 15 and
their local balances will be 0. OTOH if one worker sleep then the
remaining shared balance will be 75, so the second worker has missed
this sleep cycle but on the next opportunity when the shared value
again reaches 100 and if the second worker performs more I/O it will
sleep for a longer duration.

Even if we add it to the shared balance later (like you were doing
earlier) then also we can reproduce the similar behavior, suppose
shared balance is 85 and both workers have local balance 40 each. Now,
each worker has done the I/O of 10.  Now, suppose we don't add to
shared balance then both workers will see the balance as 85+10= 95 so
none of them will sleep. OTOH, if they do serially the first worker
will add 10 and make it 95 and then the second worker will locally
check 95+10 which is more than 100 and it will sleep. Right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [HACKERS] Block level parallel vacuum

2019-12-04 Thread Amit Kapila
On Thu, Dec 5, 2019 at 1:41 AM Robert Haas  wrote:
>
> On Mon, Dec 2, 2019 at 2:26 PM Masahiko Sawada
>  wrote:
> > It's just an example, I'm not saying your idea is bad. ISTM the idea
> > is good on an assumption that all indexes take the same time or take a
> > long time so I'd also like to consider if this is true even in
> > production and which approaches is better if we don't have such
> > assumption.
>
> I think his idea is good. You're not wrong when you say that there are
> cases where it could work out badly, but I think on the whole it's a
> clear improvement. Generally, the indexes should be of relatively
> similar size because index size is driven by table size; it's true
> that different AMs could result in different-size indexes, but it
> seems like a stretch to suppose that the indexes that don't support
> parallelism are also going to be the little tiny ones that go fast
> anyway, unless we have some evidence that this is really true. I also
> wonder whether we really need the option to disable parallel vacuum in
> the first place.
>

I think it could be required for the cases where the AM doesn't have a
way (or it is difficult to come up with a way) to communicate the
stats allocated by the first ambulkdelete call to the subsequent ones
until amvacuumcleanup.  Currently, we have such a case for the Gist
index, see email thread [1]. Though we have come up with a way to
avoid that for Gist indexes, I am not sure if we can assume that it is
the case for any possible index AM especially when there is a
provision that indexAM can have additional stats information.  In the
worst case, if we have to modify some existing index AM like we did
for the Gist index, we need such a provision so that it is possible.
In the ideal case, the index AM should provide a way to copy such
stats, but we can't assume that, so we come up with this option.

We have used this for dummy_index_am which also provides a way to test it.

> Maybe I'm looking in the right place, but I'm not
> finding anything in the way of comments or documentation explaining
> why some AMs don't support it. It's an important design point, and
> should be documented.
>

Agreed.  We should do this.

> I also think PARALLEL_VACUUM_DISABLE_LEADER_PARTICIPATION seems like a
> waste of space. For parallel queries, there is a trade-off between
> having the leader do work (which can speed up the query) and having it
> remain idle so that it can immediately absorb tuples from workers and
> keep them from having their tuple queues fill up (which can speed up
> the query). But here, at least as I understand it, there's no such
> trade-off. Having the leader fail to participate is just a loser.
> Maybe it's useful to test while debugging the patch,
>

Yeah, it is primarily a debugging/testing aid patch and it helped us
in discovering some issues.  During development, it is being used for
tesing purpose as well.  This is the reason the code is under #ifdef

> but why should
> the committed code support it?
>

I am also not sure whether we should commit this part of code and that
is why I told in one of the above emails to keep it as a separate
patch.  We can later see whether to commit this code.  Now, the point
in its favor is that we already have a similar define
(DISABLE_LEADER_PARTICIPATION) for parallel create index, so having it
here is not a bad idea.  I think it might help us in debugging some
bugs where we want forcefully the index to be vacuumed by some worker.
We might want to have something like force_parallel_mode for
testing/debugging purpose, but not sure which is better.  I think
having something as a debugging aid for such features is good.

> To respond to another point from a different part of the email chain,
> the reason why LaunchParallelWorkers() does not take an argument for
> the number of workers is because I believed that the caller should
> always know how many workers they're going to want at the time they
> CreateParallelContext(). Increasing it later is not possible, because
> the DSM has already sized based on the count provided. I grant that it
> would be possible to allow the number to be reduced later, but why
> would you want to do that? Why not get the number right when first
> creating the DSM?
>

Here, we have a need to reduce the number of workers.  Index Vacuum
has two different phases (index vacuum and index cleanup) which uses
the same parallel-context/DSM but both could have different
requirements for workers.  The second phase (cleanup) would normally
need fewer workers as if the work is done in the first phase, second
wouldn't need it, but we have exceptions like gin indexes where we
need it for the second phase as well because it takes the pass
over-index again even if we have cleaned the index in the first phase.
Now, consider the case where we have 3 btree indexes and 2 gin
indexes, we would need 5 workers for index vacuum phase and 2 workers
for index cleanup phase.  There are other cases too.

We also

Re: [HACKERS] Block level parallel vacuum

2019-12-04 Thread Dilip Kumar
On Thu, Dec 5, 2019 at 1:41 AM Robert Haas  wrote:
>
> On Mon, Dec 2, 2019 at 2:26 PM Masahiko Sawada
>  wrote:
> > It's just an example, I'm not saying your idea is bad. ISTM the idea
> > is good on an assumption that all indexes take the same time or take a
> > long time so I'd also like to consider if this is true even in
> > production and which approaches is better if we don't have such
> > assumption.
>
> I think his idea is good. You're not wrong when you say that there are
> cases where it could work out badly, but I think on the whole it's a
> clear improvement. Generally, the indexes should be of relatively
> similar size because index size is driven by table size; it's true
> that different AMs could result in different-size indexes, but it
> seems like a stretch to suppose that the indexes that don't support
> parallelism are also going to be the little tiny ones that go fast
> anyway, unless we have some evidence that this is really true. I also
> wonder whether we really need the option to disable parallel vacuum in
> the first place. Maybe I'm looking in the right place, but I'm not
> finding anything in the way of comments or documentation explaining
> why some AMs don't support it. It's an important design point, and
> should be documented.
>
> I also think PARALLEL_VACUUM_DISABLE_LEADER_PARTICIPATION seems like a
> waste of space. For parallel queries, there is a trade-off between
> having the leader do work (which can speed up the query) and having it
> remain idle so that it can immediately absorb tuples from workers and
> keep them from having their tuple queues fill up (which can speed up
> the query). But here, at least as I understand it, there's no such
> trade-off. Having the leader fail to participate is just a loser.
> Maybe it's useful to test while debugging the patch, but why should
> the committed code support it?
>
> To respond to another point from a different part of the email chain,
> the reason why LaunchParallelWorkers() does not take an argument for
> the number of workers is because I believed that the caller should
> always know how many workers they're going to want at the time they
> CreateParallelContext(). Increasing it later is not possible, because
> the DSM has already sized based on the count provided. I grant that it
> would be possible to allow the number to be reduced later, but why
> would you want to do that? Why not get the number right when first
> creating the DSM?
>
> Is there any legitimate use case for parallel vacuum in combination
> with vacuum cost delay? As I understand it, any serious vacuuming is
> going to be I/O-bound, so can you really need multiple workers at the
> same time that you are limiting the I/O rate? Perhaps it's possible if
> the I/O limit is so high that a single worker can't hit the limit by
> itself, but multiple workers can, but it seems like a bad idea to
> spawn more workers and then throttle them rather than just starting
> fewer workers.

I agree that there is no point is first to spawn more workers to get
the work done faster and later throttle them.  Basically, that will
lose the whole purpose of running it in parallel.  OTOH, we should
also consider the cases where there could be some vacuum that may not
hit the I/O limit right? because it may find all the pages in the
shared buffers and they might not need to dirty a lot of pages.  So I
think for such cases it is advantageous to run in parallel.  The
problem is that there is no way to know in advance whether the total
I/O for the vacuum will hit the I/O limit or not so we can not decide
in advance whether to run it in parallel or not.

 In any case, the algorithm suggested in vacuumlazy.c
> around the definition of VacuumSharedCostBalance seems almost the
> opposite of what you probably want. The idea there seems to be that
> you shouldn't make a worker sleep if it hasn't actually got to do
> anything. Apparently the idea is that if you have 3 workers and you
> only have enough I/O rate for 1 worker, you want all 3 workers to run
> at once, so that the I/O is random, rather than having them run 1 at a
> time, so that the I/O is sequential. That seems counterintuitive. It
> could be right if the indexes are in different tablespaces, but if
> they are in the same tablespace it's probably wrong. I guess it could
> still be right if there's just so much I/O that you aren't going to
> run out ever, and the more important consideration is that you don't
> know which index will take longer to vacuum and so want to start them
> all at the same time so that you don't accidentally start the slow one
> last, but that sounds like a stretch. I think this whole area needs
> more thought. I feel like we're trying to jam a go-slower feature and
> a go-faster feature into the same box.
>
> + * vacuum and for performing index cleanup.  Note that all parallel workers
> + * live during either index vacuuming or index cleanup but the leader process
> + * neither exits from the parallel

Re: [HACKERS] Block level parallel vacuum

2019-12-04 Thread Amit Kapila
On Thu, Dec 5, 2019 at 10:52 AM Amit Kapila  wrote:
>
> On Thu, Dec 5, 2019 at 1:41 AM Robert Haas  wrote:
> >
> > On Mon, Dec 2, 2019 at 2:26 PM Masahiko Sawada
> >  wrote:
> > > It's just an example, I'm not saying your idea is bad. ISTM the idea
> > > is good on an assumption that all indexes take the same time or take a
> > > long time so I'd also like to consider if this is true even in
> > > production and which approaches is better if we don't have such
> > > assumption.
> >
> > I think his idea is good. You're not wrong when you say that there are
> > cases where it could work out badly, but I think on the whole it's a
> > clear improvement. Generally, the indexes should be of relatively
> > similar size because index size is driven by table size; it's true
> > that different AMs could result in different-size indexes, but it
> > seems like a stretch to suppose that the indexes that don't support
> > parallelism are also going to be the little tiny ones that go fast
> > anyway, unless we have some evidence that this is really true. I also
> > wonder whether we really need the option to disable parallel vacuum in
> > the first place.
> >
>
> I think it could be required for the cases where the AM doesn't have a
> way (or it is difficult to come up with a way) to communicate the
> stats allocated by the first ambulkdelete call to the subsequent ones
> until amvacuumcleanup.  Currently, we have such a case for the Gist
> index, see email thread [1].
>

oops, I had referred to a couple of other discussions in my reply but
forgot to mention the links, doing it now.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1LGr%2BMN0xHZpJ2dfS8QNQ1a_aROKowZB%2BMPNep8FVtwAA%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAA4eK1J-VoR9gzS5E75pcD-OH0mEyCdp8RihcwKrcuw7J-Q0%2Bw%40mail.gmail.com
[3] - 
https://www.postgresql.org/message-id/20191106022550.zq7nai2ct2ashegq%40alap3.anarazel.de

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Memory-Bounded Hash Aggregation

2019-12-04 Thread Jeff Davis
On Wed, 2019-12-04 at 19:50 -0800, Adam Lee wrote:
> On Wed, Dec 04, 2019 at 06:55:43PM -0800, Jeff Davis wrote:
> > 
> > Thanks very much for a great review! I've attached a new patch.
> 
> Hi,
> 
> About the `TODO: project needed attributes only` in your patch, when
> would the input tuple contain columns not needed? It seems like
> anything
> you can project has to be in the group or aggregates.

If you have a table like:

   CREATE TABLE foo(i int, j int, x int, y int, z int);

And do:

   SELECT i, SUM(j) FROM foo GROUP BY i;

At least from a logical standpoint, you might expect that we project
only the attributes we need from foo before feeding them into the
HashAgg. But that's not quite how postgres works. Instead, it leaves
the tuples intact (which, in this case, means they have 5 attributes)
until after aggregation and lazily fetches whatever attributes are
referenced. Tuples are spilled from the input, at which time they still
have 5 attributes; so naively copying them is wasteful.

I'm not sure how often this laziness is really a win in practice,
especially after the expression evaluation has changed so much in
recent releases. So it might be better to just project all the
attributes eagerly, and then none of this would be a problem. If we
still wanted to be lazy about attribute fetching, that should still be
possible even if we did a kind of "logical" projection of the tuple so
that the useless attributes would not be relevant. Regardless, that's
outside the scope of the patch I'm currently working on.

What I'd like to do is copy just the attributes needed into a new
virtual slot, leave the unneeded ones NULL, and then write it out to
the tuplestore as a MinimalTuple. I just need to be sure to get the
right attributes.

Regards,
Jeff Davis






Re: Memory-Bounded Hash Aggregation

2019-12-04 Thread Jeff Davis
On Wed, 2019-12-04 at 17:24 -0800, Melanie Plageman wrote:
> 
> It looks like the parameter input_tuples passed to hash_spill_init() 
> in lookup_hash_entries() is the number of groups estimated by
> planner.
> However, when reloading a spill file, if we run out of memory and
> re-spill, hash_spill_init() is passed batch->input_groups (which is
> actually set from input_ngroups which is the number of tuples in the
> spill file). So, input_tuples is groups and input_groups is
> input_tuples. It may be helpful to rename this.

You're right; this is confusing. I will clarify this in the next patch.
 
> So, it looks like the HASH_PARTITION_FACTOR is only used when
> re-spilling. The initial hashtable will use work_mem.
> It seems like the reason for using it when re-spilling is to be very
> conservative to avoid more than one re-spill and make sure each spill
> file fits in a hashtable in memory.

It's used any time a spill happens, even the first spill. I'm flexible
on the use of HASH_PARTITION_FACTOR though... it seems not everyone
thinks it's a good idea. To me it's just a knob to tune and I tend to
think over-partitioning is the safer bet most of the time.

> The comment does seem to point to some other reason, though...

I have observed some anomalies where smaller work_mem values (for
already-low values of work_mem) result faster runtime. The only
explanation I have is caching effects.

> 256 actually seems very large. hash_spill_npartitions() will be
> called
> for every respill, so, HASH_MAX_PARTITIONS it not the total number of
> spill files permitted, but, actually, it is the number of respill
> files in a given spill (a spill set). So if you made X partitions
> initially and every partition re-spills, now you would have (at most)
> X * 256 partitions.

Right. Though I'm not sure there's any theoretical max... given enough
input tuples and it will just keep getting deeper. If this is a serious
concern maybe I should make it depth-first recursion by prepending new
work items rather than appending. That would still not bound the
theoretical max, but it would slow the growth.

> If HASH_MAX_PARTITIONS is 256, wouldn't the metadata from the spill
> files take up a lot of memory at that point?

Yes. Each file keeps a BLCKSZ buffer, plus some other metadata. And it
does create a file, so it's offloading some work to the OS to manage
that new file.

It's annoying to properly account for these costs because the memory
needs to be reserved at the time we are building the hash table, but we
don't know how many partitions we want until it comes time to spill.
And for that matter, we don't even know whether we will need to spill
or not.

There are two alternative approaches which sidestep this problem:

1. Reserve a fixed fraction of work_mem, say, 1/8 to make space for
however many partitions that memory can handle. We would still have a
min and max, but the logic for reserving the space would be easy and so
would choosing the number of partitions to create.
  * Pro: simple
  * Con: lose the ability to choose the numer of partitions

2. Use logtape.c instead (suggestion from Heikki). Supporting more
logical tapes doesn't impose costs on the OS, and we can potentially
use a lot of logical tapes.
  * Pro: can use lots of partitions without making lots of files
  * Con: buffering still needs to happen somewhere, so we still need
memory for each logical tape. Also, we risk losing locality of read
access when reading the tapes, or perhaps confusing readahead.
Fundamentally, logtapes.c was designed for sequential write, random
read; but we are going to do random write and sequential read.

Regards,
Jeff Davis