date:20191125

Re: Avoiding deadlock errors in CREATE INDEX CONCURRENTLY

2019-11-25 Thread Michael Paquier

On Fri, Nov 08, 2019 at 10:30:39AM +0900, Michael Paquier wrote:
> Per the arguments of upthread, storing a 64-bit XID would require a
> catalog change and you cannot backpatch that.  I would suggest to keep
> this patch focused on HEAD, and keep it as an improvement of the
> existing features.  Concurrent deadlock risks caused by CCI exist
> since the feature came to life.

Marked as returned with feedback per lack of activity and the patch
was waiting on author for a bit more than two weeks.
--
Michael


signature.asc
Description: PGP signature

Re: pglz performance

2019-11-25 Thread Michael Paquier

On Wed, Nov 06, 2019 at 09:04:25AM +0100, Peter Eisentraut wrote:
> OK, waiting on some independent verification of benchmark numbers.

Still waiting for these after 19 days, so the patch has been marked as
returned with feedback.
--
Michael


signature.asc
Description: PGP signature

Re: log bind parameter values on error

2019-11-25 Thread Michael Paquier

On Thu, Nov 07, 2019 at 03:41:04PM -0800, Andres Freund wrote:
> The way you do it you need to do it in numerous places, and I'm pretty
> sure you're missing some already. E.g. this will not work to log
> parameters for parametrized statements generated on the server side,
> e.g. for foreign key queries. I don't think that's the right direction
> to go. You can maybe argue that we don't need support for logging server
> side generated parameters in the initial version, but the approach
> should be compatible with adding that support.

This patch had a review from two committers, with no updates from the
author, so marked as returned with feedback.
--
Michael


signature.asc
Description: PGP signature

Re: Avoid full GIN index scan when possible

2019-11-25 Thread Michael Paquier

On Sat, Nov 23, 2019 at 02:35:50AM +0300, Nikita Glukhov wrote:
> Attached 8th version of the patches.

Please be careful here.  The CF entry was still marked as waiting on
author, but you sent a new patch series which has not been reviewed.
I have moved this patc to next CF instead.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-11-25 Thread Michael Paquier

On Mon, Nov 18, 2019 at 05:26:37PM -0800, Peter Geoghegan wrote:
> Attached is v24. This revision doesn't fix the problem with
> xl_btree_insert record bloat, but it does fix the bitrot against the
> master branch that was caused by commit 50d22de9. (This patch has had
> a surprisingly large number of conflicts against the master branch
> recently.)

Please note that I have moved this patch to next CF per this last
update.  Anastasia, the ball is waiting on your side of the field, as
the CF entry is marked as waiting on author for some time now.
--
Michael


signature.asc
Description: PGP signature

Re: WIP: Data at rest encryption

2019-11-25 Thread Michael Paquier

On Wed, Sep 04, 2019 at 06:56:18AM +0200, Antonin Houska wrote:
> This thread started later than our effort but important design questions are
> being discussed there. So far there seems to be no consensus whether
> full-instance encryption should be implemented first, so any effort spent on
> this patch might get wasted. When/if there will be an agreement on the design,
> we'll see how much of this patch can be used.

I see.  I have marked the patch as returned with feedback in CF
2019-11.  Let's see how the other one finishes, before perhaps coming
back to this one.
--
Michael


signature.asc
Description: PGP signature

Re: pglz performance

2019-11-25 Thread Andrey Borodin

> 25 нояб. 2019 г., в 13:03, Michael Paquier  написал(а):
> 
> On Wed, Nov 06, 2019 at 09:04:25AM +0100, Peter Eisentraut wrote:
>> OK, waiting on some independent verification of benchmark numbers.
> 
> Still waiting for these after 19 days, so the patch has been marked as
> returned with feedback.

I think status Needs Review describes what is going on better. It's not like 
something is awaited from my side.

Thanks.

Best regards, Andrey Borodin.

Re: pglz performance

2019-11-25 Thread Michael Paquier

On Mon, Nov 25, 2019 at 01:21:27PM +0500, Andrey Borodin wrote:
> I think status Needs Review describes what is going on better. It's
> not like something is awaited from my side.

Indeed.  You are right so I have moved the patch instead, with "Needs
review".  The patch status was actually incorrect in the CF app, as it
was marked as waiting on author.

@Tomas: updated versions of the patches have been sent by Andrey. 
--
Michael

signature.asc
Description: PGP signature

Re: accounting for memory used for BufFile during hash joins

2019-11-25 Thread Michael Paquier

On Tue, Sep 10, 2019 at 03:47:51PM +0200, Tomas Vondra wrote:
> My feeling is that we should get the BNLJ committed first, and then maybe
> use some of those additional strategies as fallbacks (depending on which
> issues are still unsolved by the BNLJ).

The glacier is melting more.  Tomas, what's your status here?  The
patch has been waiting on author for two months now.  If you are not
planning to work more on this one, then it should be marked as
returned with feedback?
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] Incomplete startup packet errors

2019-11-25 Thread Jobin Augustine

On Thu, Mar 7, 2019 at 1:26 AM Andrew Dunstan <
andrew.duns...@2ndquadrant.com> wrote:

>
> On 3/6/19 12:12 PM, Robert Haas wrote:
> > On Tue, Mar 5, 2019 at 5:35 PM Andrew Dunstan
> >  wrote:
> >> OK, I think we have agreement on Tom's patch. Do we want to backpatch
> OK, no back-patching it is.
>
However, Checking whether the port is open is resulting in error log like:
2019-11-25 14:03:44.414 IST [14475] LOG:  invalid length of startup packet
Yes, This is different from "Incomplete startup packet" discussed here.

Steps to reproduce:
$ telnet localhost 5432


>
>

Re: Rework manipulation and structure of attribute mappings

2019-11-25 Thread Amit Langote

On Fri, Nov 22, 2019 at 4:57 PM Michael Paquier  wrote:
> On Fri, Nov 22, 2019 at 02:21:41PM +0900, Amit Langote wrote:
> > Actually, we should also refactor
> > convert_tuples_by_position() to carve out the code that builds the
> > AttrMap into a separate function and move it to attmap.c.
>
> Not sure how to name that.  One logic uses a match based on the
> attribute name, and the other uses the type.  So the one based on the
> attribute name should be something like build_attrmap_by_name() and
> the second attrmap_build_by_position()?  We could use a better
> convention like AttrMapBuildByPosition for example.  Any suggestions
> of names are welcome.

Actually, I was just suggesting that we create a new function
convert_tuples_by_position_map() and put the logic that compares the
two TupleDescs to create the AttrMap in it, just like
convert_tuples_by_name_map().  Now you could say that there would be
no point in having such a function, because no caller currently wants
to use such a map (that is, without the accompanying
TupleConversionMap), but maybe there will be in the future.  Though
irrespective of that consideration, I guess you'd agree to group
similar code in a single source file.

Regarding coming up with any new name, having a word in the name that
distinguishes the aspect of mapping by attribute name vs. type
(position) should suffice.  We can always do the renaming in a
separate patch.

>  Please note that I still have a commit fest to
> run and finish, so I'll unlikely come back to that before December.
> Let's continue with the tuning of the function names though.

As it's mainly just moving around code, I gave it a shot; patch
attached (applies on top of yours).  I haven't invented any new names
yet, but let's keep discussing that as you say.

Thanks,
Amit

0002-Move-more-code-to-attmap.c.patch
Description: Binary data

Re: Attempt to consolidate reading of XLOG page

2019-11-25 Thread Antonin Houska

Alvaro Herrera  wrote:

> On 2019-Nov-22, Antonin Houska wrote:
> 
> > As I pointed out in
> > 
> > https://www.postgresql.org/message-id/88183.1574261429%40antos
> > 
> > seg.ws_off only replaced readOff in XLogReaderState. So we should only 
> > update
> > ws_off where readOff was updated before commit 709d003. This does happen in
> > ReadPageInternal (see HEAD) and I see no reason for the final patch to 
> > update
> > ws_off anywhere else.
> 
> Oh you're right.
> 
> I see no reason to leave ws_off.  We can move that to XLogReaderState; I
> did that here.  We also need the offset in WALReadError, though, so I
> added it there too.  Conceptually it seems clearer to me this way.
> 
> What do you think of the attached?

It looks good to me. Attached is just a fix of a minor problem in error
reporting that Michael pointed out earlier.

> BTW I'm not clear what errors can pread()/pg_pread() report that do not
> set errno.  I think lines 1083/1084 of WALRead are spurious now.

All I can say is that the existing calls of pg_pread() do not clear errno, so
you may be right. I'd appreciate more background about the "partial read" that
Michael mentions here:

https://www.postgresql.org/message-id/20191125033048.GG37821%40paquier.xyz

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 04124bc254..eda81c1df1 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -354,9 +354,11 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 	 state->segcxt.ws_segsize);
 
 		if (errinfo.wre_errno != 0)
-			fatal_error("could not read in file %s, offset %u, length %zu: %s",
-		fname, errinfo.wre_off, (Size) errinfo.wre_req,
-		strerror(errinfo.wre_errno));
+		{
+			errno = errinfo.wre_errno;
+			fatal_error("could not read in file %s, offset %u, length %zu: %m",
+		fname, errinfo.wre_off, (Size) errinfo.wre_req);
+		}
 		else
 			fatal_error("could not read in file %s, offset %u: length: %zu",
 		fname, errinfo.wre_off, (Size) errinfo.wre_req);

Re: backup manifests

2019-11-25 Thread Suraj Kharage

Hi Jeevan,

I have incorporated all the comments in the attached patch. Please review
and let me know your thoughts.

On Thu, Nov 21, 2019 at 2:51 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
> On Wed, Nov 20, 2019 at 11:05 AM Suraj Kharage <
> suraj.khar...@enterprisedb.com> wrote:
>
>> Hi,
>>
>> Since now we are generating the backup manifest file with each backup, it
>> provides us an option to validate the given backup.
>> Let's say, we have taken a backup and after a few days, we want to check
>> whether that backup is validated or corruption-free without restarting the
>> server.
>>
>> Please find attached POC patch for same which will be based on the latest
>> backup manifest patch from Rushabh. With this functionality, we add new
>> option to pg_basebackup, something like --verify-backup.
>> So, the syntax would be:
>> ./bin/pg_basebackup --verify-backup -D 
>>
>> Basically, we read the backup_manifest file line by line from the given
>> directory path and build the hash table, then scan the directory and
>> compare each file with the hash entry.
>>
>> Thoughts/suggestions?
>>
>
>
> I like the idea of verifying the backup once we have backup_manifest with
> us.
> Periodically verifying the already taken backup with this simple tool
> becomes
> easy now.
>
> I have reviewed this patch and here are my comments:
>
> 1.
> @@ -30,7 +30,9 @@
>  #include "common/file_perm.h"
>  #include "common/file_utils.h"
>  #include "common/logging.h"
> +#include "common/sha2.h"
>  #include "common/string.h"
> +#include "fe_utils/simple_list.h"
>  #include "fe_utils/recovery_gen.h"
>  #include "fe_utils/string_utils.h"
>  #include "getopt_long.h"
> @@ -38,12 +40,19 @@
>  #include "pgtar.h"
>  #include "pgtime.h"
>  #include "pqexpbuffer.h"
> +#include "pgrhash.h"
>  #include "receivelog.h"
>  #include "replication/basebackup.h"
>  #include "streamutil.h"
>
> Please add new files in order.
>
> 2.
> Can hash related file names be renamed to backuphash.c and backuphash.h?
>
> 3.
> Need indentation adjustments at various places.
>
> 4.
> +charbuf[100];  // 1MB chunk
>
> It will be good if we have multiple of block /page size (or at-least power
> of 2
> number).
>
> 5.
> +typedef struct pgrhash_entry
> +{
> +struct pgrhash_entry *next; /* link to next entry in same bucket */
> +DataDirectoryFileInfo *record;
> +} pgrhash_entry;
> +
> +struct pgrhash
> +{
> +unsignednbuckets;/* number of buckets */
> +pgrhash_entry **bucket;/* pointer to hash entries */
> +};
> +
> +typedef struct pgrhash pgrhash;
>
> These two can be moved to .h file instead of redefining over there.
>
> 6.
> +/*
> + * TODO: this function is not necessary, can be removed.
> + * Test whether the given row number is match for the supplied keys.
> + */
> +static bool
> +pgrhash_compare(char *bt_filename, char *filename)
>
> Yeah, it can be removed by doing strcmp() at the required places rather
> than
> doing it in a separate function.
>
> 7.
> mdate is not compared anywhere. I understand that it can't be compared with
> the file in the backup directory and its entry in the manifest as manifest
> entry gives mtime from server file whereas the same file in the backup will
> have different mtime. But adding a few comments there will be good.
>
> 8.
> +charmdate[24];
>
> should be mtime instead?
>
>
> Thanks
>
> --
> Jeevan Chalke
> Associate Database Architect & Team Lead, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>

-- 
--

Thanks & Regards,
Suraj kharage,
EnterpriseDB Corporation,
The Postgres Database Company.


backup_validator_POC.patch
Description: Binary data

Re: adding partitioned tables to publications

2019-11-25 Thread Amit Langote

On Fri, Nov 22, 2019 at 7:46 PM Peter Eisentraut
 wrote:
> On 2019-11-22 07:28, Amit Langote wrote:
> > Hmm, I thought it would be more desirable to not expose a published
> > partitioned table's leaf partitions to a subscriber, because it allows
> > the target table to be defined more flexibly.
>
> There are multiple different variants that we probably eventually want
> to support.  But I think there is value in exposing the partition
> structure to the subscriber.  Most notably, it allows the subscriber to
> run the initial table sync per partition rather than in one big chunk --
> which ultimately reflects one of the reasons partitioning exists.

I agree that replicating leaf-to-leaf has the least overhead.

> The other way, exposing only the partitioned table, is also useful,
> especially if you want to partition differently on the subscriber.  But
> without the ability to target a partitioned table on the subscriber,
> this would right now only allow you to replicate a partitioned table
> into a non-partitioned table.  Which is valid but probably not often useful.

Handling non-partitioned target tables was the main reason for me to
make publishing using the root parent's schema the default behavior.
But given that replicating from partitioned tables into
non-partitioned ones would be rare, I agree to replicating using the
leaf schema by default.

> >> What happens when you add a leaf table directly to a publication?  Is it
> >> replicated under its own identity or under its ancestor partitioned
> >> table?  (What if both the leaf table and a partitioned table are
> >> publication members?)
> >
> > If both a leaf partition and an ancestor belong to the same
> > publication, then leaf partition changes are replicated using the
> > ancestor's schema.  For a leaf partition to be replicated using its
> > own schema it must be published via a separate publication that
> > doesn't contain the ancestor.  At least that's what the current patch
> > does.
>
> Hmm, that seems confusing.  This would mean that if you add a
> partitioned table to a publication that already contains leaf tables,
> the publication behavior of the leaf tables would change.  So again, I
> think this alternative behavior of publishing partitions under the name
> of their root table should be an explicit option on a publication, and
> then it should be ensured somehow that individual partitions are not
> added to the publication in confusing ways.
>
> So, it's up to you which aspect of this you want to tackle, but I
> thought your original goal of being able to add partitioned tables to
> publications and have that implicitly expand to all member partitions on
> the publication side seemed quite useful, self-contained, and
> uncontroversial.

OK, let's make whether to publish with root or leaf schema an option,
with the latter being the default.  I will see about updating the
patch that way.

Thanks,
Amit

Re: [HACKERS] Block level parallel vacuum

2019-11-25 Thread Amit Kapila

On Fri, Nov 22, 2019 at 2:49 PM Amit Kapila  wrote:
>
> On Wed, Nov 20, 2019 at 11:01 AM Masahiko Sawada
>  wrote:
> >
> > I've attached the latest version patch set. The patch set includes all
> > discussed points regarding index AM options as well as shared cost
> > balance. Also I added some test cases used all types of index AM.
> >
>
> I have reviewed the first patch and made a number of modifications
> that include adding/modifying comments, made some corrections and
> modifications in the documentation. You can find my changes in
> v33-0001-delta-amit.patch.
>

I have continued my review for this patch series and reviewed/hacked
the second patch.  I have added/modified comments, changed function
ordering in file to make them look consistent and a few other changes.
You can find my changes in v33-0002-delta-amit.patch.   Are you
working on review comments given recently, if you have not started
yet, then it might be better to prepare a patch atop of v33 version as
I am also going to work on this patch series, that way it will be easy
to merge changes.  OTOH, if you are already working on those, then it
is fine.  I can merge any remaining changes with your new patch.
Whatever be the case, please let me know.

Few more comments on v33-0002-Add-parallel-option-to-VACUUM-command.patch:
---
1.
+ * leader process re-initializes the parallel context while keeping recorded
+ * dead tuples so that the leader can launch parallel workers again in the next
+ * time.

In this sentence, it is not clear to me why we need to keep the
recorded dead tuples while re-initialize parallel workers?  The next
time when workers are launched, they should process a new set of dead
tuples, no?

2.
lazy_parallel_vacuum_or_cleanup_indexes()
{
..
+ /*
+ * Increment the active worker count. We cannot decrement until the
+ * all parallel workers finish.
+ */
+
pg_atomic_add_fetch_u32(VacuumActiveNWorkers, 1);
+
+ /*
+ * Join as parallel workers. The leader process alone does that in
+ * case where
no workers launched.
+ */
+ if (lps->leaderparticipates || lps->pcxt->nworkers_launched == 0)
+ vacuum_or_cleanup_indexes_worker
(Irel, nindexes, stats, lps->lvshared,
+ vacrelstats->dead_tuples);
+
+ /*
+
 * Here, the indexes that had been skipped during parallel index vacuuming
+ * are remaining. If there are such indexes the leader process does
vacuum
+ * or cleanup them one by one.
+ */
+ nindexes_remains = nindexes -
pg_atomic_read_u32(&(lps->lvshared->nprocessed));
+ if
(nindexes_remains > 0)
+ {
+ int i;
+#ifdef USE_ASSERT_CHECKING
+ int nprocessed = 0;
+#endif
+
+ for (i = 0; i <
nindexes; i++)
+ {
+ bool processed = !skip_parallel_index_vacuum(Irel[i],
+
lps->lvshared->for_cleanup,
+
lps->lvshared->first_time);
+
+ /* Skip the already processed indexes */
+
if (processed)
+ continue;
+
+ if (lps->lvshared->for_cleanup)
+
lazy_cleanup_index(Irel[i], &stats[i],
+vacrelstats->new_rel_tuples,
+
   vacrelstats->tupcount_pages < vacrelstats->rel_pages);
+ else
+
lazy_vacuum_index(Irel[i], &stats[i], vacrelstats->dead_tuples,
+   vacrelstats-
>old_live_tuples);
+#ifdef USE_ASSERT_CHECKING
+ nprocessed++;
+#endif
+ }
+#ifdef USE_ASSERT_CHECKING
+ Assert
(nprocessed == nindexes_remains);
+#endif
+ }
+
+ /*
+ * We have completed the index vacuum so decrement the active worker
+ * count.
+
 */
+ pg_atomic_sub_fetch_u32(VacuumActiveNWorkers, 1);
..
}

Here, it seems that we can increment/decrement the
VacuumActiveNWorkers even when there is no work performed by the
leader backend.  How about moving increment/decrement inside function
vacuum_or_cleanup_indexes_worker?  In that case, we need to do it in
this function when we are actually doing an index vacuum or cleanup.
After doing that the other usage of increment/decrement of
VacuumActiveNWorkers in other function heap_parallel_vacuum_main can
be removed.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

v33-0002-Add-parallel-option-to-VACUUM-command.patch
Description: Binary data

v33-0002-delta-amit.patch
Description: Binary data

Re: Ordering of header file inclusion

2019-11-25 Thread Amit Kapila

On Thu, Nov 21, 2019 at 2:10 PM Amit Kapila  wrote:
>
> Thanks for finding the remaining places, the patch looks good to me.
> I hope this covers the entire code.  BTW, are you using some script to
> find this or is this a result of manual inspection of code?  I have
> modified the commit message in the attached patch.  I will commit this
> early next week unless someone else wants to review it.
>

Pushed.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: dropdb --force

2019-11-25 Thread Amit Kapila

On Sun, Nov 24, 2019 at 3:55 PM vignesh C  wrote:
>
> On Sat, Nov 23, 2019 at 4:42 PM Amit Kapila  wrote:
> >
>
> > 2.
> > ok( TestLib::pump_until(
> > + $killme,
> > + $psql_timeout,
> > + \$killme_stderr,
> > + qr/FATAL:  terminating connection due to administrator command/m
> > + ),
> > + "psql query died successfully after SIGTERM");
> >
> > Extra space before TestLib.
>
> Ran perltidy, perltidy adds an extra space. I'm not sure which version
> is right whether to include space or without space. I had noticed
> similarly in 001_stream_rep.pl, in few places space is present and in
> few places it is not present. If required I can update based on
> suggestion.
>

You can try by running perltidy on other existing .pl files where you
find the usage "without space" and see if it adds the extra space in
all places.  I think keeping the version after running perltidy would
be a better choice.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: TestLib::command_fails_like enhancement

2019-11-25 Thread Andrew Dunstan


On 11/11/19 4:28 PM, Mark Dilger wrote:
>
>
>

 On further consideration, I'm wondering why we don't just
 unconditionally use a closed input pty for all these functions (except
 run_log). None of them use any input, and making the client worry
 about
 whether or not to close it seems something of an abstraction break.
 There would be no API change at all involved in this case, just a
 bit of
 extra cleanliness. Would need testing on Windows, I'll go and do that
 now.


 Thoughts?
>>>
>>> That sounds a lot better than your previous patch.
>>>
>>> PostgresNode.pm and TestLib.pm both invoke IPC::Run::run.  If you
>>> change all the invocations in TestLib to close input pty, should you
>>> do the same for PostgresNode?  I don't have a strong argument for
>>> doing so, but it seems cleaner to have both libraries invoking
>>> commands under identical conditions, such that if commands were
>>> borrowed from one library and called from the other they would behave
>>> the same.
>>>
>>> PostgresNode already uses TestLib, so perhaps setting up the
>>> environment can be abstracted into something, perhaps TestLib::run,
>>> and then used everywhere that IPC::Run::run currently is used.
>>
>>
>>
>> I don't think we need to do that. In the case of the PostgresNode.pm
>> uses we know what the executable is, unlike the the TestLib.pm cases.
>> They are our own executables and we don't expect them to be doing
>> anything funky with /dev/tty.
>
> Ok.  I think your proposal sounds fine.



Here's a patch for that. The pty stuff crashes and burns on my Windows
test box, so there I just set stdin to an empty string via the usual
pipe mechanism.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index a377cdb226..5ff138360c 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -87,6 +87,8 @@ our @EXPORT = qw(
 
 our ($windows_os, $tmp_check, $log_path, $test_logfile);
 
+my @no_stdin;
+
 BEGIN
 {
 
@@ -178,6 +180,20 @@ INIT
 	autoflush STDOUT 1;
 	autoflush STDERR 1;
 	autoflush $testlog 1;
+
+	# Settings to close stdin for certain commands.
+	# On non-Windows, use a pseudo-terminal, so that libraries like openssl
+	# which open the tty if they think stdin isn't one for a password
+	# don't block. Windows doesn't have ptys, so just provide an empty
+	# string for stdin.
+	if ($windows_os)
+	{
+		@no_stdin = ('<', \"");
+	}
+	else
+	{
+		@no_stdin = ('', \$stdout, '2>', \$stderr;
+	my $result = IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr, @no_stdin;
 	chomp($stdout);
 	chomp($stderr);
 	return ($stdout, $stderr);
@@ -576,7 +592,7 @@ sub check_pg_config
 	my ($regexp) = @_;
 	my ($stdout, $stderr);
 	my $result = IPC::Run::run [ 'pg_config', '--includedir' ], '>',
-	  \$stdout, '2>', \$stderr
+	  \$stdout, '2>', \$stderr, @no_stdin
 	  or die "could not execute pg_config";
 	chomp($stdout);
 	$stdout =~ s/\r$//;
@@ -673,7 +689,7 @@ sub program_help_ok
 	my ($stdout, $stderr);
 	print("# Running: $cmd --help\n");
 	my $result = IPC::Run::run [ $cmd, '--help' ], '>', \$stdout, '2>',
-	  \$stderr;
+	  \$stderr, @no_stdin;
 	ok($result, "$cmd --help exit code 0");
 	isnt($stdout, '', "$cmd --help goes to stdout");
 	is($stderr, '', "$cmd --help nothing to stderr");
@@ -695,7 +711,7 @@ sub program_version_ok
 	my ($stdout, $stderr);
 	print("# Running: $cmd --version\n");
 	my $result = IPC::Run::run [ $cmd, '--version' ], '>', \$stdout, '2>',
-	  \$stderr;
+	  \$stderr, @no_stdin;
 	ok($result, "$cmd --version exit code 0");
 	isnt($stdout, '', "$cmd --version goes to stdout");
 	is($stderr, '', "$cmd --version nothing to stderr");
@@ -718,8 +734,7 @@ sub program_options_handling_ok
 	my ($stdout, $stderr);
 	print("# Running: $cmd --not-a-valid-option\n");
 	my $result = IPC::Run::run [ $cmd, '--not-a-valid-option' ], '>',
-	  \$stdout,
-	  '2>', \$stderr;
+	  \$stdout, '2>', \$stderr, @no_stdin;
 	ok(!$result, "$cmd with invalid option nonzero exit code");
 	isnt($stderr, '', "$cmd with invalid option prints error message");
 	return;
@@ -740,7 +755,7 @@ sub command_like
 	my ($cmd, $expected_stdout, $test_name) = @_;
 	my ($stdout, $stderr);
 	print("# Running: " . join(" ", @{$cmd}) . "\n");
-	my $result = IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+	my $result = IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr, @no_stdin;
 	ok($result, "$test_name: exit code 0");
 	is($stderr, '', "$test_name: no stderr");
 	like($stdout, $expected_stdout, "$test_name: matches");
@@ -769,7 +784,8 @@ sub command_like_safe
 	my $stdoutfile = File::Temp->new();
 	my $stderrfile = File::Temp->new();
 	print("# Running: " . join(" ", @{$cmd}) . "\n");
-	my $result = IPC::Run::run $cmd, '>', $stdoutfile, '2>', $stderrfile;
+	my $result = IPC::Run::run $cmd, '>', $stdoutfile, '2>', $stderrfile,
+	  @

Re: proposal: new polymorphic types - commontype and commontypearray

2019-11-25 Thread Dmitry Dolgov

> On Mon, Jun 17, 2019 at 05:31:40AM +0200, Pavel Stehule wrote:
>
> > I like anycompatible and anycompatiblearray.
> >
> > I'll update the patch
> >
>
> and here it is

Thanks for the patch! I've reviewed it a bit, and have a few small
commentaries:

* There are few traces of copy paste in comments

+static Oid
+select_common_type_from_vector(int nargs, Oid *typeids, bool noerror)
...
+   /*
+* Nope, so set up for the full algorithm.  Note that at this point, lc
+* points to the first list item with type different from pexpr's; we 
need
+* not re-examine any items the previous loop advanced over.
+*/

Seems like it was taken from select_common_type, but in
select_common_type_from_vector there is no `lc`, since it doesn't
accept a list.

* I guess it's would be beneficial to update also commentaries for
check_generic_type_consistency and enforce_generic_type_consistency

 * The argument consistency rules are:
 *
 * 1) All arguments declared ANYELEMENT must have the same datatype.
 * ...

Since they do not reflect the current state of things in this patch.

* I've noticed that there is a small difference in how anyelement and
anycompatible behave, namely anycompatible do not handle unknowns:

=# select 'aaa'::anyelement;
 anyelement

 aaa

=# select 'aaa'::anycompatible;
ERROR:  42846: cannot cast type unknown to anycompatible
LINE 1: select 'aaa'::anycompatible;
^
LOCATION:  transformTypeCast, parse_expr.c:2823

It happens due to unknowns being filtered out quite early in
check_generic_type_consistency and similar. By itself this difference it not a
problem, but it causes different error messages in functions:

-- this function accepts anycompatible
=# select test_anycompatible('aaa');
ERROR:  42883: function test_anycompatible(unknown) does not exist
LINE 1: select test_anycompatible('aaa');
   ^
HINT:  No function matches the given name and argument types. You might 
need to add explicit type casts.
LOCATION:  ParseFuncOrColumn, parse_func.c:627

-- this function accepts anyelement
=# select test_anyelement('aaa');
ERROR:  42804: could not determine polymorphic type because input has 
type unknown
LOCATION:  enforce_generic_type_consistency, parse_coerce.c:2177

Although of course it's not that serious.

* I'm also curious about the following situation:

=# create function test_both(a anycompatible) returns anycompatible as 
$$
begin
return a;
end$$ language plpgsql;
CREATE FUNCTION

=# create function test_both(a anyelement) returns anyelement as $$
begin
return a;
end$$ language plpgsql;
CREATE FUNCTION

=# select test_both('aaa'::text);
ERROR:  42725: function test_both(text) is not unique
LINE 1: select test_both('aaa'::text);
   ^
HINT:  Could not choose a best candidate function. You might need to 
add explicit type casts.
LOCATION:  ParseFuncOrColumn, parse_func.c:568

=# select test_both('aaa'::anyelement);
ERROR:  42804: could not determine polymorphic type because input has 
type unknown
LOCATION:  enforce_generic_type_consistency, parse_coerce.c:2177

Is it possible somehow to invoke any of these functions?

Other than that the functionality looks pretty solid. It may look obvious, but
I've also tested performance in different use cases for anycompatible, looks
the same as for anyelement.

Re: Avoid full GIN index scan when possible

2019-11-25 Thread Tom Lane

Michael Paquier  writes:
> On Sat, Nov 23, 2019 at 02:35:50AM +0300, Nikita Glukhov wrote:
>> Attached 8th version of the patches.

> Please be careful here.  The CF entry was still marked as waiting on
> author, but you sent a new patch series which has not been reviewed.
> I have moved this patc to next CF instead.

That seems a bit premature --- the new patch is only a couple of days
old.  The CF entry should've been moved back to "Needs Review",
sure.

(Having said that, the end of the month isn't that far away,
so it may well end up in the next CF anyway.)

regards, tom lane

Re: segmentation fault when cassert enabled

2019-11-25 Thread Jehan-Guillaume de Rorthais

On Wed, 6 Nov 2019 14:34:38 +0100
Peter Eisentraut  wrote:

> On 2019-11-05 17:29, Jehan-Guillaume de Rorthais wrote:
> > My best bet so far is that logicalrep_relmap_invalidate_cb is not called
> > after the DDL on the subscriber so the relmap cache is not invalidated. So
> > we end up with slot->tts_tupleDescriptor->natts superior than
> > rel->remoterel->natts in slot_store_cstrings, leading to the overflow on
> > attrmap and the sigsev.  
> 
> It looks like something like that is happening.  But it shouldn't. 
> Different table schemas on publisher and subscriber are well supported, 
> so this must be an edge case of some kind.  Please continue investigating.

I've been able to find the origin of the crash, but it was a long journey.

  I was unable to debug using gdb record because of this famous error:

Process record does not support instruction 0xc5 at address 0x1482758a4b30.

Program stopped.
__memset_avx2_unaligned_erms ()
at ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:168
168 ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such
  file or directory.

  Trying with rr, I had constant "stack depth limit exceeded", even with
  unlimited one. Does it worth opening a discussion or a wiki page about these
  tools? Peter, it looks like you have some experience with rr, any feedback?

  Finally, Julien Rouhaud spend some time with me after work hours, a,swering
  my questions about some parts of the code and pointed me to the excellent
  backtrace_functions GUC addition few days ago. This finally did the trick to
  find out what was happening. Many thanks Julien!

Back to the bug itself. Consider a working logical replication with constant
update/insert activity, eg. pgbench running against provider.

Now, on the subscriber side, a session issue an "ALTER TABLE ADD
COLUMN" on a subscribed table, eg. pgbench_branches. A cache invalidation
message is then pending for this table.

In the meantime, the logical replication worker receive an UPDATE to apply. It
opens the local relation using "logicalrep_rel_open". It finds the related
entry in LogicalRepRelMap is valid, so it does not update its attrmap
and directly opens and locks the local relation:

  /* Need to update the local cache? */
  if (!OidIsValid(entry->localreloid))
  {
[...updates attrmap here...]
  }
  else
entry->localrel = table_open(entry->localreloid, lockmode);

However, when locking the table, the code in LockRelationOid() actually process
any pending invalidation messages:

  LockRelationOid(Oid relid, LOCKMODE lockmode)
  {
[...]
/*
 * Now that we have the lock, check for invalidation messages, so that we
 * will update or flush any stale relcache entry before we try to use it.
 * RangeVarGetRelid() specifically relies on us for this.  We can skip
 * this in the not-uncommon case that we already had the same type of lock
 * being requested, since then no one else could have modified the
 * relcache entry in an undesirable way.  (In the case where our own xact
 * modifies the rel, the relcache update happens via
 * CommandCounterIncrement, not here.)
 *
 * However, in corner cases where code acts on tables (usually catalogs)
 * recursively, we might get here while still processing invalidation
 * messages in some outer execution of this function or a sibling.  The
 * "cleared" status of the lock tells us whether we really are done
 * absorbing relevant inval messages.
 */
if (res != LOCKACQUIRE_ALREADY_CLEAR)
{
  AcceptInvalidationMessages();
  MarkLockClear(locallock);
}
  }

We end up with attrmap referencing N columns and the relcache referencing N+1
columns. Later, in apply_handle_update(), we build a TupleTableSlot based on
the relcache representation and we crash by overflowing attrmap while trying to
feed this larger slot in slot_store_cstrings().

Please find in attachment a bugfix proposal. It just moves the attrmap update
after the table_open() call.

Last, I was wondering if entry->attrmap should be pfree'd before palloc'ing it
again during its rebuild to avoid some memory leak?

Regards,
>From 4295b277952a46378f01211bd37075f20223aadc Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Mon, 25 Nov 2019 15:21:38 +0100
Subject: [PATCH] Fix subscriber invalid memory access on DDL

Before this patch, the attrmap structure mapping the local fields
to remote ones for a subscribed relation was rebuild before handling
pending invalidation messages and potential relcache updates. This
was leading to an invalid memory access by overflowing the attrmap
when building the related tuple slot received from the provider.
---
 src/backend/replication/logical/relation.c | 57 --
 1 file changed, 42 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index b386f8460d..cfc34d49e0 100644
--

Re: [HACKERS] Incomplete startup packet errors

2019-11-25 Thread Tom Lane

Jobin Augustine  writes:
> However, Checking whether the port is open is resulting in error log like:
> 2019-11-25 14:03:44.414 IST [14475] LOG:  invalid length of startup packet
> Yes, This is different from "Incomplete startup packet" discussed here.

> Steps to reproduce:
> $ telnet localhost 5432
>> 
>> 

Well, the agreed-to behavior change was to not log anything if the
connection is closed without any data having been sent.  If the
client *does* send something, and it doesn't look like a valid
connection request, I think we absolutely should log that.

regards, tom lane

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-25 Thread Juan José Santamaría Flecha

On Fri, Nov 22, 2019 at 4:38 AM Amit Kapila  wrote:

> On Thu, Nov 21, 2019 at 8:32 PM Juan José Santamaría Flecha
>  wrote:
> >
> > [1] Win10 (1903) MSVC 19.22.27905
> >
>
> I have tested this on Windows7. I am not sure if it is due to a
> different version of windows, but I think we can't rule out that
> possibility.
>
>
This seems to be the case. The unexpected behaviour is on my end, which is
working as described in FILE_DISPOSITION_POSIX_SEMANTICS [1].

The expected behaviour is what you have already diagnosed.

[1]
https://docs.microsoft.com/es-es/windows-hardware/drivers/ddi/ntddk/ns-ntddk-_file_disposition_information_ex

Regards,

Juan José Santamaría Flecha

Why JIT speed improvement is so modest?

2019-11-25 Thread Konstantin Knizhnik


Right now JIT provides about 30% improvement of TPC-H Q1 query:

https://www.citusdata.com/blog/2018/09/11/postgresql-11-just-in-time/

I wonder why even at this query, which seems to be ideal use case for 
JIT, we get such modest improvement?
I have raised this question several years ago - but that time JIT was 
assumed to be in early development stage and performance aspects were 
less critical
than required infrastructure changes. But right now JIT seems to be 
stable enough and is switch on by default.

Vitesse DB reports 8x speedup on Q1,
ISP-RAS JIT version  provides 3x speedup of Q1:

https://www.pgcon.org/2017/schedule/attachments/467_PGCon%202017-05-26%2015-00%20ISPRAS%20Dynamic%20Compilation%20of%20SQL%20Queries%20in%20PostgreSQL%20Using%20LLVM%20JIT.pdf

According to this presentation Q1 spends 6% of time in ExecQual and 75% 
in ExecAgg.


VOPS provides 10x improvement of Q1.

I have a hypothesis that such difference was caused by the way of 
aggregates calculation.
Postgres is using Youngs-Cramer algorithm while both ISPRAS JIT version 
and my VOPS are just accumulating results in variable of type double.
I rewrite VOPS to use the same algorithm as Postgres, but VOPS is still 
about 10 times faster.


Results of Q1 on scale factor=10 TPC-H data at my desktop with parallel 
execution enabled:

no-JIT: 5640 msec
JIT:  4590msec
VOPS: 452 msec
VOPS + Youngs-Cramer algorithm: 610 msec

Below are tops of profiles (functions with more than 1% of time):

JIT:
  10.98%  postgres  postgres    [.] float4_accum
   8.40%  postgres  postgres    [.] float8_accum
   7.51%  postgres  postgres    [.] HeapTupleSatisfiesVisibility
   5.92%  postgres  postgres    [.] ExecInterpExpr
   5.63%  postgres  postgres    [.] tts_minimal_getsomeattrs
   4.35%  postgres  postgres    [.] lookup_hash_entries
   3.72%  postgres  postgres    [.] TupleHashTableHash.isra.8
   2.93%  postgres  postgres    [.] tuplehash_insert
   2.70%  postgres  postgres    [.] heapgettup_pagemode
   2.24%  postgres  postgres    [.] check_float8_array
   2.23%  postgres  postgres    [.] hash_search_with_hash_value
   2.10%  postgres  postgres    [.] ExecScan
   1.90%  postgres  postgres    [.] hash_uint32
   1.57%  postgres  postgres    [.] tts_minimal_clear
   1.53%  postgres  postgres    [.] FunctionCall1Coll
   1.47%  postgres  postgres    [.] pg_detoast_datum
   1.39%  postgres  postgres    [.] heapgetpage
   1.37%  postgres  postgres    [.] TupleHashTableMatch.isra.9
   1.35%  postgres  postgres    [.] ExecStoreBufferHeapTuple
   1.06%  postgres  postgres    [.] LookupTupleHashEntry
   1.06%  postgres  postgres    [.] AggCheckCallContext

no-JIT:
  26.82%  postgres  postgres    [.] ExecInterpExpr
  15.26%  postgres  postgres    [.] tts_buffer_heap_getsomeattrs
   8.27%  postgres  postgres    [.] float4_accum
   7.51%  postgres  postgres    [.] float8_accum
   5.26%  postgres  postgres    [.] HeapTupleSatisfiesVisibility
   2.78%  postgres  postgres    [.] TupleHashTableHash.isra.8
   2.63%  postgres  postgres    [.] tts_minimal_getsomeattrs
   2.54%  postgres  postgres    [.] lookup_hash_entries
   2.05%  postgres  postgres    [.] tuplehash_insert
   1.97%  postgres  postgres    [.] heapgettup_pagemode
   1.72%  postgres  postgres    [.] hash_search_with_hash_value
   1.57%  postgres  postgres    [.] float48mul
   1.55%  postgres  postgres    [.] check_float8_array
   1.48%  postgres  postgres    [.] ExecScan
   1.26%  postgres  postgres    [.] hash_uint32
   1.04%  postgres  postgres    [.] tts_minimal_clear
   1.00%  postgres  postgres    [.] FunctionCall1Coll

VOPS:
  44.25%  postgres  vops.so    [.] vops_avg_state_accumulate
  11.76%  postgres  vops.so    [.] vops_float4_avg_accumulate
   6.14%  postgres  postgres   [.] ExecInterpExpr
   5.89%  postgres  vops.so    [.] vops_float4_sub_lconst
   4.89%  postgres  vops.so    [.] vops_float4_mul
   4.30%  postgres  vops.so    [.] vops_int4_le_rconst
   2.57%  postgres  vops.so    [.] vops_float4_add_lconst
   2.31%  postgres  vops.so    [.] vops_count_accumulate
   2.24%  postgres  postgres   [.] tts_buffer_heap_getsomeattrs
   1.97%  postgres  postgres   [.] heap_page_prune_opt
   1.72%  postgres  postgres   [.] HeapTupleSatisfiesVisibility
   1.67%  postgres  postgres   [.] AllocSetAlloc
   1.47%  postgres  postgres   [.] hash_search_with_hash_value


In theory by elimination of interpretation overhead JIT should provide 
performance comparable with vecrtorized executor.
In most programming languages using JIT compiler instead of byte-code 
interpreter provides about 10x speed improveme

Re: Why JIT speed improvement is so modest?

2019-11-25 Thread Merlin Moncure

On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik
 wrote:
> JIT was not able to significantly (times) increase speed on Q1 query?
> Experiment with VOPS shows that used aggregation algorithm itself is not
> a bottleneck.
> Profile also give no answer for this question.
> Any ideas?

Well, in the VOPS variant around 2/3 of the time is spent in routines
that are obviously aggregation.  In the JIT version, it's around 20%.
So this suggests that the replacement execution engine is more
invasive.  I would also guess (!) that the VOPS engine optimizes fewer
classes of query plan.   ExecScan for example, looks to be completely
optimized out VOPS but is still utilized in the JIT engine.

I experimented with Vitessa a couple of years back and this was
consistent with my recollection.

merlin

Re: [HACKERS] Block level parallel vacuum

2019-11-25 Thread Masahiko Sawada

On Fri, 22 Nov 2019 at 10:19, Amit Kapila  wrote:
>
> On Wed, Nov 20, 2019 at 11:01 AM Masahiko Sawada
>  wrote:
> >
> > I've attached the latest version patch set. The patch set includes all
> > discussed points regarding index AM options as well as shared cost
> > balance. Also I added some test cases used all types of index AM.
> >
>
> I have reviewed the first patch and made a number of modifications
> that include adding/modifying comments, made some corrections and
> modifications in the documentation. You can find my changes in
> v33-0001-delta-amit.patch.  See, if those look okay to you, if so,
> please include those in the next version of the patch.  I am attaching
> both your version of patch and delta changes by me.

Thank you.

All changes look good to me. But after changed the 0002 patch the two
macros for parallel vacuum options (VACUUM_OPTIONS_SUPPORT_XXX) is no
longer necessary. So we can remove them and can add if we need them
again.

>
> One comment on v33-0002-Add-parallel-option-to-VACUUM-command:
>
> + /* Estimate size for shared information -- PARALLEL_VACUUM_KEY_SHARED */
> + est_shared = MAXALIGN(add_size(SizeOfLVShared, BITMAPLEN
> (nindexes)));
> ..
> + shared->offset = add_size(SizeOfLVShared, BITMAPLEN(nindexes));
>
> Here, don't you need to do MAXALIGN to set offset as we are computing
> it that way while estimating shared memory?  If not, then probably,
> some comments are required to explain it.

You're right. Will fix it.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: backup manifests

2019-11-25 Thread Tels


On 2019-11-24 15:38, David Steele wrote:

On 11/23/19 4:34 PM, Andrew Dunstan wrote:


On 11/23/19 3:13 AM, Tels wrote:


Without the strong hashes it would be pointless to sign the manifest.



I guess I must have missed where we are planning to add a 
cryptographic

signature.


I don't think we were planning to, but the user could do so if they 
wished.


That was what I meant.

Best regards,

Tels

Re: backup manifests

2019-11-25 Thread Robert Haas

On Fri, Nov 22, 2019 at 2:29 PM David Steele  wrote:
> See:
> https://www.nist.gov/system/files/documents/2017/04/26/lrdc_systems_part2_032713.pdf
> Search for "The maximum block size"

Hmm, so it says: "The maximum block size that can be protected by a
32-bit CRC is 512MB." My problem is that (1) it doesn't back this up
with a citation or any kind of logical explanation and (2) it's not
very clear what "protected" means. Tels replies downthread to explain
that the internal state of the 32-bit CRC calculation is also limited
to 32 bits, and changes once per bit, so that after processing 512MB =
2^29 bytes = 2^32 bits of data, you're guaranteed to start repeating
internal states. Perhaps this is also what the NIST folks had in mind,
though it's hard to know.

This link provides some more details:

https://community.arm.com/developer/tools-software/tools/f/keil-forum/17467/crc-for-256-byte-data

Not everyone on the thread agrees with everybody else, but it seems
like there are size limits below which a CRC-n is guaranteed to detect
all 1-bit and 2-bit errors, and above which this is no longer
guaranteed. They put the limit *lower* than what NIST supposes, namely
2^(n-1)-1 bits, which would be 256MB, not 512MB, if I'm doing math
correctly. However, they also say that above that value, you are still
likely to detect most errors. Absent an intelligent adversary, the
chance of a random collision when corruption is present is still about
1 in 4 billion (2^-32).

To me, guaranteed detection of 1-bit and 2-bit errors (and the other
kinds of specific things CRC is designed to catch) doesn't seem like a
principle design consideration. It's nice if we can get it and I'm not
against it, but these are algorithms that are designed to be used when
data undergoes a digital-to-analog-to-digital conversion, where for
example it's possible that that the conversion back to digital loses
sync and reads 9 bits or 7 bits rather than 8 bits. And that's not
really what we're doing here: we all know that bits get flipped
sometimes, but nobody uses scp to copy a 1GB file and ends up with a
file that is 1GB +/- a few bits. Some lower-level part of the
communication stack is handling that part of the work; you're going to
get exactly 1GB. So it seems to me that here, as with XLOG, we're not
relying on the specific CRC properties that were intended to be used
to catch and in some cases repair bit flips caused by wrinkles in an
A-to-D conversion, but just on its general tendency to probably not
match if any bits got flipped. And those properties hold regardless of
input length.

That being said, having done some reading on this, I am a little
concerned that we're getting further and further from the design
center of the CRC algorithm. Like relation segment files, XLOG records
are not packets subject to bit insertions, but at least they're small,
and relation files are not. Using a 40-year-old algorithm that was
intended to be used for things like making sure the modem hadn't lost
framing in the last second to verify 1GB files feels, in some nebulous
way, like we might be stretching. That being said, I'm not sure what
we think the reasonable alternatives are. Users aren't going to be
better off if we say that, because CRC-32C might not do a great job
detecting errors, we're not going to check for errors at all. If we go
the other way and say we're going to use some variant of SHA, they
will be better off, but at the price of what looks like a
*significant* hit in terms of backup time.

> > "Typically an n-bit CRC applied to a data block of arbitrary length
> > will detect any single error burst not longer than n bits, and the
> > fraction of all longer error bursts that it will detect is (1 −
> > 2^−n)."
>
> I'm not sure how encouraging I find this -- a four-byte error not a lot
> and 2^32 is only 4 billion.  We have individual users who have backed up
> more than 4 billion files over the last few years.

I agree that people have a lot more than 4 billion files backed up,
but I'm not sure it matters very much given the use case I'm trying to
enable. There's a lot of difference between delta restore and backup
integrity checking. For backup integrity checking, my goal is that, on
those occasions when a file gets corrupted, the chances that we notice
that it has been corrupted. For that purpose, a 32-bit checksum is
probably sufficient. If a file gets corrupted, we have about a
1-in-4-billion chance of being unable to detect it. If 4 billion files
get corrupted, we'll miss, on average, one of those corruption events.
That's sad, but so is the fact that you had *4 billion corrupted
files*. This is not the total number of files backed up; this is the
number of those that got corrupted. I don't really know how common it
is to copy a file and end up with a corrupt copy, but if you say it's
one-in-a-million, which I suspect is far too high, then you'd have to
back up something like 4 quadrillion files before you missed a
corruption event, and th

Re: backup manifests

2019-11-25 Thread Robert Haas

On Fri, Nov 22, 2019 at 2:02 PM David Steele  wrote:
> > Except - and this gets back to the previous point - I don't want to
> > slow down backups by 40% by default. I wouldn't mind slowing them down
> > 3% by default, but 40% is too much overhead. I think we've gotta
> > either the overhead of using SHA way down or not use SHA by default.
>
> Maybe -- my take is that the measurements, an uncompressed backup to the
> local filesystem, are not a very realistic use case.

Well, compression is a feature we don't have yet, in core. So for
people who are only using core tools, an uncompressed backup is a very
realistic use case, because it's the only kind they can get. Granted
the situation is different if you are using pgbackrest.

I don't have enough experience to know how often people back up to
local filesystems vs. remote filesystems mounted locally vs. overtly
over-the-network. I sometimes get the impression that users choose
their backup tools and procedures with, as Tom would say, the aid of a
dart board, but that's probably the cynic in me talking. Or maybe a
reflection of the fact that I usually end up talking to the users for
whom things have gone really, really badly wrong, rather than the ones
for whom things went as planned.

> However, I'm still fine with leaving the user the option of checksums or
> no.  I just wanted to point out that CRCs have their limits so maybe
> that's not a great option unless it is properly caveated and perhaps not
> the default.

I think the default is the sticking point here. To me, it looks like
CRC is a better default than nothing at all because it should still
catch a high percentage of issues that would otherwise be missed, and
a better default than SHA because it's so cheap to compute. However,
I'm certainly willing to consider other theories.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: backup manifests

2019-11-25 Thread Robert Haas

On Fri, Nov 22, 2019 at 5:15 PM Tels  wrote:
> It is related to the number of states...

Thanks for this explanation. See my reply to David where I also
discuss this point.

> However, if you choose a hash, please do not go below SHA-256. Both MD5
> and SHA-1 already had collision attacks, and these only got to be bound
> to be worse.
>
>https://www.mscs.dal.ca/~selinger/md5collision/
>https://shattered.io/

Yikes, that second link, about SHA-1, is depressing. Now, it's not
likely that an attacker has access to your backup repository and can
spend 6500 years of CPU time to engineer a Trojan file there (maybe
more, because the files are probably bigger than the PDFs they used in
that case) and then induce you to restore and rely upon that backup.
However, it's entirely likely that somebody is going to eventually ban
SHA-1 as the attacks get better, which is going to be a problem for us
whether the underlying exposures are problems or not.

> It might even be a wise idea to encode the used Hash-Algorithm into the
> manifest file, so it can be changed later. The hash length might be not
> enough to decide which algorithm is the one used.

I agree. Let's write
SHA256:bc1c3a57369acd0d2183a927fb2e07acbbb1c97f317bbc3b39d93ec65b754af5
or similar rather than just the hash. That way even if the entire SHA
family gets cracked, we can easily substitute in something else that
hasn't been cracked yet.

(It is unclear to me why anyone supposes that *any* popular hash
function won't eventually be cracked. For a K-bit hash function, there
are 2^K possible outputs, where K is probably in the hundreds. But
there are 2^{2^33} possible 1GB files. So for every possible output
value, there are 2^{2^33-K} inputs that produce that value, which is a
very very big number. The probability that any given input produces a
certain output is very low, but the number of possible inputs that
produce a given output is very high; so assuming that nobody's ever
going to figure out how to construct them seems optimistic.)

> To get a feeling one can use:
>
> openssl speed md5 sha1 sha256 sha512
>
> On my really-not-fast desktop CPU (i5-4690T CPU @ 2.50GHz) it says:
>
>   The 'numbers' are in 1000s of bytes per second processed.
>type   16 bytes 64 bytes256 bytes   1024 bytes   8192
> bytes  16384 bytes
>md5   122638.55k   277023.96k   487725.57k   630806.19k
> 683892.74k   688553.98k
>sha1  127226.45k   313891.52k   632510.55k   865753.43k
> 960995.33k   977215.19k
>sha256 77611.02k   173368.15k   325460.99k   412633.43k
> 447022.92k   448020.48k
>sha512 51164.77k   205189.87k   361345.79k   543883.26k
> 638372.52k   645933.74k
>
> Or in other words, it can hash nearly 931 MByte /s with SHA-1 and about
> 427 MByte / s with SHA-256 (if I haven't miscalculated something). You'd
> need a
> pretty fast disk (aka M.2 SSD) and network (aka > 1 Gbit) to top these
> speeds
> and then you'd use a real CPU for your server, not some poor Intel
> powersaving
> surfing thingy-majingy :)

I mean, how fast is in theory doesn't matter nearly as much as what
happens when you benchmark the proposed implementation, and the
results we have so far don't support the theory that this is so cheap
as to be negligible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: dropdb --force

2019-11-25 Thread vignesh C

On Sun, Nov 24, 2019 at 5:06 PM Pavel Stehule  wrote:
>
>
>
> ne 24. 11. 2019 v 11:25 odesílatel vignesh C  napsal:
>>
>> On Sat, Nov 23, 2019 at 4:42 PM Amit Kapila  wrote:
>> >
>> > On Fri, Nov 22, 2019 at 3:16 PM vignesh C  wrote:
>> > >
>> > > Thanks for fixing the comments. The changes looks fine to me. I have
>> > > fixed the first comment, attached patch has the changes for the same.
>> > >
>> >
>> > Few comments:
>> > --
>> > 1. There is a lot of duplicate code in this test.  Basically, almost
>> > the same code is used once to test Drop Database command and dropdb.
>> > I think we can create a few sub-functions to avoid duplicate code, but
>> > do we really need to test the same thing once via command and then by
>> > dropdb utility?   I don't think it is required, but even if we want to
>> > do so, then I am not sure if this is the right test script.  I suggest
>> > removing the command related test.
>> >
>>
>> Pavel: What is your opinion on this?
>
>
> I have not any problem with this reduction.
>
> We will see in future years what is benefit of this test.
>

Fixed, removed dropdb utility test.

>>
>> > 2.
>> > ok( TestLib::pump_until(
>> > + $killme,
>> > + $psql_timeout,
>> > + \$killme_stderr,
>> > + qr/FATAL:  terminating connection due to administrator command/m
>> > + ),
>> > + "psql query died successfully after SIGTERM");
>> >
>> > Extra space before TestLib.
>>
>> Ran perltidy, perltidy adds an extra space. I'm not sure which version
>> is right whether to include space or without space. I had noticed
>> similarly in 001_stream_rep.pl, in few places space is present and in
>> few places it is not present. If required I can update based on
>> suggestion.
>>
>> >
>> > 3.
>> > +=item pump_until(proc, psql_timeout, stream, untl)
>> >
>> > I think moving pump_until to TestLib is okay, but if you are making it
>> > a generic function to be used from multiple places, it is better to
>> > name the variable as 'timeout' instead of 'psql_timeout'
>> >
>>
>> Fixed.
>>
>> > 4. Have you ran perltidy, if not, can you please run it?  See
>> > src/test/perl/README for the recommendation.
>> >
>>
>> I have verified by running perltidy.
>>
>> Please find the updated patch attached. 1st patch is same as previous,
>> 2nd patch includes the fix for comments 2,3 & 4.
>>

Attached patch handles the fix for the above comments.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From 0890c781b67595a51b04a1dfe972890fb6be18a4 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Mon, 25 Nov 2019 23:12:19 +0530
Subject: [PATCH 2/2] Made pump_until usable as a common subroutine.

Patch includes movement of pump_until subroutine from 013_crash_restart to
TestLib so that it can be used as a common sub from 013_crash_restart and
060_dropdb_force.
---
 src/test/perl/TestLib.pm | 37 +++
 src/test/recovery/t/013_crash_restart.pl | 62 
 2 files changed, 59 insertions(+), 40 deletions(-)

diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index a377cdb..b58679a 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -860,6 +860,43 @@ sub command_checks_all
 
 =pod
 
+=item pump_until(proc, timeout, stream, untl)
+
+# Pump until string is matched, or timeout occurs
+
+=cut
+
+sub pump_until
+{
+	my ($proc, $timeout, $stream, $untl) = @_;
+	$proc->pump_nb();
+	while (1)
+	{
+		last if $$stream =~ /$untl/;
+		if ($timeout->is_expired)
+		{
+			diag("aborting wait: program timed out");
+			diag("stream contents: >>", $$stream, "<<");
+			diag("pattern searched for: ", $untl);
+
+			return 0;
+		}
+		if (not $proc->pumpable())
+		{
+			diag("aborting wait: program died");
+			diag("stream contents: >>", $$stream, "<<");
+			diag("pattern searched for: ", $untl);
+
+			return 0;
+		}
+		$proc->pump();
+	}
+	return 1;
+
+}
+
+=pod
+
 =back
 
 =cut
diff --git a/src/test/recovery/t/013_crash_restart.pl b/src/test/recovery/t/013_crash_restart.pl
index 2c47797..dd08924 100644
--- a/src/test/recovery/t/013_crash_restart.pl
+++ b/src/test/recovery/t/013_crash_restart.pl
@@ -72,7 +72,8 @@ CREATE TABLE alive(status text);
 INSERT INTO alive VALUES($$committed-before-sigquit$$);
 SELECT pg_backend_pid();
 ];
-ok(pump_until($killme, \$killme_stdout, qr/[[:digit:]]+[\r\n]$/m),
+ok( TestLib::pump_until(
+		$killme, $psql_timeout, \$killme_stdout, qr/[[:digit:]]+[\r\n]$/m),
 	'acquired pid for SIGQUIT');
 my $pid = $killme_stdout;
 chomp($pid);
@@ -84,7 +85,9 @@ $killme_stdin .= q[
 BEGIN;
 INSERT INTO alive VALUES($$in-progress-before-sigquit$$) RETURNING status;
 ];
-ok(pump_until($killme, \$killme_stdout, qr/in-progress-before-sigquit/m),
+ok( TestLib::pump_until(
+		$killme, $psql_timeout,
+		\$killme_stdout, qr/in-progress-before-sigquit/m),
 	'inserted in-progress-before-sigquit');
 $killme_stdout = '';
 $killme_stderr = '';
@@ -97,7 +100,8 @@ $monitor_stdin .= q[
 SELECT $$psql-connected$$;
 SELECT pg_sleep(3600

Re: accounting for memory used for BufFile during hash joins

2019-11-25 Thread Tomas Vondra


On Mon, Nov 25, 2019 at 05:33:35PM +0900, Michael Paquier wrote:

On Tue, Sep 10, 2019 at 03:47:51PM +0200, Tomas Vondra wrote:

My feeling is that we should get the BNLJ committed first, and then maybe
use some of those additional strategies as fallbacks (depending on which
issues are still unsolved by the BNLJ).


The glacier is melting more.  Tomas, what's your status here?  The
patch has been waiting on author for two months now.  If you are not
planning to work more on this one, then it should be marked as
returned with feedback?


I'm not planning to do any any immediate work on this, so I agree with
marking it as RWF. I think Melanie is working on the BNL patch, which
seems like the right solution.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

2019-11-25 Thread Alvaro Herrera

On 2019-Nov-25, Antonin Houska wrote:

> Alvaro Herrera  wrote:

> > I see no reason to leave ws_off.  We can move that to XLogReaderState; I
> > did that here.  We also need the offset in WALReadError, though, so I
> > added it there too.  Conceptually it seems clearer to me this way.
> > 
> > What do you think of the attached?
> 
> It looks good to me. Attached is just a fix of a minor problem in error
> reporting that Michael pointed out earlier.

Excellent, I pushed it with this change included and some other cosmetic
changes.

Now there's only XLogPageRead() ...

> > BTW I'm not clear what errors can pread()/pg_pread() report that do not
> > set errno.  I think lines 1083/1084 of WALRead are spurious now.
> 
> All I can say is that the existing calls of pg_pread() do not clear errno, so
> you may be right.

Right ... in this interface, we only report an error if pg_pread()
returns negative, which is documented to always set errno.

> I'd appreciate more background about the "partial read" that
> Michael mentions here:
> 
> https://www.postgresql.org/message-id/20191125033048.GG37821%40paquier.xyz

In the current implementation, if pg_pread() does a partial read, we
just loop one more time.

I considered changing the "if (readbytes <= 0)" with "if (readbytes <
segbytes)", but that seemed pointless.

However, writing this now makes me think that we should add a
CHECK_FOR_INTERRUPTS in this loop.  (I also wonder if we shouldn't limit
the number of times we retry if pg_pread returns zero (i.e. no error,
but no bytes read either).  I don't know if this is a real-world
consideration.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: TestLib::command_fails_like enhancement

2019-11-25 Thread Mark Dilger





On 11/25/19 5:08 AM, Andrew Dunstan wrote:


On 11/11/19 4:28 PM, Mark Dilger wrote:







On further consideration, I'm wondering why we don't just
unconditionally use a closed input pty for all these functions (except
run_log). None of them use any input, and making the client worry
about
whether or not to close it seems something of an abstraction break.
There would be no API change at all involved in this case, just a
bit of
extra cleanliness. Would need testing on Windows, I'll go and do that
now.


Thoughts?


That sounds a lot better than your previous patch.

PostgresNode.pm and TestLib.pm both invoke IPC::Run::run.  If you
change all the invocations in TestLib to close input pty, should you
do the same for PostgresNode?  I don't have a strong argument for
doing so, but it seems cleaner to have both libraries invoking
commands under identical conditions, such that if commands were
borrowed from one library and called from the other they would behave
the same.

PostgresNode already uses TestLib, so perhaps setting up the
environment can be abstracted into something, perhaps TestLib::run,
and then used everywhere that IPC::Run::run currently is used.




I don't think we need to do that. In the case of the PostgresNode.pm
uses we know what the executable is, unlike the the TestLib.pm cases.
They are our own executables and we don't expect them to be doing
anything funky with /dev/tty.


Ok.  I think your proposal sounds fine.




Here's a patch for that. The pty stuff crashes and burns on my Windows
test box, so there I just set stdin to an empty string via the usual
pipe mechanism.


Ok, I've reviewed and tested this.  It works fine for me on Linux.  I
am not set up to test it on Windows.  I think it is ready to commit.

I have one remaining comment about the code, and this is just FYI.  I
won't quibble with you committing your patch as it currently stands.

You might consider changing the '\x04' literal to use a named control
character, both for readability and portability, as here:

+   use charnames ':full';
+   @no_stdin = ('

GROUPING SETS and SQL standard

2019-11-25 Thread Phil Florent

Hi,

We are still on the process to migrate our applications from proprietary RDBMS 
to PostgreSQL.

Here is a simple query executed on various systems (real query is different but 
this one does not need any data) :


Connected to:

Oracle Database 19c Standard Edition 2 Release 19.0.0.0.0 - Production

Version 19.3.0.0.0



SQL> select count(*) from (select 1 from dual where 0=1 group by grouping 
sets(())) tmp;



  COUNT(*)

--

 0





select @@version;

GO



---
   
---
   --

Microsoft SQL Server 2017 (RTM-CU16) (KB4508218) - 14.0.3223.3 (X64)

Jul 12 2019 17:43:08

Copyright (C) 2017 Microsoft Corporation

Developer Edition (64-bit) on Linux (Ubuntu 16.04.6 LTS)



select count(*) from (select 1 as c1 where 0=1 group by grouping sets(())) tmp;

GO



---

  0



(1 rows affected)





select version();

version



PostgreSQL 11.5 (Debian 11.5-1+deb10u1) on x86_64-pc-linux-gnu, compiled by gcc 
(Debian 8.3.0-6) 8.3.0, 64-bit







select count(*) from (select 1 from dual where 0=1 group by grouping sets(())) 
tmp;

count

---

 1

(1 ligne)





0 or 1, which behaviour conforms to the SQL standard ? We have a workaround and 
it's just informational.


Regards,


Phil

Re: TestLib::command_fails_like enhancement

2019-11-25 Thread Andrew Dunstan



On 11/25/19 1:56 PM, Mark Dilger wrote:
>
>
> On 11/25/19 5:08 AM, Andrew Dunstan wrote:
>>
>> On 11/11/19 4:28 PM, Mark Dilger wrote:
>>>
>>>
>>>
>>
>> On further consideration, I'm wondering why we don't just
>> unconditionally use a closed input pty for all these functions
>> (except
>> run_log). None of them use any input, and making the client worry
>> about
>> whether or not to close it seems something of an abstraction break.
>> There would be no API change at all involved in this case, just a
>> bit of
>> extra cleanliness. Would need testing on Windows, I'll go and do
>> that
>> now.
>>
>>
>> Thoughts?
>
> That sounds a lot better than your previous patch.
>
> PostgresNode.pm and TestLib.pm both invoke IPC::Run::run.  If you
> change all the invocations in TestLib to close input pty, should you
> do the same for PostgresNode?  I don't have a strong argument for
> doing so, but it seems cleaner to have both libraries invoking
> commands under identical conditions, such that if commands were
> borrowed from one library and called from the other they would behave
> the same.
>
> PostgresNode already uses TestLib, so perhaps setting up the
> environment can be abstracted into something, perhaps TestLib::run,
> and then used everywhere that IPC::Run::run currently is used.



 I don't think we need to do that. In the case of the PostgresNode.pm
 uses we know what the executable is, unlike the the TestLib.pm cases.
 They are our own executables and we don't expect them to be doing
 anything funky with /dev/tty.
>>>
>>> Ok.  I think your proposal sounds fine.
>>
>>
>>
>> Here's a patch for that. The pty stuff crashes and burns on my Windows
>> test box, so there I just set stdin to an empty string via the usual
>> pipe mechanism.
>
> Ok, I've reviewed and tested this.  It works fine for me on Linux.  I
> am not set up to test it on Windows.  I think it is ready to commit.
>
> I have one remaining comment about the code, and this is just FYI.  I
> won't quibble with you committing your patch as it currently stands.
>
> You might consider changing the '\x04' literal to use a named control
> character, both for readability and portability, as here:
>
> +   use charnames ':full';
> +   @no_stdin = ('
> The only character set I can find where this matters is EBCDIC, in
> which the EOT character is 55 rather than 4.  Since EBCDIC does not
> occur in the list of supported character sets for postgres, per the
> docs section 23.3.1, I don't suppose it matters too much.  Nor can
> I test how this works on EBCDIC, so I'm mostly guessing that perl
> would do the right thing there.  But, at least to my eyes, it is
> more immediately clear what the code is doing when the control
> character name is spelled out.
>
>


Agreed, I'll do it that way. This is quite timely, as I just finished
reworking the patch that relies on it. Thanks for the review.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: GROUPING SETS and SQL standard

2019-11-25 Thread Pavel Stehule

po 25. 11. 2019 v 20:32 odesílatel Phil Florent 
napsal:

> Hi,
>
> We are still on the process to migrate our applications from proprietary
> RDBMS to PostgreSQL.
>
> Here is a simple query executed on various systems (real query is
> different but this one does not need any data) :
>
> Connected to:
>
> Oracle Database 19c Standard Edition 2 Release 19.0.0.0.0 - Production
>
> Version 19.3.0.0.0
>
>
>
> SQL> select count(*) from (select 1 from dual where 0=1 group by grouping
> sets(())) tmp;
>
>
>
>   COUNT(*)
>
> --
>
>  0
>
>
>
>
>
> select @@version;
>
> GO
>
>
>
>
> ---
> ---
> --
>
> Microsoft SQL Server 2017 (RTM-CU16) (KB4508218) - 14.0.3223.3 (X64)
>
> Jul 12 2019 17:43:08
>
> Copyright (C) 2017 Microsoft Corporation
>
> Developer Edition (64-bit) on Linux (Ubuntu 16.04.6 LTS)
>
>
>
> select count(*) from (select 1 as c1 where 0=1 group by grouping sets(()))
> tmp;
>
> GO
>
>
>
> ---
>
>   0
>
>
>
> (1 rows affected)
>
>
>
>
>
> select version();
>
> version
>
>
> 
>
> PostgreSQL 11.5 (Debian 11.5-1+deb10u1) on x86_64-pc-linux-gnu, compiled
> by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
>
>
>
>
>
>
>
> select count(*) from (select 1 from dual where 0=1 group by grouping
> sets(())) tmp;
>
> count
>
> ---
>
>  1
>
> (1 ligne)
>
>
>
>
>
> 0 or 1, which behaviour conforms to the SQL standard ? We have a
> workaround and it's just informational.
>

This example has not too much sense - I am not sure if these corner cases
are described by ANSI SQL standards.

If I add aggregate query to subquery - using grouping sets without
aggregation function is strange, then Postgres result looks more correct

postgres=# select 1, count(*) from dual  group by grouping sets(());
┌──┬───┐
│ ?column? │ count │
╞══╪═══╡
│1 │ 1 │
└──┴───┘
(1 row)

postgres=# select 1, count(*) from dual where false group by grouping
sets(());
┌──┬───┐
│ ?column? │ count │
╞══╪═══╡
│1 │ 0 │
└──┴───┘
(1 row)

SELECT count(*) from this should be one in both cases.

I am not sure, if standard describe using grouping sets without any
aggregation function

Pavel

>
> Regards,
>
>
> Phil
>
>

Re: FETCH FIRST clause WITH TIES option

2019-11-25 Thread Alvaro Herrera

On 2019-Nov-11, Alvaro Herrera wrote:

> I'm not sure the proposed changes to gram.y are all that great, though.

Here's a proposed simplification of the gram.y changes.  There are two
things here:

1. cosmetic: we don't need the LimitClause struct; we can use just
   SelectLimit, and return that from limit_clause; that can be
   complemented using the offset_clause if there's any at select_limit
   level.

2. there's a gratuituous palloc() in opt_select_limit when there's no
   clause, that seems to be there just so that NULLs can be returned.
   That's one extra palloc for SELECTs parsed using one the affected
   productions ... it's not every single select, but it seems bad enough
   it's worth fixing.

I fixed #2 by just checking whether the return from opt_select_limit is
NULL.  ISTM it'd be better to pass the SelectLimit pointer to
insertSelectOptions instead (which is a static function in gram.y anyway
so there's no difficulty there).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index e776215155..f4304a45b9 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -135,13 +135,6 @@ typedef struct SelectLimit
 	LimitOption limitOption;
 } SelectLimit;
 
-/* Private struct for the result of limit_clause production */
-typedef struct LimitClause
-{
-	Node *limitCount;
-	LimitOption limitOption;
-} LimitClause;
-
 /* ConstraintAttributeSpec yields an integer bitmask of these flags: */
 #define CAS_NOT_DEFERRABLE			0x01
 #define CAS_DEFERRABLE0x02
@@ -258,8 +251,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	PartitionSpec		*partspec;
 	PartitionBoundSpec	*partboundspec;
 	RoleSpec			*rolespec;
-	struct SelectLimit	*SelectLimit;
-	struct LimitClause	*LimitClause;
+	struct SelectLimit	*selectlimit;
 }
 
 %type 	stmt schema_stmt
@@ -392,8 +384,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	import_qualification_type
 %type  import_qualification
 %type 	vacuum_relation
-%type  opt_select_limit select_limit
-%type  limit_clause
+%type  opt_select_limit select_limit limit_clause
 
 %type 	stmtblock stmtmulti
 OptTableElementList TableElementList OptInherit definition
@@ -11343,8 +11334,9 @@ select_no_parens:
 			| select_clause opt_sort_clause for_locking_clause opt_select_limit
 {
 	insertSelectOptions((SelectStmt *) $1, $2, $3,
-		($4)->limitOffset, ($4)->limitCount,
-		($4)->limitOption,
+		($4) ? ($4)->limitOffset : NULL,
+		($4) ? ($4)->limitCount : NULL,
+		($4) ? ($4)->limitOption : LIMIT_OPTION_DEFAULT,
 		NULL,
 		yyscanner);
 	$$ = $1;
@@ -11362,7 +11354,7 @@ select_no_parens:
 {
 	insertSelectOptions((SelectStmt *) $2, NULL, NIL,
 		NULL, NULL,
-		LIMIT_OPTION_DEFAULT,$1,
+		LIMIT_OPTION_DEFAULT, $1,
 		yyscanner);
 	$$ = $2;
 }
@@ -11370,15 +11362,16 @@ select_no_parens:
 {
 	insertSelectOptions((SelectStmt *) $2, $3, NIL,
 		NULL, NULL,
-		LIMIT_OPTION_DEFAULT,$1,
+		LIMIT_OPTION_DEFAULT, $1,
 		yyscanner);
 	$$ = $2;
 }
 			| with_clause select_clause opt_sort_clause for_locking_clause opt_select_limit
 {
 	insertSelectOptions((SelectStmt *) $2, $3, $4,
-		($5)->limitOffset, ($5)->limitCount,
-		($5)->limitOption,
+		($5) ? ($5)->limitOffset : NULL,
+		($5) ? ($5)->limitCount : NULL,
+		($5) ? ($5)->limitOption : LIMIT_OPTION_DEFAULT,
 		$1,
 		yyscanner);
 	$$ = $2;
@@ -11683,27 +11676,17 @@ sortby:		a_expr USING qual_all_Op opt_nulls_order
 select_limit:
 			limit_clause offset_clause
 {
-	SelectLimit *n = (SelectLimit *) palloc(sizeof(SelectLimit));
-	n->limitOffset = $2;
-	n->limitCount = ($1)->limitCount;
-	n->limitOption = ($1)->limitOption;
-	$$ = n;
+	$$ = $1;
+	($$)->limitOffset = $2;
 }
 			| offset_clause limit_clause
 {
-	SelectLimit *n = (SelectLimit *) palloc(sizeof(SelectLimit));
-	n->limitOffset = $1;
-	n->limitCount = ($2)->limitCount;
-	n->limitOption = ($2)->limitOption;
-	$$ = n;
+	$$ = $2;
+	($$)->limitOffset = $1;
 }
 			| limit_clause
 {
-	SelectLimit *n = (SelectLimit *) palloc(sizeof(SelectLimit));
-	n->limitOffset = NULL;
-	n->limitCount = ($1)->limitCount;
-	n->limitOption = ($1)->limitOption;
-	$$ = n;
+	$$ = $1;
 }
 			| offset_clause
 {
@@ -11717,20 +11700,14 @@ select_limit:
 
 opt_select_limit:
 			select_limit		{ $$ = $1; }
-			| /* EMPTY */
-{
-	SelectLimit *n = (SelectLimit *) palloc(sizeof(SelectLimit));
-	n->limitOffset = NULL;
-	n->limitCount = NULL;
-	n->limitOption = LIMIT_OPTION_DEFAULT;
-	$$ = n;
-}
+			| /* EMPTY */		{ $$ =

Re: FETCH FIRST clause WITH TIES option

2019-11-25 Thread Alvaro Herrera

(Prior to posting this delta patch, the CF bot appeared happy with this
patch.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-25 Thread Robert Haas

On Sat, Nov 23, 2019 at 4:21 PM Noah Misch  wrote:
> I noticed an additional defect:
>
> BEGIN;
> CREATE TABLE t (c) AS SELECT 1;
> CHECKPOINT; -- write and fsync the table's one page
> TRUNCATE t; -- no WAL
> COMMIT; -- no FPI, just the commit record
>
> If we crash after the COMMIT and before the next fsync or OS-elected sync of
> the table's file, the table will stay on disk with its pre-TRUNCATE content.

Shouldn't the TRUNCATE be triggering an fsync() to happen before
COMMIT is permitted to complete? You'd have the same problem if the
TRUNCATE were replaced by INSERT, unless fsync() happens in that case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: libpq sslpassword parameter and callback function

2019-11-25 Thread Andrew Dunstan


On 10/31/19 7:27 PM, Andrew Dunstan wrote:
> On 10/31/19 6:34 PM, Andrew Dunstan wrote:
>> This time with attachment.
>>
>>
>> On 10/31/19 6:33 PM, Andrew Dunstan wrote:
>>> This patch provides for an sslpassword parameter for libpq, and a hook
>>> that a client can fill in for a callback function to set the password.
>>>
>>>
>>> This provides similar facilities to those already available in the JDBC
>>> driver.
>>>
>>>
>>> There is also a function to fetch the sslpassword from the connection
>>> parameters, in the same way that other settings can be fetched.
>>>
>>>
>>> This is mostly the excellent work of my colleague Craig Ringer, with a
>>> few embellishments from me.
>>>
>>>
>>> Here are his notes:
>>>
>>>
>>>     Allow libpq to non-interactively decrypt client certificates that
>>> are stored
>>>     encrypted by adding a new "sslpassword" connection option.
>>>    
>>>     The sslpassword option offers a middle ground between a cleartext
>>> key and
>>>     setting up advanced key mangement via openssl engines, PKCS#11, USB
>>> crypto
>>>     offload and key escrow, etc.
>>>    
>>>     Previously use of encrypted client certificate keys only worked if
>>> the user
>>>     could enter the key's password interactively on stdin, in response
>>> to openssl's
>>>     default prompt callback:
>>>    
>>>     Enter PEM passhprase:
>>>    
>>>     That's infesible in many situations, especially things like use from
>>>     postgres_fdw.
>>>    
>>>     This change also allows admins to prevent libpq from ever prompting
>>> for a
>>>     password by calling:
>>>    
>>>     PQsetSSLKeyPassHook(PQdefaultSSLKeyPassHook);
>>>    
>>>     which is useful since OpenSSL likes to open /dev/tty to prompt for a
>>> password,
>>>     so even closing stdin won't stop it blocking if there's no user
>>> input available.
>>>     Applications may also override or extend SSL password fetching with
>>> their own
>>>     callback.
>>>    
>>>     There is deliberately no environment variable equivalent for the
>>> sslpassword
>>>     option.
>>>
>>>
> I should also mention that this patch provides for support for DER
> format certificates and keys.
>
>


Here's an updated version of the patch, adjusted to the now committed
changes to TestLib.pm.


cheers


andrew


-- 

Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

>From b3b2104ab0a17383bb1e55ef59eaa609a417ed40 Mon Sep 17 00:00:00 2001
From: Andrew Dunstan 
Date: Mon, 25 Nov 2019 14:59:05 -0500
Subject: [PATCH] libpq sslpassword and DER support

---
 contrib/dblink/expected/dblink.out|   2 +-
 doc/src/sgml/libpq.sgml   | 116 ++
 doc/src/sgml/postgres-fdw.sgml|   2 +-
 src/interfaces/libpq/exports.txt  |   4 +
 src/interfaces/libpq/fe-connect.c |  14 +++
 src/interfaces/libpq/fe-secure-openssl.c  |  99 +-
 src/interfaces/libpq/libpq-fe.h   |  14 +++
 src/interfaces/libpq/libpq-int.h  |   2 +
 src/test/ssl/Makefile |  22 +++-
 src/test/ssl/ssl/client-der.key   | Bin 0 -> 1191 bytes
 src/test/ssl/ssl/client-encrypted-der.key | Bin 0 -> 1191 bytes
 src/test/ssl/ssl/client-encrypted-pem.key |  30 ++
 src/test/ssl/t/001_ssltests.pl|  66 ++--
 13 files changed, 354 insertions(+), 17 deletions(-)
 create mode 100644 src/test/ssl/ssl/client-der.key
 create mode 100644 src/test/ssl/ssl/client-encrypted-der.key
 create mode 100644 src/test/ssl/ssl/client-encrypted-pem.key

diff --git a/contrib/dblink/expected/dblink.out b/contrib/dblink/expected/dblink.out
index 6ceabb453c..6516d4f131 100644
--- a/contrib/dblink/expected/dblink.out
+++ b/contrib/dblink/expected/dblink.out
@@ -879,7 +879,7 @@ $d$;
 CREATE USER MAPPING FOR public SERVER fdtest
   OPTIONS (server 'localhost');  -- fail, can't specify server here
 ERROR:  invalid option "server"
-HINT:  Valid options in this context are: user, password
+HINT:  Valid options in this context are: user, password, sslpassword
 CREATE USER MAPPING FOR public SERVER fdtest OPTIONS (user :'USER');
 GRANT USAGE ON FOREIGN SERVER fdtest TO regress_dblink_user;
 GRANT EXECUTE ON FUNCTION dblink_connect_u(text, text) TO regress_dblink_user;
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 258b09cf8e..6082b38e9e 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -776,6 +776,72 @@ PGPing PQping(const char *conninfo);
  
 
 
+
+ PQsetSSLKeyPassHookPQsetSSLKeyPassHook
+ 
+  
+   PQsetSSLKeyPassHook lets an application override
+   libpq's default
+   handling of encrypted client certificate key files using
+or interactive prompting.
+
+
+void PQsetSSLKeyPassHook(PQsslKeyPassHook_type hook);
+
+
+   The application passes a pointer to a callback function with signature:
+   
+   int callback_fn(char *buf, int size, PGc

RE: GROUPING SETS and SQL standard

2019-11-25 Thread Phil Florent

Hi,
Thank you, as you mentionned it's not really an interesting real life case 
anyway.
Regards,
Phil


De : Pavel Stehule 
Envoyé : lundi 25 novembre 2019 21:23
À : Phil Florent 
Cc : pgsql-hack...@postgresql.org 
Objet : Re: GROUPING SETS and SQL standard



po 25. 11. 2019 v 20:32 odesílatel Phil Florent 
mailto:philflor...@hotmail.com>> napsal:
Hi,

We are still on the process to migrate our applications from proprietary RDBMS 
to PostgreSQL.

Here is a simple query executed on various systems (real query is different but 
this one does not need any data) :


Connected to:

Oracle Database 19c Standard Edition 2 Release 19.0.0.0.0 - Production

Version 19.3.0.0.0



SQL> select count(*) from (select 1 from dual where 0=1 group by grouping 
sets(())) tmp;



  COUNT(*)

--

 0





select @@version;

GO



---
   
---
   --

Microsoft SQL Server 2017 (RTM-CU16) (KB4508218) - 14.0.3223.3 (X64)

Jul 12 2019 17:43:08

Copyright (C) 2017 Microsoft Corporation

Developer Edition (64-bit) on Linux (Ubuntu 16.04.6 LTS)



select count(*) from (select 1 as c1 where 0=1 group by grouping sets(())) tmp;

GO



---

  0



(1 rows affected)





select version();

version



PostgreSQL 11.5 (Debian 11.5-1+deb10u1) on x86_64-pc-linux-gnu, compiled by gcc 
(Debian 8.3.0-6) 8.3.0, 64-bit







select count(*) from (select 1 from dual where 0=1 group by grouping sets(())) 
tmp;

count

---

 1

(1 ligne)





0 or 1, which behaviour conforms to the SQL standard ? We have a workaround and 
it's just informational.

This example has not too much sense - I am not sure if these corner cases are 
described by ANSI SQL standards.

If I add aggregate query to subquery - using grouping sets without aggregation 
function is strange, then Postgres result looks more correct

postgres=# select 1, count(*) from dual  group by grouping sets(());
┌──┬───┐
│ ?column? │ count │
╞══╪═══╡
│1 │ 1 │
└──┴───┘
(1 row)

postgres=# select 1, count(*) from dual where false group by grouping sets(());
┌──┬───┐
│ ?column? │ count │
╞══╪═══╡
│1 │ 0 │
└──┴───┘
(1 row)

SELECT count(*) from this should be one in both cases.

I am not sure, if standard describe using grouping sets without any aggregation 
function

Pavel


Regards,


Phil

Re: [Doc] pg_restore documentation didn't explain how to use connection string

2019-11-25 Thread Laurenz Albe

On Wed, 2019-11-13 at 16:48 +0100, Lætitia Avrot wrote:
> So after some thoughts I did the minimal patch (for now).
> I corrected documentation for the following tools so that now, using 
> connection string
> for Postgres client applications is documented in Postgres:
> - clusterdb
> - pgbench
> - pg_dump
> - pg_restore
> - reindexdb
> - vacuumdb

I think that this patch is a good idea.
Even if it adds some redundancy, that can hardly be avoided because, as you 
said,
the options to specify the database name are not the same everywhere.

The patch applies and build fine.

I see some room for improvement:

- I think that "connection string" is better than "conninfo string".
  At least the chapter to which you link is headed "Connection Strings".

  This would also be consistent with the use of that term in the
  "pg_basebackup" , "pg_dumpall" and "pg_receivewal" documentation.

  You seem to have copied that wording from the "pg_isready", "psql",
  "reindexdb" and "vacuumdb" documentation, but I think it would be better
  to reword those too.

- You begin your paragraph with "if this parameter contains ...".

  First, I think "argument" might be more appropriate here, as you
  are talking about
  a) the supplied value and
  b) a command line argument or the argument to an option

  Besides, it might be confusing to refer to "*this* parameter" if the text
  is not immediately after what you are referring to, like for example
  in "pgbench", where it might refer to the -h, -p or -U options.

  I think it would be better and less ambiguous to use
  "If dbname contains ..."

  In the cases where there is no ambiguity, it might be better to use
  a wording like in the "pg_recvlogical" documentation.

- There are two places you forgot:

  createdb --maintenance-db=dbname
  dropdb --maintenance-db=dbname

While looking at this patch, I noticed that "createuser" and "dropuser"
explicitly connect to the "postgres" database rather than using
"connectMaintenanceDatabase()" like the other scripts, which would try
the database "postgres" first and fall back to "template1".

This is unrelated to the patch, but low-hanging fruit for unified behavior.

Yours,
Laurenz Albe

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-25 Thread Noah Misch

On Mon, Nov 25, 2019 at 03:58:14PM -0500, Robert Haas wrote:
> On Sat, Nov 23, 2019 at 4:21 PM Noah Misch  wrote:
> > I noticed an additional defect:
> >
> > BEGIN;
> > CREATE TABLE t (c) AS SELECT 1;
> > CHECKPOINT; -- write and fsync the table's one page
> > TRUNCATE t; -- no WAL
> > COMMIT; -- no FPI, just the commit record
> >
> > If we crash after the COMMIT and before the next fsync or OS-elected sync of
> > the table's file, the table will stay on disk with its pre-TRUNCATE content.
> 
> Shouldn't the TRUNCATE be triggering an fsync() to happen before
> COMMIT is permitted to complete?

With wal_skip_threshold=0, you do get an fsync().  The patch tries to avoid
at-commit fsync of small files by WAL-logging file contents instead.  However,
the patch doesn't WAL-log enough to handle files that decreased in size.

> You'd have the same problem if the
> TRUNCATE were replaced by INSERT, unless fsync() happens in that case.

I think an insert would be fine.  You'd get an FPI record for the relation's
one page, which fully reproduces the relation.

Re: Allow superuser to grant passwordless connection rights on postgres_fdw

2019-11-25 Thread Andrew Dunstan

On Sun, Nov 10, 2019 at 4:35 AM Craig Ringer  wrote:
>
> On Mon, 4 Nov 2019 at 12:20, Stephen Frost  wrote:
>>
>> Greetings,
>>
>> * Andrew Dunstan (andrew.duns...@2ndquadrant.com) wrote:
>> > On 11/1/19 12:58 PM, Robert Haas wrote:
>> > > On Thu, Oct 31, 2019 at 4:58 PM Andrew Dunstan
>> > >  wrote:
>> > >> This patch allows the superuser to grant passwordless connection rights
>> > >> in postgres_fdw user mappings.
>> > > This is clearly something that we need, as the current code seems
>> > > woefully ignorant of the fact that passwords are not the only
>> > > authentication method supported by PostgreSQL, nor even the most
>> > > secure.
>> > >
>> > > But, I do wonder a bit if we ought to think harder about the overall
>> > > authentication model for FDW. Like, maybe we'd take a different view
>> > > of how to solve this particular piece of the problem if we were
>> > > thinking about how FDWs could do LDAP authentication, SSL
>> > > authentication, credentials forwarding...
>> >
>> > I'm certainly open to alternatives.
>>
>> I've long felt that the way to handle this kind of requirement is to
>> have a "trusted remote server" kind of option- where the local server
>> authenticates to the remote server as a *server* and then says "this is
>> the user on this server, and this is the user that this user wishes to
>> be" and the remote server is then able to decide if they accept that, or
>> not.
>
>
> The original use case for the patch was to allow FDWs to use SSL/TLS client 
> certificates. Each user-mapping has its own certificate - there's a separate 
> patch to allow that. So there's no delegation of trust via Kerberos etc in 
> that particular case.
>
> I can see value in using Kerberos etc for that too though, as it separates 
> authorization and authentication in the same manner as most sensible systems. 
> You can say "user postgres@foo is trusted to vet users so you can safely hand 
> out tickets for any bar@foo that postgres@foo says is legit".
>
> I would strongly discourage allowing all users on host A to authenticate as 
> user postgres on host B. But with appropriate user-mappings support, we could 
> likely support that sort of model for both SSPI and Kerberos.
>
> A necessary prerequisite is that Pg be able to cope with passwordless 
> user-mappings though. Hence this patch.
>
>


Yeah, I agree. Does anyone else want to weigh in on this? If nobody
objects I'd like to tidy this up and get it committed so we can add
support for client certs in postgres_fdw, which is the real business
at hand, and which I know from various offline comments a number of
people are keen to have.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: global / super barriers (for checksums)

2019-11-25 Thread Robert Haas

On Wed, Nov 13, 2019 at 12:26 PM Robert Haas  wrote:
> On the other hand, 0002 seems like it's pretty clearly a good idea. It
> makes a whole bunch of auxiliary processes use
> procsignal_sigusr1_handler() and those things all get called from
> AuxiliaryProcessMain(), which does ProcSignalInit(), and it seems like
> clearly the right idea that processes which register to participate in
> the procsignal mechanism should also register to get notified if they
> receive a procsignal. I think that the reason we haven't bothered with
> this up until now is because I think that it's presently impossible
> for any of the kind of procsignals that we have to get sent to any of
> those processes. But, global barriers would require us to do so, so it
> seems like it's time to tighten that up, and it doesn't really cost
> anything. So I propose to commit this part soon, unless somebody
> objects.

Done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: benchmarking Flex practices

2019-11-25 Thread Tom Lane

[ My apologies for being so slow to get back to this ]

John Naylor  writes:
> Now that I think of it, the regression in v7 was largely due to the
> fact that the parser has to call the lexer 3 times per string in this
> case, and that's going to be slower no matter what we do.

Ah, of course.  I'm not too fussed about the performance of queries with
an explicit UESCAPE clause, as that seems like a very minority use-case.
What we do want to pay attention to is not regressing for plain
identifiers/strings, and to a lesser extent the U& cases without UESCAPE.

> Inlining hexval() and friends seems to have helped somewhat for
> unicode escapes, but I'd have to profile to improve that further.
> However, v8 has regressed from v7 enough with both simple strings and
> the information schema that it's a noticeable regression from HEAD.
> I'm guessing getting rid of the "Uescape" production is to blame, but
> I haven't tried reverting just that one piece. Since inlining the
> rules didn't seem to help with the precedence hacks, it seems like the
> separate production was a better way. Thoughts?

I have duplicated your performance tests here, and get more or less
the same results (see below).  I agree that the performance of the
v8 patch isn't really where we want to be --- and it also seems
rather invasive to gram.y, and hence error-prone.  (If we do it
like that, I bet my bottom dollar that somebody would soon commit
a patch that adds a production using IDENT not Ident, and it'd take
a long time to notice.)

It struck me though that there's another solution we haven't discussed,
and that's to make the token lookahead filter in parser.c do the work
of converting UIDENT [UESCAPE SCONST] to IDENT, and similarly for the
string case.  I pursued that to the extent of developing the attached
incomplete patch ("v9"), which looks reasonable from a performance
standpoint.  I get these results with tests using the drive_parser
function:

information_schema

HEAD3447.674 ms, 3433.498 ms, 3422.407 ms
v6  3381.851 ms, 3442.478 ms, 3402.629 ms
v7  3525.865 ms, 3441.038 ms, 3473.488 ms
v8  3567.640 ms, 3488.417 ms, 3556.544 ms
v9  3456.360 ms, 3403.635 ms, 3418.787 ms

pgbench str

HEAD4414.046 ms, 4376.222 ms, 4356.468 ms
v6  4304.582 ms, 4245.534 ms, 4263.562 ms
v7  4395.815 ms, 4398.381 ms, 4460.304 ms
v8  4475.706 ms, 4466.665 ms, 4471.048 ms
v9  4392.473 ms, 4316.549 ms, 4318.472 ms

pgbench unicode

HEAD4959.000 ms, 4921.751 ms, 4945.069 ms
v6  4856.998 ms, 4802.996 ms, 4855.486 ms
v7  5057.199 ms, 4948.342 ms, 4956.614 ms
v8  5008.090 ms, 4963.641 ms, 4983.576 ms
v9  4809.227 ms, 4767.355 ms, 4741.641 ms

pgbench uesc

HEAD5114.401 ms, 5235.764 ms, 5200.567 ms
v6  5030.156 ms, 5083.398 ms, 4986.974 ms
v7  5915.508 ms, 5953.135 ms, 5929.775 ms
v8  5678.810 ms, 5665.239 ms, 5645.696 ms
v9  5648.965 ms, 5601.592 ms, 5600.480 ms

(A note about what we're looking at: on my machine, after using cpupower
to lock down the CPU frequency, and taskset to bind everything to one
CPU socket, I can get numbers that are very repeatable, to 0.1% or so
... until I restart the postmaster, and then I get different but equally
repeatable numbers.  The difference can be several percent, which is a lot
of noise compared to what we're looking for.  I believe the explanation is
that kernel ASLR has loaded the backend executable at some different
addresses and so there are different cache-line-boundary effects.  While
I could lock that down too by disabling ASLR, the result would be to
overemphasize chance effects of a particular set of cache line boundaries.
So I prefer to run all the tests over again after restarting the
postmaster, a few times, and then look at the overall set of results to
see what things look like.  Each number quoted above is median-of-three
tests within a single postmaster run.)

Anyway, my conclusion is that the attached patch is at least as fast
as today's HEAD; it's not as fast as v6, but on the other hand it's
an even smaller postmaster executable, so there's something to be said
for that:

$ size postg*
   textdata bss dec hex filename
7478138   57928  203360 7739426  761822 postgres.head
7271218   57928  203360 7532506  72efda postgres.v6
7275810   57928  203360 7537098  7301ca postgres.v7
7276978   57928  203360 7538266  73065a postgres.v8
7266274   57928  203360 7527562  72dc8a postgres.v9

I based this on your v7 not v8; not sure if there's anything you
want to salvage from v8.

Generally, I'm pretty happy with this approach: it touches gram.y
hardly at all, and it removes just about all of the complexity from
scan.l.  I'm happier about dropping the support code into parser.c
than the other choices we've discussed.

There's still undone work here, though:

* I did not touch psql.  Probably your patch is fine for that.

* I did not do more with ecpg than get it to compile, using the
same hacks as in your v7.  It still fai

RE: GROUPING SETS and SQL standard

2019-11-25 Thread Phil Florent


A  of () (called grand total in the Standard) is 
equivalent to grouping the entire result Table;

If I get it correctly:

select max(dummy) from dual where  0 = 1 group by grouping sets(());

and

select max(dummy) from dual where  0 = 1 ;

should have the same output.

It's the case with PostgreSQL, not with Oracle.
Hence it means it's PostgreSQL which conforms to the standard in this case.

Regards,
Phil


De : Phil Florent 
Envoyé : lundi 25 novembre 2019 22:18
À : Pavel Stehule 
Cc : pgsql-hack...@postgresql.org 
Objet : RE: GROUPING SETS and SQL standard

Hi,
Thank you, as you mentionned it's not really an interesting real life case 
anyway.
Regards,
Phil


De : Pavel Stehule 
Envoyé : lundi 25 novembre 2019 21:23
À : Phil Florent 
Cc : pgsql-hack...@postgresql.org 
Objet : Re: GROUPING SETS and SQL standard



po 25. 11. 2019 v 20:32 odesílatel Phil Florent 
mailto:philflor...@hotmail.com>> napsal:
Hi,

We are still on the process to migrate our applications from proprietary RDBMS 
to PostgreSQL.

Here is a simple query executed on various systems (real query is different but 
this one does not need any data) :


Connected to:

Oracle Database 19c Standard Edition 2 Release 19.0.0.0.0 - Production

Version 19.3.0.0.0



SQL> select count(*) from (select 1 from dual where 0=1 group by grouping 
sets(())) tmp;



  COUNT(*)

--

 0





select @@version;

GO



---
   
---
   --

Microsoft SQL Server 2017 (RTM-CU16) (KB4508218) - 14.0.3223.3 (X64)

Jul 12 2019 17:43:08

Copyright (C) 2017 Microsoft Corporation

Developer Edition (64-bit) on Linux (Ubuntu 16.04.6 LTS)



select count(*) from (select 1 as c1 where 0=1 group by grouping sets(())) tmp;

GO



---

  0



(1 rows affected)





select version();

version



PostgreSQL 11.5 (Debian 11.5-1+deb10u1) on x86_64-pc-linux-gnu, compiled by gcc 
(Debian 8.3.0-6) 8.3.0, 64-bit







select count(*) from (select 1 from dual where 0=1 group by grouping sets(())) 
tmp;

count

---

 1

(1 ligne)





0 or 1, which behaviour conforms to the SQL standard ? We have a workaround and 
it's just informational.

This example has not too much sense - I am not sure if these corner cases are 
described by ANSI SQL standards.

If I add aggregate query to subquery - using grouping sets without aggregation 
function is strange, then Postgres result looks more correct

postgres=# select 1, count(*) from dual  group by grouping sets(());
┌──┬───┐
│ ?column? │ count │
╞══╪═══╡
│1 │ 1 │
└──┴───┘
(1 row)

postgres=# select 1, count(*) from dual where false group by grouping sets(());
┌──┬───┐
│ ?column? │ count │
╞══╪═══╡
│1 │ 0 │
└──┴───┘
(1 row)

SELECT count(*) from this should be one in both cases.

I am not sure, if standard describe using grouping sets without any aggregation 
function

Pavel


Regards,


Phil

Re: GROUPING SETS and SQL standard

2019-11-25 Thread Tom Lane

Phil Florent  writes:
> A  of () (called grand total in the Standard) is 
> equivalent to grouping the entire result Table;

Yeah, I believe so.  Grouping by no columns is similar to what happens
if you compute an aggregate with no GROUP BY: the whole table is
taken as one group.  If the table is empty, the group is empty, but
there's still a group --- that's why you get one aggregate output
value, not none, from

regression=# select count(*) from dual where 0 = 1;
 count 
---
 0
(1 row)

Thus, in your example, the sub-query should give

regression=# select 1 from dual where 0=1 group by grouping sets(());
 ?column? 
--
1
(1 row)

and therefore it's correct that

regression=# select count(*) from (select 1 from dual where 0=1 group by 
grouping sets(())) tmp;
 count 
---
 1
(1 row)

AFAICS, Oracle and SQL Server are getting it wrong.

regards, tom lane

Dynamic gathering the values for seq_page_cost/xxx_cost

2019-11-25 Thread Andy Fan

The optimizer cost model usually needs 2 inputs,  one is used to represent
data distribution and the other one is used to represent the capacity of
the hardware, like cpu/io let's call this one as system stats.

In Oracle database, the system stats can be gathered with
dbms_stats.gather_system_stats [1] on the running hardware,  In
postgresql,  the value is set on based on experience (user can change the
value as well, but is should be hard to decide which values they should
use).  The pg way is not perfect in theory(In practice, it may be good
enough or not).  for example,  HDD & SSD have different capacity regards to
seq_scan_cost/random_page_cost,   cpu cost may also different on different
hardware as well.

I run into a paper [2] which did some research on dynamic gathering the
values for xxx_cost, looks it is interesting.  However it doesn't provide
the code for others to do more research.  before I dive into this,  It
would be great to hear some suggestion from experts.

so what do you think about this method and have we have some discussion
about this before and the result?

[1] https://docs.oracle.com/database/121/ARPLS/d_stats.htm#ARPLS68580
[2] https://dsl.cds.iisc.ac.in/publications/thesis/pankhuri.pdf

Thanks

[PATCH] Fix possible string overflow with sscanf (xlog.c)

2019-11-25 Thread Ranier Vilela

Hi,
I know it's very hard, but is possible. Just someone with the knowledge to do.

Here a proof of concept:
#include 
#include 

#define MAXPGPATH 256

int main(int argc, char ** argv)
{
chartbsoid[MAXPGPATH];
charstr[MAXPGPATH];
int ch,
prev_ch = -1,
i = 0,
n;
FILE * lfp;

lfp = fopen("c:\\tmp\\crash.dat", "rb");
while ((ch = fgetc(lfp)) != EOF)
{
if ((ch == '\n' || ch == '\r') && prev_ch != '\\')
{
str[i] = '\0';
if (sscanf(str, "%s %n", tbsoid, &n) != 1) {
   printf("tbsoid size=%u\n", strlen(tbsoid));
   printf("tbsoid=%s\n", tbsoid);
   exit(1);
}
i = 0;
continue;
}
else if ((ch == '\n' || ch == '\r') && prev_ch == '\\')
str[i - 1] = ch;
else
str[i++] = ch;
prev_ch = ch;
}
fclose(lfp);
}

Overflow with (MAXPGPATH=256)
C:\usr\src\tests\scanf>sscanf3
tbsoid size=260
tbsoid=x


xxx

Now with patch:
C:\usr\src\tests\scanf>sscanf3
tbsoid size=255
tbsoid=x


xx

The solution is simple, but clumsy. I hope that is enough.
sscanf(str, "%1023s %n", tbsoid, &n)

Best regards.
Ranier Vileladiff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5f0ee50092..790e68ccf0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11382,7 +11382,7 @@ read_tablespace_map(List **tablespaces)
 		if ((ch == '\n' || ch == '\r') && prev_ch != '\\')
 		{
 			str[i] = '\0';
-			if (sscanf(str, "%s %n", tbsoid, &n) != 1)
+			if (sscanf(str, FMTPGPATH, tbsoid, &n) != 1)
 ereport(FATAL,
 		(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 		 errmsg("invalid data in file \"%s\"", TABLESPACE_MAP)));
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index 743401cb96..03e0e3735e 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -84,6 +84,11 @@
  */
 #define MAXPGPATH		1024
 
+/*
+ * FMTPGPATH: standard format to scan pathname buffer, must follow MAXPGPATH
+ */
+#define FMTPGPATH   "%1023s %n"
+
 /*
  * PG_SOMAXCONN: maximum accept-queue length limit passed to
  * listen(2).  You'd think we should use SOMAXCONN from

Re: progress report for ANALYZE

2019-11-25 Thread Tatsuro Yamada


Hi Amit-san!

Thanks for your comments!



I have looked at the patch and here are some comments.

I think include_children and current_relid are not enough to
understand the progress of analyzing inheritance trees, because even
with current_relid being updated, I can't tell how many more there
will be.  I think it'd be better to show the total number of children
and the number of children processed, just like
pg_stat_progress_create_index does for partitions.  So, instead of
include_children and current_relid, I think it's better to have
child_tables_total, child_tables_done, and current_child_relid, placed
last in the set of columns.


Ah, I understood.
I'll check pg_stat_progress_create_index does for partitions,
and will create a new patch. :)

Related to the above,
I wonder whether we need the total number of ext stats on
pg_stat_progress_analyze or not. As you might know, there is the same
counter on pg_stat_progress_vacuum and pg_stat_progress_cluster.
For example, index_vacuum_count and index_rebuild_count.
Would it be added to the total number column to these views? :)



Also, inheritance tree stats are created *after* creating single table
stats, so I think that it would be better to have a distinct phase
name for that, say "acquiring inherited sample rows".  In
do_analyze_rel(), you can select which of two phases to set based on
whether inh is true or not.  For partitioned tables, the progress
output will immediately switch to this phase, because partitioned
table itself is empty so there's nothing to do in the "acquiring
sample rows" phase.

That's all for now.



Okay! I'll also add the new phase "acquiring inherited sample rows" on
the next patch. :)


Thanks,
Tatsuro Yamada

Re: Problem while updating a foreign table pointing to a partitioned table on foreign server

2019-11-25 Thread Etsuro Fujita

Hi Michael-san,

On Mon, Nov 25, 2019 at 4:13 PM Michael Paquier  wrote:
> On Tue, Sep 03, 2019 at 12:37:52AM +0900, Etsuro Fujita wrote:
> > On Wed, Aug 14, 2019 at 11:51 AM Etsuro Fujita  
> > wrote:
> >> This is my TODO item for PG13, but I'll give priority to other things
> >> in the next commitfest.  If anyone wants to work on it, feel free;
> >> else I'll move this to the November commitfest when it opens.
> >
> > Moved.
>
> This has been waiting on author for two commit fests now, and the
> situation has not changed.  Fujita-san, Hiriguchi-san, do you have an
> update to provide?  There is no meaning to keep the current stale
> situation for more CFs.

I was planning to work on this in this commitfest, but sorry, I didn't
have time due to other priorities.  Probably, I won't have time for
this in the development cycle for v13.  So I'll mark this as RWF,
unless anyone wants to work on it.

Best regards,
Etsuro Fujita

Re: progress report for ANALYZE

2019-11-25 Thread Tatsuro Yamada


Hi Amit-san,



Related to the above,
I wonder whether we need the total number of ext stats on
pg_stat_progress_analyze or not. As you might know, there is the same
counter on pg_stat_progress_vacuum and pg_stat_progress_cluster.
For example, index_vacuum_count and index_rebuild_count.
Would it be added to the total number column to these views? :)



Oops, I made a mistake. :(

What I'd like to say was:
Would it be better to add the total number column to these views? :)

Thanks,
Tatsuro Yamada

Re: checkpointer: PANIC: could not fsync file: No such file or directory

2019-11-25 Thread Justin Pryzby

I looked and found a new "hint".

On Tue, Nov 19, 2019 at 05:57:59AM -0600, Justin Pryzby wrote:
> < 2019-11-15 22:16:07.098 EST  >PANIC:  could not fsync file 
> "base/16491/1731839470.2": No such file or directory
> < 2019-11-15 22:16:08.751 EST  >LOG:  checkpointer process (PID 27388) was 
> terminated by signal 6: Aborted

An earlier segment of that relation had been opened successfully and was 
*still* opened:

$ sudo grep 1731839470 /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds 
63:/var/lib/pgsql/12/data/base/16491/1731839470

For context:

$ sudo grep / /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds |tail -3
61:/var/lib/pgsql/12/data/base/16491/1757077748
62:/var/lib/pgsql/12/data/base/16491/1756223121.2
63:/var/lib/pgsql/12/data/base/16491/1731839470

So this may be an issue only with relations>segment (but, that interpretation 
could also be very naive).

Re: Safeguards against incorrect fd flags for fsync()

2019-11-25 Thread Michael Paquier

On Mon, Nov 25, 2019 at 04:18:33PM +0900, Michael Paquier wrote:
> Thanks for the review.  I'll look at that pretty soon.

Tweaked a bit the comment block added, and committed.  Thanks Mark for
the input!
--
Michael


signature.asc
Description: PGP signature

Re: checkpointer: PANIC: could not fsync file: No such file or directory

2019-11-25 Thread Thomas Munro

On Tue, Nov 26, 2019 at 5:21 PM Justin Pryzby  wrote:
> I looked and found a new "hint".
>
> On Tue, Nov 19, 2019 at 05:57:59AM -0600, Justin Pryzby wrote:
> > < 2019-11-15 22:16:07.098 EST  >PANIC:  could not fsync file 
> > "base/16491/1731839470.2": No such file or directory
> > < 2019-11-15 22:16:08.751 EST  >LOG:  checkpointer process (PID 27388) was 
> > terminated by signal 6: Aborted
>
> An earlier segment of that relation had been opened successfully and was
> *still* opened:
>
> $ sudo grep 1731839470 /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds
> 63:/var/lib/pgsql/12/data/base/16491/1731839470
>
> For context:
>
> $ sudo grep / /var/spool/abrt/ccpp-2019-11-15-22:16:08-27388/open_fds |tail -3
> 61:/var/lib/pgsql/12/data/base/16491/1757077748
> 62:/var/lib/pgsql/12/data/base/16491/1756223121.2
> 63:/var/lib/pgsql/12/data/base/16491/1731839470
>
> So this may be an issue only with relations>segment (but, that interpretation
> could also be very naive).

FTR I have been trying to reproduce this but failing so far.  I'm
planning to dig some more in the next couple of days.  Yeah, it's a .2
file, which means that it's one that would normally be unlinked after
you commit your transaction (unlike a no-suffix file, which would
normally be dropped at the next checkpoint after the commit, as our
strategy to prevent the relfilenode from being reused before the next
checkpoint cycle), but should normally have had a SYNC_FORGET_REQUEST
enqueued for it first.  So the question is, how did it come to pass
that a .2 file was ENOENT but there was no forget request?  Diificult,
given the definition of mdunlinkfork().  I wondered if something was
going wrong in queue compaction or something like that, but I don't
see it.  I need to dig into the exactly flow with the ALTER case to
see if there is something I'm missing there, and perhaps try
reproducing it with a tiny segment size to exercise some more
multisegment-related code paths.

Re: accounting for memory used for BufFile during hash joins

2019-11-25 Thread Michael Paquier

On Mon, Nov 25, 2019 at 07:11:19PM +0100, Tomas Vondra wrote:
> I'm not planning to do any any immediate work on this, so I agree with
> marking it as RWF. I think Melanie is working on the BNL patch, which
> seems like the right solution.

Thanks, I have switched the patch as returned with feedback.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] Block level parallel vacuum

2019-11-25 Thread Amit Kapila

On Mon, Nov 25, 2019 at 9:42 PM Masahiko Sawada
 wrote:
>
> On Fri, 22 Nov 2019 at 10:19, Amit Kapila  wrote:
> >
> > On Wed, Nov 20, 2019 at 11:01 AM Masahiko Sawada
> >  wrote:
> > >
> > > I've attached the latest version patch set. The patch set includes all
> > > discussed points regarding index AM options as well as shared cost
> > > balance. Also I added some test cases used all types of index AM.
> > >
> >
> > I have reviewed the first patch and made a number of modifications
> > that include adding/modifying comments, made some corrections and
> > modifications in the documentation. You can find my changes in
> > v33-0001-delta-amit.patch.  See, if those look okay to you, if so,
> > please include those in the next version of the patch.  I am attaching
> > both your version of patch and delta changes by me.
>
> Thank you.
>
> All changes look good to me. But after changed the 0002 patch the two
> macros for parallel vacuum options (VACUUM_OPTIONS_SUPPORT_XXX) is no
> longer necessary. So we can remove them and can add if we need them
> again.
>

Sounds reasonable.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-25 Thread Amit Kapila

On Fri, Nov 22, 2019 at 7:38 PM Amit Khandekar  wrote:
>
> On Fri, 22 Nov 2019 at 4:26 PM, Amit Kapila  wrote:
>>
>> I think this is exactly the reason for the problem.  In my test [1],
>> the error "permission denied" occurred when I second time executed
>> pg_logical_slot_get_changes() which means on first execution the
>> unlink would have been successful but the files are still not removed
>> as they were not closed. Then on second execution, it gets an error
>> "Permission denied" when it again tries to unlink files via
>> ReorderBufferCleanupSerializedTXNs().
>>
>>
>> .
>> > But what you are seeing is "Permission denied" errors. Not sure why
>> > unlink() is failing.
>> >
>>
>> In your test program, if you try to unlink the file second time, you
>> should see the error "Permission denied".
>
>  I tested using the sample program and indeed I got the error 5 (access 
> denied) when I called unlink the second time.
>

So, what is the next step here?  How about if we somehow check whether
the file exists before doing unlink, say by using stat?  If that
doesn't work, I think we might need to go in the direction of tracking
file handles in some way, so that they can be closed during an abort.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Ought to use heap_multi_insert() for pg_attribute/depend insertions?

2019-11-25 Thread Michael Paquier

On Sun, Nov 17, 2019 at 12:01:08AM +0100, Daniel Gustafsson wrote:
> Fixed by opting for the latter, mostly since it seems best convey what the
> function does.

-   recordMultipleDependencies(depender,
-  context.addrs->refs, context.addrs->numrefs,
-  behavior);
+   recordMultipleDependencies(depender, context.addrs->refs,
+  context.addrs->numrefs, behavior);
Some noise diffs.

--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
  index concur_reindex_ind4| column c1 of table
- index concur_reindex_ind4| column c1 of table
  index concur_reindex_ind4| column c2 of table
This removes a duplicated dependency with indexes using the same
column multiple times.  Guess that's good to get rid of, this was just
unnecessary bloat in pg_depend.

The regression tests of contrib/test_decoding are still failing here:
+ERROR:  could not resolve cmin/cmax of catalog tuple

Getting rid the duplication between InsertPgAttributeTuples() and
InsertPgAttributeTuple() would be nice.  You would basically finish by
just using a single slot when inserting one tuple..
--
Michael


signature.asc
Description: PGP signature

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-25 Thread Amit Khandekar

On Tue, 26 Nov 2019 at 10:49, Amit Kapila  wrote:
>
> On Fri, Nov 22, 2019 at 7:38 PM Amit Khandekar  wrote:
> >
> > On Fri, 22 Nov 2019 at 4:26 PM, Amit Kapila  wrote:
> >>
> >> I think this is exactly the reason for the problem.  In my test [1],
> >> the error "permission denied" occurred when I second time executed
> >> pg_logical_slot_get_changes() which means on first execution the
> >> unlink would have been successful but the files are still not removed
> >> as they were not closed. Then on second execution, it gets an error
> >> "Permission denied" when it again tries to unlink files via
> >> ReorderBufferCleanupSerializedTXNs().
> >>
> >>
> >> .
> >> > But what you are seeing is "Permission denied" errors. Not sure why
> >> > unlink() is failing.
> >> >
> >>
> >> In your test program, if you try to unlink the file second time, you
> >> should see the error "Permission denied".
> >
> >  I tested using the sample program and indeed I got the error 5 (access 
> > denied) when I called unlink the second time.
> >
>
> So, what is the next step here?  How about if we somehow check whether
> the file exists before doing unlink, say by using stat?
But the thing is, the behaviour is so much in a grey area, that we
cannot reliably say for instance that when stat() says there is no
such file, there is indeed no such file, and if we re-create  the same
file when it is still open, it is always going to open a new file,
etc.

> If that doesn't work, I think we might need to go in the direction of tracking
> file handles in some way, so that they can be closed during an abort.
Yeah, that is one way. I am still working on different approaches.
WIll get back with proposals.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: dropdb --force

2019-11-25 Thread Amit Kapila

On Mon, Nov 25, 2019 at 11:22 PM vignesh C  wrote:
>
> On Sun, Nov 24, 2019 at 5:06 PM Pavel Stehule  wrote:
> >
> >
> >
> > ne 24. 11. 2019 v 11:25 odesílatel vignesh C  napsal:
> >>
> >> On Sat, Nov 23, 2019 at 4:42 PM Amit Kapila  
> >> wrote:
> >> >
> >> > On Fri, Nov 22, 2019 at 3:16 PM vignesh C  wrote:
> >> > >
> >> > > Thanks for fixing the comments. The changes looks fine to me. I have
> >> > > fixed the first comment, attached patch has the changes for the same.
> >> > >
> >> >
> >> > Few comments:
> >> > --
> >> > 1. There is a lot of duplicate code in this test.  Basically, almost
> >> > the same code is used once to test Drop Database command and dropdb.
> >> > I think we can create a few sub-functions to avoid duplicate code, but
> >> > do we really need to test the same thing once via command and then by
> >> > dropdb utility?   I don't think it is required, but even if we want to
> >> > do so, then I am not sure if this is the right test script.  I suggest
> >> > removing the command related test.
> >> >
> >>
> >> Pavel: What is your opinion on this?
> >
> >
> > I have not any problem with this reduction.
> >
> > We will see in future years what is benefit of this test.
> >
>
> Fixed, removed dropdb utility test.
>

Hmm, you have done the opposite of what I have asked.   This test file
is in rc/bin/scripts/t/  where we generally keep the tests for
utilities. So, retaining the dropdb utility test in that file seems
more sensible.

+ok( TestLib::pump_until(
+ $killme, $psql_timeout, \$killme_stdout, qr/[[:digit:]]+[\r\n]$/m),
+ 'acquired pid');

How about changing 'acquired pid' to 'acquired pid for SIGTERM'?

> >> >
> >>
> >> I have verified by running perltidy.
> >>

I think we don't need to run perltidy on the existing code.  So, I am
not sure if it is a good idea to run it for file 013_crash_restart.pl
as it changes some existing code formatting.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-25 Thread Amit Kapila

On Tue, Nov 26, 2019 at 11:19 AM Amit Khandekar  wrote:
>
> On Tue, 26 Nov 2019 at 10:49, Amit Kapila  wrote:
> >
> >
> > So, what is the next step here?  How about if we somehow check whether
> > the file exists before doing unlink, say by using stat?
> But the thing is, the behaviour is so much in a grey area, that we
> cannot reliably say for instance that when stat() says there is no
> such file, there is indeed no such file,
>

Why so?

> and if we re-create  the same
> file when it is still open, it is always going to open a new file,
> etc.
>

Yeah, or maybe even if we don't create with the same name, there is
always be some dangling file which again doesn't sound like a good
thing.

> > If that doesn't work, I think we might need to go in the direction of 
> > tracking
> > file handles in some way, so that they can be closed during an abort.
> Yeah, that is one way. I am still working on different approaches.
> WIll get back with proposals.
>

Fair enough.  See, if you can also consider an approach that is local
to ReorderBuffer module wherein we can track those handles in
ReorderBufferTxn or some other place local to that module.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Implementing Incremental View Maintenance

2019-11-25 Thread Yugo Nagata

Hi,

Attached is the latest patch (v8) to add support for Incremental View
Maintenance (IVM).  This patch adds OUTER join support in addition
to the patch (v7) submitted last week in the following post.

On Fri, 22 Nov 2019 15:29:45 +0900 (JST)
Tatsuo Ishii  wrote:
> Up to now, IVM supports materialized views using:
> 
> - Inner joins
> - Some aggregate functions (count, sum, min, max, avg)
> - GROUP BY
> - Self joins
> 
> With the latest patch now IVM supports subqueries in addition to
> above.
> 
> Known limitations are listed here:
> 
> https://github.com/sraoss/pgsql-ivm/issues
> 
> See more details at:
> https://wiki.postgresql.org/wiki/Incremental_View_Maintenance

* About outer join support:

In case of outer-join, when a table is modified, in addition to deltas
which occur in inner-join case, we also need to deletion or insertion of
dangling tuples, that is, null-extended tuples generated when a join
condition isn't met.

[Example]
-
-- Create base tables and an outer join view
CREATE TABLE r(i int);
CREATE TABLE s(j int);
INSERT INTO r VALUES (1);
CREATE INCREMENTAL MATERIALIZED VIEW mv 
 AS SELECT * FROM r LEFT JOIN s ON r.i=s.j;
SELECT * FROM mv;
 i | j 
---+---
(1 row)

-- After an insertion to a base table ...
INSERT INTO s VALUES (1);
 -- (1,1) is inserted and (1,null) is deleted from the view.
SELECT * FROM mv;
 i | j 
---+---
 1 | 1
(1 row)
-

Our implementation is basically based on the algorithm of Larson & Zhou
(2007) [1]. Before view maintenances, the view definition query's jointree
is analysed to make "view maintenance graph". This graph represents
which tuples in the views are affected when a base table is modified.
Specifically, tuples which are not null-extended on the modified table
(that is, tuples generated by joins with the modiifed table) are directly
affected. The delta of such effects are calculated similarly to inner-joins.

On the other hand, dangling tuples generated by anti-joins with directly
affected tuples can be indirectly affected. This means that we may need to
delete dangling tuples when any tuples are inserted to a table, as well as
to insert dangling tuples when tuples are deleted from a table.

[1] Efficient Maintenance of Materialized Outer-Join Views (Larson & Zhou, 2007)
https://ieeexplore.ieee.org/document/4221654

Although the original paper assumes that every base table and view have a
unique key and tuple duplicates is disallowed, we allow this. If a view has
tuple duplicates, we have to determine the number of each dangling tuple to
be inserted into the view when tuples in a table are deleted. For this purpose,
we count the number of each tuples which constitute a deleted tuple. These
counts are stored as JSONB object in the delta table, and we use this
information to maintain outer-join views. Also, we support outer self-joins
that is not assumed in the original paper.

* Restrictions

Currently, we have following restrictions:

- outer join view's targetlist must contain attributes used in join conditions
- outer join view's targetlist cannot contain non-strict functions
- outer join supports only simple equijoin
- outer join view's WHERE clause cannot contain non null-rejecting predicates
- aggregate is not supported with outer join
- subquery (including EXSITS) is not supported with outer join

Regression tests for all patterns of 3-way outer join and are added. 

Moreover, I reordered IVM related functions in matview.c so that ones
which have relationship will be located closely. Moreover, I added more
comments.

Regards,
Yugo Nagata

-- 
Yugo Nagata 

IVM_v8.patch.gz
Description: application/gzip

Re: Implementing Incremental View Maintenance

2019-11-25 Thread Tatsuo Ishii

Note that this is the last patch in the series of IVM patches: now we
would like focus on blushing up the patches, rather than adding new
SQL support to IVM, so that the patch is merged into PostgreSQL 13
(hopefully). We are very welcome reviews, comments on the patch.

BTW, the SGML docs in the patch is very poor at this point. I am going
to add more descriptions to the doc.

> Hi,
> 
> Attached is the latest patch (v8) to add support for Incremental View
> Maintenance (IVM).  This patch adds OUTER join support in addition
> to the patch (v7) submitted last week in the following post.
> 
> On Fri, 22 Nov 2019 15:29:45 +0900 (JST)
> Tatsuo Ishii  wrote:
>> Up to now, IVM supports materialized views using:
>> 
>> - Inner joins
>> - Some aggregate functions (count, sum, min, max, avg)
>> - GROUP BY
>> - Self joins
>> 
>> With the latest patch now IVM supports subqueries in addition to
>> above.
>> 
>> Known limitations are listed here:
>> 
>> https://github.com/sraoss/pgsql-ivm/issues
>> 
>> See more details at:
>> https://wiki.postgresql.org/wiki/Incremental_View_Maintenance
> 
> * About outer join support:
> 
> In case of outer-join, when a table is modified, in addition to deltas
> which occur in inner-join case, we also need to deletion or insertion of
> dangling tuples, that is, null-extended tuples generated when a join
> condition isn't met.
> 
> [Example]
> -
> -- Create base tables and an outer join view
> CREATE TABLE r(i int);
> CREATE TABLE s(j int);
> INSERT INTO r VALUES (1);
> CREATE INCREMENTAL MATERIALIZED VIEW mv 
>  AS SELECT * FROM r LEFT JOIN s ON r.i=s.j;
> SELECT * FROM mv;
>  i | j 
> ---+---
> (1 row)
> 
> -- After an insertion to a base table ...
> INSERT INTO s VALUES (1);
>  -- (1,1) is inserted and (1,null) is deleted from the view.
> SELECT * FROM mv;
>  i | j 
> ---+---
>  1 | 1
> (1 row)
> -
> 
> Our implementation is basically based on the algorithm of Larson & Zhou
> (2007) [1]. Before view maintenances, the view definition query's jointree
> is analysed to make "view maintenance graph". This graph represents
> which tuples in the views are affected when a base table is modified.
> Specifically, tuples which are not null-extended on the modified table
> (that is, tuples generated by joins with the modiifed table) are directly
> affected. The delta of such effects are calculated similarly to inner-joins.
> 
> On the other hand, dangling tuples generated by anti-joins with directly
> affected tuples can be indirectly affected. This means that we may need to
> delete dangling tuples when any tuples are inserted to a table, as well as
> to insert dangling tuples when tuples are deleted from a table.
> 
> [1] Efficient Maintenance of Materialized Outer-Join Views (Larson & Zhou, 
> 2007)
> https://ieeexplore.ieee.org/document/4221654
> 
> Although the original paper assumes that every base table and view have a
> unique key and tuple duplicates is disallowed, we allow this. If a view has
> tuple duplicates, we have to determine the number of each dangling tuple to
> be inserted into the view when tuples in a table are deleted. For this 
> purpose,
> we count the number of each tuples which constitute a deleted tuple. These
> counts are stored as JSONB object in the delta table, and we use this
> information to maintain outer-join views. Also, we support outer self-joins
> that is not assumed in the original paper.
> 
> * Restrictions
> 
> Currently, we have following restrictions:
> 
> - outer join view's targetlist must contain attributes used in join conditions
> - outer join view's targetlist cannot contain non-strict functions
> - outer join supports only simple equijoin
> - outer join view's WHERE clause cannot contain non null-rejecting predicates
> - aggregate is not supported with outer join
> - subquery (including EXSITS) is not supported with outer join
> 
> 
> Regression tests for all patterns of 3-way outer join and are added. 
> 
> Moreover, I reordered IVM related functions in matview.c so that ones
> which have relationship will be located closely. Moreover, I added more
> comments.
> 
> Regards,
> Yugo Nagata
> 
> 
> -- 
> Yugo Nagata

66 matches

Mail list logo