Re: Record a Bitmapset of non-pruned partitions

2021-07-30 Thread Amit Langote
On Mon, Jul 12, 2021 at 11:47 AM David Rowley  wrote:
> v3 patches

0001 looks mostly fine, except I thought the following could be worded
to say that the bitmap members are offsets into the part_rels array.
To avoid someone confusing them with RT indexes, for example.

+   Bitmapset  *live_parts; /* Bitmap with members to indicate which
+* partitions survived partition pruning. */

On 0002:

interleaved_parts idea looks clever.  I wonder if you decided that
it's maybe not worth setting that field in the joinrel's
PartitionBoundInfos?  For example, adding the code that you added in
create_list_bounds() also in merge_list_bounds().

...  The definition of interleaved
+ * is any partition that can contain multiple different values where exists at
+ * least one other partition which could contain a value which is between the
+ * multiple possible values in the other partition.

The sentence sounds a bit off starting at "...where exists".  How about:

"A partition is considered interleaved if it contains multiple values
such that there exists at least one other partition containing a value
that lies between those values [ in terms of partitioning-defined
ordering ]."

Looks fine otherwise.


--
Amit Langote
EDB: http://www.enterprisedb.com




Re: A qsort template

2021-07-30 Thread Peter Geoghegan
On Fri, Jul 30, 2021 at 3:34 AM John Naylor
 wrote:
> I'm also attaching your tuplesort patch so others can see what exactly I'm 
> comparing.

If you're going to specialize the sort routine for unsigned integer
style abbreviated keys then you might as well cover all relevant
opclasses/types. Almost all abbreviated key schemes produce
conditioned datums that are designed to use simple 3-way unsigned int
comparator. It's not just text. (Actually, the only abbreviated key
scheme that doesn't do it that way is numeric.)

Offhand I know that UUID, macaddr, and inet all have abbreviated keys
that can use your new ssup_datum_binary_cmp() comparator instead of
their own duplicated comparator (which will make them use the
corresponding specialized sort routine inside tuplesort.c).

-- 
Peter Geoghegan




Re: Problem about postponing gathering partial paths for topmost scan/join rel

2021-07-30 Thread Richard Guo
On Wed, Jul 28, 2021 at 3:42 PM Richard Guo  wrote:

> To fix this problem, I'm thinking we can leverage 'root->all_baserels'
> to tell if we are at the topmost scan/join rel, something like:
>
> --- a/src/backend/optimizer/path/allpaths.c
> +++ b/src/backend/optimizer/path/allpaths.c
> @@ -3041,7 +3041,7 @@ standard_join_search(PlannerInfo *root, int
> levels_needed, List *initial_rels)
>  * partial paths.  We'll do the same for the
> topmost scan/join rel
>  * once we know the final targetlist (see
> grouping_planner).
>  */
> -   if (lev < levels_needed)
> +   if (!bms_equal(rel->relids, root->all_baserels))
> generate_useful_gather_paths(root, rel,
> false);
>
>
> Any thoughts?
>

Attach a patch to include the fix described upthread. Would appreciate
any comments on this topic.

Thanks
Richard


v1-0001-Gather-partial-paths-for-subproblem-s-topmost-sca.patch
Description: Binary data


Re: alter table set TABLE ACCESS METHOD

2021-07-30 Thread Michael Paquier
On Thu, Jul 29, 2021 at 08:55:21AM -0700, Jeff Davis wrote:
> I see that ATExecSetTableSpace() also invokes the hook even for a no-
> op. Should we do the same thing for setting the AM?

Looking at the past, it was the intention of 05f3f9c7 to go through
the hook even if SET TABLESPACE does not move the relation, so you are
right that ALTER TABLE is inconsistent to not do the same for LOGGED,
UNLOGGED and ACCESS METHOD if all of them do nothing to trigger a
relation rewrite.

Now, I am a bit biased about this change and if we actually need it
for the no-op path.  If we were to do that, I think that we need to
add in AlteredTableInfo a way to track down if any of those
subcommands have been used to allow the case of rewrite == 0 to launch
the hook even if these are no-ops.  And I am not sure if that's worth
the code complication for an edge case.  We definitely should have a
hook call for the case of rewrite > 0, though.
--
Michael


signature.asc
Description: PGP signature


Division by zero error in to_char(num, '9.9EEEE')

2021-07-30 Thread Dean Rasheed
While testing the numeric_power() patch in [1], I found this problem
trying to use to_char() to format very small numbers:

SELECT to_char(1.2e-1001, '9.9'); -- OK
  to_char

  1.2e-1001

SELECT to_char(1.2e-1002, '9.9'); -- fails
ERROR:  division by zero

It turns out that the problem is in get_str_from_var_sci() which
attempts to divide the input by 1e-1002 to get the significand.
However, it is using power_var_int() to compute 1e-1002, which has a
maximum rscale of NUMERIC_MAX_DISPLAY_SCALE (1000), so it returns 0,
which is the correct answer to that scale, and then
get_str_from_var_sci() attempts to divide by that.

Rather than messing with power_var_int(), I think the simplest
solution is to just introduce a new local function, as in the attached
patch. This directly constructs 10^n, for integer n, which is pretty
trivial, and doesn't need any numeric multiplication or rounding.

Regards,
Dean

[1] 
https://www.postgresql.org/message-id/CAEZATCUWUV_BP41Ob7QY12oF%2BqDxjTWfDpkdkcOOuojrDvOLxw%40mail.gmail.com
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index faff09f..e05e5d7
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -428,16 +428,6 @@ static const NumericDigit const_two_data
 static const NumericVar const_two =
 {1, 0, NUMERIC_POS, 0, NULL, (NumericDigit *) const_two_data};
 
-#if DEC_DIGITS == 4 || DEC_DIGITS == 2
-static const NumericDigit const_ten_data[1] = {10};
-static const NumericVar const_ten =
-{1, 0, NUMERIC_POS, 0, NULL, (NumericDigit *) const_ten_data};
-#elif DEC_DIGITS == 1
-static const NumericDigit const_ten_data[1] = {1};
-static const NumericVar const_ten =
-{1, 1, NUMERIC_POS, 0, NULL, (NumericDigit *) const_ten_data};
-#endif
-
 #if DEC_DIGITS == 4
 static const NumericDigit const_zero_point_nine_data[1] = {9000};
 #elif DEC_DIGITS == 2
@@ -582,6 +572,7 @@ static void power_var(const NumericVar *
 	  NumericVar *result);
 static void power_var_int(const NumericVar *base, int exp, NumericVar *result,
 		  int rscale);
+static void power_ten_int(int exp, NumericVar *result);
 
 static int	cmp_abs(const NumericVar *var1, const NumericVar *var2);
 static int	cmp_abs_common(const NumericDigit *var1digits, int var1ndigits,
@@ -7213,9 +7204,7 @@ static char *
 get_str_from_var_sci(const NumericVar *var, int rscale)
 {
 	int32		exponent;
-	NumericVar	denominator;
-	NumericVar	significand;
-	int			denom_scale;
+	NumericVar	tmp_var;
 	size_t		len;
 	char	   *str;
 	char	   *sig_out;
@@ -7252,25 +7241,16 @@ get_str_from_var_sci(const NumericVar *v
 	}
 
 	/*
-	 * The denominator is set to 10 raised to the power of the exponent.
-	 *
-	 * We then divide var by the denominator to get the significand, rounding
-	 * to rscale decimal digits in the process.
+	 * Divide var by 10^exponent to get the significand, rounding to rscale
+	 * decimal digits in the process.
 	 */
-	if (exponent < 0)
-		denom_scale = -exponent;
-	else
-		denom_scale = 0;
-
-	init_var(&denominator);
-	init_var(&significand);
+	init_var(&tmp_var);
 
-	power_var_int(&const_ten, exponent, &denominator, denom_scale);
-	div_var(var, &denominator, &significand, rscale, true);
-	sig_out = get_str_from_var(&significand);
+	power_ten_int(exponent, &tmp_var);
+	div_var(var, &tmp_var, &tmp_var, rscale, true);
+	sig_out = get_str_from_var(&tmp_var);
 
-	free_var(&denominator);
-	free_var(&significand);
+	free_var(&tmp_var);
 
 	/*
 	 * Allocate space for the result.
@@ -10468,6 +10448,34 @@ power_var_int(const NumericVar *base, in
 		round_var(result, rscale);
 }
 
+/*
+ * power_ten_int() -
+ *
+ *	Raise ten to the power of exp, where exp is an integer.  Note that unlike
+ *	power_var_int(), this does no overflow/underflow checking or rounding.
+ */
+static void
+power_ten_int(int exp, NumericVar *result)
+{
+	/* Construct the result directly, starting from 10^0 = 1 */
+	set_var_from_var(&const_one, result);
+
+	/* Scale needed to represent the result exactly */
+	result->dscale = exp < 0 ? -exp : 0;
+
+	/* Base-NBASE weight of result and remaining exponent */
+	if (exp >= 0)
+		result->weight = exp / DEC_DIGITS;
+	else
+		result->weight = (exp + 1) / DEC_DIGITS - 1;
+
+	exp -= result->weight * DEC_DIGITS;
+
+	/* Final adjustment of the result's single NBASE digit */
+	while (exp-- > 0)
+		result->digits[0] *= 10;
+}
+
 
 /* --
  *
diff --git a/src/test/regress/expected/numeric.out b/src/test/regress/expected/numeric.out
new file mode 100644
index cc11995..8f0b40a
--- a/src/test/regress/expected/numeric.out
+++ b/src/test/regress/expected/numeric.out
@@ -1794,6 +1794,38 @@ FROM v;
 NaN |  #.### |  #.### |  #.###
 (7 rows)
 
+WITH v(exp) AS
+  (VALUES(-16379),(-16378),(-1234),(-789),(-45),(-5),(-4),(-3),(-2),(-1),(0),
+ (1),(2),(3),(4),(5),(38),(275),(2345),(45678),(131070),(131071))
+SELECT exp,
+  to_char(('1.2345e'||exp)::nu

Re: Replace l337sp34k in comments.

2021-07-30 Thread Geoff Winkless
On Thu, 29 Jul 2021 at 22:46, Gavin Flower
 wrote:
> Though in code, possibly it would be better to just use 'up-to-date' in
> code for consistency and to make the it easier to grep?

If it's causing an issue, perhaps using a less syntactically
problematic synonym like "current" might be better?

:)

Geoff




Re: Case expression pushdown

2021-07-30 Thread Alexander Pyhalov

Tom Lane писал 2021-07-29 23:54:

Alexander Pyhalov  writes:

[ 0001-Allow-pushing-CASE-expression-to-foreign-server-v7.patch ]


I looked this over.  It's better than before, but the collation
handling is still not at all correct.  We have to consider that a
CASE's arg expression supplies the collation for a contained
CaseTestExpr, otherwise we'll come to the wrong conclusions about
whether "CASE foreignvar WHEN ..." is shippable, if the foreignvar
is what's determining collation of the comparisons.

This means that the CaseExpr level of recursion has to pass data down
to the CaseTestExpr level.  In the attached, I did that by adding an
additional argument to foreign_expr_walker().  That's a bit invasive,
but it's not awful.  I thought about instead adding fields to the
foreign_loc_cxt struct.  But that seemed considerably messier in the
end, because we'd then have some fields that are information sourced
at one recursion level and some that are info sourced at another
level.

I also whacked the regression test cases around a lot.  They seemed
to spend a lot of time on irrelevant combinations, while failing to
check the things that matter, namely whether collation-based pushdown
decisions are made correctly.

regards, tom lane


Hi.

Overall looks good.
The only thing I'm confused about is in T_CaseTestExpr case - how can it 
be that CaseTestExpr collation doesn't match case_arg_cxt->collation ?
Do we we need to inspect only case_arg_cxt->state? Can we assert that 
collation == case_arg_cxt->collation?


--
Best regards,
Alexander Pyhalov,
Postgres Professional




RE: [HACKERS] logical decoding of two-phase transactions

2021-07-30 Thread tanghy.f...@fujitsu.com
On Friday, July 30, 2021 12:02 PM Peter Smith wrote:
> 
> Please find attached the latest patch set v100*
> 
> v99-0002 --> v100-0001
> 

Thanks for your patch. A few comments on the test file:

1. src/test/subscription/t/022_twophase_cascade.pl

1.1
I saw your test cases for "PREPARE / COMMIT PREPARED" and "PREPARE with a 
nested ROLLBACK TO SAVEPOINT", but didn't see cases for "PREPARE / ROLLBACK 
PREPARED". Is it needless or just missing?

1.2
+# check inserts are visible at subscriber(s).
+# All the streamed data (prior to the SAVEPOINT) should be rolled back.
+# (3, 'foobar') should be committed.

I think it should be (, 'foobar') here.

1.3
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where 
b = 'foobar';");
+is($result, qq(1), 'Rows committed are present on subscriber B');
+$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;");
+

It seems the test is not finished yet. We didn't check the value of 'result'. 
Besides, maybe we should also check node_C, right?

1.4
+$node_B->append_conf('postgresql.conf',qq(max_prepared_transactions = 
10));
+$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB));

You see, the first line uses a TAB but the second line uses a space.
Also, we could use only one statement to append these two settings to run tests 
a bit faster. Thoughts?
Something like:

$node_B->append_conf(
'postgresql.conf', qq(
max_prepared_transactions = 10
logical_decoding_work_mem = 64kB
));

Regards
Tang


Re: Background writer and checkpointer in crash recovery

2021-07-30 Thread Aleksander Alekseev
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

v2-0001 and v2-0002 look fine, but I don't like much the idea of introducing a 
new GUC in v2-0003. It's for very specific needs, which most of the users, I 
believe, don't care about. I suggest dealing with v2-0001 and v2-0002 first and 
then maybe submit and discuss v2-0003 as a separate CF entry.

Tested on MacOS against master `1ec7fca8`.

The new status of this patch is: Ready for Committer


Segment fault when excuting SPI function On PG with commit 41c6a5be

2021-07-30 Thread liuhuail...@fujitsu.com
Hi, all

When I used SPI_execute_plan function on PG12 which commit 41c6a5be is used,
Segment fault occurred.

PS: If commit 41c6a5be is not used, this phenomenon will not happen. 

Reproduce:
In a background process, the following steps are executed.
--
StartTransactionCommand();
SPI_connect();   
plan = SPI_prepare(query,0,NULL); ★the query is a SELECT SQL. 
SPI_keepplan(plan); 
SPI_finish();   
CommitTransactionCommand();   
StartTransactionCommand();
SPI_connect();
SPI_execute_plan(plan, NULL, NULL, true, 0); ★Segment fault
--

Core stack:
Stack trace of thread 43926:
#0  0x00772f19 EnsurePortalSnapshotExists (postgres)
#1  0x0064f85c _SPI_execute_plan (postgres)
#2  0x0064fbd1 SPI_execute_plan (postgres)
#3  0x7fbee784402e xxx (xxx.so)
#4  0x7fbee78424ae  (.so)
#5  0x006e91f5 StartBackgroundWorker (postgres)
#6  0x006f5e25 maybe_start_bgworkers (postgres)
#7  0x006f6121 reaper (postgres)
#8  0x7fbeedf48dc0 __restore_rt (libpthread.so.0)
#9  0x7fbeebadf75b __select (libc.so.6)
#10 0x006f6ea6 ServerLoop (postgres)
#11 0x006f8c30 PostmasterMain (postgres)
#12 0x00485d99 main (postgres)
#13 0x7fbeeba0f873 __libc_start_main (libc.so.6)
#14 0x00485e3e _start (postgres)

Is there a bug in the commit 41c6a5be?  Please confirm it.

Regards, Liu Huailing



Re: Slightly improve initdb --sync-only option's help message

2021-07-30 Thread Daniel Gustafsson
> On 30 Jul 2021, at 01:37, Bossart, Nathan  wrote:
> 
> On 7/29/21, 2:11 AM, "Daniel Gustafsson"  wrote:
>> I think removing the word "only" here is a net loss, since it IMO clearly
>> conveyed that this option isn't doing any of the other initdb duties.  The
>> documentation states it in more words, but the help output must be brief so I
>> think "only" is a good keyword.
> 
> I've attached a new version of the patch in which I've attempted to
> address both sets of feedback.

LGTM.  I took the liberty to rephrase the "and exit" part of the initdb help
output match the other exiting options in there.  Barring objections, I think
this is ready.

--
Daniel Gustafsson   https://vmware.com/



v4-0001-Clarify-initdb-sync-only-help-message-and-docs.patch
Description: Binary data


Re: amcheck verification for GiST and GIN

2021-07-30 Thread Heikki Linnakangas

On 29/07/2021 21:34, Nikolay Samokhvalov wrote:
I was trying to check a bunch of GINs on some production after switching 
from Ubuntu 16.04 to 18.04 and got many errors. So decided to check for 
16.04 first (that is still used on prod for that DB), without any 
OS/glibc changes.


On 16.04, I still saw errors and it was not really expected because this 
should mean that production is corrupted too. So, REINDEX should fix it. 
But it didn't -- see output below. I cannot give data and thinking how 
to create a synthetic demo of this. Any suggestions?


And is this a sign that the tool is wrong rather that we have a real 
corruption cases? (I assume if we did, we would see no errors after 
REINDEXing -- of course, if GIN itself doesn't have bugs).


Env: Ubuntu 16.04 (so, glibc 2.27), Postgres 12.7, patch from Heikki 
slightly adjusted to work with PG12 (
https://gitlab.com/postgres/postgres/-/merge_requests/5 
) snippet used 
to run amcheck:
https://gitlab.com/-/snippets/2001962 
 (see file #3)


Almost certainly the tool is wrong. We went back and forth a few times 
with Pawel, fixing various bugs in the amcheck patch at this thread: 
https://www.postgresql.org/message-id/9fdbb584-1e10-6a55-ecc2-9ba8b5dca1cf%40iki.fi. 
Can you try again with the latest patch version from that thread, 
please? That's v5-0001-Amcheck-for-GIN-13stable.patch.


- Heikki




Re: [Patch] ALTER SYSTEM READ ONLY

2021-07-30 Thread Prabhat Sahu
Hi,

On Thu, Jul 29, 2021 at 9:46 PM Robert Haas  wrote:

> On Wed, Jul 28, 2021 at 7:33 AM Amul Sul  wrote:
> > I was too worried about how I could miss that & after thinking more
> > about that, I realized that the operation for ArchiveRecoveryRequested
> > is never going to be skipped in the startup process and that never
> > left for the checkpoint process to do that later. That is the reason
> > that assert was added there.
> >
> > When ArchiveRecoveryRequested, the server will no longer be in
> > the wal prohibited mode, we implicitly change the state to
> > wal-permitted. Here is the snip from the 0003 patch:
>
> Ugh, OK. That makes sense, but I'm still not sure that I like it. I've
> kind of been wondering: why not have XLogAcceptWrites() be the
> responsibility of the checkpointer all the time, in every case? That
> would require fixing some more things, and this is one of them, but
> then it would be consistent, which means that any bugs would be likely
> to get found and fixed. If calling XLogAcceptWrites() from the
> checkpointer is some funny case that only happens when the system
> crashes while WAL is prohibited, then we might fail to notice that we
> have a bug.
>
> This is especially true given that we have very little test coverage
> in this area. Andres was ranting to me about this earlier this week,
> and I wasn't sure he was right, but then I noticed that we have
> exactly zero tests in the entire source tree that make use of
> recovery_end_command. We really need a TAP test for that, I think.
> It's too scary to do much reorganization of the code without having
> any tests at all for the stuff we're moving around. Likewise, we're
> going to need TAP tests for the stuff that is specific to this patch.
> For example, we should have a test that crashes the server while it's
> read only, brings it back up, checks that we still can't write WAL,
> then re-enables WAL, and checks that we now can write WAL. There are
> probably a bunch of other things that we should test, too.
>

Hi,

I have been testing “ALTER SYSTEM READ ONLY” and wrote a few tap test cases
for this feature.
Please find the test case(Draft version) attached herewith, to be applied
on top of the v30 patch by Amul.
Kindly have a review and let me know the required changes.
-- 

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com


prohibitwal-tap-test.patch
Description: Binary data


Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-30 Thread Dave Cramer
On Thu, 29 Jul 2021 at 22:03, Julien Rouhaud  wrote:

> On Thu, Jul 29, 2021 at 02:14:56PM -0400, Jan Wieck wrote:
> >
> > I presume that pg_upgrade on a database with that extension installed
> would
> > silently succeed and have the pg_catalog as well as public (or wherever)
> > version of that function present.
>
> I'll have to run a pg_upgrade with it to be 100% sure, but given that this
> is a
> plpgsql function and since the created function is part of the extension
> dependencies (and looking at pg_dump source code for binary-upgrade mode),
> I'm
> almost certain that the upgraded cluster would have the pg96- version of
> the
> function even if upgrading to pg9.6+.
>
> Note that in that case the extension would appear to work normally, but the
> only way to simulate missing_ok = true is to add a BEGIN/EXCEPTION block.
>
> Since this wrapper function is extensively used, it seems quite possible to
> lead to overflowing the snapshot subxip array, as the extension basically
> runs
> every x minutes many functions in a single trannsaction to retrieve many
> performance metrics.  This can ruin the performance.
>
> This was an acceptable trade off for people still using pg96- in 2021, but
> would be silly to have on more recent versions.
>
> Unfortunately I don't see any easy way to avoid that, as there isn't any
> guarantee that a new version will be available after the upgrade.  AFAICT
> the
> only way to ensure that the correct version of the function is present
> from an
> extension point of view would be to add a dedicated function to overwrite
> any
> object that depends on the servers version and document the need to call
> that
> after a pg_upgrade.
>

What would happen if subsequent to the upgrade "ALTER EXTENSION UPGRADE"
was executed ?


Re: Segment fault when excuting SPI function On PG with commit 41c6a5be

2021-07-30 Thread Greg Nancarrow
On Fri, Jul 30, 2021 at 6:57 PM liuhuail...@fujitsu.com
 wrote:
>
> When I used SPI_execute_plan function on PG12 which commit 41c6a5be is used,
> Segment fault occurred.

I'm not seeing any such commit.
Can you provide a link to the commit in GitHub?

Regards,
Greg Nancarrow
Fujitsu Australia




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-30 Thread Julien Rouhaud
On Fri, Jul 30, 2021 at 06:03:50AM -0400, Dave Cramer wrote:
> 
> What would happen if subsequent to the upgrade "ALTER EXTENSION UPGRADE"
> was executed ?

If the extension was already up to date on the source cluster then obviously
nothing.

Otherwise, the extension would be updated.  But unless I'm willing (and
remember) to copy/paste until the end of time an anonymous block code that
checks the current server version to see if the wrapper function needs to be
overwritten then nothing will happen either as far as this function is
concerned.




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-30 Thread Dave Cramer
On Fri, 30 Jul 2021 at 06:39, Julien Rouhaud  wrote:

> On Fri, Jul 30, 2021 at 06:03:50AM -0400, Dave Cramer wrote:
> >
> > What would happen if subsequent to the upgrade "ALTER EXTENSION UPGRADE"
> > was executed ?
>
> If the extension was already up to date on the source cluster then
> obviously
> nothing.
>
> Otherwise, the extension would be updated.  But unless I'm willing (and
> remember) to copy/paste until the end of time an anonymous block code that
> checks the current server version to see if the wrapper function needs to
> be
> overwritten then nothing will happen either as far as this function is
> concerned.
>

Well I think that's on the extension author to fix. There's only so much
pg_upgrade can do here.
It seems reasonable that upgrading the extension should upgrade the
extension to the latest version.

Dave


Re: pg_dump new feature: exporting functions only. Bad or good idea ?

2021-07-30 Thread Lætitia Avrot
>
>
> > On Fri, Jul 9, 2021 at 4:43 PM Tomas Vondra <
> tomas.von...@enterprisedb.com> wrote:
> >
> > The main question I have is whether this should include procedures. I'd
> > probably argue procedures should be considered different from functions
> > (i.e. requiring a separate --procedures-only option), because it pretty
> > much is meant to be a separate object type. We don't allow calling DROP
> > FUNCTION on a procedure, etc. It'd be silly to introduce an unnecessary
> > ambiguity in pg_dump and have to deal with it sometime later.
>
>
I respectfully disagree. In psql, the `\ef` and `\df` metacommands will
also list procedures, not just functions. So at one point we agreed to
consider for this client that functions were close enough to procedures to
use a simple metacommand to list/display without distinction. Why should it
be different for `pg_dump` ?

Have a nice day,

Lætitia


Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-30 Thread Julien Rouhaud
On Fri, Jul 30, 2021 at 06:48:34AM -0400, Dave Cramer wrote:
> 
> Well I think that's on the extension author to fix. There's only so much
> pg_upgrade can do here.
> It seems reasonable that upgrading the extension should upgrade the
> extension to the latest version.

That would only work iff the extension was *not* up to date on the original
instance.  Otherwise I fail to see how any extension script will be called at
all, and therefore the extension was no way to fix anything.




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-30 Thread Dave Cramer
On Fri, 30 Jul 2021 at 07:07, Julien Rouhaud  wrote:

> On Fri, Jul 30, 2021 at 06:48:34AM -0400, Dave Cramer wrote:
> >
> > Well I think that's on the extension author to fix. There's only so much
> > pg_upgrade can do here.
> > It seems reasonable that upgrading the extension should upgrade the
> > extension to the latest version.
>
> That would only work iff the extension was *not* up to date on the original
> instance.  Otherwise I fail to see how any extension script will be called
> at
> all, and therefore the extension was no way to fix anything.
>

So my understanding is that upgrade is going to run all of the SQL files
from whatever version the original instance was up to the current version.

I'm at a loss as to how this would not work ? How do you upgrade your
extension otherwise ?

Dave


Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-30 Thread Julien Rouhaud
On Fri, Jul 30, 2021 at 07:18:56AM -0400, Dave Cramer wrote:
> 
> So my understanding is that upgrade is going to run all of the SQL files
> from whatever version the original instance was up to the current version.
> 
> I'm at a loss as to how this would not work ? How do you upgrade your
> extension otherwise ?

Yes, but as I said twice only if the currently installed version is different
from the default version.  Otherwise ALTER EXTENSION UPGRADE is a no-op.

Just to be clear: I'm not arguing against automatically doing an ALTER
EXTENSION UPGRADE for all extensions in all databases during pg_upgrade (I'm
all for it), just that this specific corner case can't be solved by that
approach.




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-30 Thread Dave Cramer
>
> Yes, but as I said twice only if the currently installed version is
> different
> from the default version.  Otherwise ALTER EXTENSION UPGRADE is a no-op.
>

Ah, ok, got it, thanks. Well I'm not sure how to deal with this.
The only thing I can think of is that we add a post_upgrade function to the
API.

All extensions would have to implement this.

Dave


Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-30 Thread Julien Rouhaud
On Fri, Jul 30, 2021 at 07:33:55AM -0400, Dave Cramer wrote:
> 
> Ah, ok, got it, thanks. Well I'm not sure how to deal with this.
> The only thing I can think of is that we add a post_upgrade function to the
> API.
> 
> All extensions would have to implement this.

It seems like a really big hammer for a niche usage.  As far as I know I'm the
only one who wrote an extension that can create different objects depending on
the server version, so I'm entirely fine with dealing with that problem in my
extension rather than forcing everyone to implement an otherwise useless API.

Now if that API can be useful for other cases or if there are other extensions
with similar problems that would be different story.




Re: A qsort template

2021-07-30 Thread Ranier Vilela
Em qui., 29 de jul. de 2021 às 21:34, John Naylor <
john.nay...@enterprisedb.com> escreveu:

>
> On Sun, Jul 4, 2021 at 12:27 AM Thomas Munro 
> wrote:
> >
> > Since you are experimenting with tuplesort and likely thinking similar
> > thoughts, here's a patch I've been using to explore that area.  I've
> > seen it get, for example, ~1.18x speedup for simple index builds in
> > favourable winds (YMMV, early hacking results only).  Currently, it
> > kicks in when the leading column is of type int4, int8, timestamp,
> > timestamptz, date or text + friends (when abbreviatable, currently
> > that means "C" and ICU collations only), while increasing the
> > executable by only 8.5kB (Clang, amd64, -O2, no debug).
> >
> > These types are handled with just three specialisations.  Their custom
> > "fast" comparators all boiled down to comparisons of datum bits,
> > varying only in signedness and width, so I tried throwing them away
> > and using 3 new common routines.  Then I extended
> > tuplesort_sort_memtuples()'s pre-existing specialisation dispatch to
> > recognise qualifying users of those and select 3 corresponding sort
> > specialisations.
>
> I got around to getting a benchmark together to serve as a starting point.
> I based it off something I got from the archives, but don't remember where
> (I seem to remember Tomas Vondra wrote the original, but not sure). To
> start I just used types that were there already -- int, text, numeric. The
> latter two won't be helped by this patch, but I wanted to keep something
> like that so we can see what kind of noise variation there is. I'll
> probably cut text out in the future and just keep numeric for that purpose.
>
> I've attached both the script and a crude spreadsheet. I'll try to figure
> out something nicer for future tests, and maybe some graphs. The
> "comparison" sheet has the results side by side (min of five). There are 6
> distributions of values:
> - random
> - sorted
> - "almost sorted"
> - reversed
> - organ pipe (first half ascending, second half descending)
> - rotated (sorted but then put the smallest at the end)
> - random 0s/1s
>
> I included both "select a" and "select *" to make sure we have the recent
> datum sort optimization represented. The results look pretty good for ints
> -- about the same speed up master gets going from tuple sorts to datum
> sorts, and those got faster in turn also.
>
> Next I think I'll run microbenchmarks on int64s with the test harness you
> attached earlier, and experiment with the qsort parameters a bit.
>
> I'm also attaching your tuplesort patch so others can see what exactly I'm
> comparing.
>
The patch attached does not apply cleanly,
please can fix it?

error: patch failed: src/backend/utils/sort/tuplesort.c:4776
error: src/backend/utils/sort/tuplesort.c: patch does not apply

regards,
Ranier Vilela


Re: Segment fault when excuting SPI function On PG with commit 41c6a5be

2021-07-30 Thread Zhihong Yu
On Fri, Jul 30, 2021 at 3:30 AM Greg Nancarrow  wrote:

> On Fri, Jul 30, 2021 at 6:57 PM liuhuail...@fujitsu.com
>  wrote:
> >
> > When I used SPI_execute_plan function on PG12 which commit 41c6a5be is
> used,
> > Segment fault occurred.
>
> I'm not seeing any such commit.
> Can you provide a link to the commit in GitHub?
>
> Regards,
> Greg Nancarrow
> Fujitsu Australia
>
>
> Hi,
I think Huailing was referring to:

commit 41c6a5bec25e720d98bd60d77dd5c2939189ed3c
Author: Tom Lane 
Date:   Fri May 21 14:03:53 2021 -0400

Restore the portal-level snapshot after procedure COMMIT/ROLLBACK.

Cheers


Re: Segment fault when excuting SPI function On PG with commit 41c6a5be

2021-07-30 Thread Ranier Vilela
Em sex., 30 de jul. de 2021 às 05:57, liuhuail...@fujitsu.com <
liuhuail...@fujitsu.com> escreveu:

> Hi, all
>
> When I used SPI_execute_plan function on PG12 which commit 41c6a5be is
> used,
> Segment fault occurred.
>
> PS: If commit 41c6a5be is not used, this phenomenon will not happen.
>
> Reproduce:
> In a background process, the following steps are executed.
> --
> StartTransactionCommand();
> SPI_connect();
> plan = SPI_prepare(query,0,NULL); ★the query is a SELECT SQL.
> SPI_keepplan(plan);
> SPI_finish();
> CommitTransactionCommand();
> StartTransactionCommand();
> SPI_connect();
> SPI_execute_plan(plan, NULL, NULL, true, 0); ★Segment fault
> --
>
> Core stack:
> Stack trace of thread 43926:
> #0  0x00772f19 EnsurePortalSnapshotExists
> (postgres)
> #1  0x0064f85c _SPI_execute_plan (postgres)
> #2  0x0064fbd1 SPI_execute_plan (postgres)
> #3  0x7fbee784402e xxx (xxx.so)
> #4  0x7fbee78424ae  (.so)
> #5  0x006e91f5 StartBackgroundWorker (postgres)
> #6  0x006f5e25 maybe_start_bgworkers (postgres)
> #7  0x006f6121 reaper (postgres)
> #8  0x7fbeedf48dc0 __restore_rt (libpthread.so.0)
> #9  0x7fbeebadf75b __select (libc.so.6)
> #10 0x006f6ea6 ServerLoop (postgres)
> #11 0x006f8c30 PostmasterMain (postgres)
> #12 0x00485d99 main (postgres)
> #13 0x7fbeeba0f873 __libc_start_main (libc.so.6)
> #14 0x00485e3e _start (postgres)
>
> Is there a bug in the commit 41c6a5be?  Please confirm it.
>
Did you test it on the HEAD too?

regards,
Ranier Vilela


Re: Segment fault when excuting SPI function On PG with commit 41c6a5be

2021-07-30 Thread Michael Paquier
On Fri, Jul 30, 2021 at 08:57:35AM +, liuhuail...@fujitsu.com wrote:
> When I used SPI_execute_plan function on PG12 which commit 41c6a5be is used,
> Segment fault occurred.
> 
> PS: If commit 41c6a5be is not used, this phenomenon will not happen. 

I see nothing wrong in this commit, FWIW.

> Reproduce:
> In a background process, the following steps are executed.
> --
> StartTransactionCommand();
> SPI_connect();   
> plan = SPI_prepare(query,0,NULL); ★the query is a SELECT SQL. 
> SPI_keepplan(plan); 
> SPI_finish();   
> CommitTransactionCommand();   
> StartTransactionCommand();
> SPI_connect();
> SPI_execute_plan(plan, NULL, NULL, true, 0); ★Segment fault
> --

But I see an issue with your code.  It seems to me that you should
push a snapshot before doing SPI_prepare() and SPI_execute_plan(),
as of:
PushActiveSnapshot(GetTransactionSnapshot()); 
--
Michael


signature.asc
Description: PGP signature


Re: needless complexity in StartupXLOG

2021-07-30 Thread Robert Haas
On Thu, Jul 29, 2021 at 3:16 PM Andres Freund  wrote:
> Not sure there's enough concensus on the idea for that. I personally
> think that's a good approach at reducing relevant complexity, but I
> don't know if anybody agrees...

There does seem to be agreement on the proposed patch, so I have committed it.

Thanks to all for the discussion and the links to other relevant threads.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




RE: Parallel Inserts (WAS: [bug?] Missed parallel safety checks..)

2021-07-30 Thread houzj.f...@fujitsu.com
On Friday, July 30, 2021 2:52 PM Greg Nancarrow  wrote:
> On Fri, Jul 30, 2021 at 4:02 PM Amit Kapila  wrote:
> >
> > > Besides, I think we need a new default value about parallel dml
> > > safety. Maybe 'auto' or 'null'(different from
> > > safe/restricted/unsafe). Because, user is likely to alter the safety
> > > to the default value to get the automatic safety check, a independent 
> > > default
> > > value can make it more clear.
> > >
> >
> > Hmm, but auto won't work for partitioned tables, right? If so, that
> > might appear like an inconsistency to the user and we need to document
> > the same. Let me summarize the discussion so far in this thread so
> > that it is helpful to others.
> >
> 
> To avoid that inconsistency, UNSAFE could be the default for partitioned 
> tables
> (and we would disallow setting AUTO for these).
> So then AUTO is the default for non-partitioned tables only.

I think this approach is reasonable, +1.

Best regards,
houzj 


param 'txn' not used in function maybe_send_schema()

2021-07-30 Thread houzj.f...@fujitsu.com
Hi,

When reviewing logicalreplication related patches, I noticed the function
param ReorderBufferTXN *txn not used in the function maybe_send_schema(). Since
this is not an external function, I think it might be better to remove the 
unused paramater.

Best regards,
Houzj



0001-remove-unused-paramter-in-maybe_send_schema.patch
Description: 0001-remove-unused-paramter-in-maybe_send_schema.patch


Re: Background writer and checkpointer in crash recovery

2021-07-30 Thread Robert Haas
On Fri, Jul 30, 2021 at 4:42 AM Aleksander Alekseev
 wrote:
> v2-0001 and v2-0002 look fine, but I don't like much the idea of introducing 
> a new GUC in v2-0003. It's for very specific needs, which most of the users, 
> I believe, don't care about. I suggest dealing with v2-0001 and v2-0002 first 
> and then maybe submit and discuss v2-0003 as a separate CF entry.

Hi!

Thanks for bumping this thread; I had forgotten all about this effort,
but having just spent a bunch of time struggling with the thicket of
cases in StartupXLOG(), I'm now feeling highly motivated to make some
more progress in simplifying things over there. I am still of the
opinion that 0001 is a good idea, and I don't have any suggestions for
how it could be improved, except perhaps that the call to
PublishStartupProcessInformation() could maybe have a one-line
comment. Thomas, are you planning to press forward with committing
this soon? If not, do you mind if I do?

Regarding Simon's 0002, I wonder why it's useful to print this
information out at the end of crash recovery but not at the end of
archive recovery. It seems to me that if the information is useful
enough to be worth printing, it's probably good to print it in both
cases. In fact, rather than adding a separate message for this
information, I think we should just change the existing "redo done at"
message to print the details Simon proposes rather than what it does
now. Currently, we get output like this:

2021-07-30 09:13:05.319 EDT [48380] LOG:  redo starts at 0/23A6E18
2021-07-30 09:13:05.612 EDT [48380] LOG:  redo done at 0/D0D9CE8
system usage: CPU: user: 0.13 s, system: 0.12 s, elapsed: 0.29 s

With Simon's patch, I get something like this:

2021-07-30 09:39:43.579 EDT [63702] LOG:  redo starts at 0/14A2F48
2021-07-30 09:39:44.129 EDT [63702] LOG:  redo done at 0/15F48230
system usage: CPU: user: 0.25 s, system: 0.25 s, elapsed: 0.55 s
2021-07-30 09:39:44.129 EDT [63702] LOG:  crash recovery complete:
wrote 36517 buffers (222.9%); dirtied 52985 buffers; read 7 buffers

Now I really think that information on the number of buffers touched
and how long it took is way more useful than user and system time.
Knowing how much user and system time were spent doesn't really tell
you anything, but a count of buffers touched gives you some meaningful
idea of how much work recovery did, and whether I/O was slow. Elapsed
time you can figure out yourself from the timestamp. However, I don't
really like printing the percentage here; unlike the checkpoint case,
it can very easily be way more than a hundred percent, and I think
that will confuse people. It could be tens of thousands of percent,
really, or even more.

So my proposal is:

redo done at %X/%X: wrote %ld buffers (%0.3f ms); dirtied %ld buffers;
read %ld buffers (%0.3f ms)

Regarding 0003, I agree with Alexander's comment that a GUC doesn't
seem particularly appropriate, but I also think that the approach may
not be quite right. In the promotion case, we emit an end-of-recovery
record and then later in the code we trigger a checkpoint. In your
patch, there's no end-of-recovery checkpoint -- you just trigger a
checkpoint instead of waiting for it. I think it's probably better to
make those two cases work the same. The end-of-recovery record isn't
needed to change the TLI as it is in the promotion case, but (1) it
seems better to have fewer code paths and (2) it might be good for
debuggability.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Case expression pushdown

2021-07-30 Thread Tom Lane
Alexander Pyhalov  writes:
> The only thing I'm confused about is in T_CaseTestExpr case - how can it 
> be that CaseTestExpr collation doesn't match case_arg_cxt->collation ?
> Do we we need to inspect only case_arg_cxt->state? Can we assert that 
> collation == case_arg_cxt->collation?

Perhaps, but:

(1) I'm disinclined to make this code look different from the otherwise-
identical coding elsewhere in foreign_expr_walker.

(2) That would create a hard assumption that foreign_expr_walker's
conclusions about the collation of a subexpression match those of
assign_query_collations.  I'm not quite sure I believe that (and if
it's true, why aren't we just relying on exprCollation?).  Anyway,
if we're to have an assertion that it's true, it should be in some
place that's a lot less out-of-the-way than CaseTestExpr, because
if the assumption gets violated it might be a long time till we
notice.

So I think we're best off to just write it the way I did, at least
so far as this patch is concerned.  If we want to rethink the way
collation gets calculated here, that would be material for a
separate patch.

regards, tom lane




Re: Use generation context to speed up tuplesorts

2021-07-30 Thread Robert Haas
On Fri, Jul 30, 2021 at 2:42 AM David Rowley  wrote:
> Master:
>Sort Method: quicksort  Memory: 5541kB
> Patched:
>Sort Method: quicksort  Memory: 3197kB

Whoa.

> work_mem = '4GB';
> Testmastergen sortcompare
> Test1317.2665.6210%
> Test2228.6388.9170%
> Test3207.4330.7159%
> Test4185.5279.4151%
> Test5292.2563.9193%

Very impressive.

An early version of what eventually became DSA worked with
backend-local memory and I saw very substantial memory usage
improvements on large sorts, similar to what you show here. I am not
sure I saw the same CPU improvements, and in any case I abandoned the
idea of using that infrastructure to manage backend-local memory at
some point, since the whole thing had lots of problems that I didn't
know how to solve. What you've done here looks like a much more
promising approach.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: A qsort template

2021-07-30 Thread John Naylor
On Fri, Jul 30, 2021 at 7:47 AM Ranier Vilela  wrote:

> The patch attached does not apply cleanly,
> please can fix it?

It applies just fine with "patch", for those wondering.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Segment fault when excuting SPI function On PG with commit 41c6a5be

2021-07-30 Thread Tom Lane
Michael Paquier  writes:
> On Fri, Jul 30, 2021 at 08:57:35AM +, liuhuail...@fujitsu.com wrote:
>> When I used SPI_execute_plan function on PG12 which commit 41c6a5be is used,
>> Segment fault occurred.

> I see nothing wrong in this commit, FWIW.
> But I see an issue with your code.  It seems to me that you should
> push a snapshot before doing SPI_prepare() and SPI_execute_plan(),
> as of:
> PushActiveSnapshot(GetTransactionSnapshot()); 

Yes.  What that commit did was to formalize the requirement that
a snapshot exist *before* entering SPI.  Before that, you might
have gotten away without one, depending on what you were trying
to do (in particular, detoasting a toasted output Datum would
fail if you lack an external snapshot).

This isn't the first complaint we've seen from somebody whose
code used to work and now fails there, however.  I wonder if
we should convert the Assert into an actual test-and-elog, say

/* Otherwise, we'd better have an active Portal */
portal = ActivePortal;
-   Assert(portal != NULL);
+   if (unlikely(portal == NULL))
+   elog(ERROR, "must have an outer snapshot or portal");
Assert(portal->portalSnapshot == NULL);

Perhaps that would help people to realize that the bug is theirs
not EnsurePortalSnapshotExists's.

I gather BTW that the OP is trying to test C code in a non-assert
build.  Not a great approach.

regards, tom lane




Re: Segment fault when excuting SPI function On PG with commit 41c6a5be

2021-07-30 Thread Daniel Gustafsson
> On 30 Jul 2021, at 17:06, Tom Lane  wrote:

> I wonder if we should convert the Assert into an actual test-and-elog, say
> 
>   /* Otherwise, we'd better have an active Portal */
>   portal = ActivePortal;
> - Assert(portal != NULL);
> + if (unlikely(portal == NULL))
> + elog(ERROR, "must have an outer snapshot or portal");
>   Assert(portal->portalSnapshot == NULL);
> 
> Perhaps that would help people to realize that the bug is theirs
> not EnsurePortalSnapshotExists's.

+1, that would probably be quite helpful.

--
Daniel Gustafsson   https://vmware.com/





Re: Segment fault when excuting SPI function On PG with commit 41c6a5be

2021-07-30 Thread Tom Lane
Daniel Gustafsson  writes:
>> On 30 Jul 2021, at 17:06, Tom Lane  wrote:
>> I wonder if we should convert the Assert into an actual test-and-elog, say
>> 
>>  /* Otherwise, we'd better have an active Portal */
>>  portal = ActivePortal;
>> -Assert(portal != NULL);
>> +if (unlikely(portal == NULL))
>> +elog(ERROR, "must have an outer snapshot or portal");
>>  Assert(portal->portalSnapshot == NULL);
>> 
>> Perhaps that would help people to realize that the bug is theirs
>> not EnsurePortalSnapshotExists's.

> +1, that would probably be quite helpful.

Happy to make it so.  Anyone have suggestions about the wording of
the message?

regards, tom lane




Re: Unbounded %s in sscanf

2021-07-30 Thread Daniel Gustafsson
I took another look at this today, and propose to push the attached.  The
pg_dump fix goes all the way back to 9.6 whereas the pg_basebackup fix is from
11 and onwards.  The adjacent shadowed variable bug in pg_dump is also present
since 9.6.

Thoughts?

--
Daniel Gustafsson   https://vmware.com/



v4-0002-Fix-bug-in-TOC-file-error-message-printing.patch
Description: Binary data


v4-0001-Fix-sscanf-limits-in-pg_basebackup-and-pg_dump.patch
Description: Binary data


Re: Case expression pushdown

2021-07-30 Thread Tom Lane
I wrote:
> Alexander Pyhalov  writes:
>> Do we we need to inspect only case_arg_cxt->state? Can we assert that 
>> collation == case_arg_cxt->collation?

> Perhaps, but:
> ...

Oh, actually there's a third point: the shakiest part of this logic
is the assumption that we've correctly matched a CaseTestExpr to
its source CaseExpr.  Seeing that inlining and constant-folding can
mash things to the point where a WHEN expression doesn't look like
"CaseTestExpr = RHS", it's a little nervous-making to assume there
couldn't be another CASE in between.  While there's not known problems
of this sort, if it did happen I'd prefer this code to react as
"don't push down", not as "assertion failure".

(There's been speculation in the past about whether we could find
a more bulletproof representation of this kind of CaseExpr.  We've
not succeeded at that yet though.)

regards, tom lane




Re: Unbounded %s in sscanf

2021-07-30 Thread Tom Lane
Daniel Gustafsson  writes:
> I took another look at this today, and propose to push the attached.  The
> pg_dump fix goes all the way back to 9.6 whereas the pg_basebackup fix is from
> 11 and onwards.  The adjacent shadowed variable bug in pg_dump is also present
> since 9.6.
> Thoughts?

Generally +1, though I wonder if it'd be prudent to deal with the
shadowed-variable bug by renaming *both* variables.  "fname" is
clearly too generic in a function that deals with multiple file
names.

Another thing that is nibbling at the back of my mind is that one
reason we started to use src/port/snprintf.c all the time is that
glibc's *printf functions behave in a very unfriendly fashion when
asked to print text that they think is invalidly encoded, but only
if the format involves an explicit field width spec.  I wonder if
we're opening ourselves to similar problems if we start to use
field widths with *scanf.  In principle, I think the input text
always ought to be ASCII in these cases, so that there's no hazard.
But is there an interesting security aspect here?  That is, if someone
can inject a maliciously-crafted file containing non-ASCII data, what
kind of misbehavior could ensue?  It might be that sscanf would just
report failure and we'd give up, which would be fine.  But if a
stack overrun could be triggered that way, it'd not be fine.

regards, tom lane




Re: Have I found an interval arithmetic bug?

2021-07-30 Thread Bruce Momjian
On Wed, Jul 28, 2021 at 11:19:16AM -0400, Bruce Momjian wrote:
> On Wed, Jul 28, 2021 at 08:42:31AM +0100, Dean Rasheed wrote:
> > So it's equal even without calling justify_interval() on the result.
> > 
> > FWIW, I remain of the opinion that the interval literal code should
> > just spill down to lower units in all cases, just like the
> > multiplication and division code, so that the results are consistent
> > (barring floating point rounding errors) and explainable.
> 
> Here is a more minimal patch that doesn't change the spill-down units at
> all, but merely documents it, and changes the spilldown to months to
> round instead of truncate.

Unless I hear more feedback, I plan to apply this doc patch to all
branches with the word "rounded" changed to "truncated" in the back
branches, and apply the rounded code changes to master.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Have I found an interval arithmetic bug?

2021-07-30 Thread Bruce Momjian
On Fri, Jul 30, 2021 at 12:04:39PM -0400, Bruce Momjian wrote:
> On Wed, Jul 28, 2021 at 11:19:16AM -0400, Bruce Momjian wrote:
> > On Wed, Jul 28, 2021 at 08:42:31AM +0100, Dean Rasheed wrote:
> > > So it's equal even without calling justify_interval() on the result.
> > > 
> > > FWIW, I remain of the opinion that the interval literal code should
> > > just spill down to lower units in all cases, just like the
> > > multiplication and division code, so that the results are consistent
> > > (barring floating point rounding errors) and explainable.
> > 
> > Here is a more minimal patch that doesn't change the spill-down units at
> > all, but merely documents it, and changes the spilldown to months to
> > round instead of truncate.
> 
> Unless I hear more feedback, I plan to apply this doc patch to all
> branches with the word "rounded" changed to "truncated" in the back
> branches, and apply the rounded code changes to master.

Now that I think of it, I will just remove the word "rounded" from the
back branch docs so we are technically breaking the documented API less
in PG 15.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Slightly improve initdb --sync-only option's help message

2021-07-30 Thread Bossart, Nathan
On 7/30/21, 2:22 AM, "Daniel Gustafsson"  wrote:
> LGTM.  I took the liberty to rephrase the "and exit" part of the initdb help
> output match the other exiting options in there.  Barring objections, I think
> this is ready.

LGTM.  Thanks!

Nathan



Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-30 Thread Bruce Momjian
On Thu, Jul 29, 2021 at 06:19:56PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > I think we need to first give clear instructions on how to find out if
> > an extension update is available, and then how to update it.  I am
> > thinking we should supply a query which reports all extensions that can
> > be upgraded, at least for contrib.
> 
> I suggested awhile ago that pg_upgrade should look into
> pg_available_extensions in the new cluster, and prepare
> a script with ALTER EXTENSION UPDATE commands for
> anything that's installed but is not the (new cluster's)
> default version.

OK, done in this patch.  I am assuming that everything that shows an
update in pg_available_extensions can use ALTER EXTENSION UPDATE.  I
assume this would be backpatched to 9.6.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 0c47a6b8cc..ad5f391995 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -241,6 +241,8 @@ issue_warnings_and_set_wal_level(void)
 	if (GET_MAJOR_VERSION(old_cluster.major_version) <= 906)
 		old_9_6_invalidate_hash_indexes(&new_cluster, false);
 
+	report_extension_updates(&new_cluster);
+
 	stop_postmaster(false);
 }
 
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 7038ac12bf..ca0795f68f 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -455,6 +455,7 @@ void		old_9_6_invalidate_hash_indexes(ClusterInfo *cluster,
 			bool check_mode);
 
 void		old_11_check_for_sql_identifier_data_type_usage(ClusterInfo *cluster);
+void		report_extension_updates(ClusterInfo *cluster);
 
 /* parallel.c */
 void		parallel_exec_prog(const char *log_file, const char *opt_log_file,
diff --git a/src/bin/pg_upgrade/version.c b/src/bin/pg_upgrade/version.c
index a3c193316d..b82afafd22 100644
--- a/src/bin/pg_upgrade/version.c
+++ b/src/bin/pg_upgrade/version.c
@@ -468,3 +468,81 @@ old_11_check_for_sql_identifier_data_type_usage(ClusterInfo *cluster)
 	else
 		check_ok();
 }
+
+
+/*
+ * report_extension_updates()
+ *	Report extensions that should be updated.
+ */
+void
+report_extension_updates(ClusterInfo *cluster)
+{
+	int			dbnum;
+	FILE	   *script = NULL;
+	bool		found = false;
+	char	   *output_path = "update_extensions.sql";
+
+	prep_status("Checking for extension updates");
+
+	for (dbnum = 0; dbnum < cluster->dbarr.ndbs; dbnum++)
+	{
+		PGresult   *res;
+		bool		db_used = false;
+		int			ntups;
+		int			rowno;
+		int			i_name;
+		DbInfo	   *active_db = &cluster->dbarr.dbs[dbnum];
+		PGconn	   *conn = connectToServer(cluster, active_db->db_name);
+
+		/* find hash indexes */
+		res = executeQueryOrDie(conn,
+"SELECT name "
+"FROM pg_available_extensions "
+"WHERE installed_version != default_version"
+			);
+
+		ntups = PQntuples(res);
+		i_name = PQfnumber(res, "name");
+		for (rowno = 0; rowno < ntups; rowno++)
+		{
+			found = true;
+
+			if (script == NULL && (script = fopen_priv(output_path, "w")) == NULL)
+pg_fatal("could not open file \"%s\": %s\n", output_path,
+		 strerror(errno));
+			if (!db_used)
+			{
+PQExpBufferData connectbuf;
+
+initPQExpBuffer(&connectbuf);
+appendPsqlMetaConnect(&connectbuf, active_db->db_name);
+fputs(connectbuf.data, script);
+termPQExpBuffer(&connectbuf);
+db_used = true;
+			}
+			fprintf(script, "ALTER EXTENSION %s UPDATE;\n",
+	quote_identifier(PQgetvalue(res, rowno, i_name)));
+		}
+
+		PQclear(res);
+
+		PQfinish(conn);
+	}
+
+	if (script)
+		fclose(script);
+
+	if (found)
+	{
+		report_status(PG_REPORT, "notice");
+		pg_log(PG_REPORT, "\n"
+			   "Your installation contains extensions that should be updated\n"
+			   "with the ALTER EXTENSION command.  The file\n"
+			   "%s\n"
+			   "when executed by psql by the database superuser will update\n"
+			   "these extensions.\n\n",
+			   output_path);
+	}
+	else
+		check_ok();
+}


Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-30 Thread Jan Wieck

On 7/30/21 7:33 AM, Dave Cramer wrote:





Yes, but as I said twice only if the currently installed version is
different
from the default version.  Otherwise ALTER EXTENSION UPGRADE is a no-op.


Ah, ok, got it, thanks. Well I'm not sure how to deal with this.
The only thing I can think of is that we add a post_upgrade function to 
the API.


All extensions would have to implement this.


An alternative to this would be a recommended version number scheme for 
extensions. If extensions were numbered


   MAJOR_SERVER.MAJOR_EXTENSION.MINOR_EXTENSION

then pg_upgrade could check the new cluster for the existence of an SQL 
script that upgrades the extension from the old cluster's version to the 
new current. And since an extension cannot have the same version number 
on two major server versions, there is no ambiguity here.


The downside is that all the extensions that don't need anything done 
for those upgrades (which I believe is the majority of them) would have 
to provide empty SQL files. Not necessarily a bad thing, as it actually 
documents "yes, the extension developer checked this and there is 
nothing to do here."



Regards, Jan

--
Jan Wieck




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-30 Thread Jan Wieck

On 7/30/21 7:40 AM, Julien Rouhaud wrote:

On Fri, Jul 30, 2021 at 07:33:55AM -0400, Dave Cramer wrote:


Ah, ok, got it, thanks. Well I'm not sure how to deal with this.
The only thing I can think of is that we add a post_upgrade function to the
API.

All extensions would have to implement this.


It seems like a really big hammer for a niche usage.  As far as I know I'm the
only one who wrote an extension that can create different objects depending on
the server version, so I'm entirely fine with dealing with that problem in my
extension rather than forcing everyone to implement an otherwise useless API.

Now if that API can be useful for other cases or if there are other extensions
with similar problems that would be different story.



I haven't worked on it for a while, but I think pl_profiler does the 
same thing, so you are not alone.



Jan

--
Jan Wieck




Re: Have I found an interval arithmetic bug?

2021-07-30 Thread Tom Lane
Bruce Momjian  writes:
> On Fri, Jul 30, 2021 at 12:04:39PM -0400, Bruce Momjian wrote:
>> Unless I hear more feedback, I plan to apply this doc patch to all
>> branches with the word "rounded" changed to "truncated" in the back
>> branches, and apply the rounded code changes to master.

> Now that I think of it, I will just remove the word "rounded" from the
> back branch docs so we are technically breaking the documented API less
> in PG 15.

I think your first idea was better.  Not documenting the behavior
doesn't make this not an API change; it just makes it harder for
people to understand what changed.

The doc patch itself is not exactly fine:

+ Field values can have fractional parts;  for example, '1.5
+ weeks' or '01:02:03.45'.  However,

I think "some field values", as it was worded previously, was better.
If you try to write 01.5:02:03, that is not going to be interpreted
as 1.5 hours.  (Hmm, I get something that seems quite insane:

regression=# select '01.5:02:03'::interval;
interval

 1 day 14:03:00
(1 row)

I wonder what it thinks it's doing there.)

This is wrong:

+ because interval internally stores only three integer units (months,
+ days, seconds), fractional units must be spilled to smaller units.

s/seconds/microseconds/ is probably enough to fix that.

+ For example, because months are approximated to equal 30 days,
+ fractional values of units greater than months is rounded to be the
+ nearest integer number of months.  Fractional units of months or less
+ are computed to be an integer number of days and seconds, assuming
+ 24 hours per day.  For example, '1.5 months'
+ becomes 1 month 15 days.

This entire passage is vague, and grammatically shaky too.  Perhaps
more like

  Fractional parts of units larger than months are rounded to the
  nearest integer number of months; for example '1.5 years'
  becomes '1 year 6 mons'.  Fractional parts of months are rounded
  to the nearest integer number of days, using the assumption that
  one month equals 30 days; for example '1.5 months'
  becomes '1 mon 15 days'.  Fractional parts of days and weeks
  are converted to microseconds, using the assumption that one day
  equals 24 hours.

  On output, the months field is shown as an appropriate number of
  years and months; the days field is shown as-is; the microseconds
  field is converted to hours, minutes, and possibly-fractional
  seconds.

regards, tom lane




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-30 Thread Tom Lane
Jan Wieck  writes:
> An alternative to this would be a recommended version number scheme for 
> extensions. If extensions were numbered
> MAJOR_SERVER.MAJOR_EXTENSION.MINOR_EXTENSION
> then pg_upgrade could check the new cluster for the existence of an SQL 
> script that upgrades the extension from the old cluster's version to the 
> new current. And since an extension cannot have the same version number 
> on two major server versions, there is no ambiguity here.

That idea cannot get off the ground.  We've spent ten years telling
people they can use whatever version-numbering scheme they like for
their extensions; we can't suddenly go from that to "you must use
exactly this scheme".

I don't see the need for it anyway.  What is different from just
putting the update actions into an extension version upgrade
script created according to the current rules, and then setting
that new extension version as the default version in the extension
build you ship for the new server version?

regards, tom lane




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-30 Thread Jan Wieck

On 7/30/21 1:05 PM, Tom Lane wrote:

I don't see the need for it anyway.  What is different from just
putting the update actions into an extension version upgrade
script created according to the current rules, and then setting
that new extension version as the default version in the extension
build you ship for the new server version?


You are right. The real fix should actually be that an extension, that 
creates different objects depending on the major server version it is 
installed on, should not use the same version number for itself on those 
two server versions. It is actually wrong to have DO blocks that execute 
server version dependent sections in the CREATE EXTENSION scripts. 
However similar the code may be, it is intended for different server 
versions, so it is not the same version of the extension.



Regards, Jan

--
Jan Wieck




Re: [PATCH] Hooks at XactCommand level

2021-07-30 Thread Tom Lane
Gilles Darold  writes:
> [ 1-startcommand_xact_callback-v2.diff ]

I've not read this version of the patch, but I see from the cfbot's
results that it's broken postgres_fdw.  I recall that postgres_fdw
uses the XactCallback and SubXactCallback mechanisms, so I'm betting
this means that you've changed the semantics of those callbacks in
an incompatible way.  That's probably not a great idea.  We could
fix postgres_fdw, but there are more than likely some external
modules that would also get broken, and that is supposed to be a
reasonably stable API.

regards, tom lane




Re: logical decoding and replication of sequences

2021-07-30 Thread Tomas Vondra

Hi,

Here's a an updated version of this patch - rebased to current master, 
and fixing some of the issues raised in Peter's review.


Mainly, this adds a TAP test to src/test/subscription, focusing on 
testing the various situations with transactional and non-transactional 
behavior (with subtransactions and various ROLLBACK versions).


This new TAP test however uncovered an issue with wait_for_catchup(), 
because that uses pg_current_wal_lsn() to wait for replication of all 
the changes. But that does not work when the sequence gets advanced in a 
transaction that is then aborted, e.g. like this:


BEGIN;
SELECT nextval('s') FROM generate_series(1,100);
ROLLBACK;

The root cause is that pg_current_wal_lsn() uses the LogwrtResult.Write, 
which is updated by XLogFlush() - but only in RecordTransactionCommit. 
Which makes sense, because only the committed stuff is "visible". But 
the non-transactional behavior changes this, because now some of the 
changes from aborted transactions may need to be replicated. Which means 
the wait_for_catchup() ends up not waiting for the sequence change.


One option would be adding XLogFlush() to RecordTransactionAbort(), but 
my guess is we don't do that intentionally (even though aborts should be 
fairly rare in most workloads).


I'm not entirely sure changing this (replicating changes from aborted 
xacts) is a good idea. Maybe it'd be better to replicate this "lazily" - 
instead of replicating the advances right away, we might remember which 
sequences were advanced in the transaction, and then replicate current 
state for those sequences at commit time.


The idea is that if an increment is "invisible" we probably don't need 
to replicate it, it's fine to replicate the next "visible" increment. So 
for example given


BEGIN;
SELECT nextval('s');
ROLLBACK;

BEGIN;
SELECT nextval('s');
COMMIT;

we don't need to replicate the first change, but we need to replicate 
the second one.


The trouble is we don't decode individual sequence advances, just those 
that update the sequence tuple (so every 32 values or so). So we'd need 
to remeber the first increment, in a way that is (a) persistent across 
restarts and (b) shared by all backends.


The other challenge seems to be ordering of the changes - at the moment 
we have no issues with this, because increments on the same sequence are 
replicated immediately, in the WAL order. But postponing them to commit 
time would affect this order.



I've also briefly looked at the code duplication - there's a couple 
functions in the patch that I shamelessly copy-pasted and tweaked to 
handle sequences instead of tables:


publicationcmds.c
-

1) OpenTableList/CloseTableList - > OpenSequenceList/CloseSequenceList

Trivial differences, trivial to get rid of - the only difference is 
pretty much just table_open vs. relation open.



2) AlterPublicationTables -> AlterPublicationSequences

This is a bit more complicated, because the tables also handle 
inheritance (which I think does not apply to sequences). Other than 
that, it's calling the functions from (1).



subscriptioncmds.c
--

1) fetch_table_list, fetch_sequence_list

Minimal differences, essentially just the catalog name.

2) AlterSubscription_refresh

A lot of duplication, but the code handling tables and sequences is 
almost exactly the same and can be reused fairly easily (moved to a 
separate function, called for tables and then sequences).



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b879..56ddc3abae 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -5,7 +5,8 @@ PGFILEDESC = "test_decoding - example of a logical decoding output plugin"
 
 REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
-	spill slot truncate stream stats twophase twophase_stream
+	spill slot truncate stream stats twophase twophase_stream \
+	sequence
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
 	twophase_snapshot
diff --git a/contrib/test_decoding/expected/sequence.out b/contrib/test_decoding/expected/sequence.out
new file mode 100644
index 00..07f3e3e92c
--- /dev/null
+++ b/contrib/test_decoding/expected/sequence.out
@@ -0,0 +1,327 @@
+-- predictability
+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
+ ?column? 
+--
+ init
+(1 row)
+
+CREATE SEQUENCE test_sequence;
+-- test the sequence changes by several nextval() calls
+SELECT nextval('test_sequence');
+ nextval 
+-
+   1
+(1 row)
+
+SELECT nextval('test_sequence');
+ nextval 
+-
+   2
+(1 row)
+
+SELECT nextval('test_sequence');
+ nextval 
+-
+ 

Re: archive status ".ready" files may be created too early

2021-07-30 Thread Alvaro Herrera
On 2021-Jul-28, Bossart, Nathan wrote:

> On 7/27/21, 6:05 PM, "Alvaro Herrera"  wrote:

> > I'm not sure I understand what's the reason not to store this value in
> > pg_control; I feel like I'm missing something.  Can you please explain?
> 
> Thanks for taking a look.
> 
> The only reason I can think of is that it could make back-patching
> difficult.  I don't mind working on a version of the patch that uses
> pg_control.  Back-patching this fix might be a stretch, anyway.

Hmm ... I'm not sure we're prepared to backpatch this kind of change.
It seems a bit too disruptive to how replay works.  I think patch we
should be focusing solely on patch 0001 to surgically fix the precise
bug you see.  Does patch 0002 exist because you think that a system with
only 0001 will not correctly deal with a crash at the right time?


Now, the reason I'm looking at this patch series is that we're seeing a
related problem with walsender/walreceiver, which apparently are capable
of creating a file in the replica that ends up not existing in the
primary after a crash, for a reason closely related to what you
describe for WAL archival.  I'm not sure what is going on just yet, so
I'm not going to try and explain because I'm likely to get it wrong.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: logical decoding and replication of sequences

2021-07-30 Thread Robert Haas
On Tue, Jul 20, 2021 at 5:41 PM Tomas Vondra
 wrote:
> I think the crucial aspect of this that needs discussion/feedback the
> most is the transactional vs. non-transactional behavior. All the other
> questions are less important / cosmetic.

Yeah, it seems really tricky to me to get this right. The hard part
is, I think, mostly figuring out what the right behavior really is.

DDL in PostgreSQL is transactional. Non-DDL operations on sequences
are non-transactional. If a given transaction does only one of those
things, it seems clear enough what to do, but when the same
(sub)transaction does both, it gets messy. I'd be tempted to think
about something like:

1. When a transaction performs only non-transactional operations on
sequences, they are emitted immediately.

2. If a transaction performs transactional operations on sequences,
the decoded operations acquire a dependency on the transaction and
cannot be emitted until that transaction is fully decoded. When commit
or abort of that XID is reached, emit the postponed non-transactional
operations at that point.

I think this is similar to what you've designed, but I'm not sure that
it's exactly equivalent. I think in particular that it may be better
to insist that all of these operations are non-transactional and that
the debate is only about when they can be sent, rather than trying to
sort of convert them into an equivalent series of transactional
operations. That approach seems confusing especially in the case where
some (sub)transactions abort.

But this is just my $0.02.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Have I found an interval arithmetic bug?

2021-07-30 Thread Bruce Momjian
On Fri, Jul 30, 2021 at 12:49:34PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > Now that I think of it, I will just remove the word "rounded" from the
> > back branch docs so we are technically breaking the documented API less
> > in PG 15.
> 
> I think your first idea was better.  Not documenting the behavior
> doesn't make this not an API change; it just makes it harder for
> people to understand what changed.

OK.  However, I thought we were more worried about changing documented
APIs than undocumented ones.  Anyway, I will do as you suggested.

> The doc patch itself is not exactly fine:
> 
> + Field values can have fractional parts;  for example, '1.5
> + weeks' or '01:02:03.45'.  However,
> 
> I think "some field values", as it was worded previously, was better.
> If you try to write 01.5:02:03, that is not going to be interpreted
> as 1.5 hours.  (Hmm, I get something that seems quite insane:
> 
> regression=# select '01.5:02:03'::interval;
> interval
> 
>  1 day 14:03:00
> (1 row)
> 
> I wonder what it thinks it's doing there.)

It thinks 01.5:02:03 is Days:Hours;Minute, so I think all fields can use
fractions:

SELECT interval '1.5 minutes';
 interval
--
 00:01:30

> This is wrong:
> 
> + because interval internally stores only three integer units (months,
> + days, seconds), fractional units must be spilled to smaller units.
> 
> s/seconds/microseconds/ is probably enough to fix that.

OK, there were a few place that said "seconds" so I fixed those too.

> + For example, because months are approximated to equal 30 days,
> + fractional values of units greater than months is rounded to be the
> + nearest integer number of months.  Fractional units of months or less
> + are computed to be an integer number of days and seconds, assuming
> + 24 hours per day.  For example, '1.5 months'
> + becomes 1 month 15 days.
> 
> This entire passage is vague, and grammatically shaky too.  Perhaps
> more like
> 
>   Fractional parts of units larger than months are rounded to the
>   nearest integer number of months; for example '1.5 years'
>   becomes '1 year 6 mons'.  Fractional parts of months are rounded
>   to the nearest integer number of days, using the assumption that
>   one month equals 30 days; for example '1.5 months'

The newest patch actually doesn't work as explained above --- fractional
months now continue to spill to microseconds.  I think you are looking
at a previous version.

>   becomes '1 mon 15 days'.  Fractional parts of days and weeks
>   are converted to microseconds, using the assumption that one day
>   equals 24 hours.

Uh, fractional weeks can be integer days.

>   On output, the months field is shown as an appropriate number of
>   years and months; the days field is shown as-is; the microseconds
>   field is converted to hours, minutes, and possibly-fractional
>   seconds.

Here is an updated patch that includes some of your ideas.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 453115f942..50a2c8e5f1 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -2840,15 +2840,18 @@ P  years-months-
 
 
 
- In the verbose input format, and in some fields of the more compact
- input formats, field values can have fractional parts; for example
- '1.5 week' or '01:02:03.45'.  Such input is
- converted to the appropriate number of months, days, and seconds
- for storage.  When this would result in a fractional number of
- months or days, the fraction is added to the lower-order fields
- using the conversion factors 1 month = 30 days and 1 day = 24 hours.
- For example, '1.5 month' becomes 1 month and 15 days.
- Only seconds will ever be shown as fractional on output.
+ Field values can have fractional parts:  for example, '1.5
+ weeks' or '01:02:03.45'.  However,
+ because interval internally stores only three integer units (months,
+ days, microseconds), fractional units must be spilled to smaller
+ units.  Fractional parts of units greater than months is rounded to
+ be an integer number of months, e.g. '1.5 years'
+ becomes '1 year 6 mons'.  Fractional parts of
+ weeks and days are computed to be an integer number of days and
+ microseconds, assuming 30 days per month and 24 hours per day, e.g.,
+ '1.75 months' becomes 1 mon 22 days
+ 12:00:00.  Only seconds will ever be shown as fractional
+ on output.
 
 
 
@@ -2892,10 +2895,10 @@ P  years-months-
 
 
  Internally interval values are stored as months, days,
- and seconds. This is done because the number of days in a month
+ and microseconds. T

Re: Have I found an interval arithmetic bug?

2021-07-30 Thread Robert Haas
On Fri, Jul 30, 2021 at 3:03 PM Bruce Momjian  wrote:
> Here is an updated patch that includes some of your ideas.

Just to be clear, I am against this patch. I don't think it's a
minimal change for the reported problem, and I think some people will
be unhappy about the behavior changes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-30 Thread Robert Haas
On Thu, Jul 29, 2021 at 3:14 PM Andres Freund  wrote:
> I think those advantages are far outstripped by the big disadvantage of
> needing to either size the array accurately from the start, or to
> reallocate the whole array.  Our current pre-allocation behaviour is
> very wasteful for most vacuums but doesn't handle large work_mem at all,
> causing unnecessary index scans.

I agree that the current pre-allocation behavior is bad, but I don't
really see that as an issue with my idea. Fixing that would require
allocating the array in chunks, but that doesn't really affect the
core of the idea much, at least as I see it.

But I accept that Yura has a very good point about the memory usage of
what I was proposing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Have I found an interval arithmetic bug?

2021-07-30 Thread Bruce Momjian
On Fri, Jul 30, 2021 at 03:08:56PM -0400, Robert Haas wrote:
> On Fri, Jul 30, 2021 at 3:03 PM Bruce Momjian  wrote:
> > Here is an updated patch that includes some of your ideas.
> 
> Just to be clear, I am against this patch. I don't think it's a
> minimal change for the reported problem, and I think some people will
> be unhappy about the behavior changes.

Uh, what do you suggest then?  You wanted the years/months fixed, and
rounding at spill stop time makes sense, and fixes the problem.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-30 Thread Andres Freund
Hi,

On 2021-07-30 15:13:49 -0400, Robert Haas wrote:
> On Thu, Jul 29, 2021 at 3:14 PM Andres Freund  wrote:
> > I think those advantages are far outstripped by the big disadvantage of
> > needing to either size the array accurately from the start, or to
> > reallocate the whole array.  Our current pre-allocation behaviour is
> > very wasteful for most vacuums but doesn't handle large work_mem at all,
> > causing unnecessary index scans.
> 
> I agree that the current pre-allocation behavior is bad, but I don't
> really see that as an issue with my idea. Fixing that would require
> allocating the array in chunks, but that doesn't really affect the
> core of the idea much, at least as I see it.

Well, then it'd not really be the "simple array approach" anymore :)


> But I accept that Yura has a very good point about the memory usage of
> what I was proposing.

The lower memory usage also often will result in a better cache
utilization - which is a crucial factor for index vacuuming when the
index order isn't correlated with the heap order. Cache misses really
are a crucial performance factor there.

Greetings,

Andres Freund




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-07-30 Thread Melanie Plageman
On Wed, Jul 28, 2021 at 2:10 PM Andres Freund  wrote:
> On 2021-07-28 13:37:48 -0400, Melanie Plageman wrote:
>
> > Each IO will have its own TBMIterateResult allocated and returned by the
> > PgStreamingRead helper and freed later by
> > heapam_scan_bitmap_next_block() before requesting the next block.
> > Previously it was allocated once and saved in the TBMIterator in the
> > BitmapHeapScanState node and reused. Because of this, the table AM API
> > routine, table_scan_bitmap_next_block() now defines the TBMIterateResult
> > as an output parameter.
> >
> > I haven't made the GIN code reasonable yet either (it uses the TID
> > bitmap functions that I've changed).
>
> I don't quite understand the need to change the tidbitmap interface, or
> maybe rather I'm not convinced that pessimistically preallocating space
> is a good idea?
>

TBMIterator cannot contain a TBMIterateResult because it prefetches
blocks and calls tbm_iterate() for each one, which would overwrite the
relevant information in the TBMIterateResult before it has been returned
to heapam_scan_bitmap_next_block().*

Thus, we need at least as many TBMIterateResults as the size of the
prefetch window at its largest.

We could save some memory if we separated the data in TBMIterateResult
and made a new struct, let's call it BitmapBlockState, with just the
block number, buffer number, and recheck to be used and returned by
bitmapheapscan_pgsr_next_single().

We need both block and buffer because we need to distinguish between
hit_end, skip_fetch, and invalid block number conditions in the caller.
We need recheck before initiating IO to determine if we should
skip_fetch.

Then a separate struct which is much the same as the existing
TBMIterateResult could be maintained in the BitmapHeapScanState node and
passed into heapam_scan_bitmap_next_block() along with the bitmap (a new
parameter).

In heapam_scan_bitmap_next_block(), after getting the BitmapBlockState
from pg_streaming_read_get_next(), we could call tbm_find_pageentry()
with the block number and bitmap.
For a non-lossy page, we could then scrape the offsets and ntuples using
the PageTableEntry. If it is lossy, we would set recheck and ntuples
accordingly. (I do wonder if that allows us to distinguish between a
lossy page and a block number that is erroneous and isn't in the
bitmap--but maybe that can't happen.)

However, we would still have as many palloc() calls (one for every block
to create the BitmapBlockState. We would have less outstanding memory by
limiting the number of offsets arrays created.
We would still need to pass the recheck flag, ntuples, and buffer back
up to BitmapHeapNext(), so, at that point we would still need a data
structure that is basically the same as the existing TBMIterateResult.

Alternatively, we could keep an array of TBMIterateResults the size of
the prefetch window and reuse them -- though I'm not sure where to keep
it and how to manage it when the window gets resized.

In my current patch, I allocate and free one TBMIterateResult for each
block. The amount of outstanding memory will be #ios in prefetch window
* sizeof(TBMIterateResult).

We don't want to always palloc() memory for the TBMIterateResult inside
of tbm_iterate(), since other users (like GIN) still only need one
TBMIterateResult

So, if the TBMIterateResult is not inside of the TBMIterator and
tbm_iterate() does not allocate the memory, we need to pass it in as an
output parameter, and, if we do that, it felt odd to also return it --
hence the function signature change.

One alternative I tried was having the TBMIterator have a pointer to the
TBMIterateResult and then users of it can allocate the TBMIterateResult
and set it in the TBMIterator before calling tbm_iterate(). But, then we
need to expose the TBMIterator outside of the TID bitmap API. Also, it
felt weird to have a member of the iterator which must not be NULL when
tbm_iterate() is called but which isn't set up in tbm_begin_iterate().

>
>
> >  static bool
> >  heapam_scan_bitmap_next_block(TableScanDesc scan,
> > -
TBMIterateResult *tbmres)
> > +  TBMIterateResult **tbmres)
>
> ISTM that we possibly shouldn't expose the TBMIterateResult outside of
> the AM after this change? It feels somewhat like an implementation
> detail now. It seems somewhat odd to expose a ** to set a pointer that
> nodeBitmapHeapscan.c then doesn't really deal with itself.
>

All the members of the TBMIterateResult are populated in
bitmapheapscan_pgsr_next_single() and then
most of it is used by heapam_scan_bitmap_next_block() to
  - detect error conditions and done-ness
  - fill in the HeapScanDesc with the information needed by
heapam_scan_bitmap_next_tuple() (rs_cbuf [which is basically
redundant with TBMIterateResult->buffer] and rs_vistuples)

However, some of the information is used up in BitmapHeapNext() and in
heapam_scan_bitmap_next_tuple() and doesn't go in the HeapScanDesc:
  - BitmapHeapNext() uses the state of the

Re: Have I found an interval arithmetic bug?

2021-07-30 Thread Robert Haas
On Fri, Jul 30, 2021 at 3:20 PM Bruce Momjian  wrote:
> Uh, what do you suggest then?  You wanted the years/months fixed, and
> rounding at spill stop time makes sense, and fixes the problem.

Hmm, maybe I misunderstood. Are you saying that you think the patch
will fix cases like interval '-1.7 years 29.4 months' and interval
'29.4 months -1.7 years' to produce the same answer without changing
any other cases? I had the impression that you were proposing a bigger
change to the rules for converting fractional units to units of lower
type, particularly because Tom called it an "API change".

For some reason I can't apply the patch locally.

[rhaas pgsql]$ patch -p1 < ~/Downloads/interval.diff
(Stripping trailing CRs from patch.)
patching file doc/src/sgml/datatype.sgml
(Stripping trailing CRs from patch.)
patching file src/backend/utils/adt/datetime.c
patch:  malformed patch at line 90: @@ -3601,7 +3597,7 @@
DecodeISO8601Interval(char *str,

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-30 Thread Robert Haas
On Fri, Jul 30, 2021 at 3:34 PM Andres Freund  wrote:
> The lower memory usage also often will result in a better cache
> utilization - which is a crucial factor for index vacuuming when the
> index order isn't correlated with the heap order. Cache misses really
> are a crucial performance factor there.

Fair enough.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-07-30 Thread Andres Freund
Hi,

On 2021-07-30 15:35:30 -0400, Melanie Plageman wrote:
> * I think that having TBMIterateResult inside of TBMIterator is not
>   well-defined C language behavior. In [1], it says
> 
>   "Structures with flexible array members (or unions who have a
>   recursive-possibly structure member with flexible array member) cannot
>   appear as array elements or as members of other structures."

> [1] https://en.cppreference.com/w/c/language/struct

I think it is ok as long as the struct with the flexible array member is
at the end of the struct it is embedded in. I think even by the letter
of the standard, but it's as always hard to parse...

Greetings,

Andres Freund




Re: Have I found an interval arithmetic bug?

2021-07-30 Thread Tom Lane
Bruce Momjian  writes:
> On Fri, Jul 30, 2021 at 03:08:56PM -0400, Robert Haas wrote:
>> Just to be clear, I am against this patch. I don't think it's a
>> minimal change for the reported problem, and I think some people will
>> be unhappy about the behavior changes.

> Uh, what do you suggest then?  You wanted the years/months fixed, and
> rounding at spill stop time makes sense, and fixes the problem.

The complained-of bug is that 'X years Y months' isn't always
identical to 'Y months X years'.  I do not believe that this patch
fixes that, though it may obscure the problem for some values of
X and Y.  After a quick look at the code, I am suspicious that
the actual problem is that AdjustFractDays is applied at the wrong
time, before we've collected all the input.  We probably need to
collect up all of the contributing input as floats and then do the
fractional spilling once at the end.

Having said that, I also agree that it's not great that '1.4 years'
is converted to '1 year 4 mons' (1.3... years) rather than
'1 year 5 mons' (1.41666... years).  The latter seems like a clearly
saner translation.  I would really rather that we reject such input
altogether, but if we're going to accept it, we should round not
truncate.

regards, tom lane




Re: Have I found an interval arithmetic bug?

2021-07-30 Thread Bruce Momjian
On Fri, Jul 30, 2021 at 03:47:53PM -0400, Robert Haas wrote:
> On Fri, Jul 30, 2021 at 3:20 PM Bruce Momjian  wrote:
> > Uh, what do you suggest then?  You wanted the years/months fixed, and
> > rounding at spill stop time makes sense, and fixes the problem.
> 
> Hmm, maybe I misunderstood. Are you saying that you think the patch
> will fix cases like interval '-1.7 years 29.4 months' and interval
> '29.4 months -1.7 years' to produce the same answer without changing
> any other cases? I had the impression that you were proposing a bigger

Yes, tests from the oringal email:

SELECT interval '-1.7 years 29.4 months';
interval

 9 mons 12 days
(1 row)

SELECT interval '29.4 months -1.7 years';
interval

 9 mons 12 days
(1 row)

SELECT interval '-1.7 years' + interval '29.4 months';
?column?

 9 mons 12 days
(1 row)

SELECT interval '29.4 months' + interval '-1.7 years';
?column?

 9 mons 12 days

> change to the rules for converting fractional units to units of lower
> type, particularly because Tom called it an "API change".

The API change is to _round_ units greater than months to integeral
month values;  we currently truncate.  Changing the spill behavior has
been rejected.

> For some reason I can't apply the patch locally.
> 
> [rhaas pgsql]$ patch -p1 < ~/Downloads/interval.diff
> (Stripping trailing CRs from patch.)
> patching file doc/src/sgml/datatype.sgml
> (Stripping trailing CRs from patch.)
> patching file src/backend/utils/adt/datetime.c
> patch:  malformed patch at line 90: @@ -3601,7 +3597,7 @@
> DecodeISO8601Interval(char *str,

Uh, here is the patch again, in case that helps.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 453115f942..50a2c8e5f1 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -2840,15 +2840,18 @@ P  years-months-
 
 
 
- In the verbose input format, and in some fields of the more compact
- input formats, field values can have fractional parts; for example
- '1.5 week' or '01:02:03.45'.  Such input is
- converted to the appropriate number of months, days, and seconds
- for storage.  When this would result in a fractional number of
- months or days, the fraction is added to the lower-order fields
- using the conversion factors 1 month = 30 days and 1 day = 24 hours.
- For example, '1.5 month' becomes 1 month and 15 days.
- Only seconds will ever be shown as fractional on output.
+ Field values can have fractional parts:  for example, '1.5
+ weeks' or '01:02:03.45'.  However,
+ because interval internally stores only three integer units (months,
+ days, microseconds), fractional units must be spilled to smaller
+ units.  Fractional parts of units greater than months is rounded to
+ be an integer number of months, e.g. '1.5 years'
+ becomes '1 year 6 mons'.  Fractional parts of
+ weeks and days are computed to be an integer number of days and
+ microseconds, assuming 30 days per month and 24 hours per day, e.g.,
+ '1.75 months' becomes 1 mon 22 days
+ 12:00:00.  Only seconds will ever be shown as fractional
+ on output.
 
 
 
@@ -2892,10 +2895,10 @@ P  years-months-
 
 
  Internally interval values are stored as months, days,
- and seconds. This is done because the number of days in a month
+ and microseconds. This is done because the number of days in a month
  varies, and a day can have 23 or 25 hours if a daylight savings
  time adjustment is involved.  The months and days fields are integers
- while the seconds field can store fractions.  Because intervals are
+ while the microseconds field can store fractional seconds.  Because intervals are
  usually created from constant strings or timestamp subtraction,
  this storage method works well in most cases, but can cause unexpected
  results:
diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 54ae632de2..cb3fa85892 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -3306,29 +3306,25 @@ DecodeInterval(char **field, int *ftype, int nf, int range,
 
 	case DTK_YEAR:
 		tm->tm_year += val;
-		if (fval != 0)
-			tm->tm_mon += fval * MONTHS_PER_YEAR;
+		tm->tm_mon += rint(fval * MONTHS_PER_YEAR);
 		tmask = DTK_M(YEAR);
 		break;
 
 	case DTK_DECADE:
 		tm->tm_year += val * 10;
-		if (fval != 0)
-			tm->tm_mon += fval * MONTHS_PER_YEAR * 10;
+		tm->tm_mon += rint(fval * MONTHS_PER_YEA

Re: Have I found an interval arithmetic bug?

2021-07-30 Thread Bruce Momjian
On Fri, Jul 30, 2021 at 03:54:42PM -0400, Tom Lane wrote:
> Having said that, I also agree that it's not great that '1.4 years'
> is converted to '1 year 4 mons' (1.3... years) rather than
> '1 year 5 mons' (1.41666... years).  The latter seems like a clearly
> saner translation.  I would really rather that we reject such input
> altogether, but if we're going to accept it, we should round not
> truncate.

My patch returns what you want:

SELECT interval '1.4 years';
   interval
---
 1 year 5 mons

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Background writer and checkpointer in crash recovery

2021-07-30 Thread Andres Freund
Hi,

On 2021-07-30 10:16:44 -0400, Robert Haas wrote:
> 2021-07-30 09:39:43.579 EDT [63702] LOG:  redo starts at 0/14A2F48
> 2021-07-30 09:39:44.129 EDT [63702] LOG:  redo done at 0/15F48230
> system usage: CPU: user: 0.25 s, system: 0.25 s, elapsed: 0.55 s
> 2021-07-30 09:39:44.129 EDT [63702] LOG:  crash recovery complete:
> wrote 36517 buffers (222.9%); dirtied 52985 buffers; read 7 buffers
> 
> Now I really think that information on the number of buffers touched
> and how long it took is way more useful than user and system time.
> Knowing how much user and system time were spent doesn't really tell
> you anything, but a count of buffers touched gives you some meaningful
> idea of how much work recovery did, and whether I/O was slow.

I don't agree with that? If (user+system) << wall then it is very likely
that recovery is IO bound. If system is a large percentage of wall, then
shared buffers is likely too small (or we're replacing the wrong
buffers) because you spend a lot of time copying data in/out of the
kernel page cache. If user is the majority, you're CPU bound.

Without user & system time it's much harder to figure that out - at
least for me.


> In your patch, there's no end-of-recovery checkpoint -- you just
> trigger a checkpoint instead of waiting for it. I think it's probably
> better to make those two cases work the same. The end-of-recovery
> record isn't needed to change the TLI as it is in the promotion case,
> but (1) it seems better to have fewer code paths and (2) it might be
> good for debuggability.

+1

In addition, the end-of-recovery record also good for e.g. hot standby,
logical decoding, etc, because it's a point where no write transactions
can be in progress...

Greetings,

Andres Freund




Re: archive status ".ready" files may be created too early

2021-07-30 Thread Bossart, Nathan
On 7/30/21, 11:34 AM, "Alvaro Herrera"  wrote:
> Hmm ... I'm not sure we're prepared to backpatch this kind of change.
> It seems a bit too disruptive to how replay works.  I think patch we
> should be focusing solely on patch 0001 to surgically fix the precise
> bug you see.  Does patch 0002 exist because you think that a system with
> only 0001 will not correctly deal with a crash at the right time?

Yes, that was what I was worried about.  However, I just performed a
variety of tests with just 0001 applied, and I am beginning to suspect
my concerns were unfounded.  With wal_buffers set very high,
synchronous_commit set to off, and a long sleep at the end of
XLogWrite(), I can reliably cause the archive status files to lag far
behind the current open WAL segment.  However, even if I crash at this
time, the .ready files are created when the server restarts (albeit
out of order).  This appears to be due to the call to
XLogArchiveCheckDone() in RemoveOldXlogFiles().  Therefore, we can
likely abandon 0002.

> Now, the reason I'm looking at this patch series is that we're seeing a
> related problem with walsender/walreceiver, which apparently are capable
> of creating a file in the replica that ends up not existing in the
> primary after a crash, for a reason closely related to what you
> describe for WAL archival.  I'm not sure what is going on just yet, so
> I'm not going to try and explain because I'm likely to get it wrong.

I've suspected that this is due to the use of the flushed location for
the send pointer, which AFAICT needn't align with a WAL record
boundary.

/*
 * Streaming the current timeline on a primary.
 *
 * Attempt to send all data that's already been written out and
 * fsync'd to disk.  We cannot go further than what's been 
written out
 * given the current implementation of WALRead().  And in any 
case
 * it's unsafe to send WAL that is not securely down to disk on 
the
 * primary: if the primary subsequently crashes and restarts, 
standbys
 * must not have applied any WAL that got lost on the primary.
 */
 SendRqstPtr = GetFlushRecPtr();

Nathan



Re: Parallel Full Hash Join

2021-07-30 Thread Melanie Plageman
On Sat, Jul 10, 2021 at 9:13 AM vignesh C  wrote:
>
> On Mon, May 31, 2021 at 10:47 AM Greg Nancarrow  wrote:
> >
> > On Sat, Mar 6, 2021 at 12:31 PM Thomas Munro  wrote:
> > >
> > > On Tue, Mar 2, 2021 at 11:27 PM Thomas Munro  
> > > wrote:
> > > > On Fri, Feb 12, 2021 at 11:02 AM Melanie Plageman
> > > >  wrote:
> > > > > I just attached the diff.
> > > >
> > > > Squashed into one patch for the cfbot to chew on, with a few minor
> > > > adjustments to a few comments.
> > >
> > > I did some more minor tidying of comments and naming.  It's been on my
> > > to-do-list to update some phase names after commit 3048898e, and while
> > > doing that I couldn't resist the opportunity to change DONE to FREE,
> > > which somehow hurts my brain less, and makes much more obvious sense
> > > after the bugfix in CF #3031 that splits DONE into two separate
> > > phases.  It also pairs obviously with ALLOCATE.  I include a copy of
> > > that bugix here too as 0001, because I'll likely commit that first, so
> > > I rebased the stack of patches that way.  0002 includes the renaming I
> > > propose (master only).  Then 0003 is Melanie's patch, using the name
> > > SCAN for the new match bit scan phase.  I've attached an updated
> > > version of my "phase diagram" finger painting, to show how it looks
> > > with these three patches.  "scan*" is new.
> >
> > Patches 0002, 0003 no longer apply to the master branch, seemingly
> > because of subsequent changes to pgstat, so need rebasing.
>
> I am changing the status to "Waiting on Author" as the patch does not
> apply on Head.
>
> Regards,
> Vignesh
>
>

Rebased patches attached. I will change status back to "Ready for Committer"
From 954bc84ca79e217c216c0ed8160853c49c19b609 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 4 Nov 2020 14:25:33 -0800
Subject: [PATCH v7 3/3] Parallel Hash {Full,Right} Outer Join.

Previously, parallel full and right outer joins were not supported due
to a potential deadlock hazard (see discussion).

For now, sidestep the problem by terminating parallelism for the
unmatched inner tuple scan. The last process to arrive at the barrier
prepares for the unmatched inner tuple scan in HJ_NEED_NEW_OUTER and
transitions to HJ_FILL_INNER, scanning the hash table and emitting
unmatched inner tuples.  Other processes are free to go and work on
other batches, if there are any.

To make parallel and serial hash join more consistent, change the serial
version to scan match bits in tuple chunk order, instead of doing it in
hash table bucket order.

Author: Melanie Plageman 
Reviewed-by: Thomas Munro 
Discussion: https://postgr.es/m/CA%2BhUKG%2BA6ftXPz4oe92%2Bx8Er%2BxpGZqto70-Q_ERwRaSyA%3DafNg%40mail.gmail.com
---
 src/backend/executor/nodeHash.c | 205 ++--
 src/backend/executor/nodeHashjoin.c |  59 +++
 src/backend/optimizer/path/joinpath.c   |  14 +-
 src/include/executor/hashjoin.h |  15 +-
 src/include/executor/nodeHash.h |   3 +
 src/test/regress/expected/join_hash.out |  56 ++-
 src/test/regress/sql/join_hash.sql  |  23 ++-
 7 files changed, 283 insertions(+), 92 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 3ba688e8e0..e7d420ee12 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -517,6 +517,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 		hashtable->spaceAllowed * SKEW_HASH_MEM_PERCENT / 100;
 	hashtable->chunks = NULL;
 	hashtable->current_chunk = NULL;
+	hashtable->current_chunk_idx = 0;
 	hashtable->parallel_state = state->parallel_state;
 	hashtable->area = state->ps.state->es_query_dsa;
 	hashtable->batches = NULL;
@@ -2070,16 +2071,72 @@ void
 ExecPrepHashTableForUnmatched(HashJoinState *hjstate)
 {
 	/*--
-	 * During this scan we use the HashJoinState fields as follows:
+	 * During this scan we use the HashJoinTable fields as follows:
 	 *
-	 * hj_CurBucketNo: next regular bucket to scan
-	 * hj_CurSkewBucketNo: next skew bucket (an index into skewBucketNums)
-	 * hj_CurTuple: last tuple returned, or NULL to start next bucket
+	 * current_chunk: current HashMemoryChunk to scan
+	 * current_chunk_idx: index in current HashMemoryChunk
 	 *--
 	 */
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+
 	hjstate->hj_CurBucketNo = 0;
 	hjstate->hj_CurSkewBucketNo = 0;
 	hjstate->hj_CurTuple = NULL;
+	hashtable->current_chunk = hashtable->chunks;
+	hashtable->current_chunk_idx = 0;
+}
+
+/*
+ * ExecParallelPrepHashTableForUnmatched
+ *		set up for a series of ExecParallelScanHashTableForUnmatched calls
+ *		return true if this worker is elected to do the unmatched inner scan
+ */
+bool
+ExecParallelPrepHashTableForUnmatched(HashJoinState *hjstate)
+{
+	/*--
+	 * During this scan we use the ParallelHashJoinBatchAccessor fields for the
+	 * current batch as follows:
+	 *
+	 * current_chunk: current HashMemoryChunk to scan
+	 * current_chun

Re: Use generation context to speed up tuplesorts

2021-07-30 Thread Andres Freund
Hi,

On 2021-07-30 18:42:18 +1200, David Rowley wrote:
> While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum
> sorts for single column sorts), I noticed that we use a separate
> memory context to store tuple data and we just reset when we're done
> if the sort fits in memory, or we dump the tuples to disk in the same
> order they're added and reset the context when it does not.  There is
> a little pfree() work going on via writetup_heap() which I think
> possibly could just be removed to get some additional gains.
> 
> Anyway, this context that stores tuples uses the standard aset.c
> allocator which has the usual power of 2 wastage and additional
> overheads of freelists etc.  I wondered how much faster it would go if
> I set it to use a generation context instead of an aset.c one.
> 
> After running make installcheck to make the tenk1 table, running the
> attached tuplesortbench script, I get this:

> So it seems to save quite a bit of memory getting away from rounding
> up allocations to the next power of 2.
> 
> Performance-wise, there's some pretty good gains. (Results in TPS)

Very nice!


I wonder if there's cases where generation.c would regress performance
over aset.c due to not having an initial / "keeper" block?


> The patch is just a simple 1-liner at the moment.  I likely do need to
> adjust what I'm passing as the blockSize to GenerationContextCreate().
>   Maybe a better number would be something that's calculated from
> work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64))
> so that we just allocate at most a 16th of work_mem per chunk, but not
> bigger than 8MB. I don't think changing this will affect the
> performance of the above very much.

I think it's bad that both genereration and slab don't have internal
handling of block sizes. Needing to err on the size of too big blocks to
handle large amounts of memory well, just so the contexts don't need to
deal with variably sized blocks isn't a sensible tradeoff.

I don't think it's acceptable to use ALLOCSET_DEFAULT_MAXSIZE or
Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64) for
tuplesort.c. There's plenty cases where we'll just sort a handful of
tuples, and increasing the memory usage of those by a factor of 1024
isn't good. The Min() won't do any good if a larger work_mem is used.

Nor will it be good to use thousands of small allocations for a large
in-memory tuplesort just because we're concerned about the initial
allocation size. Both because of the allocation overhead, but
importantly also because that will make context resets more expensive.

To me this says that we should transplant aset.c's block size growing
into generation.c.


There is at least one path using tuplecontext that reaches code that
could end up freeing memory to a significant enough degree to care about
generation.c effectively not using that memory:
tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
On a quick look I didn't find any expanded record user that frees
nontrivial amounts of memory, but I didn't look all that carefully.


Greetings,

Andres Freund




Re: Have I found an interval arithmetic bug?

2021-07-30 Thread Tom Lane
Bruce Momjian  writes:
> On Fri, Jul 30, 2021 at 03:54:42PM -0400, Tom Lane wrote:
>> Having said that, I also agree that it's not great that '1.4 years'
>> is converted to '1 year 4 mons' (1.3... years) rather than
>> '1 year 5 mons' (1.41666... years).  The latter seems like a clearly
>> saner translation.  I would really rather that we reject such input
>> altogether, but if we're going to accept it, we should round not
>> truncate.

> My patch returns what you want:

Yeah, as far as that point goes, I was replying to Robert's
objection not your patch.

regards, tom lane




Re: Corrected documentation of data type for the logical replication message formats.

2021-07-30 Thread Tom Lane
vignesh C  writes:
[ v6-0001-Included-the-actual-datatype-used-in-logical-repl.patch ]

I see what you want to do here, but the way you did it seems quite
detrimental to the readability of the field descriptions.
Parenthesized interjections should be used sparingly.

I'm inclined to think that the equivalent data type is part of the
field data type specification, and thus that we ought to put it in
the data type part of each entry.  So we'd have something like



Int64 (XLogRecPtr)



The final LSN of the transaction.




instead of what you did here.  Parentheses might not be the best
punctuation to use, given the existing convention about parenthesized
specific values, but we could probably settle on some other markup.
Or just ignore the ambiguity.

Another idea is to add the data type info at the ends of items
instead of cramming it into the sentences, thus:

The final LSN of the transaction.  (XLogRecPtr)

I don't find that better personally, but maybe others will
think differently.

regards, tom lane




Re: Use generation context to speed up tuplesorts

2021-07-30 Thread Tomas Vondra

On 7/30/21 10:38 PM, Andres Freund wrote:

Hi,

On 2021-07-30 18:42:18 +1200, David Rowley wrote:

While reviewing and benchmarking 91e9e89dc (Make nodeSort.c use Datum
sorts for single column sorts), I noticed that we use a separate
memory context to store tuple data and we just reset when we're done
if the sort fits in memory, or we dump the tuples to disk in the same
order they're added and reset the context when it does not.  There is
a little pfree() work going on via writetup_heap() which I think
possibly could just be removed to get some additional gains.

Anyway, this context that stores tuples uses the standard aset.c
allocator which has the usual power of 2 wastage and additional
overheads of freelists etc.  I wondered how much faster it would go if
I set it to use a generation context instead of an aset.c one.

After running make installcheck to make the tenk1 table, running the
attached tuplesortbench script, I get this:



So it seems to save quite a bit of memory getting away from rounding
up allocations to the next power of 2.

Performance-wise, there's some pretty good gains. (Results in TPS)


Very nice!



Yes, very nice. I wouldn't have expected such significant difference, 
particularly in CPU usage. It's pretty interesting that it both reduces 
memory and CPU usage, I'd have guessed it's either one of the other.




I wonder if there's cases where generation.c would regress performance
over aset.c due to not having an initial / "keeper" block?



Not sure. I guess such workload would need to allocate and free a single 
block (so very little memory) very often. I guess that's possible, but 
I'm not aware of a place doing that very often. Although, maybe decoding 
could do that for simple (serial) workload.


I'm not opposed to adding a keeper block to Generation, similarly to 
what was discussed for Slab not too long ago.





The patch is just a simple 1-liner at the moment.  I likely do need to
adjust what I'm passing as the blockSize to GenerationContextCreate().
   Maybe a better number would be something that's calculated from
work_mem, e.g Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64))
so that we just allocate at most a 16th of work_mem per chunk, but not
bigger than 8MB. I don't think changing this will affect the
performance of the above very much.


I think it's bad that both genereration and slab don't have internal
handling of block sizes. Needing to err on the size of too big blocks to
handle large amounts of memory well, just so the contexts don't need to
deal with variably sized blocks isn't a sensible tradeoff.



Well, back then it seemed like a sensible trade off to me, but I agree 
it may have negative consequences. I'm not opposed to revisiting this.



I don't think it's acceptable to use ALLOCSET_DEFAULT_MAXSIZE or
Min(ALLOCSET_DEFAULT_MAXSIZE, ((Size) work_mem) * 64) for
tuplesort.c. There's plenty cases where we'll just sort a handful of
tuples, and increasing the memory usage of those by a factor of 1024
isn't good. The Min() won't do any good if a larger work_mem is used.

Nor will it be good to use thousands of small allocations for a large
in-memory tuplesort just because we're concerned about the initial
allocation size. Both because of the allocation overhead, but
importantly also because that will make context resets more expensive.



True.


To me this says that we should transplant aset.c's block size growing
into generation.c.



Yeah, maybe.



There is at least one path using tuplecontext that reaches code that
could end up freeing memory to a significant enough degree to care about
generation.c effectively not using that memory:
tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
On a quick look I didn't find any expanded record user that frees
nontrivial amounts of memory, but I didn't look all that carefully.



Not sure, I'm not familiar with EOH_flatten_into or expanded records. 
But I wonder if there's some sort of metric that we could track in 
Generation and use it to identify "interesting" places.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] Hooks at XactCommand level

2021-07-30 Thread Andres Freund
Hi,

On 2021-07-30 13:58:51 -0400, Tom Lane wrote:
> Gilles Darold  writes:
> > [ 1-startcommand_xact_callback-v2.diff ]
>
> I've not read this version of the patch, but I see from the cfbot's
> results that it's broken postgres_fdw.  I recall that postgres_fdw
> uses the XactCallback and SubXactCallback mechanisms, so I'm betting
> this means that you've changed the semantics of those callbacks in
> an incompatible way.  That's probably not a great idea.  We could
> fix postgres_fdw, but there are more than likely some external
> modules that would also get broken, and that is supposed to be a
> reasonably stable API.

I think this may partially be an issue with the way that postgres_fdw
uses the callback than with the patch. It disconnects from the server
*regardless* of the XactEvent passed to the callback. That makes it
really hard to extend the callback mechanism to further events...

Now, I'm also *quite* unconvinced that the placement of the
new CallXactStartCommand() in postgres.c is right.


On 2021-07-16 11:48:24 +0200, Gilles Darold wrote:
> Other calls of start_xact_command() are in exec_bind_message(),
> exec_execute_message(), exec_describe_statement_message(),
> exec_describe_portal_message and PostgresMain. In my test they are either
> not called or generates duplicates calls to the callback with
> exec_simple_query() and c_parse_exemessage().

That seems like an issue with your test then. Prepared statements can be
parsed in one transaction and bind+exec'ed in another. And you even can
execute transaction control statements this way.

IMO this'd need tests somewhere that allow us to verify the hook
placements do something sensible.


It does not seems not great to add a bunch of external function calls
into all these routines. For simple queries postgres.c's exec_*
functions show up in profiles - doing yet another function call that
then also needs to look at various memory locations plausibly will show
up. Particularly when used with pipelined queries.


I'm *very* unconvinced it makes sense to implement a feature like this
in an extension / that we should expose API for that purpose. To me the
top-level transaction state is way too tied to our internals for it to
be reasonably dealt with in an extension. And I think an in-core version
would need to tackle the overhead and internal query execution issues
this feature has.

Greetings,

Andres Freund




Re: Replace l337sp34k in comments.

2021-07-30 Thread Gavin Flower

On 30/07/21 8:05 pm, Geoff Winkless wrote:

On Thu, 29 Jul 2021 at 22:46, Gavin Flower
 wrote:

Though in code, possibly it would be better to just use 'up-to-date' in
code for consistency and to make the it easier to grep?

If it's causing an issue, perhaps using a less syntactically
problematic synonym like "current" might be better?

:)

Geoff


On thinking further...

The word 'current' means different things in different contexts. If I 
refer to my current O/S it means the one I'm using now, but it may not 
be current.  The second use of 'current' is the meaning you are thinking 
of, but the first is not. Since people reading documented code are 
focused on understanding technical aspects, they may miss this subtlety.


I'm aware that standardisation may meet with some resistance, but being 
consistent might reduce the conceptual impedance when reading the code.  
I'm just trying to reduce the potential for confusion.



Cheers,
Gavin





Re: alter table set TABLE ACCESS METHOD

2021-07-30 Thread Jeff Davis
On Fri, 2021-07-30 at 16:22 +0900, Michael Paquier wrote:
> Looking at the past, it was the intention of 05f3f9c7 to go through
> the hook even if SET TABLESPACE does not move the relation, so you
> are
> right that ALTER TABLE is inconsistent to not do the same for LOGGED,
> UNLOGGED and ACCESS METHOD if all of them do nothing to trigger a
> relation rewrite.
> 
> Now, I am a bit biased about this change and if we actually need it
> for the no-op path.  If we were to do that, I think that we need to
> add in AlteredTableInfo a way to track down if any of those
> subcommands have been used to allow the case of rewrite == 0 to
> launch
> the hook even if these are no-ops.  And I am not sure if that's worth
> the code complication for an edge case.  We definitely should have a
> hook call for the case of rewrite > 0, though.

It sounds like anything we do here should be part of a larger change to
make it consistent. So I'm fine with the patch you posted.

Regards,
Jeff Davis






Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

2021-07-30 Thread Tom Lane
Gilles Darold  writes:
> Le 26/07/2021 à 21:56, Tom Lane a écrit :
>> I'm inclined to just drop the regexp_replace additions.  I don't think
>> that the extra parameters Oracle provides here are especially useful.
>> They're definitely not useful enough to justify creating compatibility
>> hazards for.

> I would not say that being able to replace the Nth occurrence of a
> pattern matching is not useful but i agree that this is not a common
> case with replacement. Both Oracle [1] and IBM DB2 [2] propose this form
> and I have though that we can not have compatibility issues because of
> the different data type at the 4th parameter.

Well, here's an example of the potential issues:

regression=# create function rr(text,text,text,text) returns text
regression-# language sql as $$select 'text'$$;
CREATE FUNCTION
regression=# create function rr(text,text,text,int4) returns text
language sql as $$select 'int4'$$;
CREATE FUNCTION
regression=# select rr('a','b','c','d');
  rr  
--
 text
(1 row)

regression=# select rr('a','b','c',42);
  rr  
--
 int4
(1 row)

So far so good, but:

regression=# prepare rr as select rr('a','b','c',$1);
PREPARE
regression=# execute rr(12);  
  rr  
--
 text
(1 row)

So somebody trying to use the 4-parameter Oracle form from, say, JDBC
would get bit if they were sloppy about specifying parameter types.

The one saving grace is that digits aren't valid regexp flags,
so the outcome would be something like

regression=# select regexp_replace('a','b','c','12');
ERROR:  invalid regular expression option: "1"

which'd be less difficult to debug than silent misbehavior.
Conversely, if you thought you were passing flags but it somehow
got interpreted as a start position, that would fail too:

regression=# prepare rri as select rr('a','b','c', $1::int);
PREPARE
regression=# execute rri('gi');
ERROR:  invalid input syntax for type integer: "gi"
LINE 1: execute rri('gi');
^

Still, I bet a lot that we'd see periodic bug reports complaining
that it doesn't work.

> Anyway, maybe we can just
> rename the function even if I would prefer that regexp_replace() be
> extended. For example:
>     regexp_replace(source, pattern, replacement [, flags ]);
>     regexp_substitute(source, pattern, replacement [, position ] [,
> occurrence ] [, flags ]);

Hmm.  Of course the entire selling point of this patch seems to be
bug-compatibility with Oracle, so using different names is largely
defeating the point :-(

Maybe we should just hold our noses and do it.  The point that
you'd get a recognizable failure if the wrong function were chosen
reassures me a little bit.  We've seen a lot of cases where this
sort of ambiguity results in the system just silently doing something
different from what you expected, and I was afraid that that could
happen here.

regards, tom lane




Re: [PATCH] Hooks at XactCommand level

2021-07-30 Thread Tom Lane
Andres Freund  writes:
> On 2021-07-30 13:58:51 -0400, Tom Lane wrote:
>> I've not read this version of the patch, but I see from the cfbot's
>> results that it's broken postgres_fdw.

> I think this may partially be an issue with the way that postgres_fdw
> uses the callback than with the patch. It disconnects from the server
> *regardless* of the XactEvent passed to the callback. That makes it
> really hard to extend the callback mechanism to further events...

Perhaps.  Nonetheless, I thought upthread that adding these events
as Xact/SubXactCallback events was the wrong design, and I still
think that.  A new hook would be a more sensible way.

> I'm *very* unconvinced it makes sense to implement a feature like this
> in an extension / that we should expose API for that purpose. To me the
> top-level transaction state is way too tied to our internals for it to
> be reasonably dealt with in an extension.

Yeah, that's the other major problem --- the use-case doesn't seem
very convincing.  I'm not even sold on the goal, let alone on trying
to implement it by hooking into these particular places.  I think
that'll end up being buggy and fragile as well as not very performant.

regards, tom lane




Re: Replication protocol doc fix

2021-07-30 Thread Jeff Davis
On Fri, 2021-07-02 at 08:44 -0400, Robert Haas wrote:
> Right, that seems like a good goal. Thinking about this a little
> more,
> it's only holding off *cancel* interrupts, not *all* interrupts, so
> presumably you can still terminate the backend in this state. That's
> not so bad, and it's not clear how we could do any better. So I
> withdraw my previous complaint about this point.

Further thoughts on this? I don't feel comfortable making this change
without a stronger endorsement.

Regards,
Jeff Davis






Re: ECPG bug fix: DECALRE STATEMENT and DEALLOCATE, DESCRIBE

2021-07-30 Thread Peter Geoghegan
On Thu, Jul 29, 2021 at 12:22 PM Michael Meskes  wrote:
> I just wanted to let you know that I'm well aware of this thread but
> need to get through my backlog before I can comment. Sorry for the
> delay.

The RMT discussed this recently. We are concerned about this issue,
including how it has been handled so far.

If you're unable to resolve the issue (or at least commit time to
resolving the issue) then the RMT will call for the revert of the
original feature patch.
-- 
Peter Geoghegan




Re: archive status ".ready" files may be created too early

2021-07-30 Thread Alvaro Herrera
On 2021-Jul-30, Bossart, Nathan wrote:

> On 7/30/21, 11:34 AM, "Alvaro Herrera"  wrote:
> > Hmm ... I'm not sure we're prepared to backpatch this kind of change.
> > It seems a bit too disruptive to how replay works.  I think patch we
> > should be focusing solely on patch 0001 to surgically fix the precise
> > bug you see.  Does patch 0002 exist because you think that a system with
> > only 0001 will not correctly deal with a crash at the right time?
> 
> Yes, that was what I was worried about.  However, I just performed a
> variety of tests with just 0001 applied, and I am beginning to suspect
> my concerns were unfounded.  With wal_buffers set very high,
> synchronous_commit set to off, and a long sleep at the end of
> XLogWrite(), I can reliably cause the archive status files to lag far
> behind the current open WAL segment.  However, even if I crash at this
> time, the .ready files are created when the server restarts (albeit
> out of order).  This appears to be due to the call to
> XLogArchiveCheckDone() in RemoveOldXlogFiles().  Therefore, we can
> likely abandon 0002.

That's great to hear.  I'll give 0001 a look again.

> > Now, the reason I'm looking at this patch series is that we're seeing a
> > related problem with walsender/walreceiver, which apparently are capable
> > of creating a file in the replica that ends up not existing in the
> > primary after a crash, for a reason closely related to what you
> > describe for WAL archival.  I'm not sure what is going on just yet, so
> > I'm not going to try and explain because I'm likely to get it wrong.
> 
> I've suspected that this is due to the use of the flushed location for
> the send pointer, which AFAICT needn't align with a WAL record
> boundary.

Yeah, I had gotten as far as the GetFlushRecPtr but haven't tracked down
what happens with a contrecord.


-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: archive status ".ready" files may be created too early

2021-07-30 Thread Bossart, Nathan
On 7/30/21, 3:23 PM, "Alvaro Herrera"  wrote:
> That's great to hear.  I'll give 0001 a look again.

Much appreciated.

Nathan



Re: Split xlog.c

2021-07-30 Thread Andres Freund
Hi,

I think it'd make sense to apply the first few patches now, they seem
uncontroversial and simple enough.


On 2021-07-31 00:33:34 +0300, Heikki Linnakangas wrote:
> From 0cfb852e320bd8fe83c588d25306d5b4c57b9da6 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Mon, 21 Jun 2021 22:14:58 +0300
> Subject: [PATCH 1/7] Don't use O_SYNC or similar when opening signal file to
>  fsync it.

+1

> From 83f00e90bb818ed21bb14580f19f58c4ade87ef7 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Wed, 9 Jun 2021 12:05:53 +0300
> Subject: [PATCH 2/7] Remove unnecessary 'restoredFromArchive' global variable.
> 
> It might've been useful for debugging purposes, but meh. There's
> 'readSource' which does almost the same thing.

+1


> From ec53470c8d271c01b8d2e12b92863501c3a9b4cf Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Mon, 21 Jun 2021 16:12:50 +0300
> Subject: [PATCH 3/7] Extract code to get reason that recovery was stopped to a
>  function.

+1


> +/*
> + * Create a comment for the history file to explain why and where
> + * timeline changed.
> + */
> +static char *
> +getRecoveryStopReason(void)
> +{
> + charreason[200];
> +
> + if (recoveryTarget == RECOVERY_TARGET_XID)
> + snprintf(reason, sizeof(reason),
> +  "%s transaction %u",
> +  recoveryStopAfter ? "after" : "before",
> +  recoveryStopXid);
> + else if (recoveryTarget == RECOVERY_TARGET_TIME)
> + snprintf(reason, sizeof(reason),
> +  "%s %s\n",
> +  recoveryStopAfter ? "after" : "before",
> +  timestamptz_to_str(recoveryStopTime));
> + else if (recoveryTarget == RECOVERY_TARGET_LSN)
> + snprintf(reason, sizeof(reason),
> +  "%s LSN %X/%X\n",
> +  recoveryStopAfter ? "after" : "before",
> +  LSN_FORMAT_ARGS(recoveryStopLSN));
> + else if (recoveryTarget == RECOVERY_TARGET_NAME)
> + snprintf(reason, sizeof(reason),
> +  "at restore point \"%s\"",
> +  recoveryStopName);
> + else if (recoveryTarget == RECOVERY_TARGET_IMMEDIATE)
> + snprintf(reason, sizeof(reason), "reached consistency");
> + else
> + snprintf(reason, sizeof(reason), "no recovery target 
> specified");
> +
> + return pstrdup(reason);
> +}

I guess it would make sense to change this over to a switch at some
point, so we can get warnings if a new type of target is added...


> From 70f688f9576b7939d18321444fd59c51c402ce23 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Mon, 21 Jun 2021 21:25:37 +0300
> Subject: [PATCH 4/7] Move InRecovery and standbyState global vars to
>  xlogutils.c.
> 
> They are used in code that is sometimes called from a redo routine,
> so xlogutils.c seems more appropriate. That's where we have other helper
> functions used by redo routines.

FWIW, with some compilers on some linux distributions there is an efficiency
difference between accessing a variable (or calling a function) defined in the
current translation unit or a separate one (with the separate TU going through
the GOT). I don't think it's a problem here, but it's worth keeping in mind
while moving things around.   We should probably adjust our compiler settings
to address that at some point :(


> From da11050ca890ce0311d9e97d2832a6a61bc43e10 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Fri, 18 Jun 2021 12:15:04 +0300
> Subject: [PATCH 5/7] Move code around in StartupXLOG().
> 
> This is the order that things will happen with the next commit, this
> makes it more explicit. To aid review, I added "BEGIN/END function"
> comments to mark which blocks of code are moved to separate functions in
> in the next commit.

> ---
>  src/backend/access/transam/xlog.c | 605 --
>  1 file changed, 315 insertions(+), 290 deletions(-)
> 
> diff --git a/src/backend/access/transam/xlog.c 
> b/src/backend/access/transam/xlog.c
> index efb3ca273ed..b9d96d6de26 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -882,7 +882,6 @@ static MemoryContext walDebugCxt = NULL;
>  
>  static void readRecoverySignalFile(void);
>  static void validateRecoveryParameters(void);
> -static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
>  static bool recoveryStopsBefore(XLogReaderState *record);
>  static bool recoveryStopsAfter(XLogReaderState *record);
>  static char *getRecoveryStopReason(void);
> @@ -5592,111 +5591,6 @@ validateRecoveryParameters(void)
>   }
>  }
>  
> -/*
> - * Exit archive-recovery state
> - */
> -static void
> -exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
> -{

I don't really understand the motivation for this part of the change

Re: [PATCH] Hooks at XactCommand level

2021-07-30 Thread Andres Freund
Hi,

On 2021-07-30 17:49:09 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2021-07-30 13:58:51 -0400, Tom Lane wrote:
> >> I've not read this version of the patch, but I see from the cfbot's
> >> results that it's broken postgres_fdw.
> 
> > I think this may partially be an issue with the way that postgres_fdw
> > uses the callback than with the patch. It disconnects from the server
> > *regardless* of the XactEvent passed to the callback. That makes it
> > really hard to extend the callback mechanism to further events...
> 
> Perhaps.  Nonetheless, I thought upthread that adding these events
> as Xact/SubXactCallback events was the wrong design, and I still
> think that.  A new hook would be a more sensible way.

I know I've wanted additional events in XactEvent before that'd also be
problematic for pg_fdw, but not make sense as a separate event. E.g. an
event when an xid is assigned.


> > I'm *very* unconvinced it makes sense to implement a feature like this
> > in an extension / that we should expose API for that purpose. To me the
> > top-level transaction state is way too tied to our internals for it to
> > be reasonably dealt with in an extension.
> 
> Yeah, that's the other major problem --- the use-case doesn't seem
> very convincing.  I'm not even sold on the goal, let alone on trying
> to implement it by hooking into these particular places.  I think
> that'll end up being buggy and fragile as well as not very performant.

I'm more favorable than you on the overall goal. Migrations to PG are a
frequent and good thing and as discussed before, lots of PG forks ended
up implementing a version of this. Clearly there's demand.

However, I think a proper implementation would require a substantial
amount of effort. Including things like optimizing the subtransaction
logic so that switching the feature on doesn't lead to xid wraparound
issues. Adding odd hooks doesn't move us towards a real solution imo.

Greetings,

Andres Freund




Re: archive status ".ready" files may be created too early

2021-07-30 Thread Alvaro Herrera
On 2021-Jul-30, Bossart, Nathan wrote:

> Yes, that was what I was worried about.  However, I just performed a
> variety of tests with just 0001 applied, and I am beginning to suspect
> my concerns were unfounded.  With wal_buffers set very high,
> synchronous_commit set to off, and a long sleep at the end of
> XLogWrite(), I can reliably cause the archive status files to lag far
> behind the current open WAL segment.  However, even if I crash at this
> time, the .ready files are created when the server restarts (albeit
> out of order).

I think that creating files out of order might be problematic.  On the
archiver side, pgarch_readyXlog() expects to return the oldest
archivable file; but if we create a newer segment's .ready file first,
it is possible that a directory scan would return that newer file before
the older segment's .ready file appears.

However, the comments in pgarch_readyXlog() aren't super convincing that
processing the files in order is actually a correctness requirement, so
perhaps it doesn't matter all that much.


I noticed that XLogCtl->lastNotifiedSeg is protected by both the
info_lck and ArchNotifyLock.  I think it it's going to be protected by
the lwlock, then we should drop the use of the spinlock.


We set archiver's latch on each XLogArchiveNotify(), but if we're doing
it in a loop such as in NotifySegmentsReadyForArchive() perhaps it is
better to create all the .ready files first and do PgArchWakeup() at the
end.  I'm not convinced that this is useful but let's at least discard
the idea explicitly if not.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Always assume the user will do much worse than the stupidest thing
you can imagine."(Julien PUYDT)
>From f9d94c614d6cf640fdaa09785d0817a1d86b9658 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 30 Jul 2021 18:35:52 -0400
Subject: [PATCH v2] Avoid creating archive status ".ready" files too early.

WAL records may span multiple segments, but XLogWrite() does not
wait for the entire record to be written out to disk before
creating archive status files.  Instead, as soon as the last WAL
page of the segment is written, the archive status file will be
created.  If PostgreSQL crashes before it is able to write the rest
of the record, it will end up reusing segments that have already
been marked as ready-for-archival.  However, the archiver process
may have already processed the old version of the segment, so the
wrong version of the segment may be backed-up.  This backed-up
segment will cause operations such as point-in-time restores to
fail.

To fix this, we keep track of records that span across segments and
ensure that segments are only marked ready-for-archival once such
records have been completely written to disk.
---
 src/backend/access/transam/timeline.c|   2 +-
 src/backend/access/transam/xlog.c| 279 ++-
 src/backend/access/transam/xlogarchive.c |  14 +-
 src/backend/replication/walreceiver.c|   6 +-
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/include/access/xlogarchive.h |   4 +-
 src/include/access/xlogdefs.h|   3 +
 7 files changed, 289 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c175..acd5c2431d 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -452,7 +452,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	if (XLogArchivingActive())
 	{
 		TLHistoryFileName(histfname, newTLI);
-		XLogArchiveNotify(histfname);
+		XLogArchiveNotify(histfname, true);
 	}
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 26fa2b6c8f..700bbccff8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -531,6 +531,13 @@ typedef enum ExclusiveBackupState
  */
 static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
 
+/* entries for RecordBoundaryMap, used to mark segments ready for archival */
+typedef struct RecordBoundaryEntry
+{
+	XLogSegNo seg;	/* must be first */
+	XLogRecPtr pos;
+} RecordBoundaryEntry;
+
 /*
  * Shared state data for WAL insertion.
  */
@@ -742,6 +749,12 @@ typedef struct XLogCtlData
 	 */
 	XLogRecPtr	lastFpwDisableRecPtr;
 
+	/*
+	 * The last segment we've marked ready for archival.  Protected by
+	 * ArchNotifyLock.
+	 */
+	XLogSegNo lastNotifiedSeg;
+
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
@@ -755,6 +768,12 @@ static WALInsertLockPadded *WALInsertLocks = NULL;
  */
 static ControlFileData *ControlFile = NULL;
 
+/*
+ * Record boundary map, used for marking segments as ready for archival.
+ * Protected by ArchNotifyLock.
+ */
+static HTAB *RecordBoundaryMap = NULL;
+
 /*
  * Calculate the amount of space left on the page after 'endptr'. Beware
  * multiple evaluation!
@@ -983,6 +1002,12 @@ stati

Re: Replication protocol doc fix

2021-07-30 Thread Andres Freund
Hi,

On 2021-06-17 16:37:51 -0700, Jeff Davis wrote:
> In theory, it could break a client that issues Parse+Bind+Execute for a
> CopyIn/CopyBoth command without a Sync, but I'm not sure there are any
> clients that do that, and it's arguable whether the documentation
> permitted that or not anyway.

I'm worried about that breaking things and us only noticing down the
road. This doesn't fix a problem that we are actively hitting, and as
you say it's arguably compliant to do it differently. Potential protocol
incompatibilities are a dangerous area. I think before doing something
like this we ought to at least verify that the most popular native
drivers won't have a problem with the change. Maybe pgjdbc, npgsql, the
popular go ones and rust-postgres?

Greetings,

Andres Freund




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-07-30 Thread Thomas Munro
On Sat, Jul 31, 2021 at 7:52 AM Andres Freund  wrote:
> On 2021-07-30 15:35:30 -0400, Melanie Plageman wrote:
> > * I think that having TBMIterateResult inside of TBMIterator is not
> >   well-defined C language behavior. In [1], it says
> >
> >   "Structures with flexible array members (or unions who have a
> >   recursive-possibly structure member with flexible array member) cannot
> >   appear as array elements or as members of other structures."
>
> > [1] https://en.cppreference.com/w/c/language/struct
>
> I think it is ok as long as the struct with the flexible array member is
> at the end of the struct it is embedded in. I think even by the letter
> of the standard, but it's as always hard to parse...

That's clearly the de facto situation (I think that was the case on
the most popular compilers long before flexible array members were
even standardised), but I think it might technically still be not
allowed since this change has not yet been accepted AFAICS:

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2083.htm

In any case, we already do it which is why wrasse (Sun Studio
compiler) warns about indkey in pg_index.h.  Curiously, indkey is not
always the final member of the containing struct, depending on
CATALOG_VARLEN...




Re: param 'txn' not used in function maybe_send_schema()

2021-07-30 Thread Fujii Masao




On 2021/07/30 22:27, houzj.f...@fujitsu.com wrote:

Hi,

When reviewing logicalreplication related patches, I noticed the function
param ReorderBufferTXN *txn not used in the function maybe_send_schema(). Since
this is not an external function, I think it might be better to remove the 
unused paramater.


+1

Seems commit 464824323e accidentally added the argument though
it's not used in maybe_send_schema(). I *guess* that it was used
in the patches in dev early stage, but it's changed so that no longer used
while it's being reviewed. But maybe we forgot to remove it.

Barring any objection, I will commit the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-07-30 Thread Tom Lane
Thomas Munro  writes:
> In any case, we already do it which is why wrasse (Sun Studio
> compiler) warns about indkey in pg_index.h.  Curiously, indkey is not
> always the final member of the containing struct, depending on
> CATALOG_VARLEN...

Hm?  CATALOG_VARLEN is never to be defined, see genbki.h.

regards, tom lane




Re: Use generation context to speed up tuplesorts

2021-07-30 Thread Tomas Vondra

Hi,

I spent a bit of time hacking on the Generation context, adding the two 
improvements discussed in this thread:


1) internal handling of block sizes, similar to what AllocSet does (it 
pretty much just copies parts of it)


2) keeper block (we keep one empry block instead of freeing it)

3) I've also added allocChunkLimit, which makes it look a bit more like 
AllocSet (instead of using just blockSize/8, which does not work too 
well with dynamic blockSize)


I haven't done any extensive tests on it, but it does pass check-world 
with asserts etc. I haven't touched the comments, those need updating.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
>From 6ca0ae674964157c1f2d2ae446cb415c9be509b6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Fri, 30 Jul 2021 23:53:52 +0200
Subject: [PATCH 1/3] Generation: grow blocks

---
 src/backend/access/gist/gistvacuum.c  |  2 +-
 .../replication/logical/reorderbuffer.c   |  2 +-
 src/backend/utils/mmgr/generation.c   | 53 ++-
 src/include/utils/memutils.h  |  4 +-
 4 files changed, 44 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0663193531..1818ed06fc 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -161,7 +161,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	vstate.page_set_context = GenerationContextCreate(CurrentMemoryContext,
 	  "GiST VACUUM page set context",
-	  16 * 1024);
+	  ALLOCSET_DEFAULT_SIZES);
 	oldctx = MemoryContextSwitchTo(vstate.page_set_context);
 	vstate.internal_page_set = intset_create();
 	vstate.empty_leaf_set = intset_create();
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f99e45f22d..89018dc99d 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -336,7 +336,7 @@ ReorderBufferAllocate(void)
 
 	buffer->tup_context = GenerationContextCreate(new_ctx,
   "Tuples",
-  SLAB_LARGE_BLOCK_SIZE);
+  ALLOCSET_DEFAULT_SIZES);
 
 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
diff --git a/src/backend/utils/mmgr/generation.c b/src/backend/utils/mmgr/generation.c
index 584cd614da..771a2525ca 100644
--- a/src/backend/utils/mmgr/generation.c
+++ b/src/backend/utils/mmgr/generation.c
@@ -60,7 +60,9 @@ typedef struct GenerationContext
 	MemoryContextData header;	/* Standard memory-context fields */
 
 	/* Generational context parameters */
-	Size		blockSize;		/* standard block size */
+	Size		initBlockSize;	/* initial block size */
+	Size		maxBlockSize;	/* maximum block size */
+	Size		nextBlockSize;	/* next block size to allocate */
 
 	GenerationBlock *block;		/* current (most recently allocated) block */
 	dlist_head	blocks;			/* list of blocks */
@@ -196,7 +198,9 @@ static const MemoryContextMethods GenerationMethods = {
 MemoryContext
 GenerationContextCreate(MemoryContext parent,
 		const char *name,
-		Size blockSize)
+		Size minContextSize,
+		Size initBlockSize,
+		Size maxBlockSize)
 {
 	GenerationContext *set;
 
@@ -208,16 +212,20 @@ GenerationContextCreate(MemoryContext parent,
 	 "padding calculation in GenerationChunk is wrong");
 
 	/*
-	 * First, validate allocation parameters.  (If we're going to throw an
-	 * error, we should do so before the context is created, not after.)  We
-	 * somewhat arbitrarily enforce a minimum 1K block size, mostly because
-	 * that's what AllocSet does.
+	 * First, validate allocation parameters.  Once these were regular runtime
+	 * test and elog's, but in practice Asserts seem sufficient because nobody
+	 * varies their parameters at runtime.  We somewhat arbitrarily enforce a
+	 * minimum 1K block size.
 	 */
-	if (blockSize != MAXALIGN(blockSize) ||
-		blockSize < 1024 ||
-		!AllocHugeSizeIsValid(blockSize))
-		elog(ERROR, "invalid blockSize for memory context: %zu",
-			 blockSize);
+	Assert(initBlockSize == MAXALIGN(initBlockSize) &&
+		   initBlockSize >= 1024);
+	Assert(maxBlockSize == MAXALIGN(maxBlockSize) &&
+		   maxBlockSize >= initBlockSize &&
+		   AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
+	Assert(minContextSize == 0 ||
+		   (minContextSize == MAXALIGN(minContextSize) &&
+			minContextSize >= 1024 &&
+			minContextSize <= maxBlockSize));
 
 	/*
 	 * Allocate the context header.  Unlike aset.c, we never try to combine
@@ -242,7 +250,9 @@ GenerationContextCreate(MemoryContext parent,
 	 */
 
 	/* Fill in GenerationContext-specific header fields */
-	set->blockSize = blockSize;
+	set->initBlockSize = initBlockSize;
+	set->maxBlockSize = maxBlockSize;
+	set->nextBlockSize = initBlockSize;
 	set->block = NULL;
 	dlist_init(&se

Re: archive status ".ready" files may be created too early

2021-07-30 Thread Alvaro Herrera
On 2021-Jul-30, Alvaro Herrera wrote:

> We set archiver's latch on each XLogArchiveNotify(), but if we're doing
> it in a loop such as in NotifySegmentsReadyForArchive() perhaps it is
> better to create all the .ready files first and do PgArchWakeup() at the
> end.  I'm not convinced that this is useful but let's at least discard
> the idea explicitly if not.

hm, this causes an ABI change so it's not backpatchable.


-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"This is what I like so much about PostgreSQL.  Most of the surprises
are of the "oh wow!  That's cool" Not the "oh shit!" kind.  :)"
Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php




Re: archive status ".ready" files may be created too early

2021-07-30 Thread Bossart, Nathan
On 7/30/21, 4:52 PM, "Alvaro Herrera"  wrote:
> I think that creating files out of order might be problematic.  On the
> archiver side, pgarch_readyXlog() expects to return the oldest
> archivable file; but if we create a newer segment's .ready file first,
> it is possible that a directory scan would return that newer file before
> the older segment's .ready file appears.
>
> However, the comments in pgarch_readyXlog() aren't super convincing that
> processing the files in order is actually a correctness requirement, so
> perhaps it doesn't matter all that much.

I can't think of a reason it'd be needed from a correctness
perspective.  After a quick scan, I couldn't find any promises about
archival order in the documentation, either.  In any case, it doesn't
look like there's a risk that the archiver will skip files when the
.ready files are created out of order.

> I noticed that XLogCtl->lastNotifiedSeg is protected by both the
> info_lck and ArchNotifyLock.  I think it it's going to be protected by
> the lwlock, then we should drop the use of the spinlock.

That seems reasonable to me.  This means that the lock is acquired at
the end of every XLogWrite(), but the other places that acquire the
lock only do so once per WAL segment.  Plus, the call to
NotifySegmentsReadyForArchive() at the end of every XLogWrite() should
usually only need the lock for a short amount of time to retrieve a
value from shared memory.

> We set archiver's latch on each XLogArchiveNotify(), but if we're doing
> it in a loop such as in NotifySegmentsReadyForArchive() perhaps it is
> better to create all the .ready files first and do PgArchWakeup() at the
> end.  I'm not convinced that this is useful but let's at least discard
> the idea explicitly if not.

I don't have a terribly strong opinion, but I would lean towards
setting the latch for each call to XLogArchiveNotify() so that the
archiver process can get started as soon as a segment is ready.
However, I doubt that holding off until the end of the loop has any
discernible effect in most cases.

Nathan



Re: [HACKERS] logical decoding of two-phase transactions

2021-07-30 Thread Amit Kapila
On Wed, Jul 14, 2021 at 11:52 AM Amit Kapila  wrote:
>
> On Mon, Jul 12, 2021 at 9:14 AM Peter Smith  wrote:
>
> Pushed.
>

As reported by Michael [1], there is one test failure related to this
commit. The failure is as below:

#   Failed test 'transaction is prepared on subscriber'
#   at t/021_twophase.pl line 324.
#  got: '1'
# expected: '2'
# Looks like you failed 1 test of 24.
[12:14:02] t/021_twophase.pl ..
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/24 subtests
[12:14:12] t/022_twophase_cascade.pl .. ok10542 ms ( 0.00
usr  0.00 sys +  2.03 cusr  0.61 csys =  2.64 CPU)
[12:14:31] t/100_bugs.pl .. ok18550 ms ( 0.00
usr  0.00 sys +  3.85 cusr  1.36 csys =  5.21 CPU)
[12:14:31]

I think I know what's going wrong here. The corresponding test is:

# Now do a prepare on publisher and check that it IS replicated
$node_publisher->safe_psql('postgres', "
BEGIN;
INSERT INTO tab_copy VALUES (99);
PREPARE TRANSACTION 'mygid';");

$node_publisher->wait_for_catchup($appname_copy);

# Check that the transaction has been prepared on the subscriber,
there will be 2
# prepared transactions for the 2 subscriptions.
$result = $node_subscriber->safe_psql('postgres', "SELECT count(*)
FROM pg_prepared_xacts;");
is($result, qq(2), 'transaction is prepared on subscriber');

Here, the test is expecting 2 prepared transactions corresponding to
two subscriptions but it waits for just one subscription via
appname_copy. It should wait for the second subscription using
$appname as well.

What do you think?

[1] - https://www.postgresql.org/message-id/YQP02%2B5yLCIgmdJY%40paquier.xyz

-- 
With Regards,
Amit Kapila.




Re: [HACKERS] logical decoding of two-phase transactions

2021-07-30 Thread Ajin Cherian
On Sat, Jul 31, 2021 at 2:39 PM Amit Kapila  wrote:

> Here, the test is expecting 2 prepared transactions corresponding to
> two subscriptions but it waits for just one subscription via
> appname_copy. It should wait for the second subscription using
> $appname as well.
>
> What do you think?

I agree with this analysis. The test needs to wait for both
subscriptions to catch up.
Attached is a patch that addresses this issue.

regards,
Ajin Cherian
Fujitsu Australia


v1-0001-Fix-possible-failure-in-021_twophase-tap-test.patch
Description: Binary data