Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-04-02 Thread Julien Rouhaud
On Thu, Mar 31, 2022 at 01:06:10PM +0300, Andrei Zubkov wrote:
>
> On Wed, 2022-03-30 at 17:31 +0800, Julien Rouhaud wrote:
> > Feature wise, I'm happy with the patch.  I just have a few comments.
> >
> > Tests:
> >
> > - it's missing some test in sql/oldextversions.sql to validate that the
> > code
> >   works with the extension in version 1.9
>
> Yes, I've just added some tests there, but it seems they are not quite
> suficient. Maybe we should try to do some queries to views and
> functions in old versions? A least when new C function version
> appears...

I'm not sure if that's really helpful.  If you have new C functions and old
SQL-version, you won't be able to reach them anyway.  Similarly, if you have
the new SQL but the old .so (which we can't test), it will just fail as the
symbol doesn't exist.  The real problem that has to be explicitly handled by
the C code is different signatures for C functions.
>
> During tests developing I've noted that current test of
> pg_stat_statements_info view actually tests only view access. However
> we can test at least functionality of stats_reset field like this:
>
> SELECT now() AS ref_ts \gset
> SELECT dealloc, stats_reset >= :'ref_ts' FROM pg_stat_statements_info;
> SELECT pg_stat_statements_reset();
> SELECT dealloc, stats_reset >= :'ref_ts' FROM pg_stat_statements_info;
>
> Does it seems reasonable?

It looks reasonable, especially if the patch adds a new mode for the reset
function.

> > - the last test removed the minmax_plan_zero field, why?
>
> My thaught was as follows... Reexecution of the same query will
> definitely cause execution. However, most likely it wouldn't be
> planned, but if it would (maybe this is possible, or maybe it will be
> possible in the future in some cases) the test shouldn't fail. Checking
> of only execution stats seems enough to me - in most cases we can't
> check planning stats with such test anyway.
> What do you think about it?

Ah I see.  I guess we could set plan_cache_mode to force_generic_plan to make
sure we go though planning.  But otherwise just adding a comment saying that
the test has to be compatible with different plan caching approach would be
fine with me.

Thanks for the work on merging the functions!  I will reply on the other parts
of the thread where some discussion started.




Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-04-02 Thread Julien Rouhaud
On Fri, Apr 01, 2022 at 10:47:02PM +0300, Andrei Zubkov wrote:
>
> On Fri, 2022-04-01 at 11:38 -0400, Greg Stark wrote:
> > [13:19:51.544] pg_stat_statements.c: In function ‘entry_reset’:
> > [13:19:51.544] pg_stat_statements.c:2598:32: error:
> > ‘minmax_stats_reset’ may be used uninitialized in this function
> > [-Werror=maybe-uninitialized]
> > [13:19:51.544] 2598 | entry->minmax_stats_since = minmax_stats_reset;
> > [13:19:51.544] | ~~^~~~
> >
>
> I was afraid of such warning can appear..
>
> On Sat, 2022-04-02 at 00:13 +0800, Julien Rouhaud wrote:
> > I guess that pg_stat_statement_reset() is so
> > expensive that an extra gettimeofday() wouldn't change much. 
>
> Agreed
>
> > Otherwise
> > initializing to NULL should be enough.
>
> Julien, I would prefer an extra GetCurrentTimestamp(). So, I've opted
> to use the common unconditional
>
> stats_reset = GetCurrentTimestamp();
>
> for an entire entry_reset function due to the following:
>
> 1. It will be uniform for stats_reset and minmax_stats_reset
> 2. As you mentioned, it wouldn't change a much
> 3. The most common way to use this function is to reset all statements
> meaning that GetCurrentTimestamp() will be called anyway to update the
> value of stats_reset field in pg_stat_statements_info view
> 4. Actually I would like that pg_stat_statements_reset function was
> able to return the value of stats_reset as its result. This could give
> to the sampling solutions the ability to check if the last reset (of
> any type) was performed by this solution or any other reset was
> performed by someone else. It seems valuable to me, but it changes the
> result type of the pg_stat_statements_reset() function, so I don't know
> if we can do that right now.

I'm fine with always getting the current timestamp when calling the function.

I'm not sure about returning the ts.  If you need it you could call SELECT
now() FROM pg_stat_statements_reset() (or clock_timestamp()).  It won't be
entirely accurate but since the function will have an exclusive lock during the
whole execution that shouldn't be a problem.  Now you're already adding a new
version of the C function so I guess that it wouldn't require any additional
effort so why not.




Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-04-02 Thread Julien Rouhaud
On Fri, Apr 01, 2022 at 01:01:53PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2022-04-01 22:47:02 +0300, Andrei Zubkov wrote:
> > +   entry = (pgssEntry *) hash_search(pgss_hash, &key, HASH_FIND, 
> > NULL);
> > +
> > +   if (entry) {
> > +   /* Found */
> > +   if (minmax_only) {
> > +   /* When requested reset only min/max statistics 
> > of an entry */
> > +   entry_counters = &entry->counters;
> > +   for (int kind = 0; kind < PGSS_NUMKIND; kind++)
> > +   {
> > +   entry_counters->max_time[kind] = 0;
> > +   entry_counters->min_time[kind] = 0;
> > +   }
> > +   entry->minmax_stats_since = stats_reset;
> > +   }
> > +   else
> > +   {
> > +   /* Remove the key otherwise  */
> > +   hash_search(pgss_hash, &entry->key, 
> > HASH_REMOVE, NULL);
> > +   num_remove++;
> > +   }
> > +   }
> 
> It seems decidedly not great to have four copies of this code. It was already
> not great before, but this patch makes the duplicated section go from four
> lines to 20 or so.

+1




Re: Skipping logical replication transactions on subscriber side

2022-04-02 Thread Masahiko Sawada
On Sat, Apr 2, 2022 at 1:08 PM Amit Kapila  wrote:
>
> On Sat, Apr 2, 2022 at 7:29 AM Noah Misch  wrote:
> >
> > On Sat, Apr 02, 2022 at 06:49:20AM +0530, Amit Kapila wrote:
> >
> > After applying datum_to_lsn_skiplsn_1.patch, I get another failure.  Logs
> > attached.
> >
>
> The failure is for the same reason. I noticed that even when skip lsn
> value should be 0/0, it is some invalid value, see: "LOG:  not started
> skipping changes: my_skiplsn 0/B0706F72 finish_lsn 0/14EB7D8". Here,
> my_skiplsn should be 0/0 instead of 0/B0706F72. Now, I am not sure why
> the LSN's 4 bytes are correct and the other 4 bytes have some random
> value.

It seems that 0/B0706F72 is not a random value. Two subscriber logs
show the same value. Since 0x70 = 'p', 0x6F = 'o', and 0x72 = 'r', it
might show the next field in the pg_subscription catalog, i.e.,
subconninfo. The subscription is created by "CREATE SUBSCRIPTION sub
CONNECTION 'port=57851 host=/tmp/6u2vRwQYik dbname=postgres'
PUBLICATION pub WITH (disable_on_error = true, streaming = on,
two_phase = on)".

Given subscription.sql passes, something is wrong when we read the
subskiplsn value by like "sub->skiplsn = subform->subskiplsn;".

Is it possible to run the test again with the attached patch?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


add_logs_v2.patch
Description: Binary data


Re: Skipping logical replication transactions on subscriber side

2022-04-02 Thread Noah Misch
On Sat, Apr 02, 2022 at 04:33:44PM +0900, Masahiko Sawada wrote:
> It seems that 0/B0706F72 is not a random value. Two subscriber logs
> show the same value. Since 0x70 = 'p', 0x6F = 'o', and 0x72 = 'r', it
> might show the next field in the pg_subscription catalog, i.e.,
> subconninfo. The subscription is created by "CREATE SUBSCRIPTION sub
> CONNECTION 'port=57851 host=/tmp/6u2vRwQYik dbname=postgres'
> PUBLICATION pub WITH (disable_on_error = true, streaming = on,
> two_phase = on)".
> 
> Given subscription.sql passes, something is wrong when we read the
> subskiplsn value by like "sub->skiplsn = subform->subskiplsn;".

That's a good clue.  We've never made pg_type.typalign able to represent
alignment as it works on AIX.  A uint64 like pg_lsn has 8-byte alignment, so
the C struct follows from that.  At the typalign level, we have only these:

#define  TYPALIGN_CHAR  'c' /* char alignment (i.e. unaligned) 
*/
#define  TYPALIGN_SHORT 's' /* short alignment (typically 2 
bytes) */
#define  TYPALIGN_INT   'i' /* int alignment (typically 4 
bytes) */
#define  TYPALIGN_DOUBLE'd' /* double alignment (often 8 bytes) 
*/

On AIX, they are:

#define ALIGNOF_DOUBLE 4 
#define ALIGNOF_INT 4
#define ALIGNOF_LONG 8   
/* #undef ALIGNOF_LONG_LONG_INT */
/* #undef ALIGNOF_PG_INT128_TYPE */
#define ALIGNOF_SHORT 2  

uint64 and pg_lsn use TYPALIGN_DOUBLE.  For AIX, they really need a typalign
corresponding to ALIGNOF_LONG.  Hence, the C struct layout doesn't match the
tuple layout.  Columns potentially affected:

[local] test=*# select attrelid::regclass, attname from pg_attribute a join 
pg_class c on c.oid = attrelid where attalign = 'd' and relkind = 'r' and 
attnotnull and attlen <> -1;
attrelid │   attname
─┼──
 pg_sequence │ seqstart
 pg_sequence │ seqincrement
 pg_sequence │ seqmax
 pg_sequence │ seqmin
 pg_sequence │ seqcache
 pg_subscription │ subskiplsn
(6 rows)

The pg_sequence fields evade trouble, because there's exactly eight bytes (two
oids) before them.


Some options:
- Move subskiplsn after subdbid, so it's always aligned anyway.  I've
  confirmed that this lets the test pass, in 44s.
- Move subskiplsn to the CATALOG_VARLEN section, despite its fixed length.
- Introduce a new typalign value suitable for uint64.  This is more intrusive,
  but it's more future-proof.  Looking beyond catalog columns, it might
  improve performance by avoiding unaligned reads.

> Is it possible to run the test again with the attached patch?

Logs attached.  The test "passed", though it printed "poll_query_until timed
out" three times and took awhile.


log-subscription-20220401c.tar.xz
Description: Binary data


Re: shared-memory based stats collector - v66

2022-04-02 Thread Andres Freund
Hi,

On 2022-03-25 17:24:18 +0900, Kyotaro Horiguchi wrote:
> > AFIXME: Isn't PGSTAT_MIN_INTERVAL way too long? What is the justification
> > for increasing it?
> 
> It is 1000ms in the comment just above but actually 1ms. The
> number came from a discussion that if we have 1000 clients and each
> backend writes stats once per 0.5 seconds, totally we flush pending
> data to shared area at 2000 times per second which is too frequent. I
> raised it to 5000ms, then 1ms.  So the expected maximum flush
> frequency is reduces to 100 times per second.  Of course it is
> assuming the worst case and the 1ms is apparently too long for the
> average cases.
> 
> The current implement of pgstat postpones flushing if lock collision
> happens then postpone by at most 60s.  This is a kind of
> auto-averaging mechanishm.  It might be enough and we can reduce the
> PGSTAT_MIN_INTERVAL to 500ms or so.

I just noticed that the code doesn't appear to actually work like that right
now. Whenever the timeout is reached, pgstat_report_stat() is called with
force = true.

And even if the backend is busy running queries, once there's contention, the
next invocation of pgstat_report_stat() will return the timeout relative to
pendin_since, which then will trigger a force report via a very short timeout
soon.

It might actually make sense to only ever return PGSTAT_RETRY_MIN_INTERVAL
(with a slightly different name) from pgstat_report_stat() when blocked
(limiting the max reporting delay for an idle connection) and to continue
calling pgstat_report_stat(force = true).  But to only trigger force
"internally" in pgstat_report_stat() when PGSTAT_MAX_INTERVAL is reached.

I think that'd mean we'd report after max PGSTAT_RETRY_MIN_INTERVAL in an idle
connection, and try reporting every PGSTAT_RETRY_MIN_INTERVAL (increasing up
to PGSTAT_MAX_INTERVAL when blocked) on busy connections.

Makes sense?


I think we need to do something with the pgstat_report_stat() calls outside of
postgres.c. Otherwise there's nothing limiting their reporting delay, because
they don't have the timeout logic postgres.c has.  None of them is ever hot
enough to be problematic, so I think we should just make them pass force=true?

Greetings,

Andres Freund




Re: shared-memory based stats collector - v66

2022-04-02 Thread Andres Freund
Hi,

On 2022-03-25 17:24:18 +0900, Kyotaro Horiguchi wrote:
> > * AFIXME: Should all the stats drop code be moved into pgstat_drop.c?
> 
> Or pgstat_xact.c?

I wasn't initially happy with that suggestion, but after running with it, it
looks pretty good.

I also moved a fair bit of code into pgstat_shm.c, which to me improved code
navigation a lot. I'm wondering about splitting it further even, into
pgstat_shm.c and pgstat_entry.c.

What do you think?

Greetings,

Andres Freund




Re: A qsort template

2022-04-02 Thread John Naylor
On Fri, Apr 1, 2022 at 4:43 AM Thomas Munro  wrote:
>
> On Thu, Mar 31, 2022 at 11:09 PM John Naylor
>  wrote:
> > In a couple days I'm going to commit the v3 patch "accelerate tuple
> > sorting for common types" as-is after giving it one more look, barring
> > objections.

Pushed.

> Hi John,
>
> Thanks so much for all the work you've done here!  I feel bad that I
> lobbed so many experimental patches in here and then ran away due to
> lack of cycles.  That particular patch (the one cfbot has been chewing
> on all this time) does indeed seem committable, despite the
> deficiencies/opportunities listed in comments.  It's nice to reduce
> code duplication, it gives the right answers, and it goes faster.

Thanks for chiming in! It gives me more confidence that there wasn't
anything amiss that may have gone unnoticed. And no worries -- my own
review efforts here have been sporadic. ;-)

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: A qsort template

2022-04-02 Thread John Naylor
I wrote:

> I started towards incorporating the change in insertion sort threshold
> (part of 0010), but that caused regression test failures, so that will
> have to wait for a bit of analysis and retesting. (My earlier tests
> were done in a separate module.)

The failures seem to be where sort order is partially specified. E.g.
ORDER BY col_a, where there are duplicates there and other columns are
different. Insertion sort is stable IIRC, so moving the threshold
caused different orders in these cases. Some cases can be conveniently
fixed with additional columns in the ORDER BY clause. I'll go through
the failures and see how much can be cleaned up as a preparatory
refactoring.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Skipping logical replication transactions on subscriber side

2022-04-02 Thread Amit Kapila
On Sat, Apr 2, 2022 at 1:43 PM Noah Misch  wrote:
>
> On Sat, Apr 02, 2022 at 04:33:44PM +0900, Masahiko Sawada wrote:
> > It seems that 0/B0706F72 is not a random value. Two subscriber logs
> > show the same value. Since 0x70 = 'p', 0x6F = 'o', and 0x72 = 'r', it
> > might show the next field in the pg_subscription catalog, i.e.,
> > subconninfo. The subscription is created by "CREATE SUBSCRIPTION sub
> > CONNECTION 'port=57851 host=/tmp/6u2vRwQYik dbname=postgres'
> > PUBLICATION pub WITH (disable_on_error = true, streaming = on,
> > two_phase = on)".
> >
> > Given subscription.sql passes, something is wrong when we read the
> > subskiplsn value by like "sub->skiplsn = subform->subskiplsn;".
>
> That's a good clue.  We've never made pg_type.typalign able to represent
> alignment as it works on AIX.  A uint64 like pg_lsn has 8-byte alignment, so
> the C struct follows from that.  At the typalign level, we have only these:
>
> #define  TYPALIGN_CHAR  'c' /* char alignment (i.e. 
> unaligned) */
> #define  TYPALIGN_SHORT 's' /* short alignment (typically 2 
> bytes) */
> #define  TYPALIGN_INT   'i' /* int alignment (typically 4 
> bytes) */
> #define  TYPALIGN_DOUBLE'd' /* double alignment (often 8 
> bytes) */
>
> On AIX, they are:
>
> #define ALIGNOF_DOUBLE 4
> #define ALIGNOF_INT 4
> #define ALIGNOF_LONG 8
> /* #undef ALIGNOF_LONG_LONG_INT */
> /* #undef ALIGNOF_PG_INT128_TYPE */
> #define ALIGNOF_SHORT 2
>
> uint64 and pg_lsn use TYPALIGN_DOUBLE.  For AIX, they really need a typalign
> corresponding to ALIGNOF_LONG.  Hence, the C struct layout doesn't match the
> tuple layout.  Columns potentially affected:
>
> [local] test=*# select attrelid::regclass, attname from pg_attribute a join 
> pg_class c on c.oid = attrelid where attalign = 'd' and relkind = 'r' and 
> attnotnull and attlen <> -1;
> attrelid │   attname
> ─┼──
>  pg_sequence │ seqstart
>  pg_sequence │ seqincrement
>  pg_sequence │ seqmax
>  pg_sequence │ seqmin
>  pg_sequence │ seqcache
>  pg_subscription │ subskiplsn
> (6 rows)
>
> The pg_sequence fields evade trouble, because there's exactly eight bytes (two
> oids) before them.
>
>
> Some options:
> - Move subskiplsn after subdbid, so it's always aligned anyway.  I've
>   confirmed that this lets the test pass, in 44s.
> - Move subskiplsn to the CATALOG_VARLEN section, despite its fixed length.
>

+1 to any one of the above. I mildly prefer the first option as that
will allow us to access the value directly instead of going via
SysCacheGetAttr but I am fine either way.

> - Introduce a new typalign value suitable for uint64.  This is more intrusive,
>   but it's more future-proof.  Looking beyond catalog columns, it might
>   improve performance by avoiding unaligned reads.
>
> > Is it possible to run the test again with the attached patch?
>
> Logs attached.  The test "passed", though it printed "poll_query_until timed
> out" three times and took awhile.

Thanks for helping in figuring out the problem.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-04-02 Thread Andrei Zubkov
Hi,

On Fri, 2022-04-01 at 13:01 -0700, Andres Freund wrote:
> It seems decidedly not great to have four copies of this code. It was
> already
> not great before, but this patch makes the duplicated section go from
> four
> lines to 20 or so.

Agreed. I've created the single_entry_reset() function wrapping this
code. I wonder if it should be declared as inline to speedup a little.

On Sat, 2022-04-02 at 15:10 +0800, Julien Rouhaud wrote:
> > However
> > we can test at least functionality of stats_reset field like this:
> > 
> > SELECT now() AS ref_ts \gset
> > SELECT dealloc, stats_reset >= :'ref_ts' FROM
> > pg_stat_statements_info;
> > SELECT pg_stat_statements_reset();
> > SELECT dealloc, stats_reset >= :'ref_ts' FROM
> > pg_stat_statements_info;
> > 
> > Does it seems reasonable?
> 
> It looks reasonable, especially if the patch adds a new mode for the
> reset
> function.

I've implemented this test.

> > Checking
> > of only execution stats seems enough to me - in most cases we can't
> > check planning stats with such test anyway.
> > What do you think about it?
> 
> Ah I see.  I guess we could set plan_cache_mode to force_generic_plan
> to make
> sure we go though planning.  But otherwise just adding a comment
> saying that
> the test has to be compatible with different plan caching approach
> would be
> fine with me.

Set plan_cache_mode seems a little bit excess to me. And maybe in the
future some another plan caching strategies will be implementd with
coresponding settings.. So I've just left a comment there.

On Sat, 2022-04-02 at 15:21 +0800, Julien Rouhaud wrote:
> I'm not sure about returning the ts.  If you need it you could call
> SELECT
> now() FROM pg_stat_statements_reset() (or clock_timestamp()).  It
> won't be
> entirely accurate but since the function will have an exclusive lock
> during the
> whole execution that shouldn't be a problem.  Now you're already
> adding a new
> version of the C function so I guess that it wouldn't require any
> additional
> effort so why not.

I think that if we can do it in accurate way and there is no obvious
side effects, why not to try it...
Changing of pg_stat_statements_reset function result caused a
confiderable tests update. Also, I'm not sure that my description of
this feature in the docs is blameless..

After all, I'm a little bit in doubt about this feature, so I'm ready
to rollback it.

v9 attached
--
regards, Andrei
From 78f3e2e7bda2683b3aeb30b28fbbe60ed781db1d Mon Sep 17 00:00:00 2001
From: Andrei Zubkov 
Date: Sat, 2 Apr 2022 12:30:11 +0300
Subject: [PATCH] pg_stat_statements: Track statement entry timestamp

This patch adds stats_since column to the pg_stat_statements view. This column
is populated with the current timestamp when a new statement is added to the
pg_stat_statements hashtable. It provides clean information about statistics
collection time interval for each statement. Besides it can be used
by sampling solutions to detect situations when a statement was evicted and
returned back between samples.
Such sampling solution could derive any pg_stat_statements statistic value for
an interval between two samples with except of all min/max statistics. To
address this issue this patch adds the ability to reset min/max
statistics independently of statement reset using the new minmax_only parameter
of the pg_stat_statements_reset(userid oid, dbid oid, queryid bigint,
minmax_only boolean) function.
Timestamp of such reset is stored in the minmax_stats_since field for
each statement. pg_stat_statements_reset() function is now returns
this timestamp as a result.

Discussion:
https://www.postgresql.org/message-id/flat/72e80e7b160a6eb189df9ef6f068cce3765d37f8.camel%40moonset.ru
---
 contrib/pg_stat_statements/Makefile   |   3 +-
 .../expected/oldextversions.out   |  61 +++
 .../expected/pg_stat_statements.out   | 361 +-
 .../pg_stat_statements--1.9--1.10.sql | 108 ++
 .../pg_stat_statements/pg_stat_statements.c   | 164 ++--
 .../pg_stat_statements.control|   2 +-
 .../pg_stat_statements/sql/oldextversions.sql |   8 +
 .../sql/pg_stat_statements.sql| 149 +++-
 doc/src/sgml/pgstatstatements.sgml|  58 ++-
 9 files changed, 756 insertions(+), 158 deletions(-)
 create mode 100644 contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 7fabd96f38d..edc40c8bbfb 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -6,7 +6,8 @@ OBJS = \
 	pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.8--1.9.sql \
+DATA = pg_stat_statements--1.4.sql \
+	pg_stat_statements--1.9--1.10.sql pg_stat_statements--1.8--1.9.sql \
 	pg_stat_statements--1.7--1.8.sql pg_stat_statements--1.6--1.7.sql \
 	pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
 	

Re: A qsort template

2022-04-02 Thread Thomas Munro
On Sat, Apr 2, 2022 at 9:38 PM John Naylor  wrote:
> On Fri, Apr 1, 2022 at 4:43 AM Thomas Munro  wrote:
> > On Thu, Mar 31, 2022 at 11:09 PM John Naylor
> >  wrote:
> > > In a couple days I'm going to commit the v3 patch "accelerate tuple
> > > sorting for common types" as-is after giving it one more look, barring
> > > objections.
>
> Pushed.

It looks like UBsan sees a problem, per BF animal kestrel:

/mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/utils/sort/tuplesort.c:722:51:
runtime error: load of value 96, which is not a valid value for type
'bool'

#5  0x00eb65d4 in qsort_tuple_int32_compare (a=0x4292ce0,
b=0x4292cf8, state=0x4280130) at
/mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/utils/sort/tuplesort.c:722
#6  qsort_tuple_int32 (data=, n=133,
arg=arg@entry=0x4280130) at
/mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/include/lib/sort_template.h:313
#7  0x00eaf747 in tuplesort_sort_memtuples
(state=state@entry=0x4280130) at
/mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/utils/sort/tuplesort.c:3613
#8  0x00eaedcb in tuplesort_performsort
(state=state@entry=0x4280130) at
/mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/utils/sort/tuplesort.c:2154
#9  0x00573d60 in heapam_relation_copy_for_cluster
(OldHeap=, NewHeap=, OldIndex=, use_sort=, OldestXmin=11681,
xid_cutoff=, multi_cutoff=0x7ffecb0cfa70,
num_tuples=0x7ffecb0cfa38, tups_vacuumed=0x7ffecb0cfa20,
tups_recently_dead=0x7ffecb0cfa28) at
/mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/access/heap/heapam_handler.c:955

Reproduced locally, using the same few lines from the cluster.sql
test.  I'll try to dig more tomorrow...




Re: logical decoding and replication of sequences

2022-04-02 Thread Amit Kapila
On Fri, Apr 1, 2022 at 8:32 PM Tomas Vondra
 wrote:
>
> On 3/28/22 07:29, Amit Kapila wrote:
> > I thought about changing snapshot dealing of
> > non-transactional sequence changes similar to transactional ones but
> > that also won't work because it is only at commit we decide whether we
> > can send the changes.
> >
> I wonder if there's some earlier LSN (similar to the consistent point)
> which might be useful for this.
>
> Or maybe we should queue even the non-transactional changes, not
> per-transaction but in a global list, and then at each commit either
> discard inspect them (at that point we know the lowest LSN for all
> transactions and the consistent point). Seems complex, though.
>

I couldn't follow '..discard inspect them ..'. Do you mean we inspect
them and discard whichever are not required? It seems here we are
talking about a new global ReorderBufferGlobal instead of
ReorderBufferTXN to collect these changes but we don't need only
consistent point LSN because we do send if the commit of containing
transaction is after consistent point LSN, so we need some transaction
information as well. I think it could bring new challenges.

> > For the transactional case, as we are considering the create sequence
> > operation as transactional, we would unnecessarily queue them even
> > though that is not required. Basically, they don't need to be
> > considered transactional and we can simply ignore such messages like
> > other DDLs. But for that probably we need to distinguish Alter/Create
> > case which may or may not be straightforward. Now, queuing them is
> > probably harmless unless it causes the transaction to spill/stream.
> >
>
> I'm not sure I follow. Why would we queue them unnecessarily?
>
> Also, there's the bug with decoding changes in transactions that create
> the sequence and add it to a publication. I think the agreement was that
> this behavior was incorrect, we should not decode changes until the
> subscription is refreshed. Doesn't that mean can't be any CREATE case,
> just ALTER?
>

Yeah, but how will we distinguish them. Aren't they using the same
kind of WAL record?

-- 
With Regards,
Amit Kapila.




Re: logical decoding and replication of sequences

2022-04-02 Thread Amit Kapila
On Sat, Apr 2, 2022 at 5:47 AM Tomas Vondra
 wrote:
>
> On 4/1/22 17:02, Tomas Vondra wrote:
>
> The only option I see is reworking the decoding so that it does not need
> the snapshot at all. We'll need to stash the changes just like any other
> change, apply them at end of transaction, and the main difference
> between transactional and non-transactional case would be what happens
> at abort. Transactional changes would be discarded, non-transactional
> would be applied anyway.
>

I think in the above I am not following how we can make it work
without considering *snapshot at all* because based on that we would
have done the initial sync (copy_sequence) and if we don't follow that
later it can lead to inconsistency. I might be missing something here.

> The challenge is this reorders the sequence changes, so we'll need to
> reconcile them somehow. One option would be to simply (1) apply the
> change with the highest LSN in the transaction, and then walk all other
> in-progress transactions and changes for that sequence with a lower LSN.
> Not sure how complex/expensive that would be, though. Another problem is
> not all increments are WAL-logged, of course, not sure about that.
>
> Another option might be revisiting the approach proposed by Hannu in
> September [1], i.e. tracking sequences touched in a transaction, and
> then replicating the current state at that particular moment.
>

I'll think about that approach as well.


-- 
With Regards,
Amit Kapila.




Re: A qsort template

2022-04-02 Thread John Naylor
On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro  wrote:
> It looks like UBsan sees a problem, per BF animal kestrel:
>
> /mnt/resource/bf/build/kestrel/HEAD/pgsql.build/../pgsql/src/backend/utils/sort/tuplesort.c:722:51:
> runtime error: load of value 96, which is not a valid value for type
> 'bool'

Yeah, same with tamandua. Then, skink (a Valgrind animal) shows:

==1940791== VALGRINDERROR-BEGIN
==1940791== Conditional jump or move depends on uninitialised value(s)
==1940791==at 0x73D394: ApplyInt32SortComparator (sortsupport.h:311)
==1940791==by 0x73D394: qsort_tuple_int32_compare (tuplesort.c:722)
==1940791==by 0x73D394: qsort_tuple_int32 (sort_template.h:313)
==1940791==by 0x7409BC: tuplesort_sort_memtuples (tuplesort.c:3613)
==1940791==by 0x742806: tuplesort_performsort (tuplesort.c:2154)
==1940791==by 0x23C109: heapam_relation_copy_for_cluster
(heapam_handler.c:955)
==1940791==by 0x35799A: table_relation_copy_for_cluster (tableam.h:1658)
==1940791==by 0x35799A: copy_table_data (cluster.c:913)
==1940791==by 0x359016: rebuild_relation (cluster.c:606)
==1940791==by 0x35914E: cluster_rel (cluster.c:427)
==1940791==by 0x3594EB: cluster (cluster.c:195)
==1940791==by 0x5C73FF: standard_ProcessUtility (utility.c:862)
==1940791==by 0x5C78D0: ProcessUtility (utility.c:530)
==1940791==by 0x5C4C7B: PortalRunUtility (pquery.c:1158)
==1940791==by 0x5C4F78: PortalRunMulti (pquery.c:1315)
==1940791==  Uninitialised value was created by a stack allocation
==1940791==at 0x74224E: tuplesort_putheaptuple (tuplesort.c:1800)

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-04-02 Thread Julien Rouhaud
On Sat, Apr 02, 2022 at 01:12:54PM +0300, Andrei Zubkov wrote:
> On Fri, 2022-04-01 at 13:01 -0700, Andres Freund wrote:
> > It seems decidedly not great to have four copies of this code. It was
> > already
> > not great before, but this patch makes the duplicated section go from
> > four
> > lines to 20 or so.
> 
> Agreed. I've created the single_entry_reset() function wrapping this
> code. I wonder if it should be declared as inline to speedup a little.

Maybe a macro would be better here?  I don't know if that's generally ok or
just old and not-that-great code, but there are other places relying on macros
when a plain function call isn't that convenient (like here returning 0 or 1 as
a hack for incrementing num_remove), for instance in hba.c.

> On Sat, 2022-04-02 at 15:21 +0800, Julien Rouhaud wrote:
> > I'm not sure about returning the ts.  If you need it you could call
> > SELECT
> > now() FROM pg_stat_statements_reset() (or clock_timestamp()).  It
> > won't be
> > entirely accurate but since the function will have an exclusive lock
> > during the
> > whole execution that shouldn't be a problem.  Now you're already
> > adding a new
> > version of the C function so I guess that it wouldn't require any
> > additional
> > effort so why not.
> 
> I think that if we can do it in accurate way and there is no obvious
> side effects, why not to try it...
> Changing of pg_stat_statements_reset function result caused a
> confiderable tests update. Also, I'm not sure that my description of
> this feature in the docs is blameless..
> 
> After all, I'm a little bit in doubt about this feature, so I'm ready
> to rollback it.

Well, I personally don't think that I would need it for powa as it's designed
to have very frequent snapshot.  I know you have a different approach in
pg_profile, but I'm not sure it will be that useful for you either?




Re: Can we automatically add elapsed times to tap test log?

2022-04-02 Thread Andrew Dunstan

On 4/1/22 16:25, Andrew Dunstan wrote:
> On 4/1/22 15:16, Andrew Dunstan wrote:
>> On 4/1/22 13:44, Nathan Bossart wrote:
>>> On Fri, Apr 01, 2022 at 10:21:50AM -0700, Andres Freund wrote:
 right now I am looking at a test added in the shmstats patch that's slow on
 CI, on windows only. Unfortunately the regress_log_* output is useless 
 as-is
 to figure out where things hang.

 I've hit this several times before. Of course it's not too hard to hack up
 something printing elapsed time. But ISTM that it'd be better if we were 
 able
 to prefix the logging into regress_log_* with something like
 [timestamp + time since start of test]
 or
 [timestamp + time since start of test + time since last log message]


 This isn't just useful to figure out what parts of test are slow, but also
 helps correlate server logs with the regress_log_* output. Which right now 
 is
 hard and inaccurate, requiring manually correlating statements between 
 server
 log and the tap test (often there's no logging for statements in the
 regress_log_*).
>>> +1
>>>
>> Maybe one way would be to make a change in
>> src/test/perl/PostgreSQL/Test/SimpleTee.pm. The simplest thing would
>> just be to add a timestamp, the other things would involve a bit more
>> bookkeeping. It should also be checked to make sure it doesn't add too
>> much overhead, although I would be surprised if it did.
>>
>
> Along these lines. Untested, it clearly needs a bit of polish (e.g. a
> way to turn it on or off for a filehandle). We could use Time::Hires if
> you want higher resolution times.
>
>


Here's a version that actually works. It produces traces that look like
this:


andrew@emma:pg_upgrade $ grep '([0-9]*s)'
tmp_check/log/regress_log_002_pg_upgrade
[21:55:06](63s) ok 1 - dump before running pg_upgrade
[21:55:22](79s) ok 2 - run of pg_upgrade for new instance
[21:55:27](84s) ok 3 - old and new dumps match after pg_upgrade
[21:55:27](84s) 1..3


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/test/perl/PostgreSQL/Test/SimpleTee.pm b/src/test/perl/PostgreSQL/Test/SimpleTee.pm
index bb9d79a755..432a458d09 100644
--- a/src/test/perl/PostgreSQL/Test/SimpleTee.pm
+++ b/src/test/perl/PostgreSQL/Test/SimpleTee.pm
@@ -14,6 +14,18 @@ package PostgreSQL::Test::SimpleTee;
 use strict;
 use warnings;
 
+my $start_time;
+
+BEGIN { $start_time = time; }
+
+sub _time_str
+{
+my ($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) =
+  localtime(time);
+return sprintf("[%.2d:%.2d:%.2d](%ds) ",
+   $hour, $min, $sec,  time - $start_time);
+}
+
 sub TIEHANDLE
 {
 	my $self = shift;
@@ -24,10 +36,13 @@ sub PRINT
 {
 	my $self = shift;
 	my $ok   = 1;
+	# skip the timestamp on the first file handle so PROVE doesn't get upset
+	my $skip = 1;
 	for my $fh (@$self)
 	{
-		print $fh @_ or $ok = 0;
+		print $fh ($skip ? "" :_time_str), @_ or $ok = 0;
 		$fh->flush   or $ok = 0;
+		$skip = 0;
 	}
 	return $ok;
 }


Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-04-02 Thread Andrei Zubkov
On Sat, 2022-04-02 at 18:56 +0800, Julien Rouhaud wrote:
> Maybe a macro would be better here?  I don't know if that's generally
> ok or
> just old and not-that-great code, but there are other places relying
> on macros
> when a plain function call isn't that convenient (like here returning
> 0 or 1 as
> a hack for incrementing num_remove), for instance in hba.c.

Yes, it is not very convenient and not looks pretty, so I'll try a
macro here soon.

> > I think that if we can do it in accurate way and there is no
> > obvious
> > side effects, why not to try it...
> > Changing of pg_stat_statements_reset function result caused a
> > confiderable tests update. Also, I'm not sure that my description
> > of
> > this feature in the docs is blameless..
> > 
> > After all, I'm a little bit in doubt about this feature, so I'm
> > ready
> > to rollback it.
> 
> Well, I personally don't think that I would need it for powa as it's
> designed
> to have very frequent snapshot.  I know you have a different approach
> in
> pg_profile, but I'm not sure it will be that useful for you either?

Of course I can do some workaround if the accurate reset time will be
unavailable. I just want to do the whole thing if it doesn't hurt. If
we have a plenty of timestamps saved now, I think it is a good idea to
have then bound to some milestones. At least it is a pretty equal join
condition between samples.
But if you think we should avoid returning ts here I won't insist on
that.





Re: A qsort template

2022-04-02 Thread John Naylor
On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro  wrote:
> Reproduced locally, using the same few lines from the cluster.sql
> test.  I'll try to dig more tomorrow...

Thanks! Unfortunately I can't reproduce locally with clang 13/gcc 11,
with -Og or -O2 with CFLAGS="-fsanitize=undefined,alignment" ...

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Skipping logical replication transactions on subscriber side

2022-04-02 Thread Masahiko Sawada
On Sat, Apr 2, 2022 at 7:04 PM Amit Kapila  wrote:
>
> On Sat, Apr 2, 2022 at 1:43 PM Noah Misch  wrote:
> >
> > On Sat, Apr 02, 2022 at 04:33:44PM +0900, Masahiko Sawada wrote:
> > > It seems that 0/B0706F72 is not a random value. Two subscriber logs
> > > show the same value. Since 0x70 = 'p', 0x6F = 'o', and 0x72 = 'r', it
> > > might show the next field in the pg_subscription catalog, i.e.,
> > > subconninfo. The subscription is created by "CREATE SUBSCRIPTION sub
> > > CONNECTION 'port=57851 host=/tmp/6u2vRwQYik dbname=postgres'
> > > PUBLICATION pub WITH (disable_on_error = true, streaming = on,
> > > two_phase = on)".
> > >
> > > Given subscription.sql passes, something is wrong when we read the
> > > subskiplsn value by like "sub->skiplsn = subform->subskiplsn;".
> >
> > That's a good clue.  We've never made pg_type.typalign able to represent
> > alignment as it works on AIX.  A uint64 like pg_lsn has 8-byte alignment, so
> > the C struct follows from that.  At the typalign level, we have only these:
> >
> > #define  TYPALIGN_CHAR  'c' /* char alignment (i.e. 
> > unaligned) */
> > #define  TYPALIGN_SHORT 's' /* short alignment (typically 2 
> > bytes) */
> > #define  TYPALIGN_INT   'i' /* int alignment (typically 4 
> > bytes) */
> > #define  TYPALIGN_DOUBLE'd' /* double alignment (often 8 
> > bytes) */
> >
> > On AIX, they are:
> >
> > #define ALIGNOF_DOUBLE 4
> > #define ALIGNOF_INT 4
> > #define ALIGNOF_LONG 8
> > /* #undef ALIGNOF_LONG_LONG_INT */
> > /* #undef ALIGNOF_PG_INT128_TYPE */
> > #define ALIGNOF_SHORT 2
> >
> > uint64 and pg_lsn use TYPALIGN_DOUBLE.  For AIX, they really need a typalign
> > corresponding to ALIGNOF_LONG.  Hence, the C struct layout doesn't match the
> > tuple layout.  Columns potentially affected:
> >
> > [local] test=*# select attrelid::regclass, attname from pg_attribute a join 
> > pg_class c on c.oid = attrelid where attalign = 'd' and relkind = 'r' and 
> > attnotnull and attlen <> -1;
> > attrelid │   attname
> > ─┼──
> >  pg_sequence │ seqstart
> >  pg_sequence │ seqincrement
> >  pg_sequence │ seqmax
> >  pg_sequence │ seqmin
> >  pg_sequence │ seqcache
> >  pg_subscription │ subskiplsn
> > (6 rows)
> >
> > The pg_sequence fields evade trouble, because there's exactly eight bytes 
> > (two
> > oids) before them.

Thanks for helping with the investigation!

> >
> >
> > Some options:
> > - Move subskiplsn after subdbid, so it's always aligned anyway.  I've
> >   confirmed that this lets the test pass, in 44s.
> > - Move subskiplsn to the CATALOG_VARLEN section, despite its fixed length.
> >
>
> +1 to any one of the above. I mildly prefer the first option as that
> will allow us to access the value directly instead of going via
> SysCacheGetAttr but I am fine either way.

+1. I also prefer the first option.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: logical decoding and replication of sequences

2022-04-02 Thread Tomas Vondra



On 4/2/22 12:35, Amit Kapila wrote:
> On Fri, Apr 1, 2022 at 8:32 PM Tomas Vondra
>  wrote:
>>
>> On 3/28/22 07:29, Amit Kapila wrote:
>>> I thought about changing snapshot dealing of
>>> non-transactional sequence changes similar to transactional ones but
>>> that also won't work because it is only at commit we decide whether we
>>> can send the changes.
>>>
>> I wonder if there's some earlier LSN (similar to the consistent point)
>> which might be useful for this.
>>
>> Or maybe we should queue even the non-transactional changes, not
>> per-transaction but in a global list, and then at each commit either
>> discard inspect them (at that point we know the lowest LSN for all
>> transactions and the consistent point). Seems complex, though.
>>
> 
> I couldn't follow '..discard inspect them ..'. Do you mean we inspect
> them and discard whichever are not required? It seems here we are
> talking about a new global ReorderBufferGlobal instead of
> ReorderBufferTXN to collect these changes but we don't need only
> consistent point LSN because we do send if the commit of containing
> transaction is after consistent point LSN, so we need some transaction
> information as well. I think it could bring new challenges.
> 

Sorry for the gibberish. Yes, I meant to discard sequence changes that
are no longer needed, due to being "obsoleted" by the applied change. We
must not apply "older" changes (using LSN) because that would make the
sequence go backwards.

I'm not entirely sure whether the list of changes should be kept in TXN
or in the global reorderbuffer object - we need to track which TXN the
change belongs to (because of transactional changes) but we also need to
discard the unnecessary changes efficiently (and walking TXN might be
expensive).

But yes, I'm sure there will be challenges. One being that tracking just
the decoded WAL stuff is not enough, because nextval() may not generate
WAL. But we still need to make sure the increment is replicated.

What I think we might do is this:

- add a global list of decoded sequence increments to ReorderBuffer

- at each commit/abort walk the list, walk the list and consider all
increments up to the commit LSN that "match" (non-transactional match
all TXNs, transactional match only the current TXN)

- replicate the last "matching" status for each sequence, discard the
processed ones

We could probably optimize this by not tracking every single increment,
but merge them "per transaction", I think.

I'm sure this description is pretty rough and will need refining, handle
various corner-cases etc.

>>> For the transactional case, as we are considering the create sequence
>>> operation as transactional, we would unnecessarily queue them even
>>> though that is not required. Basically, they don't need to be
>>> considered transactional and we can simply ignore such messages like
>>> other DDLs. But for that probably we need to distinguish Alter/Create
>>> case which may or may not be straightforward. Now, queuing them is
>>> probably harmless unless it causes the transaction to spill/stream.
>>>
>>
>> I'm not sure I follow. Why would we queue them unnecessarily?
>>
>> Also, there's the bug with decoding changes in transactions that create
>> the sequence and add it to a publication. I think the agreement was that
>> this behavior was incorrect, we should not decode changes until the
>> subscription is refreshed. Doesn't that mean can't be any CREATE case,
>> just ALTER?
>>
> 
> Yeah, but how will we distinguish them. Aren't they using the same
> kind of WAL record?
> 

Same WAL record, but the "created" flag which should distinguish these
two cases, IIRC.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: logical decoding and replication of sequences

2022-04-02 Thread Tomas Vondra



On 4/2/22 12:43, Amit Kapila wrote:
> On Sat, Apr 2, 2022 at 5:47 AM Tomas Vondra
>  wrote:
>>
>> On 4/1/22 17:02, Tomas Vondra wrote:
>>
>> The only option I see is reworking the decoding so that it does not need
>> the snapshot at all. We'll need to stash the changes just like any other
>> change, apply them at end of transaction, and the main difference
>> between transactional and non-transactional case would be what happens
>> at abort. Transactional changes would be discarded, non-transactional
>> would be applied anyway.
>>
> 
> I think in the above I am not following how we can make it work
> without considering *snapshot at all* because based on that we would
> have done the initial sync (copy_sequence) and if we don't follow that
> later it can lead to inconsistency. I might be missing something here.
> 

Well, what I meant to say is that we can't consider the snapshot at this
phase of decoding. We'd still consider it later, at commit/abort time,
of course. I.e. it'd be fairly similar to what heap_decode/DecodeInsert
does, for example. AFAIK this does not build the snapshot anywhere.

>> The challenge is this reorders the sequence changes, so we'll need to
>> reconcile them somehow. One option would be to simply (1) apply the
>> change with the highest LSN in the transaction, and then walk all other
>> in-progress transactions and changes for that sequence with a lower LSN.
>> Not sure how complex/expensive that would be, though. Another problem is
>> not all increments are WAL-logged, of course, not sure about that.
>>
>> Another option might be revisiting the approach proposed by Hannu in
>> September [1], i.e. tracking sequences touched in a transaction, and
>> then replicating the current state at that particular moment.
>>
> 
> I'll think about that approach as well.
> 

Thanks!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-04-02 Thread Andrei Zubkov
On Sat, 2022-04-02 at 14:11 +0300, Andrei Zubkov wrote:
> On Sat, 2022-04-02 at 18:56 +0800, Julien Rouhaud wrote:
> > Maybe a macro would be better here?  I don't know if that's
> > generally
> > ok or
> > just old and not-that-great code, but there are other places
> > relying
> > on macros
> > when a plain function call isn't that convenient (like here
> > returning
> > 0 or 1 as
> > a hack for incrementing num_remove), for instance in hba.c.
> 
> Yes, it is not very convenient and not looks pretty, so I'll try a
> macro here soon.

Implemented SINGLE_ENTRY_RESET as a macro.
v10 attached
--
regards, Andrei
From 83ef072fd7406509d4eb0e54d36900f5aeb16f1e Mon Sep 17 00:00:00 2001
From: Andrei Zubkov 
Date: Sat, 2 Apr 2022 14:58:04 +0300
Subject: [PATCH] pg_stat_statements: Track statement entry timestamp

This patch adds stats_since column to the pg_stat_statements view. This column
is populated with the current timestamp when a new statement is added to the
pg_stat_statements hashtable. It provides clean information about statistics
collection time interval for each statement. Besides it can be used
by sampling solutions to detect situations when a statement was evicted and
returned back between samples.
Such sampling solution could derive any pg_stat_statements statistic value for
an interval between two samples with except of all min/max statistics. To
address this issue this patch adds the ability to reset min/max
statistics independently of statement reset using the new minmax_only parameter
of the pg_stat_statements_reset(userid oid, dbid oid, queryid bigint,
minmax_only boolean) function.
Timestamp of such reset is stored in the minmax_stats_since field for
each statement. pg_stat_statements_reset() function is now returns
this timestamp as a result.

Discussion:
https://www.postgresql.org/message-id/flat/72e80e7b160a6eb189df9ef6f068cce3765d37f8.camel%40moonset.ru
---
 contrib/pg_stat_statements/Makefile   |   3 +-
 .../expected/oldextversions.out   |  61 +++
 .../expected/pg_stat_statements.out   | 361 +-
 .../pg_stat_statements--1.9--1.10.sql | 108 ++
 .../pg_stat_statements/pg_stat_statements.c   | 153 ++--
 .../pg_stat_statements.control|   2 +-
 .../pg_stat_statements/sql/oldextversions.sql |   8 +
 .../sql/pg_stat_statements.sql| 149 +++-
 doc/src/sgml/pgstatstatements.sgml|  58 ++-
 9 files changed, 745 insertions(+), 158 deletions(-)
 create mode 100644 contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 7fabd96f38d..edc40c8bbfb 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -6,7 +6,8 @@ OBJS = \
 	pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.8--1.9.sql \
+DATA = pg_stat_statements--1.4.sql \
+	pg_stat_statements--1.9--1.10.sql pg_stat_statements--1.8--1.9.sql \
 	pg_stat_statements--1.7--1.8.sql pg_stat_statements--1.6--1.7.sql \
 	pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
 	pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.2--1.3.sql \
diff --git a/contrib/pg_stat_statements/expected/oldextversions.out b/contrib/pg_stat_statements/expected/oldextversions.out
index f18c08838f5..70877948491 100644
--- a/contrib/pg_stat_statements/expected/oldextversions.out
+++ b/contrib/pg_stat_statements/expected/oldextversions.out
@@ -136,4 +136,65 @@ SELECT pg_get_functiondef('pg_stat_statements_reset'::regproc);
  
 (1 row)
 
+ALTER EXTENSION pg_stat_statements UPDATE TO '1.9';
+\d pg_stat_statements
+View "public.pg_stat_statements"
+   Column|   Type   | Collation | Nullable | Default 
+-+--+---+--+-
+ userid  | oid  |   |  | 
+ dbid| oid  |   |  | 
+ toplevel| boolean  |   |  | 
+ queryid | bigint   |   |  | 
+ query   | text |   |  | 
+ plans   | bigint   |   |  | 
+ total_plan_time | double precision |   |  | 
+ min_plan_time   | double precision |   |  | 
+ max_plan_time   | double precision |   |  | 
+ mean_plan_time  | double precision |   |  | 
+ stddev_plan_time| double precision |   |  | 
+ calls   | bigint   |   |  | 
+ total_exec_time | double precision |   |  | 
+ min_exec_time   | double precision |   |  | 
+ max_exec_time   | double precision |   |  | 
+ mean_exec_time  | double precision |   |  | 
+ 

Re: [PATCH] Add extra statistics to explain for Nested Loop

2022-04-02 Thread Justin Pryzby
This message lost track of the email headers so CFBOT isn't processing the new
patches.  Which I'm attempting to remedy now.
https://www.postgresql.org/message-id/flat/ae576cac3f451d318374f2a2e494a...@postgrespro.ru

On Fri, Apr 01, 2022 at 11:46:47PM +0300, Ekaterina Sokolova wrote:
> Hi, hackers. Thank you for your attention to this topic.
> 
> Julien Rouhaud wrote:
> > +static void show_loop_info(Instrumentation *instrument, bool isworker,
> > +   ExplainState *es);
> > 
> > I think this should be done as a separate refactoring commit.
> Sure. I divided the patch. Now Justin's refactor commit is separated. Also I
> actualized it a bit.
> 
> > Most of the comments I have are easy to fix.  But I think that the real
> > problem
> > is the significant overhead shown by Ekaterina that for now would apply
> > even if
> > you don't consume the new stats, for instance if you have
> > pg_stat_statements.
> > And I'm still not sure of what is the best way to avoid that.
> I took your advice about InstrumentOption. Now INSTRUMENT_EXTRA exists.
> So currently it's no overheads during basic load. Operations using
> INSTRUMENT_ALL contain overheads (because of INSTRUMENT_EXTRA is a part of
> INSTRUMENT_ALL), but they are much less significant than before. I apply new
> overhead statistics collected by pgbench with auto _explain enabled.
> 
> > Why do you need to initialize min_t and min_tuples but not max_t and
> > max_tuples while both will initially be 0 and possibly updated
> > afterwards?
> We need this initialization for min values so comment about it located above
> the block of code with initialization.
> 
> I am convinced that the latest changes have affected the patch in a positive
> way. I'll be pleased to hear your thoughts on this.
>From 0dec500a0ed934d5d2038cb087ba6a605cafcdef Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 15 Apr 2021 11:55:09 -0500
Subject: [PATCH 1/2] explain.c: refactor ExplainNode()

---
 src/backend/commands/explain.c | 110 ++---
 1 file changed, 47 insertions(+), 63 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index cb13227db1f..06e089a1220 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -118,6 +118,8 @@ static void show_instrumentation_count(const char *qlabel, int which,
 	   PlanState *planstate, ExplainState *es);
 static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
 static void show_eval_params(Bitmapset *bms_params, ExplainState *es);
+static void show_loop_info(Instrumentation *instrument, bool isworker,
+		ExplainState *es);
 static const char *explain_get_index_name(Oid indexId);
 static void show_buffer_usage(ExplainState *es, const BufferUsage *usage,
 			  bool planning);
@@ -1615,36 +1617,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 	if (es->analyze &&
 		planstate->instrument && planstate->instrument->nloops > 0)
-	{
-		double		nloops = planstate->instrument->nloops;
-		double		startup_ms = 1000.0 * planstate->instrument->startup / nloops;
-		double		total_ms = 1000.0 * planstate->instrument->total / nloops;
-		double		rows = planstate->instrument->ntuples / nloops;
-
-		if (es->format == EXPLAIN_FORMAT_TEXT)
-		{
-			if (es->timing)
-appendStringInfo(es->str,
- " (actual time=%.3f..%.3f rows=%.0f loops=%.0f)",
- startup_ms, total_ms, rows, nloops);
-			else
-appendStringInfo(es->str,
- " (actual rows=%.0f loops=%.0f)",
- rows, nloops);
-		}
-		else
-		{
-			if (es->timing)
-			{
-ExplainPropertyFloat("Actual Startup Time", "ms", startup_ms,
-	 3, es);
-ExplainPropertyFloat("Actual Total Time", "ms", total_ms,
-	 3, es);
-			}
-			ExplainPropertyFloat("Actual Rows", NULL, rows, 0, es);
-			ExplainPropertyFloat("Actual Loops", NULL, nloops, 0, es);
-		}
-	}
+		show_loop_info(planstate->instrument, false, es);
 	else if (es->analyze)
 	{
 		if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1673,44 +1646,14 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		for (int n = 0; n < w->num_workers; n++)
 		{
 			Instrumentation *instrument = &w->instrument[n];
-			double		nloops = instrument->nloops;
-			double		startup_ms;
-			double		total_ms;
-			double		rows;
 
-			if (nloops <= 0)
+			if (instrument->nloops <= 0)
 continue;
-			startup_ms = 1000.0 * instrument->startup / nloops;
-			total_ms = 1000.0 * instrument->total / nloops;
-			rows = instrument->ntuples / nloops;
 
 			ExplainOpenWorker(n, es);
-
+			show_loop_info(instrument, true, es);
 			if (es->format == EXPLAIN_FORMAT_TEXT)
-			{
-ExplainIndentText(es);
-if (es->timing)
-	appendStringInfo(es->str,
-	 "actual time=%.3f..%.3f rows=%.0f loops=%.0f\n",
-	 startup_ms, total_ms, rows, nloops);
-else
-	appendStringInfo(es->str,
-	 "actual rows=%.0f loops=%.0f\n",
-	 rows, nloops);
-			}
-			else
-			{
-		

Re: psql - add SHOW_ALL_RESULTS option

2022-04-02 Thread Fabien COELHO




Again, after the SendQuery refactoring extraction.


I'm doing this locally, so don't feel obliged to send more of these. ;-)


Good for me :-)

--
Fabien.




Re: [PATCH] pgbench: add multiconnect option

2022-04-02 Thread Fabien COELHO



According to the cfbot this patch needs a rebase


Indeed. v4 attached.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ebdb4b3f46..d96d2d291d 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -29,7 +29,7 @@ PostgreSQL documentation
   
pgbench
option
-   dbname
+   dbname or conninfo
   
  
 
@@ -169,6 +169,9 @@ pgbench  options  d
 not specified, the environment variable
 PGDATABASE is used. If that is not set, the
 user name specified for the connection is used.
+Alternatively, the dbname can be
+a standard connection information string.
+Several connections can be provided.

   
  
@@ -918,6 +921,21 @@ pgbench  options  d
 
 
 
+ 
+  --connection-policy=policy
+  
+   
+Set the connection policy when multiple connections are available.
+Default is round-robin provided (ro).
+Possible values are:
+first (f), 
+random (ra), 
+round-robin (ro),
+working (w).
+   
+  
+ 
+
  
   -h hostname
   --host=hostname
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index acf3e56413..d99d40fbb9 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -305,13 +305,39 @@ uint32		max_tries = 1;
 bool		failures_detailed = false;	/* whether to group failures in reports
 		 * or logs by basic types */
 
+char	   *logfile_prefix = NULL;
+
+/* main connection definition */
 const char *pghost = NULL;
 const char *pgport = NULL;
 const char *username = NULL;
-const char *dbName = NULL;
-char	   *logfile_prefix = NULL;
 const char *progname;
 
+/* multi connections */
+typedef enum mc_policy_t
+{
+	MC_UNKNOWN = 0,
+	MC_FIRST,
+	MC_RANDOM,
+	MC_ROUND_ROBIN,
+	MC_WORKING
+} mc_policy_t;
+
+/* connection info list */
+typedef struct connection_t
+{
+	const char *connection;		/* conninfo or dbname */
+	int			errors;			/* number of connection errors */
+} connection_t;
+
+static intn_connections = 0;
+static connection_t	   *connections = NULL;
+static mc_policy_t	mc_policy = MC_ROUND_ROBIN;
+
+/* last used connection */
+// FIXME per thread?
+static int current_connection = 0;
+
 #define WSEP '@'/* weight separator */
 
 volatile bool timer_exceeded = false;	/* flag from signal handler */
@@ -873,7 +899,7 @@ usage(void)
 {
 	printf("%s is a benchmarking tool for PostgreSQL.\n\n"
 		   "Usage:\n"
-		   "  %s [OPTION]... [DBNAME]\n"
+		   "  %s [OPTION]... [DBNAME or CONNINFO ...]\n"
 		   "\nInitialization options:\n"
 		   "  -i, --initialize invokes initialization mode\n"
 		   "  -I, --init-steps=[" ALL_INIT_STEPS "]+ (default \"" DEFAULT_INIT_STEPS "\")\n"
@@ -931,6 +957,7 @@ usage(void)
 		   "  -h, --host=HOSTNAME  database server host or socket directory\n"
 		   "  -p, --port=PORT  database server port number\n"
 		   "  -U, --username=USERNAME  connect as specified database user\n"
+		   "  --connection-policy=Sset multiple connection policy (\"first\", \"rand\", \"round-robin\", \"working\")\n"
 		   "  -V, --versionoutput version information, then exit\n"
 		   "  -?, --help   show this help, then exit\n"
 		   "\n"
@@ -1538,13 +1565,89 @@ tryExecuteStatement(PGconn *con, const char *sql)
 	PQclear(res);
 }
 
+/* store a new connection information string */
+static void
+push_connection(const char *c)
+{
+	connections = pg_realloc(connections, sizeof(connection_t) * (n_connections+1));
+	connections[n_connections].connection = pg_strdup(c);
+	connections[n_connections].errors = 0;
+	n_connections++;
+}
+
+/* switch connection */
+static int
+next_connection(int *pci)
+{
+	int ci;
+
+	ci = ((*pci) + 1) % n_connections;
+	*pci = ci;
+
+	return ci;
+}
+
+/* return the connection index to use for next attempt */
+static int
+choose_connection(int *pci)
+{
+	int ci;
+
+	switch (mc_policy)
+	{
+		case MC_FIRST:
+			ci = 0;
+			break;
+		case MC_RANDOM:
+			// FIXME should use a prng state ; not thread safe ;
+			ci = (int) getrand(&base_random_sequence, 0, n_connections-1);
+			*pci = ci;
+			break;
+		case MC_ROUND_ROBIN:
+			ci = next_connection(pci);
+			break;
+		case MC_WORKING:
+			ci = *pci;
+			break;
+		default:
+			pg_log_fatal("unexpected multi connection policy: %d", mc_policy);
+			exit(1);
+	}
+
+	return ci;
+}
+
+/* return multi-connection policy based on its name or shortest prefix */
+static mc_policy_t
+get_connection_policy(const char *s)
+{
+	if (s == NULL || *s == '\0' || strcmp(s, "first") == 0 || strcmp(s, "f") == 0)
+		return MC_FIRST;
+	else if (strcmp(s, "random") == 0 || strcmp(s, "ra") == 0)
+		return MC_RANDOM;
+	else if (strcmp(s, "round-robin") == 0 || strcmp(s, "ro") == 0)
+		return MC_ROUND_ROBIN;
+	else if (strcmp(s, "working") == 0 || strcmp(s, "w") == 0)
+		return MC_WORKING;
+	else
+		return MC_UNKNOWN;
+}
+
+/* get backend connection info */
+s

Re: PostgreSQL shutdown modes

2022-04-02 Thread chap

On 2022-04-01 13:22, Robert Haas wrote:

I attach herewith a modest patch to rename these shutdown modes to
more accurately correspond to their actual characteristics.


I've waited for April 2nd to submit this comment, but it seemed to me 
that the
suggestion about the first-pass checkpoint in 'slow' mode is a 
no-foolin' good one.
Then I wondered whether there could be an option to accompany the 'dumb' 
mode that
would take a WHERE clause, to be implicitly applied to pg_stat_activity, 
whose
purpose would be to select those sessions that are ok to evict without 
waiting for
them to exit. It could recognize, say, backend connections in no current 
transaction
that are from your pesky app or connection pooler that holds things 
open. It could

also, for example, select things in transaction state but where
 current_timestamp - state_change > '5 minutes' (so it would be 
re-evaluated every

so often until ready to shut down).

For conciseness (and sanity), maybe the WHERE clause could be implicitly 
applied,
not to pg_stat_activity directly, but to a (virtual or actual) view that 
has
already been restricted to client backend sessions, and already has a 
column

for current_timestamp - state_change.

Regards,
-Chap




Re: [PATCH] Add extra statistics to explain for Nested Loop

2022-04-02 Thread Julien Rouhaud
Hi,

On Fri, Apr 01, 2022 at 11:46:47PM +0300, Ekaterina Sokolova wrote:
>
> > Most of the comments I have are easy to fix.  But I think that the real
> > problem
> > is the significant overhead shown by Ekaterina that for now would apply
> > even if
> > you don't consume the new stats, for instance if you have
> > pg_stat_statements.
> > And I'm still not sure of what is the best way to avoid that.
> I took your advice about InstrumentOption. Now INSTRUMENT_EXTRA exists.
> So currently it's no overheads during basic load. Operations using
> INSTRUMENT_ALL contain overheads (because of INSTRUMENT_EXTRA is a part of
> INSTRUMENT_ALL), but they are much less significant than before. I apply new
> overhead statistics collected by pgbench with auto _explain enabled.

Can you give a bit more details on your bench scenario?  I see contradictory
results, where the patched version with more code is sometimes way faster,
sometimes way slower.  If you're using pgbench
default queries (including write queries) I don't think that any of them will
hit the loop code, so it's really a best case scenario.  Also write queries
will make tests less stable for no added value wrt. this code.

Ideally you would need a custom scenario with a single read-only query
involving a nested loop or something like that to check how much overhead you
really get when you cumulate those values.  I will try to
>
> > Why do you need to initialize min_t and min_tuples but not max_t and
> > max_tuples while both will initially be 0 and possibly updated
> > afterwards?
> We need this initialization for min values so comment about it located above
> the block of code with initialization.

Sure, but if we're going to have a branch for nloops == 0, I think it would be
better to avoid redundant / useless instructions, something like:

if (nloops == 0)
{
min_t = totaltime;
min_tuple = tuplecount;
}
else
{
if (min_t...)
...
}

While on that part of the patch, there's an extra new line between max_t and
min_tuple processing.




Re: PROXY protocol support

2022-04-02 Thread wilfried roset
Hi,

I've been able to test the patch. Here is a recap of the experimentation.

# Setup

All tests have been done witch 3 VMs (PostgreSQL, HAproxy, psql client) on
Debian 11 communicating over private network.
* PostgreSQL have been built with proxy_protocol_11.patch applied on master 
branch (465ab24296).
* psql client is from postgresql-client-13 from Debian 11 repository.
* HAproxy version used is 2.5.5-1~bpo11+1 installed from 
https://haproxy.debian.net

# Configuration

PostgresSQL has been configured to listen only on its private IP. To enable
proxy protocol support `proxy_port` has been configured to `5431` and
`proxy_servers` to `10.0.0.0/24`. `log_connections` has been turned on to make
sure the correct IP address is logged. `log_min_duration_statement` has been
configured to 0 to log all queries. Finally `log_destination` has been
configured to `csvlog`.

pg_hba.conf is like this:

  local   all all trust
  hostall all 127.0.0.1/32trust
  hostall all ::1/128 trust
  local   replication all trust
  hostreplication all 127.0.0.1/32trust
  hostreplication all ::1/128 trust
  hostall all 10.0.0.208/32   md5

Where 10.0.0.208 is the IP host the psql client's VM.

HAproxy has two frontends, one for proxy protocol (port 5431) and one for
regular TCP traffic. The configuration looks like this:

  listen postgresql
  bind 10.0.0.222:5432
  server pg 10.0.0.253:5432 check

  listen postgresql_proxy
  bind 10.0.0.222:5431
  server pg 10.0.0.253:5431 send-proxy-v2

Where 10.0.0.222 is the IP of HAproxy's VM and 10.0.0.253 is the IP of
PostgreSQL's VM.

# Tests

* from psql's vm to haproxy on port 5432 (no proxy protocol)
  --> connection denied by pg_hba.conf, as expected

* from psql's vm to postgresql's VM on port 5432 (no proxy protocol)
  --> connection success with psql's vm ip in logfile and pg_stat_activity

* from psql's vm to postgresql's VM on port 5431 (proxy protocol)
  --> unable to open a connection, as expected

* from psql's vm to haproxy on port 5431 (proxy protocol)
  --> connection success with psql's vm ip in logfile and pg_stat_activity

I've also tested without proxy protocol enable (and pg_hba.conf updated
accordingly), PostgreSQL behave as expected.

# Conclusion

From my point of view the documentation is clear enough and the feature works
as expected.

Re: support for MERGE

2022-04-02 Thread Alvaro Herrera
On 2022-Mar-31, Daniel Gustafsson wrote:

> > On 31 Mar 2022, at 19:38, Ranier Vilela  wrote:
> 
> > I think that there is an oversight at 7103ebb
> > There is no chance of Assert preventing this bug.
> 
> This seems reasonable from brief reading of the code, NULL is a legitimate
> value for the map and that should yield an empty list AFAICT.

There's no bug here and this is actually intentional: if the map is
NULL, this function should not be called.

In the code before this commit, there was an assert that this variable
was not null:

  static List *
  adjust_partition_colnos(List *colnos, ResultRelInfo *leaf_part_rri)
  {
-   List   *new_colnos = NIL;
TupleConversionMap *map = ExecGetChildToRootMap(leaf_part_rri);
!   AttrMap*attrMap;
ListCell   *lc;
  
!   Assert(map != NULL);/* else we shouldn't be here */
!   attrMap = map->attrMap;
  
foreach(lc, colnos)
{


We could add an Assert that map is not null in the new function, but
really there's no point: if the map is null, we'll crash just fine in
the following line.

I would argue that we should *remove* the Assert() that I left in
adjust_partition_colnos_with_map.

Even if we wanted to make the function handle the case of a NULL map,
then the right fix is not to return NIL, but rather we should return the
original list.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: merge documentation fix

2022-04-02 Thread Alvaro Herrera
On 2022-Apr-01, Euler Taveira wrote:

> Hi,
> 
> While inspecting the MERGE documentation, I noticed that there is an extra
> semicolon in one of the examples that shouldn't be there.

Thanks, pushed.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"The Gord often wonders why people threaten never to come back after they've
been told never to return" (www.actsofgord.com)




Re: support for MERGE

2022-04-02 Thread Andres Freund
Hi,

On 2022-04-02 17:02:01 +0200, Alvaro Herrera wrote:
> There's no bug here and this is actually intentional: if the map is
> NULL, this function should not be called.

This made me, again, wonder if we should add a pg_nonnull attibute to c.h. The
compiler can probably figure it out in this case, but there's plenty cases it
can't, because the function definition is in a different translation unit. And
IMO it helps humans too.

Regards,

Andres




Re: CLUSTER on partitioned index

2022-04-02 Thread Alvaro Herrera
Thanks, pushed.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La fuerza no está en los medios físicos
sino que reside en una voluntad indomable" (Gandhi)




Re: CLUSTER on partitioned index

2022-04-02 Thread Alvaro Herrera
Small things here.

1. in VACUUM FULL we only process partitions that are owned by the
invoking user.  We don't have this test in the new code.  I'm not sure
why do we do that there; is it worth doing the same here?

2. We should silently skip a partition that's a foreign table, I
suppose.

3. We do mark the index on the partitions as indisclustered AFAICS (we
claim that the partitioned table's index is not marked, which is
accurate).  So users doing unadorned CLUSTER afterwards will get the
partitions clustered too, once they cluster the partitioned table.  If
they don't want this, they would have to ALTER TABLE to remove the
marking.  How likely is that this will be a problem?  Maybe documenting
this point is enough.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
Voy a acabar con todos los humanos / con los humanos yo acabaré
voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender)




Re: Fix overflow in DecodeInterval

2022-04-02 Thread Tom Lane
I wrote:
> ... I almost feel that this is
> committable, but there is one thing that is bothering me.  The
> part of DecodeInterval that does strange things with signs in the
> INTSTYLE_SQL_STANDARD case (starting about line 3400 in datetime.c
> before this patch, or line 3600 after) used to separately force the
> hour, minute, second, and microsecond fields to negative.
> Now it forces the merged tm_usec field to negative.  It seems to
> me that this could give a different answer than before, if the
> h/m/s/us values had been of different signs before they got merged.
> However, I don't think that that situation is possible in SQL-spec-
> compliant input, so it may not be a problem.  Again, a counterexample
> would be interesting.

As best I can tell, the case isn't reachable with spec-compliant input,
but it's easy to demonstrate an issue if you set intervalstyle to
sql_standard and then put in Postgres-format input.  Historically,
you got

regression=# show intervalstyle;
 IntervalStyle 
---
 postgres
(1 row)

regression=# select '-23 hours 45 min 12.34 sec'::interval;
   interval   
--
 -22:14:47.66
(1 row)

(because by default the field signs are taken as independent)

regression=# set intervalstyle = sql_standard ;
SET
regression=# select '-23 hours 45 min 12.34 sec'::interval;
   interval   
--
 -23:45:12.34
(1 row)

However, with this patch both cases produce "-22:14:47.66",
because we already merged the differently-signed fields and
DecodeInterval can't tease them apart again.  Perhaps we could
get away with changing this nonstandard corner case, but I'm
pretty uncomfortable with that thought --- I don't think
random semantics changes are within the charter of this patch.

I think the patch can be salvaged, though.  I like the concept
of converting all the sub-day fields to microseconds immediately,
because it avoids a host of issues, so I don't want to give that up.
What I'm going to look into is detecting the sign-adjustment-needed
case up front (which is easy enough, since it's looking at the
input data not the conversion results) and then forcing the
individual field values negative before we accumulate them into
the pg_itm_in struct.

Meanwhile, the fact that we didn't detect this issue immediately
shows that there's a gap in our regression tests.  So the *first*
thing I'm gonna do is push a patch to add test cases like what
I showed above.

regards, tom lane




Re: Fix overflow in DecodeInterval

2022-04-02 Thread Joseph Koshakow
On Fri, Apr 1, 2022 at 8:06 PM Tom Lane  wrote:
>
> Joseph Koshakow  writes:
> > * The existing code for rounding had a lot of int to double
> > casting and vice versa. I *think* that doubles are able to completely
> > represent the range of ints. However doubles are not able to represent
> > the full range of int64. After making the change I started noticing
> > a lot of lossy behavior. One thought I had was to change the doubles
> > to long doubles, but I wasn't able to figure out if long doubles could
> > completely represent the range of int64. Especially since their size
> > varies depending on the architecture. Does anyone know the answer to
> > this?
>
> I agree that relying on long double is not a great plan.  However,
> I'm not seeing where there's a problem.  AFAICS the revised code
> only uses doubles to represent fractions from the input, ie if you
> write "123.456 hours" then the ".456" is carried around for awhile
> as a float.  This does not seem likely to pose any real-world
> problem; do you have a counterexample?

Yeah, you're correct, I don't think there is any problem with just
using double. I don't exactly remember why I thought long double
was necessary in the revised code. I probably just confused
myself because it would have been necessary with the old
rounding code, but not the revised code.

> Anyway, I've spent today reviewing the code and cleaning up things
> I didn't like, and attached is a v10.

Thanks so much for the review and updates!

> I think the patch can be salvaged, though.  I like the concept
> of converting all the sub-day fields to microseconds immediately,
> because it avoids a host of issues, so I don't want to give that up.
> What I'm going to look into is detecting the sign-adjustment-needed
> case up front (which is easy enough, since it's looking at the
> input data not the conversion results) and then forcing the
> individual field values negative before we accumulate them into
> the pg_itm_in struct.

This sounds like a very reasonable and achievable approach
to me.

- Joe Koshakow




Re: [PATCH v2] use has_privs_for_role for predefined roles

2022-04-02 Thread Joe Conway

On 3/28/22 15:56, Robert Haas wrote:

On Mon, Mar 21, 2022 at 4:15 PM Joe Conway  wrote:

Robert -- any opinion on this? If I am not mistaken it is code that you
are actively working on.


Woops, I only just saw this. I don't mind if you want to change the
calls to is_member_of_role() in basebackup_server.c and
basebackup_to_shell.c to has_privs_of_role().


Done as originally posted.

On that last note, I did not find basebackup_to_shell.required_role 
documented at all, and did not attempt to fix that.


I saw that Robert added documentation and it already reads correctly I 
believe, except possibly an unrelated issue:


8<--
 
  A role which replication whose privileges users are required to 
possess

  in order to make use of the shell backup target.
  If this is not set, any replication user may make use of the
  shell backup target.
 
8<--

Robert, should that actually be:
 s/role which replication/role with replication/
?

Joe




Re: support for MERGE

2022-04-02 Thread Ranier Vilela
Em sáb., 2 de abr. de 2022 às 12:01, Alvaro Herrera 
escreveu:

> On 2022-Mar-31, Daniel Gustafsson wrote:
>
> > > On 31 Mar 2022, at 19:38, Ranier Vilela  wrote:
> >
> > > I think that there is an oversight at 7103ebb
> > > There is no chance of Assert preventing this bug.
> >
> > This seems reasonable from brief reading of the code, NULL is a
> legitimate
> > value for the map and that should yield an empty list AFAICT.
>
> There's no bug here and this is actually intentional: if the map is
> NULL, this function should not be called.
>
IMHO, actually there are bug here.
ExecGetChildToRootMap is clear, is possible returning NULL.
To discover if the map is NULL, ExecGetChildToRootMap needs to process
"ResultRelInfo *leaf_part_rri".
So, the argument "if the map is NULL, this function should not be called",
is contradictory.

Actually, with Assert at function adjust_partition_colnos_using_map,
will never be checked, because it crashed before, both
production and debug modes.


> In the code before this commit, there was an assert that this variable
> was not null:
>
>   static List *
>   adjust_partition_colnos(List *colnos, ResultRelInfo *leaf_part_rri)
>   {
> -   List   *new_colnos = NIL;
> TupleConversionMap *map = ExecGetChildToRootMap(leaf_part_rri);
> !   AttrMap*attrMap;
> ListCell   *lc;
>
> !   Assert(map != NULL);/* else we shouldn't be here */
> !   attrMap = map->attrMap;
>
> foreach(lc, colnos)
> {
>
>
> We could add an Assert that map is not null in the new function, but
> really there's no point: if the map is null, we'll crash just fine in
> the following line.
>
> I would argue that we should *remove* the Assert() that I left in
> adjust_partition_colnos_with_map.
>
> Even if we wanted to make the function handle the case of a NULL map,
> then the right fix is not to return NIL, but rather we should return the
> original list.
>
If the right fix is to return the original list, here is the patch attached.

regards
Ranier Vilela


v2-0001-avoid-dereference-null-map.patch
Description: Binary data


logical decoding and replication of sequences

2022-04-02 Thread Ranier Vilela
On 2/10/22 19:17, Tomas Vondra wrote:
>> I've polished & pushed the first part adding sequence decoding
>> infrastructure etc. Attached are the two remaining parts.
>>
>> I plan to wait a day or two and then push the test_decoding part. The
>> last part (for built-in replication) will need more work and maybe
>> rethinking the grammar etc.
>>

>I've pushed the second part, adding sequences to test_decoding.

Hi,

Minor oversight with commit 0da92dc

.
RelationIdGetRelation can return NULL, then it is necessary to check the
return.

regards,
Ranier Vilela


0001-avoid-dereference-null-relation.patch
Description: Binary data


Re: Fix overflow in DecodeInterval

2022-04-02 Thread Joseph Koshakow
On Sat, Apr 2, 2022 at 1:29 PM Joseph Koshakow  wrote:
>
> On Fri, Apr 1, 2022 at 8:06 PM Tom Lane  wrote:
> >
> > Joseph Koshakow  writes:
> > > * The existing code for rounding had a lot of int to double
> > > casting and vice versa. I *think* that doubles are able to completely
> > > represent the range of ints. However doubles are not able to represent
> > > the full range of int64. After making the change I started noticing
> > > a lot of lossy behavior. One thought I had was to change the doubles
> > > to long doubles, but I wasn't able to figure out if long doubles could
> > > completely represent the range of int64. Especially since their size
> > > varies depending on the architecture. Does anyone know the answer to
> > > this?
> >
> > I agree that relying on long double is not a great plan.  However,
> > I'm not seeing where there's a problem.  AFAICS the revised code
> > only uses doubles to represent fractions from the input, ie if you
> > write "123.456 hours" then the ".456" is carried around for awhile
> > as a float.  This does not seem likely to pose any real-world
> > problem; do you have a counterexample?
>
> Yeah, you're correct, I don't think there is any problem with just
> using double. I don't exactly remember why I thought long double
> was necessary in the revised code. I probably just confused
> myself because it would have been necessary with the old
> rounding code, but not the revised code.

Ok I actually remember now, the issue is with the rounding
code in AdjustFractMicroseconds.

>frac *= scale;
>usec = (int64) frac;
>
>/* Round off any fractional microsecond */
>frac -= usec;
>if (frac > 0.5)
>   usec++;
>else if (frac < -0.5)
>   usec--;

I believe it's possible for `frac -= usec;` to result in a value greater
than 1 or less than -1 due to the lossiness of int64 to double
conversions. Then we'd incorrectly round in one direction. I don't
have a concrete counter example, but at worst we'd end up with a
result that's a couple of microseconds off, so it's probably not a huge
deal.

If I'm right about the above, and we care enough to fix it, then I think
it can be fixed with the following:

>frac *= scale;
>usec = (int64) frac;
>
>/* Remove non fractional part from frac */
>frac -= (double) usec;
>/* Adjust for lossy conversion from int64 to double */
>while (frac < 0 && frac < -1)
>   frac++;
>while (frac > 0 && frac > 1)
>   frac--;
>
>/* Round off any fractional microsecond */
>if (frac > 0.5)
>   usec++;
>else if (frac < -0.5)
>   usec--;

- Joe Koshakow




Re: Fix overflow in DecodeInterval

2022-04-02 Thread Joseph Koshakow
On Sat, Apr 2, 2022 at 2:22 PM Joseph Koshakow  wrote:
>
> On Sat, Apr 2, 2022 at 1:29 PM Joseph Koshakow  wrote:
> >
> > On Fri, Apr 1, 2022 at 8:06 PM Tom Lane  wrote:
> > >
> > > Joseph Koshakow  writes:
> > > > * The existing code for rounding had a lot of int to double
> > > > casting and vice versa. I *think* that doubles are able to completely
> > > > represent the range of ints. However doubles are not able to represent
> > > > the full range of int64. After making the change I started noticing
> > > > a lot of lossy behavior. One thought I had was to change the doubles
> > > > to long doubles, but I wasn't able to figure out if long doubles could
> > > > completely represent the range of int64. Especially since their size
> > > > varies depending on the architecture. Does anyone know the answer to
> > > > this?
> > >
> > > I agree that relying on long double is not a great plan.  However,
> > > I'm not seeing where there's a problem.  AFAICS the revised code
> > > only uses doubles to represent fractions from the input, ie if you
> > > write "123.456 hours" then the ".456" is carried around for awhile
> > > as a float.  This does not seem likely to pose any real-world
> > > problem; do you have a counterexample?
> >
> > Yeah, you're correct, I don't think there is any problem with just
> > using double. I don't exactly remember why I thought long double
> > was necessary in the revised code. I probably just confused
> > myself because it would have been necessary with the old
> > rounding code, but not the revised code.
>
> Ok I actually remember now, the issue is with the rounding
> code in AdjustFractMicroseconds.
>
> >frac *= scale;
> >usec = (int64) frac;
> >
> >/* Round off any fractional microsecond */
> >frac -= usec;
> >if (frac > 0.5)
> >   usec++;
> >else if (frac < -0.5)
> >   usec--;
>
> I believe it's possible for `frac -= usec;` to result in a value greater
> than 1 or less than -1 due to the lossiness of int64 to double
> conversions. Then we'd incorrectly round in one direction. I don't
> have a concrete counter example, but at worst we'd end up with a
> result that's a couple of microseconds off, so it's probably not a huge
> deal.
>
> If I'm right about the above, and we care enough to fix it, then I think
> it can be fixed with the following:
>
> >frac *= scale;
> >usec = (int64) frac;
> >
> >/* Remove non fractional part from frac */
> >frac -= (double) usec;
> >/* Adjust for lossy conversion from int64 to double */
> >while (frac < 0 && frac < -1)
> >   frac++;
> >while (frac > 0 && frac > 1)
> >   frac--;
> >
> >/* Round off any fractional microsecond */
> >if (frac > 0.5)
> >   usec++;
> >else if (frac < -0.5)
> >   usec--;


Sorry, those should be inclusive comparisons
>frac *= scale;
>usec = (int64) frac;
>
>/* Remove non fractional part from frac */
>frac -= (double) usec;
>/* Adjust for lossy conversion from int64 to double */
>while (frac < 0 && frac <= -1)
>   frac++;
>while (frac > 0 && frac >= 1)
>   frac--;
>
>/* Round off any fractional microsecond */
>if (frac > 0.5)
>   usec++;
>else if (frac < -0.5)
>   usec--;




Re: Fix overflow in DecodeInterval

2022-04-02 Thread Joseph Koshakow
On Fri, Apr 1, 2022 at 8:06 PM Tom Lane  wrote:
> I think the patch can be salvaged, though.  I like the concept
> of converting all the sub-day fields to microseconds immediately,
> because it avoids a host of issues, so I don't want to give that up.
> What I'm going to look into is detecting the sign-adjustment-needed
> case up front (which is easy enough, since it's looking at the
> input data not the conversion results) and then forcing the
> individual field values negative before we accumulate them into
> the pg_itm_in struct.

I took a stab at this issue and the attached patch (which would be
applied on top of your v10 patch) seems to fix the issue. Feel
free to ignore it if you're already working on a fix.

- Joe
From f43d27142a76fcbabf49e45b9457f8376744e759 Mon Sep 17 00:00:00 2001
From: Joseph Koshakow 
Date: Sat, 2 Apr 2022 14:42:18 -0400
Subject: [PATCH 2/2] Fix sql standard style negative semantics

---
 src/backend/utils/adt/datetime.c   | 107 ++---
 src/test/regress/expected/interval.out |  14 
 src/test/regress/sql/interval.sql  |   5 ++
 3 files changed, 79 insertions(+), 47 deletions(-)

diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index dae90e4a9e..5842d249ab 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -50,6 +50,8 @@ static int	DecodeDate(char *str, int fmask, int *tmask, bool *is2digits,
 static char *AppendSeconds(char *cp, int sec, fsec_t fsec,
 		   int precision, bool fillzeros);
 static bool int64_multiply_add(int64 val, int64 multiplier, int64 *sum);
+static void AdjustForSqlStandardGlobalNegative(int64 *val, double *fval, 
+			   bool global_negative);
 static bool AdjustFractMicroseconds(double frac, int64 scale,
 	struct pg_itm_in *itm_in);
 static bool AdjustFractDays(double frac, int scale,
@@ -527,6 +529,19 @@ int64_multiply_add(int64 val, int64 multiplier, int64 *sum)
 	return true;
 }
 
+/*
+ * Adjust values sign if SQL Standard style is being used and there's a 
+ * single leading negative sign.
+ */
+static void AdjustForSqlStandardGlobalNegative(int64 *val, double *fval,
+			   bool global_negative)
+{
+	if (*val > 0 && global_negative) {
+		*val = -*val;
+		*fval = -*fval;
+	}
+}
+
 /*
  * Multiply frac by scale (to produce microseconds) and add to itm_in->tm_usec.
  * Returns true if successful, false if itm_in overflows.
@@ -3307,10 +3322,43 @@ DecodeInterval(char **field, int *ftype, int nf, int range,
 	int64		val;
 	double		fval;
 
+	bool		global_negative = false;
+
 	*dtype = DTK_DELTA;
 	type = IGNORE_DTF;
 	ClearPgItmIn(itm_in);
 
+	/*--
+	 * The SQL standard defines the interval literal
+	 *	 '-1 1:00:00'
+	 * to mean "negative 1 days and negative 1 hours", while Postgres
+	 * traditionally treats this as meaning "negative 1 days and positive
+	 * 1 hours".  In SQL_STANDARD intervalstyle, we apply the leading sign
+	 * to all fields if there are no other explicit signs.
+	 *
+	 * We leave the signs alone if there are additional explicit signs.
+	 * This protects us against misinterpreting postgres-style dump output,
+	 * since the postgres-style output code has always put an explicit sign on
+	 * all fields following a negative field.  But note that SQL-spec output
+	 * is ambiguous and can be misinterpreted on load!	(So it's best practice
+	 * to dump in postgres style, not SQL style.)
+	 *--
+	 */
+	if (IntervalStyle == INTSTYLE_SQL_STANDARD && *field[0] == '-')
+	{
+		/* Check for additional explicit signs */
+		boolmore_signs = false;
+		for (i = 1; i < nf; i++)
+		{
+			if (*field[i] == '-' || *field[i] == '+')
+			{
+more_signs = true;
+break;
+			}
+		}
+		global_negative = !more_signs;
+	}
+
 	/* read through list backwards to pick up units before values */
 	for (i = nf - 1; i >= 0; i--)
 	{
@@ -3447,18 +3495,21 @@ DecodeInterval(char **field, int *ftype, int nf, int range,
 switch (type)
 {
 	case DTK_MICROSEC:
+		AdjustForSqlStandardGlobalNegative(&val, &fval, global_negative);
 		if (!AdjustMicroseconds(val, fval, 1, itm_in))
 			return DTERR_FIELD_OVERFLOW;
 		tmask = DTK_M(MICROSECOND);
 		break;
 
 	case DTK_MILLISEC:
+		AdjustForSqlStandardGlobalNegative(&val, &fval, global_negative);
 		if (!AdjustMicroseconds(val, fval, 1000, itm_in))
 			return DTERR_FIELD_OVERFLOW;
 		tmask = DTK_M(MILLISECOND);
 		break;
 
 	case DTK_SECOND:
+		AdjustForSqlStandardGlobalNegative(&val, &fval, global_negative);
 		if (!AdjustMicroseconds(val, fval, USECS_PER_SEC, itm_in))
 			return DTERR_FIELD_OVERFLOW;
 
@@ -3473,12 +3524,14 @@ DecodeInterval(char **field, int *ftype, int nf, int range,
 		break;
 
 	case DTK_MINUTE:
+		AdjustForSqlStandardGlobalNegative(&val, &fval, global_negative);
 		if (!AdjustMicroseconds(val, fval, USECS_PER_MINUTE, itm_in))
 			return DTERR_FIELD_OVERFLOW;
 		tmask

Re: Fix overflow in DecodeInterval

2022-04-02 Thread Tom Lane
Joseph Koshakow  writes:
> Ok I actually remember now, the issue is with the rounding
> code in AdjustFractMicroseconds.
> ...
> I believe it's possible for `frac -= usec;` to result in a value greater
> than 1 or less than -1 due to the lossiness of int64 to double
> conversions.

I think it's not, at least not for the interesting range of possible
values in this code.  Given that abs(frac) < 1 to start with, the
abs value of usec can't exceed the value of scale, which is at most 
USECS_PER_DAY so it's at most 37 or so bits, which is well within
the exact range for any sane implementation of double.  It would
take a very poor floating-point implementation to not get the right
answer here.  (And we're largely assuming IEEE-compliant floats these
days.)

Anyway, the other issue indeed turns out to be easy to fix.
Attached is a v11 that deals with that.  If the cfbot doesn't
complain about it, I'll push this later today.

regards, tom lane

diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index ba0ec35ac5..462f2ed7a8 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -21,6 +21,7 @@
 #include "access/htup_details.h"
 #include "access/xact.h"
 #include "catalog/pg_type.h"
+#include "common/int.h"
 #include "common/string.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -37,17 +38,31 @@ static int	DecodeNumber(int flen, char *field, bool haveTextMonth,
 static int	DecodeNumberField(int len, char *str,
 			  int fmask, int *tmask,
 			  struct pg_tm *tm, fsec_t *fsec, bool *is2digits);
+static int	DecodeTimeCommon(char *str, int fmask, int range,
+			 int *tmask, struct pg_itm *itm);
 static int	DecodeTime(char *str, int fmask, int range,
 	   int *tmask, struct pg_tm *tm, fsec_t *fsec);
+static int	DecodeTimeForInterval(char *str, int fmask, int range,
+  int *tmask, struct pg_itm_in *itm_in);
 static const datetkn *datebsearch(const char *key, const datetkn *base, int nel);
 static int	DecodeDate(char *str, int fmask, int *tmask, bool *is2digits,
 	   struct pg_tm *tm);
 static char *AppendSeconds(char *cp, int sec, fsec_t fsec,
 		   int precision, bool fillzeros);
-static void AdjustFractSeconds(double frac, struct pg_tm *tm, fsec_t *fsec,
-			   int scale);
-static void AdjustFractDays(double frac, struct pg_tm *tm, fsec_t *fsec,
-			int scale);
+static bool int64_multiply_add(int64 val, int64 multiplier, int64 *sum);
+static bool AdjustFractMicroseconds(double frac, int64 scale,
+	struct pg_itm_in *itm_in);
+static bool AdjustFractDays(double frac, int scale,
+			struct pg_itm_in *itm_in);
+static bool AdjustFractYears(double frac, int scale,
+			 struct pg_itm_in *itm_in);
+static bool AdjustMicroseconds(int64 val, double fval, int64 scale,
+			   struct pg_itm_in *itm_in);
+static bool AdjustDays(int64 val, int scale,
+	   struct pg_itm_in *itm_in);
+static bool AdjustMonths(int64 val, struct pg_itm_in *itm_in);
+static bool AdjustYears(int64 val, int scale,
+		struct pg_itm_in *itm_in);
 static int	DetermineTimeZoneOffsetInternal(struct pg_tm *tm, pg_tz *tzp,
 			pg_time_t *tp);
 static bool DetermineTimeZoneAbbrevOffsetInternal(pg_time_t t,
@@ -425,7 +440,7 @@ GetCurrentTimeUsec(struct pg_tm *tm, fsec_t *fsec, int *tzp)
  * Returns a pointer to the new end of string.  No NUL terminator is put
  * there; callers are responsible for NUL terminating str themselves.
  *
- * Note that any sign is stripped from the input seconds values.
+ * Note that any sign is stripped from the input sec and fsec values.
  */
 static char *
 AppendSeconds(char *cp, int sec, fsec_t fsec, int precision, bool fillzeros)
@@ -471,7 +486,7 @@ AppendSeconds(char *cp, int sec, fsec_t fsec, int precision, bool fillzeros)
 
 		/*
 		 * If we still have a non-zero value then precision must have not been
-		 * enough to print the number.  We punt the problem to pg_ltostr(),
+		 * enough to print the number.  We punt the problem to pg_ultostr(),
 		 * which will generate a correct answer in the minimum valid width.
 		 */
 		if (value)
@@ -496,39 +511,163 @@ AppendTimestampSeconds(char *cp, struct pg_tm *tm, fsec_t fsec)
 	return AppendSeconds(cp, tm->tm_sec, fsec, MAX_TIMESTAMP_PRECISION, true);
 }
 
+
+/*
+ * Add val * multiplier to *sum.
+ * Returns true if successful, false on overflow.
+ */
+static bool
+int64_multiply_add(int64 val, int64 multiplier, int64 *sum)
+{
+	int64		product;
+
+	if (pg_mul_s64_overflow(val, multiplier, &product) ||
+		pg_add_s64_overflow(*sum, product, sum))
+		return false;
+	return true;
+}
+
 /*
- * Multiply frac by scale (to produce seconds) and add to *tm & *fsec.
- * We assume the input frac is less than 1 so overflow is not an issue.
+ * Multiply frac by scale (to produce microseconds) and add to itm_in->tm_usec.
+ * Returns true if successful, false if itm_in overflows.
  */
-static void
-AdjustFractSeconds(double frac, struct p

Re: Fix overflow in DecodeInterval

2022-04-02 Thread Tom Lane
Joseph Koshakow  writes:
> I took a stab at this issue and the attached patch (which would be
> applied on top of your v10 patch) seems to fix the issue. Feel
> free to ignore it if you're already working on a fix.

You really only need to flip val/fval in one place.  More to the
point, there's also the hh:mm:ss paths to deal with; see my v11.

regards, tom lane




Re: JSON constructors and window functions

2022-04-02 Thread Andrew Dunstan


On 4/2/22 01:25, Jaime Casanova wrote:
> I got a crash running the below query on the regression database:
>
> """
> select pg_catalog.json_object_agg_unique(10,
> cast(ref_0.level2_no as int4)) 
>   over (partition by ref_0.parent_no 
>   order by ref_0.level2_no)
> from public.transition_table_level2 as ref_0;
> """
>
> Attached the backtrace.
>
> PS: I'm cc'ing Andrew and Nikita because my feeling is that this is 
> f4fb45d15c59d7add2e1b81a9d477d0119a9691a responsability.



Hmm. Thanks for the report. The code in json_unique_check_key() looks
sane enough., so the issue is probably elsewhere. I'll keep digging.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Fix overflow in DecodeInterval

2022-04-02 Thread Joseph Koshakow
On Sat, Apr 2, 2022 at 3:08 PM Tom Lane  wrote:
>
> Joseph Koshakow  writes:
> > Ok I actually remember now, the issue is with the rounding
> > code in AdjustFractMicroseconds.
> > ...
> > I believe it's possible for `frac -= usec;` to result in a value greater
> > than 1 or less than -1 due to the lossiness of int64 to double
> > conversions.
>
> I think it's not, at least not for the interesting range of possible
> values in this code.  Given that abs(frac) < 1 to start with, the
> abs value of usec can't exceed the value of scale, which is at most
> USECS_PER_DAY so it's at most 37 or so bits, which is well within
> the exact range for any sane implementation of double.  It would
> take a very poor floating-point implementation to not get the right
> answer here.  (And we're largely assuming IEEE-compliant floats these
> days.)

Ah, I see. That makes sense to me.

On Sat, Apr 2, 2022 at 3:10 PM Tom Lane  wrote:
>
> Joseph Koshakow  writes:
> > I took a stab at this issue and the attached patch (which would be
> > applied on top of your v10 patch) seems to fix the issue. Feel
> > free to ignore it if you're already working on a fix.
>
> You really only need to flip val/fval in one place.  More to the
> point, there's also the hh:mm:ss paths to deal with; see my v11.

Good point. Thanks again for all the help!

- Joe Koshakow




Re: A qsort template

2022-04-02 Thread Thomas Munro
On Sun, Apr 3, 2022 at 12:41 AM John Naylor
 wrote:
> On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro  wrote:
> > Reproduced locally, using the same few lines from the cluster.sql
> > test.  I'll try to dig more tomorrow...
>
> Thanks! Unfortunately I can't reproduce locally with clang 13/gcc 11,
> with -Og or -O2 with CFLAGS="-fsanitize=undefined,alignment" ...

Maybe you need to add -fno-sanitize-recover=all to make it crash,
otherwise it just prints the warning and keeps going.




Re: Fix overflow in DecodeInterval

2022-04-02 Thread Tom Lane
Joseph Koshakow  writes:
> On Sat, Apr 2, 2022 at 3:08 PM Tom Lane  wrote:
>> I think it's not, at least not for the interesting range of possible
>> values in this code.  Given that abs(frac) < 1 to start with, the
>> abs value of usec can't exceed the value of scale, which is at most
>> USECS_PER_DAY so it's at most 37 or so bits, which is well within
>> the exact range for any sane implementation of double.  It would
>> take a very poor floating-point implementation to not get the right
>> answer here.  (And we're largely assuming IEEE-compliant floats these
>> days.)

> Ah, I see. That makes sense to me.

Cool.  I've pushed the patch.

regards, tom lane




Re: A qsort template

2022-04-02 Thread Justin Pryzby
On Sat, Apr 02, 2022 at 06:41:30PM +0700, John Naylor wrote:
> On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro  wrote:
> > Reproduced locally, using the same few lines from the cluster.sql
> > test.  I'll try to dig more tomorrow...
> 
> Thanks! Unfortunately I can't reproduce locally with clang 13/gcc 11,
> with -Og or -O2 with CFLAGS="-fsanitize=undefined,alignment" ...

Like Thomas just said, I had to use:
CFLAGS="-Og -fsanitize=undefined,alignment -fno-sanitize-recover=all

I'm a couple few steps out of my league here, but it may be an issue with:

commit 4ea51cdfe85ceef8afabceb03c446574daa0ac23
Author: Robert Haas 
Date:   Mon Jan 19 15:20:31 2015 -0500

Use abbreviated keys for faster sorting of text datums.

This is enough to avoid the crash, which might be a useful hint..

@@ -4126,22 +4126,23 @@ copytup_cluster(Tuplesortstate *state, SortTuple *stup, 
void *tup)
/*
 * set up first-column key value, and potentially abbreviate, if it's a
 * simple column
 */
+   stup->isnull1 = false;
if (state->indexInfo->ii_IndexAttrNumbers[0] == 0)
return;
 
original = heap_getattr(tuple,

state->indexInfo->ii_IndexAttrNumbers[0],
state->tupDesc,
&stup->isnull1);




Re: A qsort template

2022-04-02 Thread Andres Freund
Hi,

On 2022-04-03 08:07:58 +1200, Thomas Munro wrote:
> On Sun, Apr 3, 2022 at 12:41 AM John Naylor
>  wrote:
> > On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro  wrote:
> > > Reproduced locally, using the same few lines from the cluster.sql
> > > test.  I'll try to dig more tomorrow...
> >
> > Thanks! Unfortunately I can't reproduce locally with clang 13/gcc 11,
> > with -Og or -O2 with CFLAGS="-fsanitize=undefined,alignment" ...
> 
> Maybe you need to add -fno-sanitize-recover=all to make it crash,
> otherwise it just prints the warning and keeps going.

I commented with a few more details on 
https://postgr.es/m/20220402201557.thanbsxcql5lk6pc%40alap3.anarazel.de
and an preliminary analysis in
https://www.postgresql.org/message-id/20220402203344.ahup2u5n73cdbbcv%40alap3.anarazel.de

Greetings,

Andres Freund




Re: A qsort template

2022-04-02 Thread Thomas Munro
On Sun, Apr 3, 2022 at 8:20 AM Justin Pryzby  wrote:
> @@ -4126,22 +4126,23 @@ copytup_cluster(Tuplesortstate *state, SortTuple 
> *stup, void *tup)

> +   stup->isnull1 = false;

Looks like I might have failed to grok the scheme for encoding null
into SortTuple objects.  It's clearly uninitialised in some paths,
with a special 0 value in datum1.  Will need to look more closely with
more coffee...




Re: A qsort template

2022-04-02 Thread Andres Freund
Hi,

On 2022-04-02 15:20:27 -0500, Justin Pryzby wrote:
> On Sat, Apr 02, 2022 at 06:41:30PM +0700, John Naylor wrote:
> > On Sat, Apr 2, 2022 at 5:27 PM Thomas Munro  wrote:
> > > Reproduced locally, using the same few lines from the cluster.sql
> > > test.  I'll try to dig more tomorrow...
> > 
> > Thanks! Unfortunately I can't reproduce locally with clang 13/gcc 11,
> > with -Og or -O2 with CFLAGS="-fsanitize=undefined,alignment" ...
> 
> Like Thomas just said, I had to use:
> CFLAGS="-Og -fsanitize=undefined,alignment -fno-sanitize-recover=all
> 
> I'm a couple few steps out of my league here, but it may be an issue with:
> 
> commit 4ea51cdfe85ceef8afabceb03c446574daa0ac23
> Author: Robert Haas 
> Date:   Mon Jan 19 15:20:31 2015 -0500
> 
> Use abbreviated keys for faster sorting of text datums.
> 
> This is enough to avoid the crash, which might be a useful hint..
>
> @@ -4126,22 +4126,23 @@ copytup_cluster(Tuplesortstate *state, SortTuple 
> *stup, void *tup)
> /*
>  * set up first-column key value, and potentially abbreviate, if it's 
> a
>  * simple column
>  */
> +   stup->isnull1 = false;
> if (state->indexInfo->ii_IndexAttrNumbers[0] == 0)
> return;
>  
> original = heap_getattr(tuple,
> 
> state->indexInfo->ii_IndexAttrNumbers[0],
> state->tupDesc,
> &stup->isnull1);

I don't think that can be correct - the column can be NULL afaics. And I don't
think in that patch it's needed, because it always goes through ->comparetup()
when state->onlyKey isn't explicitly set. Which tuplesort_begin_cluster() as
well as several others don't.  And you'd just sort an uninitialized datum
immediately after.

It's certainly not pretty that copytup_cluster() can use SortTuples without
actually using SortTuples. Afaics it basically only computes isnull1/datum1 if
state->indexInfo->ii_IndexAttrNumbers[0] == 0.

Greetings,

Andres Freund




Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-04-02 Thread Greg Stark
The tests for this seem to need adjustments.

[12:41:09.403] test pg_stat_statements ... FAILED 180 ms

diff -U3 
/tmp/cirrus-ci-build/contrib/pg_stat_statements/expected/pg_stat_statements.out
/tmp/cirrus-ci-build/contrib/pg_stat_statements/results/pg_stat_statements.out
--- 
/tmp/cirrus-ci-build/contrib/pg_stat_statements/expected/pg_stat_statements.out
2022-04-02 12:37:42.201823383 +
+++ 
/tmp/cirrus-ci-build/contrib/pg_stat_statements/results/pg_stat_statements.out
2022-04-02 12:41:09.219563204 +
@@ -1174,8 +1174,8 @@
 ORDER BY query;
query   | reset_ts_match
 ---+
- SELECT $1,$2 AS "STMTTS2" | f
  SELECT $1 AS "STMTTS1"| t
+ SELECT $1,$2 AS "STMTTS2" | f
 (2 rows)

 -- check that minmax reset does not set stats_reset


Hm. Is this a collation problem?




Re: A qsort template

2022-04-02 Thread Thomas Munro
On Sun, Apr 3, 2022 at 9:03 AM Andres Freund  wrote:
> It's certainly not pretty that copytup_cluster() can use SortTuples without
> actually using SortTuples. Afaics it basically only computes isnull1/datum1 if
> state->indexInfo->ii_IndexAttrNumbers[0] == 0.

I think we just need to decide up front if we're in a situation that
can't provide datum1/isnull1 (in this case because it's an expression
index), and skip the optimised paths.  Here's an experimental patch...
still looking into whether there are more cases like this...

(There's also room to recognise when you don't even need to look at
isnull1 for a less branchy optimised sort, but that was already
discussed and put off for later.)
From a8a7c9ec4f0b96ed9d889d008731864f3d4e87d1 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 3 Apr 2022 09:41:04 +1200
Subject: [PATCH] WIP: Fix tuplesort optimizations for expression-based
 CLUSTER.

---
 src/backend/utils/sort/tuplesort.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 361527098f..34714d9baf 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -268,6 +268,8 @@ struct Tuplesortstate
 	MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
 	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */
 
+	bool		disable_datum1;	/* disable leading value-based optimizations */
+
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are sorting from the routines that don't need to know it.
@@ -1095,6 +1097,13 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->indexInfo = BuildIndexInfo(indexRel);
 
+	/*
+	 * If we don't have a simple attribute, disable the use of datum1/isnull1.
+	 * copytup_cluster() doesn't know how to compute expressions.
+	 */
+	if (state->indexInfo->ii_IndexAttrNumbers[0] == 0)
+		state->disable_datum1 = true;
+
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
@@ -3593,20 +3602,27 @@ tuplesort_sort_memtuples(Tuplesortstate *state)
 
 	if (state->memtupcount > 1)
 	{
-		/* Do we have a specialization for the leading column's comparator? */
+		/*
+		 * Do we have a specialization for the leading column's comparator,
+		 * and has the leading column's value or abbreviation been stored in
+		 * datum1/isnull1?
+		 */
 		if (state->sortKeys &&
+			!state->disable_datum1 &&
 			state->sortKeys[0].comparator == ssup_datum_unsigned_cmp)
 		{
 			elog(DEBUG1, "qsort_tuple_unsigned");
 			qsort_tuple_unsigned(state->memtuples, state->memtupcount, state);
 		}
 		else if (state->sortKeys &&
+ !state->disable_datum1 &&
  state->sortKeys[0].comparator == ssup_datum_signed_cmp)
 		{
 			elog(DEBUG1, "qsort_tuple_signed");
 			qsort_tuple_signed(state->memtuples, state->memtupcount, state);
 		}
 		else if (state->sortKeys &&
+ !state->disable_datum1 &&
  state->sortKeys[0].comparator == ssup_datum_int32_cmp)
 		{
 			elog(DEBUG1, "qsort_tuple_int32");
@@ -4134,7 +4150,7 @@ copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup)
 	 * set up first-column key value, and potentially abbreviate, if it's a
 	 * simple column
 	 */
-	if (state->indexInfo->ii_IndexAttrNumbers[0] == 0)
+	if (state->disable_datum1)
 		return;
 
 	original = heap_getattr(tuple,
-- 
2.35.1



Re: A qsort template

2022-04-02 Thread Andres Freund
Hi,

On 2022-04-03 09:45:13 +1200, Thomas Munro wrote:
> On Sun, Apr 3, 2022 at 9:03 AM Andres Freund  wrote:
> > It's certainly not pretty that copytup_cluster() can use SortTuples without
> > actually using SortTuples. Afaics it basically only computes isnull1/datum1 
> > if
> > state->indexInfo->ii_IndexAttrNumbers[0] == 0.
> 
> I think we just need to decide up front if we're in a situation that
> can't provide datum1/isnull1 (in this case because it's an expression
> index), and skip the optimised paths.  Here's an experimental patch...
> still looking into whether there are more cases like this...

That's a lot of redundant checks. How about putting all the checks for
optimized paths into one if (state->sortKeys && !state->disable_datum1)?

I'm a bit worried that none of the !ubsan tests failed on this...

Greetings,

Andres Freund




Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2022-04-02 Thread Greg Stark
This patch was broken by d16773cdc86210493a2874cb0cf93f3883fcda73 "Add
macros in hash and btree AMs to get the special area of their pages"

If it's really just a few macros it should be easy enough to merge but
it would be good to do a rebase given the number of other commits
since February anyways.

On Mon, 21 Feb 2022 at 09:14, Maxim Orlov  wrote:
>
> I've updated the patch due to recent changes by Daniel Gustafsson 
> (549ec201d6132b7).
>
> --
> Best regards,
> Maxim Orlov.



-- 
greg




Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-04-02 Thread Tom Lane
Greg Stark  writes:
> The tests for this seem to need adjustments.
> [12:41:09.403] test pg_stat_statements ... FAILED 180 ms

> diff -U3 
> /tmp/cirrus-ci-build/contrib/pg_stat_statements/expected/pg_stat_statements.out
> /tmp/cirrus-ci-build/contrib/pg_stat_statements/results/pg_stat_statements.out
> --- 
> /tmp/cirrus-ci-build/contrib/pg_stat_statements/expected/pg_stat_statements.out
> 2022-04-02 12:37:42.201823383 +
> +++ 
> /tmp/cirrus-ci-build/contrib/pg_stat_statements/results/pg_stat_statements.out
> 2022-04-02 12:41:09.219563204 +
> @@ -1174,8 +1174,8 @@
>  ORDER BY query;
> query   | reset_ts_match
>  ---+
> - SELECT $1,$2 AS "STMTTS2" | f
>   SELECT $1 AS "STMTTS1"| t
> + SELECT $1,$2 AS "STMTTS2" | f
>  (2 rows)

>  -- check that minmax reset does not set stats_reset

> Hm. Is this a collation problem?

Yeah, looks like it.  ORDER BY query COLLATE "C" might work better.

regards, tom lane




Re: Skipping logical replication transactions on subscriber side

2022-04-02 Thread Noah Misch
On Sat, Apr 02, 2022 at 08:44:45PM +0900, Masahiko Sawada wrote:
> On Sat, Apr 2, 2022 at 7:04 PM Amit Kapila  wrote:
> > On Sat, Apr 2, 2022 at 1:43 PM Noah Misch  wrote:
> > > Some options:
> > > - Move subskiplsn after subdbid, so it's always aligned anyway.  I've
> > >   confirmed that this lets the test pass, in 44s.
> > > - Move subskiplsn to the CATALOG_VARLEN section, despite its fixed length.
> >
> > +1 to any one of the above. I mildly prefer the first option as that
> > will allow us to access the value directly instead of going via
> > SysCacheGetAttr but I am fine either way.
> 
> +1. I also prefer the first option.

Sounds good to me.




CFBot failing with "Failed to start an instance"

2022-04-02 Thread Greg Stark
One patch is failing with what looks like a generic Cirrus issue:

https://cirrus-ci.com/task/5389918250729472

Failed to start an instance: INVALID_ARGUMENT: Operation with name
"operation-1648936682461-5dbb2fd37177b-5095285b-b153ee83" failed with
status = HttpJsonStatusCode{statusCode=INVALID_ARGUMENT} and message =
BAD REQUEST



-- 
greg




Re: Pluggable toaster

2022-04-02 Thread Greg Stark
Hm. It compiles but it's failing regression tests:

diff -U3 /tmp/cirrus-ci-build/contrib/dummy_toaster/expected/dummy_toaster.out
/tmp/cirrus-ci-build/contrib/dummy_toaster/results/dummy_toaster.out
--- /tmp/cirrus-ci-build/contrib/dummy_toaster/expected/dummy_toaster.out
2022-04-02 16:02:47.874360253 +
+++ /tmp/cirrus-ci-build/contrib/dummy_toaster/results/dummy_toaster.out
2022-04-02 16:07:20.878047769 +
@@ -20,186 +20,7 @@

+server closed the connection unexpectedly
+ This probably means the server terminated abnormally
+ before or while processing the request.
+connection to server was lost
I think this will require some real debugging, so I'm marking this
Waiting on Author.




Re: Pluggable toaster

2022-04-02 Thread Andres Freund
Hi, 

On April 2, 2022 6:20:36 PM PDT, Greg Stark  wrote:
>Hm. It compiles but it's failing regression tests:
>
>diff -U3 /tmp/cirrus-ci-build/contrib/dummy_toaster/expected/dummy_toaster.out
>/tmp/cirrus-ci-build/contrib/dummy_toaster/results/dummy_toaster.out
>--- /tmp/cirrus-ci-build/contrib/dummy_toaster/expected/dummy_toaster.out
>2022-04-02 16:02:47.874360253 +
>+++ /tmp/cirrus-ci-build/contrib/dummy_toaster/results/dummy_toaster.out
>2022-04-02 16:07:20.878047769 +
>@@ -20,186 +20,7 @@
>
>+server closed the connection unexpectedly
>+ This probably means the server terminated abnormally
>+ before or while processing the request.
>+connection to server was lost
>I think this will require some real debugging, so I'm marking this
>Waiting on Author.

Yes, dumps core (just like in several previous runs):

https://cirrus-ci.com/task/4710272324599808?logs=cores#L44

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: Higher level questions around shared memory stats

2022-04-02 Thread Andres Freund
Hi,

Alvaro, added you because you were the original author for a lot of that
code. Fujii, you touched it last...


6) Should any part of the "reuse_stats" logic in table_recheck_autovac() be
kept?

With the shared memory stats patch, autovacuum can cheaply access individual
stats, so the whole scheme for avoiding stats accesses is moot.

So far the patchset had touched autovacuum.c a bit too lightly, removing the
autovac_refresh_stats() call and rephrasing a few comments, but not removing
e.g. the reuse_stats variable / branches in table_recheck_autovac. Which
doesn't seem great.  I've just tried to go through and update the autovacuum.c
code and comments in light of the shared memory stats patch..

I don't really see a point in keeping any of it - but I was curious whether
anybody else does?

I'm still polishing, so I didn't want to send a whole new version with these
adjustments to the list yet, but you can see the state as of the time of
sending this email at [1].

Greetings,

Andres Freund

[1] 
https://github.com/anarazel/postgres/commit/276c053110cfe71bf134519e8e4ab053e6d2a7f0#diff-3035fb5dace7bcd77f0eeafe32458cd808c5adb83d62ebdf54f0170cf7db93e7




Re: CFBot failing with "Failed to start an instance"

2022-04-02 Thread Thomas Munro
On Sun, Apr 3, 2022 at 1:07 PM Greg Stark  wrote:
> https://cirrus-ci.com/task/5389918250729472
>
> Failed to start an instance: INVALID_ARGUMENT: Operation with name
> "operation-1648936682461-5dbb2fd37177b-5095285b-b153ee83" failed with
> status = HttpJsonStatusCode{statusCode=INVALID_ARGUMENT} and message =
> BAD REQUEST

I guess I should teach it to retry if it fails like that, but it does
try again in ~24 hours...

Here's the test history for that branch in case it's helpful (I really
should probably put these links on the page somewhere...:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/37/3234




CLUSTER sort on abbreviated expressions is broken

2022-04-02 Thread Thomas Munro
Hi,

Independently of a problem with a recent commit, it seems that
$SUBJECT in all releases (well, I only tested as far back as 11).  I
attach an addition to the tests to show this, but here's a stand-alone
repro:

DROP TABLE IF EXISTS clstr_expression;

CREATE TABLE clstr_expression(id serial primary key, a int, b text COLLATE "C");
INSERT INTO clstr_expression(a, b) SELECT g.i % 42, 'prefix'||g.i FROM
generate_series(1, 133) g(i);
CREATE INDEX clstr_expression_minus_a ON clstr_expression ((-a), b);
CREATE INDEX clstr_expression_upper_b ON clstr_expression ((upper(b)));

CLUSTER clstr_expression USING clstr_expression_minus_a;
WITH rows AS
  (SELECT ctid, lag(a) OVER (ORDER BY ctid) AS la, a FROM clstr_expression)
SELECT * FROM rows WHERE la < a;

All good, and now for the part that I think is misbehaving:

CLUSTER clstr_expression USING clstr_expression_upper_b;
WITH rows AS
  (SELECT ctid, lag(b) OVER (ORDER BY ctid) AS lb, b FROM clstr_expression)
SELECT * FROM rows WHERE upper(lb) > upper(b);

That should produce no rows.  It works as expected if you SET
enable_seqscan = off and re-run CLUSTER, revealing that it's the
seq-scan-and-sort strategy that is broken.  It also works as expected
for non-yet-abbreviatable collations.
From 580de16e30506eace4cdd962c1ab31a471da5dff Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 3 Apr 2022 14:13:00 +1200
Subject: [PATCH] Add simple test for CLUSTER on expression indexes.

Assert that the CLUSTER command rewrites the heap in the expected order,
when using an expression index.

XXX CLUSTER on upper(b) for a COLLATE "C" column is currently broken, so this doesn't yet pass unless you disable seqscans
---
 src/test/regress/expected/cluster.out | 14 ++
 src/test/regress/sql/cluster.sql  |  6 ++
 2 files changed, 20 insertions(+)

diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index e46a66952f..0da85faff8 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -511,6 +511,13 @@ SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
 COMMIT;
 -- and after clustering on clstr_expression_minus_a
 CLUSTER clstr_expression USING clstr_expression_minus_a;
+WITH rows AS
+  (SELECT ctid, lag(a) OVER (ORDER BY ctid) AS la, a FROM clstr_expression)
+SELECT * FROM rows WHERE la < a;
+ ctid | la | a 
+--++---
+(0 rows)
+
 BEGIN;
 SET LOCAL enable_seqscan = false;
 EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE upper(b) = 'PREFIX3';
@@ -545,6 +552,13 @@ SELECT * FROM clstr_expression WHERE -a = -3 ORDER BY -a, b;
 COMMIT;
 -- and after clustering on clstr_expression_upper_b
 CLUSTER clstr_expression USING clstr_expression_upper_b;
+WITH rows AS
+  (SELECT ctid, lag(b) OVER (ORDER BY ctid) AS lb, b FROM clstr_expression)
+SELECT * FROM rows WHERE upper(lb) > upper(b);
+  ctid   |lb | b 
+-+---+---
+(0 rows)
+
 BEGIN;
 SET LOCAL enable_seqscan = false;
 EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE upper(b) = 'PREFIX3';
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index aee9cf83e0..99ee533c8d 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -245,6 +245,9 @@ COMMIT;
 
 -- and after clustering on clstr_expression_minus_a
 CLUSTER clstr_expression USING clstr_expression_minus_a;
+WITH rows AS
+  (SELECT ctid, lag(a) OVER (ORDER BY ctid) AS la, a FROM clstr_expression)
+SELECT * FROM rows WHERE la < a;
 BEGIN;
 SET LOCAL enable_seqscan = false;
 EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE upper(b) = 'PREFIX3';
@@ -255,6 +258,9 @@ COMMIT;
 
 -- and after clustering on clstr_expression_upper_b
 CLUSTER clstr_expression USING clstr_expression_upper_b;
+WITH rows AS
+  (SELECT ctid, lag(b) OVER (ORDER BY ctid) AS lb, b FROM clstr_expression)
+SELECT * FROM rows WHERE upper(lb) > upper(b);
 BEGIN;
 SET LOCAL enable_seqscan = false;
 EXPLAIN (COSTS OFF) SELECT * FROM clstr_expression WHERE upper(b) = 'PREFIX3';
-- 
2.35.1



Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-04-02 Thread Andrei Zubkov
Hi Greg,

On Sat, 2022-04-02 at 17:38 -0400, Greg Stark wrote:
> The tests for this seem to need adjustments.
> 
> [12:41:09.403] test pg_stat_statements ... FAILED 180 ms
>     query   | reset_ts_match
>  ---+
> - SELECT $1,$2 AS "STMTTS2" | f
>   SELECT $1 AS "STMTTS1"    | t
> + SELECT $1,$2 AS "STMTTS2" | f
>  (2 rows)
> 
>  -- check that minmax reset does not set stats_reset
> 
> 
> Hm. Is this a collation problem?

Of course, thank you! I've forgot to set collation here.

v11 attached
--
regards, Andrei
From c5900f1c689b2a74edbc30b66c9a73e25b85484a Mon Sep 17 00:00:00 2001
From: Andrei Zubkov 
Date: Sun, 3 Apr 2022 07:28:59 +0300
Subject: [PATCH] pg_stat_statements: Track statement entry timestamp

This patch adds stats_since column to the pg_stat_statements view. This column
is populated with the current timestamp when a new statement is added to the
pg_stat_statements hashtable. It provides clean information about statistics
collection time interval for each statement. Besides it can be used
by sampling solutions to detect situations when a statement was evicted and
returned back between samples.
Such sampling solution could derive any pg_stat_statements statistic value for
an interval between two samples with except of all min/max statistics. To
address this issue this patch adds the ability to reset min/max
statistics independently of statement reset using the new minmax_only parameter
of the pg_stat_statements_reset(userid oid, dbid oid, queryid bigint,
minmax_only boolean) function.
Timestamp of such reset is stored in the minmax_stats_since field for
each statement. pg_stat_statements_reset() function is now returns
this timestamp as a result.

Discussion:
https://www.postgresql.org/message-id/flat/72e80e7b160a6eb189df9ef6f068cce3765d37f8.camel%40moonset.ru
---
 contrib/pg_stat_statements/Makefile   |   3 +-
 .../expected/oldextversions.out   |  61 +++
 .../expected/pg_stat_statements.out   | 361 +-
 .../pg_stat_statements--1.9--1.10.sql | 108 ++
 .../pg_stat_statements/pg_stat_statements.c   | 153 ++--
 .../pg_stat_statements.control|   2 +-
 .../pg_stat_statements/sql/oldextversions.sql |   8 +
 .../sql/pg_stat_statements.sql| 149 +++-
 doc/src/sgml/pgstatstatements.sgml|  58 ++-
 9 files changed, 745 insertions(+), 158 deletions(-)
 create mode 100644 contrib/pg_stat_statements/pg_stat_statements--1.9--1.10.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 7fabd96f38d..edc40c8bbfb 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -6,7 +6,8 @@ OBJS = \
 	pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.8--1.9.sql \
+DATA = pg_stat_statements--1.4.sql \
+	pg_stat_statements--1.9--1.10.sql pg_stat_statements--1.8--1.9.sql \
 	pg_stat_statements--1.7--1.8.sql pg_stat_statements--1.6--1.7.sql \
 	pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
 	pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.2--1.3.sql \
diff --git a/contrib/pg_stat_statements/expected/oldextversions.out b/contrib/pg_stat_statements/expected/oldextversions.out
index f18c08838f5..70877948491 100644
--- a/contrib/pg_stat_statements/expected/oldextversions.out
+++ b/contrib/pg_stat_statements/expected/oldextversions.out
@@ -136,4 +136,65 @@ SELECT pg_get_functiondef('pg_stat_statements_reset'::regproc);
  
 (1 row)
 
+ALTER EXTENSION pg_stat_statements UPDATE TO '1.9';
+\d pg_stat_statements
+View "public.pg_stat_statements"
+   Column|   Type   | Collation | Nullable | Default 
+-+--+---+--+-
+ userid  | oid  |   |  | 
+ dbid| oid  |   |  | 
+ toplevel| boolean  |   |  | 
+ queryid | bigint   |   |  | 
+ query   | text |   |  | 
+ plans   | bigint   |   |  | 
+ total_plan_time | double precision |   |  | 
+ min_plan_time   | double precision |   |  | 
+ max_plan_time   | double precision |   |  | 
+ mean_plan_time  | double precision |   |  | 
+ stddev_plan_time| double precision |   |  | 
+ calls   | bigint   |   |  | 
+ total_exec_time | double precision |   |  | 
+ min_exec_time   | double precision |   |  | 
+ max_exec_time   | double precision |   |  | 
+ mean_exec_time  | double precision |   |  | 
+ stddev_exec_time| double precision |  

Re: A qsort template

2022-04-02 Thread Thomas Munro
On Sun, Apr 3, 2022 at 11:11 AM Andres Freund  wrote:
> On 2022-04-03 09:45:13 +1200, Thomas Munro wrote:
> > I think we just need to decide up front if we're in a situation that
> > can't provide datum1/isnull1 (in this case because it's an expression
> > index), and skip the optimised paths.  Here's an experimental patch...
> > still looking into whether there are more cases like this...

I didn't find anything else.

Maybe it'd be better if we explicitly declared whether datum1 is used
in each tuplesort mode's 'begin' function, right next to the code that
installs the set of routines that are in control of that?  Trying that
in this version.  Is it clearer what's going on like this?

> That's a lot of redundant checks. How about putting all the checks for
> optimized paths into one if (state->sortKeys && !state->disabl1e_datum1)?

OK, sure.

> I'm a bit worried that none of the !ubsan tests failed on this...

In accordance with whoever-it-was-that-said-that's law about things
that aren't tested, this are turned out to be broken already[1].  Once
we fix that we should have a new test in the three that might also
eventually have failed under this UB, given enough chaos.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2BbA%2BbmwD36_oDxAoLrCwZjVtST2fqe%3Db4%3DqZcmU7u89A%40mail.gmail.com
From b0a79f91edc6ce305f77373b92f31100fcad07d5 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 3 Apr 2022 09:41:04 +1200
Subject: [PATCH v2] Fix tuplesort optimizations for expression-based CLUSTER.

When redirecting sorts to specialized variants, commit 69749243 failed
to handle the case where CLUSTER-sort decides not to initialize datum1
and isnull1.  Fix by hoisting that decision up a level and advertising
whether datum1 can be used, in the Tuplesortstate object.

Per reports from UBsan and Valgrind while running the cluster.sql test.

Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/CAFBsxsF1TeK5Fic0M%2BTSJXzbKsY6aBqJGNj6ptURuB09ZF6k_w%40mail.gmail.com
---
 src/backend/utils/sort/tuplesort.c | 73 ++
 1 file changed, 53 insertions(+), 20 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 361527098f..58441ed638 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -306,6 +306,12 @@ struct Tuplesortstate
 	void		(*readtup) (Tuplesortstate *state, SortTuple *stup,
 			LogicalTape *tape, unsigned int len);
 
+	/*
+	 * Whether SortTuple's datum1 and isnull1 members are maintained by the
+	 * above routines.  If not, some sort specializations are disabled.
+	 */
+	bool		haveDatum1;
+
 	/*
 	 * This array holds the tuples now in sort memory.  If we are in state
 	 * INITIAL, the tuples are in no particular order; if we are in state
@@ -1016,6 +1022,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->haveDatum1 = true;
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 	state->abbrevNext = 10;
@@ -1095,6 +1102,15 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->indexInfo = BuildIndexInfo(indexRel);
 
+	/*
+	 * If we don't have a simple leading attribute, we can't easily initialize
+	 * datum1, so disable optimizations based on datum1.
+	 */
+	if (state->indexInfo->ii_IndexAttrNumbers[0] == 0)
+		state->haveDatum1 = false;
+	else
+		state->haveDatum1 = true;
+
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
@@ -1188,6 +1204,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
 	state->abbrevNext = 10;
+	state->haveDatum1 = true;
 
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
@@ -1262,6 +1279,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
+	state->haveDatum1 = true;
 
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
@@ -1302,6 +1320,7 @@ tuplesort_begin_index_gist(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
+	state->haveDatum1 = true;
 
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
@@ -1366,6 +1385,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	state->writetup = writetup_datum;
 	state->readtup = readtup_datum;
 	state->abbrevNext = 10;
+	state->haveDatum1 = true;
 
 	state->datumType = datumType;
 
@@ -3593,27 +3613,40 @@ tuplesort_sort_memtuples(Tuplesortstate *state)
 
 	if (state->memtupcount > 1)
 	{
-		/* Do we have a specialization for the leading column's comparator? */
-		if (state->sortKeys &&
-			state->sortKeys[0].comparator == ssup_datum_unsigned_cmp)
-		{
-			elog(DEBUG1, "qsort_tuple_unsigned");
-			qsort_tuple_unsigned(state->memtuples, state->memtupcount, state)

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

2022-04-02 Thread Tatsuo Ishii
>> I think the problem is not merely one of documentation, but one of
>> bad design.  Up to now it was possible to tell what was what from
>> counting the number of columns in the output; but with this design,
>> that is impossible.  That should be fixed.  The first thing you have
>> got to do is drop the alternation { failures | serialization_failures
>> deadlock_failures }.  That doesn't make any sense in the first place:
>> counting serialization and deadlock failures doesn't make it impossible
>> for other errors to occur.  It'd probably make the most sense to have
>> three columns always, serialization, deadlock and total.
> 
> +1.
> 
>> Now maybe
>> that change alone is sufficient, but I'm not convinced, because the
>> multiple options at the end of the line mean we will never again be
>> able to add any more columns without reintroducing ambiguity.  I
>> would be happier if the syntax diagram were such that columns could
>> only be dropped from right to left.
> 
> Or those three columns always, sum_lag sum_lag_2, min_lag max_lag, skipped, 
> retried retries?

What about this? (a log line is not actually folded)
interval_start num_transactions sum_latency sum_latency_2 min_latency 
max_latency
failures serialization_failures deadlock_failures retried retries [ sum_lag 
sum_lag_2 min_lag max_lag [ skipped ] ]

failures:
always 0 (if --max-tries is 1, the default)
sum of serialization_failures and deadlock_failures (if --max-tries is not 1)

serialization_failures and deadlock_failures:
always 0 (if --max-tries is 1, the default)
0 or more (if --max-tries is not 1)

retried and retries:
always 0 (if --max-tries is 1, the default)
0 or more (if --max-tries is not 1)

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp