Re: recovery test failure on morepork with timestamp mystery

2022-05-13 Thread Mikael Kjellström



On 2022-05-13 04:14, Andres Freund wrote:


One interesting tidbit is that the log timestamps are computed differently
(with elog.c:get_formatted_log_time()) than the reset timestamp (with
GetCurrentTimestamp()). Both use gettimeofday() internally.

I wonder if there's a chance that somehow openbsd ends up with more usecs than
"fit" in a second in the result of gettimeofday()? The elog.c case would
truncate everything above a second away afaics:
/* 'paste' milliseconds into place... */
sprintf(msbuf, ".%03d", (int) (saved_timeval.tv_usec / 1000));
memcpy(formatted_log_time + 19, msbuf, 4);

whereas GetCurrentTimestamp() would add them to the timestamp:
result = (TimestampTz) tp.tv_sec -
((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY);
result = (result * USECS_PER_SEC) + tp.tv_usec;



Well, I don't know if you remember but there was a thread a while back 
and a test program (monotime.c) to test the clock if it could go 
backwards and openbsd showed the following result when running the 
attached testprogram:


openbsd 5.9:

$ ./monotime
1021006 Starting
1017367 Starting
1003415 Starting
1007598 Starting
1021006 Stopped
1007598 Stopped
1017367 Stopped
1003415 Stopped

openbsd 6.9:

$ ./monotime
410310 Starting
547727 Starting
410310 Back 262032.372314102 => 262032.242045208
410310 Stopped
465180 Starting
255646 Starting
547727 Stopped
465180 Stopped
255646 Stopped

could that have something to do with it?

/Mikael
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define NTHREAD 4
#define NTRY5

void *
start(void *dummy)
{
int i;
struct timespec ts0, ts1;

printf("%d Starting\n", (int)getthrid());
clock_gettime(CLOCK_MONOTONIC, &ts0);

for (i = 0; i < NTRY; i++) {
clock_gettime(CLOCK_MONOTONIC, &ts1);
if (timespeccmp(&ts0, &ts1, <=)) {
ts0 = ts1;
continue;
}
printf("%d Back %lld.%09lu => %lld.%09lu\n",
(int)getthrid(), ts0.tv_sec, ts0.tv_nsec, ts1.tv_sec,
ts1.tv_nsec);
break;
}
printf("%d Stopped\n", (int)getthrid());

return (NULL);
}

int
main(int argc, char *argv[])
{
int i, n = NTHREAD;
pthread_t *threads;

threads = calloc(n, sizeof(pthread_t));

for (i = 0; i < n; i++)
pthread_create(&threads[i], NULL, start, NULL);
for (i = 0; i < n; i++)
pthread_join(threads[i], NULL);

}



Re: Correct comment in ProcedureCreate() for pgstat_create_function() call.

2022-05-13 Thread Michael Paquier
On Fri, May 13, 2022 at 10:22:57AM +0530, Amul Sul wrote:
> Sorry, hit the send button too early :|

-   /* ensure that stats are dropped if transaction commits */
+   /* ensure that stats are dropped if transaction aborts */
if (!is_update)
pgstat_create_function(retval);

As of what pgstat_create_function() does to create the stats of a new
function in a transactional way, it looks like you are right.  Will
fix if there are no objections.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Compression dictionaries for JSONB

2022-05-13 Thread Aleksander Alekseev
Hi hackers,

> Here it the 2nd version of the patch:
>
> - Includes changes named above
> - Fixes a warning reported by cfbot
> - Fixes some FIXME's
> - The path includes some simple tests now
> - A proper commit message was added

Here is the rebased version of the patch. Changes compared to v2 are minimal.

> Open questions:
> - Dictionary entries are currently stored as NameData, the same type that is
>   used for enums. Are we OK with the accompanying limitations? Any alternative
>   suggestions?
> - All in all, am I moving the right direction?

I would like to receive a little bit more feedback before investing more time
into this effort. This will allow me, if necessary, to alter the overall design
more easily.

-- 
Best regards,
Aleksander Alekseev


v3-0001-Compression-dictionaries-for-JSONB.patch
Description: Binary data


RE: Perform streaming logical transactions by background workers and parallel apply

2022-05-13 Thread houzj.f...@fujitsu.com
On Thursday, May 5, 2022 1:46 PM Peter Smith  wrote:

> Here are my review comments for v5-0001.
> I will take a look at the v5-0002 (TAP) patch another time.

Thanks for the comments !

> 4. Commit message
> 
> User can set the streaming option to 'on/off', 'apply'. For now,
> 'apply' means the streaming will be applied via a apply background if
> available. 'on' means the streaming transaction will be spilled to
> disk.
> 
> 
> I think "apply" might not be the best choice of values for this
> meaning, but I think Hou-san already said [1] that this was being
> reconsidered.

Yes, I am thinking over this along with some other related stuff[1] posted by 
Amit
and sawada. Will change this in next version.

[1] 
https://www.postgresql.org/message-id/flat/CAA4eK1%2B7D4qAQUQEE8zzQ0fGCqeBWd3rzTaY5N0jVs-VXFc_Xw%40mail.gmail.com

> 7. src/backend/commands/subscriptioncmds.c - defGetStreamingMode
> 
> +static char
> +defGetStreamingMode(DefElem *def)
> +{
> + /*
> + * If no parameter given, assume "true" is meant.
> + */
> + if (def->arg == NULL)
> + return SUBSTREAM_ON;
> 
> But is that right? IIUC all the docs said that the default is OFF.

I think it's right. "arg == NULL" means user specify the streaming option
without the value. Like CREATE SUBSCRIPTION xxx WITH(streaming). The value 
should
be 'on' in this case.


> 12. src/backend/replication/logical/origin.c - replorigin_session_setup
> 
> @@ -1110,7 +1110,11 @@ replorigin_session_setup(RepOriginId node)
>   if (curstate->roident != node)
>   continue;
> 
> - else if (curstate->acquired_by != 0)
> + /*
> + * We allow the apply worker to get the slot which is acquired by its
> + * leader process.
> + */
> + else if (curstate->acquired_by != 0 && acquire)
> 
> I still feel this is overly-cofusing. Shouldn't comment say "Allow the
> apply bgworker to get the slot...".
> 
> Also the parameter name 'acquire' is hard to reconcile with the
> comment. E.g. I feel all this would be easier to understand if the
> param was  was refactored with a name like 'bgworker' and the code was
> changed to:
> else if (curstate->acquired_by != 0 && !bgworker)
> 
> Of course, the value true/false would need to be flipped on calls too.
> This is the same as my previous comment [PSv4] #26.

I feel it's not good idea to mention bgworker in origin.c. I have remove this
comment and add some other comments in worker.c.

> 26. src/backend/replication/logical/worker.c - apply_handle_stream_abort
> 
> + if (found)
> + {
> + elog(LOG, "rolled back to savepoint %s", spname);
> + RollbackToSavepoint(spname);
> + CommitTransactionCommand();
> + subxactlist = list_truncate(subxactlist, i + 1);
> + }
> 
> Should that elog use the "[Apply BGW #%u]" format like the others for BGW?

I feel the "[Apply BGW #%u]" is a bit hacky and some of them comes from the old
patchset. I will recheck these logs and adjust them and change some log
level in next version.

> 27. src/backend/replication/logical/worker.c - apply_handle_stream_abort
> 
> Should this function be setting stream_apply_worker = NULL somewhere
> when all is done?
> 29. src/backend/replication/logical/worker.c - apply_handle_stream_commit
> 
> I am unsure, but should something be setting the stream_apply_worker =
> NULL somewhere when all is done?

I think the worker already be set to NULL in apply_handle_stream_stop.


> 32. src/backend/replication/logical/worker.c - ApplyBgwShutdown
> 
> +/*
> + * Set the failed flag so that the main apply worker can realize we have
> + * shutdown.
> + */
> +static void
> +ApplyBgwShutdown(int code, Datum arg)
> 
> If the 'code' param is deliberately unused it might be better to say
> so in the comment...

Not sure about this. After searching the codes, I think most of the callback
functions doesn't use and add comments for the 'code' param.


> 45. src/backend/utils/activity/wait_event.c
> 
> @@ -388,6 +388,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
>   case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERT:
>   event_name = "HashGrowBucketsReinsert";
>   break;
> + case WAIT_EVENT_LOGICAL_APPLY_WORKER_READY:
> + event_name = "LogicalApplyWorkerReady";
> + break;
> 
> I am not sure this is the best name for this event since the only
> place it is used (in apply_bgworker_wait_for) is not only waiting for
> READY state. Maybe a name like WAIT_EVENT_LOGICAL_APPLY_BGWORKER or
> WAIT_EVENT_LOGICAL_APPLY_WORKER_SYNC would be more appropriate? Need
> to change the wait_event.h also.

I noticed a similar named "WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE", so I changed
this to WAIT_EVENT_LOGICAL_APPLY_WORKER_STATE_CHANGE.

> 47. src/test/regress/expected/subscription.out - missting test
> 
> Missing some test cases for all new option values? E.g. Where is the
> test using streaming value is set to 'apply'. Same comment as [PSv4]
> #81

The new option is tested in the second patch posted by Shi yu.

I addressed other comments from Peter and the 2PC related comment from Shi.
Here is the version patch.

Best regards,
Hou zj


v6-0001-Pe

RE: Perform streaming logical transactions by background workers and parallel apply

2022-05-13 Thread houzj.f...@fujitsu.com
On Wednesday, May 11, 2022 1:10 PM Amit Kapila  wrote:
> 
> On Wed, May 11, 2022 at 9:35 AM Masahiko Sawada
>  wrote:
> >
> > On Tue, May 10, 2022 at 5:59 PM Amit Kapila 
> wrote:
> > >
> > > On Tue, May 10, 2022 at 10:39 AM Masahiko Sawada
>  wrote:
> > > >
> > > > Having it optional seems a good idea. BTW can the user configure
> > > > how many apply bgworkers can be used per subscription or in the
> > > > whole system? Like max_sync_workers_per_subscription, is it better
> > > > to have a configuration parameter or a subscription option for
> > > > that? If so, setting it to 0 probably means to disable the parallel 
> > > > apply
> feature.
> > > >
> > >
> > > Yeah, that might be useful but we are already giving an option while
> > > creating a subscription whether to allow parallelism, so will it be
> > > useful to give one more way to disable this feature? OTOH, having
> > > something like max_parallel_apply_workers/max_bg_apply_workers at
> > > the system level can give better control for how much parallelism
> > > the user wishes to allow for apply work.
> >
> > Or we can have something like
> > max_parallel_apply_workers_per_subscription that controls how many
> > parallel apply workers can launch per subscription. That also gives
> > better control for the number of parallel apply workers.
> >
> 
> I think we can go either way in this matter as both have their pros and cons. 
> I
> feel limiting the parallel workers per subscription gives better control but
> OTOH, it may not allow max usage of parallelism because some quota from
> other subscriptions might remain unused. Let us see what Hou-San or others
> think on this matter?

Thanks for Amit and Sawada-san's comments !
I will think over these approaches and reply soon.

Best regards,
Hou zj


Re: postgres_fdw "parallel_commit" docs

2022-05-13 Thread Etsuro Fujita
Hi Jonathan,

On Thu, May 12, 2022 at 10:32 PM Jonathan S. Katz  wrote:
> On 5/12/22 7:26 AM, Etsuro Fujita wrote:
> > Attached is an updated patch.  I'll commit the patch if no objections.
>
> I think this is much easier to read.

Cool!

> I made a few minor copy edits. Please see attached.

LGTM, so I pushed the patch.

Thanks for the patch and taking the time to improve this!

Best regards,
Etsuro Fujita




RE: Data is copied twice when specifying both child and parent table in publication

2022-05-13 Thread wangw.f...@fujitsu.com
On Fri, May 13, 2022 1:59 PM Amit Kapila  wrote:
> On Fri, May 13, 2022 at 7:32 AM wangw.f...@fujitsu.com
>  wrote:
> >
> > Attach the patches.(Only changed the patch for HEAD.).
> >
> 
> Few comments:
> =
Thanks for your comments.

> 1.
> @@ -1135,6 +1172,15 @@ pg_get_publication_tables(PG_FUNCTION_ARGS)
>   if (publication->pubviaroot)
>   tables = filter_partitions(tables);
>   }
> + pfree(elems);
> +
> + /*
> + * We need an additional filter for this case : A partition table is
> + * published in a publication with viaroot, and its parent or child
> + * table is published in another publication without viaroot. In this
> + * case, we should publish only parent table.
> + */
> + tables = filter_partitions(tables);
> 
> Do we need to filter partitions twice? Can't we check if any of the 
> publications
> has 'pubviaroot' option set, if so, call filter_partitions at the end?
Improve it according to your suggestion.

> 2.  " FROM pg_class c JOIN pg_namespace n"
> + "   ON n.oid = c.relnamespace,"
> + " LATERAL pg_get_publication_tables(array[ %s ]) gst"
> 
> Here, it is better to have an alias name as gpt.
Improve it according to your suggestion.

> 3.
>   }
> + pfree(elems);
> +
> 
> An extra line between these two lines makes it looks slightly better.
Improve it according to your suggestion.

> 4. Not able to apply patch cleanly.
> patching file src/test/subscription/t/013_partition.pl
> Hunk #1 FAILED at 477.
> Hunk #2 FAILED at 556.
> Hunk #3 FAILED at 584.
> 3 out of 3 hunks FAILED -- saving rejects to file
> src/test/subscription/t/013_partition.pl.rej
> patching file src/test/subscription/t/028_row_filter.pl
> Hunk #1 succeeded at 394 (offset 1 line).
> Hunk #2 FAILED at 722.
> 1 out of 2 hunks FAILED -- saving rejects to file
> src/test/subscription/t/028_row_filter.pl.rej
> patching file src/test/subscription/t/031_column_list.pl
> Hunk #1 succeeded at 948 (offset -92 lines).
> Hunk #2 FAILED at 1050.
> 1 out of 2 hunks FAILED -- saving rejects to file
> src/test/subscription/t/031_column_list.pl.rej
New patch could apply patch cleanly now.

Attach the patches.(Only changed the patch for HEAD.).
1. Optimize the code. Reduce calls to function filter_partitions. [suggestions 
by Amit-san]
2. Improve the alias name in SQL. [suggestions by Amit-san]
3. Improve coding alignments. [suggestions by Amit-san] 
4. Do some optimizations for list Concatenate.

Regards,
Wang wei


HEAD_v4-0001-Fix-data-replicated-twice-when-specifying-PUBLISH.patch
Description:  HEAD_v4-0001-Fix-data-replicated-twice-when-specifying-PUBLISH.patch


REL14_v4-0001-Fix-data-replicated-twice-when-specifying-PUBLISH.patch
Description:  REL14_v4-0001-Fix-data-replicated-twice-when-specifying-PUBLISH.patch


Backends stunk in wait event IPC/MessageQueueInternal

2022-05-13 Thread Japin Li


Hi, hackers

I had an incident on my Postgres 14 that queries hung in wait event
IPC / MessageQueueInternal, MessageQueueReceive.  It likes [1],
however, it doesn't have any discussions.

The process cannot be terminated by pg_terminate_backend(), although
it returns true.

Here is the call stack comes from pstack:

485073: /opt/local/pgsql/14/bin/postgres
 fb7fef216f4a ioctl (d, d001, fb7fffdfa0e0)
 008b8ec2 WaitEventSetWait  () + 112
 008b920f WaitLatch () + 6f
 008bf434 shm_mq_wait_internal () + 64
 008bff74 shm_mq_receive () + 2b4
 0079fdc8 TupleQueueReaderNext () + 28
 0077d8ca gather_merge_readnext () + 13a
 0077db25 ExecGatherMerge () + 215
 00790675 ExecNextLoop () + 175
 00790675 ExecNextLoop () + 175
 0076267d standard_ExecutorRun () + fd
 fb7fe3965fbd pgss_executorRun () + fd
 008df99b PortalRunSelect () + 1cb
 008e0dcf PortalRun () + 17f
 008ddacd PostgresMain () + 100d
 00857f62 ServerLoop () + cd2
 00858cee main () + 453
 005ab777 _start_crt () + 87
 005ab6d8 _start () + 18


Any suggestions?  Thanks in advance!

[1] 
https://www.postgresql.org/message-id/flat/E9FA92C2921F31408041863B74EE4C2001A479E590%40CCPMAILDAG03.cantab.local

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: Backends stunk in wait event IPC/MessageQueueInternal

2022-05-13 Thread Justin Pryzby
On Fri, May 13, 2022 at 06:16:23PM +0800, Japin Li wrote:
> I had an incident on my Postgres 14 that queries hung in wait event
> IPC / MessageQueueInternal, MessageQueueReceive.  It likes [1],
> however, it doesn't have any discussions.

If the process is still running, or if the problem recurs, I suggest to create
a corefile with gcore, aka gdb generate-core-file.  Then, we can look at the
backtrace at our leisure, even if the cluster needed to be restarted right
away.

What minor version of postgres is this, and what OS ?

-- 
Justin




Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication

2022-05-13 Thread Bharath Rupireddy
On Tue, May 10, 2022 at 5:55 PM Andrey Borodin  wrote:
>
> > 10 мая 2022 г., в 12:59, Bharath Rupireddy 
> >  написал(а):
> >
> > If okay, I can make the GUC behave this way - value 0 existing
> > behaviour i.e. no wait for sync repl ack, just process query cancels
> > and proc die interrupts immediately; value -1 wait unboundedly for the
> > ack; value > 0 wait for specified milliseconds for the ack.
> +1 if we make -1 and 0 only valid values.
>
> > query cancels or proc die interrupts
>
> Please note, that typical HA tool would need to handle query cancels and proc 
> die interrupts differently.

Agree.

> When the network is partitioned and somewhere standby is promoted you 
> definitely want infinite wait for cancels.

When standby is promoted, no point the old primary waiting forever for
ack assuming we are going to discard it.

> Yet once upon a time you want to shutdown postgres without coredump - thus 
> proc die needs to be processed.

I think users can still have the flexibility to set the right amounts
of time to process cancel and proc die interrupts.

IMHO, the GUC can still have 0, -1 and value > 0 in milliseconds, let
the users decide on what they want. Do you see any problems with this?

Thoughts?

Regards,
Bharath Rupireddy.




Re: Comments on Custom RMGRs

2022-05-13 Thread Simon Riggs
On Fri, 13 May 2022 at 05:13, Jeff Davis  wrote:
>
> On Thu, 2022-05-12 at 22:26 +0100, Simon Riggs wrote:
> > I see multiple uses for the rm_checkpoint() point proposed and I've
> > been asked multiple times for a checkpoint hook.
>
> Can you elaborate and/or link to a discussion?

Those were conversations away from Hackers, but I'm happy to share.

The first was a discussion about a data structure needed by BDR about
4 years ago. In the absence of a pluggable checkpoint, the solution
was forced to use a normal table, which wasn't very satisfactory.

The second was a more recent conversation with Mike Stonebraker, at
the end of 2021.. He was very keen to remove the buffer manager
entirely, which requires that we have a new smgr, which then needs new
code to allow it to be written to disk at checkpoint time, which then
requires some kind of pluggable code at checkpoint time. (Mike was
also keen to remove WAL, but that's another story entirely!).

The last use case was unlogged indexes, which need to be read from
disk at startup or rebuilt after crash, which requires RmgrStartup to
work both with and without InRedo=true.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Can postgres ever delete the recycled future WAL files to free-up disk space if max_wal_size is reduced or wal_recycle is set to off?

2022-05-13 Thread Bharath Rupireddy
On Mon, May 9, 2022 at 6:47 PM Bharath Rupireddy
 wrote:
>
> On Fri, May 6, 2022 at 10:20 PM Tom Lane  wrote:
> >
> > Bharath Rupireddy  writes:
> > > Can postgres delete the recycled future WAL files once max_wal_size is
> > > reduced and/or wal_recycle is set to off?
> >
> > A checkpoint should do that, see RemoveOldXlogFiles.
> >
> > Maybe you have a broken WAL archiving setup, or something else preventing
> > removal of old WAL files?
>
> Thanks Tom. My test case is simple [1], no archiving, no replication
> slots - just plain initdb-ed cluster. My expectation is that whenever
> max_wal_size/wal_recycle is changed from the last checkpoint value,
> postgres must be able to delete "optionally" "all or some of" the
> future WAL files to free-up some disk space (which is about to get
> full) so that I can avoid server crashes and I will have some time to
> go scale the disk.
>
> [1]
> show min_wal_size;
> show max_wal_size;
> show wal_recycle;
>
> drop table foo;
> create table foo(col int);
>
> -- run below pg_switch_wal and insert statements 9 times.
> select pg_switch_wal();
> insert into foo select * from generate_series(1, 1000);
>
> select redo_wal_file from pg_control_checkpoint();
>
> checkpoint;
> --there will be around 10 recycled WAL future WAL files.
>
> alter system set max_wal_size to '240MB';
> select pg_reload_conf();
> show max_wal_size;
>
> checkpoint;
> --future WAL files will not be deleted.
>
> alter system set min_wal_size to '24MB';
> select pg_reload_conf();
> show min_wal_size;
>
> checkpoint;
> --future WAL files will not be deleted.
>
> alter system set wal_recycle to off;
> select pg_reload_conf();
> show wal_recycle;
>
> checkpoint;
> --future WAL files will not be deleted.

Hi, I'm thinking out loud - can we add all the recycled WAL files to a
sorted list (oldest recycled WAL file to new recycled WAL file) and
then during checkpoint, if the max_wal_size is reduced or wal_recycle
is set to off, then start deleting the future WAL files from the end
of the sorted list. Upon restart of the server, if required, the
sorted list of future WAL files can be rebuilt.

Thoughts?

Regards,
Bharath Rupireddy.




Re: Unfiltered server logs routing via a new elog hook or existing emit_log_hook bypassing log_min_message check

2022-05-13 Thread Bharath Rupireddy
On Mon, May 2, 2022 at 7:37 PM Tom Lane  wrote:
>
> Julien Rouhaud  writes:
> > On Mon, May 02, 2022 at 07:24:04PM +0530, Bharath Rupireddy wrote:
> >> I basically want to avoid normal users/developers setting any
> >> parameter (especially the superuser-only log_min_message GUC, all
> >> users might not have superuser access in production environments) or
> >> making any changes to the running server except just LOADing the
> >> server log routing/intercepting extension.
>
> > The kind of scenario you mentioned didn't seem "normal users" oriented.  
> > Note
> > that LOAD is restricted to superuser anyway.
>
> It seems completely silly to be worrying that setting a GUC in a
> particular way is too hard for somebody who's going to be installing
> a loadable extension.  In any case, if you wanted to force the issue
> you could set log_min_messages in the extension's _PG_init function.

Thanks Tom and Julien. I developed a simple external module called
pg_intercept_logs [1].

[1] https://github.com/BRupireddy/pg_intercept_server_logs

Regards,
Bharath Rupireddy.




Re: Can postgres ever delete the recycled future WAL files to free-up disk space if max_wal_size is reduced or wal_recycle is set to off?

2022-05-13 Thread Justin Pryzby
On Fri, May 06, 2022 at 09:50:26PM +0530, Bharath Rupireddy wrote:
> It seems like that's not the case because postgres will not remove future WAL
> files even after max_wal_size is reduced,

In the past, I've had to generate synthetic write traffic and checkpoints to
get WAL to shrink.  +1 to make it respect max_wal_size on its own.

-- 
Justin




Re: Comments on Custom RMGRs

2022-05-13 Thread Simon Riggs
On Fri, 13 May 2022 at 00:42, Andres Freund  wrote:
>
> On 2022-05-12 22:26:51 +0100, Simon Riggs wrote:
> > On Thu, 12 May 2022 at 04:40, Andres Freund  wrote:
> > > I'm not happy with the idea of random code being executed in the middle of
> > > CheckPointGuts(), without any documentation of what is legal to do at that
> > > point.
> >
> > The "I'm not happy.." ship has already sailed with pluggable rmgrs.
>
> I don't agree. The ordering within a checkpoint is a lot more fragile than
> what you do in an individual redo routine.

Example?


> > Checkpoints exist for one purpose - as the starting place for recovery.
> >
> > Why would we allow pluggable recovery without *also* allowing
> > pluggable checkpoints?
>
> Because one can do a lot of stuff with just pluggable WAL records, without
> integrating into checkpoints?
>
> Note that I'm *not* against making checkpoint extensible - I just think it
> needs a good bit of design work around when the hook is called etc.

When was any such work done previously for any other hook?? That isn't needed.

Checkpoints aren't complete until all data structures have
checkpointed, so there are no problems from a partial checkpoint being
written.

As a result, the order of actions in CheckpointGuts() is mostly
independent of each other. The SLRUs are all independent of each
other, as is CheckPointBuffers().

The use cases I'm trying to support aren't tricksy modifications of
existing code, they are just entirely new data structures which are
completely independent of other Postgres objects.


> I definitely think it's too late in the cycle to add checkpoint extensibility
> now.
>
>
> > > To actually be useful we'd likely need multiple calls to such an rmgr
> > > callback, with a parameter where in CheckPointGuts() we are. Right now the
> > > sequencing is explicit in CheckPointGuts(), but with the proposed 
> > > callback,
> > > that'd not be the case anymore.
> >
> > It is useful without the extra complexity you mention.
>
> Shrug. The documentation work definitely is needed. Perhaps we can get away
> without multiple callbacks within a checkpoint, I think it'll become more
> apparent when writing information about the precise point in time the
> checkpoint callback is called.

You seem to be thinking in terms of modifying the existing actions in
CheckpointGuts(). I don't care about that. Anybody that wishes to do
that can work out the details of their actions.

There is nothing to document, other than "don't do things that won't
work". How can anyone enumerate all the things that wouldn't work??

There is no list of caveats for any other hook. Why is it needed here?

> > I see multiple uses for the rm_checkpoint() point proposed and I've
> > been asked multiple times for a checkpoint hook. Any rmgr that
> > services crash recovery for a non-smgr based storage system would need
> > this because the current checkpoint code only handles flushing to disk
> > for smgr-based approaches. That is orthogonal to other code during
> > checkpoint, so it stands alone quite well.
>
> FWIW, for that there are much bigger problems than checkpoint
> extensibility. Most importantly there's currently no good way to integrate
> relation creation / drop with the commit / abort infrastructure...

One at a time...

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

2022-05-13 Thread Bharath Rupireddy
On Thu, May 5, 2022 at 2:07 PM Alvaro Herrera  wrote:
>
> On 2022-May-05, Bharath Rupireddy wrote:
>
> > On Fri, Apr 29, 2022 at 4:11 PM Alvaro Herrera  
> > wrote:
> > >
> > > Did we ever consider the idea of using a new pg_stat_wal_activity_progress
> > > view or something like that, using the backend_progress.c functionality?
> > > I don't see it mentioned in the thread.
> >
> > IMO, progress reporting works well on a running server and at the
> > moment. The WAL recovery/replay can happen even before the server
> > opens up for connections
>
> It's definitely true that you wouldn't be able to use it for when the
> server is not accepting connections.
>
> > and the progress report view can't be used
> > for later analysis like how much time the restoring WAL files from
> > archive location took
>
> This is true too -- progress report doesn't store historical data, only
> current status.
>
> > and also the WAL file names can't be reported in progress reporting
> > mechanism
>
> Also true.
>
> > (only integers columns, of course if required we can add text columns
> > to pg_stat_get_progress_info).
>
> Yeah, I don't think adding text columns is terribly easy, because the
> whole point of the progress reporting infrastructure is that it can be
> updated very cheaply as atomic operations, and if you want to transmit
> text columns, that's no longer possible.
>
> > Having the recovery info in server logs might help.
>
> I suppose it might.
>
> > I think reporting a long-running file processing operation (removing
> > or syncing) within postgres is a generic problem (for snapshot,
> > mapping, temporary (pgsql_tmp), temp relation files, old WAL file
> > processing, WAL file processing during recovery etc.) and needs to be
> > solved
>
> I agree up to here.
>
> > in two ways: 1) logging progress into server logs (which helps
> > for analysis and report when the server isn't available for
> > connections, crash recovery), a generic GUC
> > log_file_processing_traffic = {none, medium, high} might help here
> > (also proposed in [1]) and 2) pg_stat_file_processing_progress
> > (extending progress reporting pg_stat_get_progress_info to have few
> > text columns for current file name and directory path).
>
> I think using the server log to store telemetry data is not a great fit.
> It bloats the log files and can be so slow as to block other operations
> in the server.  Server logs should normally be considered critical info
> that's not okay to lose; telemetry tends to be of secondary importance
> and in a pinch you can drop a few messages without hurting too much.
>
> We've done moderately okay so far with having some system views where
> some telemetry readings can be obtained, but there several drawbacks to
> that approach that we should at some point solve.  My opinion on this is
> that we need to bite the bullet and develop separate infrastructure for
> reporting server metrics.
>
> I just
> think that we should look into a new mechanism going forward.

I completely agree that we must have new ways to report important
server metrics in an easily consumable fashion - it could be something
like server computing metrics (time taken, the file being processed
etc.) important/time-taking operations such as execution of archive
command, restore command, recovery, checkpoint, old WAL files
removal/recycling, processing of various files - snapshot, mapping,
pgsql_tmp files, temp relation files etc. and writing these metrics to
a file or a stat table on the database itself (this table can have an
automatic mechanism of clean up or limit on number of rows or size of
the table so that the clients can read those metrics and use it in
whichever way they want. To start with, a simple structure of the
metrics can be {OPERATION text, START_TIME timestamp, END_TIME
timestamp, few text, double and/or integer flexible columns that each
operation can use to store metrics}.

I can start a discussion in a separate thread to seek more thoughts on
the server metrics part.

> That said, I'm not opposed to having a patch somewhat as posted.

Thanks. I'm thinking if we should have a generic GUC
log_file_processing_traffic = {none, medium, high} that might help
reporting other time taking file processing operations such as [1].

[1] 
https://www.postgresql.org/message-id/CALj2ACW-ELOF5JT2zPavs95wbZ0BrLPrqvSZ7Ac%2BpjxCkmXtEQ%40mail.gmail.com

Regards,
Bharath Rupireddy.




Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)

2022-05-13 Thread Robert Haas
On Fri, Apr 29, 2022 at 5:11 AM Bharath Rupireddy
 wrote:
> Here's the rebased v9 patch.

This seems like it has enormous overlap with the existing
functionality that we have from log_startup_progress_interval.

I think that facility is also better-designed than this one. It prints
out a message based on elapsed time, whereas this patch prints out a
message based progress through the WAL. That means that if WAL replay
isn't actually advancing for some reason, you just won't get any log
messages and you don't know whether it's advancing slowly or not at
all or the server is just hung. With that facility you can distinguish
those cases.

Also, if for some reason we do think that amount of WAL replayed is
the right metric, rather than time, why would we only allow high=1
segment and low=128 segments, rather than say any number of MB or GB
that the user would like to configure?

I suggest that if log_startup_progress_interval doesn't meet your
needs here, we should try to understand why not and maybe enhance it,
instead of adding a separate facility.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Backends stunk in wait event IPC/MessageQueueInternal

2022-05-13 Thread Japin Li


On Fri, 13 May 2022 at 19:41, Justin Pryzby  wrote:
> On Fri, May 13, 2022 at 06:16:23PM +0800, Japin Li wrote:
>> I had an incident on my Postgres 14 that queries hung in wait event
>> IPC / MessageQueueInternal, MessageQueueReceive.  It likes [1],
>> however, it doesn't have any discussions.
>
> If the process is still running, or if the problem recurs, I suggest to create
> a corefile with gcore, aka gdb generate-core-file.  Then, we can look at the
> backtrace at our leisure, even if the cluster needed to be restarted right
> away.
>

Thanks for your advice, I will try it later.

> What minor version of postgres is this, and what OS ?

PostgreSQL 14.2 and Solaris.

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2022-05-13 Thread Maxim Orlov
> here is the rebased v32 version of the patch.
>
> The patchset rotted a bit. Here is a rebased version.
>

We have posted an updated version v34 of the whole patchset in [1].
Changes of patches 0001-0003 there are identical to v33. So, no update is
needed in this thread.

[1]
https://www.postgresql.org/message-id/flat/CACG%3DezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe%3DpyyjVWA%40mail.gmail.com

-- 
Best regards,
Maxim Orlov.


Re: make MaxBackends available in _PG_init

2022-05-13 Thread Robert Haas
On Thu, May 12, 2022 at 8:15 PM Michael Paquier  wrote:
> On Thu, May 12, 2022 at 06:51:59PM +0300, Anton A. Melnikov wrote:
> > Maybe remove the first part from the patchset?
> > Because now the Patch Tester is giving apply error for the first part and
> > can't test the other.
> > http://cfbot.cputube.org/patch_38_3614.log
>
> Yep.  As simple as the attached.

Committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Backends stunk in wait event IPC/MessageQueueInternal

2022-05-13 Thread Robert Haas
On Fri, May 13, 2022 at 6:16 AM Japin Li  wrote:
> The process cannot be terminated by pg_terminate_backend(), although
> it returns true.

pg_terminate_backend() just sends SIGINT. What I'm wondering is what
happens when the stuck process receives SIGINT. It would be useful, I
think, to check the value of the global variable InterruptHoldoffCount
in the stuck process by attaching to it with gdb. I would also try
running "strace -p $PID" on the stuck process and then try terminating
it again with pg_terminate_backend(). Either the system call in which
it's currently stuck returns and then it makes the same system call
again and hangs again ... or the signal doesn't dislodge it from the
system call in which it's stuck in the first place. It would be useful
to know which of those two things is happening.

One thing I find a bit curious is that the top of the stack in your
case is ioctl(). And there are no calls to ioctl() anywhere in
latch.c, nor have there ever been. What operating system is this? We
have 4 different versions of WaitEventSetWaitBlock() that call
epoll_wait(), kevent(), poll(), and WaitForMultipleObjects()
respectively. I wonder which of those we're using, and whether one of
those calls is showing up as ioctl() in the stacktrace, or whether
there's some other function being called in here that is somehow
resulting in ioctl() getting called.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




RE: Data is copied twice when specifying both child and parent table in publication

2022-05-13 Thread osumi.takami...@fujitsu.com
On Friday, May 13, 2022 6:42 PM Wang, Wei/王 威  wrote:
> Attach the patches.(Only changed the patch for HEAD.).
> 1. Optimize the code. Reduce calls to function filter_partitions. 
> [suggestions by
> Amit-san] 2. Improve the alias name in SQL. [suggestions by Amit-san] 3.
> Improve coding alignments. [suggestions by Amit-san] 4. Do some
> optimizations for list Concatenate.
Hi, thank you for updating the patch.


I have one minor comment on fetch_table_list() in HEAD v4.


@@ -1759,17 +1759,22 @@ static List *
 fetch_table_list(WalReceiverConn *wrconn, List *publications)
 {
WalRcvExecResult *res;
-   StringInfoData cmd;
+   StringInfoData cmd,
+   pub_names;
TupleTableSlot *slot;
Oid tableRow[2] = {TEXTOID, TEXTOID};
List   *tablelist = NIL;

+   initStringInfo(&pub_names);
+   get_publications_str(publications, &pub_names, true);
+

Kindly free the pub_names's data along with the cmd.data.


Best Regards,
Takamichi Osumi



Re: First draft of the PG 15 release notes

2022-05-13 Thread 'Bruce Momjian'
On Fri, May 13, 2022 at 01:36:04AM +, osumi.takami...@fujitsu.com wrote:
> > 
> > This is enabled with the subscriber option "disable_on_error"
> > and avoids possible infinite loops during stream application.
> > 
> > 
> Thank you !
> 
> In this last paragraph, how about replacing "infinite loops"
> with "infinite error loops" ? I think it makes the situation somewhat
> clear for readers.

Agreed, done.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-13 Thread Robert Haas
On Thu, May 12, 2022 at 10:20 PM Thomas Munro  wrote:
> As for skink failing, the timeout was hard coded 300s for the whole
> test, but apparently that wasn't enough under valgrind.  Let's use the
> standard PostgreSQL::Test::Utils::timeout_default (180s usually), but
> reset it for each query we send.

@@ -202,6 +198,9 @@ sub send_query_and_wait
  my ($psql, $query, $untl) = @_;
  my $ret;

+ $psql_timeout->reset();
+ $psql_timeout->start();
+
  # send query
  $$psql{stdin} .= $query;
  $$psql{stdin} .= "\n";

This seems fine, but I think you should add a non-trivial comment about it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: make MaxBackends available in _PG_init

2022-05-13 Thread Nathan Bossart
On Fri, May 13, 2022 at 09:49:54AM -0400, Robert Haas wrote:
> Committed.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: First draft of the PG 15 release notes

2022-05-13 Thread Bruce Momjian
On Fri, May 13, 2022 at 10:48:41AM +0900, Amit Langote wrote:
> On Thu, May 12, 2022 at 10:52 PM Bruce Momjian  wrote:
> > Okay, I went with:
> >
> > Previously, such updates ran delete actions on the source
> > partition and insert actions on the target partition.  PostgreSQL 
> > will
> > now run update actions on the referenced partition root.
> 
> WFM, thanks.
> 
> Btw, perhaps the following should be listed under E.1.3.2.1. Logical
> Replication, not E.1.3.1.1. Partitioning?
> 
> 
> 
> 
> 
> Remove incorrect duplicate partitions in system view
> pg_publication_tables (Hou Zhijie)
> 
> 
> 
> Attached a patch to do so.

Agreed, done.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Comments on Custom RMGRs

2022-05-13 Thread Jeff Davis
On Fri, 2022-05-13 at 13:31 +0100, Simon Riggs wrote:
> The first was a discussion about a data structure needed by BDR about
> 4 years ago. In the absence of a pluggable checkpoint, the solution
> was forced to use a normal table, which wasn't very satisfactory.

I'm interested to hear more about this case. Are you developing it into
a full table AM? In my experience with columnar, there's still a long
tail of things I wish I had in the backend to better support complete
table AMs.

> The second was a more recent conversation with Mike Stonebraker, at
> the end of 2021.. He was very keen to remove the buffer manager
> entirely, which requires that we have a new smgr, which then needs
> new
> code to allow it to be written to disk at checkpoint time, which then
> requires some kind of pluggable code at checkpoint time. (Mike was
> also keen to remove WAL, but that's another story entirely!).

I'm guessing that would be more of an experimental/ambitious project,
and based on modified postgres anyway.

> The last use case was unlogged indexes, which need to be read from
> disk at startup or rebuilt after crash, which requires RmgrStartup to
> work both with and without InRedo=true.

That sounds like a core feature, in which case we can just refactor
that for v16. It might be a nice cleanup for unlogged tables, too. I
don't think your 002-v2 patch is particularly risky, but any reluctance
at all probably pushes it to v16 given that it's so late in the cycle.

Regards,
Jeff Davis






Re: First draft of the PG 15 release notes

2022-05-13 Thread Bruce Momjian
On Fri, May 13, 2022 at 08:24:53AM +0530, Amit Kapila wrote:
> On Fri, May 13, 2022 at 6:02 AM Euler Taveira  wrote:
> >
> > On Thu, May 12, 2022, at 11:22 AM, Bruce Momjian wrote:
> >
> > On Thu, May 12, 2022 at 11:12:54AM -0300, Euler Taveira wrote:
> > OB> On Thu, May 12, 2022, at 11:03 AM, Bruce Momjian wrote:
> > >
> > > I looked at that but thought that everyone would already assume we
> > > skipped replication of empty transactions, and I didn't see much 
> > > impact
> > > for the user, so I didn't include it.
> > >
> > > It certainly has an impact on heavy workloads that replicate tables with 
> > > few
> > > modifications. It receives a high traffic of 'begin' and 'commit' 
> > > messages that
> > > the previous Postgres versions have to handle (discard). I would classify 
> > > it as
> > > a performance improvement for logical replication. Don't have a strong 
> > > opinion
> > > if it should be mentioned or not.
> >
> > Oh, so your point is that a transaction that only has SELECT would
> > previously send an empty transaction?  I thought this was only for apps
> > that create literal empty transactions, which seem rare.
> >
> > No. It should be a write transaction. If you have a replication setup that
> > publish only table foo (that isn't modified often) and most of your
> > workload does not contain table foo, Postgres sends 'begin' and 'commit'
> > messages to subscriber even if there is no change to replicate.
> >
> 
> It reduces network traffic and improves performance by 3-14% on simple
> tests [1] like the one shown by Euler. I see a value in adding this as
> for the workloads where it hits, it seems more than 99% of network
> traffic [2] is due to these empty messages.

I see the point now --- new item:





Prevent logical replication of empty transactions (Ajin Cherian,
Hou Zhijie, Euler Taveira)



Previously, write transactions would send empty transactions to
subscribers if subscribed tables were not modified.



-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Should use MERGE use BulkInsertState ?

2022-05-13 Thread Robert Haas
On Wed, May 11, 2022 at 12:48 PM Justin Pryzby  wrote:
> The point of the ring/strategy buffer is to avoid a seq scan/vacuum/COPY
> clobbering the entire buffer cache when processing a table larger than
> shared_buffers.
>
> It also makes backends doing bulk operations responsible for their own writes,
> rather than leaving other backends to clean up all their dirty buffers.

Right. It's not just about how fast individual operations go: dirtying
all of shared_buffers is faster *for you* but you've imposed a
distributed cost on everyone else on the system, which isn't great.

That said, I'm not really sure what the proper handling is in this
specific case. I guess we should try our best to have it be symmetric
with other similar cases, but questions have been raised repeatedly
about how well the ring buffer stuff actually works.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Comments on Custom RMGRs

2022-05-13 Thread Robert Haas
On Fri, May 13, 2022 at 8:47 AM Simon Riggs
 wrote:
> > Note that I'm *not* against making checkpoint extensible - I just think it
> > needs a good bit of design work around when the hook is called etc.
>
> When was any such work done previously for any other hook?? That isn't needed.

I think almost every proposal to add a hook results in some discussion
about how usable the hook will be and whether it's being put in the
correct place and called with the correct arguments.

I think that's a good thing, too. Otherwise the code would be
cluttered with a bunch of hooks that seemed to someone like a good
idea at the time but are actually just a maintenance headache.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: recovery test failure on morepork with timestamp mystery

2022-05-13 Thread Andres Freund
Hi,

On 2022-05-13 09:00:20 +0200, Mikael Kjellström wrote:
> Well, I don't know if you remember but there was a thread a while back and a
> test program (monotime.c) to test the clock if it could go backwards and
> openbsd showed the following result when running the attached testprogram:

Nope, didn't remember...


> $ ./monotime
> 410310 Starting
> 547727 Starting
> 410310 Back 262032.372314102 => 262032.242045208
> 410310 Stopped
> 465180 Starting
> 255646 Starting
> 547727 Stopped
> 465180 Stopped
> 255646 Stopped
> 
> could that have something to do with it?

Yes!


> printf("%d Back %lld.%09lu => %lld.%09lu\n",
> (int)getthrid(), ts0.tv_sec, ts0.tv_nsec, ts1.tv_sec,
> ts1.tv_nsec);
> break;

I wonder whether the %09lu potentially is truncating ts1.tv_nsec.


I can't reproduce the problem trivially in an openbsd VM I had around. But
it's 7.1, so maybe that's the reason?

Greetings,

Andres Freund




Re: Multi-Master Logical Replication

2022-05-13 Thread Bruce Momjian
On Fri, Apr 29, 2022 at 07:05:11PM +1000, Peter Smith wrote:
> This MMLR proposal was mostly just to create an interface making it
> easier to use PostgreSQL core logical replication CREATE
> PUBLICATION/SUBSCRIPTION for table sharing among a set of nodes.
> Otherwise, this is difficult for a user to do manually. (e.g.
> difficulties as mentioned in section 2.2 of the original post [1] -
> dealing with initial table data, coordinating the timing/locking to
> avoid concurrent updates, getting the SUBSCRIPTION options for
> copy_data exactly right etc)
> 
> At this time we have no provision for HA, nor for transaction
> consistency awareness, conflict resolutions, node failure detections,
> DDL replication etc. Some of the features like DDL replication are
> currently being implemented [2], so when committed it will become
> available in the core, and can then be integrated into this module.

Uh, without these features, what workload would this help with?  I think
you made the mistake of jumping too far into implementation without
explaining the problem you are trying to solve.  The TODO list has this
ordering:

https://wiki.postgresql.org/wiki/Todo
Desirability -> Design -> Implement -> Test -> Review -> Commit

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: Privileges on PUBLICATION

2022-05-13 Thread Euler Taveira
On Fri, May 13, 2022, at 3:36 AM, Antonin Houska wrote:
> Attached is my proposal. It tries to be more specific and does not mention the
> absence of the privileges explicitly.
You explained the current issue but say nothing about the limitation. This
information will trigger a question possibly in one of the MLs. IMO if you say
something like the sentence above at the end, it will make it clear why that
setup expose all data (there is no access control to publications) and
explicitly say there is a TODO here.

Additional privileges might be added to control access to table data in a
future version of PostgreSQL.

I also wouldn't use the warning tag because it fits in the same category as the
other restrictions listed in the page.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: Item compression in the Gist index

2022-05-13 Thread Bruce Momjian
On Mon, May  2, 2022 at 02:48:27PM +0200, Baca Radim wrote:
> Hi,
> we are developing an extension for multidimensional data. We have created a
> Gist index that is heavily inspired by a cube extension. Right now we would
> like to add some item compression since data items in a node share a
> significant portion of a tuple prefix. However, I have no idea how to handle
> information stored on a node level in Gists' compression method. Is there
> any example? Unfortunately, the cube extension does not implement item
> compression.
> To be more specific we would like to store in the node a common prefix for
> all tuples in the node.

Uh, SP-GiST does prefix compression.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





Re: recovery test failure on morepork with timestamp mystery

2022-05-13 Thread Andres Freund
Hi,

On 2022-05-13 10:22:32 -0700, Andres Freund wrote:
> On 2022-05-13 09:00:20 +0200, Mikael Kjellström wrote:
> > Well, I don't know if you remember but there was a thread a while back and a
> > test program (monotime.c) to test the clock if it could go backwards and
> > openbsd showed the following result when running the attached testprogram:
> 
> Nope, didn't remember...
> 
> 
> > $ ./monotime
> > 410310 Starting
> > 547727 Starting
> > 410310 Back 262032.372314102 => 262032.242045208
> > 410310 Stopped
> > 465180 Starting
> > 255646 Starting
> > 547727 Stopped
> > 465180 Stopped
> > 255646 Stopped
> > 
> > could that have something to do with it?
> 
> Yes!
> 
> 
> > printf("%d Back %lld.%09lu => %lld.%09lu\n",
> > (int)getthrid(), ts0.tv_sec, ts0.tv_nsec, ts1.tv_sec,
> > ts1.tv_nsec);
> > break;
> 
> I wonder whether the %09lu potentially is truncating ts1.tv_nsec.
> 
> 
> I can't reproduce the problem trivially in an openbsd VM I had around. But
> it's 7.1, so maybe that's the reason?

What does
sysctl kern.timecounter
return? Does the problem continue if you switch kern.timecounter.hardware to
something else?

In https://postgr.es/m/32aaeb66-71b2-4af0-91ef-1a992ac4d58b%40mksoft.nu you
said it was using acpitimer0 and that it's a vmware VM. It might also be a
vmware bug, not an openbsd one...

Greetings,

Andres Freund




list of TransactionIds

2022-05-13 Thread Alvaro Herrera
We didn't have any use of TransactionId as members of List, until
RelationSyncEntry->streamed_txns was introduced (464824323e57, pg14).
It's currently implemented as a list of int.  This is not wrong at
present, but it may soon be, and I'm sure it rubs some people the wrong
way.

But is the rubbing way wrong enough to add support for TransactionId in
pg_list.h, including, say, T_XidList?

The minimal patch (attached) is quite small AFAICS.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
>From 19f165f556a3cee7226a7b22b9bdf7ad9bce9fb0 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 29 Apr 2022 17:42:17 +0200
Subject: [PATCH] Implement List support for TransactionId

Use it for RelationSyncEntry->streamed_txns, which is currently using an
integer list.
---
 src/backend/nodes/list.c| 42 -
 src/backend/nodes/outfuncs.c|  4 ++
 src/backend/replication/pgoutput/pgoutput.c | 13 ++-
 src/include/nodes/nodes.h   |  1 +
 src/include/nodes/pg_list.h |  6 +++
 5 files changed, 56 insertions(+), 10 deletions(-)

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index f843f861ef..9d8f4fd5c7 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -54,6 +54,7 @@
 #define IsPointerList(l)		((l) == NIL || IsA((l), List))
 #define IsIntegerList(l)		((l) == NIL || IsA((l), IntList))
 #define IsOidList(l)			((l) == NIL || IsA((l), OidList))
+#define IsXidList(l)			((l) == NIL || IsA((l), XidList))
 
 #ifdef USE_ASSERT_CHECKING
 /*
@@ -71,7 +72,8 @@ check_list_invariants(const List *list)
 
 	Assert(list->type == T_List ||
 		   list->type == T_IntList ||
-		   list->type == T_OidList);
+		   list->type == T_OidList ||
+		   list->type == T_XidList);
 }
 #else
 #define check_list_invariants(l)  ((void) 0)
@@ -383,6 +385,24 @@ lappend_oid(List *list, Oid datum)
 	return list;
 }
 
+/*
+ * Append a TransactionId to the specified list. See lappend()
+ */
+List *
+lappend_xid(List *list, TransactionId datum)
+{
+	Assert(IsXidList(list));
+
+	if (list == NIL)
+		list = new_list(T_XidList, 1);
+	else
+		new_tail_cell(list);
+
+	llast_xid(list) = datum;
+	check_list_invariants(list);
+	return list;
+}
+
 /*
  * Make room for a new cell at position 'pos' (measured from 0).
  * The data in the cell is left undefined, and must be filled in by the
@@ -714,6 +734,26 @@ list_member_oid(const List *list, Oid datum)
 	return false;
 }
 
+/*
+ * Return true iff the TransactionId 'datum' is a member of the list.
+ */
+bool
+list_member_xid(const List *list, TransactionId datum)
+{
+	const ListCell *cell;
+
+	Assert(IsXidList(list));
+	check_list_invariants(list);
+
+	foreach(cell, list)
+	{
+		if (lfirst_oid(cell) == datum)
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * Delete the n'th cell (counting from 0) in list.
  *
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 0271ea9d78..0f257ed0ce 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -221,6 +221,8 @@ _outList(StringInfo str, const List *node)
 		appendStringInfoChar(str, 'i');
 	else if (IsA(node, OidList))
 		appendStringInfoChar(str, 'o');
+	else if (IsA(node, XidList))
+		appendStringInfoChar(str, 'x');
 
 	foreach(lc, node)
 	{
@@ -239,6 +241,8 @@ _outList(StringInfo str, const List *node)
 			appendStringInfo(str, " %d", lfirst_int(lc));
 		else if (IsA(node, OidList))
 			appendStringInfo(str, " %u", lfirst_oid(lc));
+		else if (IsA(node, XidList))
+			appendStringInfo(str, " %u", lfirst_xid(lc));
 		else
 			elog(ERROR, "unrecognized list node type: %d",
  (int) node->type);
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 42c06af239..d4a84631c7 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -1923,13 +1923,8 @@ init_rel_sync_cache(MemoryContext cachectx)
 static bool
 get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
 {
-	ListCell   *lc;
-
-	foreach(lc, entry->streamed_txns)
-	{
-		if (xid == (uint32) lfirst_int(lc))
-			return true;
-	}
+	if (list_member_xid(entry->streamed_txns, xid))
+		return true;
 
 	return false;
 }
@@ -1945,7 +1940,7 @@ set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
 
 	oldctx = MemoryContextSwitchTo(CacheMemoryContext);
 
-	entry->streamed_txns = lappend_int(entry->streamed_txns, xid);
+	entry->streamed_txns = lappend_xid(entry->streamed_txns, xid);
 
 	MemoryContextSwitchTo(oldctx);
 }
@@ -2248,7 +2243,7 @@ cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
 		 */
 		foreach(lc, entry->streamed_txns)
 		{
-			if (xid == (uint32) lfirst_int(lc))
+			if (xid == lfirst_xid(lc))
 			{
 if (is_commit)
 	entry->schema_sent = true;
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index b3b407579b..7ce1fc4deb 100644
--- a

Re: recovery test failure on morepork with timestamp mystery

2022-05-13 Thread Mikael Kjellström




On 2022-05-13 22:09, Andres Freund wrote:


What does
sysctl kern.timecounter
return? Does the problem continue if you switch kern.timecounter.hardware to
something else?



openbsd 5.9:

$ sysctl kern.timecounter
kern.timecounter.tick=1
kern.timecounter.timestepwarnings=0
kern.timecounter.hardware=acpihpet0
kern.timecounter.choice=i8254(0) acpihpet0(1000) acpitimer0(1000) 
dummy(-100)


$ ./monotime 


1024736 Starting
1013674 Starting
1028468 Starting
1014641 Starting
1013674 Stopped
1024736 Stopped
1014641 Stopped
1028468 Stopped

no problem


openbsd 6.9:

$ sysctl kern.timecounter
kern.timecounter.tick=1
kern.timecounter.timestepwarnings=0
kern.timecounter.hardware=tsc
kern.timecounter.choice=i8254(0) acpihpet0(1000) tsc(2000) acpitimer0(1000)

Hm, here it's using the tsc timer.  So that is a difference from 5.9

$ ./monotime
133998 Starting
408137 Starting
578042 Starting
326139 Starting
133998 Back 310670.000931851668 => 310670.000801582864
133998 Stopped
326139 Stopped
578042 Stopped
408137 Stopped

it's only in 6.9 the problem with the timer going backwards shows up.

If I switch timer to acpihpet0 this is the result:

$ ./monotime
101622 Starting
480782 Starting
219318 Starting
316650 Starting
101622 Stopped
480782 Stopped
219318 Stopped
316650 Stopped

so that seems to solve the problem with the timer going backwards.



In https://postgr.es/m/32aaeb66-71b2-4af0-91ef-1a992ac4d58b%40mksoft.nu you
said it was using acpitimer0 and that it's a vmware VM. It might also be a
vmware bug, not an openbsd one...


Might be a bug in vmware but it's running the latest patchlevel for 6.7 
and no other VM have this problem so seems to only happen in openbsd 6.9 
for some reason.  Maybe that is the only VM that is using tsc as a timer 
source though?


/Mikael




Re: Backends stunk in wait event IPC/MessageQueueInternal

2022-05-13 Thread Thomas Munro
On Sat, May 14, 2022 at 2:09 AM Robert Haas  wrote:
> On Fri, May 13, 2022 at 6:16 AM Japin Li  wrote:
> > The process cannot be terminated by pg_terminate_backend(), although
> > it returns true.

> One thing I find a bit curious is that the top of the stack in your
> case is ioctl(). And there are no calls to ioctl() anywhere in
> latch.c, nor have there ever been. What operating system is this? We
> have 4 different versions of WaitEventSetWaitBlock() that call
> epoll_wait(), kevent(), poll(), and WaitForMultipleObjects()
> respectively. I wonder which of those we're using, and whether one of
> those calls is showing up as ioctl() in the stacktrace, or whether
> there's some other function being called in here that is somehow
> resulting in ioctl() getting called.

I guess this is really illumos (née OpenSolaris), not Solaris, using
our epoll build mode, with illumos's emulation of epoll, which maps
epoll onto Sun's /dev/poll driver:

https://github.com/illumos/illumos-gate/blob/master/usr/src/lib/libc/port/sys/epoll.c#L230

That'd explain:

  fb7fef216f4a ioctl (d, d001, fb7fffdfa0e0)

That matches the value DP_POLL from:

https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/sys/devpoll.h#L44

Or if it's really Solaris, huh, are people moving illumos code back
into closed Solaris these days?

As for why it's hanging, I don't know, but one thing that we changed
in 14 was that we started using signalfd() to receive latch signals on
systems that have it, and illumos also has an emulation of signalfd()
that our configure script finds:

https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/io/signalfd.c

There were in fact a couple of unexplained hangs on the illumos build
farm animals, and then they were changed to use -DWAIT_USE_POLL so
that they wouldn't automatically choose epoll()/signalfd().   That is
not very satisfactory, but as far as I know there is a bug in either
epoll() or signalfd(), or at least some difference compared to the
Linux implementation they are emulating.   spent quite a bit of time
ping ponging emails back and forth with the owner of a hanging BF
animal trying to get a minimal repro for a bug report, without
success.  I mean, it's possible that the bug is in PostgreSQL (though
no complaint has ever reached me about this stuff on Linux), but while
trying to investigate it a kernel panic happened[1], which I think
counts as a point against that theory...

(For what it's worth, WSL1 also emulates these two Linux interfaces
and also apparently doesn't do so well enough for our purposes, also
for reasons not understood by us.)

In short, I'd recommend -DWAIT_USE_POLL for now.  It's possible that
we could do something to prevent the selection of WAIT_USE_EPOLL on
that platform, or that we should have a halfway option epoll() but not
signalfd() (= go back to using the self-pipe trick), patches welcome,
but that feels kinda strange and would be very niche combination that
isn't fun to maintain... the real solution is to fix the bug.

[1] https://www.illumos.org/issues/13700




Re: Backends stunk in wait event IPC/MessageQueueInternal

2022-05-13 Thread Thomas Munro
On Sat, May 14, 2022 at 9:25 AM Thomas Munro  wrote:
> In short, I'd recommend -DWAIT_USE_POLL for now.  It's possible that
> we could do something to prevent the selection of WAIT_USE_EPOLL on
> that platform, or that we should have a halfway option epoll() but not
> signalfd() (= go back to using the self-pipe trick), patches welcome,
> but that feels kinda strange and would be very niche combination that
> isn't fun to maintain... the real solution is to fix the bug.

I felt a bit sad about writing that, so I took a crack at trying to
write a patch that separates the signalfd/self-pipe choice from the
epoll/poll choice.  Maybe it's not too bad.

Japin, are you able to reproduce the problem reliably?  Did I guess
right, that you're on illumos?  Does this help?  I used
defined(__sun__) to select the option, but I don't remember if that's
the right way to detect that OS family, could you confirm that, or
adjust as required?
From 316ec0895c7ab65102c8c100774b51fc0e86c859 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 14 May 2022 10:15:30 +1200
Subject: [PATCH] Don't trust signalfd() on illumos.

---
 src/backend/storage/ipc/latch.c | 40 +
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 78c6a89271..8b0b31d100 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -84,6 +84,18 @@
 #error "no wait set implementation available"
 #endif
 
+/*
+ * We need the self-pipe trick for race-free wakeups when using poll().
+ *
+ * XXX We also need it when using epoll() on illumos, because its signalfd()
+ * seems to be broken.
+ */
+#if defined(WAIT_USE_POLL) || (defined(WAIT_USE_EPOLL) && defined(__sun__))
+#define WAIT_USE_SELF_PIPE
+#elif defined(WAIT_USE_EPOLL)
+#define WAIT_USE_SIGNALFD
+#endif
+
 /* typedef in latch.h */
 struct WaitEventSet
 {
@@ -146,12 +158,12 @@ static WaitEventSet *LatchWaitSet;
 static volatile sig_atomic_t waiting = false;
 #endif
 
-#ifdef WAIT_USE_EPOLL
+#ifdef WAIT_USE_SIGNALFD
 /* On Linux, we'll receive SIGURG via a signalfd file descriptor. */
 static int	signal_fd = -1;
 #endif
 
-#if defined(WAIT_USE_POLL)
+#ifdef WAIT_USE_SELF_PIPE
 /* Read and write ends of the self-pipe */
 static int	selfpipe_readfd = -1;
 static int	selfpipe_writefd = -1;
@@ -164,7 +176,7 @@ static void latch_sigurg_handler(SIGNAL_ARGS);
 static void sendSelfPipeByte(void);
 #endif
 
-#if defined(WAIT_USE_POLL) || defined(WAIT_USE_EPOLL)
+#if defined(WAIT_USE_SELF_PIPE) || defined(WAIT_USE_SIGNALFD)
 static void drain(void);
 #endif
 
@@ -190,7 +202,7 @@ static inline int WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 void
 InitializeLatchSupport(void)
 {
-#if defined(WAIT_USE_POLL)
+#if defined(WAIT_USE_SELF_PIPE)
 	int			pipefd[2];
 
 	if (IsUnderPostmaster)
@@ -264,7 +276,7 @@ InitializeLatchSupport(void)
 	pqsignal(SIGURG, latch_sigurg_handler);
 #endif
 
-#ifdef WAIT_USE_EPOLL
+#ifdef WAIT_USE_SIGNALFD
 	sigset_t	signalfd_mask;
 
 	/* Block SIGURG, because we'll receive it through a signalfd. */
@@ -316,7 +328,7 @@ ShutdownLatchSupport(void)
 		LatchWaitSet = NULL;
 	}
 
-#if defined(WAIT_USE_POLL)
+#if defined(WAIT_USE_SELF_PIPE)
 	close(selfpipe_readfd);
 	close(selfpipe_writefd);
 	selfpipe_readfd = -1;
@@ -324,7 +336,7 @@ ShutdownLatchSupport(void)
 	selfpipe_owner_pid = InvalidPid;
 #endif
 
-#if defined(WAIT_USE_EPOLL)
+#if defined(WAIT_USE_SIGNALFD)
 	close(signal_fd);
 	signal_fd = -1;
 #endif
@@ -904,9 +916,9 @@ AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch,
 	{
 		set->latch = latch;
 		set->latch_pos = event->pos;
-#if defined(WAIT_USE_POLL)
+#if defined(WAIT_USE_SELF_PIPE)
 		event->fd = selfpipe_readfd;
-#elif defined(WAIT_USE_EPOLL)
+#elif defined(WAIT_USE_SIGNALFD)
 		event->fd = signal_fd;
 #else
 		event->fd = PGINVALID_SOCKET;
@@ -2083,7 +2095,7 @@ GetNumRegisteredWaitEvents(WaitEventSet *set)
 	return set->nevents;
 }
 
-#if defined(WAIT_USE_POLL)
+#if defined(WAIT_USE_SELF_PIPE)
 
 /*
  * SetLatch uses SIGURG to wake up the process waiting on the latch.
@@ -2134,7 +2146,7 @@ retry:
 
 #endif
 
-#if defined(WAIT_USE_POLL) || defined(WAIT_USE_EPOLL)
+#if defined(WAIT_USE_SELF_PIPE) || defined(WAIT_USE_SIGNALFD)
 
 /*
  * Read all available data from self-pipe or signalfd.
@@ -2150,7 +2162,7 @@ drain(void)
 	int			rc;
 	int			fd;
 
-#ifdef WAIT_USE_POLL
+#ifdef WAIT_USE_SELF_PIPE
 	fd = selfpipe_readfd;
 #else
 	fd = signal_fd;
@@ -2168,7 +2180,7 @@ drain(void)
 			else
 			{
 waiting = false;
-#ifdef WAIT_USE_POLL
+#ifdef WAIT_USE_SELF_PIPE
 elog(ERROR, "read() on self-pipe failed: %m");
 #else
 elog(ERROR, "read() on signalfd failed: %m");
@@ -2178,7 +2190,7 @@ drain(void)
 		else if (rc == 0)
 		{
 			waiting = false;
-#ifdef WAIT_USE_POLL
+#ifdef WAIT_USE_SELF_PIPE
 			elog(ERROR, "unexpected EOF on self-pipe");
 #else
 			elog(ERROR, "unexpected EOF on signalfd");
-- 
2.30.

Re: Removing unneeded self joins

2022-05-13 Thread Zhihong Yu
Hi,
w.r.t. v33-0001-Remove-self-joins.patch :

removes inner join of plane table -> removes inner join of plain table
in an query plan -> in a query plan

+ * Used as the relation_has_unique_index_for,

Since relation_has_unique_index_for() becomes a wrapper of
relation_has_unique_index_ext, the above sentence doesn't make much sense.
I think you can drop this part.

but if extra_clauses doesn't NULL -> If extra_clauses isn't NULL

+   is_req_equal =
+   (rinfo->required_relids == rinfo->clause_relids) ? true : false;

The above can be simplified to:
  is_req_equal = rinfo->required_relids == rinfo->clause_relids;

+   ListCell *otherCell;
otherCell should be initialized to NULL.

+   if (bms_is_member(k, info->syn_lefthand) &&
+   !bms_is_member(r, info->syn_lefthand))
+   jinfo_check = false;
+   else if (bms_is_member(k, info->syn_righthand) &&
+   !bms_is_member(r, info->syn_righthand))
+   jinfo_check = false;
+   else if (bms_is_member(r, info->syn_lefthand) &&
+   !bms_is_member(k, info->syn_lefthand))
+   jinfo_check = false;

I think the above code can be simplified:

If bms_is_member(k, info->syn_lefthand) ^ bms_is_member(r,
info->syn_lefthand) is true, jinfo_check is false.
If bms_is_member(k, info->syn_righthand) ^ bms_is_member(r,
info->syn_righthand) is true, jinfo_check is false.
Otherwise jinfo_check is true.

Cheers


Re: [RFC] building postgres with meson -v8

2022-05-13 Thread Andres Freund
Hi,

On 2022-05-11 12:18:58 +0200, Peter Eisentraut wrote:
> This currently only works on macOS.  The dtrace -G calls needed on
> other platforms are not implemented yet.

I looked into that part. The make rule passes all the backend object files as
an option, but it's not clear to me where / why that's needed. On linux it
certainly works to not pass in the object files...

Maybe CI will show problems on freebsd or such...


> Therefore, the option is not auto by default.

It probably shouldn't be auto either way, there's some overhead associated
with the probes...

Greetings,

Andres Freund




Re: Correct comment in ProcedureCreate() for pgstat_create_function() call.

2022-05-13 Thread Michael Paquier
On Fri, May 13, 2022 at 04:09:00PM +0900, Michael Paquier wrote:
> As of what pgstat_create_function() does to create the stats of a new
> function in a transactional way, it looks like you are right.  Will
> fix if there are no objections.

And done with fcab82a.  Thanks, Amul.
--
Michael


signature.asc
Description: PGP signature


Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

2022-05-13 Thread Thomas Munro
On Sat, May 14, 2022 at 3:33 AM Robert Haas  wrote:
> This seems fine, but I think you should add a non-trivial comment about it.

Thanks for looking.  Done, and pushed.  Let's see if 180s per query is enough...




Re: Making JIT more granular

2022-05-13 Thread Andy Fan
Hi David:


> Does anyone have any thoughts about this JIT costing?  Is this an
> improvement?  Is there a better way?
>
>
I think this is an improvement.  However I'm not sure how much improvement
& effort we want pay for it.  I just shared my thoughts to start this
discussion.

1. Ideally there is no GUC needed at all.  For given a operation, like
Expression execution, tuple deform, if we can know the extra cost
of JIT in compile and the saved cost of JIT in execution, we
can choose JIT automatically. But as for now, it is hard to
say both. and we don't have a GUC to for DBA like jit_compile_cost
/ jit_compile_tuple_deform_cost as well.  Looks we have some
long way to go for this and cost is always a headache.

2. You calculate the cost to compare with jit_above_cost as:

plan->total_cost * plan->est_loops.

An alternative way might be to consider the rescan cost like
cost_rescan. This should be closer for a final execution cost.
However since it is hard to set a reasonable jit_above_cost,
so I am feeling the current way is OK as well.


3. At implementation level, I think it would be terrible to add
another parameter like est_loops to every create_xxx_plan
in future, An alternative way may be:

typedef struct
{
   int  est_calls;
} ExtPlanInfo;

void
copyExtPlanInfo(Plan *targetPlan,  ExtPlanInfo ext)
{
targetPlan->est_calls = ext.est_calls;
}

create_xxx_plan(...,  ExtPlanInfo extinfo)
{
   copyExtPlanInfo(plan, extinfo);
}

By this way, it would be easier to add another parameter
like est_calls easily. Not sure if this is over-engineered.

I have gone through the patches for a while, General it looks
good to me. If we have finalized the design, I can do a final
double check.

At last,  I think the patched way should be better than
the current way.

-- 
Best Regards
Andy Fan


Re: Backends stunk in wait event IPC/MessageQueueInternal

2022-05-13 Thread Thomas Munro
On Sat, May 14, 2022 at 10:25 AM Thomas Munro  wrote:
> Japin, are you able to reproduce the problem reliably?  Did I guess
> right, that you're on illumos?  Does this help?  I used
> defined(__sun__) to select the option, but I don't remember if that's
> the right way to detect that OS family, could you confirm that, or
> adjust as required?

Better version.  Now you can independently set -DWAIT_USE_{POLL,EPOLL}
and -DWAIT_USE_{SELF_PIPE,SIGNALFD} for testing, and it picks a
sensible default.
From 4f3027b35d5f25da37836b68d81fbcfb077e41d5 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 14 May 2022 10:15:30 +1200
Subject: [PATCH v2] Don't trust signalfd() on illumos.

Allow the choice between signalfd vs self-pipe to be controlled
independently of the choice between epoll() and poll().  Make the
default choice the same as before (signalfd + epoll() for Linux systems,
self-pipe + poll() for other Unixes that don't have kqueue), except on
illumos where it's self-pipe + epoll().  We don't have a very good
understanding of why, yet, but its signalfd doesn't seem to behave the
same as Linux in some edge case that leads to lost wakeups.  This way,
illumos users get a working default that doesn't give up the benefits of
epoll() over poll().

Back-patch to 14, where use of signalfd() appeared.

Discussion: https://postgr.es/m/MEYP282MB1669C8D88F0997354C2313C1B6CA9%40MEYP282MB1669.AUSP282.PROD.OUTLOOK.COM
---
 src/backend/storage/ipc/latch.c | 58 +++--
 1 file changed, 40 insertions(+), 18 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 78c6a89271..c99dbb9f46 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -72,7 +72,7 @@
 #if defined(WAIT_USE_EPOLL) || defined(WAIT_USE_POLL) || \
 	defined(WAIT_USE_KQUEUE) || defined(WAIT_USE_WIN32)
 /* don't overwrite manual choice */
-#elif defined(HAVE_SYS_EPOLL_H) && defined(HAVE_SYS_SIGNALFD_H)
+#elif defined(HAVE_SYS_EPOLL_H)
 #define WAIT_USE_EPOLL
 #elif defined(HAVE_KQUEUE)
 #define WAIT_USE_KQUEUE
@@ -84,6 +84,22 @@
 #error "no wait set implementation available"
 #endif
 
+/*
+ * By default, we use a self-pipe with poll() and a signalfd with epoll(), if
+ * available.  We avoid signalfd on illumos because it doesn't seem to work
+ * reliably.  For testing the choice can also be manually specified.
+ */
+#if defined(WAIT_USE_POLL) || defined(WAIT_USE_EPOLL)
+#if defined(WAIT_USE_SELF_PIPE) || defined(WAIT_USE_SIGNALFD)
+/* don't overwrite manual choice */
+#elif defined(WAIT_USE_EPOLL) && defined(HAVE_SYS_SIGNALFD_H) && \
+	!defined(__sun__)
+#define WAIT_USE_SIGNALFD
+#else
+#define WAIT_USE_SELF_PIPE
+#endif
+#endif
+
 /* typedef in latch.h */
 struct WaitEventSet
 {
@@ -146,12 +162,12 @@ static WaitEventSet *LatchWaitSet;
 static volatile sig_atomic_t waiting = false;
 #endif
 
-#ifdef WAIT_USE_EPOLL
+#ifdef WAIT_USE_SIGNALFD
 /* On Linux, we'll receive SIGURG via a signalfd file descriptor. */
 static int	signal_fd = -1;
 #endif
 
-#if defined(WAIT_USE_POLL)
+#ifdef WAIT_USE_SELF_PIPE
 /* Read and write ends of the self-pipe */
 static int	selfpipe_readfd = -1;
 static int	selfpipe_writefd = -1;
@@ -164,7 +180,7 @@ static void latch_sigurg_handler(SIGNAL_ARGS);
 static void sendSelfPipeByte(void);
 #endif
 
-#if defined(WAIT_USE_POLL) || defined(WAIT_USE_EPOLL)
+#if defined(WAIT_USE_SELF_PIPE) || defined(WAIT_USE_SIGNALFD)
 static void drain(void);
 #endif
 
@@ -190,7 +206,7 @@ static inline int WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 void
 InitializeLatchSupport(void)
 {
-#if defined(WAIT_USE_POLL)
+#if defined(WAIT_USE_SELF_PIPE)
 	int			pipefd[2];
 
 	if (IsUnderPostmaster)
@@ -264,7 +280,7 @@ InitializeLatchSupport(void)
 	pqsignal(SIGURG, latch_sigurg_handler);
 #endif
 
-#ifdef WAIT_USE_EPOLL
+#ifdef WAIT_USE_SIGNALFD
 	sigset_t	signalfd_mask;
 
 	/* Block SIGURG, because we'll receive it through a signalfd. */
@@ -316,7 +332,7 @@ ShutdownLatchSupport(void)
 		LatchWaitSet = NULL;
 	}
 
-#if defined(WAIT_USE_POLL)
+#if defined(WAIT_USE_SELF_PIPE)
 	close(selfpipe_readfd);
 	close(selfpipe_writefd);
 	selfpipe_readfd = -1;
@@ -324,7 +340,7 @@ ShutdownLatchSupport(void)
 	selfpipe_owner_pid = InvalidPid;
 #endif
 
-#if defined(WAIT_USE_EPOLL)
+#if defined(WAIT_USE_SIGNALFD)
 	close(signal_fd);
 	signal_fd = -1;
 #endif
@@ -341,9 +357,12 @@ InitLatch(Latch *latch)
 	latch->owner_pid = MyProcPid;
 	latch->is_shared = false;
 
-#if defined(WAIT_USE_POLL)
+#if defined(WAIT_USE_SELF_PIPE)
 	/* Assert InitializeLatchSupport has been called in this process */
 	Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid);
+#elif defined(WAIT_USE_SIGNALFD)
+	/* Assert InitializeLatchSupport has been called in this process */
+	Assert(signal_fd >= 0);
 #elif defined(WAIT_USE_WIN32)
 	latch->event = CreateEvent(NULL, TRUE, FALSE, NULL);
 	if (latch->event == NULL)
@@ -405,9 +424,12 @@ OwnLatch(Latch *latch)
 	/* Sanity checks */
 	

Re: First draft of the PG 15 release notes

2022-05-13 Thread Amit Kapila
On Fri, May 13, 2022 at 9:18 PM Bruce Momjian  wrote:
>
> On Fri, May 13, 2022 at 08:24:53AM +0530, Amit Kapila wrote:
> > On Fri, May 13, 2022 at 6:02 AM Euler Taveira  wrote:
> > >
> > > On Thu, May 12, 2022, at 11:22 AM, Bruce Momjian wrote:
> > >
> > > On Thu, May 12, 2022 at 11:12:54AM -0300, Euler Taveira wrote:
> > > OB> On Thu, May 12, 2022, at 11:03 AM, Bruce Momjian wrote:
> > > >
> > > > I looked at that but thought that everyone would already assume we
> > > > skipped replication of empty transactions, and I didn't see much 
> > > > impact
> > > > for the user, so I didn't include it.
> > > >
> > > > It certainly has an impact on heavy workloads that replicate tables 
> > > > with few
> > > > modifications. It receives a high traffic of 'begin' and 'commit' 
> > > > messages that
> > > > the previous Postgres versions have to handle (discard). I would 
> > > > classify it as
> > > > a performance improvement for logical replication. Don't have a strong 
> > > > opinion
> > > > if it should be mentioned or not.
> > >
> > > Oh, so your point is that a transaction that only has SELECT would
> > > previously send an empty transaction?  I thought this was only for apps
> > > that create literal empty transactions, which seem rare.
> > >
> > > No. It should be a write transaction. If you have a replication setup that
> > > publish only table foo (that isn't modified often) and most of your
> > > workload does not contain table foo, Postgres sends 'begin' and 'commit'
> > > messages to subscriber even if there is no change to replicate.
> > >
> >
> > It reduces network traffic and improves performance by 3-14% on simple
> > tests [1] like the one shown by Euler. I see a value in adding this as
> > for the workloads where it hits, it seems more than 99% of network
> > traffic [2] is due to these empty messages.
>
> I see the point now --- new item:
>
> 
>
> 
> 
> Prevent logical replication of empty transactions (Ajin Cherian,
> Hou Zhijie, Euler Taveira)
> 
>
> 
> Previously, write transactions would send empty transactions to
> subscribers if subscribed tables were not modified.
> 
> 
>

Thanks!

-- 
With Regards,
Amit Kapila.




Re: JSON Functions and Operators Docs for v15

2022-05-13 Thread Erik Rijkers


Not done yet but here's where I'm at. If I'm on the wrong track or
missing things that should be done please let me know.


[sqljson-dox-rework.patch] 



Here are a few errors/typos/improvements.

I've added (=copied from the old docs) the CREATE TABLE for the my_films 
table so that the more complicated json_table examples can be run easily.



Erik Rijkers



--
Andrew Dunstan
EDB: https://www.enterprisedb.com--- doc/src/sgml/func.sgml.orig	2022-05-14 06:32:28.564537299 +0200
+++ doc/src/sgml/func.sgml	2022-05-14 08:10:05.313313154 +0200
@@ -16287,7 +16287,7 @@


 jsonb_set_lax('[{"f1":1,"f2":null},2,null,3]', '{0,f1}', null)
-[{"f1":null,"f2":null},2,null,3]
+[{"f1": null, "f2": null}, 2, null, 3]


 jsonb_set_lax('[{"f1":99,"f2":null},2]', '{0,f3}', null, true, 'return_target')
@@ -17682,7 +17682,7 @@
 object keys.


-json('{"a" 123, "b":[true,"foo"], "a":"bar"}')
+json('{"a":123, "b":[true,"foo"], "a":"bar"}')
 {"a":123, "b":[true,"foo"], "a":"bar"}


@@ -17959,7 +17959,7 @@


 json_exists(jsonb '{"a": [1,2,3]}', 'strict $.a[5]' ERROR ON ERROR)
-ERROR: Invalid SQL/JSON subscript
+ERROR:  jsonpath array subscript is out of bounds
   
  
  
@@ -17990,11 +17990,11 @@
 of the ON EMPTY clause.


-json_value('"123.45"', '$' RETURNING float)
+json_value('"123.45"'::jsonb, '$' RETURNING float)
 123.45


-json_value('"03:04 2015-02-01"', '$.datetime("HH24:MI -MM-DD")' RETURNING date)
+json_value('"03:04 2015-02-01"'::jsonb, '$.datetime("HH24:MI -MM-DD")' RETURNING date)
 2015-02-01


@@ -18111,7 +18111,7 @@
 
 
  The input data to query, the JSON path expression defining the query,
- and an optional PASSING clause, which can privide data
+ and an optional PASSING clause, which can provide data
  values to the path_expression.
  The result of the input data
  evaluation is called the row pattern. The row
@@ -18409,6 +18409,31 @@
   Examples
 
  
+ In these examples the following small table storing some JSON data will be used:
+
+CREATE TABLE my_films ( js jsonb );
+
+INSERT INTO my_films VALUES (
+'{ "favorites" : [
+   { "kind" : "comedy", "films" : [
+ { "title" : "Bananas",
+   "director" : "Woody Allen"},
+ { "title" : "The Dinner Game",
+   "director" : "Francis Veber" } ] },
+   { "kind" : "horror", "films" : [
+ { "title" : "Psycho",
+   "director" : "Alfred Hitchcock" } ] },
+   { "kind" : "thriller", "films" : [
+ { "title" : "Vertigo",
+   "director" : "Alfred Hitchcock" } ] },
+   { "kind" : "drama", "films" : [
+ { "title" : "Yojimbo",
+   "director" : "Akira Kurosawa" } ] }
+  ] }');
+
+ 
+
+ 
   Query the my_films table holding
   some JSON data about the films and create a view that
   distributes the film genre, title, and director between separate columns:
@@ -18427,7 +18452,7 @@
  1  | comedy   | Bananas  | Woody Allen
  1  | comedy   | The Dinner Game  | Francis Veber
  2  | horror   | Psycho   | Alfred Hitchcock
- 3  | thriller | Vertigo  | Hitchcock
+ 3  | thriller | Vertigo  | Alfred Hitchcock
  4  | drama| Yojimbo  | Akira Kurosawa
  (5 rows)
 


Re: Multi-Master Logical Replication

2022-05-13 Thread Amit Kapila
On Sat, May 14, 2022 at 12:33 AM Bruce Momjian  wrote:
>
> Uh, without these features, what workload would this help with?
>

To allow replication among multiple nodes when some of the nodes may
have pre-existing data. This work plans to provide simple APIs to
achieve that. Now, let me try to explain the difficulties users can
face with the existing interface. It is simple to set up replication
among various nodes when they don't have any pre-existing data but
even in that case if the user operates on the same table at multiple
nodes, the replication will lead to an infinite loop and won't
proceed. The example in email [1] demonstrates that and the patch in
that thread attempts to solve it. I have mentioned that problem
because this work will need that patch.

Now, let's take a simple case where two nodes have the same table
which has some pre-existing data:

Node-1:
Table t1 (c1 int) has data
1, 2, 3, 4

Node-2:
Table t1 (c1 int) has data
5, 6, 7, 8

If we have to set up replication among the above two nodes using
existing interfaces, it could be very tricky. Say user performs
operations like below:

Node-1
#Publication for t1
Create Publication pub1 For Table t1;

Node-2
#Publication for t1,
Create Publication pub1_2 For Table t1;

Node-1:
Create Subscription sub1 Connection '' Publication pub1_2;

Node-2:
Create Subscription sub1_2 Connection '' Publication pub1;

After this the data will be something like this:
Node-1:
1, 2, 3, 4, 5, 6, 7, 8

Node-2:
1, 2, 3, 4, 5, 6, 7, 8, 5, 6, 7, 8

So, you can see that data on Node-2 (5, 6, 7, 8) is duplicated. In
case, table t1 has a unique key, it will lead to a unique key
violation and replication won't proceed. Here, I have assumed that we
already have functionality for the patch in email [1], otherwise,
replication will be an infinite loop replicating the above data again
and again. Now one way to achieve this could be that we can ask users
to stop all operations on both nodes before starting replication
between those and take data dumps of tables from each node they want
to replicate and restore them to other nodes. Then use the above
commands to set up replication and allow to start operations on those
nodes. The other possibility for users could be as below. Assume, we
have already created publications as in the above example, and then:

Node-2:
Create Subscription sub1_2 Connection '' Publication pub1;

#Wait for the initial sync of table t1 to finish. Users can ensure
that by checking 'srsubstate' in pg_subscription_rel.

Node-1:
Begin;
# Disallow truncates to be published and then truncate the table
Alter Publication pub1 Set (publish = 'insert, update, delete');
Truncate t1;
Create Subscription sub1 Connection '' Publication pub1_2;
Alter Publication pub1 Set (publish = 'insert, update, delete, truncate');
Commit;

This will become more complicated when more than two nodes are
involved, see the example provided for the three nodes case [2]. Can
you think of some other simpler way to achieve the same? If not, I
don't think the current way is ideal and even users won't prefer that.
I am not telling that the APIs proposed in this thread is the only or
best way to achieve the desired purpose but I think we should do
something to allow users to easily set up replication among multiple
nodes.

[1] - 
https://www.postgresql.org/message-id/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9%3D9orqubhjcQ%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CALDaNm3aD3nZ0HWXA8V435AGMvORyR5-mq2FzqQdKQ8CPomB5Q%40mail.gmail.com

-- 
With Regards,
Amit Kapila.