Re: recovery test failure on morepork with timestamp mystery
On 2022-05-13 04:14, Andres Freund wrote: One interesting tidbit is that the log timestamps are computed differently (with elog.c:get_formatted_log_time()) than the reset timestamp (with GetCurrentTimestamp()). Both use gettimeofday() internally. I wonder if there's a chance that somehow openbsd ends up with more usecs than "fit" in a second in the result of gettimeofday()? The elog.c case would truncate everything above a second away afaics: /* 'paste' milliseconds into place... */ sprintf(msbuf, ".%03d", (int) (saved_timeval.tv_usec / 1000)); memcpy(formatted_log_time + 19, msbuf, 4); whereas GetCurrentTimestamp() would add them to the timestamp: result = (TimestampTz) tp.tv_sec - ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY); result = (result * USECS_PER_SEC) + tp.tv_usec; Well, I don't know if you remember but there was a thread a while back and a test program (monotime.c) to test the clock if it could go backwards and openbsd showed the following result when running the attached testprogram: openbsd 5.9: $ ./monotime 1021006 Starting 1017367 Starting 1003415 Starting 1007598 Starting 1021006 Stopped 1007598 Stopped 1017367 Stopped 1003415 Stopped openbsd 6.9: $ ./monotime 410310 Starting 547727 Starting 410310 Back 262032.372314102 => 262032.242045208 410310 Stopped 465180 Starting 255646 Starting 547727 Stopped 465180 Stopped 255646 Stopped could that have something to do with it? /Mikael #include #include #include #include #include #include #include #define NTHREAD 4 #define NTRY5 void * start(void *dummy) { int i; struct timespec ts0, ts1; printf("%d Starting\n", (int)getthrid()); clock_gettime(CLOCK_MONOTONIC, &ts0); for (i = 0; i < NTRY; i++) { clock_gettime(CLOCK_MONOTONIC, &ts1); if (timespeccmp(&ts0, &ts1, <=)) { ts0 = ts1; continue; } printf("%d Back %lld.%09lu => %lld.%09lu\n", (int)getthrid(), ts0.tv_sec, ts0.tv_nsec, ts1.tv_sec, ts1.tv_nsec); break; } printf("%d Stopped\n", (int)getthrid()); return (NULL); } int main(int argc, char *argv[]) { int i, n = NTHREAD; pthread_t *threads; threads = calloc(n, sizeof(pthread_t)); for (i = 0; i < n; i++) pthread_create(&threads[i], NULL, start, NULL); for (i = 0; i < n; i++) pthread_join(threads[i], NULL); }
Re: Correct comment in ProcedureCreate() for pgstat_create_function() call.
On Fri, May 13, 2022 at 10:22:57AM +0530, Amul Sul wrote: > Sorry, hit the send button too early :| - /* ensure that stats are dropped if transaction commits */ + /* ensure that stats are dropped if transaction aborts */ if (!is_update) pgstat_create_function(retval); As of what pgstat_create_function() does to create the stats of a new function in a transactional way, it looks like you are right. Will fix if there are no objections. -- Michael signature.asc Description: PGP signature
Re: [PATCH] Compression dictionaries for JSONB
Hi hackers, > Here it the 2nd version of the patch: > > - Includes changes named above > - Fixes a warning reported by cfbot > - Fixes some FIXME's > - The path includes some simple tests now > - A proper commit message was added Here is the rebased version of the patch. Changes compared to v2 are minimal. > Open questions: > - Dictionary entries are currently stored as NameData, the same type that is > used for enums. Are we OK with the accompanying limitations? Any alternative > suggestions? > - All in all, am I moving the right direction? I would like to receive a little bit more feedback before investing more time into this effort. This will allow me, if necessary, to alter the overall design more easily. -- Best regards, Aleksander Alekseev v3-0001-Compression-dictionaries-for-JSONB.patch Description: Binary data
RE: Perform streaming logical transactions by background workers and parallel apply
On Thursday, May 5, 2022 1:46 PM Peter Smith wrote: > Here are my review comments for v5-0001. > I will take a look at the v5-0002 (TAP) patch another time. Thanks for the comments ! > 4. Commit message > > User can set the streaming option to 'on/off', 'apply'. For now, > 'apply' means the streaming will be applied via a apply background if > available. 'on' means the streaming transaction will be spilled to > disk. > > > I think "apply" might not be the best choice of values for this > meaning, but I think Hou-san already said [1] that this was being > reconsidered. Yes, I am thinking over this along with some other related stuff[1] posted by Amit and sawada. Will change this in next version. [1] https://www.postgresql.org/message-id/flat/CAA4eK1%2B7D4qAQUQEE8zzQ0fGCqeBWd3rzTaY5N0jVs-VXFc_Xw%40mail.gmail.com > 7. src/backend/commands/subscriptioncmds.c - defGetStreamingMode > > +static char > +defGetStreamingMode(DefElem *def) > +{ > + /* > + * If no parameter given, assume "true" is meant. > + */ > + if (def->arg == NULL) > + return SUBSTREAM_ON; > > But is that right? IIUC all the docs said that the default is OFF. I think it's right. "arg == NULL" means user specify the streaming option without the value. Like CREATE SUBSCRIPTION xxx WITH(streaming). The value should be 'on' in this case. > 12. src/backend/replication/logical/origin.c - replorigin_session_setup > > @@ -1110,7 +1110,11 @@ replorigin_session_setup(RepOriginId node) > if (curstate->roident != node) > continue; > > - else if (curstate->acquired_by != 0) > + /* > + * We allow the apply worker to get the slot which is acquired by its > + * leader process. > + */ > + else if (curstate->acquired_by != 0 && acquire) > > I still feel this is overly-cofusing. Shouldn't comment say "Allow the > apply bgworker to get the slot...". > > Also the parameter name 'acquire' is hard to reconcile with the > comment. E.g. I feel all this would be easier to understand if the > param was was refactored with a name like 'bgworker' and the code was > changed to: > else if (curstate->acquired_by != 0 && !bgworker) > > Of course, the value true/false would need to be flipped on calls too. > This is the same as my previous comment [PSv4] #26. I feel it's not good idea to mention bgworker in origin.c. I have remove this comment and add some other comments in worker.c. > 26. src/backend/replication/logical/worker.c - apply_handle_stream_abort > > + if (found) > + { > + elog(LOG, "rolled back to savepoint %s", spname); > + RollbackToSavepoint(spname); > + CommitTransactionCommand(); > + subxactlist = list_truncate(subxactlist, i + 1); > + } > > Should that elog use the "[Apply BGW #%u]" format like the others for BGW? I feel the "[Apply BGW #%u]" is a bit hacky and some of them comes from the old patchset. I will recheck these logs and adjust them and change some log level in next version. > 27. src/backend/replication/logical/worker.c - apply_handle_stream_abort > > Should this function be setting stream_apply_worker = NULL somewhere > when all is done? > 29. src/backend/replication/logical/worker.c - apply_handle_stream_commit > > I am unsure, but should something be setting the stream_apply_worker = > NULL somewhere when all is done? I think the worker already be set to NULL in apply_handle_stream_stop. > 32. src/backend/replication/logical/worker.c - ApplyBgwShutdown > > +/* > + * Set the failed flag so that the main apply worker can realize we have > + * shutdown. > + */ > +static void > +ApplyBgwShutdown(int code, Datum arg) > > If the 'code' param is deliberately unused it might be better to say > so in the comment... Not sure about this. After searching the codes, I think most of the callback functions doesn't use and add comments for the 'code' param. > 45. src/backend/utils/activity/wait_event.c > > @@ -388,6 +388,9 @@ pgstat_get_wait_ipc(WaitEventIPC w) > case WAIT_EVENT_HASH_GROW_BUCKETS_REINSERT: > event_name = "HashGrowBucketsReinsert"; > break; > + case WAIT_EVENT_LOGICAL_APPLY_WORKER_READY: > + event_name = "LogicalApplyWorkerReady"; > + break; > > I am not sure this is the best name for this event since the only > place it is used (in apply_bgworker_wait_for) is not only waiting for > READY state. Maybe a name like WAIT_EVENT_LOGICAL_APPLY_BGWORKER or > WAIT_EVENT_LOGICAL_APPLY_WORKER_SYNC would be more appropriate? Need > to change the wait_event.h also. I noticed a similar named "WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE", so I changed this to WAIT_EVENT_LOGICAL_APPLY_WORKER_STATE_CHANGE. > 47. src/test/regress/expected/subscription.out - missting test > > Missing some test cases for all new option values? E.g. Where is the > test using streaming value is set to 'apply'. Same comment as [PSv4] > #81 The new option is tested in the second patch posted by Shi yu. I addressed other comments from Peter and the 2PC related comment from Shi. Here is the version patch. Best regards, Hou zj v6-0001-Pe
RE: Perform streaming logical transactions by background workers and parallel apply
On Wednesday, May 11, 2022 1:10 PM Amit Kapila wrote: > > On Wed, May 11, 2022 at 9:35 AM Masahiko Sawada > wrote: > > > > On Tue, May 10, 2022 at 5:59 PM Amit Kapila > wrote: > > > > > > On Tue, May 10, 2022 at 10:39 AM Masahiko Sawada > wrote: > > > > > > > > Having it optional seems a good idea. BTW can the user configure > > > > how many apply bgworkers can be used per subscription or in the > > > > whole system? Like max_sync_workers_per_subscription, is it better > > > > to have a configuration parameter or a subscription option for > > > > that? If so, setting it to 0 probably means to disable the parallel > > > > apply > feature. > > > > > > > > > > Yeah, that might be useful but we are already giving an option while > > > creating a subscription whether to allow parallelism, so will it be > > > useful to give one more way to disable this feature? OTOH, having > > > something like max_parallel_apply_workers/max_bg_apply_workers at > > > the system level can give better control for how much parallelism > > > the user wishes to allow for apply work. > > > > Or we can have something like > > max_parallel_apply_workers_per_subscription that controls how many > > parallel apply workers can launch per subscription. That also gives > > better control for the number of parallel apply workers. > > > > I think we can go either way in this matter as both have their pros and cons. > I > feel limiting the parallel workers per subscription gives better control but > OTOH, it may not allow max usage of parallelism because some quota from > other subscriptions might remain unused. Let us see what Hou-San or others > think on this matter? Thanks for Amit and Sawada-san's comments ! I will think over these approaches and reply soon. Best regards, Hou zj
Re: postgres_fdw "parallel_commit" docs
Hi Jonathan, On Thu, May 12, 2022 at 10:32 PM Jonathan S. Katz wrote: > On 5/12/22 7:26 AM, Etsuro Fujita wrote: > > Attached is an updated patch. I'll commit the patch if no objections. > > I think this is much easier to read. Cool! > I made a few minor copy edits. Please see attached. LGTM, so I pushed the patch. Thanks for the patch and taking the time to improve this! Best regards, Etsuro Fujita
RE: Data is copied twice when specifying both child and parent table in publication
On Fri, May 13, 2022 1:59 PM Amit Kapila wrote: > On Fri, May 13, 2022 at 7:32 AM wangw.f...@fujitsu.com > wrote: > > > > Attach the patches.(Only changed the patch for HEAD.). > > > > Few comments: > = Thanks for your comments. > 1. > @@ -1135,6 +1172,15 @@ pg_get_publication_tables(PG_FUNCTION_ARGS) > if (publication->pubviaroot) > tables = filter_partitions(tables); > } > + pfree(elems); > + > + /* > + * We need an additional filter for this case : A partition table is > + * published in a publication with viaroot, and its parent or child > + * table is published in another publication without viaroot. In this > + * case, we should publish only parent table. > + */ > + tables = filter_partitions(tables); > > Do we need to filter partitions twice? Can't we check if any of the > publications > has 'pubviaroot' option set, if so, call filter_partitions at the end? Improve it according to your suggestion. > 2. " FROM pg_class c JOIN pg_namespace n" > + " ON n.oid = c.relnamespace," > + " LATERAL pg_get_publication_tables(array[ %s ]) gst" > > Here, it is better to have an alias name as gpt. Improve it according to your suggestion. > 3. > } > + pfree(elems); > + > > An extra line between these two lines makes it looks slightly better. Improve it according to your suggestion. > 4. Not able to apply patch cleanly. > patching file src/test/subscription/t/013_partition.pl > Hunk #1 FAILED at 477. > Hunk #2 FAILED at 556. > Hunk #3 FAILED at 584. > 3 out of 3 hunks FAILED -- saving rejects to file > src/test/subscription/t/013_partition.pl.rej > patching file src/test/subscription/t/028_row_filter.pl > Hunk #1 succeeded at 394 (offset 1 line). > Hunk #2 FAILED at 722. > 1 out of 2 hunks FAILED -- saving rejects to file > src/test/subscription/t/028_row_filter.pl.rej > patching file src/test/subscription/t/031_column_list.pl > Hunk #1 succeeded at 948 (offset -92 lines). > Hunk #2 FAILED at 1050. > 1 out of 2 hunks FAILED -- saving rejects to file > src/test/subscription/t/031_column_list.pl.rej New patch could apply patch cleanly now. Attach the patches.(Only changed the patch for HEAD.). 1. Optimize the code. Reduce calls to function filter_partitions. [suggestions by Amit-san] 2. Improve the alias name in SQL. [suggestions by Amit-san] 3. Improve coding alignments. [suggestions by Amit-san] 4. Do some optimizations for list Concatenate. Regards, Wang wei HEAD_v4-0001-Fix-data-replicated-twice-when-specifying-PUBLISH.patch Description: HEAD_v4-0001-Fix-data-replicated-twice-when-specifying-PUBLISH.patch REL14_v4-0001-Fix-data-replicated-twice-when-specifying-PUBLISH.patch Description: REL14_v4-0001-Fix-data-replicated-twice-when-specifying-PUBLISH.patch
Backends stunk in wait event IPC/MessageQueueInternal
Hi, hackers I had an incident on my Postgres 14 that queries hung in wait event IPC / MessageQueueInternal, MessageQueueReceive. It likes [1], however, it doesn't have any discussions. The process cannot be terminated by pg_terminate_backend(), although it returns true. Here is the call stack comes from pstack: 485073: /opt/local/pgsql/14/bin/postgres fb7fef216f4a ioctl (d, d001, fb7fffdfa0e0) 008b8ec2 WaitEventSetWait () + 112 008b920f WaitLatch () + 6f 008bf434 shm_mq_wait_internal () + 64 008bff74 shm_mq_receive () + 2b4 0079fdc8 TupleQueueReaderNext () + 28 0077d8ca gather_merge_readnext () + 13a 0077db25 ExecGatherMerge () + 215 00790675 ExecNextLoop () + 175 00790675 ExecNextLoop () + 175 0076267d standard_ExecutorRun () + fd fb7fe3965fbd pgss_executorRun () + fd 008df99b PortalRunSelect () + 1cb 008e0dcf PortalRun () + 17f 008ddacd PostgresMain () + 100d 00857f62 ServerLoop () + cd2 00858cee main () + 453 005ab777 _start_crt () + 87 005ab6d8 _start () + 18 Any suggestions? Thanks in advance! [1] https://www.postgresql.org/message-id/flat/E9FA92C2921F31408041863B74EE4C2001A479E590%40CCPMAILDAG03.cantab.local -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: Backends stunk in wait event IPC/MessageQueueInternal
On Fri, May 13, 2022 at 06:16:23PM +0800, Japin Li wrote: > I had an incident on my Postgres 14 that queries hung in wait event > IPC / MessageQueueInternal, MessageQueueReceive. It likes [1], > however, it doesn't have any discussions. If the process is still running, or if the problem recurs, I suggest to create a corefile with gcore, aka gdb generate-core-file. Then, we can look at the backtrace at our leisure, even if the cluster needed to be restarted right away. What minor version of postgres is this, and what OS ? -- Justin
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
On Tue, May 10, 2022 at 5:55 PM Andrey Borodin wrote: > > > 10 мая 2022 г., в 12:59, Bharath Rupireddy > > написал(а): > > > > If okay, I can make the GUC behave this way - value 0 existing > > behaviour i.e. no wait for sync repl ack, just process query cancels > > and proc die interrupts immediately; value -1 wait unboundedly for the > > ack; value > 0 wait for specified milliseconds for the ack. > +1 if we make -1 and 0 only valid values. > > > query cancels or proc die interrupts > > Please note, that typical HA tool would need to handle query cancels and proc > die interrupts differently. Agree. > When the network is partitioned and somewhere standby is promoted you > definitely want infinite wait for cancels. When standby is promoted, no point the old primary waiting forever for ack assuming we are going to discard it. > Yet once upon a time you want to shutdown postgres without coredump - thus > proc die needs to be processed. I think users can still have the flexibility to set the right amounts of time to process cancel and proc die interrupts. IMHO, the GUC can still have 0, -1 and value > 0 in milliseconds, let the users decide on what they want. Do you see any problems with this? Thoughts? Regards, Bharath Rupireddy.
Re: Comments on Custom RMGRs
On Fri, 13 May 2022 at 05:13, Jeff Davis wrote: > > On Thu, 2022-05-12 at 22:26 +0100, Simon Riggs wrote: > > I see multiple uses for the rm_checkpoint() point proposed and I've > > been asked multiple times for a checkpoint hook. > > Can you elaborate and/or link to a discussion? Those were conversations away from Hackers, but I'm happy to share. The first was a discussion about a data structure needed by BDR about 4 years ago. In the absence of a pluggable checkpoint, the solution was forced to use a normal table, which wasn't very satisfactory. The second was a more recent conversation with Mike Stonebraker, at the end of 2021.. He was very keen to remove the buffer manager entirely, which requires that we have a new smgr, which then needs new code to allow it to be written to disk at checkpoint time, which then requires some kind of pluggable code at checkpoint time. (Mike was also keen to remove WAL, but that's another story entirely!). The last use case was unlogged indexes, which need to be read from disk at startup or rebuilt after crash, which requires RmgrStartup to work both with and without InRedo=true. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Can postgres ever delete the recycled future WAL files to free-up disk space if max_wal_size is reduced or wal_recycle is set to off?
On Mon, May 9, 2022 at 6:47 PM Bharath Rupireddy wrote: > > On Fri, May 6, 2022 at 10:20 PM Tom Lane wrote: > > > > Bharath Rupireddy writes: > > > Can postgres delete the recycled future WAL files once max_wal_size is > > > reduced and/or wal_recycle is set to off? > > > > A checkpoint should do that, see RemoveOldXlogFiles. > > > > Maybe you have a broken WAL archiving setup, or something else preventing > > removal of old WAL files? > > Thanks Tom. My test case is simple [1], no archiving, no replication > slots - just plain initdb-ed cluster. My expectation is that whenever > max_wal_size/wal_recycle is changed from the last checkpoint value, > postgres must be able to delete "optionally" "all or some of" the > future WAL files to free-up some disk space (which is about to get > full) so that I can avoid server crashes and I will have some time to > go scale the disk. > > [1] > show min_wal_size; > show max_wal_size; > show wal_recycle; > > drop table foo; > create table foo(col int); > > -- run below pg_switch_wal and insert statements 9 times. > select pg_switch_wal(); > insert into foo select * from generate_series(1, 1000); > > select redo_wal_file from pg_control_checkpoint(); > > checkpoint; > --there will be around 10 recycled WAL future WAL files. > > alter system set max_wal_size to '240MB'; > select pg_reload_conf(); > show max_wal_size; > > checkpoint; > --future WAL files will not be deleted. > > alter system set min_wal_size to '24MB'; > select pg_reload_conf(); > show min_wal_size; > > checkpoint; > --future WAL files will not be deleted. > > alter system set wal_recycle to off; > select pg_reload_conf(); > show wal_recycle; > > checkpoint; > --future WAL files will not be deleted. Hi, I'm thinking out loud - can we add all the recycled WAL files to a sorted list (oldest recycled WAL file to new recycled WAL file) and then during checkpoint, if the max_wal_size is reduced or wal_recycle is set to off, then start deleting the future WAL files from the end of the sorted list. Upon restart of the server, if required, the sorted list of future WAL files can be rebuilt. Thoughts? Regards, Bharath Rupireddy.
Re: Unfiltered server logs routing via a new elog hook or existing emit_log_hook bypassing log_min_message check
On Mon, May 2, 2022 at 7:37 PM Tom Lane wrote: > > Julien Rouhaud writes: > > On Mon, May 02, 2022 at 07:24:04PM +0530, Bharath Rupireddy wrote: > >> I basically want to avoid normal users/developers setting any > >> parameter (especially the superuser-only log_min_message GUC, all > >> users might not have superuser access in production environments) or > >> making any changes to the running server except just LOADing the > >> server log routing/intercepting extension. > > > The kind of scenario you mentioned didn't seem "normal users" oriented. > > Note > > that LOAD is restricted to superuser anyway. > > It seems completely silly to be worrying that setting a GUC in a > particular way is too hard for somebody who's going to be installing > a loadable extension. In any case, if you wanted to force the issue > you could set log_min_messages in the extension's _PG_init function. Thanks Tom and Julien. I developed a simple external module called pg_intercept_logs [1]. [1] https://github.com/BRupireddy/pg_intercept_server_logs Regards, Bharath Rupireddy.
Re: Can postgres ever delete the recycled future WAL files to free-up disk space if max_wal_size is reduced or wal_recycle is set to off?
On Fri, May 06, 2022 at 09:50:26PM +0530, Bharath Rupireddy wrote: > It seems like that's not the case because postgres will not remove future WAL > files even after max_wal_size is reduced, In the past, I've had to generate synthetic write traffic and checkpoints to get WAL to shrink. +1 to make it respect max_wal_size on its own. -- Justin
Re: Comments on Custom RMGRs
On Fri, 13 May 2022 at 00:42, Andres Freund wrote: > > On 2022-05-12 22:26:51 +0100, Simon Riggs wrote: > > On Thu, 12 May 2022 at 04:40, Andres Freund wrote: > > > I'm not happy with the idea of random code being executed in the middle of > > > CheckPointGuts(), without any documentation of what is legal to do at that > > > point. > > > > The "I'm not happy.." ship has already sailed with pluggable rmgrs. > > I don't agree. The ordering within a checkpoint is a lot more fragile than > what you do in an individual redo routine. Example? > > Checkpoints exist for one purpose - as the starting place for recovery. > > > > Why would we allow pluggable recovery without *also* allowing > > pluggable checkpoints? > > Because one can do a lot of stuff with just pluggable WAL records, without > integrating into checkpoints? > > Note that I'm *not* against making checkpoint extensible - I just think it > needs a good bit of design work around when the hook is called etc. When was any such work done previously for any other hook?? That isn't needed. Checkpoints aren't complete until all data structures have checkpointed, so there are no problems from a partial checkpoint being written. As a result, the order of actions in CheckpointGuts() is mostly independent of each other. The SLRUs are all independent of each other, as is CheckPointBuffers(). The use cases I'm trying to support aren't tricksy modifications of existing code, they are just entirely new data structures which are completely independent of other Postgres objects. > I definitely think it's too late in the cycle to add checkpoint extensibility > now. > > > > > To actually be useful we'd likely need multiple calls to such an rmgr > > > callback, with a parameter where in CheckPointGuts() we are. Right now the > > > sequencing is explicit in CheckPointGuts(), but with the proposed > > > callback, > > > that'd not be the case anymore. > > > > It is useful without the extra complexity you mention. > > Shrug. The documentation work definitely is needed. Perhaps we can get away > without multiple callbacks within a checkpoint, I think it'll become more > apparent when writing information about the precise point in time the > checkpoint callback is called. You seem to be thinking in terms of modifying the existing actions in CheckpointGuts(). I don't care about that. Anybody that wishes to do that can work out the details of their actions. There is nothing to document, other than "don't do things that won't work". How can anyone enumerate all the things that wouldn't work?? There is no list of caveats for any other hook. Why is it needed here? > > I see multiple uses for the rm_checkpoint() point proposed and I've > > been asked multiple times for a checkpoint hook. Any rmgr that > > services crash recovery for a non-smgr based storage system would need > > this because the current checkpoint code only handles flushing to disk > > for smgr-based approaches. That is orthogonal to other code during > > checkpoint, so it stands alone quite well. > > FWIW, for that there are much bigger problems than checkpoint > extensibility. Most importantly there's currently no good way to integrate > relation creation / drop with the commit / abort infrastructure... One at a time... -- Simon Riggshttp://www.EnterpriseDB.com/
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
On Thu, May 5, 2022 at 2:07 PM Alvaro Herrera wrote: > > On 2022-May-05, Bharath Rupireddy wrote: > > > On Fri, Apr 29, 2022 at 4:11 PM Alvaro Herrera > > wrote: > > > > > > Did we ever consider the idea of using a new pg_stat_wal_activity_progress > > > view or something like that, using the backend_progress.c functionality? > > > I don't see it mentioned in the thread. > > > > IMO, progress reporting works well on a running server and at the > > moment. The WAL recovery/replay can happen even before the server > > opens up for connections > > It's definitely true that you wouldn't be able to use it for when the > server is not accepting connections. > > > and the progress report view can't be used > > for later analysis like how much time the restoring WAL files from > > archive location took > > This is true too -- progress report doesn't store historical data, only > current status. > > > and also the WAL file names can't be reported in progress reporting > > mechanism > > Also true. > > > (only integers columns, of course if required we can add text columns > > to pg_stat_get_progress_info). > > Yeah, I don't think adding text columns is terribly easy, because the > whole point of the progress reporting infrastructure is that it can be > updated very cheaply as atomic operations, and if you want to transmit > text columns, that's no longer possible. > > > Having the recovery info in server logs might help. > > I suppose it might. > > > I think reporting a long-running file processing operation (removing > > or syncing) within postgres is a generic problem (for snapshot, > > mapping, temporary (pgsql_tmp), temp relation files, old WAL file > > processing, WAL file processing during recovery etc.) and needs to be > > solved > > I agree up to here. > > > in two ways: 1) logging progress into server logs (which helps > > for analysis and report when the server isn't available for > > connections, crash recovery), a generic GUC > > log_file_processing_traffic = {none, medium, high} might help here > > (also proposed in [1]) and 2) pg_stat_file_processing_progress > > (extending progress reporting pg_stat_get_progress_info to have few > > text columns for current file name and directory path). > > I think using the server log to store telemetry data is not a great fit. > It bloats the log files and can be so slow as to block other operations > in the server. Server logs should normally be considered critical info > that's not okay to lose; telemetry tends to be of secondary importance > and in a pinch you can drop a few messages without hurting too much. > > We've done moderately okay so far with having some system views where > some telemetry readings can be obtained, but there several drawbacks to > that approach that we should at some point solve. My opinion on this is > that we need to bite the bullet and develop separate infrastructure for > reporting server metrics. > > I just > think that we should look into a new mechanism going forward. I completely agree that we must have new ways to report important server metrics in an easily consumable fashion - it could be something like server computing metrics (time taken, the file being processed etc.) important/time-taking operations such as execution of archive command, restore command, recovery, checkpoint, old WAL files removal/recycling, processing of various files - snapshot, mapping, pgsql_tmp files, temp relation files etc. and writing these metrics to a file or a stat table on the database itself (this table can have an automatic mechanism of clean up or limit on number of rows or size of the table so that the clients can read those metrics and use it in whichever way they want. To start with, a simple structure of the metrics can be {OPERATION text, START_TIME timestamp, END_TIME timestamp, few text, double and/or integer flexible columns that each operation can use to store metrics}. I can start a discussion in a separate thread to seek more thoughts on the server metrics part. > That said, I'm not opposed to having a patch somewhat as posted. Thanks. I'm thinking if we should have a generic GUC log_file_processing_traffic = {none, medium, high} that might help reporting other time taking file processing operations such as [1]. [1] https://www.postgresql.org/message-id/CALj2ACW-ELOF5JT2zPavs95wbZ0BrLPrqvSZ7Ac%2BpjxCkmXtEQ%40mail.gmail.com Regards, Bharath Rupireddy.
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
On Fri, Apr 29, 2022 at 5:11 AM Bharath Rupireddy wrote: > Here's the rebased v9 patch. This seems like it has enormous overlap with the existing functionality that we have from log_startup_progress_interval. I think that facility is also better-designed than this one. It prints out a message based on elapsed time, whereas this patch prints out a message based progress through the WAL. That means that if WAL replay isn't actually advancing for some reason, you just won't get any log messages and you don't know whether it's advancing slowly or not at all or the server is just hung. With that facility you can distinguish those cases. Also, if for some reason we do think that amount of WAL replayed is the right metric, rather than time, why would we only allow high=1 segment and low=128 segments, rather than say any number of MB or GB that the user would like to configure? I suggest that if log_startup_progress_interval doesn't meet your needs here, we should try to understand why not and maybe enhance it, instead of adding a separate facility. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Backends stunk in wait event IPC/MessageQueueInternal
On Fri, 13 May 2022 at 19:41, Justin Pryzby wrote: > On Fri, May 13, 2022 at 06:16:23PM +0800, Japin Li wrote: >> I had an incident on my Postgres 14 that queries hung in wait event >> IPC / MessageQueueInternal, MessageQueueReceive. It likes [1], >> however, it doesn't have any discussions. > > If the process is still running, or if the problem recurs, I suggest to create > a corefile with gcore, aka gdb generate-core-file. Then, we can look at the > backtrace at our leisure, even if the cluster needed to be restarted right > away. > Thanks for your advice, I will try it later. > What minor version of postgres is this, and what OS ? PostgreSQL 14.2 and Solaris. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)
> here is the rebased v32 version of the patch. > > The patchset rotted a bit. Here is a rebased version. > We have posted an updated version v34 of the whole patchset in [1]. Changes of patches 0001-0003 there are identical to v33. So, no update is needed in this thread. [1] https://www.postgresql.org/message-id/flat/CACG%3DezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe%3DpyyjVWA%40mail.gmail.com -- Best regards, Maxim Orlov.
Re: make MaxBackends available in _PG_init
On Thu, May 12, 2022 at 8:15 PM Michael Paquier wrote: > On Thu, May 12, 2022 at 06:51:59PM +0300, Anton A. Melnikov wrote: > > Maybe remove the first part from the patchset? > > Because now the Patch Tester is giving apply error for the first part and > > can't test the other. > > http://cfbot.cputube.org/patch_38_3614.log > > Yep. As simple as the attached. Committed. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Backends stunk in wait event IPC/MessageQueueInternal
On Fri, May 13, 2022 at 6:16 AM Japin Li wrote: > The process cannot be terminated by pg_terminate_backend(), although > it returns true. pg_terminate_backend() just sends SIGINT. What I'm wondering is what happens when the stuck process receives SIGINT. It would be useful, I think, to check the value of the global variable InterruptHoldoffCount in the stuck process by attaching to it with gdb. I would also try running "strace -p $PID" on the stuck process and then try terminating it again with pg_terminate_backend(). Either the system call in which it's currently stuck returns and then it makes the same system call again and hangs again ... or the signal doesn't dislodge it from the system call in which it's stuck in the first place. It would be useful to know which of those two things is happening. One thing I find a bit curious is that the top of the stack in your case is ioctl(). And there are no calls to ioctl() anywhere in latch.c, nor have there ever been. What operating system is this? We have 4 different versions of WaitEventSetWaitBlock() that call epoll_wait(), kevent(), poll(), and WaitForMultipleObjects() respectively. I wonder which of those we're using, and whether one of those calls is showing up as ioctl() in the stacktrace, or whether there's some other function being called in here that is somehow resulting in ioctl() getting called. -- Robert Haas EDB: http://www.enterprisedb.com
RE: Data is copied twice when specifying both child and parent table in publication
On Friday, May 13, 2022 6:42 PM Wang, Wei/王 威 wrote: > Attach the patches.(Only changed the patch for HEAD.). > 1. Optimize the code. Reduce calls to function filter_partitions. > [suggestions by > Amit-san] 2. Improve the alias name in SQL. [suggestions by Amit-san] 3. > Improve coding alignments. [suggestions by Amit-san] 4. Do some > optimizations for list Concatenate. Hi, thank you for updating the patch. I have one minor comment on fetch_table_list() in HEAD v4. @@ -1759,17 +1759,22 @@ static List * fetch_table_list(WalReceiverConn *wrconn, List *publications) { WalRcvExecResult *res; - StringInfoData cmd; + StringInfoData cmd, + pub_names; TupleTableSlot *slot; Oid tableRow[2] = {TEXTOID, TEXTOID}; List *tablelist = NIL; + initStringInfo(&pub_names); + get_publications_str(publications, &pub_names, true); + Kindly free the pub_names's data along with the cmd.data. Best Regards, Takamichi Osumi
Re: First draft of the PG 15 release notes
On Fri, May 13, 2022 at 01:36:04AM +, osumi.takami...@fujitsu.com wrote: > > > > This is enabled with the subscriber option "disable_on_error" > > and avoids possible infinite loops during stream application. > > > > > Thank you ! > > In this last paragraph, how about replacing "infinite loops" > with "infinite error loops" ? I think it makes the situation somewhat > clear for readers. Agreed, done. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Thu, May 12, 2022 at 10:20 PM Thomas Munro wrote: > As for skink failing, the timeout was hard coded 300s for the whole > test, but apparently that wasn't enough under valgrind. Let's use the > standard PostgreSQL::Test::Utils::timeout_default (180s usually), but > reset it for each query we send. @@ -202,6 +198,9 @@ sub send_query_and_wait my ($psql, $query, $untl) = @_; my $ret; + $psql_timeout->reset(); + $psql_timeout->start(); + # send query $$psql{stdin} .= $query; $$psql{stdin} .= "\n"; This seems fine, but I think you should add a non-trivial comment about it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: make MaxBackends available in _PG_init
On Fri, May 13, 2022 at 09:49:54AM -0400, Robert Haas wrote: > Committed. Thanks! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: First draft of the PG 15 release notes
On Fri, May 13, 2022 at 10:48:41AM +0900, Amit Langote wrote: > On Thu, May 12, 2022 at 10:52 PM Bruce Momjian wrote: > > Okay, I went with: > > > > Previously, such updates ran delete actions on the source > > partition and insert actions on the target partition. PostgreSQL > > will > > now run update actions on the referenced partition root. > > WFM, thanks. > > Btw, perhaps the following should be listed under E.1.3.2.1. Logical > Replication, not E.1.3.1.1. Partitioning? > > > > > > Remove incorrect duplicate partitions in system view > pg_publication_tables (Hou Zhijie) > > > > Attached a patch to do so. Agreed, done. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: Comments on Custom RMGRs
On Fri, 2022-05-13 at 13:31 +0100, Simon Riggs wrote: > The first was a discussion about a data structure needed by BDR about > 4 years ago. In the absence of a pluggable checkpoint, the solution > was forced to use a normal table, which wasn't very satisfactory. I'm interested to hear more about this case. Are you developing it into a full table AM? In my experience with columnar, there's still a long tail of things I wish I had in the backend to better support complete table AMs. > The second was a more recent conversation with Mike Stonebraker, at > the end of 2021.. He was very keen to remove the buffer manager > entirely, which requires that we have a new smgr, which then needs > new > code to allow it to be written to disk at checkpoint time, which then > requires some kind of pluggable code at checkpoint time. (Mike was > also keen to remove WAL, but that's another story entirely!). I'm guessing that would be more of an experimental/ambitious project, and based on modified postgres anyway. > The last use case was unlogged indexes, which need to be read from > disk at startup or rebuilt after crash, which requires RmgrStartup to > work both with and without InRedo=true. That sounds like a core feature, in which case we can just refactor that for v16. It might be a nice cleanup for unlogged tables, too. I don't think your 002-v2 patch is particularly risky, but any reluctance at all probably pushes it to v16 given that it's so late in the cycle. Regards, Jeff Davis
Re: First draft of the PG 15 release notes
On Fri, May 13, 2022 at 08:24:53AM +0530, Amit Kapila wrote: > On Fri, May 13, 2022 at 6:02 AM Euler Taveira wrote: > > > > On Thu, May 12, 2022, at 11:22 AM, Bruce Momjian wrote: > > > > On Thu, May 12, 2022 at 11:12:54AM -0300, Euler Taveira wrote: > > OB> On Thu, May 12, 2022, at 11:03 AM, Bruce Momjian wrote: > > > > > > I looked at that but thought that everyone would already assume we > > > skipped replication of empty transactions, and I didn't see much > > > impact > > > for the user, so I didn't include it. > > > > > > It certainly has an impact on heavy workloads that replicate tables with > > > few > > > modifications. It receives a high traffic of 'begin' and 'commit' > > > messages that > > > the previous Postgres versions have to handle (discard). I would classify > > > it as > > > a performance improvement for logical replication. Don't have a strong > > > opinion > > > if it should be mentioned or not. > > > > Oh, so your point is that a transaction that only has SELECT would > > previously send an empty transaction? I thought this was only for apps > > that create literal empty transactions, which seem rare. > > > > No. It should be a write transaction. If you have a replication setup that > > publish only table foo (that isn't modified often) and most of your > > workload does not contain table foo, Postgres sends 'begin' and 'commit' > > messages to subscriber even if there is no change to replicate. > > > > It reduces network traffic and improves performance by 3-14% on simple > tests [1] like the one shown by Euler. I see a value in adding this as > for the workloads where it hits, it seems more than 99% of network > traffic [2] is due to these empty messages. I see the point now --- new item: Prevent logical replication of empty transactions (Ajin Cherian, Hou Zhijie, Euler Taveira) Previously, write transactions would send empty transactions to subscribers if subscribed tables were not modified. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: Should use MERGE use BulkInsertState ?
On Wed, May 11, 2022 at 12:48 PM Justin Pryzby wrote: > The point of the ring/strategy buffer is to avoid a seq scan/vacuum/COPY > clobbering the entire buffer cache when processing a table larger than > shared_buffers. > > It also makes backends doing bulk operations responsible for their own writes, > rather than leaving other backends to clean up all their dirty buffers. Right. It's not just about how fast individual operations go: dirtying all of shared_buffers is faster *for you* but you've imposed a distributed cost on everyone else on the system, which isn't great. That said, I'm not really sure what the proper handling is in this specific case. I guess we should try our best to have it be symmetric with other similar cases, but questions have been raised repeatedly about how well the ring buffer stuff actually works. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Comments on Custom RMGRs
On Fri, May 13, 2022 at 8:47 AM Simon Riggs wrote: > > Note that I'm *not* against making checkpoint extensible - I just think it > > needs a good bit of design work around when the hook is called etc. > > When was any such work done previously for any other hook?? That isn't needed. I think almost every proposal to add a hook results in some discussion about how usable the hook will be and whether it's being put in the correct place and called with the correct arguments. I think that's a good thing, too. Otherwise the code would be cluttered with a bunch of hooks that seemed to someone like a good idea at the time but are actually just a maintenance headache. -- Robert Haas EDB: http://www.enterprisedb.com
Re: recovery test failure on morepork with timestamp mystery
Hi, On 2022-05-13 09:00:20 +0200, Mikael Kjellström wrote: > Well, I don't know if you remember but there was a thread a while back and a > test program (monotime.c) to test the clock if it could go backwards and > openbsd showed the following result when running the attached testprogram: Nope, didn't remember... > $ ./monotime > 410310 Starting > 547727 Starting > 410310 Back 262032.372314102 => 262032.242045208 > 410310 Stopped > 465180 Starting > 255646 Starting > 547727 Stopped > 465180 Stopped > 255646 Stopped > > could that have something to do with it? Yes! > printf("%d Back %lld.%09lu => %lld.%09lu\n", > (int)getthrid(), ts0.tv_sec, ts0.tv_nsec, ts1.tv_sec, > ts1.tv_nsec); > break; I wonder whether the %09lu potentially is truncating ts1.tv_nsec. I can't reproduce the problem trivially in an openbsd VM I had around. But it's 7.1, so maybe that's the reason? Greetings, Andres Freund
Re: Multi-Master Logical Replication
On Fri, Apr 29, 2022 at 07:05:11PM +1000, Peter Smith wrote: > This MMLR proposal was mostly just to create an interface making it > easier to use PostgreSQL core logical replication CREATE > PUBLICATION/SUBSCRIPTION for table sharing among a set of nodes. > Otherwise, this is difficult for a user to do manually. (e.g. > difficulties as mentioned in section 2.2 of the original post [1] - > dealing with initial table data, coordinating the timing/locking to > avoid concurrent updates, getting the SUBSCRIPTION options for > copy_data exactly right etc) > > At this time we have no provision for HA, nor for transaction > consistency awareness, conflict resolutions, node failure detections, > DDL replication etc. Some of the features like DDL replication are > currently being implemented [2], so when committed it will become > available in the core, and can then be integrated into this module. Uh, without these features, what workload would this help with? I think you made the mistake of jumping too far into implementation without explaining the problem you are trying to solve. The TODO list has this ordering: https://wiki.postgresql.org/wiki/Todo Desirability -> Design -> Implement -> Test -> Review -> Commit -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: Privileges on PUBLICATION
On Fri, May 13, 2022, at 3:36 AM, Antonin Houska wrote: > Attached is my proposal. It tries to be more specific and does not mention the > absence of the privileges explicitly. You explained the current issue but say nothing about the limitation. This information will trigger a question possibly in one of the MLs. IMO if you say something like the sentence above at the end, it will make it clear why that setup expose all data (there is no access control to publications) and explicitly say there is a TODO here. Additional privileges might be added to control access to table data in a future version of PostgreSQL. I also wouldn't use the warning tag because it fits in the same category as the other restrictions listed in the page. -- Euler Taveira EDB https://www.enterprisedb.com/
Re: Item compression in the Gist index
On Mon, May 2, 2022 at 02:48:27PM +0200, Baca Radim wrote: > Hi, > we are developing an extension for multidimensional data. We have created a > Gist index that is heavily inspired by a cube extension. Right now we would > like to add some item compression since data items in a node share a > significant portion of a tuple prefix. However, I have no idea how to handle > information stored on a node level in Gists' compression method. Is there > any example? Unfortunately, the cube extension does not implement item > compression. > To be more specific we would like to store in the node a common prefix for > all tuples in the node. Uh, SP-GiST does prefix compression. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: recovery test failure on morepork with timestamp mystery
Hi, On 2022-05-13 10:22:32 -0700, Andres Freund wrote: > On 2022-05-13 09:00:20 +0200, Mikael Kjellström wrote: > > Well, I don't know if you remember but there was a thread a while back and a > > test program (monotime.c) to test the clock if it could go backwards and > > openbsd showed the following result when running the attached testprogram: > > Nope, didn't remember... > > > > $ ./monotime > > 410310 Starting > > 547727 Starting > > 410310 Back 262032.372314102 => 262032.242045208 > > 410310 Stopped > > 465180 Starting > > 255646 Starting > > 547727 Stopped > > 465180 Stopped > > 255646 Stopped > > > > could that have something to do with it? > > Yes! > > > > printf("%d Back %lld.%09lu => %lld.%09lu\n", > > (int)getthrid(), ts0.tv_sec, ts0.tv_nsec, ts1.tv_sec, > > ts1.tv_nsec); > > break; > > I wonder whether the %09lu potentially is truncating ts1.tv_nsec. > > > I can't reproduce the problem trivially in an openbsd VM I had around. But > it's 7.1, so maybe that's the reason? What does sysctl kern.timecounter return? Does the problem continue if you switch kern.timecounter.hardware to something else? In https://postgr.es/m/32aaeb66-71b2-4af0-91ef-1a992ac4d58b%40mksoft.nu you said it was using acpitimer0 and that it's a vmware VM. It might also be a vmware bug, not an openbsd one... Greetings, Andres Freund
list of TransactionIds
We didn't have any use of TransactionId as members of List, until RelationSyncEntry->streamed_txns was introduced (464824323e57, pg14). It's currently implemented as a list of int. This is not wrong at present, but it may soon be, and I'm sure it rubs some people the wrong way. But is the rubbing way wrong enough to add support for TransactionId in pg_list.h, including, say, T_XidList? The minimal patch (attached) is quite small AFAICS. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ >From 19f165f556a3cee7226a7b22b9bdf7ad9bce9fb0 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 29 Apr 2022 17:42:17 +0200 Subject: [PATCH] Implement List support for TransactionId Use it for RelationSyncEntry->streamed_txns, which is currently using an integer list. --- src/backend/nodes/list.c| 42 - src/backend/nodes/outfuncs.c| 4 ++ src/backend/replication/pgoutput/pgoutput.c | 13 ++- src/include/nodes/nodes.h | 1 + src/include/nodes/pg_list.h | 6 +++ 5 files changed, 56 insertions(+), 10 deletions(-) diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c index f843f861ef..9d8f4fd5c7 100644 --- a/src/backend/nodes/list.c +++ b/src/backend/nodes/list.c @@ -54,6 +54,7 @@ #define IsPointerList(l) ((l) == NIL || IsA((l), List)) #define IsIntegerList(l) ((l) == NIL || IsA((l), IntList)) #define IsOidList(l) ((l) == NIL || IsA((l), OidList)) +#define IsXidList(l) ((l) == NIL || IsA((l), XidList)) #ifdef USE_ASSERT_CHECKING /* @@ -71,7 +72,8 @@ check_list_invariants(const List *list) Assert(list->type == T_List || list->type == T_IntList || - list->type == T_OidList); + list->type == T_OidList || + list->type == T_XidList); } #else #define check_list_invariants(l) ((void) 0) @@ -383,6 +385,24 @@ lappend_oid(List *list, Oid datum) return list; } +/* + * Append a TransactionId to the specified list. See lappend() + */ +List * +lappend_xid(List *list, TransactionId datum) +{ + Assert(IsXidList(list)); + + if (list == NIL) + list = new_list(T_XidList, 1); + else + new_tail_cell(list); + + llast_xid(list) = datum; + check_list_invariants(list); + return list; +} + /* * Make room for a new cell at position 'pos' (measured from 0). * The data in the cell is left undefined, and must be filled in by the @@ -714,6 +734,26 @@ list_member_oid(const List *list, Oid datum) return false; } +/* + * Return true iff the TransactionId 'datum' is a member of the list. + */ +bool +list_member_xid(const List *list, TransactionId datum) +{ + const ListCell *cell; + + Assert(IsXidList(list)); + check_list_invariants(list); + + foreach(cell, list) + { + if (lfirst_oid(cell) == datum) + return true; + } + + return false; +} + /* * Delete the n'th cell (counting from 0) in list. * diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 0271ea9d78..0f257ed0ce 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -221,6 +221,8 @@ _outList(StringInfo str, const List *node) appendStringInfoChar(str, 'i'); else if (IsA(node, OidList)) appendStringInfoChar(str, 'o'); + else if (IsA(node, XidList)) + appendStringInfoChar(str, 'x'); foreach(lc, node) { @@ -239,6 +241,8 @@ _outList(StringInfo str, const List *node) appendStringInfo(str, " %d", lfirst_int(lc)); else if (IsA(node, OidList)) appendStringInfo(str, " %u", lfirst_oid(lc)); + else if (IsA(node, XidList)) + appendStringInfo(str, " %u", lfirst_xid(lc)); else elog(ERROR, "unrecognized list node type: %d", (int) node->type); diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c index 42c06af239..d4a84631c7 100644 --- a/src/backend/replication/pgoutput/pgoutput.c +++ b/src/backend/replication/pgoutput/pgoutput.c @@ -1923,13 +1923,8 @@ init_rel_sync_cache(MemoryContext cachectx) static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid) { - ListCell *lc; - - foreach(lc, entry->streamed_txns) - { - if (xid == (uint32) lfirst_int(lc)) - return true; - } + if (list_member_xid(entry->streamed_txns, xid)) + return true; return false; } @@ -1945,7 +1940,7 @@ set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid) oldctx = MemoryContextSwitchTo(CacheMemoryContext); - entry->streamed_txns = lappend_int(entry->streamed_txns, xid); + entry->streamed_txns = lappend_xid(entry->streamed_txns, xid); MemoryContextSwitchTo(oldctx); } @@ -2248,7 +2243,7 @@ cleanup_rel_sync_cache(TransactionId xid, bool is_commit) */ foreach(lc, entry->streamed_txns) { - if (xid == (uint32) lfirst_int(lc)) + if (xid == lfirst_xid(lc)) { if (is_commit) entry->schema_sent = true; diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h index b3b407579b..7ce1fc4deb 100644 --- a
Re: recovery test failure on morepork with timestamp mystery
On 2022-05-13 22:09, Andres Freund wrote: What does sysctl kern.timecounter return? Does the problem continue if you switch kern.timecounter.hardware to something else? openbsd 5.9: $ sysctl kern.timecounter kern.timecounter.tick=1 kern.timecounter.timestepwarnings=0 kern.timecounter.hardware=acpihpet0 kern.timecounter.choice=i8254(0) acpihpet0(1000) acpitimer0(1000) dummy(-100) $ ./monotime 1024736 Starting 1013674 Starting 1028468 Starting 1014641 Starting 1013674 Stopped 1024736 Stopped 1014641 Stopped 1028468 Stopped no problem openbsd 6.9: $ sysctl kern.timecounter kern.timecounter.tick=1 kern.timecounter.timestepwarnings=0 kern.timecounter.hardware=tsc kern.timecounter.choice=i8254(0) acpihpet0(1000) tsc(2000) acpitimer0(1000) Hm, here it's using the tsc timer. So that is a difference from 5.9 $ ./monotime 133998 Starting 408137 Starting 578042 Starting 326139 Starting 133998 Back 310670.000931851668 => 310670.000801582864 133998 Stopped 326139 Stopped 578042 Stopped 408137 Stopped it's only in 6.9 the problem with the timer going backwards shows up. If I switch timer to acpihpet0 this is the result: $ ./monotime 101622 Starting 480782 Starting 219318 Starting 316650 Starting 101622 Stopped 480782 Stopped 219318 Stopped 316650 Stopped so that seems to solve the problem with the timer going backwards. In https://postgr.es/m/32aaeb66-71b2-4af0-91ef-1a992ac4d58b%40mksoft.nu you said it was using acpitimer0 and that it's a vmware VM. It might also be a vmware bug, not an openbsd one... Might be a bug in vmware but it's running the latest patchlevel for 6.7 and no other VM have this problem so seems to only happen in openbsd 6.9 for some reason. Maybe that is the only VM that is using tsc as a timer source though? /Mikael
Re: Backends stunk in wait event IPC/MessageQueueInternal
On Sat, May 14, 2022 at 2:09 AM Robert Haas wrote: > On Fri, May 13, 2022 at 6:16 AM Japin Li wrote: > > The process cannot be terminated by pg_terminate_backend(), although > > it returns true. > One thing I find a bit curious is that the top of the stack in your > case is ioctl(). And there are no calls to ioctl() anywhere in > latch.c, nor have there ever been. What operating system is this? We > have 4 different versions of WaitEventSetWaitBlock() that call > epoll_wait(), kevent(), poll(), and WaitForMultipleObjects() > respectively. I wonder which of those we're using, and whether one of > those calls is showing up as ioctl() in the stacktrace, or whether > there's some other function being called in here that is somehow > resulting in ioctl() getting called. I guess this is really illumos (née OpenSolaris), not Solaris, using our epoll build mode, with illumos's emulation of epoll, which maps epoll onto Sun's /dev/poll driver: https://github.com/illumos/illumos-gate/blob/master/usr/src/lib/libc/port/sys/epoll.c#L230 That'd explain: fb7fef216f4a ioctl (d, d001, fb7fffdfa0e0) That matches the value DP_POLL from: https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/sys/devpoll.h#L44 Or if it's really Solaris, huh, are people moving illumos code back into closed Solaris these days? As for why it's hanging, I don't know, but one thing that we changed in 14 was that we started using signalfd() to receive latch signals on systems that have it, and illumos also has an emulation of signalfd() that our configure script finds: https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/io/signalfd.c There were in fact a couple of unexplained hangs on the illumos build farm animals, and then they were changed to use -DWAIT_USE_POLL so that they wouldn't automatically choose epoll()/signalfd(). That is not very satisfactory, but as far as I know there is a bug in either epoll() or signalfd(), or at least some difference compared to the Linux implementation they are emulating. spent quite a bit of time ping ponging emails back and forth with the owner of a hanging BF animal trying to get a minimal repro for a bug report, without success. I mean, it's possible that the bug is in PostgreSQL (though no complaint has ever reached me about this stuff on Linux), but while trying to investigate it a kernel panic happened[1], which I think counts as a point against that theory... (For what it's worth, WSL1 also emulates these two Linux interfaces and also apparently doesn't do so well enough for our purposes, also for reasons not understood by us.) In short, I'd recommend -DWAIT_USE_POLL for now. It's possible that we could do something to prevent the selection of WAIT_USE_EPOLL on that platform, or that we should have a halfway option epoll() but not signalfd() (= go back to using the self-pipe trick), patches welcome, but that feels kinda strange and would be very niche combination that isn't fun to maintain... the real solution is to fix the bug. [1] https://www.illumos.org/issues/13700
Re: Backends stunk in wait event IPC/MessageQueueInternal
On Sat, May 14, 2022 at 9:25 AM Thomas Munro wrote: > In short, I'd recommend -DWAIT_USE_POLL for now. It's possible that > we could do something to prevent the selection of WAIT_USE_EPOLL on > that platform, or that we should have a halfway option epoll() but not > signalfd() (= go back to using the self-pipe trick), patches welcome, > but that feels kinda strange and would be very niche combination that > isn't fun to maintain... the real solution is to fix the bug. I felt a bit sad about writing that, so I took a crack at trying to write a patch that separates the signalfd/self-pipe choice from the epoll/poll choice. Maybe it's not too bad. Japin, are you able to reproduce the problem reliably? Did I guess right, that you're on illumos? Does this help? I used defined(__sun__) to select the option, but I don't remember if that's the right way to detect that OS family, could you confirm that, or adjust as required? From 316ec0895c7ab65102c8c100774b51fc0e86c859 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 14 May 2022 10:15:30 +1200 Subject: [PATCH] Don't trust signalfd() on illumos. --- src/backend/storage/ipc/latch.c | 40 + 1 file changed, 26 insertions(+), 14 deletions(-) diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 78c6a89271..8b0b31d100 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -84,6 +84,18 @@ #error "no wait set implementation available" #endif +/* + * We need the self-pipe trick for race-free wakeups when using poll(). + * + * XXX We also need it when using epoll() on illumos, because its signalfd() + * seems to be broken. + */ +#if defined(WAIT_USE_POLL) || (defined(WAIT_USE_EPOLL) && defined(__sun__)) +#define WAIT_USE_SELF_PIPE +#elif defined(WAIT_USE_EPOLL) +#define WAIT_USE_SIGNALFD +#endif + /* typedef in latch.h */ struct WaitEventSet { @@ -146,12 +158,12 @@ static WaitEventSet *LatchWaitSet; static volatile sig_atomic_t waiting = false; #endif -#ifdef WAIT_USE_EPOLL +#ifdef WAIT_USE_SIGNALFD /* On Linux, we'll receive SIGURG via a signalfd file descriptor. */ static int signal_fd = -1; #endif -#if defined(WAIT_USE_POLL) +#ifdef WAIT_USE_SELF_PIPE /* Read and write ends of the self-pipe */ static int selfpipe_readfd = -1; static int selfpipe_writefd = -1; @@ -164,7 +176,7 @@ static void latch_sigurg_handler(SIGNAL_ARGS); static void sendSelfPipeByte(void); #endif -#if defined(WAIT_USE_POLL) || defined(WAIT_USE_EPOLL) +#if defined(WAIT_USE_SELF_PIPE) || defined(WAIT_USE_SIGNALFD) static void drain(void); #endif @@ -190,7 +202,7 @@ static inline int WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout, void InitializeLatchSupport(void) { -#if defined(WAIT_USE_POLL) +#if defined(WAIT_USE_SELF_PIPE) int pipefd[2]; if (IsUnderPostmaster) @@ -264,7 +276,7 @@ InitializeLatchSupport(void) pqsignal(SIGURG, latch_sigurg_handler); #endif -#ifdef WAIT_USE_EPOLL +#ifdef WAIT_USE_SIGNALFD sigset_t signalfd_mask; /* Block SIGURG, because we'll receive it through a signalfd. */ @@ -316,7 +328,7 @@ ShutdownLatchSupport(void) LatchWaitSet = NULL; } -#if defined(WAIT_USE_POLL) +#if defined(WAIT_USE_SELF_PIPE) close(selfpipe_readfd); close(selfpipe_writefd); selfpipe_readfd = -1; @@ -324,7 +336,7 @@ ShutdownLatchSupport(void) selfpipe_owner_pid = InvalidPid; #endif -#if defined(WAIT_USE_EPOLL) +#if defined(WAIT_USE_SIGNALFD) close(signal_fd); signal_fd = -1; #endif @@ -904,9 +916,9 @@ AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd, Latch *latch, { set->latch = latch; set->latch_pos = event->pos; -#if defined(WAIT_USE_POLL) +#if defined(WAIT_USE_SELF_PIPE) event->fd = selfpipe_readfd; -#elif defined(WAIT_USE_EPOLL) +#elif defined(WAIT_USE_SIGNALFD) event->fd = signal_fd; #else event->fd = PGINVALID_SOCKET; @@ -2083,7 +2095,7 @@ GetNumRegisteredWaitEvents(WaitEventSet *set) return set->nevents; } -#if defined(WAIT_USE_POLL) +#if defined(WAIT_USE_SELF_PIPE) /* * SetLatch uses SIGURG to wake up the process waiting on the latch. @@ -2134,7 +2146,7 @@ retry: #endif -#if defined(WAIT_USE_POLL) || defined(WAIT_USE_EPOLL) +#if defined(WAIT_USE_SELF_PIPE) || defined(WAIT_USE_SIGNALFD) /* * Read all available data from self-pipe or signalfd. @@ -2150,7 +2162,7 @@ drain(void) int rc; int fd; -#ifdef WAIT_USE_POLL +#ifdef WAIT_USE_SELF_PIPE fd = selfpipe_readfd; #else fd = signal_fd; @@ -2168,7 +2180,7 @@ drain(void) else { waiting = false; -#ifdef WAIT_USE_POLL +#ifdef WAIT_USE_SELF_PIPE elog(ERROR, "read() on self-pipe failed: %m"); #else elog(ERROR, "read() on signalfd failed: %m"); @@ -2178,7 +2190,7 @@ drain(void) else if (rc == 0) { waiting = false; -#ifdef WAIT_USE_POLL +#ifdef WAIT_USE_SELF_PIPE elog(ERROR, "unexpected EOF on self-pipe"); #else elog(ERROR, "unexpected EOF on signalfd"); -- 2.30.
Re: Removing unneeded self joins
Hi, w.r.t. v33-0001-Remove-self-joins.patch : removes inner join of plane table -> removes inner join of plain table in an query plan -> in a query plan + * Used as the relation_has_unique_index_for, Since relation_has_unique_index_for() becomes a wrapper of relation_has_unique_index_ext, the above sentence doesn't make much sense. I think you can drop this part. but if extra_clauses doesn't NULL -> If extra_clauses isn't NULL + is_req_equal = + (rinfo->required_relids == rinfo->clause_relids) ? true : false; The above can be simplified to: is_req_equal = rinfo->required_relids == rinfo->clause_relids; + ListCell *otherCell; otherCell should be initialized to NULL. + if (bms_is_member(k, info->syn_lefthand) && + !bms_is_member(r, info->syn_lefthand)) + jinfo_check = false; + else if (bms_is_member(k, info->syn_righthand) && + !bms_is_member(r, info->syn_righthand)) + jinfo_check = false; + else if (bms_is_member(r, info->syn_lefthand) && + !bms_is_member(k, info->syn_lefthand)) + jinfo_check = false; I think the above code can be simplified: If bms_is_member(k, info->syn_lefthand) ^ bms_is_member(r, info->syn_lefthand) is true, jinfo_check is false. If bms_is_member(k, info->syn_righthand) ^ bms_is_member(r, info->syn_righthand) is true, jinfo_check is false. Otherwise jinfo_check is true. Cheers
Re: [RFC] building postgres with meson -v8
Hi, On 2022-05-11 12:18:58 +0200, Peter Eisentraut wrote: > This currently only works on macOS. The dtrace -G calls needed on > other platforms are not implemented yet. I looked into that part. The make rule passes all the backend object files as an option, but it's not clear to me where / why that's needed. On linux it certainly works to not pass in the object files... Maybe CI will show problems on freebsd or such... > Therefore, the option is not auto by default. It probably shouldn't be auto either way, there's some overhead associated with the probes... Greetings, Andres Freund
Re: Correct comment in ProcedureCreate() for pgstat_create_function() call.
On Fri, May 13, 2022 at 04:09:00PM +0900, Michael Paquier wrote: > As of what pgstat_create_function() does to create the stats of a new > function in a transactional way, it looks like you are right. Will > fix if there are no objections. And done with fcab82a. Thanks, Amul. -- Michael signature.asc Description: PGP signature
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
On Sat, May 14, 2022 at 3:33 AM Robert Haas wrote: > This seems fine, but I think you should add a non-trivial comment about it. Thanks for looking. Done, and pushed. Let's see if 180s per query is enough...
Re: Making JIT more granular
Hi David: > Does anyone have any thoughts about this JIT costing? Is this an > improvement? Is there a better way? > > I think this is an improvement. However I'm not sure how much improvement & effort we want pay for it. I just shared my thoughts to start this discussion. 1. Ideally there is no GUC needed at all. For given a operation, like Expression execution, tuple deform, if we can know the extra cost of JIT in compile and the saved cost of JIT in execution, we can choose JIT automatically. But as for now, it is hard to say both. and we don't have a GUC to for DBA like jit_compile_cost / jit_compile_tuple_deform_cost as well. Looks we have some long way to go for this and cost is always a headache. 2. You calculate the cost to compare with jit_above_cost as: plan->total_cost * plan->est_loops. An alternative way might be to consider the rescan cost like cost_rescan. This should be closer for a final execution cost. However since it is hard to set a reasonable jit_above_cost, so I am feeling the current way is OK as well. 3. At implementation level, I think it would be terrible to add another parameter like est_loops to every create_xxx_plan in future, An alternative way may be: typedef struct { int est_calls; } ExtPlanInfo; void copyExtPlanInfo(Plan *targetPlan, ExtPlanInfo ext) { targetPlan->est_calls = ext.est_calls; } create_xxx_plan(..., ExtPlanInfo extinfo) { copyExtPlanInfo(plan, extinfo); } By this way, it would be easier to add another parameter like est_calls easily. Not sure if this is over-engineered. I have gone through the patches for a while, General it looks good to me. If we have finalized the design, I can do a final double check. At last, I think the patched way should be better than the current way. -- Best Regards Andy Fan
Re: Backends stunk in wait event IPC/MessageQueueInternal
On Sat, May 14, 2022 at 10:25 AM Thomas Munro wrote: > Japin, are you able to reproduce the problem reliably? Did I guess > right, that you're on illumos? Does this help? I used > defined(__sun__) to select the option, but I don't remember if that's > the right way to detect that OS family, could you confirm that, or > adjust as required? Better version. Now you can independently set -DWAIT_USE_{POLL,EPOLL} and -DWAIT_USE_{SELF_PIPE,SIGNALFD} for testing, and it picks a sensible default. From 4f3027b35d5f25da37836b68d81fbcfb077e41d5 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sat, 14 May 2022 10:15:30 +1200 Subject: [PATCH v2] Don't trust signalfd() on illumos. Allow the choice between signalfd vs self-pipe to be controlled independently of the choice between epoll() and poll(). Make the default choice the same as before (signalfd + epoll() for Linux systems, self-pipe + poll() for other Unixes that don't have kqueue), except on illumos where it's self-pipe + epoll(). We don't have a very good understanding of why, yet, but its signalfd doesn't seem to behave the same as Linux in some edge case that leads to lost wakeups. This way, illumos users get a working default that doesn't give up the benefits of epoll() over poll(). Back-patch to 14, where use of signalfd() appeared. Discussion: https://postgr.es/m/MEYP282MB1669C8D88F0997354C2313C1B6CA9%40MEYP282MB1669.AUSP282.PROD.OUTLOOK.COM --- src/backend/storage/ipc/latch.c | 58 +++-- 1 file changed, 40 insertions(+), 18 deletions(-) diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 78c6a89271..c99dbb9f46 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -72,7 +72,7 @@ #if defined(WAIT_USE_EPOLL) || defined(WAIT_USE_POLL) || \ defined(WAIT_USE_KQUEUE) || defined(WAIT_USE_WIN32) /* don't overwrite manual choice */ -#elif defined(HAVE_SYS_EPOLL_H) && defined(HAVE_SYS_SIGNALFD_H) +#elif defined(HAVE_SYS_EPOLL_H) #define WAIT_USE_EPOLL #elif defined(HAVE_KQUEUE) #define WAIT_USE_KQUEUE @@ -84,6 +84,22 @@ #error "no wait set implementation available" #endif +/* + * By default, we use a self-pipe with poll() and a signalfd with epoll(), if + * available. We avoid signalfd on illumos because it doesn't seem to work + * reliably. For testing the choice can also be manually specified. + */ +#if defined(WAIT_USE_POLL) || defined(WAIT_USE_EPOLL) +#if defined(WAIT_USE_SELF_PIPE) || defined(WAIT_USE_SIGNALFD) +/* don't overwrite manual choice */ +#elif defined(WAIT_USE_EPOLL) && defined(HAVE_SYS_SIGNALFD_H) && \ + !defined(__sun__) +#define WAIT_USE_SIGNALFD +#else +#define WAIT_USE_SELF_PIPE +#endif +#endif + /* typedef in latch.h */ struct WaitEventSet { @@ -146,12 +162,12 @@ static WaitEventSet *LatchWaitSet; static volatile sig_atomic_t waiting = false; #endif -#ifdef WAIT_USE_EPOLL +#ifdef WAIT_USE_SIGNALFD /* On Linux, we'll receive SIGURG via a signalfd file descriptor. */ static int signal_fd = -1; #endif -#if defined(WAIT_USE_POLL) +#ifdef WAIT_USE_SELF_PIPE /* Read and write ends of the self-pipe */ static int selfpipe_readfd = -1; static int selfpipe_writefd = -1; @@ -164,7 +180,7 @@ static void latch_sigurg_handler(SIGNAL_ARGS); static void sendSelfPipeByte(void); #endif -#if defined(WAIT_USE_POLL) || defined(WAIT_USE_EPOLL) +#if defined(WAIT_USE_SELF_PIPE) || defined(WAIT_USE_SIGNALFD) static void drain(void); #endif @@ -190,7 +206,7 @@ static inline int WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout, void InitializeLatchSupport(void) { -#if defined(WAIT_USE_POLL) +#if defined(WAIT_USE_SELF_PIPE) int pipefd[2]; if (IsUnderPostmaster) @@ -264,7 +280,7 @@ InitializeLatchSupport(void) pqsignal(SIGURG, latch_sigurg_handler); #endif -#ifdef WAIT_USE_EPOLL +#ifdef WAIT_USE_SIGNALFD sigset_t signalfd_mask; /* Block SIGURG, because we'll receive it through a signalfd. */ @@ -316,7 +332,7 @@ ShutdownLatchSupport(void) LatchWaitSet = NULL; } -#if defined(WAIT_USE_POLL) +#if defined(WAIT_USE_SELF_PIPE) close(selfpipe_readfd); close(selfpipe_writefd); selfpipe_readfd = -1; @@ -324,7 +340,7 @@ ShutdownLatchSupport(void) selfpipe_owner_pid = InvalidPid; #endif -#if defined(WAIT_USE_EPOLL) +#if defined(WAIT_USE_SIGNALFD) close(signal_fd); signal_fd = -1; #endif @@ -341,9 +357,12 @@ InitLatch(Latch *latch) latch->owner_pid = MyProcPid; latch->is_shared = false; -#if defined(WAIT_USE_POLL) +#if defined(WAIT_USE_SELF_PIPE) /* Assert InitializeLatchSupport has been called in this process */ Assert(selfpipe_readfd >= 0 && selfpipe_owner_pid == MyProcPid); +#elif defined(WAIT_USE_SIGNALFD) + /* Assert InitializeLatchSupport has been called in this process */ + Assert(signal_fd >= 0); #elif defined(WAIT_USE_WIN32) latch->event = CreateEvent(NULL, TRUE, FALSE, NULL); if (latch->event == NULL) @@ -405,9 +424,12 @@ OwnLatch(Latch *latch) /* Sanity checks */
Re: First draft of the PG 15 release notes
On Fri, May 13, 2022 at 9:18 PM Bruce Momjian wrote: > > On Fri, May 13, 2022 at 08:24:53AM +0530, Amit Kapila wrote: > > On Fri, May 13, 2022 at 6:02 AM Euler Taveira wrote: > > > > > > On Thu, May 12, 2022, at 11:22 AM, Bruce Momjian wrote: > > > > > > On Thu, May 12, 2022 at 11:12:54AM -0300, Euler Taveira wrote: > > > OB> On Thu, May 12, 2022, at 11:03 AM, Bruce Momjian wrote: > > > > > > > > I looked at that but thought that everyone would already assume we > > > > skipped replication of empty transactions, and I didn't see much > > > > impact > > > > for the user, so I didn't include it. > > > > > > > > It certainly has an impact on heavy workloads that replicate tables > > > > with few > > > > modifications. It receives a high traffic of 'begin' and 'commit' > > > > messages that > > > > the previous Postgres versions have to handle (discard). I would > > > > classify it as > > > > a performance improvement for logical replication. Don't have a strong > > > > opinion > > > > if it should be mentioned or not. > > > > > > Oh, so your point is that a transaction that only has SELECT would > > > previously send an empty transaction? I thought this was only for apps > > > that create literal empty transactions, which seem rare. > > > > > > No. It should be a write transaction. If you have a replication setup that > > > publish only table foo (that isn't modified often) and most of your > > > workload does not contain table foo, Postgres sends 'begin' and 'commit' > > > messages to subscriber even if there is no change to replicate. > > > > > > > It reduces network traffic and improves performance by 3-14% on simple > > tests [1] like the one shown by Euler. I see a value in adding this as > > for the workloads where it hits, it seems more than 99% of network > > traffic [2] is due to these empty messages. > > I see the point now --- new item: > > > > > > Prevent logical replication of empty transactions (Ajin Cherian, > Hou Zhijie, Euler Taveira) > > > > Previously, write transactions would send empty transactions to > subscribers if subscribed tables were not modified. > > > Thanks! -- With Regards, Amit Kapila.
Re: JSON Functions and Operators Docs for v15
Not done yet but here's where I'm at. If I'm on the wrong track or missing things that should be done please let me know. [sqljson-dox-rework.patch] Here are a few errors/typos/improvements. I've added (=copied from the old docs) the CREATE TABLE for the my_films table so that the more complicated json_table examples can be run easily. Erik Rijkers -- Andrew Dunstan EDB: https://www.enterprisedb.com--- doc/src/sgml/func.sgml.orig 2022-05-14 06:32:28.564537299 +0200 +++ doc/src/sgml/func.sgml 2022-05-14 08:10:05.313313154 +0200 @@ -16287,7 +16287,7 @@ jsonb_set_lax('[{"f1":1,"f2":null},2,null,3]', '{0,f1}', null) -[{"f1":null,"f2":null},2,null,3] +[{"f1": null, "f2": null}, 2, null, 3] jsonb_set_lax('[{"f1":99,"f2":null},2]', '{0,f3}', null, true, 'return_target') @@ -17682,7 +17682,7 @@ object keys. -json('{"a" 123, "b":[true,"foo"], "a":"bar"}') +json('{"a":123, "b":[true,"foo"], "a":"bar"}') {"a":123, "b":[true,"foo"], "a":"bar"} @@ -17959,7 +17959,7 @@ json_exists(jsonb '{"a": [1,2,3]}', 'strict $.a[5]' ERROR ON ERROR) -ERROR: Invalid SQL/JSON subscript +ERROR: jsonpath array subscript is out of bounds @@ -17990,11 +17990,11 @@ of the ON EMPTY clause. -json_value('"123.45"', '$' RETURNING float) +json_value('"123.45"'::jsonb, '$' RETURNING float) 123.45 -json_value('"03:04 2015-02-01"', '$.datetime("HH24:MI -MM-DD")' RETURNING date) +json_value('"03:04 2015-02-01"'::jsonb, '$.datetime("HH24:MI -MM-DD")' RETURNING date) 2015-02-01 @@ -18111,7 +18111,7 @@ The input data to query, the JSON path expression defining the query, - and an optional PASSING clause, which can privide data + and an optional PASSING clause, which can provide data values to the path_expression. The result of the input data evaluation is called the row pattern. The row @@ -18409,6 +18409,31 @@ Examples + In these examples the following small table storing some JSON data will be used: + +CREATE TABLE my_films ( js jsonb ); + +INSERT INTO my_films VALUES ( +'{ "favorites" : [ + { "kind" : "comedy", "films" : [ + { "title" : "Bananas", + "director" : "Woody Allen"}, + { "title" : "The Dinner Game", + "director" : "Francis Veber" } ] }, + { "kind" : "horror", "films" : [ + { "title" : "Psycho", + "director" : "Alfred Hitchcock" } ] }, + { "kind" : "thriller", "films" : [ + { "title" : "Vertigo", + "director" : "Alfred Hitchcock" } ] }, + { "kind" : "drama", "films" : [ + { "title" : "Yojimbo", + "director" : "Akira Kurosawa" } ] } + ] }'); + + + + Query the my_films table holding some JSON data about the films and create a view that distributes the film genre, title, and director between separate columns: @@ -18427,7 +18452,7 @@ 1 | comedy | Bananas | Woody Allen 1 | comedy | The Dinner Game | Francis Veber 2 | horror | Psycho | Alfred Hitchcock - 3 | thriller | Vertigo | Hitchcock + 3 | thriller | Vertigo | Alfred Hitchcock 4 | drama| Yojimbo | Akira Kurosawa (5 rows)
Re: Multi-Master Logical Replication
On Sat, May 14, 2022 at 12:33 AM Bruce Momjian wrote: > > Uh, without these features, what workload would this help with? > To allow replication among multiple nodes when some of the nodes may have pre-existing data. This work plans to provide simple APIs to achieve that. Now, let me try to explain the difficulties users can face with the existing interface. It is simple to set up replication among various nodes when they don't have any pre-existing data but even in that case if the user operates on the same table at multiple nodes, the replication will lead to an infinite loop and won't proceed. The example in email [1] demonstrates that and the patch in that thread attempts to solve it. I have mentioned that problem because this work will need that patch. Now, let's take a simple case where two nodes have the same table which has some pre-existing data: Node-1: Table t1 (c1 int) has data 1, 2, 3, 4 Node-2: Table t1 (c1 int) has data 5, 6, 7, 8 If we have to set up replication among the above two nodes using existing interfaces, it could be very tricky. Say user performs operations like below: Node-1 #Publication for t1 Create Publication pub1 For Table t1; Node-2 #Publication for t1, Create Publication pub1_2 For Table t1; Node-1: Create Subscription sub1 Connection '' Publication pub1_2; Node-2: Create Subscription sub1_2 Connection '' Publication pub1; After this the data will be something like this: Node-1: 1, 2, 3, 4, 5, 6, 7, 8 Node-2: 1, 2, 3, 4, 5, 6, 7, 8, 5, 6, 7, 8 So, you can see that data on Node-2 (5, 6, 7, 8) is duplicated. In case, table t1 has a unique key, it will lead to a unique key violation and replication won't proceed. Here, I have assumed that we already have functionality for the patch in email [1], otherwise, replication will be an infinite loop replicating the above data again and again. Now one way to achieve this could be that we can ask users to stop all operations on both nodes before starting replication between those and take data dumps of tables from each node they want to replicate and restore them to other nodes. Then use the above commands to set up replication and allow to start operations on those nodes. The other possibility for users could be as below. Assume, we have already created publications as in the above example, and then: Node-2: Create Subscription sub1_2 Connection '' Publication pub1; #Wait for the initial sync of table t1 to finish. Users can ensure that by checking 'srsubstate' in pg_subscription_rel. Node-1: Begin; # Disallow truncates to be published and then truncate the table Alter Publication pub1 Set (publish = 'insert, update, delete'); Truncate t1; Create Subscription sub1 Connection '' Publication pub1_2; Alter Publication pub1 Set (publish = 'insert, update, delete, truncate'); Commit; This will become more complicated when more than two nodes are involved, see the example provided for the three nodes case [2]. Can you think of some other simpler way to achieve the same? If not, I don't think the current way is ideal and even users won't prefer that. I am not telling that the APIs proposed in this thread is the only or best way to achieve the desired purpose but I think we should do something to allow users to easily set up replication among multiple nodes. [1] - https://www.postgresql.org/message-id/CALDaNm0gwjY_4HFxvvty01BOT01q_fJLKQ3pWP9%3D9orqubhjcQ%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CALDaNm3aD3nZ0HWXA8V435AGMvORyR5-mq2FzqQdKQ8CPomB5Q%40mail.gmail.com -- With Regards, Amit Kapila.