Re: Add client connection check during the execution of the query
Hi Thomas! Thanks for working on this patch. I have attached a new version with some typo corrections of doc entry, removing of redundant `include` entries and trailing whitespaces. Also I added in doc the case when single query transaction with disconnected client might be eventually commited upon completion in autocommit mode if it doesn't return any rows (doesn't communicate with user) before sending final commit message. This behavior might be unexpected for clients and hence IMO it's worth noticing. > diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c > index 4c7b1e7bfd..8cf95d09a4 100644 > --- a/src/backend/libpq/pqcomm.c > +++ b/src/backend/libpq/pqcomm.c > > @@ -919,13 +923,13 @@ socket_set_nonblocking(bool nonblocking) > } > > /* > - * pq_recvbuf - load some bytes into the input buffer > + * pq_recvbuf_ext - load some bytes into the input buffer > * > * returns 0 if OK, EOF if trouble > * > */ > static int > -pq_recvbuf(void) > +pq_recvbuf_ext(bool nonblocking) > { > if (PqRecvPointer > 0) > { > @@ -941,8 +945,7 @@ pq_recvbuf(void) > PqRecvLength = PqRecvPointer = 0; > } > > - /* Ensure that we're in blocking mode */ > - socket_set_nonblocking(false); > + socket_set_nonblocking(nonblocking); > > /* Can fill buffer from PqRecvLength and upwards */ > for (;;) > @@ -954,6 +957,9 @@ pq_recvbuf(void) > > if (r < 0) > { > + if (nonblocking && (errno == EAGAIN || errno == EWOULDBLOCK)) > + return 0; > + > if (errno == EINTR) > continue; /* Ok if interrupted */ > > @@ -981,6 +987,13 @@ pq_recvbuf(void) > } > } > > +static int > +pq_recvbuf(void) > +{ > + return pq_recvbuf_ext(false); > +} > + > + AFAICS, the above fragment is not related with primary fix directly. AFAICS, there are the following open items that don't allow to treat the current patch completed: * Absence of tap tests emulating some scenarios of client disconnection. IIRC, you wanted to rewrite the test case posted by Sergey. * Concerns about portability of `pq_check_connection()`A implementation. BTW, on windows postgres with this patch have not been built [1]. * Absence of benchmark results to show lack of noticeable performance regression after applying non-zero timeout for checking client liveness. 1. https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.131820 -- Regards, Maksim Milyutin >From 3ec788ac5e7c47fe135a3618849db179942f4b27 Mon Sep 17 00:00:00 2001 From: Maksim Milyutin Date: Fri, 26 Mar 2021 10:18:30 +0300 Subject: [PATCH v9] Detect POLLHUP while running queries Provide a new optional GUC that can be used to check whether the client connection has gone away periodically while running very long queries. Author: Sergey Cherkashin Author: Thomas Munro Reviewed-by: Thomas Munro Reviewed-by: Tatsuo Ishii Reviewed-by: Konstantin Knizhnik Reviewed-by: Zhihong Yu Reviewed-by: Andres Freund Reviewed-by: Maksim Milyutin Reviewed-by: Tom Lane (much earlier version) Discussion: https://postgr.es/m/77def86b27e41f0efcba411460e929ae%40postgrespro.ru --- doc/src/sgml/config.sgml | 36 +++ src/backend/libpq/pqcomm.c| 63 +-- src/backend/tcop/postgres.c | 27 src/backend/utils/init/globals.c | 1 + src/backend/utils/init/postinit.c | 10 +++ src/backend/utils/misc/guc.c | 10 +++ src/backend/utils/misc/postgresql.conf.sample | 3 + src/include/libpq/libpq.h | 2 + src/include/miscadmin.h | 1 + src/include/utils/timeout.h | 1 + 10 files changed, 150 insertions(+), 4 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index ddc6d789d8..abec47ada7 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -998,6 +998,42 @@ include_dir 'conf.d' + + client_connection_check_interval (integer) + + client_connection_check_interval configuration parameter + + + + +Sets the time interval between checks that the client is still +connected, while running queries. The check is performed by polling the +socket to allows long running queries to be aborted immediately. + + +If this value is specified without units, it is taken as milliseconds. +The default value is 0, which disables connection +checks. Without connection checks, the server will detect the loss of +the conne
Re: Add client connection check during the execution of the query
On 12.06.2021 07:24, Thomas Munro wrote: On Fri, Apr 30, 2021 at 2:23 PM Thomas Munro wrote: Here's something I wanted to park here to look into for the next cycle: it turns out that kqueue's EV_EOF flag also has the right semantics for this. That leads to the idea of exposing the event via the WaitEventSet API, and would the bring client_connection_check_interval feature to 6/10 of our OSes, up from 2/10. Maybe Windows' FD_CLOSE event could get us up to 7/10, not sure. Rebased. Added documentation tweak and a check to reject the GUC on unsupported OSes. Good work. I have tested your patch on Linux and FreeBSD on three basic cases: client killing, cable breakdown (via manipulations with firewall) and silent closing client connection before completion of previously started query in asynchronous manner. And all works fine. Some comments from me: bool pq_check_connection(void) { -#if defined(POLLRDHUP) + WaitEvent events[3]; 3 is looks like as magic constant. We might to specify a constant for all event groups in FeBeWaitSet. + ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetLatchPos, WL_LATCH_SET, NULL); + ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetSocketPos, WL_SOCKET_CLOSED, NULL); + rc = WaitEventSetWait(FeBeWaitSet, 0, events, lengthof(events), 0); + ModifyWaitEvent(FeBeWaitSet, FeBeWaitSetLatchPos, WL_LATCH_SET, MyLatch); AFAICS, side effect to (FeBeWaitSet->events[FeBeWaitSetSocketPos]).events by setting WL_SOCKET_CLOSED value under calling of pq_check_connection() function doesn't have negative impact later, does it? That is, all WaitEventSetWait() calls have to setup socket events on its own from scratch. -- Regards, Maksim Milyutin
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
On 3/16/19 5:32 PM, Robert Haas wrote: There's only one query ID field available, and you can't use two extensions that care about query ID unless they compute it the same way, and replicating all the code that computes the query ID into each new extension that wants one sucks. I think we should actually bite the bullet and move all of that code into core, and then just let extensions say whether they care about it getting set. +1. But I think that enough to integrate into core the query normalization routine and store generalized query strings (from which the queryId is produced) in shared memory (for example, hashtable that maps queryId to the text representation of generalized query). And activate normalization routine and filling the table of generalized queries by specified GUC. This allows to unbind extensions that require queryId from using pg_stat_statements and consider such computing of queryId as canonical. -- Regards, Maksim Milyutin
Re: [HACKERS] wrong t_bits alignment in pageinspect
Hi! 02.01.2018 12:33, Andrey Borodin wrote: 15 дек. 2017 г., в 18:53, Maksim Milyutin написал(а): I found out the problem in exposing values of t_bits field from heap_page_items function. Probably, this [0] place contains similar bug too? [0] https://github.com/postgres/postgres/blob/master/contrib/pageinspect/heapfuncs.c#L439 Yes, it's so. Thanks a lot for notice. Attached a new version of patch with regression test. -- Regards, Maksim Milyutin diff --git a/contrib/pageinspect/expected/page.out b/contrib/pageinspect/expected/page.out index 8e15947a81..97af6f5bd2 100644 --- a/contrib/pageinspect/expected/page.out +++ b/contrib/pageinspect/expected/page.out @@ -92,3 +92,20 @@ create table test_part1 partition of test_partitioned for values from ( 1 ) to ( select get_raw_page('test_part1', 0); -- get farther and error about empty table ERROR: block number 0 is out of range for relation "test_part1" drop table test_partitioned; +-- check null bitmap alignment for table whose the number of attributes is multiple of 8 +create table test8 (f1 int, f2 int, f3 int, f4 int, f5 int, f6 int, f7 int, f8 int); +insert into test8(f1, f8) values (x'f1'::int, 0); +select t_bits, t_data from heap_page_items(get_raw_page('test8', 0)); + t_bits | t_data +--+ + 1001 | \xf100 +(1 row) + +select tuple_data_split('test8'::regclass, t_data, t_infomask, t_infomask2, t_bits) +from heap_page_items(get_raw_page('test8', 0)); + tuple_data_split +- + {"\\xf100",NULL,NULL,NULL,NULL,NULL,NULL,"\\x"} +(1 row) + +drop table test8; diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c index 088254453e..99f409846c 100644 --- a/contrib/pageinspect/heapfuncs.c +++ b/contrib/pageinspect/heapfuncs.c @@ -234,7 +234,7 @@ heap_page_items(PG_FUNCTION_ARGS) int bits_len; bits_len = - ((tuphdr->t_infomask2 & HEAP_NATTS_MASK) / 8 + 1) * 8; + BITMAPLEN(HeapTupleHeaderGetNatts(tuphdr)) * BITS_PER_BYTE; values[11] = CStringGetTextDatum( bits_to_text(tuphdr->t_bits, bits_len)); } @@ -436,12 +436,12 @@ tuple_data_split(PG_FUNCTION_ARGS) int bits_str_len; int bits_len; - bits_len = (t_infomask2 & HEAP_NATTS_MASK) / 8 + 1; + bits_len = BITMAPLEN(t_infomask2 & HEAP_NATTS_MASK) * BITS_PER_BYTE; if (!t_bits_str) ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED), errmsg("argument of t_bits is null, but it is expected to be null and %d character long", - bits_len * 8))); + bits_len))); bits_str_len = strlen(t_bits_str); if ((bits_str_len % 8) != 0) @@ -449,11 +449,11 @@ tuple_data_split(PG_FUNCTION_ARGS) (errcode(ERRCODE_DATA_CORRUPTED), errmsg("length of t_bits is not a multiple of eight"))); - if (bits_len * 8 != bits_str_len) + if (bits_len != bits_str_len) ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED), errmsg("unexpected length of t_bits %u, expected %d", - bits_str_len, bits_len * 8))); + bits_str_len, bits_len))); /* do the conversion */ t_bits = text_to_bits(t_bits_str, bits_str_len); diff --git a/contrib/pageinspect/sql/page.sql b/contrib/pageinspect/sql/page.sql index 493ca9b211..16b4571fbf 100644 --- a/contrib/pageinspect/sql/page.sql +++ b/contrib/pageinspect/sql/page.sql @@ -41,3 +41,11 @@ select get_raw_page('test_partitioned', 0); -- error about partitioned table create table test_part1 partition of test_partitioned for values from ( 1 ) to (100); select get_raw_page('test_part1', 0); -- get farther and error about empty table drop table test_partitioned; + +-- check null bitmap alignment for table whose the number of attributes is multiple of 8 +create table test8 (f1 int, f2 int, f3 i
Re: [HACKERS] PoC: custom signal handler for extensions
On 12.01.2018 18:51, Teodor Sigaev wrote: In perspective, this mechanism provides the low-level instrument to define remote procedure call on extension side. The simple RPC that defines effective userid on remote backend (remote_effective_user function) is attached for example. Suppose, it's useful patch. Some notes: 1) AssignCustomProcSignalHandler is unneeded API function, easy replaced by GetCustomProcSignalHandler/ReleaseCustomProcSignal/RegisterCustomProcSignalHandler functions, but in other hand, it isn't looked as widely used feature to reassign custom signal handler. Agreed. AssignCustomProcSignalHandler is unnecessary. Also I have removed GetCustomProcSignalHandler as not see any application of this function. 2) Naming RegisterCustomProcSignalHandler and ReleaseCustomProcSignal is not self-consistent. Other parts suggest pair Register/Unregister or Aquire/Release. Seems, Register/Unregister pair is preferred here. Thanks for notice. Fixed. 3) Agree that Assert(reason >= PROCSIG_CUSTOM_1 && reason <= PROCSIG_CUSTOM_N) should be replaced with ereport. By the way, ReleaseCustomProcSignal() is a single place where there isn't a range check of reason arg Oops, I missed this assert check. I considered assert check in user available routines accepting procsignal as a precondition for routine's clients, i.e. violation of this check have to involve uncertain behavior. This checks is not expensive itself therefore it makes sense to replace their to runtime checks via ereport calls. 4) CustomSignalInterrupt() - play with errno and SetLatch() seems unneeded here - there isn't call of handler of custom signal, SetLatch will be called several lines below I have noticed that even simple HandleNotifyInterrupt() and HandleParallelMessageInterrupt() routines add at least SetLatch(MyLatch). I think it makes sense to leave this call in our case. Perhaps, I'm wrong. I don't understand entirely the intention of SetLatch() in those cases. 5) Using a special memory context for handler call looks questionably, I think that complicated handlers could manage memory itself (and will do, if they need to store something between calls). So, I prefer to remove context. Fixed. But in this case we have to explicitly reflect in documentation the ambiguity of running memory context under signal handler and the intention to leave memory management on extension's developer. 6) As I can see, all possible (not only custom) signal handlers could perfectly use this API, or, at least, be store in CustomHandlers array, which could be renamed to InterruptHandlers, for example. Next, custom handler type is possible to make closier to built-in handlers, let its signature extends to void (*ProcSignalHandler_type) (ProcSignalReason reason) as others. It will allow to use single handler to support several reasons. OK, fixed. 7) Suppose, API allows to use different handlers in different processes for the same reason, it's could be reason of confusion. I suggest to restrict Register/Unregister call only for shared_preload_library, ie only during startup. Yeah, added checks on process_shared_preload_libraries_in_progress flag. Thanks for notice. 8) I'd like to see an example of usage this API somewhere in contrib in exsting modules. Ideas? Besides of usage in the extension pg_query_state [1] that provides the way to extract query state from running backend, I see the possibility with this patch to implement statistical sampling-based profiler for plpgsql functions on extension side. If we could get a call stack of plpgsql functions on interruptible backend then there are no obstacles to organize capturing of stack frames at some intervals over fixed period of time defined by arguments in extension's function. I have attached a new version of patch and updated version of remote_effective_user function implementation that demonstrates the usage of custom signals API. 1. https://github.com/postgrespro/pg_query_state -- Regards, Maksim Milyutin diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index b0dd7d1..ae7d007 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -60,12 +60,20 @@ typedef struct */ #define NumProcSignalSlots (MaxBackends + NUM_AUXPROCTYPES) +#define IsCustomProcSignalReason(reason) \ + ((reason) >= PROCSIG_CUSTOM_1 && (reason) <= PROCSIG_CUSTOM_N) + +static bool CustomSignalPendings[NUM_CUSTOM_PROCSIGNALS]; +static ProcSignalHandler_type CustomInterruptHandlers[NUM_CUSTOM_PROCSIGNALS]; + static ProcSignalSlot *ProcSignalSlots = NULL; static volatile ProcSignalSlot *MyProcSignalSlot = NULL; static bool CheckProcSignal(ProcSignalReason reason); static void CleanupProcSignalState(int status, Datum arg); +static void CheckAndSetCustomSignalInterrupts(void); + /* * Pro
Re: [HACKERS] PoC: custom signal handler for extensions
Hello! On 11.01.2018 18:53, Arthur Zakirov wrote: The relationship between custom signals and assigned handlers (function addresses) is replicated from postmaster to child processes including working backends. I think this happens only if a custom signal registered during initializing shared_preload_libraries. But from RegisterCustomProcSignalHandler()'s implementation I see that you can register the signal anytime. Am I wrong? If so then custom signal handlers may not work as expected. Yeah, thanks. Added checks on process_shared_preload_libraries_in_progress flag. What is purpose of AssignCustomProcSignalHandler() function? This function can erase a handler set by another extension. For example, extension 1 set handler passing reason PROCSIG_CUSTOM_1, and extension 2 set another handler passing reason PROCSIG_CUSTOM_1 too. Here the handler of the extension 2 will be purged. + + Assert(reason >= PROCSIG_CUSTOM_1 && reason <= PROCSIG_CUSTOM_N); + I think it is not good to use asserts within AssignCustomProcSignalHandler() and GetCustomProcSignalHandler(). Because this functions can be executed by an external extension, and it may pass a reason outside this boundary. It will be better if the reason will be checked in runtime. I was guided by the consideration that assertions define preconditions for input parameter (in our case, procsignal) of functions. Meeting these preconditions have to be provided by function callers. Since these checks is not expensive it will be good to replace their to ereport calls. Updated patch is attached in [1]. 1. https://www.postgresql.org/message-id/37a48ac6-aa14-4e36-5f08-cf8581fe1382%40gmail.com -- Regards, Maksim Milyutin
Re: Disallow cancellation of waiting for synchronous replication
On 21.12.2019 00:19, Tom Lane wrote: Three is still a problem when backend is not canceled, but terminated [2]. Exactly. If you don't have a fix that handles that case, you don't have anything. In fact, you've arguably made things worse, by increasing the temptation to terminate or "kill -9" the nonresponsive session. I assume that the termination of backend that causes termination of PostgreSQL instance in Andrey's patch proposal have to be resolved by external HA agents that could interrupt such terminations as parent process of postmaster and make appropriate decisions e.g., restart PostgreSQL node in closed from external users state (via pg_hba.conf manipulation) until all sync replicas synchronize changes from master. Stolon HA tool implements this strategy [1]. This logic (waiting for all replicas declared in synchronous_standby_names replicate all WAL from master) could be implemented inside PostgreSQL kernel after start recovery process before database is opened to users and this can be done separately later. Another approach is to implement two-phase commit over master and sync replicas (as it did Oracle in old versions [2]) where the risk to get local committed data under instance restarting and query canceling is minimal (after starting of final commitment phase). But this approach has latency penalty and complexity to resolve partial (prepared but not committed) transactions under coordinator (in this case master node) failure in automatic mode. Nicely if this approach will be implemented later as option of synchronous commit. 1. https://github.com/sorintlab/stolon/blob/master/doc/syncrepl.md#handling-postgresql-sync-repl-limits-under-such-circumstances 2. https://docs.oracle.com/cd/B28359_01/server.111/b28326/repmaster.htm#i33607 -- Best regards, Maksim Milyutin
Re: Disallow cancellation of waiting for synchronous replication
On 21.12.2019 13:34, Marco Slot wrote: I do agree with the general sentiment that terminating the connection is preferable over sending a response to the client (except when synchronous replication was already disabled). But in this case locally committed data becomes visible to new incoming transactions that is bad side-effect of this issue. Under failover those changes potentially undo. Synchronous replication does not guarantee that a committed write is actually on any replica, but it does in general guarantee that a commit has been replicated before sending a response to the client. That's arguably more important because the rest of what the application might depend on the transaction completing and replicating successfully. I don't know of cases other than cancellation in which a response is sent to the client without replication when synchronous replication is enabled. Yes, at query canceling (e.g. by timeout from client driver) client receives response about completed transaction (though with warning which not all client drivers can handle properly) and the guarantee about successfully replicated transaction *violates*. -- Best regards, Maksim Milyutin
Re: Disallow cancellation of waiting for synchronous replication
On 25.12.2019 14:27, Marco Slot wrote: On Wed, Dec 25, 2019, 11:28 Maksim Milyutin <mailto:milyuti...@gmail.com>> wrote: But in this case locally committed data becomes visible to new incoming transactions that is bad side-effect of this issue. Your application should be prepared for that in any case. At the point where synchronous replication waits, the commit has already been written to disk on the primary. If postgres restarts while waiting for replication then the write becomes immediately visible regardless of whether it was replicated. Yes, this write is recovered after starting of instance. At first, this case I want to delegate to external HA tool around PostgreSQL. It can handle instance stopping and take switchover to sync replica or start current instance with closed connections from external users until all writes replicate to sync replicas. Later, arguably closing connection after recovery process could be implemented inside the kernel. I don't think throwing a PANIC actually prevents that and if it does it's coincidental. PANIC lets down instance and doesn't provide clients to read locally committed data. HA tool takes further steps to close access to these data as described above. That's far from ideal, but fixing it requires a much bigger change to streaming replication. The write should be replicated prior to commit on the primary, but applied after in a way where unapplied writes on the secondary can be overwritten/discarded if it turns out they did not commit on the primary. Thanks for sharing your opinion about enhancement of synchronous commit protocol. Here [1] my position is listed. It would like to see positions of other members of community. 1. https://www.postgresql.org/message-id/f3ffc220-e601-cc43-3784-f9bba66dc382%40gmail.com -- Best regards, Maksim Milyutin
Re: Disallow cancellation of waiting for synchronous replication
On 25.12.2019 13:45, Andrey Borodin wrote: 25 дек. 2019 г., в 15:28, Maksim Milyutin написал(а): Synchronous replication does not guarantee that a committed write is actually on any replica, but it does in general guarantee that a commit has been replicated before sending a response to the client. That's arguably more important because the rest of what the application might depend on the transaction completing and replicating successfully. I don't know of cases other than cancellation in which a response is sent to the client without replication when synchronous replication is enabled. Yes, at query canceling (e.g. by timeout from client driver) client receives response about completed transaction (though with warning which not all client drivers can handle properly) and the guarantee about successfully replicated transaction *violates*. We obviously need a design discussion here to address the issue. But the immediate question is should we add this topic to January CF items? +1 on posting this topic to January CF. Andrey, some fixes from me: 1) pulled out the cancelling of QueryCancelPending from internal branch where synchronous_commit_cancelation is set so that to avoid dummy iterations with printing message "canceling the wait for ..." 2) rewrote errdetail message under cancelling query: I hold in this case we cannot assert that transaction committed locally because its changes are not visible as yet so I propose to write about locally flushed commit wal record. Updated patch is attached. -- Best regards, Maksim Milyutin diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 5353b6ab0b..844ce36d79 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -80,6 +80,7 @@ bool DefaultXactDeferrable = false; bool XactDeferrable; int synchronous_commit = SYNCHRONOUS_COMMIT_ON; +bool synchronous_commit_cancelation = true; /* * When running as a parallel worker, we place only a single diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c index 7bf7a1b7f8..0a533d2d79 100644 --- a/src/backend/replication/syncrep.c +++ b/src/backend/replication/syncrep.c @@ -250,13 +250,21 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit) */ if (ProcDiePending) { - ereport(WARNING, + if (synchronous_commit_cancelation) + { +ereport(WARNING, + (errcode(ERRCODE_ADMIN_SHUTDOWN), + errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"), + errdetail("The transaction has already committed locally, but might not have been replicated to the standby."))); +whereToSendOutput = DestNone; +SyncRepCancelWait(); +break; + } + + ereport(PANIC, (errcode(ERRCODE_ADMIN_SHUTDOWN), - errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"), - errdetail("The transaction has already committed locally, but might not have been replicated to the standby."))); - whereToSendOutput = DestNone; - SyncRepCancelWait(); - break; + errmsg("canceling the wait for synchronous replication and terminating connection due to administrator command"), + errdetail("The transaction has already committed locally, but might not have been replicated to the standby."))); } /* @@ -268,11 +276,18 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit) if (QueryCancelPending) { QueryCancelPending = false; + if (synchronous_commit_cancelation) + { +ereport(WARNING, + (errmsg("canceling wait for synchronous replication due to user request"), + errdetail("The transaction has already committed locally, but might not have been replicated to the standby."))); +SyncRepCancelWait(); +break; + } + ereport(WARNING, - (errmsg("canceling wait for synchronous replication due to user request"), - errdetail("The transaction has already committed locally, but might not have been replicated to the standby."))); - SyncRepCancelWait(); - break; + (errmsg("canceling wait for synchronous replication due requested, but cancelation is not allowed"), + errdetail("The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We must wait here."))); } /* diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 4b4911d5ec..a279faaeb5 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -1222,6 +1222,15 @@ static struct config_bool ConfigureNamesBool[] = NULL, NULL, NULL }, + { + {"synchronous_commit_cancelation", PGC_USERSET, WAL_SETTINGS, + gettext_noop("Allow to cancel waiting for replication of transaction commited localy."), + NULL + },
Re: Disallow cancellation of waiting for synchronous replication
On 29.12.2019 00:55, Robert Haas wrote: On Fri, Dec 20, 2019 at 12:04 AM Andrey Borodin wrote: Currently, we can have split brain with the combination of following steps: 0. Setup cluster with synchronous replication. Isolate primary from standbys. 1. Issue upsert query INSERT .. ON CONFLICT DO NOTHING 2. CANCEL 1 during wait for synchronous replication 3. Retry 1. Idempotent query will succeed and client have confirmation of written data, while it is not present on any standby. All that being said, like Tom and Michael, I don't think teaching the backend to ignore cancels is the right approach. We have had innumerable problems over the years that were caused by the backend failing to respond to cancels when we would really have liked it to do so, and users REALLY hate it when you tell them that they have to shut down and restart (with crash recovery) the entire database because of a single stuck backend. The stuckness of backend is not deadlock here. To cancel waiting of backend fluently, client is enough to turn off synchronous replication (change synchronous_standby_names through server reload) or change synchronous replica to another livable one (again through changing of synchronous_standby_names). In first case he explicitly agrees with existence of local (not replicated) commits in master. -- Best regards, Maksim Milyutin
Re: Disallow cancellation of waiting for synchronous replication
On 15.01.2020 01:53, Andres Freund wrote: On 2020-01-12 16:18:38 +0500, Andrey Borodin wrote: 11 янв. 2020 г., в 7:34, Bruce Momjian написал(а): Actually, it might be worse than that. In my reading of RecordTransactionCommit(), we do this: write to WAL flush WAL (durable) make visible to other backends replicate communicate to the client I think this means we make the transaction commit visible to all backends _before_ we replicate it, and potentially wait until we get a replication reply to return SUCCESS to the client. No. Data is not visible to other backend when we await sync rep. Yea, as the relevant comment in RecordTransactionCommit() says; * Note that at this stage we have marked clog, but still show as running * in the procarray and continue to hold locks. */ if (wrote_xlog && markXidCommitted) SyncRepWaitForLSN(XactLastRecEnd, true); But it's worthwhile to emphasize that data at that stage actually *can* be visible on standbys. The fact that the transaction still shows as running via procarray, on the primary, does not influence visibility determinations on the standby. In common case, consistent reading in cluster (even if remote_apply is on) is available (and have to be) only on master node. For example, if random load balancer on read-only queries is established above master and sync replicas (while meeting remote_apply is on) it's possible to catch the case when preceding reads would return data that will be absent on subsequent ones. Moreover, such visible commits on sync standby are not durable from the point of cluster view. For example, if we have two sync standbys then under failover we can switch master to sync standby on which waiting commit was not replicated but it was applied (and visible) on other standby. This switching is completely safe because client haven't received acknowledge on commit request and that transaction is in indeterminate state, it can be as committed so aborted depending on which standby will be promoted. -- Best regards, Maksim Milyutin
Re: [HACKERS] PoC: custom signal handler for extensions
Hello David, On 05.03.2018 18:50, David Steele wrote: Hello Maksim, On 1/27/18 2:19 PM, Arthur Zakirov wrote: Is there actual need in UnregisterCustomProcSignal() within _PG_init()? An extension registers a handler and then unregister it doing nothing. It seems useless. Also process_shared_preload_libraries_in_progress within _PG_fini() will be false I think. _PG_fini() won't be called though, because implementation of internal_unload_library() is disabled. Actually, is there need in UnregisterCustomProcSignal() at all? It unregisters a handler only in current backend, for actual unergistering we need to call it everywhere, if I'm not mistaken. This patch has been in Waiting on Author state for almost three weeks. Have you had a chance to consider Arthur's suggestions? Yes, I want to rework my patch to enable setup of custom signals on working backend without preload initialization. Do you know when you'll have an updated patch available? I want to actuate the work on this patch for the next 12 release. Sorry, for now I can not keep up with the current release. -- Regards, Maksim Milyutin
Re: Synchronous commit behavior during network outage
Hi! This is a known issue with synchronous replication [1]. You might inject into unmodified operation some dummy modification to overcome the negative sides of such partially committing without source code patching. On 20.04.2021 19:23, Aleksander Alekseev wrote: Although it's unlikely that someone implemented an application which deals with important data and "pressed Ctr+C" as it's done in psql. Some client libraries have feature to cancel session that has similar effect to "Ctrl+C" from psql after specified by client deadline expiration [2]. Hence, this case might be quite often when application interacts with database. On Mon, Apr 19, 2021 at 10:13 PM Ondřej Žižka wrote: From the synchronous_commit=remote_write level and "higher", I would expect, that when the remote application (doesn't matter if flush, write or apply) would not be applied I would not receive a confirmation about the commit (even with a warning). Something like, if there is no commit from sync replica, there is no commit on primary and if someone performs the steps above, the whole transaction will not send a confirmation. The warning have to be accounted here and performed commit have not to be treated as *successful*. 1. https://www.postgresql.org/message-id/C1F7905E-5DB2-497D-ABCC-E14D4DEE506C%40yandex-team.ru 2. https://www.postgresql.org/message-id/CANtu0ogbu%2By6Py963p-zKJ535b8zm5AOq7zkX7wW-tryPYi1DA%40mail.gmail.com -- Regards, Maksim Milyutin
Re: Synchronous commit behavior during network outage
On 20.04.2021 19:38, Tomas Vondra wrote: On 4/20/21 6:23 PM, Aleksander Alekseev wrote: Hi Ondřej, Thanks for the report. It seems to be a clear violation of what is promised in the docs. Although it's unlikely that someone implemented an application which deals with important data and "pressed Ctr+C" as it's done in psql. So this might be not such a critical issue after all. BTW what version of PostgreSQL are you using? Which part of the docs does this contradict? I think, Aleksandr refers to the following phrase in docs: "The guarantee we offer is that the application will not receive explicit acknowledgment of the successful commit of a transaction until the WAL data is known to be safely received by all the synchronous standbys." [1] And IMO confusing here regards to the notion of `successful commit`. Does warning attached to received commit message make it not *successful*? I think we have to explicitly mention cases about cancellation and termination session in docs to avoid ambiguity in understanding of phrase above. 1. https://www.postgresql.org/docs/current/warm-standby.html#SYNCHRONOUS-REPLICATION-HA -- Regards, Maksim Milyutin
Re: Extension Enhancement: Buffer Invalidation in pg_buffercache
On 07.04.2024 02:07, Thomas Munro wrote: So this is the version I plan to commit. +bool +EvictUnpinnedBuffer(Buffer buf) +{ ... +/* This will return false if it becomes dirty or someone else pins it. */ +result = InvalidateVictimBuffer(desc); + +UnpinBuffer(desc); + +return result; +} Hi, Thomas! Should not we call at the end the StrategyFreeBuffer() function to add target buffer to freelist and not miss it after invalidation? -- Best regards, Maksim Milyutin
Re: Extension Enhancement: Buffer Invalidation in pg_buffercache
On 14.04.2024 21:16, Maksim Milyutin wrote: On 07.04.2024 02:07, Thomas Munro wrote: So this is the version I plan to commit. +bool +EvictUnpinnedBuffer(Buffer buf) +{ ... +/* This will return false if it becomes dirty or someone else pins it. */ +result = InvalidateVictimBuffer(desc); + +UnpinBuffer(desc); + +return result; +} Hi, Thomas! Should not we call at the end the StrategyFreeBuffer() function to add target buffer to freelist and not miss it after invalidation? Hello everyone! Please take a look at this issue, current implementation of EvictUnpinnedBuffer() IMO is erroneous - evicted buffers are lost permanently and will not be reused again -- Best regards, Maksim Milyutin
Re: Extension Enhancement: Buffer Invalidation in pg_buffercache
On 29.04.2024 23:59, Thomas Munro wrote: On Tue, Apr 30, 2024 at 7:17 AM Thomas Munro wrote: On Tue, Apr 30, 2024 at 6:47 AM Maksim Milyutin wrote: Should not we call at the end the StrategyFreeBuffer() function to add target buffer to freelist and not miss it after invalidation? Please take a look at this issue, current implementation of EvictUnpinnedBuffer() IMO is erroneous - evicted buffers are lost permanently and will not be reused again I don't think that's true: it is not lost permanently, it'll be found by the regular clock hand. Perhaps it should be put on the freelist so it can be found again quickly, but I'm not sure that's a bug, is it? Yeah, you are right. Thanks for clarification. CLOCK algorithm will reuse it eventually but being of evicted cleared buffer in freelist might greatly restrict the time of buffer allocation when all others buffers were in use. -- Best regards, Maksim Milyutin
Re: O_DIRECT for relations and SLRUs (Prototype)
On 1/15/19 11:28 AM, Michael Paquier wrote: On Tue, Jan 15, 2019 at 11:19:48AM +0500, Andrey Borodin wrote: Is it possible to avoid those memcpy's by aligning available buffers instead? I couldn't understand this from the patch and this thread. Sure, it had better do that. That's just a lazy implementation. Hi! Could you specify all cases when buffers will not be aligned with BLCKSZ? AFAIC shared and temp buffers are aligned. And what ones are not? -- Regards, Maksim Milyutin
Re: Slow planning time for simple query
On 09.06.2018 22:49, Tom Lane wrote: Maksim Milyutin writes: On hot standby I faced with the similar problem. ... is planned 4.940 ms on master and *254.741* ms on standby. (I wonder though why, if you executed the same query on the master, its setting of the index-entry-is-dead bits didn't propagate to the standby.) I have verified the number dead item pointers (through pageinspect extension) in the first leaf page of index participating in query ('main.message_instance_pkey') on master and slave nodes and have noticed a big difference. SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705); On master: blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags ---+--+++---+---+---+---+---+--+ 3705 | l | 1 | 58 | 24 | 8192 | 6496 | 0 | 3719 | 0 | 65 On standby: blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags ---+--+++---+---+---+---+---+--+ 3705 | l | 59 | 0 | 24 | 8192 | 6496 | 0 | 3719 | 0 | 1 The vacuum routine improves the situation. Сan there be something that I have incorrectly configured WAL logging or replication? I wonder if we should extend the "SnapshotNonVacuumable" logic introduced in commit 3ca930fc3 so that in hot standby, *all* index entries are deemed non vacuumable. This would essentially get rid of long standby planning times in this sort of scenario by instead accepting worse (possibly much worse) planner range estimates. I'm unsure if that's a good tradeoff or not. I applied the patch introduced in this commit to test standby (not master; I don't know if this is correct) and haven't noticed any differences. -- Regards, Maksim Milyutin
Re: Slow planning time for simple query
13.06.2018 12:40, Maksim Milyutin wrote: On 09.06.2018 22:49, Tom Lane wrote: Maksim Milyutin writes: On hot standby I faced with the similar problem. ... is planned 4.940 ms on master and *254.741* ms on standby. (I wonder though why, if you executed the same query on the master, its setting of the index-entry-is-dead bits didn't propagate to the standby.) I have verified the number dead item pointers (through pageinspect extension) in the first leaf page of index participating in query ('main.message_instance_pkey') on master and slave nodes and have noticed a big difference. SELECT * FROM monitoring.bt_page_stats('main.message_instance_pkey', 3705); On master: blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags ---+--+++---+---+---+---+---+--+ 3705 | l | 1 | 58 | 24 | 8192 | 6496 | 0 | 3719 | 0 | 65 On standby: blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags ---+--+++---+---+---+---+---+--+ 3705 | l | 59 | 0 | 24 | 8192 | 6496 | 0 | 3719 | 0 | 1 In this point I want to highlight the issue that the changes in *lp_flags* bits (namely, set items as dead) for index item pointers doesn't propagate from master to replica in my case. As a consequence, on standby I have live index items most of which on master are marked as dead. And my queries on planning stage are forced to descent to heap pages under *get_actual_variable_range* execution that considerately slows down planning. Is it bug or restriction of implementation or misconfiguration of WAL/replication? -- Regards, Maksim Milyutin
Re: [HACKERS] Add support for tuple routing to foreign partitions
Hi, Fujita-san! On 24.11.2017 16:03, Etsuro Fujita wrote: (2017/10/27 20:00), Etsuro Fujita wrote: Please find attached an updated version of the patch. Amit rebased this patch and sent me the rebased version off list. Thanks for that, Amit! One thing I noticed I overlooked is about this change I added to make_modifytable to make a valid-looking plan node to pass to PlanForeignModify to plan remote insert to each foreign partition: + /* + * The column list of the child might have a different column + * order and/or a different set of dropped columns than that + * of its parent, so adjust the subplan's tlist as well. + */ + tlist = preprocess_targetlist(root, + child_parse->targetList); This would be needed because the FDW might reference the tlist. Since preprocess_targetlist references root->parse, it's needed to replace that with the child query before calling that function, but I forgot to do that. So I fixed that. Attached is an updated version of the patch. Your patch already is not applied on master. Please rebase it. -- Regards, Maksim Milyutin
[HACKERS] wrong t_bits alignment in pageinspect
Hi! I found out the problem in exposing values of t_bits field from heap_page_items function. When the number of attributes in table is multiple of eight, t_bits column shows double number of bits in which data fields are included. For example: # create table tbl(f1 int, f2 int, f3 int, f4 int, f5 int, f6 int, f7 int, f8 int); # insert into tbl(f1, f8) values (x'f1'::int, 0); # select t_bits, t_data from heap_page_items(get_raw_page('tbl', 0)); t_bits | t_data --+ 10011000 | \xf100 I suppose the prefix 1001 corresponds to real value of t_bits, the rest part 1000 - to the lower byte of f1 field of tbl. Attached patch fixes this issue. -- Regards, Maksim Milyutin diff --git a/contrib/pageinspect/heapfuncs.c b/contrib/pageinspect/heapfuncs.c index ca4d3f5..fe53a1a 100644 --- a/contrib/pageinspect/heapfuncs.c +++ b/contrib/pageinspect/heapfuncs.c @@ -234,7 +234,7 @@ heap_page_items(PG_FUNCTION_ARGS) int bits_len; bits_len = - ((tuphdr->t_infomask2 & HEAP_NATTS_MASK) / 8 + 1) * 8; + TYPEALIGN(8, (tuphdr->t_infomask2 & HEAP_NATTS_MASK)); values[11] = CStringGetTextDatum( bits_to_text(tuphdr->t_bits, bits_len)); }
[HACKERS] PoC: custom signal handler for extensions
Hi, hackers! I want to propose the patch that allows to define custom signals and their handlers on extension side. It is based on ProcSignal module, namely it defines the fixed set (number is specified by constant) of custom signals that could be reserved on postgres initialization stage (in _PG_init function of shared preloaded module) through specific interface functions. Functions that are custom signal handlers are defined in extension. The relationship between custom signals and assigned handlers (function addresses) is replicated from postmaster to child processes including working backends. Using this signals we are able to call specific handler (via SendProcSignal function) on remote backend that is actually come into action in CHECK_FOR_INTERRUPTS point. In perspective, this mechanism provides the low-level instrument to define remote procedure call on extension side. The simple RPC that defines effective userid on remote backend (remote_effective_user function) is attached for example. C&C welcome! -- Regards, Maksim Milyutin diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index b9302ac630..75d4ea72b7 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -27,6 +27,7 @@ #include "storage/shmem.h" #include "storage/sinval.h" #include "tcop/tcopprot.h" +#include "utils/memutils.h" /* @@ -60,12 +61,17 @@ typedef struct */ #define NumProcSignalSlots (MaxBackends + NUM_AUXPROCTYPES) +static bool CustomSignalPendings[NUM_CUSTOM_PROCSIGNALS]; +static ProcSignalHandler_type CustomHandlers[NUM_CUSTOM_PROCSIGNALS]; + static ProcSignalSlot *ProcSignalSlots = NULL; static volatile ProcSignalSlot *MyProcSignalSlot = NULL; static bool CheckProcSignal(ProcSignalReason reason); static void CleanupProcSignalState(int status, Datum arg); +static void CustomSignalInterrupt(ProcSignalReason reason); + /* * ProcSignalShmemSize * Compute space needed for procsignal's shared memory @@ -166,6 +172,70 @@ CleanupProcSignalState(int status, Datum arg) } /* + * RegisterCustomProcSignalHandler + * Assign specific handler of custom process signal with new ProcSignalReason key + * + * Return INVALID_PROCSIGNAL if all custom signals have been assigned. + */ +ProcSignalReason +RegisterCustomProcSignalHandler(ProcSignalHandler_type handler) +{ + ProcSignalReason reason; + + /* iterate through custom signal keys to find free spot */ + for (reason = PROCSIG_CUSTOM_1; reason <= PROCSIG_CUSTOM_N; reason++) + if (!CustomHandlers[reason - PROCSIG_CUSTOM_1]) + { + CustomHandlers[reason - PROCSIG_CUSTOM_1] = handler; + return reason; + } + return INVALID_PROCSIGNAL; +} + +/* + * ReleaseCustomProcSignal + * Release slot of specific custom signal + */ +void +ReleaseCustomProcSignal(ProcSignalReason reason) +{ + CustomHandlers[reason - PROCSIG_CUSTOM_1] = NULL; +} + +/* + * AssignCustomProcSignalHandler + * Assign handler of custom process signal with specific ProcSignalReason key + * + * Return old ProcSignal handler. + * Assume incoming reason is one of custom ProcSignals. + */ +ProcSignalHandler_type +AssignCustomProcSignalHandler(ProcSignalReason reason, ProcSignalHandler_type handler) +{ + ProcSignalHandler_type old; + + Assert(reason >= PROCSIG_CUSTOM_1 && reason <= PROCSIG_CUSTOM_N); + + old = CustomHandlers[reason - PROCSIG_CUSTOM_1]; + CustomHandlers[reason - PROCSIG_CUSTOM_1] = handler; + return old; +} + +/* + * GetCustomProcSignalHandler + * Get handler of custom process signal + * + * Assume incoming reason is one of custom ProcSignals. + */ +ProcSignalHandler_type +GetCustomProcSignalHandler(ProcSignalReason reason) +{ + Assert(reason >= PROCSIG_CUSTOM_1 && reason <= PROCSIG_CUSTOM_N); + + return CustomHandlers[reason - PROCSIG_CUSTOM_1]; +} + +/* * SendProcSignal * Send a signal to a Postgres process * @@ -260,7 +330,8 @@ CheckProcSignal(ProcSignalReason reason) void procsignal_sigusr1_handler(SIGNAL_ARGS) { - int save_errno = errno; + int save_errno = errno; + ProcSignalReasonreason; if (CheckProcSignal(PROCSIG_CATCHUP_INTERRUPT)) HandleCatchupInterrupt(); @@ -292,9 +363,84 @@ procsignal_sigusr1_handler(SIGNAL_ARGS) if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN)) RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN); + for (reason = PROCSIG_CUSTOM_1; reason <= PROCSIG_CUSTOM_N; reason++) + if (CheckProcSignal(reason)) + CustomSignalInterrupt(reason); + SetLatch(MyLatch);
Re: Using ProcSignal to get memory context stats from a running backend
On 19.12.2017 16:54, Pavel Stehule wrote: Hi 2017-12-19 14:44 GMT+01:00 Craig Ringer <mailto:cr...@2ndquadrant.com>>: On 18 December 2017 at 10:05, Robert Haas mailto:robertmh...@gmail.com>> wrote: On Thu, Dec 14, 2017 at 9:34 PM, Craig Ringer mailto:cr...@2ndquadrant.com>> wrote: > On 15 December 2017 at 09:24, Greg Stark mailto:st...@mit.edu>> wrote: >> Another simpler option would be to open up a new file in the log >> directory > > ... if we have one. > > We might be logging to syslog, or whatever else. > > I'd rather keep it simple(ish). +1. I still think just printing it to the log is fine. Here we go. Implemented pretty much as outlined above. A new pg_diag_backend(pid) function sends a new ProcSignalReason PROCSIG_DIAG_REQUEST. It's handled by CHECK_FOR_INTERRUPTS() and logs MemoryContextStats() output to elog(...). I didn't want to mess with the MemoryContextMethods and expose a printf-wrapper style typedef in memnodes.h, so I went with a hook global. It's a diagnostic routine so I don't think that's going to be a great bother. By default it's set to write_stderr. That just writes to vfprintf on unix so the outcome's the same as our direct use of fprintf now. On Windows, though, using write_stderr will make Pg attempt to write memory context dumps to the eventlog with ERROR level if running as a service with no console. That means we vsnprintf the string into a static buffer first. I'm not 100% convinced of the wisdom of that because it could flood the event log, which is usually size limited by # of events and recycled. It'd be better to write one event than write one per memory context line, but it's not safe to do that when OOM. I lean toward this being a win: at least Windows users should have some hope of finding out why Pg OOM'd, which currently they don't when it runs as a service. If we don't, we should look at using SetStdHandle to write stderr to a secondary logfile instead. I'm happy to just add a trivial vfprintf wrapper so we preserve exactly the same behaviour instead, but I figured I'd start with reusing write_stderr. I'd really like to preserve the writing to elog(...) not fprintf, because on some systems simply finding where stderr is written can be a challenge, if it's not going to /dev/null via some detached session. Is it in journald? In some separate log? Being captured by the logging collector (and probably silently eaten)? Who knows! Using elog(...) provides a log_line_prefix and other useful info to associate the dump with the process of interest and what it's doing at the time. Also, an astonishing number of deployed systems I've worked with actually don't put the pid or anything useful in log_line_prefix to make grouping up multi-line output practical. Which is insane. But it's only PGC_SIGHUP so fixing it is easy enough. sorry for small offtopic. Can be used this mechanism for log of executed plan or full query? This idea (but without logging) is implemented in the work on pg_progress command proposed by Remi Colinet[1] and in extension pg_query_state[2]. 1. https://www.postgresql.org/message-id/CADdR5nxQUSh5kCm9MKmNga8%2Bc1JLxLHDzLhAyXpfo9-Wmc6s5g%40mail.gmail.com 2. https://github.com/postgrespro/pg_query_state -- Regards, Maksim Milyutin
Re: Using ProcSignal to get memory context stats from a running backend
On 22.12.2017 16:56, Craig Ringer wrote: On 22 December 2017 at 20:50, Maksim Milyutin <mailto:milyuti...@gmail.com>> wrote: On 19.12.2017 16:54, Pavel Stehule wrote: sorry for small offtopic. Can be used this mechanism for log of executed plan or full query? That's a really good idea. I'd love to be able to pg_explain_backend(...) I left the mechanism as a generic diagnostic signal exactly so that we could add other things we wanted to be able to ask backends. I think a follow-on patch that adds support for dumping explain-format plans would be great, if it's practical to do that while a query's already running. Noticing the interest in the calling some routines on the remote backend through signals, in parallel thread[1] I have proposed the possibility to define user defined signal handlers in extensions. There is a patch taken from pg_query_state module. 1. https://www.postgresql.org/message-id/3f905f10-cf7a-d4e0-64a1-7fd9b8351592%40gmail.com -- Regards, Maksim Milyutin
Re: PoC: custom signal handler for extensions
23.12.17 12:58, legrand legrand wrote: +1 if this permits to use extension pg_query_state <https://github.com/postgrespro/pg_query_state> , that would be great ! Yes, attached patch is the single critical point. It allows to loose pg_query_state from the need to patch postgres core. -- Regards, Maksim Milyutin
Re: Using ProcSignal to get memory context stats from a running backend
On 27.12.2017 10:44, Craig Ringer wrote: On 22 December 2017 at 23:19, Maksim Milyutin <mailto:milyuti...@gmail.com>> wrote: Noticing the interest in the calling some routines on the remote backend through signals, in parallel thread[1] I have proposed the possibility to define user defined signal handlers in extensions. There is a patch taken from pg_query_state module. 1. https://www.postgresql.org/message-id/3f905f10-cf7a-d4e0-64a1-7fd9b8351592%40gmail.com <https://www.postgresql.org/message-id/3f905f10-cf7a-d4e0-64a1-7fd9b8351592%40gmail.com> Haven't read the patch, but the idea seems useful if a bit hairy in practice. It'd be done on a "use at your own risk, and if it breaks you get to keep the pieces" basis, where no guarantees are made by Pg about how and when the function is called so it must be utterly defensive. The challenge there being that you can't always learn enough about the state of the system without doing things that might break it, so lots of things you could do would be fragile. Yes, I agree with your thesis that the state of the system have to be checked/rechecked before starting its manipulation in signal handler. And the responsibility is down to developer of extension to provide the necessary level of defensive. Perhaps, it makes sense to add try-catch block around signal handler to interrupt potential errors therefrom. -- Regards, Maksim Milyutindefe
Hint to set owner for tablespace directory
Hi! I have noticed the novice users are stuck trying to create tablespace over a directory whose owner is not the system postgres user. They observed the message "could not set permissions on directory ...: permission denied". I want to add patch that prints hint to set required owner for the tablespace directory if this is the cause of the problem (*errno == EPERM* after calling *chmod*). -- Regards, Maksim Milyutin diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c index f7e9160..a8733d4 100644 --- a/src/backend/commands/tablespace.c +++ b/src/backend/commands/tablespace.c @@ -585,10 +585,16 @@ create_tablespace_directories(const char *location, const Oid tablespaceoid) InRecovery ? errhint("Create this directory for the tablespace before " "restarting the server.") : 0)); else + { + bool wrong_owner = (errno == EPERM); + ereport(ERROR, (errcode_for_file_access(), errmsg("could not set permissions on directory \"%s\": %m", - location))); + location), + wrong_owner ? errhint("Install the PostgreSQL system user as " + "the owner of this directory.") : 0)); + } } if (InRecovery)
Re: Hint to set owner for tablespace directory
On 08/24/2018 05:18 AM, Michael Paquier wrote: On Thu, Aug 23, 2018 at 02:24:25PM +0300, Maksim Milyutin wrote: I want to add patch that prints hint to set required owner for the tablespace directory if this is the cause of the problem (*errno == EPERM* after calling *chmod*). Please do not forget to add this patch to the next commit fest. Thanks, done. -- Regards, Maksim Milyutin
Re: Hint to set owner for tablespace directory
On 08/31/2018 11:55 AM, Rafia Sabih wrote: Adding to that this patch needs a rebase. And, please don't forget to run 'git diff --check', as it has some white-space issues. Hmm, it's odd. Provided patch is fluently applied on the current master HEAD (2e39f69) and *git diff --check* doesn't print any warnings. -- Regards, Maksim Milyutin
Re: Hint to set owner for tablespace directory
On 08/31/2018 01:12 PM, Rafia Sabih wrote: On Fri, Aug 31, 2018 at 3:30 PM, Maksim Milyutin <mailto:milyuti...@gmail.com>> wrote: On 08/31/2018 11:55 AM, Rafia Sabih wrote: Adding to that this patch needs a rebase. And, please don't forget to run 'git diff --check', as it has some white-space issues. git apply /home/edb/Desktop/patches/others/owner_tbsp_hint.patch /home/edb/Desktop/patches/others/owner_tbsp_hint.patch:9: trailing whitespace. { /home/edb/Desktop/patches/others/owner_tbsp_hint.patch:10: trailing whitespace. bool wrong_owner = (errno == EPERM); This is hex+ASCII display (from *hexdump -C owner_tbsp_hint.patch* output) of code fragment above: 01a0 0a 20 09 09 65 6c 73 65 0a 2b 09 09 7b 0a 2b 09 |. ..else.+..{.+.| 01b0 09 09 62 6f 6f 6c 20 77 72 6f 6e 67 5f 6f 77 6e |..bool wrong_own| 01c0 65 72 20 3d 20 28 65 72 72 6e 6f 20 3d 3d 20 45 |er = (errno == E| 01d0 50 45 52 4d 29 3b 0a 2b 0a 20 09 09 09 65 72 65 |PERM);.+. ...ere| Problem lines don't have trailing whitespaces, only line feed (0x0a) symbols This is what I got. Please correct me if I am missng anything. I am using centos 7, just an FYI. My git version is 2.7.4. -- Regards, Maksim Milyutin
Re: Hint to set owner for tablespace directory
30.08.2018 19:52, Peter Eisentraut wrote: On 23/08/2018 13:24, Maksim Milyutin wrote: I have noticed the novice users are stuck trying to create tablespace over a directory whose owner is not the system postgres user. They observed the message "could not set permissions on directory ...: permission denied". I want to add patch that prints hint to set required owner for the tablespace directory if this is the cause of the problem (*errno == EPERM* after calling *chmod*). I think the hint is backwards. When you don't have permission to chmod the tablespace directory, you probably want to fix the permissions on the tablespace directory or its parent. According with man page of chmod(2) when permissions on parent directories are insufficient the errno takes the EACCES value. However, in Linux man page of chmod(2) there been mentioned that the process could be not privileged (does not have the CAP_FOWNER capability) when errno is EPERM. But the hint says to run the database server as a different user. That's a bit unusual. In this case I propose to: - replace my initial hint message to the guess what to do if errno == EPERM, smth like "You might need to install the PostgreSQL system user as the owner of this directory"; - add another hint if errno == EACCES: "Fix permissions on the parent directories". Any thoughts? -- Regards, Maksim Milyutin
Re: Hint to set owner for tablespace directory
On 08/31/2018 04:59 PM, Tom Lane wrote: Maksim Milyutin writes: 30.08.2018 19:52, Peter Eisentraut wrote: I think the hint is backwards. When you don't have permission to chmod the tablespace directory, you probably want to fix the permissions on the tablespace directory or its parent. In this case I propose to: - replace my initial hint message to the guess what to do if errno == EPERM, smth like "You might need to install the PostgreSQL system user as the owner of this directory"; - add another hint if errno == EACCES: "Fix permissions on the parent directories". I agree with what I think Peter is saying: the hint should just recommend fixing permissions on the directory, for either errno code. The more detail you try to give, the more likely the hint will be wrong ... especially when you're dealing with errno codes that aren't all that well standardized, as these aren't. OK. However I think it would be helpful to leave the mention about setting necessary owner for tablespace directory. My final version of hint is "You might need to fix permissions on this directory or its parents or install the PostgreSQL system user as the owner of this directory." And updated version of patch is attached. -- Regards, Maksim Milyutin diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c index f7e9160..1478462 100644 --- a/src/backend/commands/tablespace.c +++ b/src/backend/commands/tablespace.c @@ -585,10 +585,15 @@ create_tablespace_directories(const char *location, const Oid tablespaceoid) InRecovery ? errhint("Create this directory for the tablespace before " "restarting the server.") : 0)); else + { ereport(ERROR, (errcode_for_file_access(), errmsg("could not set permissions on directory \"%s\": %m", - location))); + location), + errhint("You might need to fix permissions on this " + "directory or its parents or install the PostgreSQL " + "system user as the owner of this directory."))); + } } if (InRecovery)
Re: Hint to set owner for tablespace directory
On 09/07/2018 06:05 PM, Peter Eisentraut wrote: On 07/09/2018 11:59, Maksim Milyutin wrote: OK. However I think it would be helpful to leave the mention about setting necessary owner for tablespace directory. My final version of hint is "You might need to fix permissions on this directory or its parents or install the PostgreSQL system user as the owner of this directory." And updated version of patch is attached. You are still proposing to hint that the permissions on the tablespace directory might be correct but that the main database instance is running under the wrong user. Not exactly this way. The case that motivated me to make this patch was the novice user after installing postgres from package (with setting up postgres user and initializing PGDATA with right permissions) tried to create the necessary tablespace directories from himself (the owner of those directories was that user). The error message "could not set permissions on directory ..." disoriented that user. The need to change the owner of those directories came after careful reading of documentation. I think it would be helpful to show the proposed hint to more operatively resolve the problem. -- Regards, Maksim Milyutin
Re: Hint to set owner for tablespace directory
01.10.2018 15:15, Peter Eisentraut wrote: On 24/09/2018 14:50, Peter Eisentraut wrote: On 11/09/2018 17:10, Peter Eisentraut wrote: On 07/09/2018 17:59, Maksim Milyutin wrote: those directories was that user). The error message "could not set permissions on directory ..." disoriented that user. The need to change the owner of those directories came after careful reading of documentation. I think it would be helpful to show the proposed hint to more operatively resolve the problem. I think it might be worth clarifying the documentation instead. I'm looking at the CREATE TABLESPACE reference page and it's not super clear on first reading. How about the attached patch? I have committed this patch and will close this commitfest item. I don't think the idea with the hints in the code was going anywhere. Yes, thanks. Sorry for so delayed answer, I was busy. You have clearly described the workflow to create tablespace (including *chown* command) and in general it is enough to resolve the issue with wrong owner of tablespace directory. -- Regards, Maksim Milyutin
COALESCE with single argument looks like identity function
Hello everyone! I've noticed that COALESCE function doesn't converge to argument expression if it is alone in argument list of COALESCE as part simplification routine for expressions in planner. This might suppress further useful transformations when non-strict ops are required from some expression like converging OUTER JOIN to INNER one with WHERE qual containing COALESCE over single column from inner side. The patch of transformation in question for COALESCE is attached. -- Best regard, Maksim Milyutin From 1287610efa3895a0ababfc66f346a6a7c7edf9b9 Mon Sep 17 00:00:00 2001 From: Maksim Milyutin Date: Fri, 11 Apr 2025 15:43:42 +0300 Subject: [PATCH v1] Simplify COALESCE with single argument --- src/backend/optimizer/util/clauses.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c index 26a3e050086..60f33839214 100644 --- a/src/backend/optimizer/util/clauses.c +++ b/src/backend/optimizer/util/clauses.c @@ -3332,6 +3332,8 @@ eval_const_expressions_mutator(Node *node, return (Node *) makeNullConst(coalesceexpr->coalescetype, -1, coalesceexpr->coalescecollid); +if (list_length(newargs) == 1) + return (Node *) linitial(newargs); newcoalesce = makeNode(CoalesceExpr); newcoalesce->coalescetype = coalesceexpr->coalescetype; -- 2.43.0
Re: COALESCE with single argument looks like identity function
On 4/11/25 17:00, Tom Lane wrote: Maksim Milyutin writes: I've noticed that COALESCE function doesn't converge to argument expression if it is alone in argument list of COALESCE as part simplification routine for expressions in planner. This might suppress further useful transformations when non-strict ops are required from some expression like converging OUTER JOIN to INNER one with WHERE qual containing COALESCE over single column from inner side. Seems like a reasonable idea --- it's probably a rare case, but the check is cheap enough. I'd add some comments though. Thanks for your comments. Please add this to the open commitfest so we don't lose track of it. Done. In regression tests I've replaced all COALESCEs with single argument to ones with dummy second argument to preserve coalesce calls as AFAICS their usages are intentional for wrapping attributes to generate PHVs above. Also I've noticed the issue in query (in join.sql test suite): SELECT 1 FROM group_tbl t1 LEFT JOIN (SELECT a c1, *COALESCE(a)* c2 FROM group_tbl t2) s ON TRUE GROUP BY s.c1, s.c2 repeatable t2.a in GROUP BY clauses are not converged to single appearance: QUERY PLAN Group Group Key: t2.a, *t2.a* -> Sort Sort Key: t2.a, *t2.a* -> Nested Loop Left Join -> Seq Scan on group_tbl t1 -> Seq Scan on group_tbl t2 IMO the cause is in PHV surrounding s.c2 that differentiates its internal expression with the same first grouping key. -- Best regard, Maksim Milyutin
Re: COALESCE with single argument looks like identity function
Updated patchset is attached On 4/14/25 17:25, Maksim Milyutin wrote: On 4/11/25 17:00, Tom Lane wrote: Maksim Milyutin writes: I've noticed that COALESCE function doesn't converge to argument expression if it is alone in argument list of COALESCE as part simplification routine for expressions in planner. This might suppress further useful transformations when non-strict ops are required from some expression like converging OUTER JOIN to INNER one with WHERE qual containing COALESCE over single column from inner side. Seems like a reasonable idea --- it's probably a rare case, but the check is cheap enough. I'd add some comments though. Thanks for your comments. Please add this to the open commitfest so we don't lose track of it. Done. In regression tests I've replaced all COALESCEs with single argument to ones with dummy second argument to preserve coalesce calls as AFAICS their usages are intentional for wrapping attributes to generate PHVs above. Also I've noticed the issue in query (in join.sql test suite): SELECT 1 FROM group_tbl t1 LEFT JOIN (SELECT a c1, *COALESCE(a)* c2 FROM group_tbl t2) s ON TRUE GROUP BY s.c1, s.c2 repeatable t2.a in GROUP BY clauses are not converged to single appearance: QUERY PLAN Group Group Key: t2.a, *t2.a* -> Sort Sort Key: t2.a, *t2.a* -> Nested Loop Left Join -> Seq Scan on group_tbl t1 -> Seq Scan on group_tbl t2 IMO the cause is in PHV surrounding s.c2 that differentiates its internal expression with the same first grouping key. -- Best regard, Maksim Milyutin From 294d7fdf105dac6a17a6ac94a832f4cdf02e6060 Mon Sep 17 00:00:00 2001 From: Maksim Milyutin Date: Mon, 14 Apr 2025 17:06:50 +0300 Subject: [PATCH v1 2/2] Adjust regression tests --- src/test/regress/expected/join.out | 32 - src/test/regress/expected/subselect.out | 46 - src/test/regress/sql/join.sql | 10 +++--- src/test/regress/sql/subselect.sql | 8 ++--- 4 files changed, 48 insertions(+), 48 deletions(-) diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out index 14da5708451..6c55eeda6be 100644 --- a/src/test/regress/expected/join.out +++ b/src/test/regress/expected/join.out @@ -5384,14 +5384,14 @@ select * from (select 1 as id) as xx left join (tenk1 as a1 full join (select 1 as id) as yy on (a1.unique1 = yy.id)) - on (xx.id = coalesce(yy.id)); - QUERY PLAN + on (xx.id = coalesce(yy.id, 0)); + QUERY PLAN + Nested Loop Left Join -> Result -> Hash Full Join Hash Cond: (a1.unique1 = (1)) - Filter: (1 = COALESCE((1))) + Filter: (1 = COALESCE((1), 0)) -> Seq Scan on tenk1 a1 -> Hash -> Result @@ -5401,7 +5401,7 @@ select * from (select 1 as id) as xx left join (tenk1 as a1 full join (select 1 as id) as yy on (a1.unique1 = yy.id)) - on (xx.id = coalesce(yy.id)); + on (xx.id = coalesce(yy.id, 0)); id | unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 | id +-+-+-+--+-++-+--+-+---+--+-+--+--+--+-+ 1 | 1 |2838 | 1 |1 | 1 | 1 | 1 |1 | 1 | 1 |1 | 2 |3 | BA | EFEAAA | xx | 1 @@ -8123,20 +8123,20 @@ select * from int4_tbl i left join explain (verbose, costs off) select * from int4_tbl i left join - lateral (select coalesce(i) from int2_tbl j where i.f1 = j.f1) k on true; - QUERY PLAN -- + lateral (select coalesce(i, row(0)::int4_tbl) from int2_tbl j where i.f1 = j.f1) k on true; + QUERY PLAN +-- Nested Loop Left Join - Output: i.f1, (COALESCE(i.*)) + Output: i.f1, (COALESCE(i.*, '(0)'::int4_tbl)) -> Seq Scan on public.int4_tbl i Output: i.f1, i.* -> Seq Scan on public.int2_tbl j - Output: j.f1, COALESCE(i.*) + Output: j.f1, COALESCE(i.*, '(0)'::int4_tbl) Filter: (i.f1 = j.f1) (7 rows) select * from int4_tbl i left join - lateral (select coalesce(i) from int2_tbl j where i.f1 = j.f1) k on true; + lateral (select coalesce(i, row(0)::int4_tbl) from int2_tbl j where i.f1 = j.f1) k on true; f1 | coalesce -+-- 0 | (0) @@ -9305,14 +9305,14 @@ CREATE STATISTICS group_tbl_stat (ndistinct) O