Code comment fix
When table oids were removed by commit 578b229718e, the function CatalogTupleInsert() was modified to return void but its comment was left intact. Here is a trivial patch to fix that. -- Vik Fearing diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c index 284ceaa6b9..4d1440cd3a 100644 --- a/src/backend/catalog/indexing.c +++ b/src/backend/catalog/indexing.c @@ -210,7 +210,6 @@ CatalogTupleCheckConstraints(Relation heapRel, HeapTuple tup) * CatalogTupleInsert - do heap and indexing work for a new catalog tuple * * Insert the tuple data in "tup" into the specified catalog relation. - * The Oid of the inserted tuple is returned. * * This is a convenience routine for the common case of inserting a single * tuple in a system catalog; it inserts a new heap tuple, keeping indexes
RE: [PATCH] pgbench: improve \sleep meta command
Dear Fujii-san, Thank you for updating the patch. I understand that you don't want to change the current specification. ```diff + if (usec == 0) + { + char *c = var; + + /* Skip sign */ + if (*c == '+' || *c == '-') + c++; ``` In my understanding the skip is not necessary, because plus sign is already removed in the executeMetaCommand() and minus value can be returned by atoi(). Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: PITR promote bug: Checkpointer writes to older timeline
On Mon, Mar 15, 2021 at 04:38:08PM +0900, Michael Paquier wrote: > On Mon, Mar 15, 2021 at 03:01:09PM +0900, Kyotaro Horiguchi wrote: >> Logical decoding stuff is (I think) designed to turn any backend into >> a walsender, which may need to maintain ThisTimeLineID. It seems to >> me that logical decoding stuff indents to maintain ThisTimeLineID of >> such backends at reading a WAL record. logical_read_xlog_page also >> updates ThisTimeLineID and pg_logical_slot_get_changes_guts(), >> pg_replication_slot_advance() (and maybe other functions) updates >> ThisTimeLineID. So it is natural that local_read_xlog_page() updates >> it since it is intended to be used used in logical decoding plugins. > > Logical decoding contexts cannot be created while in recovery as per > CheckLogicalDecodingRequirements(), and as mentioned not everything > is > in place to allow that. FWIW, I think that it is just confusing for > pg_replication_slot_advance() and pg_logical_slot_get_changes_guts() > to update it, and we just look for the latest value each time it is > necessary when reading a new WAL page. Studying some history today, having read_local_xlog_page() directly update ThisTimeLineID has been extensively discussed here back in 2017 to attempt to introduce logical decoding on standbys (1148e22a): https://www.postgresql.org/message-id/CAMsr%2BYEVmBJ%3DdyLw%3D%2BkTihmUnGy5_EW4Mig5T0maieg_Zu%3DXCg%40mail.gmail.com Currently with HEAD and back branches, nothing would be broken as logical contexts cannot exist in recovery. Still it would be easy to miss the new behavior for anybody attempting to work more on this feature in the future if we change read_local_xlog_page to not update ThisTimeLineID anymore. Resetting the value of ThisTimeLineID in read_local_xlog_page() does not seem completely right either with this argument, as they could be some custom code relying on the existing behavior of read_local_xlog_page() to maintain ThisTimeLineID. Hmmm. I am wondering whether the best answer for the moment would not be to save and reset ThisTimeLineID just in XlogReadTwoPhaseData(), as that's the local change that uses read_local_xlog_page(). The state of the code is really confusing on HEAD, and I'd like to think that the best thing we could do in the long-term is to make the logical decoding path not rely on ThisTimeLineID at all and decouple all that, putting the code in a state sane enough so as we don't finish with similar errors as what is discussed on this thread. That would be a work for a different patch though, not for stable branches. And seeing some slot and advancing functions update directly ThisTimeLineID is no good either.. Any thoughts? -- Michael signature.asc Description: PGP signature
Re: Code comment fix
On Wed, Mar 17, 2021 at 08:31:16AM +0100, Vik Fearing wrote: > When table oids were removed by commit 578b229718e, the function > CatalogTupleInsert() was modified to return void but its comment was > left intact. Here is a trivial patch to fix that. Thanks, Vik. Good catch. I'll fix that in a bit. -- Michael signature.asc Description: PGP signature
Re: pl/pgsql feature request: shorthand for argument and local variable references
On Wed, Mar 17, 2021 at 02:06:57PM +0800, Julien Rouhaud wrote: > I also think that there should be a single usable top label, otherwise it will > lead to confusing code and it can be a source of bug. Okay, that's fine by me. Could it be possible to come up with an approach that does not hijack the namespace list contruction and the lookup logic as much as it does now? I get it that the patch is done this way because of the ordering of operations done for the initial ns list creation and the grammar parsing that adds the routine label on top of the root one, but I'd like to believe that there are more solid ways to achieve that, like for example something that decouples the root label and its alias, or something that associates an alias directly with its PLpgSQL_nsitem entry? -- Michael signature.asc Description: PGP signature
Re: Transactions involving multiple postgres foreign servers, take 2
Hi, For v35-0007-Prepare-foreign-transactions-at-commit-time.patch : With this commit, the foreign server modified within the transaction marked as 'modified'. transaction marked -> transaction is marked +#define IsForeignTwophaseCommitRequested() \ +(foreign_twophase_commit > FOREIGN_TWOPHASE_COMMIT_DISABLED) Since the other enum is FOREIGN_TWOPHASE_COMMIT_REQUIRED, I think the macro should be named: IsForeignTwophaseCommitRequired. +static bool +checkForeignTwophaseCommitRequired(bool local_modified) + if (!ServerSupportTwophaseCommit(fdw_part)) + have_no_twophase = true; ... + if (have_no_twophase) + ereport(ERROR, It seems the error case should be reported within the loop. This way, we don't need to iterate the other participant(s). Accordingly, nserverswritten should be incremented for local server prior to the loop. The condition in the loop would become if (!ServerSupportTwophaseCommit(fdw_part) && nserverswritten > 1). have_no_twophase is no longer needed. Cheers On Tue, Mar 16, 2021 at 8:04 PM Masahiko Sawada wrote: > On Mon, Mar 15, 2021 at 3:55 AM Ibrar Ahmed wrote: > > > > > > > > On Thu, Feb 11, 2021 at 6:25 PM Masahiko Sawada > wrote: > >> > >> On Fri, Feb 5, 2021 at 2:45 PM Masahiko Sawada > wrote: > >> > > >> > On Tue, Feb 2, 2021 at 5:18 PM Fujii Masao < > masao.fu...@oss.nttdata.com> wrote: > >> > > > >> > > > >> > > > >> > > On 2021/01/27 14:08, Masahiko Sawada wrote: > >> > > > On Wed, Jan 27, 2021 at 10:29 AM Fujii Masao > >> > > > wrote: > >> > > >> > >> > > >> > >> > > >> You fixed some issues. But maybe you forgot to attach the latest > patches? > >> > > > > >> > > > Yes, I've attached the updated patches. > >> > > > >> > > Thanks for updating the patch! I tried to review 0001 and 0002 as > the self-contained change. > >> > > > >> > > + * An FDW that implements both commit and rollback APIs can > request to register > >> > > + * the foreign transaction by FdwXactRegisterXact() to participate > it to a > >> > > + * group of distributed tranasction. The registered foreign > transactions are > >> > > + * identified by OIDs of server and user. > >> > > > >> > > I'm afraid that the combination of OIDs of server and user is not > unique. IOW, more than one foreign transactions can have the same > combination of OIDs of server and user. For example, the following two > SELECT queries start the different foreign transactions but their user OID > is the same. OID of user mapping should be used instead of OID of user? > >> > > > >> > > CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw; > >> > > CREATE USER MAPPING FOR postgres SERVER loopback OPTIONS (user > 'postgres'); > >> > > CREATE USER MAPPING FOR public SERVER loopback OPTIONS (user > 'postgres'); > >> > > CREATE TABLE t(i int); > >> > > CREATE FOREIGN TABLE ft(i int) SERVER loopback OPTIONS > (table_name 't'); > >> > > BEGIN; > >> > > SELECT * FROM ft; > >> > > DROP USER MAPPING FOR postgres SERVER loopback ; > >> > > SELECT * FROM ft; > >> > > COMMIT; > >> > > >> > Good catch. I've considered using user mapping OID or a pair of user > >> > mapping OID and server OID as a key of foreign transactions but I > >> > think it also has a problem if an FDW caches the connection by pair of > >> > server OID and user OID whereas the core identifies them by user > >> > mapping OID. For instance, mysql_fdw manages connections by pair of > >> > server OID and user OID. > >> > > >> > For example, let's consider the following execution: > >> > > >> > BEGIN; > >> > SET ROLE user_A; > >> > INSERT INTO ft1 VALUES (1); > >> > SET ROLE user_B; > >> > INSERT INTO ft1 VALUES (1); > >> > COMMIT; > >> > > >> > Suppose that an FDW identifies the connections by {server OID, user > >> > OID} and the core GTM identifies the transactions by user mapping OID, > >> > and user_A and user_B use the public user mapping to connect server_X. > >> > In the FDW, there are two connections identified by {user_A, sever_X} > >> > and {user_B, server_X} respectively, and therefore opens two > >> > transactions on each connection, while GTM has only one FdwXact entry > >> > because the two connections refer to the same user mapping OID. As a > >> > result, at the end of the transaction, GTM ends only one foreign > >> > transaction, leaving another one. > >> > > >> > Using user mapping OID seems natural to me but I'm concerned that > >> > changing role in the middle of transaction is likely to happen than > >> > dropping the public user mapping but not sure. We would need to find > >> > more better way. > >> > >> After more thought, I'm inclined to think it's better to identify > >> foreign transactions by user mapping OID. The main reason is, I think > >> FDWs that manages connection caches by pair of user OID and server OID > >> potentially has a problem with the scenario Fujii-san mentioned. If an > >> FDW has to use another user mapping (i.g., connection information) due > >
Re: A new function to wait for the backend exit after termination
On Wed, Mar 17, 2021 at 8:28 AM Kyotaro Horiguchi wrote: > > At Wed, 17 Mar 2021 07:01:39 +0530, Bharath Rupireddy > wrote in > > Attaching v10 patch for further review. > > The time-out mechanism doesn't count remainingtime as expected, > concretely it does the following. > > do { > kill(); > WaitLatch(WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, waittime); > ResetLatch(MyLatch); > remainingtime -= waittime; > } while (remainingtime > 0); > > So, the WaitLatch doesn't consume as much time as the set waittime in > case of latch set. remainingtime reduces faster than the real at the > iteration. WaitLatch can exit without waiting for the waittime duration whenever the MyLatch is set (SetLatch). Now the question is how frequently SetLatch can get called in a backend? For instance, if we keep calling pg_reload_conf in any of the backends in the cluster, then the SetLatch will be called and the timeout in pg_wait_until_termination will be reached fastly. I see that this problem can also exist in case of pg_promote function. Similarly it may exist in other places where we have WaitLatch for timeouts. IMO, the frequency of SetLatch calls may not be that much in real time scenarios. If at all, the latch gets set too frequently, then the terminate and wait functions might timeout earlier. But is it a critical problem to worry about? (IMHO, it's not that critical) If yes, we might as well need to fix it (I don't know how?) in other critical areas like pg_promote? > It wouldn't happen actually but I concern about PID recycling. We can > make sure to get rid of the fear by checking for our BEENTRY instead > of PID. However, it seems to me that some additional function is > needed in pgstat.c so that we can check the realtime value of > PgBackendStatus, which might be too much. The aim of the wait logic is to ensure that the process is gone from the system processes that is why using kill(), not it's entries are gone from the shared memory. > + /* If asked to wait, check whether the timeout value is valid or not. > */ > + if (wait && pid != MyProcPid) > + { > + timeout = PG_GETARG_INT64(2); > + > + if (timeout <= 0) > + ereport(ERROR, > + > (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), > +errmsg("\"timeout\" must not be > negative or zero"))); > > This means that pg_terminate_backend accepts negative timeouts when > terminating myself, which looks odd. I will change this. > Is there any reason to reject 0 as timeout? Actually, timeout 0 should mean that "don't wait" and we can error out on negative values. Thoughts? > +* Wait only if requested and the termination is successful. Self > +* termination is allowed but waiting is not. > +*/ > + if (wait && pid != MyProcPid && result) > + result = pg_wait_until_termination(pid, timeout); > > Why don't we wait for myself to be terminated? There's no guarantee > that myself will be terminated without failure. (I agree that that is > not so useful, but I think there's no reason not to do so.) We could programmatically allow it to wait in case of self termination and it doesn't make any difference to the user, they would see "Terminating connection due to administrator command" FATAL error. I can remove pid != MyProcPid. > The first suggested signature for pg_terminate_backend() with timeout > was pg_terminate_backend(pid, timeout). The current signature (pid, > wait?, timeout) looks redundant. Maybe the reason for rejecting 0 > astimeout is pg_terminate_backend(pid, true, 0) looks odd but it we > can wait forever in that case (as other features does). On the other > hand pg_terminate_backend(pid, false, 100) is apparently odd but this > patch doesn't seem to reject it. If there's no considerable reason > for the current signature, I would suggest that: > > pg_terminate_backend(pid, timeout), where it waits forever if timeout > is zero and waits for the timeout if positive. Negative values are not > accepted. So, as stated above, how about a timeout 0 (which is default) telling "don't wait", negative error out, a positive milliseconds value indicating that we should wait after termination? And for pg_wait_for_backend_termination timeout 0 or negative, we error out? IMO, the above semantics are better than timeout 0 meaning "wait forever". Thoughts? > + ereport(WARNING, > + (errmsg("could not check the > existence of the backend with PID %d: %m", > + pid))); > + return false; > > I think this is worth ERROR. We can avoid this handling if we look > into PgBackendEntry instead. I will change it to ERROR. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Code comment fix
On 3/17/21 9:11 AM, Michael Paquier wrote: > On Wed, Mar 17, 2021 at 08:31:16AM +0100, Vik Fearing wrote: >> When table oids were removed by commit 578b229718e, the function >> CatalogTupleInsert() was modified to return void but its comment was >> left intact. Here is a trivial patch to fix that. > > Thanks, Vik. Good catch. I'll fix that in a bit. Cheers. -- Vik Fearing
Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?
On Wed, Mar 17, 2021 at 8:05 AM torikoshia wrote: > > I have not gone through that thread though. Is there any way we can > > detect those child processes(stats collector, sys logger) that are > > forked by the postmaster from a backend process? Thoughts? > > I couldn't find good ways to do that, and thus I'm now wondering > just changing the message. Changed the message. > >> I'm now wondering if changing the message to something like > >> "PID is not a PostgreSQL backend process". > >> > >> "backend process' is now defined as "Process of an instance > >> which acts on behalf of a client session and handles its > >> requests." in Appendix. > > > > Yeah, that looks good to me. IIUC, we can just change the message from > > "PID is not a PostgreSQL server process" to "PID is not a > > PostgreSQL backend process" and we don't need look for AuxiliaryProcs > > or PostmasterPid. > > Changing log messages can affect operations, especially when people > monitor the log message strings, but improving "PID is not a > PostgreSQL server process" does not seem to cause such problems. Now the error message clearly says that the given pid not a backend process. Attaching v2 patch. Please have a look. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com v2-0001-Improve-PID--is-not-a-PostgreSQL-server-proce.patch Description: Binary data
Re: A new function to wait for the backend exit after termination
Hi, w.r.t. WaitLatch(), if its return value is WL_TIMEOUT, we know the specified timeout has elapsed. It seems WaitLatch() can be enhanced to also return the actual duration of the wait. This way, the caller can utilize the duration directly. As for other places where WaitLatch() is called, similar change can be applied on a per-case basis (with separate patches, not under this topic). Cheers On Wed, Mar 17, 2021 at 2:04 AM Bharath Rupireddy < bharath.rupireddyforpostg...@gmail.com> wrote: > On Wed, Mar 17, 2021 at 8:28 AM Kyotaro Horiguchi > wrote: > > > > At Wed, 17 Mar 2021 07:01:39 +0530, Bharath Rupireddy < > bharath.rupireddyforpostg...@gmail.com> wrote in > > > Attaching v10 patch for further review. > > > > The time-out mechanism doesn't count remainingtime as expected, > > concretely it does the following. > > > > do { > > kill(); > > WaitLatch(WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, waittime); > > ResetLatch(MyLatch); > > remainingtime -= waittime; > > } while (remainingtime > 0); > > > > So, the WaitLatch doesn't consume as much time as the set waittime in > > case of latch set. remainingtime reduces faster than the real at the > > iteration. > > WaitLatch can exit without waiting for the waittime duration whenever > the MyLatch is set (SetLatch). Now the question is how frequently > SetLatch can get called in a backend? For instance, if we keep calling > pg_reload_conf in any of the backends in the cluster, then the > SetLatch will be called and the timeout in pg_wait_until_termination > will be reached fastly. I see that this problem can also exist in > case of pg_promote function. Similarly it may exist in other places > where we have WaitLatch for timeouts. > > IMO, the frequency of SetLatch calls may not be that much in real time > scenarios. If at all, the latch gets set too frequently, then the > terminate and wait functions might timeout earlier. But is it a > critical problem to worry about? (IMHO, it's not that critical) If > yes, we might as well need to fix it (I don't know how?) in other > critical areas like pg_promote? > > > It wouldn't happen actually but I concern about PID recycling. We can > > make sure to get rid of the fear by checking for our BEENTRY instead > > of PID. However, it seems to me that some additional function is > > needed in pgstat.c so that we can check the realtime value of > > PgBackendStatus, which might be too much. > > The aim of the wait logic is to ensure that the process is gone from > the system processes that is why using kill(), not it's entries are > gone from the shared memory. > > > + /* If asked to wait, check whether the timeout value is valid or > not. */ > > + if (wait && pid != MyProcPid) > > + { > > + timeout = PG_GETARG_INT64(2); > > + > > + if (timeout <= 0) > > + ereport(ERROR, > > + > (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), > > +errmsg("\"timeout\" must not be > negative or zero"))); > > > > This means that pg_terminate_backend accepts negative timeouts when > > terminating myself, which looks odd. > > I will change this. > > > Is there any reason to reject 0 as timeout? > > Actually, timeout 0 should mean that "don't wait" and we can error out > on negative values. Thoughts? > > > +* Wait only if requested and the termination is successful. Self > > +* termination is allowed but waiting is not. > > +*/ > > + if (wait && pid != MyProcPid && result) > > + result = pg_wait_until_termination(pid, timeout); > > > > Why don't we wait for myself to be terminated? There's no guarantee > > that myself will be terminated without failure. (I agree that that is > > not so useful, but I think there's no reason not to do so.) > > We could programmatically allow it to wait in case of self termination > and it doesn't make any difference to the user, they would see > "Terminating connection due to administrator command" FATAL error. I > can remove pid != MyProcPid. > > > The first suggested signature for pg_terminate_backend() with timeout > > was pg_terminate_backend(pid, timeout). The current signature (pid, > > wait?, timeout) looks redundant. Maybe the reason for rejecting 0 > > astimeout is pg_terminate_backend(pid, true, 0) looks odd but it we > > can wait forever in that case (as other features does). On the other > > hand pg_terminate_backend(pid, false, 100) is apparently odd but this > > patch doesn't seem to reject it. If there's no considerable reason > > for the current signature, I would suggest that: > > > > pg_terminate_backend(pid, timeout), where it waits forever if timeout > > is zero and waits for the timeout if positive. Negative values are not > > accepted. > > So, as stated above, how about a timeout 0 (which is default) telling > "don't wait", negative error out, a positive milliseconds value > indicating that we should wait after terminati
Re: pgsql: Add libpq pipeline mode support to pgbench
Bonjour Daniel, Ola Alvaro, Add libpq pipeline mode support to pgbench New metacommands \startpipeline and \endpipeline allow the user to run queries in libpq pipeline mode. Author: Daniel Vérité Reviewed-by: Álvaro Herrera Discussion: https://postgr.es/m/b4e34135-2bd9-4b8a-94ca-27d760da2...@manitou-mail.org I did not notice that the libpq pipeline mode thread had a pgbench patch attached, otherwise I would have looked at it. Some minor post-commit comments: For consistency with the existing \if … \endif, ISTM that it could have been named \batch … \endbatch or \pipeline … \endpipeline? ISTM that the constraint checks (nesting, no \[ag]set) could be added to CheckConditional that could be renamed to CheckScript. That could allow to simplify the checks in the command execution as mere Asserts. -- Fabien.
RE: Parallel Inserts in CREATE TABLE AS
> > Seems like v22 patch was failing in cfbot for one of the unstable test > > cases. > > Attaching v23 patch set with modification in 0003 and 0004 patches. No > > changes to 0001 and 0002 patches. Hopefully cfbot will be happy with v23. > > > > Please consider v23 for further review. > Hi, > > I was looking into the latest patch, here are some comments: Few comments. 1) Executing the following SQL will cause assertion failure. ---sql--- create table data(a int); insert into data select 1 from generate_series(1,100,1) t; explain (verbose) create table tt as select a,2 from data; -- The stack message: ---stack--- TRAP: FailedAssertion("!allow && gather_exists && tuple_cost_opts && !(*tuple_cost_opts & PARALLEL_INSERT_TUP_COST_IGNORED)", File: "execParallel.c", Line: 1872, PID: 1618247) postgres: houzj postgres [local] EXPLAIN(ExceptionalCondition+0x8b)[0x940f0b] postgres: houzj postgres [local] EXPLAIN[0x67ba1c] postgres: houzj postgres [local] EXPLAIN(ExplainOnePlan+0x1c2)[0x605997] postgres: houzj postgres [local] EXPLAIN[0x605d11] postgres: houzj postgres [local] EXPLAIN(ExplainOneUtility+0x162)[0x605eb0] -- In this case, The Gather node have projection in which case parallel CTAS is not supported, but we still ignore the cost in planner. If we plan to detect the projection, we may need to check tlist_same_exprs. +if (tlist_same_exprs) + { ignore_parallel_tuple_cost(root); +} generate_useful_gather_paths(root, rel, false); 2) +* Parallelize inserts only when the upper Gather node has no projections. */ - gstate->dest = dest; + if (!gstate->ps.ps_ProjInfo) IMO, It's better to add some comments about why we do not support projection for now. Because, not all the projection are parallel unsafe (such as the case in 1) ), it will be desirable to support these later. 3) + if (IsParallelInsertionAllowed(PARALLEL_INSERT_CMD_CREATE_TABLE_AS, + ¶llel_ins_info)) ... /* plan the query */ plan = pg_plan_query(query, pstate->p_sourcetext, CURSOR_OPT_PARALLEL_OK, params); ... + if (IsParallelInsertionAllowed(PARALLEL_INSERT_CMD_CREATE_TABLE_AS, + ¶llel_ins_info)) Currently, the patch call IsParallelInsertionAllowed() before and after pg_plan_query(), This might lead to a misunderstanding that parallel_ins_info will get changed during pg_plan_query(). Since parallel_ins_info will not get changed in pg_plan_query, is it possible to add a bool flag(allowed) in parallel_ins_info to avoid the second call of IsParallelInsertionAllowed ? Best regards, houzj
Re: fdatasync performance problem with large number of DB files
On Wed, Mar 17, 2021 at 11:45 AM Thomas Munro wrote: > > On Tue, Mar 16, 2021 at 9:29 PM Fujii Masao > wrote: > > On 2021/03/16 8:15, Thomas Munro wrote: > > > I don't want to add a hypothetical sync_after_crash=none, because it > > > seems like generally a bad idea. We already have a > > > running-with-scissors mode you could use for that: fsync=off. > > > > I heard that some backup tools sync the database directory when restoring > > it. > > I guess that those who use such tools might want the option to disable such > > startup sync (i.e., sync_after_crash=none) because it's not necessary. > > Hopefully syncfs() will return quickly in that case, without doing any work? I just quickly reviewed the patch (the code part). It looks good. Only one concern or question is do_syncfs() for symlink opened fd for syncfs() - I'm not 100% sure. I think we could consider reviewing and then pushing the syncfs patch at this moment since the fsync issue really affects much although there seems to be a better plan for this in the future, it may make the sync step in startup much faster. Today I first encountered a real case in a production environment. startup spends >1hour on the fsync step: I'm pretty sure that the pgdata is sync-ed, and per startup pstack I saw the startup process one by one slowly open(), fsync() (surely do nothing) and close(), and the pre_sync_fname() is also a burden in such a scenario. So this issue is really annoying. We could discuss further optimizing the special crash recovery scenario that users explicitly know the sync step could be skipped (this scenario is surely not unusual), even having the patch. The syncfs patch could be used for this scenario also but the filesystem might be shared by other applications (this is not unusual and happens in my customer's environment for example) so syncfs() is (probably much) slower than skipping the sync step. -- Paul Guo (Vmware)
Re: logical replication worker accesses catalogs in error context callback
On Tue, Mar 16, 2021 at 2:21 AM Tom Lane wrote: > > Thanks for pointing to the changes in the commit message. I corrected > > them. Attaching v4 patch set, consider it for further review. > > I took a quick look at this. I'm quite worried about the potential > performance cost of the v4-0001 patch (the one for fixing > slot_store_error_callback). Previously, we didn't pay any noticeable > cost for having the callback unless there actually was an error. > As patched, we perform several catalog lookups per column per row, > even in the non-error code path. That seems like it'd be a noticeable > performance hit. Just to add insult to injury, it leaks memory. > > I propose a more radical but simpler solution: let's just not bother > with including the type names in the context message. How much are > they really buying? Thanks. In that case, the message can only return the local and remote columns names and ignore the types (something like below). And the user will have to figure out what are the types of those columns in local and remote separately in case of error. Then the function logicalrep_typmap_gettypname can also be removed. I'm not sure if this is okay. Thoughts? TupleDesc localtupdesc = RelationGetDescr(rel->localrel); Form_pg_attribute localatt = TupleDescAttr(localtupdesc, errarg->local_attnum - 1); errcontext("processing remote data for replication target relation \"%s.%s\" column \"%s\", " "remote column %s, local column %s", rel->remoterel.nspname, rel->remoterel.relname, rel->remoterel.attnames[errarg->remote_attnum], NameStr(localatt->attname))); > v4-0002 (for postgres_fdw's conversion_error_callback) has the same > problems, although mitigated a bit by not needing to do any catalog > lookups in the non-join case. For the join case, I wonder whether > we could push the lookups back to plan setup time, so they don't > need to be done over again for each row. (Not doing lookups at all > doesn't seem attractive there; it'd render the context message nearly > useless.) A different idea is to try to get the column names out > of the query's rangetable instead of going to the catalogs. I will think more on this point. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: a verbose option for autovacuum
On Wed, Mar 10, 2021, at 12:46 AM, Masahiko Sawada wrote: > Attached a patch. I've slightly modified the format for consistency > with heap statistics. Since commit 5f8727f5a6, this patch doesn't apply anymore. Fortunately, it is just a small hunk. I reviewed this patch and it looks good to me. There is just a small issue (double space after 'if') that I fixed in the attached version. -- Euler Taveira EDB https://www.enterprisedb.com/ diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index 8341879d89..35be34ea06 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -314,6 +314,7 @@ typedef struct LVRelStats int num_index_scans; TransactionId latestRemovedXid; bool lock_waiter_detected; + IndexBulkDeleteResult **indstats; /* used for autovacuum logs */ /* Used for error callback */ char *indname; @@ -531,9 +532,6 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params, /* Do the vacuuming */ lazy_scan_heap(onerel, params, vacrelstats, Irel, nindexes, aggressive); - /* Done with indexes */ - vac_close_indexes(nindexes, Irel, NoLock); - /* * Compute whether we actually scanned the all unfrozen pages. If we did, * we can adjust relfrozenxid and relminmxid. @@ -614,13 +612,18 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params, pgstat_progress_end_command(); /* and log the action if appropriate */ - if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0) + if (IsAutoVacuumWorkerProcess()) { - TimestampTz endtime = GetCurrentTimestamp(); + TimestampTz endtime = 0; + int i; - if (params->log_min_duration == 0 || - TimestampDifferenceExceeds(starttime, endtime, - params->log_min_duration)) + if (params->log_min_duration >= 0) + endtime = GetCurrentTimestamp(); + + if (endtime > 0 && + (params->log_min_duration == 0 || + TimestampDifferenceExceeds(starttime, endtime, + params->log_min_duration))) { StringInfoData buf; char *msgfmt; @@ -680,6 +683,21 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params, (long long) VacuumPageHit, (long long) VacuumPageMiss, (long long) VacuumPageDirty); + for (i = 0; i < nindexes; i++) + { +IndexBulkDeleteResult *stats = vacrelstats->indstats[i]; + +if (!stats) + continue; + +appendStringInfo(&buf, + _("index \"%s\": pages: %u remain, %u newly deleted, %u currently deleted, %u reusable\n"), + RelationGetRelationName(Irel[i]), + stats->num_pages, + stats->pages_newly_deleted, + stats->pages_deleted, + stats->pages_free); + } appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"), read_rate, write_rate); if (track_io_timing) @@ -704,7 +722,17 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params, (errmsg_internal("%s", buf.data))); pfree(buf.data); } + + /* Cleanup index statistics */ + for (i = 0; i < nindexes; i++) + { + if (vacrelstats->indstats[i]) +pfree(vacrelstats->indstats[i]); + } } + + /* Done with indexes */ + vac_close_indexes(nindexes, Irel, NoLock); } /* @@ -1758,7 +1786,10 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, /* Update index statistics */ if (vacrelstats->useindex) + { update_index_statistics(Irel, indstats, nindexes); + vacrelstats->indstats = indstats; + } /* If no indexes, make log report that lazy_vacuum_heap would've made */ if (vacuumed_pages) @@ -3243,7 +3274,13 @@ update_index_statistics(Relation *Irel, IndexBulkDeleteResult **stats, InvalidTransactionId, InvalidMultiXactId, false); - pfree(stats[i]); + + /* + * Autovacuum worker keeps the index statistics until the end + * of lazy vacuum for autovacuum logs. + */ + if (!IsAutoVacuumWorkerProcess()) + pfree(stats[i]); } }
Re: proposal: schema variables - doc
> On 2021.03.13. 07:01 Pavel Stehule wrote: > Hi > fresh rebase > [schema-variables-20210313.patch.gz] Hi Pavel, I notice that the phrase 'schema variable' is not in the index at the end ('bookindex.html'). Not good. It is also not in the index at the front of the manual - also not good. Maybe these two (front and back index) can be added? If a user searches the pdf, the first occurrence he finds is at: 43.13.2.4. Global variables and constants (in itself that occurrence/mention is all right, but is should not be the first find, I think) (I think there was in earlier versions of the patch an entry in the 'contents', i.e., at the front of the manual). I think it would be good to have it in the front-index, pointing to either LET or CREATE VARIABLE, or maybe even to a small introductory paragraph somewhere else (again, I seem to remember that there was one in an earlier patch version). Of the new commands that this patch brings, 'LET' is the most immediately illuminating for a user (even when a CREATE VARIABLE has to be done first. There is an entry 'LET' in the index (good), but it would be better if that with LET-entry too the phrase 'schema variable' occurred. (I don't know if that's possible) Then, in the CREATE VARIABLE paragraphs it says 'Changing a schema variable is non-transactional by default.' I think that, unless there exists a mode where schema vars can be made transactional, 'by default' should be deleted (and there is no such 'transactional mode' for schema variables, is there?). The 'Description' also has such a 'By default' which is better removed for the same reason. In the CREATE VARIABLE page the example is: CREATE VARIABLE var1 AS integer; SELECT var1; I suggest to make that CREATE VARIABLE var1 AS date; LET var1 = (select current_date); SELECT var1; So that the example immediately shows an application of functionality. Thanks, Erik Rijkers > > Pavel
Re: Assertion failure with barriers in parallel hash join
On Wed, Mar 17, 2021 at 6:58 PM Thomas Munro wrote: > According to BF animal elver there is something wrong with this > commit. Looking into it. Assertion failure reproduced here and understood, but unfortunately it'll take some more time to fix this. I've reverted the commit for now to unbreak the ~5 machines that are currently showing red in the build farm.
Re: Get memory contexts of an arbitrary backend process
On 2021-03-05 14:22, Fujii Masao wrote: On 2021/03/04 18:32, torikoshia wrote: On 2021-01-14 19:11, torikoshia wrote: Since pg_get_target_backend_memory_contexts() waits to dump memory and it could lead dead lock as below. - session1 BEGIN; TRUNCATE t; - session2 BEGIN; TRUNCATE t; -- wait - session1 SELECT * FROM pg_get_target_backend_memory_contexts(); --wait Thanks for notifying me, Fujii-san. Attached v8 patch that prohibited calling the function inside transactions. Regrettably, this modification could not cope with the advisory lock and I haven't come up with a good way to deal with it. It seems to me that the architecture of the requestor waiting for the dumper leads to this problem and complicates things. Considering the discussion printing backtrace discussion[1], it seems reasonable that the requestor just sends a signal and dumper dumps to the log file. +1 Thanks! I remade the patch and introduced a function pg_print_backend_memory_contexts(PID) which prints the memory contexts of the specified PID to elog. =# SELECT pg_print_backend_memory_contexts(450855); ** log output ** 2021-03-17 15:21:01.942 JST [450855] LOG: Printing memory contexts of PID 450855 2021-03-17 15:21:01.942 JST [450855] LOG: level: 0 TopMemoryContext: 68720 total in 5 blocks; 16312 free (15 chunks); 52408 used 2021-03-17 15:21:01.942 JST [450855] LOG: level: 1 Prepared Queries: 65536 total in 4 blocks; 35088 free (14 chunks); 30448 used 2021-03-17 15:21:01.942 JST [450855] LOG: level: 1 pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1408 free (0 chunks); 6784 used ..(snip).. 2021-03-17 15:21:01.942 JST [450855] LOG: level: 2 CachedPlanSource: 4096 total in 3 blocks; 680 free (0 chunks); 3416 used: PREPARE hoge_200 AS SELECT * FROM pgbench_accounts WHERE aid = 1... 2021-03-17 15:21:01.942 JST [450855] LOG: level: 3 CachedPlanQuery: 4096 total in 3 blocks; 464 free (0 chunks); 3632 used ..(snip).. 2021-03-17 15:21:01.945 JST [450855] LOG: level: 1 Timezones: 104128 total in 2 blocks; 2584 free (0 chunks); 101544 used 2021-03-17 15:21:01.945 JST [450855] LOG: level: 1 ErrorContext: 8192 total in 1 blocks; 7928 free (5 chunks); 264 used 2021-03-17 15:21:01.945 JST [450855] LOG: Grand total: 2802080 bytes in 1399 blocks; 480568 free (178 chunks); 2321512 used As above, the output is almost the same as MemoryContextStatsPrint() except for the way of expression of the level. MemoryContextStatsPrint() uses indents, but pg_print_backend_memory_contexts() writes it as "level: %d". Since there was discussion about enlarging StringInfo may cause errors on OOM[1], this patch calls elog for each context. As with MemoryContextStatsPrint(), each context shows 100 children at most. I once thought it should be configurable, but something like pg_print_backend_memory_contexts(PID, num_children) needs to send the 'num_children' from requestor to dumper and it seems to require another infrastructure. Creating a new GUC for this seems overkill. If MemoryContextStatsPrint(), i.e. showing 100 children at most is enough, this hard limit may be acceptable. Only superusers can call pg_print_backend_memory_contexts(). I'm going to add documentation and regression tests. Any thoughts? [1] https://www.postgresql.org/message-id/CAMsr%2BYGh%2Bsso5N6Q%2BFmYHLWC%3DBPCzA%2B5GbhYZSGruj2d0c7Vvg%40mail.gmail.com "r_d/strengthen_perf/print_memcon.md" 110L, 5642C written Regards, -- Atsushi Torikoshi NTT DATA CORPORATIONdiff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index c6a8d4611e..e116f4a1be 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -30,6 +30,7 @@ #include "storage/shmem.h" #include "storage/sinval.h" #include "tcop/tcopprot.h" +#include "utils/memutils.h" /* * The SIGUSR1 signal is multiplexed to support signaling multiple event @@ -440,6 +441,20 @@ HandleProcSignalBarrierInterrupt(void) /* latch will be set by procsignal_sigusr1_handler */ } +/* + * HandleProcSignalPrintMemoryContext + * + * Handle receipt of an interrupt indicating print memory context. + * Signal handler portion of interrupt handling. + */ +static void +HandleProcSignalPrintMemoryContext(void) +{ + InterruptPending = true; + PrintMemoryContextPending = true; + /* latch will be set by procsignal_sigusr1_handler */ +} + /* * Perform global barrier related interrupt checking. * @@ -580,6 +595,25 @@ ProcessProcSignalBarrier(void) ConditionVariableBroadcast(&MyProcSignalSlot->pss_barrierCV); } +/* + * ProcessPrintMemoryContextInterrupt + * The portion of print memory context interrupt handling that runs + * outside of the signal handler. + */ +void +ProcessPrintMemoryContextInterrupt(void) +{ + PrintMemoryContextPending = false; + + ereport(LOG, + (errmsg("Printing memory contexts of PID %d", MyProcPid))); + + /* A har
Re: pgsql: Add libpq pipeline mode support to pgbench
Fabien COELHO wrote: > For consistency with the existing \if … \endif, ISTM that it could have > been named \batch … \endbatch or \pipeline … \endpipeline? "start" mirrors "end". To me, the analogy with \if-\endif is not obvious. Grammatically \if is meant to introduce the expression after it, whereas \startpipeline takes no argument. Functionally \startpipeline can be thought as "open the valve" and \endpipeline "close the valve". They're "call-to-action" kinds of commands, and in that sense quite different from the \if-\endif pair. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
Re: Calendar support in localization
On Tue, Mar 16, 2021 at 12:20 PM Thomas Munro wrote: > On Wed, Mar 17, 2021 at 6:31 AM Surafel Temesgen > wrote: > > Ethiopice calendar have 13 months so it can not be stored as date and > timestamp type and you approach seems more complicated and i suggest to > have this feature on the purpose of PostgreSQL popularities too not only > for my need > > I know, but the DATE and TIMESTAMPTZ datatypes don't intrinsically > know anything about months or other calendar concepts. Internally, > they are just a single number that counts the number of days or > seconds since an arbitrary epoch time. We are all in agreement about > how many times the Earth has rotated since then*. The calendar > concepts such as "day", "month", "year", whether Gregorian, Ethiopic, > Islamic, ... are all derivable from those numbers, if you know the > rules. > > Okay > > > > I think you suggesting this by expecting the implementation is difficult > but it's not that much difficult once you fully understand Gregorian > calendar and the calendar you work on > > Yeah, I am sure it's all just a bunch of simple integer maths. But > I'm talking about things like software architecture, maintainability, > cohesion, and getting maximum impact for the work we do. > > I may be missing some key detail though: why do you think it should be > a different type? The two reasons I can think of are: (1) the > slightly tricky detail that the date apparently changes at 1:00am > (which I don't think is a show stopper for this approach, I could > elaborate), (2) you may want dates to be formatted on the screen with > the Ethiopic calendar in common software like psql and GUI clients, > which may be easier to arrange with different types, but that seems to > be a cosmetic thing that could eventually be handled with tighter > locale integration with ICU. In the early stages you'd access > calendar logic though special functions with names like > icu_format_date(), or whatever. > > As you mention above whatever the calendar type is we ended up storing an integer that represent the date so rather than re-implementing every function and operation for every calendar we can use existing Gerigorian implementation as a base implementation and if new calendar want to perform same function or operation it translate to Gregorian one and use the existing function and operation and translate to back to working calendar. In this approach the only function we want for supporting a new calendar is a translator from the new calendar to Gregorian one and from Gerigorian calendar to the new calendar and may be input/ output function. What do you think of this implementation? regards Surafel
Re: Calendar support in localization
On Wed, Mar 17, 2021 at 9:54 AM Surafel Temesgen wrote: > As you mention above whatever the calendar type is we ended up storing an > integer that represent the date so rather than re-implementing every > function and operation for every calendar we can use existing Gerigorian > implementation as a base implementation and if new calendar want to perform > same function or operation it translate to Gregorian one and use the existing > function and operation and translate to back to working calendar. In this > approach the only function we want for supporting a new calendar is a > translator from the new calendar to Gregorian one and from Gerigorian > calendar to the new calendar and may be input/ output function. What do you > think of this implementation? I'm not sure what the best way of tackling this problem is, but I wanted to mention another possible approach: instead of actually using the timestamptz data type, have another data type that is binary-compatible with timestamptz - that is, it's the same thing on disk, so you can cast between the two data types for free. Then have separate input/output functions for it, separate operators and functions and so forth. It's not very obvious how to scale this kind of approach to a wide variety of calendar types, and as Thomas says, it would much cooler to be able to handle all of the ones that ICU knows how to support rather than just one. But, the problem I see with using timestamptz is that it's not so obvious how to get a different output format ... unless, I guess, we can cram it into DateStyle. And it's also much less obvious how you get the other functions and operators to do what you want, if it's different. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Getting better results from valgrind leak tracking
Andres Freund writes: > On 2021-03-16 20:50:17 -0700, Andres Freund wrote: >> What I meant was that I didn't understand how there's not a leak >> danger when compilation fails halfway through, given that the context >> in question is below TopMemoryContext and that I didn't see a relevant >> TRY block. But that probably there is something cleaning it up that I >> didn't see. > Looks like it's an actual leak: Yeah, I believe that. On the other hand, I'm not sure that such cases represent any real problem for production usage. I'm inclined to focus on non-error scenarios first. (Having said that, we probably have the ability to fix such things relatively painlessly now, by reparenting an initially-temporary context once we're done parsing.) Meanwhile, I'm still trying to understand why valgrind is whining about the rd_indexcxt identifier strings. AFAICS it shouldn't. regards, tom lane
Re: Calendar support in localization
Robert Haas writes: > It's not very obvious how to scale this kind of approach to a wide > variety of calendar types, and as Thomas says, it would much cooler to > be able to handle all of the ones that ICU knows how to support rather > than just one. But, the problem I see with using timestamptz is that > it's not so obvious how to get a different output format ... unless, I > guess, we can cram it into DateStyle. And it's also much less obvious > how you get the other functions and operators to do what you want, if > it's different. Yeah, I'm afraid that it probably is different. The most obvious example is in operations involving type interval: select now() + '1 month'::interval; That should almost certainly give a different answer when using a different calendar --- indeed the units of interest might not even be the same. (Do all human calendars use the concept of months?) I don't feel like DateStyle is chartered to affect the behavior of datetime operators; it's understood as tweaking the I/O behavior only. There might be more of a case for letting LC_TIME choose this behavior, but I bet the relevant standards only contemplate Gregorian calendars. Also, the SQL spec says in so many words that the SQL-defined datetime types follow the Gregorian calendar. So on the whole, new datatypes and operators seem like the way to go. I concur that if ICU has solved this problem, piggybacking on it seems more promising than writing our own code. regards, tom lane
RE: libpq debug log
Hi Tsunakawa san, Alvaro san, Thank you very much for your review. It helped me a lot to make the patch better. I update patch to v26. This patch has been updated according to Tsunakawa san and Alvaro san review comments. The output is following; ``` 2021-03-17 14:43:16.411238 > Terminate 4 2021-03-17 14:43:16.441775 > Query 155 "CREATE TABLESPACE regress_tblspacewith LOCATION '/home/iwata/pgsql/postgres/src/test/regress/testtablespace' WITH (some_nonexistent_parameter = true);" 2021-03-17 14:43:16.442935 < ErrorResponse 124 S "ERROR" V "ERROR" C "22023" M "unrecognized parameter "some_nonexistent_parameter"" F "reloptions.c" L "1447" R "parseRelOptionsInternal" \x00 "Z" 2021-03-17 14:43:16.442977 < ReadyForQuery 5 I 2021-03-17 14:43:16.443042 > Query 144 "CREATE TABLESPACE regress_tblspacewith LOCATION '/home/iwata/pgsql/postgres/src/test/regress/testtablespace' WITH (random_page_cost = 3.0);" 2021-03-17 14:43:16.444774 < CommandComplete 22 "CREATE TABLESPACE" 2021-03-17 14:43:16.444807 < ReadyForQuery 5 I 2021-03-17 14:43:16.444878 > Query 81 "SELECT spcoptions FROM pg_tablespace WHERE spcname = 'regress_tblspacewith';" 2021-03-17 14:43:16.448117 < RowDescription 35 1 "spcoptions" 1213 5 1009 65535 -1 0 2021-03-17 14:43:16.448157 < DataRow 32 1 22 '{random_page_cost=3.0}' 2021-03-17 14:43:16.448171 < CommandComplete 13 "SELECT 1" ``` > -Original Message- > From: Tsunakawa, Takayuki/綱川 貴之 > Sent: Tuesday, March 16, 2021 12:03 PM > (1) > - Enables tracing of the client/server communication to a debugging > file stream. > + Enables tracing of the client/server communication to a debugging file > + stream. > + Only available when protocol version 3 or higher is used. > > The last line is unnecessary now that v2 is not supported. I removed last comment from documentation file. > (2) > @@ -113,6 +114,7 @@ install: all installdirs install-lib > $(INSTALL_DATA) $(srcdir)/libpq-fe.h '$(DESTDIR)$(includedir)' > $(INSTALL_DATA) $(srcdir)/libpq-events.h '$(DESTDIR)$(includedir)' > $(INSTALL_DATA) $(srcdir)/libpq-int.h > '$(DESTDIR)$(includedir_internal)' > + $(INSTALL_DATA) $(srcdir)/libpq-trace.h > '$(DESTDIR)$(includedir_internal)' > $(INSTALL_DATA) $(srcdir)/pqexpbuffer.h > '$(DESTDIR)$(includedir_internal)' > $(INSTALL_DATA) $(srcdir)/pg_service.conf.sample > '$(DESTDIR)$(datadir)/pg_service.conf.sample' > > Why is this necessary? I think it is not necessary. I removed it. > (3) libpq-trace.h > +#ifdef __cplusplus > +extern "C" > +{ > > This is unnecessary because pqTraceOutputMessage() is for libpq's internal > use. This extern is for the user's C++ application to call public libpq > functions. > > +#include "libpq-fe.h" > +#include "libpq-int.h" > > I think these can be removed by declaring the struct and function as follows: > > struct pg_conn; > extern void pqTraceOutputMessage(struct pg_conn *conn, const char > *message, bool toServer); > > But... after doing the above, only this declaration is left in libpq-trace.h. > Why > don't you put your original declaration using PGconn in libpq-int.h and remove > libpq-trace.h? I remove this file. I was wondering whether to delete this file during the development of v25 patch. I leave it because it keep scalability. If tracing process become update and it have a lot of extern function, leave this header file is meaningful. However I think it does not happen. So I remove it. > (4) > + /* Trace message only when there is first 1 byte */ > + if (conn->Pfdebug && (conn->outCount < conn->outMsgStart)) > + pqTraceOutputMessage(conn, conn->outBuffer + > conn->outCount, true); > > () surrounding the condition after && can be removed. I removed this (). And This if () statement has been modified according to the review comment (14). > (5) > +static const char *pqGetProtocolMsgType(unsigned char c, > + > bool toServer); > > This is unnecessary because the function definition precedes its only one call > site. Yes, I removed this definition. > (6) > + * We accumulate frontend message pieces in an array as the libpq code > + writes > + * them, and log the complete message when pqTraceOutputFeMsg is called. > + * For backend, we print the pieces as soon as we receive them from the > server. > > The first paragraph is no longer true. I think this entire comment is > unnecessary. I removed this explanation. > (7) > +static const char * > +pqGetProtocolMsgType(unsigned char c, bool toServer) { > + if (toServer == true && > + c < lengthof(protocol_message_type_f) && > + protocol_message_type_f[c] != NULL) > + return protocol_message_type_f[c]; > + > + if (toServer == false && > + c < lengthof(protoco
Re: Getting better results from valgrind leak tracking
Hi, On Wed, Mar 17, 2021, at 07:16, Tom Lane wrote: > Andres Freund writes: > > On 2021-03-16 20:50:17 -0700, Andres Freund wrote: > Meanwhile, I'm still trying to understand why valgrind is whining > about the rd_indexcxt identifier strings. AFAICS it shouldn't. I found a way around that late last night. Need to mark the context itself as an allocation. But I made a mess on the way to that and need to clean the patch up before sending it (and need to drop my girlfriend off first). Andres
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
On Sun, Mar 14, 2021 at 04:06:45PM +0800, Julien Rouhaud wrote: > Recent conflict, thanks to cfbot. v18 attached. We are reaching the two-year mark on this feature, that everyone seems to agree is needed. Is any committer going to work on this to get it into PG 14? Should I take it? I just read the thread and I didn't see any open issues. Are there any? -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
Bruce Momjian writes: > We are reaching the two-year mark on this feature, that everyone seems > to agree is needed. Is any committer going to work on this to get it > into PG 14? Should I take it? I still say that it's a serious mistake to sanctify a query ID calculation method that was designed only for pg_stat_statement's needs as the one true way to do it. But that's what exposing it in a core view would do. regards, tom lane
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote: > Bruce Momjian writes: > > We are reaching the two-year mark on this feature, that everyone seems > > to agree is needed. Is any committer going to work on this to get it > > into PG 14? Should I take it? > > I still say that it's a serious mistake to sanctify a query ID calculation > method that was designed only for pg_stat_statement's needs as the one > true way to do it. But that's what exposing it in a core view would do. OK, I am fine with creating a new method, and maybe having pg_stat_statements use it. Is that the direction we should be going in? I do think we need _some_ method in core if we are going to be exposing this value in pg_stat_activity and log_line_prefix. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: PoC/WIP: Extended statistics on expressions
On Sun, 7 Mar 2021 at 21:10, Tomas Vondra wrote: > > 2) ndistinct > > There's one thing that's bugging me, in how we handle "partial" matches. > For each expression we track both the original expression and the Vars > we extract from it. If we can't find a statistics matching the whole > expression, we try to match those individual Vars, and we remove the > matching ones from the list. And in the end we multiply the estimates > for the remaining Vars. > > This works fine with one matching ndistinct statistics. Consider for example > > GROUP BY (a+b), (c+d) > > with statistics on [(a+b),c] - that is, expression and one column. I've just been going over this example, and I think it actually works slightly differently from how you described, though I haven't worked out the full general implications of that. > We parse the expressions into two GroupExprInfo > > {expr: (a+b), vars: [a, b]} > {expr: (c+d), vars: [c, d]} > Here, I think what you actually get, in the presence of stats on [(a+b),c] is actually the following two GroupExprInfos: {expr: (a+b), vars: []} {expr: (c+d), vars: [c, d]} because of the code immediately after this comment in estimate_num_groups(): /* * If examine_variable is able to deduce anything about the GROUP BY * expression, treat it as a single variable even if it's really more * complicated. */ As it happens, that makes no difference in this case, where there is just this one stats object, but it does change things when there are two stats objects. > and the statistics matches the first item exactly (the expression). The > second expression is not in the statistics, but we match "c". So we end > up with an estimate for "(a+b), c" and have one remaining GroupExprInfo: > > {expr: (c+d), vars: [d]} Right. > Without any other statistics we estimate that as ndistinct for "d", so > we end up with > > ndistinct((a+b), c) * ndistinct(d) > > which mostly makes sense. It assumes ndistinct(c+d) is product of the > ndistinct estimates, but that's kinda what we've been always doing. Yes, that appears to be what happens, and it's probably the best that can be done with the available stats. > But now consider we have another statistics on just (c+d). In the second > loop we end up matching this expression exactly, so we end up with > > ndistinct((a+b), c) * ndistinct((c+d)) In this case, with stats on (c+d) as well, the two GroupExprInfos built at the start change to: {expr: (a+b), vars: []} {expr: (c+d), vars: []} so it actually ends up not using any multivariate stats, but instead uses the two univariate expression stats, giving ndistinct((a+b)) * ndistinct((c+d)) which actually seems pretty good, and gave very good estimates in the simple test case I tried. > i.e. we kinda use the "c" twice. Which is a bit unfortunate. I think > what we should do after the first loop is just discarding the whole > expression and "expand" into per-variable GroupExprInfo, so in the > second step we would not match the (c+d) statistics. Not using the (c+d) stats would give either ndistinct((a+b)) * ndistinct(c) * ndistinct(d) or ndistinct((a+b), c) * ndistinct(d) depending on exactly how the algorithm was changed. In my test, these both gave worse estimates, but there are probably other datasets for which they might do better. It all depends on how much correlation there is between (a+b) and c. I suspect that there is no optimal strategy for handling overlapping stats that works best for all datasets, but the current algorithm seems to do a pretty decent job. > Of course, maybe there's a better way to pick the statistics, but I > think our conclusion so far was that people should just create > statistics covering all the columns in the query, to not have to match > multiple statistics like this. Yes, I think that's always likely to work better, especially for ndistinct stats, where all possible permutations of subsets of the columns are included, so a single ndistinct stat can work well for a range of different queries. Regards, Dean
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
Bruce Momjian writes: > On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote: >> I still say that it's a serious mistake to sanctify a query ID calculation >> method that was designed only for pg_stat_statement's needs as the one >> true way to do it. But that's what exposing it in a core view would do. > OK, I am fine with creating a new method, and maybe having > pg_stat_statements use it. Is that the direction we should be going in? The point is that we've understood Query.queryId as something that different extensions might calculate differently for their own needs. In particular it's easy to imagine extensions that want an ID that is less fuzzy than what pg_stat_statements wants. We never had a plan for how two such extensions could co-exist, but at least it was possible to use one if you didn't use another. If this gets moved into core then there will basically be only one way that anyone can do it. Maybe what we need is a design for allowing more than one query ID. > I do think we need _some_ method in core if we are going to be exposing > this value in pg_stat_activity and log_line_prefix. I'm basically objecting to the conclusion that we should do either of those. There is no way around the fact that it will break every user of Query.queryId other than pg_stat_statements, unless they are okay with whatever definition pg_stat_statements is using (which is a moving target BTW). regards, tom lane
Re: pl/pgsql feature request: shorthand for argument and local variable references
st 17. 3. 2021 v 9:20 odesílatel Michael Paquier napsal: > On Wed, Mar 17, 2021 at 02:06:57PM +0800, Julien Rouhaud wrote: > > I also think that there should be a single usable top label, otherwise > it will > > lead to confusing code and it can be a source of bug. > > Okay, that's fine by me. Could it be possible to come up with an > approach that does not hijack the namespace list contruction and the > lookup logic as much as it does now? I get it that the patch is done > this way because of the ordering of operations done for the initial ns > list creation and the grammar parsing that adds the routine label on > top of the root one, but I'd like to believe that there are more solid > ways to achieve that, like for example something that decouples the > root label and its alias, or something that associates an alias > directly with its PLpgSQL_nsitem entry? > I am checking it now, and I don't see any easy solution. The namespace is a one direction tree - only backward iteration from leafs to roots is supported. At the moment, when I try to replace the name of root ns_item, this root ns_item is referenced by the function's arguments (NS_TYPE_VAR type). So anytime I need some helper ns_item node, that can be a persistent target for these nodes. In this case is a memory overhead of just one ns_item. orig_label <- argument1 <- argument2 The patch has to save the original orig_label because it can be referenced from argument1 or by "FOUND" variable. New graphs looks like new_label <- invalidated orig_label <- argument1 <- ... This tree has a different direction than is usual, and then replacing the root node is not simple. Second solution (significantly more simple) is an additional pointer in ns_item structure. In almost all cases this pointer will not be used. Because ns_item uses a flexible array, then union cannot be used. I implemented this in a patch marked as "alias-implementation". What do you think about it? Pavel > -- > Michael > diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml index 9242c54329..ed8e774899 100644 --- a/doc/src/sgml/plpgsql.sgml +++ b/doc/src/sgml/plpgsql.sgml @@ -292,7 +292,9 @@ $$ LANGUAGE plpgsql; special variables such as FOUND (see ). The outer block is labeled with the function's name, meaning that parameters and special - variables can be qualified with the function's name. + variables can be qualified with the function's name. The name of this outer + block can be changed by inserting special command + #routine_label new_name at the start of the function. @@ -435,6 +437,31 @@ $$ LANGUAGE plpgsql; + + The function's argument can be qualified with the function name: + +CREATE FUNCTION sales_tax(subtotal real) RETURNS real AS $$ +BEGIN +RETURN sales_tax.subtotal * 0.06; +END; +$$ LANGUAGE plpgsql; + + + + + Sometimes the function name is too long and it can be more practical to use + some shorter label. The top namespace label can be changed (along with + the functions' arguments) using the option routine_label: + +CREATE FUNCTION sales_tax(subtotal real) RETURNS real AS $$ +#routine_label s +BEGIN +RETURN s.subtotal * 0.06; +END; +$$ LANGUAGE plpgsql; + + + Some more examples: diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c index ce8d97447d..d32e050c32 100644 --- a/src/pl/plpgsql/src/pl_comp.c +++ b/src/pl/plpgsql/src/pl_comp.c @@ -378,6 +378,10 @@ do_compile(FunctionCallInfo fcinfo, */ plpgsql_ns_init(); plpgsql_ns_push(NameStr(procStruct->proname), PLPGSQL_LABEL_BLOCK); + + /* save top ns for possibility to alter top label */ + function->root_ns = plpgsql_ns_top(); + plpgsql_DumpExecTree = false; plpgsql_start_datums(); diff --git a/src/pl/plpgsql/src/pl_funcs.c b/src/pl/plpgsql/src/pl_funcs.c index 919b968826..7132da35d1 100644 --- a/src/pl/plpgsql/src/pl_funcs.c +++ b/src/pl/plpgsql/src/pl_funcs.c @@ -101,6 +101,7 @@ plpgsql_ns_additem(PLpgSQL_nsitem_type itemtype, int itemno, const char *name) nse->itemtype = itemtype; nse->itemno = itemno; nse->prev = ns_top; + nse->alias = NULL; strcpy(nse->name, name); ns_top = nse; } @@ -141,7 +142,7 @@ plpgsql_ns_lookup(PLpgSQL_nsitem *ns_cur, bool localmode, nsitem->itemtype != PLPGSQL_NSTYPE_LABEL; nsitem = nsitem->prev) { - if (strcmp(nsitem->name, name1) == 0) + if (strcmp(nsitem->alias ? nsitem->alias : nsitem->name, name1) == 0) { if (name2 == NULL || nsitem->itemtype != PLPGSQL_NSTYPE_VAR) @@ -155,13 +156,13 @@ plpgsql_ns_lookup(PLpgSQL_nsitem *ns_cur, bool localmode, /* Check this level for qualified match to variable name */ if (name2 != NULL && - strcmp(nsitem->name, name1) == 0) + strcmp(nsitem->alias ? nsitem->alias : nsitem->name, name1) == 0) { for (nsitem = ns_cur; nsitem->itemtype != PLPGSQL_NSTYPE_LABEL; nsitem = nsitem->prev) { -if (st
Re: WIP: WAL prefetch (another approach)
On 2/15/21 12:18 AM, Stephen Frost wrote: > Greetings, > > ... > I think there's potential for some significant optimization going forward, but I think it's basically optimization over what we're doing today. As this is already a nontrivial patch, I'd argue for doing so separately. >>> >>> This seems like a great optimization, albeit a fair bit of code, for a >>> relatively uncommon use-case, specifically where full page writes are >>> disabled or very large checkpoints. As that's the case though, I would >>> think it's reasonable to ask that it go out of its way to avoid slowing >>> down the more common configurations, particularly since it's proposed to >>> have it on by default (which I agree with, provided it ends up improving >>> the common cases, which I think the suggestions above would certainly >>> make it more likely to do). >> >> I'm OK to do some benchmarking, but it's not quite clear to me why does it >> matter if the checkpoints are smaller than shared buffers? IMO what matters >> is how "localized" the updates are, i.e. how likely it is to hit the same >> page repeatedly (in a short amount of time). Regular pgbench is not very >> suitable for that, but some non-uniform distribution should do the trick, I >> think. > > I suppose strictly speaking it'd be > Min(wal_decode_buffer_size,checkpoint_size), but yes, you're right that > it's more about the wal_decode_buffer_size than the checkpoint's size. > Apologies for the confusion. As suggested above, one way to benchmark > this to really see if there's any issue would be to increase > wal_decode_buffer_size to some pretty big size and then compare the > performance vs. unpatched. I'd think that could even be done with > pgbench, so you're not having to arrange for the same pages to get > updated over and over. > What exactly would be the point of such benchmark? I don't think the patch does prefetching based on wal_decode_buffer_size, that just says how far ahead we decode - the prefetch distance I is defined by maintenance_io_concurrency. But it's not clear to me what exactly would the result say about the necessity of the optimization at hand (skipping prefetches for blocks with recent FPI). If the the maintenance_io_concurrency is very high, the probability that a block is evicted prematurely grows, making the prefetch useless in general. How does this say anything about the problem at hand? Sure, we'll do unnecessary I/O, causing issues, but that's a bit like complaining the engine gets very hot when driving on a highway in reverse. AFAICS to measure the worst case, you'd need a workload with a lot of FPIs, and very little actual I/O. That means, data set that fits into memory (either shared buffers or RAM), and short checkpoints. But that's exactly the case where you don't need prefetching ... >>> Perhaps this already improves the common cases and is worth the extra >>> code on that basis, but I don't recall seeing much in the way of >>> benchmarking in this thread for that case- that is, where FPIs are >>> enabled and checkpoints are smaller than shared buffers. Jakub's >>> testing was done with FPWs disabled and Tomas's testing used checkpoints >>> which were much larger than the size of shared buffers on the system >>> doing the replay. While it's certainly good that this patch improves >>> those cases, we should also be looking out for the worst case and make >>> sure that the patch doesn't degrade performance in that case. >> >> I'm with Andres on this. It's fine to leave some possible optimizations on >> the table for the future. And even if some workloads are affected >> negatively, it's still possible to disable the prefetching. > > While I'm generally in favor of this argument, that a feature is > particularly important and that it's worth slowing down the common cases > to enable it, I dislike that it's applied inconsistently. I'd certainly If you have a workload where this happens to cause issues, you can just disable that. IMHO that's a perfectly reasonable engineering approach, where we get something that significantly improves 80% of the cases, allow disabling it for cases where it might cause issues, and then improve it in the next version. > feel better about it if we had actual performance numbers to consider. > I don't doubt the possibility that the extra prefetch's just don't > amount to enough to matter but I have a hard time seeing them as not > having some cost and without actually measuring it, it's hard to say > what that cost is. > > Without looking farther back than the last record, we could end up > repeatedly asking for the same blocks to be prefetched too- > > FPI for block 1 > FPI for block 2 > WAL record for block 1 > WAL record for block 2 > WAL record for block 1 > WAL record for block 2 > WAL record for block 1 > WAL record for block 2 > > ... etc. > > Entirely possible my math is off, but seems like the worst case > situation right now might end up with some 4500 u
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
On Wed, Mar 17, 2021 at 12:01:38PM -0400, Tom Lane wrote: > Bruce Momjian writes: > > On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote: > >> I still say that it's a serious mistake to sanctify a query ID calculation > >> method that was designed only for pg_stat_statement's needs as the one > >> true way to do it. But that's what exposing it in a core view would do. > > > OK, I am fine with creating a new method, and maybe having > > pg_stat_statements use it. Is that the direction we should be going in? > > The point is that we've understood Query.queryId as something that > different extensions might calculate differently for their own needs. > In particular it's easy to imagine extensions that want an ID that is > less fuzzy than what pg_stat_statements wants. We never had a plan for > how two such extensions could co-exist, but at least it was possible > to use one if you didn't use another. If this gets moved into core > then there will basically be only one way that anyone can do it. Well, the patch docs say: Enables or disables in core query identifier computation.arameter. The extension requires a query --> identifier to be computed. Note that an external module can --> alternatively be used if the in core query identifier computation specification doesn't suit your need. In this case, in core computation must be disabled. The default is off. > Maybe what we need is a design for allowing more than one query ID. > > > I do think we need _some_ method in core if we are going to be exposing > > this value in pg_stat_activity and log_line_prefix. > > I'm basically objecting to the conclusion that we should do either > of those. There is no way around the fact that it will break every > user of Query.queryId other than pg_stat_statements, unless they > are okay with whatever definition pg_stat_statements is using (which > is a moving target BTW). I thought the above doc patch feature avoided this problem because an extension can override the build-in query id. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
st 17. 3. 2021 v 17:03 odesílatel Tom Lane napsal: > Bruce Momjian writes: > > On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote: > >> I still say that it's a serious mistake to sanctify a query ID > calculation > >> method that was designed only for pg_stat_statement's needs as the one > >> true way to do it. But that's what exposing it in a core view would do. > > > OK, I am fine with creating a new method, and maybe having > > pg_stat_statements use it. Is that the direction we should be going in? > > The point is that we've understood Query.queryId as something that > different extensions might calculate differently for their own needs. > In particular it's easy to imagine extensions that want an ID that is > less fuzzy than what pg_stat_statements wants. We never had a plan for > how two such extensions could co-exist, but at least it was possible > to use one if you didn't use another. If this gets moved into core > then there will basically be only one way that anyone can do it. > > Maybe what we need is a design for allowing more than one query ID. > Theoretically there can be a hook for calculation of queryid, that can be by used extension. Default can be assigned with a method that is used by pg_stat_statements. I don't think it is possible to use more different query id for pg_stat_statements so this solution can be simple. regards Pavel > > > I do think we need _some_ method in core if we are going to be exposing > > this value in pg_stat_activity and log_line_prefix. > > I'm basically objecting to the conclusion that we should do either > of those. There is no way around the fact that it will break every > user of Query.queryId other than pg_stat_statements, unless they > are okay with whatever definition pg_stat_statements is using (which > is a moving target BTW). > > regards, tom lane > > >
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
On Wed, Mar 17, 2021 at 05:16:50PM +0100, Pavel Stehule wrote: > > > st 17. 3. 2021 v 17:03 odesílatel Tom Lane napsal: > > Bruce Momjian writes: > > On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote: > >> I still say that it's a serious mistake to sanctify a query ID > calculation > >> method that was designed only for pg_stat_statement's needs as the one > >> true way to do it. But that's what exposing it in a core view would > do. > > > OK, I am fine with creating a new method, and maybe having > > pg_stat_statements use it. Is that the direction we should be going in? > > The point is that we've understood Query.queryId as something that > different extensions might calculate differently for their own needs. > In particular it's easy to imagine extensions that want an ID that is > less fuzzy than what pg_stat_statements wants. We never had a plan for > how two such extensions could co-exist, but at least it was possible > to use one if you didn't use another. If this gets moved into core > then there will basically be only one way that anyone can do it. > > Maybe what we need is a design for allowing more than one query ID. > > > Theoretically there can be a hook for calculation of queryid, that can be by > used extension. Default can be assigned with a method that is used by > pg_stat_statements. Yes, that is what the code patch says it does. > I don't think it is possible to use more different query id for > pg_stat_statements so this solution can be simple. Agreed. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: libpq debug log
Hello In pqTraceOutputString(), you can use the return value from fprintf to move the cursor -- no need to count chars. I still think that the message-type specific functions should print the message type, rather than having the string arrays. -- Álvaro Herrera Valdivia, Chile "La gente vulgar sólo piensa en pasar el tiempo; el que tiene talento, en aprovecharlo"
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
On Wed, Mar 17, 2021 at 12:24:44PM -0400, Bruce Momjian wrote: > On Wed, Mar 17, 2021 at 05:16:50PM +0100, Pavel Stehule wrote: > > > > > > st 17. 3. 2021 v 17:03 odesílatel Tom Lane napsal: > > > > Bruce Momjian writes: > > > On Wed, Mar 17, 2021 at 11:28:38AM -0400, Tom Lane wrote: > > >> I still say that it's a serious mistake to sanctify a query ID > > calculation > > >> method that was designed only for pg_stat_statement's needs as the > > one > > >> true way to do it. But that's what exposing it in a core view would > > do. > > > > > OK, I am fine with creating a new method, and maybe having > > > pg_stat_statements use it. Is that the direction we should be going > > in? > > > > The point is that we've understood Query.queryId as something that > > different extensions might calculate differently for their own needs. > > In particular it's easy to imagine extensions that want an ID that is > > less fuzzy than what pg_stat_statements wants. We never had a plan for > > how two such extensions could co-exist, but at least it was possible > > to use one if you didn't use another. If this gets moved into core > > then there will basically be only one way that anyone can do it. > > > > Maybe what we need is a design for allowing more than one query ID. > > > > > > Theoretically there can be a hook for calculation of queryid, that can be by > > used extension. Default can be assigned with a method that is used by > > pg_stat_statements. > > Yes, that is what the code patch says it does. > > > I don't think it is possible to use more different query id for > > pg_stat_statements so this solution can be simple. > > Agreed. Actually, putting the query identifer computation in the core makes it way more tunable, even if it's conterintuitive. What it means is that you can now chose to use usual pgss' algorithm or a different one for log_line_prefix and pg_stat_activity.queryid, but also that you can now use pgss with a different query id algorithm. That's another thing that user were asking for a long time. I originally suggested to make it clearer by having an enum GUC rather than a boolean, say compute_queryid = [ none | core | external ], and if set to external then a hook would be explicitely called. Right now, "none" and "external" are binded with compute_queryid = off, and depends on whether an extension is computing a queryid during post_parse_analyse_hook. It could later be extended to suit other needs if we ever come to some agreement (for instance "legacy", "logical_replication_stable" or whatever better name we can find for something that doesn't depend on Oid).
Re: Index Skip Scan (new UniqueKeys)
> On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote: > Hi, > > I took a look at the new patch series, focusing mostly on the uniquekeys > part. It'd be a bit tedious to explain all the review comments here, so > attached is a patch series with a "review" patch for some of the parts. Great, thanks. > Most of it is fairly small (corrections to comments etc.), I'll go over > the more serious part so that we can discuss it here. I'll keep it split > per parts of the original patch series. > I suggest looking for XXX and FIXME comments in all the review patches. > > > 0001 > > > > > 0002 > > In fact both 0001 & 0002 belong to another thread, which these days span [1], [2]. I've included them only because they happened to be a dependency for index skip scan following David suggestions, sorry if it's confusing. At the same time the author behind 0001 & 0002 is present in this thread as well, maybe Andy can answer these comments right here and better than me. > 0003 > > > Just some comments/whitespace. > > > 0004 > > > I wonder why we don't include this in explain TEXT format? Seems it > might make it harder to write regression tests for this? It's easier to > just check that we deduced the right unique key(s) than having to > construct an example where it actually changes the plan. Yeah, good point. I believe originally it was like that to not make explain too verbose for skip scans, but displaying prefix definitely could be helpful for testing, so will do this (and address other comments as well). [1]: https://www.postgresql.org/message-id/flat/caku4awpqjaqjwq2x-ar9g3+zhrzu1k8hnp7a+_mluov-n5a...@mail.gmail.com [2]: https://www.postgresql.org/message-id/flat/caku4awru35c9g3ce15jmvwh6b2hzf4hf7czukrsiktv7akr...@mail.gmail.com
pg_dump new feature: exporting functions only. Bad or good idea ?
Hey hackers, I had this idea, that I raised and cherished like my baby to add a switch in `pg_dump` to allow exporting stored functions (and procedures) only. However, when I finally got the time to look at it in detail, I found out there was no way to solve the dependencies in the functions and procedures, so that the exported file, when re-played could lead to invalid objects. So, I decided this would not make Postgres better and decide to walk off this patch. Anyhow, when sharing my thoughts, several people told me to ask the community about adding this feature because this could be useful in some use cases. Another argument is that should you have all your functions in one schema and your tables in another, exporting only the function schema will lead to the same kind of file that could lead to invalid objects created when the file is re-played against a database that does not have the tables. Of course, the documentation would add a warning against object invalidity should only the stored functions/procedures be exported. So, my question is: what do you think about such a feature? is it worth it? Have a nice day, Lætitia
Re: pg_dump new feature: exporting functions only. Bad or good idea ?
On 3/17/21 6:00 PM, Lætitia Avrot wrote: > Hey hackers, > > I had this idea, that I raised and cherished like my baby to add a switch > in `pg_dump` to allow exporting stored functions (and procedures) only. > > However, when I finally got the time to look at it in detail, I found out > there was no way to solve the dependencies in the functions and procedures, > so that the exported file, when re-played could lead to invalid objects. > > So, I decided this would not make Postgres better and decide to walk off > this patch. > > Anyhow, when sharing my thoughts, several people told me to ask the > community about adding this feature because this could be useful in some > use cases. Another argument is that should you have all your functions in > one schema and your tables in another, exporting only the function schema > will lead to the same kind of file that could lead to invalid objects > created when the file is re-played against a database that does not have > the tables. > > Of course, the documentation would add a warning against object invalidity > should only the stored functions/procedures be exported. > > So, my question is: what do you think about such a feature? is it worth it? Yes, it is absolutely worth it to be able to extract just the functions from a database. I have wanted this several times. -- Vik Fearing
Re: [patch] [doc] Minor variable related cleanup and rewording of plpgsql docs
I wrote: > I took a stab at doing that, just to see what it might look like. > I thought it comes out pretty well, really -- see what you think. Hearing nothing further, I pushed that after another round of copy-editing. There's still plenty of time to revise it if anybody has further comments. regards, tom lane
Re: pg_dump new feature: exporting functions only. Bad or good idea ?
On Wed, Mar 17, 2021 at 2:00 PM Lætitia Avrot wrote: > > Hey hackers, > > I had this idea, that I raised and cherished like my baby to add a switch in `pg_dump` to allow exporting stored functions (and procedures) only. > > However, when I finally got the time to look at it in detail, I found out there was no way to solve the dependencies in the functions and procedures, so that the exported file, when re-played could lead to invalid objects. > > So, I decided this would not make Postgres better and decide to walk off this patch. > > Anyhow, when sharing my thoughts, several people told me to ask the community about adding this feature because this could be useful in some use cases. Another argument is that should you have all your functions in one schema and your tables in another, exporting only the function schema will lead to the same kind of file that could lead to invalid objects created when the file is re-played against a database that does not have the tables. > > Of course, the documentation would add a warning against object invalidity should only the stored functions/procedures be exported. > > So, my question is: what do you think about such a feature? is it worth it? > Make total sense since we already have --function=NAME(args) on pg_restore and it doesn't solve dependencies... so we can add it to also export function/procedure contents. +1 for general idea. -- Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Re: pg_dump new feature: exporting functions only. Bad or good idea ?
Vik Fearing writes: > On 3/17/21 6:00 PM, Lætitia Avrot wrote: >> However, when I finally got the time to look at it in detail, I found out >> there was no way to solve the dependencies in the functions and procedures, >> so that the exported file, when re-played could lead to invalid objects. >> ... >> So, my question is: what do you think about such a feature? is it worth it? > Yes, it is absolutely worth it to be able to extract just the functions > from a database. I have wanted this several times. Selective dumps always have a risk of not being restorable on their own; I don't see that "functions only" is noticeably less safe than "just these tables", or other cases that we support already. What I'm wondering about is how this might interact with the discussion at [1]. regards, tom lane [1] https://www.postgresql.org/message-id/flat/CAFj8pRB10wvW0CC9Xq=1XDs=zcqxer3cblcnza+qix4cuh-...@mail.gmail.com
Re: PoC/WIP: Extended statistics on expressions
Hi, On 3/17/21 4:55 PM, Dean Rasheed wrote: > On Sun, 7 Mar 2021 at 21:10, Tomas Vondra > wrote: >> >> 2) ndistinct >> >> There's one thing that's bugging me, in how we handle "partial" matches. >> For each expression we track both the original expression and the Vars >> we extract from it. If we can't find a statistics matching the whole >> expression, we try to match those individual Vars, and we remove the >> matching ones from the list. And in the end we multiply the estimates >> for the remaining Vars. >> >> This works fine with one matching ndistinct statistics. Consider for example >> >> GROUP BY (a+b), (c+d) >> >> with statistics on [(a+b),c] - that is, expression and one column. > > I've just been going over this example, and I think it actually works > slightly differently from how you described, though I haven't worked > out the full general implications of that. > >> We parse the expressions into two GroupExprInfo >> >> {expr: (a+b), vars: [a, b]} >> {expr: (c+d), vars: [c, d]} >> > > Here, I think what you actually get, in the presence of stats on > [(a+b),c] is actually the following two GroupExprInfos: > > {expr: (a+b), vars: []} > {expr: (c+d), vars: [c, d]} > Yeah, right. To be precise, we get {expr: (a+b), vars: [(a+b)]} because in the first case we pass NIL, so add_unique_group_expr treats the whole expression as a var (a bit strange, but OK). > because of the code immediately after this comment in estimate_num_groups(): > > /* > * If examine_variable is able to deduce anything about the GROUP BY > * expression, treat it as a single variable even if it's really more > * complicated. > */ > > As it happens, that makes no difference in this case, where there is > just this one stats object, but it does change things when there are > two stats objects. > >> and the statistics matches the first item exactly (the expression). The >> second expression is not in the statistics, but we match "c". So we end >> up with an estimate for "(a+b), c" and have one remaining GroupExprInfo: >> >> {expr: (c+d), vars: [d]} > > Right. > >> Without any other statistics we estimate that as ndistinct for "d", so >> we end up with >> >> ndistinct((a+b), c) * ndistinct(d) >> >> which mostly makes sense. It assumes ndistinct(c+d) is product of the >> ndistinct estimates, but that's kinda what we've been always doing. > > Yes, that appears to be what happens, and it's probably the best that > can be done with the available stats. > >> But now consider we have another statistics on just (c+d). In the second >> loop we end up matching this expression exactly, so we end up with >> >> ndistinct((a+b), c) * ndistinct((c+d)) > > In this case, with stats on (c+d) as well, the two GroupExprInfos > built at the start change to: > > {expr: (a+b), vars: []} > {expr: (c+d), vars: []} > > so it actually ends up not using any multivariate stats, but instead > uses the two univariate expression stats, giving > > ndistinct((a+b)) * ndistinct((c+d)) > > which actually seems pretty good, and gave very good estimates in the > simple test case I tried. > Yeah, that works pretty well in this case. I wonder if we'd be better off extracting the Vars and doing what I mistakenly described as the current behavior. That's essentially mean extracting the Vars even in the case where we now pass NIL. My concern is that the current behavior (where we prefer expression stats over multi-column stats to some extent) works fine as long as the parts are independent, but once there's dependency it's probably more likely to produce underestimates. I think underestimates for grouping estimates were a risk in the past, so let's not make that worse. >> i.e. we kinda use the "c" twice. Which is a bit unfortunate. I think >> what we should do after the first loop is just discarding the whole >> expression and "expand" into per-variable GroupExprInfo, so in the >> second step we would not match the (c+d) statistics. > > Not using the (c+d) stats would give either > > ndistinct((a+b)) * ndistinct(c) * ndistinct(d) > > or > > ndistinct((a+b), c) * ndistinct(d) > > depending on exactly how the algorithm was changed. In my test, these > both gave worse estimates, but there are probably other datasets for > which they might do better. It all depends on how much correlation > there is between (a+b) and c. > > I suspect that there is no optimal strategy for handling overlapping > stats that works best for all datasets, but the current algorithm > seems to do a pretty decent job. > >> Of course, maybe there's a better way to pick the statistics, but I >> think our conclusion so far was that people should just create >> statistics covering all the columns in the query, to not have to match >> multiple statistics like this. > > Yes, I think that's always likely to work better, especially for > ndistinct stats, where
Re: [HACKERS] Custom compression methods
On Wed, Mar 17, 2021 at 7:41 AM Dilip Kumar wrote: > 0002: > - Wrapper over heap_form_tuple and used in ExecEvalRow() and > ExecEvalFieldStoreForm() Instead of having heap_form_flattened_tuple(), how about heap_flatten_values(tupleDesc, values, isnull) that is documented to modify the values array? Then instead of replacing the heap_form_tuple() calls with a call to heap_form_flattened_tuple(), you just insert a call to heap_flatten_values() before the call to heap_form_tuple(). I think that might be easier for people looking at this code in the future to understand what's happening. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Index Skip Scan (new UniqueKeys)
On 3/17/21 6:02 PM, Dmitry Dolgov wrote: >> On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote: >> Hi, >> >> I took a look at the new patch series, focusing mostly on the uniquekeys >> part. It'd be a bit tedious to explain all the review comments here, so >> attached is a patch series with a "review" patch for some of the parts. > > Great, thanks. > >> Most of it is fairly small (corrections to comments etc.), I'll go over >> the more serious part so that we can discuss it here. I'll keep it split >> per parts of the original patch series. >> I suggest looking for XXX and FIXME comments in all the review patches. >> >> >> 0001 >> >> >> >> >> 0002 >> >> > > In fact both 0001 & 0002 belong to another thread, which these days > span [1], [2]. I've included them only because they happened to be a > dependency for index skip scan following David suggestions, sorry if > it's confusing. > > At the same time the author behind 0001 & 0002 is present in this thread > as well, maybe Andy can answer these comments right here and better than me. > Ah, sorry for the confusion. In that case the review comments probably belong to the other threads, so we should move the discussion there. It's not clear to me which of the threads is the right one. >> 0003 >> >> >> Just some comments/whitespace. >> >> >> 0004 >> >> >> I wonder why we don't include this in explain TEXT format? Seems it >> might make it harder to write regression tests for this? It's easier to >> just check that we deduced the right unique key(s) than having to >> construct an example where it actually changes the plan. > > Yeah, good point. I believe originally it was like that to not make > explain too verbose for skip scans, but displaying prefix definitely > could be helpful for testing, so will do this (and address other > comments as well). > Cool. Thanks. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
On 2021-Mar-15, Alvaro Herrera wrote: > Here's a fixup patch to do it that way. I tried running the commands > you showed and one of them immediately dies with the new error message; > I can't cause the bogus constraint to show up anymore. Actually, that was a silly fix that didn't actually work correctly, as I discovered immediately after sending it. The right fix is to forbid all commands other than DETACH PARTITION FINALIZE in a partition that's in the process of being detached. In the attached v8, I did that; I also added a ton more tests that hopefully show how the feature should work in concurrent cases, including one case in which the transaction doing the detach is cancelled. I also renamed "inhdetached" to "inhdetachpending", per previous discussion, including changing how to looks in psql. I am not aware of any other loose end in this patch; I consider this version final. Barring further problem reports, I'll get this pushed tomorrow morning. psql completion is missing. If somebody would like to contribute that, I'm grateful. -- Álvaro Herrera39°49'30"S 73°17'W "If you have nothing to say, maybe you need just the right tool to help you not say it." (New York Times, about Microsoft PowerPoint) >From 101a20d1ccf86a255af166e198c7ac691af58c99 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 13 Jul 2020 20:15:30 -0400 Subject: [PATCH v8 1/2] Let ALTER TABLE exec routines deal with the relation This means that ATExecCmd relies on AlteredRelationInfo->rel instead of keeping the relation as a local variable; this is useful if the subcommand needs to modify the relation internally. For example, a subcommand that internally commits its transaction needs this. --- src/backend/commands/tablecmds.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index ffb1308a0c..8e753f1efd 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -156,6 +156,8 @@ typedef struct AlteredTableInfo Oid relid; /* Relation to work on */ char relkind; /* Its relkind */ TupleDesc oldDesc; /* Pre-modification tuple descriptor */ + /* Transiently set during Phase 2, normally set to NULL */ + Relation rel; /* Information saved by Phase 1 for Phase 2: */ List *subcmds[AT_NUM_PASSES]; /* Lists of AlterTableCmd */ /* Information saved by Phases 1/2 for Phase 3: */ @@ -353,7 +355,7 @@ static void ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd, AlterTableUtilityContext *context); static void ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode, AlterTableUtilityContext *context); -static void ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, +static void ATExecCmd(List **wqueue, AlteredTableInfo *tab, AlterTableCmd *cmd, LOCKMODE lockmode, int cur_pass, AlterTableUtilityContext *context); static AlterTableCmd *ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, @@ -4512,7 +4514,6 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode, { AlteredTableInfo *tab = (AlteredTableInfo *) lfirst(ltab); List *subcmds = tab->subcmds[pass]; - Relation rel; ListCell *lcmd; if (subcmds == NIL) @@ -4521,10 +4522,10 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode, /* * Appropriate lock was obtained by phase 1, needn't get it again */ - rel = relation_open(tab->relid, NoLock); + tab->rel = relation_open(tab->relid, NoLock); foreach(lcmd, subcmds) -ATExecCmd(wqueue, tab, rel, +ATExecCmd(wqueue, tab, castNode(AlterTableCmd, lfirst(lcmd)), lockmode, pass, context); @@ -4536,7 +4537,11 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode, if (pass == AT_PASS_ALTER_TYPE) ATPostAlterTypeCleanup(wqueue, tab, lockmode); - relation_close(rel, NoLock); + if (tab->rel) + { +relation_close(tab->rel, NoLock); +tab->rel = NULL; + } } } @@ -4562,11 +4567,12 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode, * ATExecCmd: dispatch a subcommand to appropriate execution routine */ static void -ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, +ATExecCmd(List **wqueue, AlteredTableInfo *tab, AlterTableCmd *cmd, LOCKMODE lockmode, int cur_pass, AlterTableUtilityContext *context) { ObjectAddress address = InvalidObjectAddress; + Relation rel = tab->rel; switch (cmd->subtype) { @@ -5669,6 +5675,7 @@ ATGetQueueEntry(List **wqueue, Relation rel) */ tab = (AlteredTableInfo *) palloc0(sizeof(AlteredTableInfo)); tab->relid = relid; + tab->rel = NULL; /* set later */ tab->relkind = rel->rd_rel->relkind; tab->oldDesc = CreateTupleDescCopyConstr(RelationGetDescr(rel)); tab->newrelpersistence = RELPERSISTENCE_PERMANENT; -- 2.20.1 >From 6e893b5f2e7abae9e55ba0d1487d4621272d2cdf Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date
Re: Getting better results from valgrind leak tracking
Hi, (really need to fix my mobile phone mail program to keep the CC list...) On 2021-03-17 08:15:43 -0700, Andres Freund wrote: > On Wed, Mar 17, 2021, at 07:16, Tom Lane wrote: > > Andres Freund writes: > > > On 2021-03-16 20:50:17 -0700, Andres Freund wrote: > > Meanwhile, I'm still trying to understand why valgrind is whining > > about the rd_indexcxt identifier strings. AFAICS it shouldn't. > > I found a way around that late last night. Need to mark the context > itself as an allocation. But I made a mess on the way to that and need > to clean the patch up before sending it (and need to drop my > girlfriend off first). Unfortunately I didn't immediately find a way to do this while keeping the MEMPOOL_CREATE/DESTROY in mcxt.c. The attached patch moves the pool creation into the memory context implementations, "allocates" the context itself as part of that pool, and changes the reset implementation from MEMPOOL_DESTROY + MEMPOOL_CREATE to instead do MEMPOOL_TRIM. That leaves the memory context itself valid (and thus tracking ->ident etc), but marks all the other memory as freed. This is just a first version, it probably needs more work, and definitely a few comments... After this, your changes, and the previously mentioned fixes, I get far fewer false positives. Also found a crash / memory leak in pgstat.c due to the new replication slot stats, but I'll start a separate thread. There are a few leak warnings around guc.c that look like they might be real, not false positives, and thus a bit concerning. Looks like several guc check hooks don't bother to free the old *extra before allocating a new one. I suspect we might get better results from valgrind, not just for leaks but also undefined value tracking, if we changed the way we represent pools to utilize VALGRIND_MEMPOOL_METAPOOL | VALGRIND_MEMPOOL_AUTO_FREE. E.g. aset.c would associate AllocBlock using VALGRIND_MEMPOOL_ALLOC and then mcxt.c would use VALGRIND_MALLOCLIKE_BLOCK for the individual chunk allocation. https://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.mempools I played with naming the allocations underlying aset.c using VALGRIND_CREATE_BLOCK(block, strlen(context->name), context->name). That does produce better undefined-value warnings, but it seems that e.g. the leak detector doen't have that information around. Nor does it seem to be usable for use-afte-free. At least the latter likely because I had to VALGRIND_DISCARD by that point... Greetings, Andres Freund >From c7e69e4bccfc4da97b0b999399732e3dc806b67e Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Wed, 17 Mar 2021 10:46:39 -0700 Subject: [PATCH] Make memory contexts themselves more visible to valgrind. --- src/include/utils/memdebug.h| 1 + src/backend/utils/mmgr/aset.c | 22 ++ src/backend/utils/mmgr/generation.c | 8 src/backend/utils/mmgr/mcxt.c | 5 - src/backend/utils/mmgr/slab.c | 8 5 files changed, 39 insertions(+), 5 deletions(-) diff --git a/src/include/utils/memdebug.h b/src/include/utils/memdebug.h index e88b4c6e8ef..5988bff8839 100644 --- a/src/include/utils/memdebug.h +++ b/src/include/utils/memdebug.h @@ -23,6 +23,7 @@ #define VALGRIND_CHECK_MEM_IS_DEFINED(addr, size) do {} while (0) #define VALGRIND_CREATE_MEMPOOL(context, redzones, zeroed) do {} while (0) #define VALGRIND_DESTROY_MEMPOOL(context) do {} while (0) +#define VALGRIND_MEMPOOL_TRIM(context, ptr, size) do {} while (0) #define VALGRIND_MAKE_MEM_DEFINED(addr, size)do {} while (0) #define VALGRIND_MAKE_MEM_NOACCESS(addr, size)do {} while (0) #define VALGRIND_MAKE_MEM_UNDEFINED(addr, size)do {} while (0) diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index ec6c130d0fb..9793ddf4a3d 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -433,6 +433,12 @@ AllocSetContextCreateInternal(MemoryContext parent, { /* Remove entry from freelist */ set = freelist->first_free; + + VALGRIND_CREATE_MEMPOOL(set, 0, false); + VALGRIND_MEMPOOL_ALLOC(set, set, sizeof(AllocSetContext)); + /* the contents are still valid, but valgrind can't know that */ + VALGRIND_MAKE_MEM_DEFINED(set, sizeof(AllocSetContext)); + freelist->first_free = (AllocSet) set->header.nextchild; freelist->num_free--; @@ -477,6 +483,8 @@ AllocSetContextCreateInternal(MemoryContext parent, name))); } + VALGRIND_CREATE_MEMPOOL(set, 0, false); + VALGRIND_MEMPOOL_ALLOC(set, set, sizeof(AllocSetContext)); /* * Avoid writing code that can fail between here and MemoryContextCreate; * we'd leak the header/initial block if we ereport in this stretch. @@ -611,6 +619,8 @@ AllocSetReset(MemoryContext context) Assert(context->mem_allocated == keepersize); + VALGRIND_MEMPOOL_TRIM(set, set, sizeof(AllocSetAlloc)); + /* Reset block size allocation sequence, too */ set->nextBlockSize = set->initBlockSize; } @@ -652,6 +662,16 @@ A
Re: [HACKERS] Custom compression methods
Hi, On 2021-03-17 13:31:14 -0400, Robert Haas wrote: > On Wed, Mar 17, 2021 at 7:41 AM Dilip Kumar wrote: > > 0002: > > - Wrapper over heap_form_tuple and used in ExecEvalRow() and > > ExecEvalFieldStoreForm() > > Instead of having heap_form_flattened_tuple(), how about > heap_flatten_values(tupleDesc, values, isnull) that is documented to > modify the values array? Then instead of replacing the > heap_form_tuple() calls with a call to heap_form_flattened_tuple(), > you just insert a call to heap_flatten_values() before the call to > heap_form_tuple(). I think that might be easier for people looking at > this code in the future to understand what's happening. OTOH heap_form_flattened_tuple() has the advantage that we can optimize it further (e.g. to do the conversion to flattened values in fill_val()) without changing the outside API. Greetings, Andres Freund
Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY
The v8 patch has the "broken constraint" problem. Also, it "fails to avoid" adding duplicate constraints: Check constraints: "c" CHECK (i IS NOT NULL AND i > 1 AND i < 2) "cc" CHECK (i IS NOT NULL AND i >= 1 AND i < 2) "p1_check" CHECK (true) "p1_i_check" CHECK (i IS NOT NULL AND i >= 1 AND i < 2) > diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml > index 5c9f4af1d5..0cb846f408 100644 > --- a/doc/src/sgml/catalogs.sgml > +++ b/doc/src/sgml/catalogs.sgml > @@ -4485,6 +4485,16 @@ SCRAM-SHA-256$count>:&l > when using declarative partitioning. > > > + > + > + > + inhdetachpending bool > + > + > + Set to true for a partition that is in the process of being detached; > + false otherwise. > + > + Remove "Set to" ? And say true and false Probably you'll hate the suggestion, but maybe it should be "pendingdetach". We already have pg_settings.pending_restart. > + If CONCURRENTLY is specified, this process runs in > two > + transactions in order to avoid blocking other sessions that might be > accessing > + the partitioned table. During the first transaction, a > + SHARE UPDATE EXCLUSIVE lock is taken on both parent > table and > + partition, and its partition is marked detached; at that point, the > transaction > + is committed and all transactions using the partitioned table are > waited for. > + Once all those transactions are gone, the second stage acquires Instead of "gone", say "have completed" ? > +/* > + * MarkInheritDetached > + * > + * When a partition is detached from its parent concurrently, we don't > + * remove the pg_inherits row until a second transaction; as a preparatory > + * step, this function marks the entry as 'detached', so that other *pending detached > + * The strategy for concurrency is to first modify the partition catalog > + * rows to make it visible to everyone that the partition is detached, the inherits catalog? > + /* > + * In concurrent mode, the partition is locked with > share-update-exclusive > + * in the first transaction. This allows concurrent transactions to be > + * doing DML to the partition. > + /* > + * Check inheritance conditions and either delete the pg_inherits row > + * (in non-concurrent mode) or just set the inhisdetached flag. detachpending
Re: PoC/WIP: Extended statistics on expressions
On Wed, 17 Mar 2021 at 17:26, Tomas Vondra wrote: > > My concern is that the current behavior (where we prefer expression > stats over multi-column stats to some extent) works fine as long as the > parts are independent, but once there's dependency it's probably more > likely to produce underestimates. I think underestimates for grouping > estimates were a risk in the past, so let's not make that worse. > I'm not sure the current behaviour really is preferring expression stats over multi-column stats. In this example, where we're grouping by (a+b), (c+d) and have stats on [(a+b),c] and (c+d), neither of those multi-column stats actually match more than one column/expression. If anything, I'd go the other way and say that it was wrong to use the [(a+b),c] stats in the first case, where they were the only stats available, since those stats aren't really applicable to (c+d), which probably ought to be treated as independent. IOW, it might have been better to estimate the first case as ndistinct((a+b)) * ndistinct(c) * ndistinct(d) and the second case as ndistinct((a+b)) * ndistinct((c+d)) Regards, Dean
Re: WIP: BRIN multi-range indexes
On Thu, Mar 11, 2021 at 12:26 PM Tomas Vondra wrote: > > Hi, > > Here is an updated version of the patch series. > > It fixes the assert failure (just remove the multiplication from it) and > adds a simple regression test that exercises this. > > Based on the discussion so far, I've decided to keep just the new > signature of the consistent function. That's a bit simpler than having > to support both 3 and 4 parameters, and it would not deal with the NULL > changes anyway (mostly harmless code duplication, but still). I've also > realized the API documentation in SGML needs updating. > > At this point, I think 0001-0006 parts are mostly committable. I think so. I've marked it RFC for this six. > As for the remaining two parts, the one dealing with correlation may not > be strictly necessary, but not having it (or something like it) may > result in not picking the BRIN index in some cases. > > But maybe it's not a major problem. I tried the example from [1] but it > no longer triggers the issue for me - I'm not entirely sure why, but the > costing changed for some reason. It used to look like this: > ... > The index scan cost is about the same, but the heap scan is about half > the cost. The row estimates are a bit different, perhaps that's related. > The seqscan cost (169248) and duration (~500ms) is still about the same, > but so now we still pick the bitmap heap scan. With only 0001-0006, I get a parallel bitmap scan in both cases: QUERY PLAN --- Gather (cost=6542.42..52779.35 rows=10 width=4) (actual time=3.283..22.308 rows=10 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Bitmap Heap Scan on t0 (cost=5542.42..51778.35 rows=4 width=4) (actual time=3.434..17.257 rows=3 loops=3) Recheck Cond: (a = 1) Rows Removed by Index Recheck: 83165 Heap Blocks: lossy=421 -> Bitmap Index Scan on t0_a_idx (cost=0.00..5542.42 rows=381682 width=0) (actual time=2.996..2.996 rows=11040 loops=1) Index Cond: (a = 1) Planning Time: 0.088 ms Execution Time: 22.341 ms (11 rows) > Not sure we can rely on > this, though. Would be quite sad if we had new improved opclasses but > the planner often decided not to use them. Yeah, I'm not sure what to do here. It might be okay to leave it out now and study it further as a PG14 open item or PG15 improvement. > I had an idea of tweaking choose_bitmap_and to consider both the cost > and selectivity (similarly to how add_path considers statup/total cost), > and that did indeed resolve this particular case. This is what the 0008 > part does. > > But it may also have negative consequence, as demonstrated by the > reproducer2.sql script. So maybe the logic would need to be more > complicated. Or maybe there's no reliable solution, considering how > tricky/unreliable BRIN estimates are. Ok, so with 0008 in reproducer2, it chooses the more selective path, even though it has a higher total cost: 0001-0007: QUERY PLAN - Bitmap Heap Scan on t2 (cost=29.03..24032.28 rows=1 width=8) (actual time=0.498..1.755 rows=1 loops=1) Recheck Cond: (a = 1000) Rows Removed by Index Recheck: 7167 Heap Blocks: lossy=128 -> Bitmap Index Scan on idx_2 (cost=0.00..29.03 rows=7163 width=0) (actual time=0.278..0.278 rows=1280 loops=1) Index Cond: (a = 1000) Planning Time: 0.148 ms Execution Time: 1.774 ms (8 rows) DROP INDEX QUERY PLAN --- Bitmap Heap Scan on t2 (cost=656.00..1531.00 rows=1 width=8) (actual time=9.695..9.708 rows=1 loops=1) Recheck Cond: (a = 1000) Rows Removed by Index Recheck: 223 Heap Blocks: lossy=4 -> Bitmap Index Scan on idx_1 (cost=0.00..656.00 rows=224 width=0) (actual time=9.675..9.675 rows=40 loops=1) Index Cond: (a = 1000) Planning Time: 0.110 ms Execution Time: 9.727 ms (8 rows) and with 0008: QUERY PLAN --- Bitmap Heap Scan on t2 (cost=656.00..1531.00 rows=1 width=8) (actual time=8.540..8.577 rows=1 loops=1) Recheck Cond: (a = 1000) Rows Removed by Index Recheck: 223 Heap Blocks: lossy=4 -> Bitmap Index Scan on idx_1 (cost=0.00..656.00 rows=224 width=0) (actual time=8.507..8.508 rows=40 loops=1) Index Cond: (a = 1000) Planning Time: 0.175 ms Execution Time: 8.601 ms (8 rows) DROP INDEX
Re: PoC/WIP: Extended statistics on expressions
On 3/17/21 7:54 PM, Dean Rasheed wrote: > On Wed, 17 Mar 2021 at 17:26, Tomas Vondra > wrote: >> >> My concern is that the current behavior (where we prefer expression >> stats over multi-column stats to some extent) works fine as long as the >> parts are independent, but once there's dependency it's probably more >> likely to produce underestimates. I think underestimates for grouping >> estimates were a risk in the past, so let's not make that worse. >> > > I'm not sure the current behaviour really is preferring expression > stats over multi-column stats. In this example, where we're grouping > by (a+b), (c+d) and have stats on [(a+b),c] and (c+d), neither of > those multi-column stats actually match more than one > column/expression. If anything, I'd go the other way and say that it > was wrong to use the [(a+b),c] stats in the first case, where they > were the only stats available, since those stats aren't really > applicable to (c+d), which probably ought to be treated as > independent. IOW, it might have been better to estimate the first case > as > > ndistinct((a+b)) * ndistinct(c) * ndistinct(d) > > and the second case as > > ndistinct((a+b)) * ndistinct((c+d)) > OK. I might be confused, but isn't that what the algorithm currently does? Or am I just confused about what the first/second case refers to? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: WIP: BRIN multi-range indexes
On 3/17/21 7:59 PM, John Naylor wrote: > On Thu, Mar 11, 2021 at 12:26 PM Tomas Vondra > mailto:tomas.von...@enterprisedb.com>> > wrote: >> >> Hi, >> >> Here is an updated version of the patch series. >> >> It fixes the assert failure (just remove the multiplication from it) and >> adds a simple regression test that exercises this. >> >> Based on the discussion so far, I've decided to keep just the new >> signature of the consistent function. That's a bit simpler than having >> to support both 3 and 4 parameters, and it would not deal with the NULL >> changes anyway (mostly harmless code duplication, but still). I've also >> realized the API documentation in SGML needs updating. >> >> At this point, I think 0001-0006 parts are mostly committable. > > I think so. I've marked it RFC for this six. > >> As for the remaining two parts, the one dealing with correlation may not >> be strictly necessary, but not having it (or something like it) may >> result in not picking the BRIN index in some cases. >> >> But maybe it's not a major problem. I tried the example from [1] but it >> no longer triggers the issue for me - I'm not entirely sure why, but the >> costing changed for some reason. It used to look like this: > >> ... > >> The index scan cost is about the same, but the heap scan is about half >> the cost. The row estimates are a bit different, perhaps that's related. >> The seqscan cost (169248) and duration (~500ms) is still about the same, >> but so now we still pick the bitmap heap scan. > > With only 0001-0006, I get a parallel bitmap scan in both cases: > > QUERY PLAN > --- > Gather (cost=6542.42..52779.35 rows=10 width=4) (actual > time=3.283..22.308 rows=10 loops=1) > Workers Planned: 2 > Workers Launched: 2 > -> Parallel Bitmap Heap Scan on t0 (cost=5542.42..51778.35 rows=4 > width=4) (actual time=3.434..17.257 rows=3 loops=3) > Recheck Cond: (a = 1) > Rows Removed by Index Recheck: 83165 > Heap Blocks: lossy=421 > -> Bitmap Index Scan on t0_a_idx (cost=0.00..5542.42 > rows=381682 width=0) (actual time=2.996..2.996 rows=11040 loops=1) > Index Cond: (a = 1) > Planning Time: 0.088 ms > Execution Time: 22.341 ms > (11 rows) > >> Not sure we can rely on >> this, though. Would be quite sad if we had new improved opclasses but >> the planner often decided not to use them. > > Yeah, I'm not sure what to do here. It might be okay to leave it out now > and study it further as a PG14 open item or PG15 improvement. > Yeah, that's definitely an option. >> I had an idea of tweaking choose_bitmap_and to consider both the cost >> and selectivity (similarly to how add_path considers statup/total cost), >> and that did indeed resolve this particular case. This is what the 0008 >> part does. >> >> But it may also have negative consequence, as demonstrated by the >> reproducer2.sql script. So maybe the logic would need to be more >> complicated. Or maybe there's no reliable solution, considering how >> tricky/unreliable BRIN estimates are. > > Ok, so with 0008 in reproducer2, it chooses the more selective path, > even though it has a higher total cost: > > 0001-0007: > > QUERY PLAN > - > Bitmap Heap Scan on t2 (cost=29.03..24032.28 rows=1 width=8) (actual > time=0.498..1.755 rows=1 loops=1) > Recheck Cond: (a = 1000) > Rows Removed by Index Recheck: 7167 > Heap Blocks: lossy=128 > -> Bitmap Index Scan on idx_2 (cost=0.00..29.03 rows=7163 width=0) > (actual time=0.278..0.278 rows=1280 loops=1) > Index Cond: (a = 1000) > Planning Time: 0.148 ms > Execution Time: 1.774 ms > (8 rows) > > DROP INDEX > QUERY PLAN > --- > Bitmap Heap Scan on t2 (cost=656.00..1531.00 rows=1 width=8) (actual > time=9.695..9.708 rows=1 loops=1) > Recheck Cond: (a = 1000) > Rows Removed by Index Recheck: 223 > Heap Blocks: lossy=4 > -> Bitmap Index Scan on idx_1 (cost=0.00..656.00 rows=224 width=0) > (actual time=9.675..9.675 rows=40 loops=1) > Index Cond: (a = 1000) > Planning Time: 0.110 ms > Execution Time: 9.727 ms > (8 rows) > > and with 0008: > > QUERY PLAN > --- > Bitmap Heap Scan on t2 (cost=656.00..1531.00 rows=1 width=8) (actual > time=8.540..8.577 rows=1 loops=1) > Recheck Cond: (a = 1000) > Rows Removed by I
Re: [HACKERS] Custom compression methods
On Wed, Mar 17, 2021 at 2:17 PM Andres Freund wrote: > OTOH heap_form_flattened_tuple() has the advantage that we can optimize > it further (e.g. to do the conversion to flattened values in fill_val()) > without changing the outside API. Well, in my view, that does change the outside API, because either the input values[] array is going to get scribbled on, or it's not. We should either decide we're not OK with it and just do the fill_val() thing now, or we should decide that we are and not worry about doing the fill_val() thing later. IMHO, anyway. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [HACKERS] Custom compression methods
).On Mon, Mar 15, 2021 at 6:58 PM Andres Freund wrote: > - Adding all these indirect function calls via toast_compression[] just > for all of two builtin methods isn't fun either. Yeah, it feels like this has too many layers of indirection now. Like, toast_decompress_datum() first gets TOAST_COMPRESS_METHOD(attr). Then it calls CompressionIdToMethod to convert one constant (like TOAST_PGLZ_COMPRESSION_ID) to another constant with a slightly different name (like TOAST_PGLZ_COMPRESSION). Then it calls GetCompressionRoutines() to get hold of the function pointers. Then it does an indirect functional call. That seemed like a pretty reasonable idea when we were trying to support arbitrary compression AMs without overly privileging the stuff that was built into core, but if we're just doing stuff that's built into core, then we could just switch (TOAST_COMPRESS_METHOD(attr)) and call the correct function. In fact, we could even move the stuff from toast_compression.c into detoast.c, which would allow the compiler to optimize better (e.g. by inlining, if it wants). The same applies to toast_decompress_datum_slice(). There's a similar issue in toast_get_compression_method() and the only caller, pg_column_compression(). Here the multiple mapping layers and the indirect function call are split across those two functions rather than all in the same one, but here again one could presumably find a place to just switch on TOAST_COMPRESS_METHOD(attr) or VARATT_EXTERNAL_GET_COMPRESSION(attr) and return "pglz" or "lz4" directly. In toast_compress_datum(), I think we could have a switch that invokes the appropriate compressor based on cmethod and sets a variable to the value to be passed as the final argument of TOAST_COMPRESS_SET_SIZE_AND_METHOD(). Likewise, I suppose CompressionNameToMethod could at least be simplified to use constant strings rather than stuff like toast_compression[TOAST_PGLZ_COMPRESSION_ID].cmname. > - why is HIDE_TOAST_COMPRESSION useful? Doesn't quite seem to be > comparable to HIDE_TABLEAM? Andres, what do you mean by this exactly? It's exactly the same issue: without this, if you change the default compression method, every test that uses \d+ breaks. If you want to be able to run the whole test suite with either compression method and get the same results, you need this. Now, maybe you don't, because perhaps that doesn't seem so important with compression methods as with table AMs. I think that's a defensible position. But, it is at the underlying level, the same thing. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
On Wed, Mar 17, 2021 at 12:48 PM Julien Rouhaud wrote: > I originally suggested to make it clearer by having an enum GUC rather than a > boolean, say compute_queryid = [ none | core | external ], and if set to > external then a hook would be explicitely called. Right now, "none" and > "external" are binded with compute_queryid = off, and depends on whether an > extension is computing a queryid during post_parse_analyse_hook. I would just make it a Boolean and have a hook. The Boolean controls whether it gets computed at all, and the hook lets an external module override the way it gets computed. -- Robert Haas EDB: http://www.enterprisedb.com
Re: PoC/WIP: Extended statistics on expressions
On Wed, 17 Mar 2021 at 19:07, Tomas Vondra wrote: > > On 3/17/21 7:54 PM, Dean Rasheed wrote: > > > > it might have been better to estimate the first case as > > > > ndistinct((a+b)) * ndistinct(c) * ndistinct(d) > > > > and the second case as > > > > ndistinct((a+b)) * ndistinct((c+d)) > > OK. I might be confused, but isn't that what the algorithm currently > does? Or am I just confused about what the first/second case refers to? > No, it currently estimates the first case as ndistinct((a+b),c) * ndistinct(d). Having said that, maybe that's OK after all. It at least makes an effort to account for any correlation between (a+b) and (c+d), using the known correlation between (a+b) and c. For reference, here is the test case I was using (which isn't really very good for catching dependence between columns): DROP TABLE IF EXISTS foo; CREATE TABLE foo (a int, b int, c int, d int); INSERT INTO foo SELECT x%10, x%11, x%12, x%13 FROM generate_series(1,10) x; SELECT COUNT(DISTINCT a) FROM foo; -- 10 SELECT COUNT(DISTINCT b) FROM foo; -- 11 SELECT COUNT(DISTINCT c) FROM foo; -- 12 SELECT COUNT(DISTINCT d) FROM foo; -- 13 SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 20 SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 24 SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 228 SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 478 -- First case: stats on [(a+b),c] CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo; ANALYSE foo; EXPLAIN ANALYSE SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); -- Estimate = 2964, Actual = 478 -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 228*13 -- Second case: stats on (c+d) as well CREATE STATISTICS s2 ON (c+d) FROM foo; ANALYSE foo; EXPLAIN ANALYSE SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); -- Estimate = 480, Actual = 478 -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 20*24 I think that's probably pretty reasonable behaviour, given incomplete stats (the estimate with no extended stats is capped at 1). Regards, Dean
Re: VACUUM (DISABLE_PAGE_SKIPPING on)
On Fri, 12 Mar 2021 at 22:16, Tomas Vondra wrote: > > On 1/28/21 2:33 PM, Simon Riggs wrote: > > On Thu, 28 Jan 2021 at 12:53, Masahiko Sawada wrote: > > > >> This entry has been "Waiting on Author" status and the patch has not > >> been updated since Nov 30. Are you still planning to work on this? > > > > Yes, new patch version tomorrow. Thanks for the nudge and the review. > > > > So, is it tomorrow already? ;-) Been a long day. ;-) -- Simon Riggshttp://www.EnterpriseDB.com/
Re: non-HOT update not looking at FSM for large tuple update
On Fri, Mar 12, 2021 at 8:45 AM Matthias van de Meent < boekewurm+postg...@gmail.com> wrote: > > If this case isn't added, the lower else branch will fail to find > fitting pages for len > maxPaddedFsmRequest tuples; potentially > extending the relation when there is actually still enough space > available. Okay, with that it looks better to go back to using Max(). Also in v4: - With the separate constant you suggested, I split up the comment into two parts. - I've added a regression test to insert.sql similar to your earlier test, but we cannot use vacuum, since in parallel tests there could still be tuples visible to other transactions. It's still possible to test almost-all-free by inserting a small tuple. - Draft commit message -- John Naylor EDB: http://www.enterprisedb.com v4-0001-Fix-bug-in-heap-space-management-that-was-overly-.patch Description: Binary data
Re: WIP: BRIN multi-range indexes
On Wed, Mar 17, 2021 at 3:16 PM Tomas Vondra wrote: > Ummm, no attachment ;-) Oops, here it is. -- John Naylor EDB: http://www.enterprisedb.com jcn-costing-test.sql Description: Binary data
Re: PoC/WIP: Extended statistics on expressions
On Wed, 17 Mar 2021 at 20:48, Dean Rasheed wrote: > > For reference, here is the test case I was using (which isn't really very > good for > catching dependence between columns): > And here's a test case with much more dependence between the columns: DROP TABLE IF EXISTS foo; CREATE TABLE foo (a int, b int, c int, d int); INSERT INTO foo SELECT x%2, x%5, x%10, x%15 FROM generate_series(1,10) x; SELECT COUNT(DISTINCT a) FROM foo; -- 2 SELECT COUNT(DISTINCT b) FROM foo; -- 5 SELECT COUNT(DISTINCT c) FROM foo; -- 10 SELECT COUNT(DISTINCT d) FROM foo; -- 15 SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 6 SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 20 SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 10 SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 30 -- First case: stats on [(a+b),c] CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo; ANALYSE foo; EXPLAIN ANALYSE SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); -- Estimate = 150, Actual = 30 -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 10*15, -- which is much better than ndistinct((a+b)) * ndistinct(c) * ndistinct(d) = 6*10*15 = 900 -- Estimate with no stats = 1500 -- Second case: stats on (c+d) as well CREATE STATISTICS s2 ON (c+d) FROM foo; ANALYSE foo; EXPLAIN ANALYSE SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); -- Estimate = 120, Actual = 30 -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 6*20 Again, I'd say the current behaviour is pretty good. Regards, Dean
Re: PoC/WIP: Extended statistics on expressions
On 3/17/21 9:58 PM, Dean Rasheed wrote: > On Wed, 17 Mar 2021 at 20:48, Dean Rasheed wrote: >> >> For reference, here is the test case I was using (which isn't really very >> good for >> catching dependence between columns): >> > > And here's a test case with much more dependence between the columns: > > DROP TABLE IF EXISTS foo; > CREATE TABLE foo (a int, b int, c int, d int); > INSERT INTO foo SELECT x%2, x%5, x%10, x%15 FROM generate_series(1,10) x; > SELECT COUNT(DISTINCT a) FROM foo; -- 2 > SELECT COUNT(DISTINCT b) FROM foo; -- 5 > SELECT COUNT(DISTINCT c) FROM foo; -- 10 > SELECT COUNT(DISTINCT d) FROM foo; -- 15 > SELECT COUNT(DISTINCT (a+b)) FROM foo; -- 6 > SELECT COUNT(DISTINCT (c+d)) FROM foo; -- 20 > SELECT COUNT(DISTINCT ((a+b),c)) FROM foo; -- 10 > SELECT COUNT(DISTINCT ((a+b),(c+d))) FROM foo; -- 30 > > -- First case: stats on [(a+b),c] > CREATE STATISTICS s1(ndistinct) ON (a+b),c FROM foo; > ANALYSE foo; > EXPLAIN ANALYSE > SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); > -- Estimate = 150, Actual = 30 > -- This estimate is ndistinct((a+b),c) * ndistinct(d) = 10*15, > -- which is much better than ndistinct((a+b)) * ndistinct(c) * > ndistinct(d) = 6*10*15 = 900 > -- Estimate with no stats = 1500 > > -- Second case: stats on (c+d) as well > CREATE STATISTICS s2 ON (c+d) FROM foo; > ANALYSE foo; > EXPLAIN ANALYSE > SELECT (a+b), (c+d) FROM foo GROUP BY (a+b), (c+d); > -- Estimate = 120, Actual = 30 > -- This estimate is ndistinct((a+b)) * ndistinct((c+d)) = 6*20 > > Again, I'd say the current behaviour is pretty good. > Thanks! I agree applying at least the [(a+b),c] stats is probably the right approach, as it means we're considering at least the available information about dependence between the columns. I think to improve this, we'll need to teach the code to use overlapping statistics, a bit like conditional probability. In this case we might do something like this: ndistinct((a+b),c) * (ndistinct((c+d)) / ndistinct(c)) Which in this case would be either, for the "less correlated" case 228 * 24 / 12 = 446 (actual = 478, current estimate = 480) or, for the "more correlated" case 10 * 20 / 10 = 20 (actual = 30, current estimate = 120) But that's clearly a matter for a future patch, and I'm sure there are cases where this will produce worse estimates. Anyway, I plan to go over the patches one more time, and start pushing them sometime early next week. I don't want to leave it until the very last moment in the CF. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: WIP: WAL prefetch (another approach)
Greetings, * Tomas Vondra (tomas.von...@enterprisedb.com) wrote: > Right, I was just going to point out the FPIs are not necessary - what > matters is the presence of long streaks of WAL records touching the same > set of blocks. But people with workloads where this is common likely > don't need the WAL prefetching at all - the replica can keep up just > fine, because it doesn't need to do much I/O anyway (and if it can't > then prefetching won't help much anyway). So just don't enable the > prefetching, and there'll be no overhead. Isn't this exactly the common case though..? Checkpoints happening every 5 minutes, the replay of the FPI happens first and then the record is updated and everything's in SB for the later changes? You mentioned elsewhere that this would improve 80% of cases but that doesn't seem to be backed up by anything and certainly doesn't seem likely to be the case if we're talking about across all PG deployments. I also disagree that asking the kernel to go do random I/O for us, even as a prefetch, is entirely free simply because we won't actually need those pages. At the least, it potentially pushes out pages that we might need shortly from the filesystem cache, no? > If it was up to me, I'd just get the patch committed as is. Delaying the > feature because of concerns that it might have some negative effect in > some cases, when that can be simply mitigated by disabling the feature, > is not really beneficial for our users. I don't know that we actually know how many cases it might have a negative effect on or what the actual amount of such negative case there might be- that's really why we should probably try to actually benchmark it and get real numbers behind it, particularly when the chances of running into such a negative effect with the default configuration (that is, FPWs enabled) on the more typical platforms (as in, not ZFS) is more likely to occur in the field than the cases where FPWs are disabled and someone's running on ZFS. Perhaps more to the point, it'd be nice to see how this change actually improves the caes where PG is running with more-or-less the defaults on the more commonly deployed filesystems. If it doesn't then maybe it shouldn't be the default..? Surely the folks running on ZFS and running with FPWs disabled would be able to manage to enable it if they wished to and we could avoid entirely the question of if this has a negative impact on the more common cases. Guess I'm just not a fan of pushing out a change that will impact everyone by default, in a possibly negative way (or positive, though that doesn't seem terribly likely, but who knows), without actually measuring what that impact will look like in those more common cases. Showing that it's a great win when you're on ZFS or running with FPWs disabled is good and the expected best case, but we should be considering the worst case too when it comes to performance improvements. Anyhow, ultimately I don't know that there's much more to discuss on this thread with regard to this particular topic, at least. As I said before, if everyone else is on board and not worried about it then so be it; I feel that at least the concern that I raised has been heard. Thanks, Stephen signature.asc Description: PGP signature
Re: [HACKERS] Custom compression methods
Hi, On 2021-03-17 16:01:58 -0400, Robert Haas wrote: > > - why is HIDE_TOAST_COMPRESSION useful? Doesn't quite seem to be > > comparable to HIDE_TABLEAM? > > Andres, what do you mean by this exactly? It's exactly the same issue: > without this, if you change the default compression method, every test > that uses \d+ breaks. If you want to be able to run the whole test > suite with either compression method and get the same results, you > need this. Now, maybe you don't, because perhaps that doesn't seem so > important with compression methods as with table AMs. I think that latter part is why I wasn't sure such an option is warranted. Given it's a builtin feature, I didn't really forsee a need to be able to run all the tests with a different compression method. And it looked a like it could just have been copied from the tableam logic, without a clear need. But if it's useful, then ... Greetings, Andres Freund
Re: Getting better results from valgrind leak tracking
Andres Freund writes: >> I found a way around that late last night. Need to mark the context >> itself as an allocation. But I made a mess on the way to that and need >> to clean the patch up before sending it (and need to drop my >> girlfriend off first). > Unfortunately I didn't immediately find a way to do this while keeping > the MEMPOOL_CREATE/DESTROY in mcxt.c. The attached patch moves the pool > creation into the memory context implementations, "allocates" the > context itself as part of that pool, and changes the reset > implementation from MEMPOOL_DESTROY + MEMPOOL_CREATE to instead do > MEMPOOL_TRIM. That leaves the memory context itself valid (and thus > tracking ->ident etc), but marks all the other memory as freed. Huh, interesting. I wonder why that makes the ident problem go away? I'd supposed that valgrind would see the context headers as ordinary memory belonging to the global "malloc" pool, so that any pointers inside them ought to be considered valid. Anyway, I don't have a problem with rearranging the responsibility like this. It gives the individual allocators more freedom to do odd stuff, at the cost of very minor duplication of valgrind calls. I agree we need more comments -- would you like me to have a go at writing them? One thing I was stewing over last night is that a MemoryContextReset will mess up any context identifier assigned with MemoryContextCopyAndSetIdentifier. I'd left that as a problem to fix later, because we don't currently have a need to reset contexts that use copied identifiers. But that assumption obviously will bite us someday, so maybe now is a good time to think about it. The very simplest fix would be to allocate non-constant idents with malloc; which'd require adding a flag to track whether context->ident needs to be free()d. We have room for another bool near the top of struct MemoryContextData (and at some point we could turn those bool fields into a flags word). The only real cost here is one more free() while destroying a labeled context, which is probably negligible. Other ideas are possible but they seem to require getting the individual mcxt methods involved, and I doubt it's worth the complexity. > There are a few leak warnings around guc.c that look like they might be > real, not false positives, and thus a bit concerning. Looks like several > guc check hooks don't bother to free the old *extra before allocating a > new one. I'll take a look, but I'm pretty certain that guc.c, not the hooks, is responsible for freeing those. Might be another case of valgrind not understanding what's happening. > I suspect we might get better results from valgrind, not just for leaks > but also undefined value tracking, if we changed the way we represent > pools to utilize VALGRIND_MEMPOOL_METAPOOL | > VALGRIND_MEMPOOL_AUTO_FREE. Don't really see why that'd help? I mean, it could conceivably help catch bugs in the allocators themselves, but I don't follow the argument that it'd improve anything else. Defined is defined, as far as I can tell from the valgrind manual. regards, tom lane
Re: pl/pgsql feature request: shorthand for argument and local variable references
why are you using yet another special syntax for this ? would it not be better to do something like this: CREATE FUNCTION a_reall_long_and_winding_function_name(i int, out o int) LANGUAGE plpgsql AS $plpgsql$ DECLARE args function_name_alias BEGIN args.o = 2 * args.i END; $plpgsql$; or at least do something based on block labels (see end of https://www.postgresql.org/docs/13/plpgsql-structure.html ) CREATE FUNCTION somefunc() RETURNS integer AS $$ << outerblock >> DECLARE ... maybe extend this to CREATE FUNCTION somefunc() RETURNS integer AS $$ << functionlabel:args >> << outerblock >> DECLARE ... or replace 'functionlabel' with something less ugly :) maybe <<< args >>> Cheers Hannu On Wed, Mar 17, 2021 at 5:05 PM Pavel Stehule wrote: > > > > st 17. 3. 2021 v 9:20 odesílatel Michael Paquier napsal: >> >> On Wed, Mar 17, 2021 at 02:06:57PM +0800, Julien Rouhaud wrote: >> > I also think that there should be a single usable top label, otherwise it >> > will >> > lead to confusing code and it can be a source of bug. >> >> Okay, that's fine by me. Could it be possible to come up with an >> approach that does not hijack the namespace list contruction and the >> lookup logic as much as it does now? I get it that the patch is done >> this way because of the ordering of operations done for the initial ns >> list creation and the grammar parsing that adds the routine label on >> top of the root one, but I'd like to believe that there are more solid >> ways to achieve that, like for example something that decouples the >> root label and its alias, or something that associates an alias >> directly with its PLpgSQL_nsitem entry? > > > I am checking it now, and I don't see any easy solution. The namespace is a > one direction tree - only backward iteration from leafs to roots is > supported. At the moment, when I try to replace the name of root ns_item, > this root ns_item is referenced by the function's arguments (NS_TYPE_VAR > type). So anytime I need some helper ns_item node, that can be a persistent > target for these nodes. In this case is a memory overhead of just one ns_item. > > orig_label <- argument1 <- argument2 > > The patch has to save the original orig_label because it can be referenced > from argument1 or by "FOUND" variable. New graphs looks like > > new_label <- invalidated orig_label <- argument1 <- ... > > This tree has a different direction than is usual, and then replacing the > root node is not simple. > > Second solution (significantly more simple) is an additional pointer in > ns_item structure. In almost all cases this pointer will not be used. Because > ns_item uses a flexible array, then union cannot be used. I implemented this > in a patch marked as "alias-implementation". > > What do you think about it? > > Pavel > > > > > > >> >> -- >> Michael
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
On Wed, Mar 17, 2021 at 04:04:44PM -0400, Robert Haas wrote: > On Wed, Mar 17, 2021 at 12:48 PM Julien Rouhaud wrote: > > I originally suggested to make it clearer by having an enum GUC rather than > > a > > boolean, say compute_queryid = [ none | core | external ], and if set to > > external then a hook would be explicitely called. Right now, "none" and > > "external" are binded with compute_queryid = off, and depends on whether an > > extension is computing a queryid during post_parse_analyse_hook. > > I would just make it a Boolean and have a hook. The Boolean controls > whether it gets computed at all, and the hook lets an external module > override the way it gets computed. OK, is that what everyone wants? I think that is what the patch already does. I think having multiple queryids used in a single cluster is much too confusing to support. You would have to label and control which queryid is displayed by pg_stat_activity and log_line_prefix, and that seems too confusing and not useful. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: pl/pgsql feature request: shorthand for argument and local variable references
On Wed, Mar 17, 2021 at 05:04:48PM +0100, Pavel Stehule wrote: > This tree has a different direction than is usual, and then replacing the > root node is not simple. Yeah, it is not like we should redesign this whole part just for the feature discussed here, and that may impact performance as the current list handling is cheap now. > Second solution (significantly more simple) is an additional pointer in > ns_item structure. In almost all cases this pointer will not be used. > Because ns_item uses a flexible array, then union cannot be used. I > implemented this in a patch marked as "alias-implementation". > > What do you think about it? I am not sure that it is necessary nor good to replace entirely the root label. So associating a new field directly into it sounds like a promising approach? -- Michael signature.asc Description: PGP signature
Re: Calendar support in localization
On Thu, Mar 18, 2021 at 3:48 AM Tom Lane wrote: > Robert Haas writes: > > It's not very obvious how to scale this kind of approach to a wide > > variety of calendar types, and as Thomas says, it would much cooler to > > be able to handle all of the ones that ICU knows how to support rather > > than just one. But, the problem I see with using timestamptz is that > > it's not so obvious how to get a different output format ... unless, I > > guess, we can cram it into DateStyle. And it's also much less obvious > > how you get the other functions and operators to do what you want, if > > it's different. > > Yeah, I'm afraid that it probably is different. The most obvious > example is in operations involving type interval: > select now() + '1 month'::interval; > That should almost certainly give a different answer when using a > different calendar --- indeed the units of interest might not even > be the same. (Do all human calendars use the concept of months?) Right, so if this is done by trying to extend Daniel Verite's icu_ext extension (link given earlier) and Robert's idea of a fast-castable type, I suppose you might want now()::icu_date + '1 month'::internal to advance you by one Ethiopic month if you have done SET icu_ext.ICU_LC_TIME = 'am_ET@calendar=traditional'. Or if using my first idea of just sticking with the core types, perhaps you'd have to replace stuff via search path... I admit that sounds rather error prone and fragile (I was thinking mainly of different functions, not operators). Either way, I suppose there'd also be more explicit functions for various operations including ones that take an extra argument if you want to use an explicit locale instead of relying on the ICU_LC_TIME setting. I dunno. As for whether all calendars have months, it looks like ICU's model has just the familiar looking standardised fields; whether some of them make no sense in some calendars, I don't know, but it has stuff like x.get(field, &error), x.set(field, &error), x.add(field, amount, &error) and if it fails for some field on your particular calendar, or for some value (you can't set a Gregorian date's month to 13 (apparently we call this month "undecember", hah), but you can for a Hebrew or Ethiopic one) I suppose we'd just report the error? > I don't feel like DateStyle is chartered to affect the behavior > of datetime operators; it's understood as tweaking the I/O behavior > only. There might be more of a case for letting LC_TIME choose > this behavior, but I bet the relevant standards only contemplate About LC_TIME... I suppose in one possible future we eventually use ICU for more core stuff, and someone proposes to merge hypothetical icu_date etc types into the core date etc types, and then LC_TIME controls that. But then you might have a version of the problem that Peter E ran into in attempts so far to use ICU collations as the default: if you put ICU's funky extensible locale names into the LC_XXX environment variables, then your libc will see it too, and might get upset, since PostgreSQL uses the en. I suspect that ICU will understand typical libc locale names, but common libcs won't understand ICU's highly customisable syntax, but I haven't looked into it. If that's generally true, then perhaps the solution to both problems is a kind of partial separation: regular LC_XXX, and then also ICU_LC_XXX which defaults to the same value but can be changed to access more advanced stuff, and is used only for interacting with ICU. > Gregorian calendars. Also, the SQL spec says in so many words > that the SQL-defined datetime types follow the Gregorian calendar. :-(
Re: WIP: WAL prefetch (another approach)
Hi, On 3/17/21 10:43 PM, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.von...@enterprisedb.com) wrote: >> Right, I was just going to point out the FPIs are not necessary - what >> matters is the presence of long streaks of WAL records touching the same >> set of blocks. But people with workloads where this is common likely >> don't need the WAL prefetching at all - the replica can keep up just >> fine, because it doesn't need to do much I/O anyway (and if it can't >> then prefetching won't help much anyway). So just don't enable the >> prefetching, and there'll be no overhead. > > Isn't this exactly the common case though..? Checkpoints happening > every 5 minutes, the replay of the FPI happens first and then the record > is updated and everything's in SB for the later changes? Well, as I said before, the FPIs are not very significant - you'll have mostly the same issue with any repeated changes to the same block. It does not matter much if you do FPI for block 1 WAL record for block 2 WAL record for block 1 WAL record for block 2 WAL record for block 1 or just WAL record for block 1 WAL record for block 2 WAL record for block 1 WAL record for block 2 WAL record for block 1 In both cases some of the prefetches are probably unnecessary. But the frequency of checkpoints does not really matter, the important bit is repeated changes to the same block(s). If you have active set much larger than RAM, this is quite unlikely. And we know from the pgbench tests that prefetching has a huge positive effect in this case. On smaller active sets, with frequent updates to the same block, we may issue unnecessary prefetches - that's true. But (a) you have not shown any numbers suggesting this is actually an issue, and (b) those cases don't really need prefetching because all the data is already either in shared buffers or in page cache. So if it happens to be an issue, the user can simply disable it. So how exactly would a problematic workload look like? > You mentioned elsewhere that this would improve 80% of cases but that > doesn't seem to be backed up by anything and certainly doesn't seem > likely to be the case if we're talking about across all PG > deployments. Obviously, the 80% was just a figure of speech, illustrating my belief that the proposed patch is beneficial for most users who currently have issues with replication lag. That is based on my experience with support customers who have such issues - it's almost invariably an OLTP workload with large active set, and we know (from the benchmarks) that in these cases it helps. Users who don't have issues with replication lag can disable (or not enable) the prefetching, and won't get any negative effects. Perhaps there are users with weird workloads that have replication lag issues but this patch won't help them - bummer, we can't solve everything in one go. Also, no one actually demonstrated such workload in this thread so far. But as you're suggesting we don't have data to support the claim that this actually helps many users (with no risk to others), I'd point out you have not actually provided any numbers showing that it actually is an issue in practice. > I also disagree that asking the kernel to go do random I/O for us, > even as a prefetch, is entirely free simply because we won't > actually need those pages. At the least, it potentially pushes out > pages that we might need shortly from the filesystem cache, no? Where exactly did I say it's free? I said that workloads where this happens a lot most likely don't need the prefetching at all, so it can be simply disabled, eliminating all negative effects. Moreover, looking at a limited number of recently prefetched blocks won't eliminate this problem anyway - imagine a random OLTP on large data set that however fits into RAM. After a while no read I/O needs to be done, but you'd need pretty much infinite list of prefetched blocks to eliminate that, and with smaller lists you'll still do 99% of the prefetches. Just disabling prefetching on such instances seems quite reasonable. >> If it was up to me, I'd just get the patch committed as is. Delaying the >> feature because of concerns that it might have some negative effect in >> some cases, when that can be simply mitigated by disabling the feature, >> is not really beneficial for our users. > > I don't know that we actually know how many cases it might have a > negative effect on or what the actual amount of such negative case there > might be- that's really why we should probably try to actually benchmark > it and get real numbers behind it, particularly when the chances of > running into such a negative effect with the default configuration (that > is, FPWs enabled) on the more typical platforms (as in, not ZFS) is more > likely to occur in the field than the cases where FPWs are disabled and > someone's running on ZFS. > > Perhaps more to the point, it'd be nice to see how this
replication slot stats memory bug
Hi, in the course of https://postgr.es/m/3471359.1615937770%40sss.pgh.pa.us I saw a leak in pgstat_read_statsfiles(), more precisely: /* Allocate the space for replication slot statistics */ replSlotStats = palloc0(max_replication_slots * sizeof(PgStat_ReplSlotStats)); the issue is that the current memory context is not set by pgstat_read_statsfiles(). In some cases CurrentMemoryContext is going to be a long-lived context, accumulating those allocations over time. In other contexts it will be a too short lived context, e.g. an ExprContext from the pg_stat_* invocation in the query. A reproducer for the latter: postgres[2252294][1]=# SELECT pg_create_logical_replication_slot('test', 'test_decoding'); ┌┐ │ pg_create_logical_replication_slot │ ├┤ │ (test,0/456C1878) │ └┘ (1 row) postgres[2252294][1]=# BEGIN ; BEGIN postgres[2252294][1]*=# SELECT * FROM pg_stat_replication_slots ; ┌───┬┬─┬─┬─┬──┬──┬─┐ │ slot_name │ spill_txns │ spill_count │ spill_bytes │ stream_txns │ stream_count │ stream_bytes │ stats_reset │ ├───┼┼─┼─┼─┼──┼──┼─┤ │ test │ 0 │ 0 │ 0 │ 0 │ 0 │0 │ (null) │ └───┴┴─┴─┴─┴──┴──┴─┘ (1 row) postgres[2252294][1]*=# SELECT * FROM pg_stat_replication_slots ; ┌─> │ > ├─> │ \x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F\x7F> └─> (1 row) I'll push the minimal fix of forcing the allocation to happen in pgStatLocalContext and setting it to NULL in pgstat_clear_snapshot(). But it seems like we just shouldn't allocate it dynamically at all? max_replication_slots doesn't change during postmaster lifetime, so it seems like it should just be allocated once? Greetings, Andres Freund
Re: replication slot stats memory bug
Hi, On 2021-03-17 16:04:47 -0700, Andres Freund wrote: > I'll push the minimal fix of forcing the allocation to happen in > pgStatLocalContext and setting it to NULL in pgstat_clear_snapshot(). Done: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5f79580ad69f6e696365bdc63bc265f45bd77211 Greetings, Andres Freund
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
On Wed, Mar 17, 2021 at 06:32:16PM -0400, Bruce Momjian wrote: > On Wed, Mar 17, 2021 at 04:04:44PM -0400, Robert Haas wrote: > > On Wed, Mar 17, 2021 at 12:48 PM Julien Rouhaud wrote: > > > I originally suggested to make it clearer by having an enum GUC rather > > > than a > > > boolean, say compute_queryid = [ none | core | external ], and if set to > > > external then a hook would be explicitely called. Right now, "none" and > > > "external" are binded with compute_queryid = off, and depends on whether > > > an > > > extension is computing a queryid during post_parse_analyse_hook. > > > > I would just make it a Boolean and have a hook. The Boolean controls > > whether it gets computed at all, and the hook lets an external module > > override the way it gets computed. > > OK, is that what everyone wants? I think that is what the patch already > does. Note exactly. Right now a custom queryid can be computed even if compute_queryid is off, if some extension does that in post_parse_analyze_hook. I'm assuming that what Robert was thinking was more like: if (compute_queryid) { if (queryid_hook) queryId = queryid_hook(...); else queryId = JumbeQuery(...); } else queryId = 0; And that should be done *after* post_parse_analyse_hook so that it's clear that this hook is no longer the place to compute queryid. Is that what should be done? > I think having multiple queryids used in a single cluster is much too > confusing to support. You would have to label and control which queryid > is displayed by pg_stat_activity and log_line_prefix, and that seems too > confusing and not useful. I agree.
Re: replication slot stats memory bug
Andres Freund writes: > I saw a leak in pgstat_read_statsfiles(), more precisely: > /* Allocate the space for replication slot statistics */ > replSlotStats = palloc0(max_replication_slots * > sizeof(PgStat_ReplSlotStats)); Yeah, I just found that myself. I think your fix is good. > But it seems like we just shouldn't allocate it dynamically at all? > max_replication_slots doesn't change during postmaster lifetime, so it > seems like it should just be allocated once? Meh. I don't see a need to wire in such an assumption here. regards, tom lane
Re: Getting better results from valgrind leak tracking
Hi, On 2021-03-17 18:07:36 -0400, Tom Lane wrote: > Andres Freund writes: > >> I found a way around that late last night. Need to mark the context > >> itself as an allocation. But I made a mess on the way to that and need > >> to clean the patch up before sending it (and need to drop my > >> girlfriend off first). > > > Unfortunately I didn't immediately find a way to do this while keeping > > the MEMPOOL_CREATE/DESTROY in mcxt.c. The attached patch moves the pool > > creation into the memory context implementations, "allocates" the > > context itself as part of that pool, and changes the reset > > implementation from MEMPOOL_DESTROY + MEMPOOL_CREATE to instead do > > MEMPOOL_TRIM. That leaves the memory context itself valid (and thus > > tracking ->ident etc), but marks all the other memory as freed. > > Huh, interesting. I wonder why that makes the ident problem go away? > I'd supposed that valgrind would see the context headers as ordinary > memory belonging to the global "malloc" pool, so that any pointers > inside them ought to be considered valid. I'm didn't quite understand either at the time of writing the change. It just seemed the only explanation for some the behaviour I was seeing, so I tried it. Just to be initially be rebuffed due to errors when accessing the recycled sets... I spent a bit of time looking at valgrind, and it looks to be explicit behaviour: memcheck/mc_leakcheck.c static MC_Chunk** find_active_chunks(Int* pn_chunks) { // Our goal is to construct a set of chunks that includes every // mempool chunk, and every malloc region that *doesn't* contain a // mempool chunk. ... // Then we loop over the mempool tables. For each chunk in each // pool, we set the entry in the Bool array corresponding to the // malloc chunk containing the mempool chunk. VG_(HT_ResetIter)(MC_(mempool_list)); while ( (mp = VG_(HT_Next)(MC_(mempool_list))) ) { VG_(HT_ResetIter)(mp->chunks); while ( (mc = VG_(HT_Next)(mp->chunks)) ) { // We'll need to record this chunk. n_chunks++; // Possibly invalidate the malloc holding the beginning of this chunk. m = find_chunk_for(mc->data, mallocs, n_mallocs); if (m != -1 && malloc_chunk_holds_a_pool_chunk[m] == False) { tl_assert(n_chunks > 0); n_chunks--; malloc_chunk_holds_a_pool_chunk[m] = True; } I think that means it explicitly ignores the entire malloced allocation whenever there's *any* mempool chunk in it, instead considering only the mempool chunks. So once aset allocats something in the init block, the context itself is ignored for leak checking purposes. But that wouldn't be the case if we didn't have the init block... I guess that makes sense, but would definitely be nice to have had documented... > Anyway, I don't have a problem with rearranging the responsibility > like this. It gives the individual allocators more freedom to do > odd stuff, at the cost of very minor duplication of valgrind calls. Yea, similar. > I agree we need more comments -- would you like me to have a go at > writing them? Gladly! > One thing I was stewing over last night is that a MemoryContextReset > will mess up any context identifier assigned with > MemoryContextCopyAndSetIdentifier. Valgrind should catch such problems. Not having the danger would be better, of course. We could also add an assertion at reset time that the identifier isn't in the current context. > The very simplest fix would be to allocate non-constant idents with > malloc; which'd require adding a flag to track whether context->ident > needs to be free()d. We have room for another bool near the top of > struct MemoryContextData (and at some point we could turn those > bool fields into a flags word). The only real cost here is one > more free() while destroying a labeled context, which is probably > negligible. Hm. A separate malloc()/free() could conceivably actually show up in profiles at some point. What if we instead used that flag to indicate that MemoryContextReset() needs to save the identifier? Then any cost would only be paid if the context is actually reset. > > There are a few leak warnings around guc.c that look like they might be > > real, not false positives, and thus a bit concerning. Looks like several > > guc check hooks don't bother to free the old *extra before allocating a > > new one. > > I'll take a look, but I'm pretty certain that guc.c, not the hooks, is > responsible for freeing those. Yea, I had misattributed the leak to the callbacks. One of the things I saw definitely is a leak: if call_string_check_hook() ereport(ERRORs) the guc_strdup() of newval->stringval is lost. There's another set of them that seems to be related to paralellism. But I've not hunted it down yet. I think it might be worth adding a VALGRIND_DO_CHANGED_LEAK_CHECK() at the end of a transaction or such? That way it'd not be quite as hard to pinpoint the so
Re: Permission failures with WAL files in 13~ on Windows
On Tue, Mar 16, 2021 at 11:40:12AM +0100, Magnus Hagander wrote: > If we can provide a new .EXE built with exactly the same flags as the > EDB downloads that they can just drop into a directory, I think it's a > lot easier to get that done. Yeah, multiple people have been complaining about that bug, so I have just produced two builds that people with those sensitive environments can use, and sent some private links to get the builds. Let's see how it goes from this point, but, FWIW, I have not been able to reproduce again my similar problem with the archive command :/ -- Michael signature.asc Description: PGP signature
Re: Calendar support in localization
On 3/17/21 3:48 PM, Tom Lane wrote: > Also, the SQL spec says in so many words > that the SQL-defined datetime types follow the Gregorian calendar. We already don't follow the SQL spec for timestamps (and I, for one, think we are better for it) so I don't think we should worry about that. -- Vik Fearing
Re: Getting better results from valgrind leak tracking
Andres Freund writes: > On 2021-03-17 18:07:36 -0400, Tom Lane wrote: >> Huh, interesting. I wonder why that makes the ident problem go away? > I spent a bit of time looking at valgrind, and it looks to be explicit > behaviour: > >// Our goal is to construct a set of chunks that includes every >// mempool chunk, and every malloc region that *doesn't* contain a >// mempool chunk. Ugh. > I guess that makes sense, but would definitely be nice to have had > documented... Indeed. So this started happening to us when we merged the aset header with its first allocation block. >>> There are a few leak warnings around guc.c that look like they might be >>> real, not false positives, and thus a bit concerning. Looks like several >>> guc check hooks don't bother to free the old *extra before allocating a >>> new one. >> I'll take a look, but I'm pretty certain that guc.c, not the hooks, is >> responsible for freeing those. I believe I've just tracked down the cause of that. Those errors are (as far as I've seen) only happening in parallel workers, and the reason is this gem in RestoreGUCState: /* See comment at can_skip_gucvar(). */ for (i = 0; i < num_guc_variables; i++) if (!can_skip_gucvar(guc_variables[i])) InitializeOneGUCOption(guc_variables[i]); where InitializeOneGUCOption zeroes out that GUC variable, causing it to lose track of any allocations it was responsible for. At minimum, somebody didn't understand the difference between "initialize" and "reset". But I suspect we could just nuke this loop altogether. I've not yet tried to grok the comment that purports to justify it, but I fail to see why it'd ever be useful to drop values inherited from the postmaster. regards, tom lane
Re: replication slot stats memory bug
On Thu, Mar 18, 2021 at 4:55 AM Andres Freund wrote: > > Hi, > > On 2021-03-17 16:04:47 -0700, Andres Freund wrote: > > I'll push the minimal fix of forcing the allocation to happen in > > pgStatLocalContext and setting it to NULL in pgstat_clear_snapshot(). > > Done: > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5f79580ad69f6e696365bdc63bc265f45bd77211 > Thank you! -- With Regards, Amit Kapila.
Re: [HACKERS] Custom compression methods
On Wed, Mar 17, 2021 at 02:50:34PM -0700, Andres Freund wrote: > On 2021-03-17 16:01:58 -0400, Robert Haas wrote: > > > - why is HIDE_TOAST_COMPRESSION useful? Doesn't quite seem to be > > > comparable to HIDE_TABLEAM? > > > > Andres, what do you mean by this exactly? It's exactly the same issue: > > without this, if you change the default compression method, every test > > that uses \d+ breaks. If you want to be able to run the whole test > > suite with either compression method and get the same results, you > > need this. Now, maybe you don't, because perhaps that doesn't seem so > > important with compression methods as with table AMs. Arguably, it's more important, since it affects every column in \d+, not just a "footer" line. > I think that latter part is why I wasn't sure such an option is > warranted. Given it's a builtin feature, I didn't really forsee a need > to be able to run all the tests with a different compression method. And > it looked a like it could just have been copied from the tableam logic, > without a clear need. But if it's useful, then ... This was one of my suggestions and contributions. I copied it from tableam specifically, not incidentally. https://www.postgresql.org/message-id/20210214184940.GL1793%40telsasoft.com -- Justin
Re: Permission failures with WAL files in 13~ on Windows
Hi, On 2021-03-16 16:20:37 +0900, Michael Paquier wrote: > Fujii-san has mentioned that on twitter, but one area that has changed > during the v13 cycle is aaa3aed, where the code recycling segments has > been switched from a pgrename() (with a retry loop) to a > CreateHardLinkA()+pgunlink() (with a retry loop for the second). One > theory that I got in mind here is the case where we create the hard > link, but fail to finish do the pgunlink() on the xlogtemp.N file, > though after some testing it did not seem to have any impact. A related question: What on earth is the point of using the unlink approach on any operating system. We use the durable_rename_excl() (and its predecessor, durable_link_or_rename(), and in turn its open coded predecessors) for things like recycling WAL files at check points. Now imagine that durable_rename_excl() fails to unlink the old file. We'll still have the old file, but there's a second link to a new WAL file, which will be used. No error will be thrown, because we don't check unlink()'s return code (but if we did, we'd still have similar issues). And then imagine that that happens again, during the next checkpoint, because the permission or whatever issue is not yet resolved. We now will have the same physical file in two location in the future WAL stream. Welcome impossible to debug issues. And all of this with the sole justification of "paranoidly trying to avoid overwriting an existing file (there shouldn't be one).". A few lines after we either unlinked the target filename, or used stat() to find an unused filename. Isn't the whole idea of durable_rename_excl() bad? There's not the same danger of using it when we start from a temp filename, e.g. during creation of new segments, or timeline files or whatnot. But I still don't see what the whole thing is supposed to protect us against realistically. Greetings, Andres Freund
Re: replication slot stats memory bug
Hi, On 2021-03-17 19:36:46 -0400, Tom Lane wrote: > > But it seems like we just shouldn't allocate it dynamically at all? > > max_replication_slots doesn't change during postmaster lifetime, so it > > seems like it should just be allocated once? > > Meh. I don't see a need to wire in such an assumption here. It does make it easier for the shared memory stats patch, because if there's a fixed number + location, the relevant stats reporting doesn't need to go through a hashtable with the associated locking. I guess that may have colored my perception that it's better to just have a statically sized memory allocation for this. Noteworthy that SLRU stats are done in a fixed size allocation as well... Greetings, Andres Freund
Re: EXPLAIN/EXPLAIN ANALYZE REFRESH MATERIALIZED VIEW
On Tue, 16 Mar 2021 at 20:13, Bharath Rupireddy wrote: > On Tue, Mar 16, 2021 at 1:15 AM Tom Lane wrote: >> >> [ Sorry for not looking at this thread sooner ] >> >> Bharath Rupireddy writes: >> > Currently, $subject is not allowed. We do plan the mat view query >> > before every refresh. I propose to show the explain/explain analyze of >> > the select part of the mat view in case of Refresh Mat View(RMV). >> >> TBH, I think we should reject this. The problem with it is that it >> binds us to the assumption that REFRESH MATERIALIZED VIEW has an >> explainable plan. There are various people poking at ideas like >> incremental matview updates, which might rely on some implementation >> that doesn't exactly equate to a SQL query. Incremental updates are >> hard enough already; they'll be even harder if they also have to >> maintain compatibility with a pre-existing EXPLAIN behavior. >> >> I don't really see that this feature buys us anything you can't >> get by explaining the view's query, so I think we're better advised >> to keep our options open about how REFRESH might be implemented >> in future. > > That makes sense to me. Thanks for the comments. I'm fine to withdraw the > patch. > > I would like to see if the 0001 patch(attaching here) will be useful > at all. It just splits up the existing ExecRefreshMatView into a few > functions to make the code readable. +1. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: Getting better results from valgrind leak tracking
Hi, On 2021-03-17 21:30:48 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2021-03-17 18:07:36 -0400, Tom Lane wrote: > >> Huh, interesting. I wonder why that makes the ident problem go away? > > > I spent a bit of time looking at valgrind, and it looks to be explicit > > behaviour: > > > >// Our goal is to construct a set of chunks that includes every > >// mempool chunk, and every malloc region that *doesn't* contain a > >// mempool chunk. > > Ugh. > > > I guess that makes sense, but would definitely be nice to have had > > documented... > > Indeed. So this started happening to us when we merged the aset > header with its first allocation block. Yea. I have the feeling that valgrinds error detection capability in our codebase used to be higher too. I wonder if that could be related somehow. Or maybe it's just a figment of my imagination. > I believe I've just tracked down the cause of that. Those errors > are (as far as I've seen) only happening in parallel workers, and > the reason is this gem in RestoreGUCState: > > /* See comment at can_skip_gucvar(). */ > for (i = 0; i < num_guc_variables; i++) > if (!can_skip_gucvar(guc_variables[i])) > InitializeOneGUCOption(guc_variables[i]); > > where InitializeOneGUCOption zeroes out that GUC variable, causing > it to lose track of any allocations it was responsible for. Ouch. At least it's a short lived issue rather than something permanently leaking memory on every SIGHUP... > At minimum, somebody didn't understand the difference between "initialize" > and "reset". But I suspect we could just nuke this loop altogether. > I've not yet tried to grok the comment that purports to justify it, > but I fail to see why it'd ever be useful to drop values inherited > from the postmaster. I can't really make sense of it either. I think it may be trying to restore the GUC state to what it would have been in postmaster, disregarding all the settings that were set as part of PostgresInit() etc? Greetings, Andres Freund
Re: New IndexAM API controlling index vacuum strategies
On Wed, Mar 17, 2021 at 7:21 AM Peter Geoghegan wrote: > > On Tue, Mar 16, 2021 at 6:40 AM Masahiko Sawada wrote: > > > Note that I've merged multiple existing functions in vacuumlazy.c into > > > one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap() > > > into a single function named vacuum_indexes_mark_unused() > > > I agree to create a function like vacuum_indexes_mark_unused() that > > makes a decision and does index and heap vacumming accordingly. But > > what is the point of removing both lazy_vacuum_all_indexes() and > > lazy_vacuum_heap()? I think we can simply have > > vacuum_indexes_mark_unused() call those functions. I'm concerned that > > if we add additional criteria to vacuum_indexes_mark_unused() in the > > future the function will become very large. > > I agree now. I became overly excited about advertising the fact that > these two functions are logically one thing. This is important, but it > isn't necessary to go as far as actually making everything into one > function. Adding some comments would also make that point clear, but > without making vacuumlazy.c even more spaghetti-like. I'll fix it. > > > > I wonder if we can add some kind of emergency anti-wraparound vacuum > > > logic to what I have here, for Postgres 14. > > > +1 > > > > I think we can set VACOPT_TERNARY_DISABLED to > > tab->at_params.index_cleanup in table_recheck_autovac() or increase > > the thresholds used to not skipping index vacuuming. > > I was worried about the "tupgone = true" special case causing problems > when we decide to do some index vacuuming and some heap > vacuuming/LP_UNUSED-marking but then later decide to end the VACUUM. > But I now think that getting rid of "tupgone = true" gives us total > freedom to > choose what to do, including the freedom to start with index vacuuming > and then give up on it later -- even after we do some amount of > LP_UNUSED-marking (during a VACUUM with multiple index passes, perhaps > due to a low maintenance_work_mem setting). That isn't actually > special at all, because everything will be 100% decoupled. > > Whether or not it's a good idea to either not start index vacuuming or > to end it early (e.g. due to XID wraparound pressure) is a complicated > question, and the right approach will be debatable in each case/when > thinking about each issue. However, deciding on the best behavior to > address these problems should have nothing to do with implementation > details and everything to do with the costs and benefits to users. > Which now seems possible. > > A sophisticated version of the "XID wraparound pressure" > implementation could increase reliability without ever being > conservative, just by evaluating the situation regularly and being > prepared to change course when necessary -- until it is truly a matter > of shutting down new XID allocations/the server. It should be possible > to decide to end VACUUM early and advance relfrozenxid (provided we > have reached the point of finishing all required pruning and freezing, > of course). Highly agile behavior seems quite possible, even if it > takes a while to agree on a good design. Since I was thinking that always skipping index vacuuming on anti-wraparound autovacuum is legitimate, skipping index vacuuming only when we're really close to the point of going into read-only mode seems a bit conservative, but maybe a good start. I've attached a PoC patch to disable index vacuuming if the table's relfrozenxid is too older than autovacuum_freeze_max_age (older than 1.5x of autovacuum_freeze_max_age). > > > Aside from whether it's good or bad, strictly speaking, it could > > change the index AM API contract. The documentation of > > amvacuumcleanup() says: > > > > --- > > stats is whatever the last ambulkdelete call returned, or NULL if > > ambulkdelete was not called because no tuples needed to be deleted. > > --- > > > > With this change, we could pass NULL to amvacuumcleanup even though > > the index might have tuples needed to be deleted. > > I think that we should add a "Note" box to the documentation that > notes the difference here. Though FWIW my interpretation of the words > "no tuples needed to be deleted" was always "no tuples needed to be > deleted because vacuumlazy.c didn't call ambulkdelete()". After all, > VACUUM can add to tups_vacuumed through pruning inside > heap_prune_chain(). It is already possible (though only barely) to not > call ambulkdelete() for indexes (to only call amvacuumcleanup() during > cleanup) despite the fact that heap vacuuming did "delete tuples". Agreed. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ poc_skip_index_cleanup_at_anti_wraparound.patch Description: Binary data
Re: Permission failures with WAL files in 13~ on Windows
Hi, On 2021-03-18 09:55:46 +0900, Michael Paquier wrote: > Let's see how it goes from this point, but, FWIW, I have not been able > to reproduce again my similar problem with the archive command :/ -- I suspect it might be easier to reproduce the issue with smaller WAL segments, a short checkpoint_timeout, and multiple jobs generating WAL and then sleeping for random amounts of time. Not sure if that's the sole ingredient, but consider what happens there's processes that XLogWrite()s some WAL and then sleeps. Typically such a process' openLogFile will still point to the WAL segment. And they may still do that when the next checkpoint finishes and we recycle the WAL file. I wonder if we actually fail to unlink() the file in durable_link_or_rename(), and then end up recycling the same old file into multiple "future" positions in the WAL stream. There's also these interesting notes at https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-createhardlinka 1) > The security descriptor belongs to the file to which a hard link > points. The link itself is only a directory entry, and does not have a > security descriptor. Therefore, when you change the security > descriptor of a hard link, you a change the security descriptor of the > underlying file, and all hard links that point to the file allow the > newly specified access. You cannot give a file different security > descriptors on a per-hard-link basis. 2) > Flags, attributes, access, and sharing that are specified in > CreateFile operate on a per-file basis. That is, if you open a file > that does not allow sharing, another application cannot share the file > by creating a new hard link to the file. 3) > The maximum number of hard links that can be created with this > function is 1023 per file. If more than 1023 links are created for a > file, an error results. 1) and 2) seems problematic for restore_command use. I wonder if there's a chance that some of the reports ended up hitting 3), and that windows doesn't handle that well. If you manage to reproduce, could you check what the link count of the all the segments is? Apparently sysinternal's findlinks can do that. Or perhaps even better, add an error check that the number of links of WAL segments is 1 in a bunch of places (recycling, opening them, closing them, maybe?). Plus error reporting for unlink failures, of course. Greetings, Andres Freund
Re: Getting better results from valgrind leak tracking
Andres Freund writes: > On 2021-03-17 21:30:48 -0400, Tom Lane wrote: >> I believe I've just tracked down the cause of that. Those errors >> are (as far as I've seen) only happening in parallel workers, and >> the reason is this gem in RestoreGUCState: ... > Ouch. At least it's a short lived issue rather than something permanently > leaking memory on every SIGHUP... Yeah. This leak is really insignificant in practice, although I'm suspicious that randomly init'ing GUCs like this might have semantic issues that we've not detected yet. >> I've not yet tried to grok the comment that purports to justify it, >> but I fail to see why it'd ever be useful to drop values inherited >> from the postmaster. > I can't really make sense of it either. I think it may be trying to > restore the GUC state to what it would have been in postmaster, > disregarding all the settings that were set as part of PostgresInit() > etc? At least in a non-EXEC_BACKEND build, the most reliable way to reproduce the postmaster's settings is to do nothing whatsoever. And I think the same is true for EXEC_BACKEND, really, because the guc.c subsystem is responsible for restoring what would have been the inherited-via-fork settings. So I'm really not sure what this is on about, and I'm too tired to try to figure it out tonight. In other news, I found that there's a genuine leak in RelationBuildLocalRelation: RelationInitTableAccessMethod was added in a spot where CurrentMemoryContext is CacheMemoryContext, which is bad because it does a syscache lookup that can lead to a table access which can leak some memory. Seems easy to fix though. The plpgsql situation looks like a mess. As a short-term answer, I'm inclined to recommend adding an exclusion that will ignore anything allocated within plpgsql_compile(). Doing better will require a fair amount of rewriting. (Although I suppose we could also consider adding an on_proc_exit callback that runs through and frees all the function cache entries.) regards, tom lane
Re: Parallel INSERT (INTO ... SELECT ...)
On Fri, Mar 12, 2021 at 04:05:09PM +0900, Amit Langote wrote: > On Fri, Mar 12, 2021 at 6:10 AM Justin Pryzby wrote: > > Note also this CF entry > > https://commitfest.postgresql.org/32/2987/ > > | Allow setting parallel_workers on partitioned tables I'm rebasing that other patch on top of master (with this patch), and I noticed that it adds this bit, and this patch should have done something similar: --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml @@ -1340,4 +1340,5 @@ WITH ( MODULUS numeric_literal, REM will use the table's parameter value. -Specifying these parameters for partitioned tables is not supported, -but you may specify them for individual leaf partitions. +These parameters, with the exception of parallel_workers, +are not supported on partitioned tables, but you may specify them for +individual leaf partitions. -- Justin
RE: libpq debug log
I'll look at the comments from Alvaro-san and Horiguchi-san. Here are my review comments: (23) + /* Trace message only when there is first 1 byte */ + if (conn->Pfdebug && conn->outCount < conn->outMsgStart) + { + if (conn->outCount < conn->outMsgStart) + pqTraceOutputMessage(conn, conn->outBuffer + conn->outCount, true); + else + pqTraceOutputNoTypeByteMessage(conn, + conn->outBuffer + conn->outMsgStart); + } The inner else path doesn't seem to be reached because both the outer and inner if contain the same condition. I think you want to remove the second condition from the outer if. (24) pqTraceOutputNoTypeByteMessage + case 16:/* CancelRequest */ + fprintf(conn->Pfdebug, "%s\t>\tCancelRequest\t%d", timestr, length); + pqTraceOutputInt32(message, &LogCursor, conn->Pfdebug); + pqTraceOutputInt32(message, &LogCursor, conn->Pfdebug); + break; Another int32 data needs to be output as follows: -- Int32(80877102) The cancel request code. The value is chosen to contain 1234 in the most significant 16 bits, and 5678 in the least significant 16 bits. (To avoid confusion, this code must not be the same as any protocol version number.) Int32 The process ID of the target backend. Int32 The secret key for the target backend. -- (25) + case 8 :/* GSSENRequest or SSLRequest */ GSSENRequest -> GSSENCRequest (26) +static void +pqTraceOutputByte1(const char *data, int *cursor, FILE *pfdebug) +{ + const char *v = data + *cursor; + /* +static void +pqTraceOutputf(const char *message, int end, FILE *f) +{ + int cursor = 0; + pqTraceOutputString(message, &cursor, end, f); +} Put an empty line to separate the declaration part and execution part. (27) + const char *message_type = "UnkownMessage"; + + id = message[LogCursor++]; + + memcpy(&length, message + LogCursor , 4); + length = (int) pg_ntoh32(length); + LogCursor += 4; + LogEnd = length - 4; + /* Get a protocol type from first byte identifier */ + if (toServer && + id < lengthof(protocol_message_type_f) && + protocol_message_type_f[(unsigned char)id] != NULL) + message_type = protocol_message_type_f[(unsigned char)id]; + else if (!toServer && + id < lengthof(protocol_message_type_b) && + protocol_message_type_b[(unsigned char)id] != NULL) + message_type = protocol_message_type_b[(unsigned char)id]; + else + { + fprintf(conn->Pfdebug, "Unknown message: %02x\n", id); + return; + } + The initial value "UnkownMessage" is not used. So, you can initialize message_type with NULL and do like: +if (...) + ... + else if (...) + ... + + if (message_type == NULL) + { + fprintf(conn->Pfdebug, "Unknown message: %02x\n", id); + return; + Plus, I think this should be done before looking at the message length. (28) pqTraceOutputBinary() is only used in pqTraceOutputNchar(). Do they have to be separated? Regards Takayuki Tsunakawa
Re: New IndexAM API controlling index vacuum strategies
On Mon, Mar 15, 2021 at 4:11 PM Andres Freund wrote: > I kinda wonder whether this case should just be handled by just gotoing > back to the start of the blkno loop, and redoing the pruning. The only > thing that makes that a bit more complicatd is that we've already > incremented vacrelstats->{scanned_pages,vacrelstats->tupcount_pages}. > > We really should put the per-page work (i.e. the blkno loop body) of > lazy_scan_heap() into a separate function, same with the > too-many-dead-tuples branch. Attached patch series splits everything up. There is now a large patch that removes the tupgone special case, and a second patch that actually adds code that dynamically decides to not do index vacuuming in cases where (for whatever reason) it doesn't seem useful. Here are some key points about the first patch: * Eliminates the "tupgone = true" special case by putting pruning, the HTSV() call, as well as tuple freezing into a new, dedicated function -- the function is prepared to restart pruning in those rare cases where the vacuumlazy.c HTSV() call indicates that a tuple is dead. Restarting pruning again prunes again, rendering the DEAD tuple with storage an LP_DEAD line pointer stub. The restart thing is based on Andres' suggestion. This patch enables incremental VACUUM (the second patch, and likely other variations) by allowing us to make a uniform assumption that it is never strictly necessary to reach lazy_vacuum_all_indexes() or lazy_vacuum_heap(). It is now possible to "end VACUUM early" while still advancing relfrozenxid. Provided we've finished the first scan of the heap, that should be safe. * In principle we could visit and revisit the question of whether or not vacuuming should continue or end early many times, as new information comes to light. For example, maybe Masahiko's patch from today could be taught to monitor how old relfrozenxid is again and again, before finally giving up early when the risk of wraparound becomes very severe -- but not before then. * I've added three structs that replace a blizzard of local variables we used lazy_scan_heap() with just three (three variables for each of the three structs). I've also moved several chunks of logic to other new functions (in addition to one that does pruning and freezing). I think that I have the data structures roughly right here -- but I would appreciate some feedback on that. Does this seem like the right direction? -- Peter Geoghegan v3-0002-Skip-index-vacuuming-dynamically.patch Description: Binary data v3-0001-Remove-tupgone-special-case-from-vacuumlazy.c.patch Description: Binary data
RE: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.
Dear Ikeda-san, I confirmed new patch and no problem was found. Thanks. (I'm not a native English speaker, so I cannot check your comments correctly, sorry) Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Parallel INSERT (INTO ... SELECT ...)
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml index ff1b642722..d5d356f2de 100644 --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml @@ -1338,8 +1338,10 @@ WITH ( MODULUS numeric_literal, REM If a table parameter value is set and the equivalent toast. parameter is not, the TOAST table will use the table's parameter value. -Specifying these parameters for partitioned tables is not supported, -but you may specify them for individual leaf partitions. +These parameters, with the exception of +parallel_insert_enabled, +are not supported on partitioned tables, but may be specified for +individual leaf partitions. >From 55c7326e56f9b49710a564e91c46212a17f12b24 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Wed, 17 Mar 2021 21:47:27 -0500 Subject: [PATCH] Fix for parallel_insert_enabled --- doc/src/sgml/ref/create_table.sgml | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml index ff1b642722..d5d356f2de 100644 --- a/doc/src/sgml/ref/create_table.sgml +++ b/doc/src/sgml/ref/create_table.sgml @@ -1338,8 +1338,10 @@ WITH ( MODULUS numeric_literal, REM If a table parameter value is set and the equivalent toast. parameter is not, the TOAST table will use the table's parameter value. -Specifying these parameters for partitioned tables is not supported, -but you may specify them for individual leaf partitions. +These parameters, with the exception of +parallel_insert_enabled, +are not supported on partitioned tables, but may be specified for +individual leaf partitions. -- 2.17.0
Re: Permission failures with WAL files in 13~ on Windows
On Wed, Mar 17, 2021 at 07:30:04PM -0700, Andres Freund wrote: > I suspect it might be easier to reproduce the issue with smaller WAL > segments, a short checkpoint_timeout, and multiple jobs generating WAL > and then sleeping for random amounts of time. Not sure if that's the > sole ingredient, but consider what happens there's processes that > XLogWrite()s some WAL and then sleeps. Typically such a process' > openLogFile will still point to the WAL segment. And they may still do > that when the next checkpoint finishes and we recycle the WAL file. Yep. That's basically the kind of scenarios I have been testing to stress the recycling/removing, with pgbench putting some load into the server. This has worked for me. Once. But I have little idea why it gets easier to reproduce in the environments of others, so there may be an OS-version dependency in the equation here. > I wonder if we actually fail to unlink() the file in > durable_link_or_rename(), and then end up recycling the same old file > into multiple "future" positions in the WAL stream. You actually mean durable_rename_excl() as of 13~, right? Yeah, this matches my impression that it is a two-step failure: - Failure in one of the steps of durable_rename_excl(). - Fallback to segment removal, where we get the complain about renaming. > 1) and 2) seems problematic for restore_command use. I wonder if there's > a chance that some of the reports ended up hitting 3), and that windows > doesn't handle that well. Yeap. I was thinking about 3) being the actual problem while going through those docs two days ago. > If you manage to reproduce, could you check what the link count of the > all the segments is? Apparently sysinternal's findlinks can do that. > > Or perhaps even better, add an error check that the number of links of > WAL segments is 1 in a bunch of places (recycling, opening them, closing > them, maybe?). > > Plus error reporting for unlink failures, of course. Yep, that's actually something I wrote for my own setups, with log_checkpoints enabled to catch all concurrent checkpoint activity and some LOGs. Still no luck unfortunately :( -- Michael signature.asc Description: PGP signature
Re: Getting better results from valgrind leak tracking
Hi, On 2021-03-17 00:01:55 -0400, Tom Lane wrote: > As for the particular point about ParallelBlockTableScanWorkerData, > I agree with your question to David about why that's in TableScanDesc > not HeapScanDesc, but I can't get excited about it not being freed in > heap_endscan. That's mainly because I do not believe that anything as > complex as a heap or indexscan should be counted on to be zero-leakage. > The right answer is to not do such operations in long-lived contexts. > So if we're running such a thing while switched into CacheContext, > *that* is the bug, not that heap_endscan didn't free this particular > allocation. I agree that it's a bad idea to do scans in non-transient contexts. It does however seems like there's a number of places that do... I added the following hacky definition of "permanent" contexts /* * NB: Only for assertion use. * * TopMemoryContext itself obviously is permanent. Treat CacheMemoryContext * and all its children as permanent too. * * XXX: Might be worth adding this as an explicit flag on the context? */ bool MemoryContextIsPermanent(MemoryContext c) { if (c == TopMemoryContext) return true; while (c) { if (c == CacheMemoryContext) return true; c = c->parent; } return false; } and checked that the CurrentMemoryContext is not permanent in SearchCatCacheInternal() and systable_beginscan(). Hit a number of times. The most glaring case is the RelationInitTableAccessMethod() call in RelationBuildLocalRelation(). Seems like the best fix is to just move the MemoryContextSwitchTo() to just before the RelationInitTableAccessMethod(). Although I wonder if we shouldn't go further, and move it to much earlier, somewhere after the rd_rel allocation. There's plenty other hits, but I think I should get back to working on making the shared memory stats patch committable. I really wouldn't want it to slip yet another year. But I think it might make sense to add a flag indicating contexts that shouldn't be used for non-transient data. Seems like we fairly regularly have "bugs" around this? Greetings, Andres Freund
Re: Getting better results from valgrind leak tracking
On 2021-03-17 22:33:59 -0400, Tom Lane wrote: > >> I've not yet tried to grok the comment that purports to justify it, > >> but I fail to see why it'd ever be useful to drop values inherited > >> from the postmaster. > > > I can't really make sense of it either. I think it may be trying to > > restore the GUC state to what it would have been in postmaster, > > disregarding all the settings that were set as part of PostgresInit() > > etc? > > At least in a non-EXEC_BACKEND build, the most reliable way to reproduce > the postmaster's settings is to do nothing whatsoever. And I think the > same is true for EXEC_BACKEND, really, because the guc.c subsystem is > responsible for restoring what would have been the inherited-via-fork > settings. So I'm really not sure what this is on about, and I'm too > tired to try to figure it out tonight. The restore thing runs after we've already set and initialized GUCs, including things like user/database default GUCs. Is see things like ==2251779== 4,560 bytes in 1 blocks are definitely lost in loss record 383 of 406 ==2251779==at 0x483877F: malloc (vg_replace_malloc.c:307) ==2251779==by 0x714A45: ConvertTimeZoneAbbrevs (datetime.c:4556) ==2251779==by 0x88DE95: load_tzoffsets (tzparser.c:465) ==2251779==by 0x88511D: check_timezone_abbreviations (guc.c:11565) ==2251779==by 0x884633: call_string_check_hook (guc.c:11232) ==2251779==by 0x87CB57: parse_and_validate_value (guc.c:7012) ==2251779==by 0x87DD5F: set_config_option (guc.c:7630) ==2251779==by 0x88397F: ProcessGUCArray (guc.c:10784) ==2251779==by 0x32BCCF: ApplySetting (pg_db_role_setting.c:256) ==2251779==by 0x874CA2: process_settings (postinit.c:1163) ==2251779==by 0x874A0B: InitPostgres (postinit.c:1048) ==2251779==by 0x60129A: BackgroundWorkerInitializeConnectionByOid (postmaster.c:5681) ==2251779==by 0x2BB283: ParallelWorkerMain (parallel.c:1374) ==2251779==by 0x5EBA6A: StartBackgroundWorker (bgworker.c:864) ==2251779==by 0x6014FE: do_start_bgworker (postmaster.c:5802) ==2251779==by 0x6018D2: maybe_start_bgworkers (postmaster.c:6027) ==2251779==by 0x600811: sigusr1_handler (postmaster.c:5190) ==2251779==by 0x4DD513F: ??? (in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so) ==2251779==by 0x556E865: select (select.c:41) ==2251779==by 0x5FC0CB: ServerLoop (postmaster.c:1700) ==2251779==by 0x5FBA68: PostmasterMain (postmaster.c:1408) ==2251779==by 0x4F8BFD: main (main.c:209) The BackgroundWorkerInitializeConnectionByOid() in that trace is before the RestoreGUCState() later in ParallelWorkerMain(). But that doesn't really explain the approach. > In other news, I found that there's a genuine leak in > RelationBuildLocalRelation: RelationInitTableAccessMethod > was added in a spot where CurrentMemoryContext is CacheMemoryContext, > which is bad because it does a syscache lookup that can lead to > a table access which can leak some memory. Seems easy to fix though. Yea, that's the one I was talking about re ParallelBlockTableScanWorkerData not being freed. While I think we should also add that pfree, I think you're right, and we shouldn't do RelationInitTableAccessMethod() while in CacheMemoryContext. > The plpgsql situation looks like a mess. As a short-term answer, > I'm inclined to recommend adding an exclusion that will ignore anything > allocated within plpgsql_compile(). Doing better will require a fair > amount of rewriting. (Although I suppose we could also consider adding > an on_proc_exit callback that runs through and frees all the function > cache entries.) The error variant of this one seems like it might actually be a practically relevant leak? As well as increasing memory usage for compiled plpgsql functions unnecessarily, of course. The latter would be good to fix, but the former seems like it might be a practical issue for poolers and the like? So I think we should do at least the reparenting thing to address that? Greetings, Andres Freund
Re: Getting better results from valgrind leak tracking
Andres Freund writes: > The most glaring case is the RelationInitTableAccessMethod() call in > RelationBuildLocalRelation(). Seems like the best fix is to just move > the MemoryContextSwitchTo() to just before the > RelationInitTableAccessMethod(). Although I wonder if we shouldn't go > further, and move it to much earlier, somewhere after the rd_rel > allocation. Yeah, same thing I did locally. Not sure if it's worth working harder. > There's plenty other hits, but I think I should get back to working on > making the shared memory stats patch committable. I really wouldn't want > it to slip yet another year. +1, so far there's little indication that we're finding any serious leaks here. Might be best to set it all aside till there's more time. regards, tom lane
Re: New IndexAM API controlling index vacuum strategies
On Wed, Mar 17, 2021 at 7:16 PM Masahiko Sawada wrote: > Since I was thinking that always skipping index vacuuming on > anti-wraparound autovacuum is legitimate, skipping index vacuuming > only when we're really close to the point of going into read-only mode > seems a bit conservative, but maybe a good start. I've attached a PoC > patch to disable index vacuuming if the table's relfrozenxid is too > older than autovacuum_freeze_max_age (older than 1.5x of > autovacuum_freeze_max_age). Most anti-wraparound VACUUMs are really not emergencies, though. So treating them as special simply because they're anti-wraparound vacuums doesn't seem like the right thing to do. I think that we should dynamically decide to do this when (antiwraparound) VACUUM has already been running for some time. We need to delay the decision until it is almost certainly true that we really have an emergency. Can you take what you have here, and make the decision dynamic? Delay it until we're done with the first heap scan? This will require rebasing on top of the patch I posted. And then adding a third patch, a little like the second patch -- but not too much like it. In the second/SKIP_VACUUM_PAGES_RATIO patch I posted today, the function two_pass_strategy() (my new name for the main entry point for calling lazy_vacuum_all_indexes() and lazy_vacuum_heap()) is only willing to perform the "skip index vacuuming" optimization when the call to two_pass_strategy() is the first call and the last call for that entire VACUUM (plus we test the number of heap blocks with LP_DEAD items using SKIP_VACUUM_PAGES_RATIO, of course). It works this way purely because I don't think that we should be aggressive when we've already run out of maintenance_work_mem. That's a bad time to apply a performance optimization. But what you're talking about now isn't a performance optimization (the mechanism is similar or the same, but the underlying reasons are totally different) -- it's a safety/availability thing. I don't think that you need to be concerned about running out of maintenance_work_mem in two_pass_strategy() when applying logic that is concerned about keeping the database online by avoiding XID wraparound. You just need to have high confidence that it is a true emergency. I think that we can be ~99% sure that we're in a real emergency by using dynamic information about how old relfrozenxid is *now*, and by rechecking a few times during VACUUM. Probably by rechecking every time we call two_pass_strategy(). I now believe that there is no fundamental correctness issue with teaching two_pass_strategy() to skip index vacuuming when we're low on memory -- it is 100% a matter of costs and benefits. The core skip-index-vacuuming mechanism is very flexible. If we can be sure that it's a real emergency, I think that we can justify behaving very aggressively (letting indexes get bloated is after all very aggressive). We just need to be 99%+ sure that continuing with vacuuming will be worse that ending vacuuming. Which seems possible by making the decision dynamic (and revisiting it at least a few times during a very long VACUUM, in case things change). -- Peter Geoghegan
Re: Parallel INSERT (INTO ... SELECT ...)
On Thu, Mar 18, 2021 at 8:30 AM Justin Pryzby wrote: > > diff --git a/doc/src/sgml/ref/create_table.sgml > b/doc/src/sgml/ref/create_table.sgml > index ff1b642722..d5d356f2de 100644 > --- a/doc/src/sgml/ref/create_table.sgml > +++ b/doc/src/sgml/ref/create_table.sgml > @@ -1338,8 +1338,10 @@ WITH ( MODULUS class="parameter">numeric_literal, REM > If a table parameter value is set and the > equivalent toast. parameter is not, the TOAST table > will use the table's parameter value. > -Specifying these parameters for partitioned tables is not supported, > -but you may specify them for individual leaf partitions. > +These parameters, with the exception of > +parallel_insert_enabled, > +are not supported on partitioned tables, but may be specified for > +individual leaf partitions. > > Your patch looks good to me. While checking this, I notice a typo in the previous patch: - planner parameter parallel_workers. + planner parameter parallel_workers and + parallel_insert_enabled. Here, it should be /planner parameter/planner parameters. -- With Regards, Amit Kapila.
RE: Parallel INSERT (INTO ... SELECT ...)
> > If a table parameter value is set and the > > equivalent toast. parameter is not, the TOAST table > > will use the table's parameter value. > > -Specifying these parameters for partitioned tables is not supported, > > -but you may specify them for individual leaf partitions. > > +These parameters, with the exception of > > +parallel_insert_enabled, > > +are not supported on partitioned tables, but may be specified for > > +individual leaf partitions. > > > > > > Your patch looks good to me. While checking this, I notice a typo in the > previous patch: > - planner parameter parallel_workers. > + planner parameter parallel_workers and > + parallel_insert_enabled. > > Here, it should be /planner parameter/planner parameters. Thanks amit and justin for pointing this out ! The changes looks good to me. Best regards, houzj