Re: Dynamic gathering the values for seq_page_cost/xxx_cost
On Sat, Sep 26, 2020 at 1:51 PM Julien Rouhaud wrote: > On Sat, Sep 26, 2020 at 8:17 AM Andy Fan wrote: > > > > As for the testing with cache considered, I found how to estimate cache > hit > > ratio is hard or how to control a hit ratio to test is hard. Recently I > am thinking > > a method that we can get a page_reads, shared_buffer_hit from pg_kernel > > and the real io (without the file system cache hit) at os level (just as > what > > iotop/pidstat do). then we can know the shared_buffer hit ratio and file > system > > cache hit ratio (assume it will be stable after a long run). and then do > a testing. > > However this would be another branch of manual work and I still have not > got > > it done until now. > > FWIW pg_stat_kcache [1] extension accumulates per (database, user, > queryid) physical reads and writes, so you can easily compute a > shared_buffers / IO cache / disk hit ratio. > > [1] https://github.com/powa-team/pg_stat_kcache > WOW, this would be a good extension for this purpose. Thanks for sharing it. -- Best Regards Andy Fan
Re: New statistics for tuning WAL buffer size
On Fri, Sep 25, 2020 at 11:06 PM Fujii Masao wrote: > > On 2020/09/25 12:06, Masahiro Ikeda wrote: > > On 2020-09-18 11:11, Kyotaro Horiguchi wrote: > >> At Fri, 18 Sep 2020 09:40:11 +0900, Masahiro Ikeda > >> wrote in > >>> Thanks. I confirmed that it causes HOT pruning or killing of > >>> dead index tuple if DecodeCommit() is called. > >>> > >>> As you said, DecodeCommit() may access the system table. > >> ... > >>> The wals are generated only when logical replication is performed. > >>> So, I added pgstat_send_wal() in XLogSendLogical(). > >>> > >>> But, I concerned that it causes poor performance > >>> since pgstat_send_wal() is called per wal record, > >> > >> I think that's too frequent. If we want to send any stats to the > >> collector, it is usually done at commit time using > >> pgstat_report_stat(), and the function avoids sending stats too > >> frequently. For logrep-worker, apply_handle_commit() is calling it. It > >> seems to be the place if we want to send the wal stats. Or it may be > >> better to call pgstat_send_wal() via pgstat_report_stat(), like > >> pg_stat_slru(). > > > > Thanks for your comments. > > Since I changed to use pgstat_report_stat() and DecodeCommit() is calling > > it, > > the frequency to send statistics is not so high. > > On second thought, it's strange to include this change in pg_stat_wal patch. > Because pgstat_report_stat() sends various stats and that change would > affect not only pg_stat_wal but also other stats views. That is, if we really > want to make some processes call pgstat_report_stat() newly, which > should be implemented as a separate patch. But I'm not sure how useful > this change is because probably the stats are almost negligibly small > in those processes. > > This thought seems valid for pgstat_send_wal(). I changed the thought > and am inclined to be ok not to call pgstat_send_wal() in some background > processes that are very unlikely to generate WAL. > This makes sense to me. I think even if such background processes have to write WAL due to wal_buffers, it will be accounted next time the backend sends the stats. One minor point, don't we need to reset the counter WalStats.m_wal_buffers_full once we sent the stats, otherwise the same stats will be accounted multiple times. -- With Regards, Amit Kapila.
Re: Asynchronous Append on postgres_fdw nodes.
On Thu, Aug 20, 2020 at 4:36 PM Kyotaro Horiguchi wrote: > This is rebased version. Thanks for the rebased version! > Come to think of "complex", ExecAsync stuff in this patch might be > too-much for a short-term solution until executor overhaul, if it > comes shortly. (the patch of mine here as a whole is like that, > though..). The queueing stuff in postgres_fdw is, too. Here are some review comments on “ExecAsync stuff” (the 0002 patch): @@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags) i = -1; while ((i = bms_next_member(validsubplans, i)) >= 0) { Plan *initNode = (Plan *) list_nth(node->appendplans, i); + int sub_eflags = eflags; + + /* Let async-capable subplans run asynchronously */ + if (i < node->nasyncplans) + { + sub_eflags |= EXEC_FLAG_ASYNC; + nasyncplans++; + } This would be more ambitious than Thomas’ patch: his patch only allows foreign scan nodes beneath an Append node to be executed asynchronously, but your patch allows any plan nodes beneath it (e.g., local child joins between foreign tables). Right? I think that would be great, but I’m not sure how we execute such plan nodes asynchronously as other parts of your patch seem to assume that only foreign scan nodes beneath an Append are considered as async-capable. Maybe I’m missing something, though. Could you elaborate on that? Your patch (and the original patch by Robert [1]) modified ExecAppend() so that it can process local plan nodes while waiting for the results from remote queries, which would be also a feature that’s not supported by Thomas’ patch, but I’d like to know performance results. Did you do performance testing on that? I couldn’t find that from the archive. +bool +is_async_capable_path(Path *path) +{ + switch (nodeTag(path)) + { + case T_ForeignPath: + { + FdwRoutine *fdwroutine = path->parent->fdwroutine; + + Assert(fdwroutine != NULL); + if (fdwroutine->IsForeignPathAsyncCapable != NULL && + fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path)) + return true; + } Do we really need to introduce the FDW API IsForeignPathAsyncCapable()? I think we could determine whether a foreign path is async-capable, by checking whether the FDW has the postgresForeignAsyncConfigureWait() API. In relation to the first comment, I noticed this change in the postgres_fdw regression tests: HEAD: EXPLAIN (VERBOSE, COSTS OFF) SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1; QUERY PLAN Sort Output: t1.a, (count(((t1.*)::pagg_tab))) Sort Key: t1.a -> Append -> HashAggregate Output: t1.a, count(((t1.*)::pagg_tab)) Group Key: t1.a Filter: (avg(t1.b) < '22'::numeric) -> Foreign Scan on public.fpagg_tab_p1 t1 Output: t1.a, t1.*, t1.b Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1 -> HashAggregate Output: t1_1.a, count(((t1_1.*)::pagg_tab)) Group Key: t1_1.a Filter: (avg(t1_1.b) < '22'::numeric) -> Foreign Scan on public.fpagg_tab_p2 t1_1 Output: t1_1.a, t1_1.*, t1_1.b Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2 -> HashAggregate Output: t1_2.a, count(((t1_2.*)::pagg_tab)) Group Key: t1_2.a Filter: (avg(t1_2.b) < '22'::numeric) -> Foreign Scan on public.fpagg_tab_p3 t1_2 Output: t1_2.a, t1_2.*, t1_2.b Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3 (25 rows) Patched: EXPLAIN (VERBOSE, COSTS OFF) SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1; QUERY PLAN Sort Output: t1.a, (count(((t1.*)::pagg_tab))) Sort Key: t1.a -> HashAggregate Output: t1.a, count(((t1.*)::pagg_tab)) Group Key: t1.a Filter: (avg(t1.b) < '22'::numeric) -> Append Async subplans: 3 -> Async Foreign Scan on public.fpagg_tab_p1 t1_1 Output: t1_1.a, t1_1.*, t1_1.b Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1 -> Async Foreign Scan on public.fpagg_tab_p2 t1_2 Output: t1_2.a, t1_2.*, t1_2.b Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2 -> Async Foreign Scan on public.fpagg_tab_p3 t1_3 Output: t1_3.a, t1_3.*, t1_3.b Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3 (18 rows) So, yo
Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables
On Wed, Sep 23, 2020 at 1:09 PM Keisuke Kuroda wrote: > > Hi hackers, > > I found a problem in logical replication. > It seems to have the same cause as the following problem. > > Creating many tables gets logical replication stuck > > https://www.postgresql.org/message-id/flat/20f3de7675f83176253f607b5e199b228406c21c.camel%40cybertec.at > > Logical decoding CPU-bound w/ large number of tables > > https://www.postgresql.org/message-id/flat/CAHoiPjzea6N0zuCi%3D%2Bf9v_j94nfsy6y8SU7-%3Dbp4%3D7qw6_i%3DRg%40mail.gmail.com > > # problem > > * logical replication enabled > * walsender process has RelfilenodeMap cache(2000 relations in this case) > * TRUNCATE or DROP or CREATE many tables in same transaction > > At this time, walsender process continues to use 100% of the CPU 1core. > ... ... > > ./src/backend/replication/logical/reorderbuffer.c > 1746 case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID: > 1747 Assert(change->data.command_id != InvalidCommandId); > 1748 > 1749 if (command_id < change->data.command_id) > 1750 { > 1751 command_id = change->data.command_id; > 1752 > 1753 if (!snapshot_now->copied) > 1754 { > 1755 /* we don't use the global one anymore */ > 1756 snapshot_now = ReorderBufferCopySnap(rb, snapshot_now, > 1757 txn, command_id); > 1758 } > 1759 > 1760 snapshot_now->curcid = command_id; > 1761 > 1762 TeardownHistoricSnapshot(false); > 1763 SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash); > 1764 > 1765 /* > 1766 * Every time the CommandId is incremented, we could > 1767 * see new catalog contents, so execute all > 1768 * invalidations. > 1769 */ > 1770 ReorderBufferExecuteInvalidations(rb, txn); > 1771 } > 1772 > 1773 break; > > Do you have any solutions? > Yeah, I have an idea on how to solve this problem. This problem is primarily due to the reason that we use to receive invalidations only at commit time and then we need to execute them after each command id change. However, after commit c55040ccd0 (When wal_level=logical, write invalidations at command end into WAL so that decoding can use this information.) we actually know exactly when we need to execute each invalidation. The idea is that instead of collecting invalidations only in ReorderBufferTxn, we need to collect them in form of ReorderBufferChange as well similar to what we do for other changes (for ex. REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID). In this case, we need to collect additionally in ReorderBufferTxn because if the transaction is aborted or some exception occurred while executing the changes we need to perform all the invalidations. -- With Regards, Amit Kapila.
Avoid suspects casts VARHDRSZ (c.h)
Hi, In all the static analysis tools I’ve used, there are literally *hundreds* of alerts about a one suspect cast: 64 bits sizet_t -> 32 bits int -> 64 bits size_t -#define VARHDRSZ ((int32) sizeof(int32)) +#define VARHDRSZ (sizeof(int32)) Is there any special reason for not simplifying this and avoiding these alerts? Passed 100% with vcregress check and in use in local tests. regards, Ranier Vilela avoid_suspects_cast_VARHDRSZ.patch Description: Binary data
Re: Get rid of runtime handling of AlternativeSubPlan?
On Tue, 1 Sep 2020 at 05:22, Tom Lane wrote: > > I wrote: > > One inefficiency I see that we could probably get rid of is > > where make_subplan() is doing > > /* Now we can check if it'll fit in hash_mem */ > > /* XXX can we check this at the Path stage? */ > > I went ahead and fixed that, and I also realized there's another small > improvement to be made: we can remove the unused SubPlan from the > subplans list of the finished PlannedStmt, by setting that list cell > to NULL. (This is already specifically allowed by the comments for > PlannedStmt.subplans.) Initially I supposed that this'd only save the > costs of copying that subtree when we copy the whole plan. On looking > closer, though, InitPlan actually runs ExecInitNode on every list > entry, even unused ones, so this will make some difference in executor > startup time. > > Hence, an updated 0001 patch. 0002 hasn't changed. I had a look over these two. A handful of very small things: 0001: 1. I think we should be moving away from using linitial() and second() when we know there are two items in the list. Using list_nth() has less overhead. subplan1 = (SubPlan *) linitial(asplan->subplans); subplan2 = (SubPlan *) lsecond(asplan->subplans); 2. I did have sight concerns that fix_alternative_subplan() always assumes the list of subplans will always be 2, though on looking at the definition of AlternativeSubPlan, I see always having two in the list is mentioned. It feels like fix_alternative_subplan() wouldn't become much more complex to allow any non-zero number of subplans, but maybe making that happen should wait until there is some need for more than two. It just feels a bit icky to have to document the special case when not having the special case is not that hard to implement. 3. Wouldn't it be better to say NULLify rather than delete? + * node or higher-level nodes. However, we do delete the rejected subplan + * from root->glob->subplans, to minimize cycles expended on it later. 0002: I don't have much to say about this. Leaving the code in get_rule_expr() for the reasons you mentioned in the new comment does make sense. On a side note, I was playing around with the following case: create table t (a int, b int, c int); insert into t select x,1,2 from generate_Series(1,1)x; create index on t (b); vacuum freeze analyze t; and ran: select * from t where exists (select 1 from t t2 where t.a=t2.b) or a < 0; EXPLAIN ANALYZE shows: QUERY PLAN Seq Scan on t (cost=0.00..360.00 rows=5000 width=12) (actual time=0.020..7468.452 rows=1 loops=1) Filter: ((SubPlan 1) OR (a < 0)) Rows Removed by Filter: SubPlan 1 -> Seq Scan on t t2 (cost=0.00..180.00 rows=1 width=0) (actual time=0.746..0.746 rows=0 loops=1) Filter: (t.a = b) Rows Removed by Filter: Planning Time: 0.552 ms Execution Time: 7468.481 ms (9 rows) Notice that the SubPlan's estimated rows are 1. This is due to the ndistinct for "b" being 1 and since t.a is a parameter, the selectivity is estimated to be 1.0 by var_eq_non_const(). Unfortunately, for this reason, the index on t(b) is not used either. The planner thinks all rows are being selected, in which case, an index is not much help. both master and patched seem to not choose to use the hashed subplan which results in a pretty slow execution time. This seems to be down to cost_subplan() doing: /* we only need to fetch 1 tuple; clamp to avoid zero divide */ sp_cost.per_tuple += plan_run_cost / clamp_row_est(plan->plan_rows); I imagine / 2 might be more realistic to account for the early abort, which is pretty much what the ALL_SUBLINK and ANY_SUBLINK do just below: Changing that makes the run-time of that query go from 7.4 seconds for me down to 3.7 ms, about 2000 times faster. I understand there will be other cases where that's not so ideal, but this slowness is not ideal either. Of course, not the fault of this patch. David
Re: Parallel INSERT (INTO ... SELECT ...)
On Wed, Sep 23, 2020 at 2:21 PM Greg Nancarrow wrote: > > > - When INSERTs are made parallel, currently the reported row-count in > > the "INSERT 0 " status only reflects the rows that the > > leader has processed (not the workers) - so it is obviously less than > > the actual number of rows inserted. > > Attached an updated patch which fixes this issue (for parallel > INSERTs, each worker's processed tuple count is communicated in shared > memory back to the leader, where it is added to the global > "es_processed" count). I noticed that we are not having any check for skipping temporary table insertion. /* Check if the target relation has foreign keys; if so, avoid * creating a parallel Insert plan (because inserting into * such tables would result in creation of new CommandIds, and * this isn't supported by parallel workers). * Similarly, avoid creating a parallel Insert plan if ON * CONFLICT ... DO UPDATE ... has been specified, because * parallel UPDATE is not supported. * However, do allow any underlying query to be run by parallel * workers in these cases. */ You should also include temporary tables check here, as parallel workers might not have access to temporary tables. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: Libpq support to connect to standby server as priority
I wrote: > BTW, I think it would be worth splitting this into separate server-side > and libpq patches. It looked to me like the server side is pretty > nearly committable, modulo bikeshedding about the new GUC name. Actually ... I looked that over again and got a bit more queasy about all the new signaling logic it is adding. Signals are inherently bug-prone stuff, plus it's not very clear what sort of guarantees we'd have about either the reliability or the timeliness of client notifications about exiting hot-standby mode. I also wonder what consideration has been given to the performance implications of marking transaction_read_only as GUC_REPORT, thus causing client traffic to occur every time it's changed. Most of the current GUC_REPORT variables don't change too often in typical sessions, but I'm less convinced about that for transaction_read_only. So I thought about alternative ways to implement this, and realized that it would not be hard to make guc.c handle it all by itself, if we use a custom show-hook for the in_hot_standby GUC that calls RecoveryInProgress() instead of examining purely static state. Now, by itself that idea only takes care of the session-start-time report, because there'd never be any GUC action causing a new report to occur. But we can improve the situation if we get rid of the current design whereby ReportGUCOption() is called immediately when any GUC value changes, and instead batch up the reports to occur when we're about to go idle waiting for a new client query. Not incidentally, this responds to a concern Robert mentioned awhile back about the performance of GUC reporting [1]. You can already get the server to spam the client excessively if any GUC_REPORT variables are changed by, for example, functions' SET clauses, because that could lead to the active value changing many times within a query. We've gotten away with that so far, but it'd be a problem if any more-often- changed variables get marked GUC_REPORT. (I actually have a vague memory of other complaints about that, but I couldn't find any in a desultory search of the archives.) So I present 0001 attached which changes the GUC_REPORT code to work that way, and then 0002 is a pretty small hack to add a reportable in_hot_standby GUC by having the end-of-query function check (when needed) to see if the active value changed. As it stands, 0001 reduces the ParameterStatus message traffic to at most one per GUC per query, but it doesn't attempt to eliminate duplicate ParameterStatus messages altogether. We could do that as a pretty simple adjustment if we're willing to expend the storage to remember the last value sent to the client. It might be worth doing, since for example the function-SET-clause case would typically lead to no net change in the GUC's value by the end of the query. An objection that could be raised to this approach for in_hot_standby is that it will only report in_hot_standby becoming false at the end of a query, whereas the v19 patch at least attempts to deliver an async ParameterStatus message when the client is idle (and, I think, indeed may fail to provide *any* message if the transition occurs when it isn't idle). I don't find that too compelling though; libpq-based clients, at least, don't have any very practical way to deal with async ParameterStatus messages anyway. (Note that I did not touch the docs here, so that while 0001 might be committable as-is, 0002 is certainly just WIP.) BTW, as far as the transaction_read_only side of things goes, IMO it would make a lot more sense to mark default_transaction_read_only as GUC_REPORT, since that changes a lot less frequently. We'd then have to expend some work to report that value honestly, since right now the hot-standby code cheats by ignoring the GUC's value during hot standby. But I think a technique much like 0002's would work for that. Thoughts? regards, tom lane [1] https://www.postgresql.org/message-id/CA%2BTgmoaDoVtMnfKNFm-iyyCSp%3DFPiHkfU1AXuEHJqmcLTAX6kQ%40mail.gmail.com diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 411cfadbff..b67cc2f375 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -4233,6 +4233,9 @@ PostgresMain(int argc, char *argv[], pgstat_report_activity(STATE_IDLE, NULL); } + /* Report any recently-changed GUC options */ + ReportChangedGUCOptions(); + ReadyForQuery(whereToSendOutput); send_ready_for_query = false; } diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 596bcb7b84..ddfc7ea05d 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -4822,6 +4822,8 @@ static bool guc_dirty; /* true if need to do commit/abort work */ static bool reporting_enabled; /* true to enable GUC_REPORT */ +static bool report_needed; /* true if any GUC_REPORT reports are needed */ + static int GUCNestLevel = 0; /* 1 when in main transaction */ @
enable_incremental_sort changes query behavior
Hi, With sqlsmith I found a query that gives this error: ERROR: ORDER/GROUP BY expression not found in targetlist I noted the query (sql query below, sorry it uses custom tables i couldn't replicate with regression tables) because it doesn't include an ORDER/GROUP BY clause. --- 0 select distinct subq_0.c1 as c0, ref_0.radi_usua_radi as c1, ref_0.radi_nume_asoc as c2, subq_0.c1 as c3, case when (cast(null as pg_lsn) >= pg_catalog.pg_last_wal_receive_lsn()) and (true = pg_catalog.pg_rotate_logfile_old()) then ref_0.radi_usua_rem else ref_0.radi_usua_rem end as c4, cast(nullif((select hist_codi from public.hist_eventos_2 limit 1 offset 4) , pg_catalog.pg_stat_get_buf_alloc()) as int8) as c5 from public.radicado_2 as ref_0, lateral (select ref_0.radi_text_temp as c0, ref_0.radi_usua_actu as c1 from public.hist_eventos_1 as ref_1 where cast(nullif(cast(null as float4), cast(null as float4)) as float4) >= pg_catalog.pi()) as subq_0 where ref_0.radi_usua_dest is not NULL; --- 0 Attached the stack trace produced until de elog that produces the message. But if I set enable_incremental_sort to off the query gets executed without problems (attached the explain produced for that case) -- Jaime Casanova www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services QUERY PLAN - -- HashAggregate (cost=585169544090.96..675589722275.63 rows=202007 width=65) Group Key: ref_0.radi_usua_actu, ref_0.radi_usua_radi, ref_0.radi_nume_asoc, ref_0.radi_usua_actu, CASE WHEN (pg_rotate_logfile_old() AND NULL::boolean) THEN ref_0.radi_usua_rem ELSE ref_0.radi_usua_rem END, NULLIF($0, pg_stat_get _buf_alloc()) Planned Partitions: 16 InitPlan 1 (returns $0) -> Limit (cost=0.10..0.12 rows=1 width=8) -> Seq Scan on hist_eventos_2 (cost=0.00..118555.22 rows=4844222 width=8) -> Result (cost=0.01..88461384682.66 rows=3857927451714 width=65) One-Time Filter: (NULLIF(NULL::real, NULL::real) >= '3.141592653589793'::double precision) -> Nested Loop (cost=0.01..59526928794.81 rows=3857927451714 width=30) -> Seq Scan on radicado_2 ref_0 (cost=0.00..53449.74 rows=798774 width=30) Filter: (radi_usua_dest IS NOT NULL) -> Materialize (cost=0.00..156323.17 rows=4829811 width=0) -> Seq Scan on hist_eventos_1 ref_1 (cost=0.00..118024.11 rows=4829811 width=0) (13 rows) #0 get_sortgroupref_tle (sortref=1, targetList=0x563fd6160f80) at tlist.c:371 l__state = {l = 0x0, i = 0} l = 0x563fd616e948 __func__ = "get_sortgroupref_tle" #1 0x563fd49cac41 in get_sortgroupclause_tle (sgClause=0x563fd617c928, targetList=0x563fd6160f80) at tlist.c:392 No locals. #2 0x563fd49cac66 in get_sortgroupclause_expr (sgClause=0x563fd617c928, targetList=0x563fd6160f80) at tlist.c:403 tle = 0x1d613d940 #3 0x563fd496ee90 in make_pathkeys_for_sortclauses (root=0x563fd617cc40, sortclauses=0x563fd617c890, tlist=0x563fd6160f80) at pathkeys.c:1138 sortcl = 0x563fd617c928 sortkey = 0x563fd617c890 pathkey = 0x563fd617c890 l__state = {l = 0x563fd617c890, i = 0} pathkeys = 0x0 l = 0x563fd617c8a8 #4 0x563fd498be89 in standard_qp_callback (root=0x563fd617cc40, extra=0x7ffc8bbff920) at planner.c:3629 parse = 0x563fd61a01c8 qp_extra = 0x7ffc8bbff920 tlist = 0x563fd6160f80 activeWindows = 0x0 #5 0x563fd4984f20 in query_planner (root=0x563fd617cc40, qp_callback=0x563fd498bd4d , qp_extra=0x7ffc8bbff920) at planmain.c:205 parse = 0x563fd61a01c8 joinlist = 0x563fd616e870 final_rel = 0x563fd6160f80 __func__ = "query_planner" #6 0x563fd4988c00 in grouping_planner (root=0x563fd617cc40, inheritance_update=false, tuple_fraction=0) at planner.c:2058 sort_input_targets = 0x563fd61662e8 sort_input_target_parallel_safe = false grouping_target = 0x563fd61663b0 scanjoin_target = 0x7ffc8bbffa80 activeWindows = 0x0 qp_extra = {activeWindows = 0x0, groupClause = 0x0} sort_input_targets_contain_srfs = 0x563fd613d940 have_grouping = false wflists = 0x0 gset_data = 0x0 sort_input_target = 0x7ffc8bbffab0 grouping_targets = 0x40 grouping
Re: 回复:how to create index concurrently on partitioned table
On Thu, Sep 24, 2020 at 05:11:03PM +0900, Michael Paquier wrote: > start. So, the goal of the patch, as I would define it, is to give a > way to CLUSTER to work on a partitioned table using a given > partitioned index. Hence, we would perform CLUSTER automatically from > the top-most parent for any partitions that have an index on the same > partition tree as the partitioned index defined in USING, switching > indisclustered automatically depending on the index used. I think that's right, except there's no need to look for a compatible partitioned index: we just use the child index. Also, if a partitioned index is clustered, when we clear indisclustered for other indexes, should we also propogate that to their parent indexes, if any ? > From what I can see, attempting to use a CLUSTER on a top-most > partitioned table fails to work on child tables, Oops - it looks like this patch never worked right, due to the RECHECK on indisclustered. I think maybe I returned to the CIC patch and never finishing with cluster. > It would be good also to check if > we have a partition index tree that maps partially with a partition > table tree (aka no all table partitions have a partition index), where > these don't get clustered because there is no index to work on. This should not happen, since a incomplete partitioned index is "invalid". > Isn't NoLock unsafe here, even knowing that an exclusive lock is taken on > the parent? It seems to me that at least schema changes should be > prevented with a ShareLock in the first transaction building the list > of elements to cluster. Thanks for noticing. I chose ShareUpdateExclusiveLock since it's set-exclusive. -- Justin >From e4fe1e422dc2c0ee21ce9df15d5baf24e825bf9d Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sat, 6 Jun 2020 17:42:23 -0500 Subject: [PATCH v8 1/3] Allow CREATE INDEX CONCURRENTLY on partitioned table Note, this effectively reverts 050098b14, so take care to not reintroduce the bug it fixed. XXX: does pgstat_progress_update_param() break other commands progress ? --- doc/src/sgml/ref/create_index.sgml | 9 -- src/backend/commands/indexcmds.c | 135 + src/test/regress/expected/indexing.out | 60 ++- src/test/regress/sql/indexing.sql | 18 +++- 4 files changed, 166 insertions(+), 56 deletions(-) diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml index 33aa64e81d..c780dc9547 100644 --- a/doc/src/sgml/ref/create_index.sgml +++ b/doc/src/sgml/ref/create_index.sgml @@ -657,15 +657,6 @@ Indexes: cannot. - -Concurrent builds for indexes on partitioned tables are currently not -supported. However, you may concurrently build the index on each -partition individually and then finally create the partitioned index -non-concurrently in order to reduce the time where writes to the -partitioned table will be locked out. In this case, building the -partitioned index is a metadata only operation. - - diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index f1b5f87e6a..d417404211 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -91,6 +91,7 @@ static bool ReindexRelationConcurrently(Oid relationOid, int options); static void ReindexPartitions(Oid relid, int options, bool isTopLevel); static void ReindexMultipleInternal(List *relids, int options); +static void reindex_invalid_child_indexes(Oid indexRelationId); static void reindex_error_callback(void *args); static void update_relispartition(Oid relationId, bool newval); static bool CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts); @@ -665,17 +666,6 @@ DefineIndex(Oid relationId, partitioned = rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE; if (partitioned) { - /* - * Note: we check 'stmt->concurrent' rather than 'concurrent', so that - * the error is thrown also for temporary tables. Seems better to be - * consistent, even though we could do it on temporary table because - * we're not actually doing it concurrently. - */ - if (stmt->concurrent) - ereport(ERROR, - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("cannot create index on partitioned table \"%s\" concurrently", - RelationGetRelationName(rel; if (stmt->excludeOpNames) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), @@ -1110,6 +1100,11 @@ DefineIndex(Oid relationId, if (pd->nparts != 0) flags |= INDEX_CREATE_INVALID; } + else if (concurrent && OidIsValid(parentIndexId)) + { + /* If concurrent, initially build index partitions as "invalid" */ + flags |= INDEX_CREATE_INVALID; + } if (stmt->deferrable) constr_flags |= INDEX_CONSTR_CREATE_DEFERRABLE; @@ -1162,6 +1157,14 @@ DefineIndex(Oid relationId, */ if (!stmt->relation || stmt->relation->inh) { + /* + * Need to close the relation before recursing into children, so + * copy
Re: Get rid of runtime handling of AlternativeSubPlan?
Thanks for reviewing! David Rowley writes: > 1. I think we should be moving away from using linitial() and second() > when we know there are two items in the list. Using list_nth() has > less overhead. Uh, really? And if it's true, why would we change all the call sites rather than improving the pg_list.h macros? > 2. I did have sight concerns that fix_alternative_subplan() always > assumes the list of subplans will always be 2, though on looking at > the definition of AlternativeSubPlan, I see always having two in the > list is mentioned. It feels like fix_alternative_subplan() wouldn't > become much more complex to allow any non-zero number of subplans, but > maybe making that happen should wait until there is some need for more > than two. It just feels a bit icky to have to document the special > case when not having the special case is not that hard to implement. It seemed to me that dealing with the general case would make fix_alternative_subplan() noticeably more complex and less obviously correct. I might be wrong though; what specific coding did you have in mind? > 3. Wouldn't it be better to say NULLify rather than delete? > + * node or higher-level nodes. However, we do delete the rejected subplan > + * from root->glob->subplans, to minimize cycles expended on it later. Fair enough, that comment could be improved. > On a side note, I was playing around with the following case: > ... > both master and patched seem to not choose to use the hashed subplan > which results in a pretty slow execution time. This seems to be down > to cost_subplan() doing: > /* we only need to fetch 1 tuple; clamp to avoid zero divide */ > sp_cost.per_tuple += plan_run_cost / clamp_row_est(plan->plan_rows); > I imagine / 2 might be more realistic to account for the early abort, > which is pretty much what the ALL_SUBLINK and ANY_SUBLINK do just > below: Hm, actually isn't it the other way around? *If* there are any matching rows, then what's being done here is an accurate estimate. But if there are not, we're going to have to scan the entire subquery output to verify that. I wonder if we should just be taking the subquery cost at face value, ie be pessimistic not optimistic. If the user is bothering to test EXISTS, we should expect that the no-match case does happen. However, I think that's a distinct concern from this patch; this patch is only meant to improve the processing of alternative subplans, not to change the costing rules around them. If we fool with it I'd rather do so as a separate patch. regards, tom lane
Re: Avoid suspects casts VARHDRSZ (c.h)
Ranier Vilela writes: > In all the static analysis tools I’ve used, there are literally *hundreds* > of alerts about a one suspect cast: > 64 bits sizet_t -> 32 bits int -> 64 bits size_t > -#define VARHDRSZ ((int32) sizeof(int32)) > +#define VARHDRSZ (sizeof(int32)) Given that the compiler can very easily see that there is no actual overflow there, I question whether these warnings are of any value. Also, the proposed patch changes the type of that macro from signed to unsigned, meaning that it's considerably riskier than you seem to think. We'd have to look at every usage to see if that would affect the interpretation of any comparisons, for example. On the whole I see little value here. Suggest finding a tool with less nanny-ish tendencies. regards, tom lane
Re: More efficient RI checks - take 2
On Fri, Jun 05, 2020 at 05:16:43PM +0200, Antonin Houska wrote: > Antonin Houska wrote: > > > In general, the checks are significantly faster if there are many rows to > > process, and a bit slower when we only need to check a single row. > > Attached is a new version that uses the existing simple queries if there's > only one row to check. SPI is used for both single-row and bulk checks - as > discussed in this thread, it can perhaps be replaced with a different approach > if appears to be beneficial, at least for the single-row checks. > > I think using a separate query for the single-row check is more practicable > than convincing the planner that the bulk-check query should only check a > single row. So this patch version tries to show what it'd look like. I'm interested in testing this patch, however there's a lot of internals to digest. Are there any documentation updates or regression tests to add ? If FKs support "bulk" validation, users should know when that applies, and be able to check that it's working as intended. Even if the test cases are overly verbose or not stable, and not intended for commit, that would be a useful temporary addition. I think that calls=4 indicates this is using bulk validation. postgres=# begin; explain(analyze, timing off, costs off, summary off, verbose) DELETE FROM t WHERE i<999; rollback; BEGIN QUERY PLAN --- Delete on public.t (actual rows=0 loops=1) -> Index Scan using t_pkey on public.t (actual rows=998 loops=1) Output: ctid Index Cond: (t.i < 999) Trigger RI_ConstraintTrigger_a_16399 for constraint t_i_fkey: calls=4 I started thinking about this 1+ years ago wondering if a BRIN index could be used for (bulk) FK validation. So I would like to be able to see the *plan* for the query. I was able to show the plan and see that BRIN can be used like so: |SET auto_explain.log_nested_statements=on; SET client_min_messages=debug; SET auto_explain.log_min_duration=0; Should the plan be visible in explain (not auto-explain) ? BTW did you see this older thread ? https://www.postgresql.org/message-id/flat/CA%2BU5nMLM1DaHBC6JXtUMfcG6f7FgV5mPSpufO7GRnbFKkF2f7g%40mail.gmail.com -- Justin