Re: Dynamic gathering the values for seq_page_cost/xxx_cost

2020-09-26 Thread Andy Fan
On Sat, Sep 26, 2020 at 1:51 PM Julien Rouhaud  wrote:

> On Sat, Sep 26, 2020 at 8:17 AM Andy Fan  wrote:
> >
> > As for the testing with cache considered, I found how to estimate cache
> hit
> > ratio is hard or how to control a hit ratio to test is hard. Recently I
> am thinking
> > a method that we can get a page_reads, shared_buffer_hit from pg_kernel
> > and the real io (without the file system cache hit) at os level (just as
> what
> > iotop/pidstat do). then we can know the shared_buffer hit ratio and file
> system
> > cache hit ratio (assume it will be stable after a long run). and then do
> a testing.
> > However this would be another branch of manual work and I still have not
> got
> > it done until now.
>
> FWIW pg_stat_kcache [1] extension accumulates per (database, user,
> queryid) physical reads and writes, so you can easily compute a
> shared_buffers / IO cache / disk hit ratio.
>
> [1] https://github.com/powa-team/pg_stat_kcache
>

WOW, this would be a good extension for this purpose.  Thanks for sharing
it.

-- 
Best Regards
Andy Fan


Re: New statistics for tuning WAL buffer size

2020-09-26 Thread Amit Kapila
On Fri, Sep 25, 2020 at 11:06 PM Fujii Masao
 wrote:
>
> On 2020/09/25 12:06, Masahiro Ikeda wrote:
> > On 2020-09-18 11:11, Kyotaro Horiguchi wrote:
> >> At Fri, 18 Sep 2020 09:40:11 +0900, Masahiro Ikeda
> >>  wrote in
> >>> Thanks. I confirmed that it causes HOT pruning or killing of
> >>> dead index tuple if DecodeCommit() is called.
> >>>
> >>> As you said, DecodeCommit() may access the system table.
> >> ...
> >>> The wals are generated only when logical replication is performed.
> >>> So, I added pgstat_send_wal() in XLogSendLogical().
> >>>
> >>> But, I concerned that it causes poor performance
> >>> since pgstat_send_wal() is called per wal record,
> >>
> >> I think that's too frequent.  If we want to send any stats to the
> >> collector, it is usually done at commit time using
> >> pgstat_report_stat(), and the function avoids sending stats too
> >> frequently. For logrep-worker, apply_handle_commit() is calling it. It
> >> seems to be the place if we want to send the wal stats.  Or it may be
> >> better to call pgstat_send_wal() via pgstat_report_stat(), like
> >> pg_stat_slru().
> >
> > Thanks for your comments.
> > Since I changed to use pgstat_report_stat() and DecodeCommit() is calling 
> > it,
> > the frequency to send statistics is not so high.
>
> On second thought, it's strange to include this change in pg_stat_wal patch.
> Because pgstat_report_stat() sends various stats and that change would
> affect not only pg_stat_wal but also other stats views. That is, if we really
> want to make some processes call pgstat_report_stat() newly, which
> should be implemented as a separate patch. But I'm not sure how useful
> this change is because probably the stats are almost negligibly small
> in those processes.
>
> This thought seems valid for pgstat_send_wal(). I changed the thought
> and am inclined to be ok not to call pgstat_send_wal() in some background
> processes that are very unlikely to generate WAL.
>

This makes sense to me. I think even if such background processes have
to write WAL due to wal_buffers, it will be accounted next time the
backend sends the stats.

One minor point, don't we need to reset the counter
WalStats.m_wal_buffers_full once we sent the stats, otherwise the same
stats will be accounted multiple times.


-- 
With Regards,
Amit Kapila.




Re: Asynchronous Append on postgres_fdw nodes.

2020-09-26 Thread Etsuro Fujita
On Thu, Aug 20, 2020 at 4:36 PM Kyotaro Horiguchi
 wrote:
> This is rebased version.

Thanks for the rebased version!

> Come to think of "complex", ExecAsync stuff in this patch might be
> too-much for a short-term solution until executor overhaul, if it
> comes shortly. (the patch of mine here as a whole is like that,
> though..). The queueing stuff in postgres_fdw is, too.

Here are some review comments on “ExecAsync stuff” (the 0002 patch):

@@ -192,10 +196,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)

i = -1;
while ((i = bms_next_member(validsubplans, i)) >= 0)
{
Plan   *initNode = (Plan *) list_nth(node->appendplans, i);
+   int sub_eflags = eflags;
+
+   /* Let async-capable subplans run asynchronously */
+   if (i < node->nasyncplans)
+   {
+   sub_eflags |= EXEC_FLAG_ASYNC;
+   nasyncplans++;
+   }

This would be more ambitious than Thomas’ patch: his patch only allows
foreign scan nodes beneath an Append node to be executed
asynchronously, but your patch allows any plan nodes beneath it (e.g.,
local child joins between foreign tables).  Right?  I think that would
be great, but I’m not sure how we execute such plan nodes
asynchronously as other parts of your patch seem to assume that only
foreign scan nodes beneath an Append are considered as async-capable.
Maybe I’m missing something, though.  Could you elaborate on that?

Your patch (and the original patch by Robert [1]) modified
ExecAppend() so that it can process local plan nodes while waiting for
the results from remote queries, which would be also a feature that’s
not supported by Thomas’ patch, but I’d like to know performance
results.  Did you do performance testing on that?  I couldn’t find
that from the archive.

+bool
+is_async_capable_path(Path *path)
+{
+   switch (nodeTag(path))
+   {
+   case T_ForeignPath:
+   {
+   FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+   Assert(fdwroutine != NULL);
+   if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+   fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+   return true;
+   }

Do we really need to introduce the FDW API
IsForeignPathAsyncCapable()?  I think we could determine whether a
foreign path is async-capable, by checking whether the FDW has the
postgresForeignAsyncConfigureWait() API.

In relation to the first comment, I noticed this change in the
postgres_fdw regression tests:

HEAD:

EXPLAIN (VERBOSE, COSTS OFF)
SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
   QUERY PLAN

 Sort
   Output: t1.a, (count(((t1.*)::pagg_tab)))
   Sort Key: t1.a
   ->  Append
 ->  HashAggregate
   Output: t1.a, count(((t1.*)::pagg_tab))
   Group Key: t1.a
   Filter: (avg(t1.b) < '22'::numeric)
   ->  Foreign Scan on public.fpagg_tab_p1 t1
 Output: t1.a, t1.*, t1.b
 Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
 ->  HashAggregate
   Output: t1_1.a, count(((t1_1.*)::pagg_tab))
   Group Key: t1_1.a
   Filter: (avg(t1_1.b) < '22'::numeric)
   ->  Foreign Scan on public.fpagg_tab_p2 t1_1
 Output: t1_1.a, t1_1.*, t1_1.b
 Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
 ->  HashAggregate
   Output: t1_2.a, count(((t1_2.*)::pagg_tab))
   Group Key: t1_2.a
   Filter: (avg(t1_2.b) < '22'::numeric)
   ->  Foreign Scan on public.fpagg_tab_p3 t1_2
 Output: t1_2.a, t1_2.*, t1_2.b
 Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3
(25 rows)

Patched:

EXPLAIN (VERBOSE, COSTS OFF)
SELECT a, count(t1) FROM pagg_tab t1 GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
   QUERY PLAN

 Sort
   Output: t1.a, (count(((t1.*)::pagg_tab)))
   Sort Key: t1.a
   ->  HashAggregate
 Output: t1.a, count(((t1.*)::pagg_tab))
 Group Key: t1.a
 Filter: (avg(t1.b) < '22'::numeric)
 ->  Append
   Async subplans: 3
   ->  Async Foreign Scan on public.fpagg_tab_p1 t1_1
 Output: t1_1.a, t1_1.*, t1_1.b
 Remote SQL: SELECT a, b, c FROM public.pagg_tab_p1
   ->  Async Foreign Scan on public.fpagg_tab_p2 t1_2
 Output: t1_2.a, t1_2.*, t1_2.b
 Remote SQL: SELECT a, b, c FROM public.pagg_tab_p2
   ->  Async Foreign Scan on public.fpagg_tab_p3 t1_3
 Output: t1_3.a, t1_3.*, t1_3.b
 Remote SQL: SELECT a, b, c FROM public.pagg_tab_p3
(18 rows)

So, yo

Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-09-26 Thread Amit Kapila
On Wed, Sep 23, 2020 at 1:09 PM Keisuke Kuroda
 wrote:
>
> Hi hackers,
>
> I found a problem in logical replication.
> It seems to have the same cause as the following problem.
>
>   Creating many tables gets logical replication stuck
>   
> https://www.postgresql.org/message-id/flat/20f3de7675f83176253f607b5e199b228406c21c.camel%40cybertec.at
>
>   Logical decoding CPU-bound w/ large number of tables
>   
> https://www.postgresql.org/message-id/flat/CAHoiPjzea6N0zuCi%3D%2Bf9v_j94nfsy6y8SU7-%3Dbp4%3D7qw6_i%3DRg%40mail.gmail.com
>
> # problem
>
> * logical replication enabled
> * walsender process has RelfilenodeMap cache(2000 relations in this case)
> * TRUNCATE or DROP or CREATE many tables in same transaction
>
> At this time, walsender process continues to use 100% of the CPU 1core.
>
...
...
>
> ./src/backend/replication/logical/reorderbuffer.c
> 1746 case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
> 1747   Assert(change->data.command_id != InvalidCommandId);
> 1748
> 1749   if (command_id < change->data.command_id)
> 1750   {
> 1751 command_id = change->data.command_id;
> 1752
> 1753 if (!snapshot_now->copied)
> 1754 {
> 1755   /* we don't use the global one anymore */
> 1756   snapshot_now = ReorderBufferCopySnap(rb, snapshot_now,
> 1757  txn, command_id);
> 1758 }
> 1759
> 1760 snapshot_now->curcid = command_id;
> 1761
> 1762 TeardownHistoricSnapshot(false);
> 1763 SetupHistoricSnapshot(snapshot_now, txn->tuplecid_hash);
> 1764
> 1765 /*
> 1766  * Every time the CommandId is incremented, we could
> 1767  * see new catalog contents, so execute all
> 1768  * invalidations.
> 1769  */
> 1770 ReorderBufferExecuteInvalidations(rb, txn);
> 1771   }
> 1772
> 1773   break;
>
> Do you have any solutions?
>

Yeah, I have an idea on how to solve this problem. This problem is
primarily due to the reason that we use to receive invalidations only
at commit time and then we need to execute them after each command id
change. However, after commit c55040ccd0 (When wal_level=logical,
write invalidations at command end into WAL so that decoding can use
this information.) we actually know exactly when we need to execute
each invalidation. The idea is that instead of collecting
invalidations only in ReorderBufferTxn, we need to collect them in
form of ReorderBufferChange as well similar to what we do for other
changes (for ex. REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID). In this
case, we need to collect additionally in ReorderBufferTxn because if
the transaction is aborted or some exception occurred while executing
the changes we need to perform all the invalidations.

-- 
With Regards,
Amit Kapila.




Avoid suspects casts VARHDRSZ (c.h)

2020-09-26 Thread Ranier Vilela
Hi,

In all the static analysis tools I’ve used, there are literally *hundreds*
of alerts about a one suspect cast:

64 bits sizet_t -> 32 bits int -> 64 bits size_t
-#define VARHDRSZ ((int32) sizeof(int32))
+#define VARHDRSZ (sizeof(int32))

Is there any special reason for not simplifying this and avoiding these
alerts?
Passed 100% with vcregress check and in use in local tests.

regards,
Ranier Vilela


avoid_suspects_cast_VARHDRSZ.patch
Description: Binary data


Re: Get rid of runtime handling of AlternativeSubPlan?

2020-09-26 Thread David Rowley
On Tue, 1 Sep 2020 at 05:22, Tom Lane  wrote:
>
> I wrote:
> > One inefficiency I see that we could probably get rid of is
> > where make_subplan() is doing
> > /* Now we can check if it'll fit in hash_mem */
> > /* XXX can we check this at the Path stage? */
>
> I went ahead and fixed that, and I also realized there's another small
> improvement to be made: we can remove the unused SubPlan from the
> subplans list of the finished PlannedStmt, by setting that list cell
> to NULL.  (This is already specifically allowed by the comments for
> PlannedStmt.subplans.)  Initially I supposed that this'd only save the
> costs of copying that subtree when we copy the whole plan.  On looking
> closer, though, InitPlan actually runs ExecInitNode on every list
> entry, even unused ones, so this will make some difference in executor
> startup time.
>
> Hence, an updated 0001 patch.  0002 hasn't changed.

I had a look over these two. A handful of very small things:

0001:

1. I think we should be moving away from using linitial() and second()
when we know there are two items in the list. Using list_nth() has
less overhead.

subplan1 = (SubPlan *) linitial(asplan->subplans);
subplan2 = (SubPlan *) lsecond(asplan->subplans);

2. I did have sight concerns that fix_alternative_subplan() always
assumes the list of subplans will always be 2, though on looking at
the definition of AlternativeSubPlan, I see always having two in the
list is mentioned. It feels like fix_alternative_subplan() wouldn't
become much more complex to allow any non-zero number of subplans, but
maybe making that happen should wait until there is some need for more
than two. It just feels a bit icky to have to document the special
case when not having the special case is not that hard to implement.

3. Wouldn't it be better to say NULLify rather than delete?

+ * node or higher-level nodes.  However, we do delete the rejected subplan
+ * from root->glob->subplans, to minimize cycles expended on it later.

0002:

I don't have much to say about this.  Leaving the code in
get_rule_expr() for the reasons you mentioned in the new comment does
make sense.


On a side note, I was playing around with the following case:

create table t (a int, b int, c int);
insert into t select x,1,2 from generate_Series(1,1)x;
create index on t (b);
vacuum freeze analyze t;

and ran:

select * from t where exists (select 1 from t t2 where t.a=t2.b) or a < 0;

EXPLAIN ANALYZE shows:

   QUERY PLAN

 Seq Scan on t  (cost=0.00..360.00 rows=5000 width=12) (actual
time=0.020..7468.452 rows=1 loops=1)
   Filter: ((SubPlan 1) OR (a < 0))
   Rows Removed by Filter: 
   SubPlan 1
 ->  Seq Scan on t t2  (cost=0.00..180.00 rows=1 width=0)
(actual time=0.746..0.746 rows=0 loops=1)
   Filter: (t.a = b)
   Rows Removed by Filter: 
 Planning Time: 0.552 ms
 Execution Time: 7468.481 ms
(9 rows)


Notice that the SubPlan's estimated rows are 1. This is due to the
ndistinct for "b" being 1 and since t.a is a parameter, the
selectivity is estimated to be 1.0 by var_eq_non_const().
Unfortunately, for this reason, the index on t(b) is not used either.
The planner thinks all rows are being selected, in which case, an
index is not much help.

both master and patched seem to not choose to use the hashed subplan
which results in a pretty slow execution time. This seems to be down
to cost_subplan() doing:

/* we only need to fetch 1 tuple; clamp to avoid zero divide */
sp_cost.per_tuple += plan_run_cost / clamp_row_est(plan->plan_rows);

I imagine / 2 might be more realistic to account for the early abort,
which is pretty much what the ALL_SUBLINK and ANY_SUBLINK do just
below:

Changing that makes the run-time of that query go from 7.4 seconds for
me down to 3.7 ms, about 2000 times faster.

I understand there will be other cases where that's not so ideal, but
this slowness is not ideal either. Of course, not the fault of this
patch.

David




Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-26 Thread vignesh C
On Wed, Sep 23, 2020 at 2:21 PM Greg Nancarrow  wrote:
>
> > - When INSERTs are made parallel, currently the reported row-count in
> > the "INSERT 0 " status only reflects the rows that the
> > leader has processed (not the workers) - so it is obviously less than
> > the actual number of rows inserted.
>
> Attached an updated patch which fixes this issue (for parallel
> INSERTs, each worker's processed tuple count is communicated in shared
> memory back to the leader, where it is added to the global
> "es_processed" count).

I noticed that we are not having any check for skipping temporary
table insertion.

/* Check if the target relation has foreign keys; if so, avoid
* creating a parallel Insert plan (because inserting into
* such tables would result in creation of new CommandIds, and
* this isn't supported by parallel workers).
* Similarly, avoid creating a parallel Insert plan if ON
* CONFLICT ... DO UPDATE ... has been specified, because
* parallel UPDATE is not supported.
* However, do allow any underlying query to be run by parallel
* workers in these cases.
*/

You should also include temporary tables check here, as parallel
workers might not have access to temporary tables.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Libpq support to connect to standby server as priority

2020-09-26 Thread Tom Lane
I wrote:
> BTW, I think it would be worth splitting this into separate server-side
> and libpq patches.  It looked to me like the server side is pretty
> nearly committable, modulo bikeshedding about the new GUC name.

Actually ... I looked that over again and got a bit more queasy about
all the new signaling logic it is adding.  Signals are inherently
bug-prone stuff, plus it's not very clear what sort of guarantees
we'd have about either the reliability or the timeliness of client
notifications about exiting hot-standby mode.

I also wonder what consideration has been given to the performance
implications of marking transaction_read_only as GUC_REPORT, thus
causing client traffic to occur every time it's changed.  Most of
the current GUC_REPORT variables don't change too often in typical
sessions, but I'm less convinced about that for transaction_read_only.

So I thought about alternative ways to implement this, and realized
that it would not be hard to make guc.c handle it all by itself, if
we use a custom show-hook for the in_hot_standby GUC that calls
RecoveryInProgress() instead of examining purely static state.
Now, by itself that idea only takes care of the session-start-time
report, because there'd never be any GUC action causing a new
report to occur.  But we can improve the situation if we get rid
of the current design whereby ReportGUCOption() is called immediately
when any GUC value changes, and instead batch up the reports to
occur when we're about to go idle waiting for a new client query.

Not incidentally, this responds to a concern Robert mentioned awhile
back about the performance of GUC reporting [1].  You can already get
the server to spam the client excessively if any GUC_REPORT variables
are changed by, for example, functions' SET clauses, because that could
lead to the active value changing many times within a query.  We've
gotten away with that so far, but it'd be a problem if any more-often-
changed variables get marked GUC_REPORT.  (I actually have a vague
memory of other complaints about that, but I couldn't find any in a
desultory search of the archives.)

So I present 0001 attached which changes the GUC_REPORT code to work
that way, and then 0002 is a pretty small hack to add a reportable
in_hot_standby GUC by having the end-of-query function check (when
needed) to see if the active value changed.

As it stands, 0001 reduces the ParameterStatus message traffic to
at most one per GUC per query, but it doesn't attempt to eliminate
duplicate ParameterStatus messages altogether.  We could do that
as a pretty simple adjustment if we're willing to expend the storage
to remember the last value sent to the client.  It might be worth
doing, since for example the function-SET-clause case would typically
lead to no net change in the GUC's value by the end of the query.

An objection that could be raised to this approach for in_hot_standby
is that it will only report in_hot_standby becoming false at the end
of a query, whereas the v19 patch at least attempts to deliver an
async ParameterStatus message when the client is idle (and, I think,
indeed may fail to provide *any* message if the transition occurs
when it isn't idle).  I don't find that too compelling though;
libpq-based clients, at least, don't have any very practical way to
deal with async ParameterStatus messages anyway.

(Note that I did not touch the docs here, so that while 0001 might
be committable as-is, 0002 is certainly just WIP.)

BTW, as far as the transaction_read_only side of things goes, IMO
it would make a lot more sense to mark default_transaction_read_only
as GUC_REPORT, since that changes a lot less frequently.  We'd then
have to expend some work to report that value honestly, since right
now the hot-standby code cheats by ignoring the GUC's value during
hot standby.  But I think a technique much like 0002's would work
for that.

Thoughts?

regards, tom lane

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoaDoVtMnfKNFm-iyyCSp%3DFPiHkfU1AXuEHJqmcLTAX6kQ%40mail.gmail.com

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 411cfadbff..b67cc2f375 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4233,6 +4233,9 @@ PostgresMain(int argc, char *argv[],
 pgstat_report_activity(STATE_IDLE, NULL);
 			}
 
+			/* Report any recently-changed GUC options */
+			ReportChangedGUCOptions();
+
 			ReadyForQuery(whereToSendOutput);
 			send_ready_for_query = false;
 		}
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 596bcb7b84..ddfc7ea05d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -4822,6 +4822,8 @@ static bool guc_dirty;			/* true if need to do commit/abort work */
 
 static bool reporting_enabled;	/* true to enable GUC_REPORT */
 
+static bool report_needed;		/* true if any GUC_REPORT reports are needed */
+
 static int	GUCNestLevel = 0;	/* 1 when in main transaction */
 
 
@

enable_incremental_sort changes query behavior

2020-09-26 Thread Jaime Casanova
Hi,

With sqlsmith I found a query that gives this error:
ERROR:  ORDER/GROUP BY expression not found in targetlist

I noted the query (sql query below, sorry it uses custom tables i
couldn't replicate with regression tables) because it doesn't include
an ORDER/GROUP BY clause.

--- 0 
select distinct
  subq_0.c1 as c0,
  ref_0.radi_usua_radi as c1,
  ref_0.radi_nume_asoc as c2,
  subq_0.c1 as c3,
  case when (cast(null as pg_lsn) >=
pg_catalog.pg_last_wal_receive_lsn())
  and (true = pg_catalog.pg_rotate_logfile_old()) then
ref_0.radi_usua_rem else ref_0.radi_usua_rem end
 as c4,
  cast(nullif((select hist_codi from public.hist_eventos_2
limit 1 offset 4)
  ,
pg_catalog.pg_stat_get_buf_alloc()) as int8) as c5
from
  public.radicado_2 as ref_0,
  lateral (select
ref_0.radi_text_temp as c0,
ref_0.radi_usua_actu as c1
  from
public.hist_eventos_1 as ref_1
  where cast(nullif(cast(null as float4),
  cast(null as float4)) as float4) >= pg_catalog.pi()) as subq_0
where ref_0.radi_usua_dest is not NULL;
--- 0 

Attached the stack trace produced until de elog that produces the message.

But if I set enable_incremental_sort to off the query gets executed
without problems (attached the explain produced for that case)

-- 
Jaime Casanova  www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

  QUERY PLAN

-
--
 HashAggregate  (cost=585169544090.96..675589722275.63 rows=202007 width=65)
   Group Key: ref_0.radi_usua_actu, ref_0.radi_usua_radi, ref_0.radi_nume_asoc, 
ref_0.radi_usua_actu, CASE WHEN (pg_rotate_logfile_old() AND NULL::boolean) 
THEN ref_0.radi_usua_rem ELSE ref_0.radi_usua_rem END, NULLIF($0, pg_stat_get
_buf_alloc())
   Planned Partitions: 16
   InitPlan 1 (returns $0)
 ->  Limit  (cost=0.10..0.12 rows=1 width=8)
   ->  Seq Scan on hist_eventos_2  (cost=0.00..118555.22 rows=4844222 
width=8)
   ->  Result  (cost=0.01..88461384682.66 rows=3857927451714 width=65)
 One-Time Filter: (NULLIF(NULL::real, NULL::real) >= 
'3.141592653589793'::double precision)
 ->  Nested Loop  (cost=0.01..59526928794.81 rows=3857927451714 
width=30)
   ->  Seq Scan on radicado_2 ref_0  (cost=0.00..53449.74 
rows=798774 width=30)
 Filter: (radi_usua_dest IS NOT NULL)
   ->  Materialize  (cost=0.00..156323.17 rows=4829811 width=0)
 ->  Seq Scan on hist_eventos_1 ref_1  
(cost=0.00..118024.11 rows=4829811 width=0)
(13 rows)

#0  get_sortgroupref_tle (sortref=1, targetList=0x563fd6160f80) at tlist.c:371
l__state = {l = 0x0, i = 0}
l = 0x563fd616e948
__func__ = "get_sortgroupref_tle"
#1  0x563fd49cac41 in get_sortgroupclause_tle (sgClause=0x563fd617c928, 
targetList=0x563fd6160f80) at tlist.c:392
No locals.
#2  0x563fd49cac66 in get_sortgroupclause_expr (sgClause=0x563fd617c928, 
targetList=0x563fd6160f80) at tlist.c:403
tle = 0x1d613d940
#3  0x563fd496ee90 in make_pathkeys_for_sortclauses (root=0x563fd617cc40, 
sortclauses=0x563fd617c890, tlist=0x563fd6160f80) at pathkeys.c:1138
sortcl = 0x563fd617c928
sortkey = 0x563fd617c890
pathkey = 0x563fd617c890
l__state = {l = 0x563fd617c890, i = 0}
pathkeys = 0x0
l = 0x563fd617c8a8
#4  0x563fd498be89 in standard_qp_callback (root=0x563fd617cc40, 
extra=0x7ffc8bbff920) at planner.c:3629
parse = 0x563fd61a01c8
qp_extra = 0x7ffc8bbff920
tlist = 0x563fd6160f80
activeWindows = 0x0
#5  0x563fd4984f20 in query_planner (root=0x563fd617cc40, 
qp_callback=0x563fd498bd4d , qp_extra=0x7ffc8bbff920) at 
planmain.c:205
parse = 0x563fd61a01c8
joinlist = 0x563fd616e870
final_rel = 0x563fd6160f80
__func__ = "query_planner"
#6  0x563fd4988c00 in grouping_planner (root=0x563fd617cc40, 
inheritance_update=false, tuple_fraction=0) at planner.c:2058
sort_input_targets = 0x563fd61662e8
sort_input_target_parallel_safe = false
grouping_target = 0x563fd61663b0
scanjoin_target = 0x7ffc8bbffa80
activeWindows = 0x0
qp_extra = {activeWindows = 0x0, groupClause = 0x0}
sort_input_targets_contain_srfs = 0x563fd613d940
have_grouping = false
wflists = 0x0
gset_data = 0x0
sort_input_target = 0x7ffc8bbffab0
grouping_targets = 0x40
grouping

Re: 回复:how to create index concurrently on partitioned table

2020-09-26 Thread Justin Pryzby
On Thu, Sep 24, 2020 at 05:11:03PM +0900, Michael Paquier wrote:
> start.  So, the goal of the patch, as I would define it, is to give a
> way to CLUSTER to work on a partitioned table using a given
> partitioned index.  Hence, we would perform CLUSTER automatically from
> the top-most parent for any partitions that have an index on the same
> partition tree as the partitioned index defined in USING, switching
> indisclustered automatically depending on the index used.

I think that's right, except there's no need to look for a compatible
partitioned index: we just use the child index.

Also, if a partitioned index is clustered, when we clear indisclustered for
other indexes, should we also propogate that to their parent indexes, if any ?

> From what I can see, attempting to use a CLUSTER on a top-most
> partitioned table fails to work on child tables,

Oops - it looks like this patch never worked right, due to the RECHECK on
indisclustered.  I think maybe I returned to the CIC patch and never finishing
with cluster.

> It would be good also to check if
> we have a partition index tree that maps partially with a partition
> table tree (aka no all table partitions have a partition index), where
> these don't get clustered because there is no index to work on.

This should not happen, since a incomplete partitioned index is "invalid".

> Isn't NoLock unsafe here, even knowing that an exclusive lock is taken on
> the parent?  It seems to me that at least schema changes should be
> prevented with a ShareLock in the first transaction building the list
> of elements to cluster.

Thanks for noticing.  I chose ShareUpdateExclusiveLock since it's
set-exclusive.

-- 
Justin
>From e4fe1e422dc2c0ee21ce9df15d5baf24e825bf9d Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 6 Jun 2020 17:42:23 -0500
Subject: [PATCH v8 1/3] Allow CREATE INDEX CONCURRENTLY on partitioned table

Note, this effectively reverts 050098b14, so take care to not reintroduce the
bug it fixed.

XXX: does pgstat_progress_update_param() break other commands progress ?
---
 doc/src/sgml/ref/create_index.sgml |   9 --
 src/backend/commands/indexcmds.c   | 135 +
 src/test/regress/expected/indexing.out |  60 ++-
 src/test/regress/sql/indexing.sql  |  18 +++-
 4 files changed, 166 insertions(+), 56 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 33aa64e81d..c780dc9547 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -657,15 +657,6 @@ Indexes:
 cannot.

 
-   
-Concurrent builds for indexes on partitioned tables are currently not
-supported.  However, you may concurrently build the index on each
-partition individually and then finally create the partitioned index
-non-concurrently in order to reduce the time where writes to the
-partitioned table will be locked out.  In this case, building the
-partitioned index is a metadata only operation.
-   
-
   
  
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f1b5f87e6a..d417404211 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -91,6 +91,7 @@ static bool ReindexRelationConcurrently(Oid relationOid, int options);
 
 static void ReindexPartitions(Oid relid, int options, bool isTopLevel);
 static void ReindexMultipleInternal(List *relids, int options);
+static void reindex_invalid_child_indexes(Oid indexRelationId);
 static void reindex_error_callback(void *args);
 static void update_relispartition(Oid relationId, bool newval);
 static bool CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts);
@@ -665,17 +666,6 @@ DefineIndex(Oid relationId,
 	partitioned = rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
 	if (partitioned)
 	{
-		/*
-		 * Note: we check 'stmt->concurrent' rather than 'concurrent', so that
-		 * the error is thrown also for temporary tables.  Seems better to be
-		 * consistent, even though we could do it on temporary table because
-		 * we're not actually doing it concurrently.
-		 */
-		if (stmt->concurrent)
-			ereport(ERROR,
-	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	 errmsg("cannot create index on partitioned table \"%s\" concurrently",
-			RelationGetRelationName(rel;
 		if (stmt->excludeOpNames)
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
@@ -1110,6 +1100,11 @@ DefineIndex(Oid relationId,
 		if (pd->nparts != 0)
 			flags |= INDEX_CREATE_INVALID;
 	}
+	else if (concurrent && OidIsValid(parentIndexId))
+	{
+		/* If concurrent, initially build index partitions as "invalid" */
+		flags |= INDEX_CREATE_INVALID;
+	}
 
 	if (stmt->deferrable)
 		constr_flags |= INDEX_CONSTR_CREATE_DEFERRABLE;
@@ -1162,6 +1157,14 @@ DefineIndex(Oid relationId,
 		 */
 		if (!stmt->relation || stmt->relation->inh)
 		{
+			/*
+			 * Need to close the relation before recursing into children, so
+			 * copy 

Re: Get rid of runtime handling of AlternativeSubPlan?

2020-09-26 Thread Tom Lane
Thanks for reviewing!

David Rowley  writes:
> 1. I think we should be moving away from using linitial() and second()
> when we know there are two items in the list. Using list_nth() has
> less overhead.

Uh, really?  And if it's true, why would we change all the call sites
rather than improving the pg_list.h macros?

> 2. I did have sight concerns that fix_alternative_subplan() always
> assumes the list of subplans will always be 2, though on looking at
> the definition of AlternativeSubPlan, I see always having two in the
> list is mentioned. It feels like fix_alternative_subplan() wouldn't
> become much more complex to allow any non-zero number of subplans, but
> maybe making that happen should wait until there is some need for more
> than two. It just feels a bit icky to have to document the special
> case when not having the special case is not that hard to implement.

It seemed to me that dealing with the general case would make
fix_alternative_subplan() noticeably more complex and less obviously
correct.  I might be wrong though; what specific coding did you have in
mind?

> 3. Wouldn't it be better to say NULLify rather than delete?

> + * node or higher-level nodes.  However, we do delete the rejected subplan
> + * from root->glob->subplans, to minimize cycles expended on it later.

Fair enough, that comment could be improved.

> On a side note, I was playing around with the following case:
> ...
> both master and patched seem to not choose to use the hashed subplan
> which results in a pretty slow execution time. This seems to be down
> to cost_subplan() doing:
>   /* we only need to fetch 1 tuple; clamp to avoid zero divide */
>   sp_cost.per_tuple += plan_run_cost / clamp_row_est(plan->plan_rows);
> I imagine / 2 might be more realistic to account for the early abort,
> which is pretty much what the ALL_SUBLINK and ANY_SUBLINK do just
> below:

Hm, actually isn't it the other way around?  *If* there are any matching
rows, then what's being done here is an accurate estimate.  But if there
are not, we're going to have to scan the entire subquery output to verify
that.  I wonder if we should just be taking the subquery cost at face
value, ie be pessimistic not optimistic.  If the user is bothering to
test EXISTS, we should expect that the no-match case does happen.

However, I think that's a distinct concern from this patch; this patch
is only meant to improve the processing of alternative subplans, not
to change the costing rules around them.  If we fool with it I'd rather
do so as a separate patch.

regards, tom lane




Re: Avoid suspects casts VARHDRSZ (c.h)

2020-09-26 Thread Tom Lane
Ranier Vilela  writes:
> In all the static analysis tools I’ve used, there are literally *hundreds*
> of alerts about a one suspect cast:

> 64 bits sizet_t -> 32 bits int -> 64 bits size_t
> -#define VARHDRSZ ((int32) sizeof(int32))
> +#define VARHDRSZ (sizeof(int32))

Given that the compiler can very easily see that there is no actual
overflow there, I question whether these warnings are of any value.

Also, the proposed patch changes the type of that macro from signed
to unsigned, meaning that it's considerably riskier than you seem
to think.  We'd have to look at every usage to see if that would
affect the interpretation of any comparisons, for example.

On the whole I see little value here.
Suggest finding a tool with less nanny-ish tendencies.

regards, tom lane




Re: More efficient RI checks - take 2

2020-09-26 Thread Justin Pryzby
On Fri, Jun 05, 2020 at 05:16:43PM +0200, Antonin Houska wrote:
> Antonin Houska  wrote:
> 
> > In general, the checks are significantly faster if there are many rows to
> > process, and a bit slower when we only need to check a single row.
> 
> Attached is a new version that uses the existing simple queries if there's
> only one row to check. SPI is used for both single-row and bulk checks - as
> discussed in this thread, it can perhaps be replaced with a different approach
> if appears to be beneficial, at least for the single-row checks.
> 
> I think using a separate query for the single-row check is more practicable
> than convincing the planner that the bulk-check query should only check a
> single row. So this patch version tries to show what it'd look like.

I'm interested in testing this patch, however there's a lot of internals to
digest.

Are there any documentation updates or regression tests to add ?  If FKs
support "bulk" validation, users should know when that applies, and be able to
check that it's working as intended.  Even if the test cases are overly verbose
or not stable, and not intended for commit, that would be a useful temporary
addition.

I think that calls=4 indicates this is using bulk validation.

postgres=# begin; explain(analyze, timing off, costs off, summary off, verbose) 
DELETE FROM t WHERE i<999; rollback;
BEGIN
  QUERY PLAN   
---
 Delete on public.t (actual rows=0 loops=1)
   ->  Index Scan using t_pkey on public.t (actual rows=998 loops=1)
 Output: ctid
 Index Cond: (t.i < 999)
 Trigger RI_ConstraintTrigger_a_16399 for constraint t_i_fkey: calls=4

I started thinking about this 1+ years ago wondering if a BRIN index could be
used for (bulk) FK validation.

So I would like to be able to see the *plan* for the query.

I was able to show the plan and see that BRIN can be used like so:
|SET auto_explain.log_nested_statements=on; SET client_min_messages=debug; SET 
auto_explain.log_min_duration=0;
Should the plan be visible in explain (not auto-explain) ?

BTW did you see this older thread ?
https://www.postgresql.org/message-id/flat/CA%2BU5nMLM1DaHBC6JXtUMfcG6f7FgV5mPSpufO7GRnbFKkF2f7g%40mail.gmail.com

-- 
Justin