Re: Re: parallel distinct union and aggregate support patch

2020-10-29 Thread bu...@sohu.com
> 1) It's better to always include the whole patch series - including the
> parts that have not changed. Otherwise people have to scavenge the
> thread and search for all the pieces, which may be a source of issues.
> Also, it confuses the patch tester [1] which tries to apply patches from
> a single message, so it will fail for this one.
 Pathes 3 and 4 do not rely on 1 and 2 in code.
 But, it will fail when you apply the apatches 3 and 4 directly, because
 they are written after 1 and 2.
 I can generate a new single patch if you need.

> 2) I suggest you try to describe the goal of these patches, using some
> example queries, explain output etc. Right now the reviewers have to
> reverse engineer the patches and deduce what the intention was, which
> may be causing unnecessary confusion etc. If this was my patch, I'd try
> to create a couple examples (CREATE TABLE + SELECT + EXPLAIN) showing
> how the patch changes the query plan, showing speedup etc.
 I written some example queries in to regress, include "unique" "union"
 "group by" and "group by grouping sets".
 here is my tests, they are not in regress
```sql
begin;
create table gtest(id integer, txt text);
insert into gtest select t1.id,'txt'||t1.id from (select 
generate_series(1,1000*1000) id) t1,(select generate_series(1,10) id) t2;
analyze gtest;
commit;
set jit = off;
\timing on
```
normal aggregate times
```
set enable_batch_hashagg = off;
explain (costs off,analyze,verbose)
select sum(id),txt from gtest group by txt;
 QUERY PLAN
-
 Finalize GroupAggregate (actual time=6469.279..8947.024 rows=100 loops=1)
   Output: sum(id), txt
   Group Key: gtest.txt
   ->  Gather Merge (actual time=6469.245..8165.930 rows=158 loops=1)
 Output: txt, (PARTIAL sum(id))
 Workers Planned: 2
 Workers Launched: 2
 ->  Sort (actual time=6356.471..7133.832 rows=53 loops=3)
   Output: txt, (PARTIAL sum(id))
   Sort Key: gtest.txt
   Sort Method: external merge  Disk: 11608kB
   Worker 0:  actual time=6447.665..7349.431 rows=317512 loops=1
 Sort Method: external merge  Disk: 10576kB
   Worker 1:  actual time=6302.882..7061.157 rows=01 loops=1
 Sort Method: external merge  Disk: 2kB
   ->  Partial HashAggregate (actual time=2591.487..4430.437 
rows=53 loops=3)
 Output: txt, PARTIAL sum(id)
 Group Key: gtest.txt
 Batches: 17  Memory Usage: 4241kB  Disk Usage: 113152kB
 Worker 0:  actual time=2584.345..4486.407 rows=317512 
loops=1
   Batches: 17  Memory Usage: 4241kB  Disk Usage: 101392kB
 Worker 1:  actual time=2584.369..4393.244 rows=01 
loops=1
   Batches: 17  Memory Usage: 4241kB  Disk Usage: 112832kB
 ->  Parallel Seq Scan on public.gtest (actual 
time=0.691..603.990 rows=333 loops=3)
   Output: id, txt
   Worker 0:  actual time=0.104..607.146 rows=3174970 
loops=1
   Worker 1:  actual time=0.100..603.951 rows=3332785 
loops=1
 Planning Time: 0.226 ms
 Execution Time: 9021.058 ms
(29 rows)

Time: 9022.251 ms (00:09.022)

set enable_batch_hashagg = on;
explain (costs off,analyze,verbose)
select sum(id),txt from gtest group by txt;
   QUERY PLAN
-
 Gather (actual time=3116.666..5740.826 rows=100 loops=1)
   Output: (sum(id)), txt
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel BatchHashAggregate (actual time=3103.181..5464.948 rows=33 
loops=3)
 Output: sum(id), txt
 Group Key: gtest.txt
 Worker 0:  actual time=3094.550..5486.992 rows=326082 loops=1
 Worker 1:  actual time=3099.562..5480.111 rows=324729 loops=1
 ->  Parallel Seq Scan on public.gtest (actual time=0.791..656.601 
rows=333 loops=3)
   Output: id, txt
   Worker 0:  actual time=0.080..646.053 rows=3057680 loops=1
   Worker 1:  actual time=0.070..662.754 rows=3034370 loops=1
 Planning Time: 0.243 ms
 Execution Time: 5788.981 ms
(15 rows)

Time: 5790.143 ms (00:05.790)
```

grouping sets times
```
set enable_batch_hashagg = off;
explain (costs off,analyze,verbose)
select sum(id),txt from gtest group by grouping sets(id,txt,());
QUERY PLAN
--
 GroupAggregate (actual time=9454.707..38921.885 rows=201 loops=1)
   Output: sum(id), txt, id
   Group Key: gtest.id
   Group Key: ()
   Sort Key: gtest.txt
 

Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-10-29 Thread Masahiro Ikeda

Hi,

Thanks for your comments and advice. I updated the patch.

On 2020-10-21 18:03, Kyotaro Horiguchi wrote:

At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda
 wrote in

On 2020-10-20 12:46, Amit Kapila wrote:
> I see that we also need to add extra code to capture these stats (some
> of which is in performance-critical path especially in
> XLogInsertRecord) which again makes me a bit uncomfortable. It might
> be that it is all fine as it is very important to collect these stats
> at cluster-level in spite that the same information can be gathered at
> statement-level to help customers but I don't see a very strong case
> for that in your proposal.


We should avoid that duplication as possible even if the both number
are important.


Also about performance, I thought there are few impacts because it
increments stats in memory. If I can implement to reuse pgWalUsage's
value which already collects these stats, there is no impact in
XLogInsertRecord.
For example, how about pg_stat_wal() calculates the accumulated
value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's
value?


I don't think that works, but it would work that pgstat_send_wal()
takes the difference of that values between two successive calls.

WalUsage prevWalUsage;
...
pgstat_send_wal()
{
..
   /* fill in some values using pgWalUsage */
   WalStats.m_wal_bytes   = pgWalUsage.wal_bytes   - 
prevWalUsage.wal_bytes;
   WalStats.m_wal_records = pgWalUsage.wal_records - 
prevWalUsage.wal_records;
   WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - 
prevWalUsage.wal_fpi;

...
   pgstat_send(&WalStats, sizeof(WalStats));

   /* remember the current numbers */
   prevWalUsage = pgWalUsage;


Thanks for Horiguchi-san's advice, I changed to reuse pgWalUsage
which is already defined and eliminates the extra overhead.

5. I notice some code in issue_xlog_fsync() function to collect sync 
info,

a standby may call the issue_xlog_fsync() too, which do not want to
to collect this info. I think this need some change avoid run by
standby side.


IIUC, issue_xlog_fsync is called by wal receiver on standby side.
But it doesn't send collected statistics to the stats collecter.
So, I think it's not necessary to change the code to avoid collecting 
the stats on the standby side.


Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONdiff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 98e19954..587acd10 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3447,12 +3447,106 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
 
 
+ 
+  
+   wal_records bigint
+  
+  
+   Total number of WAL records generated
+  
+ 
+
+ 
+  
+   wal_fpi bigint
+  
+  
+   Total number of WAL full page images generated
+  
+ 
+
+ 
+  
+   wal_bytes bigint
+  
+  
+   Total amount of WAL bytes generated
+  
+ 
+
  
   
wal_buffers_full bigint
   
   
-   Number of times WAL data was written to the disk because WAL buffers got full
+   Total number of times WAL data written to the disk because WAL buffers got full
+  
+ 
+
+ 
+  
+   wal_init_file bigint
+  
+  
+   Total number of WAL file segment created
+  
+ 
+
+ 
+  
+   wal_write_backend bigint
+  
+  
+   Total number of times backends write WAL data to the disk
+  
+ 
+
+ 
+  
+   wal_write_walwriter bigint
+  
+  
+   Total number of times walwriter writes WAL data to the disk
+  
+ 
+
+ 
+  
+   wal_write_time bigint
+  
+  
+   Total amount of time that has been spent in the portion of
+   WAL data was written to disk by backend and walwriter, in milliseconds
+   (if  is enabled, otherwise zero)
+  
+ 
+
+ 
+  
+   wal_sync_backend bigint
+  
+  
+   Total number of times backends sync WAL data to the disk
+  
+ 
+
+ 
+  
+   wal_sync_walwriter bigint
+  
+  
+   Total number of times walwriter syncs WAL data synced to the disk
+  
+ 
+
+ 
+  
+   wal_sync_time bigint
+  
+  
+   Total amount of time that has been spent in the portion of
+   WAL data was synced to disk by backend and walwriter, in milliseconds
+   (if  is enabled, otherwise zero)
   
  
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 52a67b11..77fe977c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2527,6 +2527,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			int			written;
+			instr_time	start;
+			instr_time	duration;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2535,9 +2537,28 @@ XLogWrite(XLogwrtRqst Writ

Re: Deduplicate aggregates and transition functions in planner

2020-10-29 Thread Heikki Linnakangas

On 28/10/2020 21:59, Andres Freund wrote:

On 2020-10-28 21:10:41 +0200, Heikki Linnakangas wrote:

Currently, ExecInitAgg() performs quite a lot of work, to deduplicate
identical Aggrefs, as well as Aggrefs that can share the same transition
state. That doesn't really belong in the executor, we should perform that
work in the planner. It doesn't change from one invocation of the plan to
another, and it would be nice to reflect the state-sharing in the plan
costs.


Woo! Very glad to see this tackled.

It wouldn't surprise me to see a small execution time speedup here -
I've seen the load of the aggno show up in profiles.


I think you'd be hard-pressed to find a real-life query where it 
matters. But if you don't care about real life:


regression=# do $$
begin
  for i in 1..10 loop
perform sum(g), sum(g+0), sum(g+1), sum(g+2), sum(g+3), sum(g+4), 
sum(g+5), sum(g+6), sum(g+7), sum(g+8), sum(g+9), sum(g+10) from 
generate_series(1,1) g;

  end loop;
end;
$$;
DO
Time: 1282.701 ms (00:01.283)

vs.

Time: 860.323 ms

with the patch.


@@ -783,14 +783,13 @@ ExecInitExprRec(Expr *node, ExprState *state,
  
  scratch.opcode = EEOP_AGGREF;

scratch.d.aggref.astate = astate;
-   astate->aggref = aggref;
+   astate->aggno = aggref->aggno;
  
  if (state->parent && IsA(state->parent, AggState))

{
AggState   *aggstate = (AggState *) 
state->parent;
  
-	aggstate->aggs = lappend(aggstate->aggs, astate);

-   aggstate->numaggs++;
+   aggstate->aggs = 
lappend(aggstate->aggs, aggref);


Hm. Why is aggstate->aggs still built during expression initialization?
Imo that's a pretty huge wart that also introduces more
order-of-operation brittleness to executor startup.


The Agg node itself doesn't include any information about the aggregates 
and transition functions. Because of that, ExecInitAgg needs a 
"representive" Aggref for each transition state and agg, to initialize 
the per-trans and per-agg structs. The expression initialization makes 
those Aggrefs available for ExecInitAgg.


Instead of collecting all the Aggrefs in a list, ExecInitExprRec() could 
set each representative Aggref directly in the right per-trans and 
per-agg struct, based on the 'aggno' and 'aggtransno' fields. That 
requires initializing the per-trans and per-agg arrays earlier, and for 
that, we would need to store the # of aggs and transition states in the 
Agg node, like you also suggested. Certainly doable, but on the whole, 
it didn't really seem better to me. Attached is a patch, to demonstrate 
what that looks like, on top of the main patch. It's not complete, 
there's at least one case with hash-DISTINCT for queries like "SELECT 
DISTINCT aggregate(x) ..." where the planner creates an Agg for the 
DISTINCT without aggregates, but the code currently passes numAggs=1 to 
the executor. Some further changes would be needed in the planner, to 
mark the AggPath generated for deduplication differently from the 
AggPaths created for aggregation. Again that's doable, but on the whole 
I prefer the approach to scan the Aggrefs in executor startup, for now.


I'd like to get rid of the "representative Aggrefs" altogether, and pass 
information about the transition and final functions from planner to 
executor in some other form. But that's exactly what got me into the 
refactoring that was ballooning out of hand that I mentioned.


- Heikki
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 2a4dea2b052..6a03fa730e5 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -785,11 +785,23 @@ ExecInitExprRec(Expr *node, ExprState *state,
 scratch.d.aggref.astate = astate;
 astate->aggno = aggref->aggno;
 
+/*
+ * Remember this Aggref as the representative for the
+ * aggregate. ExecInitAgg needs a representative for
+ * initializing the states, and it can also be accessed
+ * by the user-defined functions by AggGetAggref().
+ */
 if (state->parent && IsA(state->parent, AggState))
 {
 	AggState   *aggstate = (AggState *) state->parent;
+	AggStatePerAgg peragg;
+
+	if (aggref->aggno >= aggstate->numaggs)
+		elog(ERROR, "invalid aggno %d", aggref->aggno);
+	peragg = &aggstate->peragg[aggref->aggno];
 
-	aggstate->aggs = lappend(aggstate->aggs, aggref);
+	if (peragg->aggref == NULL)
+		peragg->aggref = aggref;
 }
 else
 {
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 7585689b94d..1949729a494 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -3236,13 +3236,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	Plan	   *outerPlan;
 	ExprContext *econtext;
 	TupleDesc	scanDesc;
-	int			max_a

Re: Parallel copy

2020-10-29 Thread Heikki Linnakangas

On 27/10/2020 15:36, vignesh C wrote:

Attached v9 patches have the fixes for the above comments.


I did some testing:

/tmp/longdata.pl:

#!/usr/bin/perl
#
# Generate three rows:
# foo
# longdatalongdatalongdata...
# bar
#
# The length of the middle row is given as command line arg.
#

my $bytes = $ARGV[0];

print "foo\n";
for(my $i = 0; $i < $bytes; $i+=8){
print "longdata";
}
print "\n";
print "bar\n";


postgres=# copy longdata from program 'perl /tmp/longdata.pl 1' 
with (parallel 2);


This gets stuck forever (or at least I didn't have the patience to wait 
it finish). Both worker processes are consuming 100% of CPU.


- Heikki




Re: Parallel copy

2020-10-29 Thread Daniel Westermann (DWE)
On 27/10/2020 15:36, vignesh C wrote:
>> Attached v9 patches have the fixes for the above comments.

>I did some testing:

I did some testing as well and have a cosmetic remark:

postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 10);
ERROR:  value 10 out of bounds for option "parallel"
DETAIL:  Valid values are between "1" and "1024".
postgres=# copy t1 from '/var/tmp/aa.txt' with (parallel 1000);
ERROR:  parallel requires an integer value
postgres=# 

Wouldn't it make more sense to only have one error message? The first one seems 
to be the better message.

Regards
Daniel



RE: extension patch of CREATE OR REPLACE TRIGGER

2020-10-29 Thread osumi.takami...@fujitsu.com
Hi,


From: Tsunakawa, Takayuki < tsunakawa.ta...@fujitsu.com>
> From: osumi.takami...@fujitsu.com 
> > > > * I don't think that you've fully thought through the implications
> > > > of replacing a trigger for a table that the current transaction
> > > > has already modified.  Is it really sufficient, or even useful, to
> > > > do
> > > > this:
> > > >
> > > > +/*
> > > > + * If this trigger has pending events, throw an error.
> > > > + */
> > > > +if (trigger_deferrable &&
> > > > + AfterTriggerPendingOnRel(RelationGetRelid(rel)))
> > > >
> > > > As an example, if we change a BEFORE trigger to an AFTER trigger,
> > > > that's not going to affect the fact that we *already* fired that
> > > > trigger.  Maybe this is okay and we just need to document it, but
> > > > I'm not
> > > convinced.
> > > >
> > > > * BTW, I don't think a trigger necessarily has to be deferrable in
> > > > order to have pending AFTER events.  The existing use of
> > > > AfterTriggerPendingOnRel certainly doesn't assume that.  But
> > > > really, I think we probably ought to be applying
> > > > CheckTableNotInUse which'd include that test.  (Another way in
> > > > which there's fuzzy thinking here is that AfterTriggerPendingOnRel
> > > > isn't specific to *this*
> > > > trigger.)
> > > Hmm, actually, when I just put a code of CheckTableNotInUse() in
> > > CreateTrigger(), it throws error like "cannot CREATE OR REPLACE
> > > TRIGGER because it is being used by active queries in this session".
> > > This causes an break of the protection for internal cases above and
> > > a contradiction of already passed test cases.
> > > I though adding a condition of in_partition==false to call
> > CheckTableNotInUse().
> > > But this doesn't work in a corner case that child trigger generated
> > > internally is pending and we don't want to allow the replacement of
> > > this kind
> > of trigger.
> > > Did you have any good idea to achieve both points at the same time ?
> > Still, in terms of this point, I'm waiting for a comment !
> 
> I understand this patch is intended for helping users to migrate from other
> DBMSs (mainly Oracle?) because they can easily alter some trigger attributes
> (probably the trigger action and WHEN condition in practice.)  OTOH, the
> above issue seems to be associated with the Postgres-specific constraint
> trigger that is created with CONSTRAINT clause.  (Oracle and the SQL
> standard doesn't have an equivalent feature.)
> 
> So, how about just disallowing the combination of REPLACE and
> CONSTRAINT?  I think nobody would be crippled with that.  If someone
> wants the combination by all means, that can be a separate enhancement.
I didn't notice this kind of perspective and you are right.
In order to achieve the purpose to help database migration from Oracle to 
Postgres,
prohibitting the usage of OR REPLACE for constraint trigger is no problem.

Thanks for your great advice. I fixed and created new version.
Also, the size of this patch becomes much smaller.


Best,
Takamichi Osumi


CREATE_OR_REPLACE_TRIGGER_v14.patch
Description: CREATE_OR_REPLACE_TRIGGER_v14.patch


Re: Parallel copy

2020-10-29 Thread Amit Kapila
On Thu, Oct 29, 2020 at 11:45 AM Amit Kapila  wrote:
>
> On Tue, Oct 27, 2020 at 7:06 PM vignesh C  wrote:
> >
> [latest version]
>
> I think the parallel-safety checks in this patch
> (v9-0002-Allow-copy-from-command-to-process-data-from-file) are
> incomplete and wrong.
>

One more point, I have noticed that some time back [1], I have given
one suggestion related to the way workers process the set of lines
(aka chunk). I think you can try by increasing the chunk size to say
100, 500, 1000 and use some shared counter to remember the number of
chunks processed.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1L-Xgw1zZEbGePmhBBWmEmLFL6rCaiOMDPnq2GNMVz-sg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Re: Track statistics for streaming of in-progress transactions

2020-10-29 Thread Amit Kapila
On Thu, Oct 29, 2020 at 5:16 AM Tomas Vondra
 wrote:
>
> On Wed, Oct 28, 2020 at 08:54:53AM +0530, Amit Kapila wrote:
> >On Fri, Oct 23, 2020 at 10:24 AM Amit Kapila  wrote:
> >>
> >> On Thu, Oct 22, 2020 at 2:09 PM Amit Kapila  
> >> wrote:
> >> >
> >>
> >> I have fixed the above comment and rebased the patch. I have changed
> >> the docs a bit to add more explanation about the counters. Let me know
> >> if you have any more comments. Thanks Dilip and Sawada-San for
> >> reviewing this patch.
> >>
> >
> >Attached is an updated patch with minor changes in docs and cosmetic
> >changes. I am planning to push this patch tomorrow unless there are
> >any more comments/suggestions.
> >
>
> +1 and thanks for working on this
>

Pushed!

-- 
With Regards,
Amit Kapila.




Re: libpq compression

2020-10-29 Thread Daniil Zakhlystov
Hi,

> On Oct 29, 2020, at 12:27 AM, Andres Freund  wrote:
> 
> The protocol sounds to me like there's no way to enable/disable
> compression in an existing connection. To me it seems better to have an
> explicit, client initiated, request to use a specific method of
> compression (including none). That allows to enable compression for bulk
> work, and disable it in other cases (e.g. for security sensitive
> content, or for unlikely to compress well content).
> 
> I think that would also make cross-version handling easier, because a
> newer client driver can send the compression request and handle the
> error, without needing to reconnect or such.
> 
> Most importantly, I think such a design is basically a necessity to make
> connection poolers to work in a sensible way.

Can you please clarify your opinion about connection poolers? Do you mean by 
sensible way that in some cases they can just forward the compressed stream 
without parsing?

>> +/*
>> + * Index of used compression algorithm in zpq_algorithms array.
>> + */
>> +static int zpq_algorithm_impl;
> 
> This is just odd API design imo. Why doesn't the dispatch work based on
> an argument for zpq_create() and the ZpqStream * for the rest?
> 
> What if there's two libpq connections in one process? To servers
> supporting different compression algorithms? This isn't going to fly.


I agree, I think that moving the zpq_algorithm_impl to the ZpqStream struct 
seems like an easy fix for this issue.

>> +/* initialize compression */
>> +if (zpq_set_algorithm(compression_algorithm))
>> +PqStream = zpq_create((zpq_tx_func)secure_write, 
>> (zpq_rx_func)secure_read, MyProcPort);
>> +}
>> +return 0;
>> +}
> 
> Why is zpq a wrapper around secure_write/read? I'm a bit worried this
> will reduce the other places we could use zpq.

Maybe we can just split the PqStream into PqCompressStream and 
PqDecompressStream? It looks like that they can work independently.

—
Daniil Zakhlystov




Re: Disable WAL logging to speed up data loading

2020-10-29 Thread Laurenz Albe
On Thu, 2020-10-29 at 11:42 +0900, Fujii Masao wrote:
> > But what if someone sets wal_level=none, performs some data modifications,
> > sets wal_level=archive and after dome more processing decides to restore 
> > from
> > a backup that was taken before the cluster was set to wal_level=none?
> > Then they would end up with a corrupted database, right?
> 
> I think that the guard to prevent the server from starting up from
> the corrupted database in that senario is necessary.

That wouldn't apply, I think, because the backup from which you start
was taken with wal_level = replica, so the guard wouldn't alert.

But as I said in the other thread, changing wal_level emits a WAL
record, and I am sure that recovery will refuse to proceed if
wal_level < replica.

> I'm still not sure if it's worth supporting this feature in core.
> Because it can really really easily cause users to corrupt whole the database.

You mean, if they take no backup before setting wal_level = none
and then crash the database, so that they are stuck with an
unrecoverable database?

Yes, that feels somewhat too fast and loose...
It would at least require some fat warnings in the documentation
and in postgresql.conf.

Yours,
Laurenz Albe





Re: Disable WAL logging to speed up data loading

2020-10-29 Thread Laurenz Albe
On Thu, 2020-10-29 at 00:07 +, osumi.takami...@fujitsu.com wrote:
> > Sorry for the noise, and I am beginning to think that this is actually a 
> > useful
> > feature.
> 
> No problem at all.
> Probably, for some developers, was the name "none" confusing ?

No, I think that "none" is quite accurate.

The worry is that it makes it quite easy for people to end up with
a crashed database that cannot start - at the very least there should be
warnings in the documentation and postgresql.conf that are as dire as
for "fsync = off".

Yours,
Laurenz Albe





Re: [HACKERS] logical decoding of two-phase transactions

2020-10-29 Thread Amit Kapila
On Tue, Oct 27, 2020 at 3:25 PM Ajin Cherian  wrote:
>
[v13 patch set]
Few comments on v13-0001-Support-2PC-txn-base. I haven't checked v14
version of patches so if you have fixed anything then ignore it.

1.
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -10,6 +10,7 @@
 #define REORDERBUFFER_H

 #include "access/htup_details.h"
+#include "access/twophase.h"
 #include "lib/ilist.h"
 #include "storage/sinval.h"
 #include "utils/hsearch.h"
@@ -174,6 +175,9 @@ typedef struct ReorderBufferChange
 #define RBTXN_IS_STREAMED 0x0010
 #define RBTXN_HAS_TOAST_INSERT0x0020
 #define RBTXN_HAS_SPEC_INSERT 0x0040
+#define RBTXN_PREPARE 0x0080
+#define RBTXN_COMMIT_PREPARED 0x0100
+#define RBTXN_ROLLBACK_PREPARED   0x0200

 /* Does the transaction have catalog changes? */
 #define rbtxn_has_catalog_changes(txn) \
@@ -233,6 +237,24 @@ typedef struct ReorderBufferChange
  ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \
 )

+/* Has this transaction been prepared? */
+#define rbtxn_prepared(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_PREPARE) != 0 \
+)
+
+/* Has this prepared transaction been committed? */
+#define rbtxn_commit_prepared(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \
+)
+
+/* Has this prepared transaction been rollbacked? */
+#define rbtxn_rollback_prepared(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \
+)
+

I think the above changes should be moved to the second patch. There
is no use of these macros in this patch and moreover they appear to be
out-of-place.

2.
@@ -127,6 +152,7 @@ pg_decode_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
  ListCell   *option;
  TestDecodingData *data;
  bool enable_streaming = false;
+ bool enable_2pc = false;

I think it is better to name this variable as enable_two_pc or enable_twopc.

3.
+ xid = strtoul(strVal(elem->arg), NULL, 0);
+ if (xid == 0 || errno != 0)
+ data->check_xid_aborted = InvalidTransactionId;
+ else
+ data->check_xid_aborted = (TransactionId)xid;
+
+ if (!TransactionIdIsValid(data->check_xid_aborted))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("check-xid-aborted is not a valid xid: \"%s\"",
+ strVal(elem->arg;

Can't we write this as below and get rid of xid variable:
data->check_xid_aborted= (TransactionId) strtoul(strVal(elem->arg), NULL, 0);
if (!TransactionIdIsValid(data->check_xid_aborted) || errno)
ereport..

4.
+ /* if check_xid_aborted is a valid xid, then it was passed in
+ * as an option to check if the transaction having this xid would be aborted.
+ * This is to test concurrent aborts.
+ */

multi-line comments have the first line as empty.

5.
+ 
+  The required prepare_cb callback is called whenever
+  a transaction which is prepared for two-phase commit has been
+  decoded. The change_cb callbacks for all modified
+  rows will have been called before this, if there have been any modified
+  rows.
+
+typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx,
+ReorderBufferTXN *txn,
+XLogRecPtr prepare_lsn);
+
+ 
+
+
+
+ Transaction Commit Prepared Callback
+
+ 
+  The required commit_prepared_cb callback
is called whenever
+  a transaction commit prepared has been decoded. The
gid field,
+  which is part of the txn parameter can
be used in this
+  callback.

I think the last line "The gid field, which is
part of the txn parameter can be used in this
callback." in 'Transaction Commit Prepared Callback' should also be
present in 'Transaction Prepare Callback' as we using the same in
prepare API as well.

6.
+pg_decode_stream_prepare(LogicalDecodingContext *ctx,
+ ReorderBufferTXN *txn,
+ XLogRecPtr prepare_lsn)
+{
+ TestDecodingData *data = ctx->output_plugin_private;
+
+ if (data->skip_empty_xacts && !data->xact_wrote_changes)
+ return;
+
+ OutputPluginPrepareWrite(ctx, true);
+
+ if (data->include_xids)
+ appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid);
+ else
+ appendStringInfo(ctx->out, "preparing streamed transaction");

I think we should include 'gid' as well in the above messages.

7.
@@ -221,12 +235,26 @@ StartupDecodingContext(List *output_plugin_options,
  ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) ||
  (ctx->callbacks.stream_stop_cb != NULL) ||
  (ctx->callbacks.stream_abort_cb != NULL) ||
+ (ctx->callbacks.stream_prepare_cb != NULL) ||
  (ctx->callbacks.stream_commit_cb != NULL) ||
  (ctx->callbacks.stream_change_cb != NULL) ||
  (ctx->callbacks.stream_message_cb != NULL) ||
  (ctx->callbacks.stream_truncate_cb != NULL);

  /*
+ * To support two-phase logical decoding, we require
prepare/commit-prepare/abort-prepare
+ * callbacks. The filter-prepare callback is optional. We however
enable two-phase logical
+ * decoding when at least one of the methods is enabled so that we

Re: document pg_settings view doesn't display custom options

2020-10-29 Thread John Naylor
On Wed, Oct 28, 2020 at 11:38 PM Fujii Masao 
wrote:

>
>
> On 2020/10/29 3:45, John Naylor wrote:
> > On Wed, Oct 28, 2020 at 2:15 PM John Naylor <
> john.nay...@enterprisedb.com > wrote:
> >
> > Starting separate threads to keep from cluttering the TODO list
> thread.
> >
> > Here's a patch for the subject, as mentioned in
> >
> https://www.postgresql.org/message-id/20201027220555.GS4951%40momjian.us
> >
> >
> > I just realized I introduced a typo, so here's v2.
>
> +   The pg_settings view does not display
> +   customized options.
>
> This is true until the module that defines the customized options is
> loaded,
> but not after that. No? For example, pg_settings displays
> pg_stat_statements.max after pg_stat_statements is loaded.
>

True, how about this:

   The pg_settings does not display
   customized options
   that have been set before the relevant extension module has been loaded.

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: partition routing layering in nodeModifyTable.c

2020-10-29 Thread Amit Langote
On Wed, Oct 28, 2020 at 12:02 PM Amit Langote  wrote:
> On Tue, Oct 27, 2020 at 10:23 PM Heikki Linnakangas  wrote:
> > But since this applies on top of the "overhaul update/delete processing"
> > patch, let's tackle that patch set next. Could you rebase that, please?
>
>
> Anyway, I will post the rebased patch on the "overhaul update/delete
> processing" thread.

Done.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: libpq compression

2020-10-29 Thread Konstantin Knizhnik

Hi,
Thank for review.

On 28.10.2020 22:27, Andres Freund wrote:

I don't see a corresponding configure.ac change?
Shame on me - I completely forgot that configure is actually generate 
from configure.ac.

Fixed.



+ 
+  compression
+  
+  
+Request compression of libpq traffic. Client sends to the server list 
of compression algorithms, supported by client library.
+If server supports one of this algorithms, then it acknowledges use of 
this algorithm and then all libpq messages send both from client to server and
+visa versa will be compressed. If server is not supporting any of the 
suggested algorithms, then it replies with 'n' (no compression)
+message and it is up to the client whether to continue work without 
compression or report error.
+Supported compression algorithms are chosen at configure time. Right 
now two libraries are supported: zlib (default) and zstd (if Postgres was
+configured with --with-zstd option). In both cases streaming mode is 
used.
+  
+  
+ 
+


- there should be a reference to potential security impacts when used in
   combination with encrypted connections

Done


- What does " and it is up to the client whether to continue work
   without compression or report error" actually mean for a libpq parameter?

It can not happen.
The client request from server use of compressed protocol only if 
"compression=XXX" was specified in connection string.
But XXX should be supported by client, otherwise this request will be 
rejected.

So supported protocol string sent by client can never be empty.


- What is the point of the "streaming mode" reference?


There are ways of performing compression:
- block mode when each block is individually compressed (compressor 
stores dictionary in each compressed blocked)

- stream mode
Block mode allows to independently decompress each page. It is good for 
implementing page or field compression (as pglz is used to compress 
toast values).

But it is not efficient for compressing client-server protocol commands.
It seems to me to be important to explain that libpq is using stream 
mode and why there is no pglz compressor

Why are compression methods identified by one byte identifiers? That
seems unnecessarily small, given this is commonly a once-per-connection
action?


It is mostly for simplicity of implementation: it is always simple to 
work with fixed size messages (or with array of chars rather than array 
of strings).
And I do not think that it can somehow decrease flexibility: this 
one-letter algorihth codes are not visible for user. And I do not think 
that we sometime will support more than 127 (or even 64 different 
compression algorithms).

The protocol sounds to me like there's no way to enable/disable
compression in an existing connection. To me it seems better to have an
explicit, client initiated, request to use a specific method of
compression (including none). That allows to enable compression for bulk
work, and disable it in other cases (e.g. for security sensitive
content, or for unlikely to compress well content).


It will significantly complicate implementation (because of buffering at 
different levels).
Also it is not clear to me who and how will control enabling/disabling 
compression in this case?
I can imagine that "\copy" should trigger compression. But what about 
(server side) "copy"  command?
Or just select returning huge results? I do not think that making user 
or application to enable/disable compression on the fly is really good idea.

Overhead of compressing small commands is not so large.
And concerning security risks... In most cases such problem is not 
relevant at all because both client and server are located within single 
reliable network.
It if security of communication really matters, you should not switch 
compression in all cases (including COPY and other bulk data transfer).
It is very strange idea to let client to decide which data is "security 
sensitive" and which not.


I think that would also make cross-version handling easier, because a
newer client driver can send the compression request and handle the
error, without needing to reconnect or such.

Most importantly, I think such a design is basically a necessity to make
connection poolers to work in a sensible way.


I do not completely understand the problem with connection pooler.
Right now developers of Yandex Odyssey are trying to support libpq 
compression in their pooler.

If them will be faced with some problems, I will definitely address them.

And lastly, wouldn't it be reasonable to allow to specify things like
compression levels? All that doesn't have to be supported now, but I
think the protocol should take that into account.
Well, if we want to provide the maximal flexibility, then we should 
allow to specify compression level.
Practically, when I have implemented our CFS compressed storage for 
pgpro-ee and libpq_compression I have performed
a lot ben

Re: MultiXact\SLRU buffers configuration

2020-10-29 Thread Tomas Vondra

On Thu, Oct 29, 2020 at 12:08:21PM +0500, Andrey Borodin wrote:




29 окт. 2020 г., в 04:32, Tomas Vondra  
написал(а):

It's not my intention to be mean or anything like that, but to me this
means we don't really understand the problem we're trying to solve. Had
we understood it, we should be able to construct a workload reproducing
the issue ...

I understand what the individual patches are doing, and maybe those
changes are desirable in general. But without any benchmarks from a
plausible workload I find it hard to convince myself that:

(a) it actually will help with the issue you're observing on production

and
(b) it's actually worth the extra complexity (e.g. the lwlock changes)


I'm willing to invest some of my time into reviewing/testing this, but I
think we badly need better insight into the issue, so that we can build
a workload reproducing it. Perhaps collecting some perf profiles and a
sample of the queries might help, but I assume you already tried that.


Thanks, Tomas! This totally makes sense.

Indeed, collecting queries did not help yet. We have loadtest environment 
equivalent to production (but with 10x less shards), copy of production 
workload queries. But the problem does not manifest there.
Why do I think the problem is in MultiXacts?
Here is a chart with number of wait events on each host


During the problem MultiXactOffsetControlLock and SLRURead dominate all other 
lock types. After primary switchover to another node SLRURead continued for a 
bit there, then disappeared.


OK, so most of this seems to be due to SLRURead and
MultiXactOffsetControlLock. Could it be that there were too many
multixact members, triggering autovacuum to prevent multixact
wraparound? That might generate a lot of IO on the SLRU. Are you
monitoring the size of the pg_multixact directory?


Backtraces on standbys during the problem show that most of backends are 
sleeping in pg_sleep(1000L) and are not included into wait stats on these 
charts.

Currently I'm considering writing test that directly calls MultiXactIdExpand(), 
MultiXactIdCreate(), and GetMultiXactIdMembers() from an extension. How do you 
think, would benchmarks in such tests be meaningful?



I don't know. I'd much rather have a SQL-level benchmark than an
extension doing this kind of stuff.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Log message for GSS connection is missing once connection authorization is successful.

2020-10-29 Thread Stephen Frost
Greetings,

* vignesh C (vignes...@gmail.com) wrote:
> I have made a v2 patch based on the changes you have suggested. The
> patch for the same is attached.

> From b067cf823750f200102be0a0cad9a26a08e29a92 Mon Sep 17 00:00:00 2001
> From: Vignesh C 
> Date: Wed, 28 Oct 2020 08:19:06 +0530
> Subject: [PATCH v2] Log message for GSS connection is missing once connection
>  authorization is successful.
> 
> Log message for GSS connection is missing once connection authorization is
> successful. We have similar log message for SSL connections once the 
> connection
> authorization is successful. This message will help the user to identify the
> connection that was selected from the logfile.

Just to be clear- it's not that the message is 'missing', it's just not
providing the (certainly useful) information about how the connection
was authorized.  The commit message should make it clear that what we're
doing here is improving the connection authorization message for GSS
authenticated or encrypted connections.

> diff --git a/src/backend/utils/init/postinit.c 
> b/src/backend/utils/init/postinit.c
> index d4ab4c7..7980e92 100644
> --- a/src/backend/utils/init/postinit.c
> +++ b/src/backend/utils/init/postinit.c
> @@ -267,6 +267,21 @@ PerformAuthentication(Port *port)
> 
> be_tls_get_compression(port) ? _("on") : _("off";
>   else
>  #endif
> +#ifdef ENABLE_GSS
> + if (be_gssapi_get_auth(port) || 
> be_gssapi_get_princ(port))
> + ereport(LOG,
> + (port->application_name != NULL
> +  ? errmsg("replication 
> connection authorized: user=%s application_name=%s GSS %s (principal=%s)",
> +   
> port->user_name,
> +   
> port->application_name,
> +   
> be_gssapi_get_auth(port) || be_gssapi_get_enc(port) ? _("on") : _("off"),
> +   
> be_gssapi_get_princ(port))
> +  : errmsg("replication 
> connection authorized: user=%s GSS %s (principal=%s)",
> +   
> port->user_name,
> +   
> be_gssapi_get_auth(port) || be_gssapi_get_enc(port) ? _("on") : _("off"),
> +   
> be_gssapi_get_princ(port;
> + else
> +#endif

No, this isn't what I was suggesting.  "on" and "off" really isn't
communicating the details about the GSS-using connection.  What I
suggested before was something like:

errmsg("replication connection authorized: user=%s application_name=%s GSS %s 
(principal=%s)",
port->user_name,
port->application_name,
(be_gssapi_get_auth(port) && be_gssapi_get_enc(port)) ?  "authenticated 
and encrypted" : be_gssapi_get_auth(port) ?  "authenticated" : "encrypted",
be_gssapi_get_princ(port))

Though I'll admit that perhaps there's something better which could be
done here- but just 'on/off' certainly isn't that.  Another option might
be:

errmsg("replication connection authorized: user=%s application_name=%s GSS 
authenticated: %s, encrypted: %s, principal: %s",
port->user_name,
port->application_name,
be_gssapi_get_auth(port) ? "yes" : "no",
be_gssapi_get_enc(port) ? "yes" : "no",
be_gssapi_get_princ(port))

Also, it would be good to see if there's a way to add to the tests we
have for GSSAPI authentication/encryption to show that we hit each of
the possible cases and check that we get the correct messages in the log
as a result.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: New default role- 'pg_read_all_data'

2020-10-29 Thread Stephen Frost
Greetings,

* Georgios Kokolatos (gkokola...@protonmail.com) wrote:
> this patch is in "Ready for committer" state and the Cfbot is happy.

Glad that's still the case. :)

> Is there any committer that is available for taking a look at it?

If there aren't any objections or further comments, I'll take another
look through it and will commit it during the upcoming CF.

Thanks!

Stephen


signature.asc
Description: PGP signature


-O switch

2020-10-29 Thread Magnus Hagander
postgres --help:
  -o OPTIONS pass "OPTIONS" to each server process (obsolete)

This was marked obsolete in 2006 (86c23a6eb28).

Is it perhaps time to get rid of it?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: libpq compression

2020-10-29 Thread Konstantin Knizhnik

New version of the patch with fixed is attached.


On 28.10.2020 22:27, Andres Freund wrote:

Hi,

On 2020-10-26 19:20:46 +0300, Konstantin Knizhnik wrote:

diff --git a/configure b/configure
index ace4ed5..deba608 100755
--- a/configure
+++ b/configure
@@ -700,6 +700,7 @@ LD
  LDFLAGS_SL
  LDFLAGS_EX
  with_zlib
+with_zstd
  with_system_tzdata
  with_libxslt
  XML2_LIBS


I don't see a corresponding configure.ac change?



+ 
+  compression
+  
+  
+Request compression of libpq traffic. Client sends to the server list 
of compression algorithms, supported by client library.
+If server supports one of this algorithms, then it acknowledges use of 
this algorithm and then all libpq messages send both from client to server and
+visa versa will be compressed. If server is not supporting any of the 
suggested algorithms, then it replies with 'n' (no compression)
+message and it is up to the client whether to continue work without 
compression or report error.
+Supported compression algorithms are chosen at configure time. Right 
now two libraries are supported: zlib (default) and zstd (if Postgres was
+configured with --with-zstd option). In both cases streaming mode is 
used.
+  
+  
+ 
+


- there should be a reference to potential security impacts when used in
   combination with encrypted connections
- What does " and it is up to the client whether to continue work
   without compression or report error" actually mean for a libpq parameter?
- What is the point of the "streaming mode" reference?




@@ -263,6 +272,21 @@
   
  
   

+  CompressionAck
+  
+   
+ Server acknowledges using compression for client-server communication 
protocol.
+ Compression can be requested by client by including "compression" 
option in connection string.
+ Client sends to the server list of compression algorithms, supported 
by client library
+ (compression algorithm is identified by one letter: 'f' - 
Facebook zstd, 'z' - zlib,...).
+ If server supports one of this algorithms, then it acknowledges use 
of this algorithm and all subsequent libpq messages send both from client to 
server and
+ visa versa will be compressed. If server is not supporting any of the 
suggested algorithms, then it replies with 'n' (no compression)
+ algorithm identifier and it is up to the client whether to continue 
work without compression or report error.
+   
+  
+ 

Why are compression methods identified by one byte identifiers? That
seems unnecessarily small, given this is commonly a once-per-connection
action?


The protocol sounds to me like there's no way to enable/disable
compression in an existing connection. To me it seems better to have an
explicit, client initiated, request to use a specific method of
compression (including none). That allows to enable compression for bulk
work, and disable it in other cases (e.g. for security sensitive
content, or for unlikely to compress well content).

I think that would also make cross-version handling easier, because a
newer client driver can send the compression request and handle the
error, without needing to reconnect or such.

Most importantly, I think such a design is basically a necessity to make
connection poolers to work in a sensible way.

And lastly, wouldn't it be reasonable to allow to specify things like
compression levels? All that doesn't have to be supported now, but I
think the protocol should take that into account.



+
+Used compression algorithm. Right now the following streaming 
compression algorithms are supported: 'f' - Facebook zstd, 'z' - zlib, 'n' - no 
compression.
+

I would prefer this just be referenced as zstd or zstandard, not
facebook zstd. There's an RFC (albeit only "informational"), and it
doesn't name facebook, except as an employer:
https://tools.ietf.org/html/rfc8478



+int
+pq_configure(Port* port)
+{
+   char* client_compression_algorithms = port->compression_algorithms;
+   /*
+* If client request compression, it sends list of supported 
compression algorithms.
+* Each compression algorirthm is idetified by one letter ('f' - 
Facebook zsts, 'z' - xlib)
+*/

s/algorirthm/algorithm/
s/idetified/identified/
s/zsts/zstd/
s/xlib/zlib/

That's, uh, quite the typo density.



+   if (client_compression_algorithms)
+   {
+   char server_compression_algorithms[ZPQ_MAX_ALGORITHMS];
+   char compression_algorithm = ZPQ_NO_COMPRESSION;
+   char compression[6] = {'z',0,0,0,5,0}; /* message length = 5 */
+   int rc;

Why is this hand-rolling protocol messages?



+   /* Intersect lists */
+   while (*client_compression_algorithms != '\0')
+   {
+   if (strchr(server_compression_algorithms, 
*client_compression_algorithms))
+   {
+  

Re: POC: GROUP BY optimization

2020-10-29 Thread Dmitry Dolgov
> On Tue, Oct 27, 2020 at 09:19:51PM +0100, Tomas Vondra wrote:
> On Mon, Oct 26, 2020 at 11:40:40AM +0100, Dmitry Dolgov wrote:
> > > On Mon, Oct 26, 2020 at 01:28:59PM +0400, Pavel Borisov wrote:
> > > > Thanks for your interest! FYI there is a new thread about this topic [1]
> > > > with the next version of the patch and more commentaries (I've created
> > > > it for visibility purposes, but probably it also created some confusion,
> > > > sorry for that).
> > > >
> > > > Thanks!
> > >
> > > I made a very quick look at your updates and noticed that it is intended 
> > > to
> > > be simple and some parts of the code are removed as they have little test
> > > coverage. I'd propose vice versa to increase test coverage to enjoy more
> > > precise cost calculation and probably partial grouping.
> > >
> > > Or maybe it's worth to benchmark both patches and then re-decide what we
> > > want more to have a more complicated or a simpler version.
> > >
> > > Good to know that this feature is not stuck anymore and we have more than
> > > one proposal.
> > > Thanks!
> >
> > Just to clarify, the patch that I've posted in another thread mentioned
> > above is not an alternative proposal, but a development of the same
> > patch I had posted in this thread. As mentioned in [1], reduce of
> > functionality is an attempt to reduce the scope, and as soon as the base
> > functionality looks good enough it will be returned back.
> >
>
> I find it hard to follow two similar threads trying to do the same (or
> very similar) things in different ways. Is there any chance to join
> forces and produce a single patch series merging the changes? With the
> "basic" functionality at the beginning, then patches with the more
> complex stuff. That's the usual way, I think.
>
> As I said in my response on the other thread [1], I think constructing
> additional paths with alternative orderings of pathkeys is the right
> approach. Otherwise we can't really deal with optimizations above the
> place where we consider this optimization.
>
> That's essentially what I was trying in explain May 16 response [2]
> when I actually said this:
>
>So I don't think there will be a single "interesting" grouping
>pathkeys (i.e. root->group_pathkeys), but a collection of pathkeys.
>And we'll need to build grouping paths for all of those, and leave
>the planner to eventually pick the one giving us the cheapest plan.
>
> I wouldn't go as far as saying the approach in this patch (i.e. picking
> one particular ordering) is doomed, but it's going to be very hard to
> make it work reliably. Even if we get the costing *at this node* right,
> who knows how it'll affect costing of the nodes above us?
>
> So if I can suggest something, I'd merge the two patches, adopting the
> path-based approach. With the very basic functionality/costing in the
> first patch, and the more advanced stuff in additional patches.
>
> Does that make sense?

Yes, and from what I understand it's already what had happened in the
newer thread [1]. To avoid any confusion, there are no "two patches" at
least from my side, and what I've posted in [1] is the continuation of
this work, but with path-based approach adopted and a bit less
functionality (essentially I've dropped everything what was not covered
with tests in the original patch).

In case if I'm missing something and Pavel's proposal is significantly
different from the original patch (if I understand correctly, at the
moment the latest patch posted here is a rebase and adjusting the old
patch to work with the latest changes in master, right?), then indeed
they could be merged, but please in the newer thread [1].

[1]: 
https://www.postgresql.org/message-id/flat/CA%2Bq6zcW_4o2NC0zutLkOJPsFt80megSpX_dVRo6GK9PC-Jx_Ag%40mail.gmail.com




Re: libpq compression

2020-10-29 Thread Konstantin Knizhnik




On 28.10.2020 22:58, Alvaro Herrera wrote:

On 2020-Oct-26, Konstantin Knizhnik wrote:


+   while (bufptr < bufend || zpq_buffered(PqStream) != 0) /* has more data 
to flush or unsent data in internal compression buffer */
{
-   int r;
-
-   r = secure_write(MyProcPort, bufptr, bufend - bufptr);
-
-   if (r <= 0)
+   int r;
+   size_t  processed = 0;
+   size_t  available = bufend - bufptr;
+   r = PqStream
+   ? zpq_write(PqStream, bufptr, available, &processed)
+   : secure_write(MyProcPort, bufptr, available);
+   bufptr += processed;
+   PqSendStart += processed;

This bit is surprising to me.  I thought the whole zpq_write() thing
should be hidden inside secure_write, so internal_flush would continue
to call just secure_write; and it is that routine's responsibility to
call zpq_write or be_tls_write or secure_raw_write etc according to
compile-time options and socket state.

Sorry, may be it is not the nicest way of coding.
Ideally, we should use "decorator design pattern" here where each layer 
(compression, TLS,...) is implemented as separate decorator class.

This is how io streams are implemented in java and many other SDKs.
But it requires  too much redesign of Postgres code.

Also from my point of view the core of the problem is that in Postgres 
there are two almost independent implementation of networking code for 
backend/frontend.
IMHO it is better to have some SAL (system abstraction layer) which 
hides OS specific stuff from rest of the system and which can be shared 
by backend and frontend.


In any case I failed to implement it in different way.
The basic requirements was:
1. zpq code should be used both by backend and frontent.
2. decompressor may need to perform multiple reads from underlying layer 
to fill its buffer and be able to produce some output.

3. minimize changes in postgres code
4. be able to use zpq library in some other tools (like pooler)

This is why zpq_create function is given tx/rx functions to pefrom IO 
operations.
secure_write is such tx function for backend (and pqsecure_write for 
frontend).


Actually the name of this function secure_write assumes that it deals 
only with TLS, not with compression.
Certainly it is possible to rename it or better introduce some other 
functions, i.e. stream_read which will perform this checks.
But please notice that it is not enough to perform all checks in one 
functions as you suggested.  It should be really pipe each component of 
which is doing its own job:

encryption, compression

As for me I prefer to have in this place indirect function calls.
But there are several reasons (at least different prototypes of the 
functions)

which makes me choose the current way.

In any case: there are many different ways of doing the same task.
And different people have own opinion about best/right way of doing it.
Definitely there are some objective criteria: encapsulation, lack of 
code duplication,

readability of code,...

I tried to find the best approach base on my own preferences and 
requirements described above.
May be I am wrong but then I want to be convinced that suggested 
alternative is better.
From my point of view calling compressor from function named 
secure_read is not right solution...




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: POC: GROUP BY optimization

2020-10-29 Thread Pavel Borisov
>
> In case if I'm missing something and Pavel's proposal is significantly
> different from the original patch (if I understand correctly, at the
> moment the latest patch posted here is a rebase and adjusting the old
> patch to work with the latest changes in master, right?), then indeed
> they could be merged, but please in the newer thread [1].
>

Sure, my patch has the only difference from the previous Theodor's code
for compatibility with v.13, though it is not small, and I appreciate the
changes in paths processing. The only thing that caused my notice, is that
some useful changes which I've mentioned before, are discarded now. But as
long as they are planned to be put in later it is completely fine. I agree
to discuss the thing in any thread, though I don't quite understand the
reason for a switch.

Still I don't see a problem.

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: New default role- 'pg_read_all_data'

2020-10-29 Thread Georgios Kokolatos
Hi,

this patch is in "Ready for committer" state and the Cfbot is happy.

Is there any committer that is available for taking a look at it?

Cheers,
//Georgios - CFM 2020-11

Re: -O switch

2020-10-29 Thread Tom Lane
Magnus Hagander  writes:
> postgres --help:
>   -o OPTIONS pass "OPTIONS" to each server process (obsolete)

> This was marked obsolete in 2006 (86c23a6eb28).

I don't think it's really obsolete ... don't we use that to pass
PGOPTIONS through from the client?

regards, tom lane




Re: -O switch

2020-10-29 Thread Magnus Hagander
On Thu, Oct 29, 2020 at 4:45 PM Tom Lane  wrote:
>
> Magnus Hagander  writes:
> > postgres --help:
> >   -o OPTIONS pass "OPTIONS" to each server process (obsolete)
>
> > This was marked obsolete in 2006 (86c23a6eb28).
>
> I don't think it's really obsolete ... don't we use that to pass
> PGOPTIONS through from the client?

Then it probably shouldn't be labeled as obsolete :)

That said, I don't think we do, or I'm misunderstanding what you mean.
The startup packet which holds the client options is not read until
we're already in the child process, so there is no further exec to be
done?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-29 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Victor Yegorov  writes:
> > ср, 28 окт. 2020 г. в 19:44, Alexander Kukushkin :
> >> I know, nobody in their mind should do that, but, if the postmaster
> >> process is killed with SIGKILL signal, most backend processes
> >> correctly notice the fact of the postmaster process absence and exit.
> >> There is one exception though, when there are autovacuum worker
> >> processes they are continuing to run until eventually finish and exit.
> 
> > Do you get the same behaviour also on master?
> > As there was some work in this area for 14, see
> > https://git.postgresql.org/pg/commitdiff/44fc6e259b
> 
> That was about SIGQUIT response, which isn't really related to this
> scenario.  But I do not think Alexander has accurately characterized
> the situation.  *No* server processes will react instantly to postmaster
> death.  Typically they'll only detect it while waiting for some other
> condition, such as client input, or in some cases while iterating their
> outermost loop.  So if they're busy with calculations they might not
> notice for a long time.  I don't think autovacuum is any worse than
> a busy client backend on this score.

Considering how long an autovacuum can run, it seems like it'd be
worthwhile to find a useful place to check for postmaster-death.
Typical well-running systems are going to be waiting for the client
pretty frequently and therefore this does make autovacuum stick out in
this case.

> It's hard to do better than that, because on most platforms there's
> no way to get a signal on parent-process death, so the only way to
> notice would be to poll the postmaster-death pipe constantly; which
> would be hugely expensive in comparison to the value.

I agree that 'constantly' wouldn't be great, but with some periodicity
that's more frequent than 'not until a few hours later when we finally
finish vacuuming this relation' would be nice.  At least with autovauum
we may be periodically sleeping anyway so it doesn't seem like polling
at that point would really be terrible, though it'd be nice to check
every once in a while even if we aren't sleeping.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: -O switch

2020-10-29 Thread Tom Lane
Magnus Hagander  writes:
> On Thu, Oct 29, 2020 at 4:45 PM Tom Lane  wrote:
>> I don't think it's really obsolete ... don't we use that to pass
>> PGOPTIONS through from the client?

> That said, I don't think we do, or I'm misunderstanding what you mean.
> The startup packet which holds the client options is not read until
> we're already in the child process, so there is no further exec to be
> done?

[ pokes around... ]  Ah, you're right, that stuff goes through
port->cmdline_options now.  It looks like the mechanism for -o
is the postmaster's ExtraOptions variable, which we could get
rid of this way.  Seems like a reasonable thing, especially since
we unified all the other postmaster/postgres options already.

regards, tom lane




Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-29 Thread Tom Lane
Stephen Frost  writes:
> I agree that 'constantly' wouldn't be great, but with some periodicity
> that's more frequent than 'not until a few hours later when we finally
> finish vacuuming this relation' would be nice.  At least with autovauum
> we may be periodically sleeping anyway so it doesn't seem like polling
> at that point would really be terrible, though it'd be nice to check
> every once in a while even if we aren't sleeping.

Maybe put a check into vacuum_delay_point, and poll the pipe when we're
about to sleep anyway?  That wouldn't fix anything except autovacuum,
but if you're right that that's a primary pain point then it'd help.

regards, tom lane




Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-29 Thread Alvaro Herrera
On 2020-Oct-29, Stephen Frost wrote:

> > It's hard to do better than that, because on most platforms there's
> > no way to get a signal on parent-process death, so the only way to
> > notice would be to poll the postmaster-death pipe constantly; which
> > would be hugely expensive in comparison to the value.
> 
> I agree that 'constantly' wouldn't be great, but with some periodicity
> that's more frequent than 'not until a few hours later when we finally
> finish vacuuming this relation' would be nice.  At least with autovauum
> we may be periodically sleeping anyway so it doesn't seem like polling
> at that point would really be terrible, though it'd be nice to check
> every once in a while even if we aren't sleeping.

vacuum_delay_point seems an obvious candidate, as soon as we've
determined that the sleep interval is > 0; since we're going to sleep,
the cost of a syscall seems negligible.  I'm not sure what to suggest
for vacuums that don't have vacuum costing active, though.




Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-29 Thread Alvaro Herrera
On 2020-Oct-28, Alexander Kukushkin wrote:

> Hello,
> 
> I know, nobody in their mind should do that, but, if the postmaster
> process is killed with SIGKILL signal, most backend processes
> correctly notice the fact of the postmaster process absence and exit.
> There is one exception though, when there are autovacuum worker
> processes they are continuing to run until eventually finish and exit.

So, if you have a manual vacuum running on the table (with
vacuum_cost_delay=0) and kill -KILL the postmaster, that one also
lingers arbitrarily long afterwards?

(I suppose the problem is not as obvious just because the vacuum
wouldn't run as long, because of no vacuum cost delay; but it'd still be
a problem if you made the table bigger.)




Re: pg_dump, ATTACH, and independently restorable child partitions

2020-10-29 Thread Justin Pryzby
On Sat, Oct 24, 2020 at 02:59:49PM -0500, Justin Pryzby wrote:
> On Fri, Oct 23, 2020 at 12:29:40AM -0500, Justin Pryzby wrote:
> > Since this commit, pg_dump CREATEs tables and then ATTACHes them:
> > 
> > |commit 33a53130a89447e171a8268ae0b221bb48af6468
> > |Author: Alvaro Herrera 
> > |Date:   Mon Jun 10 18:56:23 2019 -0400
> > |
> > |Make pg_dump emit ATTACH PARTITION instead of PARTITION OF (reprise)
> > |...
> > |This change also has the advantage that the partition is restorable 
> > from
> > |the dump (as a standalone table) even if its parent table isn't
> > |restored.
> > 
> > I like the idea of child tables being independently restorable, but it 
> > doesn't
> > seem to work.
> ...
> > Now that I look, it seems like this is calling PQexec(), which sends a 
> > single,
> > "simple" libpq message with:
> > |CREATE TABLE ..; ALTER TABLE .. ATTACH PARTITION;
> > ..which is transactional, so when the 2nd command fails, the CREATE is 
> > rolled back.
> > https://www.postgresql.org/docs/9.5/libpq-exec.html#LIBPQ-EXEC-MAIN
> 
> The easy fix is to add an explicit begin/commit.

Now with updated test script.

-- 
Justin
>From 8e34426bf7d863e3e8878a2e31c64ca999ec4464 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 24 Oct 2020 14:51:18 -0500
Subject: [PATCH v2] pg_dump: Allow child partitions to be independently
 restored

..even if the parent doesn't exist, or has missing/incompatible columns

This seems to have been intended by commit 33a53130a
---
 src/bin/pg_dump/pg_dump.c| 8 
 src/bin/pg_dump/t/002_pg_dump.pl | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 03f9d4d9e8..4767418849 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -15651,6 +15651,13 @@ dumpTableSchema(Archive *fout, TableInfo *tbinfo)
 			binary_upgrade_set_pg_class_oids(fout, q,
 			 tbinfo->dobj.catId.oid, false);
 
+		/*
+		 * Use explicit transaction commands so that failure in a
+		 * following command in the same txn doesn't cause ROLLBACK of
+		 * the "CREATE TABLE".
+		 */
+		appendPQExpBufferStr(q, "begin;\n");
+
 		appendPQExpBuffer(q, "CREATE %s%s %s",
 		  tbinfo->relpersistence == RELPERSISTENCE_UNLOGGED ?
 		  "UNLOGGED " : "",
@@ -15871,6 +15878,7 @@ dumpTableSchema(Archive *fout, TableInfo *tbinfo)
 		}
 		else
 			appendPQExpBufferStr(q, ";\n");
+		appendPQExpBufferStr(q, "commit;\n");
 
 		/* Materialized views can depend on extensions */
 		if (tbinfo->relkind == RELKIND_MATVIEW)
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index ec63662060..90dc2c5ae9 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -2343,6 +2343,7 @@ my %tests = (
 		regexp => qr/^
 			\Q-- Name: measurement;\E.*\n
 			\Q--\E\n\n
+			\Qbegin;\E\n
 			\QCREATE TABLE dump_test.measurement (\E\n
 			\s+\Qcity_id integer NOT NULL,\E\n
 			\s+\Qlogdate date NOT NULL,\E\n
@@ -2351,6 +2352,7 @@ my %tests = (
 			\s+\QCONSTRAINT measurement_peaktemp_check CHECK ((peaktemp >= '-460'::integer))\E\n
 			\)\n
 			\QPARTITION BY RANGE (logdate);\E\n
+			\Qcommit;\E\n
 			/xm,
 		like =>
 		  { %full_runs, %dump_test_schema_runs, section_pre_data => 1, },
-- 
2.17.0



Re: duplicate function oid symbols

2020-10-29 Thread Tom Lane
John Naylor  writes:
> Here is a quick patch implementing this much.

Pushed with a couple cosmetic tweaks.

regards, tom lane




Re: Deduplicate aggregates and transition functions in planner

2020-10-29 Thread Andres Freund
Hi,

On 2020-10-29 10:17:20 +0200, Heikki Linnakangas wrote:
> On 28/10/2020 21:59, Andres Freund wrote:
> > It wouldn't surprise me to see a small execution time speedup here -
> > I've seen the load of the aggno show up in profiles.
> 
> I think you'd be hard-pressed to find a real-life query where it
> matters. But if you don't care about real life:

I was actually thinking about a different angle - that the evaluation of
an Aggref can be faster, because we need less indirection to find the
aggno. As you have already implemented for the JITed code, but removing
it for the expression code looks easy enough too. You'd need a lot of
groups and presumably a fair number of Aggrefs to see it.

Attached is a quick version of what I am thinking wrt AggrefExprState.


> > > @@ -783,14 +783,13 @@ ExecInitExprRec(Expr *node, ExprState *state,
> > >   scratch.opcode = EEOP_AGGREF;
> > >   scratch.d.aggref.astate = astate;
> > > - astate->aggref = aggref;
> > > + astate->aggno = aggref->aggno;
> > >   if (state->parent && IsA(state->parent, 
> > > AggState))
> > >   {
> > >   AggState   *aggstate = 
> > > (AggState *) state->parent;
> > > - aggstate->aggs = 
> > > lappend(aggstate->aggs, astate);
> > > - aggstate->numaggs++;
> > > + aggstate->aggs = 
> > > lappend(aggstate->aggs, aggref);
> > 
> > Hm. Why is aggstate->aggs still built during expression initialization?
> > Imo that's a pretty huge wart that also introduces more
> > order-of-operation brittleness to executor startup.
> 
> The Agg node itself doesn't include any information about the aggregates and
> transition functions. Because of that, ExecInitAgg needs a "representive"
> Aggref for each transition state and agg, to initialize the per-trans and
> per-agg structs. The expression initialization makes those Aggrefs available
> for ExecInitAgg.

> Instead of collecting all the Aggrefs in a list, ExecInitExprRec() could set
> each representative Aggref directly in the right per-trans and per-agg
> struct, based on the 'aggno' and 'aggtransno' fields.

Hold on a second: To me the question is why is it the right design that
the Agg node doesn't have the information about "aggregates and
transition functions"? Agg e.g. already does directly contains the group
keys...

My concern wouldn't really be addressed if we replace the lappend() in
ExecInitExprRec() with setting something "directly in the right
per-trans...". I think it's structurally wrong to have to discover
Aggrefs at execution time at all.

Perhaps the easiest incremental step would be to have something like a
CookedAggref { int aggno; } and then just store the Aggref nodes in
Agg->aggs, with aggno referencing that...


> I'd like to get rid of the "representative Aggrefs" altogether, and pass
> information about the transition and final functions from planner to
> executor in some other form. But that's exactly what got me into the
> refactoring that was ballooning out of hand that I mentioned.

Fair.


Greetings,

Andres Freund
diff --git i/src/include/executor/execExpr.h w/src/include/executor/execExpr.h
index b792de1bc95..abb489e2062 100644
--- i/src/include/executor/execExpr.h
+++ w/src/include/executor/execExpr.h
@@ -564,8 +564,7 @@ typedef struct ExprEvalStep
 		/* for EEOP_AGGREF */
 		struct
 		{
-			/* out-of-line state, modified by nodeAgg.c */
-			AggrefExprState *astate;
+			int			aggno;
 		}			aggref;
 
 		/* for EEOP_GROUPING_FUNC */
diff --git i/src/include/nodes/execnodes.h w/src/include/nodes/execnodes.h
index fc5698cff20..0ff19256e13 100644
--- i/src/include/nodes/execnodes.h
+++ w/src/include/nodes/execnodes.h
@@ -746,16 +746,6 @@ typedef tuplehash_iterator TupleHashIterator;
  * 
  */
 
-/* 
- *		AggrefExprState node
- * 
- */
-typedef struct AggrefExprState
-{
-	NodeTag		type;
-	int			aggno;			/* ID number for agg within its plan node */
-} AggrefExprState;
-
 /* 
  *		WindowFuncExprState node
  * 
diff --git i/src/include/nodes/nodes.h w/src/include/nodes/nodes.h
index 7ddd8c011bf..3684f87a883 100644
--- i/src/include/nodes/nodes.h
+++ w/src/include/nodes/nodes.h
@@ -206,10 +206,9 @@ typedef enum NodeTag
 	 * Most Expr-based plan nodes do not have a corresponding expression state
 	 * node, they're fully handled within execExpr* - but sometimes the state
 	 * needs to be shared with other parts of the executor, as for example
-	 * with AggrefExprState, which nodeAgg.c has to modify.
+	 * with SubPlanState, which nodeSubplan.c has to modify.
 	 */
 	T_ExprState,
-	T_AggrefExprState,
 	T_WindowFuncExprState,
 	T_SetExprState,
 	T_SubPlanState,
d

Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-29 Thread Andres Freund
Hi,

On 2020-10-29 12:27:53 -0400, Tom Lane wrote:
> Maybe put a check into vacuum_delay_point, and poll the pipe when we're
> about to sleep anyway?

Perhaps we should just replace the pg_usleep() with a latch wait?

Greetings,

Andres Freund




Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-29 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2020-10-29 12:27:53 -0400, Tom Lane wrote:
> > Maybe put a check into vacuum_delay_point, and poll the pipe when we're
> > about to sleep anyway?
> 
> Perhaps we should just replace the pg_usleep() with a latch wait?

I'm not sure why, but I had the thought that we already had done that,
and was a bit surprised that it wasn't that way, so +1 from my part.

I do think it'd be good to find a way to check every once in a while
even when we aren't going to delay though.  Not sure what the best
answer there is.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Online checksums verification in the backend

2020-10-29 Thread Andres Freund
Hi,

On 2020-10-28 14:08:52 +0900, Michael Paquier wrote:
> Thanks for confirming.  I have gone through the whole set today,
> splitted the thing into two commits and applied them.  We had
> buildfarm member florican complain about a mistake in one of the
> GetDatum() calls that I took care of already, and there is nothing
> else on my radar.

The code does IO while holding the buffer mapping lock. That seems
*entirely* unacceptable to me. That basically locks 1/128 of shared
buffers against concurrent mapping changes, while reading data that is
likely not to be on disk?  Seriously?


/* see if the block is in the buffer pool or not */
LWLockAcquire(partLock, LW_SHARED);
buf_id = BufTableLookup(&buf_tag, buf_hash);
if (buf_id >= 0)
{
uint32  buf_state;

/*
 * Found it.  Now, retrieve its state to know what to do with 
it, and
 * release the pin immediately.  We do so to limit overhead as 
much as
 * possible.  We keep the shared LWLock on the target buffer 
mapping
 * partition for now, so this buffer cannot be evicted, and we 
acquire
 * an I/O Lock on the buffer as we may need to read its 
contents from
 * disk.
 */
bufdesc = GetBufferDescriptor(buf_id);

LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
buf_state = LockBufHdr(bufdesc);
UnlockBufHdr(bufdesc, buf_state);

/* If the page is dirty or invalid, skip it */
if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) 
== 0)
{
LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
LWLockRelease(partLock);
return true;
}

/* Read the buffer from disk, with the I/O lock still held */
smgrread(smgr, forknum, blkno, buffer);
LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
}
else
{
/*
 * Simply read the buffer.  There's no risk of modification on 
it as
 * we are holding the buffer pool partition mapping lock.
 */
smgrread(smgr, forknum, blkno, buffer);
}


The justification in the in-shared-buffers case seems to completely
mis-judge costs too:
 * Found it.  Now, retrieve its state to know what to do with 
it, and
 * release the pin immediately.  We do so to limit overhead as 
much as
 * possible.  We keep the shared LWLock on the target buffer 
mapping
 * partition for now, so this buffer cannot be evicted, and we 
acquire
 * an I/O Lock on the buffer as we may need to read its 
contents from
 * disk.
a pin is cheap. Holding the partition lock is not.


Also, using char[BLCKSZ] as a buffer isn't ok. This should use
PGAlignedBlock:
/*
 * Use this, not "char buf[BLCKSZ]", to declare a field or local variable
 * holding a page buffer, if that page might be accessed as a page and not
 * just a string of bytes.  Otherwise the variable might be under-aligned,
 * causing problems on alignment-picky hardware.  (In some places, we use
 * this to declare buffers even though we only pass them to read() and
 * write(), because copying to/from aligned buffers is usually faster than
 * using unaligned buffers.)  We include both "double" and "int64" in the
 * union to ensure that the compiler knows the value must be MAXALIGN'ed
 * (cf. configure's computation of MAXIMUM_ALIGNOF).
 */
typedef union PGAlignedBlock


I think this needs to be quickly reworked or reverted.


Greetings,

Andres Freund




Re: Online checksums verification in the backend

2020-10-29 Thread Andres Freund
Hi,

On 2020-10-29 11:17:29 -0700, Andres Freund wrote:
>   LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
>   buf_state = LockBufHdr(bufdesc);
>   UnlockBufHdr(bufdesc, buf_state);
> 
>   /* If the page is dirty or invalid, skip it */
>   if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) 
> == 0)

This is weird as well. What is this supposed to do? Just locking and
unlocking a buffer header doesn't do squat? There's no guarantee that
the flags haven't changed by this point, so you could just as well not
acquire the buffer header lock.

Also, why are pages without a valid tag ignored? I can follow the
argument for skipping it in the DIRTY case, but that doesn't apply for
BM_TAG_VALID?

Greetings,

Andres Freund




Re: Online checksums verification in the backend

2020-10-29 Thread Andres Freund
Hi,

On 2020-10-29 11:17:29 -0700, Andres Freund wrote:
> The code does IO while holding the buffer mapping lock. That seems
> *entirely* unacceptable to me. That basically locks 1/128 of shared
> buffers against concurrent mapping changes, while reading data that is
> likely not to be on disk?  Seriously?

Also, uh, I don't think the locking of the buffer table provides you
with the full guarantees CheckBuffer() seems to assume:

 * Check the state of a buffer without loading it into the shared buffers. To
 * avoid torn pages and possible false positives when reading data, a shared
 * LWLock is taken on the target buffer pool partition mapping, and we check
 * if the page is in shared buffers or not.  An I/O lock is taken on the block
 * to prevent any concurrent activity from happening.

this doesn't actually prevent all concurrent write IO, unless you hold
an appropriate lock on the relation. There's a few places that use
smgrwrite()/smgrextend() to write out data bypassing shared buffers.

Maybe that isn't a problem for the uses of CheckBuffer() is envisioned
for, but that'd need a pretty detailed explanation as to when it's safe
to use CheckBuffer() for which blocks.

Greetings,

Andres Freund




Re: [PATCH] remove pg_archivecleanup and pg_standby

2020-10-29 Thread Michael Banck
Hi,

Am Mittwoch, den 28.10.2020, 21:44 -0500 schrieb Justin Pryzby:
> Forking this thread:
> https://www.postgresql.org/message-id/fd93f1c5-7818-a02c-01e5-1075ac0d4...@iki.fi

Glancing over this in the context of pg_standby/pg_archivecleanup, I am
not sure Heikki's "Ditto" is about "remove pg_archivecleanup just like
pg_standby" or rather "keep the note until we get around doing something
with it". Probably the former, but see below.

> I think these are old-fashioned since 9.6 (?), so remove them for v14.

Why 9.6?

> I found it confusing when re-familiarizing myself with modern streaming
> replication that there are extensions which only help do things the "old way".

I guess not many will complain about pg_standby going away, but I am
under the impression that pg_archivecleanup is still used a lot in PITR
backup environments as a handy tool to expire WAL related to expired
base backups. I certainly saw hand-assembled shell code fail with "too
many files" and things when it tried to act on large amount of WAL.

So I think the part about it being used in archive_cleanup_command can
be probably be removed, but the part about it being useful as a stand-
alone tool, in particular this part:

|In this mode, if you specify a .partial or .backup file name, then only

|the file prefix will be used as the oldestkeptwalfile. This treatment
|of .backup file name allows you to remove all WAL files archived prior
|to a specific base backup without error.

At the very least, the commit message should give a rationale on why
pg_archivebackup is retired, and what it should be replaced with, in
case valid use-cases for it are still present.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz





Re: Disable WAL logging to speed up data loading

2020-10-29 Thread Fujii Masao




On 2020/10/29 19:21, Laurenz Albe wrote:

On Thu, 2020-10-29 at 11:42 +0900, Fujii Masao wrote:

But what if someone sets wal_level=none, performs some data modifications,
sets wal_level=archive and after dome more processing decides to restore from
a backup that was taken before the cluster was set to wal_level=none?
Then they would end up with a corrupted database, right?


I think that the guard to prevent the server from starting up from
the corrupted database in that senario is necessary.


That wouldn't apply, I think, because the backup from which you start
was taken with wal_level = replica, so the guard wouldn't alert.

But as I said in the other thread, changing wal_level emits a WAL
record, and I am sure that recovery will refuse to proceed if
wal_level < replica.


Yes. What I meant was such a safe guard needs to be implemented.

This may mean that if we want to recover the database from that backup,
we need to specify the recovery target so that the archive recovery stops
just before the WAL record indicating wal_level change.




I'm still not sure if it's worth supporting this feature in core.
Because it can really really easily cause users to corrupt whole the database.


You mean, if they take no backup before setting wal_level = none
and then crash the database, so that they are stuck with an
unrecoverable database?


Yes. Also if the safe guard that we discussed the above is missing,
even when a backup is taken before wal_level=none, recovery from
the backup can make the database corrupted.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Consistent error reporting for encryption/decryption in pgcrypto

2020-10-29 Thread Daniel Gustafsson
Commit b918bf86c65 added the errorcode PXE_DECRYPT_FAILED to the existing set
of PXE_ error codes.  When pgcrypto was changed to the EVP APIs in 5ff4a67f63,
no new error codes were added in favour of existing ones.  This results in
encryption failures returning PXE_ERR_GENERIC, which seems a bit inconsistent.

The attached introduce PXE_ENCRYPT_FAILED and use that for EVP_EncryptUpdate to
ideally be slightly clearer in case of errors.  Any reason not to do that
instead of using ERR_GENERIC?

cheers ./daniel



0001-Use-a-more-descriptive-error-for-failed-encryption.patch
Description: Binary data






Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-29 Thread Alvaro Herrera
On 2020-Oct-29, Stephen Frost wrote:

> I do think it'd be good to find a way to check every once in a while
> even when we aren't going to delay though.  Not sure what the best
> answer there is.

Maybe instead of thinking specifically in terms of vacuum, we could
count buffer accesses (read from kernel) and check the latch once every
1000th such, or something like that.  Then a very long query doesn't
have to wait until it's run to completion.  The cost is one integer
addition per syscall, which should be bearable.

(This doesn't help with a query that's running arbitrarily outside of
Postgres, or doing something that doesn't access disk -- but it'd help
with a majority of problem cases.)




-Wformat-signedness

2020-10-29 Thread Thomas Munro
Hi hackers,

There're probably mostly harmless, being mostly error and debug
messages and the like, and considering that eg OID parsing tolerates
negative numbers when reading them back in, but for what it's worth:
GCC complains about many %d vs %u type mixups if you build with
$SUBJECT.




Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-29 Thread Tom Lane
Alvaro Herrera  writes:
> On 2020-Oct-29, Stephen Frost wrote:
>> I do think it'd be good to find a way to check every once in a while
>> even when we aren't going to delay though.  Not sure what the best
>> answer there is.

> Maybe instead of thinking specifically in terms of vacuum, we could
> count buffer accesses (read from kernel) and check the latch once every
> 1000th such, or something like that.  Then a very long query doesn't
> have to wait until it's run to completion.  The cost is one integer
> addition per syscall, which should be bearable.

I'm kind of unwilling to add any syscalls at all to normal execution
code paths for this purpose.  People shouldn't be sig-kill'ing the
postmaster, or if they do, cleaning up the mess is their responsibility.
I'd also suggest that adding nearly-untestable code paths for this
purpose is a fine way to add bugs we'll never catch.

The if-we're-going-to-delay-anyway path in vacuum_delay_point seems
OK to add a touch more overhead to, though.

regards, tom lane




Re: Deleting older versions in unique indexes to avoid page splits

2020-10-29 Thread Victor Yegorov
пн, 26 окт. 2020 г. в 22:15, Peter Geoghegan :

> Attached is v5, which has changes that are focused on two important
> high level goals:
>

And some more comments after another round of reading the patch.

1. Looks like UNIQUE_CHECK_NO_WITH_UNCHANGED is used for HOT updates,
   should we use UNIQUE_CHECK_NO_HOT here? It is better understood like
this.

2. You're modifying the table_tuple_update() function on 1311 line of
include/access/tableam.h,
   adding modified_attrs_hint. There's a large comment right before it
describing parameters,
   I think there should be a note about modified_attrs_hint parameter in
that comment, 'cos
   it is referenced from other places in tableam.h and also from
backend/access/heap/heapam.c

3. Can you elaborate on the scoring model you're using?
   Why do we expect a score of 25, what's the rationale behind this number?
   And should it be #define-d ?

4. heap_compute_xid_horizon_for_tuples contains duplicate logic. Is it
possible to avoid this?

5. In this comment

+ * heap_index_batch_check() helper function.  Sorts deltids array in the
+ * order needed for useful processing.

   perhaps it is better to replace "useful" with more details? Or point to
the place
   where "useful processing" is described.

6. In this snippet in _bt_dedup_delete_pass()

+   else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+_bt_dedup_save_htid(state, itup))
+   {
+
+   }

   I would rather add a comment, explaining that the empty body of the
clause is actually expected.

7. In the _bt_dedup_delete_finish_pending() you're setting ispromising to
false for both
   posting and non-posting tuples. This contradicts comments before
function.




-- 
Victor Yegorov


Re: enable_incremental_sort changes query behavior

2020-10-29 Thread Tomas Vondra

On Tue, Oct 06, 2020 at 09:37:31AM -0400, James Coleman wrote:

On Tue, Oct 6, 2020 at 9:28 AM Jaime Casanova
 wrote:

Can you please create an entry in the commitfest for this one so we
don't lose track of it?




We're not too far from the next minor release, so I've been looking at
this fix again and I intend to get it committed shortly (on Monday or
so). Attached is a slightly modified version of the v4 patch that:

(a) Removes comments about projections added to code that is not
directly related to the fix. I'm not against adding such comments
separately.

(b) Fixes comment in expected output of incremental_sort test.

(c) Removes else branch from find_em_expr_usable_for_sorting_rel. It's
not quite needed thanks to the "return" in the "if" branch. IMO this
makes it more elegant.


I do have two questions about find_em_expr_usable_for_sorting_rel.

(a) Where does the em_is_const check come from? The comment claims we
should not be trying to sort by equivalence class members, but I can't
convince myself it's actually true and I don't see it discussed in this
thread.

(b) In find_em_expr_for_rel, which was what was used before, the
condition on relids was this:

if (bms_is_subset(em->em_relids, rel->relids) &&
!bms_is_empty(em->em_relids))
{
return em->em_expr;
}

but here we're using only

if (!bms_is_subset(em->em_relids, rel->relids))
continue;

Isn't this missing the second bms_is_empty condition?


Of course, an alternative to this fix would be reverting ba3e76cc571
(completely or just the part introducing generate_useful_gather_paths).
If anyone thinks that's what we should do, please speak now.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index b399592ff8..3393e1d37a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2760,7 +2760,8 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, 
RelOptInfo *rel)
/*
 * Considering query_pathkeys is always worth it, because it might allow
 * us to avoid a total sort when we have a partially presorted path
-* available.
+* available or to push the total sort into the parallel portion of the
+* query.
 */
if (root->query_pathkeys)
{
@@ -2773,17 +2774,17 @@ get_useful_pathkeys_for_relation(PlannerInfo *root, 
RelOptInfo *rel)
EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
 
/*
-* We can only build an Incremental Sort for pathkeys 
which
-* contain an EC member in the current relation, so 
ignore any
-* suffix of the list as soon as we find a pathkey 
without an EC
-* member the relation.
+* We can only build a sort for pathkeys which contain 
an EC
+* member in the current relation's target, so ignore 
any suffix
+* of the list as soon as we find a pathkey without an 
EC member
+* in the relation.
 *
 * By still returning the prefix of the pathkeys list 
that does
 * meet criteria of EC membership in the current 
relation, we
 * enable not just an incremental sort on the entirety 
of
 * query_pathkeys but also incremental sort below a 
JOIN.
 */
-   if (!find_em_expr_for_rel(pathkey_ec, rel))
+   if (!find_em_expr_usable_for_sorting_rel(pathkey_ec, 
rel))
break;
 
npathkeys++;
diff --git a/src/backend/optimizer/path/equivclass.c 
b/src/backend/optimizer/path/equivclass.c
index 823422edad..9662dd3262 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -794,6 +794,66 @@ find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
return NULL;
 }
 
+/*
+ * Find an equivalence class member expression that can be safely used by a
+ * sort node on top of the provided relation. The rules here must match those
+ * applied in prepare_sort_from_pathkeys.
+ */
+Expr *
+find_em_expr_usable_for_sorting_rel(EquivalenceClass *ec, RelOptInfo *rel)
+{
+   ListCell   *lc_em;
+
+   /*
+* If there is more than one equivalence member matching these
+* requirements we'll be content to choose any one of them.
+*/
+   foreach(lc_em, ec->ec_members)
+   {
+   EquivalenceMember *em = lfirst(lc_em);
+   PathTarget *target = rel->reltarget;
+   ListCell   *lc_expr;
+
+   /*
+* We shouldn't be trying to sort by an 

contrib/sslinfo cleanup and OpenSSL errorhandling

2020-10-29 Thread Daniel Gustafsson
While hacking on the NSS patch I realized that sslinfo was passing the ->ssl
Port member directly to OpenSSL in order to extract information regarding the
connection.  This breaks the API provided by the backend, as well as duplicates
code for no real benefit.  The attached 0001 patch rewrites sslinfo to use the
be_tls_* API where possible to reduce duplication and keep the codebase TLS
dependency (mostly) tucked away behind a nice API.  0001 also contains a small
sslinfo doc update to cover that TLSv1.3 is a supported protocol.

0002 ports OpenSSL errorhandling introduced in d94c36a45ab which was performed
for sslinfo but not the backend.  I agree with the commit message that the risk
is small (but not non-existing), but if the checks were important enough for
sslinfo I'd argue they make sense for the backend too.

This patchset was pulled from the NSS patch, but it is entirely independent
from NSS.

cheers ./daniel



0002-Improve-error-handling-in-backend-OpenSSL-implementa.patch
Description: Binary data


0001-Use-be_tls_-API-for-SSL-information-in-sslinfo.patch
Description: Binary data



EXPLAIN vs track_io_timing=on vs tests

2020-10-29 Thread Andres Freund
Hi

I run my development instances with track_io_timing=on, as I've found
that to be really useful. Unfortunately that causes tests to fail
whenever I forget to turn that off to run installcheck.

The diffs are caused by the additional data shown in the explain tests:
...
-   "Temp Written Blocks": N+
+   "Temp Written Blocks": N,   +
+   "I/O Read Time": N.N,   +
+   "I/O Write Time": N.N   +
...


First, why is the output of these fields conditional when using a
non-text format?  Seems we instead should output -1 or null. The latter
seems a bit clearer, but is a bit json specific. I guess we could add a
ExplainPropertyNull() or such?

Second, as long as it is conditional, would anybody mind if I put a
track_io_timing=false into explain.sql? We don't try to make the tests
pass with every possible option set, but track_io_timing seems common
enough?

Greetings,

Andres Freund




Re: Deleting older versions in unique indexes to avoid page splits

2020-10-29 Thread Peter Geoghegan
On Wed, Oct 28, 2020 at 4:05 PM Victor Yegorov  wrote:
> I've reviewed v5 of the patch and did some testing.

Thanks!

> I now see what you mean by saying that this patch is a natural and logical
> extension of the deduplication v13 work. I agree with this.

I tried the patch out with a long running transaction yesterday. I
think that the synergy with the v13 deduplication work really helped.
It took a really long time for an old snapshot to lead to pgbench page
splits (relative to the master branch, running a benchmark like the
one I talked about recently -- the fiver, tenner, score, etc index
benchmark). When the page splits finally started, they seemed much
more gradual -- I don't think that I saw the familiar pattern of
distinct waves of page splits that are clearly all correlated. I think
that the indexes grew at a low steady rate, which looked like the rate
that heap relations usually grow at.

We see a kind of "tick tock" pattern with this new mechanism + v13
deduplication: even when we don't delete very many TIDs, we still free
a few, and then merge the remaining TIDs to buy more time. Very
possibly enough time that a long running transaction goes away by the
time the question of splitting the page comes up again. Maybe there is
another long running transaction by then, but deleting just a few of
the TIDs the last time around is enough to not waste time on that
block this time around, and therefore to actually succeed despite the
second, newer long running transaction (we can delete older TIDs, just
not the now-oldest TIDs that the newer long running xact might still
need).

If this scenario sounds unlikely, bear in mind that "unnecessary" page
splits (which are all we really care about here) are usually only
barely necessary today, if you think about it in a localized/page
level way. What the master branch shows is that most individual
"unnecessary" page splits are in a sense *barely* unnecessary (which
of course doesn't make the consequences any better). We could buy many
hours until the next time the question of splitting a page comes up by
just freeing a small number of tuples -- even on a very busy database.

I found that the "fiver" and "tenner" indexes in particular took a
very long time to have even one page split with a long running
transaction.  Another interesting effect was that all page splits
suddenly stopped when my one hour 30 minute long transaction/snapshot
finally went away -- the indexes stopped growing instantly when I
killed the psql session. But on the master branch the cascading
version driven page splits took at least several minutes to stop when
I killed the psql session/snapshot at that same point of the benchmark
(maybe longer). With the master branch, we can get behind on LP_DEAD
index tuple bit setting, and then have no chance of catching up.
Whereas the patch gives us a second chance for each page.

(I really have only started to think about long running transactions
this week, so my understanding is still very incomplete, and based on
guesses and intuitions.)

> I don't quite like the last sentence. Given that this code is committed,
> I would rather make it:
>
>   … cannot be applied. Therefore we opportunistically check for dead tuples
> and reuse the space, delaying leaf page splits.
>
> I understand that "we" shouldn't be used here, but I fail to think of a
> proper way to express this.

Makes sense.

> 2. in _bt_dedup_delete_pass() and heap_index_batch_check() you're using some
> constants, like:
> - expected score of 25
> - nblocksaccessed checks for 1, 2 and 3 blocks
> - maybe more, but the ones above caught my attention.
>
> Perhaps, it'd be better to use #define-s here instead?

Yeah. It's still evolving, which is why it's still rough.

It's not easy to come up with a good interface here. Not because it's
very important and subtle. It's actually very *unimportant*, in a way.
nbtree cannot really expect too much from heapam here (though it might
get much more than expected too, when it happens to be easy for
heapam). The important thing is always what happens to be possible at
the local/page level -- the exact preferences of nbtree are not so
important. Beggars cannot be choosers.

It only makes sense to have a "score" like this because sometimes the
situation is so favorable (i.e. there are so many TIDs that can be
killed) that we want to avoid vastly exceeding what is likely to be
useful to nbtree. Actually, this situation isn't that rare (which
maybe means I was wrong to say the score thing was unimportant, but
hopefully you get the idea).

Easily hitting our target score of 25 on the first heap page probably
happens almost all the time when certain kinds of unique indexes use
the mechanism, for example. And when that happens it is nice to only
have to visit one heap block. We're almost sure that it isn't worth
visiting a second, regardless of how many TIDs we're likely to find
there.

> 3. Do we really need to touch another heap page, if all condition

Re: contrib/sslinfo cleanup and OpenSSL errorhandling

2020-10-29 Thread Andres Freund
Hi,

Thanks for extracting these.

On 2020-10-29 23:48:57 +0100, Daniel Gustafsson wrote:>  
>  /*
> @@ -54,9 +53,16 @@ PG_FUNCTION_INFO_V1(ssl_version);
>  Datum
>  ssl_version(PG_FUNCTION_ARGS)
>  {
> - if (MyProcPort->ssl == NULL)
> + const char *version;
> +
> + if (!MyProcPort->ssl_in_use)
> + PG_RETURN_NULL();
> +
> + version = be_tls_get_version(MyProcPort);
> + if (version == NULL)
>   PG_RETURN_NULL();
> - PG_RETURN_TEXT_P(cstring_to_text(SSL_get_version(MyProcPort->ssl)));
> +
> + PG_RETURN_TEXT_P(cstring_to_text(version));
>  }

There's quite a few copies of this code that look exactly the same,
except for the be_tls_get_* call. Do you see a way to have fewer copies
of the same code?

Greetings,

Andres Freund




Re: Deleting older versions in unique indexes to avoid page splits

2020-10-29 Thread Peter Geoghegan
On Thu, Oct 29, 2020 at 3:05 PM Victor Yegorov  wrote:
> And some more comments after another round of reading the patch.
>
> 1. Looks like UNIQUE_CHECK_NO_WITH_UNCHANGED is used for HOT updates,
>should we use UNIQUE_CHECK_NO_HOT here? It is better understood like this.

This would probably get me arrested by the tableam police, though.

FWIW the way that that works is still kind of a hack. I think that I
actually need a new boolean flag, rather than overloading the enum
like this.

> 2. You're modifying the table_tuple_update() function on 1311 line of 
> include/access/tableam.h,
>adding modified_attrs_hint. There's a large comment right before it 
> describing parameters,
>I think there should be a note about modified_attrs_hint parameter in that 
> comment, 'cos
>it is referenced from other places in tableam.h and also from 
> backend/access/heap/heapam.c

Okay, makes sense.

> 3. Can you elaborate on the scoring model you're using?
>Why do we expect a score of 25, what's the rationale behind this number?
>And should it be #define-d ?

See my remarks on this from the earlier e-mail.

> 4. heap_compute_xid_horizon_for_tuples contains duplicate logic. Is it 
> possible to avoid this?

Maybe? I think that duplicating code is sometimes the lesser evil.
Like in tuplesort.c, for example. I'm not sure if that's true here,
but it certainly can be true. This is the kind of thing that I usually
only make my mind up about at the last minute. It's a matter of taste.

> 5. In this comment
>
> + * heap_index_batch_check() helper function.  Sorts deltids array in the
> + * order needed for useful processing.
>
>perhaps it is better to replace "useful" with more details? Or point to 
> the place
>where "useful processing" is described.

Okay.

> +   else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
> +_bt_dedup_save_htid(state, itup))
> +   {
> +
> +   }
>
>I would rather add a comment, explaining that the empty body of the clause 
> is actually expected.

Okay. Makes sense.

> 7. In the _bt_dedup_delete_finish_pending() you're setting ispromising to 
> false for both
>posting and non-posting tuples. This contradicts comments before function.

The idea is that we can have plain tuples (non-posting list tuples)
that are non-promising when they're duplicates. Because why not?
Somebody might have deleted them (rather than updating them). It is
fine to have an open mind about this possibility despite the fact that
it is close to zero (in the general case). Including these TIDs
doesn't increase the amount of work we do in heapam. Even when we
don't succeed in finding any of the non-dup TIDs as dead (which is
very much the common case), telling heapam about their existence could
help indirectly (which is somewhat more common). This factor alone
could influence which heap pages heapam visits when there is no
concentration of promising tuples on heap pages (since the total
number of TIDs on each block is the tie-breaker condition when
comparing heap blocks with an equal number of promising tuples during
the block group sort in heapam.c). I believe that this approach tends
to result in heapam going after older TIDs when it wouldn't otherwise,
at least in some cases.

You're right, though -- this is still unclear. Actually, I think that
I should move the handling of promising/duplicate tuples into
_bt_dedup_delete_finish_pending(), too (move it from
_bt_dedup_delete_pass()). That would allow me to talk about all of the
TIDs that get added to the deltids array (promising and non-promising)
in one central function. I'll do it that way soon.

-- 
Peter Geoghegan




Re: Consistent error reporting for encryption/decryption in pgcrypto

2020-10-29 Thread Michael Paquier
On Thu, Oct 29, 2020 at 10:26:54PM +0100, Daniel Gustafsson wrote:
> The attached introduce PXE_ENCRYPT_FAILED and use that for EVP_EncryptUpdate 
> to
> ideally be slightly clearer in case of errors.  Any reason not to do that
> instead of using ERR_GENERIC?

+1.  While looking at that, I was wondering of the potential need of
this error code for other encryption code paths, but it happens that
this is only specific to OpenSSL.  Rijndael or Blowfish don't need
it.
--
Michael


signature.asc
Description: PGP signature


Re: Deleting older versions in unique indexes to avoid page splits

2020-10-29 Thread Peter Geoghegan
On Thu, Oct 29, 2020 at 4:30 PM Peter Geoghegan  wrote:
> I found that the "fiver" and "tenner" indexes in particular took a
> very long time to have even one page split with a long running
> transaction.  Another interesting effect was that all page splits
> suddenly stopped when my one hour 30 minute long transaction/snapshot
> finally went away -- the indexes stopped growing instantly when I
> killed the psql session. But on the master branch the cascading
> version driven page splits took at least several minutes to stop when
> I killed the psql session/snapshot at that same point of the benchmark
> (maybe longer). With the master branch, we can get behind on LP_DEAD
> index tuple bit setting, and then have no chance of catching up.
> Whereas the patch gives us a second chance for each page.

I forgot to say that this long running xact/snapshot test I ran
yesterday was standard pgbench (more or less) -- no custom indexes.
Unlike my other testing, the only possible source of non-HOT updates
here was not being able to fit a heap tuple on the same heap page
(typically because we couldn't truncate HOT chains in time due to a
long running xact holding back cleanup).

The normal state (without a long running xact/snapshot) is no page
splits, both with the patch and with master. But when you introduce a
long running xact, both master and patch will get page splits. The
difference with the patch is that it'll take much longer to start up
compared to master, the page splits are more gradual and smoother with
the patch, and the patch will stop having page splits just as soon as
the xact goes away -- the same second. With the master branch we're
reliant on LP_DEAD bit setting, and if that gets temporarily held back
by a long snapshot then we have little chance of catching up after the
snapshot goes away but before some pages have unnecessary
version-driven page splits.

-- 
Peter Geoghegan




Re: psql \df choose functions by their arguments

2020-10-29 Thread Greg Sabino Mullane
Thank you for looking this over.


> This isn't working for arrays:
> ...
> postgres=# \df aa aa int[]
>

Arrays should work as expected, I think you have one too many "aa" in there?


> I think it should use the same syntax as \sf and \ef, which require
> parenthesis
> and commas, not spaces.
>

Hmm, that will not allow partial matches if we require a closing parens.
Right now both commas and parens are accepted, but optional.


> I think x is just used as "initial", so I think you should make it boolean
> and
> then set is_initial = false, or similar.
>

Good suggestion, it is done.


> +
>  pg_strcasecmp(functoken, "bool") == 0 ? "'boolean'"
>
> I think writing this all within a call to appendPQExpBuffer() is excessive.
> You can make an array or structure to search through and then append the
> result
> to the buffer.
>

Hmm, like a custom struct we loop through? I will look into implementing
that and submit a new patch.

Cheers,
Greg


Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-29 Thread Greg Nancarrow
On Tue, Oct 27, 2020 at 8:56 PM Amit Kapila  wrote:
>
> IIUC, below is code for this workaround:
>
> +MaxRelParallelHazardForModify(Oid relid,
> + CmdType commandType,
> + max_parallel_hazard_context *context)
> +{
> + Relationrel;
> + TupleDesc tupdesc;
> + int attnum;
> +
> + LOCKMODE lockmode = AccessShareLock;
> +
> + /*
> + * It's possible that this relation is locked for exclusive access
> + * in another concurrent transaction (e.g. as a result of a
> + * ALTER TABLE ... operation) until that transaction completes.
> + * If a share-lock can't be acquired on it now, we have to assume this
> + * could be the worst-case, so to avoid blocking here until that
> + * transaction completes, conditionally try to acquire the lock and
> + * assume and return UNSAFE on failure.
> + */
> + if (ConditionalLockRelationOid(relid, lockmode))
> + {
> + rel = table_open(relid, NoLock);
> + }
> + else
> + {
> + context->max_hazard = PROPARALLEL_UNSAFE;
> + return context->max_hazard;
> + }
>
> Do we need this workaround if we lock just the parent table instead of
> locking all the tables? Basically, can we safely identify the
> parallel-safety of partitioned relation if we just have a lock on
> parent relation?

I believe the workaround is still needed in this case, because the
workaround was added because of a test in which the parent table was
exclusively locked in another concurrent transaction (as a result of
ALTER TABLE ... ATTACH PARTITION ...) so we could not even get a
ShareLock on the parent table without hanging (and then ending up
failing the test because of it).
So at the moment the workaround is needed, even if just trying to lock
the parent table.
I'll do some more testing to determine the secondary issue of whether
locks on the partition tables are needed, but at the moment I believe
they are.

>One more thing I have noticed is that for scan
> relations (Select query), we do such checks much later based on
> RelOptInfo (see set_rel_consider_parallel) which seems to have most of
> the information required to perform parallel-safety checks but I guess
> for ModifyTable (aka the Insert table) the equivalent doesn't seem
> feasible but have you thought of doing at the later stage in planner?
>

Yes, and in fact I tried putting the checks in a later stage of the
planner, and it's almost successful, except it then makes setting
"parallelModeNeeded" very tricky indeed, because that is expected to
be set based on whether the SQL is safe to run in parallel mode
(paralleModeOK == true) and whether force_parallel_mode is not off.
With parallel safety checks delayed to a later stage in the planner,
it's then not known whether there are certain types of parallel-unsafe
INSERTs (such as INSERT INTO ... VALUES ... ON CONFLICT DO UPDATE
...), because processing for those doesn't reach those later stages of
the planner where parallelism is being considered. So then to avoid
errors from when parallel-mode is forced on and such unsafe INSERTs
are run, the only real choice is to only allow parallelModeNeeded to
be true for SELECT only (not INSERT), and this is kind of cheating and
also not picking up cases where parallel-safe INSERT is run but
invokes parallel-mode-unsafe features.
My conclusion, at least for the moment, is to leave the check where it is.


> Few other comments on latest patch:
> ===
> 1.
> MaxRelParallelHazardForModify()
> {
> ..
> + if (commandType == CMD_INSERT || commandType == CMD_UPDATE)
> + {
> + /*
> ..
>
> Why to check CMD_UPDATE here?
>

That was a bit of forward-thinking, for when/if UPDATE/DELETE is
supported in parallel-mode.
Column default expressions and check-constraints are only applicable
to INSERT and UPDATE.
Note however that currently this function can only ever be called with
commandType == CMD_INSERT.

> 2.
> +void PrepareParallelModeForModify(CmdType commandType, bool
> isParallelModifyLeader)
> +{
> + Assert(!IsInParallelMode());
> +
> + if (isParallelModifyLeader)
> + (void)GetCurrentCommandId(true);
> +
> + (void)GetCurrentFullTransactionId();
>
> Here, we should use GetCurrentTransactionId() similar to heap_insert
> or other heap operations. I am not sure why you have used
> GetCurrentFullTransactionId?
>

GetCurrentTransactionId() and GetCurrentFullTransactionId() actually
have the same functionality, just a different return value (which is
not being used here).
But anyway I've changed it to use GetCurrentTransactionId().


> 3. Can we have a test to show why we need to check all the partitions
> for parallel-safety? I think it would be possible when there is a
> trigger on only one of the partitions and that trigger has
> corresponding parallel_unsafe function. But it is good to verify that
> once.
>

I can't imagine how you could check parallel-safety properly without
checking all of the partitions.
We don't know which partition that data will get inserted into until
runtime (e.g. range/list partitioning).
Each partition can have its own colu

Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2020-10-29 Thread Tomas Vondra

Hi,

I might be somewhat late to the party, but I've done a bit of
benchmarking too ;-) I used TPC-H data from a 100GB test, and tried
different combinations of COPY [FREEZE] and VACUUM [FREEZE], both on
current master and with the patch.

The results look like this (the columns say what combination of COPY and
VACUUM was used - e.g. -/FREEZE means plain COPY and VACUUM FREEZE)

master:

- / -FREEZE / - - / FREEZEFREEZE / FREEZE
  
 COPY2471  2477   2486   2484
   VACUUM 228   209453206

patched:

- / -FREEZE / - - / FREEZEFREEZE / FREEZE
  
  COPY   2459  2445   2458   2446
VACUUM227 0467  0

So I don't really observe any measurable slowdowns in the COPY part (in
fact there seems to be a tiny speedup, but it might be just noise). In
the VACUUM part, there's clear speedup when the data was already frozen
by COPY (Yes, those are zeroes, because it took less than 1 second.)

So that looks pretty awesome, I guess.

For the record, these tests were run on a server with NVMe SSD, so
hopefully reliable and representative numbers.


A couple minor comments about the code:

1) Maybe add a comment before the block setting xlrec->flags in
heap_multi_insert. It's not very complex, but it used to be a bit
simpler, and all the other pieces around have comments, so it won't
hurt.


2) I find the "if (all_frozen_set)" block a bit strange. It's a matter
of personal preference, but I'd just use a single level of nesting, i.e.
something like this:

/* if everything frozen, the whole page has to be visible */
Assert(!(all_frozen_set && !PageIsAllVisible(page)));

/*
 * If we've frozen everything on the page, and if we're already
 * holding pin on the vmbuffer, record that in the visibilitymap.
 * If we're not holding the pin, it's OK to skip this - it's just
 * an optimization.
 *
 * It's fine to use InvalidTransactionId here - this is only used
 * when HEAP_INSERT_FROZEN is specified, which intentionally
 * violates visibility rules.
 */
if (all_frozen_set &&
visibilitymap_pin_ok(BufferGetBlockNumber(buffer), vmbuffer))
visibilitymap_set(...);

IMHO it's easier to read, but YMMV. I've also reworded the comment a bit
to say what we're doing etc. And I've moved the comment from inside the
if block into the main comment - that was making it harder to read too.


3) I see RelationGetBufferForTuple does this:

/*
 * This is for COPY FREEZE needs. If page is empty,
 * pin vmbuffer to set all_frozen bit
 */
if ((options & HEAP_INSERT_FROZEN) &&
(PageGetMaxOffsetNumber(BufferGetPage(buffer)) == 0))
visibilitymap_pin(relation, BufferGetBlockNumber(buffer), vmbuffer);

so is it actually possible to get into the (all_frozen_set) without
holding a pin on the visibilitymap? I haven't investigated this so
maybe there are other ways to get into that code block. But the new
additions to hio.c get the pin too.


4) In heap_xlog_multi_insert we now have this:

if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
PageClearAllVisible(page);
if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET)
PageSetAllVisible(page);

IIUC it makes no sense to have both flags at the same time, right? So
what about adding

Assert(!(XLH_INSERT_ALL_FROZEN_SET && XLH_INSERT_ALL_VISIBLE_CLEARED));

to check that?


5) Not sure we need to explicitly say this is for COPY FREE in all the
blocks added to hio.c. IMO it's sufficient to use HEAP_INSERT_FROZEN in
the condition, at this level of abstraction.


I wonder what to do about the heap_insert - I know there are concerns it
would negatively impact regular insert, but is it really true? I suppose
this optimization would be valuable even for cases where multi-insert is
not possible.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

2020-10-29 Thread Ashutosh Sharma
Hi All,

Today while working on some other task related to database encoding, I
noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
UTF-8. See below:

postgres=# select convert('\xa1dd', 'euc_jp', 'utf8');
 convert
--
 \xefbc8d
(1 row)

Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN
(with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH
HYPHEN-MINUS SIGN.

When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is
converted to EUC-JP, the convert functions fails with an error saying:
"character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no
equivalent in encoding EUC_JP". See below:

postgres=# select convert('\xe28892', 'utf-8', 'euc_jp');
ERROR:  character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8"
has no equivalent in encoding "EUC_JP"

However, when the same MINUS SIGN in UTF-8 is converted to SJIS
encoding, the convert function returns the correct result. See below:

postgres=# select convert('\xe28892', 'utf-8', 'sjis');
 convert
-
 \x817c
(1 row)

Please note that the byte sequence (81-7c) in SJIS represents MINUS
SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
MINUS SIGN in SJIS and that is what we expect. Isn't it?

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com




Re: Add Information during standby recovery conflicts

2020-10-29 Thread Masahiko Sawada
,

On Thu, 29 Oct 2020 at 00:16, Fujii Masao  wrote:
>
>
>
> On 2020/10/27 9:41, Masahiko Sawada wrote:
> > On Tue, 20 Oct 2020 at 22:02, Drouvot, Bertrand  wrote:
> >>
> >> Hi,
> >>
> >> On 10/15/20 9:15 AM, Masahiko Sawada wrote:
> >>> CAUTION: This email originated from outside of the organization. Do not 
> >>> click links or open attachments unless you can confirm the sender and 
> >>> know the content is safe.
> >>>
> >>>
> >>>
> >>> On Thu, 15 Oct 2020 at 14:52, Kyotaro Horiguchi  
> >>> wrote:
>  At Thu, 15 Oct 2020 14:28:57 +0900, Masahiko Sawada 
>   wrote in
> >> ereport(..(errmsg("%s", _("hogehoge" results in
> >> fprintf((translated("%s")), translate("hogehoge")).
> >>
> >> So your change (errmsg("%s", gettext_noop("hogehoge")) results in
> >>
> >> fprintf((translated("%s")), DONT_translate("hogehoge")).
> >>
> >> which leads to a translation problem.
> >>
> >> (errmsg(gettext_noop("hogehoge"))
> > This seems equivalent to (errmsg("hogehoge")), right?
>  Yes and no.  However eventually the two works the same way,
>  "(errmsg(gettext_noop("hogehoge"))" is a shorthand of
> 
>  1: char *msg = gettext_noop("hogehoge");
>  ...
>  2: .. (errmsg(msg));
> 
>  That is, the line 1 only registers a message id "hogehoge" and doesn't
>  translate. The line 2 tries to translate the content of msg and it
>  finds the translation for the message id "hogehoge".
> >>> Understood.
> >>>
> > I think I could understand translation stuff. Given we only report the
> > const string returned from get_recovery_conflict_desc() without
> > placeholders, the patch needs to use errmsg_internal() instead while
> > not changing _() part. (errmsg(get_recovery_conflict_desc())) is not
> > good (warned by -Wformat-security).
>  Ah, right. we get a complain if no value parameters added. We can
>  silence it by adding a dummy parameter to errmsg, but I'm not sure
>  which is preferable.
> >>> Okay, I'm going to use errmsg_internal() for now until a better idea 
> >>> comes.
> >>>
> >>> I've attached the updated patch that fixed the translation part.
> >>
> >> Thanks for reviewing and helping on this patch!
> >>
> >> The patch tester bot is currently failing due to:
> >>
> >> "proc.c:1290:5: error: ‘standbyWaitStart’ may be used uninitialized in
> >> this function [-Werror=maybe-uninitialized]"
> >>
> >> I've attached a new version with the minor change to fix it.
> >>
> >
> > Thank you for updating the patch!
> >
> > I've looked at the patch and revised a bit the formatting etc.
> >
> > After some thoughts, I think it might be better to report the waiting
> > time as well. it would help users and there is no particular reason
> > for logging the report only once. It also helps make the patch clean
> > by reducing the variables such as recovery_conflict_logged. I’ve
> > implemented it in the v8 patch.
>
> I read v8 patch. Here are review comments.

Thank you for your review.

>
> When recovery conflict with buffer pin happens, log message is output
> every deadlock_timeout. Is this intentional behavior? If yes, IMO that's
> not good because lots of messages can be output.

Agreed.

if we were to log the recovery conflict only once in bufferpin
conflict case, we would log it multiple times only in lock conflict
case. So I guess it's better to do that in all conflict cases.

>
> +   if (log_recovery_conflict_waits)
> +   waitStart = GetCurrentTimestamp();
>
> What happens if log_recovery_conflict_waits is off at the beginning and
> then it's changed during waiting for the conflict? In this case, waitStart is
> zero, but log_recovery_conflict_waits is on. This may cause some problems?

Hmm, I didn't see a path that happens to reload the config file during
waiting for buffer cleanup lock. Even if the startup process receives
SIGHUP during that, it calls HandleStartupProcInterrupts() at the next
convenient time. It could be the beginning of main apply loop or
inside WaitForWALToBecomeAvailable() and so on but I didn’t see it in
the wait loop for taking a buffer cleanup. However, I think it’s
better to use (waitStart > 0) for safety when checking if we log the
recovery conflict instead of log_recovery_conflict_waits.

>
> +   if (report_waiting)
> +   ts = GetCurrentTimestamp();
>
> GetCurrentTimestamp() doesn't need to be called every cycle
> in the loop after "logged" is true and "new_status" is not NULL.

Agreed

>
> +extern const char *get_procsignal_reason_desc(ProcSignalReason reason);
>
> Is this garbage?

Yes.

>
> When log_lock_waits is enabled, both "still waiting for ..." and "acquired 
> ..."
> messages are output. Like this, when log_recovery_conflict_waits is enabled,
> not only "still waiting ..." but also "resolved ..." message should be output?
> The latter message might not need to be output if the conflict is canceled
> due to ma

Re: Boundary value check in lazy_tid_reaped()

2020-10-29 Thread Masahiko Sawada
On Tue, 1 Sep 2020 at 05:56, Peter Geoghegan  wrote:
>
> On Mon, Aug 31, 2020 at 12:22 PM Thomas Munro  wrote:
> > On Sun, Aug 30, 2020 at 11:08 PM Masahiko Sawada
> >  wrote:
> > > So my proposal is to add boundary value check in lazy_tid_reaped()
> > > before executing bsearch(3). This will help when index vacuum happens
> > > multiple times or when garbage tuples are concentrated to a narrow
> > > range.
> >
> > Makes sense if it's often out of range.
>
> I also think this is a good idea. But I also wonder if it goes far
> enough. Only one or two dead TIDs in inconvenient heap pages can
> completely defeat the optimization.
>
> A related problem with the TID list is that it is very space
> inefficient. It would be nice to fix that problem at some point. We
> could make multiple index scans by VACUUM operations much rarer if we
> tried. But how can we do all of this together?
>
> I wonder if Roaring bitmaps would work well for this:
>
> https://arxiv.org/abs/1709.07821
>
> This data structure looks like it might work well in lazy_tid_reaped()
> (for the TID array), because it offers effective bitmap compression
> without compromising on binary search speed. Of course, you'd have to
> encode TIDs as bitmaps to make use of the data structure, much like
> tidbitmap.c. I imagine that the Roaring bitmap indirection would be
> very CPU cache efficient in cases that are similar to the cases
> Sawada-san is worried about, but a bit more complicated.
>
> VACUUM would be able to make the summarizing information for the TID
> bitmap resident in CPU cache. Only this well-cached summarizing
> information (the top-level bitmap range indirection) will need to be
> accessed by most individual calls to a Roaring-bitmap-aware
> lazy_tid_reaped() that return false (i.e. calls that say "don't kill
> this TID, it's alive"). These performance characteristics seem likely
> when a certain amount of clustering of dead tuples takes place in the
> heap. I bet having completely random positions for dead TIDs almost
> never happens -- *some* clustering is practically certain to take
> place, even without natural locality in the data (which is itself very
> common). Think about how opportunistic pruning accumulates many
> LP_DEAD items in heap pages.
>
> There is a reference Roaring bitmap implementation in C, so it's
> probably not that hard to experimentally determine how well it would
> work:
>
> https://github.com/RoaringBitmap/CRoaring
>
> Lots of open source projects already use it, so there are no problems
> with patents. Seems worth investigating. (I also wonder if we could
> use it for clog.)

Very interesting.

Before getting into CRoaring bitmap, I've done some experiments with
three different methods of recording dead tuple TIDs: array,
array-minmax, and integer set.

'array' is the current method lazy vacuum uses. It just adds dead
tuple TIDs to the simple array of ItemPointerData.
'array-minmax' is the same as 'array' method except for checking min
and max boundaries when deleting index dead tuples (i.g., in
lazy_tid_reaped()).
'intset' uses the integer set (src/backend/lib/integerset.c) to record
dead tuples TIDs. It's an in-memory data structure to hold 64-bit
integers.

In the experiments, I created the table with 4GB and made 20% of total
tuples dirty in all test cases while changing the distribution of
where dead tuples exist within the table. The table has only one
index, therefore I did't use parallel vacuum. I set enough
maintenance_work_mem so that the index scan runs only once. Here is
the result, showing the "total execution time in second (heap scan
time/index vacuum time/table vacuum time/memory usage in MB)”. The
numbers are round down to the nearest decimal.

1. Updated evenly the table (every block has at least one dead tuple).
array : 22 (8/12/2/114)
array-minmax : 20 (8/11/2/114)
intset : 17 (7/8/1/19)

2. Updated the middle of the table.
array : 25 (6/19/0/114)
array-minmax : 17 (6/11/0/114)
intset : 17 (6/11/0/7)

3. Updated the tail of the table.
array : 25 (6/19/0/114)
array-minmax : 18 (6/11/0/114)
intset : 18 (6/11/0/5)

4. Updated both the beginning and the tail of the table.
array : 25 (6/19/0/114)
array-minmax : 23 (6/17/0/114)
intset : 21 (6/14/0/6)

The memory usage is calculated by (# of dead tuples *
sizeof(ItemPointerData)) in both ‘array’ and ‘array-minmax’ case,
although the actual amount of palloc'd memory is different, and by
intset_memory_usage() in ‘intset’ case.

Using the integer set is very memory efficient (5MB vs. 114MB in the
base case) and there is no 1GB limitation. Looking at the execution
time, I had expected that using the integer set is slower on recording
TIDs than using the simple array but in this experiment, there is no
big difference among methods. Perhaps the result will vary if vacuum
needs to record much more dead tuple TIDs. From the results, I can see
a good improvement in the integer set case and probably a good start
but if we want to use it also for lazy vac

Re: Online checksums verification in the backend

2020-10-29 Thread Julien Rouhaud
Hi,

On Fri, Oct 30, 2020 at 2:17 AM Andres Freund  wrote:
> The code does IO while holding the buffer mapping lock. That seems
> *entirely* unacceptable to me. That basically locks 1/128 of shared
> buffers against concurrent mapping changes, while reading data that is
> likely not to be on disk?  Seriously?

The initial implementation had a different approach, reading the buffer once
without holding the buffer mapping lock (which could lead to some false
positive in some unlikely scenario), and only if a corruption is detected the
read is done once again *while holding the buffer mapping lock* to ensure it's
not a false positive.  Some benchmarking showed that the performance was worse,
so we dropped that optimisation.  Should we go back to something like that or
do you have a better way to ensure a consistent read of a buffer which isn't in
shared buffers?

> a pin is cheap. Holding the partition lock is not.

> The justification in the in-shared-buffers case seems to completely
> mis-judge costs too:
>  * Found it.  Now, retrieve its state to know what to do with 
> it, and
>  * release the pin immediately.  We do so to limit overhead 
> as much as
>  * possible.  We keep the shared LWLock on the target buffer 
> mapping
>  * partition for now, so this buffer cannot be evicted, and 
> we acquire
>  * an I/O Lock on the buffer as we may need to read its 
> contents from
>  * disk.
> a pin is cheap. Holding the partition lock is not.

I clearly did a poor job in that case.  Will fix.

> Also, using char[BLCKSZ] as a buffer isn't ok. This should use
> PGAlignedBlock:

I wasn't aware of it, I will fix.

> >   LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
> >   buf_state = LockBufHdr(bufdesc);
> >   UnlockBufHdr(bufdesc, buf_state);
> >
> >   /* If the page is dirty or invalid, skip it */
> >   if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) 
> > == 0)
>
> This is weird as well. What is this supposed to do? Just locking and
> unlocking a buffer header doesn't do squat? There's no guarantee that
> the flags haven't changed by this point, so you could just as well not
> acquire the buffer header lock.

This is using the same approach as e.g. WaitIO() to get the state.  I agree
that the state can change after the buffer header lock has been released, but
I think that's something out of scope.  The only guarantee that we can give is
that the database (or subset of relations checked) was healthy at the time the
check was started, provided that your cluster survive the checkpoint happening
after the check ended.  I don't see how we can do better than that.

> Also, why are pages without a valid tag ignored? I can follow the
> argument for skipping it in the DIRTY case, but that doesn't apply for
> BM_TAG_VALID?

AFAICT pages that aren't BM_TAG_VALID are pages newly allocated.
Those shouldn't
be entirely initialized yet, and they'll be eventually written and flushed.

> Also, uh, I don't think the locking of the buffer table provides you
> with the full guarantees CheckBuffer() seems to assume:
>
>  * Check the state of a buffer without loading it into the shared buffers. To
>  * avoid torn pages and possible false positives when reading data, a shared
>  * LWLock is taken on the target buffer pool partition mapping, and we check
>  * if the page is in shared buffers or not.  An I/O lock is taken on the block
>  * to prevent any concurrent activity from happening.
>
> this doesn't actually prevent all concurrent write IO, unless you hold
> an appropriate lock on the relation. There's a few places that use
> smgrwrite()/smgrextend() to write out data bypassing shared buffers.
>
> Maybe that isn't a problem for the uses of CheckBuffer() is envisioned
> for, but that'd need a pretty detailed explanation as to when it's safe
> to use CheckBuffer() for which blocks.

AFAICT, concurrent smgrwrite() can only happen for init forks of unlogged
relation, during creation.  Those relations shouldn't be visible to the caller
snapshot, so it should be safe.  I can add a comment for that if I'm not
mistaken.

For concurrent smgrextend(), we read the relation size at the beginning of the
function, so we shouldn't read newly allocated blocks.  But you're right that
it's still possible to get the size that includes a newly allocated block
that can be concurrently written.  We can avoid that be holding a
LOCKTAG_RELATION_EXTEND lock when reading the relation size.  Would that be ok?




Re: Online verification of checksums

2020-10-29 Thread Michael Paquier
On Thu, Oct 22, 2020 at 10:41:53AM +0900, Michael Paquier wrote:
> We cannot trust the fields fields of the page header because these may
> have been messed up with some random corruption, so what really
> matters is that the checksums don't match, and that we can just rely
> on that.  The zero-only case of a page is different because these
> don't have a checksum set, so I would finish with something like the
> attached to make the detection more robust.  This does not make the
> detection perfect as there is no locking insurance (we really need to
> do that but v13 has been released already), but with a sufficient
> number of retries this can make things much more reliable than what's
> present.
> 
> Are there any comments?  Anybody?

So, hearing nothing, attached is a set of patches that I would like to
apply to 11~ to address the set of issues of this thread.  This comes
with two parts:
- Some refactoring of PageIsVerified(), similar to d401c57 on HEAD
except that this keeps ABI compatibility.
- The actual patch, with tweaks for each stable branch.

Playing with dd and generating random pages, this detects random
corruptions, making use of a wait/retry loop if a failure is detected.
As mentioned upthread, this is a double-edged sword, increasing the
number of retries reduces the changes of false positives, at the cost
of making regression tests longer.  This stuff uses up to 5 retries
with 100ms of sleep for each page.  (I am aware of the fact that the
commit message of the main patch is not written yet).
--
Michael
From 5413343de74a77f0257619d8599523dc7f0a01be Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 30 Oct 2020 10:18:34 +0900
Subject: [PATCH v8] Fix page verifications in base backups

---
 src/backend/replication/basebackup.c | 148 ++-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |  20 ++-
 2 files changed, 96 insertions(+), 72 deletions(-)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b89df01fa7..f8d0fc041d 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1569,21 +1569,24 @@ sendFile(const char *readfilename, const char *tarfilename,
 {
 	int			fd;
 	BlockNumber blkno = 0;
-	bool		block_retry = false;
+	int			block_attempts = 0;
 	char		buf[TAR_SEND_SIZE];
-	uint16		checksum;
 	int			checksum_failures = 0;
 	off_t		cnt;
 	int			i;
 	pgoff_t		len = 0;
 	char	   *page;
 	size_t		pad;
-	PageHeader	phdr;
 	int			segmentno = 0;
 	char	   *segmentpath;
 	bool		verify_checksum = false;
 	pg_checksum_context checksum_ctx;
 
+	/* Maximum number of checksum failures reported for this file */
+#define CHECKSUM_REPORT_THRESHOLD		5
+	/* maximum number of retries done for a single page */
+#define PAGE_RETRY_THRESHOLD			5
+
 	pg_checksum_init(&checksum_ctx, manifest->checksum_type);
 
 	fd = OpenTransientFile(readfilename, O_RDONLY | PG_BINARY);
@@ -1661,9 +1664,7 @@ sendFile(const char *readfilename, const char *tarfilename,
 		if (verify_checksum && (cnt % BLCKSZ != 0))
 		{
 			ereport(WARNING,
-	(errmsg("could not verify checksum in file \"%s\", block "
-			"%d: read buffer size %d and page size %d "
-			"differ",
+	(errmsg("could not verify checksum in file \"%s\", block %u: read buffer size %d and page size %d differ",
 			readfilename, blkno, (int) cnt, BLCKSZ)));
 			verify_checksum = false;
 		}
@@ -1674,79 +1675,84 @@ sendFile(const char *readfilename, const char *tarfilename,
 			{
 page = buf + BLCKSZ * i;
 
+elog(DEBUG1, "checking block %u of file %s, attempt %d",
+	 blkno, readfilename, block_attempts);
+
 /*
- * Only check pages which have not been modified since the
- * start of the base backup. Otherwise, they might have been
- * written only halfway and the checksum would not be valid.
- * However, replaying WAL would reinstate the correct page in
- * this case. We also skip completely new pages, since they
- * don't have a checksum yet.
+ * Check on-disk pages with the same set of verification
+ * conditions used before loading pages into shared buffers.
+ * Note that PageIsVerifiedExtended() considers pages filled
+ * only with zeros as valid, since they don't have a checksum
+ * yet.  Failures are not reported to pgstat yet, as these are
+ * reported in a single batch once a file is completed.  It
+ * could be possible, that a page is written halfway while
+ * doing this check, causing a false positive.  If that
+ * happens, a page is retried multiple times, with an error
+ * reported if the second attempt also fails.
  */
-if (!PageIsNew(page) && PageGetLSN(page) < startptr)
+if (PageIsVerifiedExtended(page, blkno, 0))
 {
-	checksum = pg_checksum_page((char *) page, blkno + segmentno * RELSEG_SIZE);
-	phdr = (PageHeader) page;
-	if (phdr->pd_checksum != checksum)
+	/* success, move on to the next block */
+	blkno++;
+		

Re: [patch] Fix checksum verification in base backups for zero page headers

2020-10-29 Thread Michael Paquier
On Wed, Oct 28, 2020 at 04:43:44PM +0900, Michael Paquier wrote:
> On Wed, Oct 28, 2020 at 04:11:56PM +0900, Michael Paquier wrote:
>> Thanks.  The patch for v13 cannot use a macro, but one of the versions
>> of upthread would do just fine.

For the note, I have posted a set of patches to address all issues on
the original thread where the problem applies:
https://www.postgresql.org/message-id/20201030023028.GC1693%40paquier.xyz
--
Michael


signature.asc
Description: PGP signature


Re: contrib/sslinfo cleanup and OpenSSL errorhandling

2020-10-29 Thread Michael Paquier
On Thu, Oct 29, 2020 at 04:40:32PM -0700, Andres Freund wrote:
> There's quite a few copies of this code that look exactly the same,
> except for the be_tls_get_* call. Do you see a way to have fewer copies
> of the same code?

Each one of those code paths is working on a different sub-API aiming
at fetching a specific piece of TLS information, and the way each
sub-API does its lookup at MyProcPort is different.  One possible way
would be to move the checks on ssl_in_use into a macro for the
beginning part.  The end part could be improved by making
X509_NAME_field_to_text() and such return a text and not a Datum, and
move the null check + text-to-datum conversion into a separate macro.
I am not sure if this would be an improvement in terms of readability
though.
--
Michael


signature.asc
Description: PGP signature


Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-10-29 Thread Fujii Masao




On 2020/10/29 17:03, Masahiro Ikeda wrote:

Hi,

Thanks for your comments and advice. I updated the patch.

On 2020-10-21 18:03, Kyotaro Horiguchi wrote:

At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda
 wrote in

On 2020-10-20 12:46, Amit Kapila wrote:
> I see that we also need to add extra code to capture these stats (some
> of which is in performance-critical path especially in
> XLogInsertRecord) which again makes me a bit uncomfortable. It might
> be that it is all fine as it is very important to collect these stats
> at cluster-level in spite that the same information can be gathered at
> statement-level to help customers but I don't see a very strong case
> for that in your proposal.


We should avoid that duplication as possible even if the both number
are important.


Also about performance, I thought there are few impacts because it
increments stats in memory. If I can implement to reuse pgWalUsage's
value which already collects these stats, there is no impact in
XLogInsertRecord.
For example, how about pg_stat_wal() calculates the accumulated
value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's
value?


I don't think that works, but it would work that pgstat_send_wal()
takes the difference of that values between two successive calls.

WalUsage prevWalUsage;
...
pgstat_send_wal()
{
..
   /* fill in some values using pgWalUsage */
   WalStats.m_wal_bytes   = pgWalUsage.wal_bytes   - prevWalUsage.wal_bytes;
   WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records;
   WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;
...
   pgstat_send(&WalStats, sizeof(WalStats));

   /* remember the current numbers */
   prevWalUsage = pgWalUsage;


Thanks for Horiguchi-san's advice, I changed to reuse pgWalUsage
which is already defined and eliminates the extra overhead.


+   /* fill in some values using pgWalUsage */
+   WalStats.m_wal_bytes = pgWalUsage.wal_bytes - prevWalUsage.wal_bytes;
+   WalStats.m_wal_records = pgWalUsage.wal_records - 
prevWalUsage.wal_records;
+   WalStats.m_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;

It's better to use WalUsageAccumDiff() here?

prevWalUsage needs to be initialized with pgWalUsage?

+   if (AmWalWriterProcess()){
+   WalStats.m_wal_write_walwriter++;
+   }
+   else
+   {
+   WalStats.m_wal_write_backend++;
+   }

I think that it's better not to separate m_wal_write_xxx into two for
walwriter and other processes. Instead, we can use one m_wal_write_xxx
counter and make pgstat_send_wal() send also the process type to
the stats collector. Then the stats collector can accumulate the counters
per process type if necessary. If we adopt this approach, we can easily
extend pg_stat_wal so that any fields can be reported per process type.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Online checksums verification in the backend

2020-10-29 Thread Andres Freund
Hi,

On 2020-10-30 10:01:08 +0800, Julien Rouhaud wrote:
> On Fri, Oct 30, 2020 at 2:17 AM Andres Freund  wrote:
> > The code does IO while holding the buffer mapping lock. That seems
> > *entirely* unacceptable to me. That basically locks 1/128 of shared
> > buffers against concurrent mapping changes, while reading data that is
> > likely not to be on disk?  Seriously?
>
> The initial implementation had a different approach, reading the buffer once
> without holding the buffer mapping lock (which could lead to some false
> positive in some unlikely scenario), and only if a corruption is detected the
> read is done once again *while holding the buffer mapping lock* to ensure it's
> not a false positive.  Some benchmarking showed that the performance was 
> worse,
> so we dropped that optimisation.  Should we go back to something like that or
> do you have a better way to ensure a consistent read of a buffer which isn't 
> in
> shared buffers?

I suspect that you're gonna need something quite different than what the
function is doing right now. Not because such a method will be faster in
isolation, but because there's a chance to have it correct and not have
a significant performance impact onto the rest of the system.

I've not thought about it in detail yet. Is suspect you'll need to
ensure there is a valid entry in the buffer mapping table for the buffer
you're processing. By virtue of setting BM_IO_IN_PROGRESS on that entry
you're going to prevent concurrent IO from starting until your part is
done.


> > Also, why are pages without a valid tag ignored? I can follow the
> > argument for skipping it in the DIRTY case, but that doesn't apply for
> > BM_TAG_VALID?
>
> AFAICT pages that aren't BM_TAG_VALID are pages newly allocated.
> Those shouldn't
> be entirely initialized yet, and they'll be eventually written and flushed.

When a page is being read there's a period when the buffer is without
BM_TAG_VALID. It's quite possible that the locking prevents this case
from being reachable - but in that case you shouldn't just accept it as
something to be skipped...


> > Also, uh, I don't think the locking of the buffer table provides you
> > with the full guarantees CheckBuffer() seems to assume:
> >
> >  * Check the state of a buffer without loading it into the shared buffers. 
> > To
> >  * avoid torn pages and possible false positives when reading data, a shared
> >  * LWLock is taken on the target buffer pool partition mapping, and we check
> >  * if the page is in shared buffers or not.  An I/O lock is taken on the 
> > block
> >  * to prevent any concurrent activity from happening.
> >
> > this doesn't actually prevent all concurrent write IO, unless you hold
> > an appropriate lock on the relation. There's a few places that use
> > smgrwrite()/smgrextend() to write out data bypassing shared buffers.
> >
> > Maybe that isn't a problem for the uses of CheckBuffer() is envisioned
> > for, but that'd need a pretty detailed explanation as to when it's safe
> > to use CheckBuffer() for which blocks.
>
> AFAICT, concurrent smgrwrite() can only happen for init forks of unlogged
> relation, during creation.

That may be the case right in core right now, but for one, there
definitely are extensions going through smgrwrite() without using the
buffer pool. Essentially, what you are saying is that the introduction
of CheckBuffer() altered what smgrwrite() is allowed to be used for,
without having discussed or documented that.

Before this an AM/extension could just use smgrwrite() to write data not
in shared buffers, as long as a locking scheme is used that prevents
multiple backends from doing that at the same time (trivially:
AccessExclusiveLock).


> Those relations shouldn't be visible to the caller
> snapshot, so it should be safe.  I can add a comment for that if I'm not
> mistaken.

There's no comment warning that you shouldn't use CheckBuffer() to check
every buffer in shared buffers, or every relfilenode on disk. The latter
would be quite a reasonable thing, given it'd avoid needing to connect
to every database etc.


> For concurrent smgrextend(), we read the relation size at the beginning of the
> function, so we shouldn't read newly allocated blocks.  But you're right that
> it's still possible to get the size that includes a newly allocated block
> that can be concurrently written.  We can avoid that be holding a
> LOCKTAG_RELATION_EXTEND lock when reading the relation size.  Would that be 
> ok?

That could possibly work - but currently CheckBuffer() doesn't get a
relation, nor are the comments explaining that it has to be a relation
in the current database or anything.

I hadn't yet looked at the caller - I just started looking at
CheckBuffer() this because it caused compilation failures after rebasing
my aio branch onto master (there's no IO locks anymore).



Looking at the caller:
- This is not a concurrency safe pattern:

/* Check if relation exists. leaving if there is no such relation *

Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-10-29 Thread Fujii Masao




On 2020/10/20 11:31, Masahiro Ikeda wrote:

Hi,

I think we need to add some statistics to pg_stat_wal view.

Although there are some parameter related WAL,
there are few statistics for tuning them.

I think it's better to provide the following statistics.
Please let me know your comments.

```
postgres=# SELECT * from pg_stat_wal;
-[ RECORD 1 ]---+--
wal_records | 2000224
wal_fpi | 47
wal_bytes   | 248216337
wal_buffers_full    | 20954
wal_init_file   | 8
wal_write_backend   | 20960
wal_write_walwriter | 46
wal_write_time  | 51
wal_sync_backend    | 7
wal_sync_walwriter  | 8
wal_sync_time   | 0
stats_reset | 2020-10-20 11:04:51.307771+09
```

1. Basic statistics of WAL activity

- wal_records: Total number of WAL records generated
- wal_fpi: Total number of WAL full page images generated
- wal_bytes: Total amount of WAL bytes generated

To understand DB's performance, first, we will check the performance
trends for the entire database instance.
For example, if the number of wal_fpi becomes higher, users may tune
"wal_compression", "checkpoint_timeout" and so on.

Although users can check the above statistics via EXPLAIN, auto_explain,
autovacuum and pg_stat_statements now,
if users want to see the performance trends  for the entire database,
they must recalculate the statistics.

I think it is useful to add the sum of the basic statistics.


2.  WAL segment file creation

- wal_init_file: Total number of WAL segment files created.

To create a new WAL file may have an impact on the performance of
a write-heavy workload generating lots of WAL. If this number is reported high,
to reduce the number of this initialization, we can tune WAL-related parameters
so that more "recycled" WAL files can be held.



3. Number of when WAL is flushed

- wal_write_backend : Total number of WAL data written to the disk by backends
- wal_write_walwriter : Total number of WAL data written to the disk by 
walwriter
- wal_sync_backend : Total number of WAL data synced to the disk by backends
- wal_sync_walwriter : Total number of WAL data synced to the disk by walwrite

I think it's useful for tuning "synchronous_commit" and "commit_delay" for 
query executions.
If the number of WAL is flushed is high, users can know "synchronous_commit" is 
useful for the workload.


I just wonder how useful these counters are. Even without these counters,
we already know synchronous_commit=off is likely to cause the better
performance (but has the risk of data loss). So ISTM that these counters are
not so useful when tuning synchronous_commit.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

2020-10-29 Thread Amit Langote
On Fri, Oct 30, 2020 at 9:44 AM Ashutosh Sharma  wrote:
>
> Hi All,
>
> Today while working on some other task related to database encoding, I
> noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
> mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
> UTF-8. See below:
>
> postgres=# select convert('\xa1dd', 'euc_jp', 'utf8');
>  convert
> --
>  \xefbc8d
> (1 row)
>
> Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN
> (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH
> HYPHEN-MINUS SIGN.
>
> When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is
> converted to EUC-JP, the convert functions fails with an error saying:
> "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no
> equivalent in encoding EUC_JP". See below:
>
> postgres=# select convert('\xe28892', 'utf-8', 'euc_jp');
> ERROR:  character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8"
> has no equivalent in encoding "EUC_JP"
>
> However, when the same MINUS SIGN in UTF-8 is converted to SJIS
> encoding, the convert function returns the correct result. See below:
>
> postgres=# select convert('\xe28892', 'utf-8', 'sjis');
>  convert
> -
>  \x817c
> (1 row)
>
> Please note that the byte sequence (81-7c) in SJIS represents MINUS
> SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> MINUS SIGN in SJIS and that is what we expect. Isn't it?

So we have

a1dd in euc_jp,
817c in sjis,
efbc8d in utf-8

that convert between each other just fine.

But when it comes to

e28892 in utf-8

it currently only converts to sjis and that too just one way:

select convert('\xe28892', 'utf-8', 'sjis');
 convert
-
 \x817c
(1 row)

select convert('\x817c', 'sjis', 'utf-8');
 convert
--
 \xefbc8d
(1 row)

I noticed that the commit a8bd7e1c6e02 from ages ago removed
conversions from and to utf-8's e28892, in favor of efbc8d, and that
change has stuck.  (Note though that these maps looked pretty
different back then.)

--- a/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map
+++ b/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map
-  {0xa1dd, 0xe28892},
+  {0xa1dd, 0xefbc8d},

--- a/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map
+++ b/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map
-  {0xe28892, 0xa1dd},
+  {0xefbc8d, 0xa1dd},

Can't tell what reason there was to do that, but there must have been
some.  Maybe the Japanese character sets prefer full-width hyphen
minus (unicode U+FF0D) over mathematical minus sign (U+2212)?

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

2020-10-29 Thread Kyotaro Horiguchi
Hello.

At Fri, 30 Oct 2020 06:13:53 +0530, Ashutosh Sharma  
wrote in 
> Hi All,
> 
> Today while working on some other task related to database encoding, I
> noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
> mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
> UTF-8. See below:
> 
> postgres=# select convert('\xa1dd', 'euc_jp', 'utf8');
>  convert
> --
>  \xefbc8d
> (1 row)
> 
> Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN
> (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH
> HYPHEN-MINUS SIGN.

No it's not a bug, but a well-known "design":(

The mapping is generated from CP932.TXT and JIS0212.TXT by
UCS_to_UEC_JP.pl.

CP932.TXT used here is here.

https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

CP932.TXT maps 0x817C(SJIS) = 0xA1DD(EUC-JP) as follows.

0x817C  0xFF0D  #FULLWIDTH HYPHEN-MINUS

> When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is
> converted to EUC-JP, the convert functions fails with an error saying:
> "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no
> equivalent in encoding EUC_JP". See below:
>
> postgres=# select convert('\xe28892', 'utf-8', 'euc_jp');
> ERROR:  character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8"
> has no equivalent in encoding "EUC_JP"

U+FF0D(ef bc 8d) is mapped to 0xa1dd@euc-jp
U+2212(e2 88 92) doesn't have a mapping between euc-jp.

> However, when the same MINUS SIGN in UTF-8 is converted to SJIS
> encoding, the convert function returns the correct result. See below:
> 
> postgres=# select convert('\xe28892', 'utf-8', 'sjis');
>  convert
> -
>  \x817c
> (1 row)

It is manually added by UCS_to_SJIS.pl. I'm not sure about the reason
but maybe because it was used widely.

So ping-pong between Unicode and SJIS behaves like this:

U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ...

> Please note that the byte sequence (81-7c) in SJIS represents MINUS
> SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> MINUS SIGN in SJIS and that is what we expect. Isn't it?

I think we don't change authoritative mappings, but maybe can add some
one-way conversions for the convenience.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

2020-10-29 Thread Tom Lane
Amit Langote  writes:
> On Fri, Oct 30, 2020 at 9:44 AM Ashutosh Sharma  wrote:
>> Today while working on some other task related to database encoding, I
>> noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
>> mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
>> UTF-8. See below:
>> ...
>> Isn't this a bug?

> Can't tell what reason there was to do that, but there must have been
> some.  Maybe the Japanese character sets prefer full-width hyphen
> minus (unicode U+FF0D) over mathematical minus sign (U+2212)?

The way it's been explained to me in the past is that the conversion
between Unicode and the various Japanese encodings is not as well
defined as one could wish, because there are multiple quasi-standard
versions of the Japanese encodings.  So we shouldn't move too hastily
on changing this.  Maybe it's really a bug, but maybe there are good
reasons.

regards, tom lane




Re: Add Information during standby recovery conflicts

2020-10-29 Thread Fujii Masao




On 2020/10/30 10:29, Masahiko Sawada wrote:

,

On Thu, 29 Oct 2020 at 00:16, Fujii Masao  wrote:




On 2020/10/27 9:41, Masahiko Sawada wrote:

On Tue, 20 Oct 2020 at 22:02, Drouvot, Bertrand  wrote:


Hi,

On 10/15/20 9:15 AM, Masahiko Sawada wrote:

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



On Thu, 15 Oct 2020 at 14:52, Kyotaro Horiguchi  wrote:

At Thu, 15 Oct 2020 14:28:57 +0900, Masahiko Sawada 
 wrote in

ereport(..(errmsg("%s", _("hogehoge" results in
fprintf((translated("%s")), translate("hogehoge")).

So your change (errmsg("%s", gettext_noop("hogehoge")) results in

fprintf((translated("%s")), DONT_translate("hogehoge")).

which leads to a translation problem.

(errmsg(gettext_noop("hogehoge"))

This seems equivalent to (errmsg("hogehoge")), right?

Yes and no.  However eventually the two works the same way,
"(errmsg(gettext_noop("hogehoge"))" is a shorthand of

1: char *msg = gettext_noop("hogehoge");
...
2: .. (errmsg(msg));

That is, the line 1 only registers a message id "hogehoge" and doesn't
translate. The line 2 tries to translate the content of msg and it
finds the translation for the message id "hogehoge".

Understood.


I think I could understand translation stuff. Given we only report the
const string returned from get_recovery_conflict_desc() without
placeholders, the patch needs to use errmsg_internal() instead while
not changing _() part. (errmsg(get_recovery_conflict_desc())) is not
good (warned by -Wformat-security).

Ah, right. we get a complain if no value parameters added. We can
silence it by adding a dummy parameter to errmsg, but I'm not sure
which is preferable.

Okay, I'm going to use errmsg_internal() for now until a better idea comes.

I've attached the updated patch that fixed the translation part.


Thanks for reviewing and helping on this patch!

The patch tester bot is currently failing due to:

"proc.c:1290:5: error: ‘standbyWaitStart’ may be used uninitialized in
this function [-Werror=maybe-uninitialized]"

I've attached a new version with the minor change to fix it.



Thank you for updating the patch!

I've looked at the patch and revised a bit the formatting etc.

After some thoughts, I think it might be better to report the waiting
time as well. it would help users and there is no particular reason
for logging the report only once. It also helps make the patch clean
by reducing the variables such as recovery_conflict_logged. I’ve
implemented it in the v8 patch.


I read v8 patch. Here are review comments.


Thank you for your review.



When recovery conflict with buffer pin happens, log message is output
every deadlock_timeout. Is this intentional behavior? If yes, IMO that's
not good because lots of messages can be output.


Agreed.

if we were to log the recovery conflict only once in bufferpin
conflict case, we would log it multiple times only in lock conflict
case. So I guess it's better to do that in all conflict cases.


Yes, I agree that this behavior basically should be consistent between all 
cases.





+   if (log_recovery_conflict_waits)
+   waitStart = GetCurrentTimestamp();

What happens if log_recovery_conflict_waits is off at the beginning and
then it's changed during waiting for the conflict? In this case, waitStart is
zero, but log_recovery_conflict_waits is on. This may cause some problems?


Hmm, I didn't see a path that happens to reload the config file during
waiting for buffer cleanup lock. Even if the startup process receives
SIGHUP during that, it calls HandleStartupProcInterrupts() at the next
convenient time. It could be the beginning of main apply loop or
inside WaitForWALToBecomeAvailable() and so on but I didn’t see it in
the wait loop for taking a buffer cleanup.


Yes, you're right. I seem to have read the code wrongly.


However, I think it’s
better to use (waitStart > 0) for safety when checking if we log the
recovery conflict instead of log_recovery_conflict_waits.



+   if (report_waiting)
+   ts = GetCurrentTimestamp();

GetCurrentTimestamp() doesn't need to be called every cycle
in the loop after "logged" is true and "new_status" is not NULL.


Agreed



+extern const char *get_procsignal_reason_desc(ProcSignalReason reason);

Is this garbage?


Yes.



When log_lock_waits is enabled, both "still waiting for ..." and "acquired ..."
messages are output. Like this, when log_recovery_conflict_waits is enabled,
not only "still waiting ..." but also "resolved ..." message should be output?
The latter message might not need to be output if the conflict is canceled
due to max_standby_xxx_delay parameter. The latter message would be
useful to know how long the recovery was waiting for the conflict. Thought?
It's ok to implement this as a separate patch later, though.


There was a discussion that the latter message without wait

Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

2020-10-29 Thread Kyotaro Horiguchi
At Fri, 30 Oct 2020 12:08:51 +0900, Amit Langote  
wrote in 
> I noticed that the commit a8bd7e1c6e02 from ages ago removed
> conversions from and to utf-8's e28892, in favor of efbc8d, and that
> change has stuck.  (Note though that these maps looked pretty
> different back then.)
> 
> --- a/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map
> +++ b/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map
> -  {0xa1dd, 0xe28892},
> +  {0xa1dd, 0xefbc8d},
> 
> --- a/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map
> +++ b/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map
> -  {0xe28892, 0xa1dd},
> +  {0xefbc8d, 0xa1dd},
> 
> Can't tell what reason there was to do that, but there must have been
> some.  Maybe the Japanese character sets prefer full-width hyphen
> minus (unicode U+FF0D) over mathematical minus sign (U+2212)?

It's a decsion made by Microsoft.  Several other characters are in
similar issues. I remember many people complained but in the end that
wasn't "fixed" and led to the well-known conversion messes of Japanese
character conversion involving Unicode in Java.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Enumize logical replication message actions

2020-10-29 Thread Amit Kapila
On Fri, Oct 23, 2020 at 6:26 PM Ashutosh Bapat
 wrote:
>
>
>
> On Fri, 23 Oct 2020 at 18:23, Amit Kapila  wrote:
>>
>> On Fri, Oct 23, 2020 at 11:50 AM Kyotaro Horiguchi
>>  wrote:
>> >
>> > At Thu, 22 Oct 2020 22:31:41 -0300, Alvaro Herrera 
>> >  wrote in
>> > > On 2020-Oct-22, Ashutosh Bapat wrote:
>> > >
>> > > > On Thu, 22 Oct 2020 at 14:46, Kyotaro Horiguchi 
>> > > > 
>> > > > wrote:
>> > >
>> > > > > pg_send_logicalrep_msg_type() looks somewhat too-much.  If we need
>> > > > > something like that we shouldn't do this refactoring, I think.
>> > > >
>> > > > Enum is an integer, and we want to send byte. The function asserts 
>> > > > that the
>> > > > enum fits a byte. If there's a way to declare byte long enums I would 
>> > > > use
>> > > > that. But I didn't find a way to do that.
>> > >
>> > > I didn't look at the code, but maybe it's sufficient to add a
>> > > StaticAssert?
>> >
>> > That check needs to visit all symbols in a enum and confirm that each
>> > of them is in a certain range.
>> >
>>
>> Can we define something like LOGICAL_REP_MSG_LAST (also add a comment
>> indicating this is a fake message and must be the last one) as the
>> last and just check that?
>>
>
> I don't think that's required once I applied suggestions from Kyotaro and 
> Peter. Please check the latest patch.
> Usually LAST is added to an enum when we need to cap the number of symbols or 
> want to find the number of symbols. None of that is necessary here. Do you 
> see any other use?
>

You mentioned in the beginning that you prefer to use Enum instead of
define so that switch cases can detect any remaining items but I have
tried adding extra enum value at the end and didn't handle that in
switch case but I didn't get any compilation warning or error. Do we
need something else to detect that at compile time?

Some comments assuming we want to use enum either because I am missing
something or due to some other reason we have not discussed yet.

1.
+ LOGICAL_REP_MSG_STREAM_ABORT = 'A',
+} LogicalRepMsgType;

There is no need for a comma after the last message.

2.
+/*
+ * Logical message types
+ *
+ * Used by logical replication wire protocol.
+ *
+ * Note: though this is an enum it should fit a single byte and should be a
+ * printable character.
+ */
+typedef enum
+{

I think we can expand the comments to probably say why we need these
to fit in a single byte or what problem it can cause if that rule is
disobeyed. This is to make the next person clear why we are imposing
such a rule.

3.
+typedef enum
+{
..
+} LogicalRepMsgType;

There are places in code where we use the enum name
(LogicalRepMsgType) both in the start and end. See TypeCat,
CoercionMethod, CoercionCodes, etc. I see places where we use the way
you have in the code. I would prefer the way we have used at places
like TypeCat as that makes it easier to read.

4.
  switch (action)
  {
- /* BEGIN */
- case 'B':
+ case LOGICAL_REP_MSG_BEGIN:
  apply_handle_begin(s);
- break;
- /* COMMIT */
- case 'C':
+ return;

I think we can simply use 'return apply_handle_begin;' instead of
adding return in another line. Again, I think we changed this handling
in apply_dispatch() to improve the case where we can detect at the
compile time any missing enum but at this stage it is not clear to me
if that is true.

-- 
With Regards,
Amit Kapila.




Re: document pg_settings view doesn't display custom options

2020-10-29 Thread Fujii Masao




On 2020/10/29 21:54, John Naylor wrote:



On Wed, Oct 28, 2020 at 11:38 PM Fujii Masao mailto:masao.fu...@oss.nttdata.com>> wrote:



On 2020/10/29 3:45, John Naylor wrote:
 > On Wed, Oct 28, 2020 at 2:15 PM John Naylor mailto:john.nay...@enterprisedb.com> >> wrote:
 >
 >     Starting separate threads to keep from cluttering the TODO list 
thread.
 >
 >     Here's a patch for the subject, as mentioned in
 > https://www.postgresql.org/message-id/20201027220555.GS4951%40momjian.us
 >
 >
 > I just realized I introduced a typo, so here's v2.

+   The pg_settings view does not display
+   customized options.

This is true until the module that defines the customized options is loaded,
but not after that. No? For example, pg_settings displays
pg_stat_statements.max after pg_stat_statements is loaded.


True, how about this:

    The pg_settings does not display
    customized options
    that have been set before the relevant extension module has been loaded.


I guess that someone can misread this as

customized options that have been set before the relevant extension
module has been loaded are not displayed even after the module is loaded.

So what about the following, instead?

The pg_settings does not display customized options until the extension
module that defines them has been loaded.

Also I think this note should be in the different paragraph from the paragraph
of "The pg_settings view cannot be inserted into or deleted from" because
they are different topics. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Online checksums verification in the backend

2020-10-29 Thread Julien Rouhaud
On Fri, Oct 30, 2020 at 10:58 AM Andres Freund  wrote:
>
> Hi,
>
> On 2020-10-30 10:01:08 +0800, Julien Rouhaud wrote:
> > On Fri, Oct 30, 2020 at 2:17 AM Andres Freund  wrote:
> > > The code does IO while holding the buffer mapping lock. That seems
> > > *entirely* unacceptable to me. That basically locks 1/128 of shared
> > > buffers against concurrent mapping changes, while reading data that is
> > > likely not to be on disk?  Seriously?
> >
> > The initial implementation had a different approach, reading the buffer once
> > without holding the buffer mapping lock (which could lead to some false
> > positive in some unlikely scenario), and only if a corruption is detected 
> > the
> > read is done once again *while holding the buffer mapping lock* to ensure 
> > it's
> > not a false positive.  Some benchmarking showed that the performance was 
> > worse,
> > so we dropped that optimisation.  Should we go back to something like that 
> > or
> > do you have a better way to ensure a consistent read of a buffer which 
> > isn't in
> > shared buffers?
>
> I suspect that you're gonna need something quite different than what the
> function is doing right now. Not because such a method will be faster in
> isolation, but because there's a chance to have it correct and not have
> a significant performance impact onto the rest of the system.
>
> I've not thought about it in detail yet. Is suspect you'll need to
> ensure there is a valid entry in the buffer mapping table for the buffer
> you're processing. By virtue of setting BM_IO_IN_PROGRESS on that entry
> you're going to prevent concurrent IO from starting until your part is
> done.

So I'm assuming that the previous optimization to avoid almost every
time doing an IO while holding a buffer mapping lock isn't an option?
In that case, I don't see any other option than reverting the patch
and discussing a new approach.




Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

2020-10-29 Thread Tatsuo Ishii
> Hi All,
> 
> Today while working on some other task related to database encoding, I
> noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
> mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
> UTF-8. See below:
> 
> postgres=# select convert('\xa1dd', 'euc_jp', 'utf8');
>  convert
> --
>  \xefbc8d
> (1 row)
> 
> Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN
> (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH
> HYPHEN-MINUS SIGN.

Yeah. Originally EUC_JP 0xa1dd was converted to UTF8 0xe28892. At some
point, someone changed the mapping and now you see it.

> When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is
> converted to EUC-JP, the convert functions fails with an error saying:
> "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no
> equivalent in encoding EUC_JP". See below:
> 
> postgres=# select convert('\xe28892', 'utf-8', 'euc_jp');
> ERROR:  character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8"
> has no equivalent in encoding "EUC_JP"

Again, originally UTF8 0xe28892 was converted to EUC_JP 0xa1dd . At
some point, someone changed the mapping.

> However, when the same MINUS SIGN in UTF-8 is converted to SJIS
> encoding, the convert function returns the correct result. See below:
> 
> postgres=# select convert('\xe28892', 'utf-8', 'sjis');
>  convert
> -
>  \x817c
> (1 row)
> 
> Please note that the byte sequence (81-7c) in SJIS represents MINUS
> SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> MINUS SIGN in SJIS and that is what we expect. Isn't it?

Agreed.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

2020-10-29 Thread Tatsuo Ishii
> The mapping is generated from CP932.TXT and JIS0212.TXT by
> UCS_to_UEC_JP.pl.

I still don't understand why this change has been made. Originally the
conversion was based on JIS0208.txt, JIS0212.txt and JIS0201.txt,
which is the exact definition of EUC-JP. CP932.txt is defined by
Microsoft for their products.

Probably we should call our "EUC-JP" something like "EUC-JP-MS" or
whatever to differentiate from true EUC-JP.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp




empty table blocks hash antijoin

2020-10-29 Thread Pavel Stehule
Hi

I am checking one customer query where there are some empty tables in a
nontrivial query. The fixed estimation on empty tables in Postgres are
working very well, but I found another issue.

create table test_a(id int);
create table test_b(id int);
insert into test_a select generate_series(1,10);
analyze test_a, test_b;

with zero row in test_b postgres optimizer uses nested loop

postgres=# explain analyze select * from test_a where not exists(select *
from test_b where test_a.id=test_b.id);
┌───┐
│QUERY PLAN
│
╞═══╡
│ Nested Loop Anti Join  (cost=0.00..2693.00 rows=9 width=4) (actual
time=0.024..90.530 rows=10 loops=1)│
│   Join Filter: (test_a.id = test_b.id)
 │
│   ->  Seq Scan on test_a  (cost=0.00..1443.00 rows=10 width=4)
(actual time=0.017..8.171 rows=10 loops=1) │
│   ->  Seq Scan on test_b  (cost=0.00..0.00 rows=1 width=4) (actual
time=0.000..0.000 rows=0 loops=10) │
│ Planning Time: 0.153 ms
│
│ Execution Time: 94.331 ms
│
└───┘
(6 rows)

but if I add one fake row to test_b, I got hash antijoin

insert into test_b values(-1);
analyze test_b;

postgres=# explain analyze select * from test_a where not exists(select *
from test_b where test_a.id=test_b.id);
┌───┐
│QUERY PLAN
│
╞═══╡
│ Hash Anti Join  (cost=1.02..2706.51 rows=9 width=4) (actual
time=0.026..24.474 rows=10 loops=1)   │
│   Hash Cond: (test_a.id = test_b.id)
 │
│   ->  Seq Scan on test_a  (cost=0.00..1443.00 rows=10 width=4)
(actual time=0.010..8.522 rows=10 loops=1) │
│   ->  Hash  (cost=1.01..1.01 rows=1 width=4) (actual time=0.008..0.010
rows=1 loops=1)│
│ Buckets: 1024  Batches: 1  Memory Usage: 9kB
 │
│ ->  Seq Scan on test_b  (cost=0.00..1.01 rows=1 width=4) (actual
time=0.003..0.004 rows=1 loops=1)│
│ Planning Time: 0.186 ms
│
│ Execution Time: 28.334 ms
│
└───┘
(8 rows)

Now the query is almost 3 times faster. Probably this is a cost issue,
because cost is very similar. With fake row I got better plan. But when I
disable hashjoin I got more expensive but better plan too

postgres=# explain analyze select * from test_a where not exists(select *
from test_b where test_a.id=test_b.id);
┌───┐
│QUERY PLAN
│
╞═══╡
│ Nested Loop Anti Join  (cost=0.00..2944.01 rows=9 width=4) (actual
time=0.100..47.360 rows=10 loops=1)│
│   Join Filter: (test_a.id = test_b.id)
 │
│   Rows Removed by Join Filter: 10
│
│   ->  Seq Scan on test_a  (cost=0.00..1443.00 rows=10 width=4)
(actual time=0.019..8.586 rows=10 loops=1) │
│   ->  Materialize  (cost=0.00..1.01 rows=1 width=4) (actual
time=0.000..0.000 rows=1 loops=10)│
│ ->  Seq Scan on test_b  (cost=0.00..1.01 rows=1 width=4) (actual
time=0.006..0.008 rows=1 loops=1)│
│ Planning Time: 0.176 ms
│
│ Execution Time: 51.248 ms
│
└───┘
(8 rows)

On empty table the Materialize node helps 50%


Re: Disable WAL logging to speed up data loading

2020-10-29 Thread Masahiko Sawada
On Fri, 30 Oct 2020 at 05:00, Fujii Masao  wrote:
>
>
>
> On 2020/10/29 19:21, Laurenz Albe wrote:
> > On Thu, 2020-10-29 at 11:42 +0900, Fujii Masao wrote:
> >>> But what if someone sets wal_level=none, performs some data modifications,
> >>> sets wal_level=archive and after dome more processing decides to restore 
> >>> from
> >>> a backup that was taken before the cluster was set to wal_level=none?
> >>> Then they would end up with a corrupted database, right?
> >>
> >> I think that the guard to prevent the server from starting up from
> >> the corrupted database in that senario is necessary.
> >
> > That wouldn't apply, I think, because the backup from which you start
> > was taken with wal_level = replica, so the guard wouldn't alert.
> >
> > But as I said in the other thread, changing wal_level emits a WAL
> > record, and I am sure that recovery will refuse to proceed if
> > wal_level < replica.
>
> Yes. What I meant was such a safe guard needs to be implemented.
>
> This may mean that if we want to recover the database from that backup,
> we need to specify the recovery target so that the archive recovery stops
> just before the WAL record indicating wal_level change.

Yeah, it also means that setting wal_level to none makes the previous
backup no use even if the user has some generations of backup.

Does it make things simple if the usage of wal_level = 'none' is
limited to initial data loading for example? I mean we add a special
flag to initdb that sets wal_level to 'none' after initialization and
the user does initial data loading and set wal_level to >= minimal.
That is, we allow users to set from none to >= minimal but not for the
reverse. Since we prevent the database cluster from backup when
wal_level is none, the recovery never across wal_level = none. Not
sure this idea can address the case Osumi-san concerned though.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

2020-10-29 Thread Ashutosh Sharma
On Fri, Oct 30, 2020 at 8:49 AM Kyotaro Horiguchi
 wrote:
>
> Hello.
>
> At Fri, 30 Oct 2020 06:13:53 +0530, Ashutosh Sharma  
> wrote in
> > Hi All,
> >
> > Today while working on some other task related to database encoding, I
> > noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
> > mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
> > UTF-8. See below:
> >
> > postgres=# select convert('\xa1dd', 'euc_jp', 'utf8');
> >  convert
> > --
> >  \xefbc8d
> > (1 row)
> >
> > Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN
> > (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH
> > HYPHEN-MINUS SIGN.
>
> No it's not a bug, but a well-known "design":(
>
> The mapping is generated from CP932.TXT and JIS0212.TXT by
> UCS_to_UEC_JP.pl.
>
> CP932.TXT used here is here.
>
> https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
>
> CP932.TXT maps 0x817C(SJIS) = 0xA1DD(EUC-JP) as follows.
>
> 0x817C  0xFF0D  #FULLWIDTH HYPHEN-MINUS
>

We do have MINUS SIGN (U+2212) defined in both UTF-8 and EUC-JP
encoding. So, not sure why converting MINUS SIGN from UTF-8 to EUC-JP
should throw an error saying: "... in encoding UTF8 has *no*
equivalent in EUC_JP". I mean this information looks misleading and
that's I reason I feel its a bug.

> > When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is
> > converted to EUC-JP, the convert functions fails with an error saying:
> > "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no
> > equivalent in encoding EUC_JP". See below:
> >
> > postgres=# select convert('\xe28892', 'utf-8', 'euc_jp');
> > ERROR:  character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8"
> > has no equivalent in encoding "EUC_JP"
>
> U+FF0D(ef bc 8d) is mapped to 0xa1dd@euc-jp
> U+2212(e2 88 92) doesn't have a mapping between euc-jp.
>
> > However, when the same MINUS SIGN in UTF-8 is converted to SJIS
> > encoding, the convert function returns the correct result. See below:
> >
> > postgres=# select convert('\xe28892', 'utf-8', 'sjis');
> >  convert
> > -
> >  \x817c
> > (1 row)
>
> It is manually added by UCS_to_SJIS.pl. I'm not sure about the reason
> but maybe because it was used widely.
>
> So ping-pong between Unicode and SJIS behaves like this:
>
> U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ...
>
> > Please note that the byte sequence (81-7c) in SJIS represents MINUS
> > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> > MINUS SIGN in SJIS and that is what we expect. Isn't it?
>
> I think we don't change authoritative mappings, but maybe can add some
> one-way conversions for the convenience.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center




Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

2020-10-29 Thread Kyotaro Horiguchi
At Fri, 30 Oct 2020 13:17:08 +0900 (JST), Tatsuo Ishii  
wrote in 
> > The mapping is generated from CP932.TXT and JIS0212.TXT by
> > UCS_to_UEC_JP.pl.
> 
> I still don't understand why this change has been made. Originally the
> conversion was based on JIS0208.txt, JIS0212.txt and JIS0201.txt,
> which is the exact definition of EUC-JP. CP932.txt is defined by
> Microsoft for their products.
> 
> Probably we should call our "EUC-JP" something like "EUC-JP-MS" or
> whatever to differentiate from true EUC-JP.

Seems valid.  Things are already so at the time aeed17d is introduced
(I believe it didn't make any difference in conversions.) and the
change was made by a8bd7e1c6e in 2002.


I'm not sure the point of the change, though..

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: should INSERT SELECT use a BulkInsertState?

2020-10-29 Thread Justin Pryzby
On Thu, Oct 22, 2020 at 01:29:53PM +0100, Simon Riggs wrote:
> On Fri, 16 Oct 2020 at 22:05, Justin Pryzby  wrote:
> 
> > > > I made this conditional on BEGIN BULK/SET bulk, so I'll solicit 
> > > > comments on that.
> 
> I think it would be better if this was self-tuning. So that we don't
> allocate a bulkinsert state until we've done say 100 (?) rows
> inserted.

I made it an optional, non-default behavior in response to the legitimate
concern for performance regression for the cases where a loader needs to be as
fast as possible - as compared with our case, where we want instead to optimize
for our reports by making the loaders responsible for their own writes, rather
than leaving behind many dirty pages, and clobbering the cache, too.

Also, INSERT SELECT doesn't immediately help us (telsasoft), since we use
INSERT .. VALUES () .. ON CONFLICT.  This would handle that case, which is
great, even though that wasn't a design goal.  It could also be an integer GUC
to allow configuring the size of the ring buffer.

> You should also use table_multi_insert() since that will give further
> performance gains by reducing block access overheads. Switching from
> single row to multi-row should also only happen once we've loaded a
> few rows, so we don't introduce overahads for smaller SQL statements.

Good idea...multi_insert (which reduces the overhead of individual inserts) is
mostly independent from BulkInsert state (which uses a ring-buffer to avoid
dirtying the cache).  I made this 0002.

This makes INSERT SELECT several times faster, and not clobber the cache too.

Time: 4700.606 ms (00:04.701)
   123 |  1
37 |  2
20 |  3
11 |  4
  4537 |  5
 11656 |   

Time: 1125.302 ms (00:01.125)
  2171 |  1
37 |  2
20 |  3
11 |  4
   111 |  5
 14034 |   

When enabled, this passes nearly all regression tests, and all but 2 of the
changes are easily understood.  The 2nd patch still needs work.

-- 
Justin
>From 16057608bd58f54a5e365433ded18757aca8ec48 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 8 May 2020 02:17:32 -0500
Subject: [PATCH v5 1/2] Allow INSERT SELECT to use a BulkInsertState

---
 src/backend/executor/nodeModifyTable.c | 22 --
 src/backend/parser/gram.y  |  7 ++-
 src/backend/tcop/utility.c |  4 
 src/backend/utils/misc/guc.c   | 11 +++
 src/include/executor/nodeModifyTable.h |  2 ++
 src/include/nodes/execnodes.h  |  3 +++
 src/include/parser/kwlist.h|  1 +
 src/test/regress/expected/insert.out   | 23 +++
 src/test/regress/sql/insert.sql| 13 +
 9 files changed, 83 insertions(+), 3 deletions(-)

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 29e07b7228..26ff964105 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -72,6 +72,8 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			   ResultRelInfo *targetRelInfo,
 			   TupleTableSlot *slot,
 			   ResultRelInfo **partRelInfo);
+/* guc */
+bool insert_in_bulk = false;
 
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
@@ -594,7 +596,7 @@ ExecInsert(ModifyTableState *mtstate,
 			table_tuple_insert_speculative(resultRelationDesc, slot,
 		   estate->es_output_cid,
 		   0,
-		   NULL,
+		   NULL, /* Bulk insert not supported */
 		   specToken);
 
 			/* insert index entries for tuple */
@@ -631,10 +633,17 @@ ExecInsert(ModifyTableState *mtstate,
 		}
 		else
 		{
+			if (proute && mtstate->prevResultRelInfo != resultRelInfo)
+			{
+if (mtstate->bistate)
+	ReleaseBulkInsertStatePin(mtstate->bistate);
+mtstate->prevResultRelInfo = resultRelInfo;
+			}
+
 			/* insert the tuple normally */
 			table_tuple_insert(resultRelationDesc, slot,
 			   estate->es_output_cid,
-			   0, NULL);
+			   0, mtstate->bistate);
 
 			/* insert index entries for tuple */
 			if (resultRelInfo->ri_NumIndices > 0)
@@ -2232,6 +2241,9 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 	mtstate->mt_arowmarks = (List **) palloc0(sizeof(List *) * nplans);
 	mtstate->mt_nplans = nplans;
+	mtstate->bistate = NULL;
+	if (operation == CMD_INSERT && insert_in_bulk)
+		mtstate->bistate = GetBulkInsertState();
 
 	/* set up epqstate with dummy subplan data for the moment */
 	EvalPlanQualInit(&mtstate->mt_epqstate, estate, NULL, NIL, node->epqParam);
@@ -2698,6 +2710,12 @@ ExecEndModifyTable(ModifyTableState *node)
 		   resultRelInfo);
 	}
 
+	if (node->bistate)
+	{
+		FreeBulkInsertState(node->bistate);
+		table_finish_bulk_insert(node->rootResultRelInfo->ri_RelationDesc, 0);
+	}
+
 	/*
 	 * Close all the partitioned tables, leaf partitions, and their indices
 	 *

Re: Enumize logical replication message actions

2020-10-29 Thread Peter Smith
On Fri, Oct 30, 2020 at 2:46 PM Amit Kapila  wrote:

Hi Amit

> You mentioned in the beginning that you prefer to use Enum instead of
> define so that switch cases can detect any remaining items but I have
> tried adding extra enum value at the end and didn't handle that in
> switch case but I didn't get any compilation warning or error. Do we
> need something else to detect that at compile time?

See [1] some GCC compiler options that can expose missing cases like those.

e.g. use -Wswitch or -Wswitch-enum
Detection depends if the switch has a default case or not.

> 4.
>   switch (action)
>   {
> - /* BEGIN */
> - case 'B':
> + case LOGICAL_REP_MSG_BEGIN:
>   apply_handle_begin(s);
> - break;
> - /* COMMIT */
> - case 'C':
> + return;
>
> I think we can simply use 'return apply_handle_begin;' instead of
> adding return in another line. Again, I think we changed this handling
> in apply_dispatch() to improve the case where we can detect at the
> compile time any missing enum but at this stage it is not clear to me
> if that is true.

IIUC getting rid of the default from the switch can make the missing
enum detection easier because then you can use -Wswitch option to
expose the problem (instead of -Wswitch-enum which may give lots of
false positives as well)

===

[1]  https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wswitch

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Online checksums verification in the backend

2020-10-29 Thread Andres Freund
On 2020-10-30 11:58:13 +0800, Julien Rouhaud wrote:
> So I'm assuming that the previous optimization to avoid almost every
> time doing an IO while holding a buffer mapping lock isn't an option?
> In that case, I don't see any other option than reverting the patch
> and discussing a new approach.

I think its pretty much *never* OK to do IO while holding a buffer
mapping lock. You're locking a significant fraction of shared buffers
over IO. That's just not OK.  Don't think there's any place doing so
currently either.




Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8

2020-10-29 Thread Amit Langote
On Fri, Oct 30, 2020 at 12:20 PM Kyotaro Horiguchi
 wrote:
> At Fri, 30 Oct 2020 06:13:53 +0530, Ashutosh Sharma  
> wrote in
> > However, when the same MINUS SIGN in UTF-8 is converted to SJIS
> > encoding, the convert function returns the correct result. See below:
> >
> > postgres=# select convert('\xe28892', 'utf-8', 'sjis');
> >  convert
> > -
> >  \x817c
> > (1 row)
>
> It is manually added by UCS_to_SJIS.pl. I'm not sure about the reason
> but maybe because it was used widely.
>
> So ping-pong between Unicode and SJIS behaves like this:
>
> U+2212 => 0x817c@sjis => U+ff0d => 0x817c@sjis ...

Is it the following piece of code in UCS_TO_SJIS.pl that manually adds
the mapping?

# Add these UTF8->SJIS pairs to the table.
push @$mapping,
...
{
direction => FROM_UNICODE,
ucs   => 0x2212,
code  => 0x817c,
comment   => '# MINUS SIGN',
f => $this_script,
l => __LINE__
},

Given that U+2212 is encoded by e28892 in utf8, I assume that's how
utf8_to_sjis.map ends up with the following mapping into sjis for that
byte sequence:

  /*** Three byte table, leaf: e288xx - offset 0x004ee ***/

  /* 80 */  0x81cd, 0x, 0x81dd, 0x81ce, 0x, 0x, 0x, 0x81de,
  /* 88 */  0x81b8, 0x, 0x, 0x81b9, 0x, 0x, 0x, 0x,
  /* 90 */  0x, 0x8794, "0x817c", ...

> > Please note that the byte sequence (81-7c) in SJIS represents MINUS
> > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> > MINUS SIGN in SJIS and that is what we expect. Isn't it?
>
> I think we don't change authoritative mappings, but maybe can add some
> one-way conversions for the convenience.

Maybe UCS_TO_EUC_JP.pl could do something like the above.

Are there other cases that were fixed like this in the past, either
for euc_jp or sjis?

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: [PoC] Non-volatile WAL buffer

2020-10-29 Thread Takashi Menjo
Hi Heikki,

> I had a new look at this thread today, trying to figure out where we are.
I'm a bit confused.
>
> One thing we have established: mmap()ing WAL files performs worse than
the current method, if pg_wal is not on
> a persistent memory device. This is because the kernel faults in existing
content of each page, even though we're
> overwriting everything.
Yes. In addition, after a certain page (in the sense of OS page) is
msync()ed, another page fault will occur again when something is stored
into that page.

> That's unfortunate. I was hoping that mmap() would be a good option even
without persistent memory hardware.
> I wish we could tell the kernel to zero the pages instead of reading them
from the file. Maybe clear the file with
> ftruncate() before mmapping it?
The area extended by ftruncate() appears as if it were zero-filled [1].
Please note that it merely "appears as if." It might not be actually
zero-filled as data blocks on devices, so pre-allocating files should
improve transaction performance. At least, on Linux 5.7 and ext4, it takes
more time to store into the mapped file just open(O_CREAT)ed and
ftruncate()d than into the one filled already and actually.

> That should not be problem with a real persistent memory device, however
(or when emulating it with DRAM). With
> DAX, the storage is memory-mapped directly and there is no page cache,
and no pre-faulting.
Yes, with filesystem DAX, there is no page cache for file data. A page
fault still occurs but for each 2MiB DAX hugepage, so its overhead
decreases compared with 4KiB page fault. Such a DAX hugepage fault is only
applied to DAX-mapped files and is different from a general transparent
hugepage fault.

> Because of that, I'm baffled by what the
v4-0002-Non-volatile-WAL-buffer.patch does. If I understand it
> correctly, it puts the WAL buffers in a separate file, which is stored on
the NVRAM. Why? I realize that this is just
> a Proof of Concept, but I'm very much not interested in anything that
requires the DBA to manage a second WAL
> location. Did you test the mmap() patches with persistent memory
hardware? Did you compare that with the pmem
> patchset, on the same hardware? If there's a meaningful performance
difference between the two, what's causing
> it?
Yes, this patchset puts the WAL buffers into the file specified by
"nvwal_path" in postgresql.conf.

Why this patchset puts the buffers into the separated file, not existing
segment files in PGDATA/pg_wal, is because it reduces the overhead due to
system calls such as open(), mmap(), munmap(), and close(). It open()s and
mmap()s the file "nvwal_path" once, and keeps that file mapped while
running. On the other hand, as for the patchset mmap()ing the segment
files, a backend process should munmap() and close() the current mapped
file and open() and mmap() the new one for each time the inserting location
for that process goes over segments. This causes the performance difference
between the two.

Best regards,
Takashi

[1]
https://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html

-- 
Takashi Menjo 


Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-29 Thread Amit Kapila
On Fri, Oct 30, 2020 at 6:09 AM Greg Nancarrow  wrote:
>
> On Tue, Oct 27, 2020 at 8:56 PM Amit Kapila  wrote:
> >
> > IIUC, below is code for this workaround:
> >
> > +MaxRelParallelHazardForModify(Oid relid,
> > + CmdType commandType,
> > + max_parallel_hazard_context *context)
> > +{
> > + Relationrel;
> > + TupleDesc tupdesc;
> > + int attnum;
> > +
> > + LOCKMODE lockmode = AccessShareLock;
> > +
> > + /*
> > + * It's possible that this relation is locked for exclusive access
> > + * in another concurrent transaction (e.g. as a result of a
> > + * ALTER TABLE ... operation) until that transaction completes.
> > + * If a share-lock can't be acquired on it now, we have to assume this
> > + * could be the worst-case, so to avoid blocking here until that
> > + * transaction completes, conditionally try to acquire the lock and
> > + * assume and return UNSAFE on failure.
> > + */
> > + if (ConditionalLockRelationOid(relid, lockmode))
> > + {
> > + rel = table_open(relid, NoLock);
> > + }
> > + else
> > + {
> > + context->max_hazard = PROPARALLEL_UNSAFE;
> > + return context->max_hazard;
> > + }
> >
> > Do we need this workaround if we lock just the parent table instead of
> > locking all the tables? Basically, can we safely identify the
> > parallel-safety of partitioned relation if we just have a lock on
> > parent relation?
>
> I believe the workaround is still needed in this case, because the
> workaround was added because of a test in which the parent table was
> exclusively locked in another concurrent transaction (as a result of
> ALTER TABLE ... ATTACH PARTITION ...) so we could not even get a
> ShareLock on the parent table without hanging (and then ending up
> failing the test because of it).
>

Don't you think the test case design is flawed in that case? Because
even simple "select * from tpart;" will hang in planner while taking
share lock (the code flow is:
add_other_rels_to_query->expand_inherited_rtentry->expand_partitioned_rtentry)
once you take exclusive lock for a parallel session on the table.
Currently we never need to acquire any lock for Inserts in the planner
but not sure we can design a test case based on that assumption as we
can see it fails in this basic case.


> So at the moment the workaround is needed, even if just trying to lock
> the parent table.
>

I am not convinced, rather I think that the test case is not well
designed unless there is any other way (without taking share lock on
the relation) to determine parallel-safety of Inserts which neither of
us have thought of. I understand that you don't want to change that
test case as part of this patch so you are using this workaround.

> I'll do some more testing to determine the secondary issue of whether
> locks on the partition tables are needed, but at the moment I believe
> they are.
>

Fair enough but lets determine that by some testing and analysis. I
feel we should even add a comment if we require to lock all partition
tables. I see that we are already doing it for SELECT in the above
mentioned code path so maybe it is okay to do so for Inserts as well.

> >One more thing I have noticed is that for scan
> > relations (Select query), we do such checks much later based on
> > RelOptInfo (see set_rel_consider_parallel) which seems to have most of
> > the information required to perform parallel-safety checks but I guess
> > for ModifyTable (aka the Insert table) the equivalent doesn't seem
> > feasible but have you thought of doing at the later stage in planner?
> >
>
> Yes, and in fact I tried putting the checks in a later stage of the
> planner, and it's almost successful, except it then makes setting
> "parallelModeNeeded" very tricky indeed, because that is expected to
> be set based on whether the SQL is safe to run in parallel mode
> (paralleModeOK == true) and whether force_parallel_mode is not off.
> With parallel safety checks delayed to a later stage in the planner,
> it's then not known whether there are certain types of parallel-unsafe
> INSERTs (such as INSERT INTO ... VALUES ... ON CONFLICT DO UPDATE
> ...), because processing for those doesn't reach those later stages of
> the planner where parallelism is being considered.
>

I guess if that is the only case then you can have that check in the
earlier stage of planner (we should be able to do that as the
information is present in Query) and other checks in the later stage.
However, I guess that is not the only case, we need to determine
parallel-safety of index expressions, trigger functions if any, any
other CHECK expressions on each of attribute, etc.

> So then to avoid
> errors from when parallel-mode is forced on and such unsafe INSERTs
> are run, the only real choice is to only allow parallelModeNeeded to
> be true for SELECT only (not INSERT), and this is kind of cheating and
> also not picking up cases where parallel-safe INSERT is run but
> invokes parallel-mode-unsafe features.
> My conclusion, at least for the moment, is to leave the che

Re: partition routing layering in nodeModifyTable.c

2020-10-29 Thread Amit Langote
On Wed, Oct 28, 2020 at 4:46 PM Amit Langote  wrote:
>
> On Tue, Oct 27, 2020 at 10:23 PM Heikki Linnakangas  wrote:
> > This patch looks reasonable to me at a quick glance. I'm a bit worried
> > or unhappy about the impact on FDWs, though. It doesn't seem nice that
> > the ResultRelInfo is not available in the BeginDirectModify call. It's
> > not too bad, the FDW can call ExecGetResultRelation() if it needs it,
> > but still. Perhaps it would be better to delay calling
> > BeginDirectModify() until the first modification is performed, to avoid
> > any initialization overhead there, like establishing the connection in
> > postgres_fdw.
>
> Ah, calling BeginDirectModify() itself lazily sounds like a good idea;
> see attached updated 0001 to see how that looks.  While updating that
> patch, I realized that the ForeignScan.resultRelation that we
> introduced in 178f2d560d will now be totally useless. :-(

Given that we've closed the CF entry for this thread and given that
there seems to be enough context to these patches, I will move these
patches back to their original thread, that is:

* ModifyTable overheads in generic plans *
https://www.postgresql.org/message-id/flat/CA%2BHiwqE4k1Q2TLmCAvekw%2B8_NXepbnfUOamOeX%3DKpHRDTfSKxA%40mail.gmail.com

That will also make the CF-bot happy.

-- 
Amit Langote
EDB: http://www.enterprisedb.com




Re: ModifyTable overheads in generic plans

2020-10-29 Thread Amit Langote
Attached updated patches based on recent the discussion at:

* Re: partition routing layering in nodeModifyTable.c *
https://www.postgresql.org/message-id/CA%2BHiwqHpmMjenQqNpMHrhg3DRhqqQfby2RCT1HWVwMin3_5vMA%40mail.gmail.com

0001 adjusts how ForeignScanState.resultRelInfo is initialized for use
by direct modify operations.

0002 refactors ResultRelInfo initialization do be don lazily on first use

I call these v6, because the last version posted on this thread was
v5, even though it went through a couple of iterations on the above
thread. Sorry about the confusion.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


v6-0001-Call-BeginDirectModify-from-ExecInitModifyTable.patch
Description: Binary data


v6-0002-Initialize-result-relation-information-lazily.patch
Description: Binary data


Re: Enumize logical replication message actions

2020-10-29 Thread Amit Kapila
On Fri, Oct 30, 2020 at 10:37 AM Peter Smith  wrote:
>
> On Fri, Oct 30, 2020 at 2:46 PM Amit Kapila  wrote:
>
> Hi Amit
>
> > You mentioned in the beginning that you prefer to use Enum instead of
> > define so that switch cases can detect any remaining items but I have
> > tried adding extra enum value at the end and didn't handle that in
> > switch case but I didn't get any compilation warning or error. Do we
> > need something else to detect that at compile time?
>
> See [1] some GCC compiler options that can expose missing cases like those.
>

Thanks, I am able to see the warnings now.

> e.g. use -Wswitch or -Wswitch-enum
> Detection depends if the switch has a default case or not.
>
> > 4.
> >   switch (action)
> >   {
> > - /* BEGIN */
> > - case 'B':
> > + case LOGICAL_REP_MSG_BEGIN:
> >   apply_handle_begin(s);
> > - break;
> > - /* COMMIT */
> > - case 'C':
> > + return;
> >
> > I think we can simply use 'return apply_handle_begin;' instead of
> > adding return in another line. Again, I think we changed this handling
> > in apply_dispatch() to improve the case where we can detect at the
> > compile time any missing enum but at this stage it is not clear to me
> > if that is true.
>
> IIUC getting rid of the default from the switch can make the missing
> enum detection easier because then you can use -Wswitch option to
> expose the problem (instead of -Wswitch-enum which may give lots of
> false positives as well)
>

Fair enough. So, it makes sense to move the default out of the switch case.

Ashutosh, see if we can add in comments (or may be commit message) why
we preferred to use enum for these messages.

-- 
With Regards,
Amit Kapila.




Re: Add important info about ANALYZE after create Functional Index

2020-10-29 Thread Michael Paquier
On Thu, Oct 29, 2020 at 10:59:52AM +0900, Michael Paquier wrote:
> REINDEX CONCURRENTLY is by design wanted to provide an experience
> transparent to the user similar to what a plain REINDEX would do, at
> least that's the idea behind it, so..  This qualifies as a bug to me,
> in spirit.

And in spirit, it is possible to address this issue with the patch
attached which copies the set of stats from the old to the new index.
For a non-concurrent REINDEX, this does not happen because we keep the
same base relation, while we replace completely the relation with a
concurrent operation.  We have a RemoveStatistics() in heap.c, but I
did not really see the point to invent a copy flavor for this
particular case.  Perhaps others have an opinion on that?
--
Michael
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 0974f3e23a..3820a03fb6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -49,6 +49,7 @@
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_opclass.h"
 #include "catalog/pg_operator.h"
+#include "catalog/pg_statistic.h"
 #include "catalog/pg_tablespace.h"
 #include "catalog/pg_trigger.h"
 #include "catalog/pg_type.h"
@@ -1711,6 +1712,44 @@ index_concurrently_swap(Oid newIndexId, Oid oldIndexId, const char *oldName)
 		}
 	}
 
+	/*
+	 * Copy over any data of pg_statistic from the old index to the new one.
+	 */
+	{
+		HeapTuple   tup;
+		SysScanDesc scan;
+		ScanKeyData key[1];
+		Relation	statrel;
+
+		statrel = table_open(StatisticRelationId, RowExclusiveLock);
+		ScanKeyInit(&key[0],
+	Anum_pg_statistic_starelid,
+	BTEqualStrategyNumber, F_OIDEQ,
+	ObjectIdGetDatum(oldIndexId));
+
+		scan = systable_beginscan(statrel, StatisticRelidAttnumInhIndexId,
+  true, NULL, 1, key);
+
+		while (HeapTupleIsValid((tup = systable_getnext(scan
+		{
+			Form_pg_statistic statform;
+
+			/* make a modifiable copy */
+			tup = heap_copytuple(tup);
+			statform = (Form_pg_statistic) GETSTRUCT(tup);
+
+			/* update the copy of the tuple and insert it */
+			statform->starelid = newIndexId;
+			CatalogTupleInsert(statrel, tup);
+
+			heap_freetuple(tup);
+		}
+
+		systable_endscan(scan);
+
+		table_close(statrel, RowExclusiveLock);
+	}
+
 	/* Close relations */
 	table_close(pg_class, RowExclusiveLock);
 	table_close(pg_index, RowExclusiveLock);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6ace7662ee..012c1eb067 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2551,6 +2551,17 @@ CREATE UNIQUE INDEX concur_exprs_index_pred ON concur_exprs_tab (c1)
 CREATE UNIQUE INDEX concur_exprs_index_pred_2
   ON concur_exprs_tab ((1 / c1))
   WHERE ('-H') >= (c2::TEXT) COLLATE "C";
+ANALYZE concur_exprs_tab;
+SELECT starelid::regclass, count(*) FROM pg_statistic WHERE starelid IN (
+  'concur_exprs_index_expr'::regclass,
+  'concur_exprs_index_pred'::regclass,
+  'concur_exprs_index_pred_2'::regclass)
+  GROUP BY starelid ORDER BY starelid::regclass::text;
+starelid | count 
+-+---
+ concur_exprs_index_expr | 1
+(1 row)
+
 SELECT pg_get_indexdef('concur_exprs_index_expr'::regclass);
 pg_get_indexdef
 ---
@@ -2608,6 +2619,17 @@ SELECT pg_get_indexdef('concur_exprs_index_pred_2'::regclass);
  CREATE UNIQUE INDEX concur_exprs_index_pred_2 ON public.concur_exprs_tab USING btree (((1 / c1))) WHERE ('-H'::text >= (c2 COLLATE "C"))
 (1 row)
 
+-- Statistics should remain intact.
+SELECT starelid::regclass, count(*) FROM pg_statistic WHERE starelid IN (
+  'concur_exprs_index_expr'::regclass,
+  'concur_exprs_index_pred'::regclass,
+  'concur_exprs_index_pred_2'::regclass)
+  GROUP BY starelid ORDER BY starelid::regclass::text;
+starelid | count 
+-+---
+ concur_exprs_index_expr | 1
+(1 row)
+
 DROP TABLE concur_exprs_tab;
 -- Temporary tables and on-commit actions, where CONCURRENTLY is ignored.
 -- ON COMMIT PRESERVE ROWS, the default.
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 37f7259da9..4e45b18613 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1079,6 +1079,12 @@ CREATE UNIQUE INDEX concur_exprs_index_pred ON concur_exprs_tab (c1)
 CREATE UNIQUE INDEX concur_exprs_index_pred_2
   ON concur_exprs_tab ((1 / c1))
   WHERE ('-H') >= (c2::TEXT) COLLATE "C";
+ANALYZE concur_exprs_tab;
+SELECT starelid::regclass, count(*) FROM pg_statistic WHERE starelid IN (
+  'concur_exprs_index_expr'::regclass,
+  'concur_exprs_index_pred'::regclass,
+  'concur_exprs_index_pred_2'::regclass)
+  GROUP BY starelid ORDER BY starelid::regclass::text;
 SELECT pg_get_