Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-01-27 Thread Dilip Kumar
On Tue, Jan 26, 2021 at 9:18 AM japin  wrote:
>
>
> Hi, hackers
>
> When I read the discussion in [1], I found that update subscription's 
> publications
> is complicated.
>
> For example, I have 5 publications in subscription.
>
> CREATE SUBSCRIPTION mysub1 CONNECTION 'host=localhost port=5432 
> dbname=postgres'
> PUBLICATION mypub1, mypub2, mypub3, mypub4, mypub5;
>
> If I want to drop "mypub4", we should use the following command:
>
> ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, mypub5;
>
> Also, if I want to add "mypub7" and "mypub8", it will use:
>
> ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, mypub5, 
> mypub7, mypub8;
>
> Attached implement ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax, 
> for the above
> two cases, we can use the following:
>
> ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub4;
>
> ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub7, mypub8;
>
> I think it's more convenient. Any thoughts?

+1 for the idea

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] remove pg_standby

2021-01-27 Thread Fujii Masao




On 2021/01/27 14:32, Thomas Munro wrote:

On Wed, Jan 27, 2021 at 6:06 PM Michael Paquier  wrote:

On Wed, Jan 27, 2021 at 04:13:24PM +1300, Thomas Munro wrote:

I would like to commit this, because "waiting restore commands" have
confusing interactions with my proposed prefetching-during-recovery
patch[1].  Here's a version that fixes an error when building the docs
(there was a stray remaining ), and adds a
commit message.  Any objections?


I agree with this direction (i.e, remove pg_standby). BTW last month when I 
gave the talk about possible retire of pg_standby at PostgreSQL Unconference 
Tokyo, no one in audience complained about that retire.

But one question is; shouldn't we follow "usual" way to retire the feature 
instead of dropping that immediately? That is, mark pg_standby as obsolete, announce that 
pg_standby will be dropped after several releases, and then drop pg_standby. This seems 
safe because there might be some users. While it's been marked as obsolete, maybe WAL 
prefetch feature doesn't work with pg_standby, but we can live with that because it's 
obsolete.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Parallel Inserts in CREATE TABLE AS

2021-01-27 Thread Bharath Rupireddy
On Wed, Jan 27, 2021 at 1:25 PM Tang, Haiying
 wrote:
> I choose 5 cases which pick parallel insert plan in CTAS to measure the 
> patched performance. Each case run 30 times.
>
> Most of the tests execution become faster with this patch.
>
> However, Test NO 4(create table xxx as table xxx.) appears performance 
> degradation. I tested various table size(2/10/20 millions), they all have a 
> 6%-10% declines. I think it may need some check at this problem.
>
>
>
> Below are my test results. 'Test NO' is corresponded to 'Test NO' in attached 
> test_ctas.sql file.
>
> reg%=(patched-master)/master
>
> Test NO |  Test Case |reg%  | patched(ms)  | master(ms)
>
> ||--|--|-
>
> 1   |  CTAS select from table| -9%  | 16709.50477  | 18370.76660
>
> 2   |  Append plan   | -14% | 16542.97807  | 19305.86600
>
> 3   |  initial plan under Gather node| -5%  | 13374.27187  | 14120.02633
>
> 4   |  CTAS table| 10%  | 20835.48800  | 18986.40350
>
> 5   |  CTAS select from execute  | -6%  | 16973.73890  | 18008.59789
>
>
>
> About Test NO 4:
>
> In master(HEAD), this test case picks serial seq scan.
>
> query plan likes:
>
> --
>
> Seq Scan on public.tenk1  (cost=0.00..444828.12 rows=1012 width=244) 
> (actual time=0.005..1675.268 rows=1000 loops=1)
>
>Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, 
> twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4  
> Planning Time: 0.053 ms  Execution Time: 20165.023 ms
>
>
>
> With this patch, it will choose parallel seq scan and parallel insert.
>
> query plan likes:
>
> --
>
> Gather  (cost=1000.00..370828.03 rows=1012 width=244) (actual 
> time=20428.823..20437.143 rows=0 loops=1)
>
>Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, 
> twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4
>
>Workers Planned: 4
>
>Workers Launched: 4
>
> ->  Create test
>
>->  Parallel Seq Scan on public.tenk1  (cost=0.00..369828.03 rows=253 
> width=244) (actual time=0.021..411.094 rows=200 loops=5)
>
>  Output: unique1, unique2, two, four, ten, twenty, hundred, thousand, 
> twothousand, fivethous, tenthous, odd, even, stringu1, stringu2, string4
>
>  Worker 0:  actual time=0.023..390.856 rows=1858407 loops=1
>
>  Worker 1:  actual time=0.024..468.587 rows=2264494 loops=1
>
>  Worker 2:  actual time=0.023..473.170 rows=2286580 loops=1
>
>  Worker 3:  actual time=0.027..373.727 rows=1853216 loops=1  Planning 
> Time: 0.053 ms  Execution Time: 20437.643 ms
>
>
>
> test machine spec:
>
> CentOS 8.2, 128G RAM, 40 processors, disk SAS

Thanks a lot for the performance tests and test cases. I will analyze
why the performance is degrading one case and respond soon.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: [PoC] Non-volatile WAL buffer

2021-01-27 Thread Takashi Menjo
Hi,

Now I have caught up with this thread. I see that many of you are
interested in performance profiling.

I share my slides in SNIA SDC 2020 [1]. In the slides, I had profiles
focused on XLogInsert and XLogFlush (mainly the latter) for my non-volatile
WAL buffer patchset. I found that the time for XLogWrite and
locking/unlocking WALWriteLock were eliminated by the patchset. Instead,
XLogInsert and WaitXLogInsertionsToFinish took more (or a little more) time
than ever because memcpy-ing to PMEM (Optane PMem) is slower than to DRAM.
For details, please see the slides.

Best regards,
Takashi

[1]
https://www.snia.org/educational-library/how-can-persistent-memory-make-databases-faster-and-how-could-we-go-ahead-2020


2021年1月26日(火) 18:50 Takashi Menjo :

>  Dear everyone, Tomas,
>
> First of all, the "v4" patchset for non-volatile WAL buffer attached to
> the previous mail is actually v5... Please read "v4" as "v5."
>
> Then, to Tomas:
> Thank you for your crash report you gave on Nov 27, 2020, regarding msync
> patchset. I applied the latest msync patchset v3 attached to the previous
> to master 411ae64 (on Jan18, 2021) then tested it, and I got no error when
> pgbench -i -s 500. Please try it if necessary.
>
> Best regards,
> Takashi
>
>
> 2021年1月26日(火) 17:52 Takashi Menjo :
>
>> Dear everyone,
>>
>> Sorry but I forgot to attach my patchsets... Please see the files
>> attached to this mail. Please also note that they contain some fixes.
>>
>> Best regards,
>> Takashi
>>
>>
>> 2021年1月26日(火) 17:46 Takashi Menjo :
>>
>>> Dear everyone,
>>>
>>> I'm sorry for the late reply. I rebase my two patchsets onto the latest
>>> master 411ae64.The one patchset prefixed with v4 is for non-volatile WAL
>>> buffer; the other prefixed with v3 is for msync.
>>>
>>> I will reply to your thankful feedbacks one by one within days. Please
>>> wait for a moment.
>>>
>>> Best regards,
>>> Takashi
>>>
>>>
>>> 01/25/2021(Mon) 11:56 Masahiko Sawada :
>>>
 On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra
  wrote:
 >
 >
 >
 > On 1/21/21 3:17 AM, Masahiko Sawada wrote:
 > > On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
 > >  wrote:
 > >>
 > >> Hi,
 > >>
 > >> I think I've managed to get the 0002 patch [1] rebased to master
 and
 > >> working (with help from Masahiko Sawada). It's not clear to me how
 it
 > >> could have worked as submitted - my theory is that an incomplete
 patch
 > >> was submitted by mistake, or something like that.
 > >>
 > >> Unfortunately, the benchmark results were kinda disappointing. For
 a
 > >> pgbench on scale 500 (fits into shared buffers), an average of
 three
 > >> 5-minute runs looks like this:
 > >>
 > >> branch 1163264
 96
 > >>
  
 > >> master  7291 87704165310150437
 224186
 > >> ntt 7912106095213206212410
 237819
 > >> simple-no-buffers   7654 96544115416 95828
 103065
 > >>
 > >> NTT refers to the patch from September 10, pre-allocating a large
 WAL
 > >> file on PMEM, and simple-no-buffers is the simpler patch simply
 removing
 > >> the WAL buffers and writing directly to a mmap-ed WAL segment on
 PMEM.
 > >>
 > >> Note: The patch is just replacing the old implementation with mmap.
 > >> That's good enough for experiments like this, but we probably want
 to
 > >> keep the old one for setups without PMEM. But it's good enough for
 > >> testing, benchmarking etc.
 > >>
 > >> Unfortunately, the results for this simple approach are pretty
 bad. Not
 > >> only compared to the "ntt" patch, but even to master. I'm not
 entirely
 > >> sure what's the root cause, but I have a couple hypotheses:
 > >>
 > >> 1) bug in the patch - That's clearly a possibility, although I've
 tried
 > >> tried to eliminate this possibility.
 > >>
 > >> 2) PMEM is slower than DRAM - From what I know, PMEM is much
 faster than
 > >> NVMe storage, but still much slower than DRAM (both in terms of
 latency
 > >> and bandwidth, see [2] for some data). It's not terrible, but the
 > >> latency is maybe 2-3x higher - not a huge difference, but may
 matter for
 > >> WAL buffers?
 > >>
 > >> 3) PMEM does not handle parallel writes well - If you look at [2],
 > >> Figure 4(b), you'll see that the throughput actually *drops" as the
 > >> number of threads increase. That's pretty strange / annoying,
 because
 > >> that's how we write into WAL buffers - each thread writes it's own
 data,
 > >> so parallelism is not something we can get rid of.
 > >>
 > >> I've added some simple profiling, to measure number of calls /
 time for
 > >> each operation (use -DX

Re: simplifying foreign key/RI checks

2021-01-27 Thread Kyotaro Horiguchi
At Sun, 24 Jan 2021 20:51:39 +0900, Amit Langote  
wrote in 
> Here's v5.

At Mon, 25 Jan 2021 18:19:56 +0900, Amit Langote  
wrote in 
> > Anybody else want to look this patch over before I mark it Ready For 
> > Committer?
> 
> Would be nice to have others look it over.  Thanks.

This nice improvement.

0001 just looks fine.

0002:

 /* RI query type codes */
-/* these queries are executed against the PK (referenced) table: */
+/*
+ * 1 and 2  are no longer used, because PK (referenced) table is looked up
+ * directly using ri_ReferencedKeyExists().
 #define RI_PLAN_CHECK_LOOKUPPK 1
 #define RI_PLAN_CHECK_LOOKUPPK_FROM_PK 2
 #define RI_PLAN_LAST_ON_PK 
RI_PLAN_CHECK_LOOKUPPK_FROM_PK

However, this patch does.

+   if (!ri_ReferencedKeyExists(pk_rel, fk_rel, newslot, riinfo))
+   ri_ReportViolation(riinfo,
+  pk_rel, fk_rel,
+  newslot,
+  NULL,
+  RI_PLAN_CHECK_LOOKUPPK, 
false);

It seems to me 1 (RI_PLAN_CHECK_LOOKUPPK) is still alive. (Yeah, I
know that doesn't mean the usefulness of the macro but the mechanism
the macro suggests, but it is confusing.) On the other hand,
RI_PLAN_CHECK_LOOKUPPK_FROM_PK and RI_PLAN_LAST_ON_PK seem to be no
longer used.  (Couldn't we remove them?)

(about the latter, we can rewrite the only use of it "if
(qkey->constr_queryno <= RI_PLAN_LAST_ON_PK)" not to use the macro.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Is Recovery actually paused?

2021-01-27 Thread Yugo NAGATA
On Wed, 27 Jan 2021 13:29:23 +0530
Dilip Kumar  wrote:

> On Wed, Jan 27, 2021 at 12:50 PM Masahiko Sawada  
> wrote:
> >
> > On Tue, Jan 26, 2021 at 2:00 AM Robert Haas  wrote:
> > >
> > > On Sat, Jan 23, 2021 at 6:10 AM Bharath Rupireddy
> > >  wrote:
> > > > +1 to just show the recovery pause state in the output of
> > > > pg_is_wal_replay_paused. But, should the function name
> > > > "pg_is_wal_replay_paused" be something like
> > > > "pg_get_wal_replay_pause_state" or some other? To me, when "is" exists
> > > > in a function, I expect a boolean output. Others may have better
> > > > thoughts.
> > >
> > > Maybe we should leave the existing function pg_is_wal_replay_paused()
> > > alone and add a new one with the name you suggest that returns text.
> > > That would create less burden for tool authors.
> >
> > +1
> >
> 
> Yeah, we can do that, I will send an updated patch soon.

This means pg_is_wal_replay_paused is left without any change and this
returns whether pause is requested or not? If so, it seems good to modify
the documentation of this function in order to note that this could not
return the actual pause state.

Regards,
Yugo Nagata


-- 
Yugo NAGATA 




Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2021-01-27 Thread Paul Guo
Thanks for the review, please see the replies below.

> On Jan 5, 2021, at 9:07 AM, Kyotaro Horiguchi  wrote:
> 
> At Wed, 8 Jul 2020 12:56:44 +, Paul Guo  wrote in 
>> On 2020/01/15 19:18, Paul Guo wrote:
>> I further fixed the last test failure (due to a small bug in the test, not 
>> in code). Attached are the new patch series. Let's see the CI pipeline 
>> result.
>> 
>> Thanks for updating the patches!
>> 
>> I started reading the 0003 patch.
>> 
>> The approach that the 0003 patch uses is not the perfect solution.
>> If the standby crashes after tblspc_redo() removes the directory and before
>> its subsequent COMMIT record is replayed, PANIC error would occur since
>> there can be some unresolved missing directory entries when we reach the
>> consistent state. The problem would very rarely happen, though...
>> Just idea; calling XLogFlush() to update the minimum recovery point just
>> before tblspc_redo() performs destroy_tablespace_directories() may be
>> safe and helpful for the problem?
> 
> It seems to me that what the current patch does is too complex.  What
> we need to do here is to remember every invalid operation then forget
> it when the prerequisite object is dropped.
> 
> When a table space is dropped before consistency is established, we
> don't need to care what has been performed inside the tablespace.  In
> this perspective, it is enough to remember tablespace ids when failed
> to do something inside it due to the absence of the tablespace and
> then forget it when we remove it.  We could remember individual
> database id to show them in error messages, but I'm not sure it's
> useful.  The reason log_invalid_page records block numbers is to allow
> the machinery handle partial table truncations, but this is not the
> case since dropping tablespace cannot leave some of containing
> databases.
> 
> As the result, we won't see an unresolved invalid operations in a
> dropped tablespace.
> 
> Am I missing something?

Yes, removing the database id from the hash key in the log/forget code should
be usually fine, but the previous code does stricter/safer checking.

Consider the scenario:

CREATE DATABASE newdb1 TEMPLATE template_db1;
CREATE DATABASE newdb2 TEMPLATE template_db2; <- in case the template_db2 
database directory is missing abnormally somehow.
DROP DATABASE template_db1;

The previous code could detect this but if we remove the database id in the 
code,
this bad scenario is skipped.

> 
> 
> dbase_redo:
> +  if (!(stat(parent_path, &st) == 0 && S_ISDIR(st.st_mode)))
> +  {
> +XLogRecordMissingDir(xlrec->tablespace_id, InvalidOid, parent_path);
> 
> This means "record the belonging table space directory if it is not
> found OR it is not a directory". The former can be valid but the
> latter is unconditionally can not (I don't think we bother considering
> symlinks there).

Again this is a safer check, in the case the parent_path is a file for example 
somehow,
we should panic finally for the case and let the user checks and then does 
recovery again.

> 
> +/*
> + * Source directory may be missing.  E.g. the template database used
> + * for creating this database may have been dropped, due to reasons
> + * noted above.  Moving a database from one tablespace may also be a
> + * partner in the crime.
> + */
> +if (!(stat(src_path, &st) == 0 && S_ISDIR(st.st_mode)))
> +{
> +  XLogLogMissingDir(xlrec->src_tablespace_id, xlrec->src_db_id, 
> src_path);
> 
> This is a part of *creation* of the target directory. Lack of the
> source directory cannot be valid even if the source directory is
> dropped afterwards in the WAL stream and we can allow that if the
> *target* tablespace is dropped afterwards. As the result, as I
> mentioned above, we don't need to record about the database directory.
> 
> By the way the name XLogLogMiss.. is somewhat confusing. How about
> XLogReportMissingDir (named after report_invalid_page).

Agree with you.

Also your words remind me that we should skip the checking if the consistency 
point
is reached.

Here is a git diff against the previous patch. I’ll send out the new rebased 
patches after
the consensus is reached.

diff --git a/src/backend/access/transam/xlogutils.c 
b/src/backend/access/transam/xlogutils.c
index 7ade385965..c8fe3fe228 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -90,7 +90,7 @@ typedef struct xl_missing_dir
 static HTAB *missing_dir_tab = NULL;

 void
-XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path)
+XLogReportMissingDir(Oid spcNode, Oid dbNode, char *path)
 {
xl_missing_dir_key key;
bool found;
@@ -103,16 +103,6 @@ XLogLogMissingDir(Oid spcNode, Oid dbNode, char *path)
 */
Assert(OidIsValid(spcNode));

-   if (reachedConsistency)
-   {
-   if (dbNode == InvalidOid)
-   elog(PANIC, "cannot find directory %s (tablespace %d)",
-

Re: Is Recovery actually paused?

2021-01-27 Thread Dilip Kumar
On Wed, Jan 27, 2021 at 2:06 PM Yugo NAGATA  wrote:
>
> On Wed, 27 Jan 2021 13:29:23 +0530
> Dilip Kumar  wrote:
>
> > On Wed, Jan 27, 2021 at 12:50 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Jan 26, 2021 at 2:00 AM Robert Haas  wrote:
> > > >
> > > > On Sat, Jan 23, 2021 at 6:10 AM Bharath Rupireddy
> > > >  wrote:
> > > > > +1 to just show the recovery pause state in the output of
> > > > > pg_is_wal_replay_paused. But, should the function name
> > > > > "pg_is_wal_replay_paused" be something like
> > > > > "pg_get_wal_replay_pause_state" or some other? To me, when "is" exists
> > > > > in a function, I expect a boolean output. Others may have better
> > > > > thoughts.
> > > >
> > > > Maybe we should leave the existing function pg_is_wal_replay_paused()
> > > > alone and add a new one with the name you suggest that returns text.
> > > > That would create less burden for tool authors.
> > >
> > > +1
> > >
> >
> > Yeah, we can do that, I will send an updated patch soon.
>
> This means pg_is_wal_replay_paused is left without any change and this
> returns whether pause is requested or not? If so, it seems good to modify
> the documentation of this function in order to note that this could not
> return the actual pause state.

Yes, we can say that it will return true if the replay pause is
requested.  I am changing that in my new patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-01-27 Thread Amit Kapila
On Tue, Jan 26, 2021 at 9:18 AM japin  wrote:
>
>
> When I read the discussion in [1], I found that update subscription's 
> publications
> is complicated.
>
> For example, I have 5 publications in subscription.
>
> CREATE SUBSCRIPTION mysub1 CONNECTION 'host=localhost port=5432 
> dbname=postgres'
> PUBLICATION mypub1, mypub2, mypub3, mypub4, mypub5;
>
> If I want to drop "mypub4", we should use the following command:
>
> ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, mypub5;
>
> Also, if I want to add "mypub7" and "mypub8", it will use:
>
> ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, mypub5, 
> mypub7, mypub8;
>
> Attached implement ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax, 
> for the above
> two cases, we can use the following:
>
> ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub4;
>
> ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub7, mypub8;
>
> I think it's more convenient. Any thoughts?
>

While the new proposed syntax does seem to provide some ease for users
but it has nothing which we can't do with current syntax. Also, in the
current syntax, there is an additional provision for refreshing the
existing publications as well. So, if the user has to change the
existing subscription such that it has to (a) add new publication(s),
(b) remove some publication(s), (c) refresh existing publication(s)
then all can be done in one command whereas with your new proposed
syntax user has to write three separate commands.

Having said that, I don't deny the appeal of having separate commands
for each of (a), (b), and (c).

-- 
With Regards,
Amit Kapila.




RE: ECPG: proposal for new DECLARE STATEMENT

2021-01-27 Thread kuroda.hay...@fujitsu.com
Dear Hackers,

I know I'm asking a big favor, but could you review(and commit) the patch?
The status has become RFC last Nov., but no one checked this after that.
Maybe Meskes is quite busy and have no time to review it.

The main part of the patch is about 200 lines(It means not so long), and maybe
I have reviewed other patches more than it.

I will review more, so I'm happy if this commits until the end of next CF.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Two patches to speed up pg_rewind.

2021-01-27 Thread Paul Guo

While reading pg_rewind code I found two things could speed up pg_rewind.
Attached are the patches.

First one: pg_rewind would fsync the whole pgdata directory on the target by 
default,
but that is a waste since usually just part of the files/directories on
the target are modified. Other files on the target should have been flushed
since pg_rewind requires a clean shutdown before doing the real work. This
would help the scenario that the target postgres instance includes millions of
files, which has been seen in a real environment.

There are several things that may need further discussions:

1. PG_FLUSH_DATA_WORKS was introduced as "Define PG_FLUSH_DATA_WORKS if we have 
an implementation for pg_flush_data”,
but now the code guarded by it is just pre_sync_fname() relevant so we 
might want
to rename it as HAVE_PRE_SYNC kind of name?

2. Pre_sync_fname() implementation

The code looks like this:
  #if defined(HAVE_SYNC_FILE_RANGE)
  (void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
  #elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
  (void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);

I’m a bit suspicious about calling posix_fadvise() with POSIX_FADV_DONTNEED.
I did not check the Linux Kernel code but according to the man
page I suspect that this option might cause the kernel tends to evict the 
related kernel
pages from the page cache, which might not be something we expect. This is
not a big issue since sync_file_range() should exist on many widely used 
Linux.

Also I’m not sure how much we could benefit from the pre_sync code. Also 
note if the
directory has a lot of files or the IO is fast, pre_sync_fname() might slow 
down
the process instead. The reasons are: If there are a lot of files it is 
possible that we need
to read the already-synced-and-evicted inode from disk (by open()-ing) 
after rewinding since
   the inode cache in Linux Kernel is limited; also if the IO is faster and 
kernel do background
   dirty page flush quickly, pre_sync_fname() might just waste cpu cycles.

   A better solution might be launch a separate pthread and do fsync one by one
   when pg_rewind finishes handling one file. pg_basebackup could use the 
solution also.

   Anyway this is independent of this patch.

Second one is use copy_file_range() for the local rewind case to replace 
read()+write().
This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other
code could use copy_file_range() if needed. copy_file_range() was introduced
In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap()
might be better than read()+write() but copy_file_range() is more interesting
given that it could skip the data copying in some file systems - this could 
benefit more
on Linux fs on network-based block storage.

Regards,
Paul

0001-Fsync-the-affected-files-directories-only-in-pg_rewi.patch
Description:  0001-Fsync-the-affected-files-directories-only-in-pg_rewi.patch


0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch
Description:  0002-Use-copy_file_range-for-file-copying-in-pg_rewind.patch


Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-01-27 Thread Bharath Rupireddy
On Wed, Jan 27, 2021 at 2:30 PM Amit Kapila  wrote:
>
> On Tue, Jan 26, 2021 at 9:18 AM japin  wrote:
> >
> >
> > When I read the discussion in [1], I found that update subscription's 
> > publications
> > is complicated.
> >
> > For example, I have 5 publications in subscription.
> >
> > CREATE SUBSCRIPTION mysub1 CONNECTION 'host=localhost port=5432 
> > dbname=postgres'
> > PUBLICATION mypub1, mypub2, mypub3, mypub4, mypub5;
> >
> > If I want to drop "mypub4", we should use the following command:
> >
> > ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, 
> > mypub5;
> >
> > Also, if I want to add "mypub7" and "mypub8", it will use:
> >
> > ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, 
> > mypub5, mypub7, mypub8;
> >
> > Attached implement ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax, 
> > for the above
> > two cases, we can use the following:
> >
> > ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub4;
> >
> > ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub7, mypub8;
> >
> > I think it's more convenient. Any thoughts?
> >
>
> While the new proposed syntax does seem to provide some ease for users
> but it has nothing which we can't do with current syntax. Also, in the
> current syntax, there is an additional provision for refreshing the
> existing publications as well. So, if the user has to change the
> existing subscription such that it has to (a) add new publication(s),
> (b) remove some publication(s), (c) refresh existing publication(s)
> then all can be done in one command whereas with your new proposed
> syntax user has to write three separate commands.

IIUC the initial patch proposed here, it does allow ALTER SUBSCRIPTION
mysub1 ADD/DROP PUBLICATION mypub4 WITH (refresh = true);. Isn't this
option enough to achieve what we can with ALTER SUBSCRIPTION mysub1
SET PUBLICATION mypub1, mypub2  WITH (refresh = true);? Am I missing
something here?

> Having said that, I don't deny the appeal of having separate commands
> for each of (a), (b), and (c).

for (c) i.e. refresh existing publication do we need something like
ALTER SUBSCRIPTION mysub1 REFRESH SUBSCRIPTION or some other syntax
that only refreshes the subscription similar to ALTER SUBSCRIPTION
mysub1 REFRESH PUBLICATION?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Freeze the inserted tuples during CTAS?

2021-01-27 Thread Paul Guo
Here is the simple patch,

diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index dce882012e..0391699423 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -552,7 +552,7 @@ intorel_startup(DestReceiver *self, int operation, 
TupleDesc typeinfo)
myState->rel = intoRelationDesc;
myState->reladdr = intoRelationAddr;
myState->output_cid = GetCurrentCommandId(true);
-   myState->ti_options = TABLE_INSERT_SKIP_FSM;
+   myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;

MatView code already does this and COPY does this if specified. I’m not sure how
does the community think about this. Actually personally I expect more about the
all-visible setting due to TABLE_INSERT_FROZEN since I could easier use index 
only scan
if we create an index and table use CTAS, else people have to use index only 
scan
after vacuum. If people do not expect freeze could we at least introduce a 
option to
specify the visibility during inserting?

Regards,
Paul

Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-01-27 Thread Amit Kapila
On Wed, Jan 27, 2021 at 2:57 PM Bharath Rupireddy
 wrote:
>
> On Wed, Jan 27, 2021 at 2:30 PM Amit Kapila  wrote:
> >
> > On Tue, Jan 26, 2021 at 9:18 AM japin  wrote:
> > >
> > >
> > > When I read the discussion in [1], I found that update subscription's 
> > > publications
> > > is complicated.
> > >
> > > For example, I have 5 publications in subscription.
> > >
> > > CREATE SUBSCRIPTION mysub1 CONNECTION 'host=localhost port=5432 
> > > dbname=postgres'
> > > PUBLICATION mypub1, mypub2, mypub3, mypub4, mypub5;
> > >
> > > If I want to drop "mypub4", we should use the following command:
> > >
> > > ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, 
> > > mypub5;
> > >
> > > Also, if I want to add "mypub7" and "mypub8", it will use:
> > >
> > > ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, 
> > > mypub5, mypub7, mypub8;
> > >
> > > Attached implement ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... 
> > > syntax, for the above
> > > two cases, we can use the following:
> > >
> > > ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub4;
> > >
> > > ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub7, mypub8;
> > >
> > > I think it's more convenient. Any thoughts?
> > >
> >
> > While the new proposed syntax does seem to provide some ease for users
> > but it has nothing which we can't do with current syntax. Also, in the
> > current syntax, there is an additional provision for refreshing the
> > existing publications as well. So, if the user has to change the
> > existing subscription such that it has to (a) add new publication(s),
> > (b) remove some publication(s), (c) refresh existing publication(s)
> > then all can be done in one command whereas with your new proposed
> > syntax user has to write three separate commands.
>
> IIUC the initial patch proposed here, it does allow ALTER SUBSCRIPTION
> mysub1 ADD/DROP PUBLICATION mypub4 WITH (refresh = true);. Isn't this
> option enough to achieve what we can with ALTER SUBSCRIPTION mysub1
> SET PUBLICATION mypub1, mypub2  WITH (refresh = true);? Am I missing
> something here?
>

I feel the SET syntax would allow refreshing existing publications as
well whereas, in Add, it will be only for new Publications.

-- 
With Regards,
Amit Kapila.




Re: Freeze the inserted tuples during CTAS?

2021-01-27 Thread Komяpa
Hi,

I confirm that my analytic workflows often do the CTAS and VACUUM of the
relation right after, before the index creation, to mark stuff as
all-visible for IOS to work. Freezing and marking as visible will help.

On Wed, Jan 27, 2021 at 12:29 PM Paul Guo  wrote:

> Here is the simple patch,
>
> diff --git a/src/backend/commands/createas.c
> b/src/backend/commands/createas.c
> index dce882012e..0391699423 100644
> --- a/src/backend/commands/createas.c
> +++ b/src/backend/commands/createas.c
> @@ -552,7 +552,7 @@ intorel_startup(DestReceiver *self, int operation,
> TupleDesc typeinfo)
> myState->rel = intoRelationDesc;
> myState->reladdr = intoRelationAddr;
> myState->output_cid = GetCurrentCommandId(true);
> -   myState->ti_options = TABLE_INSERT_SKIP_FSM;
> +   myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
>
> MatView code already does this and COPY does this if specified. I’m not
> sure how
> does the community think about this. Actually personally I expect more
> about the
> all-visible setting due to TABLE_INSERT_FROZEN since I could easier use
> index only scan
> if we create an index and table use CTAS, else people have to use index
> only scan
> after vacuum. If people do not expect freeze could we at least introduce a
> option to
> specify the visibility during inserting?
>
> Regards,
> Paul



-- 
Darafei "Komяpa" Praliaskouski
OSM BY Team - http://openstreetmap.by/


Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-01-27 Thread Bharath Rupireddy
On Wed, Jan 27, 2021 at 3:01 PM Amit Kapila  wrote:
> > > While the new proposed syntax does seem to provide some ease for users
> > > but it has nothing which we can't do with current syntax. Also, in the
> > > current syntax, there is an additional provision for refreshing the
> > > existing publications as well. So, if the user has to change the
> > > existing subscription such that it has to (a) add new publication(s),
> > > (b) remove some publication(s), (c) refresh existing publication(s)
> > > then all can be done in one command whereas with your new proposed
> > > syntax user has to write three separate commands.
> >
> > IIUC the initial patch proposed here, it does allow ALTER SUBSCRIPTION
> > mysub1 ADD/DROP PUBLICATION mypub4 WITH (refresh = true);. Isn't this
> > option enough to achieve what we can with ALTER SUBSCRIPTION mysub1
> > SET PUBLICATION mypub1, mypub2  WITH (refresh = true);? Am I missing
> > something here?
> >
>
> I feel the SET syntax would allow refreshing existing publications as
> well whereas, in Add, it will be only for new Publications.

I think the patch v2-0001 will refresh all the publications, I mean
existing and newly added publications. IIUC the patch, it first
fetches all the available publications with the subscriptions and it
sees if that list has the given publication [1], if not, then adds it
to the existing publications list and returns that list [2]. If the
refresh option is specified as true with ALTER SUBSCRIPTION ... ADD
PUBLICATION, then it refreshes all the returned publications [3]. I
believe this is also true with ALTER SUBSCRIPTION ... DROP
PUBLICATION.

So, I think the new syntax, ALTER SUBSCRIPTION .. ADD/DROP PUBLICATION
will refresh the new and existing publications.

[1]
+
+/*
+ * merge current subpublications and user specified by add/drop publications.
+ *
+ * If addpub == true, we will add the list of publications into current
+ * subpublications.  Otherwise, we will delete the list of publications from
+ * current subpublications.
+ */
+static List *
+merge_subpublications(HeapTuple tuple, TupleDesc tupledesc,
+  List *publications, bool addpub)

[2]
+publications = merge_subpublications(tup,
RelationGetDescr(rel),

[3]
+/* Refresh if user asked us to. */
+if (refresh)
+{
+if (!sub->enabled)
+ereport(ERROR,
+(errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("ALTER SUBSCRIPTION with
refresh is not allowed for disabled subscriptions"),
+ errhint("Use ALTER SUBSCRIPTION ...
SET PUBLICATION ... WITH (refresh = false).")));
+
+/* Make sure refresh sees the new list of publications. */
+sub->publications = publications;
+
+AlterSubscription_refresh(sub, copy_data);

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-01-27 Thread japin


On Wed, 27 Jan 2021 at 17:46, Bharath Rupireddy 
 wrote:
> On Wed, Jan 27, 2021 at 3:01 PM Amit Kapila  wrote:
>> > > While the new proposed syntax does seem to provide some ease for users
>> > > but it has nothing which we can't do with current syntax. Also, in the
>> > > current syntax, there is an additional provision for refreshing the
>> > > existing publications as well. So, if the user has to change the
>> > > existing subscription such that it has to (a) add new publication(s),
>> > > (b) remove some publication(s), (c) refresh existing publication(s)
>> > > then all can be done in one command whereas with your new proposed
>> > > syntax user has to write three separate commands.
>> >
>> > IIUC the initial patch proposed here, it does allow ALTER SUBSCRIPTION
>> > mysub1 ADD/DROP PUBLICATION mypub4 WITH (refresh = true);. Isn't this
>> > option enough to achieve what we can with ALTER SUBSCRIPTION mysub1
>> > SET PUBLICATION mypub1, mypub2  WITH (refresh = true);? Am I missing
>> > something here?
>> >
>>
>> I feel the SET syntax would allow refreshing existing publications as
>> well whereas, in Add, it will be only for new Publications.
>
> I think the patch v2-0001 will refresh all the publications, I mean
> existing and newly added publications. IIUC the patch, it first
> fetches all the available publications with the subscriptions and it
> sees if that list has the given publication [1], if not, then adds it
> to the existing publications list and returns that list [2]. If the
> refresh option is specified as true with ALTER SUBSCRIPTION ... ADD
> PUBLICATION, then it refreshes all the returned publications [3]. I
> believe this is also true with ALTER SUBSCRIPTION ... DROP
> PUBLICATION.
>
> So, I think the new syntax, ALTER SUBSCRIPTION .. ADD/DROP PUBLICATION
> will refresh the new and existing publications.
>

Yes! It will refresh the new and existing publications.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-01-27 Thread japin


On Wed, 27 Jan 2021 at 16:59, Amit Kapila  wrote:
> On Tue, Jan 26, 2021 at 9:18 AM japin  wrote:
>>
>>
>> When I read the discussion in [1], I found that update subscription's 
>> publications
>> is complicated.
>>
>> For example, I have 5 publications in subscription.
>>
>> CREATE SUBSCRIPTION mysub1 CONNECTION 'host=localhost port=5432 
>> dbname=postgres'
>> PUBLICATION mypub1, mypub2, mypub3, mypub4, mypub5;
>>
>> If I want to drop "mypub4", we should use the following command:
>>
>> ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, mypub5;
>>
>> Also, if I want to add "mypub7" and "mypub8", it will use:
>>
>> ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, 
>> mypub5, mypub7, mypub8;
>>
>> Attached implement ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax, 
>> for the above
>> two cases, we can use the following:
>>
>> ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub4;
>>
>> ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub7, mypub8;
>>
>> I think it's more convenient. Any thoughts?
>>
>
> While the new proposed syntax does seem to provide some ease for users
> but it has nothing which we can't do with current syntax. Also, in the
> current syntax, there is an additional provision for refreshing the
> existing publications as well. So, if the user has to change the
> existing subscription such that it has to (a) add new publication(s),
> (b) remove some publication(s), (c) refresh existing publication(s)
> then all can be done in one command whereas with your new proposed
> syntax user has to write three separate commands.
>

If we want add and drop some publications, we can use SET PUBLICATION, it
is more convenient than ADD and DROP PUBLICATION, however if we just want to
add (or drop) publication into (or from) subcription which has much 
publications,
then the new syntax is more convenient IMO.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-01-27 Thread Mark Rofail
Hi Joel,

As always, great catch!
Kindly find the updated patch (v15) below

Changelog:
- v15 (compatible with current master 2021-01-27, commit
e42b3c3bd6a9c6233ac4c8a2e9b040367ba2f97c)
* remove "EACH ELEMENT OF" from violation messages
* accommodate tests accordingly

Keep up the good work testing this patch.

/Mark


On Wed, Jan 27, 2021 at 5:11 AM Joel Jacobson  wrote:

> On Tue, Jan 26, 2021, at 12:59, Mark Rofail wrote:
> > Please don't hesitate to give your feedback.
>
> The error message for insert or update violations looks fine:
>
> UPDATE catalog_clone.pg_extension SET extconfig = ARRAY[12345] WHERE oid =
> 10;
> ERROR:  insert or update on table "pg_extension" violates foreign key
> constraint "pg_extension_extconfig_fkey"
> DETAIL:  Key (EACH ELEMENT OF extconfig)=(12345) is not present in table
> "pg_class".
>
> But when trying to delete a still referenced row,
> the column mentioned in the "EACH ELEMENT OF" sentence
> is not the array column, but the column in the referenced table:
>
> DELETE FROM catalog_clone.pg_class WHERE oid = 10;
> ERROR:  update or delete on table "pg_class" violates foreign key
> constraint "pg_extension_extconfig_fkey" on table "pg_extension"
> DETAIL:  Key (EACH ELEMENT OF oid)=(10) is still referenced from table
> "pg_extension".
>
> I think either the "EACH ELEMENT OF" part of the latter error message
> should be dropped,
> or the column name for the array column should be used.
>
> /Joel
>


Re: poc - possibility to write window function in PL languages

2021-01-27 Thread Pavel Stehule
st 20. 1. 2021 v 21:14 odesílatel Pavel Stehule 
napsal:

>
>
> st 20. 1. 2021 v 21:07 odesílatel Tom Lane  napsal:
>
>> Pavel Stehule  writes:
>> > The second question is work with partition context value. This should be
>> > only one value, and of only one but of any type per function. In this
>> case
>> > we cannot use GET statements. I had an idea of enhancing declaration.
>> Some
>> > like
>>
>> > DECLARE
>> >   pcx PARTITION CONTEXT (int); -- read partition context
>> > BEGIN
>> >   pcx := 10; -- set partition context
>>
>> > What do you think about it?
>>
>> Uh, what?  I don't understand what this "partition context" is.
>>
>
> It was my name for an access to window partition local memory -
> WinGetPartitionLocalMemory
>
> We need some interface for this cache
>

I have to think more about declarative syntax. When I try to transform our
WindowObject API directly, then it looks like Cobol. It needs a different
concept to be user friendly.

Regards

Pavel


> Regards
>
> Pavel
>
>
>
>
>
>
>
>>
>> regards, tom lane
>>
>


Re: [PATCH] Disable bgworkers during servers start in pg_upgrade

2021-01-27 Thread Denis Laxalde
Hi,

Andres Freund a écrit :
> On 2021-01-21 16:23:58 +0100, Denis Laxalde wrote:
> > We found an issue in pg_upgrade on a cluster with a third-party
> > background worker. The upgrade goes fine, but the new cluster is then in
> > an inconsistent state. The background worker comes from the PoWA
> > extension but the issue does not appear to related to this particular
> > code.
> 
> Well, it does imply that that backgrounder did something, as the pure
> existence of a bgworker shouldn't affect
> 
> anything. Presumably the issue is that the bgworker actually does
> transactional writes, which causes problems because the xids /
> multixacts from early during pg_upgrade won't actually be valid after we
> do pg_resetxlog etc.
> 
> 
> > As a solution, it seems that, for similar reasons that we restrict
> > socket access to prevent accidental connections (from commit
> > f763b77193), we should also prevent background workers to start at this
> > step.
> 
> I think that'd have quite the potential for negative impact - imagine
> extensions that refuse to be loaded outside of shared_preload_libraries
> (e.g. because they need to allocate shared memory) but that are required
> during the course of pg_upgrade (e.g. because it's a tableam, a PL or
> such). Those libraries will then tried to be loaded during the upgrade
> (due to the _PG_init() hook being called when functions from the
> extension are needed, e.g. the tableam or PL handler).
> 
> Nor is it clear to me that the only way this would be problematic is
> with shared_preload_libraries. A library in local_preload_libraries, or
> a demand loaded library can trigger bgworkers (or database writes in
> some other form) as well.

Thank you for those insights!

> I wonder if we could
> 
> a) set default_transaction_read_only to true, and explicitly change it
>in the sessions that need that.
> b) when in binary upgrade mode / -b, error out on all wal writes in
>sessions that don't explicitly set a session-level GUC to allow
>writes.

Solution "a" appears to be enough to solve the problem described in my
initial email. See attached patch implementing this.

Cheers,
Denis
>From d8b74f9220775736917b7fc08bbe397d7e2eedcd Mon Sep 17 00:00:00 2001
From: Denis Laxalde 
Date: Wed, 20 Jan 2021 17:25:58 +0100
Subject: [PATCH] Set default transactions to read-only at servers start in
 pg_upgrade

In essence, this is for a similar reason that we use a restricted socket
access from f763b77193b04eba03a1f4ce46df34dc0348419e because background
workers may produce undesired activities during the upgrade.

Author: Denis Laxalde 
Co-authored-by: Jehan-Guillaume de Rorthais 
---
 src/bin/pg_upgrade/server.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/bin/pg_upgrade/server.c b/src/bin/pg_upgrade/server.c
index 31b1425202..b17f348a5b 100644
--- a/src/bin/pg_upgrade/server.c
+++ b/src/bin/pg_upgrade/server.c
@@ -244,7 +244,7 @@ start_postmaster(ClusterInfo *cluster, bool report_and_exit_on_error)
 	 * vacuumdb --freeze actually freezes the tuples.
 	 */
 	snprintf(cmd, sizeof(cmd),
-			 "\"%s/pg_ctl\" -w -l \"%s\" -D \"%s\" -o \"-p %d%s%s %s%s\" start",
+			 "\"%s/pg_ctl\" -w -l \"%s\" -D \"%s\" -o \"-p %d%s%s %s%s -c default_transaction_read_only=on\" start",
 			 cluster->bindir, SERVER_LOG_FILE, cluster->pgconfig, cluster->port,
 			 (cluster->controldata.cat_ver >=
 			  BINARY_UPGRADE_SERVER_FLAG_CAT_VER) ? " -b" :
-- 
2.20.1



Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-01-27 Thread Dilip Kumar
On Wed, Jan 27, 2021 at 3:26 PM japin  wrote:
>
>
> On Wed, 27 Jan 2021 at 16:59, Amit Kapila  wrote:
> > On Tue, Jan 26, 2021 at 9:18 AM japin  wrote:
> >>
> >>
> >> When I read the discussion in [1], I found that update subscription's 
> >> publications
> >> is complicated.
> >>
> >> For example, I have 5 publications in subscription.
> >>
> >> CREATE SUBSCRIPTION mysub1 CONNECTION 'host=localhost port=5432 
> >> dbname=postgres'
> >> PUBLICATION mypub1, mypub2, mypub3, mypub4, mypub5;
> >>
> >> If I want to drop "mypub4", we should use the following command:
> >>
> >> ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, 
> >> mypub5;
> >>
> >> Also, if I want to add "mypub7" and "mypub8", it will use:
> >>
> >> ALTER SUBSCRIPTION mysub1 SET PUBLICATION mypub1, mypub2, mypub3, 
> >> mypub5, mypub7, mypub8;
> >>
> >> Attached implement ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax, 
> >> for the above
> >> two cases, we can use the following:
> >>
> >> ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub4;
> >>
> >> ALTER SUBSCRIPTION mysub1 DROP PUBLICATION mypub7, mypub8;
> >>
> >> I think it's more convenient. Any thoughts?
> >>
> >
> > While the new proposed syntax does seem to provide some ease for users
> > but it has nothing which we can't do with current syntax. Also, in the
> > current syntax, there is an additional provision for refreshing the
> > existing publications as well. So, if the user has to change the
> > existing subscription such that it has to (a) add new publication(s),
> > (b) remove some publication(s), (c) refresh existing publication(s)
> > then all can be done in one command whereas with your new proposed
> > syntax user has to write three separate commands.
> >
>
> If we want add and drop some publications, we can use SET PUBLICATION, it
> is more convenient than ADD and DROP PUBLICATION, however if we just want to
> add (or drop) publication into (or from) subcription which has much 
> publications,
> then the new syntax is more convenient IMO.
>

I agree with you that if we just want to add or remove a few
publications in the existing subscription then providing the complete
list is not convenient.  The new syntax is way better,  although I am
not sure how frequently users need to add/remove publication in the
subscription.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Wrong usage of RelationNeedsWAL

2021-01-27 Thread Noah Misch
On Thu, Jan 21, 2021 at 01:23:36AM -0800, Noah Misch wrote:
> On Thu, Jan 21, 2021 at 06:02:11PM +0900, Kyotaro Horiguchi wrote:
> > At Thu, 21 Jan 2021 00:19:58 -0800, Noah Misch  wrote in 
> > > On Thu, Jan 21, 2021 at 12:28:44AM +0900, Kyotaro Horiguchi wrote:
> > > > However, with the previous patch, two existing tests sto_using_cursor
> > > > and sto_using_select behaves differently from the master.  That change
> > > > is coming from the omission of actual LSN check in TestForOldSnapshot
> > > > while wal_level=minimal. So we have no choice other than actually
> > > > updating page LSN.
> > > > 
> > > > In the scenario under discussion there are two places we need to do
> > > > that. one is heap_page_prune, and the other is RelationCopyStorge. As
> > > > a PoC, I used gistXLogAssignLSN() as is for thie purpose. See the
> > > > attached third file.
> > > 
> > > Fake LSNs make the system harder to understand, so I prefer not to spread 
> > > fake
> > > LSNs to more access methods.  What I had in mind is to simply suppress 
> > > early
> > > pruning when the relation is skipping WAL.  Attached.  Is this 
> > > reasonable?  It
> > > passes the older tests.  While it changes the sto_wal_optimized.spec 
> > > output, I
> > > think it preserves the old_snapshot_threshold behavior contract.
> > 
> > Perhaps I'm missing something, but the patch doesn't pass the v5-0001
> > test with wal_level=minimal?
> 
> Correct.  The case we must avoid is letting an old snapshot read an
> early-pruned page without error.  v5-0001 expects "ERROR:  snapshot too old".
> The patch suspends early pruning, so that error is not applicable.

I think the attached version is ready.  The changes since v6nm are cosmetic:

- Wrote log messages
- Split into two patches, since the user-visible bugs are materially different
- Fixed typos
- Ran perltidy

Is it okay if I push these on Saturday, or would you like more time to
investigate?
Author: Noah Misch 
Commit: Noah Misch 

Fix error with CREATE PUBLICATION, wal_level=minimal, and new tables.

CREATE PUBLICATION has failed spuriously when applied to a permanent
relation created or rewritten in the current transaction.  Make the same
change to another site having the same semantic intent; the second
instance has no user-visible consequences.  Back-patch to v13, where
commit c6b92041d38512a4176ed76ad06f713d2e6c01a8 broke this.

Kyotaro Horiguchi

Discussion: 
https://postgr.es/m/20210113.160705.2225256954956139776.horikyota@gmail.com

diff --git a/src/backend/catalog/pg_publication.c 
b/src/backend/catalog/pg_publication.c
index 5f8e1c6..84d2efc 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -67,7 +67,7 @@ check_publication_add_relation(Relation targetrel)
 errdetail("System tables cannot be added to 
publications.")));
 
/* UNLOGGED and TEMP relations cannot be part of publication. */
-   if (!RelationNeedsWAL(targetrel))
+   if (targetrel->rd_rel->relpersistence != RELPERSISTENCE_PERMANENT)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 errmsg("table \"%s\" cannot be replicated",
diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c
index da322b4..177e6e3 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -126,7 +126,8 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, 
bool inhparent,
relation = table_open(relationObjectId, NoLock);
 
/* Temporary and unlogged relations are inaccessible during recovery. */
-   if (!RelationNeedsWAL(relation) && RecoveryInProgress())
+   if (relation->rd_rel->relpersistence != RELPERSISTENCE_PERMANENT &&
+   RecoveryInProgress())
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot access temporary or unlogged 
relations during recovery")));
diff --git a/src/test/subscription/t/001_rep_changes.pl 
b/src/test/subscription/t/001_rep_changes.pl
index 0680f44..3d90f81 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 23;
+use Test::More tests => 24;
 
 # Initialize publisher node
 my $node_publisher = get_new_node('publisher');
@@ -358,3 +358,21 @@ is($result, qq(0), 'check replication origin was dropped 
on subscriber');
 
 $node_subscriber->stop('fast');
 $node_publisher->stop('fast');
+
+# CREATE PUBLICATION while wal_level=minimal should succeed, with a WARNING
+$node_publisher->append_conf(
+   'postgresql.conf', qq(
+wal_level=minimal
+max_wal_senders=0
+));
+$node_publisher->start;
+($result, my $retout, my $reterr) = $node_publis

Re: Protect syscache from bloating with negative cache entries

2021-01-27 Thread Heikki Linnakangas

On 27/01/2021 03:13, Kyotaro Horiguchi wrote:

At Thu, 14 Jan 2021 17:32:27 +0900 (JST), Kyotaro Horiguchi 
 wrote in

The commit 4656e3d668 (debug_invalidate_system_caches_always)
conflicted with this patch. Rebased.


At Wed, 27 Jan 2021 10:07:47 +0900 (JST), Kyotaro Horiguchi 
 wrote in

(I found a bug in a benchmark-aid function
(CatalogCacheFlushCatalog2), I repost an updated version soon.)


I noticed that a catcachebench-aid function
CatalogCacheFlushCatalog2() allocates bucked array wrongly in the
current memory context, which leads to a crash.

This is a fixed it then rebased version.


Thanks, with the scripts you provided, I was able to run the performance 
tests on my laptop, and got very similar results as you did.


The impact of v7-0002-Remove-dead-flag-from-catcache-tuple.patch is very 
small. I think I could see it in the tests, but only barely. And the 
tests did nothing else than do syscache lookups; in any real world 
scenario, it would be lost in noise. I think we can put that aside for 
now, and focus on v6-0001-CatCache-expiration-feature.patch:


The pruning is still pretty lethargic:

- Entries created in the same transactions are never pruned away

- The size of the hash table is never shrunk. So even though the patch 
puts a backstop to the hash table growing indefinitely, if you run one 
transaction that bloats the cache, it's bloated for the rest of the session.


I think that's OK. We might want to be more aggressive in the future, 
but for now it seems reasonable to lean towards the current behavior 
where nothing is pruned. Although I wonder if we should try to set 
'catcacheclock' more aggressively. I think we could set it whenever 
GetCurrentTimestamp() is called, for example.


Given how unaggressive this mechanism is, I think it should be safe to 
enable it by default. What would be a suitable default for 
catalog_cache_prune_min_age? 30 seconds?


Documentation needs to be updated for the new GUC.

Attached is a version with a few little cleanups:
- use TimestampTz instead of uint64 for the timestamps
- remove assign_catalog_cache_prune_min_age(). All it did was convert 
the GUC's value from seconds to microseconds, and stored it in a 
separate variable. Multiplication is cheap, so we can just do it when we 
use the GUC's value instead.


- Heikki
>From 73583c2c9cd26bad7c3942eff1ddccb2ebb43222 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 27 Jan 2021 13:08:08 +0200
Subject: [PATCH v8 1/1] CatCache expiration feature

Author: Kyotaro Horiguchi
---
 src/backend/access/transam/xact.c  |  3 ++
 src/backend/utils/cache/catcache.c | 82 +-
 src/backend/utils/misc/guc.c   | 12 +
 src/include/utils/catcache.h   | 17 +++
 4 files changed, 112 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a2068e3fd45..86888d24091 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1086,6 +1086,9 @@ static void
 AtStart_Cache(void)
 {
 	AcceptInvalidationMessages();
+
+	if (xactStartTimestamp != 0)
+		SetCatCacheClock(xactStartTimestamp);
 }
 
 /*
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index fa2b49c676e..e29f825687a 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -38,6 +38,7 @@
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"
+#include "utils/timestamp.h"
 
 
  /* #define CACHEDEBUG */	/* turns DEBUG elogs on */
@@ -60,9 +61,18 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = -1;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+TimestampTz	catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
 			   int nkeys,
 			   Datum v1, Datum v2,
@@ -74,6 +84,7 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
 Index hashIndex,
 Datum v1, Datum v2,
 Datum v3, Datum v4);
+static bool CatCacheCleanupOldEntries(CatCache *cp);
 
 static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
 		   Datum v1, Datum v2, Datum v3, Datum v4);
@@ -99,7 +110,6 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 			 Datum *srckeys, Datum *dstkeys);
 
-
 /*
  *	internal support functions
  */
@@ -1264,6 +1274,9 @@ SearchCatCacheInternal(CatCache *cache,
 		 */
 		dlist_move_head(bucket, &ct->cache_elem);
 
+		/* Record the last access timestamp */
+		ct->lastaccess = catcacheclock;
+
 		/*
 		 * If it's a positive ent

Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-01-27 Thread Amit Kapila
On Wed, Jan 27, 2021 at 3:16 PM Bharath Rupireddy
 wrote:
>
> On Wed, Jan 27, 2021 at 3:01 PM Amit Kapila  wrote:
>
> So, I think the new syntax, ALTER SUBSCRIPTION .. ADD/DROP PUBLICATION
> will refresh the new and existing publications.
>

That sounds a bit unusual to me because when the user has specifically
asked to just ADD Publication, we might refresh some existing
Publication along with it?

-- 
With Regards,
Amit Kapila.




Re: pg_upgrade fails with non-standard ACL

2021-01-27 Thread Noah Misch
On Mon, Jan 25, 2021 at 10:14:43PM +0300, Anastasia Lubennikova wrote:
> On 24.01.2021 11:39, Noah Misch wrote:
> >On Thu, Jan 21, 2021 at 01:03:58AM +0300, Anastasia Lubennikova wrote:
> >>On 03.01.2021 14:29, Noah Misch wrote:
> >>>Overall, this patch predicts a subset of cases where pg_dump will emit a
> >>>failing GRANT or REVOKE that targets a pg_catalog object.  Can you write a
> >>>code comment stating what it does and does not detect?  I think it's okay 
> >>>to
> >>>not predict every failure, but let's record the intended coverage.  Here 
> >>>are a
> >>>few examples I know; there may be more.  The above query only finds GRANTs 
> >>>to
> >>>non-pinned roles.  pg_dump dumps the following, but the above query doesn't
> >>>find them:
> >>>
> >>>   REVOKE ALL ON FUNCTION pg_last_wal_replay_lsn FROM public;
> >>>   GRANT EXECUTE ON FUNCTION pg_reload_conf() TO pg_signal_backend;
> >I see a new comment listing object types.  Please also explain the lack of
> >preventing REVOKE failures (first example query above) and the limitation
> >around non-pinned roles (second example).
> 
> 1) Could you please clarify, what do you mean by REVOKE failures?
> 
> I tried following example:
> 
> Start 9.6 cluster.
> 
> REVOKE ALL ON function pg_switch_xlog() FROM public;
> REVOKE ALL ON function pg_switch_xlog() FROM backup;
> 
> The upgrade to the current master passes with and without patch.
> It seems that current implementation of pg_upgrade doesn't take into account
> revoke ACLs.

I think you can observe it by adding "revoke all on function
pg_stat_get_subscription from public;" to test_add_acl_to_catalog_objects.sql
and then rerunning your test script.  pg_dump will reproduce that REVOKE,
which would fail if postgresql.org had removed that function.  That's fine, so
long as a comment mentions the limitation.

> 2) As for pinned roles, I think we should fix this behavior, rather than
> adding a comment. Because such roles can have grants on system objects.
> 
> In earlier versions of the patch, I gathered ACLs directly from system
> catalogs: pg_proc.proacl, pg_class.relacl pg_attribute.attacl and
> pg_type.typacl.
> 
> The only downside of this approach is that it cannot be easily extended to
> other object types, as we need to handle every object type separately.
> I don't think it is a big deal, as we already do it in
> check_for_changed_signatures()
> 
> And also the query to gather non-standard ACLs won't look as nice as now,
> because of the need to parse arrays of aclitems.
> 
> What do you think?

Hard to say for certain without seeing the code both ways.  I'm not generally
enthusiastic about adding pg_upgrade code to predict whether the dump will
fail to restore, because such code will never be as good as just trying the
restore.  The patch has 413 lines of code, which is substantial.  I would balk
if, for example, the patch tripled in size to catch more cases.




Re: Fix ALTER SUBSCRIPTION ... SET PUBLICATION documentation

2021-01-27 Thread Amit Kapila
On Tue, Jan 26, 2021 at 4:56 PM japin  wrote:
>
>
> Hi,
>
> When I read the documentation of ALTER SUBSCRIPTION ... SET PUBLICATION ... 
> WITH (...),
> it says "set_publication_option" only support "refresh" in documentation [1].
> However, we can also supply the "copy_data" option, and the code is:
>

I think there is a reference to the 'copy_data' option as well. There
is a sentence saying: "Additionally, refresh options as described
under REFRESH PUBLICATION may be specified." and then if you Refresh
option, there we do mention about 'copy_data', isn't that sufficient?

-- 
With Regards,
Amit Kapila.




Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-27 Thread Greg Nancarrow
On Wed, Jan 27, 2021 at 2:13 PM Hou, Zhijie  wrote:
>
> Hi,
>
> When testing the patch with the following kind of sql.
>
> ---
> Insert into part_table select 1;
> Insert into part_table select generate_series(1,1,1);
> Insert into part_table select * from testfunc();
> ---
>
> we usually use these sqls to initialize the table or for testing purpose.
>
> Personally I think we do not need to do the parallel safety-check for these 
> cases,
> because there seems no chance for the select part to consider parallel.
>
> I thought we aim to not check the safety unless parallel is possible.
> , So I was thinking is it possible to avoid the check it these cases ?
>
> I did some quick check on the code, An Immature ideal is to check if there is 
> RTE_RELATION in query.
> If no we do not check the safety-check.
>
> I am not sure is it worth to do that, any thoughts ?
>

Yes, I think it's worth it. It's surprising that there's not really
any optimizations for these with just the current Postgres parallel
SELECT functionality (as there's currently no way to divide the work
for these amongst the workers, even if the function/expression is
parallel-safe).
For the additional parallel-safety checks for INSERT, currently we
check that RTE_SUBQUERY is in the range-table. So I think we can
additionally check that RTE_RELATION is in the subquery range-table
(otherwise treat it as unsafe).

Regards,
Greg Nancarrow
Fujitsu Australia




Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-01-27 Thread Bharath Rupireddy
On Wed, Jan 27, 2021 at 4:42 PM Amit Kapila  wrote:
>
> On Wed, Jan 27, 2021 at 3:16 PM Bharath Rupireddy
>  wrote:
> >
> > On Wed, Jan 27, 2021 at 3:01 PM Amit Kapila  wrote:
> >
> > So, I think the new syntax, ALTER SUBSCRIPTION .. ADD/DROP PUBLICATION
> > will refresh the new and existing publications.
> >
>
> That sounds a bit unusual to me because when the user has specifically
> asked to just ADD Publication, we might refresh some existing
> Publication along with it?

Hmm. That's correct. I also feel we should not touch the existing
publications, only the ones that are added/dropped should be
refreshed. Because there will be an overhead of a SQL with more
publications(in fetch_table_list) when AlterSubscription_refresh() is
called with all the existing publications. We could just pass in the
newly added/dropped publications to AlterSubscription_refresh().

I don't see any problem if ALTER SUBSCRIPTION ... ADD PUBLICATION with
refresh true refreshes only the newly added publications, because what
we do in AlterSubscription_refresh() is that we fetch the tables
associated with the publications from the publisher, compare them with
the previously fetched tables from that publication and add the new
tables or remove the table that don't exist in that publication
anymore.

For ALTER SUBSCRIPTION ... DROP PUBLICATION, also we can do the same
thing i.e. refreshes only the dropped publications.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Fix ALTER SUBSCRIPTION ... SET PUBLICATION documentation

2021-01-27 Thread Bharath Rupireddy
On Wed, Jan 27, 2021 at 4:57 PM Amit Kapila  wrote:
>
> On Tue, Jan 26, 2021 at 4:56 PM japin  wrote:
> >
> >
> > Hi,
> >
> > When I read the documentation of ALTER SUBSCRIPTION ... SET PUBLICATION ... 
> > WITH (...),
> > it says "set_publication_option" only support "refresh" in documentation 
> > [1].
> > However, we can also supply the "copy_data" option, and the code is:
> >
>
> I think there is a reference to the 'copy_data' option as well. There
> is a sentence saying: "Additionally, refresh options as described
> under REFRESH PUBLICATION may be specified." and then if you Refresh
> option, there we do mention about 'copy_data', isn't that sufficient?

Right. It looks like the copy_option is indirectly mentioned with the
statement "Additionally, refresh options as described under REFRESH
PUBLICATION may be specified." under "set_publication_option". IMHO,
we can keep it that way.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: simplifying foreign key/RI checks

2021-01-27 Thread Amit Langote
On Wed, Jan 27, 2021 at 5:32 PM Kyotaro Horiguchi
 wrote:
> At Sun, 24 Jan 2021 20:51:39 +0900, Amit Langote  
> wrote in
> > Here's v5.
>
> At Mon, 25 Jan 2021 18:19:56 +0900, Amit Langote  
> wrote in
> > > Anybody else want to look this patch over before I mark it Ready For 
> > > Committer?
> >
> > Would be nice to have others look it over.  Thanks.
>
> This nice improvement.
>
> 0001 just looks fine.
>
> 0002:
>
>  /* RI query type codes */
> -/* these queries are executed against the PK (referenced) table: */
> +/*
> + * 1 and 2  are no longer used, because PK (referenced) table is looked up
> + * directly using ri_ReferencedKeyExists().
>  #define RI_PLAN_CHECK_LOOKUPPK 1
>  #define RI_PLAN_CHECK_LOOKUPPK_FROM_PK 2
>  #define RI_PLAN_LAST_ON_PK 
> RI_PLAN_CHECK_LOOKUPPK_FROM_PK
>
> However, this patch does.
>
> +   if (!ri_ReferencedKeyExists(pk_rel, fk_rel, newslot, riinfo))
> +   ri_ReportViolation(riinfo,
> +  pk_rel, fk_rel,
> +  newslot,
> +  NULL,
> +  RI_PLAN_CHECK_LOOKUPPK, 
> false);
>
> It seems to me 1 (RI_PLAN_CHECK_LOOKUPPK) is still alive. (Yeah, I
> know that doesn't mean the usefulness of the macro but the mechanism
> the macro suggests, but it is confusing.) On the other hand,
> RI_PLAN_CHECK_LOOKUPPK_FROM_PK and RI_PLAN_LAST_ON_PK seem to be no
> longer used.  (Couldn't we remove them?)

Yeah, better to just remove those _PK macros and say this module no
longer runs any queries on the PK table.

How about the attached?

-- 
Amit Langote
EDB: http://www.enterprisedb.com


v6-0002-Avoid-using-SPI-for-some-RI-checks.patch
Description: Binary data


v6-0001-Export-get_partition_for_tuple.patch
Description: Binary data


Re: Fix ALTER SUBSCRIPTION ... SET PUBLICATION documentation

2021-01-27 Thread japin


On Wed, 27 Jan 2021 at 19:47, Bharath Rupireddy 
 wrote:
> On Wed, Jan 27, 2021 at 4:57 PM Amit Kapila  wrote:
>>
>> On Tue, Jan 26, 2021 at 4:56 PM japin  wrote:
>> >
>> >
>> > Hi,
>> >
>> > When I read the documentation of ALTER SUBSCRIPTION ... SET PUBLICATION 
>> > ... WITH (...),
>> > it says "set_publication_option" only support "refresh" in documentation 
>> > [1].
>> > However, we can also supply the "copy_data" option, and the code is:
>> >
>>
>> I think there is a reference to the 'copy_data' option as well. There
>> is a sentence saying: "Additionally, refresh options as described
>> under REFRESH PUBLICATION may be specified." and then if you Refresh
>> option, there we do mention about 'copy_data', isn't that sufficient?
>
> Right. It looks like the copy_option is indirectly mentioned with the
> statement "Additionally, refresh options as described under REFRESH
> PUBLICATION may be specified." under "set_publication_option". IMHO,
> we can keep it that way.
>

My bad. It may be sufficient. Sorry for noise.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: Printing backtrace of postgres processes

2021-01-27 Thread vignesh C
On Thu, Jan 21, 2021 at 7:26 AM Tom Lane  wrote:
>
> Craig Ringer  writes:
> > On Wed, 20 Jan 2021 at 05:23, Tom Lane  wrote:
> >> BTW, it also looks like the patch is doing nothing to prevent the
> >> backtrace from being sent to the connected client.
>
> > I don't see a good reason to send a bt to a client. Even though these
> > backtraces won't be analysing debuginfo and populating args, locals, etc,
> > it should still just go to the server log.
>
> Yeah.  That's easier than I was thinking, we just need to
> s/LOG/LOG_SERVER_ONLY/.
>
> >> Maybe, given all of these things, we should forget using elog at
> >> all and just emit the trace with fprintf(stderr).
>
> > That causes quite a lot of pain with MemoryContextStats() already
>
> True.  Given the changes discussed in the last couple messages, I don't
> see any really killer reasons why we can't ship the trace through elog.
> We can always try that first, and back off to fprintf if we do find
> reasons why it's too unstable.
>

Thanks all of them for the suggestions. Attached v3 patch which has
fixes based on the suggestions. It includes the following fixes: 1)
Removal of support to get callstack of all postgres process, user can
get only one process callstack. 2) Update the documentation. 3) Added
necessary checks for pg_print_callstack similar to
pg_terminate_backend. 4) Changed LOG to LOG_SERVER_ONLY.
Thoughts?

Regards,
Vignesh
From dd51938225881be14cce2ba0e80ae9019a3f2bfc Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Wed, 27 Jan 2021 18:20:13 +0530
Subject: [PATCH v3] Print backtrace of postgres process that are part of this
 instance.

The idea here is to implement & expose pg_print_callstack function, internally
what this function does is, the connected backend will send SIGUSR1 signal by
setting PMSIGNAL_BACKTRACE_EMIT to the postmaster process. Once the process
receives this signal it will log the backtrace of the process.
---
 doc/src/sgml/func.sgml| 75 +++
 src/backend/postmaster/autovacuum.c   |  4 ++
 src/backend/postmaster/checkpointer.c |  5 +++
 src/backend/postmaster/interrupt.c|  5 +++
 src/backend/storage/ipc/procsignal.c  | 33 +++
 src/backend/storage/ipc/signalfuncs.c | 59 ++-
 src/backend/tcop/postgres.c   | 38 ++
 src/backend/utils/init/globals.c  |  1 +
 src/bin/pg_ctl/t/005_backtrace.pl | 73 ++
 src/include/catalog/pg_proc.dat   |  5 +++
 src/include/miscadmin.h   |  2 +
 src/include/storage/pmsignal.h|  2 +
 src/include/storage/procsignal.h  |  3 ++
 src/include/tcop/tcopprot.h   |  1 +
 14 files changed, 305 insertions(+), 1 deletion(-)
 create mode 100644 src/bin/pg_ctl/t/005_backtrace.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index aa99665..4ff6e7f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24709,6 +24709,25 @@ SELECT collation for ('foo' COLLATE "de_DE");
 however only superusers can terminate superuser backends.

   
+
+  
+   
+
+ pg_print_callstack
+
+pg_print_callstack ( pid integer )
+boolean
+   
+   
+Prints the callstack whose backend process has the specified process ID.
+Callstack will be printed based on the log configuration set. See
+ for more information.  This is
+allowed if the calling role is a member of the role whose backend
+callstack is being printed or the calling role has been granted
+pg_print_callstack, however only superusers can
+print callstack of superuser backends.
+   
+  
  
 

@@ -24728,6 +24747,62 @@ SELECT collation for ('foo' COLLATE "de_DE");
 pg_stat_activity view.

 
+   
+pg_print_callstack can be used to print callstack of
+a backend porcess. For example:
+
+postgres=# select pg_print_callstack(pg_backend_pid());
+ pg_print_callstack
+
+ t
+(1 row)
+
+The callstack will be logged in the log file. For example:
+2021-01-27 11:33:50.247 IST [111735] LOG:  current backtrace:
+postgres: postgresdba postgres [local] SELECT(LogBackTrace+0x33) [0x9501cd]
+postgres: postgresdba postgres [local] SELECT(ProcessInterrupts+0x774) [0x950bac]
+postgres: postgresdba postgres [local] SELECT() [0x761ee8]
+postgres: postgresdba postgres [local] SELECT() [0x71bc39]
+postgres: postgresdba postgres [local] SELECT() [0x71e3df]
+postgres: postgresdba postgres [local] SELECT(standard_ExecutorRun+0x1d6) [0x71c25d]
+postgres: postgresdba postgres [local] SELECT(ExecutorRun+0x55) [0x71c085]
+postgres: postgresdba postgres [local] SELECT() [0x953f3d]
+postgres: postgresdba postgres [local] SELECT(PortalRun+0x262) [0x953bf6]
+postgres: postgresdba postgres [local] SELECT() [0x94dafa]
+postgres: postgresdba 

Re: [PATCH] Disable bgworkers during servers start in pg_upgrade

2021-01-27 Thread Jehan-Guillaume de Rorthais
Hi, 

On Wed, 27 Jan 2021 11:25:11 +0100
Denis Laxalde  wrote:

> Andres Freund a écrit :
> > On 2021-01-21 16:23:58 +0100, Denis Laxalde wrote:  
> > > We found an issue in pg_upgrade on a cluster with a third-party
> > > background worker. The upgrade goes fine, but the new cluster is then in
> > > an inconsistent state. The background worker comes from the PoWA
> > > extension but the issue does not appear to related to this particular
> > > code.  
> > 
> > Well, it does imply that that backgrounder did something, as the pure
> > existence of a bgworker shouldn't affect anything. Presumably the issue is
> > that the bgworker actually does transactional writes, which causes problems
> > because the xids / multixacts from early during pg_upgrade won't actually
> > be valid after we do pg_resetxlog etc.

Indeed, it does some writes. As soon as the powa bgworker starts, it takes
"snapshots" of pg_stat_statements state and record them in a local table. To
avoid concurrent run, it takes a lock on some of its local rows using SELECT FOR
UPDATE, hence the mxid consumption.

The inconsistency occurs at least at two place:

* the datminmxid and relminmxid fields pg_dump(all)'ed and restored in the new
  cluster.
* the multixid fields in the controlfile read during the check phase and
  restored later using pg_resetxlog.

> > > As a solution, it seems that, for similar reasons that we restrict
> > > socket access to prevent accidental connections (from commit
> > > f763b77193), we should also prevent background workers to start at this
> > > step.  
> > 
> > I think that'd have quite the potential for negative impact - [...]
> 
> Thank you for those insights!

+1

> > I wonder if we could
> > 
> > a) set default_transaction_read_only to true, and explicitly change it
> >in the sessions that need that.

According to Denis' tests discussed off-list, it works fine in regard with the
powa bgworker, albeit some complaints in logs. However, I wonder how fragile it
could be as bgworker could easily manipulate either the GUC or "BEGIN READ
WRITE". I realize this is really uncommon practices, but bgworker code from
third parties might be surprising.

> > b) when in binary upgrade mode / -b, error out on all wal writes in
> >sessions that don't explicitly set a session-level GUC to allow
> >writes.

It feels safer because more specific to the subject. But I wonder if the
benefice worth adding some (limited?) complexity and a small/quick conditional
block in a very hot code path for a very limited use case.

What about c) where the bgworker are not loaded by default during binary upgrade
mode / -b unless they explicitly set a bgw_flags (BGWORKER_BINARY_UPGRADE ?)
when they are required during pg_upgrade?

Regards,




Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-01-27 Thread Li Japin


> On Jan 27, 2021, at 19:41, Bharath Rupireddy 
>  wrote:
> 
> On Wed, Jan 27, 2021 at 4:42 PM Amit Kapila  wrote:
>>> On Wed, Jan 27, 2021 at 3:16 PM Bharath Rupireddy
>>>  wrote:
 On Wed, Jan 27, 2021 at 3:01 PM Amit Kapila  
 wrote:
>>> So, I think the new syntax, ALTER SUBSCRIPTION .. ADD/DROP PUBLICATION
>>> will refresh the new and existing publications.
>> That sounds a bit unusual to me because when the user has specifically
>> asked to just ADD Publication, we might refresh some existing
>> Publication along with it?
> 
> Hmm. That's correct. I also feel we should not touch the existing
> publications, only the ones that are added/dropped should be
> refreshed. Because there will be an overhead of a SQL with more
> publications(in fetch_table_list) when AlterSubscription_refresh() is
> called with all the existing publications. We could just pass in the
> newly added/dropped publications to AlterSubscription_refresh().
> 
> I don't see any problem if ALTER SUBSCRIPTION ... ADD PUBLICATION with
> refresh true refreshes only the newly added publications, because what
> we do in AlterSubscription_refresh() is that we fetch the tables
> associated with the publications from the publisher, compare them with
> the previously fetched tables from that publication and add the new
> tables or remove the table that don't exist in that publication
> anymore.
> 
> For ALTER SUBSCRIPTION ... DROP PUBLICATION, also we can do the same
> thing i.e. refreshes only the dropped publications.
> 
> Thoughts?

Agreed. We just only need to refresh the added/dropped publications.  
Furthermore, for publications that will be dropped, we do not need the 
“copy_data” option, right?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.



Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-01-27 Thread Li Japin


> On Jan 27, 2021, at 19:41, Bharath Rupireddy 
>  wrote:
> 
> On Wed, Jan 27, 2021 at 4:42 PM Amit Kapila  wrote:
>>> On Wed, Jan 27, 2021 at 3:16 PM Bharath Rupireddy
>>>  wrote:
 On Wed, Jan 27, 2021 at 3:01 PM Amit Kapila  
 wrote:
>>> So, I think the new syntax, ALTER SUBSCRIPTION .. ADD/DROP PUBLICATION
>>> will refresh the new and existing publications.
>> That sounds a bit unusual to me because when the user has specifically
>> asked to just ADD Publication, we might refresh some existing
>> Publication along with it?
> 
> Hmm. That's correct. I also feel we should not touch the existing
> publications, only the ones that are added/dropped should be
> refreshed. Because there will be an overhead of a SQL with more
> publications(in fetch_table_list) when AlterSubscription_refresh() is
> called with all the existing publications. We could just pass in the
> newly added/dropped publications to AlterSubscription_refresh().
> 
> I don't see any problem if ALTER SUBSCRIPTION ... ADD PUBLICATION with
> refresh true refreshes only the newly added publications, because what
> we do in AlterSubscription_refresh() is that we fetch the tables
> associated with the publications from the publisher, compare them with
> the previously fetched tables from that publication and add the new
> tables or remove the table that don't exist in that publication
> anymore.
> 
> For ALTER SUBSCRIPTION ... DROP PUBLICATION, also we can do the same
> thing i.e. refreshes only the dropped publications.
> 
> Thoughts?

Argeed. We just only need to refresh the added/dropped publications.  
Furthermore, for dropped publications we do not need the “copy_data” option, 
right?

> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com


Re: [PATCH] Disable bgworkers during servers start in pg_upgrade

2021-01-27 Thread Jehan-Guillaume de Rorthais
Oh, I forgot another point before sending my previous email.

Maybe it might worth adding some final safety checks in pg_upgrade itself?
Eg. checking controldata and mxid files coherency between old and new
cluster would have catch the inconsistency here.




Re: FailedAssertion in heap_index_delete_tuples at heapam.c:7220

2021-01-27 Thread Jaime Casanova
On Wed, Jan 27, 2021 at 2:09 AM Peter Geoghegan  wrote:
>
> On Tue, Jan 26, 2021 at 10:52 PM Jaime Casanova
>  wrote:
> > ${subject} happened while executing ${attached query} at regresssion
> > database, using 14dev (commit
> > d5a83d79c9f9b660a6a5a77afafe146d3c8c6f46) and produced ${attached
> > stack trace}.
>
> I see the bug: gistprunepage() calls
> index_compute_xid_horizon_for_tuples() (which ultimately calls the
> heapam.c callback for heap_index_delete_tuples()) with an empty array,
> which we don't expect. The similar code within _hash_vacuum_one_page()
> already only calls index_compute_xid_horizon_for_tuples() when
> ndeletable > 0.
>
> The fix is obvious: Bring gistprunepage() in line with
> _hash_vacuum_one_page(). I'll go push a fix for that now.
>

Thanks

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL




Re: Key management with tests

2021-01-27 Thread Bruce Momjian
On Tue, Jan 26, 2021 at 05:53:01PM -0500, Bruce Momjian wrote:
> On Tue, Jan 26, 2021 at 03:24:30PM -0500, Robert Haas wrote:
> > I'm wondering whether you've considered storing all the keys in one
> > file instead of a file per key. The reason I ask is that it seems to
> > me that the key rotation procedure would be a lot simpler if it were
> > all in one file. You could just create a temporary file and atomically
> > rename it over the existing file. If you see a temporary file you're
> > always free to remove it. This is a lot simpler than what you have
> > right now. The "repair" concept pretty much goes away completely,
> > which seems nice. Granted I don't know exactly how to store multiple
> > keys in one file, but I bet there's some way to do it.
> 
> We envisioned allowing heap/index key rotation by having a standby with
> the same WAL key as the primary but different heap/index keys so that we
> can failover to the standby to change the heap/index key and then change
> the WAL key.  This separation allows that.  We also might need some
> additional keys later and this allows that.  I do like simplicity, but
> the complexity here seems to serve a need.

Just to close this issue, several scripts, e,g., PIV, AWS, need to store
data to indicate the cluster encryption key used, and those need to be
kept synchronized with the wrapped data keys.  Having separate
directories for each cluster key version allows that to work cleanly.

> > The README in src/backend/crypto does not explain how the scripts in
> > that directory are intended to be used. If I want to use AWS Secrets
> > Manager with this feature, I can see that I should use
> > ckey_aws.sh.sample as a basis for that integration, but I don't know
> > what I should do with the script because the README says nothing about
> > it. I am frankly pretty doubtful about the idea of shipping a bunch of
> > /bin/sh scripts as a best practice; for one thing, that's totally
> > unusable on Windows, and it also means that this is dependent on
> > /bin/sh existing and having the behavior we expect and on all the
> > other executables in these scripts as well. But, at the very least,
> > there needs to be a clearer explanation of how the scripts are
> > intended to be used, which parts people are supposed to modify, what
> > arguments they're going to get called with, and things like that.
> 
> I added comments to most of the scripts.  I don't know what more I can
> do, or what other language would be appropriate.

I think someone would need to write Windows versions of these scripts.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Speeding up GIST index creation for tsvectors

2021-01-27 Thread Amit Khandekar
On Tue, 15 Dec 2020 at 20:34, Amit Khandekar  wrote:
>
> On Sun, 13 Dec 2020 at 9:28 PM, Andrey Borodin  wrote:
> > +1
> > This will make all INSERTs and UPDATES for tsvector's GiSTs.
>
> Oh, I didn't realize that this code is getting used in GIST index
> insertion and creation too. Will check there.

I ran some insert and update tests; they show only marginal
improvement. So looks like the patch is mainly improving index builds.

> > Meanwhile there are at least 4 incarnation of hemdistsign() functions that 
> > are quite similar. I'd propose to refactor them somehow...
>
> Yes, I hope we get the benefit there also. Before that, I thought I
> should post the first use-case to get some early comments. Thanks for
> your encouraging comments :)

The attached v2 version of 0001 patch extends the hemdistsign()
changes to the other use cases like  intarray, ltree and hstore. I see
the same index build improvement for all these types.

Since for the gist index creation of some of these types the default
value for siglen is small (8-20), I tested with small siglens. For
siglens <= 20, particularly for values that are not multiples of 8
(e.g. 10, 13, etc), I see a  1-7 % reduction in speed of index
creation. It's probably because of
an extra function call for pg_xorcount(); and also might be due to the
extra logic in pg_xorcount() which becomes prominent for shorter
traversals. So for siglen less than 32, I kept the existing method
using byte-by-byte traversal.

--
Thanks,
-Amit Khandekar
Huawei Technologies
From b8905b47f7e0af8d4558fdac73cc2283c263cbcc Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Thu, 10 Dec 2020 21:45:49 +0800
Subject: [PATCH 2/2] Avoid function pointer dereferencing for
 pg_popcount32/64() call.

With USE_POPCNT_ASM set (which is true only on x86), we decide with
the help of __get_cpuid() at runtime whether the CPU supports popcnt
instruction, and hence we dynamically choose the corresponding
function; thus pg_popcount32/64 is a function pointer. But for
platforms with USE_POPCNT_ASM not set, we don't have to use dynamic
assignment of functions, but we were still declaring pg_popcount64 as a
function pointer. So arrange for direct function call so as to get rid
of function pointer dereferencing each time pg_popcount32/64 is
called.

To do this, define pg_popcount64 to another function name
(pg_popcount64_nonasm) rather than a function pointer, whenever
USE_POPCNT_ASM is not defined.  And let pg_popcount64_nonasm() be a
static inline function so that whenever pg_popcount64() is called,
the __builtin_popcount() gets called directly.  For platforms not
supporting __builtin_popcount(), continue using the slow version as is
the current behaviour. pg_popcount64_slow() was actually a misnomer,
because it was actually calling __builtin_popcount() which is not a slow
function. Now with this commit, pg_popcount64_slow() indeed has only
slow code.

Tested this on ARM64, and the gist index creation for tsvectors speeds
up by ~ 6% for default siglen (=124), and by 12% with siglen=700
---
 src/include/port/pg_bitutils.h | 59 ++
 src/port/pg_bitutils.c | 47 +--
 2 files changed, 67 insertions(+), 39 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 174df28e66..b039f94ee5 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -23,6 +23,19 @@ extern const uint8 pg_rightmost_one_pos[256];
 extern const uint8 pg_number_of_ones[256];
 #endif
 
+/*
+ * On x86_64, we can use the hardware popcount instruction, but only if
+ * we can verify that the CPU supports it via the cpuid instruction.
+ *
+ * Otherwise, we fall back to __builtin_popcount if the compiler has that,
+ * or a hand-rolled implementation if not.
+ */
+#ifdef HAVE_X86_64_POPCNTQ
+#if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
+#define USE_POPCNT_ASM 1
+#endif
+#endif
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
@@ -208,8 +221,54 @@ pg_ceil_log2_64(uint64 num)
 }
 
 /* Count the number of one-bits in a uint32 or uint64 */
+
+/*
+ * With USE_POPCNT_ASM set (which is true only on x86), we decide at runtime
+ * whether the CPU supports popcnt instruction, and hence we dynamically choose
+ * the corresponding function; thus pg_popcount32/64 is a function pointer. But
+ * for platforms with USE_POPCNT_ASM not set, we don't have to use dynamic
+ * assignment of functions, so we arrange for direct function call so as to get
+ * rid of function pointer dereferencing each time pg_popcount32/64 is called.
+ */
+#ifdef USE_POPCNT_ASM
 extern int	(*pg_popcount32) (uint32 word);
 extern int	(*pg_popcount64) (uint64 word);
+#else
+#define pg_popcount32 pg_popcount32_nonasm
+#define pg_popcount64 pg_popcount64_nonasm
+#endif
+
+/* Slow versions of pg_popcount */
+#ifndef HAVE__BUILTIN_POPCOUNT
+extern int pg_popcount32_slow(uint32 word);
+extern int pg_popcount

Re: Online checksums patch - once again

2021-01-27 Thread Heikki Linnakangas

Revisiting an issue we discussed earlier:

On 25/11/2020 15:20, Daniel Gustafsson wrote:

On 23 Nov 2020, at 18:36, Heikki Linnakangas  wrote:

On 17/11/2020 10:56, Daniel Gustafsson wrote:

I've reworked this in the attached such that the enable_ and
disable_ functions merely call into the launcher with the desired
outcome, and the launcher is responsible for figuring out the
rest.  The datachecksumworker is now the sole place which
initiates a state transfer.


Well, you still fill the DatachecksumsWorkerShmem->operations array
in the backend process that launches the datacheckumworker, not in
the worker process. I find that still a bit surprising, but I
believe it works.


I'm open to changing it in case there are strong opinions, it just
seemed the most natural to me.


This kept bothering me, so I spent a while hacking this to my liking. 
The attached patch moves the code to fill in 'operations' from the 
backend to the launcher, so that the pg_enable/disable_checksums() call 
now truly just stores the desired state, checksum on or off, in shared 
memory, and launches the launcher process. The launcher process figures 
out how to get to the desired state. This removes the couple of corner 
cases that previously emitted a NOTICE about the processing being 
concurrently disabled or aborted. What do you think? I haven't done much 
testing, so if you adopt this approach, please check if I broke 
something in the process.


This changes the way the abort works. If the 
pg_enable/disable_checksums() is called, while the launcher is already 
busy changing the state, pg_enable/disable_checksums() will just set the 
new desired state in shared memory anyway. The launcher process will 
notice that the target state changed some time later, and restart from 
scratch.


A couple of other issues came up while doing that:

- AbortProcessing() has two callers, one in code that runs in the 
launcher process, and another one in code that runs in the worker 
process. Is it really safe to use from the worker process? Calling 
ProcessAllDatabases() in the worker seems sketchy. (This is moot in this 
new patch version, as I removed AbortProcessing() altogether)


- Is it possible that the worker keeps running after the launcher has 
already exited, e.g. because of an ERROR or SIGTERM? If you then quickly 
call pg_enable_checksums() again, can you end up with two workers 
running at the same time? Is that bad?


On 26/01/2021 23:00, Daniel Gustafsson wrote:

On 22 Jan 2021, at 12:55, Heikki Linnakangas  wrote:

@@ -3567,6 +3571,27 @@ RelationBuildLocalRelation(const char *relname,
relkind == RELKIND_MATVIEW)
RelationInitTableAccessMethod(rel);
+   /*
+* Set the data checksum state. Since the data checksum state can 
change at
+* any time, the fetched value might be out of date by the time the
+* relation is built.  DataChecksumsNeedWrite returns true when data
+* checksums are: enabled; are in the process of being enabled (state:
+* "inprogress-on"); are in the process of being disabled (state:
+* "inprogress-off"). Since relhaschecksums is only used to track 
progress
+* when data checksums are being enabled, and going from disabled to
+* enabled will clear relhaschecksums before starting, it is safe to use
+* this value for a concurrent state transition to off.
+*
+* If DataChecksumsNeedWrite returns false, and is concurrently changed 
to
+* true then that implies that checksums are being enabled. Worst case,
+* this will lead to the relation being processed for checksums even 
though
+* each page written will have them already.  Performing this last 
shortens
+* the window, but doesn't avoid it.
+*/
+   HOLD_INTERRUPTS();
+   rel->rd_rel->relhaschecksums = DataChecksumsNeedWrite();
+   RESUME_INTERRUPTS();
+
/*
 * Okay to insert into the relcache hash table.
 *


I grepped for relhashcheckums, and concluded that the value in the
relcache isn't actually used for anything. Not so! In
heap_create_with_catalog(), the actual pg_class row is constructed
from the relcache entry, so the value set in
RelationBuildLocalRelation() finds its way to pg_class. Perhaps it
would be more clear to pass relhachecksums directly as an argument
to AddNewRelationTuple(). That way, the value in the relcache would
be truly never used.


I might be thick (or undercaffeinated) but I'm not sure I follow. 
AddNewRelationTuple calls InsertPgClassTuple which in turn avoids the

relcache entry.


Ah, you're right, I misread AddNewRelationTuple. That means that the 
relhaschecksums field in the relcache is never used? That's a clear 
rule. The changes to formrdesc() and RelationBuildLocalRelation() seem 
unnecessary then, we can always initialize relhaschecksums to false in 
the relcache.


- Heikki
>From 3ccb06a1e39b456d26a5e2f89c9b634f04b34307 Mon Sep 17

Inconsistent function definitions in ECPG's informix.c

2021-01-27 Thread Tom Lane
I noticed that some of the newer compilers in the buildfarm
(e.g., caiman, with gcc 11.0) whine about the definitions of
rjulmdy() and rmdyjul() not quite matching their external
declarations:

informix.c:516:23: warning: argument 2 of type `short int[3]' with mismatched 
bound [-Warray-parameter=]
  516 | rjulmdy(date d, short mdy[3])
  | ~~^~
In file included from informix.c:10:
../include/ecpg_informix.h:38:31: note: previously declared as `short int *'
   38 | extern int  rjulmdy(date, short *);
  |   ^~~
informix.c:567:15: warning: argument 1 of type `short int[3]' with mismatched 
bound [-Warray-parameter=]
  567 | rmdyjul(short mdy[3], date * d)
  | ~~^~
In file included from informix.c:10:
../include/ecpg_informix.h:41:25: note: previously declared as `short int *'
   41 | extern int  rmdyjul(short *, date *);
  | ^~~

This isn't a bug really, since per the C spec these declarations
are equivalent.  But it'd be good to silence the warning before
it gets any more common.

The most conservative thing to do would be to take the user-visible
extern declarations as being authoritative, and change informix.c
to match.  I'm slightly tempted to do the opposite though, on the
grounds that showing the expected lengths of the arrays is useful.
But I wonder if anyone's compatibility checker tools would
(mistakenly) classify that as an ABI break.

Thoughts?

regards, tom lane




Re: Columns correlation and adaptive query optimization

2021-01-27 Thread Konstantin Knizhnik



On 27.01.2021 8:45, Yugo NAGATA wrote:

On Mon, 25 Jan 2021 16:27:25 +0300
Konstantin Knizhnik  wrote:


Hello,

Thank you for review.
My answers are inside.

Thank you for updating the patch and answering my questions.


(2)
If I understand correctly, your proposal consists of the following two features.

1. Add a feature to auto_explain that creates an extended statistic 
automatically
if an error on estimated rows number is large.

2. Improve rows number estimation of join results by considering functional
dependencies between vars in the join qual if the qual has more than one 
clauses,
and also functional dependencies between a var in the join qual and vars in 
quals
of the inner/outer relation.

As you said, these two parts are independent each other, so one feature will 
work
even if we don't assume the other.  I wonder it would be better to split the 
patch
again, and register them to commitfest separately.

I agree with you that this are two almost unrelated changes, although
without clausesel patch additional statistic can not improve query planning.

I think extended statistics created by the auto_explain patch can improve query
planning even without the clausesel patch. For example, suppose the following 
case:

postgres=# create table t ( i int, j int);
CREATE TABLE
postgres=# insert into t select i/10, i/100 from  generate_series(1,100) i;
INSERT 0 100
postgres=# analyze t;
ANALYZE
postgres=# explain analyze select * from t where i = 100 and j = 10;
QUERY PLAN
-
  Seq Scan on t  (cost=0.00..19425.00 rows=1 width=8) (actual 
time=0.254..97.293 rows=10 loops=1)
Filter: ((i = 100) AND (j = 10))
Rows Removed by Filter: 90
  Planning Time: 0.199 ms
  Execution Time: 97.327 ms
(5 rows)

After applying the auto_explain patch (without clausesel patch) and issuing the 
query,
additional statistics were created.

postgres=# select * from t where i = 100 and j = 10;
LOG:  Add statistics t_i_j

Then, after analyze, the row estimation was improved.

postgres=# analyze t;
ANALYZE
postgres=# explain analyze select * from t where i = 100 and j = 10;
postgres=# explain analyze select * from t where i = 100 and j = 10;
 QUERY PLAN
--
  Seq Scan on t  (cost=0.00..19425.00 rows=10 width=8) (actual 
time=0.255..95.347 rows=10 loops=1)
Filter: ((i = 100) AND (j = 10))
Rows Removed by Filter: 90
  Planning Time: 0.124 ms
  Execution Time: 95.383 ms
(5 rows)

So, I think the auto_explain patch is useful with just that as a tool
to detect a gap between estimate and real and adjust the plan. Also,
the clausesel patch would be useful without the auto_explain patch
if an appropriate statistics are created.


But I already have too many patches at commitfest.
May be it will be enough to spit this patch into two?

Although we can discuss both of these patches in this thread,
I wonder we don't have to commit them together.


(3)
+   DefineCustomBoolVariable("auto_explain.suggest_only",
+"Do not create statistic 
but just record in WAL suggested create statistics statement.",
+NULL,
+
&auto_explain_suggest_on

To imply that this parameter is involving to add_statistics_threshold, it seems
better for me to use more related name like add_statistics_suggest_only.

Also, additional documentations for new parameters are required.

Done.

+
+   
+
+ auto_explain.auto_explain.add_statistics_threshold 
(real)
+ 
+  auto_explain.add_statistics_threshold 
configuration parameter
+ 
+
+
+ 
+   auto_explain.add_statistics_threshold sets the 
threshold for
+   actual/estimated #rows ratio triggering creation of multicolumn 
statistic
+   for the related columns. It can be used for adpative query optimization.
+   If there is large gap between real and estimated number of tuples for 
the
+   concrete plan node, then multicolumn statistic is created for involved
+   attributes. Zero value (default) disables implicit creation of 
multicolumn statistic.
+ 
+

I wonder we need to say that this parameter has no effect unless log_analyze
is enabled and that statistics are created only when the excution time exceeds
log_min_duration, if these behaviors are intentional.

In addition, additional statistics are created only if #rows is over-estimated
and not if it is under-estimated. Although it seems good as a criterion for 
creating
multicolumn statistic since extended statisstic is usually useful to fix 
over-estimation,
I am not sure if we don't have to consider under-estimation case at all.



(9)
Curr

Re: [PoC] Non-volatile WAL buffer

2021-01-27 Thread Tomas Vondra
On 1/25/21 3:56 AM, Masahiko Sawada wrote:
>>
>> ...
>>
>> On 1/21/21 3:17 AM, Masahiko Sawada wrote:
>>> ...
>>>
>>> While looking at the two methods: NTT and simple-no-buffer, I realized
>>> that in XLogFlush(), NTT patch flushes (by pmem_flush() and
>>> pmem_drain()) WAL without acquiring WALWriteLock whereas
>>> simple-no-buffer patch acquires WALWriteLock to do that
>>> (pmem_persist()). I wonder if this also affected the performance
>>> differences between those two methods since WALWriteLock serializes
>>> the operations. With PMEM, multiple backends can concurrently flush
>>> the records if the memory region is not overlapped? If so, flushing
>>> WAL without WALWriteLock would be a big benefit.
>>>
>>
>> That's a very good question - it's quite possible the WALWriteLock is
>> not really needed, because the processes are actually "writing" the WAL
>> directly to PMEM. So it's a bit confusing, because it's only really
>> concerned about making sure it's flushed.
>>
>> And yes, multiple processes certainly can write to PMEM at the same
>> time, in fact it's a requirement to get good throughput I believe. My
>> understanding is we need ~8 processes, at least that's what I heard from
>> people with more PMEM experience.
> 
> Thanks, that's good to know.
> 
>>
>> TBH I'm not convinced the code in the "simple-no-buffer" code (coming
>> from the 0002 patch) is actually correct. Essentially, consider the
>> backend needs to do a flush, but does not have a segment mapped. So it
>> maps it and calls pmem_drain() on it.
>>
>> But does that actually flush anything? Does it properly flush changes
>> done by other processes that may not have called pmem_drain() yet? I
>> find this somewhat suspicious and I'd bet all processes that did write
>> something have to call pmem_drain().
>
For the record, from what I learned / been told by engineers with PMEM
experience, calling pmem_drain() should properly flush changes done by
other processes. So it should be sufficient to do that in XLogFlush(),
from a single process.

My understanding is that we have about three challenges here:

(a) we still need to track how far we flushed, so this needs to be
protected by some lock anyway (although perhaps a much smaller section
of the function)

(b) pmem_drain() flushes all the changes, so it flushes even "future"
part of the WAL after the requested LSN, which may negatively affects
performance I guess. So I wonder if pmem_persist would be a better fit,
as it allows specifying a range that should be persisted.

(c) As mentioned before, PMEM behaves differently with concurrent
access, i.e. it reaches peak throughput with relatively low number of
threads wroting data, and then the throughput drops quite quickly. I'm
not sure if the same thing applies to pmem_drain() too - if it does, we
may need something like we have for insertions, i.e. a handful of locks
allowing limited number of concurrent inserts.


> Yeah, in terms of experiments at least it's good to find out that the
> approach mmapping each WAL segment is not good at performance.
> 
Right. The problem with small WAL segments seems to be that each mmap
causes the TLB to be thrown away, which means a lot of expensive cache
misses. As the mmap needs to be done by each backend writing WAL, this
is particularly bad with small WAL segments. The NTT patch works around
that by doing just a single mmap.

I wonder if we could pre-allocate and mmap small segments, and keep them
mapped and just rename the underlying files when recycling them. That'd
keep the regular segment files, as expected by various tools, etc. The
question is what would happen when we temporarily need more WAL, etc.

>>>
>>> ...
>>>
>>> I think the performance improvement by NTT patch with the 16MB WAL
>>> segment, the most common WAL segment size, is very good (150437 vs.
>>> 212410 with 64 clients). But maybe evaluating writing WAL segment
>>> files on PMEM DAX filesystem is also worth, as you mentioned, if we
>>> don't do that yet.
>>>
>>
>> Well, not sure. I think the question is still open whether it's actually
>> safe to run on DAX, which does not have atomic writes of 512B sectors,
>> and I think we rely on that e.g. for pg_config. But maybe for WAL that's
>> not an issue.
> 
> I think we can use the Block Translation Table (BTT) driver that
> provides atomic sector updates.
> 

But we have benchmarked that, see my message from 2020/11/26, which
shows this table:

 master/bttmaster/dax   nttsimple
   ---
 1 5469  7402  7977  6746
1648222 80869107025 82343
3273974158189214718158348
6485921154540225715164248
96   150602221159237008217253

Clearly, BTT is quite expensive. Maybe there's a way to tune that at
filesystem/kernel level, I haven't tried that.

Improve join selectivity estimation using extended statistics

2021-01-27 Thread Konstantin Knizhnik

Original thread is:
https://www.postgresql.org/message-id/flat/196f1e1a-5464-ed07-ab3c-0c9920564af7%40postgrespro.ru

Following Yugo's advice, I have splitted this patch into two:
1. Extending auto_explain extension to generate extended statistics in 
case of bad selectivity estimation.

2. Taken in account extended statistics when computing join selectivity.


Now this thread will contain only patches for join selectivity estimation.


However,
IIUC, the clausesel patch uses only functional dependencies statistics for
improving join, so my question was about possibility to consider MCV in the
clausesel patch.
Sorry, do not have idea right now how to use MCV for better estimation 
of join selectivity.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf..5446ad5 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -25,6 +25,8 @@
 #include "utils/fmgroids.h"
 #include "utils/lsyscache.h"
 #include "utils/selfuncs.h"
+#include "statistics/statistics.h"
+#include "catalog/pg_statistic_ext.h"
 
 /*
  * Data structure for accumulating info about possible range-query
@@ -56,6 +58,47 @@ static Selectivity clauselist_selectivity_or(PlannerInfo *root,
  /
 
 /*
+ * Find functional dependency between attributes using multicolumn statistic.
+ * relid:   index of relation to which all considered attributes belong
+ * var: variable which dependencies are inspected
+ * attnums: set of considered attributes included specified variables
+ * This function return degree of strongest dependency between some subset of this attributes
+ * and specified variable or 0.0 if on dependency is found.
+ */
+double
+find_var_dependency(PlannerInfo *root, Index relid, Var *var, Bitmapset *attnums)
+{
+	RelOptInfo* rel = find_base_rel(root, relid);
+	if (rel->rtekind == RTE_RELATION && rel->statlist != NIL)
+	{
+		StatisticExtInfo *stat = choose_best_statistics(rel->statlist, STATS_EXT_DEPENDENCIES,
+		&attnums, 1);
+		if (stat != NULL)
+		{
+			MVDependencies *dependencies = statext_dependencies_load(stat->statOid);
+			MVDependency *strongest = NULL;
+			int i;
+			for (i = 0; i < dependencies->ndeps; i++)
+			{
+MVDependency *dependency = dependencies->deps[i];
+int n_dep_vars = dependency->nattributes - 1;
+/* Dependency implies attribute */
+if (var->varattno == dependency->attributes[n_dep_vars])
+{
+	while (--n_dep_vars >= 0
+		   && bms_is_member(dependency->attributes[n_dep_vars], attnums));
+	if (n_dep_vars < 0 && (!strongest || strongest->degree < dependency->degree))
+		strongest = dependency;
+}
+			}
+			if (strongest)
+return strongest->degree;
+		}
+	}
+	return 0.0;
+}
+
+/*
  * clauselist_selectivity -
  *	  Compute the selectivity of an implicitly-ANDed list of boolean
  *	  expression clauses.  The list can be empty, in which case 1.0
@@ -129,6 +172,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	RangeQueryClause *rqlist = NULL;
 	ListCell   *l;
 	int			listidx;
+	Bitmapset  *clauses_attnums = NULL;
+	int			n_clauses_attnums = 0;
+	int innerRelid = varRelid;
 
 	/*
 	 * If there's exactly one clause, just go directly to
@@ -139,6 +185,9 @@ clauselist_selectivity_ext(PlannerInfo *root,
 	  varRelid, jointype, sjinfo,
 	  use_extended_stats);
 
+	if (innerRelid == 0 && sjinfo)
+		bms_get_singleton_member(sjinfo->min_righthand, &innerRelid);
+
 	/*
 	 * Determine if these clauses reference a single relation.  If so, and if
 	 * it has extended statistics, try to apply those.
@@ -171,7 +220,6 @@ clauselist_selectivity_ext(PlannerInfo *root,
 		Node	   *clause = (Node *) lfirst(l);
 		RestrictInfo *rinfo;
 		Selectivity s2;
-
 		listidx++;
 
 		/*
@@ -204,6 +252,7 @@ clauselist_selectivity_ext(PlannerInfo *root,
 		else
 			rinfo = NULL;
 
+
 		/*
 		 * See if it looks like a restriction clause with a pseudoconstant on
 		 * one side.  (Anything more complicated than that might not behave in
@@ -215,6 +264,42 @@ clauselist_selectivity_ext(PlannerInfo *root,
 			OpExpr	   *expr = (OpExpr *) clause;
 			bool		varonleft = true;
 			bool		ok;
+			int oprrest = get_oprrest(expr->opno);
+
+			/* Try to take in account functional dependencies between attributes */
+			if (oprrest == F_EQSEL && rinfo != NULL && innerRelid != 0)
+			{
+Var* var = (Var*)linitial(expr->args);
+if (!IsA(var, Var) || var->varno != innerRelid)
+{
+	var = (Var*)lsecond(expr->args);
+}
+if (IsA(var, Var) && var->varattno >= 0 && var->varno == innerRelid)
+{
+	clauses_attnums = bms_add_member(clauses_attnums, var->varattno);
+	if (n_clauses_attnums++ != 0)
+	{
+		double dep = find_var_dependency(root, innerRelid, var, clauses_attnums);
+		

Re: Printing backtrace of postgres processes

2021-01-27 Thread Andres Freund
Hi,

On 2021-01-27 19:05:16 +0530, vignesh C wrote:

>  /*
> + * LogBackTrace
> + *
> + * Get the backtrace and log the backtrace to log file.
> + */
> +void
> +LogBackTrace(void)
> +{
> + int save_errno = errno;
> +
> + void   *buf[100];
> + int nframes;
> + char  **strfrms;
> + StringInfoData errtrace;
> +
> + /* OK to process messages.  Reset the flag saying there are more to do. 
> */
> + PrintBacktracePending = false;

ISTM that it'd be better to do this in the caller, allowing this
function to be used outside of signal triggered backtraces.

>  
> +extern void LogBackTrace(void); /* Called from EmitProcSignalPrintCallStack 
> */

I don't think this comment is correct anymore?

Greetings,

Andres Freund




Re: cleaning up a few CLOG-related things

2021-01-27 Thread Robert Haas
On Mon, Jan 25, 2021 at 2:11 PM Heikki Linnakangas  wrote:
> > [patches 0001 - 0003]
>
> Makes sense.

Great. I committed the first one and will proceed with those as well.

> > So, the CLOG page gets created when nextXid advances from the first
> > value on the page to the second value on the page, not when it
> > advances from the last value on the previous page to the first value
> > on the current page.
> Yes. It took me a moment to understand that explanation, though. I'd
> phrase it something like "nextXid is the next XID that will be used, but
> we want to set latest_page_number according to the last XID that's
> already been used. So retreat by one."

OK, updated the patch to use that language for the comment.

> Having a separate FullTransactionIdToLatestPageNumber() function for
> this seems like overkill to me.

I initially thought so too, but it turned out to be pretty useful for
writing debugging cross-checks and things, so I'm inclined to keep it
even though I'm not at present proposing to commit any such debugging
cross-checks. For example I tried making the main redo loop check
whether XactCtl->shared->latest_page_number ==
FullTransactionIdToLatestPageNumber(nextXid) which turned out to be
super-helpful in understanding things.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v2-0001-In-clog_redo-don-t-set-XactCtl-shared-latest_page.patch
Description: Binary data


v2-0002-In-TrimCLOG-don-t-reset-XactCtl-shared-latest_pag.patch
Description: Binary data


v2-0003-In-StartupCLOG-correct-an-off-by-one-error.patch
Description: Binary data


Re: Support for NSS as a libpq TLS backend

2021-01-27 Thread Jacob Champion
On Wed, 2021-01-27 at 16:39 +0900, Michael Paquier wrote:
> My apologies for chiming in.  I was looking at your patch set here,
> and while reviewing the strong random and cryptohash parts I have
> found a couple of mistakes in the ./configure part.  I think that the
> switch from --with-openssl to --with-ssl={openssl} could just be done
> independently as a building piece of the rest, then the first portion
> based on NSS could just add the minimum set in configure.ac.
> 
> Please note that the patch set has been using autoconf from Debian, or
> something forked from upstream.  There were also missing updates in
> several parts of the code base, and a lack of docs for the new
> switch.  I have spent time checking that with --with-openssl to make
> sure that the obsolete grammar is still compatible, --with-ssl=openssl
> and also without it.
> 
> Thoughts?

Seems good to me on Ubuntu; builds with both flavors.

From peering at the Windows side:

> --- a/src/tools/msvc/config_default.pl
> +++ b/src/tools/msvc/config_default.pl
> @@ -16,7 +16,7 @@ our $config = {
>   tcl   => undef,# --with-tcl=
>   perl  => undef,# --with-perl=
>   python=> undef,# --with-python=
> - openssl   => undef,# --with-openssl=
> + openssl   => undef,# --with-ssl=openssl with 
>   uuid  => undef,# --with-uuid=
>   xml   => undef,# --with-libxml=
>   xslt  => undef,# --with-libxslt=

So to check understanding: the `openssl` config variable is still alive
for MSVC builds; it just turns that into `--with-ssl=openssl` in the
fake CONFIGURE_ARGS?



Since SSL is an obsolete term, and the choice of OpenSSL vs NSS vs
[nothing] affects server operation (such as cryptohash) regardless of
whether or not connection-level TLS is actually used, what would you
all think about naming this option --with-crypto? I.e.

--with-crypto=openssl
--with-crypto=nss



--Jacob


Re: Add MAIN_RELATION_CLEANUP and SECONDARY_RELATION_CLEANUP options to VACUUM

2021-01-27 Thread Justin Pryzby
On Fri, Jan 24, 2020 at 09:24:45PM +, Bossart, Nathan wrote:
> On 1/21/20, 1:39 PM, "Vik Fearing"  wrote:
> > On 21/01/2020 22:21, Bossart, Nathan wrote:
> >> I've attached a patch for a couple of new options for VACUUM:
> >> MAIN_RELATION_CLEANUP and SECONDARY_RELATION_CLEANUP.  The motive
> >> behind these options is to allow table owners to easily vacuum only
> >> the TOAST table or only the main relation.  This is especially useful
> >> for TOAST tables since roles do not have access to the pg_toast schema
> >> by default and some users may find it difficult to discover the name
> >> of a relation's TOAST table.
> >
> >
> > Could you explain why one would want to do this?  Autovacuum will
> > already deal with the tables separately as needed, but I don't see when
> > a manual vacuum would want to make this distinction.
> 
> The main use case I'm targeting is when the level of bloat or
> transaction ages of a relation and its TOAST table have significantly
> diverged.  In these scenarios, it could be beneficial to be able to
> vacuum just one or the other, especially if the tables are large.

This just came up for me:

I have a daily maintenance script which pro-actively vacuums tables: freezing
historic partitions, vacuuming current tables if the table's relfrozenxid is
old, and to encourage indexonly scan.

I'm checking the greatest(age(toast,main)) and vacuum the table (and implicitly
its toast) whenever either is getting old.

But it'd be more ideal if I could independently vacuum the main table if it's
old, but not the toast table.

-- 
Justin




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-01-27 Thread Michail Nikolaev
Hello, hackers.

I think I was able to fix the issue related to minRecoveryPoint and crash
recovery. To make sure standby will be consistent after crash recovery, we
need to take the current value of minRecoveryPoint into account while
setting LP_DEAD hints (almost the same way as it is done for *heap* hint
bits already).

I have introduced new structure IndexHintBitsData:
---
/* guaranteed not visible for all backends */
bool all_dead;

/* latest removed xid if known */
TransactionId latest_removed_xid;

 /* lsn of page where dead tuple located */
XLogRecPtr page_lsn;
---

This structure is filled by the `heap_hot_search_buffer` function. After,
we decide to set or not `kill_prior_tuple` depending on its content
(calling `IsMarkBufferDirtyIndexHintAllowed`).

For primary - it is always safe to set LP_DEAD in index if `all_dead` ==
true.

In the case of standby, we need to check `latest_removed_xid` (if
available) first. If commit LSN of the latest removed xid is already lower
than minRecoveryPoint (`XLogNeedsFlush`) - it is safe to set
`kill_prior_tuple`.

Sometimes we are not sure about the latest removed xid - heap record could
be marked dead by the XLOG_HEAP2_CLEAN record, for example. In such a case
we check the LSN of the *heap* page containing the tuple (LSN could be
updated by other transactions already - but it does not matter in that
situation). If page LSN is lower than minRecoveryPoint - it is safe to set
LP_DEAD in the index too. Otherwise - just leave the index tuple alive.


So, to bring it all together:

* Normal operation, proc->indexIgnoreKilledTuples is true:
  It is safe for standby to use hint bits from the primary FPI because
of XLOG_INDEX_HINT_BITS_HORIZON conflict resolution.
  It is safe for standby to set its index hint bits because
`ComputeXidHorizons` honors other read-only procs xmin and lowest xid on
primary (`KnownAssignedXidsGetOldestXmin`).

* Normal operation, proc->indexIgnoreKilledTuples is false:
  Index hint bits are never set or taken into account.

* Crash recovery, proc->indexIgnoreKilledTuples is true:
  It is safe for standby to use hint bits from the primary FPW because
XLOG_INDEX_HINT_BITS_HORIZON is always logged before FPI, and commit record
of transaction removed the tuple is logged before
XLOG_INDEX_HINT_BITS_HORIZON. So, if FPI with hints was flushed (and taken
into account by minRecoveryPoint) - both transaction-remover and horizon
records are replayed before reading queries.
  It is safe for standby to use its hint bits because they can be set
only if the commit record of transaction-remover is lower than
minRecoveryPoint or LSN of heap page with removed tuples is lower than
minRecoveryPoint.

* Crash recovery, proc->indexIgnoreKilledTuples is false:
  Index hint bits are never set or taken into account.

So, now it seems correct to me.

Another interesting point here - now position of minRecoveryPoint affects
performance a lot. It is happening already (because of *heap* hint bits)
but after the patch, it is noticeable even more. Is there any sense to keep
minRecoveryPoint at a low value?

Rebased and updated patch in attachment.

Will be happy if someone could recheck my ideas or even the code :)

Thanks a lot,
Michail.
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 96442ceb4e..6399184a8c 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -10,6 +10,7 @@
 #-
 
 EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL+=contrib/pageinspect
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/022_index_hint_bits.pl b/src/test/recovery/t/022_index_hint_bits.pl
new file mode 100644
index 00..737dca0185
--- /dev/null
+++ b/src/test/recovery/t/022_index_hint_bits.pl
@@ -0,0 +1,282 @@
+# Checks that index hints on standby work as excepted.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 15;
+use Config;
+
+# Initialize primary node
+my $node_primary = get_new_node('primary');
+$node_primary->init(allows_streaming => 1);
+$node_primary->append_conf('postgresql.conf', qq{
+autovacuum = off
+enable_seqscan = off
+enable_indexonlyscan = off
+});
+$node_primary->start;
+
+$node_primary->safe_psql('postgres', 'CREATE EXTENSION pageinspect');
+# Create test table with primary index
+$node_primary->safe_psql(
+'postgres', 'CREATE TABLE test_index_hint (id int, value int)');
+$node_primary->safe_psql(
+'postgres', 'CREATE INDEX test_index ON test_index_hint (value, id)');
+# Fill some data to it, note to not put a lot of records to avoid
+# heap_page_prune_opt call which cause conflict on recovery hiding conflict
+# caused due index hint bits
+$node_primary->safe_psql('postgres',
+'INSERT INTO test_index_hint VALUES (generate_series(1, 30), 0)');
+# And vacuum to allow index hint bits to be set

Re: Thoughts on "killed tuples" index hint bits support on standby

2021-01-27 Thread Michail Nikolaev
Hello, hackers.

Sorry for necroposting, but if someone is interested - I hope the patch is
ready now and available in the other thread (1).

Thanks,
Michail.

[1]
https://www.postgresql.org/message-id/flat/CANtu0oiP18H31dSaEzn0B0rW6tA_q1G7%3D9Y92%2BUS_WHGOoQevg%40mail.gmail.com


Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-01-27 Thread Joel Jacobson
Hi Mark,

On Wed, Jan 27, 2021, at 10:58, Mark Rofail wrote:
>Kindly find the updated patch (v15) below

I think you forgot to attach the patch.

/Joel

protect pg_stat_statements_info() for being used without the library loaded

2021-01-27 Thread Jaime Casanova
Hi,

Attached is a small patch for ${subject}

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 72a117fc19..62cccbfa44 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1898,6 +1898,11 @@ pg_stat_statements_info(PG_FUNCTION_ARGS)
 	Datum		values[PG_STAT_STATEMENTS_INFO_COLS];
 	bool		nulls[PG_STAT_STATEMENTS_INFO_COLS];
 
+	if (!pgss || !pgss_hash)
+		ereport(ERROR,
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("pg_stat_statements must be loaded via shared_preload_libraries")));
+
 	/* Build a tuple descriptor for our result type */
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");


Re: Allow matching whole DN from a client certificate

2021-01-27 Thread Jacob Champion
On Tue, 2021-01-26 at 18:43 +, Jacob Champion wrote:
> On Tue, 2021-01-26 at 13:49 +0100, Daniel Gustafsson wrote:
> > The OpenSSL X509_NAME_cmp function use RFC 5280 section 7.1 and RFC 4517
> > section 4.2.15 (which in turn reference RFC 4514 for the DN string format).
> > libnss has CERT_AsciiToName which is referencing RFCs 1485, 1779 and 2253 in
> > lib/certdb/alg1485.c.  Comparing the two would be interesting.
> 
> Yeah. I'll poke around a bit.

Here's some output from a test program I threw together, which parses
identical DER using OpenSSL and NSS and writes the corresponding string
representation.

> input/basic.conf:
> ssl: CN=pchampion,OU=VMware
> nss: CN=pchampion,OU=VMware
> 
> input/escape.conf:
> ssl: CN=\,\+\\\;\"\<\>
> nss: CN=",+\\;\"<>"
> 
> input/multivalue.conf:
> ssl: CN=pchampion+SN=Champion+GN=Jacob,OU=VMware
> nss: givenName=Jacob+SN=Champion+CN=pchampion,OU=VMware
> 
> input/unicode.conf:
> ssl: CN=οδυσσέας,OU=VMware
> nss: CN=οδυσσέας,OU=VMware
> 
> input/unprintable.conf:
> ssl: CN=\01\,\02\,\03,OU=\01\02\03
> nss: CN="\01,\02,\03",OU=\01\02\03

basic.conf is exactly what it looks like: CN=pchampion,OU=VMware. Both
implementations agree.

escape.conf contains a CN with the literal value

,+\;"<>

and you can see that NSS doesn't follow RFC 4514 here; it uses the
older double-quoting form of escaping. There's a 15-year-old bug on
this in NSS [1].

multivalue.conf contains a multivalued AVA with commonName "pchampion",
givenName "Jacob", and surname "Champion". They aren't sorted in the
same order, and the implementations even disagree on how to represent
the givenName attribute. (I'm not convinced that either choice is RFC-
4514-compliant; it doesn't look like GN is registered with IANA as a
short name for givenName.)

unicode.conf contains a commonName of "οδυσσέας". Both implementations
agree, but the only way I was able to get OpenSSL to produce this
(rather than a string of escaped hex) was by using the flags

XN_FLAG_RFC2253 & ~ASN1_STRFLGS_ESC_MSB

in the call to X509_NAME_print_ex(). This should work fine for a
database encoding of UTF-8, but would we need to convert for other
encodings? Also, I'm not sure how this would handle certificates that
aren't UTF-8 encoded. It looks like some UCS variants are legal?

unprintable.conf contains the bytes 0x01, 0x02, and 0x03 in the
commonName and organizationalUnit. They're backslash-escaped by both
implementations, but if you add any other printable escaped characters
(such as comma, in the CN example here) NSS will still double-quote the
whole thing.

--Jacob

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=355096


Re: archive status ".ready" files may be created too early

2021-01-27 Thread Bossart, Nathan
On 1/26/21, 6:36 PM, "Kyotaro Horiguchi"  wrote:
> At Tue, 26 Jan 2021 19:13:57 +, "Bossart, Nathan"  
> wrote in
>> On 12/17/20, 9:15 PM, "Kyotaro Horiguchi"  wrote:
>> > At Thu, 17 Dec 2020 22:20:35 +, "Bossart, Nathan" 
>> >  wrote in
>> >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi"  wrote:
>> >> > You're right in that regard. There's a window where partial record is
>> >> > written when write location passes F0 after insertion location passes
>> >> > F1. However, remembering all spanning records seems overkilling to me.
>> >>
>> >> I'm curious why you feel that recording all cross-segment records is
>> >> overkill.  IMO it seems far simpler to just do that rather than try to
>> >
>> > Sorry, my words are not enough. Remembering all spanning records in
>> > *shared memory* seems to be overkilling.  Much more if it is stored in
>> > shared hash table.  Even though it rarely the case, it can fail hard
>> > way when reaching the limit. If we could do well by remembering just
>> > two locations, we wouldn't need to worry about such a limitation.
>>
>> I don't think it will fail if we reach max_size for the hash table.
>> The comment above ShmemInitHash() has this note:
>>
>> * max_size is the estimated maximum number of hashtable entries.  This is
>> * not a hard limit, but the access efficiency will degrade if it is
>> * exceeded substantially (since it's used to compute directory size and
>> * the hash table buckets will get overfull).
>
> That description means that a shared hash has a directory with fixed
> size thus there may be synonyms, which causes degradation. Even though
> buckets are preallocated with the specified number, since the minimum
> directory size is 256, buckets are allocated at least 256 in a long
> run. Minimum on-the-fly allocation size is 32. I haven't calcuated
> further precicely, but I'm worried about the amount of spare shared
> memory the hash can allocate.

On my machine, hash_estimate_size() for the table returns 5,968 bytes.
That estimate is for a max_size of 16.  In my testing, I've been able
to need up to 6 elements in this table, but that required turning off
synchronous_commit, adding a long sleep at the end of XLogWrite(), and
increasing wal_buffers substantially.  This leads me to think that a
max_size of 16 elements is typically sufficient.  (I may have also
accidentally demonstrated that only storing two record boundaries
could be insufficient.)

>> > Another concern about the concrete patch:
>> >
>> > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a
>> > LWLock every time XLogWrite is called while segment archive is being
>> > held off.  I don't think it is acceptable and I think it could be a
>> > problem when many backends are competing on WAL.
>>
>> This is a fair point.  I did some benchmarking with a few hundred
>> connections all doing writes, and I was not able to discern any
>> noticeable performance impact.  My guess is that contention on this
>> new lock is unlikely because callers of XLogWrite() must already hold
>> WALWriteLock.  Otherwise, I believe we only acquire ArchNotifyLock no
>> more than once per segment to record new record boundaries.
>
> Thanks. I agree that the reader-reader contention is not a problem due
> to existing serialization by WALWriteLock. Adding an entry happens
> only at segment boundary so the ArchNotifyLock doesn't seem to be a
> problem.
>
> However the function prolongs the WALWriteLock section. Couldn't we
> somehow move the call to NotifySegmentsReadyForArchive in XLogWrite
> out of the WALWriteLock section?

I don't see a clean way to do that.  XLogWrite() assumes that
WALWriteLock is held when it is called, and it doesn't release it at
any point.  I think we'd have to move NotifySegmentsReadyForArchive()
to the callers of XLogWrite() if we wanted to avoid holding onto
WALWriteLock for longer.  Unless we can point to a measurable
performance penalty, I'm not sure this is worth it.

Nathan



Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-01-27 Thread Joel Jacobson
Hi Mark,

On Wed, Jan 27, 2021, at 20:34, Mark Rofail wrote:
>Here it is. 
>Array-ELEMENT-foreign-key-v15.patch

Thanks.

I've tested it and notice there still seems to be a problem with int2vector?

Reading your previous comment a few messages ago,
it sounds like this was fixed, but perhaps not?

This is the comment that made me think it was fixed:

On Sun, Jan 24, 2021, at 21:46, Mark Rofail wrote:
>Hello again Joel,
>>UPDATE catalog_clone.pg_index SET indkey = '1 2 12345'::int2vector WHERE 
>>indexrelid = 2837;
>>ERROR:  operator does not exist: int2vector pg_catalog.@> smallint[]
>>LINE 1: ...WHERE "attrelid" OPERATOR(pg_catalog.=) $1 AND $2 OPERATOR(p...
>I apologize for my rash response, I did not quite understand your example at 
>first, it appears the UPDATE statement is >neither over the referencing nor 
>the referenced columns, I understand the problem now, please disregard the 
>previous >email. Thank you for this find, please find the fix below
>
>Changelog:
>- v14 (compatible with current master 2021-01-24, commit 
>0c1e8845f28bd07ad381c8b0d6701575d967b88e)

Here is a small test causing that still fails on v15:

CREATE TABLE a (
  a_id smallint NOT NULL,
  PRIMARY KEY (a_id)
);

CREATE TABLE b (
b_id integer NOT NULL,
a_ids int2vector NOT NULL,
PRIMARY KEY (b_id)
);

ALTER TABLE b ADD FOREIGN KEY (EACH ELEMENT OF a_ids) REFERENCES a(a_id);

INSERT INTO a (a_id) VALUES (1);
INSERT INTO a (a_id) VALUES (2);
INSERT INTO b (b_id, a_ids) VALUES (3, '1 2'::int2vector);


ERROR:  operator does not exist: smallint[] pg_catalog.<@ int2vector
LINE 1: ..."."a" x WHERE ARRAY ["a_id"]::pg_catalog.anyarray OPERATOR(p...
 ^
HINT:  No operator matches the given name and argument types. You might need to 
add explicit type casts.
QUERY:  SELECT 1 WHERE (SELECT pg_catalog.count(DISTINCT y) FROM 
pg_catalog.unnest($1) y) OPERATOR(pg_catalog.=) (SELECT pg_catalog.count(*) 
FROM (SELECT 1 FROM ONLY "public"."a" x WHERE ARRAY 
["a_id"]::pg_catalog.anyarray OPERATOR(pg_catalog.<@) $1::pg_catalog.anyarray 
FOR KEY SHARE OF x) z)

/Joel

Re: Allow matching whole DN from a client certificate

2021-01-27 Thread Andrew Dunstan


On 1/27/21 3:14 PM, Jacob Champion wrote:
> On Tue, 2021-01-26 at 18:43 +, Jacob Champion wrote:
>> On Tue, 2021-01-26 at 13:49 +0100, Daniel Gustafsson wrote:
>>> The OpenSSL X509_NAME_cmp function use RFC 5280 section 7.1 and RFC 4517
>>> section 4.2.15 (which in turn reference RFC 4514 for the DN string format).
>>> libnss has CERT_AsciiToName which is referencing RFCs 1485, 1779 and 2253 in
>>> lib/certdb/alg1485.c.  Comparing the two would be interesting.
>> Yeah. I'll poke around a bit.
> Here's some output from a test program I threw together, which parses
> identical DER using OpenSSL and NSS and writes the corresponding string
> representation.
>
>> input/basic.conf:
>> ssl: CN=pchampion,OU=VMware
>> nss: CN=pchampion,OU=VMware
>>
>> input/escape.conf:
>> ssl: CN=\,\+\\\;\"\<\>
>> nss: CN=",+\\;\"<>"
>>
>> input/multivalue.conf:
>> ssl: CN=pchampion+SN=Champion+GN=Jacob,OU=VMware
>> nss: givenName=Jacob+SN=Champion+CN=pchampion,OU=VMware
>>
>> input/unicode.conf:
>> ssl: CN=οδυσσέας,OU=VMware
>> nss: CN=οδυσσέας,OU=VMware
>>
>> input/unprintable.conf:
>> ssl: CN=\01\,\02\,\03,OU=\01\02\03
>> nss: CN="\01,\02,\03",OU=\01\02\03
> basic.conf is exactly what it looks like: CN=pchampion,OU=VMware. Both
> implementations agree.
>
> escape.conf contains a CN with the literal value
>
> ,+\;"<>
>
> and you can see that NSS doesn't follow RFC 4514 here; it uses the
> older double-quoting form of escaping. There's a 15-year-old bug on
> this in NSS [1].
>
> multivalue.conf contains a multivalued AVA with commonName "pchampion",
> givenName "Jacob", and surname "Champion". They aren't sorted in the
> same order, and the implementations even disagree on how to represent
> the givenName attribute. (I'm not convinced that either choice is RFC-
> 4514-compliant; it doesn't look like GN is registered with IANA as a
> short name for givenName.)
>
> unicode.conf contains a commonName of "οδυσσέας". Both implementations
> agree, but the only way I was able to get OpenSSL to produce this
> (rather than a string of escaped hex) was by using the flags
>
> XN_FLAG_RFC2253 & ~ASN1_STRFLGS_ESC_MSB
>
> in the call to X509_NAME_print_ex(). This should work fine for a
> database encoding of UTF-8, but would we need to convert for other
> encodings? Also, I'm not sure how this would handle certificates that
> aren't UTF-8 encoded. It looks like some UCS variants are legal?
>
> unprintable.conf contains the bytes 0x01, 0x02, and 0x03 in the
> commonName and organizationalUnit. They're backslash-escaped by both
> implementations, but if you add any other printable escaped characters
> (such as comma, in the CN example here) NSS will still double-quote the
> whole thing.
>
> --Jacob
>
> [1] https://bugzilla.mozilla.org/show_bug.cgi?id=355096



OK, that bug is a bit ugly.


I'm not sure where we want to go with the present patch. Do we want to
go with what we have and document the limitations more, or try to do
something more elaborate? If so, exactly what?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-27 Thread Alexey Kondratov

On 2021-01-27 06:14, Michael Paquier wrote:

On Wed, Jan 27, 2021 at 01:00:50AM +0300, Alexey Kondratov wrote:
In the new 0002 I moved ACL check to the upper level, i.e. 
ExecReindex(),
and removed expensive text generation in test. Not touched yet some of 
your
previously raised concerns. Also, you made SetRelationTableSpace() to 
accept

Relation instead of Oid, so now we have to open/close indexes in the
ReindexPartitions(), I am not sure that I use proper locking there, 
but it

works.


Passing down Relation to the new routines makes the most sense to me
because we force the callers to think about the level of locking
that's required when doing any tablespace moves.

+   Relation iRel = index_open(partoid, ShareLock);
+
+   if (CheckRelationTableSpaceMove(iRel, 
params->tablespaceOid))

+   SetRelationTableSpace(iRel,
+ params->tablespaceOid,
+ InvalidOid);
Speaking of which, this breaks the locking assumptions of
SetRelationTableSpace().  I feel that we should think harder about
this part for partitioned indexes and tables because this looks rather
unsafe in terms of locking assumptions with partition trees.  If we
cannot come up with a safe solution, I would be fine with disallowing
TABLESPACE in this case, as a first step.  Not all problems have to be
solved at once, and even without this part the feature is still
useful.



I have read more about lock levels and ShareLock should prevent any kind 
of physical modification of indexes. We already hold ShareLock doing 
find_all_inheritors(), which is higher than ShareUpdateExclusiveLock, so 
using ShareLock seems to be safe here, but I will look on it closer.




+   /* It's not a shared catalog, so refuse to move it to shared 
tablespace */

+   if (params->tablespaceOid == GLOBALTABLESPACE_OID)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("cannot move non-shared relation to tablespace 
\"%s\"",

+get_tablespace_name(params->tablespaceOid;
Why is that needed if CheckRelationTableSpaceMove() is used?



This is from ReindexRelationConcurrently() where we do not use 
CheckRelationTableSpaceMove(). For me it makes sense to add only this 
GLOBALTABLESPACE_OID check there, since before we already check for 
system catalogs and after for temp relations, so adding 
CheckRelationTableSpaceMove() will be a double-check.




- indexRelation->rd_rel->reltablespace,
+ OidIsValid(tablespaceOid) ?
+   tablespaceOid :
indexRelation->rd_rel->reltablespace,
Let's remove this logic from index_concurrently_create_copy() and let
the caller directly decide the tablespace to use, without a dependency
on InvalidOid in the inner routine.  A share update exclusive lock is
already hold on the old index when creating the concurrent copy, so
there won't be concurrent schema changes.



Changed.

Also added tests for ACL checks, relfilenode changes. Added ACL recheck 
for multi-transactional case. Added info about TOAST index reindexing. 
Changed some comments.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom f176a6e5a81ab133fee849f72e4edb8b287d6062 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Wed, 27 Jan 2021 00:46:17 +0300
Subject: [PATCH v8] Allow REINDEX to change tablespace

REINDEX already does full relation rewrite, this patch adds a
possibility to specify a new tablespace where new relfilenode
will be created.
---
 doc/src/sgml/ref/reindex.sgml |  31 +++-
 src/backend/catalog/index.c   |  50 +-
 src/backend/commands/indexcmds.c  | 112 -
 src/bin/psql/tab-complete.c   |   4 +-
 src/include/catalog/index.h   |   9 +-
 src/test/regress/input/tablespace.source  | 106 +
 src/test/regress/output/tablespace.source | 181 ++
 7 files changed, 481 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 627b36300c..e610a0f52c 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN
 
 CONCURRENTLY [ boolean ]
 VERBOSE [ boolean ]
+TABLESPACE new_tablespace
 
  
 
@@ -187,6 +188,21 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+TABLESPACE
+
+ 
+  Specifies that indexes will be rebuilt on a new tablespace.
+  Cannot be used with "mapped" and system (unless allow_system_table_mods
+  is set to TRUE) relations. If SCHEMA,
+  DATABASE or SYSTEM are specified,
+  then all "mapped" and system relations will be skipped and a single
+  WARNING will be generated. Indexes on TOAST tables
+  are reindexed, but not moved the new tablespace.
+ 
+
+   
+

 

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-27 Thread Justin Pryzby
Thanks for updating the patch.  I have just a couple comments on the new (and
old) language.

On Thu, Jan 28, 2021 at 12:19:06AM +0300, Alexey Kondratov wrote:
> Also added tests for ACL checks, relfilenode changes. Added ACL recheck for
> multi-transactional case. Added info about TOAST index reindexing. Changed
> some comments.

> +  Specifies that indexes will be rebuilt on a new tablespace.
> +  Cannot be used with "mapped" and system (unless 
> allow_system_table_mods

say mapped *or* system relations
Or maybe:
mapped or (unless >allow_system_table_mods<) system relations.

> +  is set to TRUE) relations. If 
> SCHEMA,
> +  DATABASE or SYSTEM are specified,
> +  then all "mapped" and system relations will be skipped and a single
> +  WARNING will be generated. Indexes on TOAST tables
> +  are reindexed, but not moved the new tablespace.

moved *to* the new tablespace.
I don't know if that needs to be said at all.  We talked about it a lot to
arrive at the current behavior, but I think that's only due to the difficulty
of correcting the initial mistake.

> + /*
> +  * Set the new tablespace for the relation.  Do that only in the
> +  * case where the reindex caller wishes to enforce a new tablespace.

I'd say just "/* Set new tablespace, if requested */
You wrote something similar in an earlier revision of your refactoring patch.

> +  * Mark the relation as ready to be dropped at transaction 
> commit,
> +  * before making visible the new tablespace change so as this 
> won't
> +  * miss things.

This comment is vague.  I think Michael first wrote this comment about a year
ago.  Does it mean "so the new tablespace won't be missed" ?  Missed by what ?

-- 
Justin




Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-01-27 Thread Mark Rofail
>
>
> Hey Joel,

I apologize for my rash response, I did not quite understand your example
> at first, it appears the UPDATE statement is neither over the
> referencing nor the referenced columns
>
The v14 fixed the issue where an error would occur by an update statement
that didn't target the refrencing or refrenced columns

Vectors as refrencing columns are not supported and out of scope of this
patch. Try to use arrays.

/Mark


Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-27 Thread Alvaro Herrera
On 2021-Jan-28, Alexey Kondratov wrote:

> I have read more about lock levels and ShareLock should prevent any kind of
> physical modification of indexes. We already hold ShareLock doing
> find_all_inheritors(), which is higher than ShareUpdateExclusiveLock, so
> using ShareLock seems to be safe here, but I will look on it closer.

You can look at lock.c where LockConflicts[] is; that would tell you
that ShareLock indeed conflicts with ShareUpdateExclusiveLock ... but it
does not conflict with itself!  So it would be possible to have more
than one process doing this thing at the same time, which surely makes
no sense.

I didn't look at the patch closely enough to understand why you're
trying to do something like CLUSTER, VACUUM FULL or REINDEX without
holding full AccessExclusiveLock on the relation.  But do keep in mind
that once you hold a lock on a relation, trying to grab a weaker lock
afterwards is pretty pointless.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"E pur si muove" (Galileo Galilei)




Index predicate locking and serializability contention

2021-01-27 Thread Marcelo Zabani
Hi!

We're currently having issues with serializable contention at our shop, and
after tracking it down very carefully, we found that there are two main
reasons for one such conflict:
1. Page-level predicate locks on primary key indexes, whose associated
column gets their Id from a sequence.
2. An empty table which gets inserted to but has those inserted rows
deleted before committing.


We're confident they are the only remaining impediments to allowing
transactions not to conflict with each other, because we have changed the
code just in the right places to make sure that no conflicts arise when we
do both of:
- In the first case, the sequence's nextval and increment are set so that
the first transaction gets an Id that is on a different index page than the
Id the second transaction will get.
- Not writing to the table that once got inserted to and emptied. Before
this, we also tried setting enable_seqscan to off and inspecting the query
plans and SIReadLocks carefully before committing to make sure sequential
scans were avoided, but it wasn't sufficient.

I believe in the first case the problem is one of granularity which has
been mentioned before at
https://www.postgresql.org/message-id/flat/20110503064807.GB85173%40csail.mit.edu#836599e3c18caf54052114d46f929cbb
).
In the second case, I believe part of the problem could be due to how empty
tables are predicately locked - according to
https://dba.stackexchange.com/questions/246179/postgresql-serialisation-failure-on-different-ids
.

In our case, we use empty tables to keep complex invariants checked at the
DB level by inserting into them with triggers and making sure deferrable
constraints will fail if the rows are still there (thus forcing the
committer to run a "consistency-enforcing" job before committing).
I'm not sure if our use-case is too particular, but we have found in
general that having little data - which some of our tables do, and will
still have for the foreseeable future - is sometimes worse than having lots
of it due to index locking granularity being at least at page-level.

So I have a few questions:
- Would index-key / index-gap locking avoid avoid creating serialization
anomalies for inserts of consecutive Ids that currently fall in the same
index page? Is it in the roadmap?
- A colleague made a suggestion which I found no mention of anywhere: would
it be possible not to predicate-lock on indices for insertion into
GENERATED AS IDENTITY columns, unless of course in the case of UPDATE,
INSERT INTO .. OVERRIDING, ALTER TABLE .. RESTART WITH or other similarly
conflicting statements?
- Is there something that can be done for the problem with empty tables?

We currently use Postgres 11, and anything that could help us change how we
approach the problem on our side is very much welcome too!

Thanks in advance,
Marcelo.


Re: Is it useful to record whether plans are generic or custom?

2021-01-27 Thread Tatsuro Yamada

On 2020/12/04 14:29, Fujii Masao wrote:


On 2020/11/30 15:24, Tatsuro Yamada wrote:

Hi Torikoshi-san,



In this patch, exposing new columns is mandatory, but I think
it's better to make it optional by adding a GUC something
like 'pgss.track_general_custom_plans.

I also feel it makes the number of columns too many.
Just adding the total time may be sufficient.



I think this feature is useful for DBA. So I hope that it gets
committed to PG14. IMHO, many columns are Okay because DBA can
select specific columns by their query.
Therefore, it would be better to go with the current design.


But that design may waste lots of memory. No? For example, when
plan_cache_mode=force_custom_plan, the memory used for the columns
for generic plans is not used.

Regards,



Sorry for the super delayed replay.
I don't think that because I suppose that DBA uses plan_cache_mode if
they faced an inefficient execution plan. And the parameter will be used
as a session-level GUC parameter, not a database-level.


Regards,
Tatsuro Yamada







Re: Is it useful to record whether plans are generic or custom?

2021-01-27 Thread Tatsuro Yamada

Torikoshi-san,

On 2020/12/04 15:03, torikoshia wrote:


ISTM now that creating pg_stat_statements_xxx views
both for generic andcustom plans is better than my PoC patch.

And I'm also struggling with the following.

| However, I also began to wonder how effective it would be to just
| distinguish between generic and custom plans.  Custom plans can
| include all sorts of plans. and thinking cache validation, generic
| plans can also include various plans.

| Considering this, I'm starting to feel that it would be better to
| not just keeping whether generic or cutom but the plan itself as
| discussed in the below thread.


Yamada-san,

Do you think it's effective just distinguish between generic
and custom plans?


Sorry for the super delayed replay.

Ah, it's okay.
It would be better to check both info by using a single view from the
perspective of usability. However, it's okay to divide both information
into two views to use memory effectively.

Regards,
Tatsuro Yamada





Re: Is it useful to record whether plans are generic or custom?

2021-01-27 Thread Tatsuro Yamada

Horiguchi-san,

On 2020/12/04 15:37, Kyotaro Horiguchi wrote:

And I'm also struggling with the following.

| However, I also began to wonder how effective it would be to just
| distinguish between generic and custom plans.  Custom plans can
| include all sorts of plans. and thinking cache validation, generic
| plans can also include various plans.

| Considering this, I'm starting to feel that it would be better to
| not just keeping whether generic or cutom but the plan itself as
| discussed in the below thread.


FWIW, that seems to me to be like some existing extension modules,
pg_stat_plans or pg_store_plans..  The former is faster but may lose
plans, the latter doesn't lose plans but slower.  I feel that we'd
beter consider simpler feature if we are intendeng it to be a part of
a contrib module,


There is also pg_show_plans.
Ideally, it would be better to able to track all of the plan changes by
checking something view since Plan Stability is important for DBA when
they use PostgreSQL in Mission-critical systems.
I prefer that the feature will be released as a contrib module. :-D

Regards,
Tatsuro Yamada
 






Re: Is it useful to record whether plans are generic or custom?

2021-01-27 Thread Tatsuro Yamada

Hi Toricoshi-san,

On 2021/01/12 20:36, torikoshia wrote:

I suppose it would be normal practice to store past results of
pg_stat_statements for future comparisons.
If this is the case, I think that if we only add the number of
generic plan execution, it will give us a hint to notice the cause
of performance degradation due to changes in the plan between
generic and custom.

For example, if there is a clear difference in the number of times
the generic plan is executed between before and after performance
degradation as below, it would be natural to check if there is a
problem with the generic plan.

...

Attached a patch that just adds a generic call counter to
pg_stat_statements.



I think that I'd like to use the view when we faced a performance
problem and find the reason. If we did the fixed-point observation
(should I say time-series analysis?) of generic_calls, it allows us to
realize the counter changes, and we can know whether the suspect is
generic_plan or not. So the patch helps DBA, I believe.

Regards,
Tatsuro Yamada





Re: vacuum_cost_page_miss default value and modern hardware

2021-01-27 Thread Peter Geoghegan
On Thu, Jan 21, 2021 at 5:12 PM Peter Geoghegan  wrote:
> I'm going to go ahead with committing my patch to lower the default
> next week. If anybody has any objections to that plan, please speak
> up.

Just pushed a commit that reduced the default for vacuum_cost_page_miss to 2.

One more thing on this: I noticed that the docs for these settings say
"There are many situations where it is not important that maintenance
commands like VACUUM and ANALYZE finish quickly". That now seems a
little dubious to me -- I wonder if we should refresh it.

There are not that many problems that can be solved by making VACUUM
slower. This is due to the visibility map, and to a lesser degree
certain key improvements in index AMs. The text that I quoted was
written in 2005, a time when the delay stuff was still very new, and
even the earliest visibility map design was still a few years away.

Thanks
--
Peter Geoghegan




Re: Perform COPY FROM encoding conversions in larger chunks

2021-01-27 Thread John Naylor
Hi Heikki,

0001 through 0003 are straightforward, and I think they can be committed
now if you like.

0004 is also pretty straightforward. The check you proposed upthread for
pg_upgrade seems like the best solution to make that workable. I'll take a
look at 0005 soon.

I measured the conversions that were rewritten in 0003, and there is indeed
a noticeable speedup:

Big5 to EUC-TW:

head196ms
0001-3  152ms

EUC-TW to Big5:

head190ms
0001-3  144ms

I've attached the driver function for reference. Example use:

select drive_conversion(
  1000, 'euc_tw'::name, 'big5'::name,
  convert('a few kB of utf8 text here', 'utf8', 'euc_tw')
);

I took a look at the test suite also, and the only thing to note is a
couple places where the comment doesn't match the code:

+  -- JIS X 0201: 2-byte encoded chars starting with 0x8e (SS2)
+  byte1 = hex('0e');
+  for byte2 in hex('a1')..hex('df') loop
+return next b(byte1, byte2);
+  end loop;
+
+  -- JIS X 0212: 3-byte encoded chars, starting with 0x8f (SS3)
+  byte1 = hex('0f');
+  for byte2 in hex('a1')..hex('fe') loop
+for byte3 in hex('a1')..hex('fe') loop
+  return next b(byte1, byte2, byte3);
+end loop;
+  end loop;

Not sure if it matters , but thought I'd mention it anyway.

--
John Naylor
EDB: http://www.enterprisedb.com


drive_conversion.c
Description: Binary data


Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-01-27 Thread Zhihong Yu
Hi, Mark:

+   if (ARR_NDIM(arr) != 1 ||
+   ARR_HASNULL(arr) ||
+   ARR_ELEMTYPE(arr) != CHAROID)
+   elog(ERROR, "confreftype is not a 1-D char array");

I think the ARR_HASNULL(arr) condition is not reflected in the error
message.

+* Array foreign keys support only UPDATE/DELETE NO ACTION,
UPDATE/DELETE
+* RESTRICT amd DELETE CASCADE actions

I don't see CASCADE in the if condition that follows the above comment.

+   charreftype;/* FKCONSTR_REF_xxx code */

The code would be FKCONSTR_REF_EACH_ELEMENT and FKCONSTR_REF_PLAIN. I think
you can mention them in the comment.

Cheers

On Wed, Jan 27, 2021 at 11:34 AM Mark Rofail  wrote:

> Hello Joel,
>
>
>> I think you forgot to attach the patch.
>>
> Appears so, sorry about that.
>
> Here it is.
>
> /Mark
>


RE: Enhance traceability of wal_level changes for backup management

2021-01-27 Thread osumi.takami...@fujitsu.com
Hello


On Thursday, January 21, 2021 11:19 PM I wrote:
> If no one disagree with it, I'll proceed with (1) below.
> 
> (1) writing the time or LSN in the control file to indicate when/where 
> wal_level
> is changed to 'minimal'
> from upper level to invalidate the old backups or make alerts to users.
I attached the first patch which implementes this idea.
It was aligned by pgindent and shows no regression.

Best Regards,
Takamichi Osumi



trace_wal_level_drop_v01.patch
Description: trace_wal_level_drop_v01.patch


Re: protect pg_stat_statements_info() for being used without the library loaded

2021-01-27 Thread Julien Rouhaud
On Thu, Jan 28, 2021 at 3:53 AM Jaime Casanova
 wrote:
>
> Hi,
>
> Attached is a small patch for ${subject}

Good catch, and patch looks good to me.




Re: Add MAIN_RELATION_CLEANUP and SECONDARY_RELATION_CLEANUP options to VACUUM

2021-01-27 Thread Justin Pryzby
On Wed, Jan 27, 2021 at 11:16:26PM +, Bossart, Nathan wrote:
> On 1/27/21, 11:07 AM, "Justin Pryzby"  wrote:
> > This just came up for me:
> >
> > I have a daily maintenance script which pro-actively vacuums tables: 
> > freezing
> > historic partitions, vacuuming current tables if the table's relfrozenxid is
> > old, and to encourage indexonly scan.
> >
> > I'm checking the greatest(age(toast,main)) and vacuum the table (and 
> > implicitly
> > its toast) whenever either is getting old.
> >
> > But it'd be more ideal if I could independently vacuum the main table if 
> > it's
> > old, but not the toast table.
> 
> Thanks for chiming in.
> 
> It looks like we were leaning towards only adding the
> TOAST_TABLE_CLEANUP option, which is already implemented internally
> with VACOPT_SKIPTOAST.  It's already possible to vacuum a TOAST table
> directly, so we can probably do without the MAIN_RELATION_CLEANUP
> option.
> 
> I've attached a new patch that only adds TOAST_TABLE_CLEANUP.

Thanks, I wrote my message after running into the issue and remembered this
thread.  I didn't necessarily mean to send another patch :)

My only comment is on the name: TOAST_TABLE_CLEANUP.  "Cleanup" suggests that
the (main or only) purpose is to "clean" dead tuples to avoid bloat.  But in my
use case, the main purpose is to avoid XID wraparound (or its warnings).

Okay, my second only comment is that this:

| This option cannot be disabled when the FULL option is
| used.

Should it instead be ignored if FULL is also specified ?  Currently only
PARALLEL and DISABLE_PAGE_SKIPPING cause an error when used with FULL.  That's
documented for PARALLEL, but I think it should also be documented for
DISABLE_PAGE_SKIPPING (which is however an advanced option).

-- 
Justin




Re: Thoughts on "killed tuples" index hint bits support on standby

2021-01-27 Thread Peter Geoghegan
On Wed, Jan 27, 2021 at 11:30 AM Michail Nikolaev
 wrote:
> Sorry for necroposting, but if someone is interested - I hope the patch is 
> ready now and available in the other thread (1).

I wonder if it would help to not actually use the LP_DEAD bit for
this. Instead, you could use the currently-unused-in-indexes
LP_REDIRECT bit. The general idea here is that this status bit is
interpreted as a "recently dead" bit in nbtree. It is always something
that is true *relative* to some *ephemeral* coarse-grained definition
of recently dead. Whether or not "recently dead" means "dead to my
particular MVCC snapshot" can be determined using some kind of
in-memory state that won't survive a crash (or a per-index-page
epoch?). We still have LP_DEAD bits in this world, which work in the
same way as now (and so unambiguously indicate that the index tuple is
dead-to-all, at least on the primary). I think that this idea of a
"recently dead" bit might be able to accomplish a few goals all at
once, including your specific goal of setting "hint bits" in indexes.

The issue here is that it would also be nice to use a "recently dead"
bit on the primary, but if you do that then maybe you're back to the
original problem. OTOH, I also think that we could repurpose the
LP_NORMAL bit in index AMs, so we could potentially have 3 different
definitions of dead-ness without great difficulty!

Perhaps this doesn't seem all that helpful, since I am expanding scope
here for a project that is already very difficult. And maybe it just
isn't helpful -- I don't know. But it is at least possible that
expanding scope could actually *help* your case a lot, even if you
only ever care about your original goal. My personal experience with
nbtree patches is just that: sometimes *expanding* scope actually
makes things *easier*, not harder. This is sometimes called "The
Inventor's Paradox":

https://en.wikipedia.org/wiki/Inventor%27s_paradox

Consider the example of my work on nbtree in PostgreSQL 12. It was
actually about 6 or 7 enhancements, not just 1 big enhancement -- it
is easy to forget that now. I think that it was *necessary* to add at
least 5 of these enhancements at the same time (maybe 2 or so really
were optional). This is deeply counter-intuitive, but still seems to
be true in my case. The specific reasons why I believe that this is
true of the PostgreSQL 12 work are complicated, but it boils down to
this: the ~5 related-though-distinct enhancements were individually
not all that compelling (certainly not compelling enough to make
on-disk changes for), but were much greater than the sum of their
parts when considered together. Maybe I got lucky there.

More generally, the nbtree stuff that I worked on in 12, 13, and now
14 actually feels like one big project to me. I will refrain from
explaining exactly why that is right now, but you might be very
surprised at how closely related it all is. I didn't exactly plan it
that way, but trying to see things in the most general terms turned
out to be a huge asset to me. If there are several independent reasons
to move in one particular direction all at once, you can generally
afford to be wrong about a lot of things without it truly harming
anybody. Plus it's easy to see that you will not frustrate future work
that is closely related but distinct when that future work is
*directly enabled* by what you're doing.

What's more, the extra stuff I'm talking about probably has a bunch of
other benefits on the primary, if done well. Consider how the deletion
stuff with LP_DEAD bits now considers "extra" index tuples to delete
when they're close by. We could even do something like that with these
LP_REDIRECT/recently dead bits on the primary.

I understand that it's hard to have a really long term outlook. I
think that that's almost necessary when working on a project like
this, though.

Don't forget that this works both ways. Maybe a person that wanted to
do this "recently dead" stuff (which sounds really interesting to me
right now) would similarly be compelled to consider the bigger
picture, including the question of setting hint bits on standbys --
this other person had better not make that harder in the future, for
the same reasons (obviously what you want to do here makes sense, it
just isn't everything to everybody). This is yet another way in which
expanding scope can help.

--
Peter Geoghegan




Re: Thoughts on "killed tuples" index hint bits support on standby

2021-01-27 Thread Peter Geoghegan
On Wed, Jan 27, 2021 at 5:24 PM Peter Geoghegan  wrote:
> The issue here is that it would also be nice to use a "recently dead"
> bit on the primary, but if you do that then maybe you're back to the
> original problem. OTOH, I also think that we could repurpose the
> LP_NORMAL bit in index AMs, so we could potentially have 3 different
> definitions of dead-ness without great difficulty!

To be clear, what I mean is that you currently have two bits in line
pointers. In an nbtree leaf page we only currently use one -- the
LP_DEAD bit. But bringing the second bit into it allows us to have a
representation for two additional states (not just one), since of
course the meaning of the second bit can be interpreted using the
LP_DEAD bit. You could for example having encodings for each of the
following distinct per-LP states, while still preserving on-disk
compatibility:

"Not known to be dead in any sense" (0), "Unambiguously dead to all"
(what we now simply call LP_DEAD), "recently dead on standby"
(currently-spare bit is set), and "recently dead on primary" (both
'lp_flags' bits set).

Applying FPIs on the standby would have to be careful to preserve a
standby-only bit. I'm probably not thinking of every subtlety, but
"putting all of the pieces of the puzzle together for further
consideration" is likely to be a useful exercise.

-- 
Peter Geoghegan




Re: row filtering for logical replication

2021-01-27 Thread Andres Freund
Hi,

On 2020-12-17 09:43:30 +0300, Önder Kalacı wrote:
> The above part can be considered the core of the logic, executed per tuple.
> As far as I can see, it has two downsides.
> 
> First, calling `expression_planner()` for every tuple can be quite
> expensive. I created a sample table, loaded data and ran a quick benchmark
> to see its effect. I attached the very simple script that I used to
> reproduce the issue on my laptop. I'm pretty sure you can find nicer ways
> of doing similar perf tests, just sharing as a reference.
> 
> The idea of the test is to add a WHERE clause to a table, but none of the
> tuples are filtered out. They just go through this code-path and send it to
> the remote node.
> 
> #rows   Patched| Master
> 1M  00:00:25.067536| 00:00:16.633988
> 10M  00:04:50.770791| 00:02:40.945358
> 
> 
> So, it seems a significant overhead to me. What do you think?

That seems almost prohibitively expensive. I think at the very least
some of this work would need to be done in a cached manner, e.g. via
get_rel_sync_entry().


> Secondly, probably more importantly, allowing any operator is as dangerous
> as allowing any function as users can create/overload operator(s).

That's not safe, indeed. It's not even just create/overloading
operators, as far as I can tell the expression can contain just plain
function calls.

The issue also isn't primarily that the user can overload functions,
it's that logical decoding is a limited environment, and not everything
is safe to do within. You e.g. only catalog tables can be
accessed. Therefore I don't think we can allow arbitrary expressions.


> The other problematic area was the performance, as calling
> `expression_planner()` for every tuple can be very expensive. To avoid
> that, it might be considered to ask users to provide a function instead of
> a free form WHERE clause, such that if the function returns true, the tuple
> is sent. The allowed functions need to be immutable SQL functions with bool
> return type. As we can parse the  SQL functions, we should be able to allow
> only functions that rely on the above mentioned procs. We can apply as many
> restrictions (such as no modification query) as possible. For example, see
> below:
> ```

I don't think that would get us very far.

>From a safety aspect: A function's body can be changed by the user at
any time, therefore we cannot rely on analyses of the function's body.

>From a performance POV: SQL functions are planned at every invocation,
so that'd not buy us much either.


I think what you would have to do instead is to ensure that the
expression is "simple enough", and then process it into a cheaply
executable format in get_rel_sync_entry(). I'd suggest that in the first
version you just allow a simple ANDed list of 'foo.bar op constant'
expressions.

Does that make sense?

Greetings,

Andres Freund




RE: [POC] Fast COPY FROM command for the table with foreign partitions

2021-01-27 Thread Tang, Haiying
Hi Andrey,

> Sometimes before i suggested additional optimization [1] which can 
> additionally speed up COPY by 2-4 times. Maybe you can perform the 
> benchmark for this solution too?

Sorry for the late reply, I just have time to take this test now.
But the patch no longer applies, I tried to apply on e42b3c3bd6(2021/1/26) but 
failed.

Can you send a rebased version?

Regards,
Tang





RE: [PoC] Non-volatile WAL buffer

2021-01-27 Thread tsunakawa.ta...@fujitsu.com
From: Tomas Vondra 
> (c) As mentioned before, PMEM behaves differently with concurrent
> access, i.e. it reaches peak throughput with relatively low number of
> threads wroting data, and then the throughput drops quite quickly. I'm
> not sure if the same thing applies to pmem_drain() too - if it does, we
> may need something like we have for insertions, i.e. a handful of locks
> allowing limited number of concurrent inserts.

> I think WALWriteLock itself (i.e. acquiring/releasing it) is not an
> issue - the problem is that writing the WAL to persistent storage itself
> is expensive, and we're waiting to that.
> 
> So it's not clear to me if removing the lock (and allowing multiple
> processes to do pmem_drain concurrently) can actually help, considering
> pmem_drain() should flush writes from other processes anyway.

I may be out of the track, but HPE's benchmark using Oracle 18c, placing the 
REDO log file on Intel PMEM in App Direct mode, showed only 27% performance 
increase compared to even "SAS" SSD.

https://h20195.www2.hpe.com/v2/getdocument.aspx?docname=a00074230enw


The just-released Oracle 21c has started support for placing data files on 
PMEM, eliminating the overhead of buffer cache.  It's interesting that this new 
feature is categorized in "Manageability", not "Performance and scalability."

https://docs.oracle.com/en/database/oracle/oracle-database/21/nfcon/persistent-memory-database-258797846.html


They recommend placing REDO logs on DAX-aware file systems.  I ownder what's 
behind this.

https://docs.oracle.com/en/database/oracle/oracle-database/21/admin/using-PMEM-db-support.html#GUID-D230B9CF-1845-4833-9BF7-43E9F15B7113

"You can use PMEM Filestore for database datafiles and control files. For 
performance reasons, Oracle recommends that you store redo log files as 
independent files in a DAX-aware filesystem such as EXT4/XFS."


Regards
Takayuki Tsunakawa



Re: On login trigger: take three

2021-01-27 Thread Amit Kapila
On Mon, Dec 28, 2020 at 5:46 PM Masahiko Sawada  wrote:
>
> On Sat, Dec 26, 2020 at 4:04 PM Pavel Stehule  wrote:
> >
> >
> >
> > so 26. 12. 2020 v 8:00 odesílatel Pavel Stehule  
> > napsal:
> >>
> >> Hi
> >>
> >>
> >>> Thank you.
> >>> I have applied all your fixes in on_connect_event_trigger-12.patch.
> >>>
> >>> Concerning enable_client_connection_trigger GUC, I think that it is 
> >>> really useful: it is the fastest and simplest way to disable login 
> >>> triggers in case
> >>> of some problems with them (not only for superuser itself, but for all 
> >>> users). Yes, it can be also done using "ALTER EVENT TRIGGER DISABLE".
> >>> But assume that you have a lot of databases with different login policies 
> >>> enforced by on-login event triggers. And you want temporary disable them 
> >>> all, for example for testing purposes.
> >>> In this case GUC is most convenient way to do it.
> >>>
> >>
> >> There was typo in patch
> >>
> >> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> >> index f810789..8861f1b 100644
> >> --- a/doc/src/sgml/config.sgml
> >> +++ b/doc/src/sgml/config.sgml
> >> @@ -1,4 +1,4 @@
> >> -
> >> +\
> >>
> >> I have not any objections against functionality or design. I tested the 
> >> performance, and there are no negative impacts when this feature is not 
> >> used. There is significant overhead related to plpgsql runtime 
> >> initialization, but when this trigger will be used, then probably some 
> >> other PLpgSQL procedures and functions will be used too, and then this 
> >> overhead can be ignored.
> >>
> >> * make without warnings
> >> * make check-world passed
> >> * doc build passed
> >>
> >> Possible ToDo:
> >>
> >> The documentation can contain a note so usage connect triggers in 
> >> environments with short life sessions and very short fast queries without 
> >> usage PLpgSQL functions or procedures can have negative impact on 
> >> performance due overhead of initialization of PLpgSQL engine.
> >>
> >> I'll mark this patch as ready for committers
> >
> >
> > looks so this patch has not entry in commitfestapp 2021-01
> >
>
> Yeah, please register this patch before the next CommitFest[1] starts,
> 2021-01-01 AoE[2].
>

Konstantin, did you register this patch in any CF? Even though the
reviewer seems to be happy with the patch, I am afraid that we might
lose track of this unless we register it.

-- 
With Regards,
Amit Kapila.




Re: Wrong usage of RelationNeedsWAL

2021-01-27 Thread Kyotaro Horiguchi
At Wed, 27 Jan 2021 02:48:48 -0800, Noah Misch  wrote in 
> On Thu, Jan 21, 2021 at 01:23:36AM -0800, Noah Misch wrote:
> > On Thu, Jan 21, 2021 at 06:02:11PM +0900, Kyotaro Horiguchi wrote:
> > > Perhaps I'm missing something, but the patch doesn't pass the v5-0001
> > > test with wal_level=minimal?
> > 
> > Correct.  The case we must avoid is letting an old snapshot read an
> > early-pruned page without error.  v5-0001 expects "ERROR:  snapshot too 
> > old".
> > The patch suspends early pruning, so that error is not applicable.
> 
> I think the attached version is ready.  The changes since v6nm are cosmetic:
> 
> - Wrote log messages
> - Split into two patches, since the user-visible bugs are materially different
> - Fixed typos
> - Ran perltidy
> 
> Is it okay if I push these on Saturday, or would you like more time to
> investigate?

Thank you for the new version.

I studied the sto feature further and concluded that the checker side
is fine that it always follow the chages of page-LSN.

So what we can do for the issue is setting seemingly correct page LSN
at pruning or refrain from early-pruning while we are skipping
WAL. The reason I took the former is I thought that the latter might
be a problem since early-pruning would be postponed by a long-running
wal-skipping transaction.

So the patch looks fine to me. The commit message mekes sense.

However, is it ok that the existing tests (modules/snapshot_too_old)
fails when wal_level=minimal?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-01-27 Thread Bharath Rupireddy
On Wed, Jan 27, 2021 at 7:35 PM Li Japin  wrote:
> > I don't see any problem if ALTER SUBSCRIPTION ... ADD PUBLICATION with
> > refresh true refreshes only the newly added publications, because what
> > we do in AlterSubscription_refresh() is that we fetch the tables
> > associated with the publications from the publisher, compare them with
> > the previously fetched tables from that publication and add the new
> > tables or remove the table that don't exist in that publication
> > anymore.
> >
> > For ALTER SUBSCRIPTION ... DROP PUBLICATION, also we can do the same
> > thing i.e. refreshes only the dropped publications.
> >
> > Thoughts?
>
> Agreed. We just only need to refresh the added/dropped publications.  
> Furthermore, for publications that will be dropped, we do not need the 
> “copy_data” option, right?

I think you are right. The copy_data option doesn't make sense for
ALTER SUBSCRIPTION ... DROP PUBLICATION, maybe we should throw an
error if the user specifies it. Of course, we need that option for
ALTER SUBSCRIPTION ... ADD PUBLICATION.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Is Recovery actually paused?

2021-01-27 Thread Dilip Kumar
On Wed, Jan 27, 2021 at 2:28 PM Dilip Kumar  wrote:
>
> On Wed, Jan 27, 2021 at 2:06 PM Yugo NAGATA  wrote:
> >
> > On Wed, 27 Jan 2021 13:29:23 +0530
> > Dilip Kumar  wrote:
> >
> > > On Wed, Jan 27, 2021 at 12:50 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Tue, Jan 26, 2021 at 2:00 AM Robert Haas  
> > > > wrote:
> > > > >
> > > > > On Sat, Jan 23, 2021 at 6:10 AM Bharath Rupireddy
> > > > >  wrote:
> > > > > > +1 to just show the recovery pause state in the output of
> > > > > > pg_is_wal_replay_paused. But, should the function name
> > > > > > "pg_is_wal_replay_paused" be something like
> > > > > > "pg_get_wal_replay_pause_state" or some other? To me, when "is" 
> > > > > > exists
> > > > > > in a function, I expect a boolean output. Others may have better
> > > > > > thoughts.
> > > > >
> > > > > Maybe we should leave the existing function pg_is_wal_replay_paused()
> > > > > alone and add a new one with the name you suggest that returns text.
> > > > > That would create less burden for tool authors.
> > > >
> > > > +1
> > > >
> > >
> > > Yeah, we can do that, I will send an updated patch soon.
> >
> > This means pg_is_wal_replay_paused is left without any change and this
> > returns whether pause is requested or not? If so, it seems good to modify
> > the documentation of this function in order to note that this could not
> > return the actual pause state.
>
> Yes, we can say that it will return true if the replay pause is
> requested.  I am changing that in my new patch.

I have modified the patch, changes

- I have added a new interface pg_get_wal_replay_pause_state to get
the pause request state
- Now, we are not waiting for the recovery to actually get paused so I
think it doesn't make sense to put a lot of checkpoints to check the
pause requested so I have removed that check from the
recoveryApplyDelay but I think it better we still keep that check in
the WaitForWalToBecomeAvailable because it can wait forever before the
next wal get available.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v8-0001-Provide-a-new-interface-to-get-the-recovery-pause.patch
Description: Binary data


Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-01-27 Thread japin

On Thu, 28 Jan 2021 at 12:22, Bharath Rupireddy 
 wrote:
> On Wed, Jan 27, 2021 at 7:35 PM Li Japin  wrote:
>> > I don't see any problem if ALTER SUBSCRIPTION ... ADD PUBLICATION with
>> > refresh true refreshes only the newly added publications, because what
>> > we do in AlterSubscription_refresh() is that we fetch the tables
>> > associated with the publications from the publisher, compare them with
>> > the previously fetched tables from that publication and add the new
>> > tables or remove the table that don't exist in that publication
>> > anymore.
>> >
>> > For ALTER SUBSCRIPTION ... DROP PUBLICATION, also we can do the same
>> > thing i.e. refreshes only the dropped publications.
>> >
>> > Thoughts?
>>
>> Agreed. We just only need to refresh the added/dropped publications.  
>> Furthermore, for publications that will be dropped, we do not need the 
>> “copy_data” option, right?
>
> I think you are right. The copy_data option doesn't make sense for
> ALTER SUBSCRIPTION ... DROP PUBLICATION, maybe we should throw an
> error if the user specifies it. Of course, we need that option for
> ALTER SUBSCRIPTION ... ADD PUBLICATION.
>

Thanks for your confirm.  Attached v3 patch fix it.

* v3-0001
Only refresh the publications that will be added/dropped, also remove 
"copy_data"
option from DROP PUBLICATION.

* v3-0002
Add a new testcase for DROP PUBLICATION WITH (copy_data).

* v3-0003
Remove the reference of REFRESH PUBLICATION in DROP PUBLICATION.

* v3-0004
Do not change.

Attaching v3 patches, please consider these for further review.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

>From bff45768b066ccf94de6dae9f687cf54212900b1 Mon Sep 17 00:00:00 2001
From: Japin Li 
Date: Tue, 26 Jan 2021 15:43:52 +0800
Subject: [PATCH v3 1/4] Introduce a new syntax to add/drop publications

At present, if we want to update publications in subscription, we can
use SET PUBLICATION, however, it requires supply all publications that
exists and the new publications if we want to add new publications, it's
inconvenient.  The new syntax only supply the new publications.  When
the refresh is true, it only refresh the new publications.
---
 src/backend/commands/subscriptioncmds.c | 121 
 src/backend/parser/gram.y   |  20 
 src/include/nodes/parsenodes.h  |   2 +
 3 files changed, 143 insertions(+)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 082f7855b8..88fa7f1b3f 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -46,6 +46,8 @@
 #include "utils/syscache.h"
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
+static List *merge_subpublications(HeapTuple tuple, TupleDesc tupledesc,
+   List *publications, bool addpub);
 
 /*
  * Common option parsing function for CREATE and ALTER SUBSCRIPTION commands.
@@ -857,6 +859,51 @@ AlterSubscription(AlterSubscriptionStmt *stmt)
 break;
 			}
 
+		case ALTER_SUBSCRIPTION_ADD_PUBLICATION:
+		case ALTER_SUBSCRIPTION_DROP_PUBLICATION:
+			{
+bool	copy_data = false;
+bool	isadd = stmt->kind == ALTER_SUBSCRIPTION_ADD_PUBLICATION;
+bool	refresh;
+List   *publist = NIL;
+
+publist = merge_subpublications(tup, RelationGetDescr(rel),
+stmt->publication, isadd);
+
+parse_subscription_options(stmt->options,
+		   NULL,	/* no "connect" */
+		   NULL, NULL,	/* no "enabled" */
+		   NULL,	/* no "create_slot" */
+		   NULL, NULL,	/* no "slot_name" */
+		   isadd ? ©_data : NULL,
+		   NULL,	/* no "synchronous_commit" */
+		   &refresh,
+		   NULL, NULL,	/* no "binary" */
+		   NULL, NULL); /* no "streaming" */
+values[Anum_pg_subscription_subpublications - 1] =
+	publicationListToArray(publist);
+replaces[Anum_pg_subscription_subpublications - 1] = true;
+
+update_tuple = true;
+
+/* Refresh if user asked us to. */
+if (refresh)
+{
+	if (!sub->enabled)
+		ereport(ERROR,
+(errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions"),
+ errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).")));
+
+	/* Only refresh the added/dropped list of publications. */
+	sub->publications = stmt->publication;
+
+	AlterSubscription_refresh(sub, copy_data);
+}
+
+break;
+			}
+
 		case ALTER_SUBSCRIPTION_REFRESH:
 			{
 bool		copy_data;
@@ -1278,3 +1325,77 @@ fetch_table_list(WalReceiverConn *wrconn, List *publications)
 
 	return tablelist;
 }
+
+/*
+ * Merge current subscription's publications and user specified publications
+ * by ADD/DROP PUBLICATIONS.
+ *
+ * If isadd == true, we will add the list of publications into current
+ * subscription's publications.  Otherwise, we will delete the list of
+ * publications from current su

Re: On login trigger: take three

2021-01-27 Thread Amit Kapila
On Thu, Jan 28, 2021 at 8:17 AM Amit Kapila  wrote:
>
> On Mon, Dec 28, 2020 at 5:46 PM Masahiko Sawada  wrote:
> >
> > On Sat, Dec 26, 2020 at 4:04 PM Pavel Stehule  
> > wrote:
> > >
> > >
> > >
> > > so 26. 12. 2020 v 8:00 odesílatel Pavel Stehule  
> > > napsal:
> > >>
> > >> Hi
> > >>
> > >>
> > >>> Thank you.
> > >>> I have applied all your fixes in on_connect_event_trigger-12.patch.
> > >>>
> > >>> Concerning enable_client_connection_trigger GUC, I think that it is 
> > >>> really useful: it is the fastest and simplest way to disable login 
> > >>> triggers in case
> > >>> of some problems with them (not only for superuser itself, but for all 
> > >>> users). Yes, it can be also done using "ALTER EVENT TRIGGER DISABLE".
> > >>> But assume that you have a lot of databases with different login 
> > >>> policies enforced by on-login event triggers. And you want temporary 
> > >>> disable them all, for example for testing purposes.
> > >>> In this case GUC is most convenient way to do it.
> > >>>
> > >>
> > >> There was typo in patch
> > >>
> > >> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> > >> index f810789..8861f1b 100644
> > >> --- a/doc/src/sgml/config.sgml
> > >> +++ b/doc/src/sgml/config.sgml
> > >> @@ -1,4 +1,4 @@
> > >> -
> > >> +\
> > >>
> > >> I have not any objections against functionality or design. I tested the 
> > >> performance, and there are no negative impacts when this feature is not 
> > >> used. There is significant overhead related to plpgsql runtime 
> > >> initialization, but when this trigger will be used, then probably some 
> > >> other PLpgSQL procedures and functions will be used too, and then this 
> > >> overhead can be ignored.
> > >>
> > >> * make without warnings
> > >> * make check-world passed
> > >> * doc build passed
> > >>
> > >> Possible ToDo:
> > >>
> > >> The documentation can contain a note so usage connect triggers in 
> > >> environments with short life sessions and very short fast queries 
> > >> without usage PLpgSQL functions or procedures can have negative impact 
> > >> on performance due overhead of initialization of PLpgSQL engine.
> > >>
> > >> I'll mark this patch as ready for committers
> > >
> > >
> > > looks so this patch has not entry in commitfestapp 2021-01
> > >
> >
> > Yeah, please register this patch before the next CommitFest[1] starts,
> > 2021-01-01 AoE[2].
> >
>
> Konstantin, did you register this patch in any CF? Even though the
> reviewer seems to be happy with the patch, I am afraid that we might
> lose track of this unless we register it.
>

I see the CF entry (https://commitfest.postgresql.org/31/2900/) for
this work. Sorry for the noise.

-- 
With Regards,
Amit Kapila.




Re: logical replication worker accesses catalogs in error context callback

2021-01-27 Thread Amit Kapila
On Wed, Jan 27, 2021 at 9:38 AM Bharath Rupireddy
 wrote:
>
> On Wed, Jan 27, 2021 at 7:48 AM Zhihong Yu  wrote:
>
> Thanks for pointing to the changes in the commit message. I corrected
> them. Attaching v4 patch set, consider it for further review.
>

About 0001, have we tried to reproduce the actual bug here which means
when the error_callback is called we should face some problem? I feel
with the correct testcase we should hit the Assert
(Assert(IsTransactionState());) in SearchCatCacheInternal because
there we expect the transaction to be in a valid state. I understand
that the transaction is in a broken state at that time but having a
testcase to hit the actual bug makes it easy to test the fix.

-- 
With Regards,
Amit Kapila.




Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-01-27 Thread Joel Jacobson
Hi Mark,

On Wed, Jan 27, 2021, at 22:36, Mark Rofail wrote:
> Vectors as refrencing columns are not supported and out of scope of this 
> patch. Try to use arrays.

OK, not a biggie, but I think the user at least should get an error message
immediately when trying to add the foreign key on an incompatible column,
just like we would get if e.g. trying to add a fk on numeric[] -> smallint
or some other incompatible combination.

The first error message looks good:

CREATE TABLE a (
  a_id smallint NOT NULL,
  PRIMARY KEY (a_id)
);

CREATE TABLE b (
b_id integer NOT NULL,
a_ids numeric[] NOT NULL,
PRIMARY KEY (b_id)
);

joel=# ALTER TABLE b ADD FOREIGN KEY (EACH ELEMENT OF a_ids) REFERENCES a(a_id);
ERROR:  foreign key constraint "b_a_ids_fkey" cannot be implemented
DETAIL:  Key column "a_ids" has element type numeric which does not have a 
default btree operator class that's compatible with class "int2_ops".

But you don't get any error message if a_ids is instead int2vector:

CREATE TABLE b (
b_id integer NOT NULL,
a_ids int2vector NOT NULL,
PRIMARY KEY (b_id)
);

ALTER TABLE b ADD FOREIGN KEY (EACH ELEMENT OF a_ids) REFERENCES a(a_id);

It's silently added without any error.

The error first comes when you try to insert data:

INSERT INTO a (a_id) VALUES (1);
INSERT INTO a (a_id) VALUES (2);
INSERT INTO b (b_id, a_ids) VALUES (3, '1 2'::int2vector);

ERROR:  operator does not exist: smallint[] pg_catalog.<@ int2vector
LINE 1: ..."."a" x WHERE ARRAY ["a_id"]::pg_catalog.anyarray OPERATOR(p...
 ^
HINT:  No operator matches the given name and argument types. You might need to 
add explicit type casts.
QUERY:  SELECT 1 WHERE (SELECT pg_catalog.count(DISTINCT y) FROM 
pg_catalog.unnest($1) y) OPERATOR(pg_catalog.=) (SELECT pg_catalog.count(*) 
FROM (SELECT 1 FROM ONLY "public"."a" x WHERE ARRAY 
["a_id"]::pg_catalog.anyarray OPERATOR(pg_catalog.<@) $1::pg_catalog.anyarray 
FOR KEY SHARE OF x) z)

This means, as a user, I might not be aware of the vector restriction when 
adding the foreign keys
for my existing schema, and think everything is fine, and not realize there is 
a problem until
new data arrives.

/Joel

Re: Support for NSS as a libpq TLS backend

2021-01-27 Thread Michael Paquier
On Wed, Jan 27, 2021 at 06:47:17PM +, Jacob Champion wrote:
> So to check understanding: the `openssl` config variable is still alive
> for MSVC builds; it just turns that into `--with-ssl=openssl` in the
> fake CONFIGURE_ARGS?

Yeah, I think that keeping both variables separated in the MSVC
scripts is the most straight-forward option, as this passes down a
path.  Once there is a value for nss, we'd need to properly issue an
error if both OpenSSL and NSS are specified.

> Since SSL is an obsolete term, and the choice of OpenSSL vs NSS vs
> [nothing] affects server operation (such as cryptohash) regardless of
> whether or not connection-level TLS is actually used, what would you
> all think about naming this option --with-crypto? I.e.
> 
> --with-crypto=openssl
> --with-crypto=nss

Looking around, curl has multiple switches for each lib with one named
--with-ssl for OpenSSL, but it needs to be able to use multiple
libraries at run time.  I can spot that libssh2 uses what you are
proposing.  It seems to me that --with-ssl is a bit more popular but
not by that much: wget, wayland, some apache stuff (it uses a path as
option value).  Anyway, what you are suggesting sounds like a good in
the context of Postgres.  Daniel?
--
Michael


signature.asc
Description: PGP signature


Re: Single transaction in the tablesync worker?

2021-01-27 Thread Peter Smith
Hi Amit.

PSA the v21 patch for the Tablesync Solution1.

Main differences from v20:
+ Rebased to latest OSS HEAD @ 27/Jan
+ v21 is a merging of patches [v17] and [v20], which was made
necessary when it was found [ak0127] that the v20 usage of TEMPORARY
tablesync slots did not work correctly. v21 reverts to using PERMANENT
tablesync slots same as implemented in v17, while retaining other
review comment fixes made for v18, v19, v20.


[v17] 
https://www.postgresql.org/message-id/CAHut%2BPt9%2Bg8qQR0kMC85nY-O4uDQxXboamZAYhHbvkebzC9fAQ%40mail.gmail.com
[v20] 
https://www.postgresql.org/message-id/CAHut%2BPuNwSujoL_dwa%3DTtozJ_vF%3DCnJxjgQTCmNBkazd8J1m-A%40mail.gmail.com
[ak0127] 
https://www.postgresql.org/message-id/CAA4eK1LDsj9kw4FbWAw3CMHyVsjafgDum03cYy-wpGmor%3D8-1w%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia


v21-0002-Tablesync-extra-logging.patch
Description: Binary data


v21-0001-Tablesync-Solution1.patch
Description: Binary data


RE: libpq debug log

2021-01-27 Thread k.jami...@fujitsu.com
On Mon, Jan 25, 2021 10:11 PM (JST), Alvaro Herrera wrote:
> On 2021-Jan-25, tsunakawa.ta...@fujitsu.com wrote:
> 
> > Iwata-san,
> 
> [...]
> 
> > Considering these and the compilation error Kirk-san found, I'd like
> > you to do more self-review before I resume this review.
> 
> Kindly note that these errors are all mine.
> 
> Thanks for the review
> 
Hello,
To make the CFBot happy, I fixed the compiler errors.
I have not included here the change proposal to move the tracing functions
to a new file: fe-trace.c or something, so I retained the changes .
Maybe Iwata-san can consider the proposal.
Currently, both pqTraceInit() and pqTraceUninit() are called only by one 
function.

Summary:
1. I changed the bits32 flags to int flags
2. I assumed INT_MAX value is the former MAX_FRONTEND_MSGS
   I defined it to solve the compile error. Please tell if the value was wrong.
3. Fixed the doc errors
4. Added freeing of be_msg in pqTraceUninit()
5. Addressed the following comments: 

> (25)
> + conn->fe_msg =
> malloc(sizeof(offsetof(pqFrontendMessage, fields)) +
> +
> DEF_FE_MSGFIELDS * sizeof(pqFrontendMessageField));
> 
> It's incorrect that offsetof() is nested in sizeof().  See my original 
> review comment.

I removed the initial sizeof().


> (26)
> +bool
> +pqTraceInit(PGconn *conn, bits32 flags) {
> ...
> + conn->be_msg->state = LOG_FIRST_BYTE;
> + conn->be_msg->length = 0;
> + }
> ...
> + conn->be_msg->state = LOG_FIRST_BYTE;
> + conn->be_msg->length = 0;
> 
> The former is redundant?

Right. I've removed the former.


> (27)
> +/*
> + * Deallocate frontend-message tracking memory.  We only do this 
> +because
> + * it can grow arbitrarily large, and skip it in the initial state, 
> +because
> + * it's likely pointless.
> + */
> +void
> +pqTraceUninit(PGconn *conn)
> +{
> + if (conn->fe_msg &&
> + conn->fe_msg->num_fields != DEF_FE_MSGFIELDS)
> + {
> + pfree(conn->fe_msg);
> + conn->fe_msg = NULL;
> + }
> +}
> 
> I thought this function plays the reverse role of pqTraceInit(), but 
> it only frees memory for frontend messages.  Plus, use free() instead 
> of pfree(), because
> malloc() is used to allocate memory.

Fixed to use free(). Also added the freeing of be_msg.

> (28)
> +void PQtrace(PGconn *conn, FILE *stream, bits32 flags);
> 
> bits32 can't be used because it's a data type defined in c.h, which 
> should not be exposed to applications.  I think you can just int or something.

I used int. It still works to suppress the timestamp when flags is true.

In my environment, when flags is false(0):
2021-01-28 06:26:11.368 >   Query   27 "COPY country TO STDOUT"
2021-01-28 06:26:11.368 >   Query   -1
2021-01-28 06:26:11.369 <   CopyOutResponse 13 \x00 #3 #0 #0 #0
2021-01-28 06:26:11.369 <   CopyDone4

2021-01-28 06:26:11.369 <   CopyDone4
2021-01-28 06:26:11.369 <   CopyDone4
2021-01-28 06:26:11.369 <   CommandComplete 11 "COPY 0"
2021-01-28 06:26:11.369 <   ReadyForQuery   5
2021-01-28 06:26:11.369 >   Terminate   4
2021-01-28 06:26:11.369 >   Terminate   -1

When flags is true(1), running the same query:
>   Query   27 "COPY country TO STDOUT"
>   Query   -1
<   CopyOutResponse 13 \x00 #3 #0 #0 #0
<   CopyDone4

<   CopyDone4
<   CopyDone4
<   CommandComplete 11 "COPY 0"
<   ReadyForQuery   5
>   Terminate   4
>   Terminate   -1


Regards,
Kirk Jamison


v14-0001-libpq-trace.patch
Description: v14-0001-libpq-trace.patch


Re: protect pg_stat_statements_info() for being used without the library loaded

2021-01-27 Thread Michael Paquier
On Thu, Jan 28, 2021 at 08:49:54AM +0800, Julien Rouhaud wrote:
> Good catch, and patch looks good to me.

This crashes the server, cash.  Looking at all the other modules in
the tree, I am not seeing any other hole.  This is new as of 9fbc3f3,
and I will apply it on HEAD.
--
Michael


signature.asc
Description: PGP signature


Re: protect pg_stat_statements_info() for being used without the library loaded

2021-01-27 Thread Fujii Masao




On 2021/01/28 15:42, Michael Paquier wrote:

On Thu, Jan 28, 2021 at 08:49:54AM +0800, Julien Rouhaud wrote:

Good catch, and patch looks good to me.


This crashes the server, cash.  Looking at all the other modules in
the tree, I am not seeing any other hole.  This is new as of 9fbc3f3,
and I will apply it on HEAD.


Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: pg_upgrade fails with non-standard ACL

2021-01-27 Thread Noah Misch
On Wed, Jan 27, 2021 at 07:32:42PM +0300, Anastasia Lubennikova wrote:
> On 27.01.2021 14:21, Noah Misch wrote:
> >On Mon, Jan 25, 2021 at 10:14:43PM +0300, Anastasia Lubennikova wrote:
> >
> >>1) Could you please clarify, what do you mean by REVOKE failures?
> >>
> >>I tried following example:
> >>
> >>Start 9.6 cluster.
> >>
> >>REVOKE ALL ON function pg_switch_xlog() FROM public;
> >>REVOKE ALL ON function pg_switch_xlog() FROM backup;
> >>
> >>The upgrade to the current master passes with and without patch.
> >>It seems that current implementation of pg_upgrade doesn't take into account
> >>revoke ACLs.
> >I think you can observe it by adding "revoke all on function
> >pg_stat_get_subscription from public;" to test_add_acl_to_catalog_objects.sql
> >and then rerunning your test script.  pg_dump will reproduce that REVOKE,
> >which would fail if postgresql.org had removed that function.  That's fine, 
> >so
> >long as a comment mentions the limitation.
> 
> In the updated patch, I implemented generation of both GRANT ALL and REVOKE
> ALL for problematic objects. If I understand it correctly, these calls will
> clean object's ACL completely. And I see no harm in doing this, because the
> objects don't exist in the new cluster anyway.
> 
> To test it I attempted to reproduce the problem, using attached
> test_revoke.sql in the test. Still, pg_upgrade works fine without any
> adjustments. I'd be grateful if you test it some more.

test_revoke.sql has "revoke all on function pg_stat_get_subscription() from
test", which does nothing.  You need something that causes a REVOKE in pg_dump
output.  Worked example:

=== shell script
set -x
createuser test
pg_dump | grep -E 'GRANT|REVOKE'
psql -Xc 'revoke all on function pg_stat_get_subscription() from test;'
psql -Xc 'revoke all on function pg_stat_get_subscription from test;'
pg_dump | grep -E 'GRANT|REVOKE'
psql -Xc 'revoke all on function pg_stat_get_subscription from public;'
pg_dump | grep -E 'GRANT|REVOKE'

=== output
+ createuser test
+ pg_dump
+ grep -E 'GRANT|REVOKE'
+ psql -Xc 'revoke all on function pg_stat_get_subscription() from test;'
ERROR:  function pg_stat_get_subscription() does not exist
+ psql -Xc 'revoke all on function pg_stat_get_subscription from test;'
REVOKE
+ pg_dump
+ grep -E 'GRANT|REVOKE'
+ psql -Xc 'revoke all on function pg_stat_get_subscription from public;'
REVOKE
+ pg_dump
+ grep -E 'GRANT|REVOKE'
REVOKE ALL ON FUNCTION pg_catalog.pg_stat_get_subscription(subid oid, OUT subid 
oid, OUT relid oid, OUT pid integer, OUT received_lsn pg_lsn, OUT 
last_msg_send_time timestamp with time zone, OUT last_msg_receipt_time 
timestamp with time zone, OUT latest_end_lsn pg_lsn, OUT latest_end_time 
timestamp with time zone) FROM PUBLIC;

That REVOKE is going to fail if the upgrade target cluster lacks the function
in question, and your test_rename_catalog_objects_v13 does simulate
pg_stat_get_subscription not existing.




Re: Single transaction in the tablesync worker?

2021-01-27 Thread Peter Smith
On Wed, Jan 27, 2021 at 2:53 PM Amit Kapila  wrote:
>
> On Sat, Jan 23, 2021 at 5:56 PM Amit Kapila  wrote:
> >
> > On Sat, Jan 23, 2021 at 4:55 AM Peter Smith  wrote:
> > >
> > > PSA the v18 patch for the Tablesync Solution1.
> >
> > 7. Have you tested with the new patch the scenario where we crash
> > after FINISHEDCOPY and before SYNCDONE, is it able to pick up the
> > replication using the new temporary slot? Here, we need to test the
> > case where during the catchup phase we have received few commits and
> > then the tablesync worker is crashed/errored out? Basically, check if
> > the replication is continued from the same point?
> >
>
> I have tested this and it didn't work, see the below example.
>
> Publisher-side
> 
> CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
>
> BEGIN;
> INSERT INTO mytbl1(somedata, text) VALUES (1, 1);
> INSERT INTO mytbl1(somedata, text) VALUES (1, 2);
> COMMIT;
>
> CREATE PUBLICATION mypublication FOR TABLE mytbl1;
>
> Subscriber-side
> 
> - Have a while(1) loop in LogicalRepSyncTableStart so that tablesync
> worker stops.
>
> CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120));
>
>
> CREATE SUBSCRIPTION mysub
>  CONNECTION 'host=localhost port=5432 dbname=postgres'
> PUBLICATION mypublication;
>
> During debug, stop after we mark FINISHEDCOPY state.
>
> Publisher-side
> 
> INSERT INTO mytbl1(somedata, text) VALUES (1, 3);
> INSERT INTO mytbl1(somedata, text) VALUES (1, 4);
>
>
> Subscriber-side
> 
> - Have a breakpoint in apply_dispatch
> - continue in debugger;
> - After we replay first commit (which will be for values(1,3), note
> down the origin position in apply_handle_commit_internal and somehow
> error out. I have forced the debugger to set to the last line in
> apply_dispatch where the error is raised.
> - After the error, again the tablesync worker is restarted and it
> starts from the position noted in the previous step
> - It exits without replaying the WAL for (1,4)
>
> So, on the subscriber-side, you will see 3 records. Fourth is missing.
> Now, if you insert more records on the publisher, it will anyway
> replay those but the fourth one got missing.
>
> The temporary slots didn't seem to work because we created again the
> new temporary slot after the crash and ask it to start decoding from
> the point we noted in origin_lsn. The publisher didn’t hold the
> required WAL as our slot was temporary so it started sending from some
> later point. We retain WAL based on the slots restart_lsn position and
> wal_keep_size. For our case, the positions of the slots will matter
> and as we have created temporary slots, there is no way for a
> publisher to save that WAL.
>
> In this particular case, even if the WAL would have been there we only
> pass the start_decoding_at position but didn’t pass restart_lsn, so it
> picked a random location (current insert position in WAL) which is
> ahead of start_decoding_at point so it never sent the required fourth
> record. Now, I don’t think it will work even if somehow sent the
> correct restart_lsn because of what I wrote earlier that there is no
> guarantee that the earlier WAL would have been saved.
>
> At this point, I can't think of any way to fix this problem except for
> going back to the previous approach of permanent slots but let me know
> if you have any ideas to salvage this approach?
>

OK. The latest patch [v21] now restores the permanent slot (and slot
cleanup) approach as it was implemented in an earlier version [v17].
Please note that this change also re-introduces some potential slot
cleanup problems for some race scenarios. These will be addressed by
future patches.


[v17] 
https://www.postgresql.org/message-id/CAHut%2BPt9%2Bg8qQR0kMC85nY-O4uDQxXboamZAYhHbvkebzC9fAQ%40mail.gmail.com
[v21] 
https://www.postgresql.org/message-id/CAHut%2BPvzHRRA_5O0R8KZCb1tVe1mBVPxFtmttXJnmuOmAegoWA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Wrong usage of RelationNeedsWAL

2021-01-27 Thread Noah Misch
On Thu, Jan 28, 2021 at 12:06:27PM +0900, Kyotaro Horiguchi wrote:
> At Wed, 27 Jan 2021 02:48:48 -0800, Noah Misch  wrote in 
> > On Thu, Jan 21, 2021 at 01:23:36AM -0800, Noah Misch wrote:
> > > On Thu, Jan 21, 2021 at 06:02:11PM +0900, Kyotaro Horiguchi wrote:
> > > > Perhaps I'm missing something, but the patch doesn't pass the v5-0001
> > > > test with wal_level=minimal?
> > > 
> > > Correct.  The case we must avoid is letting an old snapshot read an
> > > early-pruned page without error.  v5-0001 expects "ERROR:  snapshot too 
> > > old".
> > > The patch suspends early pruning, so that error is not applicable.

> I studied the sto feature further and concluded that the checker side
> is fine that it always follow the chages of page-LSN.
> 
> So what we can do for the issue is setting seemingly correct page LSN
> at pruning or refrain from early-pruning while we are skipping
> WAL. The reason I took the former is I thought that the latter might
> be a problem since early-pruning would be postponed by a long-running
> wal-skipping transaction.

Yes, that's an accurate summary.

> So the patch looks fine to me. The commit message mekes sense.
> 
> However, is it ok that the existing tests (modules/snapshot_too_old)
> fails when wal_level=minimal?

That would not be okay, but I'm not seeing that.  How did your setup differ
from the following?

[nm@rfd 6:1 2021-01-27T23:06:33 postgresql 0]$ cat /nmscratch/minimal.conf
log_statement = all
wal_level = minimal
max_wal_senders = 0
log_line_prefix = '%m [%p %l %x] %q%a '
[nm@rfd 6:1 2021-01-27T23:06:38 postgresql 0]$ make -C 
src/test/modules/snapshot_too_old check TEMP_CONFIG=/nmscratch/minimal.conf
== creating temporary instance==
== initializing database system   ==
== starting postmaster==
running on port 58080 with PID 2603099
== creating database "isolation_regression" ==
CREATE DATABASE
ALTER DATABASE
== running regression test queries==
test sto_using_cursor ... ok30168 ms
test sto_using_select ... ok24197 ms
test sto_using_hash_index ... ok 6089 ms
== shutting down postmaster   ==
== removing temporary instance==

=
 All 3 tests passed.
=




Re: protect pg_stat_statements_info() for being used without the library loaded

2021-01-27 Thread Michael Paquier
On Thu, Jan 28, 2021 at 03:53:54PM +0900, Fujii Masao wrote:
> Thanks!

No problem.  Applied as of bca96dd.
--
Michael


signature.asc
Description: PGP signature


Re: Add SQL function for SHA1

2021-01-27 Thread Michael Paquier
On Tue, Jan 26, 2021 at 09:53:52PM -0500, Sehrope Sarkuni wrote:
> The refactor patch looks good. It builds and passes make check.

Thanks for double-checking!  The refactoring has been just done as of
f854c69.
--
Michael


signature.asc
Description: PGP signature


Re: Two patches to speed up pg_rewind.

2021-01-27 Thread Michael Paquier
On Wed, Jan 27, 2021 at 09:18:48AM +, Paul Guo wrote:
> Second one is use copy_file_range() for the local rewind case to replace 
> read()+write().
> This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other
> code could use copy_file_range() if needed. copy_file_range() was introduced
> In high-version Linux Kernel, in low-version Linux or other Unix-like OS 
> mmap()
> might be better than read()+write() but copy_file_range() is more interesting
> given that it could skip the data copying in some file systems - this could 
> benefit more
> on Linux fs on network-based block storage.

Have you done some measurements?
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] remove pg_standby

2021-01-27 Thread Michael Paquier
On Wed, Jan 27, 2021 at 05:08:56PM +0900, Fujii Masao wrote:
> But one question is; shouldn't we follow "usual" way to retire the
> feature instead of dropping that immediately? That is, mark
> pg_standby as obsolete, announce that pg_standby will be dropped
> after several releases, and then drop pg_standby. This seems safe
> because there might be some users. While it's been marked as
> obsolete, maybe WAL prefetch feature doesn't work with pg_standby,
> but we can live with that because it's obsolete.

Thanks.  FWIW, at this stage, my take is just to move on and remove
it.  If we mark that as obsolete, it will stay around forever while
annoying future development.
--
Michael


signature.asc
Description: PGP signature


Re: Protect syscache from bloating with negative cache entries

2021-01-27 Thread Kyotaro Horiguchi
At Wed, 27 Jan 2021 13:11:55 +0200, Heikki Linnakangas  wrote 
in 
> On 27/01/2021 03:13, Kyotaro Horiguchi wrote:
> > At Thu, 14 Jan 2021 17:32:27 +0900 (JST), Kyotaro Horiguchi
> >  wrote in
> >> The commit 4656e3d668 (debug_invalidate_system_caches_always)
> >> conflicted with this patch. Rebased.
> > At Wed, 27 Jan 2021 10:07:47 +0900 (JST), Kyotaro Horiguchi
> >  wrote in
> >> (I found a bug in a benchmark-aid function
> >> (CatalogCacheFlushCatalog2), I repost an updated version soon.)
> > I noticed that a catcachebench-aid function
> > CatalogCacheFlushCatalog2() allocates bucked array wrongly in the
> > current memory context, which leads to a crash.
> > This is a fixed it then rebased version.
> 
> Thanks, with the scripts you provided, I was able to run the
> performance tests on my laptop, and got very similar results as you
> did.
> 
> The impact of v7-0002-Remove-dead-flag-from-catcache-tuple.patch is
> very small. I think I could see it in the tests, but only barely. And
> the tests did nothing else than do syscache lookups; in any real world
> scenario, it would be lost in noise. I think we can put that aside for
> now, and focus on v6-0001-CatCache-expiration-feature.patch:

I agree to that opinion. But a bit dissapointing that the long
struggle ended up in vain:p

> The pruning is still pretty lethargic:
> 
> - Entries created in the same transactions are never pruned away
> 
> - The size of the hash table is never shrunk. So even though the patch
> - puts a backstop to the hash table growing indefinitely, if you run one
> - transaction that bloats the cache, it's bloated for the rest of the
> - session.

Right. But more frequent check impacts on performance. We can do more
aggressive pruning in idle-time.

> I think that's OK. We might want to be more aggressive in the future,
> but for now it seems reasonable to lean towards the current behavior
> where nothing is pruned. Although I wonder if we should try to set
> 'catcacheclock' more aggressively. I think we could set it whenever
> GetCurrentTimestamp() is called, for example.

Ah. I didn't thought that direction. global_last_acquired_timestamp or
such?

> Given how unaggressive this mechanism is, I think it should be safe to
> enable it by default. What would be a suitable default for
> catalog_cache_prune_min_age? 30 seconds?

Without a detailed thought, it seems a bit too short. The
ever-suggested value for the variable is 300-600s. That is,
intermittent queries with about 5-10 minutes intervals don't lose
cache entries.

In a bad case, two queries alternately remove each other's cache
entries.

Q1: adds 100 entries
<1 minute passed>

Q2: adds 100 entries but rehash is going to happen at 150 entries and
the existing 100 entreis added by Q1 are removed.
<1 minute passed>

Q1: adds 100 entries but rehash is going to happen at 150 entries and
the existing 100 entreis added by Q2 are removed.



Or a transaction sequence persists longer than that seconds may lose
some of the catcache entries.

> Documentation needs to be updated for the new GUC.
> 
> Attached is a version with a few little cleanups:
> - use TimestampTz instead of uint64 for the timestamps
> - remove assign_catalog_cache_prune_min_age(). All it did was convert
> - the GUC's value from seconds to microseconds, and stored it in a
> - separate variable. Multiplication is cheap, so we can just do it when
> - we use the GUC's value instead.

Yeah, the laater is a trace of the struggle for cutting down cpu
cycles in the normal paths. I don't object to do so.

I found that some comments are apparently stale. cp->cc_oldest_ts is
not used anywhere, but it is added for the decision of whether to scan
or not.

I fixed the following points in the attached.

- Removed some comments that is obvious. ("Timestamp in us")
- Added cp->cc_oldest_ts check in CatCacheCleanupOldEntries.
- Set the default value for catalog_cache_prune_min_age to 600s.
- Added a doc entry for the new GUC in the resoruce/memory section.
- Fix some code comments.
- Adjust pruning criteria from (ct->lastaccess < prune_threshold) to <=.

I was going to write in the doc something like "you can inspect memory
consumption by catalog caches using pg_backend_memory_contexts", but
all the memory used by catalog cache is in CacheMemoryContext.  Is it
sensible for each catalog cache to have their own contexts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 01d32ffa499ee9185e222fe6fa3d39ad6ac5ff37 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 27 Jan 2021 13:08:08 +0200
Subject: [PATCH v9] CatCache expiration feature

Author: Kyotaro Horiguchi
---
 doc/src/sgml/config.sgml   | 20 +++
 src/backend/access/transam/xact.c  |  3 ++
 src/backend/utils/cache/catcache.c | 85 +-
 src/backend/utils/misc/guc.c   | 12 +
 src/include/utils/catcache.h   | 17 ++
 5 files changed, 136 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgm