Re: LSM tree for Postgres

2020-08-09 Thread Konstantin Knizhnik




On 09.08.2020 04:53, Alexander Korotkov wrote:


I realize that it is not true LSM.
But still I wan to notice that it is able to provide ~10 times increase
of insert speed when size of index is comparable with RAM size.
And "true LSM" from RocksDB shows similar results.

It's very far from being shown.  All the things you've shown is a
naive benchmark.  I don't object that your design can work out some
cases.  And it's great that we have the lsm3 extension now.  But I
think for PostgreSQL core we should think about better design.


Sorry, I mean that at particular benchmark and hardware Lsm3 and RocksDB 
shows similar performance.

It definitely doesn't mean that it will be true in all other cases.
This is one of the reasons why I have published this Lsm3 and RockDB FDW 
extensions:

anybody can try to test them at their workload.
It will be very interesting to me to know this results, because I 
certainly understand
that measuring of random insert performance in dummy table is not enough 
to make some

conclusions.

And I certainly do not want to say that we do not need "right" LSM 
implementation inside Postgres core.

It just requires an order of magnitude more efforts.
And there are many questions and challenges. For example Postgres buffer 
size (8kb) seems to be too small for LSM.
Should LSM implementation bypass Postgres buffer cache? There pros and 
contras...


Another issue is logging. Should we just log all operations with LSM in 
WAL in usual way (as it is done for nbtree and Lsm3)?
It seems to me that for LSM alternative and more efficient solutions may 
be proposed.
For example we may not log inserts in top index at all and just replay 
them during recovery, assuming that this operation with
small index is fast enough. And merge of top index with base index can 
be done in atomic way and so also doesn't require WAL.


As far as I know Anastasia Lubennikova several years ago has implemented 
LSM for Postgres.

There was some performance issues (with concurrent access?).
This is why the first thing I want to clarify for myself is what are the 
bottlenecks of LSM architecture
and are them caused by LSM itself or its integration in Postgres 
infrastructure.


I any case, before thinking about details of in-core LSM implementation 
for Postgres, I think that
it is necessary to demonstrate workloads at which RocksDB (or any other 
existed DBMS with LSM)
shows significant performance advantages comparing with Postgres with 
nbtree/Lsm3.



May be if size of
index will be 100 times larger then
size of RAM, RocksDB will be significantly faster than Lsm3. But modern
servers has 0.5-1Tb of RAM.
Can't believe that there are databases with 100Tb indexes.

Comparison of whole RAM size to single index size looks plain wrong
for me.  I think we can roughly compare whole RAM size to whole
database size.  But also not the whole RAM size is always available
for caching data.  Let's assume half of RAM is used for caching data.
So, a modern server with 0.5-1Tb of RAM, which suffers from random
B-tree insertions and badly needs LSM-like data-structure, runs a
database of 25-50Tb.  Frankly speaking, there is nothing
counterintuitive for me.


There is actually nothing counterintuitive.
I just mean that there are not so much 25-50Tb OLTP databases.





Re: Unnecessary delay in streaming replication due to replay lag

2020-08-09 Thread Michael Paquier
On Sun, Aug 09, 2020 at 05:54:32AM +, Asim Praveen wrote:
> I would like to revive this thready by submitting a rebased patch to
> start streaming replication without waiting for startup process to
> finish replaying all WAL.  The start LSN for streaming is determined
> to be the LSN that points to the beginning of the most recently
> flushed WAL segment.
> 
> The patch passes tests under src/test/recovery and top level “make check”.

I have not really looked at the proposed patch, but it would be good
to have some documentation.
--
Michael


signature.asc
Description: PGP signature


Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-08-09 Thread Michael Paquier
On Sun, Apr 26, 2020 at 12:56:14PM -0500, Justin Pryzby wrote:
> Rebased onto d12bdba77b0fce9df818bc84ad8b1d8e7a96614b
> 
> Restored two tests from Alexey's original patch which exposed issue with
> REINDEX DATABASE when allow_system_table_mods=off.

I have been looking at 0001 as a start, and your patch is incorrect on
a couple of aspects for the completion of REINDEX:
- "(" is not proposed as a completion option after the initial
REINDEX, and I think that it should.
- completion gets incorrect for all the commands once a parenthesized
list of options is present, as CONCURRENTLY goes missing.

The presence of CONCURRENTLY makes the completion a bit more complex
than the other commands, as we need to add this keyword if still not
specified with the other objects of the wanted type to reindex, but it
can be done as the attached.  What do you think?
--
Michael
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index c4af40bfa9..26a9b2a722 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -3431,27 +3431,48 @@ psql_completion(const char *text, int start, int end)
 
 /* REINDEX */
 	else if (Matches("REINDEX"))
+		COMPLETE_WITH("TABLE", "INDEX", "SYSTEM", "SCHEMA", "DATABASE", "(");
+	else if (Matches("REINDEX", "(*)"))
 		COMPLETE_WITH("TABLE", "INDEX", "SYSTEM", "SCHEMA", "DATABASE");
-	else if (Matches("REINDEX", "TABLE"))
+	else if (Matches("REINDEX", "TABLE") ||
+			 Matches("REINDEX", "(*)", "TABLE"))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_indexables,
    " UNION SELECT 'CONCURRENTLY'");
-	else if (Matches("REINDEX", "INDEX"))
+	else if (Matches("REINDEX", "INDEX") ||
+			 Matches("REINDEX", "(*)", "INDEX"))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_indexes,
    " UNION SELECT 'CONCURRENTLY'");
-	else if (Matches("REINDEX", "SCHEMA"))
+	else if (Matches("REINDEX", "SCHEMA") ||
+			 Matches("REINDEX", "(*)", "SCHEMA"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_schemas
 			" UNION SELECT 'CONCURRENTLY'");
-	else if (Matches("REINDEX", "SYSTEM|DATABASE"))
+	else if (Matches("REINDEX", "SYSTEM|DATABASE") ||
+			 Matches("REINDEX", "(*)", "DATABASE"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_databases
 			" UNION SELECT 'CONCURRENTLY'");
-	else if (Matches("REINDEX", "TABLE", "CONCURRENTLY"))
+	else if (Matches("REINDEX", "TABLE", "CONCURRENTLY") ||
+			 Matches("REINDEX", "(*)", "TABLE", "CONCURRENTLY"))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_indexables, NULL);
-	else if (Matches("REINDEX", "INDEX", "CONCURRENTLY"))
+	else if (Matches("REINDEX", "INDEX", "CONCURRENTLY") ||
+			 Matches("REINDEX", "(*)", "INDEX", "CONCURRENTLY"))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_indexes, NULL);
-	else if (Matches("REINDEX", "SCHEMA", "CONCURRENTLY"))
+	else if (Matches("REINDEX", "SCHEMA", "CONCURRENTLY") ||
+			 Matches("REINDEX", "(*)", "SCHEMA", "CONCURRENTLY"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_schemas);
-	else if (Matches("REINDEX", "SYSTEM|DATABASE", "CONCURRENTLY"))
+	else if (Matches("REINDEX", "SYSTEM|DATABASE", "CONCURRENTLY") ||
+			 Matches("REINDEX", "(*)", "SYSTEM|DATABASE", "CONCURRENTLY"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_databases);
+	else if (HeadMatches("REINDEX", "(*") &&
+			 !HeadMatches("REINDEX", "(*)"))
+	{
+		/*
+		 * This fires if we're in an unfinished parenthesized option list.
+		 * get_previous_words treats a completed parenthesized option list as
+		 * one word, so the above test is correct.
+		 */
+		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
+			COMPLETE_WITH("VERBOSE");
+	}
 
 /* SECURITY LABEL */
 	else if (Matches("SECURITY"))


signature.asc
Description: PGP signature


Optimising latch signals

2020-08-09 Thread Thomas Munro
Hi hackers,

Here are some more experimental patches to reduce system calls.

0001 skips sending signals when the recipient definitely isn't
waiting, using a new flag-and-memory-barrier dance.  This seems to
skip around 12% of the kill() calls for "make check", and probably
helps with some replication configurations that do a lot of
signalling.  Patch by Andres (with a small memory barrier adjustment
by me).

0002 gets rid of the latch self-pipe on Linux systems.

0003 does the same on *BSD/macOS systems.

The idea for 0002 and 0003 is to use a new dedicated signal just for
latch wakeups, and keep it blocked (Linux) or ignored (BSD), except
while waiting.  There may be other ways to achieve this without
bringing in a new signal, but it seemed important to leave SIGUSR1
unblocked for procsignals, and hard to figure out how to multiplex
with existing SIGUSR2 users, so for the first attempt at prototyping
this I arbitrarily chose SIGURG.
From a45e9edaa315e62a0f823dbd55b42950fc8d5df3 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 8 Aug 2020 15:08:09 +1200
Subject: [PATCH 1/3] Optimize latches to send fewer signals.

Author: Andres Freund 
---
 src/backend/storage/ipc/latch.c | 26 ++
 src/include/storage/latch.h |  1 +
 2 files changed, 27 insertions(+)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 4153cc8557..37528f29b5 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -274,6 +274,7 @@ void
 InitLatch(Latch *latch)
 {
 	latch->is_set = false;
+	latch->maybe_sleeping = false;
 	latch->owner_pid = MyProcPid;
 	latch->is_shared = false;
 
@@ -321,6 +322,7 @@ InitSharedLatch(Latch *latch)
 #endif
 
 	latch->is_set = false;
+	latch->maybe_sleeping = false;
 	latch->owner_pid = 0;
 	latch->is_shared = true;
 }
@@ -523,6 +525,10 @@ SetLatch(Latch *latch)
 
 	latch->is_set = true;
 
+	pg_memory_barrier();
+	if (!latch->maybe_sleeping)
+		return;
+
 #ifndef WIN32
 
 	/*
@@ -589,6 +595,7 @@ ResetLatch(Latch *latch)
 {
 	/* Only the owner should reset the latch */
 	Assert(latch->owner_pid == MyProcPid);
+	Assert(latch->maybe_sleeping == false);
 
 	latch->is_set = false;
 
@@ -1273,6 +1280,14 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 		 * ordering, so that we cannot miss seeing is_set if a notification
 		 * has already been queued.
 		 */
+		if (set->latch && !set->latch->is_set)
+		{
+			/* about to sleep on a latch */
+			set->latch->maybe_sleeping = true;
+			pg_memory_barrier();
+			/* and recheck */
+		}
+
 		if (set->latch && set->latch->is_set)
 		{
 			occurred_events->fd = PGINVALID_SOCKET;
@@ -1283,6 +1298,9 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 			occurred_events++;
 			returned_events++;
 
+			/* could have been set above */
+			set->latch->maybe_sleeping = false;
+
 			break;
 		}
 
@@ -1294,6 +1312,12 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 		rc = WaitEventSetWaitBlock(set, cur_timeout,
    occurred_events, nevents);
 
+		if (set->latch)
+		{
+			Assert(set->latch->maybe_sleeping);
+			set->latch->maybe_sleeping = false;
+		}
+
 		if (rc == -1)
 			break;/* timeout occurred */
 		else
@@ -1383,6 +1407,8 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 		if (cur_event->events == WL_LATCH_SET &&
 			cur_epoll_event->events & (EPOLLIN | EPOLLERR | EPOLLHUP))
 		{
+			Assert(set->latch);
+
 			/* There's data in the self-pipe, clear it. */
 			drainSelfPipe();
 
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 7c742021fb..d0da7e5d31 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -110,6 +110,7 @@
 typedef struct Latch
 {
 	sig_atomic_t is_set;
+	sig_atomic_t maybe_sleeping;
 	bool		is_shared;
 	int			owner_pid;
 #ifdef WIN32
-- 
2.20.1

From 0cac26c1499fffa2a6f4e71f59e7a4e9a31be833 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 8 Aug 2020 16:32:22 +1200
Subject: [PATCH 2/3] Remove self-pipe in WAIT_USE_EPOLL builds.

Use SIGURG instead of SIGUSR1 for latch wake-ups, and unblock it only
while in epoll_pwait() to avoid the race traditionally solved by the
self-pipe trick.
---
 src/backend/libpq/pqsignal.c| 13 +++-
 src/backend/postmaster/postmaster.c |  4 +-
 src/backend/storage/ipc/latch.c | 97 +++--
 src/backend/tcop/postgres.c |  2 +-
 src/include/libpq/pqsignal.h|  3 +-
 src/include/storage/latch.h |  2 +
 6 files changed, 96 insertions(+), 25 deletions(-)

diff --git a/src/backend/libpq/pqsignal.c b/src/backend/libpq/pqsignal.c
index 9289493118..13ec3cd533 100644
--- a/src/backend/libpq/pqsignal.c
+++ b/src/backend/libpq/pqsignal.c
@@ -20,6 +20,7 @@
 
 /* Global variables */
 sigset_t	UnBlockSig,
+			UnBlockAllSig,
 			BlockSig,
 			StartupBlockSig;
 
@@ -35,13 +36,21 @@ sigset_t	UnBlockSig,
  * collection; it's essentially BlockSig minus SIGTERM, SIGQUIT, SIGALRM.
  *
  * UnBlockSig is the set of signals

tab completion of IMPORT FOREIGN SCHEMA

2020-08-09 Thread Jeff Janes
I use IMPORT FOREIGN SCHEMA a bit to set up systems for testing.  But not
enough that I can ever remember whether INTO or FROM SERVER comes first in
the syntax.

Here is an improvement to the tab completion, so I don't have to keep
looking it up in the docs.

It accidentally (even before this patch) completes "IMPORT FOREIGN SCHEMA"
with a list of local schemas.  This is probably wrong, but I find this
convenient as I often do this in a loop-back setup where the list of
foreign schema would be the same as the local ones.  So I don't countermand
that behavior here.

Cheers,

Jeff


foreign_schema_tab_complete.patch
Description: Binary data


Re: tab completion of IMPORT FOREIGN SCHEMA

2020-08-09 Thread Tom Lane
Jeff Janes  writes:
> It accidentally (even before this patch) completes "IMPORT FOREIGN SCHEMA"
> with a list of local schemas.  This is probably wrong, but I find this
> convenient as I often do this in a loop-back setup where the list of
> foreign schema would be the same as the local ones.  So I don't countermand
> that behavior here.

I don't see how psql could obtain a "real" list of foreign schemas
from an arbitrary FDW, even if it magically knew which server the
user would specify later in the command.  So this behavior seems fine.
It has some usefulness, while not completing at all would have none.

It might be a good idea to figure out where that completion is
happening and annotate it about this point.

regards, tom lane




Re: 回复:how to create index concurrently on partitioned table

2020-08-09 Thread Justin Pryzby
Thanks for looking.

On Sun, Aug 09, 2020 at 02:00:09PM +0900, Michael Paquier wrote:
> > exactly what's needed.
> 
> For now, I would recommend to focus first on 0001 to add support for
> partitioned tables and indexes to REINDEX.  CIC is much more
> complicated btw, but I am not entering in the details now.
> 
> +   /* Avoid erroring out */
> if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> {
> This comment does not help, and actually this becomes incorrect as
> reindex for this relkind becomes supported once 0001 is done.

I made a minimal change to avoid forgetting to eventually change that part.

> ReindexPartitionedRel() fails to consider the concurrent case here for
> partition indexes and tables, as reindex_index()/reindex_relation()
> are the APIs used in the non-concurrent case.  Once you consider the
> concurrent case correctly, we also need to be careful with partitions
> that have a temporary persistency (note that we don't allow partition
> trees to mix persistency types, all partitions have to be temporary or
> permanent).

Fixed.

> I think that you are right to make the entry point to handle
> partitioned index in ReindexIndex() and partitioned table in
> ReindexTable(), but the structure of the patch should be different:
> - The second portion of ReindexMultipleTables() should be moved into a
> separate routine, taking in input a list of relation OIDs.  This needs
> to be extended a bit so as reindex_index() gets called for an index
> relkind if the relpersistence is temporary or if we have a
> non-concurrent reindex.  The idea is that we finish with a single code
> path able to work on a list of relations.  And your patch adds more of
> that as of ReindexPartitionedRel().

It's a good idea.

> - We should *not* handle directly partitioned index and/or table in
> ReindexRelationConcurrently() to not complicate the logic where we
> gather all the indexes of a table/matview.  So I think that the list
> of partition indexes/tables to work on should be built directly in
> ReindexIndex() and ReindexTable(), and then this should call the
> second part of ReindexMultipleTables() refactored in the previous
> point.

I think I addressed these mostly as you intended.

> This way, each partition index gets done individually in its
> own transaction.  For a partition table, all indexes of this partition
> are rebuilt in the same set of transactions.  For the concurrent case,
> we have already reindex_concurrently_swap that it able to switch the
> dependencies of two indexes within a partition tree, so we can rely on
> that so as a failure in the middle of the operation never leaves the
> a partition structure in an inconsistent state.

-- 
Justin
>From 25486d56303de512368f322c87528c2be29af0de Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 7 Jun 2020 15:59:54 -0500
Subject: [PATCH v5 1/3] Implement REINDEX of partitioned tables/indexes

---
 doc/src/sgml/ref/reindex.sgml  |   5 -
 src/backend/catalog/index.c|   8 +-
 src/backend/commands/indexcmds.c   | 137 ++---
 src/backend/tcop/utility.c |   6 +-
 src/include/commands/defrem.h  |   6 +-
 src/test/regress/expected/create_index.out |  18 +--
 src/test/regress/sql/create_index.sql  |  13 +-
 7 files changed, 122 insertions(+), 71 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index aac5d5be23..e2e32b3ba0 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -258,11 +258,6 @@ REINDEX [ ( option [, ...] ) ] { IN
can always reindex anything.
   
 
-  
-   Reindexing partitioned tables or partitioned indexes is not supported.
-   Each individual partition can be reindexed separately instead.
-  
-
   
Rebuilding Indexes Concurrently
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1be27eec52..f69f027890 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3660,12 +3660,8 @@ reindex_relation(Oid relid, int flags, int options)
 	 */
 	rel = table_open(relid, ShareLock);
 
-	/*
-	 * This may be useful when implemented someday; but that day is not today.
-	 * For now, avoid erroring out when called in a multi-table context
-	 * (REINDEX SCHEMA) and happen to come across a partitioned table.  The
-	 * partitions may be reindexed on their own anyway.
-	 */
+	/* Skip partitioned tables if called in a multi-table context.  The
+	 * partitions are reindexed on their own. */
 	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		ereport(WARNING,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2baca12c5f..68d75916ae 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -88,7 +88,10 @@ static List *ChooseIndexColumnNames(List *indexElems);
 static void RangeVarCallbackForReindexIndex(const RangeVar *relation,
 			Oid relId, Oid oldRelId, v

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-08-09 Thread Justin Pryzby
On Sun, Aug 09, 2020 at 08:02:52PM +0900, Michael Paquier wrote:
> I have been looking at 0001 as a start, and your patch is incorrect on
> a couple of aspects for the completion of REINDEX:
> - "(" is not proposed as a completion option after the initial
> REINDEX, and I think that it should.

That part of your patch handles REINDEX and REINDEX(*) differently than mine.
Yours is technically more correct/complete.  But, I recall Tom objected a
different patch because of completing to a single char.  I think the case is
arguable either way: if only some completions are shown, then it hides the
others..
https://www.postgresql.org/message-id/14255.1536781...@sss.pgh.pa.us

-   else if (Matches("REINDEX") || Matches("REINDEX", "(*)"))
+   else if (Matches("REINDEX"))
+   COMPLETE_WITH("TABLE", "INDEX", "SYSTEM", "SCHEMA", "DATABASE", 
"(");
+   else if (Matches("REINDEX", "(*)"))
COMPLETE_WITH("TABLE", "INDEX", "SYSTEM", "SCHEMA", "DATABASE");

> - completion gets incorrect for all the commands once a parenthesized
> list of options is present, as CONCURRENTLY goes missing.

The rest of your patch looks fine.  In my mind, REINDEX(CONCURRENTLY) was the
"new way" to write things, and it's what's easy to support, so I think I didn't
put special effort into making tab completion itself complete.

-- 
Justin




Re: Unnecessary delay in streaming replication due to replay lag

2020-08-09 Thread Asim Praveen


> On 09-Aug-2020, at 2:11 PM, Michael Paquier  wrote:
> 
> I have not really looked at the proposed patch, but it would be good
> to have some documentation.
> 

Ah, right.  The basic idea is to reuse the logic to allow read-only connections 
to also start WAL streaming.  The patch borrows a new GUC 
“wal_receiver_start_condition” introduced by another patch alluded to upthread. 
 It affects when to start WAL receiver process on a standby.  By default, the 
GUC is set to “replay”, which means no change in current behavior - WAL 
receiver is started only after replaying all WAL already available in pg_wal.  
When set to “consistency”, WAL receiver process is started earlier, as soon as 
consistent state is reached during WAL replay.

The LSN where to start streaming from is determined to be the LSN that points 
at the beginning of the WAL segment file that was most recently flushed in 
pg_wal.  To find the most recently flushed WAL segment, first blocks of all WAL 
segment files in pg_wal, starting from the segment that contains currently 
replayed record, are inspected.  The search stops when a first page with no 
valid header is found.

The benefits of starting WAL receiver early are mentioned upthread but allow me 
to reiterate: as WAL streaming starts, any commits that are waiting for 
synchronous replication on the master are unblocked.  The benefit of this is 
apparent in situations where significant replay lag has been built up and the 
replication is configured to be synchronous.

Asim

Re: display offset along with block number in vacuum errors

2020-08-09 Thread Masahiko Sawada
On Fri, 7 Aug 2020 at 13:04, Amit Kapila  wrote:
>
> On Fri, Aug 7, 2020 at 8:10 AM Masahiko Sawada
>  wrote:
> >
> > On Fri, 7 Aug 2020 at 10:49, Amit Kapila  wrote:
> > >
> > > On Thu, Aug 6, 2020 at 7:51 PM Justin Pryzby  wrote:
> > > >
> > > > On Thu, Aug 06, 2020 at 07:39:21PM +0530, Amit Kapila wrote:
> > > > > On Wed, Jul 29, 2020 at 1:09 AM Justin Pryzby  
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > > lazy_check_needs_freeze iterates over blocks and this patch changes 
> > > > > > it to
> > > > > > update vacrelstats.  I think it should do what
> > > > > > lazy_{vacuum/cleanup}_heap/page/index do and call 
> > > > > > update_vacuum_error_info() at
> > > > > > its beginning (even though only the offset is changed), and then
> > > > > > restore_vacuum_error_info() at its end (to "revert back" to the 
> > > > > > item number it
> > > > > > started with).
> > > > > >
> > > > >
> > > > > I see that Mahendra has changed patch as per this suggestion but I am
> > > > > not convinced that it is a good idea to sprinkle
> > > > > update_vacuum_error_info()/restore_vacuum_error_info() at places more
> > > > > than required. I see that it might look a bit clean from the
> > > > > perspective that if tomorrow we use the function
> > > > > lazy_check_needs_freeze() for a different purpose then we don't need
> > > > > to worry about the wrong phase information. If we are worried about
> > > > > that then we should have an assert in that function to ensure that the
> > > > > current phase is VACUUM_ERRCB_PHASE_SCAN_HEAP.
> > > >
> > > > The motivation was to restore the offnum, which is set to Invalid at 
> > > > the start
> > > > of lazy_scan_heap(), and then set valid within lazy_check_needs_freeze, 
> > > > but
> > > > should be restored or re-set to Invalid when returns to 
> > > > lazy_scan_heap().  If
> > > > you think it's important, we could just set vacrelstats->offnum = 
> > > > Invalid
> > > > before returning,
> > > >
> > >
> > > Yeah, I would prefer that and probably a comment to indicate why we
> > > are doing that.
> >
> > +1
> >
> > I'm concerned that we call the update and restore in
> > heap_page_is_all_visible(). Unlike lazy_check_needs_freeze(), this
> > function is called for every vacuumed page. I commented that if we
> > want to update the offset number during iterating tuples in the
> > function we should change the phase to SCAN_HEAP at the beginning of
> > the function because it's actually not vacuuming.
> >
>
> AFAICS, heap_page_is_all_visible() is called from only one place and
> that is lazy_vacuum_page(), so I think if there is any error in
> heap_page_is_all_visible(), it should be considered as VACUUM_HEAP
> phase error.

It's true that heap_page_is_all_visible() is called from only
lazy_vacuum_page() but I'm concerned it would lead misleading since
it's not actually removing tuples but just checking after vacuum. I
guess that the errcontext should show what the process is actually
doing and therefore help the investigation, so I thought VACUUM_HEAP
might not be appropriate for this case. But I see also your point.
Other vacuum error context phases match with vacuum progress
information phrases. So in heap_page_is_all_visible (), I agree it's
better to update the offset number and keep the phase VACUUM_HEAP
rather than do nothing.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Add information to rm_redo_error_callback()

2020-08-09 Thread Masahiko Sawada
On Wed, 5 Aug 2020 at 00:37, Drouvot, Bertrand  wrote:
>
> Hi hackers,
>
> I've attached a small patch to add information to rm_redo_error_callback().
>
> The changes attached in this patch came while working on the "Add
> information during standby recovery conflicts" patch (See [1]).
>
> The goal is to add more information during the callback (if doable), so
> that something like:
>
> 2020-08-04 14:42:57.545 UTC [15459] CONTEXT:  WAL redo at 0/4A3B0DE0 for
> Heap2/CLEAN: remxid 1168
>
> would get extra information that way:
>
> 2020-08-04 14:42:57.545 UTC [15459] CONTEXT:  WAL redo at 0/4A3B0DE0 for
> Heap2/CLEAN: remxid 1168, blkref #0: rel 1663/13586/16850 fork main blk 0
>
> As this could be useful outside of [1], a dedicated "sub" patch has been
> created (thanks Sawada for the suggestion).
>
> I will add this patch to the next commitfest. I look forward to your
> feedback about the idea and/or implementation.
>

Thank you for starting the new thread for this patch!

I think this patch is simple enough and improves information shown in
errcontext.

I have two comments on the patch:

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 756b838e6a..8b2024e9e9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11749,10 +11749,22 @@ rm_redo_error_callback(void *arg)
 {
XLogReaderState *record = (XLogReaderState *) arg;
StringInfoData buf;
+   int block_id;
+   RelFileNode rnode;
+   ForkNumber forknum;
+   BlockNumber blknum;

initStringInfo(&buf);
xlog_outdesc(&buf, record);

+   for (block_id = 0; block_id <= record->max_block_id; block_id++)
+   {
+   if (XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blknum))
+   appendStringInfo(&buf,", blkref #%d: rel %u/%u/%u fork %s blk %u",
+   block_id, rnode.spcNode, rnode.dbNode,
+   rnode.relNode, forkNames[forknum],
+   blknum);
+   }
/* translator: %s is a WAL record description */
errcontext("WAL redo at %X/%X for %s",
   (uint32) (record->ReadRecPtr >> 32),

rnode, forknum and blknum can be declared within the for loop.

I think it's better to put a new line just before the comment starting
from "translator:".

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Change a constraint's index - ALTER TABLE ... ALTER CONSTRAINT ... USING INDEX ...

2020-08-09 Thread Anna Akenteva

On 2020-07-07 01:08, Tom Lane wrote:

Alvaro Herrera  writes:

On 2020-Jul-05, Anna Akenteva wrote:

-- Swapping primary key's index for an equivalent index,
-- but with INCLUDE-d attributes.
CREATE UNIQUE INDEX new_idx ON target_tbl (id) INCLUDE (info);
ALTER TABLE target_tbl ALTER CONSTRAINT target_tbl_pkey USING INDEX
new_idx;
ALTER TABLE referencing_tbl ALTER CONSTRAINT 
referencing_tbl_id_ref_fkey

USING INDEX new_idx;



How is this state represented by pg_dump?


Even if it's possible to represent, I think we should flat out reject
this "feature".  Primary keys that aren't primary keys don't seem like
a good idea.  For one thing, it won't be possible to describe the
constraint accurately in the information_schema.


Do you think it could still be a good idea if we only swap the 
relfilenodes of indexes, as it was suggested in [1]? The original use 
case was getting rid of index bloat, which is now solved by REINDEX 
CONCURRENTLY, but this feature still has its own use case of adding 
INCLUDE-d columns to constraint indexes.


[1]
https://www.postgresql.org/message-id/flat/CABwTF4UxTg%2BkERo1Nd4dt%2BH2miJoLPcASMFecS1-XHijABOpPg%40mail.gmail.com

--
Anna Akenteva
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Unnecessary delay in streaming replication due to replay lag

2020-08-09 Thread Masahiko Sawada
On Sun, 9 Aug 2020 at 14:54, Asim Praveen  wrote:
>
> I would like to revive this thready by submitting a rebased patch to start 
> streaming replication without waiting for startup process to finish replaying 
> all WAL.  The start LSN for streaming is determined to be the LSN that points 
> to the beginning of the most recently flushed WAL segment.
>
> The patch passes tests under src/test/recovery and top level “make check”.
>

The patch can be applied cleanly to the current HEAD but I got the
error on building the code with this patch:

xlog.c: In function 'StartupXLOG':
xlog.c:7315:6: error: too few arguments to function 'RequestXLogStreaming'
 7315 |  RequestXLogStreaming(ThisTimeLineID,
  |  ^~~~
In file included from xlog.c:59:
../../../../src/include/replication/walreceiver.h:463:13: note: declared here
  463 | extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
  | ^~~~

cfbot also complaints this.

Could you please update the patch?

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services