date:20210310

Re: pgsql: Enable parallel SELECT for "INSERT INTO ... SELECT ...".

2021-03-10 Thread Amit Kapila

On Wed, Mar 10, 2021 at 9:07 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > Enable parallel SELECT for "INSERT INTO ... SELECT ...".
>
> skink (valgrind) is unhappy:
>
> creating configuration files ... ok
> running bootstrap script ... ok
> performing post-bootstrap initialization ... ==4085668== VALGRINDERROR-BEGIN
> ==4085668== Conditional jump or move depends on uninitialised value(s)
> ==4085668==at 0x4AEB77: max_parallel_hazard_walker (clauses.c:700)
> ==4085668==by 0x445287: expression_tree_walker (nodeFuncs.c:2188)
> ==4085668==by 0x4AEBB8: max_parallel_hazard_walker (clauses.c:860)
> ==4085668==by 0x4B045E: is_parallel_safe (clauses.c:637)
> ==4085668==by 0x4985D0: grouping_planner (planner.c:2070)
> ==4085668==by 0x49AE4F: subquery_planner (planner.c:1024)
> ==4085668==by 0x49B4F5: standard_planner (planner.c:404)
> ==4085668==by 0x49BAD2: planner (planner.c:273)
> ==4085668==by 0x5818BE: pg_plan_query (postgres.c:809)
> ==4085668==by 0x581977: pg_plan_queries (postgres.c:900)
> ==4085668==by 0x581E70: exec_simple_query (postgres.c:1092)
> ==4085668==by 0x583F7A: PostgresMain (postgres.c:4327)
> ==4085668==  Uninitialised value was created by a stack allocation
> ==4085668==at 0x4B0363: is_parallel_safe (clauses.c:599)
> ==4085668==
> ==4085668== VALGRINDERROR-END
>
> There are a few other commits that skink hasn't seen before, but given
> the apparent connection to parallel planning, none of the others look
> like plausible candidates to explain this.
>

Right, the patch forgot to initialize a new variable in
max_parallel_hazard_context via is_parallel_safe. I think we need to
initialize all the new variables as NULL because is_parallel_safe is
used to check parallel-safety of expressions. These new variables are
only required for checking parallel-safety of target relation which is
already done at the time of initial checks in standard_planner.

-- 
With Regards,
Amit Kapila.

Re: Add some tests for pg_stat_statements compatibility verification under contrib

2021-03-10 Thread Erica Zhang

Hi Julien,
Thanks a lot for the quick review. Please see my answer below in blue. Attached 
is the new patch.



-- Original --
From:   
 "Julien Rouhaud"   
 


v2_add_test_for_pg_stat_statements.patch
Description: Binary data

Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-03-10 Thread Fujii Masao





On 2021/03/10 14:11, Masahiro Ikeda wrote:

On 2021-03-09 17:51, Fujii Masao wrote:

On 2021/03/05 8:38, Masahiro Ikeda wrote:

On 2021-03-05 01:02, Fujii Masao wrote:

On 2021/03/04 16:14, Masahiro Ikeda wrote:

On 2021-03-03 20:27, Masahiro Ikeda wrote:

On 2021-03-03 16:30, Fujii Masao wrote:

On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
 wrote:


I pgindented the patches.


... XLogWrite, which is invoked during an
XLogFlush request (see ...).  This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"


Thanks, I fixed them.


wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."


I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.


"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in) This is undesirable because ..."


Thanks, I fixed it.


I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics.  This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite.  Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not.  It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL).  Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead.  In either case, this
half-and-half implementation seems undesirable.


OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)


Thanks for updating the patches!



I added the infrastructure code to communicate the WAL receiver stats messages 
between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?


On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?


OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 
patch.


Thanks for updating the patches! I'm now reading 001 patch.

+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+    (sync_method == SYNC_METHOD_FSYNC ||
+ sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+ sync_method == SYNC_METHOD_FDATASYNC))
+    {

Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?


Thanks for the comments.
I added the above code in v12 patch.



+    /*
+ * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+ * the overhead in WAL-writing.
+ */
+    if (rc & WL_TIMEOUT)
+    pgstat_send_wal();

On second thought, this change means that it always takes wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
For example, if wal_writer_delay is set to several seconds, some values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?


Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.


Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.



Why don't to make another way to check the timestamp?

+   /*
+    * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+    * msec since we last sent one
+    */
+   now = GetC

Re: Release SPI plans for referential integrity with DISCARD ALL

2021-03-10 Thread yuzuko

Hello,

I thought about this suggestion again.

Amit's patch suggested in the thread [1] can eliminate SPI plans from
INSERT/UPDATE triggers, so our memory pressure issue would be solved.
But as far as I can see that thread, Amit's patch doesn't cover DELETE case.
It is not a common case, but there is a risk of pressing memory
capacity extremely.
Considering that, this suggestion might still have good value as Corey
said before.

I improved a patch according to Peter's following comment :
> but I think the
> solution of dropping all cached plans as part of DISCARD ALL seems a bit
> too extreme of a solution.  In the context of connection pooling,
> getting a new session with pre-cached plans seems like a good thing, and
> this change could potentially have a performance impact without a
> practical way to opt out.

In a new patch, I separated discarding SPI Plan caches for RI from DISCARD ALL
and added a new option "RIPLANS" to DISCARD command to do that.  Also a new
function, ResetRIPlanCache() which clears SPI plan caches is called by
DISCARD ALL
or DISCARD RIPLANS.  The amount of modification is small and this option can be
removed instantly when it is no longer needed, so I think we can use
this patch as
a provisional solution.

Any thoughts?

[1] 
https://www.postgresql.org/message-id/flat/CA%2BHiwqGkfJfYdeq5vHPh6eqPKjSbfpDDY%2Bj-kXYFePQedtSLeg%40mail.gmail.com

-- 
Best regards,
Yuzuko Hosoya
NTT Open Source Software Center


v2_discard_ri_spi_plans.patch
Description: Binary data

Re: cryptohash: missing locking functions for OpenSSL <= 1.0.2?

2021-03-10 Thread Michael Paquier

On Mon, Mar 08, 2021 at 06:06:32PM +, Jacob Champion wrote:
> I had to convince myself that this logic is correct -- we set
> destroy_needed even if crypto is not enabled, but then check later to
> make sure that crypto_loaded is true before doing anything. What would
> you think about moving the conn->crypto_loaded check to the else
> branch, so that destroy_needed is only set if we actually need it?

Do you mean something like the attached?  If I recall my mood from the
moment, I think that I did that to be more careful with the case where
the client has its own set of callbacks set (pq_init_crypto_lib as
false) but that does not matter as this is double-checked in
destroy_ssl_system().  I have adjusted some comments after more
review.
--
Michael
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 29054bad7b..e4a85c37c4 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -2887,6 +2887,16 @@ keep_going:		/* We will come back to here until there is
 
 #ifdef USE_SSL
 
+/*
+ * Enable the libcrypto callbacks before checking if SSL needs
+ * to be done.  This is done before sending the startup packet
+ * as depending on the type of authentication done, like MD5
+ * or SCRAM, the callbacks would be required even without a
+ * SSL connection
+ */
+if (pqsecure_initialize(conn, false, true) < 0)
+	goto error_return;
+
 /*
  * If SSL is enabled and we haven't already got encryption of
  * some sort running, request SSL instead of sending the
@@ -2998,8 +3008,14 @@ keep_going:		/* We will come back to here until there is
 	{
 		/* mark byte consumed */
 		conn->inStart = conn->inCursor;
-		/* Set up global SSL state if required */
-		if (pqsecure_initialize(conn) != 0)
+
+		/*
+		 * Set up global SSL state if required.  The crypto
+		 * state has already been set if libpq took care of
+		 * doing that, so there is no need to make that happen
+		 * again.
+		 */
+		if (pqsecure_initialize(conn, true, false) != 0)
 			goto error_return;
 	}
 	else if (SSLok == 'N')
diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c
index 0fa10a23b4..3de8f89264 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -85,7 +85,7 @@ static bool pq_init_crypto_lib = true;
 static bool ssl_lib_initialized = false;
 
 #ifdef ENABLE_THREAD_SAFETY
-static long ssl_open_connections = 0;
+static long crypto_open_connections = 0;
 
 #ifndef WIN32
 static pthread_mutex_t ssl_config_mutex = PTHREAD_MUTEX_INITIALIZER;
@@ -111,7 +111,7 @@ pgtls_init_library(bool do_ssl, int do_crypto)
 	 * Disallow changing the flags while we have open connections, else we'd
 	 * get completely confused.
 	 */
-	if (ssl_open_connections != 0)
+	if (crypto_open_connections != 0)
 		return;
 #endif
 
@@ -635,7 +635,7 @@ pq_lockingcallback(int mode, int n, const char *file, int line)
  * override it.
  */
 int
-pgtls_init(PGconn *conn)
+pgtls_init(PGconn *conn, bool do_ssl, bool do_crypto)
 {
 #ifdef ENABLE_THREAD_SAFETY
 #ifdef WIN32
@@ -684,22 +684,28 @@ pgtls_init(PGconn *conn)
 			}
 		}
 
-		if (ssl_open_connections++ == 0)
+		if (do_crypto && !conn->crypto_loaded)
 		{
-			/*
-			 * These are only required for threaded libcrypto applications,
-			 * but make sure we don't stomp on them if they're already set.
-			 */
-			if (CRYPTO_get_id_callback() == NULL)
-CRYPTO_set_id_callback(pq_threadidcallback);
-			if (CRYPTO_get_locking_callback() == NULL)
-CRYPTO_set_locking_callback(pq_lockingcallback);
+			if (crypto_open_connections++ == 0)
+			{
+/*
+ * These are only required for threaded libcrypto
+ * applications, but make sure we don't stomp on them if
+ * they're already set.
+ */
+if (CRYPTO_get_id_callback() == NULL)
+	CRYPTO_set_id_callback(pq_threadidcallback);
+if (CRYPTO_get_locking_callback() == NULL)
+	CRYPTO_set_locking_callback(pq_lockingcallback);
+			}
+
+			conn->crypto_loaded = true;
 		}
 	}
 #endif			/* HAVE_CRYPTO_LOCK */
 #endif			/* ENABLE_THREAD_SAFETY */
 
-	if (!ssl_lib_initialized)
+	if (!ssl_lib_initialized && do_ssl)
 	{
 		if (pq_init_ssl_lib)
 		{
@@ -740,10 +746,10 @@ destroy_ssl_system(void)
 	if (pthread_mutex_lock(&ssl_config_mutex))
 		return;
 
-	if (pq_init_crypto_lib && ssl_open_connections > 0)
-		--ssl_open_connections;
+	if (pq_init_crypto_lib && crypto_open_connections > 0)
+		--crypto_open_connections;
 
-	if (pq_init_crypto_lib && ssl_open_connections == 0)
+	if (pq_init_crypto_lib && crypto_open_connections == 0)
 	{
 		/*
 		 * No connections left, unregister libcrypto callbacks, if no one
@@ -1402,46 +1408,63 @@ pgtls_close(PGconn *conn)
 {
 	bool		destroy_needed = false;
 
-	if (conn->ssl)
+	if (conn->ssl_in_use)
 	{
-		/*
-		 * We can't destroy everything SSL-related here due to

Re: OpenSSL 3.0.0 compatibility

2021-03-10 Thread Daniel Gustafsson

> On 3 Mar 2021, at 14:55, Peter Eisentraut  
> wrote:
> 
> This thread is still in the commit fest, but I don't see any actual proposed 
> patch still pending.  Most of the activity has moved into other threads.

The doc changes in the patch proposed on 29/9 still stands, although I see that
it had an off by one in mentioning MD5 when it should be MD4 et.al; so
something more like the below.

diff --git a/doc/src/sgml/pgcrypto.sgml b/doc/src/sgml/pgcrypto.sgml
index b6bb23de0f..d45464c7ea 100644
--- a/doc/src/sgml/pgcrypto.sgml
+++ b/doc/src/sgml/pgcrypto.sgml
@@ -1234,6 +1234,12 @@ gen_random_uuid() returns uuid
 


+   
+When compiled against OpenSSL 3.0.0, the legacy
+provider must be activated in the system openssl.cnf
+configuration file in order to use older ciphers like DES and Blowfish.
+   
+


> Could you update the status of this CF entry, and perhaps also on the status 
> of OpenSSL compatibility in general?

Let's just wait for 3.0.0 to ship before we do anything.

--
Daniel Gustafsson   https://vmware.com/

RE: Avoid CommandCounterIncrement in RI trigger when INSERT INTO referencing table

2021-03-10 Thread tsunakawa.ta...@fujitsu.com

From: Hou, Zhijie/侯 志杰 
> After some more on how to support parallel insert into fk relation.
> It seems we do not have a cheap way to implement this feature.
> 
> In RI_FKey_check, Currently, postgres execute "select xx for key share" to
> check that foreign key exists in PK table.
> However "select for update/share" is considered as parallel unsafe. It may be
> dangerous to do this in parallel mode, we may want to change this.

Hmm, I guess the parallel leader and workers can execute SELECT FOR KEY SHARE, 
if the parallelism infrastructure allows execution of SPI calls.  The lock 
manager supports tuple locking in parallel leader and workers by the group 
locking.  Also, the tuple locking doesn't require combo Cid, which is necessary 
for parallel UPDATE and DELETE.

Perhaps the reason why SELECT FOR is treated as parallel-unsafe is that tuple 
locking modifies data pages to store lock information in the tuple header.  But 
now, page modification is possible in parallel processes, so I think we can 
consider SELECT FOR as parallel-safe.  (I may be too optimistic.)


> And also, "select for update/share" is considered as "not read only" which 
> will
> force readonly = false in _SPI_execute_plan.

read_only is used to do CCI.  Can we arrange to skip CCI?


> At the same time, " simplifying foreign key/RI checks " thread is trying to
> replace "select xx for key share" with index_beginscan()+table_tuple_lock() (I
> think it’s parallel safe).
> May be we can try to support parallel insert fk relation after " simplifying 
> foreign
> key/RI checks " patch applied ?

Why do you think it's parallel safe?

Can you try running parallel INSERT SELECT on the target table with FK and see 
if any problem happens?

If some problem occurs due to the tuple locking, I think we can work around it 
by avoiding tuple locking.  That is, we make parallel INSERT SELECT lock the 
parent tables in exclusive mode so that the check tuples won't be deleted.  
Some people may not like this, but it's worth considering because parallel 
INSERT SELECT would not have to be run concurrently with short OLTP 
transactions.  Anyway, tuple locking partly disturbs parallel INSERT speedup 
because it modifies pages in the parent tables and emits WAL.

Surprisingly, Oracle doesn't support parallel INSERT SELECT on a table with FK 
as follows.  SQL Server doesn't mention anything, so I guess it's supported.  
This is a good chance for PostgreSQL to exceed Oracle.

https://docs.oracle.com/en/database/oracle/oracle-database/21/vldbg/types-parallelism.html#GUID-D4CFC1F2-44D3-4BE3-B5ED-6A309EB8BF06

Table 8-1 Referential Integrity Restrictions
DML Statement   Issued on ParentIssued on Child Self-Referential
INSERT  (Not applicable)Not parallelizedNot parallelized


Regards
Takayuki Tsunakawa

Re: Improvements and additions to COPY progress reporting

2021-03-10 Thread Matthias van de Meent

On Tue, 9 Mar 2021 at 06:34, Michael Paquier  wrote:
>
> On Mon, Mar 08, 2021 at 05:33:40PM +0100, Matthias van de Meent wrote:
> > Seems reasonable. PFA updated patches. I've renamed the previous 0003
> > to 0002 to keep git-format-patch easy.
>
> Thanks for updating the patch.  0001 has been applied, after tweaking
> a bit comments, indentation and the docs.

Thanks!

> > This is keeping current behaviour of the implementation as committed
> > with 8a4f618e, with the rationale of that patch being that this number
> > should mirror the number returned by the copy command.
> >
> > I am not opposed to adding another column for `tuples_inserted` and
> > changing the logic accordingly (see prototype 0003), but that was not
> > in the intended scope of this patchset. Unless you think that this
> > should be included in this current patchset, I'll spin that patch out
> > into a different thread, but I'm not sure that would make it into
> > pg14.
>
> Okay, point taken.  If there is demand for it in the future, we could
> extend the existing set of columns.  After thinking more about it the
> usecase if not completely clear to me from a monitoring point of
> view.
>
> I have not looked at 0002 in details yet, but I am wondering first if
> the size estimations in the expected output are actually portable.
> Second, I doubt a bit that the extra cycles spent on that are actually
> worth the coverage, even if the trick with an AFTER INSERT trigger is
> interesting.

There are examples in which pg_stat_progress_* -views report
inaccurate data. I think it is fairly reasonable to at least validate
some part of the progress reporting, as it is one of the few methods
for administrators to look at the state of currently running
administrative tasks, and as such, this user-visible api should be
validated.


With regards,

Matthias van de Meent

Re: Procedures versus the "fastpath" API

2021-03-10 Thread Laurenz Albe

On Tue, 2021-03-09 at 14:15 -0500, Tom Lane wrote:
> The security team received a report from Theodor-Arsenij
> Larionov-Trichkin of PostgresPro that it's possible to crash the
> backend with an assertion or null-pointer dereference by trying to
> call a window function via the "fast path function call" protocol
> message.
> 
> So the questthemion on the table is what to do about this.
> 
> As for procedures, I'm of the opinion that we should just reject those
> too, but some other security team members were not happy with that
> idea.  Conceivably we could attempt to make the case actually work,
> but is it worth the trouble?  Given the lack of field complaints
> about the "invalid transaction termination" failure, it seems unlikely
> that it's occurred to anyone to try to call procedures this way.
> We'd need special infrastructure to test the case, too, since psql
> offers no access to fastpath calls.
> 
> A compromise suggestion was to prohibit calling procedures via
> fastpath as of HEAD, but leave existing releases as they are,
> in case anyone is using a case that happens to work.

The "invalid transaction termination" failure alone doesn't
worry or surprise me - transaction handling in procedures only works
under rather narrow conditions anyway (no SELECT on the call stack,
no explicit transaction was started outside the procedure).

If that is the only problem, I'd just document it.  The hard work is
of course that there is no other problem with calling procedures that
way.  If anybody wants to do that work, and transaction handling is
the only thing that doesn't work with the fastpath API, we can call
it supported and document the exception.

In case of doubt, I would agree with you and forbid it in HEAD
as a corner case with little real-world use.

Yours,
Laurenz Albe

Re: Parallel INSERT (INTO ... SELECT ...)

2021-03-10 Thread Amit Kapila

On Mon, Mar 8, 2021 at 7:19 PM Amit Langote  wrote:
>
> Hi Amit
>
> On Mon, Mar 8, 2021 at 10:18 PM Amit Kapila  wrote:
> > On Mon, Mar 8, 2021 at 3:54 PM Greg Nancarrow  wrote:
> > > I've attached an updated set of patches with the suggested locking 
> > > changes.
>
> (Thanks Greg.)
>
> > Amit L, others, do let me know if you have still more comments on
> > 0001* patch or if you want to review it further?
>
> I just read through v25 and didn't find anything to complain about.
>

Thanks a lot, pushed now! Amit L., your inputs are valuable for this work.

Now, coming back to Hou-San's patch to introduce a GUC and reloption
for this feature, I think both of those make sense to me because when
the feature is enabled via GUC, one might want to disable it for
partitioned tables? Do we agree on that part or someone thinks
otherwise?

The other points to bikeshed could be:
1. The name of GUC and reloption. The two proposals at hand are
enable_parallel_dml and enable_parallel_insert. I would prefer the
second (enable_parallel_insert) because updates/deletes might not have
a similar overhead.

2. Should we keep the default value of GUC to on or off? It is
currently off. I am fine keeping it off for this release and we can
always turn it on in the later releases if required. Having said that,
I see the value in keeping it on because in many cases Insert ...
Select will be used for large data and there we will see a benefit of
parallelism and users facing trouble (who have a very large number of
partitions with less data to query) can still disable the parallel
insert for that particular table. Also, the other benefit of keeping
it on till at least the beta period is that this functionality will
get tested and if we found reports of regression then we can turn it
off for this release as well.

Thoughts?

-- 
With Regards,
Amit Kapila.

Re: Confusing behavior of psql's \e

2021-03-10 Thread Juan José Santamaría Flecha

On Wed, Mar 10, 2021 at 6:14 AM Tom Lane  wrote:

>
> PS: I seem to recall that some Microsoft filesystems have 2-second
> resolution on file mod times, so maybe it needs to be "time(NULL) - 2"?
>

You are thinking about FAT:

https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getfiletime#remarks

Regards,

Juan José Santamaría Flecha

Re: [PATCH] Provide more information to filter_prepare

2021-03-10 Thread Amit Kapila

On Tue, Mar 9, 2021 at 2:14 PM Markus Wanner
 wrote:
>
> Hi,
>
> currently, only the gid is passed on to the filter_prepare callback.
> While we probably should not pass a full ReorderBufferTXN (as we do for
> most other output plugin callbacks), a bit more information would be
> nice, I think.
>

How the proposed 'xid' parameter can be useful? What exactly plugins
want to do with it?

-- 
With Regards,
Amit Kapila.

Re: Confusing behavior of psql's \e

2021-03-10 Thread Laurenz Albe

On Wed, 2021-03-10 at 00:13 -0500, Tom Lane wrote:
> I agree
> that trying to get rid of the race condition inherent in the existing
> file mtime test would be a good idea.  However, I've got some
> portability-related gripes about how you are doing the latter:
> 
> 1. There is no principled reason to assume that the epoch date is in the
> past.  IIRC, Postgres' timestamp epoch of 2000-01-01 was in the future
> at the time we set it.  More relevant to the immediate issue, I clearly
> recall a discussion at Red Hat in which one of the principal glibc
> maintainers (likely Ulrich Drepper, though I'm not quite sure) argued
> that 32-bit time_t could be used indefinitely by redefining the epoch
> forward by 2^32 seconds every often; which would require intervals of
> circa 68 years in which time_t was seen as a negative offset from a
> future epoch date, rather than an unsigned offset from a past date.
> Now, I thought he was nuts then and I still think that 32-bit hardware
> will be ancient history by 2038 ... but there may be systems that do it
> like that.  glibc hates ABI breakage.
> 
> 2. Putting an enormously old date on a file that was just created will
> greatly confuse onlookers, some of whom (such as backup or antivirus
> daemons) might not react pleasantly.
> 
> Between #1 and #2, it's clearly worth the extra one or two lines of
> code to set the file dates to, say, "time(NULL) - 1", rather than
> assuming that zero is good enough.
> 
> 3. I wonder about the portability of utime(2).  I see that we are using
> it to cause updates of socket and lock file times, but our expectations
> for it there are rock-bottom low.  I think that failing the edit command
> if utime() fails is an overreaction even with optimistic assumptions about
> its reliability.  Doing that makes things strictly worse than simply not
> doing anything, because 99% of the time this refinement is unnecessary.
> 
> In short, I think the relevant code ought to be more like
> 
>   else
>   {
>   struct utimbuf ut;
> 
>   ut.modtime = ut.actime = time(NULL) - 1;
>   (void) utime(fname, &ut);
>   }
> 
> (plus some comments of course)
> 
>   regards, tom lane
> 
> PS: I seem to recall that some Microsoft filesystems have 2-second
> resolution on file mod times, so maybe it needs to be "time(NULL) - 2"?

Thanks for taking a look!

Taken together, these arguments are convincing.

Done like that in the attached patch version 4.

Yours,
Laurenz Albe
From 31257c2e697421bbcc08bdd546ccc8729654ccb3 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Wed, 10 Mar 2021 11:28:20 +0100
Subject: [PATCH] Improve \e, \ef and \ev if the editor is quit

Before, if you edited a function or view definition or a
script file, quitting the editor without saving would cause
the *previous query* to be executed, which is certainly not
what anyone would expect.

It is better to execute nothing and clear the query buffer
in this case.

The behavior when editing the query buffer is unchanged:
quitting without saving will retain the query buffer.

This patch also fixes an old race condition: due to the low
granularity that stat(2) can measure for the modification
time, an edit that took less than a second might go
unnoticed.

In passing, fix the misplaced function comment for do_edit().

Author: Laurenz Albe 
Reviewed-by: Chapman Flack, Jacob Champion
Discussion: https://postgr.es/m/0ba3f2a658bac6546d9934ab6ba63a805d46a49b.camel%40cybertec.at
---
 doc/src/sgml/ref/psql-ref.sgml | 10 --
 src/bin/psql/command.c | 64 ++
 2 files changed, 57 insertions(+), 17 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 13c1edfa4d..0bf69174ff 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1970,7 +1970,9 @@ testdb=>
 
 
 
-The new contents of the query buffer are then re-parsed according to
+If you edit a file or the previous query, and you quit the editor without
+modifying the file, the query buffer is cleared.
+Otherwise, the new contents of the query buffer are re-parsed according to
 the normal rules of psql, treating the
 whole buffer as a single line.  Any complete queries are immediately
 executed; that is, if the query buffer contains or ends with a
@@ -2039,7 +2041,8 @@ Tue Oct 26 21:40:57 CEST 1999
  in the form of a CREATE OR REPLACE FUNCTION or
  CREATE OR REPLACE PROCEDURE command.
  Editing is done in the same way as for \edit.
- After the editor exits, the updated command is executed immediately
+ If you quit the editor without saving, the statement is discarded.
+ If you save and exit the editor, the updated command is executed immediately
  if you added a semicolon to it.  Otherwise it is redisplayed;
  type semicolon or \g to send it, or \r
  to

Do we support upgrade of logical replication?

2021-03-10 Thread vignesh C

Hi,

I was reviewing logical decoding of two-phase transactions feature,
while reviewing the feature I was checking if there is any impact on
publisher/subscriber upgrade.

I checked the existing pg_upgrade behaviour with logical replication.
I made a logical replication data instance with publisher and
subscriber with subscriber subscribed to a table.  Then I tried
upgrading publisher and subscriber individually. After upgrade I
noticed the following:

1) Pg_logical/mappings files were not copied in the upgraded data folder:
--
Pg_logical contents in old data folder:
publisher/pg_logical/replorigin_checkpoint
publisher/pg_logical/mappings:
map-32cb-4df-0_1767088-225-225
publisher/pg_logical/snapshots:
0-1643650.snap

New upgraded data folder:
publisher1/pg_logical/replorigin_checkpoint
publisher1/pg_logical/mappings:
publisher1/pg_logical/snapshots:

2) Replication slots were not copied:
select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary |
active | active_pid | xmin | catalog_xmin | restart_lsn |
confirmed_flush_lsn | wal_status | safe_wal_size | t
wo_phase
---++---++--+---+++--+--+-+-++---+--
-
(0 rows)

3) The subscriber is in disabled mode in the upgraded data:
select * from pg_subscription;
  oid  | subdbid | subname | subowner | subenabled | subbinary |
substream | subtwophase |   subconninfo|
subslotname | subsynccommit | subpublicati
ons
---+-+-+--++---+---+-+--+-+---+-

16404 |   16401 | mysub   |   10 | f  | f | f
   | f   | host=localhost port=5432 dbname=postgres | mysub
   | off   | {mypub}
(1 row)

4) The pg_subscription_rel contents also were not copied:
select * from pg_subscription_rel;
 srsubid | srrelid | srsubstate | srsublsn
-+-++--
(0 rows)

5) While logical decoding of transactions, the decoded changes will be
serialized based on logical_decoding_work_mem configuration. Even
these files were not copied during upgrade.

Do we support upgrading of logical replication, if it is supported
could someone point me to the document link on how to upgrade logical
replication?

Regards,
Vignesh

Re: [PATCH] Provide more information to filter_prepare

2021-03-10 Thread Markus Wanner


On 10.03.21 11:18, Amit Kapila wrote:

On Tue, Mar 9, 2021 at 2:14 PM Markus Wanner
 wrote:

currently, only the gid is passed on to the filter_prepare callback.
While we probably should not pass a full ReorderBufferTXN (as we do for
most other output plugin callbacks), a bit more information would be
nice, I think.


How the proposed 'xid' parameter can be useful? What exactly plugins
want to do with it?


The xid is the very basic identifier for transactions in Postgres.  Any 
output plugin that interacts with Postgres in any way slightly more 
interesting than "filter by gid prefix" is very likely to come across a 
TransactionId.


It allows for basics like checking if the transaction to decode still is 
in progress, for example.  Or in a much more complex scenario, decide on 
whether or not to filter based on properties the extension stored during 
processing the transaction.


Regards

Markus

Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation

2021-03-10 Thread Bharath Rupireddy

On Wed, Mar 10, 2021 at 1:27 PM Jeevan Ladhe
 wrote:
>
> On Wed, Mar 10, 2021 at 10:44 AM Bharath Rupireddy 
>  wrote:
>>
>> Hi,
>>
>> While providing thoughts on [1], I observed that the error messages
>> that are emitted while adding foreign, temporary and unlogged tables
>> can be improved a bit from the existing [2] to [3].
>
> +1 for improving the error messages here.

Thanks for taking a look at the patch.

>> Attaching a small patch. Thoughts?
>
> I had a look at the patch and it looks good to me. However, I think after
> you have added the specific kind of table type in the error message itself,
> now the error details seem to be giving redundant information, but others 
> might
> have different thoughts.

The error detail is to give a bit of information of what and all
relation types are unsupported with the create publication statement.
But with the error message now showing up the type of relation, the
detail message looks redundant to me as well. If agreed, I can remove
that. Thoughts?

> The patch itself looks good otherwise. Also the make check and postgres_fdw
> check looking good.

Thanks.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

RE: Parallel Inserts in CREATE TABLE AS

2021-03-10 Thread houzj.f...@fujitsu.com

> Seems like v22 patch was failing in cfbot for one of the unstable test cases.
> Attaching v23 patch set with modification in 0003 and 0004 patches. No
> changes to 0001 and 0002 patches. Hopefully cfbot will be happy with v23.
> 
> Please consider v23 for further review.
Hi,

I was looking into the latest patch, here are some comments:

1)
 * (Note that we do allow CREATE TABLE AS, SELECT INTO, and CREATE
 * MATERIALIZED VIEW to use parallel plans, but as of now, only the 
leader
 * backend writes into a completely new table.  In the future, we can

Since we will support parallel insert with CTAS, this existing comments need to 
be changed.

2)
In parallel.sgml

The query writes any data or locks any database rows. If a query
contains a data-modifying operation either at the top level or within
a CTE, no parallel plans for that query will be generated.  As an
exception, the commands CREATE TABLE ... AS,

The same as 1), we'd better comment we have support parallel insert with CTAS.

3)
ExecInitParallelPlan(PlanState *planstate, EState *estate,
 Bitmapset *sendParams, int nworkers,
-int64 tuples_needed)
+int64 tuples_needed,
+ParallelInsertCmdKind parallel_ins_cmd,
+void *parallel_ins_info)

Is it better to place ParallelInsertCmdKind in struct ParallelInsertCTASInfo?
We can pass less parameter in some places.
Like:
typedef struct ParallelInsertCTASInfo
{
+   ParallelInsertCmdKind parallel_ins_cmd;
IntoClause *intoclause;
Oid objectid;

} ParallelInsertCTASInfo;

Best regards,
houzj

Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation

2021-03-10 Thread Euler Taveira

On Wed, Mar 10, 2021, at 2:14 AM, Bharath Rupireddy wrote:
> While providing thoughts on [1], I observed that the error messages
> that are emitted while adding foreign, temporary and unlogged tables
> can be improved a bit from the existing [2] to [3]. For instance, the
> existing message when foreign table is tried to add into the
> publication "f1" is not a table" looks odd. Because it says that the
> foreign table is not a table at all.
I wouldn't mix [regular|partitioned|temporary|unlogged] tables with foreign
tables. Although, they have a pg_class entry in common, foreign tables aren't
"real" tables (external storage); they even have different DDLs to handle it
(CREATE TABLE x CREATE FOREIGN TABLE).

postgres=# CREATE PUBLICATION testpub FOR TABLE f1;
ERROR:  "f1" is not a table
DETAIL:  Only tables can be added to publications.

I agree that "f1 is not a table" is a confusing message at first because
foreign table has "table" as description. Maybe if we apply the negation in
both messages it would be clear (using the same wording as system tables).

ERROR:  "f1" is a foreign table
DETAIL:  Foreign tables cannot be added to publications.

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Allow batched insert during cross-partition updates

2021-03-10 Thread Georgios



Hi,

‐‐‐ Original Message ‐‐‐
On Thursday, February 18, 2021 10:54 AM, Amit Langote  
wrote:

> On Thu, Feb 18, 2021 at 6:52 PM Amit Langote amitlangot...@gmail.com wrote:
>
> > Based on the discussion at:
> > https://www.postgresql.org/message-id/6929d485-2d2a-da46-3681-4a400a3d794f%40enterprisedb.com
> > I'm posting the patch for $subject here in this new thread and I'll
> > add it to the next CF per Tomas' advice.
>
> Done:https://commitfest.postgresql.org/32/2992/
>
> --

apparently I did not receive the review comment I sent via the commitfest app.
Apologies for the chatter. Find the message-id here:



>
> Amit Langote
> EDB: http://www.enterprisedb.com

Re: Allow batched insert during cross-partition updates

2021-03-10 Thread Georgios

‐‐‐ Original Message ‐‐‐
On Wednesday, March 10, 2021 1:23 PM, Georgios  
wrote:

>
>
> Hi,
>
> ‐‐‐ Original Message ‐‐‐
> On Thursday, February 18, 2021 10:54 AM, Amit Langote amitlangot...@gmail.com 
> wrote:
>
> > On Thu, Feb 18, 2021 at 6:52 PM Amit Langote amitlangot...@gmail.com wrote:
> >
> > > Based on the discussion at:
> > > https://www.postgresql.org/message-id/6929d485-2d2a-da46-3681-4a400a3d794f%40enterprisedb.com
> > > I'm posting the patch for $subject here in this new thread and I'll
> > > add it to the next CF per Tomas' advice.
> >
> > Done:https://commitfest.postgresql.org/32/2992/
>
> apparently I did not receive the review comment I sent via the commitfest app.
> Apologies for the chatter. Find the message-id here:
https://www.postgresql.org/message-id/161530518971.29967.9368488207318158252.pgcf%40coridan.postgresql.org

I continued looking a bit at the patch, yet I am either failing to see fix or I 
am
looking at the wrong thing. Please find attached a small repro of what my 
expectetions
were.

As you can see in the repro, I would expect the
 UPDATE local_root_remote_partitions SET a = 2;
to move the tuples to remote_partition_2 on the same transaction.
However this is not the case, with or without the patch.

Is my expectation of this patch wrong?

Cheers,
//Georgios

>
> > Amit Langote
> > EDB: http://www.enterprisedb.com

repro.sql
Description: application/sql

Re: shared-memory based stats collector

2021-03-10 Thread Fujii Masao




On 2021/03/10 17:51, Kyotaro Horiguchi wrote:

At Wed, 10 Mar 2021 15:20:43 +0900, Fujii Masao  
wrote in

On 2021/03/10 12:10, Kyotaro Horiguchi wrote:

Agreed. The code moved to the original place and added the crash
handling code. And I added a phrase to the comment.
+* Was it the archiver?  If exit status is zero (normal) or one 
(FATAL
+* exit), we assume everything is all right just like normal 
backends
+* and just try to restart a new one so that we immediately 
retry
   
+* archiving of remaining files. (If fail, we'll try again in 
future
 



"of" of "archiving of remaining" should be replaced with "the", or removed?


Either will do.  I don't mind turning the gerund (archiving) into a
gerund phrase (archiving remaining files).


Just for record. Previously LogChildExit() was called and the following LOG
message was output when the archiver reported FATAL error. OTOH the patch
prevents that and the following LOG message is not output at FATAL exit of
archiver. But I don't think that the following message is required in that
case
because FATAL message indicating the similar thing is already output.
Therefore, I'm ok with the patch.

LOG:  archiver process (PID 46418) exited with exit code 1


Yeah, that's the same behavor with wal receiver.


I read v50_003 patch.

When archiver dies, ProcGlobal->archiverLatch should be reset to NULL,
like walreceiver does the similar thing in WalRcvDie()?

Differently from walwriter and checkpointer, archiver as well as
walreceiver may die while server is running. Leaving the latch pointer
alone may lead to nudging a wrong process that took over the same
procarray slot. Added pgarch_die() to do that.


Thanks!

+   if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+   SetLatch(ProcGlobal->archiverLatch);

The latch can be reset to NULL in pgarch_die() between the if-condition and
SetLatch(), and which would be problematic. Probably we should protect
the access to the latch by using spinlock, like we do for walreceiver's latch?


Ugg. Right.  I remember about that bug.  I moved the archiverLatch out
of ProcGlobal to a dedicate local struct PgArch and placed a spinlock
together.

Thanks for the review!  v52 is attached.


Thanks! I applied minor and cosmetic changes to the 0003 patch as follows.
Attached is the updated version of the 0003 patch. Barring any objection,
I will commit this patch.


-#include "storage/latch.h"
-#include "storage/proc.h"

I removed these because they are no longer necessary.


logical replication worker,
parallel worker, background 
writer,
client backend, checkpointer,
+   archiver,
startup, walreceiver,
walsender and walwriter.

In the document about pg_stat_activity, possible values in backend_type
column are all listed. I added "archiver" into the list.

BTW, those values were originally listed in alphabetical order,
but ISTM that they were gotten out of order at a certain moment.
So they should be listed in alphabetical order again. This should
be implemented as a separate patch.


-PgArchData *PgArch = NULL;
+static PgArchData *PgArch = NULL;

I marked PgArchData as static because it's used only in pgarch.c.


-   ShmemInitStruct("Archiver ", PgArchShmemSize(), &found);
+   ShmemInitStruct("Archiver Data", PgArchShmemSize(), &found);

I found that the original shmem name ended with unnecessary space character.
I replaced it with "Archiver Data".


In reaper(), I moved the code block for archiver to the original location.


I ran pgindent for the files that the patch modifies.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3335d71eba..b82b2f6d66 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -935,6 +935,7 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34 
  0:00 postgres: ser
logical replication worker,
parallel worker, background 
writer,
client backend, checkpointer,
+   archiver,
startup, walreceiver,
walsender and walwriter.
In addition, background workers registered by extensions may have
diff --git a/src/backend/access/transam/xlogarchive.c 
b/src/backend/access/transam/xlogarchive.c
index 1c5a4f8b5a..26b023e754 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -25,11 +25,11 @@
 #include "common/archive.h"
 #include "miscadmin.h"
 #include "postmaster/startup.h"
+#include "postmaster/pgarch.h"
 #include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
-#include "storage/pmsignal.h"
 
 /*
  * Attempt to retrie

Re: Change JOIN tutorial to focus more on explicit joins

2021-03-10 Thread David Steele


On 12/1/20 3:38 AM, Jürgen Purtz wrote:

On 30.11.20 21:25, David G. Johnston wrote:

Sorry, I managed to overlook the most recent patch.

I admitted my use of parentheses was incorrect and I don't see anyone 
else defending them.  Please remove them.


Minor typos:

"the database compare" -> needs an "s" (compares)

"In this case, the definition how to compare their rows." -> remove, 
redundant with the first sentence


"The results from the older implicit syntax, and the newer explicit 
JOIN/ON syntax, are identical" -> move the commas around to what is 
shown here




OK. Patch attached.


Peter, you committed some of this patch originally. Do you think the 
rest of the patch is now in shape to be committed?


Regards,
--
-David
da...@pgmasters.net

Re: error_severity of brin work item

2021-03-10 Thread David Steele


On 12/1/20 5:25 PM, Justin Pryzby wrote:

On Tue, Dec 01, 2020 at 03:57:24PM -0300, Alvaro Herrera wrote:


Another idea is if perform_work_item() were responsible for discarding
relations which disappear.  Currently it does this, which is racy since it
holds no lock.


That has the property that it remains contained in autovacuum.c, but no
other advantages I think.


It has the advantage that it moves all the try_open stuff out of brin.

I started implementing this, and then realized that the try_open stuff *has* to
be in the brin_summarize function, to handle the case that someone passes a
non-index, since it's SQL exposed.
So maybe we should use your LockOid patch now, and refactor in the future if we
add additional work-item types.


Thoughts on this, Álvaro? I can see that the first version of this patch 
was not ideal but the rework seems to have stalled. Since it is a bug 
perhaps it would be better to get something in as Justin suggests?


Regards,
--
-David
da...@pgmasters.net

Re: libpq debug log

2021-03-10 Thread 'alvhe...@alvh.no-ip.org'

On 2021-Mar-10, iwata@fujitsu.com wrote:

> Hi all,
> 
> Following all reviewer's advice, I have created a new patch.
> 
> In this patch, I add only two tracing entry points; I call 
> pqTraceOutputMsg(PGconn conn, int msgCursor, PGCommSource commsource) in 
> pqParseInput3 () and pqPutMsgEnd () to output log.
> The argument contains message first byte offset called msgCursor because it 
> is simpler to specify the buffer pointer in the caller. 
> 
> In pqTraceOutputMsg(), the content common to all protocol messages (first 
> timestamp, < or >, first 1 byte, and length) are output first and after that 
> each protocol message contents is output. I referred to pqParseInput3 () , 
> fe-exec.c and documentation page for that part.
> 
> This fix almost eliminates if(conn->Pfdebug) that was built into the code for 
> other features.

This certainly looks much better!  Thanks for getting it done so
quickly.

I didn't review the code super closely.  I do see a compiler warning:

/pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c: In function 
'pqTraceOutputNchar':
/pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c:373:2: warning: 
argument 1 null where non-null expected [-Wnonnull]
  memcpy(v, buf + *cursor, len);
  ^

and then when I try to run the "libpq_pipeline" program from the other
thread, I get a crash with this backtrace:

Program received signal SIGSEGV, Segmentation fault.
__memmove_avx_unaligned_erms () at 
../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:288
288 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No existe el 
fichero o el directorio.
(gdb) bt
#0  __memmove_avx_unaligned_erms () at 
../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:288
#1  0x77f9b50b in pqTraceOutputNchar (buf=buf@entry=0x55564660 "P", 
LogEnd=LogEnd@entry=42, cursor=cursor@entry=0x7fffdeb4, 
pfdebug=0x55569320, len=1)
at /pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c:373
#2  0x77f9bc25 in pqTraceOutputMsg (conn=conn@entry=0x55560260, 
msgCursor=, 
commsource=commsource@entry=MSGDIR_FROM_FRONTEND)
at /pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c:476
#3  0x77f96280 in pqPutMsgEnd (conn=conn@entry=0x55560260)
at /pgsql/source/pipeline/src/interfaces/libpq/fe-misc.c:533
...

The attached patch fixes it, though I'm not sure that that's the final
form we want it to have (since we'd alloc and free repeatedly, making it
slow -- perhaps better to use a static, or ...? ).

-- 
Álvaro Herrera   Valdivia, Chile
Essentially, you're proposing Kevlar shoes as a solution for the problem
that you want to walk around carrying a loaded gun aimed at your foot.
(Tom Lane)
diff --git a/src/interfaces/libpq/libpq-trace.c b/src/interfaces/libpq/libpq-trace.c
index ef294fa556..5d3b40a1d0 100644
--- a/src/interfaces/libpq/libpq-trace.c
+++ b/src/interfaces/libpq/libpq-trace.c
@@ -368,7 +368,15 @@ pqTraceOutputBinary(char *v, int length, FILE *pfdebug)
 static void
 pqTraceOutputNchar(char *buf, int LogEnd, int *cursor, FILE *pfdebug, int len)
 {
-	char	*v = '\0';
+	char	*v;
+
+	v = malloc(len);
+	if (v == NULL)
+	{
+		fprintf(pfdebug, "'..%d chars..'", len);
+		*cursor += len;
+		return;
+	}
 
 	memcpy(v, buf + *cursor, len);
 	*cursor += len;
@@ -377,6 +385,8 @@ pqTraceOutputNchar(char *buf, int LogEnd, int *cursor, FILE *pfdebug, int len)
 	pqTraceOutputBinary(v, len, pfdebug);
 	fprintf(pfdebug, "\'");
 
+	free(v);
+
 	if (*cursor < LogEnd)
 		fprintf(pfdebug, " ");
 	else

Re: get rid of tags in the docs?

2021-03-10 Thread John Naylor

On Thu, Feb 4, 2021 at 11:31 AM Tom Lane  wrote:
>
> John Naylor  writes:
> > While looking at the proposed removal of the v2 protocol, I noticed
that we
> > italicize some, but not all, instances of 'per se', 'pro forma', and 'ad
> > hoc'. I'd say these are widespread enough in formal registers of English
> > that they hardly need to be called out as foreign, so I propose removing
> > the tags for those words.
>
> +1, nobody italicizes those in normal usage.

Now that protocol v2 is gone, here's a patch to remove those tags.

> > The other case is 'voilà', found in rules.sgml. The case for italics
here
> > is stronger, but looking at that file, I actually think a more
> > generic-sounding phrase here would be preferable.
>
> Yeah, seeing that we only use that in one place, I think we could do
> without it.  Looks like something as pedestrian as "The results are:"
> would do fine.

Done that way.

--
John Naylor
EDB: http://www.enterprisedb.com


v1-0001-Get-rid-of-foreignphrase-tags-in-the-docs.patch
Description: Binary data

Re: proposal: unescape_text function

2021-03-10 Thread David Steele


On 12/2/20 1:30 PM, Pavel Stehule wrote:
st 2. 12. 2020 v 11:37 odesílatel Pavel Stehule st 2. 12. 2020 v 9:23 odesílatel Peter Eisentraut


Heh.  The fact that there is a table of two dozen possible
representations kind of proves my point that we should be
deliberate in
picking one.

I do see Oracle unistr() on that list, which appears to be very
similar
to what you are trying to do here.  Maybe look into aligning
with that.

unistr is a primitive form of proposed function.  But it can be used
as a base. The format is compatible with our  "4.1.2.3. String
Constants with Unicode Escapes".

What do you think about the following proposal?

1. unistr(text) .. compatible with Postgres unicode escapes - it is
enhanced against Oracle, because Oracle's unistr doesn't support 6
digits unicodes.

2. there can be optional parameter "prefix" with default "\". But
with "\u" it can be compatible with Java or Python.

What do you think about it?

I thought about it a little bit more, and  the prefix specification has 
not too much sense (more if we implement this functionality as function 
"unistr"). I removed the optional argument and renamed the function to 
"unistr". The functionality is the same. Now it supports Oracle 
convention, Java and Python (for Python U) and \+XX. These 
formats was already supported.The compatibility witth Oracle is nice.


Peter, it looks like Pavel has aligned this function with unistr() as 
you suggested. Thoughts?


Regards,
--
-David
da...@pgmasters.net

Re: Lowering the ever-growing heap->pd_lower

2021-03-10 Thread Matthias van de Meent

On Tue, 9 Mar 2021 at 22:35, Tom Lane  wrote:
>
> Matthias van de Meent  writes:
> > The only two existing mechanisms that I could find (in the access/heap
> > directory) that possibly could fail on shrunken line pointer arrays;
> > being xlog recovery (I do not have enough knowledge on recovery to
> > determine if that may touch pages that have shrunken line pointer
> > arrays, or if those situations won't exist due to never using dirtied
> > pages in recovery) and backwards table scans on non-MVCC snapshots
> > (which would be fixed in the attached patch).
>
> I think you're not visualizing the problem properly.
>
> The case I was concerned about back when is that there are various bits of
> code that may visit a page with a predetermined TID in mind to look at.
> An index lookup is an obvious example, and another one is chasing an
> update chain's t_ctid link.  You might argue that if the tuple was dead
> enough to be removed, there should be no such in-flight references to
> worry about, but I think such an assumption is unsafe.  There is not, for
> example, any interlock that ensures that a process that has obtained a TID
> from an index will have completed its heap visit before a VACUUM that
> subsequently removed that index entry also removes the heap tuple.

I am aware of this problem. I will admit that I did not detected all
potentially problematic accesses, so I'll show you my work.

> So, to accept a patch that shortens the line pointer array, what we need
> to do is verify that every such code path checks for an out-of-range
> offset before trying to fetch the target line pointer.  I believed
> back in 2007 that there were, or once had been, code paths that omitted
> such a range check, assuming that they could trust the TID they had
> gotten from $wherever to point at an extant line pointer array entry.
> Maybe things are all good now, but I think you should run around and
> examine every place that checks for tuple deadness to see if the offset
> it used is known to be within the current page bounds.

In my search for problematic accesses, I make the following assumptions:
* PageRepairFragmentation as found in bufpage is only applicable to
heap pages; this function is not applied to other pages in core
postgres. So, any problems that occur are with due to access with an
OffsetNumber > PageGetMaxOffsetNumber.
* Items [on heap pages] are only extracted after using PageGetItemId
for that item on the page, whilst holding a lock.

Under those assumptions, I ran "grep PageGetItemId" over the src
directory. For all 227 results (as of 68b34b23) I checked if the page
accessed (or item accessed thereafter) was a heap page or heap tuple.
After analysis of the relevant references, I had the following results
(attached full report, containing a line with the file & line number,
and code line of the call, followed by a line containing the usage
type):

Count of usage type - usage type
4  - Not a call (comment)
7  - Callsite guarantees bounds
8  - Has assertion ItemIdIsNormal (asserts item is not removed; i.e.
concurrent vacuum should not have been able to remove this item)
39  - Has bound checks
6  - Not a heap page (brin)
1  - Not a heap page (generic index)
24  - Not a heap page (gin)
21  - Not a heap page (gist)
14  - Not a heap page (hash)
60  - Not a heap page (nbtree)
1  - Not a heap page (sequence)
36  - Not a heap page (spgist)
2  - OffsetNumber is generated by PageAddItem
2  - problem case 1 (table scan)
1  - Problem case 2 (xlog)
1  - Problem case 3 (invalid redirect pointers)

The 3 problem cases were classified based on the origin of the
potentially invalid pointer.

Problem case 1: table scan; heapam.c lines 678 and 811, in heapgettup

The table scan maintains a state which contains a page-bound
OffsetNumber, which it uses as a cursor whilst working through the
pages of the relation. In forward scans, the bounds of the page are
re-validated at the start of the 'while (linesleft > 0)' loop at 681,
but for backwards scans this check is invalid, because it incorrectly
assumes that the last OffsetNumber is guaranteed to still exist.

For MVCC snapshots, this is true (the previously returned value must
have been visible in its snapshot, therefore cannot have been vacuumed
because the snapshot is still alive), but non-mvcc snapshots may have
returned a dead tuple, which is now vacuumed and truncated away.

The first occurrance of this issue is easily fixed with the changes as
submitted in patch v2.

The second problem case (on line 811) is for forward scans, where the
line pointer array could have been truncated to 0 length. As the code
uses a hardcoded offset of FirstOffsetNumber (=1), that might point
into arbitrary data. The reading of this arbitrary data is saved by
the 'while (linesleft > 0) check', because at that point linesleft
will be PageGetMaxOffsetNumber, which would then equal 0.

Problem case 2: xlog; heapam.c line 8796, in heap_xlog_freeze_page

This is in the replay of transaction lo

Re: Parallel INSERT (INTO ... SELECT ...)

2021-03-10 Thread Dilip Kumar

On Wed, Mar 10, 2021 at 2:48 PM Amit Kapila  wrote:
>
> On Mon, Mar 8, 2021 at 7:19 PM Amit Langote  wrote:
> >
> > Hi Amit
> >
> > On Mon, Mar 8, 2021 at 10:18 PM Amit Kapila  wrote:
> > > On Mon, Mar 8, 2021 at 3:54 PM Greg Nancarrow  wrote:
> > > > I've attached an updated set of patches with the suggested locking 
> > > > changes.
> >
> > (Thanks Greg.)
> >
> > > Amit L, others, do let me know if you have still more comments on
> > > 0001* patch or if you want to review it further?
> >
> > I just read through v25 and didn't find anything to complain about.
> >
>
> Thanks a lot, pushed now! Amit L., your inputs are valuable for this work.
>
> Now, coming back to Hou-San's patch to introduce a GUC and reloption
> for this feature, I think both of those make sense to me because when
> the feature is enabled via GUC, one might want to disable it for
> partitioned tables? Do we agree on that part or someone thinks
> otherwise?

I think it makes sense to have both.

>
> The other points to bikeshed could be:
> 1. The name of GUC and reloption. The two proposals at hand are
> enable_parallel_dml and enable_parallel_insert. I would prefer the
> second (enable_parallel_insert) because updates/deletes might not have
> a similar overhead.

I agree enable_parallel_insert makes more sense.

> 2. Should we keep the default value of GUC to on or off? It is
> currently off. I am fine keeping it off for this release and we can
> always turn it on in the later releases if required. Having said that,
> I see the value in keeping it on because in many cases Insert ...
> Select will be used for large data and there we will see a benefit of
> parallelism and users facing trouble (who have a very large number of
> partitions with less data to query) can still disable the parallel
> insert for that particular table. Also, the other benefit of keeping
> it on till at least the beta period is that this functionality will
> get tested and if we found reports of regression then we can turn it
> off for this release as well.
>
> Thoughts?

IMHO, we should keep it on because in most of the cases it is going
the give benefit to the user, and if for some specific scenario where
a table has a lot of partition then it can be turned off using
reloption.  And, if for some users most of the tables are like that
where they always have a lot of partition then the user can turn it
off using guc.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] SET search_path += octopus

2021-03-10 Thread David Steele


On 12/1/20 9:49 AM, Abhijit Menon-Sen wrote:


Therefore, for lack of any plausible way to proceed, I do not plan to
work on this feature any more.


It sounds like this CF entry should be closed. I'll do that on March 12 
unless somebody beats me to it.


Regards,
--
-David
da...@pgmasters.net

Re: libpq debug log

2021-03-10 Thread 'alvhe...@alvh.no-ip.org'

I also found that DataRow messages are not being printed.  This seems to
fix that in the normal case and singlerow, at least in pipeline mode.
Didn't verify the non-pipeline mode.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Nunca se desea ardientemente lo que solo se desea por razón" (F. Alexandre)
diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c
index 68f0f6a081..e8db5edb71 100644
--- a/src/interfaces/libpq/fe-protocol3.c
+++ b/src/interfaces/libpq/fe-protocol3.c
@@ -869,6 +869,9 @@ getAnotherTuple(PGconn *conn, int msgLength)
 		goto advance_and_error;
 	}
 
+	if (conn->Pfdebug)
+		pqTraceOutputMsg(conn, conn->inStart, MSGDIR_FROM_BACKEND);
+
 	/* Advance inStart to show that the "D" message has been processed. */
 	conn->inStart = conn->inCursor;

Re: WIP: document the hook system

2021-03-10 Thread David Steele


On 3/9/21 12:20 PM, Bruce Momjian wrote:

On Sat, Mar  6, 2021 at 08:32:43PM -0500, Tom Lane wrote:

I think that the best you should hope for here is that people are
willing to add a short, not-too-detailed para to a markup-free
plain-text README file that lists all the hooks.  As soon as it
gets any more complex than that, either the doco aspect will be
ignored, or there simply won't be any more hooks.

(I'm afraid I likewise don't believe in the idea of carrying a test
module for each hook.  Again, requiring that is a good way to
ensure that new hooks just won't happen.)


Agreed.  If you document the hooks too much, it allows them to drift
away from matching the code, which makes the hook documentation actually
worse than having no hook documentation at all.


There's doesn't seem to be agreement on how to proceed here, so closing.

David, if you do decide to proceed with a README then it would probably 
be best to create a new thread/entry.


Regards,
--
-David
da...@pgmasters.net

Re: Boundary value check in lazy_tid_reaped()

2021-03-10 Thread Peter Eisentraut


On 10.03.21 02:29, Masahiko Sawada wrote:

There is no noticeable effect of inlining lazy_tid_reaped(). So it
would be better to not do that.


Attached the patch that doesn't inline lazy_tid_reaped().


committed

Re: Huge memory consumption on partitioned table with FKs

2021-03-10 Thread Amit Langote

On Wed, Mar 10, 2021 at 8:37 AM Tom Lane  wrote:
> Amit Langote  writes:
> > On Fri, Mar 5, 2021 at 6:00 AM Tom Lane  wrote:
> >> This claim seems false on its face:
> >>> All child constraints of a given foreign key constraint can use the
> >>> same RI query and the resulting plan, that is, no need to create as
> >>> many copies of the query and the plan as there are partitions, as
> >>> happens now due to the child constraint OID being used in the key
> >>> for ri_query_cache.
>
> > The quoted comment could have been written to be clearer about this,
> > but it's not talking about the table that is to be queried, but the
> > table whose RI trigger is being executed.  In all the cases except one
> > (mentioned below), the table that is queried is the same irrespective
> > of which partition's trigger is being executed, so we're basically
> > creating the same plan separately for each partition.
>
> Hmm.  So, the key point is that the values coming from the partitioned
> child table are injected into the test query as parameters, not as
> column references, thus it doesn't matter *to the test query* what
> numbers the referencing columns have in that child.  We just have to
> be sure we pass the right parameter values.

Right.

>  But ... doesn't the code
> use riinfo->fk_attnums[] to pull out the values to be passed?

Yes, from a slot that belongs to the child table.

> IOW, I now get the point about being able to share the SPI plans,
> but I'm still dubious about sharing the RI_ConstraintInfo cache entries.

There was actually a proposal upthread about sharing the
RI_ConstraintInfo too, but we decided to not pursue that for now.

> It looks to me like the v4 patch is now actually not sharing the
> cache entries, ie their hash key is just the child constraint OID
> same as before;

Yeah, you may see that we're only changing ri_BuildQueryKey() in the
patch affecting only ri_query_cache, but not ri_LoadConstraintInfo()
which leaves ri_constraint_cache unaffected.

> but the comments are pretty confused about this.

I've tried improving the comment in ri_BuildQueryKey() a bit to make
clear what is and what is not being shared between partitions.

> It might be simpler if you add just one new field which is the OID of
> the constraint that we're building the SPI query from, which might be
> either equal to constraint_id, or the OID of some parent constraint.
> In particular it's not clear to me why we need both constraint_parent
> and constraint_root_id.

Yeah, I think constraint_parent is a leftover from some earlier
hacking.  I have removed it.

Attached updated patch.

--
Amit Langote
EDB: http://www.enterprisedb.com


v5-0001-Share-SPI-plan-between-partitions-in-some-RI-trig.patch
Description: Binary data

Re: [PATCH] Covering SPGiST index

2021-03-10 Thread David Steele


On 12/4/20 12:31 PM, Pavel Borisov wrote:

The cfbot's still unhappy --- looks like you omitted a file from the
patch?

You are right, thank you. Corrected this. PFA v13


Tom, do the changes as enumerated in [1] look like they are going in the 
right direction?


Regards,
--
-David
da...@pgmasters.net

[1] 
https://www.postgresql.org/message-id/CALT9ZEEszJUwsXMWknXQ3k_YbGtQaQwiYxxEGZ-pFGRUDSXdtQ%40mail.gmail.com

Re: WIP: document the hook system

2021-03-10 Thread David Fetter

On Wed, Mar 10, 2021 at 09:38:39AM -0500, David Steele wrote:
> On 3/9/21 12:20 PM, Bruce Momjian wrote:
> > On Sat, Mar  6, 2021 at 08:32:43PM -0500, Tom Lane wrote:
> > > I think that the best you should hope for here is that people are
> > > willing to add a short, not-too-detailed para to a markup-free
> > > plain-text README file that lists all the hooks.  As soon as it
> > > gets any more complex than that, either the doco aspect will be
> > > ignored, or there simply won't be any more hooks.
> > > 
> > > (I'm afraid I likewise don't believe in the idea of carrying a test
> > > module for each hook.  Again, requiring that is a good way to
> > > ensure that new hooks just won't happen.)
> > 
> > Agreed.  If you document the hooks too much, it allows them to drift
> > away from matching the code, which makes the hook documentation actually
> > worse than having no hook documentation at all.
> 
> There's doesn't seem to be agreement on how to proceed here, so closing.
> 
> David, if you do decide to proceed with a README then it would probably be
> best to create a new thread/entry.

Thanks for the work on this and the helpful feedback!

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: WIP: System Versioned Temporal Table

2021-03-10 Thread Surafel Temesgen

hi Ibrar,
thank you for rebasing

On Mon, Mar 8, 2021 at 9:34 AM Ibrar Ahmed  wrote:

>
>> Since the get_row_start_time_col_name() and get_row_end_time_col_name()
>> are similar, IMO we can pass a flag to get StartTime/EndTime column name,
>> thought?
>>
>>
For me your option is better.  i will change to it in my next
patch if no objection


regards
Surafel

Re: Evaluate expression at planning time for two more cases

2021-03-10 Thread Surafel Temesgen

Hi Ibrar,


On Mon, Mar 8, 2021 at 8:13 AM Ibrar Ahmed  wrote:

>
> It was a minor change therefore I rebased the patch, please take a look.
>

It is perfect thank you

regards
Surafel

Re: WIP: System Versioned Temporal Table

2021-03-10 Thread Vik Fearing

On 3/10/21 5:49 PM, Surafel Temesgen wrote:
> hi Ibrar,
> thank you for rebasing
> 
> On Mon, Mar 8, 2021 at 9:34 AM Ibrar Ahmed  wrote:
> 
>>
>>> Since the get_row_start_time_col_name() and get_row_end_time_col_name()
>>> are similar, IMO we can pass a flag to get StartTime/EndTime column name,
>>> thought?
>>>
>>>
> For me your option is better.  i will change to it in my next
> patch if no objection

I have plenty of objection.  I'm sorry that I am taking so long with my
review.  I am still working on it and it is coming soon, I promise.
-- 
Vik Fearing

Re: cryptohash: missing locking functions for OpenSSL <= 1.0.2?

2021-03-10 Thread Jacob Champion

On Wed, 2021-03-10 at 17:21 +0900, Michael Paquier wrote:
> Do you mean something like the attached?

Yes! Patch LGTM.

--Jacob

Re: Add header support to text format and matching feature

2021-03-10 Thread David Steele


On 12/7/20 6:40 PM, Rémi Lapeyre wrote:

Hi, here’s a rebased version of the patch.


Michael, since the issue of duplicated options has been fixed do either 
of these patches look like they are ready for commit?


Regards,
--
-David
da...@pgmasters.net

Re: Consider parallel for lateral subqueries with limit

2021-03-10 Thread David Steele


On 12/7/20 6:45 PM, James Coleman wrote:

On Sun, Dec 6, 2020 at 7:34 PM Brian Davis  wrote:


Played around with this a bit, here's a non-correlated subquery that gets us to 
that if statement


While I haven't actually tracked down to guarantee this is handled
elsewhere, a thought experiment -- I think -- shows it must be so.
Here's why: suppose we don't have a limit here, but the query return
order is different in different backends. Then we would have the same
problem you bring up. In that case this code is already setting
consider_parallel=true on the rel. So I don't think we're changing any
behavior here.


So it looks like you and Brian are satisfied that this change is not 
allowing bad behavior.


Seems like an obvious win. Hopefully we can get some other concurring 
opinions.


Regards,
--
-David
da...@pgmasters.net

Re: get rid of tags in the docs?

2021-03-10 Thread Tom Lane

John Naylor  writes:
> On Thu, Feb 4, 2021 at 11:31 AM Tom Lane  wrote:
>> +1, nobody italicizes those in normal usage.

> Now that protocol v2 is gone, here's a patch to remove those tags.

Pushed.

regards, tom lane

Re: Speeding up GIST index creation for tsvectors

2021-03-10 Thread John Naylor

On Mon, Mar 8, 2021 at 8:43 AM Amit Khandekar 
wrote:
>
> On Wed, 3 Mar 2021 at 23:32, John Naylor 
wrote:

> > 0001:
> >
> > + /*
> > + * We can process 64-bit chunks only if both are mis-aligned by the
same
> > + * number of bytes.
> > + */
> > + if (b_aligned - b == a_aligned - a)
> >
> > The obvious question here is: how often are they identically
misaligned? You
> > don't indicate that your measurements differ in a bimodal fashion, so
does
> > that mean they happen to be always (mis)aligned?
>
> I ran CREATE INDEX on tsvector columns using the tsearch.data that I
> had attached upthread, with some instrumentation; here are the
> proportions :
> 1. In 15% of the cases, only one among a and b was aligned. The other
> was offset from the 8-byte boundary by 4 bytes.
> 2. 6% of the cases, both were offset by 4 bytes, i.e. identically
misaligned.
> 3. Rest of the cases, both were aligned.

That's somewhat strange. I would have assumed always aligned, or usually
not. It sounds like we could require them both to be aligned and not bother
with the byte-by-byte prologue. I also wonder if it's worth it to memcpy to
a new buffer if the passed pointer is not aligned.

> > + /* For smaller lengths, do simple byte-by-byte traversal */
> > + if (bytes <= 32)
> >
> > You noted upthread:
> >
> > > Since for the gist index creation of some of these types the default
> > > value for siglen is small (8-20), I tested with small siglens. For
> > > siglens <= 20, particularly for values that are not multiples of 8
> > > (e.g. 10, 13, etc), I see a  1-7 % reduction in speed of index
> > > creation. It's probably because of
> > > an extra function call for pg_xorcount(); and also might be due to the
> > > extra logic in pg_xorcount() which becomes prominent for shorter
> > > traversals. So for siglen less than 32, I kept the existing method
> > > using byte-by-byte traversal.
> >
> > I wonder if that can be addressed directly, while cleaning up the loops
and
> > alignment checks in pg_xorcount_long() a little. For example, have a
look at
> > pg_crc32c_armv8.c -- it might be worth testing a similar approach.
>
> Yeah, we can put the bytes <= 32 condition inside pg_xorcount_long().
> I avoided that to not hamper the <= 32 scenarios. Details explained
> below for "why inline pg_xorcount is calling global function"
>
> > Also, pardon my ignorance, but where can I find the default siglen for
various types?
> Check SIGLEN_DEFAULT.

Okay, so it's hard-coded in various functions in contrib modules. In that
case, let's just keep the existing coding for those. In fact, the comments
that got removed by your patch specifically say:

/* Using the popcount functions here isn't likely to win */

...which your testing confirmed. The new function should be used only for
Gist and possibly Intarray, whose default is 124. That means we can't get
rid of hemdistsign(), but that's fine. Alternatively, we could say that a
small regression is justifiable reason to refactor all call sites, but I'm
not proposing that.

> > (I did make sure to remove  indirect calls  from the retail functions
> > in [1], in case we want to go that route).
>
> Yeah, I quickly had a look at that. I am still going over that thread.
> Thanks for the exhaustive analysis there.

I'll post a patch soon that builds on that, so you can see what I mean.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Huge memory consumption on partitioned table with FKs

2021-03-10 Thread Tom Lane

Amit Langote  writes:
> On Wed, Mar 10, 2021 at 8:37 AM Tom Lane  wrote:
>> Hmm.  So, the key point is that the values coming from the partitioned
>> child table are injected into the test query as parameters, not as
>> column references, thus it doesn't matter *to the test query* what
>> numbers the referencing columns have in that child.  We just have to
>> be sure we pass the right parameter values.

> Right.

I did some cosmetic fooling with this (mostly, rewriting the comments
YA time) and pushed it.

regards, tom lane

Re: libpq debug log

2021-03-10 Thread 'alvhe...@alvh.no-ip.org'

On 2021-Mar-10, 'alvhe...@alvh.no-ip.org' wrote:

> I also found that DataRow messages are not being printed.  This seems to
> fix that in the normal case and singlerow, at least in pipeline mode.
> Didn't verify the non-pipeline mode.

Hm, and RowDescription ('T') messages are also not being printed ... so
I think there's some larger problem here, and perhaps it needs to be
fixed using a different approach.

After staring at it a couple of times, I think that the places in
pqParseInput3() where there's a comment "... moves inStart itself" and
then "continue;" should have a call to pqTraceOutputMsg(), since AFAIU
those are the places where the message in question would not reach the
"Successfully consumed this message" block that prints each message.

-- 
Álvaro Herrera39°49'30"S 73°17'W
Maybe there's lots of data loss but the records of data loss are also lost.
(Lincoln Yeoh)

Re: libpq debug log

2021-03-10 Thread Tom Lane

"'alvhe...@alvh.no-ip.org'"  writes:
> After staring at it a couple of times, I think that the places in
> pqParseInput3() where there's a comment "... moves inStart itself" and
> then "continue;" should have a call to pqTraceOutputMsg(), since AFAIU
> those are the places where the message in question would not reach the
> "Successfully consumed this message" block that prints each message.

Yeah, the whole business of just when a message has been "consumed"
is a stumbling block for libpq tracing.  It was a real mess with the
existing code and v2 protocol, because we could actually try to parse
a message more than once, with the first few tries deciding that the
message wasn't all here yet (but nonetheless emitting partial trace
output).

Now that v2 is dead, the logic to abort because of the message not
being all arrived yet is basically useless: only the little bit of
code concerned with the message length word really needs to cope with
that.  It's tempting to go through and get rid of all the now-unreachable
"return"s and such, but it seemed like it would be a lot of code churn for
not really that much gain.

I didn't look at the new version of the patch yet, so I'm not
sure whether the issues it still has are related to this.

regards, tom lane

Re: Occasional tablespace.sql failures in check-world -jnn

2021-03-10 Thread Andres Freund

Hi,

On 2021-03-10 15:40:38 +0900, Michael Paquier wrote:
> On Mon, Mar 08, 2021 at 11:53:57AM +0100, Peter Eisentraut wrote:
> > On 09.12.20 08:55, Michael Paquier wrote:
> >> ...  Because we may still introduce this problem again if some new
> >> stuff uses src/test/pg_regress in a way similar to pg_upgrade,
> >> triggering again tablespace-setup.  Something like the attached may be
> >> enough, though I have not spent much time checking the surroundings,
> >> Windows included.
> > 
> > This patch looks alright to me.
> 
> So, I have spent more time checking the surroundings of this patch,
> and finally applied it.  Thanks for the review, Peter.

Thanks!

Greetings,

Andres Freund

Re: libpq debug log

2021-03-10 Thread Tom Lane

"'alvhe...@alvh.no-ip.org'"  writes:
> After staring at it a couple of times, I think that the places in
> pqParseInput3() where there's a comment "... moves inStart itself" and
> then "continue;" should have a call to pqTraceOutputMsg(), since AFAIU
> those are the places where the message in question would not reach the
> "Successfully consumed this message" block that prints each message.

After digging around a little, I remember that there are a *bunch*
of places in fe-protocol3.c that advance inStart.  The cleanest way
to shove this stuff in without rethinking that logic would be to
call pqTraceOutputMsg immediately before each such advance (using
the old inStart value as the message start address).  A possible
advantage of doing it like that is that we'd be aware by that point
of whether we think the message is good or bad (or whether it was
good but we failed to process it, perhaps because of OOM).  Seems
like that could be a useful thing to include in the log.

Or we could rethink the logic.  It's not quite clear to me, after
all this time, why getRowDescriptions() et al are allowed to
move inStart themselves rather than letting the main loop in
pqParseInput3 do it.  It might well be an artifact of having not
rewritten the v2 logic too much.

regards, tom lane

Re: pg_amcheck contrib application

2021-03-10 Thread Robert Haas

On Wed, Mar 10, 2021 at 11:10 AM Mark Dilger
 wrote:
> Once again, I think you are right and have removed the objectionable 
> behavior, but
>
> The --startblock and --endblock options make the most sense when the user is 
> only checking one table, like
>
>   pg_amcheck --startblock=17 --endblock=19 --table=my_schema.my_corrupt_table
>
> because the user likely has some knowledge about that table, perhaps from a 
> prior run of pg_amcheck.  The --startblock and --endblock arguments are a bit 
> strange when used globally, as relations don't all have the same number of 
> blocks, so
>
>   pg_amcheck --startblock=17 --endblock=19 mydb
>
> will very likely emit lots of error messages for tables which don't have 
> blocks in that range.  That's not entirely pg_amcheck's fault, as it just did 
> what the user asked, but it also doesn't seem super helpful.  I'm not going 
> to do anything about it in this release.

+1 to all that. I tend toward the opinion that trying to make
--startblock and --endblock do anything useful in the context of
checking multiple relations is not really possible, and therefore we
just shouldn't put any effort into it. But if user feedback shows
otherwise, we can always do something about it later.

> After running 'make installcheck', if I delete all entries from pg_class 
> where relnamespace = 'pg_toast'::regclass, by running 'pg_amcheck 
> regression', I get lines that look like this:
>
> heap relation "regression"."public"."quad_poly_tbl":
> ERROR:  could not open relation with OID 17177

In this here example, the first line ends in a colon.

> relation "regression"."public"."functional_dependencies", block 28, offset 
> 54, attribute 0
> attribute 0 with length 4294967295 ends at offset 50 beyond total tuple 
> length 43

But this here one does not. Seems like it should be consistent.

The QUALIFIED_NAME_FIELDS macro doesn't seem to be used anywhere,
which is good, because macros with unbalanced parentheses are usually
not a good plan; and a macro that expands to a comma-separated list of
things is suspect too.

"invalid skip options\n" seems too plural.

With regard to your use of strtol() for --{start,end}block, telling
the user that their input is garbage seems pejorative, even though it
may be accurate. Compare:

[rhaas EDBAS]$ pg_dump -jdsgdsgd
pg_dump: error: invalid number of parallel jobs

In the message "relation end block argument precedes start block
argument\n", I think you could lose both instances of the word
"argument" and probably the word "relation" as well. I actually don't
know why all of these messages about start and end block mention
"relation". It's not like there is some other kind of
non-relation-related start block with which it could be confused.

The comment for run_command() explains some things about the cparams
argument, but those things are false. In fact the argument is unused.

Usual PostgreSQL practice when freeing memory in e.g.
verify_heap_slot_handler is to set the pointers to NULL as well. The
performance cost of this is trivial, and it makes debugging a lot
easier should somebody accidentally write code to access one of those
things after it's been freed.

The documentation says that -D "does exclude any database that was
listed explicitly as dbname on the command line, nor does it exclude
the database chosen in the absence of any dbname argument." The first
part of this makes complete sense to me, but I'm not sure about the
second part. If I type pg_amcheck --all -D 'r*', I think I'm expecting
that "rhaas" won't be checked. Likewise, if I say pg_amcheck -d
'bob*', I think I only want to check the bob-related databases and not
rhaas.

I suggest documenting --endblock as "Check table blocks up to and
including the specified ending block number. An error will occur if a
relation being checked has fewer than this number of blocks." And
similarly for --startblock: "Check table blocks beginning with the
specified block number. An error will occur, etc." Perhaps even
mention something like "This option is probably only useful when
checking a single table." Also, the documentation here isn't clear
that this affects only table checking, not index checking.

It appears that pg_amcheck sometimes makes dummy connections to the
database that don't do anything, e.g. pg_amcheck -t 'q*' resulted in:

2021-03-10 15:00:14.273 EST [95473] LOG:  connection received: host=[local]
2021-03-10 15:00:14.274 EST [95473] LOG:  connection authorized:
user=rhaas database=rhaas application_name=pg_amcheck
2021-03-10 15:00:14.286 EST [95473] LOG:  statement: SELECT
pg_catalog.set_config('search_path', '', false);
2021-03-10 15:00:14.290 EST [95464] DEBUG:  forked new backend,
pid=95474 socket=11
2021-03-10 15:00:14.291 EST [95464] DEBUG:  server process (PID 95473)
exited with exit code 0
2021-03-10 15:00:14.291 EST [95474] LOG:  connection received: host=[local]
2021-03-10 15:00:14.293 EST [95474] LOG:  connection authorized:
user=rhaas database=rhaas appl

Re: [HACKERS] Custom compression methods

2021-03-10 Thread Robert Haas

On Wed, Mar 10, 2021 at 6:52 AM Dilip Kumar  wrote:
> The pending comment is providing a way to rewrite a table and
> re-compress the data with the current compression method.

I spent some time poking at this yesterday and ran couldn't figure out
what was going on here. There are two places where we rewrite tables.
One is the stuff in cluter.c, which handles VACUUM FULL and CLUSTER.
That eventually calls reform_and_rewrite_tuple(), which deforms the
old tuple and creates a new one, but it doesn't seem like there's
anything in there that would expand toasted values, whether external
or inline compressed. But I think that can't be right, because it
seems like then you'd end up with toast pointers into the old TOAST
relation, not the new one, which would cause failures later. So I must
be missing something here. The other place where we rewrite tables is
in ATRewriteTable() as part of the ALTER TABLE machinery. I don't see
anything there to force detoasting either.

That said, I think that using the word REWRITE may not really capture
what we're on about. Leaving aside the question of exactly what the
CLUSTER code does today, you could in theory rewrite the main table by
just taking all the tuples and putting them into a new relfilenode.
And then you could do the same thing with the TOAST table. And despite
having fully rewritten both tables, you wouldn't have done anything
that helps with this problem because you haven't deformed the tuples
at any point. Now as it happens we do have code -- in
reform_and_rewrite_tuple() -- that does deform and reform the tuples,
but it doesn't take care of this problem either. We might need to
distinguish between rewriting the table, which is mostly about getting
a new relfilenode, and some other word that means doing this.

But, I am not really convinced that we need to solve this problem by
adding new ALTER TABLE syntax. I'd be happy enough if CLUSTER, VACUUM
FULL, and versions of ALTER TABLE that already force a rewrite would
cause the compression to be redone also. Honestly, even if the user
had to fall back on creating a new table and doing INSERT INTO newtab
SELECT * FROM oldtab I would consider that to be not a total
showstopper for this .. assuming of course that it actually works. If
it doesn't, we have big problems. Even without the pg_am stuff, we
still need to make sure that we don't just blindly let compressed
values wander around everywhere. When we insert into a table column
with a compression method, we should recompress any data that is
compressed using some other method.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: libpq debug log

2021-03-10 Thread 'alvhe...@alvh.no-ip.org'

On 2021-Mar-10, Tom Lane wrote:

> "'alvhe...@alvh.no-ip.org'"  writes:
> > After staring at it a couple of times, I think that the places in
> > pqParseInput3() where there's a comment "... moves inStart itself" and
> > then "continue;" should have a call to pqTraceOutputMsg(), since AFAIU
> > those are the places where the message in question would not reach the
> > "Successfully consumed this message" block that prints each message.
> 
> Yeah, the whole business of just when a message has been "consumed"
> is a stumbling block for libpq tracing.  It was a real mess with the
> existing code and v2 protocol, because we could actually try to parse
> a message more than once, with the first few tries deciding that the
> message wasn't all here yet (but nonetheless emitting partial trace
> output).

Hmm, that makes sense, but the issue I'm reporting is not the same,
unless I misunderstand you.

> Now that v2 is dead, the logic to abort because of the message not
> being all arrived yet is basically useless: only the little bit of
> code concerned with the message length word really needs to cope with
> that.  It's tempting to go through and get rid of all the now-unreachable
> "return"s and such, but it seemed like it would be a lot of code churn for
> not really that much gain.

That sounds like an interesting exercise, and I bet it'd bring a lot of
code readability improvements.

> I didn't look at the new version of the patch yet, so I'm not
> sure whether the issues it still has are related to this.

The issues I noticed are related to separate messages rather than one
message split in pieces -- for example several DataRow messages are
processed internally in a loop, rather than each individually.  The
number of places that need to be adjusted for things to AFAICT work
correctly are few enough; ISTM that the attached patch is sufficient.

(The business with the "logged" boolean is necessary so that we print to
the trace file any of those messages even if they are deviating from the
protocol.)

-- 
Álvaro Herrera   Valdivia, Chile
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre" (Ijon Tichy)

Re: libpq debug log

2021-03-10 Thread 'alvhe...@alvh.no-ip.org'

On 2021-Mar-10, Tom Lane wrote:

> Or we could rethink the logic.  It's not quite clear to me, after
> all this time, why getRowDescriptions() et al are allowed to
> move inStart themselves rather than letting the main loop in
> pqParseInput3 do it.  It might well be an artifact of having not
> rewritten the v2 logic too much.

I would certainly prefer that the logic stays put for the time being,
while I finalize the pipelining stuff.

-- 
Álvaro Herrera   Valdivia, Chile

Re: libpq debug log

2021-03-10 Thread Tom Lane

I wrote:
> Or we could rethink the logic.  It's not quite clear to me, after
> all this time, why getRowDescriptions() et al are allowed to
> move inStart themselves rather than letting the main loop in
> pqParseInput3 do it.  It might well be an artifact of having not
> rewritten the v2 logic too much.

After studying this further, I think we should apply the attached
patch to remove that responsibility from pqParseInput3's subroutines.
This will allow a single trace call near the bottom of pqParseInput3
to handle all cases that that function processes.

There are still some cowboy advancements of inStart in the functions
concerned with COPY processing, but it doesn't look like there's
much to be done about that.  Those spots will need their own trace
calls.

BTW, while looking at this I concluded that getParamDescriptions
is actually buggy: if it's given a malformed message that seems
to need more data than the message length specifies, it just
returns EOF, resulting in an infinite loop.  This function apparently
got missed while converting the logic from v2 (where that was the
right thing) to v3 (where it ain't).  So that bit needs to be
back-patched.  I'm tempted to back-patch the whole thing though.

regards, tom lane

diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c
index 2ca8c057b9..233456fd90 100644
--- a/src/interfaces/libpq/fe-protocol3.c
+++ b/src/interfaces/libpq/fe-protocol3.c
@@ -290,8 +290,6 @@ pqParseInput3(PGconn *conn)
 		/* First 'T' in a query sequence */
 		if (getRowDescriptions(conn, msgLength))
 			return;
-		/* getRowDescriptions() moves inStart itself */
-		continue;
 	}
 	else
 	{
@@ -337,8 +335,7 @@ pqParseInput3(PGconn *conn)
 case 't':		/* Parameter Description */
 	if (getParamDescriptions(conn, msgLength))
 		return;
-	/* getParamDescriptions() moves inStart itself */
-	continue;
+	break;
 case 'D':		/* Data Row */
 	if (conn->result != NULL &&
 		conn->result->resultStatus == PGRES_TUPLES_OK)
@@ -346,8 +343,6 @@ pqParseInput3(PGconn *conn)
 		/* Read another tuple of a normal query response */
 		if (getAnotherTuple(conn, msgLength))
 			return;
-		/* getAnotherTuple() moves inStart itself */
-		continue;
 	}
 	else if (conn->result != NULL &&
 			 conn->result->resultStatus == PGRES_FATAL_ERROR)
@@ -462,7 +457,6 @@ handleSyncLoss(PGconn *conn, char id, int msgLength)
  * command for a prepared statement) containing the attribute data.
  * Returns: 0 if processed message successfully, EOF to suspend parsing
  * (the latter case is not actually used currently).
- * In the former case, conn->inStart has been advanced past the message.
  */
 static int
 getRowDescriptions(PGconn *conn, int msgLength)
@@ -577,9 +571,6 @@ getRowDescriptions(PGconn *conn, int msgLength)
 	/* Success! */
 	conn->result = result;
 
-	/* Advance inStart to show that the "T" message has been processed. */
-	conn->inStart = conn->inCursor;
-
 	/*
 	 * If we're doing a Describe, we're done, and ready to pass the result
 	 * back to the client.
@@ -603,9 +594,6 @@ advance_and_error:
 	if (result && result != conn->result)
 		PQclear(result);
 
-	/* Discard the failed message by pretending we read it */
-	conn->inStart += 5 + msgLength;
-
 	/*
 	 * Replace partially constructed result with an error result. First
 	 * discard the old result to try to win back some memory.
@@ -624,6 +612,12 @@ advance_and_error:
 	appendPQExpBuffer(&conn->errorMessage, "%s\n", errmsg);
 	pqSaveErrorResult(conn);
 
+	/*
+	 * Show the message as fully consumed, else pqParseInput3 will overwrite
+	 * our error with a complaint about that.
+	 */
+	conn->inCursor = conn->inStart + 5 + msgLength;
+
 	/*
 	 * Return zero to allow input parsing to continue.  Subsequent "D"
 	 * messages will be ignored until we get to end of data, since an error
@@ -635,12 +629,8 @@ advance_and_error:
 /*
  * parseInput subroutine to read a 't' (ParameterDescription) message.
  * We'll build a new PGresult structure containing the parameter data.
- * Returns: 0 if completed message, EOF if not enough data yet.
- * In the former case, conn->inStart has been advanced past the message.
- *
- * Note that if we run out of data, we have to release the partially
- * constructed PGresult, and rebuild it again next time.  Fortunately,
- * that shouldn't happen often, since 't' messages usually fit in a packet.
+ * Returns: 0 if processed message successfully, EOF to suspend parsing
+ * (the latter case is not actually used currently).
  */
 static int
 getParamDescriptions(PGconn *conn, int msgLength)
@@ -690,23 +680,16 @@ getParamDescriptions(PGconn *conn, int msgLength)
 	/* Success! */
 	conn->result = result;
 
-	/* Advance inStart to show that the "t" message has been processed. */
-	conn->inStart = conn->inCursor;
-
 	return 0;
 
 not_enough_data:
-	PQclear(result);
-	return EOF;
+	errmsg =

Re: Replace buffer I/O locks with condition variables (reviving an old patch)

2021-03-10 Thread Thomas Munro

On Tue, Mar 9, 2021 at 6:24 PM Julien Rouhaud  wrote:
> > The old I/O lock array was the only user of struct
> > LWLockMinimallyPadded, added in commit 6150a1b08a9, and it seems kinda
> > strange to leave it in the tree with no user.  Of course it's remotely
> > possible there are extensions using it (know of any?).  In the
> > attached, I've ripped that + associated commentary out, because it's
> > fun to delete dead code.  Objections?
>
> None from me.  I don't know of any extension relying on it, and neither does
> codesearch.debian.net.  I would be surprised to see any extension actually
> relying on that anyway.

Thanks for checking!

> > Since the whole reason for that out-of-line array in the first place
> > was to keep BufferDesc inside one cache line, and since it is in fact
> > possible to put a new condition variable into BufferDesc without
> > exceeding 64 bytes on a 64 bit x86 box, perhaps we should just do that
> > instead?  I haven't yet considered other architectures or potential
> > member orders.
>
> +1 for adding the cv into BufferDesc.  That brings the struct size to exactly
> 64 bytes on x86 64 bits architecture.  This won't add any extra overhead to
> LOCK_DEBUG cases, as it was already exceeding the 64B threshold, if that even
> was a concern.

I also checked that it's 64B on an Arm box.  Not sure about POWER.
But... despite the fact that it looks like a good change in isolation,
I decided to go back to the separate array in this initial commit,
because the AIO branch also wants to add a new BufferDesc member[1].
I may come back to that change, if we can make some more space (seems
entirely doable, but I'd like to look into that separately).

> > I wonder if we should try to preserve user experience a little harder,
> > for the benefit of people who have monitoring queries that look for
> > this condition.  Instead of inventing a new wait_event value, let's
> > just keep showing "BufferIO" in that column.  In other words, the
> > change is that wait_event_type changes from "LWLock" to "IPC", which
> > is a pretty good summary of this patch.  Done in the attached.  Does
> > this make sense?
>
> I think it does make sense, and it's good to preserve this value.
>
> Looking at the patch itself, I don't have much to add it all looks sensible 
> and
> I agree with the arguments in the first mail.  All regression tests pass and
> documentation builds.

I found one more thing to tweak: a reference in the README.

> I'm marking this patch as RFC.

Thanks for the review.  And of course to Robert for writing the patch.  Pushed.

[1] 
https://github.com/anarazel/postgres/blob/aio/src/include/storage/buf_internals.h#L190

fdatasync performance problem with large number of DB files

2021-03-10 Thread Michael Brown

I initially posted this on the pgsql-general mailing list [5] but was
advised to also post this to the -hackers list as it deals with internals.

We've encountered a production performance problem with pg13 related to
how it fsyncs the whole data directory in certain scenarios, related to
what Paul (bcc'ed) described in a post to pgsql-hackers [1].

Background:

We've observed the full recursive fsync is triggered when

* pg_basebackup receives a streaming backup (via [2] fsync_dir_recurse
or fsync_pgdata) unless --no-sync is specified
* postgres starts up unclean (via [3] SyncDataDirectory)

We run multiple postgres clusters and some of those clusters have many
(~450) databases (one database-per-customer) meaning that the postgres
data directory has around 700,000 files.

On one of our less loaded servers this takes ~7 minutes to complete, but
on another [4] this takes ~90 minutes.

Obviously this is untenable risk. We've modified our process that
bootstraps a replica via pg_basebackup to instead do "pg_basebackup
--no-sync…" followed by a "sync", but we don't have any way to do the
equivalent for the postgres startup.

I presume the reason postgres doesn't blindly run a sync() is that we
don't know what other I/O is on the system and it'd be rude to affect
other services. That makes sense, except for our environment the work
done by the recursive fsync is orders of magnitude more disruptive than
a sync().

My questions are:

* is there a knob missing we can configure?
* can we get an opt-in knob to use a single sync() call instead of a
recursive fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?

Thanks!

[1]:
https://www.postgresql.org/message-id/flat/caeet0zhgnbxmi8yf3ywsdzvb3m9cbdsgzgftxscq6agcbzc...@mail.gmail.com
[2]:
https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/pg_basebackup.c#L2181
[3]:
https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L6495
[4]: It should be identical config-wise. It isn't starved for IO but
does have other regular write workloads
[5]:
https://www.postgresql-archive.org/fdatasync-performance-problem-with-large-number-of-DB-files-td6184094.html

-- 
Michael Brown
Civilized Discourse Construction Kit, Inc.
https://www.discourse.org/

Re: GiST comment improvement

2021-03-10 Thread Bruce Momjian

On Tue, Mar  2, 2021 at 11:40:21AM -0500, Bruce Momjian wrote:
> In talking to Teodor this week, I have written the attached C comment
> patch which improves our description of GiST's NSN and GistBuildLSN
> values.

Patch applied.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: libpq debug log

2021-03-10 Thread Alvaro Herrera

On 2021-Mar-10, Tom Lane wrote:

> After studying this further, I think we should apply the attached
> patch to remove that responsibility from pqParseInput3's subroutines.
> This will allow a single trace call near the bottom of pqParseInput3
> to handle all cases that that function processes.

Works for me.

> BTW, while looking at this I concluded that getParamDescriptions
> is actually buggy: if it's given a malformed message that seems
> to need more data than the message length specifies, it just
> returns EOF, resulting in an infinite loop.  This function apparently
> got missed while converting the logic from v2 (where that was the
> right thing) to v3 (where it ain't).  So that bit needs to be
> back-patched.

Ah, that makes sense.

> I'm tempted to back-patch the whole thing though.

+0.5 on that.  I think we may be happy that all branches are alike
(though it doesn't look like this will cause any subtle bugs -- breakage
will be fairly obvious).

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Most hackers will be perfectly comfortable conceptualizing users as entropy
 sources, so let's move on."   (Nathaniel Smith)

Re: automatic analyze: readahead - add "IO read time" log message

2021-03-10 Thread Tomas Vondra

On 3/8/21 8:42 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
>> On 2/10/21 11:10 PM, Stephen Frost wrote:
>>> * Heikki Linnakangas (hlinn...@iki.fi) wrote:
 On 05/02/2021 23:22, Stephen Frost wrote:
> Unless there's anything else on this, I'll commit these sometime next
> week.

 One more thing: Instead of using 'effective_io_concurrency' GUC directly,
 should call get_tablespace_maintenance_io_concurrency().
>>>
>>> Ah, yeah, of course.
>>>
>>> Updated patch attached.
>>
>> A couple minor comments:
>>
>> 1) I think the patch should be split into two parts, one adding the
>> track_io_timing, one adding the prefetching.
> 
> This was already done..
> 

Not sure what you mean by "done"? I see the patch still does both
changes related to track_io_timing and prefetching.

>> 2) I haven't tried but I'm pretty sure there'll be a compiler warning
>> about 'prefetch_maximum' being unused without USE_PREFETCH defined.
> 
> Ah, that part is likely true, moved down into the #ifdef block to
> address that, which also is good since it should avoid mistakenly using
> it outside of the #ifdef's later on by mistake too.
> 

OK

>> 3) Is there a way to reduce the amount of #ifdef in acquire_sample_rows?
> 
> Perhaps..
> 
>> This makes the code rather hard to read, IMHO. It seems to me we can
>> move the code around a bit and merge some of the #ifdef blocks - see the
>> attached patch. Most of this is fairly trivial, with the exception of
>> moving PrefetchBuffer before table_scan_analyze_next_block - AFAIK this
>> does not materially change the behavior, but perhaps I'm wrong.
> 
> but I don't particularly like doing the prefetch right before we run
> vacuum_delay_point() and potentially sleep.
> 

Why? Is that just a matter of personal preference (fair enough) or is
there a reason why that would be wrong/harmful?

I think e.g. prefetch_targblock could be moved to the next #ifdef, which
will eliminate the one-line ifdef.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: fdatasync performance problem with large number of DB files

2021-03-10 Thread Thomas Munro

On Thu, Mar 11, 2021 at 11:01 AM Michael Brown
 wrote:
> * pg_basebackup receives a streaming backup (via [2] fsync_dir_recurse
> or fsync_pgdata) unless --no-sync is specified
> * postgres starts up unclean (via [3] SyncDataDirectory)
>
> We run multiple postgres clusters and some of those clusters have many
> (~450) databases (one database-per-customer) meaning that the postgres
> data directory has around 700,000 files.
>
> On one of our less loaded servers this takes ~7 minutes to complete, but
> on another [4] this takes ~90 minutes.

Ouch.

> My questions are:
>
> * is there a knob missing we can configure?
> * can we get an opt-in knob to use a single sync() call instead of a
> recursive fsync()?
> * would you be open to merging a patch providing said knob?
> * is there something else we missed?

As discussed on that other thread, I don't think sync() is an option
(it doesn't wait on all OSes or in the standard and it doesn't report
errors).  syncfs() on Linux 5.8+ looks like a good candidate though,
and I think we'd consider a patch like that.  I mean, I even posted
one[1] in that other thread.  There will of course be cases where
that's slower (small database sharing filesystem with other software
that has a lot of dirty data to write back).

I also wrote a WAL-and-checkpoint based prototype[2], which, among
other advantages such as being faster, not ignoring errors and not
triggering collateral write-back storms, happens to work on all
operating systems.  On the other hand it requires a somewhat dogmatic
switch in thinking about the meaning of checkpoints (I mean, it
requires humans to promise not to falsify checkpoints by copying
databases around underneath us), which may be hard to sell (I didn't
try very hard), and there may be subtleties I have missed...

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGKT6XiPiEJrqeOFGi7RYCGzbBysF9pyWwv0-jm-oNajxg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CA%2BhUKGKHhDNnN6fxf6qrAx9h%2BmjdNU2Zmx7ztJzFQ0C5%3Du3QPg%40mail.gmail.com

Re: Speeding up GIST index creation for tsvectors

2021-03-10 Thread John Naylor

I wrote:
> I'll post a patch soon that builds on that, so you can see what I mean.

I've attached where I was imagining this heading, as a text file to avoid
distracting the cfbot. Here are the numbers I got with your test on the
attached, as well as your 0001, on x86-64 Clang 10, default siglen:

master:
739ms

v3-0001
692ms

attached POC
665ms

The small additional speed up is not worth it, given the code churn and
complexity, so I don't want to go this route after all. I think the way to
go is a simplified version of your 0001 (not 0002), with only a single
function, for gist and intarray only, and a style that better matches the
surrounding code. If you look at my xor functions in the attached text
file, you'll get an idea of what it should look like. Note that it got the
above performance without ever trying to massage the pointer alignment. I'm
a bit uncomfortable with the fact that we can't rely on alignment, but
maybe there's a simple fix somewhere in the gist code.

--
John Naylor
EDB: http://www.enterprisedb.com
 src/backend/access/heap/visibilitymap.c |  13 +-
 src/backend/nodes/bitmapset.c   |  23 +--
 src/backend/utils/adt/tsgistidx.c   |  14 +-
 src/include/port/pg_bitutils.h  |  12 +-
 src/port/pg_bitutils.c  | 240 ++--
 5 files changed, 214 insertions(+), 88 deletions(-)

diff --git a/src/backend/access/heap/visibilitymap.c 
b/src/backend/access/heap/visibilitymap.c
index e198df65d8..d910eb2875 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -388,7 +388,6 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, 
BlockNumber *all_fro
{
Buffer  mapBuffer;
uint64 *map;
-   int i;
 
/*
 * Read till we fall off the end of the map.  We assume that 
any extra
@@ -409,17 +408,11 @@ visibilitymap_count(Relation rel, BlockNumber 
*all_visible, BlockNumber *all_fro
StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0,
 "unsupported MAPSIZE");
if (all_frozen == NULL)
-   {
-   for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
-   nvisible += pg_popcount64(map[i] & 
VISIBLE_MASK64);
-   }
+   nvisible = pg_popcount_mask64(map, MAPSIZE, 
VISIBLE_MASK64);
else
{
-   for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
-   {
-   nvisible += pg_popcount64(map[i] & 
VISIBLE_MASK64);
-   nfrozen += pg_popcount64(map[i] & 
FROZEN_MASK64);
-   }
+   nvisible = pg_popcount_mask64(map, MAPSIZE, 
VISIBLE_MASK64);
+   nfrozen = pg_popcount_mask64(map, MAPSIZE, 
FROZEN_MASK64);
}
 
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 649478b0d4..5368c72255 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -452,7 +452,6 @@ bms_is_member(int x, const Bitmapset *a)
 int
 bms_member_index(Bitmapset *a, int x)
 {
-   int i;
int bitnum;
int wordnum;
int result = 0;
@@ -466,14 +465,8 @@ bms_member_index(Bitmapset *a, int x)
bitnum = BITNUM(x);
 
/* count bits in preceding words */
-   for (i = 0; i < wordnum; i++)
-   {
-   bitmapword  w = a->words[i];
-
-   /* No need to count the bits in a zero word */
-   if (w != 0)
-   result += bmw_popcount(w);
-   }
+   result = pg_popcount((const char *) a->words,
+wordnum * BITS_PER_BITMAPWORD 
/ BITS_PER_BYTE);
 
/*
 * Now add bits of the last word, but only those before the item. We can
@@ -645,22 +638,14 @@ bms_get_singleton_member(const Bitmapset *a, int *member)
 int
 bms_num_members(const Bitmapset *a)
 {
-   int result = 0;
int nwords;
-   int wordnum;
 
if (a == NULL)
return 0;
nwords = a->nwords;
-   for (wordnum = 0; wordnum < nwords; wordnum++)
-   {
-   bitmapword  w = a->words[wordnum];
 
-   /* No need to count the bits in a zero word */
-   if (w != 0)
-   result += bmw_popcount(w);
-   }
-   return result;
+   return pg_popcount((const char *) a->words,
+  nwords * BITS_PER_BITMAPWORD / 
BITS_PER_BYTE);
 }
 
 /*
diff --git a/src/backend/utils/adt/tsgistidx.c 
b/src/backend/utils/adt/tsgistidx.

Re: Columns correlation and adaptive query optimization

2021-03-10 Thread Tomas Vondra

On 3/10/21 3:00 AM, Tomas Vondra wrote:
> Hello Konstantin,
> 
> 
> Sorry for not responding to this thread earlier. I definitely agree the
> features proposed here are very interesting and useful, and I appreciate
> you kept rebasing the patch.
> 
> I think the patch improving join estimates can be treated as separate,
> and I see it already has a separate CF entry - it however still points
> to this thread, which will be confusing. I suggest we start a different
> thread for it, to keep the discussions separate.
> 

D'oh! I must have been confused yesterday, because now I see there
already is a separate thread [1] for the join selectivity patch. So you
can ignore this.

regards

[1]
https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc...@postgrespro.ru

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PROXY protocol support

2021-03-10 Thread Jacob Champion

On Tue, 2021-03-09 at 11:25 +0100, Magnus Hagander wrote:
> I've also added some trivial tests (man that took an ungodly amount of
> fighting perl -- it's clearly been a long time since I used perl
> properly).

Yeah. The tests I'm writing for this and NSS have been the same way;
it's a real problem. I'm basically writing supplemental tests in Python
as the "daily driver", then trying to port whatever is easiest (not
much) into Perl, when I get time.

== More Notes ==

Some additional spec-compliance stuff:

>   /* Lower 4 bits hold type of connection */
>   if (proxyheader.fam == 0)
>   {
>   /* LOCAL connection, so we ignore the address included */
>   }

(fam == 0) is the UNSPEC case, which isn't necessarily LOCAL. We have
to do something different for the LOCAL case:

> - \x0 : LOCAL : [...] The receiver must accept this connection as
> valid and must use the real connection endpoints and discard the
> protocol block including the family which is ignored.

So we should ignore the entire "protocol block" (by which I believe
they mean the protocol-and-address-family byte) in the case of LOCAL,
and just accept it with the original address info intact. That seems to
match the sample code in the back of the spec. The current behavior in
the patch will apply the PROXY behavior incorrectly if the sender sends
a LOCAL header with something other than UNSPEC -- which is strange
behavior but not explicitly prohibited as far as I can see.

We also need to reject all connections that aren't either LOCAL or
PROXY commands:

> - other values are unassigned and must not be emitted by senders.
> Receivers must drop connections presenting unexpected values here.

...and naturally it'd be Nice (tm) if the tests covered those corner
cases.

Over on the struct side:

> + struct
> + {   /* for TCP/UDP 
> over IPv4, len = 12 */
> + uint32  src_addr;
> + uint32  dst_addr;
> + uint16  src_port;
> + uint16  dst_port;
> + }   ip4;
> ... snip ...
> + /* TCPv4 */
> + if (proxyaddrlen < 12)
> + {

Given the importance of these hardcoded lengths matching reality, is it
possible to add some static assertions to make sure that sizeof() == 12 and so on? That would also save any poor souls who are
using compilers with nonstandard struct-packing behavior.

--Jacob

Re: libpq debug log

2021-03-10 Thread Tom Lane

Alvaro Herrera  writes:
> On 2021-Mar-10, Tom Lane wrote:
>> After studying this further, I think we should apply the attached
>> patch to remove that responsibility from pqParseInput3's subroutines.
>> This will allow a single trace call near the bottom of pqParseInput3
>> to handle all cases that that function processes.

> Works for me.

I dug into the git history a little and concluded that the reason for
doing that in the subroutines is that 92785dac2 made it so for this
reason:

+   /*
+* Advance inStart to show that the "D" message has been processed.  We
+* must do this before calling the row processor, in case it longjmps.
+*/
+   conn->inStart = conn->inCursor;

We later gave up on allowing user-defined row processor callbacks,
but we failed to undo this messiness.  Looking at what 92785dac2 did
confirms something else I'd been eyeing, which is why these subroutines
have their own checks for having consumed the right amount of data
instead of letting the master check in pqParseInput3 take care of it.
They didn't use to do that, but that check was hoisted up to before the
row processor call, so we wouldn't expose data from a corrupt message to
user code.  So I think we can undo that too.

92785dac2 only directly hacked getRowDescriptions and getAnotherTuple,
but it looks like similar error handling was stuck into
getParamDescriptions later.

regards, tom lane

Re: fdatasync performance problem with large number of DB files

2021-03-10 Thread Thomas Munro

On Thu, Mar 11, 2021 at 11:38 AM Thomas Munro  wrote:
> On Thu, Mar 11, 2021 at 11:01 AM Michael Brown
>  wrote:
> > * is there a knob missing we can configure?
> > * can we get an opt-in knob to use a single sync() call instead of a
> > recursive fsync()?
> > * would you be open to merging a patch providing said knob?
> > * is there something else we missed?
>
> As discussed on that other thread, I don't think sync() is an option
> (it doesn't wait on all OSes or in the standard and it doesn't report
> errors).  syncfs() on Linux 5.8+ looks like a good candidate though,
> and I think we'd consider a patch like that.  I mean, I even posted
> one[1] in that other thread.  There will of course be cases where
> that's slower (small database sharing filesystem with other software
> that has a lot of dirty data to write back).

Thinking about this some more, if you were to propose a patch like
that syncfs() one but make it a configurable option, I'd personally be
in favour of trying to squeeze it into v14.  Others might object on
commitfest procedural grounds, I dunno, but I think this is a real
operational issue and that's a fairly simple and localised change.
I've run into a couple of users who have just commented that recursive
fsync() code out!

I'd probably make it an enum-style GUC, because I intend to do some
more work on my "precise" alternative, though not in time for this
release, and it could just as well be an option too.

Re: fdatasync performance problem with large number of DB files

2021-03-10 Thread Tom Lane

Thomas Munro  writes:
> Thinking about this some more, if you were to propose a patch like
> that syncfs() one but make it a configurable option, I'd personally be
> in favour of trying to squeeze it into v14.  Others might object on
> commitfest procedural grounds, I dunno, but I think this is a real
> operational issue and that's a fairly simple and localised change.
> I've run into a couple of users who have just commented that recursive
> fsync() code out!

I'm a little skeptical about the "simple" part.  At minimum, you'd
have to syncfs() each tablespace, since we have no easy way to tell
which of them are on different filesystems.  (Although, if we're
presuming this is Linux-only, we might be able to tell with some
unportable check or other.)

regards, tom lane

Re: fdatasync performance problem with large number of DB files

2021-03-10 Thread Thomas Munro

On Thu, Mar 11, 2021 at 1:16 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Thinking about this some more, if you were to propose a patch like
> > that syncfs() one but make it a configurable option, I'd personally be
> > in favour of trying to squeeze it into v14.  Others might object on
> > commitfest procedural grounds, I dunno, but I think this is a real
> > operational issue and that's a fairly simple and localised change.
> > I've run into a couple of users who have just commented that recursive
> > fsync() code out!
>
> I'm a little skeptical about the "simple" part.  At minimum, you'd
> have to syncfs() each tablespace, since we have no easy way to tell
> which of them are on different filesystems.  (Although, if we're
> presuming this is Linux-only, we might be able to tell with some
> unportable check or other.)

Right, the patch knows about that:

+/*
+ * On Linux, we don't have to open every single file one by one.  We can
+ * use syncfs() to sync whole filesystems.  We only expect filesystem
+ * boundaries to exist where we tolerate symlinks, namely pg_wal and the
+ * tablespaces, so we call syncfs() for each of those directories.
+ */

Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-03-10 Thread Masahiro Ikeda


On 2021-03-10 17:08, Fujii Masao wrote:

On 2021/03/10 14:11, Masahiro Ikeda wrote:

On 2021-03-09 17:51, Fujii Masao wrote:

On 2021/03/05 8:38, Masahiro Ikeda wrote:

On 2021-03-05 01:02, Fujii Masao wrote:

On 2021/03/04 16:14, Masahiro Ikeda wrote:

On 2021-03-03 20:27, Masahiro Ikeda wrote:

On 2021-03-03 16:30, Fujii Masao wrote:

On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
 wrote:


I pgindented the patches.


... XLogWrite, which is invoked during 
an
XLogFlush request (see ...).  This is 
also

incremented by the WAL receiver during replication.

("which normally called" should be "which is normally 
called" or
"which normally is called" if you want to keep true to the 
original)
You missed the adding the space before an opening 
parenthesis here and

elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly 
query the

operating system..."
", because" -> "as"


Thanks, I fixed them.

wal_write_time and the sync items also need the note: "This 
is also

incremented by the WAL receiver during replication."


I skipped changing it since I separated the stats for the WAL 
receiver

in pg_stat_wal_receiver.

"The number of times it happened..." -> " (the tally of this 
event is
reported in wal_buffers_full in) This is undesirable 
because ..."


Thanks, I fixed it.

I notice that the patch for WAL receiver doesn't require 
explicitly
computing the sync statistics but does require computing the 
write
statistics.  This is because of the presence of 
issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite.  Additionally, I 
observe that
the XLogWrite code path calls pgstat_report_wait_*() while 
the WAL
receiver path does not.  It seems technically 
straight-forward to
refactor here to avoid the almost-duplicated logic in the 
two places,
though I suspect there may be a trade-off for not adding 
another
function call to the stack given the importance of WAL 
processing
(though that seems marginalized compared to the cost of 
actually
writing the WAL).  Or, as Fujii noted, go the other way and 
don't have
any shared code between the two but instead implement the 
WAL receiver
one to use pg_stat_wal_receiver instead.  In either case, 
this

half-and-half implementation seems undesirable.


OK, as Fujii-san mentioned, I separated the WAL receiver 
stats.

(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)


Thanks for updating the patches!


I added the infrastructure code to communicate the WAL 
receiver stats messages between the WAL receiver and the 
stats collector, and
the stats for WAL receiver is counted in 
pg_stat_wal_receiver.

What do you think?


On second thought, this idea seems not good. Because those 
stats are

collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver 
process running
at that moment. IOW, it seems strange that some values show 
dynamic
stats and the others show collected stats, even though they 
are in

the same view pg_stat_wal_receiver. Thought?


OK, I fixed it.
The stats collected in the WAL receiver is exposed in 
pg_stat_wal view in v11 patch.


Thanks for updating the patches! I'm now reading 001 patch.

+    /* Check whether the WAL file was synced to disk right now 
*/

+    if (enableFsync &&
+    (sync_method == SYNC_METHOD_FSYNC ||
+ sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+ sync_method == SYNC_METHOD_FDATASYNC))
+    {

Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or 
open_data_sync,

to simplify the code more?


Thanks for the comments.
I added the above code in v12 patch.



+    /*
+ * Send WAL statistics only if WalWriterDelay has 
elapsed to minimize

+ * the overhead in WAL-writing.
+ */
+    if (rc & WL_TIMEOUT)
+    pgstat_send_wal();

On second thought, this change means that it always takes 
wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() 
is called.
For example, if wal_writer_delay is set to several seconds, some 
values in
pg_stat_wal would be not up-to-date meaninglessly for those 
seconds.
So I'm thinking to withdraw my previous comment and it's ok to 
send

the stats every after XLogBackgroundFlush() is called. Thought?


Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.


Yeah, if wal_writer_delay is set to very small value, there is a 
risk
that the WAL stats are sent too frequently. I agree that's a 
problem.




Why don't to make another way to check the timestamp?

+   /*
+    * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+

Re: Huge memory consumption on partitioned table with FKs

2021-03-10 Thread Amit Langote

On Thu, Mar 11, 2021 at 4:25 AM Tom Lane  wrote:
> Amit Langote  writes:
> > On Wed, Mar 10, 2021 at 8:37 AM Tom Lane  wrote:
> >> Hmm.  So, the key point is that the values coming from the partitioned
> >> child table are injected into the test query as parameters, not as
> >> column references, thus it doesn't matter *to the test query* what
> >> numbers the referencing columns have in that child.  We just have to
> >> be sure we pass the right parameter values.
>
> > Right.
>
> I did some cosmetic fooling with this (mostly, rewriting the comments
> YA time) and pushed it.

Perfect.   Thanks for your time on this.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: [HACKERS] Custom compression methods

2021-03-10 Thread Justin Pryzby

On Wed, Mar 10, 2021 at 03:50:48PM -0500, Robert Haas wrote:
> On Wed, Mar 10, 2021 at 6:52 AM Dilip Kumar  wrote:
> > The pending comment is providing a way to rewrite a table and
> > re-compress the data with the current compression method.
> 
> I spent some time poking at this yesterday and ran couldn't figure out
> what was going on here. There are two places where we rewrite tables.
> One is the stuff in cluter.c, which handles VACUUM FULL and CLUSTER.
> That eventually calls reform_and_rewrite_tuple(), which deforms the
> old tuple and creates a new one, but it doesn't seem like there's
> anything in there that would expand toasted values, whether external
> or inline compressed. But I think that can't be right, because it
> seems like then you'd end up with toast pointers into the old TOAST
> relation, not the new one, which would cause failures later. So I must
> be missing something here.

I did this the empirical way.

postgres=# CREATE TABLE t (a text compression lz4);
postgres=# INSERT INTO t SELECT repeat('a',9);
postgres=# ALTER TABLE t ALTER a SET COMPRESSION pglz;
postgres=# VACUUM FULL t;
postgres=# SELECT pg_column_compression(a) FROM t;
pg_column_compression | lz4

Actually, a --without-lz4 build can *also* VACUUM FULL the lz4 table.

It looks like VACUUM FULL t processes t but not its toast table (which is
strange to me, since VACUUM processes both, but then autovacuum also processes
them independently).

-- 
Justin

Re: Improve join selectivity estimation using extended statistics

2021-03-10 Thread Tomas Vondra

Hi Konstantin,

Thanks for working on this! Using extended statistics to improve join
cardinality estimates was definitely on my radar, and this patch seems
like a good start.

I had two basic ideas about how we might improve join estimates:

(a) use per-table extended statistics to estimate join conditions

(b) invent multi-table extended statistics (requires inventing how to
sample the tables in a coordinated way, etc.)

This patch aims to do (a) which is perfectly reasonable - I think we can
achieve significant improvements this way. I have some ideas about (b),
but it seems harder and for a separate thread/patch.


The patch includes some *very* interesting ideas, but I think it's does
them too late and at the wrong level of abstraction. I mean that:

1) I don't think the code in clausesel.c should deal with extended
statistics directly - it requires far too much knowledge about different
types of extended stats, what clauses are supported by them, etc.
Allowing stats on expressions will make this even worse.

Better do that in extended_stats.c, like statext_clauselist_selectivity.

2) in clauselist_selectivity_ext, functional dependencies are applied in
the part that processes remaining clauses, not estimated using extended
statistics. That seems a bit confusing, and I suspect it may lead to
issues - for example, it only processes the clauses incrementally, in a
particular order. That probably affects the  result, because it affects
which functional dependencies we can apply.

In the example query that's not an issue, because it only has two Vars,
so it either can't apply anything (with one Var) or it can apply
everything (with two Vars). But with 3 or more Vars the order would
certainly matter, so it's problematic.


Moreover, it seems a bit strange that this considers dependencies only
on the inner relation. Can't that lead to issues with different join
orders producing different cardinality estimates?


I think a better approach would be to either modify the existing block
dealing with extended stats for a single relation to also handle join
conditions. Or perhaps we should invent a separate block, dealing with
*pairs* of relations? And it should deal with *all* join clauses for
that pair of relations at once, not one by one.

As for the exact implementation, I'd imagine we call overall logic to be
something like (for clauses on two joined relations):

- pick a subset of clauses with the same type of extended statistics on
both sides (MCV, ndistinct, ...), repeat until we can't apply more
statistics

- estimate remaining clauses either using functional dependencies or in
the regular (old) way


As for how to use other types of extended statistics, I think eqjoinsel
could serve as an inspiration. We should look for an MCV list and
ndistinct stats on both sides of the join (possibly on some subset of
clauses), and then do the same thing eqjoinsel does, just with multiple
columns.

Note: I'm not sure what to do when we find the stats only on one side.
Perhaps we should assume the other side does not have correlations and
use per-column statistics (seems reasonable), or maybe just not apply
anything (seems a bit too harsh).

Anyway, if there are some non-estimated clauses, we could try applying
functional dependencies similarly to what this patch does. It's also
consistent with statext_clauselist_selectivity - that also tries to
apply MCV lists first, and only then we try functional dependencies.


BTW, should this still rely on oprrest (e.g. F_EQSEL). That's the
selectivity function for restriction (non-join) clauses, so maybe we
should be looking at oprjoin when dealing with joins? Not sure.


One bit that I find *very* interesting is the calc_joinrel_size_estimate
part, with this comment:

  /*
   * Try to take in account functional dependencies between attributes
   * of clauses pushed-down to joined relations and retstrictlist
   * clause. Right now we consider only case of restrictlist consists of
   * one clause.
   */

If I understand the comment and the code after it, it essentially tries
to apply extended statistics from both the join clauses and filters at
the relation level. That is, with a query like

SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a) WHERE t1.b = 10

we would be looking at statistics on t1(a,b), because we're interested
in estimating conditional probability distribution

   P(t1.a = a? | t1.b = 10)

I think that's extremely interesting and powerful, because it allows us
to "restrict" the multi-column MCV lists, we could probably estimate
number of distinct "a" values in rows with "b=10" like:

ndistinct(a,b) / ndistinct(b)

and do various interesting stuff like this.

That will require some improvements to the extended statistics code (to
allow passing a list of conditions), but that's quite doable. I think
the code actually did something like that originally ;-)


Obviously, none of this is achievable for PG14, as we're in the middle
of the last CF. But if you're interest

Re: shared-memory based stats collector

2021-03-10 Thread Kyotaro Horiguchi

At Wed, 10 Mar 2021 21:47:51 +0900, Fujii Masao  
wrote in 
> Attached is the updated version of the 0003 patch. Barring any
> objection,
> I will commit this patch.
> 
> 
> -#include "storage/latch.h"
> -#include "storage/proc.h"
> 
> I removed these because they are no longer necessary.

Mmm. Sorry for the garbage.

> logical replication worker,
> parallel worker, background
> writer,
> client backend, checkpointer,
> +   archiver,
> startup, walreceiver,
> walsender and walwriter.
> 
> In the document about pg_stat_activity, possible values in
> backend_type
> column are all listed. I added "archiver" into the list.
> 
> BTW, those values were originally listed in alphabetical order,
> but ISTM that they were gotten out of order at a certain moment.
> So they should be listed in alphabetical order again. This should
> be implemented as a separate patch.

Thanks for adding it.

They are also mildly sorted by function or characteristics. I'm not
sure which is better, but anyway they should be the order based on a
clear criteria.

> -PgArchData *PgArch = NULL;
> +static PgArchData *PgArch = NULL;
> 
> I marked PgArchData as static because it's used only in pgarch.c.

Right.

> - ShmemInitStruct("Archiver ", PgArchShmemSize(), &found);
> + ShmemInitStruct("Archiver Data", PgArchShmemSize(), &found);
> 
> I found that the original shmem name ended with unnecessary space
> character.
> I replaced it with "Archiver Data".

Oops. The trailing space is where I stopped writing the string and try
to find a better word, then in the meanwhile, my mind have been
attracted to something else and left.  I don't object to "Archiver
Data".  Thanks for completing it.

> In reaper(), I moved the code block for archiver to the original
> location.

Agreed.

> I ran pgindent for the files that the patch modifies.

Yeah, I forgot to add the new struct into typedefs.list.  I
intentionally omitted clearing newly-acquired shared memory but it
doesn't matter to do that.

So, I'm fine with it. Thanks for taking this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pgbench - add pseudo-random permutation function

2021-03-10 Thread Bruce Momjian

On Mon, Mar  8, 2021 at 11:50:43AM +0100, Fabien COELHO wrote:
> 
> > > What are your current thoughts?
> > 
> > Thanks for prodding.  I still think it's a useful feature.  However I
> > don't think I'll have to time to get it done on the current commitfest.
> > I suggest to let it sit in the commitfest to see if somebody else will
> > pick it up -- and if not, we move it to the next one, with apologies to
> > author and reviewers.
> > 
> > I may have time to become familiar or at least semi-comfortable with all
> > that weird math in it by then.
> 
> Yep.
> 
> Generating a parametric good-quality low-cost (but not
> cryptographically-secure) pseudo-random permutations on arbitrary sizes (not
> juste power of two sizes) is not a trivial task, I had to be quite creative
> to achieve it, hence the "weird" maths. I had a lot of bad
> not-really-working ideas before the current status of the patch.
> 
> The code could be simplified if we assume that PG_INT128_TYPE will be
> available on all relevant architectures, and accept the feature not to be
> available if not.

Maybe Dean Rasheed can help because of his math background --- CC'ing him.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: fdatasync performance problem with large number of DB files

2021-03-10 Thread Fujii Masao





On 2021/03/11 8:30, Thomas Munro wrote:

On Thu, Mar 11, 2021 at 11:38 AM Thomas Munro  wrote:

On Thu, Mar 11, 2021 at 11:01 AM Michael Brown
 wrote:

* is there a knob missing we can configure?
* can we get an opt-in knob to use a single sync() call instead of a
recursive fsync()?
* would you be open to merging a patch providing said knob?
* is there something else we missed?


As discussed on that other thread, I don't think sync() is an option
(it doesn't wait on all OSes or in the standard and it doesn't report
errors).  syncfs() on Linux 5.8+ looks like a good candidate though,
and I think we'd consider a patch like that.  I mean, I even posted
one[1] in that other thread.  There will of course be cases where
that's slower (small database sharing filesystem with other software
that has a lot of dirty data to write back).


Thinking about this some more, if you were to propose a patch like
that syncfs() one but make it a configurable option, I'd personally be
in favour of trying to squeeze it into v14.  Others might object on
commitfest procedural grounds, I dunno, but I think this is a real
operational issue and that's a fairly simple and localised change.


+1 to push this kind of change into v14!!


I've run into a couple of users who have just commented that recursive
fsync() code out!


BTW, we can skip that recursive fsync() by disabling fsync GUC even without
commenting out the code?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Boundary value check in lazy_tid_reaped()

2021-03-10 Thread Masahiko Sawada

On Wed, Mar 10, 2021 at 11:53 PM Peter Eisentraut
 wrote:
>
> On 10.03.21 02:29, Masahiko Sawada wrote:
> >> There is no noticeable effect of inlining lazy_tid_reaped(). So it
> >> would be better to not do that.
> >
> > Attached the patch that doesn't inline lazy_tid_reaped().
>
> committed

Thank you!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

RE: Enhance traceability of wal_level changes for backup management

2021-03-10 Thread osumi.takami...@fujitsu.com

On Tuesday, March 9, 2021 11:13 PM David Steele 
> On 3/7/21 9:45 PM, osumi.takami...@fujitsu.com wrote:
> > On Sun, Mar 7, 2021 3:48 AM Peter Eisentraut
>  wrote:
> >> On 28.01.21 01:44, osumi.takami...@fujitsu.com wrote:
>  (1) writing the time or LSN in the control file to indicate
>  when/where wal_level is changed to 'minimal'
>  from upper level to invalidate the old backups or make alerts to users.
> >>> I attached the first patch which implementes this idea.
> >>> It was aligned by pgindent and shows no regression.
> >>
> >> It's not clear to me what this is supposed to accomplish.  I read the
> >> thread, but it's still not clear.
> >> What is one supposed to do with this information?
> > OK. The basic idea is to enable backup management tools to recognize
> > wal_level drop between *snapshots*.
> > When you have a snapshot of the cluster at one time and another one at
> > different time, with this new parameter, you can see if anything that
> > causes discontinuity from the drop happens in the middle of the two
> > snapshots without efforts to have a look at the WALs in between.
> 
> As a backup software author, I don't see this feature as very useful.
> 
> The problem is that there are lots of ways for WAL to go missing so
> monitoring the WAL archive for gaps is essential and this feature would not
> replace that requirement. The only extra information you'd get is the ability 
> to
> classify the most recent gap as "intentional", maybe.
> 
> So, -1 from me.
Thanks for giving me a meaningful viewpoint.
Let me sleep on it, about whether the new param doesn't help in all cases or 
not.


Best Regards,
Takamichi Osumi

Re: Huge memory consumption on partitioned table with FKs

2021-03-10 Thread Keisuke Kuroda

Hi hackers,

> >
> > I did some cosmetic fooling with this (mostly, rewriting the comments
> > YA time) and pushed it.
>
> Perfect.   Thanks for your time on this.
>

Thank you for your help! I'm glad to solve it.

-- 
Keisuke Kuroda
NTT Software Innovation Center
keisuke.kuroda.3...@gmail.com

Re: Removing vacuum_cleanup_index_scale_factor

2021-03-10 Thread Peter Geoghegan

On Tue, Mar 9, 2021 at 7:42 PM Peter Geoghegan  wrote:
> My current plan is to commit everything within the next day or two.
> This includes backpatching to Postgres 13 only.

Pushed, thanks.

-- 
Peter Geoghegan

Re: libpq debug log

2021-03-10 Thread Kyotaro Horiguchi

At Wed, 10 Mar 2021 10:31:27 -0300, "'alvhe...@alvh.no-ip.org'" 
 wrote in 
> On 2021-Mar-10, iwata@fujitsu.com wrote:
> 
> > Hi all,
> > 
> > Following all reviewer's advice, I have created a new patch.
> > 
> > In this patch, I add only two tracing entry points; I call 
> > pqTraceOutputMsg(PGconn conn, int msgCursor, PGCommSource commsource) in 
> > pqParseInput3 () and pqPutMsgEnd () to output log.
> > The argument contains message first byte offset called msgCursor because it 
> > is simpler to specify the buffer pointer in the caller. 
> > 
> > In pqTraceOutputMsg(), the content common to all protocol messages (first 
> > timestamp, < or >, first 1 byte, and length) are output first and after 
> > that each protocol message contents is output. I referred to pqParseInput3 
> > () , fe-exec.c and documentation page for that part.
> > 
> > This fix almost eliminates if(conn->Pfdebug) that was built into the code 
> > for other features.
> 
> This certainly looks much better!  Thanks for getting it done so
> quickly.
> 
> I didn't review the code super closely.  I do see a compiler warning:

+1 for the thanks for the quick work.  I have some random comments
after a quick look on it.

The individual per-type-output functions are designed to fit to the
corresponding pqGet*() functions.  On the other hand, now that we read
the message bytes knowing the exact format in advance as you did in
pqTraceOutputMsg().  So the complexity exist in the functions can be
eliminated by separating them into more type specific output
functions. For example, pqTraceOutputInt() is far simplified by
spliting it into pqTraceOutputInt16() and -32().

The output functions copy message bytes into local variable but the
same effect can be obtained by just casing via typed pointer type.

  uint32 tmp4;
  ..
  memcpy(&tmp4, buf + *cursor, 4);
  result = (int) pg_ntoh32(tmp4);

can be written as

  result = pg_ntoh32(* (uint32 *) (buf + *cursor));

I think we can get rid of copying in the output functions for String
and Bytes in different ways.

Now that we can manage our own reading pointer within
pqTraceOutputMsg(), the per-type-output functions can work only on the
reading pointer instead of buffer pointer and cursor, and length.
*I*'d want to make the output functions move the reading pointer by
themseves.  pqTradeOutputMsg can be as simplified as the follows doing
this.  End-of-message pointer may be useful to check read-overrunning
on the message buffer.

  switch (id) {
case 'R':
  pqTraceOutputInt32(&p, endptr, conn->Pfdebug);
  fputc('\n', conn->Pfdebug);
  break;
case 'K':
  pqTraceOutputInt32(&p, endptr, conn->Pfdebug));
  pqTraceOutputInt32(&p, endptr, conn->Pfdebug));
...

+   char   *prefix = commsource == MSGDIR_FROM_BACKEND ? "<" : ">";
+   int LogEnd = commsource == MSGDIR_FROM_BACKEND ? 
conn->inCursor : conn->outMsgEnd;
+   char*logBuffer = commsource == MSGDIR_FROM_BACKEND ? 
conn->inBuffer 
..
+   if (commsource == MSGDIR_FROM_BACKEND)
+   id = logBuffer[LogCursor++];

Repeated usage of the ternaly operator on the same condition makes
code hard-to-read.  It is simpler to consolidate them into one if-else
statement.

+   result = (int) pg_ntoh32(result32);
+   if (result == NEGOTIATE_SSL_CODE)

Maybe I'm missing something, but the above doesn't seem working.  I
didn't see such message when making a SSL connection.

+   /* Protocol 2.0 does not support tracing. */
+   if (PG_PROTOCOL_MAJOR(conn->pversion) < 3)
+   return;

We can write that message out into tracing file.

+void
+PQtraceSetFlags(PGconn *conn, int flags)
+{
+   if (conn == NULL)
+   return;
+   /* Protocol 2.0 is not supported. */
+   if (PG_PROTOCOL_MAJOR(conn->pversion) < 3)
+   return;
+   /* If PQtrace() failed, do nothing. */
+   if (conn->Pfdebug == NULL)
+   return;
+   conn->traceFlags = flags;

Pfdebug is always NULL for V2 protocol there, so it can be an
assertion?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: fdatasync performance problem with large number of DB files

2021-03-10 Thread Thomas Munro

On Thu, Mar 11, 2021 at 2:00 PM Fujii Masao  wrote:
> On 2021/03/11 8:30, Thomas Munro wrote:
> > I've run into a couple of users who have just commented that recursive
> > fsync() code out!
>
> BTW, we can skip that recursive fsync() by disabling fsync GUC even without
> commenting out the code?

Those users wanted fsync=on because they wanted to recover to a normal
online system after a crash, but they believed that the preceding
fsync of the data directory was useless, because replaying the WAL
should be enough.  IMHO they were nearly on the right track, and the
prototype patch I linked earlier as [2] was my attempt to find the
specific reasons why that doesn't work and fix them.  So far, I
figured out that you still have to remember to fsync the WAL files
(otherwise you're replaying WAL that potentially hasn't reached the
disk), and data files holding blocks that recovery decided to skip due
to BLK_DONE (otherwise you might decide to skip replay because of a
higher LSN that is on a page that is in the kernel's cache but not yet
on disk).

Re: fdatasync performance problem with large number of DB files

2021-03-10 Thread Tom Lane

Thomas Munro  writes:
> On Thu, Mar 11, 2021 at 1:16 PM Tom Lane  wrote:
>> I'm a little skeptical about the "simple" part.  At minimum, you'd
>> have to syncfs() each tablespace, since we have no easy way to tell
>> which of them are on different filesystems.  (Although, if we're
>> presuming this is Linux-only, we might be able to tell with some
>> unportable check or other.)

> Right, the patch knows about that:

I noticed that the syncfs man page present in RHEL8 seemed a little
squishy on the critical question of error reporting.  It promises
that syncfs will wait for I/O completion, but it doesn't say in so
many words that I/O errors will be reported (and the list of
applicable errno codes is only EBADF, not very reassuring).

Trolling the net, I found a newer-looking version of the man page,
and behold it says

   In mainline kernel versions prior to 5.8, syncfs() will fail only
   when passed a bad file descriptor (EBADF).  Since Linux 5.8,
   syncfs() will also report an error if one or more inodes failed
   to be written back since the last syncfs() call.

So this means that in less-than-bleeding-edge kernels, syncfs can
only be regarded as a dangerous toy.  If we expose an option to use
it, there had better be large blinking warnings in the docs.

regards, tom lane

Re: fdatasync performance problem with large number of DB files

2021-03-10 Thread Thomas Munro

On Thu, Mar 11, 2021 at 2:25 PM Tom Lane  wrote:
> Trolling the net, I found a newer-looking version of the man page,
> and behold it says
>
>In mainline kernel versions prior to 5.8, syncfs() will fail only
>when passed a bad file descriptor (EBADF).  Since Linux 5.8,
>syncfs() will also report an error if one or more inodes failed
>to be written back since the last syncfs() call.
>
> So this means that in less-than-bleeding-edge kernels, syncfs can
> only be regarded as a dangerous toy.  If we expose an option to use
> it, there had better be large blinking warnings in the docs.

Agreed.  Perhaps we could also try to do something programmatic about that.

Its fsync() was also pretty rough for the first 28 years.

Re: 64-bit XIDs in deleted nbtree pages

2021-03-10 Thread Peter Geoghegan

On Mon, Mar 1, 2021 at 7:25 PM Peter Geoghegan  wrote:
> Attached is v8 of the patch series, which has new patches. No real
> changes compared to v7 for the first patch, though.

Here is another bitrot-fix-only revision, v9. Just the recycling patch again.

I'll commit this when we get your patch committed. Still haven't
decided on exactly how more aggressive we should be. For example the
use of the heap relation within _bt_newly_deleted_pages_recycle()
might have unintended consequences for recycling efficiency with some
workloads, since it doesn't agree with _bt_getbuf() (it is still "more
ambitious" than _bt_getbuf(), at least for now).

-- 
Peter Geoghegan

v9-0001-Recycle-pages-deleted-during-same-VACUUM.patch
Description: Binary data

Re: libpq debug log

2021-03-10 Thread Tom Lane

Kyotaro Horiguchi  writes:
> Maybe I'm missing something, but the above doesn't seem working.  I
> didn't see such message when making a SSL connection.

As far as that goes, I don't see any point in trying to trace
connection-related messages, because there is no way to enable
tracing on a connection before it's created.

regards, tom lane

RE: libpq debug log

2021-03-10 Thread tsunakawa.ta...@fujitsu.com

Alvaro-san, Tom-san, Horiguchi-san,

From: Kyotaro Horiguchi 
> +1 for the thanks for the quick work.  I have some random comments
> after a quick look on it.

Thank you very much for giving many comments.  And We're sorry to have caused 
you trouble.  I told Iwata-san yesterday to modify the patch to use the logging 
function interface that Tom-san, Alvaro-san and I agreed on, that I'd review 
after the new revision has been posted, and let others know she is modifying 
the patch again so that they won't start reviewing.  But I should have done the 
request on hackers.

We're catching up with your feedback and post the updated patch.  Then I'll 
review it.

We appreciate your help so much.


Regards
Takayuki Tsunakawa

Re: Removing vacuum_cleanup_index_scale_factor

2021-03-10 Thread Masahiko Sawada

On Thu, Mar 11, 2021 at 10:12 AM Peter Geoghegan  wrote:
>
> On Tue, Mar 9, 2021 at 7:42 PM Peter Geoghegan  wrote:
> > My current plan is to commit everything within the next day or two.
> > This includes backpatching to Postgres 13 only.
>
> Pushed, thanks.

Great! Thank you!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Replace buffer I/O locks with condition variables (reviving an old patch)

2021-03-10 Thread Julien Rouhaud

On Thu, Mar 11, 2021 at 10:48:40AM +1300, Thomas Munro wrote:
> On Tue, Mar 9, 2021 at 6:24 PM Julien Rouhaud  wrote:
> >
> > +1 for adding the cv into BufferDesc.  That brings the struct size to 
> > exactly
> > 64 bytes on x86 64 bits architecture.  This won't add any extra overhead to
> > LOCK_DEBUG cases, as it was already exceeding the 64B threshold, if that 
> > even
> > was a concern.
> 
> I also checked that it's 64B on an Arm box.  Not sure about POWER.
> But... despite the fact that it looks like a good change in isolation,
> I decided to go back to the separate array in this initial commit,
> because the AIO branch also wants to add a new BufferDesc member[1].

Ok!

> I may come back to that change, if we can make some more space (seems
> entirely doable, but I'd like to look into that separately).

-   /*
-* It would be nice to include the I/O locks in the BufferDesc, but that
-* would increase the size of a BufferDesc to more than one cache line,
-* and benchmarking has shown that keeping every BufferDesc aligned on a
-* cache line boundary is important for performance.  So, instead, the
-* array of I/O locks is allocated in a separate tranche.  Because those
-* locks are not highly contended, we lay out the array with minimal
-* padding.
-*/
-   size = add_size(size, mul_size(NBuffers, sizeof(LWLockMinimallyPadded)));
+   /* size of I/O condition variables */
+   size = add_size(size, mul_size(NBuffers,
+  sizeof(ConditionVariableMinimallyPadded)));

Should we keep for now some similar comment mentionning why we don't put the cv
in the BufferDesc even though it would currently fit the 64B target size?

> Thanks for the review.  And of course to Robert for writing the patch.  
> Pushed.

Great!

Re: [HACKERS] Custom compression methods

2021-03-10 Thread Justin Pryzby

This includes a patch to use pkgconfig, in an attempt to build on mac, which
currently fails like:

https://cirrus-ci.com/task/5993712963551232?command=build#L126
checking for LZ4_compress in -llz4... no
configure: error: library 'lz4' is required for LZ4 support

-- 
Justin
>From 601ed9e2966b29e1603fd07f69b8068404753810 Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Fri, 5 Mar 2021 09:33:08 +0530
Subject: [PATCH 1/5] Built-in compression method

Add syntax allowing a compression method to be specified.
As of now there is only 2 option for build-in compression
method (pglz, lz4) which can be set while creating a table
or adding a new column.  No option for altering the
compression method for an existing column.

Dilip Kumar based on the patches from Ildus Kurbangaliev.
Design input from Robert Haas and Tomas Vondra.
Reviewed by Robert Haas, Tomas Vondra, Alexander Korotkov and Justin Pryzby

Discussions:
https://www.postgresql.org/message-id/20171213151818.75a20...@postgrespro.ru
https://www.postgresql.org/message-id/CA%2BTgmoaKDW1Oi9V%3Djc9hOGyf77NbkNEABuqgHD1Cq%3D%3D1QsOcxg%40mail.gmail.com
https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com
https://www.postgresql.org/message-id/20201005160355.byp74sh3ejsv7wrj%40development
https://www.postgresql.org/message-id/CAFiTN-tzTTT2oqWdRGLv1dvvS5MC1W%2BLE%2B3bqWPJUZj4GnHOJg%40mail.gmail.com
---
 configure | 118 +++
 configure.ac  |  18 +
 doc/src/sgml/catalogs.sgml|  10 +
 doc/src/sgml/ref/create_table.sgml|  33 +-
 doc/src/sgml/ref/psql-ref.sgml|  11 +
 src/backend/access/brin/brin_tuple.c  |   5 +-
 src/backend/access/common/Makefile|   1 +
 src/backend/access/common/detoast.c   |  71 ++--
 src/backend/access/common/indextuple.c|   3 +-
 src/backend/access/common/toast_compression.c | 308 ++
 src/backend/access/common/toast_internals.c   |  54 ++-
 src/backend/access/common/tupdesc.c   |   8 +
 src/backend/access/table/toast_helper.c   |   5 +-
 src/backend/bootstrap/bootstrap.c |   5 +
 src/backend/catalog/genbki.pl |   3 +
 src/backend/catalog/heap.c|   4 +
 src/backend/catalog/index.c   |   1 +
 src/backend/catalog/toasting.c|   6 +
 src/backend/commands/tablecmds.c  | 110 +++
 src/backend/nodes/copyfuncs.c |   1 +
 src/backend/nodes/equalfuncs.c|   1 +
 src/backend/nodes/nodeFuncs.c |   2 +
 src/backend/nodes/outfuncs.c  |   1 +
 src/backend/parser/gram.y |  26 +-
 src/backend/parser/parse_utilcmd.c|   9 +
 src/backend/utils/adt/varlena.c   |  41 +++
 src/bin/pg_dump/pg_backup.h   |   1 +
 src/bin/pg_dump/pg_dump.c |  39 +++
 src/bin/pg_dump/pg_dump.h |   1 +
 src/bin/pg_dump/t/002_pg_dump.pl  |  12 +-
 src/bin/psql/describe.c   |  31 +-
 src/bin/psql/help.c   |   2 +
 src/bin/psql/settings.h   |   1 +
 src/bin/psql/startup.c|  10 +
 src/include/access/detoast.h  |   8 +
 src/include/access/toast_compression.h| 119 +++
 src/include/access/toast_helper.h |   1 +
 src/include/access/toast_internals.h  |  20 +-
 src/include/catalog/pg_attribute.h|   8 +-
 src/include/catalog/pg_proc.dat   |   4 +
 src/include/nodes/parsenodes.h|   2 +
 src/include/parser/kwlist.h   |   1 +
 src/include/pg_config.h.in|   3 +
 src/include/postgres.h|  14 +-
 src/test/regress/expected/compression.out | 247 ++
 src/test/regress/expected/compression_1.out   | 240 ++
 src/test/regress/parallel_schedule|   2 +-
 src/test/regress/pg_regress_main.c|   4 +-
 src/test/regress/serial_schedule  |   1 +
 src/test/regress/sql/compression.sql  | 102 ++
 src/tools/msvc/Solution.pm|   1 +
 src/tools/pgindent/typedefs.list  |   1 +
 52 files changed, 1645 insertions(+), 85 deletions(-)
 create mode 100644 src/backend/access/common/toast_compression.c
 create mode 100644 src/include/access/toast_compression.h
 create mode 100644 src/test/regress/expected/compression.out
 create mode 100644 src/test/regress/expected/compression_1.out
 create mode 100644 src/test/regress/sql/compression.sql

diff --git a/configure b/configure
index fad817bb38..761a27965d 100755
--- a/configure
+++ b/configure
@@ -699,6 +699,7 @@ with_gnu_ld
 LD
 LDFLAGS_SL
 LDFLAGS_EX
+with_lz4
 with_zlib
 with_system_tzdata
 with_libxslt
@@ -864,6 +865,7 @@ with_libxml
 with_libxslt

RE: Parallel INSERT (INTO ... SELECT ...)

2021-03-10 Thread houzj.f...@fujitsu.com

> > 2. Should we keep the default value of GUC to on or off? It is
> > currently off. I am fine keeping it off for this release and we can
> > always turn it on in the later releases if required. Having said that,
> > I see the value in keeping it on because in many cases Insert ...
> > Select will be used for large data and there we will see a benefit of
> > parallelism and users facing trouble (who have a very large number of
> > partitions with less data to query) can still disable the parallel
> > insert for that particular table. Also, the other benefit of keeping
> > it on till at least the beta period is that this functionality will
> > get tested and if we found reports of regression then we can turn it
> > off for this release as well.
> >
> > Thoughts?
> 
> IMHO, we should keep it on because in most of the cases it is going the give
> benefit to the user, and if for some specific scenario where a table has a 
> lot of
> partition then it can be turned off using reloption.  And, if for some users 
> most
> of the tables are like that where they always have a lot of partition then 
> the user
> can turn it off using guc.
> 
I think your suggestion makes sense.
If no one have other opinions, I will post a new version set default 
enable_parallel_insert=on soon.

Best regards,
houzj

Re: [HACKERS] Custom compression methods

2021-03-10 Thread Dilip Kumar

On Thu, Mar 11, 2021 at 2:21 AM Robert Haas  wrote:
>
> On Wed, Mar 10, 2021 at 6:52 AM Dilip Kumar  wrote:
> > The pending comment is providing a way to rewrite a table and
> > re-compress the data with the current compression method.
>
> I spent some time poking at this yesterday and ran couldn't figure out
> what was going on here. There are two places where we rewrite tables.
> One is the stuff in cluter.c, which handles VACUUM FULL and CLUSTER.
> That eventually calls reform_and_rewrite_tuple(), which deforms the
> old tuple and creates a new one, but it doesn't seem like there's
> anything in there that would expand toasted values, whether external
> or inline compressed. But I think that can't be right, because it
> seems like then you'd end up with toast pointers into the old TOAST
> relation, not the new one, which would cause failures later. So I must
> be missing something here. The other place where we rewrite tables is
> in ATRewriteTable() as part of the ALTER TABLE machinery. I don't see
> anything there to force detoasting either.

We do expand the external values, see below call stack.  See below call stack.

#0  detoast_external_attr (attr=0x2d61a80) at detoast.c:50
#1  0x0055bd53 in toast_tuple_init
#2  0x0050554d in heap_toast_insert_or_update
#3  0x0050ad5b in raw_heap_insert
#4  0x0050a9a1 in rewrite_heap_tuple
#5  0x00502325 in reform_and_rewrite_tuple
#6  0x004ff47c in heapam_relation_copy_for_cluster

But that is only if there are external attributes, we do nothing for
the inline compressed values.  In raw_heap_insert only if
HeapTupleHasExternal(tup) is true then we call raw_heap_insert.  So if
we want to do something about inline compressed data then we will have
to do something in reform_and_rewrite_tuple because there we deform
and form the tuple again so we have an opportunity to decompress any
compressed data.

> But, I am not really convinced that we need to solve this problem by
> adding new ALTER TABLE syntax. I'd be happy enough if CLUSTER, VACUUM
> FULL, and versions of ALTER TABLE that already force a rewrite would
> cause the compression to be redone also.

Okay,  Maybe for directly compressed data we can do that without
affecting the performance of unrelated paths but I am again worried
about the composite type.  Basically, if we have some row type then we
have to deform the tuple stored in the row type and process it fully.
 Said that I think in the older version we already had a pacthes at
[1], basically the first 3 patches will ensure that we will never have
any compressed data in the composite type so we will not have to worry
about those as well.

 Honestly, even if the user
> had to fall back on creating a new table and doing INSERT INTO newtab
> SELECT * FROM oldtab I would consider that to be not a total
> showstopper for this .. assuming of course that it actually works. If
> it doesn't, we have big problems. Even without the pg_am stuff, we
> still need to make sure that we don't just blindly let compressed
> values wander around everywhere. When we insert into a table column
> with a compression method, we should recompress any data that is
> compressed using some other method.

Well, it used to work in the older version[1] so we can make it work
here as well without much effort.

[1] 
https://www.postgresql.org/message-id/CAFiTN-u%3D2-qaLTod3isQmXuSU0s0_bR%2BRcUQL-vSvH%3DMJbEd7Q%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-03-10 Thread Fujii Masao




On 2021/03/11 9:38, Masahiro Ikeda wrote:

On 2021-03-10 17:08, Fujii Masao wrote:

On 2021/03/10 14:11, Masahiro Ikeda wrote:

On 2021-03-09 17:51, Fujii Masao wrote:

On 2021/03/05 8:38, Masahiro Ikeda wrote:

On 2021-03-05 01:02, Fujii Masao wrote:

On 2021/03/04 16:14, Masahiro Ikeda wrote:

On 2021-03-03 20:27, Masahiro Ikeda wrote:

On 2021-03-03 16:30, Fujii Masao wrote:

On 2021/03/03 14:33, Masahiro Ikeda wrote:

On 2021-02-24 16:14, Fujii Masao wrote:

On 2021/02/15 11:59, Masahiro Ikeda wrote:

On 2021-02-10 00:51, David G. Johnston wrote:

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda
 wrote:


I pgindented the patches.


... XLogWrite, which is invoked during an
XLogFlush request (see ...).  This is also
incremented by the WAL receiver during replication.

("which normally called" should be "which is normally called" or
"which normally is called" if you want to keep true to the original)
You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either
"This parameter is off by default as it will repeatedly query the
operating system..."
", because" -> "as"


Thanks, I fixed them.


wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."


I skipped changing it since I separated the stats for the WAL receiver
in pg_stat_wal_receiver.


"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in) This is undesirable because ..."


Thanks, I fixed it.


I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics.  This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite.  Additionally, I observe that
the XLogWrite code path calls pgstat_report_wait_*() while the WAL
receiver path does not.  It seems technically straight-forward to
refactor here to avoid the almost-duplicated logic in the two places,
though I suspect there may be a trade-off for not adding another
function call to the stack given the importance of WAL processing
(though that seems marginalized compared to the cost of actually
writing the WAL).  Or, as Fujii noted, go the other way and don't have
any shared code between the two but instead implement the WAL receiver
one to use pg_stat_wal_receiver instead.  In either case, this
half-and-half implementation seems undesirable.


OK, as Fujii-san mentioned, I separated the WAL receiver stats.
(v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch)


Thanks for updating the patches!



I added the infrastructure code to communicate the WAL receiver stats messages 
between the WAL receiver and the stats collector, and
the stats for WAL receiver is counted in pg_stat_wal_receiver.
What do you think?


On second thought, this idea seems not good. Because those stats are
collected between multiple walreceivers, but other values in
pg_stat_wal_receiver is only related to the walreceiver process running
at that moment. IOW, it seems strange that some values show dynamic
stats and the others show collected stats, even though they are in
the same view pg_stat_wal_receiver. Thought?


OK, I fixed it.
The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 
patch.


Thanks for updating the patches! I'm now reading 001 patch.

+    /* Check whether the WAL file was synced to disk right now */
+    if (enableFsync &&
+    (sync_method == SYNC_METHOD_FSYNC ||
+ sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH ||
+ sync_method == SYNC_METHOD_FDATASYNC))
+    {

Isn't it better to make issue_xlog_fsync() return immediately
if enableFsync is off, sync_method is open_sync or open_data_sync,
to simplify the code more?


Thanks for the comments.
I added the above code in v12 patch.



+    /*
+ * Send WAL statistics only if WalWriterDelay has elapsed to minimize
+ * the overhead in WAL-writing.
+ */
+    if (rc & WL_TIMEOUT)
+    pgstat_send_wal();

On second thought, this change means that it always takes wal_writer_delay
before walwriter's WAL stats is sent after XLogBackgroundFlush() is called.
For example, if wal_writer_delay is set to several seconds, some values in
pg_stat_wal would be not up-to-date meaninglessly for those seconds.
So I'm thinking to withdraw my previous comment and it's ok to send
the stats every after XLogBackgroundFlush() is called. Thought?


Thanks, I didn't notice that.

Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's
default value is 200msec and it may be set shorter time.


Yeah, if wal_writer_delay is set to very small value, there is a risk
that the WAL stats are sent too frequently. I agree that's a problem.



Why don't to make another way to check the timestamp?

+   /*
+    * Don't send a message unless it's been at least
PGSTAT_STAT_INTERVAL
+

Re: Replace buffer I/O locks with condition variables (reviving an old patch)

2021-03-10 Thread Thomas Munro

On Thu, Mar 11, 2021 at 3:27 PM Julien Rouhaud  wrote:
> -   /*
> -* It would be nice to include the I/O locks in the BufferDesc, but that
> -* would increase the size of a BufferDesc to more than one cache line,
> -* and benchmarking has shown that keeping every BufferDesc aligned on a
> -* cache line boundary is important for performance.  So, instead, the
> -* array of I/O locks is allocated in a separate tranche.  Because those
> -* locks are not highly contended, we lay out the array with minimal
> -* padding.
> -*/
> -   size = add_size(size, mul_size(NBuffers, sizeof(LWLockMinimallyPadded)));
> +   /* size of I/O condition variables */
> +   size = add_size(size, mul_size(NBuffers,
> +  sizeof(ConditionVariableMinimallyPadded)));
>
> Should we keep for now some similar comment mentionning why we don't put the 
> cv
> in the BufferDesc even though it would currently fit the 64B target size?

I tried to write some words along those lines, but it seemed hard to
come up with a replacement message about a thing we're not doing
because of other currently proposed patches.  The situation could
change, and it seemed to be a strange place to put this comment
anyway, far away from the relevant struct.  Ok, let me try that
again... what do you think of this, as a new comment for BufferDesc,
next to the existing discussion of the 64 byte rule?

--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -174,6 +174,10 @@ typedef struct buftag
  * Be careful to avoid increasing the size of the struct when adding or
  * reordering members.  Keeping it below 64 bytes (the most common CPU
  * cache line size) is fairly important for performance.
+ *
+ * Per-buffer I/O condition variables are kept outside this struct in a
+ * separate array.  They could be moved in here and still fit under that
+ * limit on common systems, but for now that is not done.
  */
 typedef struct BufferDesc
 {

RE: libpq debug log

2021-03-10 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> The output functions copy message bytes into local variable but the
> same effect can be obtained by just casing via typed pointer type.
> 
>   uint32 tmp4;
>   ..
>   memcpy(&tmp4, buf + *cursor, 4);
>   result = (int) pg_ntoh32(tmp4);
> 
> can be written as
> 
>   result = pg_ntoh32(* (uint32 *) (buf + *cursor));

I'm afraid we need to memcpy() because of memory alignment.

> I think we can get rid of copying in the output functions for String
> and Bytes in different ways.

I haven't looked at this code, but you sound right.


Regards
Takayuki Tsunakawa

Re: libpq debug log

2021-03-10 Thread Kyotaro Horiguchi

At Wed, 10 Mar 2021 20:38:20 -0500, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > Maybe I'm missing something, but the above doesn't seem working.  I
> > didn't see such message when making a SSL connection.
> 
> As far as that goes, I don't see any point in trying to trace
> connection-related messages, because there is no way to enable
> tracing on a connection before it's created.

Yeah, agreed. In the previous version tracing functions are called
during protocol negotiation but that no longer happenes.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: libpq debug log

2021-03-10 Thread Kyotaro Horiguchi

At Thu, 11 Mar 2021 01:51:55 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> Alvaro-san, Tom-san, Horiguchi-san,
> 
> From: Kyotaro Horiguchi 
> > +1 for the thanks for the quick work.  I have some random comments
> > after a quick look on it.
> 
> Thank you very much for giving many comments.  And We're sorry to
> have caused you trouble.  I told Iwata-san yesterday to modify the
> patch to use the logging function interface that Tom-san, Alvaro-san
> and I agreed on, that I'd review after the new revision has been

Yeah, I agreed at the time. Sorry for not responding it but had no
room in my mind to do that:p

> posted, and let others know she is modifying the patch again so that
> they won't start reviewing.  But I should have done the request on
> hackers.
> 
> We're catching up with your feedback and post the updated patch.
> Then I'll review it.
> 
> We appreciate your help so much.

My pleasure.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Replace buffer I/O locks with condition variables (reviving an old patch)

2021-03-10 Thread Julien Rouhaud

On Thu, Mar 11, 2021 at 03:54:06PM +1300, Thomas Munro wrote:
> On Thu, Mar 11, 2021 at 3:27 PM Julien Rouhaud  wrote:
> > -   /*
> > -* It would be nice to include the I/O locks in the BufferDesc, but that
> > -* would increase the size of a BufferDesc to more than one cache line,
> > -* and benchmarking has shown that keeping every BufferDesc aligned on a
> > -* cache line boundary is important for performance.  So, instead, the
> > -* array of I/O locks is allocated in a separate tranche.  Because those
> > -* locks are not highly contended, we lay out the array with minimal
> > -* padding.
> > -*/
> > -   size = add_size(size, mul_size(NBuffers, 
> > sizeof(LWLockMinimallyPadded)));
> > +   /* size of I/O condition variables */
> > +   size = add_size(size, mul_size(NBuffers,
> > +  
> > sizeof(ConditionVariableMinimallyPadded)));
> >
> > Should we keep for now some similar comment mentionning why we don't put 
> > the cv
> > in the BufferDesc even though it would currently fit the 64B target size?
> 
> I tried to write some words along those lines, but it seemed hard to
> come up with a replacement message about a thing we're not doing
> because of other currently proposed patches.  The situation could
> change, and it seemed to be a strange place to put this comment
> anyway, far away from the relevant struct.

Yeah, I agree that it's not the best place to document the size consideration.

> Ok, let me try that
> again... what do you think of this, as a new comment for BufferDesc,
> next to the existing discussion of the 64 byte rule?
> 
> --- a/src/include/storage/buf_internals.h
> +++ b/src/include/storage/buf_internals.h
> @@ -174,6 +174,10 @@ typedef struct buftag
>   * Be careful to avoid increasing the size of the struct when adding or
>   * reordering members.  Keeping it below 64 bytes (the most common CPU
>   * cache line size) is fairly important for performance.
> + *
> + * Per-buffer I/O condition variables are kept outside this struct in a
> + * separate array.  They could be moved in here and still fit under that
> + * limit on common systems, but for now that is not done.
>   */
>  typedef struct BufferDesc
>  {

I was mostly thinking about something like "leave room for now as other feature
could make a better use of that space", but I'm definitely fine with this
comment!

Re: libpq debug log

2021-03-10 Thread Kyotaro Horiguchi

At Thu, 11 Mar 2021 03:01:16 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Kyotaro Horiguchi 
> > The output functions copy message bytes into local variable but the
> > same effect can be obtained by just casing via typed pointer type.
> > 
> >   uint32 tmp4;
> >   ..
> >   memcpy(&tmp4, buf + *cursor, 4);
> >   result = (int) pg_ntoh32(tmp4);
> > 
> > can be written as
> > 
> >   result = pg_ntoh32(* (uint32 *) (buf + *cursor));
> 
> I'm afraid we need to memcpy() because of memory alignment.

Right. So something like this?

unsigned char p;

p = buf + *cursor;
result = (uint32) (*p << 24) + (*(p + 1)) << 16 + ...);


> > I think we can get rid of copying in the output functions for String
> > and Bytes in different ways.
> 
> I haven't looked at this code, but you sound right.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Custom compression methods

2021-03-10 Thread Justin Pryzby

On Thu, Mar 11, 2021 at 08:17:46AM +0530, Dilip Kumar wrote:
> On Thu, Mar 11, 2021 at 2:21 AM Robert Haas  wrote:
> >
> > On Wed, Mar 10, 2021 at 6:52 AM Dilip Kumar  wrote:
> > > The pending comment is providing a way to rewrite a table and
> > > re-compress the data with the current compression method.
> >
> > I spent some time poking at this yesterday and ran couldn't figure out
> > what was going on here. There are two places where we rewrite tables.
> > One is the stuff in cluter.c, which handles VACUUM FULL and CLUSTER.
> > That eventually calls reform_and_rewrite_tuple(), which deforms the
> > old tuple and creates a new one, but it doesn't seem like there's
> > anything in there that would expand toasted values, whether external
> > or inline compressed. But I think that can't be right, because it
> > seems like then you'd end up with toast pointers into the old TOAST
> > relation, not the new one, which would cause failures later. So I must
> > be missing something here. The other place where we rewrite tables is
> > in ATRewriteTable() as part of the ALTER TABLE machinery. I don't see
> > anything there to force detoasting either.
> 
> We do expand the external values, see below call stack.  See below call stack.
> 
> #0  detoast_external_attr (attr=0x2d61a80) at detoast.c:50
> #1  0x0055bd53 in toast_tuple_init
> #2  0x0050554d in heap_toast_insert_or_update
> #3  0x0050ad5b in raw_heap_insert
> #4  0x0050a9a1 in rewrite_heap_tuple
> #5  0x00502325 in reform_and_rewrite_tuple
> #6  0x004ff47c in heapam_relation_copy_for_cluster
> 
> But that is only if there are external attributes, we do nothing for
> the inline compressed values.  In raw_heap_insert only if
> HeapTupleHasExternal(tup) is true then we call raw_heap_insert.  So if
> we want to do something about inline compressed data then we will have
> to do something in reform_and_rewrite_tuple because there we deform
> and form the tuple again so we have an opportunity to decompress any
> compressed data.
> 
> > But, I am not really convinced that we need to solve this problem by
> > adding new ALTER TABLE syntax. I'd be happy enough if CLUSTER, VACUUM
> > FULL, and versions of ALTER TABLE that already force a rewrite would
> > cause the compression to be redone also.
> 
> Okay,  Maybe for directly compressed data we can do that without
> affecting the performance of unrelated paths but I am again worried
> about the composite type.  Basically, if we have some row type then we
> have to deform the tuple stored in the row type and process it fully.
>  Said that I think in the older version we already had a pacthes at
> [1], basically the first 3 patches will ensure that we will never have
> any compressed data in the composite type so we will not have to worry
> about those as well.
> 
>  Honestly, even if the user
> > had to fall back on creating a new table and doing INSERT INTO newtab
> > SELECT * FROM oldtab I would consider that to be not a total
> > showstopper for this .. assuming of course that it actually works. If
> > it doesn't, we have big problems. Even without the pg_am stuff, we
> > still need to make sure that we don't just blindly let compressed
> > values wander around everywhere. When we insert into a table column
> > with a compression method, we should recompress any data that is
> > compressed using some other method.
> 
> Well, it used to work in the older version[1] so we can make it work
> here as well without much effort.

Looking at v23-0002-alter-table-set-compression, ATRewriteTable() was calling
CompareCompressionMethodAndDecompress().

I think what's wanted here is that decompression should only happen when the
tuple uses a different compression than the column's currently set compression.
So there's no overhead in the usual case.  I guess CLUSTER and INSERT SELECT
should do the same.

This is important to allow someone to get rid of LZ4 compression, if they want
to get rid of that dependency.

But it's also really desirable for admins to be able to "migrate" existing
data.  People will want to test this, and I guess INSERT SELECT and CLUSTER are
the obvious ways.

-- 
Justin

Re: shared-memory based stats collector

2021-03-10 Thread Andres Freund

Hi,

Two minor nits:

On 2021-03-10 21:47:51 +0900, Fujii Masao wrote:
> +/* Shared memory area for archiver process */
> +typedef struct PgArchData
> +{
> + Latch  *latch;  /* latch to wake the archiver 
> up */
> + slock_t mutex;  /* locks this struct */
> +} PgArchData;
> +

It doesn't really matter, but it'd be pretty trivial to avoid needing a
spinlock for this kind of thing. Just store the pgprocno of the archiver
in PgArchData.

While getting rid of the spinlock doesn't seem like a huge win, it does
seem nicer that we'd automatically have a way to find data about the
archiver (e.g. pid).

>* checkpointer to exit as well, otherwise not. The archiver, 
> stats,
>* and syslogger processes are disregarded since they are not
>* connected to shared memory; we also disregard dead_end 
> children
>* here. Walsenders are also disregarded, they will be 
> terminated
>* later after writing the checkpoint record, like the archiver
>* process.
>*/

This comment in PostmasterStateMachine() is outdated now.

Greetings,

Andres Freund

Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)

2021-03-10 Thread Thomas Munro

On Tue, Mar 2, 2021 at 2:10 PM Thomas Munro  wrote:
> ...  One question I haven't
> got to the bottom of: is it a problem for the startup process that CVs
> use CHECK_FOR_INTERRUPTS()?

This was a red herring.  The startup process already reaches CFI() via
various paths, as I figured out pretty quickly with a debugger.  So
I'd like to go ahead and commit these patches.

Michael, when you said "That's pretty hack-ish, still efficient" in
reference to this code:

> -   if (IsUnderPostmaster && !PostmasterIsAlive())
> +   if (IsUnderPostmaster &&
> +#ifndef USE_POSTMASTER_DEATH_SIGNAL
> +   count++ % 1024 == 0 &&
> +#endif
> +   !PostmasterIsAlive())

Is that an objection, and do you see a specific better way?

I know that someone just needs to write a Windows patch to get us a
postmaster death signal when the postmaster's event fires, and then
the problem will go away on Windows.  I still want this change,
because we don't have such a patch yet, and even when someone writes
that, there are still a couple of Unixes that could benefit.

Re: [PATCH] Provide more information to filter_prepare

2021-03-10 Thread Amit Kapila

On Wed, Mar 10, 2021 at 4:26 PM Markus Wanner
 wrote:
>
> On 10.03.21 11:18, Amit Kapila wrote:
> > On Tue, Mar 9, 2021 at 2:14 PM Markus Wanner
> >  wrote:
> >> currently, only the gid is passed on to the filter_prepare callback.
> >> While we probably should not pass a full ReorderBufferTXN (as we do for
> >> most other output plugin callbacks), a bit more information would be
> >> nice, I think.
> >
> > How the proposed 'xid' parameter can be useful? What exactly plugins
> > want to do with it?
>
> The xid is the very basic identifier for transactions in Postgres.  Any
> output plugin that interacts with Postgres in any way slightly more
> interesting than "filter by gid prefix" is very likely to come across a
> TransactionId.
>
> It allows for basics like checking if the transaction to decode still is
> in progress, for example.
>

But this happens when we are decoding prepare, so it is clear that the
transaction is prepared, why any additional check?

>  Or in a much more complex scenario, decide on
> whether or not to filter based on properties the extension stored during
> processing the transaction.
>

What in this can't be done with GID and how XID can achieve it?

-- 
With Regards,
Amit Kapila.

1 2 >

1 - 100 of 115 matches

Mail list logo