Re: pgsql: Enable parallel SELECT for "INSERT INTO ... SELECT ...".
On Wed, Mar 10, 2021 at 9:07 AM Tom Lane wrote: > > Amit Kapila writes: > > Enable parallel SELECT for "INSERT INTO ... SELECT ...". > > skink (valgrind) is unhappy: > > creating configuration files ... ok > running bootstrap script ... ok > performing post-bootstrap initialization ... ==4085668== VALGRINDERROR-BEGIN > ==4085668== Conditional jump or move depends on uninitialised value(s) > ==4085668==at 0x4AEB77: max_parallel_hazard_walker (clauses.c:700) > ==4085668==by 0x445287: expression_tree_walker (nodeFuncs.c:2188) > ==4085668==by 0x4AEBB8: max_parallel_hazard_walker (clauses.c:860) > ==4085668==by 0x4B045E: is_parallel_safe (clauses.c:637) > ==4085668==by 0x4985D0: grouping_planner (planner.c:2070) > ==4085668==by 0x49AE4F: subquery_planner (planner.c:1024) > ==4085668==by 0x49B4F5: standard_planner (planner.c:404) > ==4085668==by 0x49BAD2: planner (planner.c:273) > ==4085668==by 0x5818BE: pg_plan_query (postgres.c:809) > ==4085668==by 0x581977: pg_plan_queries (postgres.c:900) > ==4085668==by 0x581E70: exec_simple_query (postgres.c:1092) > ==4085668==by 0x583F7A: PostgresMain (postgres.c:4327) > ==4085668== Uninitialised value was created by a stack allocation > ==4085668==at 0x4B0363: is_parallel_safe (clauses.c:599) > ==4085668== > ==4085668== VALGRINDERROR-END > > There are a few other commits that skink hasn't seen before, but given > the apparent connection to parallel planning, none of the others look > like plausible candidates to explain this. > Right, the patch forgot to initialize a new variable in max_parallel_hazard_context via is_parallel_safe. I think we need to initialize all the new variables as NULL because is_parallel_safe is used to check parallel-safety of expressions. These new variables are only required for checking parallel-safety of target relation which is already done at the time of initial checks in standard_planner. -- With Regards, Amit Kapila.
Re: Add some tests for pg_stat_statements compatibility verification under contrib
Hi Julien, Thanks a lot for the quick review. Please see my answer below in blue. Attached is the new patch. -- Original -- From: "Julien Rouhaud" v2_add_test_for_pg_stat_statements.patch Description: Binary data
Re: About to add WAL write/fsync statistics to pg_stat_wal view
On 2021/03/10 14:11, Masahiro Ikeda wrote: On 2021-03-09 17:51, Fujii Masao wrote: On 2021/03/05 8:38, Masahiro Ikeda wrote: On 2021-03-05 01:02, Fujii Masao wrote: On 2021/03/04 16:14, Masahiro Ikeda wrote: On 2021-03-03 20:27, Masahiro Ikeda wrote: On 2021-03-03 16:30, Fujii Masao wrote: On 2021/03/03 14:33, Masahiro Ikeda wrote: On 2021-02-24 16:14, Fujii Masao wrote: On 2021/02/15 11:59, Masahiro Ikeda wrote: On 2021-02-10 00:51, David G. Johnston wrote: On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda wrote: I pgindented the patches. ... XLogWrite, which is invoked during an XLogFlush request (see ...). This is also incremented by the WAL receiver during replication. ("which normally called" should be "which is normally called" or "which normally is called" if you want to keep true to the original) You missed the adding the space before an opening parenthesis here and elsewhere (probably copy-paste) is ether -> is either "This parameter is off by default as it will repeatedly query the operating system..." ", because" -> "as" Thanks, I fixed them. wal_write_time and the sync items also need the note: "This is also incremented by the WAL receiver during replication." I skipped changing it since I separated the stats for the WAL receiver in pg_stat_wal_receiver. "The number of times it happened..." -> " (the tally of this event is reported in wal_buffers_full in) This is undesirable because ..." Thanks, I fixed it. I notice that the patch for WAL receiver doesn't require explicitly computing the sync statistics but does require computing the write statistics. This is because of the presence of issue_xlog_fsync but absence of an equivalent pg_xlog_pwrite. Additionally, I observe that the XLogWrite code path calls pgstat_report_wait_*() while the WAL receiver path does not. It seems technically straight-forward to refactor here to avoid the almost-duplicated logic in the two places, though I suspect there may be a trade-off for not adding another function call to the stack given the importance of WAL processing (though that seems marginalized compared to the cost of actually writing the WAL). Or, as Fujii noted, go the other way and don't have any shared code between the two but instead implement the WAL receiver one to use pg_stat_wal_receiver instead. In either case, this half-and-half implementation seems undesirable. OK, as Fujii-san mentioned, I separated the WAL receiver stats. (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch) Thanks for updating the patches! I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and the stats for WAL receiver is counted in pg_stat_wal_receiver. What do you think? On second thought, this idea seems not good. Because those stats are collected between multiple walreceivers, but other values in pg_stat_wal_receiver is only related to the walreceiver process running at that moment. IOW, it seems strange that some values show dynamic stats and the others show collected stats, even though they are in the same view pg_stat_wal_receiver. Thought? OK, I fixed it. The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch. Thanks for updating the patches! I'm now reading 001 patch. + /* Check whether the WAL file was synced to disk right now */ + if (enableFsync && + (sync_method == SYNC_METHOD_FSYNC || + sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH || + sync_method == SYNC_METHOD_FDATASYNC)) + { Isn't it better to make issue_xlog_fsync() return immediately if enableFsync is off, sync_method is open_sync or open_data_sync, to simplify the code more? Thanks for the comments. I added the above code in v12 patch. + /* + * Send WAL statistics only if WalWriterDelay has elapsed to minimize + * the overhead in WAL-writing. + */ + if (rc & WL_TIMEOUT) + pgstat_send_wal(); On second thought, this change means that it always takes wal_writer_delay before walwriter's WAL stats is sent after XLogBackgroundFlush() is called. For example, if wal_writer_delay is set to several seconds, some values in pg_stat_wal would be not up-to-date meaninglessly for those seconds. So I'm thinking to withdraw my previous comment and it's ok to send the stats every after XLogBackgroundFlush() is called. Thought? Thanks, I didn't notice that. Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's default value is 200msec and it may be set shorter time. Yeah, if wal_writer_delay is set to very small value, there is a risk that the WAL stats are sent too frequently. I agree that's a problem. Why don't to make another way to check the timestamp? + /* + * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL + * msec since we last sent one + */ + now = GetC
Re: Release SPI plans for referential integrity with DISCARD ALL
Hello, I thought about this suggestion again. Amit's patch suggested in the thread [1] can eliminate SPI plans from INSERT/UPDATE triggers, so our memory pressure issue would be solved. But as far as I can see that thread, Amit's patch doesn't cover DELETE case. It is not a common case, but there is a risk of pressing memory capacity extremely. Considering that, this suggestion might still have good value as Corey said before. I improved a patch according to Peter's following comment : > but I think the > solution of dropping all cached plans as part of DISCARD ALL seems a bit > too extreme of a solution. In the context of connection pooling, > getting a new session with pre-cached plans seems like a good thing, and > this change could potentially have a performance impact without a > practical way to opt out. In a new patch, I separated discarding SPI Plan caches for RI from DISCARD ALL and added a new option "RIPLANS" to DISCARD command to do that. Also a new function, ResetRIPlanCache() which clears SPI plan caches is called by DISCARD ALL or DISCARD RIPLANS. The amount of modification is small and this option can be removed instantly when it is no longer needed, so I think we can use this patch as a provisional solution. Any thoughts? [1] https://www.postgresql.org/message-id/flat/CA%2BHiwqGkfJfYdeq5vHPh6eqPKjSbfpDDY%2Bj-kXYFePQedtSLeg%40mail.gmail.com -- Best regards, Yuzuko Hosoya NTT Open Source Software Center v2_discard_ri_spi_plans.patch Description: Binary data
Re: cryptohash: missing locking functions for OpenSSL <= 1.0.2?
On Mon, Mar 08, 2021 at 06:06:32PM +, Jacob Champion wrote: > I had to convince myself that this logic is correct -- we set > destroy_needed even if crypto is not enabled, but then check later to > make sure that crypto_loaded is true before doing anything. What would > you think about moving the conn->crypto_loaded check to the else > branch, so that destroy_needed is only set if we actually need it? Do you mean something like the attached? If I recall my mood from the moment, I think that I did that to be more careful with the case where the client has its own set of callbacks set (pq_init_crypto_lib as false) but that does not matter as this is double-checked in destroy_ssl_system(). I have adjusted some comments after more review. -- Michael diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index 29054bad7b..e4a85c37c4 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -2887,6 +2887,16 @@ keep_going: /* We will come back to here until there is #ifdef USE_SSL +/* + * Enable the libcrypto callbacks before checking if SSL needs + * to be done. This is done before sending the startup packet + * as depending on the type of authentication done, like MD5 + * or SCRAM, the callbacks would be required even without a + * SSL connection + */ +if (pqsecure_initialize(conn, false, true) < 0) + goto error_return; + /* * If SSL is enabled and we haven't already got encryption of * some sort running, request SSL instead of sending the @@ -2998,8 +3008,14 @@ keep_going: /* We will come back to here until there is { /* mark byte consumed */ conn->inStart = conn->inCursor; - /* Set up global SSL state if required */ - if (pqsecure_initialize(conn) != 0) + + /* + * Set up global SSL state if required. The crypto + * state has already been set if libpq took care of + * doing that, so there is no need to make that happen + * again. + */ + if (pqsecure_initialize(conn, true, false) != 0) goto error_return; } else if (SSLok == 'N') diff --git a/src/interfaces/libpq/fe-secure-openssl.c b/src/interfaces/libpq/fe-secure-openssl.c index 0fa10a23b4..3de8f89264 100644 --- a/src/interfaces/libpq/fe-secure-openssl.c +++ b/src/interfaces/libpq/fe-secure-openssl.c @@ -85,7 +85,7 @@ static bool pq_init_crypto_lib = true; static bool ssl_lib_initialized = false; #ifdef ENABLE_THREAD_SAFETY -static long ssl_open_connections = 0; +static long crypto_open_connections = 0; #ifndef WIN32 static pthread_mutex_t ssl_config_mutex = PTHREAD_MUTEX_INITIALIZER; @@ -111,7 +111,7 @@ pgtls_init_library(bool do_ssl, int do_crypto) * Disallow changing the flags while we have open connections, else we'd * get completely confused. */ - if (ssl_open_connections != 0) + if (crypto_open_connections != 0) return; #endif @@ -635,7 +635,7 @@ pq_lockingcallback(int mode, int n, const char *file, int line) * override it. */ int -pgtls_init(PGconn *conn) +pgtls_init(PGconn *conn, bool do_ssl, bool do_crypto) { #ifdef ENABLE_THREAD_SAFETY #ifdef WIN32 @@ -684,22 +684,28 @@ pgtls_init(PGconn *conn) } } - if (ssl_open_connections++ == 0) + if (do_crypto && !conn->crypto_loaded) { - /* - * These are only required for threaded libcrypto applications, - * but make sure we don't stomp on them if they're already set. - */ - if (CRYPTO_get_id_callback() == NULL) -CRYPTO_set_id_callback(pq_threadidcallback); - if (CRYPTO_get_locking_callback() == NULL) -CRYPTO_set_locking_callback(pq_lockingcallback); + if (crypto_open_connections++ == 0) + { +/* + * These are only required for threaded libcrypto + * applications, but make sure we don't stomp on them if + * they're already set. + */ +if (CRYPTO_get_id_callback() == NULL) + CRYPTO_set_id_callback(pq_threadidcallback); +if (CRYPTO_get_locking_callback() == NULL) + CRYPTO_set_locking_callback(pq_lockingcallback); + } + + conn->crypto_loaded = true; } } #endif /* HAVE_CRYPTO_LOCK */ #endif /* ENABLE_THREAD_SAFETY */ - if (!ssl_lib_initialized) + if (!ssl_lib_initialized && do_ssl) { if (pq_init_ssl_lib) { @@ -740,10 +746,10 @@ destroy_ssl_system(void) if (pthread_mutex_lock(&ssl_config_mutex)) return; - if (pq_init_crypto_lib && ssl_open_connections > 0) - --ssl_open_connections; + if (pq_init_crypto_lib && crypto_open_connections > 0) + --crypto_open_connections; - if (pq_init_crypto_lib && ssl_open_connections == 0) + if (pq_init_crypto_lib && crypto_open_connections == 0) { /* * No connections left, unregister libcrypto callbacks, if no one @@ -1402,46 +1408,63 @@ pgtls_close(PGconn *conn) { bool destroy_needed = false; - if (conn->ssl) + if (conn->ssl_in_use) { - /* - * We can't destroy everything SSL-related here due to
Re: OpenSSL 3.0.0 compatibility
> On 3 Mar 2021, at 14:55, Peter Eisentraut > wrote: > > This thread is still in the commit fest, but I don't see any actual proposed > patch still pending. Most of the activity has moved into other threads. The doc changes in the patch proposed on 29/9 still stands, although I see that it had an off by one in mentioning MD5 when it should be MD4 et.al; so something more like the below. diff --git a/doc/src/sgml/pgcrypto.sgml b/doc/src/sgml/pgcrypto.sgml index b6bb23de0f..d45464c7ea 100644 --- a/doc/src/sgml/pgcrypto.sgml +++ b/doc/src/sgml/pgcrypto.sgml @@ -1234,6 +1234,12 @@ gen_random_uuid() returns uuid + +When compiled against OpenSSL 3.0.0, the legacy +provider must be activated in the system openssl.cnf +configuration file in order to use older ciphers like DES and Blowfish. + + > Could you update the status of this CF entry, and perhaps also on the status > of OpenSSL compatibility in general? Let's just wait for 3.0.0 to ship before we do anything. -- Daniel Gustafsson https://vmware.com/
RE: Avoid CommandCounterIncrement in RI trigger when INSERT INTO referencing table
From: Hou, Zhijie/侯 志杰 > After some more on how to support parallel insert into fk relation. > It seems we do not have a cheap way to implement this feature. > > In RI_FKey_check, Currently, postgres execute "select xx for key share" to > check that foreign key exists in PK table. > However "select for update/share" is considered as parallel unsafe. It may be > dangerous to do this in parallel mode, we may want to change this. Hmm, I guess the parallel leader and workers can execute SELECT FOR KEY SHARE, if the parallelism infrastructure allows execution of SPI calls. The lock manager supports tuple locking in parallel leader and workers by the group locking. Also, the tuple locking doesn't require combo Cid, which is necessary for parallel UPDATE and DELETE. Perhaps the reason why SELECT FOR is treated as parallel-unsafe is that tuple locking modifies data pages to store lock information in the tuple header. But now, page modification is possible in parallel processes, so I think we can consider SELECT FOR as parallel-safe. (I may be too optimistic.) > And also, "select for update/share" is considered as "not read only" which > will > force readonly = false in _SPI_execute_plan. read_only is used to do CCI. Can we arrange to skip CCI? > At the same time, " simplifying foreign key/RI checks " thread is trying to > replace "select xx for key share" with index_beginscan()+table_tuple_lock() (I > think it’s parallel safe). > May be we can try to support parallel insert fk relation after " simplifying > foreign > key/RI checks " patch applied ? Why do you think it's parallel safe? Can you try running parallel INSERT SELECT on the target table with FK and see if any problem happens? If some problem occurs due to the tuple locking, I think we can work around it by avoiding tuple locking. That is, we make parallel INSERT SELECT lock the parent tables in exclusive mode so that the check tuples won't be deleted. Some people may not like this, but it's worth considering because parallel INSERT SELECT would not have to be run concurrently with short OLTP transactions. Anyway, tuple locking partly disturbs parallel INSERT speedup because it modifies pages in the parent tables and emits WAL. Surprisingly, Oracle doesn't support parallel INSERT SELECT on a table with FK as follows. SQL Server doesn't mention anything, so I guess it's supported. This is a good chance for PostgreSQL to exceed Oracle. https://docs.oracle.com/en/database/oracle/oracle-database/21/vldbg/types-parallelism.html#GUID-D4CFC1F2-44D3-4BE3-B5ED-6A309EB8BF06 Table 8-1 Referential Integrity Restrictions DML Statement Issued on ParentIssued on Child Self-Referential INSERT (Not applicable)Not parallelizedNot parallelized Regards Takayuki Tsunakawa
Re: Improvements and additions to COPY progress reporting
On Tue, 9 Mar 2021 at 06:34, Michael Paquier wrote: > > On Mon, Mar 08, 2021 at 05:33:40PM +0100, Matthias van de Meent wrote: > > Seems reasonable. PFA updated patches. I've renamed the previous 0003 > > to 0002 to keep git-format-patch easy. > > Thanks for updating the patch. 0001 has been applied, after tweaking > a bit comments, indentation and the docs. Thanks! > > This is keeping current behaviour of the implementation as committed > > with 8a4f618e, with the rationale of that patch being that this number > > should mirror the number returned by the copy command. > > > > I am not opposed to adding another column for `tuples_inserted` and > > changing the logic accordingly (see prototype 0003), but that was not > > in the intended scope of this patchset. Unless you think that this > > should be included in this current patchset, I'll spin that patch out > > into a different thread, but I'm not sure that would make it into > > pg14. > > Okay, point taken. If there is demand for it in the future, we could > extend the existing set of columns. After thinking more about it the > usecase if not completely clear to me from a monitoring point of > view. > > I have not looked at 0002 in details yet, but I am wondering first if > the size estimations in the expected output are actually portable. > Second, I doubt a bit that the extra cycles spent on that are actually > worth the coverage, even if the trick with an AFTER INSERT trigger is > interesting. There are examples in which pg_stat_progress_* -views report inaccurate data. I think it is fairly reasonable to at least validate some part of the progress reporting, as it is one of the few methods for administrators to look at the state of currently running administrative tasks, and as such, this user-visible api should be validated. With regards, Matthias van de Meent
Re: Procedures versus the "fastpath" API
On Tue, 2021-03-09 at 14:15 -0500, Tom Lane wrote: > The security team received a report from Theodor-Arsenij > Larionov-Trichkin of PostgresPro that it's possible to crash the > backend with an assertion or null-pointer dereference by trying to > call a window function via the "fast path function call" protocol > message. > > So the questthemion on the table is what to do about this. > > As for procedures, I'm of the opinion that we should just reject those > too, but some other security team members were not happy with that > idea. Conceivably we could attempt to make the case actually work, > but is it worth the trouble? Given the lack of field complaints > about the "invalid transaction termination" failure, it seems unlikely > that it's occurred to anyone to try to call procedures this way. > We'd need special infrastructure to test the case, too, since psql > offers no access to fastpath calls. > > A compromise suggestion was to prohibit calling procedures via > fastpath as of HEAD, but leave existing releases as they are, > in case anyone is using a case that happens to work. The "invalid transaction termination" failure alone doesn't worry or surprise me - transaction handling in procedures only works under rather narrow conditions anyway (no SELECT on the call stack, no explicit transaction was started outside the procedure). If that is the only problem, I'd just document it. The hard work is of course that there is no other problem with calling procedures that way. If anybody wants to do that work, and transaction handling is the only thing that doesn't work with the fastpath API, we can call it supported and document the exception. In case of doubt, I would agree with you and forbid it in HEAD as a corner case with little real-world use. Yours, Laurenz Albe
Re: Parallel INSERT (INTO ... SELECT ...)
On Mon, Mar 8, 2021 at 7:19 PM Amit Langote wrote: > > Hi Amit > > On Mon, Mar 8, 2021 at 10:18 PM Amit Kapila wrote: > > On Mon, Mar 8, 2021 at 3:54 PM Greg Nancarrow wrote: > > > I've attached an updated set of patches with the suggested locking > > > changes. > > (Thanks Greg.) > > > Amit L, others, do let me know if you have still more comments on > > 0001* patch or if you want to review it further? > > I just read through v25 and didn't find anything to complain about. > Thanks a lot, pushed now! Amit L., your inputs are valuable for this work. Now, coming back to Hou-San's patch to introduce a GUC and reloption for this feature, I think both of those make sense to me because when the feature is enabled via GUC, one might want to disable it for partitioned tables? Do we agree on that part or someone thinks otherwise? The other points to bikeshed could be: 1. The name of GUC and reloption. The two proposals at hand are enable_parallel_dml and enable_parallel_insert. I would prefer the second (enable_parallel_insert) because updates/deletes might not have a similar overhead. 2. Should we keep the default value of GUC to on or off? It is currently off. I am fine keeping it off for this release and we can always turn it on in the later releases if required. Having said that, I see the value in keeping it on because in many cases Insert ... Select will be used for large data and there we will see a benefit of parallelism and users facing trouble (who have a very large number of partitions with less data to query) can still disable the parallel insert for that particular table. Also, the other benefit of keeping it on till at least the beta period is that this functionality will get tested and if we found reports of regression then we can turn it off for this release as well. Thoughts? -- With Regards, Amit Kapila.
Re: Confusing behavior of psql's \e
On Wed, Mar 10, 2021 at 6:14 AM Tom Lane wrote: > > PS: I seem to recall that some Microsoft filesystems have 2-second > resolution on file mod times, so maybe it needs to be "time(NULL) - 2"? > You are thinking about FAT: https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getfiletime#remarks Regards, Juan José Santamaría Flecha
Re: [PATCH] Provide more information to filter_prepare
On Tue, Mar 9, 2021 at 2:14 PM Markus Wanner wrote: > > Hi, > > currently, only the gid is passed on to the filter_prepare callback. > While we probably should not pass a full ReorderBufferTXN (as we do for > most other output plugin callbacks), a bit more information would be > nice, I think. > How the proposed 'xid' parameter can be useful? What exactly plugins want to do with it? -- With Regards, Amit Kapila.
Re: Confusing behavior of psql's \e
On Wed, 2021-03-10 at 00:13 -0500, Tom Lane wrote: > I agree > that trying to get rid of the race condition inherent in the existing > file mtime test would be a good idea. However, I've got some > portability-related gripes about how you are doing the latter: > > 1. There is no principled reason to assume that the epoch date is in the > past. IIRC, Postgres' timestamp epoch of 2000-01-01 was in the future > at the time we set it. More relevant to the immediate issue, I clearly > recall a discussion at Red Hat in which one of the principal glibc > maintainers (likely Ulrich Drepper, though I'm not quite sure) argued > that 32-bit time_t could be used indefinitely by redefining the epoch > forward by 2^32 seconds every often; which would require intervals of > circa 68 years in which time_t was seen as a negative offset from a > future epoch date, rather than an unsigned offset from a past date. > Now, I thought he was nuts then and I still think that 32-bit hardware > will be ancient history by 2038 ... but there may be systems that do it > like that. glibc hates ABI breakage. > > 2. Putting an enormously old date on a file that was just created will > greatly confuse onlookers, some of whom (such as backup or antivirus > daemons) might not react pleasantly. > > Between #1 and #2, it's clearly worth the extra one or two lines of > code to set the file dates to, say, "time(NULL) - 1", rather than > assuming that zero is good enough. > > 3. I wonder about the portability of utime(2). I see that we are using > it to cause updates of socket and lock file times, but our expectations > for it there are rock-bottom low. I think that failing the edit command > if utime() fails is an overreaction even with optimistic assumptions about > its reliability. Doing that makes things strictly worse than simply not > doing anything, because 99% of the time this refinement is unnecessary. > > In short, I think the relevant code ought to be more like > > else > { > struct utimbuf ut; > > ut.modtime = ut.actime = time(NULL) - 1; > (void) utime(fname, &ut); > } > > (plus some comments of course) > > regards, tom lane > > PS: I seem to recall that some Microsoft filesystems have 2-second > resolution on file mod times, so maybe it needs to be "time(NULL) - 2"? Thanks for taking a look! Taken together, these arguments are convincing. Done like that in the attached patch version 4. Yours, Laurenz Albe From 31257c2e697421bbcc08bdd546ccc8729654ccb3 Mon Sep 17 00:00:00 2001 From: Laurenz Albe Date: Wed, 10 Mar 2021 11:28:20 +0100 Subject: [PATCH] Improve \e, \ef and \ev if the editor is quit Before, if you edited a function or view definition or a script file, quitting the editor without saving would cause the *previous query* to be executed, which is certainly not what anyone would expect. It is better to execute nothing and clear the query buffer in this case. The behavior when editing the query buffer is unchanged: quitting without saving will retain the query buffer. This patch also fixes an old race condition: due to the low granularity that stat(2) can measure for the modification time, an edit that took less than a second might go unnoticed. In passing, fix the misplaced function comment for do_edit(). Author: Laurenz Albe Reviewed-by: Chapman Flack, Jacob Champion Discussion: https://postgr.es/m/0ba3f2a658bac6546d9934ab6ba63a805d46a49b.camel%40cybertec.at --- doc/src/sgml/ref/psql-ref.sgml | 10 -- src/bin/psql/command.c | 64 ++ 2 files changed, 57 insertions(+), 17 deletions(-) diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml index 13c1edfa4d..0bf69174ff 100644 --- a/doc/src/sgml/ref/psql-ref.sgml +++ b/doc/src/sgml/ref/psql-ref.sgml @@ -1970,7 +1970,9 @@ testdb=> -The new contents of the query buffer are then re-parsed according to +If you edit a file or the previous query, and you quit the editor without +modifying the file, the query buffer is cleared. +Otherwise, the new contents of the query buffer are re-parsed according to the normal rules of psql, treating the whole buffer as a single line. Any complete queries are immediately executed; that is, if the query buffer contains or ends with a @@ -2039,7 +2041,8 @@ Tue Oct 26 21:40:57 CEST 1999 in the form of a CREATE OR REPLACE FUNCTION or CREATE OR REPLACE PROCEDURE command. Editing is done in the same way as for \edit. - After the editor exits, the updated command is executed immediately + If you quit the editor without saving, the statement is discarded. + If you save and exit the editor, the updated command is executed immediately if you added a semicolon to it. Otherwise it is redisplayed; type semicolon or \g to send it, or \r to
Do we support upgrade of logical replication?
Hi, I was reviewing logical decoding of two-phase transactions feature, while reviewing the feature I was checking if there is any impact on publisher/subscriber upgrade. I checked the existing pg_upgrade behaviour with logical replication. I made a logical replication data instance with publisher and subscriber with subscriber subscribed to a table. Then I tried upgrading publisher and subscriber individually. After upgrade I noticed the following: 1) Pg_logical/mappings files were not copied in the upgraded data folder: -- Pg_logical contents in old data folder: publisher/pg_logical/replorigin_checkpoint publisher/pg_logical/mappings: map-32cb-4df-0_1767088-225-225 publisher/pg_logical/snapshots: 0-1643650.snap New upgraded data folder: publisher1/pg_logical/replorigin_checkpoint publisher1/pg_logical/mappings: publisher1/pg_logical/snapshots: 2) Replication slots were not copied: select * from pg_replication_slots; slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size | t wo_phase ---++---++--+---+++--+--+-+-++---+-- - (0 rows) 3) The subscriber is in disabled mode in the upgraded data: select * from pg_subscription; oid | subdbid | subname | subowner | subenabled | subbinary | substream | subtwophase | subconninfo| subslotname | subsynccommit | subpublicati ons ---+-+-+--++---+---+-+--+-+---+- 16404 | 16401 | mysub | 10 | f | f | f | f | host=localhost port=5432 dbname=postgres | mysub | off | {mypub} (1 row) 4) The pg_subscription_rel contents also were not copied: select * from pg_subscription_rel; srsubid | srrelid | srsubstate | srsublsn -+-++-- (0 rows) 5) While logical decoding of transactions, the decoded changes will be serialized based on logical_decoding_work_mem configuration. Even these files were not copied during upgrade. Do we support upgrading of logical replication, if it is supported could someone point me to the document link on how to upgrade logical replication? Regards, Vignesh
Re: [PATCH] Provide more information to filter_prepare
On 10.03.21 11:18, Amit Kapila wrote: On Tue, Mar 9, 2021 at 2:14 PM Markus Wanner wrote: currently, only the gid is passed on to the filter_prepare callback. While we probably should not pass a full ReorderBufferTXN (as we do for most other output plugin callbacks), a bit more information would be nice, I think. How the proposed 'xid' parameter can be useful? What exactly plugins want to do with it? The xid is the very basic identifier for transactions in Postgres. Any output plugin that interacts with Postgres in any way slightly more interesting than "filter by gid prefix" is very likely to come across a TransactionId. It allows for basics like checking if the transaction to decode still is in progress, for example. Or in a much more complex scenario, decide on whether or not to filter based on properties the extension stored during processing the transaction. Regards Markus
Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation
On Wed, Mar 10, 2021 at 1:27 PM Jeevan Ladhe wrote: > > On Wed, Mar 10, 2021 at 10:44 AM Bharath Rupireddy > wrote: >> >> Hi, >> >> While providing thoughts on [1], I observed that the error messages >> that are emitted while adding foreign, temporary and unlogged tables >> can be improved a bit from the existing [2] to [3]. > > +1 for improving the error messages here. Thanks for taking a look at the patch. >> Attaching a small patch. Thoughts? > > I had a look at the patch and it looks good to me. However, I think after > you have added the specific kind of table type in the error message itself, > now the error details seem to be giving redundant information, but others > might > have different thoughts. The error detail is to give a bit of information of what and all relation types are unsupported with the create publication statement. But with the error message now showing up the type of relation, the detail message looks redundant to me as well. If agreed, I can remove that. Thoughts? > The patch itself looks good otherwise. Also the make check and postgres_fdw > check looking good. Thanks. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
RE: Parallel Inserts in CREATE TABLE AS
> Seems like v22 patch was failing in cfbot for one of the unstable test cases. > Attaching v23 patch set with modification in 0003 and 0004 patches. No > changes to 0001 and 0002 patches. Hopefully cfbot will be happy with v23. > > Please consider v23 for further review. Hi, I was looking into the latest patch, here are some comments: 1) * (Note that we do allow CREATE TABLE AS, SELECT INTO, and CREATE * MATERIALIZED VIEW to use parallel plans, but as of now, only the leader * backend writes into a completely new table. In the future, we can Since we will support parallel insert with CTAS, this existing comments need to be changed. 2) In parallel.sgml The query writes any data or locks any database rows. If a query contains a data-modifying operation either at the top level or within a CTE, no parallel plans for that query will be generated. As an exception, the commands CREATE TABLE ... AS, The same as 1), we'd better comment we have support parallel insert with CTAS. 3) ExecInitParallelPlan(PlanState *planstate, EState *estate, Bitmapset *sendParams, int nworkers, -int64 tuples_needed) +int64 tuples_needed, +ParallelInsertCmdKind parallel_ins_cmd, +void *parallel_ins_info) Is it better to place ParallelInsertCmdKind in struct ParallelInsertCTASInfo? We can pass less parameter in some places. Like: typedef struct ParallelInsertCTASInfo { + ParallelInsertCmdKind parallel_ins_cmd; IntoClause *intoclause; Oid objectid; } ParallelInsertCTASInfo; Best regards, houzj
Re: Logical Replication - improve error message while adding tables to the publication in check_publication_add_relation
On Wed, Mar 10, 2021, at 2:14 AM, Bharath Rupireddy wrote: > While providing thoughts on [1], I observed that the error messages > that are emitted while adding foreign, temporary and unlogged tables > can be improved a bit from the existing [2] to [3]. For instance, the > existing message when foreign table is tried to add into the > publication "f1" is not a table" looks odd. Because it says that the > foreign table is not a table at all. I wouldn't mix [regular|partitioned|temporary|unlogged] tables with foreign tables. Although, they have a pg_class entry in common, foreign tables aren't "real" tables (external storage); they even have different DDLs to handle it (CREATE TABLE x CREATE FOREIGN TABLE). postgres=# CREATE PUBLICATION testpub FOR TABLE f1; ERROR: "f1" is not a table DETAIL: Only tables can be added to publications. I agree that "f1 is not a table" is a confusing message at first because foreign table has "table" as description. Maybe if we apply the negation in both messages it would be clear (using the same wording as system tables). ERROR: "f1" is a foreign table DETAIL: Foreign tables cannot be added to publications. -- Euler Taveira EDB https://www.enterprisedb.com/
Re: Allow batched insert during cross-partition updates
Hi, ‐‐‐ Original Message ‐‐‐ On Thursday, February 18, 2021 10:54 AM, Amit Langote wrote: > On Thu, Feb 18, 2021 at 6:52 PM Amit Langote amitlangot...@gmail.com wrote: > > > Based on the discussion at: > > https://www.postgresql.org/message-id/6929d485-2d2a-da46-3681-4a400a3d794f%40enterprisedb.com > > I'm posting the patch for $subject here in this new thread and I'll > > add it to the next CF per Tomas' advice. > > Done:https://commitfest.postgresql.org/32/2992/ > > -- apparently I did not receive the review comment I sent via the commitfest app. Apologies for the chatter. Find the message-id here: > > Amit Langote > EDB: http://www.enterprisedb.com
Re: Allow batched insert during cross-partition updates
‐‐‐ Original Message ‐‐‐ On Wednesday, March 10, 2021 1:23 PM, Georgios wrote: > > > Hi, > > ‐‐‐ Original Message ‐‐‐ > On Thursday, February 18, 2021 10:54 AM, Amit Langote amitlangot...@gmail.com > wrote: > > > On Thu, Feb 18, 2021 at 6:52 PM Amit Langote amitlangot...@gmail.com wrote: > > > > > Based on the discussion at: > > > https://www.postgresql.org/message-id/6929d485-2d2a-da46-3681-4a400a3d794f%40enterprisedb.com > > > I'm posting the patch for $subject here in this new thread and I'll > > > add it to the next CF per Tomas' advice. > > > > Done:https://commitfest.postgresql.org/32/2992/ > > apparently I did not receive the review comment I sent via the commitfest app. > Apologies for the chatter. Find the message-id here: https://www.postgresql.org/message-id/161530518971.29967.9368488207318158252.pgcf%40coridan.postgresql.org I continued looking a bit at the patch, yet I am either failing to see fix or I am looking at the wrong thing. Please find attached a small repro of what my expectetions were. As you can see in the repro, I would expect the UPDATE local_root_remote_partitions SET a = 2; to move the tuples to remote_partition_2 on the same transaction. However this is not the case, with or without the patch. Is my expectation of this patch wrong? Cheers, //Georgios > > > Amit Langote > > EDB: http://www.enterprisedb.com repro.sql Description: application/sql
Re: shared-memory based stats collector
On 2021/03/10 17:51, Kyotaro Horiguchi wrote: At Wed, 10 Mar 2021 15:20:43 +0900, Fujii Masao wrote in On 2021/03/10 12:10, Kyotaro Horiguchi wrote: Agreed. The code moved to the original place and added the crash handling code. And I added a phrase to the comment. +* Was it the archiver? If exit status is zero (normal) or one (FATAL +* exit), we assume everything is all right just like normal backends +* and just try to restart a new one so that we immediately retry +* archiving of remaining files. (If fail, we'll try again in future "of" of "archiving of remaining" should be replaced with "the", or removed? Either will do. I don't mind turning the gerund (archiving) into a gerund phrase (archiving remaining files). Just for record. Previously LogChildExit() was called and the following LOG message was output when the archiver reported FATAL error. OTOH the patch prevents that and the following LOG message is not output at FATAL exit of archiver. But I don't think that the following message is required in that case because FATAL message indicating the similar thing is already output. Therefore, I'm ok with the patch. LOG: archiver process (PID 46418) exited with exit code 1 Yeah, that's the same behavor with wal receiver. I read v50_003 patch. When archiver dies, ProcGlobal->archiverLatch should be reset to NULL, like walreceiver does the similar thing in WalRcvDie()? Differently from walwriter and checkpointer, archiver as well as walreceiver may die while server is running. Leaving the latch pointer alone may lead to nudging a wrong process that took over the same procarray slot. Added pgarch_die() to do that. Thanks! + if (IsUnderPostmaster && ProcGlobal->archiverLatch) + SetLatch(ProcGlobal->archiverLatch); The latch can be reset to NULL in pgarch_die() between the if-condition and SetLatch(), and which would be problematic. Probably we should protect the access to the latch by using spinlock, like we do for walreceiver's latch? Ugg. Right. I remember about that bug. I moved the archiverLatch out of ProcGlobal to a dedicate local struct PgArch and placed a spinlock together. Thanks for the review! v52 is attached. Thanks! I applied minor and cosmetic changes to the 0003 patch as follows. Attached is the updated version of the 0003 patch. Barring any objection, I will commit this patch. -#include "storage/latch.h" -#include "storage/proc.h" I removed these because they are no longer necessary. logical replication worker, parallel worker, background writer, client backend, checkpointer, + archiver, startup, walreceiver, walsender and walwriter. In the document about pg_stat_activity, possible values in backend_type column are all listed. I added "archiver" into the list. BTW, those values were originally listed in alphabetical order, but ISTM that they were gotten out of order at a certain moment. So they should be listed in alphabetical order again. This should be implemented as a separate patch. -PgArchData *PgArch = NULL; +static PgArchData *PgArch = NULL; I marked PgArchData as static because it's used only in pgarch.c. - ShmemInitStruct("Archiver ", PgArchShmemSize(), &found); + ShmemInitStruct("Archiver Data", PgArchShmemSize(), &found); I found that the original shmem name ended with unnecessary space character. I replaced it with "Archiver Data". In reaper(), I moved the code block for archiver to the original location. I ran pgindent for the files that the patch modifies. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 3335d71eba..b82b2f6d66 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -935,6 +935,7 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser logical replication worker, parallel worker, background writer, client backend, checkpointer, + archiver, startup, walreceiver, walsender and walwriter. In addition, background workers registered by extensions may have diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 1c5a4f8b5a..26b023e754 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -25,11 +25,11 @@ #include "common/archive.h" #include "miscadmin.h" #include "postmaster/startup.h" +#include "postmaster/pgarch.h" #include "replication/walsender.h" #include "storage/fd.h" #include "storage/ipc.h" #include "storage/lwlock.h" -#include "storage/pmsignal.h" /* * Attempt to retrie
Re: Change JOIN tutorial to focus more on explicit joins
On 12/1/20 3:38 AM, Jürgen Purtz wrote: On 30.11.20 21:25, David G. Johnston wrote: Sorry, I managed to overlook the most recent patch. I admitted my use of parentheses was incorrect and I don't see anyone else defending them. Please remove them. Minor typos: "the database compare" -> needs an "s" (compares) "In this case, the definition how to compare their rows." -> remove, redundant with the first sentence "The results from the older implicit syntax, and the newer explicit JOIN/ON syntax, are identical" -> move the commas around to what is shown here OK. Patch attached. Peter, you committed some of this patch originally. Do you think the rest of the patch is now in shape to be committed? Regards, -- -David da...@pgmasters.net
Re: error_severity of brin work item
On 12/1/20 5:25 PM, Justin Pryzby wrote: On Tue, Dec 01, 2020 at 03:57:24PM -0300, Alvaro Herrera wrote: Another idea is if perform_work_item() were responsible for discarding relations which disappear. Currently it does this, which is racy since it holds no lock. That has the property that it remains contained in autovacuum.c, but no other advantages I think. It has the advantage that it moves all the try_open stuff out of brin. I started implementing this, and then realized that the try_open stuff *has* to be in the brin_summarize function, to handle the case that someone passes a non-index, since it's SQL exposed. So maybe we should use your LockOid patch now, and refactor in the future if we add additional work-item types. Thoughts on this, Álvaro? I can see that the first version of this patch was not ideal but the rework seems to have stalled. Since it is a bug perhaps it would be better to get something in as Justin suggests? Regards, -- -David da...@pgmasters.net
Re: libpq debug log
On 2021-Mar-10, iwata@fujitsu.com wrote: > Hi all, > > Following all reviewer's advice, I have created a new patch. > > In this patch, I add only two tracing entry points; I call > pqTraceOutputMsg(PGconn conn, int msgCursor, PGCommSource commsource) in > pqParseInput3 () and pqPutMsgEnd () to output log. > The argument contains message first byte offset called msgCursor because it > is simpler to specify the buffer pointer in the caller. > > In pqTraceOutputMsg(), the content common to all protocol messages (first > timestamp, < or >, first 1 byte, and length) are output first and after that > each protocol message contents is output. I referred to pqParseInput3 () , > fe-exec.c and documentation page for that part. > > This fix almost eliminates if(conn->Pfdebug) that was built into the code for > other features. This certainly looks much better! Thanks for getting it done so quickly. I didn't review the code super closely. I do see a compiler warning: /pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c: In function 'pqTraceOutputNchar': /pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c:373:2: warning: argument 1 null where non-null expected [-Wnonnull] memcpy(v, buf + *cursor, len); ^ and then when I try to run the "libpq_pipeline" program from the other thread, I get a crash with this backtrace: Program received signal SIGSEGV, Segmentation fault. __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:288 288 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No existe el fichero o el directorio. (gdb) bt #0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:288 #1 0x77f9b50b in pqTraceOutputNchar (buf=buf@entry=0x55564660 "P", LogEnd=LogEnd@entry=42, cursor=cursor@entry=0x7fffdeb4, pfdebug=0x55569320, len=1) at /pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c:373 #2 0x77f9bc25 in pqTraceOutputMsg (conn=conn@entry=0x55560260, msgCursor=, commsource=commsource@entry=MSGDIR_FROM_FRONTEND) at /pgsql/source/pipeline/src/interfaces/libpq/libpq-trace.c:476 #3 0x77f96280 in pqPutMsgEnd (conn=conn@entry=0x55560260) at /pgsql/source/pipeline/src/interfaces/libpq/fe-misc.c:533 ... The attached patch fixes it, though I'm not sure that that's the final form we want it to have (since we'd alloc and free repeatedly, making it slow -- perhaps better to use a static, or ...? ). -- Álvaro Herrera Valdivia, Chile Essentially, you're proposing Kevlar shoes as a solution for the problem that you want to walk around carrying a loaded gun aimed at your foot. (Tom Lane) diff --git a/src/interfaces/libpq/libpq-trace.c b/src/interfaces/libpq/libpq-trace.c index ef294fa556..5d3b40a1d0 100644 --- a/src/interfaces/libpq/libpq-trace.c +++ b/src/interfaces/libpq/libpq-trace.c @@ -368,7 +368,15 @@ pqTraceOutputBinary(char *v, int length, FILE *pfdebug) static void pqTraceOutputNchar(char *buf, int LogEnd, int *cursor, FILE *pfdebug, int len) { - char *v = '\0'; + char *v; + + v = malloc(len); + if (v == NULL) + { + fprintf(pfdebug, "'..%d chars..'", len); + *cursor += len; + return; + } memcpy(v, buf + *cursor, len); *cursor += len; @@ -377,6 +385,8 @@ pqTraceOutputNchar(char *buf, int LogEnd, int *cursor, FILE *pfdebug, int len) pqTraceOutputBinary(v, len, pfdebug); fprintf(pfdebug, "\'"); + free(v); + if (*cursor < LogEnd) fprintf(pfdebug, " "); else
Re: get rid of tags in the docs?
On Thu, Feb 4, 2021 at 11:31 AM Tom Lane wrote: > > John Naylor writes: > > While looking at the proposed removal of the v2 protocol, I noticed that we > > italicize some, but not all, instances of 'per se', 'pro forma', and 'ad > > hoc'. I'd say these are widespread enough in formal registers of English > > that they hardly need to be called out as foreign, so I propose removing > > the tags for those words. > > +1, nobody italicizes those in normal usage. Now that protocol v2 is gone, here's a patch to remove those tags. > > The other case is 'voilà', found in rules.sgml. The case for italics here > > is stronger, but looking at that file, I actually think a more > > generic-sounding phrase here would be preferable. > > Yeah, seeing that we only use that in one place, I think we could do > without it. Looks like something as pedestrian as "The results are:" > would do fine. Done that way. -- John Naylor EDB: http://www.enterprisedb.com v1-0001-Get-rid-of-foreignphrase-tags-in-the-docs.patch Description: Binary data
Re: proposal: unescape_text function
On 12/2/20 1:30 PM, Pavel Stehule wrote: st 2. 12. 2020 v 11:37 odesílatel Pavel Stehule st 2. 12. 2020 v 9:23 odesílatel Peter Eisentraut Heh. The fact that there is a table of two dozen possible representations kind of proves my point that we should be deliberate in picking one. I do see Oracle unistr() on that list, which appears to be very similar to what you are trying to do here. Maybe look into aligning with that. unistr is a primitive form of proposed function. But it can be used as a base. The format is compatible with our "4.1.2.3. String Constants with Unicode Escapes". What do you think about the following proposal? 1. unistr(text) .. compatible with Postgres unicode escapes - it is enhanced against Oracle, because Oracle's unistr doesn't support 6 digits unicodes. 2. there can be optional parameter "prefix" with default "\". But with "\u" it can be compatible with Java or Python. What do you think about it? I thought about it a little bit more, and the prefix specification has not too much sense (more if we implement this functionality as function "unistr"). I removed the optional argument and renamed the function to "unistr". The functionality is the same. Now it supports Oracle convention, Java and Python (for Python U) and \+XX. These formats was already supported.The compatibility witth Oracle is nice. Peter, it looks like Pavel has aligned this function with unistr() as you suggested. Thoughts? Regards, -- -David da...@pgmasters.net
Re: Lowering the ever-growing heap->pd_lower
On Tue, 9 Mar 2021 at 22:35, Tom Lane wrote: > > Matthias van de Meent writes: > > The only two existing mechanisms that I could find (in the access/heap > > directory) that possibly could fail on shrunken line pointer arrays; > > being xlog recovery (I do not have enough knowledge on recovery to > > determine if that may touch pages that have shrunken line pointer > > arrays, or if those situations won't exist due to never using dirtied > > pages in recovery) and backwards table scans on non-MVCC snapshots > > (which would be fixed in the attached patch). > > I think you're not visualizing the problem properly. > > The case I was concerned about back when is that there are various bits of > code that may visit a page with a predetermined TID in mind to look at. > An index lookup is an obvious example, and another one is chasing an > update chain's t_ctid link. You might argue that if the tuple was dead > enough to be removed, there should be no such in-flight references to > worry about, but I think such an assumption is unsafe. There is not, for > example, any interlock that ensures that a process that has obtained a TID > from an index will have completed its heap visit before a VACUUM that > subsequently removed that index entry also removes the heap tuple. I am aware of this problem. I will admit that I did not detected all potentially problematic accesses, so I'll show you my work. > So, to accept a patch that shortens the line pointer array, what we need > to do is verify that every such code path checks for an out-of-range > offset before trying to fetch the target line pointer. I believed > back in 2007 that there were, or once had been, code paths that omitted > such a range check, assuming that they could trust the TID they had > gotten from $wherever to point at an extant line pointer array entry. > Maybe things are all good now, but I think you should run around and > examine every place that checks for tuple deadness to see if the offset > it used is known to be within the current page bounds. In my search for problematic accesses, I make the following assumptions: * PageRepairFragmentation as found in bufpage is only applicable to heap pages; this function is not applied to other pages in core postgres. So, any problems that occur are with due to access with an OffsetNumber > PageGetMaxOffsetNumber. * Items [on heap pages] are only extracted after using PageGetItemId for that item on the page, whilst holding a lock. Under those assumptions, I ran "grep PageGetItemId" over the src directory. For all 227 results (as of 68b34b23) I checked if the page accessed (or item accessed thereafter) was a heap page or heap tuple. After analysis of the relevant references, I had the following results (attached full report, containing a line with the file & line number, and code line of the call, followed by a line containing the usage type): Count of usage type - usage type 4 - Not a call (comment) 7 - Callsite guarantees bounds 8 - Has assertion ItemIdIsNormal (asserts item is not removed; i.e. concurrent vacuum should not have been able to remove this item) 39 - Has bound checks 6 - Not a heap page (brin) 1 - Not a heap page (generic index) 24 - Not a heap page (gin) 21 - Not a heap page (gist) 14 - Not a heap page (hash) 60 - Not a heap page (nbtree) 1 - Not a heap page (sequence) 36 - Not a heap page (spgist) 2 - OffsetNumber is generated by PageAddItem 2 - problem case 1 (table scan) 1 - Problem case 2 (xlog) 1 - Problem case 3 (invalid redirect pointers) The 3 problem cases were classified based on the origin of the potentially invalid pointer. Problem case 1: table scan; heapam.c lines 678 and 811, in heapgettup The table scan maintains a state which contains a page-bound OffsetNumber, which it uses as a cursor whilst working through the pages of the relation. In forward scans, the bounds of the page are re-validated at the start of the 'while (linesleft > 0)' loop at 681, but for backwards scans this check is invalid, because it incorrectly assumes that the last OffsetNumber is guaranteed to still exist. For MVCC snapshots, this is true (the previously returned value must have been visible in its snapshot, therefore cannot have been vacuumed because the snapshot is still alive), but non-mvcc snapshots may have returned a dead tuple, which is now vacuumed and truncated away. The first occurrance of this issue is easily fixed with the changes as submitted in patch v2. The second problem case (on line 811) is for forward scans, where the line pointer array could have been truncated to 0 length. As the code uses a hardcoded offset of FirstOffsetNumber (=1), that might point into arbitrary data. The reading of this arbitrary data is saved by the 'while (linesleft > 0) check', because at that point linesleft will be PageGetMaxOffsetNumber, which would then equal 0. Problem case 2: xlog; heapam.c line 8796, in heap_xlog_freeze_page This is in the replay of transaction lo
Re: Parallel INSERT (INTO ... SELECT ...)
On Wed, Mar 10, 2021 at 2:48 PM Amit Kapila wrote: > > On Mon, Mar 8, 2021 at 7:19 PM Amit Langote wrote: > > > > Hi Amit > > > > On Mon, Mar 8, 2021 at 10:18 PM Amit Kapila wrote: > > > On Mon, Mar 8, 2021 at 3:54 PM Greg Nancarrow wrote: > > > > I've attached an updated set of patches with the suggested locking > > > > changes. > > > > (Thanks Greg.) > > > > > Amit L, others, do let me know if you have still more comments on > > > 0001* patch or if you want to review it further? > > > > I just read through v25 and didn't find anything to complain about. > > > > Thanks a lot, pushed now! Amit L., your inputs are valuable for this work. > > Now, coming back to Hou-San's patch to introduce a GUC and reloption > for this feature, I think both of those make sense to me because when > the feature is enabled via GUC, one might want to disable it for > partitioned tables? Do we agree on that part or someone thinks > otherwise? I think it makes sense to have both. > > The other points to bikeshed could be: > 1. The name of GUC and reloption. The two proposals at hand are > enable_parallel_dml and enable_parallel_insert. I would prefer the > second (enable_parallel_insert) because updates/deletes might not have > a similar overhead. I agree enable_parallel_insert makes more sense. > 2. Should we keep the default value of GUC to on or off? It is > currently off. I am fine keeping it off for this release and we can > always turn it on in the later releases if required. Having said that, > I see the value in keeping it on because in many cases Insert ... > Select will be used for large data and there we will see a benefit of > parallelism and users facing trouble (who have a very large number of > partitions with less data to query) can still disable the parallel > insert for that particular table. Also, the other benefit of keeping > it on till at least the beta period is that this functionality will > get tested and if we found reports of regression then we can turn it > off for this release as well. > > Thoughts? IMHO, we should keep it on because in most of the cases it is going the give benefit to the user, and if for some specific scenario where a table has a lot of partition then it can be turned off using reloption. And, if for some users most of the tables are like that where they always have a lot of partition then the user can turn it off using guc. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: [PATCH] SET search_path += octopus
On 12/1/20 9:49 AM, Abhijit Menon-Sen wrote: Therefore, for lack of any plausible way to proceed, I do not plan to work on this feature any more. It sounds like this CF entry should be closed. I'll do that on March 12 unless somebody beats me to it. Regards, -- -David da...@pgmasters.net
Re: libpq debug log
I also found that DataRow messages are not being printed. This seems to fix that in the normal case and singlerow, at least in pipeline mode. Didn't verify the non-pipeline mode. -- Álvaro Herrera39°49'30"S 73°17'W "Nunca se desea ardientemente lo que solo se desea por razón" (F. Alexandre) diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c index 68f0f6a081..e8db5edb71 100644 --- a/src/interfaces/libpq/fe-protocol3.c +++ b/src/interfaces/libpq/fe-protocol3.c @@ -869,6 +869,9 @@ getAnotherTuple(PGconn *conn, int msgLength) goto advance_and_error; } + if (conn->Pfdebug) + pqTraceOutputMsg(conn, conn->inStart, MSGDIR_FROM_BACKEND); + /* Advance inStart to show that the "D" message has been processed. */ conn->inStart = conn->inCursor;
Re: WIP: document the hook system
On 3/9/21 12:20 PM, Bruce Momjian wrote: On Sat, Mar 6, 2021 at 08:32:43PM -0500, Tom Lane wrote: I think that the best you should hope for here is that people are willing to add a short, not-too-detailed para to a markup-free plain-text README file that lists all the hooks. As soon as it gets any more complex than that, either the doco aspect will be ignored, or there simply won't be any more hooks. (I'm afraid I likewise don't believe in the idea of carrying a test module for each hook. Again, requiring that is a good way to ensure that new hooks just won't happen.) Agreed. If you document the hooks too much, it allows them to drift away from matching the code, which makes the hook documentation actually worse than having no hook documentation at all. There's doesn't seem to be agreement on how to proceed here, so closing. David, if you do decide to proceed with a README then it would probably be best to create a new thread/entry. Regards, -- -David da...@pgmasters.net
Re: Boundary value check in lazy_tid_reaped()
On 10.03.21 02:29, Masahiko Sawada wrote: There is no noticeable effect of inlining lazy_tid_reaped(). So it would be better to not do that. Attached the patch that doesn't inline lazy_tid_reaped(). committed
Re: Huge memory consumption on partitioned table with FKs
On Wed, Mar 10, 2021 at 8:37 AM Tom Lane wrote: > Amit Langote writes: > > On Fri, Mar 5, 2021 at 6:00 AM Tom Lane wrote: > >> This claim seems false on its face: > >>> All child constraints of a given foreign key constraint can use the > >>> same RI query and the resulting plan, that is, no need to create as > >>> many copies of the query and the plan as there are partitions, as > >>> happens now due to the child constraint OID being used in the key > >>> for ri_query_cache. > > > The quoted comment could have been written to be clearer about this, > > but it's not talking about the table that is to be queried, but the > > table whose RI trigger is being executed. In all the cases except one > > (mentioned below), the table that is queried is the same irrespective > > of which partition's trigger is being executed, so we're basically > > creating the same plan separately for each partition. > > Hmm. So, the key point is that the values coming from the partitioned > child table are injected into the test query as parameters, not as > column references, thus it doesn't matter *to the test query* what > numbers the referencing columns have in that child. We just have to > be sure we pass the right parameter values. Right. > But ... doesn't the code > use riinfo->fk_attnums[] to pull out the values to be passed? Yes, from a slot that belongs to the child table. > IOW, I now get the point about being able to share the SPI plans, > but I'm still dubious about sharing the RI_ConstraintInfo cache entries. There was actually a proposal upthread about sharing the RI_ConstraintInfo too, but we decided to not pursue that for now. > It looks to me like the v4 patch is now actually not sharing the > cache entries, ie their hash key is just the child constraint OID > same as before; Yeah, you may see that we're only changing ri_BuildQueryKey() in the patch affecting only ri_query_cache, but not ri_LoadConstraintInfo() which leaves ri_constraint_cache unaffected. > but the comments are pretty confused about this. I've tried improving the comment in ri_BuildQueryKey() a bit to make clear what is and what is not being shared between partitions. > It might be simpler if you add just one new field which is the OID of > the constraint that we're building the SPI query from, which might be > either equal to constraint_id, or the OID of some parent constraint. > In particular it's not clear to me why we need both constraint_parent > and constraint_root_id. Yeah, I think constraint_parent is a leftover from some earlier hacking. I have removed it. Attached updated patch. -- Amit Langote EDB: http://www.enterprisedb.com v5-0001-Share-SPI-plan-between-partitions-in-some-RI-trig.patch Description: Binary data
Re: [PATCH] Covering SPGiST index
On 12/4/20 12:31 PM, Pavel Borisov wrote: The cfbot's still unhappy --- looks like you omitted a file from the patch? You are right, thank you. Corrected this. PFA v13 Tom, do the changes as enumerated in [1] look like they are going in the right direction? Regards, -- -David da...@pgmasters.net [1] https://www.postgresql.org/message-id/CALT9ZEEszJUwsXMWknXQ3k_YbGtQaQwiYxxEGZ-pFGRUDSXdtQ%40mail.gmail.com
Re: WIP: document the hook system
On Wed, Mar 10, 2021 at 09:38:39AM -0500, David Steele wrote: > On 3/9/21 12:20 PM, Bruce Momjian wrote: > > On Sat, Mar 6, 2021 at 08:32:43PM -0500, Tom Lane wrote: > > > I think that the best you should hope for here is that people are > > > willing to add a short, not-too-detailed para to a markup-free > > > plain-text README file that lists all the hooks. As soon as it > > > gets any more complex than that, either the doco aspect will be > > > ignored, or there simply won't be any more hooks. > > > > > > (I'm afraid I likewise don't believe in the idea of carrying a test > > > module for each hook. Again, requiring that is a good way to > > > ensure that new hooks just won't happen.) > > > > Agreed. If you document the hooks too much, it allows them to drift > > away from matching the code, which makes the hook documentation actually > > worse than having no hook documentation at all. > > There's doesn't seem to be agreement on how to proceed here, so closing. > > David, if you do decide to proceed with a README then it would probably be > best to create a new thread/entry. Thanks for the work on this and the helpful feedback! Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: WIP: System Versioned Temporal Table
hi Ibrar, thank you for rebasing On Mon, Mar 8, 2021 at 9:34 AM Ibrar Ahmed wrote: > >> Since the get_row_start_time_col_name() and get_row_end_time_col_name() >> are similar, IMO we can pass a flag to get StartTime/EndTime column name, >> thought? >> >> For me your option is better. i will change to it in my next patch if no objection regards Surafel
Re: Evaluate expression at planning time for two more cases
Hi Ibrar, On Mon, Mar 8, 2021 at 8:13 AM Ibrar Ahmed wrote: > > It was a minor change therefore I rebased the patch, please take a look. > It is perfect thank you regards Surafel
Re: WIP: System Versioned Temporal Table
On 3/10/21 5:49 PM, Surafel Temesgen wrote: > hi Ibrar, > thank you for rebasing > > On Mon, Mar 8, 2021 at 9:34 AM Ibrar Ahmed wrote: > >> >>> Since the get_row_start_time_col_name() and get_row_end_time_col_name() >>> are similar, IMO we can pass a flag to get StartTime/EndTime column name, >>> thought? >>> >>> > For me your option is better. i will change to it in my next > patch if no objection I have plenty of objection. I'm sorry that I am taking so long with my review. I am still working on it and it is coming soon, I promise. -- Vik Fearing
Re: cryptohash: missing locking functions for OpenSSL <= 1.0.2?
On Wed, 2021-03-10 at 17:21 +0900, Michael Paquier wrote: > Do you mean something like the attached? Yes! Patch LGTM. --Jacob
Re: Add header support to text format and matching feature
On 12/7/20 6:40 PM, Rémi Lapeyre wrote: Hi, here’s a rebased version of the patch. Michael, since the issue of duplicated options has been fixed do either of these patches look like they are ready for commit? Regards, -- -David da...@pgmasters.net
Re: Consider parallel for lateral subqueries with limit
On 12/7/20 6:45 PM, James Coleman wrote: On Sun, Dec 6, 2020 at 7:34 PM Brian Davis wrote: Played around with this a bit, here's a non-correlated subquery that gets us to that if statement While I haven't actually tracked down to guarantee this is handled elsewhere, a thought experiment -- I think -- shows it must be so. Here's why: suppose we don't have a limit here, but the query return order is different in different backends. Then we would have the same problem you bring up. In that case this code is already setting consider_parallel=true on the rel. So I don't think we're changing any behavior here. So it looks like you and Brian are satisfied that this change is not allowing bad behavior. Seems like an obvious win. Hopefully we can get some other concurring opinions. Regards, -- -David da...@pgmasters.net
Re: get rid of tags in the docs?
John Naylor writes: > On Thu, Feb 4, 2021 at 11:31 AM Tom Lane wrote: >> +1, nobody italicizes those in normal usage. > Now that protocol v2 is gone, here's a patch to remove those tags. Pushed. regards, tom lane
Re: Speeding up GIST index creation for tsvectors
On Mon, Mar 8, 2021 at 8:43 AM Amit Khandekar wrote: > > On Wed, 3 Mar 2021 at 23:32, John Naylor wrote: > > 0001: > > > > + /* > > + * We can process 64-bit chunks only if both are mis-aligned by the same > > + * number of bytes. > > + */ > > + if (b_aligned - b == a_aligned - a) > > > > The obvious question here is: how often are they identically misaligned? You > > don't indicate that your measurements differ in a bimodal fashion, so does > > that mean they happen to be always (mis)aligned? > > I ran CREATE INDEX on tsvector columns using the tsearch.data that I > had attached upthread, with some instrumentation; here are the > proportions : > 1. In 15% of the cases, only one among a and b was aligned. The other > was offset from the 8-byte boundary by 4 bytes. > 2. 6% of the cases, both were offset by 4 bytes, i.e. identically misaligned. > 3. Rest of the cases, both were aligned. That's somewhat strange. I would have assumed always aligned, or usually not. It sounds like we could require them both to be aligned and not bother with the byte-by-byte prologue. I also wonder if it's worth it to memcpy to a new buffer if the passed pointer is not aligned. > > + /* For smaller lengths, do simple byte-by-byte traversal */ > > + if (bytes <= 32) > > > > You noted upthread: > > > > > Since for the gist index creation of some of these types the default > > > value for siglen is small (8-20), I tested with small siglens. For > > > siglens <= 20, particularly for values that are not multiples of 8 > > > (e.g. 10, 13, etc), I see a 1-7 % reduction in speed of index > > > creation. It's probably because of > > > an extra function call for pg_xorcount(); and also might be due to the > > > extra logic in pg_xorcount() which becomes prominent for shorter > > > traversals. So for siglen less than 32, I kept the existing method > > > using byte-by-byte traversal. > > > > I wonder if that can be addressed directly, while cleaning up the loops and > > alignment checks in pg_xorcount_long() a little. For example, have a look at > > pg_crc32c_armv8.c -- it might be worth testing a similar approach. > > Yeah, we can put the bytes <= 32 condition inside pg_xorcount_long(). > I avoided that to not hamper the <= 32 scenarios. Details explained > below for "why inline pg_xorcount is calling global function" > > > Also, pardon my ignorance, but where can I find the default siglen for various types? > Check SIGLEN_DEFAULT. Okay, so it's hard-coded in various functions in contrib modules. In that case, let's just keep the existing coding for those. In fact, the comments that got removed by your patch specifically say: /* Using the popcount functions here isn't likely to win */ ...which your testing confirmed. The new function should be used only for Gist and possibly Intarray, whose default is 124. That means we can't get rid of hemdistsign(), but that's fine. Alternatively, we could say that a small regression is justifiable reason to refactor all call sites, but I'm not proposing that. > > (I did make sure to remove indirect calls from the retail functions > > in [1], in case we want to go that route). > > Yeah, I quickly had a look at that. I am still going over that thread. > Thanks for the exhaustive analysis there. I'll post a patch soon that builds on that, so you can see what I mean. -- John Naylor EDB: http://www.enterprisedb.com
Re: Huge memory consumption on partitioned table with FKs
Amit Langote writes: > On Wed, Mar 10, 2021 at 8:37 AM Tom Lane wrote: >> Hmm. So, the key point is that the values coming from the partitioned >> child table are injected into the test query as parameters, not as >> column references, thus it doesn't matter *to the test query* what >> numbers the referencing columns have in that child. We just have to >> be sure we pass the right parameter values. > Right. I did some cosmetic fooling with this (mostly, rewriting the comments YA time) and pushed it. regards, tom lane
Re: libpq debug log
On 2021-Mar-10, 'alvhe...@alvh.no-ip.org' wrote: > I also found that DataRow messages are not being printed. This seems to > fix that in the normal case and singlerow, at least in pipeline mode. > Didn't verify the non-pipeline mode. Hm, and RowDescription ('T') messages are also not being printed ... so I think there's some larger problem here, and perhaps it needs to be fixed using a different approach. After staring at it a couple of times, I think that the places in pqParseInput3() where there's a comment "... moves inStart itself" and then "continue;" should have a call to pqTraceOutputMsg(), since AFAIU those are the places where the message in question would not reach the "Successfully consumed this message" block that prints each message. -- Álvaro Herrera39°49'30"S 73°17'W Maybe there's lots of data loss but the records of data loss are also lost. (Lincoln Yeoh)
Re: libpq debug log
"'alvhe...@alvh.no-ip.org'" writes: > After staring at it a couple of times, I think that the places in > pqParseInput3() where there's a comment "... moves inStart itself" and > then "continue;" should have a call to pqTraceOutputMsg(), since AFAIU > those are the places where the message in question would not reach the > "Successfully consumed this message" block that prints each message. Yeah, the whole business of just when a message has been "consumed" is a stumbling block for libpq tracing. It was a real mess with the existing code and v2 protocol, because we could actually try to parse a message more than once, with the first few tries deciding that the message wasn't all here yet (but nonetheless emitting partial trace output). Now that v2 is dead, the logic to abort because of the message not being all arrived yet is basically useless: only the little bit of code concerned with the message length word really needs to cope with that. It's tempting to go through and get rid of all the now-unreachable "return"s and such, but it seemed like it would be a lot of code churn for not really that much gain. I didn't look at the new version of the patch yet, so I'm not sure whether the issues it still has are related to this. regards, tom lane
Re: Occasional tablespace.sql failures in check-world -jnn
Hi, On 2021-03-10 15:40:38 +0900, Michael Paquier wrote: > On Mon, Mar 08, 2021 at 11:53:57AM +0100, Peter Eisentraut wrote: > > On 09.12.20 08:55, Michael Paquier wrote: > >> ... Because we may still introduce this problem again if some new > >> stuff uses src/test/pg_regress in a way similar to pg_upgrade, > >> triggering again tablespace-setup. Something like the attached may be > >> enough, though I have not spent much time checking the surroundings, > >> Windows included. > > > > This patch looks alright to me. > > So, I have spent more time checking the surroundings of this patch, > and finally applied it. Thanks for the review, Peter. Thanks! Greetings, Andres Freund
Re: libpq debug log
"'alvhe...@alvh.no-ip.org'" writes: > After staring at it a couple of times, I think that the places in > pqParseInput3() where there's a comment "... moves inStart itself" and > then "continue;" should have a call to pqTraceOutputMsg(), since AFAIU > those are the places where the message in question would not reach the > "Successfully consumed this message" block that prints each message. After digging around a little, I remember that there are a *bunch* of places in fe-protocol3.c that advance inStart. The cleanest way to shove this stuff in without rethinking that logic would be to call pqTraceOutputMsg immediately before each such advance (using the old inStart value as the message start address). A possible advantage of doing it like that is that we'd be aware by that point of whether we think the message is good or bad (or whether it was good but we failed to process it, perhaps because of OOM). Seems like that could be a useful thing to include in the log. Or we could rethink the logic. It's not quite clear to me, after all this time, why getRowDescriptions() et al are allowed to move inStart themselves rather than letting the main loop in pqParseInput3 do it. It might well be an artifact of having not rewritten the v2 logic too much. regards, tom lane
Re: pg_amcheck contrib application
On Wed, Mar 10, 2021 at 11:10 AM Mark Dilger wrote: > Once again, I think you are right and have removed the objectionable > behavior, but > > The --startblock and --endblock options make the most sense when the user is > only checking one table, like > > pg_amcheck --startblock=17 --endblock=19 --table=my_schema.my_corrupt_table > > because the user likely has some knowledge about that table, perhaps from a > prior run of pg_amcheck. The --startblock and --endblock arguments are a bit > strange when used globally, as relations don't all have the same number of > blocks, so > > pg_amcheck --startblock=17 --endblock=19 mydb > > will very likely emit lots of error messages for tables which don't have > blocks in that range. That's not entirely pg_amcheck's fault, as it just did > what the user asked, but it also doesn't seem super helpful. I'm not going > to do anything about it in this release. +1 to all that. I tend toward the opinion that trying to make --startblock and --endblock do anything useful in the context of checking multiple relations is not really possible, and therefore we just shouldn't put any effort into it. But if user feedback shows otherwise, we can always do something about it later. > After running 'make installcheck', if I delete all entries from pg_class > where relnamespace = 'pg_toast'::regclass, by running 'pg_amcheck > regression', I get lines that look like this: > > heap relation "regression"."public"."quad_poly_tbl": > ERROR: could not open relation with OID 17177 In this here example, the first line ends in a colon. > relation "regression"."public"."functional_dependencies", block 28, offset > 54, attribute 0 > attribute 0 with length 4294967295 ends at offset 50 beyond total tuple > length 43 But this here one does not. Seems like it should be consistent. The QUALIFIED_NAME_FIELDS macro doesn't seem to be used anywhere, which is good, because macros with unbalanced parentheses are usually not a good plan; and a macro that expands to a comma-separated list of things is suspect too. "invalid skip options\n" seems too plural. With regard to your use of strtol() for --{start,end}block, telling the user that their input is garbage seems pejorative, even though it may be accurate. Compare: [rhaas EDBAS]$ pg_dump -jdsgdsgd pg_dump: error: invalid number of parallel jobs In the message "relation end block argument precedes start block argument\n", I think you could lose both instances of the word "argument" and probably the word "relation" as well. I actually don't know why all of these messages about start and end block mention "relation". It's not like there is some other kind of non-relation-related start block with which it could be confused. The comment for run_command() explains some things about the cparams argument, but those things are false. In fact the argument is unused. Usual PostgreSQL practice when freeing memory in e.g. verify_heap_slot_handler is to set the pointers to NULL as well. The performance cost of this is trivial, and it makes debugging a lot easier should somebody accidentally write code to access one of those things after it's been freed. The documentation says that -D "does exclude any database that was listed explicitly as dbname on the command line, nor does it exclude the database chosen in the absence of any dbname argument." The first part of this makes complete sense to me, but I'm not sure about the second part. If I type pg_amcheck --all -D 'r*', I think I'm expecting that "rhaas" won't be checked. Likewise, if I say pg_amcheck -d 'bob*', I think I only want to check the bob-related databases and not rhaas. I suggest documenting --endblock as "Check table blocks up to and including the specified ending block number. An error will occur if a relation being checked has fewer than this number of blocks." And similarly for --startblock: "Check table blocks beginning with the specified block number. An error will occur, etc." Perhaps even mention something like "This option is probably only useful when checking a single table." Also, the documentation here isn't clear that this affects only table checking, not index checking. It appears that pg_amcheck sometimes makes dummy connections to the database that don't do anything, e.g. pg_amcheck -t 'q*' resulted in: 2021-03-10 15:00:14.273 EST [95473] LOG: connection received: host=[local] 2021-03-10 15:00:14.274 EST [95473] LOG: connection authorized: user=rhaas database=rhaas application_name=pg_amcheck 2021-03-10 15:00:14.286 EST [95473] LOG: statement: SELECT pg_catalog.set_config('search_path', '', false); 2021-03-10 15:00:14.290 EST [95464] DEBUG: forked new backend, pid=95474 socket=11 2021-03-10 15:00:14.291 EST [95464] DEBUG: server process (PID 95473) exited with exit code 0 2021-03-10 15:00:14.291 EST [95474] LOG: connection received: host=[local] 2021-03-10 15:00:14.293 EST [95474] LOG: connection authorized: user=rhaas database=rhaas appl
Re: [HACKERS] Custom compression methods
On Wed, Mar 10, 2021 at 6:52 AM Dilip Kumar wrote: > The pending comment is providing a way to rewrite a table and > re-compress the data with the current compression method. I spent some time poking at this yesterday and ran couldn't figure out what was going on here. There are two places where we rewrite tables. One is the stuff in cluter.c, which handles VACUUM FULL and CLUSTER. That eventually calls reform_and_rewrite_tuple(), which deforms the old tuple and creates a new one, but it doesn't seem like there's anything in there that would expand toasted values, whether external or inline compressed. But I think that can't be right, because it seems like then you'd end up with toast pointers into the old TOAST relation, not the new one, which would cause failures later. So I must be missing something here. The other place where we rewrite tables is in ATRewriteTable() as part of the ALTER TABLE machinery. I don't see anything there to force detoasting either. That said, I think that using the word REWRITE may not really capture what we're on about. Leaving aside the question of exactly what the CLUSTER code does today, you could in theory rewrite the main table by just taking all the tuples and putting them into a new relfilenode. And then you could do the same thing with the TOAST table. And despite having fully rewritten both tables, you wouldn't have done anything that helps with this problem because you haven't deformed the tuples at any point. Now as it happens we do have code -- in reform_and_rewrite_tuple() -- that does deform and reform the tuples, but it doesn't take care of this problem either. We might need to distinguish between rewriting the table, which is mostly about getting a new relfilenode, and some other word that means doing this. But, I am not really convinced that we need to solve this problem by adding new ALTER TABLE syntax. I'd be happy enough if CLUSTER, VACUUM FULL, and versions of ALTER TABLE that already force a rewrite would cause the compression to be redone also. Honestly, even if the user had to fall back on creating a new table and doing INSERT INTO newtab SELECT * FROM oldtab I would consider that to be not a total showstopper for this .. assuming of course that it actually works. If it doesn't, we have big problems. Even without the pg_am stuff, we still need to make sure that we don't just blindly let compressed values wander around everywhere. When we insert into a table column with a compression method, we should recompress any data that is compressed using some other method. -- Robert Haas EDB: http://www.enterprisedb.com
Re: libpq debug log
On 2021-Mar-10, Tom Lane wrote: > "'alvhe...@alvh.no-ip.org'" writes: > > After staring at it a couple of times, I think that the places in > > pqParseInput3() where there's a comment "... moves inStart itself" and > > then "continue;" should have a call to pqTraceOutputMsg(), since AFAIU > > those are the places where the message in question would not reach the > > "Successfully consumed this message" block that prints each message. > > Yeah, the whole business of just when a message has been "consumed" > is a stumbling block for libpq tracing. It was a real mess with the > existing code and v2 protocol, because we could actually try to parse > a message more than once, with the first few tries deciding that the > message wasn't all here yet (but nonetheless emitting partial trace > output). Hmm, that makes sense, but the issue I'm reporting is not the same, unless I misunderstand you. > Now that v2 is dead, the logic to abort because of the message not > being all arrived yet is basically useless: only the little bit of > code concerned with the message length word really needs to cope with > that. It's tempting to go through and get rid of all the now-unreachable > "return"s and such, but it seemed like it would be a lot of code churn for > not really that much gain. That sounds like an interesting exercise, and I bet it'd bring a lot of code readability improvements. > I didn't look at the new version of the patch yet, so I'm not > sure whether the issues it still has are related to this. The issues I noticed are related to separate messages rather than one message split in pieces -- for example several DataRow messages are processed internally in a loop, rather than each individually. The number of places that need to be adjusted for things to AFAICT work correctly are few enough; ISTM that the attached patch is sufficient. (The business with the "logged" boolean is necessary so that we print to the trace file any of those messages even if they are deviating from the protocol.) -- Álvaro Herrera Valdivia, Chile "La experiencia nos dice que el hombre peló millones de veces las patatas, pero era forzoso admitir la posibilidad de que en un caso entre millones, las patatas pelarían al hombre" (Ijon Tichy)
Re: libpq debug log
On 2021-Mar-10, Tom Lane wrote: > Or we could rethink the logic. It's not quite clear to me, after > all this time, why getRowDescriptions() et al are allowed to > move inStart themselves rather than letting the main loop in > pqParseInput3 do it. It might well be an artifact of having not > rewritten the v2 logic too much. I would certainly prefer that the logic stays put for the time being, while I finalize the pipelining stuff. -- Álvaro Herrera Valdivia, Chile
Re: libpq debug log
I wrote: > Or we could rethink the logic. It's not quite clear to me, after > all this time, why getRowDescriptions() et al are allowed to > move inStart themselves rather than letting the main loop in > pqParseInput3 do it. It might well be an artifact of having not > rewritten the v2 logic too much. After studying this further, I think we should apply the attached patch to remove that responsibility from pqParseInput3's subroutines. This will allow a single trace call near the bottom of pqParseInput3 to handle all cases that that function processes. There are still some cowboy advancements of inStart in the functions concerned with COPY processing, but it doesn't look like there's much to be done about that. Those spots will need their own trace calls. BTW, while looking at this I concluded that getParamDescriptions is actually buggy: if it's given a malformed message that seems to need more data than the message length specifies, it just returns EOF, resulting in an infinite loop. This function apparently got missed while converting the logic from v2 (where that was the right thing) to v3 (where it ain't). So that bit needs to be back-patched. I'm tempted to back-patch the whole thing though. regards, tom lane diff --git a/src/interfaces/libpq/fe-protocol3.c b/src/interfaces/libpq/fe-protocol3.c index 2ca8c057b9..233456fd90 100644 --- a/src/interfaces/libpq/fe-protocol3.c +++ b/src/interfaces/libpq/fe-protocol3.c @@ -290,8 +290,6 @@ pqParseInput3(PGconn *conn) /* First 'T' in a query sequence */ if (getRowDescriptions(conn, msgLength)) return; - /* getRowDescriptions() moves inStart itself */ - continue; } else { @@ -337,8 +335,7 @@ pqParseInput3(PGconn *conn) case 't': /* Parameter Description */ if (getParamDescriptions(conn, msgLength)) return; - /* getParamDescriptions() moves inStart itself */ - continue; + break; case 'D': /* Data Row */ if (conn->result != NULL && conn->result->resultStatus == PGRES_TUPLES_OK) @@ -346,8 +343,6 @@ pqParseInput3(PGconn *conn) /* Read another tuple of a normal query response */ if (getAnotherTuple(conn, msgLength)) return; - /* getAnotherTuple() moves inStart itself */ - continue; } else if (conn->result != NULL && conn->result->resultStatus == PGRES_FATAL_ERROR) @@ -462,7 +457,6 @@ handleSyncLoss(PGconn *conn, char id, int msgLength) * command for a prepared statement) containing the attribute data. * Returns: 0 if processed message successfully, EOF to suspend parsing * (the latter case is not actually used currently). - * In the former case, conn->inStart has been advanced past the message. */ static int getRowDescriptions(PGconn *conn, int msgLength) @@ -577,9 +571,6 @@ getRowDescriptions(PGconn *conn, int msgLength) /* Success! */ conn->result = result; - /* Advance inStart to show that the "T" message has been processed. */ - conn->inStart = conn->inCursor; - /* * If we're doing a Describe, we're done, and ready to pass the result * back to the client. @@ -603,9 +594,6 @@ advance_and_error: if (result && result != conn->result) PQclear(result); - /* Discard the failed message by pretending we read it */ - conn->inStart += 5 + msgLength; - /* * Replace partially constructed result with an error result. First * discard the old result to try to win back some memory. @@ -624,6 +612,12 @@ advance_and_error: appendPQExpBuffer(&conn->errorMessage, "%s\n", errmsg); pqSaveErrorResult(conn); + /* + * Show the message as fully consumed, else pqParseInput3 will overwrite + * our error with a complaint about that. + */ + conn->inCursor = conn->inStart + 5 + msgLength; + /* * Return zero to allow input parsing to continue. Subsequent "D" * messages will be ignored until we get to end of data, since an error @@ -635,12 +629,8 @@ advance_and_error: /* * parseInput subroutine to read a 't' (ParameterDescription) message. * We'll build a new PGresult structure containing the parameter data. - * Returns: 0 if completed message, EOF if not enough data yet. - * In the former case, conn->inStart has been advanced past the message. - * - * Note that if we run out of data, we have to release the partially - * constructed PGresult, and rebuild it again next time. Fortunately, - * that shouldn't happen often, since 't' messages usually fit in a packet. + * Returns: 0 if processed message successfully, EOF to suspend parsing + * (the latter case is not actually used currently). */ static int getParamDescriptions(PGconn *conn, int msgLength) @@ -690,23 +680,16 @@ getParamDescriptions(PGconn *conn, int msgLength) /* Success! */ conn->result = result; - /* Advance inStart to show that the "t" message has been processed. */ - conn->inStart = conn->inCursor; - return 0; not_enough_data: - PQclear(result); - return EOF; + errmsg =
Re: Replace buffer I/O locks with condition variables (reviving an old patch)
On Tue, Mar 9, 2021 at 6:24 PM Julien Rouhaud wrote: > > The old I/O lock array was the only user of struct > > LWLockMinimallyPadded, added in commit 6150a1b08a9, and it seems kinda > > strange to leave it in the tree with no user. Of course it's remotely > > possible there are extensions using it (know of any?). In the > > attached, I've ripped that + associated commentary out, because it's > > fun to delete dead code. Objections? > > None from me. I don't know of any extension relying on it, and neither does > codesearch.debian.net. I would be surprised to see any extension actually > relying on that anyway. Thanks for checking! > > Since the whole reason for that out-of-line array in the first place > > was to keep BufferDesc inside one cache line, and since it is in fact > > possible to put a new condition variable into BufferDesc without > > exceeding 64 bytes on a 64 bit x86 box, perhaps we should just do that > > instead? I haven't yet considered other architectures or potential > > member orders. > > +1 for adding the cv into BufferDesc. That brings the struct size to exactly > 64 bytes on x86 64 bits architecture. This won't add any extra overhead to > LOCK_DEBUG cases, as it was already exceeding the 64B threshold, if that even > was a concern. I also checked that it's 64B on an Arm box. Not sure about POWER. But... despite the fact that it looks like a good change in isolation, I decided to go back to the separate array in this initial commit, because the AIO branch also wants to add a new BufferDesc member[1]. I may come back to that change, if we can make some more space (seems entirely doable, but I'd like to look into that separately). > > I wonder if we should try to preserve user experience a little harder, > > for the benefit of people who have monitoring queries that look for > > this condition. Instead of inventing a new wait_event value, let's > > just keep showing "BufferIO" in that column. In other words, the > > change is that wait_event_type changes from "LWLock" to "IPC", which > > is a pretty good summary of this patch. Done in the attached. Does > > this make sense? > > I think it does make sense, and it's good to preserve this value. > > Looking at the patch itself, I don't have much to add it all looks sensible > and > I agree with the arguments in the first mail. All regression tests pass and > documentation builds. I found one more thing to tweak: a reference in the README. > I'm marking this patch as RFC. Thanks for the review. And of course to Robert for writing the patch. Pushed. [1] https://github.com/anarazel/postgres/blob/aio/src/include/storage/buf_internals.h#L190
fdatasync performance problem with large number of DB files
I initially posted this on the pgsql-general mailing list [5] but was advised to also post this to the -hackers list as it deals with internals. We've encountered a production performance problem with pg13 related to how it fsyncs the whole data directory in certain scenarios, related to what Paul (bcc'ed) described in a post to pgsql-hackers [1]. Background: We've observed the full recursive fsync is triggered when * pg_basebackup receives a streaming backup (via [2] fsync_dir_recurse or fsync_pgdata) unless --no-sync is specified * postgres starts up unclean (via [3] SyncDataDirectory) We run multiple postgres clusters and some of those clusters have many (~450) databases (one database-per-customer) meaning that the postgres data directory has around 700,000 files. On one of our less loaded servers this takes ~7 minutes to complete, but on another [4] this takes ~90 minutes. Obviously this is untenable risk. We've modified our process that bootstraps a replica via pg_basebackup to instead do "pg_basebackup --no-sync…" followed by a "sync", but we don't have any way to do the equivalent for the postgres startup. I presume the reason postgres doesn't blindly run a sync() is that we don't know what other I/O is on the system and it'd be rude to affect other services. That makes sense, except for our environment the work done by the recursive fsync is orders of magnitude more disruptive than a sync(). My questions are: * is there a knob missing we can configure? * can we get an opt-in knob to use a single sync() call instead of a recursive fsync()? * would you be open to merging a patch providing said knob? * is there something else we missed? Thanks! [1]: https://www.postgresql.org/message-id/flat/caeet0zhgnbxmi8yf3ywsdzvb3m9cbdsgzgftxscq6agcbzc...@mail.gmail.com [2]: https://github.com/postgres/postgres/blob/master/src/bin/pg_basebackup/pg_basebackup.c#L2181 [3]: https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L6495 [4]: It should be identical config-wise. It isn't starved for IO but does have other regular write workloads [5]: https://www.postgresql-archive.org/fdatasync-performance-problem-with-large-number-of-DB-files-td6184094.html -- Michael Brown Civilized Discourse Construction Kit, Inc. https://www.discourse.org/
Re: GiST comment improvement
On Tue, Mar 2, 2021 at 11:40:21AM -0500, Bruce Momjian wrote: > In talking to Teodor this week, I have written the attached C comment > patch which improves our description of GiST's NSN and GistBuildLSN > values. Patch applied. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: libpq debug log
On 2021-Mar-10, Tom Lane wrote: > After studying this further, I think we should apply the attached > patch to remove that responsibility from pqParseInput3's subroutines. > This will allow a single trace call near the bottom of pqParseInput3 > to handle all cases that that function processes. Works for me. > BTW, while looking at this I concluded that getParamDescriptions > is actually buggy: if it's given a malformed message that seems > to need more data than the message length specifies, it just > returns EOF, resulting in an infinite loop. This function apparently > got missed while converting the logic from v2 (where that was the > right thing) to v3 (where it ain't). So that bit needs to be > back-patched. Ah, that makes sense. > I'm tempted to back-patch the whole thing though. +0.5 on that. I think we may be happy that all branches are alike (though it doesn't look like this will cause any subtle bugs -- breakage will be fairly obvious). -- Álvaro Herrera39°49'30"S 73°17'W "Most hackers will be perfectly comfortable conceptualizing users as entropy sources, so let's move on." (Nathaniel Smith)
Re: automatic analyze: readahead - add "IO read time" log message
On 3/8/21 8:42 PM, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.von...@enterprisedb.com) wrote: >> On 2/10/21 11:10 PM, Stephen Frost wrote: >>> * Heikki Linnakangas (hlinn...@iki.fi) wrote: On 05/02/2021 23:22, Stephen Frost wrote: > Unless there's anything else on this, I'll commit these sometime next > week. One more thing: Instead of using 'effective_io_concurrency' GUC directly, should call get_tablespace_maintenance_io_concurrency(). >>> >>> Ah, yeah, of course. >>> >>> Updated patch attached. >> >> A couple minor comments: >> >> 1) I think the patch should be split into two parts, one adding the >> track_io_timing, one adding the prefetching. > > This was already done.. > Not sure what you mean by "done"? I see the patch still does both changes related to track_io_timing and prefetching. >> 2) I haven't tried but I'm pretty sure there'll be a compiler warning >> about 'prefetch_maximum' being unused without USE_PREFETCH defined. > > Ah, that part is likely true, moved down into the #ifdef block to > address that, which also is good since it should avoid mistakenly using > it outside of the #ifdef's later on by mistake too. > OK >> 3) Is there a way to reduce the amount of #ifdef in acquire_sample_rows? > > Perhaps.. > >> This makes the code rather hard to read, IMHO. It seems to me we can >> move the code around a bit and merge some of the #ifdef blocks - see the >> attached patch. Most of this is fairly trivial, with the exception of >> moving PrefetchBuffer before table_scan_analyze_next_block - AFAIK this >> does not materially change the behavior, but perhaps I'm wrong. > > but I don't particularly like doing the prefetch right before we run > vacuum_delay_point() and potentially sleep. > Why? Is that just a matter of personal preference (fair enough) or is there a reason why that would be wrong/harmful? I think e.g. prefetch_targblock could be moved to the next #ifdef, which will eliminate the one-line ifdef. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: fdatasync performance problem with large number of DB files
On Thu, Mar 11, 2021 at 11:01 AM Michael Brown wrote: > * pg_basebackup receives a streaming backup (via [2] fsync_dir_recurse > or fsync_pgdata) unless --no-sync is specified > * postgres starts up unclean (via [3] SyncDataDirectory) > > We run multiple postgres clusters and some of those clusters have many > (~450) databases (one database-per-customer) meaning that the postgres > data directory has around 700,000 files. > > On one of our less loaded servers this takes ~7 minutes to complete, but > on another [4] this takes ~90 minutes. Ouch. > My questions are: > > * is there a knob missing we can configure? > * can we get an opt-in knob to use a single sync() call instead of a > recursive fsync()? > * would you be open to merging a patch providing said knob? > * is there something else we missed? As discussed on that other thread, I don't think sync() is an option (it doesn't wait on all OSes or in the standard and it doesn't report errors). syncfs() on Linux 5.8+ looks like a good candidate though, and I think we'd consider a patch like that. I mean, I even posted one[1] in that other thread. There will of course be cases where that's slower (small database sharing filesystem with other software that has a lot of dirty data to write back). I also wrote a WAL-and-checkpoint based prototype[2], which, among other advantages such as being faster, not ignoring errors and not triggering collateral write-back storms, happens to work on all operating systems. On the other hand it requires a somewhat dogmatic switch in thinking about the meaning of checkpoints (I mean, it requires humans to promise not to falsify checkpoints by copying databases around underneath us), which may be hard to sell (I didn't try very hard), and there may be subtleties I have missed... [1] https://www.postgresql.org/message-id/CA%2BhUKGKT6XiPiEJrqeOFGi7RYCGzbBysF9pyWwv0-jm-oNajxg%40mail.gmail.com [2] https://www.postgresql.org/message-id/CA%2BhUKGKHhDNnN6fxf6qrAx9h%2BmjdNU2Zmx7ztJzFQ0C5%3Du3QPg%40mail.gmail.com
Re: Speeding up GIST index creation for tsvectors
I wrote: > I'll post a patch soon that builds on that, so you can see what I mean. I've attached where I was imagining this heading, as a text file to avoid distracting the cfbot. Here are the numbers I got with your test on the attached, as well as your 0001, on x86-64 Clang 10, default siglen: master: 739ms v3-0001 692ms attached POC 665ms The small additional speed up is not worth it, given the code churn and complexity, so I don't want to go this route after all. I think the way to go is a simplified version of your 0001 (not 0002), with only a single function, for gist and intarray only, and a style that better matches the surrounding code. If you look at my xor functions in the attached text file, you'll get an idea of what it should look like. Note that it got the above performance without ever trying to massage the pointer alignment. I'm a bit uncomfortable with the fact that we can't rely on alignment, but maybe there's a simple fix somewhere in the gist code. -- John Naylor EDB: http://www.enterprisedb.com src/backend/access/heap/visibilitymap.c | 13 +- src/backend/nodes/bitmapset.c | 23 +-- src/backend/utils/adt/tsgistidx.c | 14 +- src/include/port/pg_bitutils.h | 12 +- src/port/pg_bitutils.c | 240 ++-- 5 files changed, 214 insertions(+), 88 deletions(-) diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c index e198df65d8..d910eb2875 100644 --- a/src/backend/access/heap/visibilitymap.c +++ b/src/backend/access/heap/visibilitymap.c @@ -388,7 +388,6 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro { Buffer mapBuffer; uint64 *map; - int i; /* * Read till we fall off the end of the map. We assume that any extra @@ -409,17 +408,11 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0, "unsupported MAPSIZE"); if (all_frozen == NULL) - { - for (i = 0; i < MAPSIZE / sizeof(uint64); i++) - nvisible += pg_popcount64(map[i] & VISIBLE_MASK64); - } + nvisible = pg_popcount_mask64(map, MAPSIZE, VISIBLE_MASK64); else { - for (i = 0; i < MAPSIZE / sizeof(uint64); i++) - { - nvisible += pg_popcount64(map[i] & VISIBLE_MASK64); - nfrozen += pg_popcount64(map[i] & FROZEN_MASK64); - } + nvisible = pg_popcount_mask64(map, MAPSIZE, VISIBLE_MASK64); + nfrozen = pg_popcount_mask64(map, MAPSIZE, FROZEN_MASK64); } ReleaseBuffer(mapBuffer); diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index 649478b0d4..5368c72255 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -452,7 +452,6 @@ bms_is_member(int x, const Bitmapset *a) int bms_member_index(Bitmapset *a, int x) { - int i; int bitnum; int wordnum; int result = 0; @@ -466,14 +465,8 @@ bms_member_index(Bitmapset *a, int x) bitnum = BITNUM(x); /* count bits in preceding words */ - for (i = 0; i < wordnum; i++) - { - bitmapword w = a->words[i]; - - /* No need to count the bits in a zero word */ - if (w != 0) - result += bmw_popcount(w); - } + result = pg_popcount((const char *) a->words, +wordnum * BITS_PER_BITMAPWORD / BITS_PER_BYTE); /* * Now add bits of the last word, but only those before the item. We can @@ -645,22 +638,14 @@ bms_get_singleton_member(const Bitmapset *a, int *member) int bms_num_members(const Bitmapset *a) { - int result = 0; int nwords; - int wordnum; if (a == NULL) return 0; nwords = a->nwords; - for (wordnum = 0; wordnum < nwords; wordnum++) - { - bitmapword w = a->words[wordnum]; - /* No need to count the bits in a zero word */ - if (w != 0) - result += bmw_popcount(w); - } - return result; + return pg_popcount((const char *) a->words, + nwords * BITS_PER_BITMAPWORD / BITS_PER_BYTE); } /* diff --git a/src/backend/utils/adt/tsgistidx.c b/src/backend/utils/adt/tsgistidx.
Re: Columns correlation and adaptive query optimization
On 3/10/21 3:00 AM, Tomas Vondra wrote: > Hello Konstantin, > > > Sorry for not responding to this thread earlier. I definitely agree the > features proposed here are very interesting and useful, and I appreciate > you kept rebasing the patch. > > I think the patch improving join estimates can be treated as separate, > and I see it already has a separate CF entry - it however still points > to this thread, which will be confusing. I suggest we start a different > thread for it, to keep the discussions separate. > D'oh! I must have been confused yesterday, because now I see there already is a separate thread [1] for the join selectivity patch. So you can ignore this. regards [1] https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc...@postgrespro.ru -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PROXY protocol support
On Tue, 2021-03-09 at 11:25 +0100, Magnus Hagander wrote: > I've also added some trivial tests (man that took an ungodly amount of > fighting perl -- it's clearly been a long time since I used perl > properly). Yeah. The tests I'm writing for this and NSS have been the same way; it's a real problem. I'm basically writing supplemental tests in Python as the "daily driver", then trying to port whatever is easiest (not much) into Perl, when I get time. == More Notes == Some additional spec-compliance stuff: > /* Lower 4 bits hold type of connection */ > if (proxyheader.fam == 0) > { > /* LOCAL connection, so we ignore the address included */ > } (fam == 0) is the UNSPEC case, which isn't necessarily LOCAL. We have to do something different for the LOCAL case: > - \x0 : LOCAL : [...] The receiver must accept this connection as > valid and must use the real connection endpoints and discard the > protocol block including the family which is ignored. So we should ignore the entire "protocol block" (by which I believe they mean the protocol-and-address-family byte) in the case of LOCAL, and just accept it with the original address info intact. That seems to match the sample code in the back of the spec. The current behavior in the patch will apply the PROXY behavior incorrectly if the sender sends a LOCAL header with something other than UNSPEC -- which is strange behavior but not explicitly prohibited as far as I can see. We also need to reject all connections that aren't either LOCAL or PROXY commands: > - other values are unassigned and must not be emitted by senders. > Receivers must drop connections presenting unexpected values here. ...and naturally it'd be Nice (tm) if the tests covered those corner cases. Over on the struct side: > + struct > + { /* for TCP/UDP > over IPv4, len = 12 */ > + uint32 src_addr; > + uint32 dst_addr; > + uint16 src_port; > + uint16 dst_port; > + } ip4; > ... snip ... > + /* TCPv4 */ > + if (proxyaddrlen < 12) > + { Given the importance of these hardcoded lengths matching reality, is it possible to add some static assertions to make sure that sizeof() == 12 and so on? That would also save any poor souls who are using compilers with nonstandard struct-packing behavior. --Jacob
Re: libpq debug log
Alvaro Herrera writes: > On 2021-Mar-10, Tom Lane wrote: >> After studying this further, I think we should apply the attached >> patch to remove that responsibility from pqParseInput3's subroutines. >> This will allow a single trace call near the bottom of pqParseInput3 >> to handle all cases that that function processes. > Works for me. I dug into the git history a little and concluded that the reason for doing that in the subroutines is that 92785dac2 made it so for this reason: + /* +* Advance inStart to show that the "D" message has been processed. We +* must do this before calling the row processor, in case it longjmps. +*/ + conn->inStart = conn->inCursor; We later gave up on allowing user-defined row processor callbacks, but we failed to undo this messiness. Looking at what 92785dac2 did confirms something else I'd been eyeing, which is why these subroutines have their own checks for having consumed the right amount of data instead of letting the master check in pqParseInput3 take care of it. They didn't use to do that, but that check was hoisted up to before the row processor call, so we wouldn't expose data from a corrupt message to user code. So I think we can undo that too. 92785dac2 only directly hacked getRowDescriptions and getAnotherTuple, but it looks like similar error handling was stuck into getParamDescriptions later. regards, tom lane
Re: fdatasync performance problem with large number of DB files
On Thu, Mar 11, 2021 at 11:38 AM Thomas Munro wrote: > On Thu, Mar 11, 2021 at 11:01 AM Michael Brown > wrote: > > * is there a knob missing we can configure? > > * can we get an opt-in knob to use a single sync() call instead of a > > recursive fsync()? > > * would you be open to merging a patch providing said knob? > > * is there something else we missed? > > As discussed on that other thread, I don't think sync() is an option > (it doesn't wait on all OSes or in the standard and it doesn't report > errors). syncfs() on Linux 5.8+ looks like a good candidate though, > and I think we'd consider a patch like that. I mean, I even posted > one[1] in that other thread. There will of course be cases where > that's slower (small database sharing filesystem with other software > that has a lot of dirty data to write back). Thinking about this some more, if you were to propose a patch like that syncfs() one but make it a configurable option, I'd personally be in favour of trying to squeeze it into v14. Others might object on commitfest procedural grounds, I dunno, but I think this is a real operational issue and that's a fairly simple and localised change. I've run into a couple of users who have just commented that recursive fsync() code out! I'd probably make it an enum-style GUC, because I intend to do some more work on my "precise" alternative, though not in time for this release, and it could just as well be an option too.
Re: fdatasync performance problem with large number of DB files
Thomas Munro writes: > Thinking about this some more, if you were to propose a patch like > that syncfs() one but make it a configurable option, I'd personally be > in favour of trying to squeeze it into v14. Others might object on > commitfest procedural grounds, I dunno, but I think this is a real > operational issue and that's a fairly simple and localised change. > I've run into a couple of users who have just commented that recursive > fsync() code out! I'm a little skeptical about the "simple" part. At minimum, you'd have to syncfs() each tablespace, since we have no easy way to tell which of them are on different filesystems. (Although, if we're presuming this is Linux-only, we might be able to tell with some unportable check or other.) regards, tom lane
Re: fdatasync performance problem with large number of DB files
On Thu, Mar 11, 2021 at 1:16 PM Tom Lane wrote: > Thomas Munro writes: > > Thinking about this some more, if you were to propose a patch like > > that syncfs() one but make it a configurable option, I'd personally be > > in favour of trying to squeeze it into v14. Others might object on > > commitfest procedural grounds, I dunno, but I think this is a real > > operational issue and that's a fairly simple and localised change. > > I've run into a couple of users who have just commented that recursive > > fsync() code out! > > I'm a little skeptical about the "simple" part. At minimum, you'd > have to syncfs() each tablespace, since we have no easy way to tell > which of them are on different filesystems. (Although, if we're > presuming this is Linux-only, we might be able to tell with some > unportable check or other.) Right, the patch knows about that: +/* + * On Linux, we don't have to open every single file one by one. We can + * use syncfs() to sync whole filesystems. We only expect filesystem + * boundaries to exist where we tolerate symlinks, namely pg_wal and the + * tablespaces, so we call syncfs() for each of those directories. + */
Re: About to add WAL write/fsync statistics to pg_stat_wal view
On 2021-03-10 17:08, Fujii Masao wrote: On 2021/03/10 14:11, Masahiro Ikeda wrote: On 2021-03-09 17:51, Fujii Masao wrote: On 2021/03/05 8:38, Masahiro Ikeda wrote: On 2021-03-05 01:02, Fujii Masao wrote: On 2021/03/04 16:14, Masahiro Ikeda wrote: On 2021-03-03 20:27, Masahiro Ikeda wrote: On 2021-03-03 16:30, Fujii Masao wrote: On 2021/03/03 14:33, Masahiro Ikeda wrote: On 2021-02-24 16:14, Fujii Masao wrote: On 2021/02/15 11:59, Masahiro Ikeda wrote: On 2021-02-10 00:51, David G. Johnston wrote: On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda wrote: I pgindented the patches. ... XLogWrite, which is invoked during an XLogFlush request (see ...). This is also incremented by the WAL receiver during replication. ("which normally called" should be "which is normally called" or "which normally is called" if you want to keep true to the original) You missed the adding the space before an opening parenthesis here and elsewhere (probably copy-paste) is ether -> is either "This parameter is off by default as it will repeatedly query the operating system..." ", because" -> "as" Thanks, I fixed them. wal_write_time and the sync items also need the note: "This is also incremented by the WAL receiver during replication." I skipped changing it since I separated the stats for the WAL receiver in pg_stat_wal_receiver. "The number of times it happened..." -> " (the tally of this event is reported in wal_buffers_full in) This is undesirable because ..." Thanks, I fixed it. I notice that the patch for WAL receiver doesn't require explicitly computing the sync statistics but does require computing the write statistics. This is because of the presence of issue_xlog_fsync but absence of an equivalent pg_xlog_pwrite. Additionally, I observe that the XLogWrite code path calls pgstat_report_wait_*() while the WAL receiver path does not. It seems technically straight-forward to refactor here to avoid the almost-duplicated logic in the two places, though I suspect there may be a trade-off for not adding another function call to the stack given the importance of WAL processing (though that seems marginalized compared to the cost of actually writing the WAL). Or, as Fujii noted, go the other way and don't have any shared code between the two but instead implement the WAL receiver one to use pg_stat_wal_receiver instead. In either case, this half-and-half implementation seems undesirable. OK, as Fujii-san mentioned, I separated the WAL receiver stats. (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch) Thanks for updating the patches! I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and the stats for WAL receiver is counted in pg_stat_wal_receiver. What do you think? On second thought, this idea seems not good. Because those stats are collected between multiple walreceivers, but other values in pg_stat_wal_receiver is only related to the walreceiver process running at that moment. IOW, it seems strange that some values show dynamic stats and the others show collected stats, even though they are in the same view pg_stat_wal_receiver. Thought? OK, I fixed it. The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch. Thanks for updating the patches! I'm now reading 001 patch. + /* Check whether the WAL file was synced to disk right now */ + if (enableFsync && + (sync_method == SYNC_METHOD_FSYNC || + sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH || + sync_method == SYNC_METHOD_FDATASYNC)) + { Isn't it better to make issue_xlog_fsync() return immediately if enableFsync is off, sync_method is open_sync or open_data_sync, to simplify the code more? Thanks for the comments. I added the above code in v12 patch. + /* + * Send WAL statistics only if WalWriterDelay has elapsed to minimize + * the overhead in WAL-writing. + */ + if (rc & WL_TIMEOUT) + pgstat_send_wal(); On second thought, this change means that it always takes wal_writer_delay before walwriter's WAL stats is sent after XLogBackgroundFlush() is called. For example, if wal_writer_delay is set to several seconds, some values in pg_stat_wal would be not up-to-date meaninglessly for those seconds. So I'm thinking to withdraw my previous comment and it's ok to send the stats every after XLogBackgroundFlush() is called. Thought? Thanks, I didn't notice that. Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's default value is 200msec and it may be set shorter time. Yeah, if wal_writer_delay is set to very small value, there is a risk that the WAL stats are sent too frequently. I agree that's a problem. Why don't to make another way to check the timestamp? + /* + * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL +
Re: Huge memory consumption on partitioned table with FKs
On Thu, Mar 11, 2021 at 4:25 AM Tom Lane wrote: > Amit Langote writes: > > On Wed, Mar 10, 2021 at 8:37 AM Tom Lane wrote: > >> Hmm. So, the key point is that the values coming from the partitioned > >> child table are injected into the test query as parameters, not as > >> column references, thus it doesn't matter *to the test query* what > >> numbers the referencing columns have in that child. We just have to > >> be sure we pass the right parameter values. > > > Right. > > I did some cosmetic fooling with this (mostly, rewriting the comments > YA time) and pushed it. Perfect. Thanks for your time on this. -- Amit Langote EDB: http://www.enterprisedb.com
Re: [HACKERS] Custom compression methods
On Wed, Mar 10, 2021 at 03:50:48PM -0500, Robert Haas wrote: > On Wed, Mar 10, 2021 at 6:52 AM Dilip Kumar wrote: > > The pending comment is providing a way to rewrite a table and > > re-compress the data with the current compression method. > > I spent some time poking at this yesterday and ran couldn't figure out > what was going on here. There are two places where we rewrite tables. > One is the stuff in cluter.c, which handles VACUUM FULL and CLUSTER. > That eventually calls reform_and_rewrite_tuple(), which deforms the > old tuple and creates a new one, but it doesn't seem like there's > anything in there that would expand toasted values, whether external > or inline compressed. But I think that can't be right, because it > seems like then you'd end up with toast pointers into the old TOAST > relation, not the new one, which would cause failures later. So I must > be missing something here. I did this the empirical way. postgres=# CREATE TABLE t (a text compression lz4); postgres=# INSERT INTO t SELECT repeat('a',9); postgres=# ALTER TABLE t ALTER a SET COMPRESSION pglz; postgres=# VACUUM FULL t; postgres=# SELECT pg_column_compression(a) FROM t; pg_column_compression | lz4 Actually, a --without-lz4 build can *also* VACUUM FULL the lz4 table. It looks like VACUUM FULL t processes t but not its toast table (which is strange to me, since VACUUM processes both, but then autovacuum also processes them independently). -- Justin
Re: Improve join selectivity estimation using extended statistics
Hi Konstantin, Thanks for working on this! Using extended statistics to improve join cardinality estimates was definitely on my radar, and this patch seems like a good start. I had two basic ideas about how we might improve join estimates: (a) use per-table extended statistics to estimate join conditions (b) invent multi-table extended statistics (requires inventing how to sample the tables in a coordinated way, etc.) This patch aims to do (a) which is perfectly reasonable - I think we can achieve significant improvements this way. I have some ideas about (b), but it seems harder and for a separate thread/patch. The patch includes some *very* interesting ideas, but I think it's does them too late and at the wrong level of abstraction. I mean that: 1) I don't think the code in clausesel.c should deal with extended statistics directly - it requires far too much knowledge about different types of extended stats, what clauses are supported by them, etc. Allowing stats on expressions will make this even worse. Better do that in extended_stats.c, like statext_clauselist_selectivity. 2) in clauselist_selectivity_ext, functional dependencies are applied in the part that processes remaining clauses, not estimated using extended statistics. That seems a bit confusing, and I suspect it may lead to issues - for example, it only processes the clauses incrementally, in a particular order. That probably affects the result, because it affects which functional dependencies we can apply. In the example query that's not an issue, because it only has two Vars, so it either can't apply anything (with one Var) or it can apply everything (with two Vars). But with 3 or more Vars the order would certainly matter, so it's problematic. Moreover, it seems a bit strange that this considers dependencies only on the inner relation. Can't that lead to issues with different join orders producing different cardinality estimates? I think a better approach would be to either modify the existing block dealing with extended stats for a single relation to also handle join conditions. Or perhaps we should invent a separate block, dealing with *pairs* of relations? And it should deal with *all* join clauses for that pair of relations at once, not one by one. As for the exact implementation, I'd imagine we call overall logic to be something like (for clauses on two joined relations): - pick a subset of clauses with the same type of extended statistics on both sides (MCV, ndistinct, ...), repeat until we can't apply more statistics - estimate remaining clauses either using functional dependencies or in the regular (old) way As for how to use other types of extended statistics, I think eqjoinsel could serve as an inspiration. We should look for an MCV list and ndistinct stats on both sides of the join (possibly on some subset of clauses), and then do the same thing eqjoinsel does, just with multiple columns. Note: I'm not sure what to do when we find the stats only on one side. Perhaps we should assume the other side does not have correlations and use per-column statistics (seems reasonable), or maybe just not apply anything (seems a bit too harsh). Anyway, if there are some non-estimated clauses, we could try applying functional dependencies similarly to what this patch does. It's also consistent with statext_clauselist_selectivity - that also tries to apply MCV lists first, and only then we try functional dependencies. BTW, should this still rely on oprrest (e.g. F_EQSEL). That's the selectivity function for restriction (non-join) clauses, so maybe we should be looking at oprjoin when dealing with joins? Not sure. One bit that I find *very* interesting is the calc_joinrel_size_estimate part, with this comment: /* * Try to take in account functional dependencies between attributes * of clauses pushed-down to joined relations and retstrictlist * clause. Right now we consider only case of restrictlist consists of * one clause. */ If I understand the comment and the code after it, it essentially tries to apply extended statistics from both the join clauses and filters at the relation level. That is, with a query like SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a) WHERE t1.b = 10 we would be looking at statistics on t1(a,b), because we're interested in estimating conditional probability distribution P(t1.a = a? | t1.b = 10) I think that's extremely interesting and powerful, because it allows us to "restrict" the multi-column MCV lists, we could probably estimate number of distinct "a" values in rows with "b=10" like: ndistinct(a,b) / ndistinct(b) and do various interesting stuff like this. That will require some improvements to the extended statistics code (to allow passing a list of conditions), but that's quite doable. I think the code actually did something like that originally ;-) Obviously, none of this is achievable for PG14, as we're in the middle of the last CF. But if you're interest
Re: shared-memory based stats collector
At Wed, 10 Mar 2021 21:47:51 +0900, Fujii Masao wrote in > Attached is the updated version of the 0003 patch. Barring any > objection, > I will commit this patch. > > > -#include "storage/latch.h" > -#include "storage/proc.h" > > I removed these because they are no longer necessary. Mmm. Sorry for the garbage. > logical replication worker, > parallel worker, background > writer, > client backend, checkpointer, > + archiver, > startup, walreceiver, > walsender and walwriter. > > In the document about pg_stat_activity, possible values in > backend_type > column are all listed. I added "archiver" into the list. > > BTW, those values were originally listed in alphabetical order, > but ISTM that they were gotten out of order at a certain moment. > So they should be listed in alphabetical order again. This should > be implemented as a separate patch. Thanks for adding it. They are also mildly sorted by function or characteristics. I'm not sure which is better, but anyway they should be the order based on a clear criteria. > -PgArchData *PgArch = NULL; > +static PgArchData *PgArch = NULL; > > I marked PgArchData as static because it's used only in pgarch.c. Right. > - ShmemInitStruct("Archiver ", PgArchShmemSize(), &found); > + ShmemInitStruct("Archiver Data", PgArchShmemSize(), &found); > > I found that the original shmem name ended with unnecessary space > character. > I replaced it with "Archiver Data". Oops. The trailing space is where I stopped writing the string and try to find a better word, then in the meanwhile, my mind have been attracted to something else and left. I don't object to "Archiver Data". Thanks for completing it. > In reaper(), I moved the code block for archiver to the original > location. Agreed. > I ran pgindent for the files that the patch modifies. Yeah, I forgot to add the new struct into typedefs.list. I intentionally omitted clearing newly-acquired shared memory but it doesn't matter to do that. So, I'm fine with it. Thanks for taking this. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pgbench - add pseudo-random permutation function
On Mon, Mar 8, 2021 at 11:50:43AM +0100, Fabien COELHO wrote: > > > > What are your current thoughts? > > > > Thanks for prodding. I still think it's a useful feature. However I > > don't think I'll have to time to get it done on the current commitfest. > > I suggest to let it sit in the commitfest to see if somebody else will > > pick it up -- and if not, we move it to the next one, with apologies to > > author and reviewers. > > > > I may have time to become familiar or at least semi-comfortable with all > > that weird math in it by then. > > Yep. > > Generating a parametric good-quality low-cost (but not > cryptographically-secure) pseudo-random permutations on arbitrary sizes (not > juste power of two sizes) is not a trivial task, I had to be quite creative > to achieve it, hence the "weird" maths. I had a lot of bad > not-really-working ideas before the current status of the patch. > > The code could be simplified if we assume that PG_INT128_TYPE will be > available on all relevant architectures, and accept the feature not to be > available if not. Maybe Dean Rasheed can help because of his math background --- CC'ing him. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: fdatasync performance problem with large number of DB files
On 2021/03/11 8:30, Thomas Munro wrote: On Thu, Mar 11, 2021 at 11:38 AM Thomas Munro wrote: On Thu, Mar 11, 2021 at 11:01 AM Michael Brown wrote: * is there a knob missing we can configure? * can we get an opt-in knob to use a single sync() call instead of a recursive fsync()? * would you be open to merging a patch providing said knob? * is there something else we missed? As discussed on that other thread, I don't think sync() is an option (it doesn't wait on all OSes or in the standard and it doesn't report errors). syncfs() on Linux 5.8+ looks like a good candidate though, and I think we'd consider a patch like that. I mean, I even posted one[1] in that other thread. There will of course be cases where that's slower (small database sharing filesystem with other software that has a lot of dirty data to write back). Thinking about this some more, if you were to propose a patch like that syncfs() one but make it a configurable option, I'd personally be in favour of trying to squeeze it into v14. Others might object on commitfest procedural grounds, I dunno, but I think this is a real operational issue and that's a fairly simple and localised change. +1 to push this kind of change into v14!! I've run into a couple of users who have just commented that recursive fsync() code out! BTW, we can skip that recursive fsync() by disabling fsync GUC even without commenting out the code? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Boundary value check in lazy_tid_reaped()
On Wed, Mar 10, 2021 at 11:53 PM Peter Eisentraut wrote: > > On 10.03.21 02:29, Masahiko Sawada wrote: > >> There is no noticeable effect of inlining lazy_tid_reaped(). So it > >> would be better to not do that. > > > > Attached the patch that doesn't inline lazy_tid_reaped(). > > committed Thank you! Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
RE: Enhance traceability of wal_level changes for backup management
On Tuesday, March 9, 2021 11:13 PM David Steele > On 3/7/21 9:45 PM, osumi.takami...@fujitsu.com wrote: > > On Sun, Mar 7, 2021 3:48 AM Peter Eisentraut > wrote: > >> On 28.01.21 01:44, osumi.takami...@fujitsu.com wrote: > (1) writing the time or LSN in the control file to indicate > when/where wal_level is changed to 'minimal' > from upper level to invalidate the old backups or make alerts to users. > >>> I attached the first patch which implementes this idea. > >>> It was aligned by pgindent and shows no regression. > >> > >> It's not clear to me what this is supposed to accomplish. I read the > >> thread, but it's still not clear. > >> What is one supposed to do with this information? > > OK. The basic idea is to enable backup management tools to recognize > > wal_level drop between *snapshots*. > > When you have a snapshot of the cluster at one time and another one at > > different time, with this new parameter, you can see if anything that > > causes discontinuity from the drop happens in the middle of the two > > snapshots without efforts to have a look at the WALs in between. > > As a backup software author, I don't see this feature as very useful. > > The problem is that there are lots of ways for WAL to go missing so > monitoring the WAL archive for gaps is essential and this feature would not > replace that requirement. The only extra information you'd get is the ability > to > classify the most recent gap as "intentional", maybe. > > So, -1 from me. Thanks for giving me a meaningful viewpoint. Let me sleep on it, about whether the new param doesn't help in all cases or not. Best Regards, Takamichi Osumi
Re: Huge memory consumption on partitioned table with FKs
Hi hackers, > > > > I did some cosmetic fooling with this (mostly, rewriting the comments > > YA time) and pushed it. > > Perfect. Thanks for your time on this. > Thank you for your help! I'm glad to solve it. -- Keisuke Kuroda NTT Software Innovation Center keisuke.kuroda.3...@gmail.com
Re: Removing vacuum_cleanup_index_scale_factor
On Tue, Mar 9, 2021 at 7:42 PM Peter Geoghegan wrote: > My current plan is to commit everything within the next day or two. > This includes backpatching to Postgres 13 only. Pushed, thanks. -- Peter Geoghegan
Re: libpq debug log
At Wed, 10 Mar 2021 10:31:27 -0300, "'alvhe...@alvh.no-ip.org'" wrote in > On 2021-Mar-10, iwata@fujitsu.com wrote: > > > Hi all, > > > > Following all reviewer's advice, I have created a new patch. > > > > In this patch, I add only two tracing entry points; I call > > pqTraceOutputMsg(PGconn conn, int msgCursor, PGCommSource commsource) in > > pqParseInput3 () and pqPutMsgEnd () to output log. > > The argument contains message first byte offset called msgCursor because it > > is simpler to specify the buffer pointer in the caller. > > > > In pqTraceOutputMsg(), the content common to all protocol messages (first > > timestamp, < or >, first 1 byte, and length) are output first and after > > that each protocol message contents is output. I referred to pqParseInput3 > > () , fe-exec.c and documentation page for that part. > > > > This fix almost eliminates if(conn->Pfdebug) that was built into the code > > for other features. > > This certainly looks much better! Thanks for getting it done so > quickly. > > I didn't review the code super closely. I do see a compiler warning: +1 for the thanks for the quick work. I have some random comments after a quick look on it. The individual per-type-output functions are designed to fit to the corresponding pqGet*() functions. On the other hand, now that we read the message bytes knowing the exact format in advance as you did in pqTraceOutputMsg(). So the complexity exist in the functions can be eliminated by separating them into more type specific output functions. For example, pqTraceOutputInt() is far simplified by spliting it into pqTraceOutputInt16() and -32(). The output functions copy message bytes into local variable but the same effect can be obtained by just casing via typed pointer type. uint32 tmp4; .. memcpy(&tmp4, buf + *cursor, 4); result = (int) pg_ntoh32(tmp4); can be written as result = pg_ntoh32(* (uint32 *) (buf + *cursor)); I think we can get rid of copying in the output functions for String and Bytes in different ways. Now that we can manage our own reading pointer within pqTraceOutputMsg(), the per-type-output functions can work only on the reading pointer instead of buffer pointer and cursor, and length. *I*'d want to make the output functions move the reading pointer by themseves. pqTradeOutputMsg can be as simplified as the follows doing this. End-of-message pointer may be useful to check read-overrunning on the message buffer. switch (id) { case 'R': pqTraceOutputInt32(&p, endptr, conn->Pfdebug); fputc('\n', conn->Pfdebug); break; case 'K': pqTraceOutputInt32(&p, endptr, conn->Pfdebug)); pqTraceOutputInt32(&p, endptr, conn->Pfdebug)); ... + char *prefix = commsource == MSGDIR_FROM_BACKEND ? "<" : ">"; + int LogEnd = commsource == MSGDIR_FROM_BACKEND ? conn->inCursor : conn->outMsgEnd; + char*logBuffer = commsource == MSGDIR_FROM_BACKEND ? conn->inBuffer .. + if (commsource == MSGDIR_FROM_BACKEND) + id = logBuffer[LogCursor++]; Repeated usage of the ternaly operator on the same condition makes code hard-to-read. It is simpler to consolidate them into one if-else statement. + result = (int) pg_ntoh32(result32); + if (result == NEGOTIATE_SSL_CODE) Maybe I'm missing something, but the above doesn't seem working. I didn't see such message when making a SSL connection. + /* Protocol 2.0 does not support tracing. */ + if (PG_PROTOCOL_MAJOR(conn->pversion) < 3) + return; We can write that message out into tracing file. +void +PQtraceSetFlags(PGconn *conn, int flags) +{ + if (conn == NULL) + return; + /* Protocol 2.0 is not supported. */ + if (PG_PROTOCOL_MAJOR(conn->pversion) < 3) + return; + /* If PQtrace() failed, do nothing. */ + if (conn->Pfdebug == NULL) + return; + conn->traceFlags = flags; Pfdebug is always NULL for V2 protocol there, so it can be an assertion? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: fdatasync performance problem with large number of DB files
On Thu, Mar 11, 2021 at 2:00 PM Fujii Masao wrote: > On 2021/03/11 8:30, Thomas Munro wrote: > > I've run into a couple of users who have just commented that recursive > > fsync() code out! > > BTW, we can skip that recursive fsync() by disabling fsync GUC even without > commenting out the code? Those users wanted fsync=on because they wanted to recover to a normal online system after a crash, but they believed that the preceding fsync of the data directory was useless, because replaying the WAL should be enough. IMHO they were nearly on the right track, and the prototype patch I linked earlier as [2] was my attempt to find the specific reasons why that doesn't work and fix them. So far, I figured out that you still have to remember to fsync the WAL files (otherwise you're replaying WAL that potentially hasn't reached the disk), and data files holding blocks that recovery decided to skip due to BLK_DONE (otherwise you might decide to skip replay because of a higher LSN that is on a page that is in the kernel's cache but not yet on disk).
Re: fdatasync performance problem with large number of DB files
Thomas Munro writes: > On Thu, Mar 11, 2021 at 1:16 PM Tom Lane wrote: >> I'm a little skeptical about the "simple" part. At minimum, you'd >> have to syncfs() each tablespace, since we have no easy way to tell >> which of them are on different filesystems. (Although, if we're >> presuming this is Linux-only, we might be able to tell with some >> unportable check or other.) > Right, the patch knows about that: I noticed that the syncfs man page present in RHEL8 seemed a little squishy on the critical question of error reporting. It promises that syncfs will wait for I/O completion, but it doesn't say in so many words that I/O errors will be reported (and the list of applicable errno codes is only EBADF, not very reassuring). Trolling the net, I found a newer-looking version of the man page, and behold it says In mainline kernel versions prior to 5.8, syncfs() will fail only when passed a bad file descriptor (EBADF). Since Linux 5.8, syncfs() will also report an error if one or more inodes failed to be written back since the last syncfs() call. So this means that in less-than-bleeding-edge kernels, syncfs can only be regarded as a dangerous toy. If we expose an option to use it, there had better be large blinking warnings in the docs. regards, tom lane
Re: fdatasync performance problem with large number of DB files
On Thu, Mar 11, 2021 at 2:25 PM Tom Lane wrote: > Trolling the net, I found a newer-looking version of the man page, > and behold it says > >In mainline kernel versions prior to 5.8, syncfs() will fail only >when passed a bad file descriptor (EBADF). Since Linux 5.8, >syncfs() will also report an error if one or more inodes failed >to be written back since the last syncfs() call. > > So this means that in less-than-bleeding-edge kernels, syncfs can > only be regarded as a dangerous toy. If we expose an option to use > it, there had better be large blinking warnings in the docs. Agreed. Perhaps we could also try to do something programmatic about that. Its fsync() was also pretty rough for the first 28 years.
Re: 64-bit XIDs in deleted nbtree pages
On Mon, Mar 1, 2021 at 7:25 PM Peter Geoghegan wrote: > Attached is v8 of the patch series, which has new patches. No real > changes compared to v7 for the first patch, though. Here is another bitrot-fix-only revision, v9. Just the recycling patch again. I'll commit this when we get your patch committed. Still haven't decided on exactly how more aggressive we should be. For example the use of the heap relation within _bt_newly_deleted_pages_recycle() might have unintended consequences for recycling efficiency with some workloads, since it doesn't agree with _bt_getbuf() (it is still "more ambitious" than _bt_getbuf(), at least for now). -- Peter Geoghegan v9-0001-Recycle-pages-deleted-during-same-VACUUM.patch Description: Binary data
Re: libpq debug log
Kyotaro Horiguchi writes: > Maybe I'm missing something, but the above doesn't seem working. I > didn't see such message when making a SSL connection. As far as that goes, I don't see any point in trying to trace connection-related messages, because there is no way to enable tracing on a connection before it's created. regards, tom lane
RE: libpq debug log
Alvaro-san, Tom-san, Horiguchi-san, From: Kyotaro Horiguchi > +1 for the thanks for the quick work. I have some random comments > after a quick look on it. Thank you very much for giving many comments. And We're sorry to have caused you trouble. I told Iwata-san yesterday to modify the patch to use the logging function interface that Tom-san, Alvaro-san and I agreed on, that I'd review after the new revision has been posted, and let others know she is modifying the patch again so that they won't start reviewing. But I should have done the request on hackers. We're catching up with your feedback and post the updated patch. Then I'll review it. We appreciate your help so much. Regards Takayuki Tsunakawa
Re: Removing vacuum_cleanup_index_scale_factor
On Thu, Mar 11, 2021 at 10:12 AM Peter Geoghegan wrote: > > On Tue, Mar 9, 2021 at 7:42 PM Peter Geoghegan wrote: > > My current plan is to commit everything within the next day or two. > > This includes backpatching to Postgres 13 only. > > Pushed, thanks. Great! Thank you! Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Replace buffer I/O locks with condition variables (reviving an old patch)
On Thu, Mar 11, 2021 at 10:48:40AM +1300, Thomas Munro wrote: > On Tue, Mar 9, 2021 at 6:24 PM Julien Rouhaud wrote: > > > > +1 for adding the cv into BufferDesc. That brings the struct size to > > exactly > > 64 bytes on x86 64 bits architecture. This won't add any extra overhead to > > LOCK_DEBUG cases, as it was already exceeding the 64B threshold, if that > > even > > was a concern. > > I also checked that it's 64B on an Arm box. Not sure about POWER. > But... despite the fact that it looks like a good change in isolation, > I decided to go back to the separate array in this initial commit, > because the AIO branch also wants to add a new BufferDesc member[1]. Ok! > I may come back to that change, if we can make some more space (seems > entirely doable, but I'd like to look into that separately). - /* -* It would be nice to include the I/O locks in the BufferDesc, but that -* would increase the size of a BufferDesc to more than one cache line, -* and benchmarking has shown that keeping every BufferDesc aligned on a -* cache line boundary is important for performance. So, instead, the -* array of I/O locks is allocated in a separate tranche. Because those -* locks are not highly contended, we lay out the array with minimal -* padding. -*/ - size = add_size(size, mul_size(NBuffers, sizeof(LWLockMinimallyPadded))); + /* size of I/O condition variables */ + size = add_size(size, mul_size(NBuffers, + sizeof(ConditionVariableMinimallyPadded))); Should we keep for now some similar comment mentionning why we don't put the cv in the BufferDesc even though it would currently fit the 64B target size? > Thanks for the review. And of course to Robert for writing the patch. > Pushed. Great!
Re: [HACKERS] Custom compression methods
This includes a patch to use pkgconfig, in an attempt to build on mac, which currently fails like: https://cirrus-ci.com/task/5993712963551232?command=build#L126 checking for LZ4_compress in -llz4... no configure: error: library 'lz4' is required for LZ4 support -- Justin >From 601ed9e2966b29e1603fd07f69b8068404753810 Mon Sep 17 00:00:00 2001 From: Dilip Kumar Date: Fri, 5 Mar 2021 09:33:08 +0530 Subject: [PATCH 1/5] Built-in compression method Add syntax allowing a compression method to be specified. As of now there is only 2 option for build-in compression method (pglz, lz4) which can be set while creating a table or adding a new column. No option for altering the compression method for an existing column. Dilip Kumar based on the patches from Ildus Kurbangaliev. Design input from Robert Haas and Tomas Vondra. Reviewed by Robert Haas, Tomas Vondra, Alexander Korotkov and Justin Pryzby Discussions: https://www.postgresql.org/message-id/20171213151818.75a20...@postgrespro.ru https://www.postgresql.org/message-id/CA%2BTgmoaKDW1Oi9V%3Djc9hOGyf77NbkNEABuqgHD1Cq%3D%3D1QsOcxg%40mail.gmail.com https://www.postgresql.org/message-id/CA%2BTgmobSDVgUage9qQ5P_%3DF_9jaMkCgyKxUQGtFQU7oN4kX-AA%40mail.gmail.com https://www.postgresql.org/message-id/20201005160355.byp74sh3ejsv7wrj%40development https://www.postgresql.org/message-id/CAFiTN-tzTTT2oqWdRGLv1dvvS5MC1W%2BLE%2B3bqWPJUZj4GnHOJg%40mail.gmail.com --- configure | 118 +++ configure.ac | 18 + doc/src/sgml/catalogs.sgml| 10 + doc/src/sgml/ref/create_table.sgml| 33 +- doc/src/sgml/ref/psql-ref.sgml| 11 + src/backend/access/brin/brin_tuple.c | 5 +- src/backend/access/common/Makefile| 1 + src/backend/access/common/detoast.c | 71 ++-- src/backend/access/common/indextuple.c| 3 +- src/backend/access/common/toast_compression.c | 308 ++ src/backend/access/common/toast_internals.c | 54 ++- src/backend/access/common/tupdesc.c | 8 + src/backend/access/table/toast_helper.c | 5 +- src/backend/bootstrap/bootstrap.c | 5 + src/backend/catalog/genbki.pl | 3 + src/backend/catalog/heap.c| 4 + src/backend/catalog/index.c | 1 + src/backend/catalog/toasting.c| 6 + src/backend/commands/tablecmds.c | 110 +++ src/backend/nodes/copyfuncs.c | 1 + src/backend/nodes/equalfuncs.c| 1 + src/backend/nodes/nodeFuncs.c | 2 + src/backend/nodes/outfuncs.c | 1 + src/backend/parser/gram.y | 26 +- src/backend/parser/parse_utilcmd.c| 9 + src/backend/utils/adt/varlena.c | 41 +++ src/bin/pg_dump/pg_backup.h | 1 + src/bin/pg_dump/pg_dump.c | 39 +++ src/bin/pg_dump/pg_dump.h | 1 + src/bin/pg_dump/t/002_pg_dump.pl | 12 +- src/bin/psql/describe.c | 31 +- src/bin/psql/help.c | 2 + src/bin/psql/settings.h | 1 + src/bin/psql/startup.c| 10 + src/include/access/detoast.h | 8 + src/include/access/toast_compression.h| 119 +++ src/include/access/toast_helper.h | 1 + src/include/access/toast_internals.h | 20 +- src/include/catalog/pg_attribute.h| 8 +- src/include/catalog/pg_proc.dat | 4 + src/include/nodes/parsenodes.h| 2 + src/include/parser/kwlist.h | 1 + src/include/pg_config.h.in| 3 + src/include/postgres.h| 14 +- src/test/regress/expected/compression.out | 247 ++ src/test/regress/expected/compression_1.out | 240 ++ src/test/regress/parallel_schedule| 2 +- src/test/regress/pg_regress_main.c| 4 +- src/test/regress/serial_schedule | 1 + src/test/regress/sql/compression.sql | 102 ++ src/tools/msvc/Solution.pm| 1 + src/tools/pgindent/typedefs.list | 1 + 52 files changed, 1645 insertions(+), 85 deletions(-) create mode 100644 src/backend/access/common/toast_compression.c create mode 100644 src/include/access/toast_compression.h create mode 100644 src/test/regress/expected/compression.out create mode 100644 src/test/regress/expected/compression_1.out create mode 100644 src/test/regress/sql/compression.sql diff --git a/configure b/configure index fad817bb38..761a27965d 100755 --- a/configure +++ b/configure @@ -699,6 +699,7 @@ with_gnu_ld LD LDFLAGS_SL LDFLAGS_EX +with_lz4 with_zlib with_system_tzdata with_libxslt @@ -864,6 +865,7 @@ with_libxml with_libxslt
RE: Parallel INSERT (INTO ... SELECT ...)
> > 2. Should we keep the default value of GUC to on or off? It is > > currently off. I am fine keeping it off for this release and we can > > always turn it on in the later releases if required. Having said that, > > I see the value in keeping it on because in many cases Insert ... > > Select will be used for large data and there we will see a benefit of > > parallelism and users facing trouble (who have a very large number of > > partitions with less data to query) can still disable the parallel > > insert for that particular table. Also, the other benefit of keeping > > it on till at least the beta period is that this functionality will > > get tested and if we found reports of regression then we can turn it > > off for this release as well. > > > > Thoughts? > > IMHO, we should keep it on because in most of the cases it is going the give > benefit to the user, and if for some specific scenario where a table has a > lot of > partition then it can be turned off using reloption. And, if for some users > most > of the tables are like that where they always have a lot of partition then > the user > can turn it off using guc. > I think your suggestion makes sense. If no one have other opinions, I will post a new version set default enable_parallel_insert=on soon. Best regards, houzj
Re: [HACKERS] Custom compression methods
On Thu, Mar 11, 2021 at 2:21 AM Robert Haas wrote: > > On Wed, Mar 10, 2021 at 6:52 AM Dilip Kumar wrote: > > The pending comment is providing a way to rewrite a table and > > re-compress the data with the current compression method. > > I spent some time poking at this yesterday and ran couldn't figure out > what was going on here. There are two places where we rewrite tables. > One is the stuff in cluter.c, which handles VACUUM FULL and CLUSTER. > That eventually calls reform_and_rewrite_tuple(), which deforms the > old tuple and creates a new one, but it doesn't seem like there's > anything in there that would expand toasted values, whether external > or inline compressed. But I think that can't be right, because it > seems like then you'd end up with toast pointers into the old TOAST > relation, not the new one, which would cause failures later. So I must > be missing something here. The other place where we rewrite tables is > in ATRewriteTable() as part of the ALTER TABLE machinery. I don't see > anything there to force detoasting either. We do expand the external values, see below call stack. See below call stack. #0 detoast_external_attr (attr=0x2d61a80) at detoast.c:50 #1 0x0055bd53 in toast_tuple_init #2 0x0050554d in heap_toast_insert_or_update #3 0x0050ad5b in raw_heap_insert #4 0x0050a9a1 in rewrite_heap_tuple #5 0x00502325 in reform_and_rewrite_tuple #6 0x004ff47c in heapam_relation_copy_for_cluster But that is only if there are external attributes, we do nothing for the inline compressed values. In raw_heap_insert only if HeapTupleHasExternal(tup) is true then we call raw_heap_insert. So if we want to do something about inline compressed data then we will have to do something in reform_and_rewrite_tuple because there we deform and form the tuple again so we have an opportunity to decompress any compressed data. > But, I am not really convinced that we need to solve this problem by > adding new ALTER TABLE syntax. I'd be happy enough if CLUSTER, VACUUM > FULL, and versions of ALTER TABLE that already force a rewrite would > cause the compression to be redone also. Okay, Maybe for directly compressed data we can do that without affecting the performance of unrelated paths but I am again worried about the composite type. Basically, if we have some row type then we have to deform the tuple stored in the row type and process it fully. Said that I think in the older version we already had a pacthes at [1], basically the first 3 patches will ensure that we will never have any compressed data in the composite type so we will not have to worry about those as well. Honestly, even if the user > had to fall back on creating a new table and doing INSERT INTO newtab > SELECT * FROM oldtab I would consider that to be not a total > showstopper for this .. assuming of course that it actually works. If > it doesn't, we have big problems. Even without the pg_am stuff, we > still need to make sure that we don't just blindly let compressed > values wander around everywhere. When we insert into a table column > with a compression method, we should recompress any data that is > compressed using some other method. Well, it used to work in the older version[1] so we can make it work here as well without much effort. [1] https://www.postgresql.org/message-id/CAFiTN-u%3D2-qaLTod3isQmXuSU0s0_bR%2BRcUQL-vSvH%3DMJbEd7Q%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: About to add WAL write/fsync statistics to pg_stat_wal view
On 2021/03/11 9:38, Masahiro Ikeda wrote: On 2021-03-10 17:08, Fujii Masao wrote: On 2021/03/10 14:11, Masahiro Ikeda wrote: On 2021-03-09 17:51, Fujii Masao wrote: On 2021/03/05 8:38, Masahiro Ikeda wrote: On 2021-03-05 01:02, Fujii Masao wrote: On 2021/03/04 16:14, Masahiro Ikeda wrote: On 2021-03-03 20:27, Masahiro Ikeda wrote: On 2021-03-03 16:30, Fujii Masao wrote: On 2021/03/03 14:33, Masahiro Ikeda wrote: On 2021-02-24 16:14, Fujii Masao wrote: On 2021/02/15 11:59, Masahiro Ikeda wrote: On 2021-02-10 00:51, David G. Johnston wrote: On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda wrote: I pgindented the patches. ... XLogWrite, which is invoked during an XLogFlush request (see ...). This is also incremented by the WAL receiver during replication. ("which normally called" should be "which is normally called" or "which normally is called" if you want to keep true to the original) You missed the adding the space before an opening parenthesis here and elsewhere (probably copy-paste) is ether -> is either "This parameter is off by default as it will repeatedly query the operating system..." ", because" -> "as" Thanks, I fixed them. wal_write_time and the sync items also need the note: "This is also incremented by the WAL receiver during replication." I skipped changing it since I separated the stats for the WAL receiver in pg_stat_wal_receiver. "The number of times it happened..." -> " (the tally of this event is reported in wal_buffers_full in) This is undesirable because ..." Thanks, I fixed it. I notice that the patch for WAL receiver doesn't require explicitly computing the sync statistics but does require computing the write statistics. This is because of the presence of issue_xlog_fsync but absence of an equivalent pg_xlog_pwrite. Additionally, I observe that the XLogWrite code path calls pgstat_report_wait_*() while the WAL receiver path does not. It seems technically straight-forward to refactor here to avoid the almost-duplicated logic in the two places, though I suspect there may be a trade-off for not adding another function call to the stack given the importance of WAL processing (though that seems marginalized compared to the cost of actually writing the WAL). Or, as Fujii noted, go the other way and don't have any shared code between the two but instead implement the WAL receiver one to use pg_stat_wal_receiver instead. In either case, this half-and-half implementation seems undesirable. OK, as Fujii-san mentioned, I separated the WAL receiver stats. (v10-0002-Makes-the-wal-receiver-report-WAL-statistics.patch) Thanks for updating the patches! I added the infrastructure code to communicate the WAL receiver stats messages between the WAL receiver and the stats collector, and the stats for WAL receiver is counted in pg_stat_wal_receiver. What do you think? On second thought, this idea seems not good. Because those stats are collected between multiple walreceivers, but other values in pg_stat_wal_receiver is only related to the walreceiver process running at that moment. IOW, it seems strange that some values show dynamic stats and the others show collected stats, even though they are in the same view pg_stat_wal_receiver. Thought? OK, I fixed it. The stats collected in the WAL receiver is exposed in pg_stat_wal view in v11 patch. Thanks for updating the patches! I'm now reading 001 patch. + /* Check whether the WAL file was synced to disk right now */ + if (enableFsync && + (sync_method == SYNC_METHOD_FSYNC || + sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH || + sync_method == SYNC_METHOD_FDATASYNC)) + { Isn't it better to make issue_xlog_fsync() return immediately if enableFsync is off, sync_method is open_sync or open_data_sync, to simplify the code more? Thanks for the comments. I added the above code in v12 patch. + /* + * Send WAL statistics only if WalWriterDelay has elapsed to minimize + * the overhead in WAL-writing. + */ + if (rc & WL_TIMEOUT) + pgstat_send_wal(); On second thought, this change means that it always takes wal_writer_delay before walwriter's WAL stats is sent after XLogBackgroundFlush() is called. For example, if wal_writer_delay is set to several seconds, some values in pg_stat_wal would be not up-to-date meaninglessly for those seconds. So I'm thinking to withdraw my previous comment and it's ok to send the stats every after XLogBackgroundFlush() is called. Thought? Thanks, I didn't notice that. Although PGSTAT_STAT_INTERVAL is 500msec, wal_writer_delay's default value is 200msec and it may be set shorter time. Yeah, if wal_writer_delay is set to very small value, there is a risk that the WAL stats are sent too frequently. I agree that's a problem. Why don't to make another way to check the timestamp? + /* + * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL +
Re: Replace buffer I/O locks with condition variables (reviving an old patch)
On Thu, Mar 11, 2021 at 3:27 PM Julien Rouhaud wrote: > - /* > -* It would be nice to include the I/O locks in the BufferDesc, but that > -* would increase the size of a BufferDesc to more than one cache line, > -* and benchmarking has shown that keeping every BufferDesc aligned on a > -* cache line boundary is important for performance. So, instead, the > -* array of I/O locks is allocated in a separate tranche. Because those > -* locks are not highly contended, we lay out the array with minimal > -* padding. > -*/ > - size = add_size(size, mul_size(NBuffers, sizeof(LWLockMinimallyPadded))); > + /* size of I/O condition variables */ > + size = add_size(size, mul_size(NBuffers, > + sizeof(ConditionVariableMinimallyPadded))); > > Should we keep for now some similar comment mentionning why we don't put the > cv > in the BufferDesc even though it would currently fit the 64B target size? I tried to write some words along those lines, but it seemed hard to come up with a replacement message about a thing we're not doing because of other currently proposed patches. The situation could change, and it seemed to be a strange place to put this comment anyway, far away from the relevant struct. Ok, let me try that again... what do you think of this, as a new comment for BufferDesc, next to the existing discussion of the 64 byte rule? --- a/src/include/storage/buf_internals.h +++ b/src/include/storage/buf_internals.h @@ -174,6 +174,10 @@ typedef struct buftag * Be careful to avoid increasing the size of the struct when adding or * reordering members. Keeping it below 64 bytes (the most common CPU * cache line size) is fairly important for performance. + * + * Per-buffer I/O condition variables are kept outside this struct in a + * separate array. They could be moved in here and still fit under that + * limit on common systems, but for now that is not done. */ typedef struct BufferDesc {
RE: libpq debug log
From: Kyotaro Horiguchi > The output functions copy message bytes into local variable but the > same effect can be obtained by just casing via typed pointer type. > > uint32 tmp4; > .. > memcpy(&tmp4, buf + *cursor, 4); > result = (int) pg_ntoh32(tmp4); > > can be written as > > result = pg_ntoh32(* (uint32 *) (buf + *cursor)); I'm afraid we need to memcpy() because of memory alignment. > I think we can get rid of copying in the output functions for String > and Bytes in different ways. I haven't looked at this code, but you sound right. Regards Takayuki Tsunakawa
Re: libpq debug log
At Wed, 10 Mar 2021 20:38:20 -0500, Tom Lane wrote in > Kyotaro Horiguchi writes: > > Maybe I'm missing something, but the above doesn't seem working. I > > didn't see such message when making a SSL connection. > > As far as that goes, I don't see any point in trying to trace > connection-related messages, because there is no way to enable > tracing on a connection before it's created. Yeah, agreed. In the previous version tracing functions are called during protocol negotiation but that no longer happenes. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: libpq debug log
At Thu, 11 Mar 2021 01:51:55 +, "tsunakawa.ta...@fujitsu.com" wrote in > Alvaro-san, Tom-san, Horiguchi-san, > > From: Kyotaro Horiguchi > > +1 for the thanks for the quick work. I have some random comments > > after a quick look on it. > > Thank you very much for giving many comments. And We're sorry to > have caused you trouble. I told Iwata-san yesterday to modify the > patch to use the logging function interface that Tom-san, Alvaro-san > and I agreed on, that I'd review after the new revision has been Yeah, I agreed at the time. Sorry for not responding it but had no room in my mind to do that:p > posted, and let others know she is modifying the patch again so that > they won't start reviewing. But I should have done the request on > hackers. > > We're catching up with your feedback and post the updated patch. > Then I'll review it. > > We appreciate your help so much. My pleasure. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Replace buffer I/O locks with condition variables (reviving an old patch)
On Thu, Mar 11, 2021 at 03:54:06PM +1300, Thomas Munro wrote: > On Thu, Mar 11, 2021 at 3:27 PM Julien Rouhaud wrote: > > - /* > > -* It would be nice to include the I/O locks in the BufferDesc, but that > > -* would increase the size of a BufferDesc to more than one cache line, > > -* and benchmarking has shown that keeping every BufferDesc aligned on a > > -* cache line boundary is important for performance. So, instead, the > > -* array of I/O locks is allocated in a separate tranche. Because those > > -* locks are not highly contended, we lay out the array with minimal > > -* padding. > > -*/ > > - size = add_size(size, mul_size(NBuffers, > > sizeof(LWLockMinimallyPadded))); > > + /* size of I/O condition variables */ > > + size = add_size(size, mul_size(NBuffers, > > + > > sizeof(ConditionVariableMinimallyPadded))); > > > > Should we keep for now some similar comment mentionning why we don't put > > the cv > > in the BufferDesc even though it would currently fit the 64B target size? > > I tried to write some words along those lines, but it seemed hard to > come up with a replacement message about a thing we're not doing > because of other currently proposed patches. The situation could > change, and it seemed to be a strange place to put this comment > anyway, far away from the relevant struct. Yeah, I agree that it's not the best place to document the size consideration. > Ok, let me try that > again... what do you think of this, as a new comment for BufferDesc, > next to the existing discussion of the 64 byte rule? > > --- a/src/include/storage/buf_internals.h > +++ b/src/include/storage/buf_internals.h > @@ -174,6 +174,10 @@ typedef struct buftag > * Be careful to avoid increasing the size of the struct when adding or > * reordering members. Keeping it below 64 bytes (the most common CPU > * cache line size) is fairly important for performance. > + * > + * Per-buffer I/O condition variables are kept outside this struct in a > + * separate array. They could be moved in here and still fit under that > + * limit on common systems, but for now that is not done. > */ > typedef struct BufferDesc > { I was mostly thinking about something like "leave room for now as other feature could make a better use of that space", but I'm definitely fine with this comment!
Re: libpq debug log
At Thu, 11 Mar 2021 03:01:16 +, "tsunakawa.ta...@fujitsu.com" wrote in > From: Kyotaro Horiguchi > > The output functions copy message bytes into local variable but the > > same effect can be obtained by just casing via typed pointer type. > > > > uint32 tmp4; > > .. > > memcpy(&tmp4, buf + *cursor, 4); > > result = (int) pg_ntoh32(tmp4); > > > > can be written as > > > > result = pg_ntoh32(* (uint32 *) (buf + *cursor)); > > I'm afraid we need to memcpy() because of memory alignment. Right. So something like this? unsigned char p; p = buf + *cursor; result = (uint32) (*p << 24) + (*(p + 1)) << 16 + ...); > > I think we can get rid of copying in the output functions for String > > and Bytes in different ways. > > I haven't looked at this code, but you sound right. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] Custom compression methods
On Thu, Mar 11, 2021 at 08:17:46AM +0530, Dilip Kumar wrote: > On Thu, Mar 11, 2021 at 2:21 AM Robert Haas wrote: > > > > On Wed, Mar 10, 2021 at 6:52 AM Dilip Kumar wrote: > > > The pending comment is providing a way to rewrite a table and > > > re-compress the data with the current compression method. > > > > I spent some time poking at this yesterday and ran couldn't figure out > > what was going on here. There are two places where we rewrite tables. > > One is the stuff in cluter.c, which handles VACUUM FULL and CLUSTER. > > That eventually calls reform_and_rewrite_tuple(), which deforms the > > old tuple and creates a new one, but it doesn't seem like there's > > anything in there that would expand toasted values, whether external > > or inline compressed. But I think that can't be right, because it > > seems like then you'd end up with toast pointers into the old TOAST > > relation, not the new one, which would cause failures later. So I must > > be missing something here. The other place where we rewrite tables is > > in ATRewriteTable() as part of the ALTER TABLE machinery. I don't see > > anything there to force detoasting either. > > We do expand the external values, see below call stack. See below call stack. > > #0 detoast_external_attr (attr=0x2d61a80) at detoast.c:50 > #1 0x0055bd53 in toast_tuple_init > #2 0x0050554d in heap_toast_insert_or_update > #3 0x0050ad5b in raw_heap_insert > #4 0x0050a9a1 in rewrite_heap_tuple > #5 0x00502325 in reform_and_rewrite_tuple > #6 0x004ff47c in heapam_relation_copy_for_cluster > > But that is only if there are external attributes, we do nothing for > the inline compressed values. In raw_heap_insert only if > HeapTupleHasExternal(tup) is true then we call raw_heap_insert. So if > we want to do something about inline compressed data then we will have > to do something in reform_and_rewrite_tuple because there we deform > and form the tuple again so we have an opportunity to decompress any > compressed data. > > > But, I am not really convinced that we need to solve this problem by > > adding new ALTER TABLE syntax. I'd be happy enough if CLUSTER, VACUUM > > FULL, and versions of ALTER TABLE that already force a rewrite would > > cause the compression to be redone also. > > Okay, Maybe for directly compressed data we can do that without > affecting the performance of unrelated paths but I am again worried > about the composite type. Basically, if we have some row type then we > have to deform the tuple stored in the row type and process it fully. > Said that I think in the older version we already had a pacthes at > [1], basically the first 3 patches will ensure that we will never have > any compressed data in the composite type so we will not have to worry > about those as well. > > Honestly, even if the user > > had to fall back on creating a new table and doing INSERT INTO newtab > > SELECT * FROM oldtab I would consider that to be not a total > > showstopper for this .. assuming of course that it actually works. If > > it doesn't, we have big problems. Even without the pg_am stuff, we > > still need to make sure that we don't just blindly let compressed > > values wander around everywhere. When we insert into a table column > > with a compression method, we should recompress any data that is > > compressed using some other method. > > Well, it used to work in the older version[1] so we can make it work > here as well without much effort. Looking at v23-0002-alter-table-set-compression, ATRewriteTable() was calling CompareCompressionMethodAndDecompress(). I think what's wanted here is that decompression should only happen when the tuple uses a different compression than the column's currently set compression. So there's no overhead in the usual case. I guess CLUSTER and INSERT SELECT should do the same. This is important to allow someone to get rid of LZ4 compression, if they want to get rid of that dependency. But it's also really desirable for admins to be able to "migrate" existing data. People will want to test this, and I guess INSERT SELECT and CLUSTER are the obvious ways. -- Justin
Re: shared-memory based stats collector
Hi, Two minor nits: On 2021-03-10 21:47:51 +0900, Fujii Masao wrote: > +/* Shared memory area for archiver process */ > +typedef struct PgArchData > +{ > + Latch *latch; /* latch to wake the archiver > up */ > + slock_t mutex; /* locks this struct */ > +} PgArchData; > + It doesn't really matter, but it'd be pretty trivial to avoid needing a spinlock for this kind of thing. Just store the pgprocno of the archiver in PgArchData. While getting rid of the spinlock doesn't seem like a huge win, it does seem nicer that we'd automatically have a way to find data about the archiver (e.g. pid). >* checkpointer to exit as well, otherwise not. The archiver, > stats, >* and syslogger processes are disregarded since they are not >* connected to shared memory; we also disregard dead_end > children >* here. Walsenders are also disregarded, they will be > terminated >* later after writing the checkpoint record, like the archiver >* process. >*/ This comment in PostmasterStateMachine() is outdated now. Greetings, Andres Freund
Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)
On Tue, Mar 2, 2021 at 2:10 PM Thomas Munro wrote: > ... One question I haven't > got to the bottom of: is it a problem for the startup process that CVs > use CHECK_FOR_INTERRUPTS()? This was a red herring. The startup process already reaches CFI() via various paths, as I figured out pretty quickly with a debugger. So I'd like to go ahead and commit these patches. Michael, when you said "That's pretty hack-ish, still efficient" in reference to this code: > - if (IsUnderPostmaster && !PostmasterIsAlive()) > + if (IsUnderPostmaster && > +#ifndef USE_POSTMASTER_DEATH_SIGNAL > + count++ % 1024 == 0 && > +#endif > + !PostmasterIsAlive()) Is that an objection, and do you see a specific better way? I know that someone just needs to write a Windows patch to get us a postmaster death signal when the postmaster's event fires, and then the problem will go away on Windows. I still want this change, because we don't have such a patch yet, and even when someone writes that, there are still a couple of Unixes that could benefit.
Re: [PATCH] Provide more information to filter_prepare
On Wed, Mar 10, 2021 at 4:26 PM Markus Wanner wrote: > > On 10.03.21 11:18, Amit Kapila wrote: > > On Tue, Mar 9, 2021 at 2:14 PM Markus Wanner > > wrote: > >> currently, only the gid is passed on to the filter_prepare callback. > >> While we probably should not pass a full ReorderBufferTXN (as we do for > >> most other output plugin callbacks), a bit more information would be > >> nice, I think. > > > > How the proposed 'xid' parameter can be useful? What exactly plugins > > want to do with it? > > The xid is the very basic identifier for transactions in Postgres. Any > output plugin that interacts with Postgres in any way slightly more > interesting than "filter by gid prefix" is very likely to come across a > TransactionId. > > It allows for basics like checking if the transaction to decode still is > in progress, for example. > But this happens when we are decoding prepare, so it is clear that the transaction is prepared, why any additional check? > Or in a much more complex scenario, decide on > whether or not to filter based on properties the extension stored during > processing the transaction. > What in this can't be done with GID and how XID can achieve it? -- With Regards, Amit Kapila.