Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation
On Tue, Jun 23, 2020 at 7:55 PM Amit Kapila wrote: > > I don't see any other reason for > > looping over the NL node itself in this plan. The Gather itself > > doesn't do any real looping, right? > > It is right that Gather doesn't do looping but Parallel Seq Scan node does so. Sorry, I still don't follow. How does a Parallel Seq Scan do looping? I looked at the parallel plan docs but I don't see looping mentioned anywhere[1]. Also, is looping not normally indicated on children, rather than on the node doing the looping? E.g., with a standard Nested Loop, the outer child will have loops=1 and the inner child will have loops equal to the row count produced by the outer child (and the Nested Loop itself will have loops=1 unless it also is being looped over by a parent node), right? But even aside from that, why do I need to divide by the number of loops here, when normally Postgres presents a per-loop average? [1]: https://www.postgresql.org/docs/current/parallel-plans.html
RE: [PoC] Non-volatile WAL buffer
Dear hackers, I update my non-volatile WAL buffer's patchset to v3. Now we can use it in streaming replication mode. Updates from v2: - walreceiver supports non-volatile WAL buffer Now walreceiver stores received records directly to non-volatile WAL buffer if applicable. - pg_basebackup supports non-volatile WAL buffer Now pg_basebackup copies received WAL segments onto non-volatile WAL buffer if you run it with "nvwal" mode (-Fn). You should specify a new NVWAL path with --nvwal-path option. The path will be written to postgresql.auto.conf or recovery.conf. The size of the new NVWAL is same as the master's one. Best regards, Takashi -- Takashi Menjo NTT Software Innovation Center > -Original Message- > From: Takashi Menjo > Sent: Wednesday, March 18, 2020 5:59 PM > To: 'PostgreSQL-development' > Cc: 'Robert Haas' ; 'Heikki Linnakangas' > ; 'Amit Langote' > > Subject: RE: [PoC] Non-volatile WAL buffer > > Dear hackers, > > I rebased my non-volatile WAL buffer's patchset onto master. A new v2 > patchset is attached to this mail. > > I also measured performance before and after patchset, varying -c/--client > and -j/--jobs options of pgbench, for > each scaling factor s = 50 or 1000. The results are presented in the > following tables and the attached charts. > Conditions, steps, and other details will be shown later. > > > Results (s=50) > == > Throughput [10^3 TPS] Average latency [ms] > ( c, j) before after before after > --- - - > ( 8, 8) 35.737.1 (+3.9%) 0.224 0.216 (-3.6%) > (18,18) 70.974.7 (+5.3%) 0.254 0.241 (-5.1%) > (36,18) 76.080.8 (+6.3%) 0.473 0.446 (-5.7%) > (54,18) 75.581.8 (+8.3%) 0.715 0.660 (-7.7%) > > > Results (s=1000) > > Throughput [10^3 TPS] Average latency [ms] > ( c, j) before after before after > --- - - > ( 8, 8) 37.440.1 (+7.3%) 0.214 0.199 (-7.0%) > (18,18) 79.386.7 (+9.3%) 0.227 0.208 (-8.4%) > (36,18) 87.295.5 (+9.5%) 0.413 0.377 (-8.7%) > (54,18) 86.894.8 (+9.3%) 0.622 0.569 (-8.5%) > > > Both throughput and average latency are improved for each scaling factor. > Throughput seemed to almost reach > the upper limit when (c,j)=(36,18). > > The percentage in s=1000 case looks larger than in s=50 case. I think larger > scaling factor leads to less > contentions on the same tables and/or indexes, that is, less lock and unlock > operations. In such a situation, > write-ahead logging appears to be more significant for performance. > > > Conditions > == > - Use one physical server having 2 NUMA nodes (node 0 and 1) > - Pin postgres (server processes) to node 0 and pgbench to node 1 > - 18 cores and 192GiB DRAM per node > - Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal > - Both are installed on the server-side node, that is, node 0 > - Both are formatted with ext4 > - NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX) > - Use the attached postgresql.conf > - Two new items nvwal_path and nvwal_size are used only after patch > > > Steps > = > For each (c,j) pair, I did the following steps three times then I found the > median of the three as a final result shown > in the tables above. > > (1) Run initdb with proper -D and -X options; and also give --nvwal-path and > --nvwal-size options after patch > (2) Start postgres and create a database for pgbench tables > (3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000) > (4) Stop postgres, remount filesystems, and start postgres again > (5) Execute pg_prewarm extension for all the four pgbench tables > (6) Run pgbench during 30 minutes > > > pgbench command line > > $ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ > dbname > > I gave no -b option to use the built-in "TPC-B (sort-of)" query. > > > Software > > - Distro: Ubuntu 18.04 > - Kernel: Linux 5.4 (vanilla kernel) > - C Compiler: gcc 7.4.0 > - PMDK: 1.7 > - PostgreSQL: d677550 (master on Mar 3, 2020) > > > Hardware > > - System: HPE ProLiant DL380 Gen10 > - CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets > - DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets > - NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets > - NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA > > > Best regards, > Takashi > > -- > Takashi Menjo NTT Software Innovation Center > > > -Original Message- > > From: Takashi Menjo > > Sent: Thursday, February 20, 2020 6:30 PM > > To: 'Amit Langote' > > Cc: 'Robert Haas' ; 'Heikki Linnakangas' > > ; > 'PostgreSQL-development' > > > > Subject: RE: [PoC] Non-volatile WAL buffer > > > > Dear Amit, > > > > Thank you for your advice. Exactly, it's so to speak "do as the hackers do > > when in pgsql"...
Re: Improve handling of parameter differences in physical replication
Here is another stab at this subject. This is a much simplified variant: When encountering a parameter change in the WAL that is higher than the standby's current setting, we log a warning (instead of an error until now) and pause recovery. If you resume (unpause) recovery, the instance shuts down as before. This allows you to keep your standbys running for a bit (depending on lag requirements) and schedule the required restart more deliberately. I had previously suggested making this new behavior configurable, but there didn't seem to be much interest in that, so I have not included that there. The documentation changes are mostly carried over from previous patch versions (but adjusted for the actual behavior of the patch). -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 2d3ffb221ae9418ebd7e1fc7fb771213014ca216 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 24 Jun 2020 08:49:42 +0200 Subject: [PATCH v3 1/2] Replace a macro by a function Using a macro is ugly and not justified here. --- src/backend/access/transam/xlog.c | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index a1256a103b..dcec2111b3 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6232,16 +6232,17 @@ GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream) * Note that text field supplied is a parameter name and does not require * translation */ -#define RecoveryRequiresIntParameter(param_name, currValue, minValue) \ -do { \ - if ((currValue) < (minValue)) \ - ereport(ERROR, \ - (errcode(ERRCODE_INVALID_PARAMETER_VALUE), \ -errmsg("hot standby is not possible because %s = %d is a lower setting than on the master server (its value was %d)", \ - param_name, \ - currValue, \ - minValue))); \ -} while(0) +static void +RecoveryRequiresIntParameter(const char *param_name, int currValue, int minValue) +{ + if (currValue < minValue) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), +errmsg("hot standby is not possible because %s = %d is a lower setting than on the master server (its value was %d)", + param_name, + currValue, + minValue))); +} /* * Check to see if required parameters are set high enough on this server -- 2.27.0 From 18740a021a90207423032cb7201f54650b81c15a Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 24 Jun 2020 09:18:21 +0200 Subject: [PATCH v3 2/2] Pause recovery for insufficient parameter settings When certain parameters are changed on a physical replication primary, this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL record. The standby then checks whether its own settings are at least as big as the ones on the primary. If not, the standby shuts down with a fatal error. This patch changes this behavior to pause recovery at that point instead. That allows read traffic on the standby to continue while database administrators figure out next steps. When recovery is unpaused, the server shuts down (as before). The idea is to fix the parameters while recovery is paused and then restart when there is a maintenance window. --- doc/src/sgml/high-availability.sgml | 47 + src/backend/access/transam/xlog.c | 17 ++- 2 files changed, 51 insertions(+), 13 deletions(-) diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml index 65c3fc62a9..d32ec6d9b3 100644 --- a/doc/src/sgml/high-availability.sgml +++ b/doc/src/sgml/high-availability.sgml @@ -2159,18 +2159,14 @@ Administrator's Overview -The setting of some parameters on the standby will need reconfiguration -if they have been changed on the primary. For these parameters, -the value on the standby must -be equal to or greater than the value on the primary. -Therefore, if you want to increase these values, you should do so on all -standby servers first, before applying the changes to the primary server. -Conversely, if you want to decrease these values, you should do so on the -primary server first, before applying the changes to all standby servers. -If these parameters -are not set high enough then the standby will refuse to start. -Higher values can then be supplied and the server -restarted to begin recovery again. These parameters are: +The settings of some parameters determine the s
Re: Allow CURRENT_ROLE in GRANTED BY
On 6/24/20 8:35 AM, Peter Eisentraut wrote: > I was checking some loose ends in SQL conformance, when I noticed: We > support GRANT role ... GRANTED BY CURRENT_USER, but we don't support > CURRENT_ROLE in that place, even though in PostgreSQL they are > equivalent. Here is a trivial patch to add that. The only thing that isn't dead-obvious about this patch is the commit message says "[PATCH 1/2]". What is in the other part? Assuming that's just a remnant of development, this LGTM. -- Vik Fearing
Re: Removal of currtid()/currtid2() and some table AM cleanup
Hi Michael, Where do you test, on Windows or on *nix? How do you test there? regards, Hiroshi Inoue On 2020/06/24 11:11, Michael Paquier wrote: On Tue, Jun 23, 2020 at 02:02:33PM +0900, Michael Paquier wrote: Actually, while reviewing the code, the only code path where we use currtid2() involves positioned_load() and LATEST_TUPLE_LOAD. And the only location where this happens is in SC_pos_reload_with_key(), where I don't actually see how it would be possible to not have a keyset and still use a CTID, which would led to LATEST_TUPLE_LOAD being used. So could it be possible that the code paths of currtid2() are actually just dead code? I have dug more into this one, and we actually stressed this code path quite a lot up to commit d9cb23f in the ODBC driver, with tests cursor-block-delete, positioned-update and bulkoperations particularly when calling SQLSetPos(). However, 86e2e7a has reworked the code in such a way that we visibly don't use anymore CTIDs if we don't have a keyset, and that combinations of various options like UseDeclareFetch or UpdatableCursors don't trigger this code path anymore. In short, currtid2() does not get used. Inoue-san, Saito-san, what do you think? I am adding also Tsunakawa-san in CC who has some experience in this area. -- Michael
Re: [Patch] ALTER SYSTEM READ ONLY
On 6/22/20 11:59 AM, Amul Sul wrote: 2. Now skipping the startup checkpoint if the system is read-only mode, as discussed [2]. I am not able to perform pg_checksums o/p after shutting down my server in read only mode . Steps - 1.initdb (./initdb -k -D data) 2.start the server(./pg_ctl -D data start) 3.connect to psql (./psql postgres) 4.Fire query (alter system read only;) 5.shutdown the server(./pg_ctl -D data stop) 6.pg_checksums [edb@tushar-ldap-docker bin]$ ./pg_checksums -D data pg_checksums: error: cluster must be shut down [edb@tushar-ldap-docker bin]$ Result - (when server is not in read only) [edb@tushar-ldap-docker bin]$ ./pg_checksums -D data Checksum operation completed Files scanned: 916 Blocks scanned: 2976 Bad checksums: 0 Data checksum version: 1 -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
Re: should libpq also require TLSv1.2 by default?
> On 24 Jun 2020, at 08:39, Peter Eisentraut > wrote: > > In PG13, we raised the server-side default of ssl_min_protocol_version to > TLSv1.2. We also added a connection setting named ssl_min_protocol_version > to libpq. But AFAICT, the default value of the libpq setting is empty, so > any protocol version will be accepted. Is this what we wanted? Should we > raise the default in libpq as well? This was discussed [0] when the connection settings were introduced, and the concensus was to leave them alone [1] to allow for example a new pg_dump to work against an old server. Re-reading the thread I think the argument still holds, but I was about to respond "yes, let's do this" before refreshing my memory. Perhaps we should add a comment explaining this along the lines of the attached? cheers ./daniel [0] https://www.postgresql.org/message-id/157800160408.1198.1714906047977693148.pgcf%40coridan.postgresql.org [1] https://www.postgresql.org/message-id/31993.1578321474%40sss.pgh.pa.us libpq_minmaxproto_doc.diff Description: Binary data
Re: Parallel copy
Hi, It looks like the parsing of newly introduced "PARALLEL" option for COPY FROM command has an issue(in the 0002-Framework-for-leader-worker-in-parallel-copy.patch), Mentioning PARALLEL '4ar2eteid'); would pass with 4 workers since atoi() is being used for converting string to integer which just returns 4, ignoring other strings. I used strtol(), added error checks and introduced the error " improper use of argument to option "parallel"" for the above cases. parallel '4ar2eteid'); ERROR: improper use of argument to option "parallel" LINE 5: parallel '1\'); Along with the updated patch 0002-Framework-for-leader-worker-in-parallel-copy.patch, also attaching all the latest patches from [1]. [1] - https://www.postgresql.org/message-id/CALj2ACW94icER3WrWapon7JkcX8j0TGRue5ycWMTEvgA3X7fOg%40mail.gmail.com With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com On Tue, Jun 23, 2020 at 12:22 PM vignesh C wrote: > > On Tue, Jun 23, 2020 at 8:07 AM vignesh C wrote: > > I have attached the patch for the same with the fixes. > > The patches were not applying on the head, attached the patches that can be > applied on head. > I have added a commitfest entry[1] for this feature. > > [1] - https://commitfest.postgresql.org/28/2610/ > > > On Tue, Jun 23, 2020 at 8:07 AM vignesh C wrote: >> >> Thanks Ashutosh For your review, my comments are inline. >> On Fri, Jun 19, 2020 at 5:41 PM Ashutosh Sharma >> wrote: >> > >> > Hi, >> > >> > I just got some time to review the first patch in the list i.e. >> > 0001-Copy-code-readjustment-to-support-parallel-copy.patch. As the patch >> > name suggests, it is just trying to reshuffle the existing code for COPY >> > command here and there. There is no extra changes added in the patch as >> > such, but still I do have some review comments, please have a look: >> > >> > 1) Can you please add some comments atop the new function >> > PopulateAttributes() describing its functionality in detail. Further, this >> > new function contains the code from BeginCopy() to set attribute level >> > options used with COPY FROM such as FORCE_QUOTE, FORCE_NOT_NULL, >> > FORCE_NULL etc. in cstate and along with that it also copies the code from >> > BeginCopy() to set other infos such as client encoding type, encoding >> > conversion etc. Hence, I think it would be good to give it some better >> > name, basically something that matches with what actually it is doing. >> > >> >> There is no new code added in this function, some part of code from >> BeginCopy was made in to a new function as this part of code will also >> be required for the parallel copy workers before the workers start the >> actual copy operation. This code was made into a function to avoid >> duplication. Changed the function name to PopulateGlobalsForCopyFrom & >> added few comments. >> >> > 2) Again, the name for the new function CheckCopyFromValidity() doesn't >> > look good to me. From the function name it appears as if it does the >> > sanity check of the entire COPY FROM command, but actually it is just >> > doing the sanity check for the target relation specified with COPY FROM. >> > So, probably something like CheckTargetRelValidity would look more >> > sensible, I think? TBH, I am not good at naming the functions so you can >> > always ignore my suggestions about function and variable names :) >> > >> >> Changed as suggested. >> > 3) Any reason for not making CheckCopyFromValidity as a macro instead of a >> > new function. It is just doing the sanity check for the target relation. >> > >> >> I felt there is reasonable number of lines in the function & it is not >> in performance intensive path, so I preferred function over macro. >> Your thoughts? >> >> > 4) Earlier in CopyReadLine() function while trying to clear the EOL marker >> > from cstate->line_buf.data (copied data), we were not checking if the line >> > read by CopyReadLineText() function is a header line or not, but I can see >> > that your patch checks that before clearing the EOL marker. Any reason for >> > this extra check? >> > >> >> If you see the caller of CopyReadLine, i.e. NextCopyFromRawFields does >> nothing for the header line, server basically calls CopyReadLine >> again, it is a kind of small optimization. Anyway server is not going >> to do anything with header line, I felt no need to clear EOL marker >> for header lines. >> /* on input just throw the header line away */ >> if (cstate->cur_lineno == 0 && cstate->header_line) >> { >> cstate->cur_lineno++; >> if (CopyReadLine(cstate)) >> return false; /* done */ >> } >> >> cstate->cur_lineno++; >> >> /* Actually read the line into memory here */ >> done = CopyReadLine(cstate); >> I think no need to make a fix for this. Your thoughts? >> >> > 5) I noticed the below spurious line removal in the patch. >> > >> > @@ -3839,7 +3953,6 @@ static bool >> > CopyReadLine(CopyState cstate) >> > { >> > boolresult; >> > - >> > >> >
Re: should libpq also require TLSv1.2 by default?
On Wed, Jun 24, 2020 at 10:33 AM Daniel Gustafsson wrote: > > On 24 Jun 2020, at 08:39, Peter Eisentraut < > peter.eisentr...@2ndquadrant.com> wrote: > > > > In PG13, we raised the server-side default of ssl_min_protocol_version > to TLSv1.2. We also added a connection setting named > ssl_min_protocol_version to libpq. But AFAICT, the default value of the > libpq setting is empty, so any protocol version will be accepted. Is this > what we wanted? Should we raise the default in libpq as well? > > This was discussed [0] when the connection settings were introduced, and > the > concensus was to leave them alone [1] to allow for example a new pg_dump to > work against an old server. Re-reading the thread I think the argument > still > holds, but I was about to respond "yes, let's do this" before refreshing my > memory. Perhaps we should add a comment explaining this along the lines > of the > attached? > > Another argument for not changing the default is that if you want to use SSL in any meaningful way you have to *already* change the connection string (with sslmode=require or verify-*), so it's not unreasonable to make that consideration at the same time. It might also be worth noting that it's not really "any protocol version", it means it will be "whatever the openssl configuration says", I think? For example, debian buster sets: [system_default_sect] MinProtocol = TLSv1.2 Which I believe means that if your libpq app is running on debian buster, it will be min v1.2 already (and it would likely be more useful to use ssl_min_protocol_version to *lower* that when connecting to older servers). //Magnus
Re: pg_resetwal --next-transaction-id may cause database failed to restart.
>Yeah, the normal workaround is to create the necessary file manually in >order to let the system start after such an operation; they are >sometimes necessary to enable testing weird cases with wraparound and >such. So a total rejection to work for these cases would be unhelpful >precisely for the scenario that those switches were intended to serve. I think these words should appear in pg_resetwal document if we decide to do nothing for this issue. >Maybe a better answer is to have a new switch in postmaster that creates >any needed files (incl. producing associated WAL etc); so you'd run >pg_resetwal -x some-value >postmaster --create-special-stuff >then start your server and off you go. As shown in the document, it looks like to rule a safe input, so I think it's better to rule it and add an option to focus write an unsafe value if necessary. >Now maybe this is too much complication for a mechanism that really >isn't for general consumption anyway. I mean, if you're using >pg_resetwal, you're already playing with fire. Yes, that's true, I always heard the word "You'd better not use pg_walreset". But the tool appear in PG code, it's better to improve it than do nothing. Regards, Highgo Software (Canada/China/Pakistan) URL : www.highgo.ca EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
Re: Removal of currtid()/currtid2() and some table AM cleanup
Hi Inoue-san, On Wed, Jun 24, 2020 at 05:20:42PM +0900, Inoue, Hiroshi wrote: > Where do you test, on Windows or on *nix? > How do you test there? I have been testing the driver on macos only, with various backend versions, from 11 to 14. Thanks, -- Michael signature.asc Description: PGP signature
Re: should libpq also require TLSv1.2 by default?
> On 24 Jun 2020, at 10:46, Magnus Hagander wrote: > It might also be worth noting that it's not really "any protocol version", it > means it will be "whatever the openssl configuration says", I think? For > example, debian buster sets: > > [system_default_sect] > MinProtocol = TLSv1.2 > > Which I believe means that if your libpq app is running on debian buster, it > will be min v1.2 already Correct, that being said I'm not sure how common it is for distributions to set a default protocol version. The macOS versions I have handy doesn't enforce a default version, nor does Ubuntu 20.04, FreeBSD 12 or OpenBSD 6.5 AFAICT. > (and it would likely be more useful to use ssl_min_protocol_version to > *lower* that when connecting to older servers). That is indeed one use-case for the connection parameter. cheers ./daniel
Re: Default setting for enable_hashagg_disk
On Wed, Jun 24, 2020 at 02:11:57PM +1200, David Rowley wrote: > On Tue, 23 Jun 2020 at 08:24, Jeff Davis wrote: > > Another way of looking at it is that the weird behavior is already > > there in v12, so there are already users relying on this weird behavior > > as a crutch for some other planner mistake. The question is whether we > > want to: > > > > (a) take the weird behavior away now as a consequence of implementing > > disk-based HashAgg; or > > (b) support the weird behavior forever; or > > (c) introduce a GUC now to help transition away from the weird behavior > > > > The danger with (c) is that it gives users more time to become more > > reliant on the weird behavior; and worse, a GUC could be seen as an > > endorsement of the weird behavior rather than a path to eliminating it. > > So we could intend to do (c) and end up with (b). We can mitigate this > > with documentation warnings, perhaps. > > So, I have a few thoughts on this subject. I understand both problem > cases have been mentioned before on this thread, but just to reiterate > the two problem cases that we really would rather people didn't hit. I appreciated this summary since I wasn't fully following the issues. > As for GUCs to try to help the group of users who, *I'm certain*, will > have problems with PG13's plan choice. I think the overloaded > enable_hashagg option is a really nice compromise. We don't really > have any other executor node type that has multiple GUCs controlling > its behaviour, so I believe it would be nice to keep it that way. So, in trying to anticipate how users will be affected by an API change, I try to look at similar cases where we already have this behavior, and how users react to this. Looking at the available join methods, I think we have one. We currently support: * nested loop with sequential scan * nested loop with index scan * hash join * merge join It would seem merge join has almost the same complexities as the new hash join code, since it can spill to disk doing sorts for merge joins, and adjusting work_mem is the only way to control that spill to disk. I don't remember anyone complaining about spills to disk during merge join, so I am unclear why we would need a such control for hash join. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
PostgreSQL and big data - FDW
Hi I would like to use a Foreign Data Wrapper (FDW) to connect to a HADOOP cluster which uses KERBEROS authentication. is it possible to achieve this ? which FDW should be used ? Thanks in advance Best Regards Didier ROS EDF Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse. Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message. Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus. This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval. If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message. E-mail communication cannot be guaranteed to be timely secure, error or virus-free.
Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation
On Wed, Jun 24, 2020 at 12:41 PM Maciek Sakrejda wrote: > > On Tue, Jun 23, 2020 at 7:55 PM Amit Kapila wrote: > > > I don't see any other reason for > > > looping over the NL node itself in this plan. The Gather itself > > > doesn't do any real looping, right? > > > > It is right that Gather doesn't do looping but Parallel Seq Scan node does > > so. > > Sorry, I still don't follow. How does a Parallel Seq Scan do looping? Sorry, I intend to say that Parallel Seq Scan is involved in looping. Let me try by example: Gather (actual time=6.444..722.642 rows=1 loops=1) Workers Planned: 2 Workers Launched: 2 -> Nested Loop (actual time=0.046..705.936 rows=5000 loops=2) -> Parallel Seq Scan on t1 (actual time=0.010..45.423 rows=25 loops=2) -> Index Scan using idx_t2 on t2 (actual time=0.002..0.002 rows=0 loops=50) Index Cond: (c1 = t1.c1) In the above plan, each of the worker runs NestLoop -> Parallel Seq Scan on t1 -> Index Scan using idx_t2 on t2 So, that leads to loops as 2 on "Parallel Seq Scan" and "Nested Loop" nodes. Does this make sense now? > I looked at the parallel plan docs but I don't see looping mentioned > anywhere[1]. Also, is looping not normally indicated on children, > rather than on the node doing the looping? E.g., with a standard > Nested Loop, the outer child will have loops=1 and the inner child > will have loops equal to the row count produced by the outer child > (and the Nested Loop itself will have loops=1 unless it also is being > looped over by a parent node), right? > Yeah, I hope the above has clarified it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: min_safe_lsn column in pg_replication_slots view
On 2020/06/23 15:27, Amit Kapila wrote: On Tue, Jun 23, 2020 at 7:47 AM Fujii Masao wrote: On 2020/06/23 10:10, Kyotaro Horiguchi wrote: At Mon, 22 Jun 2020 22:02:51 +0900, Fujii Masao wrote in I feel such a function is good to have but I am not sure if there is a need to tie it with the removal of min_safe_lsn column. We should expose the LSN calculated from "the current WAL LSN - max(wal_keep_segments * 16MB, max_slot_wal_keep_size)"? This indicates the minimum LSN of WAL files that are guaraneed to be currently retained by wal_keep_segments and max_slot_wal_keep_size. That is, if checkpoint occurs when restart_lsn of replication slot is smaller than that minimum LSN, some required WAL files may be removed. So DBAs can periodically monitor and compare restart_lsn and that minimum LSN. If they see frequently that difference of those LSN is very small, they can decide to increase wal_keep_segments or max_slot_wal_keep_size, to prevent required WAL files from being removed. Thought? I'm not sure about the consensus here about showing that number in the view. It is similar to "remain" in the earlier versions of this patch but a bit simpler. It would be usable in a similar way. I can live with either numbers. It's useless to display this value in each replication slot in the view. I'm thinking to expose it as a function. Having a separate function for this seems like a good idea but can we consider displaying it in a view like pg_stat_replication_slots as we are discussing a nearby thread to have such a view for other things. I think ultimately this information is required to check whether some slot can be invalidated or not, so having it displayed along with other slot information might not be a bad idea. "the current WAL LSN - max(wal_keep_segments * 16MB, max_slot_wal_keep_size)" is the same value between all the replication slots. But you think it's better to display that same value for every slots in the view? Or you're thinking to display the difference of that LSN value and restart_lsn as Horiguchi-san suggested? That diff varies each replication slot, so it seems ok to display it for every rows. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: min_safe_lsn column in pg_replication_slots view
On 2020/06/24 8:39, Alvaro Herrera wrote: On 2020-Jun-23, Kyotaro Horiguchi wrote: At Tue, 23 Jun 2020 11:50:34 +0530, Amit Kapila wrote in On Mon, Jun 22, 2020 at 6:32 PM Fujii Masao wrote: We should expose the LSN calculated from "the current WAL LSN - max(wal_keep_segments * 16MB, max_slot_wal_keep_size)"? This indicates the minimum LSN of WAL files that are guaraneed to be currently retained by wal_keep_segments and max_slot_wal_keep_size. That is, if checkpoint occurs when restart_lsn of replication slot is smaller than that minimum LSN, some required WAL files may be removed. So DBAs can periodically monitor and compare restart_lsn and that minimum LSN. If they see frequently that difference of those LSN is very small, they can decide to increase wal_keep_segments or max_slot_wal_keep_size, to prevent required WAL files from being removed. Thought? +1. This sounds like a good and useful stat for users. +1 for showing a number that is not involving lastRemovedSegNo. It is like returning to the initial version of this patch. It showed a number like ((the suggested above) minus restart_lsn). The number is different for each slot so they fit in the view. The number is usable for the same purpose so I'm ok with it. I think we should publish the value from wal_keep_segments separately from max_slot_wal_keep_size. ISTM that the user might decide to change or remove wal_keep_segments and be suddenly at risk of losing slots because of overlooking that it was wal_keep_segments, not max_slot_wal_keep_size, that was protecting them. You mean to have two functions that returns 1. "current WAL LSN - wal_keep_segments * 16MB" 2. "current WAL LSN - max_slot_wal_keep_size" Right? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: PostgreSQL and big data - FDW
On Wed, Jun 24, 2020 at 09:05:30AM +, ROS Didier wrote: > Hi > > I would like to use a Foreign Data Wrapper (FDW) to connect to a HADOOP > cluster > which uses KERBEROS authentication. > > is it possible to achieve this ? which FDW should be used ? Well, I would use the Hadoop FDW: https://github.com/EnterpriseDB/hdfs_fdw and it only supports these authentication methods: Authentication Support The FDW supports NOSASL and LDAP authentication modes. In order to use NOSASL do not specify any OPTIONS while creating user mapping. For LDAP username and password must be specified in OPTIONS while creating user mapping. Not every FDW supports every Postgres server authentication method. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Assertion failure in pg_copy_logical_replication_slot()
On 2020/06/24 9:38, Alvaro Herrera wrote: On 2020-Jun-23, Fujii Masao wrote: If restart_lsn of logical replication slot gets behind more than max_slot_wal_keep_size from the current LSN, the logical replication slot would be invalidated and its restart_lsn is reset to an invalid LSN. If this logical replication slot with an invalid restart_lsn is specified as the source slot in pg_copy_logical_replication_slot(), the function causes the following assertion failure. TRAP: FailedAssertion("!logical_slot", File: "slotfuncs.c", Line: 727) Oops. This assertion failure is caused by /* Copying non-reserved slot doesn't make sense */ if (XLogRecPtrIsInvalid(src_restart_lsn)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot copy a replication slot that doesn't reserve WAL"))); Heh, you pasted the code after your patch rather than the original. oh sorry. I think the errcode is a bit bogus considering the new case. IMO ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE is more appropriate. Agreed. So I updated the patch so this errcode is used instead. Patch attached. One could argue that the error message could also be different for the case of a logical slot (or even a physical slot that has the upcoming "invalidated_at" LSN set), maybe "cannot copy a replication slot that has been invalidated" but maybe that's a pointless distinction. I don't object to the patch as presented. I have no strong opinion about this, but for now I kept the message as it is. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 06e4955de7..ce17d44227 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -723,12 +723,9 @@ copy_replication_slot(FunctionCallInfo fcinfo, bool logical_slot) /* Copying non-reserved slot doesn't make sense */ if (XLogRecPtrIsInvalid(src_restart_lsn)) - { - Assert(!logical_slot); ereport(ERROR, - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("cannot copy a replication slot that doesn't reserve WAL"))); - } /* Overwrite params from optional arguments */ if (PG_NARGS() >= 3)
Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation
On Tue, Jun 23, 2020 at 12:55 AM Maciek Sakrejda wrote: > > Hello, > > I had some questions about the behavior of some accounting in parallel > EXPLAIN plans. Take the following plan: > > .. > > The Nested Loop here aggregates data for metrics like `buffers read` > from its workers, and to calculate a metric like `buffers read` for > the parallel leader, we can subtract the values recorded in each > individual worker. This happens in the Seq Scan and Index Scan > children, as well. However, the Gather node appears to only include > values from its direct parallel leader child (excluding that child's > workers). > I have tried to check a similar plan and for me, the values at Gather node seems to be considering the values from all workers and leader (aka whatever is displayed at "Nested Loop " node), see below. I have tried the test on HEAD. Which version of PostgreSQL are you using? If you are also using the latest version then it is possible that in some cases it is not displaying correct data. If that turns out to be the case, then feel to share the test case. Sorry, for the confusion caused by my previous reply. Gather (actual time=2.083..550.093 rows=1 loops=1) Output: t1.c1, t1.c2, t2.c1, t2.c2 Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=1012621 read=84 I/O Timings: read=0.819 -> Nested Loop (actual time=0.084..541.882 rows= loops=3) Output: t1.c1, t1.c2, t2.c1, t2.c2 Buffers: shared hit=1012621 read=84 I/O Timings: read=0.819 Worker 0: actual time=0.069..541.249 rows=3155 loops=1 Buffers: shared hit=326529 read=29 I/O Timings: read=0.325 Worker 1: actual time=0.063..541.376 rows=3330 loops=1 Buffers: shared hit=352045 read=26 I/O Timings: read=0.179 -> Parallel Seq Scan on public.t1 (actual time=0.011..34.250 rows=17 loops=3) Output: t1.c1, t1.c2 Buffers: shared hit=2703 Worker 0: actual time=0.011..33.785 rows=161265 loops=1 Buffers: shared hit=872 Worker 1: actual time=0.009..34.582 rows=173900 loops=1 Buffers: shared hit=940 -> Index Scan using idx_t2 on public.t2 (actual time=0.003..0.003 rows=0 loops=50) Output: t2.c1, t2.c2 Index Cond: (t2.c1 = t1.c1) Buffers: shared hit=1009918 read=84 I/O Timings: read=0.819 Worker 0: actual time=0.003..0.003 rows=0 loops=161265 Buffers: shared hit=325657 read=29 I/O Timings: read=0.325 Worker 1: actual time=0.002..0.002 rows=0 loops=173900 Buffers: shared hit=351105 read=26 I/O Timings: read=0.179 -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On 2020/06/24 11:56, Kyotaro Horiguchi wrote: At Tue, 23 Jun 2020 10:51:40 +0900, Michael Paquier wrote in On Sun, Jun 21, 2020 at 01:02:34PM -0700, Andres Freund wrote: I still maintain that adding restrictions here is a bad idea. Even disregarding the discussion of running normal queries interspersed, it's useful to be able to both request WAL and receive logical changes over the same connection. E.g. for creating a logical replica by first doing a physical base backup (vastly faster), or fetching WAL for decoding large transactions onto a standby. And I just don't see any reasons to disallow it. There's basically no reduction in complexity by doing so. Yeah, I still stand by the same opinion here to do nothing. I suspect that we have good chances to annoy people and some cases we are overlooking here, that used to work. In logical replication, a replication role is intended to be accessible only to the GRANTed databases. On the other hand the same role can create a dead copy of the whole cluster, including non-granted databases. It seems like a sieve missing a mesh screen. Personally I'd like to disallow physical replication commands when I explicitly reject physical replication connection (i.e., set "host replication user x.x.x.x/x reject") in pg_hba.conf, whether on physical or logical replication connection. I agree that that doesn't harm as far as roles are strictly managed so I don't insist so strongly on inhibiting the behavior. However, the documentation at least needs amendment. +1 Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
[PATCH] COPY command's data format option allows only lowercase csv, text or binary
Hi, COPY command's FORMAT option allows only all lowercase csv, text or binary, this is true because strcmp is being used while parsing these values. It would be nice if the uppercase or combination of lower and upper case format options such as CSV, TEXT, BINARY, Csv, Text, Binary so on. is also allowed. To achieve this pg_strcasecmp() is used instead of strcmp. Attached is a patch having above changes. Request the community to review the patch, if it makes sense. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com v1-0001-COPY-command-s-data-format-option-allows-only-low.patch Description: Binary data
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar wrote: > > Here is the POC patch to discuss the idea of a cleanup of shared > fileset on proc exit. As discussed offlist, here I am maintaining > the list of shared fileset. First time when the list is NULL I am > registering the cleanup function with on_proc_exit routine. After > that for subsequent fileset, I am just appending it to filesetlist. > There is also an interface to unregister the shared file set from the > cleanup list and that is done by the caller whenever we are deleting > the shared fileset manually. While explaining it here, I think there > could be one issue if we delete all the element from the list will > become NULL and on next SharedFileSetInit we will again register the > function. Maybe that is not a problem but we can avoid registering > multiple times by using some flag in the file > I don't understand what you mean by "using some flag in the file". Review comments on various patches. poc_shared_fileset_cleanup_on_procexit = 1. - ent->subxact_fileset = - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet)); + MemoryContext oldctx; + /* Shared fileset handle must be allocated in the persistent context */ + oldctx = MemoryContextSwitchTo(ApplyContext); + ent->subxact_fileset = palloc(sizeof(SharedFileSet)); SharedFileSetInit(ent->subxact_fileset, NULL); + MemoryContextSwitchTo(oldctx); fd = BufFileCreateShared(ent->subxact_fileset, path); Why is this change required for this patch and why we only cover SharedFileSetInit in the Apply context and not BufFileCreateShared? The comment is also not very clear on this point. 2. +void +SharedFileSetUnregister(SharedFileSet *input_fileset) +{ + bool found = false; + ListCell *l; + + Assert(filesetlist != NULL); + + /* Loop over all the pending shared fileset entry */ + foreach (l, filesetlist) + { + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); + + /* remove the entry from the list and delete the underlying files */ + if (input_fileset->number == fileset->number) + { + SharedFileSetDeleteAll(fileset); + filesetlist = list_delete_cell(filesetlist, l); Why are we calling SharedFileSetDeleteAll here when in the caller we have already deleted the fileset as per below code? BufFileDeleteShared(ent->stream_fileset, path); + SharedFileSetUnregister(ent->stream_fileset); I think it will be good if somehow we can remove the fileset from filesetlist during BufFileDeleteShared. If that is possible, then we don't need a separate API for SharedFileSetUnregister. 3. +static List * filesetlist = NULL; + static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum); +static void SharedFileSetOnProcExit(int status, Datum arg); static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid tablespace); static void SharedFilePath(char *path, SharedFileSet *fileset, const char *name); static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name); @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg) /* Register our cleanup callback. */ if (seg) on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset)); + else + { + if (filesetlist == NULL) + on_proc_exit(SharedFileSetOnProcExit, 0); We use NIL for list initialization and comparison. See lock_files usage. 4. +SharedFileSetOnProcExit(int status, Datum arg) +{ + ListCell *l; + + /* Loop over all the pending shared fileset entry */ + foreach (l, filesetlist) + { + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); + SharedFileSetDeleteAll(fileset); + } We can initialize filesetlist as NIL after the for loop as it will make the code look clean. Comments on other patches: = 5. > 3. On concurrent abort we are truncating all the changes including > some incomplete changes, so later when we get the complete changes we > don't have the previous changes, e.g, if we had specinsert in the > last stream and due to concurrent abort detection if we delete that > changes later we will get spec_confirm without spec insert. We could > have simply avoided deleting all the changes, but I think the better > fix is once we detect the concurrent abort for any transaction, then > why do we need to collect the changes for that, we can simply avoid > that. So I have put that fix. (0006) > On similar lines, I think we need to skip processing message, see else part of code in ReorderBufferQueueMessage. 6. In v29-0002-Issue-individual-invalidations-with-wal_level-lo, xact_desc_invalidations seems to be a subset of standby_desc_invalidations, can we have a common code for them? 7. I think we can avoid sending v29-0007-Track-statistics-for-streaming this each time. We can do this after the main patch is complete. Also, we might need to change how and where these stats will be tracked. See the related discussion [1]. 8. In v29-0005-Implement-streaming-mode-in-ReorderBuffer, * Return oldest transaction in reorderbuffer @
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Mon, Jun 22, 2020 at 11:56 AM Dilip Kumar wrote: > > On Tue, Jun 16, 2020 at 2:37 PM Amit Kapila wrote: > > > > > 8. > > + /* > > + * Start a transaction on stream start, this transaction will be committed > > + * on the stream stop. We need the transaction for handling the buffile, > > + * used for serializing the streaming data and subxact info. > > + */ > > + ensure_transaction(); > > > > I think we need this for PrepareTempTablespaces to set the > > temptablespaces. Also, isn't it required for a cleanup of buffile > > resources at the transaction end? Are there any other reasons for it > > as well? The comment should be a bit more clear for why we need a > > transaction here. > > I am not sure that will it make sense to add a comment here that why > buffile and sharedfileset need a transaction? > You can say usage of BufFile interface expects us to be in the transaction for so and so reason Do you think that we > should add comment in buffile/shared fileset API that it should be > called under a transaction? > I am fine with that as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
iOn Wed, Jun 24, 2020 at 4:04 PM Amit Kapila wrote: > > On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar wrote: > > > > Here is the POC patch to discuss the idea of a cleanup of shared > > fileset on proc exit. As discussed offlist, here I am maintaining > > the list of shared fileset. First time when the list is NULL I am > > registering the cleanup function with on_proc_exit routine. After > > that for subsequent fileset, I am just appending it to filesetlist. > > There is also an interface to unregister the shared file set from the > > cleanup list and that is done by the caller whenever we are deleting > > the shared fileset manually. While explaining it here, I think there > > could be one issue if we delete all the element from the list will > > become NULL and on next SharedFileSetInit we will again register the > > function. Maybe that is not a problem but we can avoid registering > > multiple times by using some flag in the file > > > > I don't understand what you mean by "using some flag in the file". Basically, in POC as shown in below code snippet, We are checking that if the "filesetlist" is NULL then only register the on_proc_exit function. But, as described above if all the items are deleted the list will be NULL. So I told that instead of checking the filesetlist is NULL, we can have just a boolean variable that if we have registered the callback then don't do it again. @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg) /* Register our cleanup callback. */ if (seg) on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset)); + else + { + if (filesetlist == NULL) + on_proc_exit(SharedFileSetOnProcExit, 0); + + filesetlist = lcons((void *)fileset, filesetlist); + } } > > Review comments on various patches. > > poc_shared_fileset_cleanup_on_procexit > = > 1. > - ent->subxact_fileset = > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet)); > + MemoryContext oldctx; > > + /* Shared fileset handle must be allocated in the persistent context */ > + oldctx = MemoryContextSwitchTo(ApplyContext); > + ent->subxact_fileset = palloc(sizeof(SharedFileSet)); > SharedFileSetInit(ent->subxact_fileset, NULL); > + MemoryContextSwitchTo(oldctx); > fd = BufFileCreateShared(ent->subxact_fileset, path); > > Why is this change required for this patch and why we only cover > SharedFileSetInit in the Apply context and not BufFileCreateShared? > The comment is also not very clear on this point. Because only the sharedfileset and the filesetlist which is allocated under SharedFileSetInit, are required in the permanent context. BufFileCreateShared, only creates the Buffile and VFD which will be required only within the current stream so transaction context is enough. > 2. > +void > +SharedFileSetUnregister(SharedFileSet *input_fileset) > +{ > + bool found = false; > + ListCell *l; > + > + Assert(filesetlist != NULL); > + > + /* Loop over all the pending shared fileset entry */ > + foreach (l, filesetlist) > + { > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); > + > + /* remove the entry from the list and delete the underlying files */ > + if (input_fileset->number == fileset->number) > + { > + SharedFileSetDeleteAll(fileset); > + filesetlist = list_delete_cell(filesetlist, l); > > Why are we calling SharedFileSetDeleteAll here when in the caller we > have already deleted the fileset as per below code? > BufFileDeleteShared(ent->stream_fileset, path); > + SharedFileSetUnregister(ent->stream_fileset); > > I think it will be good if somehow we can remove the fileset from > filesetlist during BufFileDeleteShared. If that is possible, then we > don't need a separate API for SharedFileSetUnregister. But the filesetlist is maintained at the sharedfileset level, so even if we delete from BufFileDeleteShared, we need to call an API from the sharedfileset layer to unregister the fileset. Am I missing something? > 3. > +static List * filesetlist = NULL; > + > static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum); > +static void SharedFileSetOnProcExit(int status, Datum arg); > static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid > tablespace); > static void SharedFilePath(char *path, SharedFileSet *fileset, const > char *name); > static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name); > @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg) > /* Register our cleanup callback. */ > if (seg) > on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset)); > + else > + { > + if (filesetlist == NULL) > + on_proc_exit(SharedFileSetOnProcExit, 0); > > We use NIL for list initialization and comparison. See lock_files usage. Right. > 4. > +SharedFileSetOnProcExit(int status, Datum arg) > +{ > + ListCell *l; > + > + /* Loop over all the pending shared fileset entry */ > + foreach (l, filesetlist) > + { > + SharedFileSet *fileset = (Sh
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Jun 24, 2020 at 4:27 PM Dilip Kumar wrote: > > iOn Wed, Jun 24, 2020 at 4:04 PM Amit Kapila wrote: > > > > On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar wrote: > > > > > > Here is the POC patch to discuss the idea of a cleanup of shared > > > fileset on proc exit. As discussed offlist, here I am maintaining > > > the list of shared fileset. First time when the list is NULL I am > > > registering the cleanup function with on_proc_exit routine. After > > > that for subsequent fileset, I am just appending it to filesetlist. > > > There is also an interface to unregister the shared file set from the > > > cleanup list and that is done by the caller whenever we are deleting > > > the shared fileset manually. While explaining it here, I think there > > > could be one issue if we delete all the element from the list will > > > become NULL and on next SharedFileSetInit we will again register the > > > function. Maybe that is not a problem but we can avoid registering > > > multiple times by using some flag in the file > > > > > > > I don't understand what you mean by "using some flag in the file". > > Basically, in POC as shown in below code snippet, We are checking > that if the "filesetlist" is NULL then only register the on_proc_exit > function. But, as described above if all the items are deleted the > list will be NULL. So I told that instead of checking the filesetlist > is NULL, we can have just a boolean variable that if we have > registered the callback then don't do it again. > Check if there is any precedent of the same in the code? > > > > > Review comments on various patches. > > > > poc_shared_fileset_cleanup_on_procexit > > = > > 1. > > - ent->subxact_fileset = > > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet)); > > + MemoryContext oldctx; > > > > + /* Shared fileset handle must be allocated in the persistent context */ > > + oldctx = MemoryContextSwitchTo(ApplyContext); > > + ent->subxact_fileset = palloc(sizeof(SharedFileSet)); > > SharedFileSetInit(ent->subxact_fileset, NULL); > > + MemoryContextSwitchTo(oldctx); > > fd = BufFileCreateShared(ent->subxact_fileset, path); > > > > Why is this change required for this patch and why we only cover > > SharedFileSetInit in the Apply context and not BufFileCreateShared? > > The comment is also not very clear on this point. > > Because only the sharedfileset and the filesetlist which is allocated > under SharedFileSetInit, are required in the permanent context. > BufFileCreateShared, only creates the Buffile and VFD which will be > required only within the current stream so transaction context is > enough. > Okay, then add some more comments to explain it or if you have explained it elsewhere, then add a reference for the same. > > 2. > > +void > > +SharedFileSetUnregister(SharedFileSet *input_fileset) > > +{ > > + bool found = false; > > + ListCell *l; > > + > > + Assert(filesetlist != NULL); > > + > > + /* Loop over all the pending shared fileset entry */ > > + foreach (l, filesetlist) > > + { > > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); > > + > > + /* remove the entry from the list and delete the underlying files */ > > + if (input_fileset->number == fileset->number) > > + { > > + SharedFileSetDeleteAll(fileset); > > + filesetlist = list_delete_cell(filesetlist, l); > > > > Why are we calling SharedFileSetDeleteAll here when in the caller we > > have already deleted the fileset as per below code? > > BufFileDeleteShared(ent->stream_fileset, path); > > + SharedFileSetUnregister(ent->stream_fileset); > > > > I think it will be good if somehow we can remove the fileset from > > filesetlist during BufFileDeleteShared. If that is possible, then we > > don't need a separate API for SharedFileSetUnregister. > > But the filesetlist is maintained at the sharedfileset level, so even > if we delete from BufFileDeleteShared, we need to call an API from the > sharedfileset layer to unregister the fileset. > Sure, but isn't it better if we can call such an API from BufFileDeleteShared? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Custom compression methods
On Tue, Jun 23, 2020 at 4:00 PM Andres Freund wrote: > https://postgr.es/m/20130621000900.GA12425%40alap2.anarazel.de is a > thread with more information / patches further along. > > I confused this patch with the approach in > https://www.postgresql.org/message-id/d8576096-76ba-487d-515b-44fdedba8bb5%402ndquadrant.com > sorry for that. It obviously still differs by not having lower space > overhead (by virtue of not having a 4 byte 'va_cmid', but no additional > space for two methods, and then 1 byte overhead for 256 more), but > that's not that fundamental a difference. Wait a minute. Are we saying there are three (3) dueling patches for adding an alternate TOAST algorithm? It seems like there is: This "custom compression methods" thread - vintage 2017 - Original code by Nikita Glukhov, later work by Ildus Kurbangaliev The "pluggable compression support" thread - vintage 2013 - Andres Freund The "plgz performance" thread - vintage 2019 - Petr Jelinek Anyone want to point to a FOURTH implementation of this feature? I guess the next thing to do is figure out which one is the best basis for further work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Why forbid "INSERT INTO t () VALUES ();"
Hello devs, I would like to create an "all defaults" row, i.e. a row composed of the default values for all attributes, so I wrote: INSERT INTO t() VALUES (); This is forbidden by postgres, and also sqlite. Is there any good reason why this should be the case? -- Fabien.
Re: Why forbid "INSERT INTO t () VALUES ();"
Fabien COELHO schrieb am 24.06.2020 um 14:18: > I would like to create an "all defaults" row, i.e. a row composed of the > default values for all attributes, so I wrote: > > INSERT INTO t() VALUES (); > > This is forbidden by postgres, and also sqlite. > > Is there any good reason why this should be the case? > Maybe because insert into t default values; exists (and is standard SQL if I'm not mistaken) Thomas
Re: Default setting for enable_hashagg_disk
On Wed, 24 Jun 2020 at 21:06, Bruce Momjian wrote: > I > don't remember anyone complaining about spills to disk during merge > join, so I am unclear why we would need a such control for hash join. Hash aggregate, you mean? The reason is that upgrading to PG13 can cause a performance regression for an underestimated ndistinct on the GROUP BY clause and cause hash aggregate to spill to disk where it previously did everything in RAM. Sure, that behaviour was never what we wanted to happen, Jeff has fixed that now, but the fact remains that this does happen in the real world quite often and people often get away with it, likey because work_mem is generally set to some very conservative value. Of course, there's also a bunch of people that have been bitten by OOM due to this too. The "neverspill" wouldn't be for those people. Certainly, it's possible that we just tell these people to increase work_mem for this query, that way they can set it to something reasonable and still get spilling if it's really needed to save them from OOM, but the problem there is that it's not very easy to go and set work_mem for a certain query. FWIW, I wish that I wasn't suggesting we do this, but I am because it seems simple enough to implement and it removes a pretty big roadblock that might exist for a small subset of people wishing to upgrade to PG13. It seems lightweight enough to maintain, at least until we invent some better management of how many executor nodes we can have allocating work_mem at once. The suggestion I made was just based on asking myself the following set of questions: Since Hash Aggregate has been able to overflow work_mem since day 1, and now that we've completely changed that fact in PG13, is that likely to upset anyone? If so, should we go to the trouble of giving those people a way of getting the old behaviour back? If we do want to help those people, what's the best way to make those options available to them in a way that we can remove the special options with the least pain in some future version of PostgreSQL? I'd certainly be interested in hearing how other people would answer those question. David
Re: Default setting for enable_hashagg_disk
On Wed, Jun 24, 2020 at 05:06:28AM -0400, Bruce Momjian wrote: > On Wed, Jun 24, 2020 at 02:11:57PM +1200, David Rowley wrote: > > On Tue, 23 Jun 2020 at 08:24, Jeff Davis wrote: > > > Another way of looking at it is that the weird behavior is already > > > there in v12, so there are already users relying on this weird behavior > > > as a crutch for some other planner mistake. The question is whether we > > > want to: > > > > > > (a) take the weird behavior away now as a consequence of implementing > > > disk-based HashAgg; or > > > (b) support the weird behavior forever; or > > > (c) introduce a GUC now to help transition away from the weird behavior > > > > > > The danger with (c) is that it gives users more time to become more > > > reliant on the weird behavior; and worse, a GUC could be seen as an > > > endorsement of the weird behavior rather than a path to eliminating it. > > > So we could intend to do (c) and end up with (b). We can mitigate this > > > with documentation warnings, perhaps. > > > > So, I have a few thoughts on this subject. I understand both problem > > cases have been mentioned before on this thread, but just to reiterate > > the two problem cases that we really would rather people didn't hit. > > I appreciated this summary since I wasn't fully following the issues. > > > As for GUCs to try to help the group of users who, *I'm certain*, will > > have problems with PG13's plan choice. I think the overloaded > > enable_hashagg option is a really nice compromise. We don't really > > have any other executor node type that has multiple GUCs controlling > > its behaviour, so I believe it would be nice to keep it that way. ... > It would seem merge join has almost the same complexities as the new > hash join code, since it can spill to disk doing sorts for merge joins, > and adjusting work_mem is the only way to control that spill to disk. I > don't remember anyone complaining about spills to disk during merge > join, so I am unclear why we would need a such control for hash join. It loooks like merge join was new in 8.3. I don't think that's a good analogy, since the old behavior was still available with enable_mergejoin=off. I think a better analogy would be if we now changed sort nodes beneath merge join to use at most 0.5*work_mem, with no way of going back to using 1.0*work_mem. -- Justin
RE: PostgreSQL and big data - FDW
Hi Bruce In the following link : https://www.enterprisedb.com/blog/connecting-hadoop-and-edb-postgres-shrink-big-data-challenges We can see : "Support for various authentication methods (i.e. Kerberos, NOSASL, etc.)" So HDFS_FDW support kerberos authentication . how to be sure of that ? Could EDB make a clear statement on this point? If so, how to implement this method ? is there any document on this subject ? Thanks in advance. Best Regards Didier ROS didier@edf.fr Tél. : +33 6 49 51 11 88 -Message d'origine- De : br...@momjian.us [mailto:br...@momjian.us] Envoyé : mercredi 24 juin 2020 11:13 À : ROS Didier Cc : pgsql-hackers@lists.postgresql.org Objet : Re: PostgreSQL and big data - FDW On Wed, Jun 24, 2020 at 09:05:30AM +, ROS Didier wrote: > Hi > > I would like to use a Foreign Data Wrapper (FDW) to connect to a > HADOOP cluster which uses KERBEROS authentication. > > is it possible to achieve this ? which FDW should be used ? Well, I would use the Hadoop FDW: https://github.com/EnterpriseDB/hdfs_fdw and it only supports these authentication methods: Authentication Support The FDW supports NOSASL and LDAP authentication modes. In order to use NOSASL do not specify any OPTIONS while creating user mapping. For LDAP username and password must be specified in OPTIONS while creating user mapping. Not every FDW supports every Postgres server authentication method. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse. Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message. Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus. This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval. If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message. E-mail communication cannot be guaranteed to be timely secure, error or virus-free.
Re: min_safe_lsn column in pg_replication_slots view
On Wed, Jun 24, 2020 at 2:37 PM Fujii Masao wrote: > > On 2020/06/23 15:27, Amit Kapila wrote: > > > > Having a separate function for this seems like a good idea but can we > > consider displaying it in a view like pg_stat_replication_slots as we > > are discussing a nearby thread to have such a view for other things. > > I think ultimately this information is required to check whether some > > slot can be invalidated or not, so having it displayed along with > > other slot information might not be a bad idea. > > "the current WAL LSN - max(wal_keep_segments * 16MB, max_slot_wal_keep_size)" > is the same value between all the replication slots. But you think it's better > to display that same value for every slots in the view? > > Or you're thinking to display the difference of that LSN value and > restart_lsn as Horiguchi-san suggested? > I see value in Horiguchi-San's proposal. IIUC, it will tell help DBAs/Users to know if any particular slot will get invalidated soon. > That diff varies each replication slot, > so it seems ok to display it for every rows. > Yes. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [PATCH] COPY command's data format option allows only lowercase csv, text or binary
Bharath Rupireddy writes: > COPY command's FORMAT option allows only all lowercase csv, text or > binary, this is true because strcmp is being used while parsing these > values. This is nonsense, actually: regression=# create table foo (f1 int); CREATE TABLE regression=# copy foo from stdin (format CSV); Enter data to be copied followed by a newline. End with a backslash and a period on a line by itself, or an EOF signal. As that shows, there's already a round of lowercasing done by the parser. The only way that strcasecmp in copy.c would be useful is if you wanted to accept things like copy foo from stdin (format "CSV"); I don't find that to be a terribly good idea. The normal implication of quoting is that it *prevents* case folding, so why should that happen anyway? More generally, though, why would we want to change this policy only here? I believe we're reasonably consistent about letting the parser do any required down-casing and then just checking keyword matches with strcmp. regards, tom lane
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On 2020-Jun-24, Kyotaro Horiguchi wrote: > In logical replication, a replication role is intended to be > accessible only to the GRANTed databases. On the other hand the same > role can create a dead copy of the whole cluster, including > non-granted databases. In other words -- essentially, if you grant replication access to a role only to a specific database, they can steal the whole cluster. I don't see what's so great about that, but apparently people like it. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Assertion failure in pg_copy_logical_replication_slot()
On 2020-Jun-24, Fujii Masao wrote: > > I think the errcode is a bit bogus considering the new case. > > IMO ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE is more appropriate. > > Agreed. So I updated the patch so this errcode is used instead. > Patch attached. LGTM. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pg_resetwal --next-transaction-id may cause database failed to restart.
On 2020-Jun-24, movead...@highgo.ca wrote: > >Maybe a better answer is to have a new switch in postmaster that creates > >any needed files (incl. producing associated WAL etc); so you'd run > >pg_resetwal -x some-value > >postmaster --create-special-stuff > >then start your server and off you go. > > As shown in the document, it looks like to rule a safe input, so I think it's > better > to rule it and add an option to focus write an unsafe value if necessary. ISTM that a reasonable compromise is that if you use -x (or -c, -m, -O) and the input value is outside the range supported by existing files, then it's a fatal error; unless you use --force, which turns it into just a warning. > >Now maybe this is too much complication for a mechanism that really > >isn't for general consumption anyway. I mean, if you're using > >pg_resetwal, you're already playing with fire. > Yes, that's true, I always heard the word "You'd better not use pg_walreset". > But the tool appear in PG code, it's better to improve it than do nothing. Sure. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: min_safe_lsn column in pg_replication_slots view
On 2020-Jun-24, Fujii Masao wrote: > On 2020/06/24 8:39, Alvaro Herrera wrote: > > I think we should publish the value from wal_keep_segments separately > > from max_slot_wal_keep_size. ISTM that the user might decide to change > > or remove wal_keep_segments and be suddenly at risk of losing slots > > because of overlooking that it was wal_keep_segments, not > > max_slot_wal_keep_size, that was protecting them. > > You mean to have two functions that returns > > 1. "current WAL LSN - wal_keep_segments * 16MB" > 2. "current WAL LSN - max_slot_wal_keep_size" Hmm, but all the values there are easily findable. What would be the point in repeating it? Maybe we should disregard this line of thinking and go back to Horiguchi-san's original proposal, to wit use the "distance to breakage", as also supported now by Amit Kapila[1] (unless I misunderstand him). [1] https://postgr.es/m/CAA4eK1L2oJ7T1cESdc5w4J9L3Q_hhvWqTigdAXKfnsJy4=v...@mail.gmail.com -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PostgreSQL and big data - FDW
On Wed, Jun 24, 2020 at 6:09 PM ROS Didier wrote: > Hi Bruce > > In the following link : > https://www.enterprisedb.com/blog/connecting-hadoop-and-edb-postgres-shrink-big-data-challenges > We can see : > "Support for various authentication methods (i.e. Kerberos, NOSASL, etc.)" > > So HDFS_FDW support kerberos authentication . how to be sure of that ? > Could EDB make a clear statement on this point? > HDFS_FDW does not support kerberos authentication. The sentence you have pasted above is from the wish list or say TODO list, here is what it says: "Currently the HDFS_FDW only provides READ capabilities but EDB is planning the following additional functionality:" The functionality was not implemented. I think the part of confusion might be due to the formatting of the list in the blog. You can follow the README[1] of HDFS_FDW to get an idea of how to use it. [1] https://github.com/EnterpriseDB/hdfs_fdw/blob/master/README.md Regards, Jeevan
Re: extensible options syntax for replication parser?
On Sun, Jun 14, 2020 at 3:15 AM Fabien COELHO wrote: > > so instead I'd like to have a better way to do it. > > > Attached is v1 of a patch to refactor things so that parts of the > > BASE_BACKUP and CREATE_REPLICATION_SLOT are replaced with a flexible > > options syntax. > > Patch applies cleanly, however compilation fails on: > >repl_gram.y:271:106: error: expected ‘;’ before ‘}’ Oops. I'm surprised my compiler didn't complain. > Getting rid of "ident_or_keyword", some day, would be a relief. Actually, I think that particular thing is a sign that things are being done correctly. You need something like that if you have contexts where you want to treat keywords and non-keywords the same way, and it's generally good to have such places. In fact, this could probably be profitably used in more places in the replication grammar. > For boolean options, you are generating (EXPORT_SNAPSHOT TRUE) where > (EXPORT_SNAPSHOT) would do. True, but it doesn't seem to matter much one way or the other. I thought this way looked a little clearer. > Maybe allowing (!EXPORT_SNAPSHOT) for (FOO FALSE) would be nice, if it is > not allowed yet. That would also allow to get rid of FOO/NOFOO variants if > any for FOO & !FOO, so one keyword is enough for a concept. Well, the goal for this is not to need ANY new keywords for a new concept, but !FOO would be inconsistent with other places where we do similar things (e.g. EXPLAIN, VACUUM, COPY) so I don't think that's the way to go. > > There are some debatable decisions here, so I'd be happy to get some > > feedback on whether to go further with this, or less far, or maybe even > > just abandon the idea altogether. I doubt the last one is the right > > course, though: ISTM something's got to be done about the BASE_BACKUP > > case, at least. > > ISTM that it would be better to generalize the approach to all commands > which accept options, so that the syntax is homogeneous. As a general principle, sure, but it's always debatable how far to take things in particular cases. For instance, in the cases of EXPLAIN, VACUUM, and COPY, the relation name is given as a dedicated piece of syntax, not an option. It could be given as an option, but since it's mandatory and important, we didn't. In the case of COPY, the source file is also specified via dedicated syntax, rather than an option. So we have to make the same kinds of decisions here. For example, for CREATE_REPLICATION_SLOT, one could argue that PHYSICAL and LOGICAL should be moved to the extensible options list instead of being kept as separate syntax. However, that seems somewhat inconsistent with the long-standing syntax for START_REPLICATION, which already does use extensible options: START_REPLICATION SLOT slot_name LOGICAL XXX/XXX [ ( option_name [option_value] [, ... ] ) ] On balance I am comfortable with what the patch does, but other people might have a different take. > Just wondering: ISTM that the patch implies that dumping a v14 db > generates the new syntax, which makes sense. Now I see 4 use cases wrt to > version. > > # sourcetarget comment > 1 v < 14v < 14 probably the dump would use one of the older version > 2 v < 14v >= 14 upgrade > 3 v >= 14 v < 14 downgrade: oops, the output uses the new syntax > 4 v >= 14 v >= 14 ok > > Both cross version usages may be legitimate. In particular, 3 (oops, > hardware issue, I have to move the db to a server where pg has not been > upgraded) seems not possible because the generated syntax uses the new > approach. Should/could there be some option to tell "please generate vXXX > syntax" to allow that? I don't think dumping a DB is really affected by any of this. AFAIK, replication commands aren't used in pg_dump output. It just affects pg_basebackup and the server, and you'll notice that I have taken pains to allow the server to continue to accept the current format, and to allow pg_basebackup to generate that format when talking to an older server. Thanks for the review. v2 attached, hopefully fixing the compilation issue you mentioned. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company v2-0001-Flexible-options-for-BASE_BACKUP-and-CREATE_REPLI.patch Description: Binary data
Re: Avoiding hash join batch explosions with extreme skew and weird stats
Hi Tomas, On Tue, Jun 23, 2020 at 3:24 PM Tomas Vondra wrote: > > Now, a couple comments / questions about the code. > > > nodeHash.c > -- > > > 1) MultiExecPrivateHash says this > >/* > * Not subject to skew optimization, so either insert normally > * or save to batch file if it belongs to another stripe > */ > > I wonder what it means to "belong to another stripe". I understand what > that means for batches, which are identified by batchno computed from > the hash value. But I thought "stripes" are just work_mem-sized pieces > of a batch, so I don't quite understand this. Especially when the code > does not actually check "which stripe" the row belongs to. I have to concur that "stripe" did inspire a RAID vibe when I heard it, but it seemed to be a better name than what it replaces > 3) I'm a bit puzzled about this formula in ExecHashIncreaseNumBatches > >childbatch = (1U << (my_log2(hashtable->nbatch) - 1)) | > hashtable->curbatch; > > and also about this comment > >/* > * TODO: what to do about tuples that don't go to the child > * batch or stay in the current batch? (this is why we are > * counting tuples to child and curbatch with two diff > * variables in case the tuples go to a batch that isn't the > * child) > */ >if (batchno == childbatch) > childbatch_outgoing_tuples++; > > I thought each old batch is split into two new ones, and the tuples > either stay in the current one, or are moved to the new one - which I > presume is the childbatch, although I haven't tried to decode that > formula. So where else could the tuple go, as the comment tried to > suggest? True, every old batch is split into two new ones, if you only consider tuples coming from the batch file that _still belong in there_. i.e. there are tuples in the old batch file that belong to a future batch. As an example, if the current nbatch = 8, and we want to expand to nbatch = 16, (old) batch 1 will split into (new) batch 1 and batch 9, but it can already contain tuples that need to go into (current) batches 3, 5, and 7 (soon-to-be batches 11, 13, and 15). Cheers, Jesse
Re: [PATCH] COPY command's data format option allows only lowercase csv, text or binary
On Wed, Jun 24, 2020 at 10:27 AM Tom Lane wrote: > More generally, though, why would we want to change this policy only > here? I believe we're reasonably consistent about letting the parser > do any required down-casing and then just checking keyword matches > with strcmp. I've had the feeling in the past that our use of pg_strcasecmp() was a bit random. Looking through the output of 'git grep pg_strcasecmp', it seems like we don't typically don't use it on option names, but sometimes we use it on option values. For instance, DefineCollation() uses pg_strcasecmp() on the collproviderstr, and DefineType() uses it on storageEl; and also, not to be overlooked, defGetBoolean() uses it when matching true/false/on/off, which probably affects a lot of places. On the other hand, ExplainQuery() matches the format using plain old strcmp(), despite indirectly using pg_strcasecmp() for the Boolean parameters. So I guess I'm not really convinced that there is all that much consistency here. Mind you, I'm not sure whether or not anything really needs to be changed, or exactly what ought to be done. I'm just making the observation that it might not be as consistent as you may think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
Greetings, * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: > On 2020-Jun-24, Kyotaro Horiguchi wrote: > > > In logical replication, a replication role is intended to be > > accessible only to the GRANTed databases. On the other hand the same > > role can create a dead copy of the whole cluster, including > > non-granted databases. > > In other words -- essentially, if you grant replication access to a role > only to a specific database, they can steal the whole cluster. > > I don't see what's so great about that, but apparently people like it. Sure, people who aren't in charge of security I'm sure like the ease of use. Doesn't mean it makes sense or that we should be supporting that. What we should have is a way to allow administrators to configure a system for exactly what they want to allow, and it doesn't seem like we're doing that today and therefore we should fix it. This isn't the only area we have that issue in. Thanks, Stephen signature.asc Description: PGP signature
Re: PostgreSQL and big data - FDW
Greetings, * Bruce Momjian (br...@momjian.us) wrote: > On Wed, Jun 24, 2020 at 09:05:30AM +, ROS Didier wrote: > > I would like to use a Foreign Data Wrapper (FDW) to connect to a HADOOP > > cluster > > which uses KERBEROS authentication. Sadly, not really. > > is it possible to achieve this ? which FDW should be used ? > > Well, I would use the Hadoop FDW: > > https://github.com/EnterpriseDB/hdfs_fdw > > and it only supports these authentication methods: > > Authentication Support > > The FDW supports NOSASL and LDAP authentication modes. In order to use > NOSASL do not specify any OPTIONS while creating user mapping. For LDAP > username and password must be specified in OPTIONS while creating user > mapping. > > Not every FDW supports every Postgres server authentication method. That isn't really the issue here, the problem is really that the GSSAPI support in PG today doesn't support credential delegation- if it did, then the HDFS FDW (and the postgres FDW) could be easily extended to leverage those delegated credentials to connect. That's been something that's been on my personal todo list of things to work on but unfortunately I've not, as yet, had time to go implement. I don't actually think it would be very hard- if someone writes it, I'd definitely review it. Thanks, Stephen signature.asc Description: PGP signature
Re: [PATCH] COPY command's data format option allows only lowercase csv, text or binary
Robert Haas writes: > On Wed, Jun 24, 2020 at 10:27 AM Tom Lane wrote: >> More generally, though, why would we want to change this policy only >> here? I believe we're reasonably consistent about letting the parser >> do any required down-casing and then just checking keyword matches >> with strcmp. > ... Mind you, I'm not sure whether or not anything really needs to be > changed, or exactly what ought to be done. I'm just making the > observation that it might not be as consistent as you may think. Yeah, I'm sure there are a few inconsistencies. We previously made a pass to get rid of pg_strcasecmp for anything that had been through the parser's downcasing (commit fb8697b31) but I wouldn't be surprised if that missed a few cases, or if new ones have snuck in. Anyway, "don't use pg_strcasecmp unnecessarily" was definitely the agreed-to policy as of Jan 2018. My vague recollection is that there are a few exceptions (defGetBoolean may well be one of them) where pg_strcasecmp still seemed necessary because the input might not have come through the parser in some usages. regards, tom lane
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On 2020-Jun-24, Stephen Frost wrote: > Doesn't mean it makes sense or that we should be supporting that. What > we should have is a way to allow administrators to configure a system > for exactly what they want to allow, and it doesn't seem like we're > doing that today and therefore we should fix it. This isn't the only > area we have that issue in. The way to do that, for the case under discussion, is to reject using a logical replication connection for physical replication commands. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Wed, Jun 24, 2020 at 07:38:43AM -0500, Justin Pryzby wrote: > On Wed, Jun 24, 2020 at 05:06:28AM -0400, Bruce Momjian wrote: > > It would seem merge join has almost the same complexities as the new > > hash join code, since it can spill to disk doing sorts for merge joins, > > and adjusting work_mem is the only way to control that spill to disk. I > > don't remember anyone complaining about spills to disk during merge > > join, so I am unclear why we would need a such control for hash join. > > It loooks like merge join was new in 8.3. I don't think that's a good > analogy, > since the old behavior was still available with enable_mergejoin=off. Uh, we don't gurantee backward compatibility in the optimizer. You can turn off hashagg if you want. That doesn't get you to PG 13 behavior, but we don't gurantee that. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Default setting for enable_hashagg_disk
On Thu, Jun 25, 2020 at 12:24:29AM +1200, David Rowley wrote: > On Wed, 24 Jun 2020 at 21:06, Bruce Momjian wrote: > > I > > don't remember anyone complaining about spills to disk during merge > > join, so I am unclear why we would need a such control for hash join. > > Hash aggregate, you mean? The reason is that upgrading to PG13 can Yes, sorry. > cause a performance regression for an underestimated ndistinct on the > GROUP BY clause and cause hash aggregate to spill to disk where it > previously did everything in RAM. Sure, that behaviour was never > what we wanted to happen, Jeff has fixed that now, but the fact > remains that this does happen in the real world quite often and people > often get away with it, likey because work_mem is generally set to > some very conservative value. Of course, there's also a bunch of > people that have been bitten by OOM due to this too. The "neverspill" > wouldn't be for those people. Certainly, it's possible that we just > tell these people to increase work_mem for this query, that way they > can set it to something reasonable and still get spilling if it's > really needed to save them from OOM, but the problem there is that > it's not very easy to go and set work_mem for a certain query. Well, my point is that merge join works that way, and no one has needed a knob to avoid mergejoin if it is going to spill to disk. If they are adjusting work_mem to prevent spill of merge join, they can do the same for hash agg. We just need to document this in the release notes. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: xid wraparound danger due to INDEX_CLEANUP false
On Fri, May 22, 2020 at 4:40 PM Peter Geoghegan wrote: > On Mon, May 18, 2020 at 7:32 PM Masahiko Sawada > wrote: > > I've attached WIP patch for HEAD. With this patch, the core pass > > index_cleanup to bulkdelete and vacuumcleanup callbacks so that they > > can make decision whether run vacuum or not. > > > > If the direction of this patch seems good, I'll change the patch so > > that we pass something information to these callbacks indicating > > whether this vacuum is anti-wraparound vacuum. This is necessary > > because it's enough to invoke index cleanup before XID wraparound as > > per discussion so far. > > It. seems like the right direction to me. Robert? Sorry, I'm so far behind on my email. Argh. I think, especially on the blog post you linked, that we should aim to have INDEX_CLEANUP OFF mode do the minimum possible amount of work while still keeping us safe against transaction ID wraparound. So if, for example, it's desirable but not imperative for BRIN to resummarize, then it's OK normally but should be skipped with INDEX_CLEANUP OFF. I find the patch itself confusing and the comments inadequate, especially the changes to lazy_scan_heap(). Before the INDEX_CLEANUP patch went into the tree, LVRelStats had a member hasindex indicating whether or not the table had any indexes. The patch changed that member to useindex, indicating whether or not we were going to do index vacuuming; thus, it would be false if either there were no indexes or if we were going to ignore them. This patch redefines useindex to mean whether or not the table has any indexes, but without renaming the variable. There's also really no comments anywhere in the vacuumlazy.c explaining the overall scheme here; what are we actually doing? Apparently, what we're really doing here is that even when INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples. That seems sad, but if it's what we have to do then it at least needs comments explaining it. As for the btree portion of the change, I expect you understand this better than I do, so I'm reluctant to stick my neck out, but it seems that what the patch does is force index cleanup to happen even when INDEX_CLEANUP is OFF provided that the vacuum is for wraparound and the btree index has at least 1 recyclable page. My first reaction is to wonder whether that doesn't nerf this feature into oblivion. Isn't it likely that an index that is being vacuumed for wraparound will have a recyclable page someplace? If the presence of that 1 recyclable page causes the "help, I'm about to run out of XIDs, please do the least possible work" flag to become a no-op, I don't think users are going to be too happy with that. Maybe I am misunderstanding. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Default setting for enable_hashagg_disk
Justin Pryzby writes: > On Wed, Jun 24, 2020 at 05:06:28AM -0400, Bruce Momjian wrote: >> It would seem merge join has almost the same complexities as the new >> hash join code, since it can spill to disk doing sorts for merge joins, >> and adjusting work_mem is the only way to control that spill to disk. I >> don't remember anyone complaining about spills to disk during merge >> join, so I am unclear why we would need a such control for hash join. > It loooks like merge join was new in 8.3. I don't think that's a good > analogy, > since the old behavior was still available with enable_mergejoin=off. Uh, what? A look into our git history shows immediately that nodeMergejoin.c has been there since the initial code import in 1996. I tend to agree with Bruce that it's not very obvious that we need another GUC knob here ... especially not one as ugly as this. I'm especially against the "neverspill" option, because that makes a single GUC that affects both the planner and executor independently. If we feel we need something to let people have the v12 behavior back, let's have (1) enable_hashagg on/off --- controls planner, same as it ever was (2) enable_hashagg_spill on/off --- controls executor by disabling spill But I'm not really convinced that we need (2). regards, tom lane
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Fri, Jun 19, 2020 at 12:04 AM Justin Pryzby wrote: > > On Tue, Apr 07, 2020 at 08:40:30AM -0400, James Coleman wrote: > > On Tue, Apr 7, 2020 at 12:25 AM Justin Pryzby wrote: > > > On Mon, Apr 06, 2020 at 09:57:22PM +0200, Tomas Vondra wrote: > > > > I've pushed the fist part of this patch series - I've reorganized it a > > > > > > I scanned through this again post-commit. Find attached some suggestions. > > > > > > Shouldn't non-text explain output always show both disk *and* mem, > > > including > > > zeros ? > > > > Could you give more context on this? Is there a standard to follow? > > Regular sort nodes only ever report one type, so there's not a good > > parallel there. > > The change I proposed was like: > > @@ -2829,7 +2829,6 @@ > show_incremental_sort_group_info(IncrementalSortGroupInfo *groupInfo, > > ExplainPropertyList("Sort Methods Used", methodNames, es); > > - if (groupInfo->maxMemorySpaceUsed > 0) > { > longavgSpace = > groupInfo->totalMemorySpaceUsed / groupInfo->groupCount; > const char *spaceTypeName; > ... > - if (groupInfo->maxDiskSpaceUsed > 0) > { > ... > > To show in non-text format *both* disk and memory space used, even if zero. > > I still think that's what's desirable. I'm of the opposite opinion. I believe showing both unnecessarily is confusing. > If it's important to show *whether* a sort space was used, then I think there > should be a boolean, or an int 0/1. But I don't think it's actually needed, > since someone parsing the explain output could just check > |if _dict['Peak Sort Space Used'] > 0: ... > the same as you're doing, without having to write some variation on: > |if 'Peak Sort Space Used' in _dict and _dict['Peak Sort Space Used'] > 0: ... I think it's desirable for code to be explicitly about the type having been used rather than implicitly assuming it based on 0/non-zero values. James
Re: should libpq also require TLSv1.2 by default?
On 2020-06-24 10:33, Daniel Gustafsson wrote: In PG13, we raised the server-side default of ssl_min_protocol_version to TLSv1.2. We also added a connection setting named ssl_min_protocol_version to libpq. But AFAICT, the default value of the libpq setting is empty, so any protocol version will be accepted. Is this what we wanted? Should we raise the default in libpq as well? This was discussed [0] when the connection settings were introduced, and the concensus was to leave them alone [1] to allow for example a new pg_dump to work against an old server. Re-reading the thread I think the argument still holds, but I was about to respond "yes, let's do this" before refreshing my memory. Perhaps we should add a comment explaining this along the lines of the attached? [0] https://www.postgresql.org/message-id/157800160408.1198.1714906047977693148.pgcf%40coridan.postgresql.org [1] https://www.postgresql.org/message-id/31993.1578321474%40sss.pgh.pa.us ISTM that these discussions went through the same questions and arguments that were made regarding the server-side change but arrived at a different conclusion. So I suggest to reconsider this so that we don't ship with contradictory results. That doesn't necessarily mean that we have to make a change, but we should make sure our rationale is sound. Note that all OpenSSL versions that do not support TLSv1.2 also do not support TLSv1.1. So by saying, in effect, that TLSv1.2 is too new to require, we are saying that we need to keep supporting TLSv1.0 -- which is heavily deprecated. Also note that the first OpenSSL version with support for TLSv1.2 shipped on March 14, 2012. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow CURRENT_ROLE in GRANTED BY
On 2020-06-24 10:12, Vik Fearing wrote: On 6/24/20 8:35 AM, Peter Eisentraut wrote: I was checking some loose ends in SQL conformance, when I noticed: We support GRANT role ... GRANTED BY CURRENT_USER, but we don't support CURRENT_ROLE in that place, even though in PostgreSQL they are equivalent. Here is a trivial patch to add that. The only thing that isn't dead-obvious about this patch is the commit message says "[PATCH 1/2]". What is in the other part? Hehe. The second patch is some in-progress work to add the GRANTED BY clause to the regular GRANT command. More on that perhaps at a later date. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: should libpq also require TLSv1.2 by default?
On Wed, Jun 24, 2020 at 07:57:31PM +0200, Peter Eisentraut wrote: > On 2020-06-24 10:33, Daniel Gustafsson wrote: > > > In PG13, we raised the server-side default of ssl_min_protocol_version to > > > TLSv1.2. We also added a connection setting named > > > ssl_min_protocol_version to libpq. But AFAICT, the default value of the > > > libpq setting is empty, so any protocol version will be accepted. Is > > > this what we wanted? Should we raise the default in libpq as well? > > > > This was discussed [0] when the connection settings were introduced, and the > > concensus was to leave them alone [1] to allow for example a new pg_dump to > > work against an old server. Re-reading the thread I think the argument > > still > > holds, but I was about to respond "yes, let's do this" before refreshing my > > memory. Perhaps we should add a comment explaining this along the lines of > > the > > attached? > > > > [0] > > https://www.postgresql.org/message-id/157800160408.1198.1714906047977693148.pgcf%40coridan.postgresql.org > > [1] https://www.postgresql.org/message-id/31993.1578321474%40sss.pgh.pa.us > > ISTM that these discussions went through the same questions and arguments > that were made regarding the server-side change but arrived at a different > conclusion. So I suggest to reconsider this so that we don't ship with > contradictory results. > > That doesn't necessarily mean that we have to make a change, but we should > make sure our rationale is sound. > > Note that all OpenSSL versions that do not support TLSv1.2 also do not > support TLSv1.1. So by saying, in effect, that TLSv1.2 is too new to > require, we are saying that we need to keep supporting TLSv1.0 -- which is > heavily deprecated. Also note that the first OpenSSL version with support > for TLSv1.2 shipped on March 14, 2012. I do think mismatched SSL requirements between client and server is confusing, though I can see the back-version pg_dump being an issue. Maybe a clear error message would help here. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Review for GetWALAvailability()
Thanks for those corrections. I have pushed this. I think all problems Masao-san reported have been dealt with, so we're done here. Thanks! -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Wed, Jun 24, 2020 at 01:29:56PM -0400, Tom Lane wrote: Justin Pryzby writes: On Wed, Jun 24, 2020 at 05:06:28AM -0400, Bruce Momjian wrote: It would seem merge join has almost the same complexities as the new hash join code, since it can spill to disk doing sorts for merge joins, and adjusting work_mem is the only way to control that spill to disk. I don't remember anyone complaining about spills to disk during merge join, so I am unclear why we would need a such control for hash join. It loooks like merge join was new in 8.3. I don't think that's a good analogy, since the old behavior was still available with enable_mergejoin=off. Uh, what? A look into our git history shows immediately that nodeMergejoin.c has been there since the initial code import in 1996. I tend to agree with Bruce that it's not very obvious that we need another GUC knob here ... especially not one as ugly as this. I'm especially against the "neverspill" option, because that makes a single GUC that affects both the planner and executor independently. If we feel we need something to let people have the v12 behavior back, let's have (1) enable_hashagg on/off --- controls planner, same as it ever was (2) enable_hashagg_spill on/off --- controls executor by disabling spill What if a user specifies enable_hashagg = on enable_hashagg_spill = off and the estimates say the hashagg would need to spill to disk. Should that disable the query (in which case the second GUC affects both executor and planner) or run it (in which case we knowingly ignore work_mem, which seems wrong). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
Tomas Vondra writes: > On Wed, Jun 24, 2020 at 01:29:56PM -0400, Tom Lane wrote: >> If we feel we need something to let people have the v12 behavior >> back, let's have >> (1) enable_hashagg on/off --- controls planner, same as it ever was >> (2) enable_hashagg_spill on/off --- controls executor by disabling spill > What if a user specifies >enable_hashagg = on >enable_hashagg_spill = off It would probably be reasonable for the planner to behave as it did pre-v13, that is not choose hashagg if it estimates that work_mem would be exceeded. (So, okay, that means enable_hashagg_spill affects both planner and executor ... but ISTM it's just one behavior not two.) regards, tom lane
Re: Default setting for enable_hashagg_disk
Hi, On 2020-06-24 14:11:57 +1200, David Rowley wrote: > 1. Statistics underestimation can cause hashagg to be selected. The > executor will spill to disk in PG13. Users may find performance > suffers as previously the query may have just overshot work_mem > without causing any OOM issues. Their I/O performance might be > terrible. > 2. We might now choose to hash aggregate where pre PG13, we didn't > choose that because the hash table was estimated to be bigger than > work_mem. Hash agg might not be the best plan for the job. > For #1. We know users are forced to run smaller work_mems than they > might like as they need to allow for that random moment where all > backends happen to be doing that 5-way hash join all at the same time. > It seems reasonable that someone might want the old behaviour. They > may well be sitting on a timebomb that's about to OOM, but it would be > sad if someone's upgrade to PG13 was blocked on this, especially if > it's just due to some query that runs once per month but needs to > perform quickly. I'm quite concerned about this one. I think this isn't just going to hit when the planner mis-estimates ndistinct, but also when transition values use a bit more space. We'll now start spilling in cases the < v13 planner did everything right. That's great for cases where we'd otherwise OOM, but for a lot of other cases where there actually is more than sufficient RAM to overrun work_mem by a single-digit factor, it can cause a pretty massive increase of IO over < v13. FWIW, my gut feeling is that we'll end up have to separate the "execution time" spilling from using plain work mem, because it'll trigger spilling too often. E.g. if the plan isn't expected to spill, only spill at 10 x work_mem or something like that. Or we'll need better management of temp file data when there's plenty memory available. > For #2. This seems like a very legitimate requirement to me. If a > user is unhappy that PG13 now hashaggs where before it sorted and > group aggregated, but they're unhappy, not because there's some issue > with hashagg spilling, but because that causes the node above the agg > to becomes a Hash Join rather than a Merge Join and that's bad for > some existing reason. Our planner doing the wrong thing based on > either; lack of, inaccurate or out-of-date statistics is not Jeff's > fault. Having the ability to switch off a certain planner feature is > just following along with what we do today for many other node types. This one concerns me a bit less, fwiw. There's a lot more "pressure" in the planner to choose hash agg or sorted agg, compared to e.g. a bunch of aggregate states taking up a bit more space (can't estimate that at all for ma. > As for GUCs to try to help the group of users who, *I'm certain*, will > have problems with PG13's plan choice. I think the overloaded > enable_hashagg option is a really nice compromise. We don't really > have any other executor node type that has multiple GUCs controlling > its behaviour, so I believe it would be nice to keep it that way. > > How about: > > enable_hashagg = "on" -- enables hashagg allowing it to freely spill > to disk as it pleases. > enable_hashagg = "trynospill" -- Planner will only choose hash_agg if > it thinks it won't spill (pre PG13 planner behaviour) > enable_hashagg = "neverspill" -- executor will *never* spill to disk > and can still OOM (NOT RECOMMENDED, but does give pre PG13 planner and > executor behaviour) > enable_hashagg = "off" -- planner does not consider hash agg, ever. > Same as what PG12 did for this setting. > > Now, it's a bit weird to have "neverspill" as this is controlling > what's done in the executor from a planner GUC. Likely we can just > work around that by having a new "allowhashspill" bool field in the > "Agg" struct that's set by the planner, say during createplan that > controls if nodeAgg.c is allowed to spill or not. That'll also allow > PREPAREd plans to continue to do what they had planned to do already. > > The thing I like about doing it this way is that: > > a) it does not add any new GUCs > b) it semi hides the weird values that we really wish nobody would > ever have to set in a GUC that people have become used it just > allowing the values "on" and "off". > > The thing I don't quite like about this idea is: > a) I wish the planner was perfect and we didn't need to do this. > b) It's a bit weird to overload a GUC that has a very booleanish name > to not be bool. > > However, I also think it's pretty lightweight to support this. I > imagine a dozen lines of docs and likely about half a dozen lines per > GUC option in the planner. That'd work for me, but I honestly don't particularly care about the specific naming, as long as we provide users an escape hatch from the increased amount of IO. Greetings, Andres Freund
Re: Default setting for enable_hashagg_disk
Hi, On 2020-06-24 13:12:03 -0400, Bruce Momjian wrote: > Well, my point is that merge join works that way, and no one has needed > a knob to avoid mergejoin if it is going to spill to disk. If they are > adjusting work_mem to prevent spill of merge join, they can do the same > for hash agg. We just need to document this in the release notes. I don't think this is comparable. For starters, the IO indirectly triggered by mergejoin actually leads to plenty people just straight out disabling it. For lots of workloads there's never a valid reason to use a mergejoin (and often the planner will never choose one). Secondly, the planner has better information about estimating the memory usage for the to-be-sorted data than it has about the size of the transition values. And lastly, there's a difference between a long existing cause for bad IO behaviour and one that's suddenly kicks in after a major version upgrade, to which there's no escape hatch (it's rarely realistic to disable hash aggs, in contrast to merge joins). Greetings, Andres Freund
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On Wed, Jun 24, 2020 at 1:06 PM Alvaro Herrera wrote: > On 2020-Jun-24, Stephen Frost wrote: > > Doesn't mean it makes sense or that we should be supporting that. What > > we should have is a way to allow administrators to configure a system > > for exactly what they want to allow, and it doesn't seem like we're > > doing that today and therefore we should fix it. This isn't the only > > area we have that issue in. > > The way to do that, for the case under discussion, is to reject using a > logical replication connection for physical replication commands. Reading over this discussion, I see basically three arguments: 1. Andres argues that being able to execute physical replication commands from the same connection as SQL queries is useful, and that people may be relying on it, and that we shouldn't break it without need. 2. Fujii Masao argues that the current situation makes it impossible to write a pg_hba.conf rule that disallows all physical replication connections, because people could get around it by using a logical replication connection for physical replication. 3. Various people argue that it's only accidental that physical replication on a replication=database connection ever worked at all, and therefore we ought to block it. I find argument #1 most convincing, #2 less convincing, and #3 least convincing. In my view, the problem with argument #3 is that just because some feature combination was unintentional doesn't mean it's unuseful or unused. As for #2, suppose someone were to propose a design for logical replication that allowed it to take place without a database connection, so that it could be done with just a regular replication connection. Such a feature would create the same problem Fujii Masao mentions here, but it seems inconceivable that we would for that reason reject it; we make decisions about features based on their usefulness, not their knock-on effects on pg_hba.conf rules. We can always add new kinds of access control restrictions if they are needed; that is a better approach than removing features so that the existing pg_hba.conf facilities can be used to accomplish some particular goal. So really I think this turns on #1: is it plausible that people are using this feature, however inadvertent it may be, and is it potentially useful? I don't see that anybody's made an argument against either of those things. Unless someone can do so, I think we shouldn't disable this. That having been said, I think that the fact that you can execute SQL queries in replication=database connections is horrifying. I really hate that feature. I think it's a bad design, and a bad implementation, and a recipe for tons of bugs. But, blocking physical replication commands on such connections isn't going to solve any of that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Default setting for enable_hashagg_disk
On Wed, Jun 24, 2020 at 3:14 PM Andres Freund wrote: > FWIW, my gut feeling is that we'll end up have to separate the > "execution time" spilling from using plain work mem, because it'll > trigger spilling too often. E.g. if the plan isn't expected to spill, > only spill at 10 x work_mem or something like that. Or we'll need > better management of temp file data when there's plenty memory > available. So, I don't think we can wire in a constant like 10x. That's really unprincipled and I think it's a bad idea. What we could do, though, is replace the existing Boolean-valued GUC with a new GUC that controls the size at which the aggregate spills. The default could be -1, meaning work_mem, but a user could configure a larger value if desired (presumably, we would just treat a value smaller than work_mem as work_mem, and document the same). I think that's actually pretty appealing. Separating the memory we plan to use from the memory we're willing to use before spilling seems like a good idea in general, and I think we should probably also do it in other places - like sorts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Default setting for enable_hashagg_disk
Hi, On 2020-06-24 14:40:50 -0400, Tom Lane wrote: > Tomas Vondra writes: > > On Wed, Jun 24, 2020 at 01:29:56PM -0400, Tom Lane wrote: > >> If we feel we need something to let people have the v12 behavior > >> back, let's have > >> (1) enable_hashagg on/off --- controls planner, same as it ever was > >> (2) enable_hashagg_spill on/off --- controls executor by disabling spill > > > What if a user specifies > >enable_hashagg = on > >enable_hashagg_spill = off > > It would probably be reasonable for the planner to behave as it did > pre-v13, that is not choose hashagg if it estimates that work_mem > would be exceeded. (So, okay, that means enable_hashagg_spill > affects both planner and executor ... but ISTM it's just one > behavior not two.) There's two different reasons for spilling in the executor right now: 1) The planner estimated that we'd need to spill, and that turns out to be true. There seems no reason to not spill in that case (as long as it's enabled/chosen in the planner). 2) The planner didn't think we'd need to spill, but we end up using more than work_mem memory. nodeAgg.c already treats those separately: void hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits, Size *mem_limit, uint64 *ngroups_limit, int *num_partitions) { int npartitions; Sizepartition_mem; /* if not expected to spill, use all of work_mem */ if (input_groups * hashentrysize < work_mem * 1024L) { if (num_partitions != NULL) *num_partitions = 0; *mem_limit = work_mem * 1024L; *ngroups_limit = *mem_limit / hashentrysize; return; } We can't sensibly disable spilling when chosen at plan time, because that'd lead to *more* OOMS than in v12. ISTM that we should have one option that controls whether 1) is done, and one that controls whether 2) is done. Even if the option for 2 is off, we still should spill when the option for 1) chooses a spilling plan. I don't think it makes sense for one of those options to influence the other implicitly. So maybe enable_hashagg_spilling_plan for 1) and hashagg_spill_on_exhaust for 2). Greetings, Andres Freund
Re: Why forbid "INSERT INTO t () VALUES ();"
Hallo Thomas, INSERT INTO t() VALUES (); This is forbidden by postgres, and also sqlite. Is there any good reason why this should be the case? Maybe because insert into t default values; exists (and is standard SQL if I'm not mistaken) That's a nice alternative I did not notice. Well, not an alternative as the other one does not work. I'm still unclear why it would be forbidden though, it seems logical to try that, whereas the working one is quite away from the usual syntax. -- Fabien.
Re: Default setting for enable_hashagg_disk
Hi, On 2020-06-24 15:28:47 -0400, Robert Haas wrote: > On Wed, Jun 24, 2020 at 3:14 PM Andres Freund wrote: > > FWIW, my gut feeling is that we'll end up have to separate the > > "execution time" spilling from using plain work mem, because it'll > > trigger spilling too often. E.g. if the plan isn't expected to spill, > > only spill at 10 x work_mem or something like that. Or we'll need > > better management of temp file data when there's plenty memory > > available. > > So, I don't think we can wire in a constant like 10x. That's really > unprincipled and I think it's a bad idea. What we could do, though, is > replace the existing Boolean-valued GUC with a new GUC that controls > the size at which the aggregate spills. The default could be -1, > meaning work_mem, but a user could configure a larger value if desired > (presumably, we would just treat a value smaller than work_mem as > work_mem, and document the same). To be clear, I wasn't actually thinking of hard-coding 10x, but having a config option that specifies a factor of work_mem. A factor seems better because it'll work reasonably for different values of work_mem, whereas a concrete size wouldn't. > I think that's actually pretty appealing. Separating the memory we > plan to use from the memory we're willing to use before spilling seems > like a good idea in general, and I think we should probably also do it > in other places - like sorts. Indeed. And then perhaps we could eventually add some reporting / monitoring infrastructure for the cases where plan time and execution time memory estimate/usage widely differs. Greetings, Andres Freund
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On 2020-Jun-24, Robert Haas wrote: > So really I think this turns on #1: is it plausible > that people are using this feature, however inadvertent it may be, and > is it potentially useful? I don't see that anybody's made an argument > against either of those things. Unless someone can do so, I think we > shouldn't disable this. People (specifically the jdbc driver) *are* using this feature in this way, but they didn't realize they were doing it. It was an accident and they didn't notice. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On Wed, 24 Jun 2020 at 15:41, Alvaro Herrera wrote: > On 2020-Jun-24, Robert Haas wrote: > > > So really I think this turns on #1: is it plausible > > that people are using this feature, however inadvertent it may be, and > > is it potentially useful? I don't see that anybody's made an argument > > against either of those things. Unless someone can do so, I think we > > shouldn't disable this. > > People (specifically the jdbc driver) *are* using this feature in this > way, but they didn't realize they were doing it. It was an accident and > they didn't notice. > > Not sure we are using it as much as we accidently did it that way. It would be trivial to fix. That said I think we should fix the security hole this opens and leave the functionality. Dave
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
Hi, On 2020-06-24 15:41:14 -0400, Alvaro Herrera wrote: > On 2020-Jun-24, Robert Haas wrote: > > > So really I think this turns on #1: is it plausible > > that people are using this feature, however inadvertent it may be, and > > is it potentially useful? I don't see that anybody's made an argument > > against either of those things. Unless someone can do so, I think we > > shouldn't disable this. > > People (specifically the jdbc driver) *are* using this feature in this > way, but they didn't realize they were doing it. It was an accident and > they didn't notice. As I said before, I've utilized being able to do both over a single connection (among others to initialize a logical replica using a base backup). And I've seen at least one other codebase (developed without my input) doing so. I really don't understand how you just dismiss this without any sort of actual argument. Yes, those uses can be fixed to reconnect with a different replication parameter, but that's code that needs to be adjusted and it requires adjustments to pg_hba.conf etc. And obviously you'd lock out older versions of jdbc, and possibly other drivers. Obviously we should allow more granular permissions here, I don't think anybody is arguing against that. Greetings, Andres Freund
Re: Why forbid "INSERT INTO t () VALUES ();"
Fabien COELHO writes: >>> INSERT INTO t() VALUES (); > I'm still unclear why it would be forbidden though, it seems logical to > try that, whereas the working one is quite away from the usual syntax. It's forbidden because the SQL standard forbids it. We allow zero-column syntaxes in some other places where SQL forbids them, but that's only because there is no reasonable alternative. In this case, there's a perfectly good, standards-compliant alternative. So why encourage people to write unportable code? regards, tom lane
Re: xid wraparound danger due to INDEX_CLEANUP false
On Wed, Jun 24, 2020 at 10:21 AM Robert Haas wrote: > Sorry, I'm so far behind on my email. Argh. That's okay. > I think, especially on the blog post you linked, that we should aim to > have INDEX_CLEANUP OFF mode do the minimum possible amount of work > while still keeping us safe against transaction ID wraparound. So if, > for example, it's desirable but not imperative for BRIN to > resummarize, then it's OK normally but should be skipped with > INDEX_CLEANUP OFF. I agree that that's very important. > Apparently, what we're really doing here is that even when > INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples. > That seems sad, but if it's what we have to do then it at least needs > comments explaining it. +1. Though I think that AMs should technically have the right to consider it advisory. > As for the btree portion of the change, I expect you understand this > better than I do, so I'm reluctant to stick my neck out, but it seems > that what the patch does is force index cleanup to happen even when > INDEX_CLEANUP is OFF provided that the vacuum is for wraparound and > the btree index has at least 1 recyclable page. My first reaction is > to wonder whether that doesn't nerf this feature into oblivion. Isn't > it likely that an index that is being vacuumed for wraparound will > have a recyclable page someplace? If the presence of that 1 recyclable > page causes the "help, I'm about to run out of XIDs, please do the > least possible work" flag to become a no-op, I don't think users are > going to be too happy with that. Maybe I am misunderstanding. I have mixed feelings about it myself. These are valid concerns. This is a problem to the extent that the user has a table that they'd like to use INDEX_CLEANUP with, that has indexes that regularly require cleanup due to page deletion. ISTM, then, that the really relevant high level design questions for this patch are: 1. How often is that likely to happen in The Real World™? 2. If we fail to do cleanup and leak already-deleted pages, how bad is that? ( Both in general, and in the worst case.) I'll hazard a guess for 1: I think that it might not come up that often. Page deletion is often something that we hardly ever need. And, unlike some DB systems, we only do it when pages are fully empty (which, as it turns out, isn't necessarily better than our simple approach [1]). I tend to think it's unlikely to happen in cases where INDEX_CLEANUP is used, because those are cases that also must not have that much index churn to begin with. Then there's question 2. The intuition behind the approach from Sawada-san's patch was that allowing wraparound here feels wrong -- it should be in the AM's hands. However, it's not like I can point to some ironclad guarantee about not leaking deleted pages that existed before the INDEX_CLEANUP feature. We know that the FSM is not crash safe, and that's that. Is this really all that different? Maybe it is, but it seems like a quantitative difference to me. I'm kind of arguing against myself even as I try to advance my original argument. If workloads that use INDEX_CLEANUP don't need to delete and recycle pages in any case, then why should we care that those same workloads might leak pages on account of the wraparound hazard? There's nothing to leak! Maybe some compromise is possible, at least for the backbranches. Perhaps nbtree is told about the requirements in a bit more detail than we'd imagined. It's not just an INDEX_CLEANUP boolean. It could be an enum or something. Not sure, though. The real reason that I want to push the mechanism down into index access methods is because that design is clearly better overall; it just so happens that the specific way in which we currently defer recycling in nbtree makes very little sense, so it's harder to see the big picture. The xid-cleanup design that we have was approximately the easiest way to do it, so that's what we got. We should figure out a way to recycle the pages at something close to the earliest possible opportunity, without having to perform a full scan on the index relation within btvacuumscan(). Maybe we can use the autovacuum work item mechanism for that. For indexes that get VACUUMed once a week on average, it makes zero sense to wait another week to recycle the pages that get deleted, in a staggered fashion. It should be possible to recycle the pages a minute or two after VACUUM proper finishes, with extra work that's proportionate to the number of deleted pages. This is still conservative. I am currently very busy with an unrelated B-Tree prototype, so I might not get around to it this year. Maybe Sawada-san can think about this? Also, handling of the vacuum_cleanup_index_scale_factor stuff (added to Postgres 11 by commit 857f9c36) should live next to code for the confusingly-similar INDEX_CLEANUP stuff (added to Postgres 12 by commit a96c41feec6), on general principle. I think that that organization is a lot easier to follow. [1] http
Re: Default setting for enable_hashagg_disk
On Wed, Jun 24, 2020 at 12:36:24PM -0700, Andres Freund wrote: Hi, On 2020-06-24 15:28:47 -0400, Robert Haas wrote: On Wed, Jun 24, 2020 at 3:14 PM Andres Freund wrote: > FWIW, my gut feeling is that we'll end up have to separate the > "execution time" spilling from using plain work mem, because it'll > trigger spilling too often. E.g. if the plan isn't expected to spill, > only spill at 10 x work_mem or something like that. Or we'll need > better management of temp file data when there's plenty memory > available. So, I don't think we can wire in a constant like 10x. That's really unprincipled and I think it's a bad idea. What we could do, though, is replace the existing Boolean-valued GUC with a new GUC that controls the size at which the aggregate spills. The default could be -1, meaning work_mem, but a user could configure a larger value if desired (presumably, we would just treat a value smaller than work_mem as work_mem, and document the same). To be clear, I wasn't actually thinking of hard-coding 10x, but having a config option that specifies a factor of work_mem. A factor seems better because it'll work reasonably for different values of work_mem, whereas a concrete size wouldn't. I'm not quite convinced we need/should introduce a new memory limit. It's true keping it equal to work_mem by default makes this less of an issue, but it's still another moving part the users will need to learn how to use. But if we do introduce a new limit, I very much think it should be a plain limit, not a factor. That just makes it even more complicated, and we don't have any such limit yet. I think that's actually pretty appealing. Separating the memory we plan to use from the memory we're willing to use before spilling seems like a good idea in general, and I think we should probably also do it in other places - like sorts. Indeed. And then perhaps we could eventually add some reporting / monitoring infrastructure for the cases where plan time and execution time memory estimate/usage widely differs. I wouldn't mind something like that in general - not just for hashagg, but for various other nodes. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Allow CURRENT_ROLE in GRANTED BY
On 2020-Jun-24, Peter Eisentraut wrote: > I was checking some loose ends in SQL conformance, when I noticed: We > support GRANT role ... GRANTED BY CURRENT_USER, but we don't support > CURRENT_ROLE in that place, even though in PostgreSQL they are equivalent. > Here is a trivial patch to add that. Hmm, since this adds to RoleSpec, this change makes every place that uses that production also take CURRENT_ROLE, so we'd need to document in all those places. For example, alter_role.sgml, create_schema.sgml, etc. This also affects role_list (but maybe the docs for those are already vague enough -- eg. ALTER INDEX .. OWNED BY only says "role_name" with no further explanation, even though it does take "current_user".) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Strange behavior with polygon and NaN
Hi hackers, While working with Chris Hajas on merging Postgres 12 with Greenplum Database we stumbled upon the following strange behavior in the geometry type polygon: -- >8 CREATE TEMP TABLE foo (p point); CREATE INDEX ON foo USING gist(p); INSERT INTO foo VALUES ('0,0'), ('1,1'), ('NaN,NaN'); SELECT $q$ SELECT * FROM foo WHERE p <@ polygon '(0,0), (0, 100), (100, 100), (100, 0)' $q$ AS qry \gset BEGIN; SAVEPOINT yolo; SET LOCAL enable_seqscan TO off; :qry; ROLLBACK TO SAVEPOINT yolo; SET LOCAL enable_indexscan TO off; SET LOCAL enable_bitmapscan TO off; :qry; -- 8< If you run the above repro SQL in HEAD (and 12, and likely all older versions), you get the following output: CREATE TABLE CREATE INDEX INSERT 0 3 BEGIN SAVEPOINT SET p --- (0,0) (1,1) (2 rows) ROLLBACK SET SET p --- (0,0) (1,1) (NaN,NaN) (3 rows) At first glance, you'd think this is the gist AM's bad, but on a second thought, something else is strange here. The following query returns true: SELECT point '(NaN, NaN)' <@ polygon '(0,0), (0, 100), (100, 100), (100, 0)' The above behavior of the "contained in" operator is surprising, and it's probably not what the GiST AM is expecting. I took a look at point_inside() in geo_ops.c, and it doesn't seem well equipped to handle NaN. Similary ill-equipped is dist_ppoly_internal() which underlies the distnace operator for polygon. It gives the following interesting output: SELECT *, c <-> polygon '(0,0),(0,100),(100,100),(100,0)' as distance FROM ( SELECT circle(point(100 * i, 'NaN'), 50) AS c FROM generate_series(-2, 4) i ) t(c) ORDER BY 2; c| distance -+-- <(-200,NaN),50> |0 <(-100,NaN),50> |0 <(0,NaN),50>|0 <(100,NaN),50> |0 <(200,NaN),50> | NaN <(300,NaN),50> | NaN <(400,NaN),50> | NaN (7 rows) Should they all be NaN? Am I alone in thinking the index is right but the operators are wrong? Or should we call the indexes wrong here? Cheers, Jesse and Chris
Re: should libpq also require TLSv1.2 by default?
> On 24 Jun 2020, at 19:57, Peter Eisentraut > wrote: > > On 2020-06-24 10:33, Daniel Gustafsson wrote: >>> In PG13, we raised the server-side default of ssl_min_protocol_version to >>> TLSv1.2. We also added a connection setting named ssl_min_protocol_version >>> to libpq. But AFAICT, the default value of the libpq setting is empty, so >>> any protocol version will be accepted. Is this what we wanted? Should we >>> raise the default in libpq as well? >> This was discussed [0] when the connection settings were introduced, and the >> concensus was to leave them alone [1] to allow for example a new pg_dump to >> work against an old server. Re-reading the thread I think the argument still >> holds, but I was about to respond "yes, let's do this" before refreshing my >> memory. Perhaps we should add a comment explaining this along the lines of >> the >> attached? >> [0] >> https://www.postgresql.org/message-id/157800160408.1198.1714906047977693148.pgcf%40coridan.postgresql.org >> [1] https://www.postgresql.org/message-id/31993.1578321474%40sss.pgh.pa.us > > ISTM that these discussions went through the same questions and arguments > that were made regarding the server-side change but arrived at a different > conclusion. So I suggest to reconsider this so that we don't ship with > contradictory results. I don't think anyone argues against safe defaults for communication between upgraded clients and upgraded servers. That being said; out of the box, an upgraded client *will* use TLSv1.2 when connecting to a upgraded server due to the server defaults requirements (assuming the server hasn't been reconfigured with a lower TLS version, but since we're talking defaults we have to assume that). The problem comes when an updated client needs to talk to an old server which hasn't been upgraded and which use an older OpenSSL version. If we set TLSv1.2 as the minimum client version, the user will have to amend a connection string which used to work when talking to a server which hasn't changed. If we don't raise the default, a user to wants nothing lower than TLSv1.2 will have to amend the connection string. > That doesn't necessarily mean that we have to make a change, but we should > make sure our rationale is sound. Totally agree. I think I, FWIW, still vote for keeping it at 1.0 to not break connections to old servers, since upgraded/new servers will enforce 1.2 anyways. As mentioned elsewhere in the thread, maybe this is also something which can be done more easily if we improve the error reporting? Right now it's fairly cryptic IMO. > Note that all OpenSSL versions that do not support TLSv1.2 also do not > support TLSv1.1. So by saying, in effect, that TLSv1.2 is too new to > require, we are saying that we need to keep supporting TLSv1.0 -- which is > heavily deprecated. Also note that the first OpenSSL version with support > for TLSv1.2 shipped on March 14, 2012. Correct, this being the 1.0.1 release which is referred to. cheers ./daniel
Re: Why forbid "INSERT INTO t () VALUES ();"
Tom Lane writes: > Fabien COELHO writes: INSERT INTO t() VALUES (); > >> I'm still unclear why it would be forbidden though, it seems logical to >> try that, whereas the working one is quite away from the usual syntax. > > It's forbidden because the SQL standard forbids it. > > We allow zero-column syntaxes in some other places where SQL forbids > them, but that's only because there is no reasonable alternative. > In this case, there's a perfectly good, standards-compliant alternative. > So why encourage people to write unportable code? FWIW, MySQL (and MariaDB) only support INSERT INTO t () VALUES (), not DEFAULT VALUES. We have added syntax for MySQL compatibility in the past, e.g. the CONCAT() function. - ilmari -- "A disappointingly low fraction of the human race is, at any given time, on fire." - Stig Sandbeck Mathisen
Re: Why forbid "INSERT INTO t () VALUES ();"
On Wed, Jun 24, 2020 at 3:31 PM Dagfinn Ilmari Mannsåker wrote: > FWIW, MySQL (and MariaDB) only support INSERT INTO t () VALUES (), not > DEFAULT VALUES. We have added syntax for MySQL compatibility in the > past, e.g. the CONCAT() function. > I don't see the similarities. IIUC there isn't a standard mandated function that provides the behavior that the concat function does. There is an operator but the treatment of NULL is different. So for concat we decided to add a custom function modelled on another DB's custom function. Adding custom syntax here when an identically behaving standard syntax already exists has considerably less going for it. I would say that accepting the compatibility hit while being the ones that are standard-compliant is in line with project values. David J.
Re: Default setting for enable_hashagg_disk
On Wed, Jun 24, 2020 at 11:02:10PM +0200, Tomas Vondra wrote: > > Indeed. And then perhaps we could eventually add some reporting / > > monitoring infrastructure for the cases where plan time and execution > > time memory estimate/usage widely differs. > > > > I wouldn't mind something like that in general - not just for hashagg, > but for various other nodes. Well, other than worrying about problems with pre-13 queries, how is this different from any other spill to disk when we exceed work_mem, like sorts for merge join. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Default setting for enable_hashagg_disk
On Wed, Jun 24, 2020 at 12:19:00PM -0700, Andres Freund wrote: > Hi, > > On 2020-06-24 13:12:03 -0400, Bruce Momjian wrote: > > Well, my point is that merge join works that way, and no one has needed > > a knob to avoid mergejoin if it is going to spill to disk. If they are > > adjusting work_mem to prevent spill of merge join, they can do the same > > for hash agg. We just need to document this in the release notes. > > I don't think this is comparable. For starters, the IO indirectly > triggered by mergejoin actually leads to plenty people just straight out > disabling it. For lots of workloads there's never a valid reason to use > a mergejoin (and often the planner will never choose one). Secondly, the > planner has better information about estimating the memory usage for the > to-be-sorted data than it has about the size of the transition > values. And lastly, there's a difference between a long existing cause > for bad IO behaviour and one that's suddenly kicks in after a major > version upgrade, to which there's no escape hatch (it's rarely realistic > to disable hash aggs, in contrast to merge joins). Well, this sounds like an issue of degree, rather than kind. It sure sounds like "ignore work_mem for this join type, but not the other". -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Default setting for enable_hashagg_disk
On Wed, Jun 24, 2020 at 07:18:10PM -0400, Bruce Momjian wrote: > On Wed, Jun 24, 2020 at 12:19:00PM -0700, Andres Freund wrote: > > Hi, > > > > On 2020-06-24 13:12:03 -0400, Bruce Momjian wrote: > > > Well, my point is that merge join works that way, and no one has needed > > > a knob to avoid mergejoin if it is going to spill to disk. If they are > > > adjusting work_mem to prevent spill of merge join, they can do the same > > > for hash agg. We just need to document this in the release notes. > > > > I don't think this is comparable. For starters, the IO indirectly > > triggered by mergejoin actually leads to plenty people just straight out > > disabling it. For lots of workloads there's never a valid reason to use > > a mergejoin (and often the planner will never choose one). Secondly, the > > planner has better information about estimating the memory usage for the > > to-be-sorted data than it has about the size of the transition > > values. And lastly, there's a difference between a long existing cause > > for bad IO behaviour and one that's suddenly kicks in after a major > > version upgrade, to which there's no escape hatch (it's rarely realistic > > to disable hash aggs, in contrast to merge joins). > > Well, this sounds like an issue of degree, rather than kind. It sure > sounds like "ignore work_mem for this join type, but not the other". I think my main point is that work_mem was not being honored for hash-agg before, but now that PG 13 can do it, we are again allowing work_mem not to apply in certain cases. I am wondering if our hard limit for work_mem is the issue, and we should make that more flexible for all uses. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: hashagg slowdown due to spill changes
On Tue, Jun 23, 2020 at 10:06 AM Andres Freund wrote: > Hi, > > On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote: > > On Mon, Jun 22, 2020 at 9:02 PM Andres Freund > wrote: > > > It's not this patch's fault, but none, really none, of this stuff > should > > > be in the executor. > > > > > > > > Were you thinking it could be done in grouping_planner() and then the > > bitmaps could be saved in the PlannedStmt? > > Or would you have to wait until query_planner()? Or are you imagining > > somewhere else entirely? > > I haven't thought about it in too much detail, but I would say > create_agg_plan() et al. I guess there's some argument to be made to do > it in setrefs.c, because we already do convert_combining_aggrefs() there > (but I don't like that much). > > There's no reason to do it before we actually decided on one specific > path, so doing it earlier than create_plan() seems unnecessary. And > having it in agg specific code seems better than putting it into global > routines. > > There's probably an argument for having a bit more shared code between > create_agg_plan(), create_group_plan() and > create_groupingsets_plan(). But even just adding a new extract_*_cols() > call to each of those would probably be ok. > > So, my summary of this point in the context of the other discussion upthread is: Planner should extract the columns that hashagg will need later during planning. Planner should not have HashAgg/MixedAgg nodes request smaller targetlists from their children with CP_SMALL_TLIST to avoid unneeded projection overhead. Also, even this extraction should only be done when the number of groups is large enough to suspect a spill. So, I wrote a patch that extracts the columns the same way as in ExecInitAgg but in create_agg_plan() and it doesn't work because we haven't called set_plan_references(). Then, I wrote a patch that does this in set_upper_references(), and it seems to work. I've attached that one. It is basically Jeff's patch (based somewhat on my patch) which extracts the columns in ExecInitAgg but I moved the functions over to setrefs.c and gave them a different name. It's not very elegant. I shoved it in at the end of set_upper_references(), but I think there should be a nice way to do it while setting the references for each var instead of walking over the nodes again. Also, I think that the bitmapsets for the colnos should maybe be put somewhere less prominent (than in the main Agg plan node?), since they are only used in one small place. I tried putting both bitmaps in an array of two bitmaps in the Agg node (since there will always be two) to make it look a bit neater, but it was pretty confusing and error prone to remember which one was aggregated and which one was unaggregated. Note that I didn't do anything with costing like only extracting the columns if there are a lot of groups. Also, I didn't revert the CP_SMALL_TLIST change in create_agg_plan() or create_groupingsets_plan(). Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST in create_hashjoin_plan() for the inner side sub-tree and the outer side one if there are multiple batches. I wondered what was different about that vs hashagg (i.e. why it is okay to do that there). -- Melanie Plageman From 0b40ad205cf8fc8a83b8c65f5f18502a51c47370 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Wed, 24 Jun 2020 17:07:14 -0700 Subject: [PATCH v1] Move extracting columns for hashagg to planner Find the columns needed the executor will need to know about for hashagg and bifurcate them into aggregated and unaggregated column bitmapsets. Save them in the plan tree for use during execution. This is done after set_plan_refs() so all the references are correct. --- src/backend/executor/nodeAgg.c | 117 +-- src/backend/nodes/copyfuncs.c| 2 + src/backend/nodes/outfuncs.c | 2 + src/backend/nodes/readfuncs.c| 2 + src/backend/optimizer/plan/setrefs.c | 69 src/include/nodes/execnodes.h| 8 +- src/include/nodes/plannodes.h| 2 + 7 files changed, 140 insertions(+), 62 deletions(-) diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index a20554ae65..2f340e4031 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -359,6 +359,14 @@ typedef struct HashAggBatch int64 input_tuples; /* number of tuples in this batch */ } HashAggBatch; +/* used to find referenced colnos */ +typedef struct FindColsContext +{ + bool is_aggref; /* is under an aggref */ + Bitmapset *aggregated; /* column references under an aggref */ + Bitmapset *unaggregated; /* other column references */ +} FindColsContext; + static void select_current_set(AggState *aggstate, int setno, bool is_hash); static void initialize_phase(AggState *aggstate, int newphase); static TupleTableSlot *fetch_input_tuple(AggState *aggstate); @@ -391,8 +399,6 @@ static void finalize_aggregates(AggSta
Re: [PATCH] Initial progress reporting for COPY command
On Tue, Jun 23, 2020 at 4:45 PM Tomas Vondra wrote: > > >> > >> Anyway if you would like to make this view more user-friendly, I can add > >> that. Just ping me. > > > >I felt we could add pg_size_pretty to make the view more user friendly. > > > > Please no. That'd make processing of the data (say, computing progress > as processed/total) impossible. It's easy to add pg_size_pretty if you > want it, it's impossible to undo it. I don't see a single pg_size_pretty > call in system_views.sql. > I thought of including pg_size_pretty as we there was no total_bytes to compare with, but I'm ok without it too as there is an option for user to always include it in the client side like "SELECT pg_size_pretty(file_bytes_processed) from pg_stat_progress_copy;" if required. Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
Re: should libpq also require TLSv1.2 by default?
On Thu, Jun 25, 2020 at 12:30:03AM +0200, Daniel Gustafsson wrote: > I don't think anyone argues against safe defaults for communication between > upgraded clients and upgraded servers. That being said; out of the box, an > upgraded client *will* use TLSv1.2 when connecting to a upgraded server due to > the server defaults requirements (assuming the server hasn't been reconfigured > with a lower TLS version, but since we're talking defaults we have to assume > that). My take here is to let things as they are for libpq. pg_dump is a very good argument, because we allow backward compatibility with a newer version of the tool, not upward compatibility. > The problem comes when an updated client needs to talk to an old server which > hasn't been upgraded and which use an older OpenSSL version. If we set > TLSv1.2 > as the minimum client version, the user will have to amend a connection string > which used to work when talking to a server which hasn't changed. If we don't > raise the default, a user to wants nothing lower than TLSv1.2 will have to > amend the connection string. Yeah, and I would not be surprised to see cases where people complain to us about that when attempting to reach one of their old boxes, breaking some stuff they have been relying on for years by forcing the addition of a tls_min_server_protocol in the connection string. It is a more important step that we enforce TLSv1.2 on the server side actually, and libpq just follows up automatically with that. > As mentioned elsewhere in the thread, maybe this is also something which can > be > done more easily if we improve the error reporting? Right now it's fairly > cryptic IMO. This part may be tricky to get right I think, because the error comes directly from OpenSSL when negotiating the protocol used between the client and the server, like "no protocols available" or such. -- Michael signature.asc Description: PGP signature
Keep elog(ERROR) and ereport(ERROR) calls in the cold path
Back in [1] I experimented with a patch to coax compilers to build all elog/ereport calls that were >= ERROR into a cold path away from the function rasing the error. At the time, I really just wanted to test how much of a speedup we could get by doing this and ended up just writing up a patch that basically changed all elog(ERROR) calls from: if () { ; elog(ERROR, "..."); } to add an unlikely() and become; if (unlikely() { ; elog(ERROR, "..."); } Per the report in [1] I did see some pretty good gains in performance from doing this. The problem was, that at the time I couldn't figure out a way to do this without an invasive patch that changed the code in the thousands of elog/ereport calls. In the attached, I've come up with a way that works. Basically I've just added a function named errstart_cold() that is attributed with __attribute__((cold)), which will hint to the compiler to keep branches which call this function in a cold path. To make the compiler properly mark just >= ERROR calls as cold, and only when the elevel is a constant, I modified the ereport_domain macro to do: if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \ errstart_cold(elevel, domain) : \ errstart(elevel, domain)) \ I see no reason why the compiler shouldn't always fold that constant expression at compile-time and correctly select the correct version of the function for the job. (I also tried using __builtin_choose_expr() but GCC complained when the elevel was not constant, even with the __builtin_constant_p() test in the condition) I sampled a few .s files to inspect what code had changed. Looking at mcxt.s, fmgr.s and xlog.s, the first two of these because I found in [1] that elogs were being done from those files quite often and xlog.s because it's pretty big. As far as I could see, GCC correctly moved all the error reporting stuff where the elevel was a constant and >= ERROR into the cold path and left the lower-level or non-consts elevel calls alone. For clang, I didn't see any changes in the .s files. I suspect that they might have a few smarts in there and see the __builtin_unreachable() call and assume the path is cold already based on that. That was with clang 10. Perhaps older versions are not as smart. Benchmarking: For benchmarking, I've not done a huge amount to test the impacts of this change. However, I can say that I am seeing some fairly good improvements. There seems to be some small improvements to execution speed using TPCH-Q1 and also some good results from a pgbench -S test. For TPCH-Q1: Master: $ pgbench -n -f pg-tpch/queries/q01.sql -T 120 tpch latency average = 5272.630 ms latency average = 5258.610 ms latency average = 5250.871 ms Master + elog_ereport_attribute_cold.patch $ pgbench -n -f pg-tpch/queries/q01.sql -T 120 tpch latency average = 5182.761 ms latency average = 5194.851 ms latency average = 5183.128 ms Which is about a 1.42% increase in performance. That's not exactly groundbreaking, but pretty useful to have if that happens to apply across the board for execution performance. For pgbench -S: My results were a bit noisier than the TPCH test, but the results I obtained did show about a 10% increase in performance: Master: drowley@amd3990x:~$ pgbench -S -T 120 postgres tps = 25245.903255 (excluding connections establishing) tps = 26144.454208 (excluding connections establishing) tps = 25931.850518 (excluding connections establishing) Master + elog_ereport_attribute_cold.patch drowley@amd3990x:~$ pgbench -S -T 120 postgres tps = 28351.480631 (excluding connections establishing) tps = 27763.656557 (excluding connections establishing) tps = 28896.427929 (excluding connections establishing) It would be useful if someone with some server-grade Intel hardware could run a few tests on this. The above results are all from AMD hardware. I've attached the patch for this. I'll add it to the July 'fest. David [1] https://www.postgresql.org/message-id/CAKJS1f8yqRW3qx2CO9r4bqqvA2Vx68=3awbh8cjwtp9zxeo...@mail.gmail.com elog_ereport_attribute_cold.patch Description: Binary data
Re: [PATCH] COPY command's data format option allows only lowercase csv, text or binary
On Wed, Jun 24, 2020 at 12:55:22PM -0400, Tom Lane wrote: > Yeah, I'm sure there are a few inconsistencies. We previously made a > pass to get rid of pg_strcasecmp for anything that had been through > the parser's downcasing (commit fb8697b31) but I wouldn't be surprised > if that missed a few cases, or if new ones have snuck in. Anyway, > "don't use pg_strcasecmp unnecessarily" was definitely the agreed-to > policy as of Jan 2018. 0d8c9c1 has introduced some in parse_basebackup_options() for the new manifest option, and fe30e7e for AlterType(), no? > My vague recollection is that there are a few exceptions (defGetBoolean > may well be one of them) where pg_strcasecmp still seemed necessary > because the input might not have come through the parser in some usages. Yep, there were a couple of exceptions. What was done at this time was a case-by-case lookup to check what came only from the parser. -- Michael signature.asc Description: PGP signature
Re: Assertion failure in pg_copy_logical_replication_slot()
On 2020/06/24 23:58, Alvaro Herrera wrote: On 2020-Jun-24, Fujii Masao wrote: I think the errcode is a bit bogus considering the new case. IMO ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE is more appropriate. Agreed. So I updated the patch so this errcode is used instead. Patch attached. LGTM. Thanks! Pushed. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: should libpq also require TLSv1.2 by default?
Michael Paquier writes: > On Thu, Jun 25, 2020 at 12:30:03AM +0200, Daniel Gustafsson wrote: >> As mentioned elsewhere in the thread, maybe this is also something which can >> be >> done more easily if we improve the error reporting? Right now it's fairly >> cryptic IMO. > This part may be tricky to get right I think, because the error comes > directly from OpenSSL when negotiating the protocol used between the > client and the server, like "no protocols available" or such. Can we do something comparable to the backend's HINT protocol, where we add on a comment that's only mostly-likely to be right? regards, tom lane
Re: Review for GetWALAvailability()
On 2020/06/25 3:27, Alvaro Herrera wrote: Thanks for those corrections. I have pushed this. I think all problems Masao-san reported have been dealt with, so we're done here. Sorry for my late to reply here... Thanks for committing the patch and improving the feature! /* * Find the oldest extant segment file. We get 1 until checkpoint removes * the first WAL segment file since startup, which causes the status being * wrong under certain abnormal conditions but that doesn't actually harm. */ oldestSeg = XLogGetLastRemovedSegno() + 1; I see the point of the above comment, but this can cause wal_status to be changed from "lost" to "unreserved" after the server restart. Isn't this really confusing? At least it seems better to document that behavior. Or if we *can ensure* that the slot with invalidated_at set always means "lost" slot, we can judge that wal_status is "lost" without using fragile XLogGetLastRemovedSegno(). Thought? Or XLogGetLastRemovedSegno() should be fixed so that it returns valid value even after the restart? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Review for GetWALAvailability()
On 2020-Jun-25, Fujii Masao wrote: > /* >* Find the oldest extant segment file. We get 1 until checkpoint > removes >* the first WAL segment file since startup, which causes the status > being >* wrong under certain abnormal conditions but that doesn't actually > harm. >*/ > oldestSeg = XLogGetLastRemovedSegno() + 1; > > I see the point of the above comment, but this can cause wal_status to be > changed from "lost" to "unreserved" after the server restart. Isn't this > really confusing? At least it seems better to document that behavior. Hmm. > Or if we *can ensure* that the slot with invalidated_at set always means > "lost" slot, we can judge that wal_status is "lost" without using fragile > XLogGetLastRemovedSegno(). Thought? Hmm, this sounds compelling -- I think it just means we need to ensure we reset invalidated_at to zero if the slot's restart_lsn is set to a correct position afterwards. I don't think we have any operation that does that, so it should be safe -- hopefully I didn't overlook anything? Neither copy nor advance seem to work with a slot that has invalid restart_lsn. > Or XLogGetLastRemovedSegno() should be fixed so that it returns valid > value even after the restart? This seems more work to implement. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: should libpq also require TLSv1.2 by default?
On Wed, Jun 24, 2020 at 10:50:39PM -0400, Tom Lane wrote: > Can we do something comparable to the backend's HINT protocol, where > we add on a comment that's only mostly-likely to be right? OpenSSL publishes its error codes as of openssl/sslerr.h, and it looks like the two error codes we would need to worry about are SSL_R_UNSUPPORTED_PROTOCOL and SSL_R_NO_PROTOCOLS_AVAILABLE. So we could for example amend open_client_SSL() when negotiating the SSL connection in libpq with error messages or hints that help better than the current state of things, but that also means an extra maintenance on our side to make sure that we keep in sync with new error codes coming from the OpenSSL world. -- Michael signature.asc Description: PGP signature
Re: Why forbid "INSERT INTO t () VALUES ();"
Hello Tom, INSERT INTO t() VALUES (); I'm still unclear why it would be forbidden though, it seems logical to try that, whereas the working one is quite away from the usual syntax. It's forbidden because the SQL standard forbids it. Ok, that is definitely a reason. I'm not sure it is a good reason, though. Why would the standard forbid it? From the language design point of view, it is basically having a syntax for lists which would not work for empty lists, or a syntax for strings which would not work for empty strings. It also means that if for some reason someone wants to insert several such all-default rows, they have to repeat the insert, as "VALUES (), ();" would not work, so it is also losing a corner-corner case capability without obvious reason. We allow zero-column syntaxes in some other places where SQL forbids them, Then forbidding there it just adds awkwardness: the same thing works in one place but not in another. That does not help users. but that's only because there is no reasonable alternative. In this case, there's a perfectly good, standards-compliant alternative. So why encourage people to write unportable code? I doubt that people look at the (costly) standard when writing corner case queries, they just try something logical as I did. As some other databases accepts it, and if it is already allowed elsewhere in pg, encouraging portability is not the main issue here. I'd rather have logic and uniformity accross commands. If I'm annoyed enough to send a patch some day, would you veto it because it departs from the standard? Anyway, thanks for the answer! -- Fabien.
Re: Review for GetWALAvailability()
On 2020/06/25 12:57, Alvaro Herrera wrote: On 2020-Jun-25, Fujii Masao wrote: /* * Find the oldest extant segment file. We get 1 until checkpoint removes * the first WAL segment file since startup, which causes the status being * wrong under certain abnormal conditions but that doesn't actually harm. */ oldestSeg = XLogGetLastRemovedSegno() + 1; I see the point of the above comment, but this can cause wal_status to be changed from "lost" to "unreserved" after the server restart. Isn't this really confusing? At least it seems better to document that behavior. Hmm. Or if we *can ensure* that the slot with invalidated_at set always means "lost" slot, we can judge that wal_status is "lost" without using fragile XLogGetLastRemovedSegno(). Thought? Hmm, this sounds compelling -- I think it just means we need to ensure we reset invalidated_at to zero if the slot's restart_lsn is set to a correct position afterwards. Yes. I don't think we have any operation that does that, so it should be safe -- hopefully I didn't overlook anything? We need to call ReplicationSlotMarkDirty() and ReplicationSlotSave() just after setting invalidated_at and restart_lsn in InvalidateObsoleteReplicationSlots()? Otherwise, restart_lsn can go back to the previous value after the restart. diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index e8761f3a18..5584e5dd2c 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1229,6 +1229,13 @@ restart: s->data.invalidated_at = s->data.restart_lsn; s->data.restart_lsn = InvalidXLogRecPtr; SpinLockRelease(&s->mutex); + + /* +* Save this invalidated slot to disk, to ensure that the slot +* is still invalid even after the server restart. +*/ + ReplicationSlotMarkDirty(); + ReplicationSlotSave(); ReplicationSlotRelease(); /* if we did anything, start from scratch */ Maybe we don't need to do this if the slot is temporary? Neither copy nor advance seem to work with a slot that has invalid restart_lsn. Or XLogGetLastRemovedSegno() should be fixed so that it returns valid value even after the restart? This seems more work to implement. Yes. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: archive status ".ready" files may be created too early
Hello. Matsumura-san. I agree that WAL writer is not the place to notify segmnet. And the direction you suggested would work. At Fri, 19 Jun 2020 10:18:34 +, "matsumura@fujitsu.com" wrote in > 1. Description in primary side > > [Basic problem] > A process flushing WAL record doesn't know whether the flushed RecPtr is > EndRecPtr of cross-segment-boundary WAL record or not because only process > inserting the WAL record knows it and it never memorizes the information to > anywhere. > > [Basic concept of the patch in primary] > A process inserting a cross-segment-boundary WAL record memorizes its > EndRecPtr > (I call it CrossBoundaryEndRecPtr) to a new structure in XLogCtl. > A flushing process creates .ready (Later, I call it just 'notify'.) against > a segment that is previous one including CrossBoundaryEndRecPtr only when > its > flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr. ... > See also the definition of XLogCtl, XLOGShmemSize(), and XLOGShmemInit() in > my patch. I think we don't need most of that shmem stuff. XLogWrite is called after WAL buffer is filled up to the requested position. So when it crosses segment boundary we know the all past corss segment-boundary records are stable. That means all we need to remember is only the position of the latest corss-boundary record. > * Action of inserting process > A inserting process memorie its CrossBoundaryEndRecPtr to > CrossBoundaryEndRecPtr > array element calculated by XLogRecPtrToBufIdx with its > CrossBoundaryEndRecPtr. > If the WAL record crosses many segments, only element against last segment > including the EndRecPtr is set and others are not set. > See also CopyXLogRecordToWAL() in my patch. If we call XLogMarkEndRecPtrIfNeeded() there, the function is called every time a record is written, most of which are wasteful. XLogInsertRecord already has a code block executed only at every page boundary. > * Action of flushing process > Overview has been already written as the follwing. > A flushing process creates .ready (Later, I call it just 'notify'.) > against > a segment that is previous one including CrossBoundaryEndRecPtr only when > its > flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr. > > An additional detail is as the following. The flushing process may notify > many segments if the record crosses many segments, so the process memorizes > latest notified segment number to latestArchiveNotifiedSegNo in XLogCtl. > The process notifies from latestArchiveNotifiedSegNo + 1 to > flushing segment number - 1. > > And latestArchiveNotifiedSegNo is set to EndOfLog after Startup process > exits > replay-loop. Standby also set same timing (= before promoting). > > Mutual exlusion about latestArchiveNotifiedSegNo is not required because > the timing of accessing has been already included in WALWriteLock critical > section. Looks reasonable. > 2. Description in standby side > > * Who notifies? > walreceiver also doesn't know whether the flushed RecPtr is EndRecPtr of > cross-segment-boundary WAL record or not. In standby, only Startup process > knows the information because it is hidden in WAL record itself and only > Startup process reads and builds WAL record. Standby doesn't write it's own WAL records. Even if primary sent an immature record on segment boundary, it just would promote to a new TLI and starts its own history. Nothing breaks. However it could be a problem if a standby that crashed the problematic way were started as-is as a primary, such scenario is out of our scope. Now we can identify stable portion of WAL stream. It's enough to prevent walsender from sending data that can be overwritten afterwards. GetReplicationTargetRecPtr() in the attached does that. > * Action of Statup process > Therefore, I implemented that walreceiver never notify and Startup process > does it. > In detail, when Startup process reads one full-length WAL record, it > notifies > from a segment that includes head(ReadRecPtr) of the record to a previous > segment that > includes EndRecPtr of the record. I don't like that archiving on standby relies on replay progress. We should avoid that and fortunately I think we dont need it. > Now, we must pay attention about switching time line. > The last segment of previous TimeLineID must be notified before switching. > This case is considered when RM_XLOG_ID is detected. That segment is archived after renamed as ".partial" later. We don't archive the last incomplete segment of the previous timeline as-is. > 3. About other notifying for segment > Two notifyings for segment are remain. They are not needed to fix. > > (1) Notifying for partial segment > It is not needed to be completed, so it's OK to notify without special > consideration. > > (2) Re-notifying > Currently, Checkpointer has notified through XLogArchiveCheckDone(). > It is a sa
Re: some more pg_dump refactoring
Hallo Peter, My 0.02 €: Patch applies cleanly, compiles, make check and pg_dump tap tests ok. The refactoring is a definite improvements. You changed the query strings to use "\n" instead of " ". I would not have changed that, because it departs from the style around, and I do not think it improves readability at the C code level. I tried to check manually and randomly that the same query is built for the same version, although my check may have been partial, especially on the aggregate query which does not comment about what is changed between versions, and my eyes are not very good at diffing. I've notice that some attributes are given systematic replacements (eg proparallel), removing the need to test for presence afterwards. This looks fine to me. However, on version < 8.4, ISTM that funcargs and funciargs are always added: is this voluntary?. Would it make sense to accumulate in the other direction, older to newer, so that new attributes are added at the end of the select? Should array_to_string be pg_catalog.array_to_string? All other calls seem to have an explicit schema. I'm fine with inlining most PQfnumber calls. I do not have old versions at hand for testing. Here is a patch to reorganize dumpFunc() and dumpAgg() in pg_dump, similar to daa9fe8a5264a3f192efa5ddee8fb011ad9da365. Instead of repeating the almost same large query in each version branch, use one query and add a few columns to the SELECT list depending on the version. This saves a lot of duplication. I have tested this with various old versions of PostgreSQL I had available, but a bit more random testing with old versions would be welcome. -- Fabien.