RE: PostgreSQL and big data - FDW
Hi Stephen My EDF company is very interested in this feature (KERBEROS authentication method and hdfs_fdw ). Is it possible to know how many days of development does this represent ? who can develop this implementation ? what cost ? Best Regards Didier ROS EDF -Message d'origine- De : sfr...@snowman.net [mailto:sfr...@snowman.net] Envoyé : mercredi 24 juin 2020 18:53 À : Bruce Momjian Cc : ROS Didier ; pgsql-hackers@lists.postgresql.org Objet : Re: PostgreSQL and big data - FDW Greetings, * Bruce Momjian (br...@momjian.us) wrote: > On Wed, Jun 24, 2020 at 09:05:30AM +, ROS Didier wrote: > > I would like to use a Foreign Data Wrapper (FDW) to connect to a > > HADOOP cluster which uses KERBEROS authentication. Sadly, not really. > > is it possible to achieve this ? which FDW should be used ? > > Well, I would use the Hadoop FDW: > > https://github.com/EnterpriseDB/hdfs_fdw > > and it only supports these authentication methods: > > Authentication Support > > The FDW supports NOSASL and LDAP authentication modes. In order to use > NOSASL do not specify any OPTIONS while creating user mapping. For LDAP > username and password must be specified in OPTIONS while creating user > mapping. > > Not every FDW supports every Postgres server authentication method. That isn't really the issue here, the problem is really that the GSSAPI support in PG today doesn't support credential delegation- if it did, then the HDFS FDW (and the postgres FDW) could be easily extended to leverage those delegated credentials to connect. That's been something that's been on my personal todo list of things to work on but unfortunately I've not, as yet, had time to go implement. I don't actually think it would be very hard- if someone writes it, I'd definitely review it. Thanks, Stephen Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse. Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message. Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus. This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval. If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message. E-mail communication cannot be guaranteed to be timely secure, error or virus-free.
Re: Ought to use heap_multi_insert() for pg_attribute/depend insertions?
> On 4 Mar 2020, at 23:16, Daniel Gustafsson wrote: > The v9 patch attached addresses, I believe, all comments made to date. I > tried > to read through earlier reviews too to ensure I hadn't missed anything so I > hope I've covered all findings so far. Attached is a rebased version which was updated to handle the changes for op class parameters introduced in 911e70207703799605. cheers ./daniel catalog_multi_insert-v10.patch Description: Binary data
Missing some ifndef FRONTEND at the top of logging.c and file_utils.c
Hi all, As subject tells, we have in src/common/ four files that are only compiled as part of the frontend: fe_memutils.c, file_utils.c, logging.c and restricted_token.c. Two of them are missing the following, to make sure that we never try to compile them with the backend: +#ifndef FRONTEND +#error "This file is not expected to be compiled for backend code" +#endif So, shouldn't that stuff be added as per the attached? Thanks. -- Michael diff --git a/src/common/file_utils.c b/src/common/file_utils.c index 7584c1f2fb..a2faafdf13 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -12,6 +12,11 @@ * *- */ + +#ifndef FRONTEND +#error "This file is not expected to be compiled for backend code" +#endif + #include "postgres_fe.h" #include diff --git a/src/common/logging.c b/src/common/logging.c index f3fc0b8262..6a3a437a34 100644 --- a/src/common/logging.c +++ b/src/common/logging.c @@ -7,6 +7,11 @@ * *- */ + +#ifndef FRONTEND +#error "This file is not expected to be compiled for backend code" +#endif + #include "postgres_fe.h" #include signature.asc Description: PGP signature
Re: Review for GetWALAvailability()
At Thu, 25 Jun 2020 14:35:34 +0900, Fujii Masao wrote in > > > On 2020/06/25 12:57, Alvaro Herrera wrote: > > On 2020-Jun-25, Fujii Masao wrote: > > > >>/* > >> * Find the oldest extant segment file. We get 1 until checkpoint > >> removes > >> * the first WAL segment file since startup, which causes the status > >> being > >> * wrong under certain abnormal conditions but that doesn't actually > >> harm. > >> */ > >>oldestSeg = XLogGetLastRemovedSegno() + 1; > >> > >> I see the point of the above comment, but this can cause wal_status to > >> be > >> changed from "lost" to "unreserved" after the server restart. Isn't > >> this > >> really confusing? At least it seems better to document that behavior. > > Hmm. > > > >> Or if we *can ensure* that the slot with invalidated_at set always > >> means > >> "lost" slot, we can judge that wal_status is "lost" without using > >> fragile > >> XLogGetLastRemovedSegno(). Thought? > > Hmm, this sounds compelling -- I think it just means we need to ensure > > we reset invalidated_at to zero if the slot's restart_lsn is set to a > > correct position afterwards. > > Yes. It is error-prone restriction, as discussed before. Without changing updator-side, invalid restart_lsn AND valid invalidated_at can be regarded as the lost state. With the following change suggested by Fujii-san we can avoid the confusing status. With attached first patch on top of the slot-dirtify fix below, we get "lost" for invalidated slots after restart. > > I don't think we have any operation that > > does that, so it should be safe -- hopefully I didn't overlook > > anything? > > We need to call ReplicationSlotMarkDirty() and ReplicationSlotSave() > just after setting invalidated_at and restart_lsn in > InvalidateObsoleteReplicationSlots()? > Otherwise, restart_lsn can go back to the previous value after the > restart. > > diff --git a/src/backend/replication/slot.c > b/src/backend/replication/slot.c > index e8761f3a18..5584e5dd2c 100644 > --- a/src/backend/replication/slot.c > +++ b/src/backend/replication/slot.c > @@ -1229,6 +1229,13 @@ restart: > s->data.invalidated_at = s->data.restart_lsn; > s->data.restart_lsn = InvalidXLogRecPtr; > SpinLockRelease(&s->mutex); > + > + /* > + * Save this invalidated slot to disk, to ensure that the slot > +* is still invalid even after the server restart. > +*/ > + ReplicationSlotMarkDirty(); > + ReplicationSlotSave(); > ReplicationSlotRelease(); > /* if we did anything, start from scratch */ > > Maybe we don't need to do this if the slot is temporary? The only difference of temprary slots from persistent one seems to be an attribute "persistency". Actually, create_physica_replication_slot() does the aboves for temporary slots. > > Neither copy nor advance seem to work with a slot that has invalid > > restart_lsn. > > > >> Or XLogGetLastRemovedSegno() should be fixed so that it returns valid > >> value even after the restart? > > This seems more work to implement. > > Yes. The confusing status can be avoided without fixing it, but I prefer to fix it. As Fujii-san suggested upthread, couldn't we remember lastRemovedSegNo in the contorl file? (Yeah, it cuases a bump of PG_CONTROL_VERSION and CATALOG_VERSION_NO?). -- Kyotaro Horiguchi NTT Open Source Software Center >From 7c8803f2bd7267d166f8a6f1a4ca5b129aa5ae96 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 25 Jun 2020 17:03:24 +0900 Subject: [PATCH 1/2] Make slot invalidation persistent --- src/backend/replication/slot.c | 8 1 file changed, 8 insertions(+) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index e8761f3a18..26f874e3cb 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1229,7 +1229,15 @@ restart: s->data.invalidated_at = s->data.restart_lsn; s->data.restart_lsn = InvalidXLogRecPtr; SpinLockRelease(&s->mutex); + + /* + * Save this invalidated slot to disk, to ensure that the slot + * is still invalid even after the server restart. + */ + ReplicationSlotMarkDirty(); + ReplicationSlotSave(); ReplicationSlotRelease(); + /* if we did anything, start from scratch */ CHECK_FOR_INTERRUPTS(); -- 2.18.4 >From 0792c41c915b87474958689a2acd5c214b393015 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 25 Jun 2020 17:04:01 +0900 Subject: [PATCH 2/2] Show correct value in pg_replication_slots.wal_status after restart The column may show bogus staus until checkpoint removes at least one WAL segment. This patch changes the logic to detect the state so that the column shows the correct status after restart. --- src/backend/replication/slotfuncs.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/src/backend/replication/slotfuncs.c b/src/backend/
Re: Why forbid "INSERT INTO t () VALUES ();"
On 6/25/20 6:56 AM, Fabien COELHO wrote: > > Hello Tom, > > INSERT INTO t() VALUES (); >> >>> I'm still unclear why it would be forbidden though, it seems logical to >>> try that, whereas the working one is quite away from the usual syntax. >> >> It's forbidden because the SQL standard forbids it. > > Ok, that is definitely a reason. I'm not sure it is a good reason, though. It's a very good reason. It might not be good *enough*, but it is a good reason. > Why would the standard forbid it? From the language design point of > view[...] Don't go there. There is nothing but pain there. -- Vik Fearing
Re: Backpatch b61d161c14 (Introduce vacuum errcontext ...)
On Tue, Jun 23, 2020 at 11:49 PM Andres Freund wrote: > > Hi, > > On 2020-06-23 00:14:40 -0400, Tom Lane wrote: > > Andres Freund writes: > > > I am only suggesting that where you save the old location, as currently > > > done with LVRelStats olderrinfo, you instead use a more specific > > > type. Not that you should pass that anywhere (except for > > > update_vacuum_error_info). > > > > As things currently stand, I don't think we need another struct > > type at all. ISTM we should hard-wire the handling of indname > > as I suggested above. Then there are only two fields to be dealt > > with, and we could just as well save them in simple local variables. > > That's fine with me too. > I have looked at both the patches (using separate variables (by Justin) and using a struct (by Andres)) and found the second one bit better. > > > If there's a clear future path to needing to save/restore more > > fields, then maybe another struct type would be useful ... but > > right now the struct type declaration itself would take more > > lines of code than it would save. > > FWIW, I started to be annoyed by this code when I was addding > prefetching support to vacuum, and wanted to change what's tracked > where. The number of places that needed to be touched was > disproportional. > > > Here's a *draft* for how I thought this roughly could look like. I think > it's nicer to not specify the exact saved state in multiple places, and > I think it's much clearer to use a separate function to restore the > state than to set a "fresh" state. > I think this is a good idea and makes code look better. I think it is better to name new struct as LVSavedErrInfo instead of LVSavedPos. > I've only applied a hacky fix for the way the indname is tracked, I > thought that'd best be discussed separately. > I think it is better to use Tom's idea here to save and restore index information in-place. I have used Justin's patch with some improvements like adding Asserts and initializing with NULL for indname while restoring to make things unambiguous. I have improved some comments in the code and for now, kept as two patches (a) one for improving the error info for index (mostly Justin's patch based on Tom's idea) and (b) the other to generally improve the code in this area (mostly Andres's patch). I have done some testing with both the patches and would like to do more unless there are objections with these. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com improve_vac_err_cb_v1.patch Description: Binary data improve_vac_err_cb_v2.patch Description: Binary data
Re: hashagg slowdown due to spill changes
On Wed, Jun 24, 2020 at 05:26:07PM -0700, Melanie Plageman wrote: On Tue, Jun 23, 2020 at 10:06 AM Andres Freund wrote: Hi, On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote: > On Mon, Jun 22, 2020 at 9:02 PM Andres Freund wrote: > > It's not this patch's fault, but none, really none, of this stuff should > > be in the executor. > > > > > Were you thinking it could be done in grouping_planner() and then the > bitmaps could be saved in the PlannedStmt? > Or would you have to wait until query_planner()? Or are you imagining > somewhere else entirely? I haven't thought about it in too much detail, but I would say create_agg_plan() et al. I guess there's some argument to be made to do it in setrefs.c, because we already do convert_combining_aggrefs() there (but I don't like that much). There's no reason to do it before we actually decided on one specific path, so doing it earlier than create_plan() seems unnecessary. And having it in agg specific code seems better than putting it into global routines. There's probably an argument for having a bit more shared code between create_agg_plan(), create_group_plan() and create_groupingsets_plan(). But even just adding a new extract_*_cols() call to each of those would probably be ok. So, my summary of this point in the context of the other discussion upthread is: Planner should extract the columns that hashagg will need later during planning. Planner should not have HashAgg/MixedAgg nodes request smaller targetlists from their children with CP_SMALL_TLIST to avoid unneeded projection overhead. Also, even this extraction should only be done when the number of groups is large enough to suspect a spill. IMO we should extract the columns irrespectedly of the estimates, otherwise we won't be able to handle underestimates efficiently. Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST in create_hashjoin_plan() for the inner side sub-tree and the outer side one if there are multiple batches. I wondered what was different about that vs hashagg (i.e. why it is okay to do that there). Yeah. That means that if we have to start batching during execution, we may need to spill much more datai. I'd say that's a hashjoin issue that we should fix too (in v14). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Open Item: Should non-text EXPLAIN always show properties?
Over on [1] Justin mentions that the non-text EXPLAIN ANALYZE should always show the "Disk Usage" and "HashAgg Batches" properties. I agree with this. show_wal_usage() is a good example of how we normally do things. We try to keep the text format as humanly readable as possible but don't really expect humans to be commonly reading the other supported formats, so we care less about including additional details there. There's also an open item regarding this for Incremental Sort, so I've CC'd James and Tomas here. This seems like a good place to discuss both. I've attached a small patch that changes the Hash Aggregate behaviour to always show these properties for non-text formats. Does anyone object to this? David [1] https://www.postgresql.org/message-id/20200619040624.GA17995%40telsasoft.com more_hash_agg_explain_fixes.patch Description: Binary data
Re: Missing some ifndef FRONTEND at the top of logging.c and file_utils.c
> On 25 Jun 2020, at 10:07, Michael Paquier wrote: > So, shouldn't that stuff be added as per the attached? That makes sense, logging.c and file_utils.c are indeed only part of libpgcommon.a and should only be compiled for frontend. cheers ./daniel
Re: Why forbid "INSERT INTO t () VALUES ();"
On Thu, 25 Jun 2020 at 16:56, Fabien COELHO wrote: > It also means that if for some reason someone wants to insert several such > all-default rows, they have to repeat the insert, as "VALUES (), ();" > would not work, so it is also losing a corner-corner case capability > without obvious reason. This is not a vote in either direction but just wanted to say that during 7e413a0f8 where multi-row inserts were added to pg_dump, a special case had to be added to support tables with no columns. We cannot do multi-inserts for that so are forced to fall back on one-row-per-INSERT. However, even if we had this syntax I imagine it would be unlikely we'd change pg_dump to use it since we want to be as standard compliant as possible when dumping INSERTs since it appears the only genuine use-case for that is for importing the data into some other relational database. David
Re: Online checksums patch - once again
> On 22 Jun 2020, at 18:29, Robert Haas wrote: > > On Mon, Jun 22, 2020 at 8:27 AM Daniel Gustafsson wrote: >> Restartability is implemented by keeping state in pg_class. I opted for a >> bool >> which is cleared as the first step of checksum enable, since it offers fewer >> synchronization cornercases I think. > > Unless you take AccessExclusiveLock on the table, this probably needs > to be three-valued. Or maybe I am misunderstanding the design... Sorry being a bit thick, can you elaborate which case you're thinking about? CREATE TABLE sets the attribute according to the value of data_checksums, and before enabling checksums (and before changing data_checksums to inprogress) the bgworker will update all relhaschecksums from true (if any) to false. Once the state is set to inprogress all new relations will set relhaschecksums to true. The attached v19 fixes a few doc issues I had missed. cheers ./daniel online_checksums19.patch Description: Binary data
CUBE_MAX_DIM
Hi, Someone contacted me about increasing CUBE_MAX_DIM in contrib/cube/cubedata.h (in the community RPMs). The current value is 100 with the following comment: * This limit is pretty arbitrary, but don't make it so large that you * risk overflow in sizing calculations. They said they use 500, and never had a problem. I never added such patches to the RPMS, and will not -- but wanted to ask if we can safely increase it in upstream? Regards, -- Devrim Gündüz Open Source Solution Architect, Red Hat Certified Engineer Twitter: @DevrimGunduz , @DevrimGunduzTR signature.asc Description: This is a digitally signed message part
Re: min_safe_lsn column in pg_replication_slots view
On Wed, Jun 24, 2020 at 8:45 PM Alvaro Herrera wrote: > > On 2020-Jun-24, Fujii Masao wrote: > > > On 2020/06/24 8:39, Alvaro Herrera wrote: > > > > I think we should publish the value from wal_keep_segments separately > > > from max_slot_wal_keep_size. ISTM that the user might decide to change > > > or remove wal_keep_segments and be suddenly at risk of losing slots > > > because of overlooking that it was wal_keep_segments, not > > > max_slot_wal_keep_size, that was protecting them. > > > > You mean to have two functions that returns > > > > 1. "current WAL LSN - wal_keep_segments * 16MB" > > 2. "current WAL LSN - max_slot_wal_keep_size" > > Hmm, but all the values there are easily findable. What would be the > point in repeating it? > > Maybe we should disregard this line of thinking and go back to > Horiguchi-san's original proposal, to wit use the "distance to > breakage", as also supported now by Amit Kapila[1] (unless I > misunderstand him). > +1. I also think let's drop the idea of exposing a function for this value and revert the min_safe_lsn part of the work as proposed by Michael above [1] excluding the function pg_wal_oldest_lsn() in that patch. Then, we can expose this as a new stat for PG14. I feel it would be better to display this stat in a new view (something like pg_stat_replication_slots) as discussed in another thread [2]. Does that make sense? [1] - https://www.postgresql.org/message-id/20200622054950.GC50978%40paquier.xyz [2] - https://www.postgresql.org/message-id/CA%2Bfd4k5_pPAYRTDrO2PbtTOe0eHQpBvuqmCr8ic39uTNmR49Eg%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Resetting spilled txn statistics in pg_stat_replication
On Tue, Jun 23, 2020 at 6:39 PM Amit Kapila wrote: > > On Tue, Jun 23, 2020 at 3:48 PM Tomas Vondra > wrote: > > > > On Tue, Jun 23, 2020 at 10:58:18AM +0530, Amit Kapila wrote: > > >On Tue, Jun 23, 2020 at 9:32 AM Masahiko Sawada > > > wrote: > > >> > > >> On Sun, 21 Jun 2020 at 06:57, Tomas Vondra > > >> wrote: > > >> > > > >> > > > > >> > >What if the decoding has been performed by multiple backends using the > > >> > >same slot? In that case, it will be difficult to make the judgment > > >> > >for the value of logical_decoding_work_mem based on stats. It would > > >> > >make sense if we provide a way to set logical_decoding_work_mem for a > > >> > >slot but not sure if that is better than what we have now. > > >> > > > > >> > > >> I thought that the stats are relevant to what > > >> logical_decoding_work_mem value was but not with who performed logical > > >> decoding. So even if multiple backends perform logical decoding using > > >> the same slot, the user can directly use stats as long as > > >> logical_decoding_work_mem value doesn’t change. > > >> Today, I thought about it again, and if we consider the point that logical_decoding_work_mem value doesn’t change much then having the stats at slot-level would also allow computing logical_decoding_work_mem based on stats. Do you think it is a reasonable assumption that users won't change logical_decoding_work_mem for different processes (WALSender, etc.)? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On Wed, Jun 24, 2020 at 3:41 PM Alvaro Herrera wrote: > People (specifically the jdbc driver) *are* using this feature in this > way, but they didn't realize they were doing it. It was an accident and > they didn't notice. But you don't know that that's true of everyone using this feature, and even if it were, so what? Breaking a feature that someone didn't know they were using is just as much of a break as breaking a feature someone DID know they were using. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Open Item: Should non-text EXPLAIN always show properties?
On Thu, Jun 25, 2020 at 5:15 AM David Rowley wrote: > > Over on [1] Justin mentions that the non-text EXPLAIN ANALYZE should > always show the "Disk Usage" and "HashAgg Batches" properties. I > agree with this. show_wal_usage() is a good example of how we normally > do things. We try to keep the text format as humanly readable as > possible but don't really expect humans to be commonly reading the > other supported formats, so we care less about including additional > details there. > > There's also an open item regarding this for Incremental Sort, so I've > CC'd James and Tomas here. This seems like a good place to discuss > both. Yesterday I'd replied [1] to Justin's proposal for this WRT incremental sort and expressed my opinion that including both unnecessarily (i.e., including disk when an in-memory sort was used) is undesirable and confusing and leads to shortcuts I believe to be bad habits when using the data programmatically. On a somewhat related note, memory can be 0 but that doesn't mean no memory was used: it's a result of how tuplesort.c doesn't properly track memory usage when it switches to disk sort. The same isn't true in reverse (we don't have 0 disk when disk was used), but I do think it does show the idea that showing "empty" data isn't an inherent good. If there's a clear established pattern and/or most others seem to prefer Justin's proposed approach, then I'm not going to fight it hard. I just don't think it's the best approach. James [1] https://www.postgresql.org/message-id/CAAaqYe-LswZFUL4k5Dr6%3DEN6MJG1HurggcH4QzUs6UFqBbnQzQ%40mail.gmail.com
Re: More tzdb fun: POSIXRULES is being deprecated upstream
On 2020-06-19 21:55, Tom Lane wrote: Yeah, we can do nothing in the back branches and hope that that doesn't happen for the remaining lifespan of v12. But I wonder whether that doesn't amount to sticking our heads in the sand. I suppose it'd be possible to have a release-note entry in the back branches that isn't tied to any actual code change on our part, but just warns that such a tzdata change might happen at some unpredictable future time. That feels weird and squishy though; and people would likely have forgotten it by the time the change actually hits them. In my mind, this isn't really that different from other external libraries making API changes. But we are not going to forcibly remove Python 2 support in PostgreSQL 9.6 just because it's no longer supported upstream. If Debian or RHEL $veryold want to keep maintaining Python 2, they are free to do so, and users thereof are free to continue using it. Similarly, Debian or RHEL $veryold are surely not going to drop a whole class of time zone codes from their stable distribution just because upstream is phasing it out. What you are saying is, instead of the OS dropping POSIXRULES support, it would be better if we dropped it first and release-noted that. However, I don't agree with the premise of that. OSes with long-term support aren't going to drop it. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Removal of currtid()/currtid2() and some table AM cleanup
Hi, I seem to have invalidated KEYSET-DRIVEN cursors used in positioned-update test . It was introduced by the commit 4a272fd but was invalidated by the commit 2be35a6. I don't object to the removal of currtid(2) because keyset-driven cursors in psqlodbc are changed into static cursors in many cases and I've hardly ever heard a complaint about it. regards, Hiroshi Inoue On 2020/06/24 11:11, Michael Paquier wrote: On Tue, Jun 23, 2020 at 02:02:33PM +0900, Michael Paquier wrote: Actually, while reviewing the code, the only code path where we use currtid2() involves positioned_load() and LATEST_TUPLE_LOAD. And the only location where this happens is in SC_pos_reload_with_key(), where I don't actually see how it would be possible to not have a keyset and still use a CTID, which would led to LATEST_TUPLE_LOAD being used. So could it be possible that the code paths of currtid2() are actually just dead code? I have dug more into this one, and we actually stressed this code path quite a lot up to commit d9cb23f in the ODBC driver, with tests cursor-block-delete, positioned-update and bulkoperations particularly when calling SQLSetPos(). However, 86e2e7a has reworked the code in such a way that we visibly don't use anymore CTIDs if we don't have a keyset, and that combinations of various options like UseDeclareFetch or UpdatableCursors don't trigger this code path anymore. In short, currtid2() does not get used. Inoue-san, Saito-san, what do you think? I am adding also Tsunakawa-san in CC who has some experience in this area. -- Michael
Re: CUBE_MAX_DIM
Hello, The problem with higher dimension cubes is that starting with dimensionality of ~52 the "distance" metrics in 64-bit float have less than a single bit per dimension in mantissa, making cubes indistinguishable. Developers for facial recognition software had a chat about that on russian postgres telegram group https://t.me/pgsql. Their problem was that they had 128-dimensional points, recompiled postgres - distances weren't helpful, and GIST KNN severely degraded to almost full scans. They had to change the number of facial features to smaller in order to make KNN search work. Floating point overflow isn't that much of a risk per se, worst case scenario it becomes an Infinity or 0 which are usually acceptable in those contexts. While mathematically possible, there are implementation issues with higher dimension cubes. I'm ok with raising the limit if such nuances get a mention in docs. On Thu, Jun 25, 2020 at 1:01 PM Devrim Gündüz wrote: > > Hi, > > Someone contacted me about increasing CUBE_MAX_DIM > in contrib/cube/cubedata.h (in the community RPMs). The current value > is 100 with the following comment: > > * This limit is pretty arbitrary, but don't make it so large that you > * risk overflow in sizing calculations. > > > They said they use 500, and never had a problem. I never added such > patches to the RPMS, and will not -- but wanted to ask if we can safely > increase it in upstream? > > Regards, > > -- > Devrim Gündüz > Open Source Solution Architect, Red Hat Certified Engineer > Twitter: @DevrimGunduz , @DevrimGunduzTR > -- Darafei Praliaskouski Support me: http://patreon.com/komzpa
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila wrote: > > On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar wrote: > > > > Here is the POC patch to discuss the idea of a cleanup of shared > > fileset on proc exit. As discussed offlist, here I am maintaining > > the list of shared fileset. First time when the list is NULL I am > > registering the cleanup function with on_proc_exit routine. After > > that for subsequent fileset, I am just appending it to filesetlist. > > There is also an interface to unregister the shared file set from the > > cleanup list and that is done by the caller whenever we are deleting > > the shared fileset manually. While explaining it here, I think there > > could be one issue if we delete all the element from the list will > > become NULL and on next SharedFileSetInit we will again register the > > function. Maybe that is not a problem but we can avoid registering > > multiple times by using some flag in the file > > > > I don't understand what you mean by "using some flag in the file". > > Review comments on various patches. > > poc_shared_fileset_cleanup_on_procexit > = > 1. > - ent->subxact_fileset = > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet)); > + MemoryContext oldctx; > > + /* Shared fileset handle must be allocated in the persistent context */ > + oldctx = MemoryContextSwitchTo(ApplyContext); > + ent->subxact_fileset = palloc(sizeof(SharedFileSet)); > SharedFileSetInit(ent->subxact_fileset, NULL); > + MemoryContextSwitchTo(oldctx); > fd = BufFileCreateShared(ent->subxact_fileset, path); > > Why is this change required for this patch and why we only cover > SharedFileSetInit in the Apply context and not BufFileCreateShared? > The comment is also not very clear on this point. Added the comments for the same. > 2. > +void > +SharedFileSetUnregister(SharedFileSet *input_fileset) > +{ > + bool found = false; > + ListCell *l; > + > + Assert(filesetlist != NULL); > + > + /* Loop over all the pending shared fileset entry */ > + foreach (l, filesetlist) > + { > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); > + > + /* remove the entry from the list and delete the underlying files */ > + if (input_fileset->number == fileset->number) > + { > + SharedFileSetDeleteAll(fileset); > + filesetlist = list_delete_cell(filesetlist, l); > > Why are we calling SharedFileSetDeleteAll here when in the caller we > have already deleted the fileset as per below code? > BufFileDeleteShared(ent->stream_fileset, path); > + SharedFileSetUnregister(ent->stream_fileset); That's wrong I have removed this. > I think it will be good if somehow we can remove the fileset from > filesetlist during BufFileDeleteShared. If that is possible, then we > don't need a separate API for SharedFileSetUnregister. I have done as discussed on later replies, basically called SharedFileSetUnregister from BufFileDeleteShared. > 3. > +static List * filesetlist = NULL; > + > static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum); > +static void SharedFileSetOnProcExit(int status, Datum arg); > static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid > tablespace); > static void SharedFilePath(char *path, SharedFileSet *fileset, const > char *name); > static Oid ChooseTablespace(const SharedFileSet *fileset, const char *name); > @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment *seg) > /* Register our cleanup callback. */ > if (seg) > on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset)); > + else > + { > + if (filesetlist == NULL) > + on_proc_exit(SharedFileSetOnProcExit, 0); > > We use NIL for list initialization and comparison. See lock_files usage. Done > 4. > +SharedFileSetOnProcExit(int status, Datum arg) > +{ > + ListCell *l; > + > + /* Loop over all the pending shared fileset entry */ > + foreach (l, filesetlist) > + { > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); > + SharedFileSetDeleteAll(fileset); > + } > > We can initialize filesetlist as NIL after the for loop as it will > make the code look clean. Right. > Comments on other patches: > = > 5. > > 3. On concurrent abort we are truncating all the changes including > > some incomplete changes, so later when we get the complete changes we > > don't have the previous changes, e.g, if we had specinsert in the > > last stream and due to concurrent abort detection if we delete that > > changes later we will get spec_confirm without spec insert. We could > > have simply avoided deleting all the changes, but I think the better > > fix is once we detect the concurrent abort for any transaction, then > > why do we need to collect the changes for that, we can simply avoid > > that. So I have put that fix. (0006) > > > > On similar lines, I think we need to skip processing message, see else > part of code in ReorderBufferQueueMessage. Basically, ReorderBufferQueueMessage also calls the ReorderBu
Re: Why forbid "INSERT INTO t () VALUES ();"
Bonjour Vik, It's forbidden because the SQL standard forbids it. Ok, that is definitely a reason. I'm not sure it is a good reason, though. It's a very good reason. It might not be good *enough*, but it is a good reason. Ok for good, although paradoxically not "good enough":-) Why would the standard forbid it? From the language design point of view[...] Don't go there. There is nothing but pain there. Hmmm. I like to understand. Basically it is my job. Otherwise, yes and no. Postgres could decide (has sometimes decided) to extend the syntax or semantics wrt the standard if it makes sense, so that when a syntax is allowed by the standard it does what the standard says, which I would call positive compliance and I would support that, but keep some freedom elsewhere. -- Fabien.
Re: xid wraparound danger due to INDEX_CLEANUP false
On Thu, 25 Jun 2020 at 05:02, Peter Geoghegan wrote: > > On Wed, Jun 24, 2020 at 10:21 AM Robert Haas wrote: > > Sorry, I'm so far behind on my email. Argh. > > That's okay. > > > I think, especially on the blog post you linked, that we should aim to > > have INDEX_CLEANUP OFF mode do the minimum possible amount of work > > while still keeping us safe against transaction ID wraparound. So if, > > for example, it's desirable but not imperative for BRIN to > > resummarize, then it's OK normally but should be skipped with > > INDEX_CLEANUP OFF. > > I agree that that's very important. > > > Apparently, what we're really doing here is that even when > > INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples. > > That seems sad, but if it's what we have to do then it at least needs > > comments explaining it. > > +1. Though I think that AMs should technically have the right to > consider it advisory. > > > As for the btree portion of the change, I expect you understand this > > better than I do, so I'm reluctant to stick my neck out, but it seems > > that what the patch does is force index cleanup to happen even when > > INDEX_CLEANUP is OFF provided that the vacuum is for wraparound and > > the btree index has at least 1 recyclable page. My first reaction is > > to wonder whether that doesn't nerf this feature into oblivion. Isn't > > it likely that an index that is being vacuumed for wraparound will > > have a recyclable page someplace? If the presence of that 1 recyclable > > page causes the "help, I'm about to run out of XIDs, please do the > > least possible work" flag to become a no-op, I don't think users are > > going to be too happy with that. Maybe I am misunderstanding. > > I have mixed feelings about it myself. These are valid concerns. > > This is a problem to the extent that the user has a table that they'd > like to use INDEX_CLEANUP with, that has indexes that regularly > require cleanup due to page deletion. ISTM, then, that the really > relevant high level design questions for this patch are: > > 1. How often is that likely to happen in The Real World™? > > 2. If we fail to do cleanup and leak already-deleted pages, how bad is > that? ( Both in general, and in the worst case.) > > I'll hazard a guess for 1: I think that it might not come up that > often. Page deletion is often something that we hardly ever need. And, > unlike some DB systems, we only do it when pages are fully empty > (which, as it turns out, isn't necessarily better than our simple > approach [1]). I tend to think it's unlikely to happen in cases where > INDEX_CLEANUP is used, because those are cases that also must not have > that much index churn to begin with. > > Then there's question 2. The intuition behind the approach from > Sawada-san's patch was that allowing wraparound here feels wrong -- it > should be in the AM's hands. However, it's not like I can point to > some ironclad guarantee about not leaking deleted pages that existed > before the INDEX_CLEANUP feature. We know that the FSM is not crash > safe, and that's that. Is this really all that different? Maybe it is, > but it seems like a quantitative difference to me. I think that with the approach implemented in my patch, it could be a problem for the user that the user cannot easily know in advance whether vacuum with INDEX_CLEANUP false will perform index cleanup, even if page deletion doesn’t happen in most cases. They can check the vacuum will be a wraparound vacuum but it’s relatively hard for users to check in advance there are recyclable pages in the btree index. This will be a restriction for users, especially those who want to use INDEX_CLEANUP feature to speed up an impending XID wraparound vacuum described on the blog post that Peter shared. I don’t come up with a good solution to keep us safe against XID wraparound yet but it seems to me that it’s better to have an option that forces index cleanup not to happen. > > I'm kind of arguing against myself even as I try to advance my > original argument. If workloads that use INDEX_CLEANUP don't need to > delete and recycle pages in any case, then why should we care that > those same workloads might leak pages on account of the wraparound > hazard? I had the same impression at first. > The real reason that I want to push the mechanism down into index > access methods is because that design is clearly better overall; it > just so happens that the specific way in which we currently defer > recycling in nbtree makes very little sense, so it's harder to see the > big picture. The xid-cleanup design that we have was approximately the > easiest way to do it, so that's what we got. We should figure out a > way to recycle the pages at something close to the earliest possible > opportunity, without having to perform a full scan on the index > relation within btvacuumscan(). +1 > Maybe we can use the autovacuum work > item mechanism for that. For indexes that get VACUUMed once a week on > average
Re: More tzdb fun: POSIXRULES is being deprecated upstream
Peter Eisentraut writes: > What you are saying is, instead of the OS dropping POSIXRULES support, > it would be better if we dropped it first and release-noted that. > However, I don't agree with the premise of that. OSes with long-term > support aren't going to drop it. You might be right, or you might not. I think the tzdata distribution is in a weird gray area so far as long-term-support platforms are concerned: they have to keep updating it, no matter how far it diverges from what they originally shipped with. Maybe they will figure out that they're not required to drop POSIXRULES just because upstream did. Or maybe they will go with the flow on that, figuring that it's not any worse than any politically-driven time zone change. I wouldn't be a bit surprised if it ends up depending on whether the particular distro is using IANA's makefile more or less verbatim. In Red Hat's case I found that they'd have to take positive action to drop POSIXRULES, so I'd agree that it won't happen there for a long time, and not in any existing RHEL release. In some other distros, it might take explicit addition of a patch to keep from dropping POSIXRULES, in which case I think there'd be quite good odds that that won't happen and the changeover occurs with the next IANA zone updates. The nasty thing about that scenario from our perspective is that it means that the same timezone spec means different things on different platforms, even ones nominally using the same tzdata release. Do we want to deal with that, or take pre-emptive action to prevent it? (You could argue that that hazard already exists for people who are intentionally using nonstandard posixrules files. But I think the set of such people can be counted without running out of fingers. If there's some evidence to the contrary I'd like to see it.) I'm also worried about what the endgame looks like. It seems clear that at some point IANA is going to remove their code's support for reading a posixrules file. Eggert hasn't tipped his hand as to when he thinks that might happen, but I wouldn't care to bet that it's more than five years away. I don't want to find ourselves in a situation where we have to maintain code that upstream has nuked. If they only do something comparable to the patch I posted, it wouldn't be so bad; but if they then undertake any significant follow-on cleanup we'd be in a very bad place for tracking them. regards, tom lane
Re: improving basebackup.c's file-reading code
> On 17 Jun 2020, at 17:45, Robert Haas wrote: > > On Wed, Apr 29, 2020 at 5:52 AM Hamid Akhtar wrote: >> The idea and the patch looks good to me. >> >> It makes sense to change fread to basebackup_read_file which internally >> calls pg_pread which is a portable function as opposed to read. As you've >> mentioned, this is much better for error handling. > > Thanks for the review. I have now committed the patch, after rebasing > and adjusting one comment slightly. As this went in, can we close the 2020-07 CF entry or is there anything left in the patchseries? cheers ./daniel
Re: Why forbid "INSERT INTO t () VALUES ();"
On Thu, Jun 25, 2020 at 12:56 AM Fabien COELHO wrote: > It also means that if for some reason someone wants to insert several such > all-default rows, they have to repeat the insert, as "VALUES (), ();" > would not work, so it is also losing a corner-corner case capability > without obvious reason. That, and a desire to make things work in PostgreSQL that work in MySQL, seems like a good-enough reason to me, but YMMV. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: improving basebackup.c's file-reading code
On Thu, Jun 25, 2020 at 10:29 AM Daniel Gustafsson wrote: > As this went in, can we close the 2020-07 CF entry or is there anything left > in > the patchseries? Done. Thanks for the reminder. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: CUBE_MAX_DIM
Devrim =?ISO-8859-1?Q?G=FCnd=FCz?= writes: > Someone contacted me about increasing CUBE_MAX_DIM > in contrib/cube/cubedata.h (in the community RPMs). The current value > is 100 with the following comment: > * This limit is pretty arbitrary, but don't make it so large that you > * risk overflow in sizing calculations. > They said they use 500, and never had a problem. I guess I'm wondering what's the use-case. 100 already seems an order of magnitude more than anyone could want. Or, if it's not enough, why does raising the limit just 5x enable any large set of new applications? The practical issue here is that, since the data requires 16 bytes per dimension (plus a little bit of overhead), we'd be talking about increasing the maximum size of a cube field from ~ 1600 bytes to ~ 8000 bytes. And cube is not toastable, so that couldn't be compressed or shoved out-of-line. Maybe your OP never had a problem with it, but plenty of use-cases would have "tuple too large" failures due to not having room on a heap page for whatever other data they want in the row. Even a non-toastable 2KB field is going to give the tuple toaster algorithm problems, as it'll end up shoving every other toastable field out-of-line in an ultimately vain attempt to bring the tuple size below 2KB. So I'm really quite hesitant to raise CUBE_MAX_DIM much past where it is now without any other changes. A more credible proposal would be to make cube toast-aware and then raise the limit to ~1GB ... but that would take a significant amount of work, and we still haven't got a use-case justifying it. I think I'd counsel storing such data as plain float8 arrays, which do have the necessary storage infrastructure. Is there something about the cube operators that's particularly missing? regards, tom lane
Re: min_safe_lsn column in pg_replication_slots view
On 2020-Jun-25, Amit Kapila wrote: > +1. I also think let's drop the idea of exposing a function for this > value and revert the min_safe_lsn part of the work as proposed by > Michael above [1] excluding the function pg_wal_oldest_lsn() in that > patch. Then, we can expose this as a new stat for PG14. I feel it > would be better to display this stat in a new view (something like > pg_stat_replication_slots) as discussed in another thread [2]. Does > that make sense? I don't understand the proposal. Michael posted a patch that adds pg_wal_oldest_lsn(), and you say we should apply the patch except the part that adds that function -- so what part would be applying? If the proposal is to apply just the hunk in pg_get_replication_slots that removes min_safe_lsn, and do nothing else in pg13, then I don't like it. The feature exposes a way to monitor slots w.r.t. the maximum slot size; I'm okay if you prefer to express that in a different way, but I don't like the idea of shipping pg13 without any way to monitor it. As reported by Masao-san, the current min_safe_lsn has a definitional problem when used with pg_walfile_name(), but we've established that that's because pg_walfile_name() has a special-case definition, not because min_safe_lsn itself is bogus. If we're looking for a minimal change that can fix this problem, let's increment one byte, which should fix that issue, no? I also see that some people complain that all slots return the same value and therefore this column is redundant. To that argument I say that it's not unreasonable that we'll add a slot-specific size limit; and if we do, we'll be happy we had slot-specific min safe LSN; see e.g. https://postgr.es/m/20170301160610.wc7ez3vihmial...@alap3.anarazel.de -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: xid wraparound danger due to INDEX_CLEANUP false
On Wed, Jun 24, 2020 at 4:02 PM Peter Geoghegan wrote: > > Apparently, what we're really doing here is that even when > > INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples. > > That seems sad, but if it's what we have to do then it at least needs > > comments explaining it. > > +1. Though I think that AMs should technically have the right to > consider it advisory. I'm not really convinced. I agree that from a theoretical point of view an index can have arbitrary needs and is the arbiter of its own needs, but when I pull the emergency break, I want the vehicle to stop, not think about stopping. There's a fine argument for the idea that depressing the regular brake pedal entitles the vehicle to exercise some discretion, and on modern cars it does (think ABS, if nothing else). But pulling the emergency break is a statement that I wish to override any contrary judgement about whether stopping is a good idea. I think this option is rightly viewed as an emergency break, and giving AMs the right to decide that we'll instead pull off at the next exit doesn't sit well with me. At the end of the day, the human being should be in charge, not the program. (Great, now Skynet will be gunning for me...) > 1. How often is that likely to happen in The Real World™? > > 2. If we fail to do cleanup and leak already-deleted pages, how bad is > that? ( Both in general, and in the worst case.) > > I'll hazard a guess for 1: I think that it might not come up that > often. Page deletion is often something that we hardly ever need. And, > unlike some DB systems, we only do it when pages are fully empty > (which, as it turns out, isn't necessarily better than our simple > approach [1]). I tend to think it's unlikely to happen in cases where > INDEX_CLEANUP is used, because those are cases that also must not have > that much index churn to begin with. I don't think I believe this. All you need is one small range-deletion, right? > Then there's question 2. The intuition behind the approach from > Sawada-san's patch was that allowing wraparound here feels wrong -- it > should be in the AM's hands. However, it's not like I can point to > some ironclad guarantee about not leaking deleted pages that existed > before the INDEX_CLEANUP feature. We know that the FSM is not crash > safe, and that's that. Is this really all that different? Maybe it is, > but it seems like a quantitative difference to me. I don't think I believe this, either. In the real-world example to which you linked, the user ran REINDEX afterward to recover from index bloat, and we could advise other people who use this option that it may leak space that a subsequent VACUUM may fail to recover, and therefore they too should consider REINDEX. Bloat sucks and I hate it, but in the vehicle analogy from up above, it's the equivalent of getting lost while driving someplace. It is inconvenient and may cause you many problems, but you will not be dead. Running out of XIDs is a brick wall. Either the car stops or you hit the wall. Ideally you can manage to both not get lost and also not hit a brick wall, but in an emergency situation where you have to choose either to get lost or to hit a brick wall, there's only one right answer. As bad as bloat is, and it's really bad, there are users who manage to run incredibly bloated databases for long periods of time just because the stuff that gets slow is either stuff that they're not doing at all, or only doing in batch jobs where it's OK if it runs super-slow and where it may even be possible to disable the batch job altogether, at least for a while. The set of users who can survive running out of XIDs is limited to those who can get by with just read-only queries, and that's practically nobody. I have yet to encounter a customer who didn't consider running out of XIDs to be an emergency. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Parallel Seq Scan vs kernel read ahead
On Tue, Jun 23, 2020 at 11:53 PM David Rowley wrote: > In summary, based on these tests, I don't think we're making anything > worse in regards to synchronize_seqscans if we cap the maximum number > of blocks to allocate to each worker at once to 8192. Perhaps there's > some argument for using something smaller than that for servers with > very little RAM, but I don't personally think so as it still depends > on the table size and It's hard to imagine tables in the hundreds of > GBs on servers that struggle with chunk allocations of 16MB. The > table needs to be at least ~70GB to get a 8192 chunk size with the > current v2 patch settings. Nice research. That makes me happy. I had a feeling the maximum useful chunk size ought to be more in this range than the larger values we were discussing before, but I didn't even think about the effect on synchronized scans. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Open Item: Should non-text EXPLAIN always show properties?
On Thu, Jun 25, 2020 at 8:42 AM James Coleman wrote: > Yesterday I'd replied [1] to Justin's proposal for this WRT > incremental sort and expressed my opinion that including both > unnecessarily (i.e., including disk when an in-memory sort was used) > is undesirable and confusing and leads to shortcuts I believe to be > bad habits when using the data programmatically. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PostgreSQL and big data - FDW
On Thu, Jun 25, 2020 at 07:02:37AM +, ROS Didier wrote: > Hi Stephen > > My EDF company is very interested in this feature (KERBEROS authentication > method and hdfs_fdw ). > Is it possible to know how many days of development does this represent ? who > can develop this implementation ? what cost ? Uh, the only thing I can suggest is to contact one of the larger Postgres support companies (ones that have developers who understand the server code, or at least the FDW code), and ask them for estimates. The community really can't supply any of that, unless you want to do the work and want source code tips. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Default setting for enable_hashagg_disk
On Wed, Jun 24, 2020 at 7:38 PM Bruce Momjian wrote: > I think my main point is that work_mem was not being honored for > hash-agg before, but now that PG 13 can do it, we are again allowing > work_mem not to apply in certain cases. I am wondering if our hard > limit for work_mem is the issue, and we should make that more flexible > for all uses. I mean, that's pretty much what we're talking about here, isn't it? It seems like in your previous two replies you were opposed to separating the plan-type limit from the execution-time limit, but that idea is precisely a way of being more flexible (and extending it to other plan nodes is a way of making it more flexible for more use cases). As I think you know, if you have a system where the workload varies a lot, you may sometimes be using 0 copies of work_mem and at other times 1000 or more copies, so the value has to be chosen conservatively as a percentage of system memory, else you start swapping or the OOM killer gets involved. On the other hand, some plan nodes get a lot less efficient when the amount of memory available falls below some threshold, so you can't just set this to a tiny value and forget about it. Because the first problem is so bad, most people set the value relatively conservatively and just live with the performance consequences. But this also means that they have memory left over most of the time, so the idea of letting a node burst above its work_mem allocation when something unexpected happens isn't crazy: as long as only a few nodes do that here and there, rather than, say, all the nodes doing it all at the same time, it's actually fine. If we had a smarter system that could dole out more work_mem to nodes that would really benefit from it and less to nodes where it isn't likely to make much difference, that would be similar in spirit but even better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path
On Wed, Jun 24, 2020 at 9:51 PM David Rowley wrote: > $ pgbench -n -f pg-tpch/queries/q01.sql -T 120 tpch > > Which is about a 1.42% increase in performance. That's not exactly > groundbreaking, but pretty useful to have if that happens to apply > across the board for execution performance. > > For pgbench -S: > > My results were a bit noisier than the TPCH test, but the results I > obtained did show about a 10% increase in performance: This is pretty cool, particularly because it affects single-client performance. It seems like a lot of ideas people have had about speeding up pgbench performance - including me - have improved performance under concurrency at the cost of very slightly degrading single-client performance. It would be nice to claw some of that back. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Why forbid "INSERT INTO t () VALUES ();"
Robert Haas writes: > On Thu, Jun 25, 2020 at 12:56 AM Fabien COELHO wrote: >> It also means that if for some reason someone wants to insert several such >> all-default rows, they have to repeat the insert, as "VALUES (), ();" >> would not work, so it is also losing a corner-corner case capability >> without obvious reason. > That, and a desire to make things work in PostgreSQL that work in > MySQL, seems like a good-enough reason to me, but YMMV. Yeah, the multi-insert case is a plausible reason that hadn't been mentioned before. On the other hand, you can already do that pretty painlessly: regression=# create table foo(x float8 default random()); CREATE TABLE regression=# insert into foo select from generate_series(1,10); INSERT 0 10 regression=# table foo; x - 0.08414037203059621 0.2921176461398325 0.8760821189460586 0.6266325419285828 0.9946880079739273 0.4547070342142696 0.09683985675118834 0.3172576600666268 0.5122428845812195 0.8823697407826394 (10 rows) So I'm still not convinced we should do this. "MySQL is incapable of conforming to the standard" is a really lousy reason for us to do something. Anyway, to answer Fabien's question about why things are like this: the standard doesn't allow zero-column tables, so most of these syntactic edge cases are forbidden on that ground. We decided we didn't like that restriction (because, for example, it creates a painful special case for DROP COLUMN). So we've adjusted a minimal set of syntactic edge cases to go along with that semantic change. There's room to argue that INSERT's edge case should be included, but there's also room to argue that it doesn't need to be. regards, tom lane
Re: Default setting for enable_hashagg_disk
On Wed, 2020-06-24 at 15:28 -0400, Robert Haas wrote: > On Wed, Jun 24, 2020 at 3:14 PM Andres Freund > wrote: > > FWIW, my gut feeling is that we'll end up have to separate the > > "execution time" spilling from using plain work mem, because it'll > > trigger spilling too often. E.g. if the plan isn't expected to > > spill, > > only spill at 10 x work_mem or something like that. Or we'll need > > better management of temp file data when there's plenty memory > > available. ... > I think that's actually pretty appealing. Separating the memory we > plan to use from the memory we're willing to use before spilling > seems > like a good idea in general, and I think we should probably also do > it > in other places - like sorts. I'm trying to make sense of this. Let's say there are two GUCs: planner_work_mem=16MB and executor_work_mem=32MB. And let's say a query comes along and generates a HashAgg path, and the planner (correctly) thinks if you put all the groups in memory at once, it would be 24MB. Then the planner, using planner_work_mem, would think spilling was necessary, and generate a cost that involves spilling. Then it's going to generate a Sort+Group path, as well. And perhaps it estimates that sorting all of the tuples in memory would also take 24MB, so it generates a cost that involves spilling to disk. But it has to choose one of them. We've penalized plans at risk of spilling to disk, but what's the point? The planner needs to choose one of them, and both are at risk of spilling to disk. Regards, Jeff Davis
Re: Default setting for enable_hashagg_disk
On Wed, 2020-06-24 at 12:14 -0700, Andres Freund wrote: > E.g. if the plan isn't expected to spill, > only spill at 10 x work_mem or something like that. Let's say you have work_mem=32MB and a query that's expected to use 16MB of memory. In reality, it uses 64MB of memory. So you are saying this query would get to use all 64MB of memory, right? But then you run ANALYZE. Now the query is (correctly) expected to use 64MB of memory. Are you saying this query, executed again with better stats, would only get to use 32MB of memory, and therefore run slower? Regards, Jeff Davis
Re: CUBE_MAX_DIM
> > Devrim =?ISO-8859-1?Q?G=FCnd=FCz?= writes: > > Someone contacted me about increasing CUBE_MAX_DIM > > in contrib/cube/cubedata.h (in the community RPMs). The current value > > is 100 with the following comment: > > > * This limit is pretty arbitrary, but don't make it so large that you > > * risk overflow in sizing calculations. > > > They said they use 500, and never had a problem. > > I guess I'm wondering what's the use-case. 100 already seems an order of > magnitude more than anyone could want. Or, if it's not enough, why does > raising the limit just 5x enable any large set of new applications? The dimensionality of embeddings generated by deep neural networks can be high. Google BERT has 768 dimensions for example. I know that Cube in it's current form isn't suitable for nearest-neighbour searching these vectors in their raw form (I have tried recompilation with higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes can be useful for these applications. There are other pre-processing techniques that can be used to improved the speed of the search, but it still ends up with a kNN search in a high-ish dimensional space. > The practical issue here is that, since the data requires 16 bytes per > dimension (plus a little bit of overhead), we'd be talking about > increasing the maximum size of a cube field from ~ 1600 bytes to ~ 8000 > bytes. And cube is not toastable, so that couldn't be compressed or > shoved out-of-line. Maybe your OP never had a problem with it, but > plenty of use-cases would have "tuple too large" failures due to not > having room on a heap page for whatever other data they want in the row. > > Even a non-toastable 2KB field is going to give the tuple toaster > algorithm problems, as it'll end up shoving every other toastable field > out-of-line in an ultimately vain attempt to bring the tuple size below > 2KB. So I'm really quite hesitant to raise CUBE_MAX_DIM much past where > it is now without any other changes. > > A more credible proposal would be to make cube toast-aware and then > raise the limit to ~1GB ... but that would take a significant amount > of work, and we still haven't got a use-case justifying it. > > I think I'd counsel storing such data as plain float8 arrays, which > do have the necessary storage infrastructure. Is there something > about the cube operators that's particularly missing? > The indexable nearest-neighbour searches are one of the great cube features not available with float8 arrays. > regards, tom lane Best regards, Alastair
Re: Open Item: Should non-text EXPLAIN always show properties?
Robert Haas writes: > On Thu, Jun 25, 2020 at 8:42 AM James Coleman wrote: >> Yesterday I'd replied [1] to Justin's proposal for this WRT >> incremental sort and expressed my opinion that including both >> unnecessarily (i.e., including disk when an in-memory sort was used) >> is undesirable and confusing and leads to shortcuts I believe to be >> bad habits when using the data programmatically. > +1. I think the policy about non-text output formats is "all applicable fields should be included automatically". But the key word there is "applicable". Are disk-sort numbers applicable when no disk sort happened? I think the right way to think about this is that we are building an output data structure according to a schema that should be fixed for any particular plan shape. If event X happened zero times in a given execution, but it could have happened in a different execution of the same plan, then we should print X with a zero count. If X could not happen period in this plan, we should omit X's entry. So the real question here is whether the disk vs memory decision is plan time vs run time. AFAIK it's run time, which leads me to think we ought to print the zeroes. regards, tom lane
Re: Why forbid "INSERT INTO t () VALUES ();"
On Wed, 24 Jun 2020 at 08:18, Fabien COELHO wrote: > I would like to create an "all defaults" row, i.e. a row composed of the > default values for all attributes, so I wrote: > >INSERT INTO t() VALUES (); > > This is forbidden by postgres, and also sqlite. > This is not the only area where empty tuples are not supported. Consider: PRIMARY KEY () This should mean the table may only contain a single row, but is not supported. Also, GROUP BY supports grouping by no columns, but not in a systematic way: Using aggregate functions with no explicit GROUP BY clause will result in grouping by no columns (i.e., entire result set is one group); I also found that I could GROUP BY NULL::integer, abusing the column number syntax. But things like GROUP BY ROLLUP () are not supported. On the plus side, empty rows are supported, although the explicit ROW keyword is required.
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path
Hi, Thanks for picking this up again! On 2020-06-25 13:50:37 +1200, David Rowley wrote: > In the attached, I've come up with a way that works. Basically I've > just added a function named errstart_cold() that is attributed with > __attribute__((cold)), which will hint to the compiler to keep > branches which call this function in a cold path. I recall you trying this before? Has that gotten easier because we evolved ereport()/elog(), or has gcc become smarter, or ...? > To make the compiler properly mark just >= ERROR calls as cold, and > only when the elevel is a constant, I modified the ereport_domain > macro to do: > > if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \ > errstart_cold(elevel, domain) : \ > errstart(elevel, domain)) \ I think it'd be good to not just do this for ERROR, but also for <= DEBUG1. I recall seing quite a few debug elogs that made the code worse just by "being there". I suspect that doing this for DEBUG* could also improve the code for clang, because we obviously don't have __builtin_unreachable after those. > I see no reason why the compiler shouldn't always fold that constant > expression at compile-time and correctly select the correct version of > the function for the job. (I also tried using __builtin_choose_expr() > but GCC complained when the elevel was not constant, even with the > __builtin_constant_p() test in the condition) I think it has to be constant in all paths for that... > Master: > drowley@amd3990x:~$ pgbench -S -T 120 postgres > tps = 25245.903255 (excluding connections establishing) > tps = 26144.454208 (excluding connections establishing) > tps = 25931.850518 (excluding connections establishing) > > Master + elog_ereport_attribute_cold.patch > drowley@amd3990x:~$ pgbench -S -T 120 postgres > tps = 28351.480631 (excluding connections establishing) > tps = 27763.656557 (excluding connections establishing) > tps = 28896.427929 (excluding connections establishing) That is pretty damn cool. > diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c > index e976201030..8076e8af24 100644 > --- a/src/backend/utils/error/elog.c > +++ b/src/backend/utils/error/elog.c > @@ -219,6 +219,19 @@ err_gettext(const char *str) > #endif > } > > +#if defined(HAVE_PG_ATTRIBUTE_HOT_AND_COLD) && > defined(HAVE__BUILTIN_CONSTANT_P) > +/* > + * errstart_cold > + * A simple wrapper around errstart, but hinted to be cold so that > the > + * compiler is more likely to put error code in a cold area away > from the > + * main function body. > + */ > +bool > +pg_attribute_cold errstart_cold(int elevel, const char *domain) > +{ > + return errstart(elevel, domain); > +} > +#endif Hm. Would it make sense to have this be a static inline? > /* > * errstart --- begin an error-reporting cycle > diff --git a/src/include/c.h b/src/include/c.h > index d72b23afe4..087b8af6cb 100644 > --- a/src/include/c.h > +++ b/src/include/c.h > @@ -178,6 +178,21 @@ > #define pg_noinline > #endif > > +/* > + * Marking certain functions as "hot" or "cold" can be useful to assist the > + * compiler in arranging the assembly code in a more efficient way. > + * These are supported from GCC >= 4.3 and clang >= 3.2 > + */ > +#if (defined(__GNUC__) && (__GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ > >= 3))) || \ > + (defined(__clang__) && (__clang_major__ > 3 || (__clang_major__ == 3 && > __clang_minor__ >= 2))) > +#define HAVE_PG_ATTRIBUTE_HOT_AND_COLD 1 > +#define pg_attribute_hot __attribute__((hot)) > +#define pg_attribute_cold __attribute__((cold)) > +#else > +#define pg_attribute_hot > +#define pg_attribute_cold > +#endif Wonder if we should start using __has_attribute() for things like this. https://gcc.gnu.org/onlinedocs/cpp/_005f_005fhas_005fattribute.html I.e. we could do something like #ifndef __has_attribute #define __has_attribute(attribute) 0 #endif #if __has_attribute(hot) #define pg_attribute_hot __attribute__((hot)) #else #define pg_attribute_hot #endif clang added __has_attribute in 2.9 (2010), gcc added it in 5 (2014), so I don't think we'd loose too much. > #ifdef HAVE__BUILTIN_CONSTANT_P > +#ifdef HAVE_PG_ATTRIBUTE_HOT_AND_COLD > +#define ereport_domain(elevel, domain, ...) \ > + do { \ > + pg_prevent_errno_in_scope(); \ > + if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \ > + errstart_cold(elevel, domain) : \ > + errstart(elevel, domain)) \ > + __VA_ARGS__, errfinish(__FILE__, __LINE__, > PG_FUNCNAME_MACRO); \ > + if (__builtin_constant_p(elevel) && (elevel) >= ERROR) \ > + pg_unreachable(); \ > + } while(0) > +#else/* > !HAVE_PG_ATTRIBUTE_HOT_AND_COLD */ > #define ereport_domain(elevel, domain, ...) \ > do { \ > pg_prevent_errno_in_scope(); \ > @@ -129,6 +141,7 @@ >
Re: Default setting for enable_hashagg_disk
On 2020-06-25 09:24:52 -0700, Jeff Davis wrote: > On Wed, 2020-06-24 at 12:14 -0700, Andres Freund wrote: > > E.g. if the plan isn't expected to spill, > > only spill at 10 x work_mem or something like that. > > Let's say you have work_mem=32MB and a query that's expected to use > 16MB of memory. In reality, it uses 64MB of memory. So you are saying > this query would get to use all 64MB of memory, right? > > But then you run ANALYZE. Now the query is (correctly) expected to use > 64MB of memory. Are you saying this query, executed again with better > stats, would only get to use 32MB of memory, and therefore run slower? Yes. I think that's ok, because it was taken into account from a costing perspective int he second case.
Re: Default setting for enable_hashagg_disk
On Wed, 2020-06-24 at 12:31 -0700, Andres Freund wrote: > nodeAgg.c already treats those separately: > > void > hash_agg_set_limits(double hashentrysize, uint64 input_groups, int > used_bits, > Size *mem_limit, uint64 > *ngroups_limit, > int *num_partitions) > { > int npartitions; > Sizepartition_mem; > > /* if not expected to spill, use all of work_mem */ > if (input_groups * hashentrysize < work_mem * 1024L) > { > if (num_partitions != NULL) > *num_partitions = 0; > *mem_limit = work_mem * 1024L; > *ngroups_limit = *mem_limit / hashentrysize; > return; > } The reason this code exists is to decide how much of work_mem to set aside for spilling (each spill partition needs an IO buffer). The alternative would be to fix the number of partitions before processing a batch, which didn't seem ideal. Or, we could just ignore the memory required for IO buffers, like HashJoin. Granted, this is an example where an underestimate can give an advantage, but I don't think we want to extend the concept into other areas. Regards, Jeff Davis
Re: CUBE_MAX_DIM
Alastair McKinley writes: > I know that Cube in it's current form isn't suitable for nearest-neighbour > searching these vectors in their raw form (I have tried recompilation with > higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes > can be useful for these applications. There are other pre-processing > techniques that can be used to improved the speed of the search, but it still > ends up with a kNN search in a high-ish dimensional space. Is there a way to fix the numerical instability involved? If we could do that, then we'd definitely have a use-case justifying the work to make cube toastable. regards, tom lane
Re: Weird failures on lorikeet
Is there something going on with lorikeet again? I see this: 2020-06-25 01:55:13.380 EDT [5ef43c40.21e0:85] pg_regress/typed_table LOG: statement: SELECT c.oid::pg_catalog.regclass FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i WHERE c.oid = i.inhparent AND i.inhrelid = '21420' AND c.relkind != 'p' AND c.relkind != 'I' ORDER BY inhseqno; *** starting debugger for pid 4456, tid 2644 2020-06-25 01:55:13.393 EDT [5ef43c40.21e0:86] pg_regress/typed_table LOG: statement: SELECT c.oid::pg_catalog.regclass, c.relkind, pg_catalog.pg_get_expr(c.relpartbound, c.oid) FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i WHERE c.oid = i.inhrelid AND i.inhparent = '21420' ORDER BY pg_catalog.pg_get_expr(c.relpartbound, c.oid) = 'DEFAULT', c.oid::pg_catalog.regclass::pg_catalog.text; 1 [main] postgres 4456 try_to_debug: Failed to start debugger, Win32 error 2 2020-06-25 01:55:13.455 EDT [5ef43c40.21e0:87] pg_regress/typed_table LOG: statement: CREATE TABLE persons3 OF person_type ( PRIMARY KEY (id), name NOT NULL DEFAULT '' ); *** continuing pid 4456 from debugger call (0) *** starting debugger for pid 4456, tid 2644 48849 [main] postgres 4456 try_to_debug: Failed to start debugger, Win32 error 2 *** continuing pid 4456 from debugger call (0) 2020-06-25 01:55:18.181 EDT [5ef43bd6.2824:4] LOG: server process (PID 4456) was terminated by signal 11: Segmentation fault 2020-06-25 01:55:18.181 EDT [5ef43bd6.2824:5] DETAIL: Failed process was running: drop table some_tab cascade; On 2018-Mar-06, Thomas Munro wrote: > On Thu, Feb 22, 2018 at 7:06 AM, Andres Freund wrote: > > I noticed your animal lorikeet failed in the last two runs: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2018-02-21%2009%3A47%3A17 > > TRAP: FailedAssertion("!(((PageHeader) (page))->pd_special >= > > (__builtin_offsetof (PageHeaderData, pd_linp)))", File: > > "/home/andrew/bf64/root/HEAD/pgsql/src/include/storage/bufpage.h", Line: > > 313) > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2018-02-20%2012%3A46%3A17 > > TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", File: > > "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/pmsignal.c", > > Line: 229) > > 2018-02-20 08:07:14.054 EST [5a8c1c3b.21d0:3] LOG: select() failed in > > postmaster: Bad address > > 2018-02-20 08:07:14.073 EST [5a8c1c3b.21d0:4] LOG: database system is shut > > down > Bad memory? Assorted output from recent runs: > > + ERROR: index "pg_toast_28546_index" contains corrupted page at block 1 > > TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", > File: > "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/pmsignal.c", > Line: 229) > > TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: > "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/pgstat.c", > Line: 871) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Thu, 2020-06-25 at 11:46 -0400, Robert Haas wrote: > Because the first problem is so bad, most people > set the value relatively conservatively and just live with the > performance consequences. But this also means that they have memory > left over most of the time, so the idea of letting a node burst above > its work_mem allocation when something unexpected happens isn't > crazy: > as long as only a few nodes do that here and there, rather than, say, > all the nodes doing it all at the same time, it's actually fine. Unexpected things (meaning underestimates) are not independent. All the queries are based on the same stats, so if you have a lot of similar queries, they will all get the same underestimate at once, and all be surprised when they need to spill at once, and then all decide they are entitled to ignore work_mem at once. > If we > had a smarter system that could dole out more work_mem to nodes that > would really benefit from it and less to nodes where it isn't likely > to make much difference, that would be similar in spirit but even > better. That sounds more useful and probably not too hard to implement in a crude form. Just have a shared counter in memory representing GB. If a node is about to spill, it could try to decrement the counter by N, and if it succeeds, it gets to exceed work_mem by N more GB. Regards, Jeff Davis
Re: Default setting for enable_hashagg_disk
On Thu, Jun 25, 2020 at 11:46:54AM -0400, Robert Haas wrote: > On Wed, Jun 24, 2020 at 7:38 PM Bruce Momjian wrote: > > I think my main point is that work_mem was not being honored for > > hash-agg before, but now that PG 13 can do it, we are again allowing > > work_mem not to apply in certain cases. I am wondering if our hard > > limit for work_mem is the issue, and we should make that more flexible > > for all uses. > > I mean, that's pretty much what we're talking about here, isn't it? It > seems like in your previous two replies you were opposed to separating > the plan-type limit from the execution-time limit, but that idea is > precisely a way of being more flexible (and extending it to other plan > nodes is a way of making it more flexible for more use cases). I think it is was Tom who was complaining about plan vs. execution time control. > As I think you know, if you have a system where the workload varies a > lot, you may sometimes be using 0 copies of work_mem and at other > times 1000 or more copies, so the value has to be chosen > conservatively as a percentage of system memory, else you start > swapping or the OOM killer gets involved. On the other hand, some plan > nodes get a lot less efficient when the amount of memory available > falls below some threshold, so you can't just set this to a tiny value > and forget about it. Because the first problem is so bad, most people > set the value relatively conservatively and just live with the > performance consequences. But this also means that they have memory > left over most of the time, so the idea of letting a node burst above > its work_mem allocation when something unexpected happens isn't crazy: > as long as only a few nodes do that here and there, rather than, say, > all the nodes doing it all at the same time, it's actually fine. If we > had a smarter system that could dole out more work_mem to nodes that > would really benefit from it and less to nodes where it isn't likely > to make much difference, that would be similar in spirit but even > better. I think the issue is that in PG 13 work_mem controls sorts and hashes with a new hard limit for hash aggregation: https://www.postgresql.org/docs/12/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of IN subqueries. In pre-PG 13, we "controlled" it by avoiding hash-based aggregation if was expected it to exceed work_mem, but if we assumed it would be less than work_mem and it was more, we exceeded work_mem allocation for that node. In PG 13, we "limit" memory to work_mem and spill to disk if we exceed it. We should really have always documented that hash agg could exceed work_mem for misestimation, and if we add a hash_agg work_mem misestimation bypass setting we should document this setting in work_mem as well. But then the question is why do we allow this bypass only for hash agg? Should work_mem have a settings for ORDER BY, merge join, hash join, and hash agg, e.g.: work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB' Yeah, crazy syntax, but you get the idea. I understand some nodes are more sensitive to disk spill than others, so shouldn't we be controlling this at the work_mem level, rather than for a specific node type like hash agg? We could allow for misestimation over allocation of hash agg work_mem by splitting up the hash agg values: work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB hash_agg_max=200MB' but _avoiding_ hash agg if it is estimated to exceed work mem and spill to disk is not something to logically control at the work mem level, which leads so something like David Rowley suggested, but with different names: enable_hashagg = on | soft | avoid | off where 'on' and 'off' are the current PG 13 behavior, 'soft' means to treat work_mem as a soft limit and allow it to exceed work mem for misestimation, and 'avoid' means to avoid hash agg if it is estimated to exceed work mem. Both 'soft' and 'avoid' don't spill to disk. David's original terms of "trynospill" and "neverspill" were focused on spilling, not on its interaction with work_mem, and I found that confusing. Frankly, if it took me this long to get my head around this, I am unclear how many people will understand this tuning feature enough to actually use it. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: [patch] demote
Hi, Here is a summary of my work during the last few days on this demote approach. Please, find in attachment v2-0001-Demote-PoC.patch and the comments in the commit message and as FIXME in code. The patch is not finished or bug-free yet, I'm still not very happy with the coding style, it probably lack some more code documentation, but a lot has changed since v1. It's still a PoC to push the discussion a bit further after being myself silent for some days. The patch is currently relying on a demote checkpoint. I understand a forced checkpoint overhead can be massive and cause major wait/downtime. But I keep this for a later step. Maybe we should be able to cancel a running checkpoint? Or leave it to its synching work but discard the result without wirting it to XLog? I hadn't time to investigate Robert's concern about shared memory for snapshot during recovery. The patch doesn't deal with prepared xact yet. Testing "start->demote->promote" raise an assert if some prepared xact exist. I suppose I will rollback them during demote in next patch version. I'm not sure how to divide this patch in multiple small independent steps. I suppose I can split it like: 1. add demote checkpoint 2. support demote: mostly postmaster, startup/xlog and checkpointer related code 3. cli using pg_ctl demote ...But I'm not sure it worth it. Regards, >From 03c41dd706648cd20df90a128db64eee6b6dad97 Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais Date: Fri, 10 Apr 2020 18:01:45 +0200 Subject: [PATCH] Demote PoC Changes: * creates a demote checkpoint * use DB_DEMOTING state in controlfile * try to handle subsystems init correctly during demote * keep some sub-processes alive: stat collector, checkpointer, bgwriter and optionally archiver or wal senders * add signal PMSIGNAL_DEMOTING to start the startup process after the demote checkpoint * ShutdownXLOG takes a boolean arg to handle demote differently Trivial manual tests: * make check : OK * make check-world : OK * start in production -> demote -> demote: OK * start in production -> demote -> stop : OK * start in production -> demote -> promote : NOK (2PC, see TODO) but OK with no prepared xact. Discuss/Todo: * rollback prepared xact * cancel/kill active/idle in xact R/W backends * pg_demote() function? * some more code reviewing around StartupXlog * investigate snapshots shmem needs/init during recovery compare to production * add tap tests * add doc * how to handle checkpoint? --- src/backend/access/rmgrdesc/xlogdesc.c | 9 +- src/backend/access/transam/xlog.c | 287 +++- src/backend/postmaster/checkpointer.c | 22 ++ src/backend/postmaster/postmaster.c | 250 - src/backend/storage/ipc/procsignal.c| 4 + src/bin/pg_controldata/pg_controldata.c | 2 + src/bin/pg_ctl/pg_ctl.c | 111 + src/include/access/xlog.h | 18 +- src/include/catalog/pg_control.h| 2 + src/include/libpq/libpq-be.h| 7 +- src/include/postmaster/bgwriter.h | 1 + src/include/storage/pmsignal.h | 1 + src/include/storage/procsignal.h| 1 + src/include/utils/pidfile.h | 1 + 14 files changed, 537 insertions(+), 179 deletions(-) diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c index 1cd97852e8..5aeaff18f8 100644 --- a/src/backend/access/rmgrdesc/xlogdesc.c +++ b/src/backend/access/rmgrdesc/xlogdesc.c @@ -40,7 +40,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record) uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; if (info == XLOG_CHECKPOINT_SHUTDOWN || - info == XLOG_CHECKPOINT_ONLINE) + info == XLOG_CHECKPOINT_ONLINE || + info == XLOG_CHECKPOINT_DEMOTE) { CheckPoint *checkpoint = (CheckPoint *) rec; @@ -65,7 +66,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record) checkpoint->oldestCommitTsXid, checkpoint->newestCommitTsXid, checkpoint->oldestActiveXid, - (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online"); + (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : + (info == XLOG_CHECKPOINT_DEMOTE)? "demote" : "online"); } else if (info == XLOG_NEXTOID) { @@ -185,6 +187,9 @@ xlog_identify(uint8 info) case XLOG_FPI_FOR_HINT: id = "FPI_FOR_HINT"; break; + case XLOG_CHECKPOINT_DEMOTE: + id = "CHECKPOINT_DEMOTE"; + break; } return id; diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e455384b5b..0e18e546ba 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6301,6 +6301,13 @@ CheckRequiredParameterValues(void) /* * This must be called ONCE during postmaster or standalone-backend startup */ +/* + * FIXME demote: part of the code here assume there's no other active + * processes before signal PMSIGNAL_RECOVERY_STARTED is sent. + * + * FIXME demote: rollback prepared xact during demote? +
Re: Default setting for enable_hashagg_disk
On Thu, 2020-06-25 at 09:37 -0700, Andres Freund wrote: > > Let's say you have work_mem=32MB and a query that's expected to use > > 16MB of memory. In reality, it uses 64MB of memory. So you are > > saying > > this query would get to use all 64MB of memory, right? > > > > But then you run ANALYZE. Now the query is (correctly) expected to > > use > > 64MB of memory. Are you saying this query, executed again with > > better > > stats, would only get to use 32MB of memory, and therefore run > > slower? > > Yes. I think that's ok, because it was taken into account from a > costing > perspective int he second case. What do you mean by "taken into account"? There are only two possible paths: HashAgg and Sort+Group, and we need to pick one. If the planner expects one to spill, it is likely to expect the other to spill. If one spills in the executor, then the other is likely to spill, too. (I'm ignoring the case with a lot of tuples and few groups because that doesn't seem relevant.) Imagine that there was only one path available to choose. Would you suggest the same thing, that unexpected spills can exceed work_mem but expected spills can't? Regards, Jeff Davis
Re: Default setting for enable_hashagg_disk
On Thu, Jun 25, 2020 at 1:15 PM Jeff Davis wrote: > Unexpected things (meaning underestimates) are not independent. All the > queries are based on the same stats, so if you have a lot of similar > queries, they will all get the same underestimate at once, and all be > surprised when they need to spill at once, and then all decide they are > entitled to ignore work_mem at once. Yeah, that's a risk. But what is proposed is a configuration setting, so people can adjust it depending on what they think is likely to happen in their environment. > That sounds more useful and probably not too hard to implement in a > crude form. Just have a shared counter in memory representing GB. If a > node is about to spill, it could try to decrement the counter by N, and > if it succeeds, it gets to exceed work_mem by N more GB. That's a neat idea, although GB seems too coarse. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Remove a redundant condition check
Hello, A one line change to remove a duplicate check. This duplicate check was detected during testing my contribution to a static code analysis tool. There is no functional change, no new tests needed. Regards, Ádám Balogh CodeChecker Team Ericsson Hungary v1-0001-Remove-redundant-condition.patch Description: v1-0001-Remove-redundant-condition.patch
Re: Default setting for enable_hashagg_disk
On Thu, 2020-06-25 at 13:17 -0400, Bruce Momjian wrote: > Frankly, if it took me this long to get my head around this, I am > unclear how many people will understand this tuning feature enough to > actually use it. The way I think about it is that v13 HashAgg is much more consistent with the way we do everything else: the planner costs it (including any spilling that is expected), and the executor executes it (including any spilling that is required to obey work_mem). In earlier versions, HashAgg was weird. If we add GUCs to get that weird behavior back, then the GUCs will necessarily be weird; and therefore hard to document. I would feel more comfortable with some kind of GUC escape hatch (or two). GROUP BY is just too common, and I don't think we can ignore the potential for users experiencing a regression of some type (even if, in principle, the v13 version is better). If we have the GUCs there, then at least if someone comes to the mailing list with a problem, we can offer them a temporary solution, and have time to try to avoid the problem in a future release (tweaking estimates, cost model, defaults, etc.). One idea is to have undocumented GUCs. That way we don't have to support them forever, and we are more likely to hear problem reports. Regards, Jeff Davis
Re: Default setting for enable_hashagg_disk
On Wed, 2020-06-24 at 13:29 -0400, Tom Lane wrote: > If we feel we need something to let people have the v12 behavior > back, let's have > (1) enable_hashagg on/off --- controls planner, same as it ever was > (2) enable_hashagg_spill on/off --- controls executor by disabling > spill > > But I'm not really convinced that we need (2). If we're not going to have a planner GUC, one alternative is to just penalize the disk costs of HashAgg for a release or two. It would only affect the cost of HashAgg paths that are expected to spill, which weren't even generated in previous releases. In other words, multiply the disk costs by enough that the planner will usually not choose HashAgg if expected to spill unless the average group size is quite large (i.e. there are a lot more tuples than groups, but still enough groups to spill). As we learn more and optimize more, we can reduce or eliminate the penalty in a future release. I'm not sure exactly what the penalty would be, though. Regards, Jeff Davis
Re: Default setting for enable_hashagg_disk
Hi, On 2020-06-25 10:44:42 -0700, Jeff Davis wrote: > There are only two possible paths: HashAgg and Sort+Group, and we need > to pick one. If the planner expects one to spill, it is likely to > expect the other to spill. If one spills in the executor, then the > other is likely to spill, too. (I'm ignoring the case with a lot of > tuples and few groups because that doesn't seem relevant.) There's also ordered index scan + Group. Which will often be vastly better than Sort+Group, but still slower than HashAgg. > Imagine that there was only one path available to choose. Would you > suggest the same thing, that unexpected spills can exceed work_mem but > expected spills can't? I'm not saying what I propose is perfect, but I've yet to hear a better proposal. Given that there *are* different ways to implement aggregation, and that we use expected costs to choose, I think the assumed costs are relevant. Greetings, Andres Freund
Re: Default setting for enable_hashagg_disk
On Thu, Jun 25, 2020 at 11:02:30AM -0700, Jeff Davis wrote: > If we have the GUCs there, then at least if someone comes to the > mailing list with a problem, we can offer them a temporary solution, > and have time to try to avoid the problem in a future release (tweaking > estimates, cost model, defaults, etc.). > > One idea is to have undocumented GUCs. That way we don't have to > support them forever, and we are more likely to hear problem reports. Uh, our track record of adding GUCs just in case is not good, and removing them is even harder. Undocumented sounds interesting but then how do we even document when we remove it? I don't think we want to go there. Oracle has done that, and I don't think the user experience is good. Maybe we should just continue though beta, add an incompatibility item to the PG 13 release notes, and see what feedback we get. We know increasing work_mem gets us the exceed work_mem behavior, but that affects other nodes too, and I can't think of a way to avoid if spill is predicted except to disable hash agg for that query. I am still trying to get my head around why the spill is going to be so much work to adjust for hash agg than our other spillable nodes. What are people doing for those cases already? Do we have an real-world queries that are a problem in PG 13 for this? -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Weird failures on lorikeet
On 6/25/20 12:52 PM, Alvaro Herrera wrote: > Is there something going on with lorikeet again? I see this: > > 2020-06-25 01:55:13.380 EDT [5ef43c40.21e0:85] pg_regress/typed_table LOG: > statement: SELECT c.oid::pg_catalog.regclass > FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i > WHERE c.oid = i.inhparent AND i.inhrelid = '21420' > AND c.relkind != 'p' AND c.relkind != 'I' > ORDER BY inhseqno; > *** starting debugger for pid 4456, tid 2644 > 2020-06-25 01:55:13.393 EDT [5ef43c40.21e0:86] pg_regress/typed_table LOG: > statement: SELECT c.oid::pg_catalog.regclass, c.relkind, > pg_catalog.pg_get_expr(c.relpartbound, c.oid) > FROM pg_catalog.pg_class c, pg_catalog.pg_inherits i > WHERE c.oid = i.inhrelid AND i.inhparent = '21420' > ORDER BY pg_catalog.pg_get_expr(c.relpartbound, c.oid) = 'DEFAULT', > c.oid::pg_catalog.regclass::pg_catalog.text; > 1 [main] postgres 4456 try_to_debug: Failed to start debugger, Win32 > error 2 > 2020-06-25 01:55:13.455 EDT [5ef43c40.21e0:87] pg_regress/typed_table LOG: > statement: CREATE TABLE persons3 OF person_type ( > PRIMARY KEY (id), > name NOT NULL DEFAULT '' > ); > *** continuing pid 4456 from debugger call (0) > *** starting debugger for pid 4456, tid 2644 > 48849 [main] postgres 4456 try_to_debug: Failed to start debugger, Win32 > error 2 > *** continuing pid 4456 from debugger call (0) > 2020-06-25 01:55:18.181 EDT [5ef43bd6.2824:4] LOG: server process (PID 4456) > was terminated by signal 11: Segmentation fault > 2020-06-25 01:55:18.181 EDT [5ef43bd6.2824:5] DETAIL: Failed process was > running: drop table some_tab cascade; > I don't know what that's about. I'll reboot the machine presently and see if it persists. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Why forbid "INSERT INTO t () VALUES ();"
On Thu, Jun 25, 2020 at 12:07 PM Tom Lane wrote: > Yeah, the multi-insert case is a plausible reason that hadn't been > mentioned before. On the other hand, you can already do that pretty > painlessly: Sure, but it means if you're writing code to generate queries programmatically, then you have to handle the 0-column case completely differently from all the others. Seems like unnecessary pain for no real reason. I mean, I generally agree that if the standard says that syntax X means Y, we should either make X mean Y, or not support X. But if the standard says that X has no meaning at all, I don't think it's a problem for us to make it mean something logical. If we thought otherwise, we'd have to rip out support for indexes, which would probably not be a winning move. Now, various people, including you and I, have made the point that it's bad to give a meaning to some piece of syntax that is not current part of the standard but might become part of the standard in the future, because then we might end up with the standard saying that X means one thing and PostgreSQL thinking that it means something else. However, that can quickly turn into an argument against doing anything that we happen not to like, even if the reason we don't like it has more to do with needing a Snickers bar than any underlying engineering reality. In a case like this, it's hard to imagine that () can reasonably mean anything other than a 0-column tuple. It's not impossible that someone could invent another interpretation, and there's been much discussion on this list about how the SQL standards committee is more likely than you'd think to come up with unusual ideas, but I still don't think it's a bad gamble, especially given the MySQL/MariaDB precedent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Failures with installcheck and low work_mem value in 13~
On Fri, Jun 19, 2020 at 7:48 PM Michael Paquier wrote: > We cared about such plan stability that in the past FWIW, see for > example c588df9 as work_mem is a setting that people like to change. > Why should this be different? work_mem is a popular configuration > setting. The RMT met today. We determined that it wasn't worth adjusting this test to pass with non-standard work_mem values. "make installcheck" also fails with lower random_page_cost settings. There doesn't seem to be any reason to permit a non-standard setting to cause installcheck to fail elsewhere, while not tolerating the same issue here, with work_mem. Thanks -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Thu, Jun 25, 2020 at 11:10:58AM -0700, Jeff Davis wrote: > On Wed, 2020-06-24 at 13:29 -0400, Tom Lane wrote: > > If we feel we need something to let people have the v12 behavior > > back, let's have > > (1) enable_hashagg on/off --- controls planner, same as it ever was > > (2) enable_hashagg_spill on/off --- controls executor by disabling > > spill > > > > But I'm not really convinced that we need (2). > > If we're not going to have a planner GUC, one alternative is to just > penalize the disk costs of HashAgg for a release or two. It would only > affect the cost of HashAgg paths that are expected to spill, which > weren't even generated in previous releases. > > In other words, multiply the disk costs by enough that the planner will > usually not choose HashAgg if expected to spill unless the average > group size is quite large (i.e. there are a lot more tuples than > groups, but still enough groups to spill). Well, the big question is whether this costing is actually more accurate than what we have now. What I am hearing is that spilling hash agg is expensive, so whatever we can do to reflect the actual costs seems like a win. If it can be done, it certainly seems better than a cost setting few people will use. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Open Item: Should non-text EXPLAIN always show properties?
On Thu, Jun 25, 2020 at 12:33 PM Tom Lane wrote: > > Robert Haas writes: > > On Thu, Jun 25, 2020 at 8:42 AM James Coleman wrote: > >> Yesterday I'd replied [1] to Justin's proposal for this WRT > >> incremental sort and expressed my opinion that including both > >> unnecessarily (i.e., including disk when an in-memory sort was used) > >> is undesirable and confusing and leads to shortcuts I believe to be > >> bad habits when using the data programmatically. > > > +1. > > I think the policy about non-text output formats is "all applicable > fields should be included automatically". But the key word there is > "applicable". Are disk-sort numbers applicable when no disk sort > happened? > > I think the right way to think about this is that we are building > an output data structure according to a schema that should be fixed > for any particular plan shape. If event X happened zero times in > a given execution, but it could have happened in a different execution > of the same plan, then we should print X with a zero count. If X > could not happen period in this plan, we should omit X's entry. > > So the real question here is whether the disk vs memory decision is > plan time vs run time. AFAIK it's run time, which leads me to think > we ought to print the zeroes. Do we print zeroes for memory usage when all sorts ended up spilling to disk then? That might be the current behavior; I'd have to check. Because that's a lie, but we don't have any better information currently (which is unfortunate, but hardly in scope for fixing here.) James
Re: should libpq also require TLSv1.2 by default?
On Wed, Jun 24, 2020 at 9:50 PM Michael Paquier wrote: > Yeah, and I would not be surprised to see cases where people complain > to us about that when attempting to reach one of their old boxes, > breaking some stuff they have been relying on for years by forcing the > addition of a tls_min_server_protocol in the connection string. It is > a more important step that we enforce TLSv1.2 on the server side > actually, and libpq just follows up automatically with that. I wonder how much of a problem this really is. A few quick Google searches suggests that support for TLSv1.2 was added to OpenSSL in v1.0.1, which was released in March 2012. If packagers adopted that version for the following PostgreSQL release, they would have had TLSv1.2 support from PostgreSQL 9.2 onward. Some people may have taken longer to adopt it, but even if they waited a year or two, all versions that they built with older OpenSSL versions would now be out of support. It doesn't seem that likely that there are going to be that many people using pg_dump to upgrade directly from PostgreSQL 9.2 -- or even 9.4 -- to PostgreSQL 13. Skipping six or eight major versions in a single upgrade is a little unusual, in my experience. And even if someone does want to do that, we haven't broken it; it'll still work fine if they are connecting through a UNIX socket, and if not, they can disable SSL or just specify that they're OK with an older protocol version. That doesn't seem like a big deal, especially if we can adopt Tom's suggestion of giving them a warning about what went wrong. Let's also consider the other side of this argument, which is that a decent number of PostgreSQL users are operating in environments where they are required for regulatory compliance to prohibit the use of TLSv1.0 and TLSv1.1. Those people will be happy if that is the default on both the client and the server side by default. They will probably be somewhat happy anyway, because now we have an option for it, which we didn't before. But they'll be more happy if it's the default. Now, we can't please everybody here. Is it more important to please people who would like insecure TLS versions disabled by default, or to please people who want to use insecure TLS versions to back up old servers? Seems debatable. Based on my own experience, I'd guess there are more users who want to avoid insecure TLS versions than there are users who want to use them to back up very old servers, so I'd tentatively favor changing the default. However, I don't know whether my experiences are representative. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Open Item: Should non-text EXPLAIN always show properties?
James Coleman writes: > On Thu, Jun 25, 2020 at 12:33 PM Tom Lane wrote: >> I think the right way to think about this is that we are building >> an output data structure according to a schema that should be fixed >> for any particular plan shape. If event X happened zero times in >> a given execution, but it could have happened in a different execution >> of the same plan, then we should print X with a zero count. If X >> could not happen period in this plan, we should omit X's entry. > Do we print zeroes for memory usage when all sorts ended up spilling > to disk then? I did not claim that the pre-existing code adheres to this model completely faithfully ;-). But we ought to have a clear mental picture of what it is we're trying to achieve. If you don't like the above design, propose a different one. regards, tom lane
Re: Default setting for enable_hashagg_disk
On Thu, Jun 25, 2020 at 03:24:42PM -0400, Bruce Momjian wrote: > On Thu, Jun 25, 2020 at 11:10:58AM -0700, Jeff Davis wrote: > > On Wed, 2020-06-24 at 13:29 -0400, Tom Lane wrote: > > > If we feel we need something to let people have the v12 behavior > > > back, let's have > > > (1) enable_hashagg on/off --- controls planner, same as it ever was > > > (2) enable_hashagg_spill on/off --- controls executor by disabling > > > spill > > > > > > But I'm not really convinced that we need (2). > > > > If we're not going to have a planner GUC, one alternative is to just > > penalize the disk costs of HashAgg for a release or two. It would only > > affect the cost of HashAgg paths that are expected to spill, which > > weren't even generated in previous releases. > > > > In other words, multiply the disk costs by enough that the planner will > > usually not choose HashAgg if expected to spill unless the average > > group size is quite large (i.e. there are a lot more tuples than > > groups, but still enough groups to spill). > > Well, the big question is whether this costing is actually more accurate > than what we have now. What I am hearing is that spilling hash agg is > expensive, so whatever we can do to reflect the actual costs seems like > a win. If it can be done, it certainly seems better than a cost setting > few people will use. It is my understanding that spill of sorts is mostly read sequentially, while hash reads are random. Is that right? Is that not being costed properly? That doesn't fix the misestimation case, but increasing work mem does allow pre-PG 13 behavior there. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: should libpq also require TLSv1.2 by default?
Robert Haas writes: > I wonder how much of a problem this really is. Yeah. I find Robert's points about that pretty persuasive: by now needing to connect to a server without TLSv1.2 support, *and* needing to do so with SSL on, ought to be a tiny niche use case (much smaller than the number of people who would like a more secure default). If we can make the error message about this be reasonably clear then I don't have an objection to changing libpq's default. regards, tom lane
Re: Default setting for enable_hashagg_disk
Hi, On 2020-06-25 14:25:12 -0400, Bruce Momjian wrote: > I am still trying to get my head around why the spill is going to be so > much work to adjust for hash agg than our other spillable nodes. Aggregates are the classical case used to process large amounts of data. For larger amounts of data sorted input (be it via explicit sort or ordered index scan) isn't an attractive option. IOW hash-agg is the common case. There's also fewer stats for halfway accurately estimating the number of groups and the size of the transition state - a sort / hash join doesn't have an equivalent to the variably sized transition value. > What are people doing for those cases already? Do we have an > real-world queries that are a problem in PG 13 for this? I don't know about real world, but it's pretty easy to come up with examples. query: SELECT a, array_agg(b) FROM (SELECT generate_series(1, 1)) a(a), (SELECT generate_series(1, 1)) b(b) GROUP BY a HAVING array_length(array_agg(b), 1) = 0; work_mem = 4MB 12 18470.012 ms HEAD44635.210 ms HEAD causes ~2.8GB of file IO, 12 doesn't cause any. If you're IO bandwidth constrained, this could be quite bad. Obviously this is contrived, and a pretty extreme case. But if you imagine this happening on a system where disk IO isn't super fast (e.g. just about any cloud provider). An even more extreme version of the above is this: query: SELECT a, array_agg(b) FROM (SELECT generate_series(1, 5)) a(a), (SELECT generate_series(1, 1)) b(b) GROUP BY a HAVING array_length(array_agg(b), 1) = 0; work_mem = 16MB 12 81598.965 ms HEAD210772.360 ms temporary tablespace on magnetic disk (raid 0 of two 7.2k server spinning disks) 12 81136.530 ms HEAD 225182.560 ms The disks are busy in some periods, but still keep up. If I however make the transition state a bit bigger: query: SELECT a, array_agg(b), count(c), max(d),max(e) FROM (SELECT generate_series(1, 1)) a(a), (SELECT generate_series(1, 5000)::text, repeat(random()::text, 10), repeat(random()::text, 10), repeat(random()::text, 10)) b(b,c,d,e) GROUP BY a HAVING array_length(array_agg(b), 1) = 0; 12 28164.865 ms fast ssd: HEAD92520.680 ms magnetic: HEAD183968.538 ms (no reads, there's plenty enough memory. Just writes because the age / amount thresholds for dirty data are reached) In the magnetic case we're IO bottlenecked nearly the whole time. Just to be clear: I think this is a completely over-the-top example. But I do think it shows the problem to some degree at least. Greetings, Andres Freund
Re: CUBE_MAX_DIM
> From: Tom Lane > Sent: 25 June 2020 17:43 > > Alastair McKinley writes: > > I know that Cube in it's current form isn't suitable for nearest-neighbour > > searching these vectors in their raw form (I have tried recompilation with > > higher CUBE_MAX_DIM myself), but conceptually kNN GiST searches using Cubes > > can be useful for these applications. There are other pre-processing > > techniques that can be used to improved the speed of the search, but it > > still ends up with a kNN search in a high-ish dimensional space. > > Is there a way to fix the numerical instability involved? If we could do > that, then we'd definitely have a use-case justifying the work to make > cube toastable. I am not that familiar with the nature of the numerical instability, but it might be worth noting for additional context that for the NN use case: - The value of each dimension is likely to be between 0 and 1 - The L1 distance is meaningful for high numbers of dimensions, which *possibly* suffers less from the numeric issues than euclidean distance. The numerical stability isn't the only issue for high dimensional kNN, the GiST search performance currently degrades with increasing N towards sequential scan performance, although maybe they are related? > regards, tom lane Best regards, Alastair
Re: Default setting for enable_hashagg_disk
On Thu, 2020-06-25 at 15:56 -0400, Bruce Momjian wrote: > It is my understanding that spill of sorts is mostly read > sequentially, > while hash reads are random. Is that right? Is that not being > costed > properly? I don't think there's a major problem with the cost model, but it could probably use some tweaking. Hash writes are random. The hash reads should be mostly sequential (for large partitions it will be 128-block extents, or 1MB). The cost model assumes 50% sequential and 50% random. Sorts are written sequentially and read randomly, but there's prefetching to keep the reads from being too random. The cost model assumes 75% sequential and 25% random. Overall, the IO pattern is better for Sort, but not dramatically so. Tomas Vondra did some nice analysis here: https://www.postgresql.org/message-id/20200525021045.dilgcsmgiu4l5jpa@development That resulted in getting the prealloc and projection patches in. Regards, Jeff Davis
Re: create database with template doesn't copy database ACL
On Tue, Jun 16, 2020 at 06:10:54AM -0400, Bruce Momjian wrote: > On Mon, Jun 15, 2020 at 10:10:32AM -0400, Bruce Momjian wrote: > > On Mon, Jun 15, 2020 at 12:14:55AM -0400, Tom Lane wrote: > > > Bruce Momjian writes: > > > > Well, I thought we copied everything except things tha can be specified > > > > as different in CREATE DATABASE, though I can see why we would not copy > > > > them. Should we document this or issue a notice about not copying > > > > non-default database attributes? > > > > > > We do not need a notice for behavior that (a) has stood for twenty years > > > or so, and (b) is considerably less broken than any alternative would be. > > > If you feel the docs need improvement, have at that. > > > > Well, I realize it has been this way for a long time, and that no one > > else has complained, but there should be a way for people to know what > > is being copied from the template and what is not. Do we have a clear > > description of what is copied and skipped? > > We already mentioned that ALTER DATABASE settings are not copied, so the > attached patch adds a mention that GRANT-level permissions are not > copied either. Patch applied to all supported versions. Thanks for the discussion. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: should libpq also require TLSv1.2 by default?
Daniel Gustafsson writes: >> On 24 Jun 2020, at 10:46, Magnus Hagander wrote: >> It might also be worth noting that it's not really "any protocol version", >> it means it will be "whatever the openssl configuration says", I think? For >> example, debian buster sets: >> >> [system_default_sect] >> MinProtocol = TLSv1.2 >> >> Which I believe means that if your libpq app is running on debian buster, it >> will be min v1.2 already > Correct, that being said I'm not sure how common it is for distributions to > set > a default protocol version. The macOS versions I have handy doesn't enforce a > default version, nor does Ubuntu 20.04, FreeBSD 12 or OpenBSD 6.5 AFAICT. Yeah, this. I experimented with connecting current libpq to a 9.2-vintage server I'd built with openssl 0.9.8x, and was surprised to find I couldn't do so unless I explicitly set "ssl_min_protocol_version=tlsv1". After some digging I verified that that's because RHEL8's openssl.cnf sets MinProtocol = TLSv1.2 MaxProtocol = TLSv1.3 Interestingly, Fedora 32 is laxer: MinProtocol = TLSv1 MaxProtocol = TLSv1.3 I concur with Daniel's finding that current macOS and FreeBSD don't enforce anything in this area. Nonetheless, for a pretty significant fraction of the world, our claim that the default is to not enforce any minimum protocol version is a lie. My feeling now is that we'd be better off defaulting ssl_min_protocol_version to something nonempty, just to make this behavior platform-independent. We certainly can't leave the docs as they are. Also, I confirm that the failure looks like $ psql -h ... -d "dbname=postgres sslmode=require" psql: error: could not connect to server: SSL error: unsupported protocol While that's not *that* awful, if you realize that "protocol" means TLS version, many people probably won't without a hint. It does not help any that the message doesn't mention either the offered TLS version or the version limits being enforced. I'm not sure we can do anything about the former, but reducing the number of variables affecting the latter seems like a smart idea. BTW, the server-side report of the problem looks like LOG: could not accept SSL connection: wrong version number regards, tom lane
Re: Default setting for enable_hashagg_disk
On 2020-Jun-25, Andres Freund wrote: > > What are people doing for those cases already? Do we have an > > real-world queries that are a problem in PG 13 for this? > > I don't know about real world, but it's pretty easy to come up with > examples. > > query: > SELECT a, array_agg(b) FROM (SELECT generate_series(1, 1)) a(a), (SELECT > generate_series(1, 1)) b(b) GROUP BY a HAVING array_length(array_agg(b), > 1) = 0; > > work_mem = 4MB > > 12 18470.012 ms > HEAD44635.210 ms > > HEAD causes ~2.8GB of file IO, 12 doesn't cause any. If you're IO > bandwidth constrained, this could be quite bad. ... however, you can pretty much get the previous performance back by increasing work_mem. I just tried your example here, and I get 32 seconds of runtime for work_mem 4MB, and 13.5 seconds for work_mem 1GB (this one spills about 800 MB); if I increase that again to 1.7GB I get no spilling and 9 seconds of runtime. (For comparison, 12 takes 15.7 seconds regardless of work_mem). My point here is that maybe we don't need to offer a GUC to explicitly turn spilling off; it seems sufficient to let users change work_mem so that spilling will naturally not occur. Why do we need more? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
Hi, On June 25, 2020 3:44:22 PM PDT, Alvaro Herrera wrote: >On 2020-Jun-25, Andres Freund wrote: > >> > What are people doing for those cases already? Do we have an >> > real-world queries that are a problem in PG 13 for this? >> >> I don't know about real world, but it's pretty easy to come up with >> examples. >> >> query: >> SELECT a, array_agg(b) FROM (SELECT generate_series(1, 1)) a(a), >(SELECT generate_series(1, 1)) b(b) GROUP BY a HAVING >array_length(array_agg(b), 1) = 0; >> >> work_mem = 4MB >> >> 12 18470.012 ms >> HEAD44635.210 ms >> >> HEAD causes ~2.8GB of file IO, 12 doesn't cause any. If you're IO >> bandwidth constrained, this could be quite bad. > >... however, you can pretty much get the previous performance back by >increasing work_mem. I just tried your example here, and I get 32 >seconds of runtime for work_mem 4MB, and 13.5 seconds for work_mem 1GB >(this one spills about 800 MB); if I increase that again to 1.7GB I get >no spilling and 9 seconds of runtime. (For comparison, 12 takes 15.7 >seconds regardless of work_mem). > >My point here is that maybe we don't need to offer a GUC to explicitly >turn spilling off; it seems sufficient to let users change work_mem so >that spilling will naturally not occur. Why do we need more? That's not really a useful escape hatch, because I'll often lead to other nodes using more memory. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: min_safe_lsn column in pg_replication_slots view
On Thu, Jun 25, 2020 at 11:24:27AM -0400, Alvaro Herrera wrote: > I don't understand the proposal. Michael posted a patch that adds > pg_wal_oldest_lsn(), and you say we should apply the patch except the > part that adds that function -- so what part would be applying? I have sent last week a patch about only the removal of min_safe_lsn: https://www.postgresql.org/message-id/20200619121552.gh453...@paquier.xyz So this applies to this part. > If the proposal is to apply just the hunk in pg_get_replication_slots > that removes min_safe_lsn, and do nothing else in pg13, then I don't like > it. The feature exposes a way to monitor slots w.r.t. the maximum slot > size; I'm okay if you prefer to express that in a different way, but I > don't like the idea of shipping pg13 without any way to monitor it. From what I can see, it seems to me that we have a lot of views of how to tackle the matter. That gives an idea that we are not really happy with the current state of things, and usually a sign that we may want to redesign it, going back to this issue for v14. My 2c. -- Michael signature.asc Description: PGP signature
Re: should libpq also require TLSv1.2 by default?
On Thu, Jun 25, 2020 at 06:44:05PM -0400, Tom Lane wrote: > BTW, the server-side report of the problem looks like > > LOG: could not accept SSL connection: wrong version number I like that one. ;-) -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Default setting for enable_hashagg_disk
On 2020-Jun-25, Andres Freund wrote: > >My point here is that maybe we don't need to offer a GUC to explicitly > >turn spilling off; it seems sufficient to let users change work_mem so > >that spilling will naturally not occur. Why do we need more? > > That's not really a useful escape hatch, because I'll often lead to > other nodes using more memory. Ah -- other nodes in the same query -- you're right, that's not good. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Thu, Jun 25, 2020 at 02:28:02PM -0700, Jeff Davis wrote: On Thu, 2020-06-25 at 15:56 -0400, Bruce Momjian wrote: It is my understanding that spill of sorts is mostly read sequentially, while hash reads are random. Is that right? Is that not being costed properly? I don't think there's a major problem with the cost model, but it could probably use some tweaking. Hash writes are random. The hash reads should be mostly sequential (for large partitions it will be 128-block extents, or 1MB). The cost model assumes 50% sequential and 50% random. The important bit here is that while the logical writes are random, those are effectively combined in page cache and the physical writes are pretty sequential. So I think the cost model is fairly reasonable. Note: Judging by iosnoop stats shared in the thread linked by Jeff. Sorts are written sequentially and read randomly, but there's prefetching to keep the reads from being too random. The cost model assumes 75% sequential and 25% random. Overall, the IO pattern is better for Sort, but not dramatically so. Tomas Vondra did some nice analysis here: https://www.postgresql.org/message-id/20200525021045.dilgcsmgiu4l5jpa@development That resulted in getting the prealloc and projection patches in. Regards, Jeff Davis regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Thu, Jun 25, 2020 at 11:16:23AM -0700, Andres Freund wrote: Hi, On 2020-06-25 10:44:42 -0700, Jeff Davis wrote: There are only two possible paths: HashAgg and Sort+Group, and we need to pick one. If the planner expects one to spill, it is likely to expect the other to spill. If one spills in the executor, then the other is likely to spill, too. (I'm ignoring the case with a lot of tuples and few groups because that doesn't seem relevant.) There's also ordered index scan + Group. Which will often be vastly better than Sort+Group, but still slower than HashAgg. Imagine that there was only one path available to choose. Would you suggest the same thing, that unexpected spills can exceed work_mem but expected spills can't? I'm not saying what I propose is perfect, but I've yet to hear a better proposal. Given that there *are* different ways to implement aggregation, and that we use expected costs to choose, I think the assumed costs are relevant. I share Jeff's opinion that this is quite counter-intuitive and we'll have a hard time explaining it to users. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Default setting for enable_hashagg_disk
On Thu, Jun 25, 2020 at 09:42:33AM -0700, Jeff Davis wrote: On Wed, 2020-06-24 at 12:31 -0700, Andres Freund wrote: nodeAgg.c already treats those separately: void hash_agg_set_limits(double hashentrysize, uint64 input_groups, int used_bits, Size *mem_limit, uint64 *ngroups_limit, int *num_partitions) { int npartitions; Sizepartition_mem; /* if not expected to spill, use all of work_mem */ if (input_groups * hashentrysize < work_mem * 1024L) { if (num_partitions != NULL) *num_partitions = 0; *mem_limit = work_mem * 1024L; *ngroups_limit = *mem_limit / hashentrysize; return; } The reason this code exists is to decide how much of work_mem to set aside for spilling (each spill partition needs an IO buffer). The alternative would be to fix the number of partitions before processing a batch, which didn't seem ideal. Or, we could just ignore the memory required for IO buffers, like HashJoin. I think the conclusion from the recent HashJoin discussions is that not accounting for BufFiles is an issue, and we want to fix it. So repeating that for HashAgg would be a mistake, IMHO. Granted, this is an example where an underestimate can give an advantage, but I don't think we want to extend the concept into other areas. I agree. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: min_safe_lsn column in pg_replication_slots view
On 2020-Jun-26, Michael Paquier wrote: > On Thu, Jun 25, 2020 at 11:24:27AM -0400, Alvaro Herrera wrote: > > I don't understand the proposal. Michael posted a patch that adds > > pg_wal_oldest_lsn(), and you say we should apply the patch except the > > part that adds that function -- so what part would be applying? > > I have sent last week a patch about only the removal of min_safe_lsn: > https://www.postgresql.org/message-id/20200619121552.gh453...@paquier.xyz > So this applies to this part. Well, I oppose that because it leaves us with no way to monitor slot limits. In his opening email, Masao-san proposed to simply change the value by adding 1. How you go from adding 1 to a column to removing the column completely with no recourse, is beyond me. Let me summarize the situation and possible ways forward as I see them. If I'm mistaken, please correct me. Problems: i) pg_replication_slot.min_safe_lsn has a weird definition in that all replication slots show the same value ii) min_safe_lsn cannot be used with pg_walfile_name, because it returns the name of the previous segment. Proposed solutions: a) Do nothing -- keep the min_safe_lsn column as is. Warn users that pg_walfile_name should not be used with this column. b) Redefine min_safe_lsn to be lsn+1, so that pg_walfile_name can be used and return a useful value. c) Remove min_safe_lsn; add functions that expose the same value d) Remove min_safe_lsn; add a new view that exposes the same value and possibly others e) Replace min_safe_lsn with a "distance" column, which reports restart_lsn - oldest valid LSN (Note that you no longer have an LSN in this scenario, so you can't call pg_walfile_name.) The original patch implemented (e); it was changed to its current definition because of this[1] comment. My proposal is to put it back. [1] https://postgr.es/m/20171106132050.6apzynxrqrzgh...@alap3.anarazel.de -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Enabling B-Tree deduplication by default
On Thu, Jan 30, 2020 at 11:40 AM Peter Geoghegan wrote: > I think that I should commit the patch without the GUC tentatively. > Just have the storage parameter, so that everyone gets the > optimization without asking for it. We can then review the decision to > enable deduplication generally after the feature has been in the tree > for several months. This is how things work in the committed patch (commit 0d861bbb): There is a B-Tree storage parameter named deduplicate_items, which is enabled by default. In general, users will get deduplication unless they opt out, including in unique indexes (though note that we're more selective about triggering a deduplication patch in unique indexes [1]). > There is no need to make a final decision about whether or not the > optimization gets enabled before committing the patch. It's now time to make a final decision on this. Does anyone have any reason to believe that leaving deduplication enabled by default is the wrong way to go? Note that using deduplication isn't strictly better than not using deduplication for all indexes in all workloads; that's why it's possible to disable the optimization. This thread has lots of information about the reasons why enabling deduplication by default seems appropriate, all of which still apply. Note that there have been no bug reports involving deduplication since it was committed on February 26th, with the exception of some minor issues that I reported and fixed. The view of the RMT is that the feature should remain enabled by default (i.e. no changes are required). Of course, I am a member of the RMT this year, as well as one of the authors of the patch. I am hardly an impartial voice here. I believe that that did not sway the decision making process of the RMT in this instance. If there are no objections in the next week or so, then I'll close out the relevant open item. [1] https://www.postgresql.org/docs/devel/btree-implementation.html#BTREE-DEDUPLICATION -- See "Tip" -- Peter Geoghegan
Re: Default setting for enable_hashagg_disk
On Thu, Jun 25, 2020 at 01:17:56PM -0400, Bruce Momjian wrote: On Thu, Jun 25, 2020 at 11:46:54AM -0400, Robert Haas wrote: On Wed, Jun 24, 2020 at 7:38 PM Bruce Momjian wrote: > I think my main point is that work_mem was not being honored for > hash-agg before, but now that PG 13 can do it, we are again allowing > work_mem not to apply in certain cases. I am wondering if our hard > limit for work_mem is the issue, and we should make that more flexible > for all uses. I mean, that's pretty much what we're talking about here, isn't it? It seems like in your previous two replies you were opposed to separating the plan-type limit from the execution-time limit, but that idea is precisely a way of being more flexible (and extending it to other plan nodes is a way of making it more flexible for more use cases). I think it is was Tom who was complaining about plan vs. execution time control. As I think you know, if you have a system where the workload varies a lot, you may sometimes be using 0 copies of work_mem and at other times 1000 or more copies, so the value has to be chosen conservatively as a percentage of system memory, else you start swapping or the OOM killer gets involved. On the other hand, some plan nodes get a lot less efficient when the amount of memory available falls below some threshold, so you can't just set this to a tiny value and forget about it. Because the first problem is so bad, most people set the value relatively conservatively and just live with the performance consequences. But this also means that they have memory left over most of the time, so the idea of letting a node burst above its work_mem allocation when something unexpected happens isn't crazy: as long as only a few nodes do that here and there, rather than, say, all the nodes doing it all at the same time, it's actually fine. If we had a smarter system that could dole out more work_mem to nodes that would really benefit from it and less to nodes where it isn't likely to make much difference, that would be similar in spirit but even better. I think the issue is that in PG 13 work_mem controls sorts and hashes with a new hard limit for hash aggregation: https://www.postgresql.org/docs/12/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of IN subqueries. In pre-PG 13, we "controlled" it by avoiding hash-based aggregation if was expected it to exceed work_mem, but if we assumed it would be less than work_mem and it was more, we exceeded work_mem allocation for that node. In PG 13, we "limit" memory to work_mem and spill to disk if we exceed it. We should really have always documented that hash agg could exceed work_mem for misestimation, and if we add a hash_agg work_mem misestimation bypass setting we should document this setting in work_mem as well. I don't think that would change anything, really. For the users the consequences would be still exactly the same, and they wouldn't even be in position to check if they are affected. So just documenting that hashagg does not respect work_mem at runtime would be nice, but it would not make any difference for v13, just like documenting a bug is not really the same thing as fixing it. But then the question is why do we allow this bypass only for hash agg? Should work_mem have a settings for ORDER BY, merge join, hash join, and hash agg, e.g.: work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB' Yeah, crazy syntax, but you get the idea. I understand some nodes are more sensitive to disk spill than others, so shouldn't we be controlling this at the work_mem level, rather than for a specific node type like hash agg? We could allow for misestimation over allocation of hash agg work_mem by splitting up the hash agg values: work_mem = 'order_by=10MB, hash_join=20MB, hash_agg=100MB hash_agg_max=200MB' but _avoiding_ hash agg if it is estimated to exceed work mem and spill to disk is not something to logically control at the work mem level, which leads so something like David Rowley suggested, but with different names: enable_hashagg = on | soft | avoid | off where 'on' and 'off' are the current PG 13 behavior, 'soft' means to treat work_mem as a soft limit and allow it to exceed work mem for misestimation, and 'avoid' means to avoid hash agg if it is estimated to exceed work mem. Both 'soft' and 'avoid' don't spill to disk. David's original terms of "trynospill" and "neverspill" were focused on spilling, not on its interaction with work_mem, and I found that confusing. Frankly, if it took me this long to get my head around this, I am unclear how many people will understand this tuning feature enough to actually use it. Yeah. I agree with Andres we this may be a real issue, and that adding some sort of "escape hatch" for v1
Re: xid wraparound danger due to INDEX_CLEANUP false
On Thu, Jun 25, 2020 at 6:59 AM Masahiko Sawada wrote: > I think that with the approach implemented in my patch, it could be a > problem for the user that the user cannot easily know in advance > whether vacuum with INDEX_CLEANUP false will perform index cleanup, > even if page deletion doesn’t happen in most cases. I was unclear. I agree that the VACUUM command with "INDEX_CLEANUP = off" is an emergency mechanism that should be fully respected, even when that means that we'll leak deleted pages. Perhaps it would make sense to behave differently when the index is on a table that has "vacuum_index_cleanup = off" set, and the vacuum is started by autovacuum, and is not an anti-wraparound vacuum. That doesn't seem all that appealing now that I write it down, though, because it's a non-obvious behavioral difference among cases that users probably expect to behave similarly. On the other hand, what user knows that there is something called an aggressive vacuum, which isn't exactly the same thing as an anti-wraparound vacuum? I find it hard to decide what the least-worst thing is for the backbranches. What do you think? > I don’t come up with a good solution to keep us safe against XID > wraparound yet but it seems to me that it’s better to have an option > that forces index cleanup not to happen. I don't think that there is a good solution that is suitable for backpatching. The real solution is to redesign the recycling along the lines I described. I don't think that it's terrible that we can leak deleted pages, especially considering the way that users are expected to use the INDEX_CLEANUP feature. I would like to be sure that the problem is well understood, though -- we should at least have a plan for Postgres v14. > I thought that btbulkdelete and/or btvacuumcleanup can register an > autovacuum work item to recycle the page that gets deleted but it > might not able to recycle those pages enough because the autovacuum > work items could be taken just after vacuum. And if page deletion is > relatively a rare case in practice, we might be able to take an > optimistic approach that vacuum registers deleted pages to FSM on the > deletion and a process who takes a free page checks if the page is > really recyclable. Anyway, I’ll try to think more about this. Right -- just putting the pages in the FSM immediately, and making it a problem that we deal with within _bt_getbuf() is an alternative approach that might be better. -- Peter Geoghegan
Re: Avoiding hash join batch explosions with extreme skew and weird stats
On Thu, Jun 25, 2020 at 03:09:44PM -0700, Melanie Plageman wrote: On Tue, Jun 23, 2020 at 3:24 PM Tomas Vondra wrote: I started looking at the patch to refresh my knowledge both of this patch and parallel hash join, but I think it needs a rebase. The changes in 7897e3bb90 apparently touched some of the code. Thanks so much for the review, Tomas! I've attached a rebased patch which also contains updates discussed below. Thanks. I assume you're working on a patch addressing the remaining TODOS, right? I wanted to get some feedback on the patch before working through the TODOs to make sure I was on the right track. Now that you are reviewing this, I will focus all my attention on addressing your feedback. If there are any TODOs that you feel are most important, let me know, so I can start with those. Otherwise, I will prioritize parallel batch 0 spilling. Feel free to work on the batch 0 spilling, please. I still need to get familiar with various parts of the parallel hash join etc. so I don't have any immediate feedback which TODOs to work on first. David Kimura plans to do a bit of work on on parallel hash join batch 0 spilling tomorrow. Whatever is left after that, I will pick up next week. Parallel hash join batch 0 spilling is the last large TODO that I had. My plan was to then focus on the feedback (either about which TODOs are most important or outside of the TODOs I've identified) I get from you and anyone else who reviews this. OK. I see you've switched to "stripe" naming - I find that a bit confusing, because when I hear stripe I think about RAID, where it means pieces of data interleaved and stored on different devices. But maybe that's just me and it's a good name. Maybe it'd be better to keep the naming and only tweak it at the end, not to disrupt reviews unnecessarily. I hear you about "stripe". I still quite like it, especially as compared to its predecessor (originally, I called them chunks -- which is impossible given that SharedTuplestoreChunks are a thing). I don't think using chunks in one place means we can't use it elsewhere in a different context. I'm sure we have "chunks" in other places. But let's not bikeshed on this too much. For ease of review, as you mentioned, I will keep the name for now. I am open to changing it later, though. I've been soliciting ideas for alternatives and, so far, folks have suggested "stride", "step", "flock", "herd", "cohort", and "school". I'm still on team "stripe" though, as it stands. ;-) nodeHash.c -- 1) MultiExecPrivateHash says this /* * Not subject to skew optimization, so either insert normally * or save to batch file if it belongs to another stripe */ I wonder what it means to "belong to another stripe". I understand what that means for batches, which are identified by batchno computed from the hash value. But I thought "stripes" are just work_mem-sized pieces of a batch, so I don't quite understand this. Especially when the code does not actually check "which stripe" the row belongs to. I agree this was confusing. "belongs to another stripe" meant here that if batch 0 falls back and we are still loading it, once we've filled up work_mem, we need to start saving those tuples to a spill file for batch 0. I've changed the comment to this: -* or save to batch file if it belongs to another stripe + * or save to batch file if batch 0 falls back and we have + * already filled the hashtable up to space_allowed. OK. Silly question - what does "batch 0 falls back" mean? Does it mean that we realized the hash table for batch 0 would not fit into work_mem, so we switched to the "hashloop" strategy? 2) I find the fields hashloop_fallback rather confusing. We have one in HashJoinTable (and it's array of BufFile items) and another one in ParallelHashJoinBatch (this time just bool). I think HashJoinTable should be renamed to hashloopBatchFile (similarly to the other BufFile arrays). I think you are right about the name. I've changed the name in HashJoinTableData to hashloopBatchFile. The array of BufFiles hashloop_fallback was only used by serial hashjoin. The boolean hashloop_fallback variable is used only by parallel hashjoin. The reason I had them named the same thing is that I thought it would be nice to have a variable with the same name to indicate if a batch "fell back" for both parallel and serial hashjoin--especially since we check it in the main hashjoin state machine used by parallel and serial hashjoin. In serial hashjoin, the BufFiles aren't identified by name, so I kept them in that array. In parallel hashjoin, each ParallelHashJoinBatch has the status saved (in the struct). So, both represented the fall back status of a batch. However, I agree with you, so I've renamed the serial one to hashloopBatchFile. OK Although I'm not sure why we even need this file, when we have innerBatchFile? BufFile(s) are not exactly free, in fact i
Re: xid wraparound danger due to INDEX_CLEANUP false
On Thu, Jun 25, 2020 at 8:28 AM Robert Haas wrote: > I'm not really convinced. I agree that from a theoretical point of > view an index can have arbitrary needs and is the arbiter of its own > needs, but when I pull the emergency break, I want the vehicle to > stop, not think about stopping. Making this theoretical argument in the context of this discussion was probably a mistake. I agree that this is the emergency break, and it needs to work like one. It might be worth considering some compromise in the event of using the "vacuum_index_cleanup" reloption (i.e. when the user has set it to 'off'), provided there is good reason to believe that we're not in an emergency -- I mentioned this to Masahiko just now. I admit that that isn't very appealing for other reasons, but it's worth considering a way of ameliorating the problem in back branches. We really ought to change how recycling works, so that it happens at the tail end of the same VACUUM operation that deleted the pages -- but that cannot be backpatched. It might be that the most appropriate mitigation in the back branches is a log message that reports on the fact that we've probably leaked pages due to this issue. Plus some documentation. Though even that would require calling nbtree to check if that is actually true (by checking the metapage), so it still requires backpatching something close to Masahiko's draft patch. > I don't think I believe this. All you need is one small range-deletion, right? Right. > > Then there's question 2. The intuition behind the approach from > > Sawada-san's patch was that allowing wraparound here feels wrong -- it > > should be in the AM's hands. However, it's not like I can point to > > some ironclad guarantee about not leaking deleted pages that existed > > before the INDEX_CLEANUP feature. We know that the FSM is not crash > > safe, and that's that. Is this really all that different? Maybe it is, > > but it seems like a quantitative difference to me. > > I don't think I believe this, either. In the real-world example to > which you linked, the user ran REINDEX afterward to recover from index > bloat, and we could advise other people who use this option that it > may leak space that a subsequent VACUUM may fail to recover, and > therefore they too should consider REINDEX. I was talking about the intuition behind the design. I did not intend to suggest that nbtree should ignore "INDEX_CLEANUP = off" regardless of the consequences. I am sure about this much: The design embodied by Masahiko's patch is clearly a better one overall, even if it doesn't fix the problem on its own. I agree that we cannot allow nbtree to ignore "INDEX_CLEANUP = off", even if that means leaking pages that could otherwise be recycled. I'm not sure what we should do about any of this in the back branches, though. I wish I had a simple idea about what to do there. -- Peter Geoghegan
Re: Missing some ifndef FRONTEND at the top of logging.c and file_utils.c
On Thu, Jun 25, 2020 at 11:15:03AM +0200, Daniel Gustafsson wrote: > That makes sense, logging.c and file_utils.c are indeed only part of > libpgcommon.a and should only be compiled for frontend. Thanks. This list is provided by OBJS_FRONTEND in src/common/Makefile, and pgcommonfrontendfiles in Mkvcbuild.pm. Let's see if others have comments, as it just looks like something that was forgotten in bf5bb2e and fc9a62a when this code was moved to src/common/. If there are no objections, I'll revisit that some time next week and fix it on HEAD. -- Michael signature.asc Description: PGP signature
Re: Failures with installcheck and low work_mem value in 13~
On Thu, Jun 25, 2020 at 12:15:54PM -0700, Peter Geoghegan wrote: > The RMT met today. We determined that it wasn't worth adjusting this > test to pass with non-standard work_mem values. Okay, thanks for the feedback. We'll see how it works out. -- Michael signature.asc Description: PGP signature
Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path
On Fri, 26 Jun 2020 at 04:35, Andres Freund wrote: > On 2020-06-25 13:50:37 +1200, David Rowley wrote: > > In the attached, I've come up with a way that works. Basically I've > > just added a function named errstart_cold() that is attributed with > > __attribute__((cold)), which will hint to the compiler to keep > > branches which call this function in a cold path. > > I recall you trying this before? Has that gotten easier because we > evolved ereport()/elog(), or has gcc become smarter, or ...? Yeah, I appear to have tried it and found it not to work in [1]. I can only assume GCC got smarter in regards to how it marks a path as cold. Previously it seemed not do due to the do/while(0). I'm pretty sure back when I tested last that ditching the do while made it work, just we can't really get rid of it. > > To make the compiler properly mark just >= ERROR calls as cold, and > > only when the elevel is a constant, I modified the ereport_domain > > macro to do: > > > > if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \ > > errstart_cold(elevel, domain) : \ > > errstart(elevel, domain)) \ > > I think it'd be good to not just do this for ERROR, but also for <= > DEBUG1. I recall seing quite a few debug elogs that made the code worse > just by "being there". I think that case is different. We don't want to move the entire elog path into the cold path for that. We'd only want to hint that errstart is unlikely to return true if elevel <= DEBUG1 > > diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c > > index e976201030..8076e8af24 100644 > > --- a/src/backend/utils/error/elog.c > > +++ b/src/backend/utils/error/elog.c > > @@ -219,6 +219,19 @@ err_gettext(const char *str) > > #endif > > } > > > > +#if defined(HAVE_PG_ATTRIBUTE_HOT_AND_COLD) && > > defined(HAVE__BUILTIN_CONSTANT_P) > > +/* > > + * errstart_cold > > + * A simple wrapper around errstart, but hinted to be cold so > > that the > > + * compiler is more likely to put error code in a cold area away > > from the > > + * main function body. > > + */ > > +bool > > +pg_attribute_cold errstart_cold(int elevel, const char *domain) > > +{ > > + return errstart(elevel, domain); > > +} > > +#endif > > Hm. Would it make sense to have this be a static inline? I thought about that but didn't try it to ensure it still worked ok. I didn't think it was that important to make sure we don't get the extra function hop for ERRORs. It seemed like a case we'd not want to really optimise for. > > /* > > * errstart --- begin an error-reporting cycle > > diff --git a/src/include/c.h b/src/include/c.h > > index d72b23afe4..087b8af6cb 100644 > > --- a/src/include/c.h > > +++ b/src/include/c.h > > @@ -178,6 +178,21 @@ > > #define pg_noinline > > #endif > > > > +/* > > + * Marking certain functions as "hot" or "cold" can be useful to assist the > > + * compiler in arranging the assembly code in a more efficient way. > > + * These are supported from GCC >= 4.3 and clang >= 3.2 > > + */ > > +#if (defined(__GNUC__) && (__GNUC__ > 4 || (__GNUC__ == 4 && > > __GNUC_MINOR__ >= 3))) || \ > > + (defined(__clang__) && (__clang_major__ > 3 || (__clang_major__ == 3 > > && __clang_minor__ >= 2))) > > +#define HAVE_PG_ATTRIBUTE_HOT_AND_COLD 1 > > +#define pg_attribute_hot __attribute__((hot)) > > +#define pg_attribute_cold __attribute__((cold)) > > +#else > > +#define pg_attribute_hot > > +#define pg_attribute_cold > > +#endif > > Wonder if we should start using __has_attribute() for things like this. > > https://gcc.gnu.org/onlinedocs/cpp/_005f_005fhas_005fattribute.html > > I.e. we could do something like > #ifndef __has_attribute > #define __has_attribute(attribute) 0 > #endif > > #if __has_attribute(hot) > #define pg_attribute_hot __attribute__((hot)) > #else > #define pg_attribute_hot > #endif > > clang added __has_attribute in 2.9 (2010), gcc added it in 5 (2014), so > I don't think we'd loose too much. Thanks for pointing that out. Seems like a good idea to me. I don't think we'll upset too many people running GCC 4.4 to 5.0. I can't imagine many people serious about performance will be using a PostgreSQL version that'll be released in 2021 with a pre 2014 compiler. > > #ifdef HAVE__BUILTIN_CONSTANT_P > > +#ifdef HAVE_PG_ATTRIBUTE_HOT_AND_COLD > > +#define ereport_domain(elevel, domain, ...) \ > > + do { \ > > + pg_prevent_errno_in_scope(); \ > > + if (__builtin_constant_p(elevel) && (elevel) >= ERROR ? \ > > + errstart_cold(elevel, domain) : \ > > + errstart(elevel, domain)) \ > > + __VA_ARGS__, errfinish(__FILE__, __LINE__, > > PG_FUNCNAME_MACRO); \ > > + if (__builtin_constant_p(elevel) && (elevel) >= ERROR) \ > > + pg_unreachable(); \ > > + } while(0) > > +#else/* > > !HAVE_PG_ATTRIBUTE_HOT_AND_COLD */ > > #defin
Re: Backpatch b61d161c14 (Introduce vacuum errcontext ...)
On Thu, Jun 25, 2020 at 02:31:51PM +0530, Amit Kapila wrote: > I have looked at both the patches (using separate variables (by > Justin) and using a struct (by Andres)) and found the second one bit > better. Thanks for looking. > I have improved some comments in the code and for now, kept as two > patches (a) one for improving the error info for index (mostly > Justin's patch based on Tom's idea) and (b) the other to generally > improve the code in this area (mostly Andres's patch). And thanks for separate patchen :) > I have done some testing with both the patches and would like to do > more unless there are objections with these. Comments: > * The index name is saved only during this phase and restored > immediately => I wouldn't say "only" since it's saved during lazy_vacuum: index AND cleanup. >update_vacuum_error_info(LVRelStats *errinfo, LVSavedErrInfo *oldpos, int >phase, => You called your struct "LVSavedErrInfo" but the variables are still called "pos". I would call it olderrinfo or just old. Also, this doesn't (re)rename the "cbarg" stuff that Alvaro didn't like, which was my 0001 patch. -- Justin
Re: Backpatch b61d161c14 (Introduce vacuum errcontext ...)
On Fri, Jun 26, 2020 at 7:25 AM Justin Pryzby wrote: > > > > I have done some testing with both the patches and would like to do > > more unless there are objections with these. > > Comments: > > > * The index name is saved only during this phase and restored > > immediately > > => I wouldn't say "only" since it's saved during lazy_vacuum: index AND > cleanup. > > >update_vacuum_error_info(LVRelStats *errinfo, LVSavedErrInfo *oldpos, int > >phase, > > => You called your struct "LVSavedErrInfo" but the variables are still called > "pos". I would call it olderrinfo or just old. > Fixed both of the above comments. I used the variable name as saved_err_info. > Also, this doesn't (re)rename the "cbarg" stuff that Alvaro didn't like, which > was my 0001 patch. > If I am not missing anything then that change was in lazy_cleanup_index and after this patch, it won't be required because we are using a different variable name. I have combined both the patches now. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com 0001-Improve-vacuum-error-context-handling.v1.patch Description: Binary data
Re: Default setting for enable_hashagg_disk
On Fri, Jun 26, 2020 at 01:53:57AM +0200, Tomas Vondra wrote: > I'm not saying it's not beneficial to use different limits for different > nodes. Some nodes are less sensitive to the size (e.g. sorting often > gets faster with smaller work_mem). But I think we should instead have a > per-session limit, and the planner should "distribute" the memory to > different nodes. It's a hard problem, of course. Yeah, I am actually confused why we haven't developed a global memory allocation strategy and continue to use per-session work_mem. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Removal of currtid()/currtid2() and some table AM cleanup
On Thu, Jun 25, 2020 at 10:14:00PM +0900, Inoue, Hiroshi wrote: > I seem to have invalidated KEYSET-DRIVEN cursors used in positioned-update > test. It was introduced by the commit 4a272fd but was invalidated by the > commit 2be35a6. > > I don't object to the removal of currtid(2) because keyset-driven cursors in > psqlodbc are changed into static cursors in many cases and I've hardly ever > heard a complaint about it. Hmm. I am not sure that this completely answers my original concern though. In short, don't we still have corner cases where keyset-driven cursors are not changed into static cursors, meaning that currtid2() could get used? The removal of the in-core functions would hurt applications using that, meaning that we should at least provide an equivalent of currtid2() in the worse case as a contrib module, no? If the code paths of currtid2() are reachable, shouldn't we also make sure that they are still reached in the regression tests of the driver, meaning that the driver code needs more coverage? I have been looking at the tests and tried to tweak them using SQLSetPos() so as the code paths involving currtid2() get reached, but I am not really able to do so. It does not mean that that currtid2() never gets reached, it just means that I am not able to be sure that this part can be safely removed from the Postgres backend code :( From what I can see on this thread, we could just remove currtid() per the arguments of the RETURNING ctid clause supported since PG 8.2, but it would make more sense to me to just remove both currtid/currtid2() at once. -- Michael signature.asc Description: PGP signature
Re: [Patch] ALTER SYSTEM READ ONLY
On Wed, Jun 24, 2020 at 1:54 PM tushar wrote: > > On 6/22/20 11:59 AM, Amul Sul wrote: > > 2. Now skipping the startup checkpoint if the system is read-only mode, as > > discussed [2]. > > I am not able to perform pg_checksums o/p after shutting down my server > in read only mode . > > Steps - > > 1.initdb (./initdb -k -D data) > 2.start the server(./pg_ctl -D data start) > 3.connect to psql (./psql postgres) > 4.Fire query (alter system read only;) > 5.shutdown the server(./pg_ctl -D data stop) > 6.pg_checksums > > [edb@tushar-ldap-docker bin]$ ./pg_checksums -D data > pg_checksums: error: cluster must be shut down > [edb@tushar-ldap-docker bin]$ > > Result - (when server is not in read only) > > [edb@tushar-ldap-docker bin]$ ./pg_checksums -D data > Checksum operation completed > Files scanned: 916 > Blocks scanned: 2976 > Bad checksums: 0 > Data checksum version: 1 > I think that's expected since the server isn't clean shutdown, similar error can be seen with any server which has been shutdown in immediate mode (pg_clt -D data_dir -m i). Regards, Amul
Re: min_safe_lsn column in pg_replication_slots view
On Fri, Jun 26, 2020 at 4:54 AM Alvaro Herrera wrote: > > On 2020-Jun-26, Michael Paquier wrote: > > > On Thu, Jun 25, 2020 at 11:24:27AM -0400, Alvaro Herrera wrote: > > > I don't understand the proposal. Michael posted a patch that adds > > > pg_wal_oldest_lsn(), and you say we should apply the patch except the > > > part that adds that function -- so what part would be applying? > > > > I have sent last week a patch about only the removal of min_safe_lsn: > > https://www.postgresql.org/message-id/20200619121552.gh453...@paquier.xyz > > So this applies to this part. > > Well, I oppose that because it leaves us with no way to monitor slot > limits. In his opening email, Masao-san proposed to simply change the > value by adding 1. How you go from adding 1 to a column to removing > the column completely with no recourse, is beyond me. > > Let me summarize the situation and possible ways forward as I see them. > If I'm mistaken, please correct me. > > Problems: > i) pg_replication_slot.min_safe_lsn has a weird definition in that all > replication slots show the same value > It is also not clear how the user can make use of that value? > ii) min_safe_lsn cannot be used with pg_walfile_name, because it returns > the name of the previous segment. > > Proposed solutions: > > a) Do nothing -- keep the min_safe_lsn column as is. Warn users that >pg_walfile_name should not be used with this column. > b) Redefine min_safe_lsn to be lsn+1, so that pg_walfile_name can be used >and return a useful value. > c) Remove min_safe_lsn; add functions that expose the same value > d) Remove min_safe_lsn; add a new view that exposes the same value and >possibly others > > e) Replace min_safe_lsn with a "distance" column, which reports >restart_lsn - oldest valid LSN >(Note that you no longer have an LSN in this scenario, so you can't >call pg_walfile_name.) > Can we consider an option to "Remove min_safe_lsn; document how a user can monitor the distance"? We have a function to get current WAL insert location and other things required are available either via view or as guc variable values. The reason I am thinking of this option is that it might be better to get some more feedback on what is the most appropriate value to display. However, I am okay if we can reach a consensus on one of the above options. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Thu, Jun 25, 2020 at 7:10 PM Dilip Kumar wrote: > > On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila wrote: > > > > On Tue, Jun 23, 2020 at 7:00 PM Dilip Kumar wrote: > > > > > > Here is the POC patch to discuss the idea of a cleanup of shared > > > fileset on proc exit. As discussed offlist, here I am maintaining > > > the list of shared fileset. First time when the list is NULL I am > > > registering the cleanup function with on_proc_exit routine. After > > > that for subsequent fileset, I am just appending it to filesetlist. > > > There is also an interface to unregister the shared file set from the > > > cleanup list and that is done by the caller whenever we are deleting > > > the shared fileset manually. While explaining it here, I think there > > > could be one issue if we delete all the element from the list will > > > become NULL and on next SharedFileSetInit we will again register the > > > function. Maybe that is not a problem but we can avoid registering > > > multiple times by using some flag in the file > > > > > > > I don't understand what you mean by "using some flag in the file". > > > > Review comments on various patches. > > > > poc_shared_fileset_cleanup_on_procexit > > = > > 1. > > - ent->subxact_fileset = > > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet)); > > + MemoryContext oldctx; > > > > + /* Shared fileset handle must be allocated in the persistent context */ > > + oldctx = MemoryContextSwitchTo(ApplyContext); > > + ent->subxact_fileset = palloc(sizeof(SharedFileSet)); > > SharedFileSetInit(ent->subxact_fileset, NULL); > > + MemoryContextSwitchTo(oldctx); > > fd = BufFileCreateShared(ent->subxact_fileset, path); > > > > Why is this change required for this patch and why we only cover > > SharedFileSetInit in the Apply context and not BufFileCreateShared? > > The comment is also not very clear on this point. > > Added the comments for the same. > > > 2. > > +void > > +SharedFileSetUnregister(SharedFileSet *input_fileset) > > +{ > > + bool found = false; > > + ListCell *l; > > + > > + Assert(filesetlist != NULL); > > + > > + /* Loop over all the pending shared fileset entry */ > > + foreach (l, filesetlist) > > + { > > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); > > + > > + /* remove the entry from the list and delete the underlying files */ > > + if (input_fileset->number == fileset->number) > > + { > > + SharedFileSetDeleteAll(fileset); > > + filesetlist = list_delete_cell(filesetlist, l); > > > > Why are we calling SharedFileSetDeleteAll here when in the caller we > > have already deleted the fileset as per below code? > > BufFileDeleteShared(ent->stream_fileset, path); > > + SharedFileSetUnregister(ent->stream_fileset); > > That's wrong I have removed this. > > > > I think it will be good if somehow we can remove the fileset from > > filesetlist during BufFileDeleteShared. If that is possible, then we > > don't need a separate API for SharedFileSetUnregister. > > I have done as discussed on later replies, basically called > SharedFileSetUnregister from BufFileDeleteShared. > > > 3. > > +static List * filesetlist = NULL; > > + > > static void SharedFileSetOnDetach(dsm_segment *segment, Datum datum); > > +static void SharedFileSetOnProcExit(int status, Datum arg); > > static void SharedFileSetPath(char *path, SharedFileSet *fileset, Oid > > tablespace); > > static void SharedFilePath(char *path, SharedFileSet *fileset, const > > char *name); > > static Oid ChooseTablespace(const SharedFileSet *fileset, const char > > *name); > > @@ -76,6 +80,13 @@ SharedFileSetInit(SharedFileSet *fileset, dsm_segment > > *seg) > > /* Register our cleanup callback. */ > > if (seg) > > on_dsm_detach(seg, SharedFileSetOnDetach, PointerGetDatum(fileset)); > > + else > > + { > > + if (filesetlist == NULL) > > + on_proc_exit(SharedFileSetOnProcExit, 0); > > > > We use NIL for list initialization and comparison. See lock_files usage. > > Done > > > 4. > > +SharedFileSetOnProcExit(int status, Datum arg) > > +{ > > + ListCell *l; > > + > > + /* Loop over all the pending shared fileset entry */ > > + foreach (l, filesetlist) > > + { > > + SharedFileSet *fileset = (SharedFileSet *) lfirst(l); > > + SharedFileSetDeleteAll(fileset); > > + } > > > > We can initialize filesetlist as NIL after the for loop as it will > > make the code look clean. > > Right. > > > Comments on other patches: > > = > > 5. > > > 3. On concurrent abort we are truncating all the changes including > > > some incomplete changes, so later when we get the complete changes we > > > don't have the previous changes, e.g, if we had specinsert in the > > > last stream and due to concurrent abort detection if we delete that > > > changes later we will get spec_confirm without spec insert. We could > > > have simply avoided deleting all the changes, but I think the better > > > fix is once we detect the concurrent abort for any transaction
Re: Transactions involving multiple postgres foreign servers, take 2
On Tue, 23 Jun 2020 at 13:26, Amit Kapila wrote: > > On Tue, Jun 23, 2020 at 9:03 AM Masahiko Sawada > wrote: > > > > > > I've attached the latest version patches. I've incorporated the review > > comments I got so far and improved locking strategy. > > > > Thanks for updating the patch. > > > Please review it. > > > > I think at this stage it is important that we do some study of various > approaches to achieve this work and come up with a comparison of the > pros and cons of each approach (a) what this patch provides, (b) what > is implemented in Global Snapshots patch [1], (c) if possible, what is > implemented in Postgres-XL. I fear that if go too far in spending > effort on this and later discovered that it can be better done via > some other available patch/work (maybe due to a reasons like that > approach can easily extended to provide atomic visibility or the > design is more robust, etc.) then it can lead to a lot of rework. Yeah, I have no objection to that plan but I think we also need to keep in mind that (b), (c), and whatever we are thinking about global consistency are talking about only PostgreSQL (and postgres_fdw). On the other hand, this patch needs to implement the feature that can resolve the atomic commit problem more generically, because the foreign server might be using oracle_fdw, mysql_fdw, or other FDWs connecting database systems supporting 2PC. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation
On Wed, Jun 24, 2020 at 2:44 AM Amit Kapila wrote: > So, that leads to loops as 2 on "Parallel Seq Scan" and "Nested Loop" nodes. > Does this make sense now? Yes, I think we're on the same page. Thanks for the additional details. It turns out that the plan I sent at the top of the thread is actually an older plan we had saved, all the way from April 2018. We're fairly certain this was Postgres 10, but not sure what point release. I tried to reproduce this on 10, 11, 12, and 13 beta, but I am now seeing similar results to yours: Buffers and I/O Timings are rolled up into the parallel leader, and that is propagated as expected to the Gather. Sorry for the confusion. On Wed, Jun 24, 2020 at 3:18 AM Maciek Sakrejda wrote: >So we should be seeing an average, not a sum, right? And here I missed that the documentation specifies rows and actual time as per-loop, but other metrics are not--they're just cumulative. So actual time and rows are still per-"loop" values, but while rows values are additive (the Gather combines rows from the parallel leader and the workers), the actual time is not because the whole point is that this work happens in parallel. I'll report back if I can reproduce the weird numbers we saw in that original plan or find out exactly what Postgres version it was from. Thanks, Maciek
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Jun 26, 2020 at 10:39 AM Dilip Kumar wrote: > > On Thu, Jun 25, 2020 at 7:10 PM Dilip Kumar wrote: > > > > On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila wrote: > > > > > > Comments on other patches: > > > = > > > 5. > > > > 3. On concurrent abort we are truncating all the changes including > > > > some incomplete changes, so later when we get the complete changes we > > > > don't have the previous changes, e.g, if we had specinsert in the > > > > last stream and due to concurrent abort detection if we delete that > > > > changes later we will get spec_confirm without spec insert. We could > > > > have simply avoided deleting all the changes, but I think the better > > > > fix is once we detect the concurrent abort for any transaction, then > > > > why do we need to collect the changes for that, we can simply avoid > > > > that. So I have put that fix. (0006) > > > > > > > > > > On similar lines, I think we need to skip processing message, see else > > > part of code in ReorderBufferQueueMessage. > > > > Basically, ReorderBufferQueueMessage also calls the > > ReorderBufferQueueChange internally for transactional changes. Yes, that is correct but I was thinking about the non-transactional part due to the below code there. else { ReorderBufferTXN *txn = NULL; volatile Snapshot snapshot_now = snapshot; if (xid != InvalidTransactionId) txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true); Even though we are using txn here but I think we don't need to skip it for aborted xacts because without patch as well such messages get decoded irrespective of transaction status. What do you think? > > But, > > having said that, I realize the idea of skipping the changes in > > ReorderBufferQueueChange is not good, because by then we have already > > allocated the memory for the change and the tuple and it's not a > > correct to ReturnChanges because it will update the memory accounting. > > So I think we can do it at a more centralized place and before we > > process the change, maybe in LogicalDecodingProcessRecord, before > > going to the switch we can call a function from the reorderbuffer.c > > layer to see whether this transaction is detected as aborted or not. > > But I have to think more on this line that can we skip all the > > processing of that record or not. > > > > Your other comments look fine to me so I will send in the next patch > > set and reply on them individually. > > I think we can not put this check, in the higher-level functions like > LogicalDecodingProcessRecord or DecodeXXXOp because we need to process > that xid at least for abort, so I think it is good to keep the check, > inside ReorderBufferQueueChange only and we can free the memory of the > change if the abort is detected. Also, if just skip those changes in > ReorderBufferQueueChange then the effect will be localized to that > particular transaction which is already aborted. > Fair enough and for cases like non-transactional part of ReorderBufferQueueMessage, I think we anyway need to process the message irrespective of transaction status. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Creating a function for exposing memory usage of backend process
Hi ! On Thu, Jun 18, 2020 at 12:56 PM Fujii Masao wrote: > Agreed. The feature to view how local memory contexts are used in > each process is very useful! +1 > >=# SELECT * FROM pg_stat_get_backend_memory_context(PID); > > I'm afraid that this interface is not convenient when we want to monitor > the usages of local memory contexts for all the processes. For example, > I'd like to monitor how much memory is totally used to store prepared > statements information. For that purpose, I wonder if it's more convenient > to provide the view displaying the memory context usages for > all the processes. How about separating a function that examines memory consumption trends for all processes and a function that examines memory consumption for a particular phase of a particular process? For the former, as Fujii said, the function shows the output limited information for each context type. All processes calculation and output the information at idle status. I think the latter is useful for debugging and other purposes. For example, imagine preparing a function for registration like the following. =# SELECT pg_stat_get_backend_memory_context_regist (pid, context, max_children, calc_point) pid: A target process context: The top level of the context of interest max_children: Maximum number of output for the target context's children (similar to MemoryContextStatsInternal()'s max_children) calc_point: Single or multiple position(s) to calculate and output context information (Existing hooks such as planner_hook, executor_start, etc.. could be used. ) This function informs the target PID to output the information of the specified context at the specified calc_point. When the target PID process reaches the calc_point, it calculates and output the context information one time to a file or DSM. (Currently PostgreSQL has no formal ways of externally modifying the parameters of a particular process, so it may need to be implemented...) Sometimes I want to know the memory usage in the planning phase or others with a query_string and/or plan_tree that before target process move to the idle status. So it would be nice to retrieve memory usage at some arbitrary point in time ! Regards, -- Tatsuhito Kasahara kasahara.tatsuhito _at_ gmail.com
Re: Resetting spilled txn statistics in pg_stat_replication
On Thu, 25 Jun 2020 at 19:35, Amit Kapila wrote: > > On Tue, Jun 23, 2020 at 6:39 PM Amit Kapila wrote: > > > > On Tue, Jun 23, 2020 at 3:48 PM Tomas Vondra > > wrote: > > > > > > On Tue, Jun 23, 2020 at 10:58:18AM +0530, Amit Kapila wrote: > > > >On Tue, Jun 23, 2020 at 9:32 AM Masahiko Sawada > > > > wrote: > > > >> > > > >> On Sun, 21 Jun 2020 at 06:57, Tomas Vondra > > > >> wrote: > > > >> > > > > >> > > > > > >> > >What if the decoding has been performed by multiple backends using > > > >> > >the > > > >> > >same slot? In that case, it will be difficult to make the judgment > > > >> > >for the value of logical_decoding_work_mem based on stats. It would > > > >> > >make sense if we provide a way to set logical_decoding_work_mem for > > > >> > >a > > > >> > >slot but not sure if that is better than what we have now. > > > >> > > > > > >> > > > >> I thought that the stats are relevant to what > > > >> logical_decoding_work_mem value was but not with who performed logical > > > >> decoding. So even if multiple backends perform logical decoding using > > > >> the same slot, the user can directly use stats as long as > > > >> logical_decoding_work_mem value doesn’t change. > > > >> > > Today, I thought about it again, and if we consider the point that > logical_decoding_work_mem value doesn’t change much then having the > stats at slot-level would also allow computing > logical_decoding_work_mem based on stats. Do you think it is a > reasonable assumption that users won't change > logical_decoding_work_mem for different processes (WALSender, etc.)? FWIW, if we use logical_decoding_work_mem as a threshold of starting of sending changes to a subscriber, I think there might be use cases where the user wants to set different logical_decoding_work_mem values to different wal senders. For example, setting a lower value to minimize the latency of synchronous logical replication to a near-site whereas setting a large value to minimize the amount of data sent to a far site. Regards, -- Masahiko Sawadahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Thu, Jun 25, 2020 at 7:11 PM Dilip Kumar wrote: > > On Wed, Jun 24, 2020 at 4:04 PM Amit Kapila wrote: > > > > > > Review comments on various patches. > > > > poc_shared_fileset_cleanup_on_procexit > > = > > 1. > > - ent->subxact_fileset = > > - MemoryContextAlloc(ApplyContext, sizeof(SharedFileSet)); > > + MemoryContext oldctx; > > > > + /* Shared fileset handle must be allocated in the persistent context */ > > + oldctx = MemoryContextSwitchTo(ApplyContext); > > + ent->subxact_fileset = palloc(sizeof(SharedFileSet)); > > SharedFileSetInit(ent->subxact_fileset, NULL); > > + MemoryContextSwitchTo(oldctx); > > fd = BufFileCreateShared(ent->subxact_fileset, path); > > > > Why is this change required for this patch and why we only cover > > SharedFileSetInit in the Apply context and not BufFileCreateShared? > > The comment is also not very clear on this point. > > Added the comments for the same. > 1. + /* + * Shared fileset handle must be allocated in the persistent context. + * Also, SharedFileSetInit allocate the memory for sharefileset list + * so we need to allocate that in the long term meemory context. + */ How about "We need to maintain shared fileset across multiple stream open/close calls. So, we allocate it in a persistent context." 2. + /* + * If the caller is following the dsm based cleanup then we don't + * maintain the filesetlist so return. + */ + if (filesetlist == NULL) + return; The check here should use 'NIL' instead of 'NULL' Other than that the changes in this particular patch looks good to me. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Creating a function for exposing memory usage of backend process
Hi, While going through the mail chain on relation, plan and catalogue caching [1], I'm thinking on the lines that is there a way to know the current relation, plan and catalogue cache sizes? If there is a way already, please ignore this and it would be grateful if someone point me to that. Posting this here as I felt it's relevant. If there is no such way to know the cache sizes and other info such as statistics, number of entries, cache misses, hits etc. can the approach discussed here be applied? If the user knows the cache statistics and other information, may be we can allow user to take appropriate actions such as allowing him to delete few entries through a command or some other way. I'm sorry, If I'm diverting the topic being discussed in this mail thread, please ignore if it is irrelevant. [1] - https://www.postgresql.org/message-id/flat/20161219.201505.11562604.horiguchi.kyotaro%40lab.ntt.co.jp With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com