Re: MSVC Build support with visual studio 2019
On Wed, May 29, 2019 at 10:30 AM Haribabu Kommi wrote: > > Updated patches are attached. > > All patches apply, build and pass tests. The patch '0001-support-building-with-visual-studio-2019_v10_to_v9.6_v3.patch' applies on version 9.5. Not sure if more review is needed before moving to 'ready for commite'r, but I have no more comments to make on current patches. Regards, Juan José Santamaría Flecha
Re: compiling PL/pgSQL plugin with C++
Tom, thanks for operational response reaction. Based on this topic and some nearby ones the problem turned out to be deeper than expceted... as always. p.s. Sorry for cyrillic in the mailing list. At the beginning I wrote from corporate email and could not change the sender name. If you can, please, replace. Regards, George
Re: UCT (Re: pgsql: Update time zone data files to tzdata release 2019a.)
Re: Tom Lane 2019-06-04 <65800.1559662...@sss.pgh.pa.us> > > There is something wrong here. On Debian Buster/unstable, using > > system tzdata (2019a-1), if /etc/timezone is "Etc/UTC": > > Is your build using --with-system-tzdata? If so, which tzdb > release is the system on, and is it a completely stock copy > of that release? It's using system tzdata (2019a-1). There's one single patch on top of that: https://sources.debian.org/src/tzdata/2019a-1/debian/patches/ > BTW, does Debian set up /etc/timezone as a symlink, by any chance, > rather than a copy or hard link? If it's a symlink, we could improve > matters by teaching identify_system_timezone() to inspect it. In the meantime I realized that I was only testing /etc/timezone (which is a plain file with just the zone name), while not touching /etc/localtime at all. In this environment, it's a symlink: lrwxrwxrwx 1 root root 27 Mär 28 14:49 /etc/localtime -> /usr/share/zoneinfo/Etc/UTC ... but the name still gets canonicalized to Etc/UCT or UCT. Christoph
Re: Confusing error message for REINDEX TABLE CONCURRENTLY
On Wed, 5 Jun 2019 at 18:11, Michael Paquier wrote: > > On Tue, Jun 04, 2019 at 11:26:44AM -0700, Ashwin Agrawal wrote: > > The variable is used in else scope hence I moved it there. But yes its > > removed completely for this scope. > > Thanks for updating the patch. It does its job by having one separate > message for the concurrent and the non-concurrent cases as discussed. > David, what do you think? Perhaps you would like to commit it > yourself? Thanks. I've just pushed this with some additional comment changes. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: hyrax vs. RelationBuildPartitionDesc
On Tue, Jun 4, 2019 at 9:25 PM Robert Haas wrote: > On Fri, May 17, 2019 at 4:36 PM Tom Lane wrote: > > Yeah, I did some additional testing that showed that it's pretty darn > > hard to get the leak to amount to anything. The test case that I > > originally posted did many DDLs in a single transaction, and it > > seems that that's actually essential to get a meaningful leak; as > > soon as you exit the transaction the leaked contexts will be recovered > > during sinval cleanup. > > My colleague Amul Sul rediscovered this same leak when he tried to > attach lots of partitions to an existing partitioned table, all in the > course of a single transaction. This seems a little less artificial > than Tom's original reproducer, which involved attaching and detaching > the same partition repeatedly. > > Here is a patch that tries to fix this, along the lines I previously > suggested; Amul reports that it does work for him. Thanks for the patch. I noticed a crash with one of the scenarios that I think the patch is meant to address. Let me describe the steps: Attach gdb (or whatever) to session 1 in which we'll run a query that will scan at least two partitions and set a breakpoint in expand_partitioned_rtentry(). Run the query, so the breakpoint will be hit. Step through up to the start of the following loop in this function: i = -1; while ((i = bms_next_member(live_parts, i)) >= 0) { Oid childOID = partdesc->oids[i]; Relationchildrel; RangeTblEntry *childrte; Index childRTindex; RelOptInfo *childrelinfo; /* Open rel, acquiring required locks */ childrel = table_open(childOID, lockmode); Note that 'partdesc' in the above code is from the partition directory. Before stepping through into the loop, start another session and attach a new partition. On into the loop. When the 1st partition is opened, its locking will result in RelationClearRelation() being called on the parent relation due to partition being attached concurrently, which I observed, is actually invoked a couple of times due to recursion. Parent relation's rd_oldpdcxt will be set in this process, which contains the aforementioned partition descriptor. Before moving the loop to process the 2nd partition, attach another partition in session 2. When the 2nd partition is opened, RelationClearRelation() will run again and in one of its recursive invocations, it destroys the rd_oldpdcxt that was set previously, taking the partition directory's partition descriptor with it. Anything that tries to access it later crashes. I couldn't figure out what to blame here though -- the design of rd_pdoldcxt or the fact that RelationClearRelation() is invoked multiple times. I will try to study it more closely tomorrow. Thanks, Amit
Re: Confusing comment for function ExecParallelEstimate
On Wed, Jun 5, 2019 at 11:27 AM Wu, Fei wrote: > > Thanks for your reply. > From the code below: > (https://github.com/postgres/postgres/blob/REL_10_7/src/backend/executor/execParallel.c) > ### > 443 /* > 444 * Give parallel-aware nodes a chance to add to the > estimates, and get a > 445 * count of how many PlanState nodes there are. > 446 */ > 447 e.pcxt = pcxt; > 448 e.nnodes = 0; > 449 ExecParallelEstimate(planstate, &e); > 450 > 451 /* Estimate space for instrumentation, if required. */ > 452 if (estate->es_instrument) > 453 { > 454 instrumentation_len = > 455 offsetof(SharedExecutorInstrumentation, > plan_node_id) + > 456 sizeof(int) * e.nnodes; > 457 instrumentation_len = MAXALIGN(instrumentation_len); > 458 instrument_offset = instrumentation_len; > 459 instrumentation_len += > 460 mul_size(sizeof(Instrumentation), > 461 mul_size(e.nnodes, > nworkers)); > 462 shm_toc_estimate_chunk(&pcxt->estimator, > instrumentation_len); > 463 shm_toc_estimate_keys(&pcxt->estimator, 1); > > ### > It seems that e.nnodes which returns from ExecParallelEstimate(planstate, &e) > , determines how much instrumentation structures in DSM(line459~line461). > And e.nnodes also determines the length of SharedExecutorInstrumentation-> > plan_node_id(line454~line456). > > So, I think here it refers to instrumentation. > Right. I think the way it is mentioned (SharedPlanStateInstrumentation structures ..) in the comment can confuse readers. We can replace SharedPlanStateInstrumentation with Instrumentation in the comment. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Binary support for pgoutput plugin
Hi, On 05/06/2019 00:08, Andres Freund wrote: > Hi, > > On 2019-06-05 00:05:02 +0200, David Fetter wrote: >> Would it make sense to work toward a binary format that's not >> architecture-specific? I recall from COPY that our binary format is >> not standardized across, for example, big- and little-endian machines. > > I think you recall wrongly. It's obviously possible that we have bugs > around this, but output/input routines are supposed to handle a > endianess independent format. That usually means that you have to do > endianess conversions, but that doesn't make it non-standardized. > Yeah, there are really 3 formats of data we have, text protocol, binary network protocol and internal on disk format. The internal on disk format will not work across big/little-endian but network binary protocol will. FWIW I don't think the code for binary format was included in original logical replication patch (I really tried to keep it as minimal as possible), but the code and protocol is pretty much ready for adding that. That said, pglogical has code which handles this (I guess Andres means that by predecessor of pgoutput) so if you look for example at the write_tuple/read_tuple/decide_datum_transfer in https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_proto_native.c that can help you give some ideas on how to approach this. -- Petr Jelinek 2ndQuadrant - PostgreSQL Solutions https://www.2ndQuadrant.com/
Re: Binary support for pgoutput plugin
Hi, On Wed, 5 Jun 2019 at 07:18, Petr Jelinek wrote: > Hi, > > On 05/06/2019 00:08, Andres Freund wrote: > > Hi, > > > > On 2019-06-05 00:05:02 +0200, David Fetter wrote: > >> Would it make sense to work toward a binary format that's not > >> architecture-specific? I recall from COPY that our binary format is > >> not standardized across, for example, big- and little-endian machines. > > > > I think you recall wrongly. It's obviously possible that we have bugs > > around this, but output/input routines are supposed to handle a > > endianess independent format. That usually means that you have to do > > endianess conversions, but that doesn't make it non-standardized. > > > > Yeah, there are really 3 formats of data we have, text protocol, binary > network protocol and internal on disk format. The internal on disk > format will not work across big/little-endian but network binary > protocol will. > > FWIW I don't think the code for binary format was included in original > logical replication patch (I really tried to keep it as minimal as > possible), but the code and protocol is pretty much ready for adding that. > Yes, I looked through the public history and could not find it. Thanks for confirming. > > That said, pglogical has code which handles this (I guess Andres means > that by predecessor of pgoutput) so if you look for example at the > write_tuple/read_tuple/decide_datum_transfer in > > https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_proto_native.c > that can help you give some ideas on how to approach this. > Thanks for the tip! Dave Cramer > > >
Re: "WIP: Data at rest encryption" patch and, PostgreSQL 11-beta3
Robert Haas wrote: > On Fri, May 31, 2019 at 2:59 AM Antonin Houska wrote: > > > Sounds good. I'm not quite sure how this is going to work. Ideally > > > you'd like the encryption key command to fetch the key from something > > > like ssh-agent, or maybe pop up a window on the user's terminal with a > > > key prompt. Just reading from stdin and writing to stdout is not > > > going to be very convenient... > > > > When I mentioned writing to stdout in my previous email, I viewed it from > > the > > perspective of postmaster: whichever way the command gets the password from > > the DBA, it'll always write it to stdout and postmaster will read it from > > there. This was to address your objection that the executables do not > > actually > > "return" strings. > > So the part about returning strings is really just a wording issue: > the documentation needs to talk about data sent to stdout or wherever, > because that's what really happens, and somebody writing such a > command from scratch needs to know what it must do. > > What I'm talking about here is that it also has to be reasonably > possible to write an encryption key command that does something > useful. I don't have a really clear vision for how that's going to > work. Nobody wants the server, which is probably being launched by > pg_ctl or systemd or both, to prompt using its own stdin/stderr, but > the we need to think about how the prompting is actually going to > work. Since you mentioned ssh-agent above, I think that postmaster can, instead of running a command, create an unix socket and read the key from there. (I refer only to the socket as a communication channel, not to the protocol - ssh-agent does not seem to actually send key over the socket.) Unlike the socket for backend connections, this one would only be readable by the user that runs postmaster, and would only exist during the encryption initialization. The simplest approach from the perspective of the DBA is that pg_ctl can write the key to the socket. Besides that we can also implement a separate utility to send the key, to be used in other special cases such as starting postgres via systemd. (If the unix socket is a problem on windows, we might need to use named pipe instead.) > > The header comment in pg_upgrade.c indicates that this is because of TOAST. > > So > > I think that dbnode and relfilenode are not preserved just because there was > > no strong reason to do so by now. > > Maybe you want to propose an independent patch making that change? I think of a separate diff in the encryption patch series. As the encryption is the only reason for this enhancement, I'm not sure if it deserves a separate CF entry. (Although it might be considered refactoring because eventually pg_upgrade won't need to handle the different relnodes, and thus it might become a little bit simpler.) -- Antonin Houska Web: https://www.cybertec-postgresql.com
pg_basebackup failure after setting default_table_access_method option
Hi, *I noticed pg_basebackup failure when default_table_access_method option is set.* *Test steps:* Step 1: Init database ./initdb -D data Step 2: Start Server ./postgres -D data & Step 3: Set Guc option export PGOPTIONS='-c default_table_access_method=zheap' Step 4: Peform backup /pg_basebackup -D backup -p 5432 --no-sync 2019-06-05 20:35:04.088 IST [11601] FATAL: cannot read pg_class without having selected a database pg_basebackup: error: could not connect to server: FATAL: cannot read pg_class without having selected a database *Reason why it is failing:* pg_basebackup does not use any database to connect to server as it backs up the whole data instance. As the option default_table_access_method is set. It tries to validate this option, but while validating the option in ScanPgRelation function: if (!OidIsValid(MyDatabaseId)) elog(FATAL, "cannot read pg_class without having selected a database"); Here as pg_basebackup uses no database the command fails. Fix: The patch has the fix for the above issue: Let me know your opinion on this. -- Regards, vignesh EnterpriseDB: http://www.enterprisedb.com 0001-pg-backup-guc-option-validation-failure.patch Description: Binary data
Re: Binary support for pgoutput plugin
On Wed, 5 Jun 2019 at 07:21, Dave Cramer wrote: > Hi, > > > On Wed, 5 Jun 2019 at 07:18, Petr Jelinek > wrote: > >> Hi, >> >> On 05/06/2019 00:08, Andres Freund wrote: >> > Hi, >> > >> > On 2019-06-05 00:05:02 +0200, David Fetter wrote: >> >> Would it make sense to work toward a binary format that's not >> >> architecture-specific? I recall from COPY that our binary format is >> >> not standardized across, for example, big- and little-endian machines. >> > >> > I think you recall wrongly. It's obviously possible that we have bugs >> > around this, but output/input routines are supposed to handle a >> > endianess independent format. That usually means that you have to do >> > endianess conversions, but that doesn't make it non-standardized. >> > >> >> Yeah, there are really 3 formats of data we have, text protocol, binary >> network protocol and internal on disk format. The internal on disk >> format will not work across big/little-endian but network binary >> protocol will. >> >> FWIW I don't think the code for binary format was included in original >> logical replication patch (I really tried to keep it as minimal as >> possible), but the code and protocol is pretty much ready for adding that. >> > Yes, I looked through the public history and could not find it. Thanks for > confirming. > >> >> That said, pglogical has code which handles this (I guess Andres means >> that by predecessor of pgoutput) so if you look for example at the >> write_tuple/read_tuple/decide_datum_transfer in >> >> https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_proto_native.c >> that can help you give some ideas on how to approach this. >> > > Thanks for the tip! > Looking at: https://github.com/postgres/postgres/blob/8255c7a5eeba8f1a38b7a431c04909bde4f5e67d/src/backend/replication/pgoutput/pgoutput.c#L163 this seems completely ignored. What was the intent? Dave
Re: WITH NOT MATERIALIZED and DML CTEs
On 2019-Jun-03, Andres Freund wrote: > On 2019-06-03 11:45:51 -0400, Elvis Pranskevichus wrote: > > It is understandably late in the 12 cycle, so maybe prohibit NOT > > MATERIALIZED with DML altogheter and revisit this in 13? > > I could see us adding an error, or just continuing to silently ignore > it. Hmm, shouldn't we be throwing an error for that case? I'm not sure it's defensible that we don't. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Binary support for pgoutput plugin
Hi On June 5, 2019 8:51:10 AM PDT, Dave Cramer wrote: >On Wed, 5 Jun 2019 at 07:21, Dave Cramer wrote: > >> Hi, >> >> >> On Wed, 5 Jun 2019 at 07:18, Petr Jelinek > >> wrote: >> >>> Hi, >>> >>> On 05/06/2019 00:08, Andres Freund wrote: >>> > Hi, >>> > >>> > On 2019-06-05 00:05:02 +0200, David Fetter wrote: >>> >> Would it make sense to work toward a binary format that's not >>> >> architecture-specific? I recall from COPY that our binary format >is >>> >> not standardized across, for example, big- and little-endian >machines. >>> > >>> > I think you recall wrongly. It's obviously possible that we have >bugs >>> > around this, but output/input routines are supposed to handle a >>> > endianess independent format. That usually means that you have to >do >>> > endianess conversions, but that doesn't make it non-standardized. >>> > >>> >>> Yeah, there are really 3 formats of data we have, text protocol, >binary >>> network protocol and internal on disk format. The internal on disk >>> format will not work across big/little-endian but network binary >>> protocol will. >>> >>> FWIW I don't think the code for binary format was included in >original >>> logical replication patch (I really tried to keep it as minimal as >>> possible), but the code and protocol is pretty much ready for adding >that. >>> >> Yes, I looked through the public history and could not find it. >Thanks for >> confirming. >> >>> >>> That said, pglogical has code which handles this (I guess Andres >means >>> that by predecessor of pgoutput) so if you look for example at the >>> write_tuple/read_tuple/decide_datum_transfer in >>> >>> >https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_proto_native.c >>> that can help you give some ideas on how to approach this. >>> >> >> Thanks for the tip! >> > >Looking at: >https://github.com/postgres/postgres/blob/8255c7a5eeba8f1a38b7a431c04909bde4f5e67d/src/backend/replication/pgoutput/pgoutput.c#L163 > >this seems completely ignored. What was the intent? That's about the output of the plugin, not the datatypes. And independent of text/binary output, the protocol contains non-printable chars. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [PATCH] Improve performance of NOTIFY over many databases (issue blocking on AccessExclusiveLock on object 0 of class 1262 of database 0)
Hoi hackers, Please find attached updated versions of the patches, I've now tested them. Also attached is a reproduction script to verify that they actually work. To test you need to create 150 databases as described in the script, then simply execute it. Before patching you get the following results (last figure is the CPU usage of Postgres): 1559749330 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg: 0.01 [0.01/0.01/0.01/0.01/0.01], 269.07% 1559749335 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg: 0.01 [0.01/0.01/0.01/0.01/0.01], 268.07% 1559749340 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg: 0.01 [0.01/0.01/0.01/0.01/0.01], 270.94% After patching you get the following: 1559749840 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.02, Avg: 0.01 [0.01/0.01/0.01/0.01/0.01], 5.09% 1559749845 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg: 0.01 [0.01/0.01/0.01/0.01/0.01], 5.06% 1559749850 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg: 0.01 [0.01/0.01/0.01/0.01/0.01], 4.75% The async queue functions in postgres also no longer appear in the perf output (below measuring threshold). As for general method, it seems like the actual optimisation here is that the async queue tail pointer is only updated once per SLRU page instead of every message. This would require a significantly larger patch, but shouldn't be too difficult. Thoughts? Have a nice day, Martijn On Tue, 4 Jun 2019 at 09:08, Martijn van Oosterhout wrote: > > Hoi hackers, > > We've been having issues with NOTIFYs blocking over multiple databases > (see [1] for more details). That was 9.4 but we've updated the > database to 11.3 and still have the same issue. Now however we could > use perf to do profiling and got the following profile (useless > details elided): > > --32.83%--ProcessClientReadInterrupt >--32.68%--ProcessNotifyInterrupt > --32.16%--asyncQueueReadAllNotifications > --23.37%--asyncQueueAdvanceTail > --20.49%--LWLockAcquire >--18.93%--LWLockQueueSelf > --12.99%--LWLockWaitListLock > > (from: perf record -F 99 -ag -- sleep 600) > > That shows that more than 20% of the time is spent in that single > function, waiting for an exclusive lock on the AsyncQueueLock. This > will block any concurrent session doing a NOTIFY in any database on > the system. This would certainly explain the symptoms we're seeing > (process xxx still waiting for AccessExclusiveLock on object 0 of > class 1262 of database 0). > > Analysis of the code leads me to the following hypothesis (and hence > to the attached patches): > > We have ~150 databases, each of which has 2 active backends with an > active LISTEN. When a NOTIFY happens anywhere on any database it > (under an exclusive lock) makes a list of 300 backends to send a > signal to. It then wakes up all of those backends. Each backend then > examines the message and all but one discards it as being for the > wrong database. Each backend then calls asyncQueueAdvanceTail (because > the current position of the each backend was the tail) which then > takes an exclusive lock and checks all the other backends to see if > the tail can be advanced. All of these will conclude 'no', except the > very last one which concludes the tail can be advanced by about 50 > bytes or so. > > So the inner loop of asyncQueueAdvanceTail will, while holding a > global exclusive lock, execute 2*150*4000 (max backends) = 1.2 million > times for basically no benefit. During this time, no other transaction > anywhere in the system that does a NOTIFY will be able to commit. > > The attached patches attempt reduce the overhead in two ways: > > Patch 1: Changes asyncQueueAdvanceTail to do nothing unless the > QUEUE_HEAD is on a different page than the QUEUE_TAIL. The idea is > that there's no point trying to advance the tail unless we can > actually usefully truncate the SLRU. This does however mean that > asyncQueueReadAllNotifications always has to call > asyncQueueAdvanceTail since it can no longer be guaranteed that any > backend is still at the tail, which is one of the assumptions of the > current code. Not sure if this is a problem or if it can be improved > without tracking much more state. > > Patch 2: Changes SignalBackends to only notify other backends when (a) > they're the same database as me or (b) the notify queue has advanced > to a new SLRU page. This avoids backends being woken up for messages > which they are not interested in. > > As a consequence of these changes, we can reduce the number of > exclusive locks and backend wake ups in our case by a factor of 300. > You still however get a thundering herd at the end of each SLRU page. > > Note: these patches have not yet been extensively tested, and so > should only be used as basis for discussion. > > Comments? Suggestions? > > [1] > https://www.postgresql.org/message-id/cadwg95t0j9zf0uwdcmh81kmndsitavhxmbvgyqrrjcd-ilw...@mail.gmail.com > > -- > Marti
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Mon, 3 Jun 2019 at 15:39, James Coleman wrote: > > On Sun, Jun 2, 2019 at 5:18 PM Tomas Vondra > wrote: > > Currently, the costing logic (cost_incremental_sort) assumes all groups > > have the same size, which is fine for uniform distributions. But for > > skewed distributions, that may be an issue. > > > > Consider for example a table with 1M rows, two columns, 100 groups in each > > column, and index on the first one. > > > > CREATE table t (a INT, b INT); > > > > INSERT INTO t SELECT 100*random(), 100*random() > > FROM generate_series(1,100); > > > > Now, let's do a simple limit query to find the first row: > > > > SELECT * FROM t ORDER BU a, b LIMIT 1; > > > > In this case the current costing logic is fine - the groups are close to > > average size, and we only need to sort the first group, i.e. 1% of data. > > > > Now, let's say the first group is much larger: > > > > > > INSERT INTO t SELECT 0, 100*random() > > FROM generate_series(1,90) s(i); > > > > INSERT INTO t SELECT 100*random(), 100*random() > > FROM generate_series(1,10); > > > > That is, the first group is roughly 90% of data, but the number of groups > > is still the same. But we need to scan 90% of data. But the average group > > size is still the same, so the current cost model is oblivious to this. > > Thinking out loud here: the current implementation doesn't guarantee > that sort groups always have the same prefix column values because > (from code comments) "Sorting many small groups with tuplesort is > inefficient). While this seems a reasonable optimization, I think it's > possible that thinking steered away from an optimization in the > opposite direction. Perhaps we should always track whether or not all > prefix tuples are the same (currently we only do that after reaching > DEFAULT_MIN_GROUP_SIZE tuples) and use that information to be able to > have tuplesort only care about the non-prefix columns (where currently > it has to sort on all pathkey columns even though for a large group > the prefix columns are guaranteed to be equal). > +1 for passing only the non-prefix columns to tuplesort. > Essentially I'm trying to think of ways that would get us to > comparable performance with regular sort in the case of large batch > sizes. > > One other thing about the DEFAULT_MIN_GROUP_SIZE logic is that in the > case where you have a very small group and then a very large batch, > we'd lose the ability to optimize in the above way. That makes me > wonder if we shouldn't intentionally optimize for the possibility of > large batch sizes, since a little extra expense per group/tuple is > more likely to be a non-concern with small groups anyway when there > are large numbers of input tuples but a relatively small limit. > What about using the knowledge of MCV here, if we know the next value is in MCV list then take the overhead of sorting this small group alone and then leverage the optimization for the larger group, by passing only the non-prefix columns. > Thoughts? > > > So I think we should look at the MCV list, and use that information when > > computing the startup/total cost. I think using the first/largest group to > > compute the startup cost, and the average group size for total cost would > > do the trick. > +1 > I think this sounds very reasonable. > > > I don't think we can do much better than this during planning. There will > > inevitably be cases where the costing model will push us to do the wrong > > thing, in either direction. The question is how serious issue that is, and > > whether we could remedy that during execution somehow. > > > > When we "incorrectly" pick full sort (when the incremental sort would be > > faster), that's obviously sad but I think it's mostly fine because it's > > not a regression. > > > > The opposite direction (picking incremental sort, while full sort being > > faster) is probably more serious, because it's a regression between > > releases. > > > > I don't think we can fully fix that by refining the cost model. We have > > two basic options: > > > > 1) Argue/show that this is not an issue in practice, because (a) such > > cases are very rare, and/or (b) the regression is limited. In short, the > > benefits of the patch outweight the losses. > > My comments above go in this direction. If we can improve performance > in the worst case, I think it's plausible this concern becomes a > non-issue. > > > 2) Provide some fallback at execution time. For example, we might watch > > the size of the group, and if we run into an unexpectedly large one we > > might abandon the incremental sort and switch to a "full sort" mode. > > Are there good examples of our doing this in other types of nodes > (whether the fallback is an entirely different algorithm/node type)? I > like this idea in theory, but I also think it's likely it would add a > very significant amount of complexity. The other problem is knowing > where to draw the line: you end up creating th
Re: Sort support for macaddr8
On Tue, Jun 4, 2019 at 3:50 PM Peter Geoghegan wrote: > On Tue, Jun 4, 2019 at 3:23 PM Andres Freund wrote: > > > This is half the reason why I ended up implementing itemptr_encode() > > > to accelerate the TID sort used by CREATE INDEX CONCURRENTLY some > > > years back -- TID is 6 bytes wide, which doesn't have the necessary > > > macro support within postgres.h. There is no reason why that couldn't > > > be added for the benefit of both TID and macaddr types, though it > > > probably wouldn't be worth it. > > > > I think we should definitely do that. It seems not unlikely that other > > people want to write new fixed width types, and we shouldn't have them > > deal with all this complexity unnecessarily. > > On second thought, maybe there is something to be said for being > exhaustive here. > > It seems like there is a preference for making macaddr8 pass-by-value > instead of adding abbreviated keys support to macaddr8, and possibly > doing the same with the original macaddr type. > > Do you think that you'll be able to work on the project with this > expanded scope, Melanie? > > I can take on making macaddr8 pass-by-value I tinkered a bit last night and got in/out mostly working (I think). I'm not sure about macaddr, TID, and user-defined types. -- Melanie Plageman
Re: Sort support for macaddr8
On 2019-Jun-05, Melanie Plageman wrote: > I can take on making macaddr8 pass-by-value > I tinkered a bit last night and got in/out mostly working (I think). > I'm not sure about macaddr, TID, and user-defined types. Yeah, let's see what macaddr8 looks like, and we can move from there -- I suppose adapting for macaddr would not be terribly different, but we don't have to do both in a single commit. I don't expect that TID would necessarily be similar since we have lots of bespoke code for that in lots of places; it might not affect anything (it should not!) but then it might. No reason not to move forward incrementally. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Sort support for macaddr8
Hi, On 2019-06-05 09:18:34 -0700, Melanie Plageman wrote: > On Tue, Jun 4, 2019 at 3:50 PM Peter Geoghegan wrote: > > > On Tue, Jun 4, 2019 at 3:23 PM Andres Freund wrote: > > > > This is half the reason why I ended up implementing itemptr_encode() > > > > to accelerate the TID sort used by CREATE INDEX CONCURRENTLY some > > > > years back -- TID is 6 bytes wide, which doesn't have the necessary > > > > macro support within postgres.h. There is no reason why that couldn't > > > > be added for the benefit of both TID and macaddr types, though it > > > > probably wouldn't be worth it. > > > > > > I think we should definitely do that. It seems not unlikely that other > > > people want to write new fixed width types, and we shouldn't have them > > > deal with all this complexity unnecessarily. > > > > On second thought, maybe there is something to be said for being > > exhaustive here. > > > > It seems like there is a preference for making macaddr8 pass-by-value > > instead of adding abbreviated keys support to macaddr8, and possibly > > doing the same with the original macaddr type. > > > > Do you think that you'll be able to work on the project with this > > expanded scope, Melanie? > > > > > I can take on making macaddr8 pass-by-value > I tinkered a bit last night and got in/out mostly working (I think). > I'm not sure about macaddr, TID, and user-defined types. I'd much rather see this tackled in a general way than fiddling with individual datatypes. I think we should: 1) make fetch_att(), store_att_byval() etc support datums of any length between 1 and <= sizeof(Datum). Probably also convert them to inline functions. There's a few more functions to be adjusted, but not many, I think. 2) Remove ability to pass PASSEDBYVALUE to CREATE TYPE, but instead compute whether attyval is possible, solely based on INTERNALLENGTH (when INTERNALLENGTH > 0 obviously). 3) Fix the fallout, by fixing a few of the Datum<->type conversion functions affected by this change. That'll require a bit of work, but not too much. We should write those conversion routines in a way that'll keep them working for the scenarios where the type is actually passable by value, and not (required for > 4 byte datums). Greetings, Andres Freund
Re: Sort support for macaddr8
On 2019-Jun-05, Andres Freund wrote: > I'd much rather see this tackled in a general way than fiddling with > individual datatypes. I think we should: > > 1) make fetch_att(), store_att_byval() etc support datums of any length >between 1 and <= sizeof(Datum). Probably also convert them to inline >functions. There's a few more functions to be adjusted, but not many, >I think. Does this mean that datatypes that are >4 and <=8 bytes need to handle both cases? Is it possible for them to detect the current environment? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Sort support for macaddr8
Hi, On June 5, 2019 12:14:42 PM PDT, Alvaro Herrera wrote: >On 2019-Jun-05, Andres Freund wrote: > >> I'd much rather see this tackled in a general way than fiddling with >> individual datatypes. I think we should: >> >> 1) make fetch_att(), store_att_byval() etc support datums of any >length >>between 1 and <= sizeof(Datum). Probably also convert them to >inline >>functions. There's a few more functions to be adjusted, but not >many, >>I think. > >Does this mean that datatypes that are >4 and <=8 bytes need to handle >both cases? Is it possible for them to detect the current environment? Well, the conversion macros need to know. You can look at float8 for an example of the difference - it's pretty centralized. We should provide a few helper macros to abstract that away. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: Question about some changes in 11.3
Thanks for taking a look at this Tom. On Mon, Jun 3, 2019 at 4:07 PM Tom Lane wrote: > Mat Arye writes: > > On Fri, May 24, 2019 at 5:05 PM Mat Arye wrote: > >> 11.3 included some change to partition table planning. Namely > >> commit 925f46f ("Fix handling of targetlist SRFs when scan/join > relation is > >> known empty.") seems to redo all paths for partitioned tables > >> in apply_scanjoin_target_to_paths. It clears the paths in: > >> > >> ``` > >> if (rel_is_partitioned) > >> rel->pathlist = NIL > >> ``` > >> > >> Then the code rebuild the paths. However, the rebuilt append path never > >> gets the > >> set_rel_pathlist_hook called. Thus, the work that hook did previously > gets > >> thrown away and the rebuilt append path can never be influenced by this > >> hook. Is this intended behavior? Am I missing something? > > Hm. I'd say this was already broken by the invention of > apply_scanjoin_target_to_paths; perhaps 11-before-11.3 managed to > still work for you, but it's not hard to envision applications of > set_rel_pathlist_hook for which it would not have worked. The contract > for set_rel_pathlist_hook is supposed to be that it gets to editorialize > on all normal (non-Gather) paths created by the core code, and that's > no longer the case now that apply_scanjoin_target_to_paths can add more. > Yeah it worked for our cases because (I guess) out paths turned out to be lower cost, but I see your point. > > > I've attached a small patch to address this discrepancy for when the > > set_rel_pathlist_hook is called so that's it is called for actual paths > > used for partitioned rels. Please let me know if I am misunderstanding > how > > this should be handled. > > I'm not very happy with this patch either, as it makes the situation > even more confused, not less so. The best-case scenario is that the > set_rel_pathlist_hook runs twice and does useless work; the worst case > is that it gets confused completely by being called twice for the same > rel. I think we need to maintain the invariant that that hook is > called exactly once per baserel. > Yeah getting called once per baserel is a nice invariant to have. > I wonder whether we could fix matters by postponing the > set_rel_pathlist_hook call till later in the same cases where > we postpone generate_gather_paths, ie, it's the only baserel. > > That would make its name pretty misleading, though. How would simply delaying the hook make the name misleading? I am also wondering if using the condition `rel->reloptkind == RELOPT_BASEREL && bms_membership(root->all_baserels) != BMS_SINGLETON` is sufficient. Is it really guaranteed that `apply_scanjoin_target_to_paths` will not be called in other cases? > Maybe we > should leave it alone and invent a separate hook to be called > by/after apply_scanjoin_target_to_paths? Although I don't > know if it'd be useful to add a new hook to v11 at this point. > Extensions would have a hard time knowing if they could use it. > I think for us, either approach would work. We just need a place to add/modify some paths. FWIW, I think delaying the hook is easier to deal with on our end if it could work since we don't have to deal with two different code paths but either is workable. Certainly if we go with the new hook approach I think it should be added to v11 as well. That way extensions that need the functionality can hook into it and deal with patch level differences instead of having no way at all to get at this functionality. I am more than happy to work on a new patch once we settle on an approach. > > regards, tom lane >
Re: Fix runtime errors from -fsanitize=undefined
On Mon, Jun 3, 2019 at 3:22 PM Peter Eisentraut wrote: > After many years of trying, it seems the -fsanitize=undefined checking > in gcc is now working somewhat reliably. Attached is a patch that fixes > all errors of the kind Is this as of some particular gcc version? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Index Skip Scan
> To address this, probably we can do something like in the attached patch. > Altogether with distinct_pathkeys uniq_distinct_pathkeys are stored, which is > the same, but without the constants elimination. It's being used then for > getting the real number of distinct keys, and to check the order of the > columns > to not consider index skip scan if it's different. Hope it doesn't > look too hacky. > Thanks! I've verified that it works now. I was wondering if we're not too strict in some cases now though. Consider the following queries: postgres=# explain(analyze) select distinct on (m,f) m,f from t where m='M2'; QUERY PLAN --- Index Only Scan using t_m_f_t_idx on t (cost=0.29..11.60 rows=40 width=5) (actual time=0.056..0.469 rows=10 loops=1) Scan mode: Skip scan Index Cond: (m = 'M2'::text) Heap Fetches: 10 Planning Time: 0.095 ms Execution Time: 0.490 ms (6 rows) postgres=# explain(analyze) select distinct on (f) m,f from t where m='M2'; QUERY PLAN Unique (cost=0.29..849.83 rows=10 width=5) (actual time=0.088..10.920 rows=10 loops=1) -> Index Only Scan using t_m_f_t_idx on t (cost=0.29..824.70 rows=10052 width=5) (actual time=0.087..8.524 rows=1 loops=1) Index Cond: (m = 'M2'::text) Heap Fetches: 1 Planning Time: 0.078 ms Execution Time: 10.944 ms (6 rows) This is basically the opposite case - when distinct_pathkeys matches the filtered list of index keys, an index skip scan could be considered. Currently, the user needs to write 'distinct m,f' explicitly, even though he specifies in the WHERE-clause that 'm' can only have one value anyway. Perhaps it's fine like this, but it could be a small improvement for consistency. -Floris?
Add CREATE DATABASE LOCALE option
I propose this patch to add a LOCALE option to CREATE DATABASE. This sets both LC_COLLATE and LC_CTYPE with one option. Similar behavior is already supported in initdb, CREATE COLLATION, and createdb. With collation providers other than libc, having separate lc_collate and lc_ctype settings is not necessarily applicable, so this is also preparation for such future functionality. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 5f64d10944b272a5b636d7b0033a0a6a3d28998b Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 4 Jun 2019 14:45:00 +0200 Subject: [PATCH] Add CREATE DATABASE LOCALE option This sets both LC_COLLATE and LC_CTYPE with one option. Similar behavior is already supported in initdb, CREATE COLLATION, and createdb. --- doc/src/sgml/ref/create_database.sgml | 15 +-- src/backend/commands/dbcommands.c | 20 src/bin/pg_dump/pg_dump.c | 23 +-- 3 files changed, 50 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml index b2c9e241c2..c5b7af5cf7 100644 --- a/doc/src/sgml/ref/create_database.sgml +++ b/doc/src/sgml/ref/create_database.sgml @@ -25,6 +25,7 @@ [ [ WITH ] [ OWNER [=] user_name ] [ TEMPLATE [=] template ] [ ENCODING [=] encoding ] + [ LOCALE [=] locale ] [ LC_COLLATE [=] lc_collate ] [ LC_CTYPE [=] lc_ctype ] [ TABLESPACE [=] tablespace_name ] @@ -111,6 +112,16 @@ Parameters + + locale + + +This is a shortcut for setting LC_COLLATE +and LC_CTYPE at once. If you specify this, +you cannot specify either of those parameters. + + + lc_collate @@ -287,7 +298,7 @@ Examples To create a database music with a different locale: CREATE DATABASE music -LC_COLLATE 'sv_SE.utf8' LC_CTYPE 'sv_SE.utf8' +LOCALE 'sv_SE.utf8' TEMPLATE template0; In this example, the TEMPLATE template0 clause is required if @@ -300,7 +311,7 @@ Examples different character set encoding: CREATE DATABASE music2 -LC_COLLATE 'sv_SE.iso885915' LC_CTYPE 'sv_SE.iso885915' +LOCALE 'sv_SE.iso885915' ENCODING LATIN9 TEMPLATE template0; diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index 15207bf75a..ac275f7e64 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -124,6 +124,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) DefElem*downer = NULL; DefElem*dtemplate = NULL; DefElem*dencoding = NULL; + DefElem*dlocale = NULL; DefElem*dcollate = NULL; DefElem*dctype = NULL; DefElem*distemplate = NULL; @@ -184,6 +185,15 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) parser_errposition(pstate, defel->location))); dencoding = defel; } + else if (strcmp(defel->defname, "locale") == 0) + { + if (dlocale) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), +errmsg("conflicting or redundant options"), +parser_errposition(pstate, defel->location))); + dlocale = defel; + } else if (strcmp(defel->defname, "lc_collate") == 0) { if (dcollate) @@ -244,6 +254,11 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) parser_errposition(pstate, defel->location))); } + if (dlocale && (dcollate || dctype)) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), +errmsg("conflicting or redundant options"))); + if (downer && downer->arg) dbowner = defGetString(downer); if (dtemplate && dtemplate->arg) @@ -276,6 +291,11 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) parser_errposition(pstate, dencoding->location))); } } + if (dlocale && dlocale->arg) + { + dbcollate = defGetString(dlocale); + dbctype = defGetString(dlocale); + } if (dcollate && dcollate->arg) dbcollate = defGetString(dcollate); if (dctype && dctype->arg) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 9f59cc74ee..ba34677787 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -2812,1
Re: Add CREATE DATABASE LOCALE option
On Wed, Jun 5, 2019 at 5:17 PM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > > I propose this patch to add a LOCALE option to CREATE DATABASE. This > sets both LC_COLLATE and LC_CTYPE with one option. Similar behavior is > already supported in initdb, CREATE COLLATION, and createdb. > > With collation providers other than libc, having separate lc_collate and > lc_ctype settings is not necessarily applicable, so this is also > preparation for such future functionality. > Cool... would be nice also add some test cases. Regards, -- Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Re: Why does pg_checksums -r not have a long option?
On 2019-05-28 04:56, Michael Paquier wrote: > You could also use a long option for that without a one-letter option, > like --file-path or such, so reserving a one-letter option for a > future, hypothetical use is not really a stopper in my opinion. In > consequence, I think that that it is fine to just use -f/--filenode. > Any objections or better suggestions from other folks here? I think -r/--relfilenode was actually a good suggestion. Because it doesn't actually check a *file* but potentially several files (forks, segments). The -f naming makes it sound like it operates on a specific file. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Cleaning up and speeding up string functions
On 2019-May-26, David Rowley wrote: > On Sun, 26 May 2019 at 04:50, Tom Lane wrote: > > Here the cost is code space rather than programmer-visible complexity, > > but I still doubt that it's worth it. > > I see on today's master the postgres binary did grow from 8633960 > bytes to 8642504 on my machine using GCC 8.3, so you might be right. > pg_receivewal grew from 96376 to 96424 bytes. I suppose one place that could be affected visibly is JSON object construction (json.c, jsonfuncs.c) that could potentially deal with millions of stringinfo manipulations, but most of those calls don't actually use appendStringInfoString with constant values, so it's probably not worth bothering with. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: crash testing suggestions for 12 beta 1
On 2019-May-23, Jeff Janes wrote: > Now that beta is out, I wanted to do some crash-recovery testing where I > inject PANIC-inducing faults and see if it recovers correctly. A long-lived > Perl process keeps track of what it should find after the crash, and > verifies that it finds it. You will probably be familiar with the general > theme from examples like the threads below. Would anyone like to nominate > some areas to focus on? Thanks for the offer! Your work has showed its value in previous cycles. REINDEX CONCURRENTLY would be one good area to focus on, I think, as well as ALTER TABLE ATTACH PARTITION. Maybe also INCLUDE columns in GiST, and the stuff in commits 9155580fd, fe280694d and 7df159a62. Thanks, -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: crash testing suggestions for 12 beta 1
On Wed, Jun 5, 2019 at 2:11 PM Alvaro Herrera wrote: > REINDEX CONCURRENTLY would be one good area to focus on, I think, as > well as ALTER TABLE ATTACH PARTITION. Maybe also INCLUDE columns in > GiST, and the stuff in commits 9155580fd, fe280694d and 7df159a62. Those all seem like good things to target. Forgive me for droning on about amcheck once more, but maybe it'll help: amcheck has the capability to detect at least two historic bugs in CREATE INDEX CONCURRENTLY that made it into stable releases. The "heapallindexed" verification option's bt_tuple_present_callback() function has a header comment that talks about this. In short, any "unhandled" broken hot chain (i.e. broken hot chains that are somehow not correctly detected and handled) should be reported as corrupt by amcheck with the "heapallindexed" check, provided the tuple is visible to verification's heap scan. The CREATE INDEX CONCURRENTLY bug that Pavan found a couple of years back while testing the WARM patch is one example. A bug that was fallout from the DROP INDEX CONCURRENTLY work is another historic example. Alvaro will recall that this same check had a role in the "freeze the dead" business. -- Peter Geoghegan
Re: Remove useless associativity/precedence from parsers
On 2019-May-28, Andrew Gierth wrote: > The main cases I know of are: > > 1. RANGE UNBOUNDED PRECEDING - this one is actually a defect in the > standard SQL grammar, since UNBOUNDED is a non-reserved keyword and so > it can also appear as a legal , and the construct > RANGE PRECEDING allows to > appear as a . Should we report this to the SQL committee? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Binary support for pgoutput plugin
Hi, On Wed, 5 Jun 2019 at 12:01, Andres Freund wrote: > Hi > > On June 5, 2019 8:51:10 AM PDT, Dave Cramer wrote: > >On Wed, 5 Jun 2019 at 07:21, Dave Cramer wrote: > > > >> Hi, > >> > >> > >> On Wed, 5 Jun 2019 at 07:18, Petr Jelinek > > > >> wrote: > >> > >>> Hi, > >>> > >>> On 05/06/2019 00:08, Andres Freund wrote: > >>> > Hi, > >>> > > >>> > On 2019-06-05 00:05:02 +0200, David Fetter wrote: > >>> >> Would it make sense to work toward a binary format that's not > >>> >> architecture-specific? I recall from COPY that our binary format > >is > >>> >> not standardized across, for example, big- and little-endian > >machines. > >>> > > >>> > I think you recall wrongly. It's obviously possible that we have > >bugs > >>> > around this, but output/input routines are supposed to handle a > >>> > endianess independent format. That usually means that you have to > >do > >>> > endianess conversions, but that doesn't make it non-standardized. > >>> > > >>> > >>> Yeah, there are really 3 formats of data we have, text protocol, > >binary > >>> network protocol and internal on disk format. The internal on disk > >>> format will not work across big/little-endian but network binary > >>> protocol will. > >>> > >>> FWIW I don't think the code for binary format was included in > >original > >>> logical replication patch (I really tried to keep it as minimal as > >>> possible), but the code and protocol is pretty much ready for adding > >that. > >>> > >> Yes, I looked through the public history and could not find it. > >Thanks for > >> confirming. > >> > >>> > >>> That said, pglogical has code which handles this (I guess Andres > >means > >>> that by predecessor of pgoutput) so if you look for example at the > >>> write_tuple/read_tuple/decide_datum_transfer in > >>> > >>> > > > https://github.com/2ndQuadrant/pglogical/blob/REL2_x_STABLE/pglogical_proto_native.c > >>> that can help you give some ideas on how to approach this. > >>> > >> > >> Thanks for the tip! > >> > > > >Looking at: > > > https://github.com/postgres/postgres/blob/8255c7a5eeba8f1a38b7a431c04909bde4f5e67d/src/backend/replication/pgoutput/pgoutput.c#L163 > > > >this seems completely ignored. What was the intent? > > That's about the output of the plugin, not the datatypes. And independent > of text/binary output, the protocol contains non-printable chars. > > Andres > -- > Sent from my Android device with K-9 Mail. Please excuse my brevity. > So one of the things they would like added is to get not null information in the schema record. This is so they can mark the field Optional in Java. I presume this would also have some uses in other languages. As I understand it this would require a protocol bump. If this were to be accepted are there any outstanding asks that would useful to add if we were going to bump the protocol? Dave
Re: Binary support for pgoutput plugin
Hi, On 2019-06-05 18:47:57 -0400, Dave Cramer wrote: > So one of the things they would like added is to get not null information > in the schema record. This is so they can mark the field Optional in Java. > I presume this would also have some uses in other languages. As I > understand it this would require a protocol bump. If this were to be > accepted are there any outstanding asks that would useful to add if we were > going to bump the protocol? I'm pretty strongly opposed to this. What's the limiting factor when adding such information? I think clients that want something like this ought to query the database for catalog information when getting schema information. Greetings, Andres Freund
Re: Adding a test for speculative insert abort case
On Thu, May 16, 2019 at 8:46 PM Melanie Plageman wrote: > > Good idea. > I squashed the changes I suggested in previous emails, Ashwin's patch, my > suggested updates to that patch, and the index order check all into one > updated > patch attached. > > I've updated this patch to make it apply on master cleanly. Thanks to Alvaro for format-patch suggestion. The first patch in the set adds the speculative wait case discussed above from Ashwin's patch. The second patch in the set is another suggestion I have. I noticed that the insert-conflict-toast test mentions that it is "not guaranteed to lead to a failed speculative insertion" and, since it seems to be testing the speculative abort but with TOAST tables, I thought it might work to kill that spec file and move that test case into insert-conflict-specconflict so the test can utilize the existing advisory locks being used for the other tests in that file to make it deterministic which session succeeds in inserting the tuple. -- Melanie Plageman From e8652d872c953ca9bc077027cc34b26b870e5b73 Mon Sep 17 00:00:00 2001 From: csteam Date: Wed, 5 Jun 2019 15:24:19 -0700 Subject: [PATCH v2 2/2] Add TOAST case to spec conflict tests The spec insert-conflict-toast seems to be testing the speculative abort case -- given two concurrent inserts inserting duplicate values to a unique index, only one should succeed and, if the other has had a chance to insert the tuple into the table, this tuple should be killed. The comment in that test said that it was not deterministic, so, if making it deterministic required adding additional pg_advisory_locks similar to what is in the spec insert-conflict-specconflict, combining them made sense. --- .../expected/insert-conflict-specconflict.out | 56 +++ .../specs/insert-conflict-specconflict.spec | 30 ++ 2 files changed, 86 insertions(+) diff --git a/src/test/isolation/expected/insert-conflict-specconflict.out b/src/test/isolation/expected/insert-conflict-specconflict.out index 2f003ca1bf..a21ed0b165 100644 --- a/src/test/isolation/expected/insert-conflict-specconflict.out +++ b/src/test/isolation/expected/insert-conflict-specconflict.out @@ -112,6 +112,62 @@ keydata k1 inserted s1 with conflict update s2 +starting permutation: controller_locks controller_show s1_insert_toast s2_insert_toast controller_show controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show controller_unlock_1_2 controller_show_count controller_unlock_2_2 controller_show_count +step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock); +pg_advisory_locksess lock + + 1 1 + 1 2 + 1 3 + 2 1 + 2 2 + 2 3 +step controller_show: SELECT * FROM upserttest; +keydata + +step s1_insert_toast: INSERT INTO upserttest VALUES('k2', ctoast_large_val()) ON CONFLICT DO NOTHING; +step s2_insert_toast: INSERT INTO upserttest VALUES('k2', ctoast_large_val()) ON CONFLICT DO NOTHING; +step controller_show: SELECT * FROM upserttest; +keydata + +step controller_unlock_1_1: SELECT pg_advisory_unlock(1, 1); +pg_advisory_unlock + +t +step controller_unlock_2_1: SELECT pg_advisory_unlock(2, 1); +pg_advisory_unlock + +t +step controller_unlock_1_3: SELECT pg_advisory_unlock(1, 3); +pg_advisory_unlock + +t +step controller_unlock_2_3: SELECT pg_advisory_unlock(2, 3); +pg_advisory_unlock + +t +step controller_show: SELECT * FROM upserttest; +keydata + +step controller_unlock_1_2: SELECT pg_advisory_unlock(1, 2); +pg_advisory_unlock + +t +step s1_insert_toast: <... completed> +step controller_show_count: SELECT COUNT(*) FROM upserttest; +count + +1 +step controller_unlock_2_2: SELECT pg_advisory_unlock(2, 2); +pg_advisory_unlock + +t +step s2_insert_toast: <... completed> +step controller_show_count: SELECT COUNT(*) FROM upserttest; +count + +1 + starting permutation: controller_locks controller_show s1_begin s2_begin s1_upsert s2_upsert controller_show controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show controller_unlock_1_2 controller_show controller_unlock_2_2 controller_show s1_commit controller_show s2_commit controller_show step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock); pg_advisory_locksess lock diff --git a/src/test/isolation/specs/insert-conflict-specconflict.spec b/src/test
Re: Binary support for pgoutput plugin
Hi, On Wed, 5 Jun 2019 at 18:50, Andres Freund wrote: > Hi, > > On 2019-06-05 18:47:57 -0400, Dave Cramer wrote: > > So one of the things they would like added is to get not null information > > in the schema record. This is so they can mark the field Optional in > Java. > > I presume this would also have some uses in other languages. As I > > understand it this would require a protocol bump. If this were to be > > accepted are there any outstanding asks that would useful to add if we > were > > going to bump the protocol? > > I'm pretty strongly opposed to this. What's the limiting factor when > adding such information? I think clients that want something like this > ought to query the database for catalog information when getting schema > information. > I'm not intimately familiar with their code. I will query them more about the ask. I am curious why you are "strongly" opposed however. We already have the information. Adding doesn't seem onerous. Dave
Re: No mention of no CIC support for partitioned index in docs
On Wed, 5 Jun 2019 at 08:10, Alvaro Herrera wrote: > > On 2019-May-24, David Rowley wrote: > > > It appears there is no mention of lack of support for CREATE INDEX > > CONCURRENTLY on partitioned index in the documents. > > I'll leave this one for you to handle, thanks. Thanks. I've just pushed something. I ended up deciding that we owe the user a bit more of an explanation of how they might work around the problem. Of course, a partitioned table index build is likely to take much longer than an index build on a normal table, since most likely a partitioned table is larger. So I went on to explain how they might minimise the time where writers will be blocked by creating indexes concurrently on each partition first. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Top-N sorts in EXPLAIN, row count estimates, and parallelism
On Thu, May 23, 2019 at 7:48 PM David Rowley wrote: > > cost_sort() costs sorts as top-N heapsorts. While we cannot make an > > iron-clad guarantee that it will work out that way from within > > tuplesort.c, that doesn't seem like it closes off the possibility of > > more informative EXPLAIN output. For example, can't we at report that > > the tuplesort will be "bounded" within EXPLAIN, indicating that we > > intend to attempt to sort using a top-N heap sort (i.e. we'll > > definitely do it that way if there is sufficient work_mem)? > > I think this really needs more of a concrete proposal. Remember > LIMIT/OFFSET don't need to be constants, they could be a Param or some > return value from a subquery, so the bound might not be known until > after executor startup, to which EXPLAIN is not going to get to know > about that. I was merely pointing out that it is clear when a sort *could* be a top-n sort, which could be exposed by EXPLAIN without anyone feeling misled. > After that, what would we do with it in EXPLAIN? Always show "Bound: > ", if it's not -1? I'm not sure. The distinction between a top-n sort and any other sort is an important one (it's certainly more important than the distinction between an internal and external sort), so it's worth being flexible in order to expose more information in EXPLAIN output. I would be willing to accept some kind of qualified or hedged description in the EXPLAIN output for a bounded sort node, even though that approach doesn't seem desirable in general. -- Peter Geoghegan
Re: pg_basebackup failure after setting default_table_access_method option
On Thu, Jun 6, 2019 at 1:46 AM vignesh C wrote: > > Hi, > > *I noticed pg_basebackup failure when default_table_access_method option > is set.* > > *Test steps:* > > Step 1: Init database > ./initdb -D data > > Step 2: Start Server > ./postgres -D data & > > Step 3: Set Guc option > export PGOPTIONS='-c default_table_access_method=zheap' > > Step 4: Peform backup > /pg_basebackup -D backup -p 5432 --no-sync > 2019-06-05 20:35:04.088 IST [11601] FATAL: cannot read pg_class without > having selected a database > pg_basebackup: error: could not connect to server: FATAL: cannot read > pg_class without having selected a database > > *Reason why it is failing:* > pg_basebackup does not use any database to connect to server as it backs > up the whole data instance. > As the option default_table_access_method is set. > It tries to validate this option, but while validating the option in > ScanPgRelation function: > if (!OidIsValid(MyDatabaseId)) > elog(FATAL, "cannot read pg_class without having selected a database"); > > Here as pg_basebackup uses no database the command fails. > Thanks for the details steps to reproduce the bug, I am also able to reproduce the problem. > Fix: > The patch has the fix for the above issue: > > Let me know your opinion on this. > Thanks for the patch and it fixes the problem. Regards, Haribabu Kommi Fujitsu Australia
RE: Confusing comment for function ExecParallelEstimate
Sorry, Last mail forget to CC the mailing list. Now the comment is confusing, Maybe someone should correct it. Here is a simple patch, What do you think ? With Regards, Wu Fei -Original Message- From: Amit Kapila [mailto:amit.kapil...@gmail.com] Sent: Wednesday, June 05, 2019 7:18 PM To: Wu, Fei/吴 非 Cc: pgsql-hack...@postgresql.org Subject: Re: Confusing comment for function ExecParallelEstimate On Wed, Jun 5, 2019 at 11:27 AM Wu, Fei wrote: > > Thanks for your reply. > From the code below: > (https://github.com/postgres/postgres/blob/REL_10_7/src/backend/execut > or/execParallel.c) > ### > 443 /* > 444 * Give parallel-aware nodes a chance to add to the > estimates, and get a > 445 * count of how many PlanState nodes there are. > 446 */ > 447 e.pcxt = pcxt; > 448 e.nnodes = 0; > 449 ExecParallelEstimate(planstate, &e); > 450 > 451 /* Estimate space for instrumentation, if required. */ > 452 if (estate->es_instrument) > 453 { > 454 instrumentation_len = > 455 offsetof(SharedExecutorInstrumentation, > plan_node_id) + > 456 sizeof(int) * e.nnodes; > 457 instrumentation_len = MAXALIGN(instrumentation_len); > 458 instrument_offset = instrumentation_len; > 459 instrumentation_len += > 460 mul_size(sizeof(Instrumentation), > 461 mul_size(e.nnodes, > nworkers)); > 462 shm_toc_estimate_chunk(&pcxt->estimator, > instrumentation_len); > 463 shm_toc_estimate_keys(&pcxt->estimator, 1); > > ## > # It seems that e.nnodes which returns from > ExecParallelEstimate(planstate, &e) , determines how much instrumentation > structures in DSM(line459~line461). > And e.nnodes also determines the length of SharedExecutorInstrumentation-> > plan_node_id(line454~line456). > > So, I think here it refers to instrumentation. > Right. I think the way it is mentioned (SharedPlanStateInstrumentation structures ..) in the comment can confuse readers. We can replace SharedPlanStateInstrumentation with Instrumentation in the comment. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com fix_confusing_comment_for_ExecParallelEstimate.patch Description: fix_confusing_comment_for_ExecParallelEstimate.patch
Re: Confusing comment for function ExecParallelEstimate
On Thu, Jun 6, 2019 at 7:37 AM Wu, Fei wrote: > > Sorry, Last mail forget to CC the mailing list. > > Now the comment is confusing, Maybe someone should correct it. > Sure, for the sake of clarity, when this code was initially introduced in commit d1b7c1ff, the structure used was SharedPlanStateInstrumentation, but later when it got changed to Instrumentation structure in commit b287df70, we forgot to update the comment. So, we should backpatch this till 9.6 where it got introduced. I will commit this change by tomorrow or so. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Should we warn against using too many partitions?
On Fri, 24 May 2019 at 22:00, David Rowley wrote: > I've attached the pg10 and pg11 patches with that updated... and also > the master one (unchanged) with the hopes that the CF bot picks that > one. I got talking to Andres about this at PGCon after a use case of 250k partitions was brought to our attention. I was thinking about the best way to handle this on the long flight home and after studying the current docs I really feel that they fairly well describe what we've done so far implementing table partitioning, but they offer next to nothing on best practices on how to make the most of the feature. I've done some work on this today and what I've ended up with is an entirely new section to the partitioning docs about best practices which provides a bit of detail on how you might go about choosing the partition key. It gives an example of why LIST partitioning on a set of values that may grow significantly over time might be a bad idea. It talks about memory growth with more partitions and mentions that rel cache might become a problem even if queries are touching a small number of partitions per query, but a large number per session. The attached patch is aimed at master. PG11 will need the planner memory and performance part tweaked and for PG10 I'll do that plus remove the mention of PRIMARY KEY and UNIQUE constraints on the partitioned table. Does anyone see anything wrong with doing this? I don't think there should be an issue adding a section to the docs right at the end as it's not causing any resequencing. Or does anyone have any better ideas or better examples to give? or any comments? If it looks okay I can post version for PG11 and PG10 for review, but I'd like to get this in fairly soon. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services add_docs_about_best_partition_practices.patch Description: Binary data
Re: Should we warn against using too many partitions?
I suggest just minor variations on language. On Thu, Jun 06, 2019 at 04:43:48PM +1200, David Rowley wrote: >diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml >index cce1618fc1..ab26630199 100644 >--- a/doc/src/sgml/ddl.sgml >+++ b/doc/src/sgml/ddl.sgml >@@ -4674,6 +4675,76 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate >>= DATE '2008-01-01'; > > > >+ >+ >+ Declarative Partitioning Best Practices >+ >+ >+The choice of how to partition a table should be considered carefully as Either say "How to partition consider should be .." or "The choice should MADE carefully" ? >+ >+One of the most critical design decisions will be the column or columns >+which you partition your data by. Often the best choice will be to by which ? >+ >+Choosing the number of partitions to divide the table into is also a the TARGET number of partitions BY WHICH to divide the table ? >+critical decision to make. Not having enough partitions may mean that >+indexes remain too large and that data locality remains poor which could >+result in poor cache hit ratios. However, dividing the table into too >+many partitions can also cause issues. Too many partitions can mean >+slower query planning times and higher memory consumption during both >+query planning and execution. It's also important to consider what >+changes may occur in the future when choosing how to partition your table. >+For example, if you choose to have one partition per customer and you >+currently have a small number of large customers, what will the have ONLY ? >+implications be if in several years you obtain a large number of small >+customers. In this case, it may be better to choose to partition by >+HASH and choose a reasonable amount of partitions reasonable NUMBER ? >+ >+It is also important to consider the overhead of partitioning during >+query planning and execution. The query planner is generally able to >+handle partition hierarchies up a few thousand partitions fairly well, >+providing that the vast majority of them can be pruned during query provided ? I would say: "provided that typical queries prune all but a small number of partitions during planning time". >+DELETE commands. Also, even if most queries are >+able to prune a high number of partitions during query planning, it still LARGE number? >+may be undesirable to have a large number of partitions as each partition may still ? >+also will obtain a relation cache entry in each session which uses the will require ? Or occupy ? >+ >+With data warehouse type workloads it can make sense to use a larger >+number of partitions than with an OLTP type workload. Generally, in data >+warehouses, query planning time is less of a concern as the majority of >+processing time is generally spent during query execution. With either of remove the 2nd "generally" >+these two types of workload, it is important to make the right decisions >+early as re-partitioning large quantities of data can be painstakingly early COMMA ? PAINFULLY slow >+When performance is critical, performing workload simulations to >+assist in making the correct decisions can be beneficial. I would say: Simulations of the intended workload are beneficial for optimizing partitioning strategy. Thanks, Justin
Re: Should we warn against using too many partitions?
Hi, On Thu, Jun 6, 2019 at 1:44 PM David Rowley wrote: > > On Fri, 24 May 2019 at 22:00, David Rowley > wrote: > > I've attached the pg10 and pg11 patches with that updated... and also > > the master one (unchanged) with the hopes that the CF bot picks that > > one. > > I got talking to Andres about this at PGCon after a use case of 250k > partitions was brought to our attention. I was thinking about the best > way to handle this on the long flight home and after studying the > current docs I really feel that they fairly well describe what we've > done so far implementing table partitioning, but they offer next to > nothing on best practices on how to make the most of the feature. Agreed that some "best practices" text is overdue, so thanks for taking that up. > I've done some work on this today and what I've ended up with is an > entirely new section to the partitioning docs about best practices > which provides a bit of detail on how you might go about choosing the > partition key. It gives an example of why LIST partitioning on a set > of values that may grow significantly over time might be a bad idea. Design advice like this is good. > It talks about memory growth with more partitions and mentions that > rel cache might become a problem even if queries are touching a small > number of partitions per query, but a large number per session. I wasn't sure at first if stuff like this should be mentioned in the user-facing documentation, but your wording seems fine in general. Thanks, Amit
Re: hyrax vs. RelationBuildPartitionDesc
On Wed, Jun 5, 2019 at 8:03 PM Amit Langote wrote: > I noticed a crash with one of the scenarios that I think the patch is > meant to address. Let me describe the steps: > > Attach gdb (or whatever) to session 1 in which we'll run a query that > will scan at least two partitions and set a breakpoint in > expand_partitioned_rtentry(). Run the query, so the breakpoint will > be hit. Step through up to the start of the following loop in this > function: > > i = -1; > while ((i = bms_next_member(live_parts, i)) >= 0) > { > Oid childOID = partdesc->oids[i]; > Relationchildrel; > RangeTblEntry *childrte; > Index childRTindex; > RelOptInfo *childrelinfo; > > /* Open rel, acquiring required locks */ > childrel = table_open(childOID, lockmode); > > Note that 'partdesc' in the above code is from the partition > directory. Before stepping through into the loop, start another > session and attach a new partition. On into the loop. When the 1st > partition is opened, its locking will result in > RelationClearRelation() being called on the parent relation due to > partition being attached concurrently, which I observed, is actually > invoked a couple of times due to recursion. Parent relation's > rd_oldpdcxt will be set in this process, which contains the > aforementioned partition descriptor. > > Before moving the loop to process the 2nd partition, attach another > partition in session 2. When the 2nd partition is opened, > RelationClearRelation() will run again and in one of its recursive > invocations, it destroys the rd_oldpdcxt that was set previously, > taking the partition directory's partition descriptor with it. > Anything that tries to access it later crashes. > > I couldn't figure out what to blame here though -- the design of > rd_pdoldcxt or the fact that RelationClearRelation() is invoked > multiple times. I will try to study it more closely tomorrow. On further study, I concluded that it's indeed the multiple invocations of RelationClearRelation() on the parent relation that causes rd_pdoldcxt to be destroyed prematurely. While it's problematic that the session in which a new partition is attached to the parent relation broadcasts 2 SI messages to invalidate the parent relation [1], it's obviously better to fix how RelationClearRelation manipulates rd_pdoldcxt so that duplicate SI messages are not harmful, fixing the latter is hardly an option. Attached is a patch that applies on top of Robert's pdoldcxt-v1.patch, which seems to fix this issue for me. In the rare case where inval messages due to multiple concurrent attachments get processed in a session holding a reference on a partitioned table, there will be multiple "old" partition descriptors and corresponding "old" contexts. All of the old contexts are chained together via context-re-parenting, with the newest "old" context being accessible from the table's rd_pdoldcxt. Thanks, Amit [1]: inval.c: AddRelcacheInvalidationMessage() does try to prevent duplicate messages from being emitted, but the logic to detect duplicates doesn't work across CommandCounterIncrement(). There are at two relcache inval requests in ATTACH PARTITION code path, emitted by SetRelationHasSubclass and StorePartitionBound, resp., which are separated by at least one CCI, so the 2nd request isn't detected as the duplicate of the 1st. pdoldcxt-v1-amit-delta.patch Description: Binary data