Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com
On Tue, Dec 5, 2017 at 8:49 PM, Thomas Munro wrote: > On Tue, Dec 5, 2017 at 6:39 PM, Amit Kapila wrote: >> I have one another observation in the somewhat related area. From the >> code, it looks like we might have some problem with displaying sort >> info for workers for rescans. I think the problem with the sortinfo >> is that it initializes shared info with local memory in >> ExecSortRetrieveInstrumentation after which it won't be able to access >> the values in shared memory changed by workers in rescans. We might >> be able to fix it by having some local_info same as sahred_info in >> sort node. But the main problem is how do we accumulate stats for >> workers across rescans. The type of sort method can change across >> rescans. We might be able to accumulate the size of Memory though, >> but not sure if that is right. I think though this appears to be >> somewhat related to the problem being discussed in this thread, it can >> be dealt separately if we want to fix it. > > Yeah, that's broken. ExecSortRetrieveInstrumentation() is run for > each loop, and after the first loop we've lost track of the pointer > into shared memory because we replaced it with palloc'd copy. We > could do what you said, or we could reinstate the pointer into the DSM > in ExecSortReInitializeDSM() by looking it up in the TOC. Or would it be an option to change the time ExecXXXRetrieveInstrumentation() is called so it is run only once? -- Thomas Munro http://www.enterprisedb.com
Re: [HACKERS] logical decoding of two-phase transactions
>> - Modified HeapTupleSatisfiesVacuum to return HEAPTUPLE_RECENTLY_DEAD >> if the transaction id is newer than OldestXmin. Doing this only for >> CATALOG tables (htup->t_tableOid < (Oid) FirstNormalObjectId). > > > Because logical decoding supports user-catalog relations, we need to use the > same sort of logical that GetOldestXmin uses instead of a simple oid-range > check. See RelationIsAccessibleInLogicalDecoding() and the > user_catalog_table reloption. > Unfortunately, HeapTupleSatisfiesVacuum does not have the Relation structure handily available to allow for these checks.. > Otherwise pseudo-catalogs used by logical decoding output plugins could > still suffer issues with needed tuples getting vacuumed, though only if the > txn being decoded made changes to those tables than ROLLBACKed. It's a > pretty tiny corner case for decoding of 2pc but a bigger one when we're > addressing streaming decoding. > We disallow rewrites on user_catalog_tables, so they cannot change underneath. Yes, DML can be carried out on them inside a 2PC transaction which then gets ROLLBACK'ed. But if it's getting aborted, then we are not interested in that data anyways. Also, now that we have the "filter_decode_txn_cb_wrapper()" function, we will stop decoding by the next change record cycle because of the abort. So, I am not sure if we need to track user_catalog_tables in HeapTupleSatisfiesVacuum explicitly. > Otherwise I'm really, really happy with how this is progressing and want to > find time to play with it. Yeah, I will do some more testing and add a few more test cases in the test_decoding plugin. It might be handy to have a DELAY of a few seconds after every change record processing, for example. That ways, we can have a TAP test which can do a few WAL activities and then we introduce a concurrent rollback midways from another session in the middle of that delayed processing. I have done debugger based testing of this concurrent rollback functionality as of now. Another test (actually, functionality) that might come in handy, is to have a way for DDL to be actually carried out on the subscriber. We will need something like pglogical.replicate_ddl_command to be added to the core for this to work. We can add this functionality as a follow-on separate patch after discussing how we want to implement that in core. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Re: Usage of epoch in txid_current
Hi! On Tue, Dec 5, 2017 at 6:19 AM, Amit Kapila wrote: > Currently, txid_current and friends export a 64-bit format of > transaction id that is extended with an “epoch” counter so that it > will not wrap around during the life of an installation. The epoch > value it uses is based on the epoch that is maintained by checkpoint > (aka only checkpoint increments it). > > Now if epoch changes multiple times between two checkpoints > (practically the chances of this are bleak, but there is a theoretical > possibility), then won't the computation of xids will go wrong? > Basically, it can give the same value of txid after wraparound if the > checkpoint doesn't occur between the two calls to txid_current. > AFAICS, yes, if epoch changes multiple times between two checkpoints, then computation will go wrong. And it doesn't look like purely theoretical possibility for me, because I think I know couple of instances of the edge of this... Am I missing something which ensures that epoch gets incremented at or > after wraparound? > I've checked the code, and it doesn't look for me that there is something like this. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] logical decoding of two-phase transactions
On 5 December 2017 at 16:00, Nikhil Sontakke wrote: > > We disallow rewrites on user_catalog_tables, so they cannot change > underneath. Yes, DML can be carried out on them inside a 2PC > transaction which then gets ROLLBACK'ed. But if it's getting aborted, > then we are not interested in that data anyways. Also, now that we > have the "filter_decode_txn_cb_wrapper()" function, we will stop > decoding by the next change record cycle because of the abort. > > So, I am not sure if we need to track user_catalog_tables in > HeapTupleSatisfiesVacuum explicitly. > I guess it's down to whether, when we're decoding a txn that just got concurrently aborted, the output plugin might do anything with its user catalogs that could cause a crash. Output plugins are most likely to be using the genam (or even SPI, I guess?) to read user-catalogs during logical decoding. Logical decoding its self does not rely on the correctness of user catalogs in any way, it's only a concern for output plugin callbacks. It may make sense to kick this one down the road at this point, I can't conclusively see where it'd cause an actual problem. > > > Otherwise I'm really, really happy with how this is progressing and want > to > > find time to play with it. > > Yeah, I will do some more testing and add a few more test cases in the > test_decoding plugin. It might be handy to have a DELAY of a few > seconds after every change record processing, for example. That ways, > we can have a TAP test which can do a few WAL activities and then we > introduce a concurrent rollback midways from another session in the > middle of that delayed processing. I have done debugger based testing > of this concurrent rollback functionality as of now. > > Sounds good. > Another test (actually, functionality) that might come in handy, is to > have a way for DDL to be actually carried out on the subscriber. We > will need something like pglogical.replicate_ddl_command to be added > to the core for this to work. We can add this functionality as a > follow-on separate patch after discussing how we want to implement > that in core. Yeah, definitely a different patch, but assuredly valuable. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] pow support for pgbench
Hello Tom, 1. A patch that adds an integer version of pow() but not a double version 2. A patch that adds a double version of pow() but not an integer version 3. A patch that adds both an integer version of pow() and a double version of pow(), with the two versions having different names It seems to me that 1 and 2 have value on their own for the workloads tried to be emulated, so what you are suggesting in 3 looks good to me. Now why are two different function names necessary? ISTM one key issue here is whether pgbench's expression language is meant to model SQL (where we have function overloading) or C (where there is no overloading). I don't think we've really settled on a fixed policy on that, but maybe now is the time. Function overloading is implemented for ABS (as noted by Michaël), but also LEAST and GREATEST. Typing is dynamic, based on guessing (think "pgbench -D i=1 -D f=1.0") . These decisions have already been taken, they were reasonable, the behavior is consistent and useful. I do not see the point of going backward and breaking compatibility. The idea is to model after SQL (eg also I've been asked to do that for the operators & functions extensions, lost in the CF queue), when possible. When possible is not always. If we do think that function overloading is OK, there remains the question of when the typing is resolved. Dynamic guessing is the only pragmatic option with pgbench. I think Robert is objecting to resolving at runtime, and I tend to agree that that's something we'd regret in the long run. Too late. I do not think that we would come to regret it. I'm rather regretting that pgbench capabilities are improving so slowly. It doesn't match either SQL or C. Sure. A dynamically typed language cannot match a statically typed one. I do not see this as a significant issue for writing benchmarking scripts. Now, about POW: As for approximating NUMERIC, there is a first difficulty, as the type can either be approximated as an INT or DOUBLE. Ah ah. Also, POW raises another difficulty, which is that depending on the sign of the second argument the result is more INT or more DOUBLE. Fun. So even if we do separate functions (IPOW & DPOW, or POW & DPOW), the result cannot be typed statically. Note that type can be enforced if necessary thanks to int() and double() functions, which is enough of a work around for pgbench simple purpose. In this context, ISTM that Raul patch is a reasonable, even if imperfect, compromise. I can understand that this compromise does not suit Robert. Too bad, because both POW versions (int & double) have a reasonable use case for writing benchmarks. Using two names does not resolve the typing issue anyway. None of the 3 options offered by Robert are really satisfactory, and the compromise is also rejected. So basically do whatever you want with the patch (accept, reject, 1, 2, 3). Only "Returned with feedback" with the feedback being "please implement a statically typed pgbench" does not seem like a useful option. I can only say that these functions would be useful, so for me it is better in than out, but that is just my silly opinion. -- Fabien.
Re: compress method for spgist - 2
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation:tested, passed I've read the updated patch and see my concerns addressed. I'm looking forward to SP-GiST compress method support, as it will allow usage of SP-GiST index infrastructure for PostGIS. The new status of this patch is: Ready for Committer
Re: Usage of epoch in txid_current
On Tue, Dec 5, 2017 at 2:49 PM, Alexander Korotkov wrote: > On Tue, Dec 5, 2017 at 6:19 AM, Amit Kapila wrote: >> >> Currently, txid_current and friends export a 64-bit format of >> transaction id that is extended with an “epoch” counter so that it >> will not wrap around during the life of an installation. The epoch >> value it uses is based on the epoch that is maintained by checkpoint >> (aka only checkpoint increments it). >> >> Now if epoch changes multiple times between two checkpoints >> (practically the chances of this are bleak, but there is a theoretical >> possibility), then won't the computation of xids will go wrong? >> Basically, it can give the same value of txid after wraparound if the >> checkpoint doesn't occur between the two calls to txid_current. > > > AFAICS, yes, if epoch changes multiple times between two checkpoints, then > computation will go wrong. And it doesn't look like purely theoretical > possibility for me, because I think I know couple of instances of the edge > of this... > Okay, it is quite strange that we haven't discovered this problem till now. I think we should do something to fix it. One idea is that we track epoch change in shared memory (probably in the same data structure (VariableCacheData) where we track nextXid). We need to increment it when the xid wraparound during xid allocation (in GetNewTransactionId). Also, we need to make it persistent as which means we need to log it in checkpoint xlog record and we need to write a separate xlog record for the epoch change. Thoughts? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com
On Tue, Dec 5, 2017 at 1:29 PM, Thomas Munro wrote: > On Tue, Dec 5, 2017 at 8:49 PM, Thomas Munro > wrote: >> On Tue, Dec 5, 2017 at 6:39 PM, Amit Kapila wrote: >>> I have one another observation in the somewhat related area. From the >>> code, it looks like we might have some problem with displaying sort >>> info for workers for rescans. I think the problem with the sortinfo >>> is that it initializes shared info with local memory in >>> ExecSortRetrieveInstrumentation after which it won't be able to access >>> the values in shared memory changed by workers in rescans. We might >>> be able to fix it by having some local_info same as sahred_info in >>> sort node. But the main problem is how do we accumulate stats for >>> workers across rescans. The type of sort method can change across >>> rescans. We might be able to accumulate the size of Memory though, >>> but not sure if that is right. I think though this appears to be >>> somewhat related to the problem being discussed in this thread, it can >>> be dealt separately if we want to fix it. >> >> Yeah, that's broken. ExecSortRetrieveInstrumentation() is run for >> each loop, and after the first loop we've lost track of the pointer >> into shared memory because we replaced it with palloc'd copy. We >> could do what you said, or we could reinstate the pointer into the DSM >> in ExecSortReInitializeDSM() by looking it up in the TOC. > > Or would it be an option to change the time > ExecXXXRetrieveInstrumentation() is called so it is run only once? > To me, that doesn't sound like a bad option. I think if do so, then we don't even need to reinitialize the shared sort stats. I think something, like attached, should work if we want to go this route. We can add regression test if this is what we think is a good idea. Having said that, one problem I see doing thing this way is that in general, we will display the accumulated stats of each worker, but for sort or some other special nodes (like hash), we will show the information of only last loop. I am not sure if that is a matter of concern, but if we want to do this way, then probably we should explain this in documentation as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com fix_accum_instr_parallel_workers_v4.patch Description: Binary data
Re: [HACKERS] kqueue
On Thu, Jun 22, 2017 at 7:19 PM, Thomas Munro wrote: > I don't plan to resubmit this patch myself, but I was doing some > spring cleaning and rebasing today and I figured it might be worth > quietly leaving a working patch here just in case anyone from the > various BSD communities is interested in taking the idea further. Since there was a mention of kqueue on -hackers today, here's another rebase. I got curious just now and ran a very quick test on an AWS 64 vCPU m4.16xlarge instance running image "FreeBSD 11.1-STABLE-amd64-2017-08-08 - ami-00608178". I set shared_buffers = 10GB and ran pgbench approximately the same way Heikki and Keith did upthread: pgbench -i -s 200 postgres pgbench -M prepared -j 6 -c 6 -S postgres -T60 -P1 pgbench -M prepared -j 12 -c 12 -S postgres -T60 -P1 pgbench -M prepared -j 24 -c 24 -S postgres -T60 -P1 pgbench -M prepared -j 36 -c 36 -S postgres -T60 -P1 pgbench -M prepared -j 48 -c 48 -S postgres -T60 -P1 The TPS numbers I got (including connections establishing) were: clientsmasterpatched 6 146,215147,535 (+0.9%) 12 273,056280,505 (+2.7%) 24 360,751369,965 (+2.5%) 36 413,147420,769 (+1.8%) 48 416,189444,537 (+6.8%) The patch appears to be doing something positive on this particular system and that effect was stable over a few runs. -- Thomas Munro http://www.enterprisedb.com kqueue-v8.patch Description: Binary data
Re: es_query_dsa is broken
On Tue, Dec 5, 2017 at 6:45 AM, Thomas Munro wrote: > On Thu, Nov 30, 2017 at 11:19 PM, Thomas Munro > wrote: >> On Thu, Nov 30, 2017 at 4:01 AM, Robert Haas wrote: >>> >>> In v10 we might need to go with a solution like what you've sketched >>> here, because Tom will complain about breaking binary compatibility >>> with EState (and maybe other PlanState nodes) in a released branch. > > Here's a pair of versions of that patch tidied up a bit, for > REL_10_STABLE and master. Unfortunately I've been unable to come up > with a reproducer that shows misbehaviour (something like what was > reported on -bugs[1] which appears to be a case of calling > dsa_get_address(wrong_area, some_dsa_pointer)). > + EState *estate = gatherstate->ps.state; + + /* Install our DSA area while executing the plan. */ + estate->es_query_dsa = gatherstate->pei->area; outerTupleSlot = ExecProcNode(outerPlan); + estate->es_query_dsa = NULL; Won't the above coding pattern create a problem, if ExecProcNode throws an error and outer block catches it and continues execution (consider the case of execution inside PL blocks)? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] pow support for pgbench
Hi all, I've been giving a thought about this and I think we could reach the compromise of having a single function with 2 overloads: * pow(double, double) -> double: Uses C pow(). * pow(int, int) -> double: Uses ipow() for positive exponents, and pow() for negative exponents. In both cases we'd return a double but we use the fast ipow if it's possible (which can be 20x faster), so at the cost of an extra cast if you need an int, we'd have a consistent API. Would this be acceptable?
Re: [HACKERS] pow support for pgbench
I've been giving a thought about this and I think we could reach the compromise of having a single function with 2 overloads: * pow(double, double) -> double: Uses C pow(). * pow(int, int) -> double: Uses ipow() for positive exponents, and pow() for negative exponents. In both cases we'd return a double but we use the fast ipow if it's possible (which can be 20x faster), so at the cost of an extra cast if you need an int, we'd have a consistent API. Would this be acceptable? This is for Robert to say whether it is more acceptable to him. My 0.02€: ISTM that it closely equivalent to having just the double version and using an explicit cast to get an int if needed, which does not conform anymore to strict SQL behavior than the previous compromise. Also, probably having something (anything) is better than nothing. -- Fabien.
Re: [HACKERS] [POC] Faster processing at Gather node
On Mon, Dec 4, 2017 at 9:20 PM, Robert Haas wrote: > On Sun, Dec 3, 2017 at 10:30 PM, Amit Kapila > wrote: > > I thought there are some cases (though less) where we want to Shutdown > > the nodes (ExecShutdownNode) earlier and release the resources sooner. > > However, if you are not completely sure about this change, then we can > > leave it as it. Thanks for sharing your thoughts. > > OK, thanks. I committed that patch, after first running 100 million > tuples through a Gather over and over again to test for leaks. > Hopefully I haven't missed anything here, but it looks like it's > solid. Here once again are the remaining patches. While the > already-committed patches are nice, these two are the ones that > Hi, I was spending sometime in verifying this memory-leak patch for gather-merge case and I too found it good. In the query I tried, around 10 million tuples were passed through gather-merge. On analysing the output of top it looks acceptable memory usage and it gets freed once the query is completed. Since, I was trying on my local system only, I tested for upto 8 workers and didn't find any memory leaks for the queries I tried. One may find the attached file for the test-case. -- Regards, Rafia Sabih EnterpriseDB: http://www.enterprisedb.com/ ml_test_query.sql Description: Binary data
Speeding up pg_upgrade
As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about zero-downtime upgrades. After the usual discussion of using logical replication, Slony, and perhaps having the server be able to read old and new system catalogs, we discussed speeding up pg_upgrade. There are clusters that take a long time to dump the schema from the old cluster and recreate it in the new cluster. One idea of speeding up pg_upgrade would be to allow pg_upgrade to be run in two stages: 1. prevent system catalog changes while the old cluster is running, and dump the old cluster's schema and restore it in the new cluster 2. shut down the old cluster and copy/link the data files My question is whether the schema dump/restore is time-consuming enough to warrant this optional more complex API, and whether people would support adding a server setting that prevented all system table changes? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: Speeding up pg_upgrade
Bruce, * Bruce Momjian (br...@momjian.us) wrote: > As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about > zero-downtime upgrades. After the usual discussion of using logical > replication, Slony, and perhaps having the server be able to read old > and new system catalogs, we discussed speeding up pg_upgrade. Sounds familiar. > There are clusters that take a long time to dump the schema from the old > cluster and recreate it in the new cluster. One idea of speeding up > pg_upgrade would be to allow pg_upgrade to be run in two stages: > > 1. prevent system catalog changes while the old cluster is running, and > dump the old cluster's schema and restore it in the new cluster > > 2. shut down the old cluster and copy/link the data files Perhaps a bit more complicated, but couldn't we copy/link while the old cluster is online and in backup mode, finish backup mode, shut down the old cluster, and then play forward the WAL to catch any relation extents being added or similar, and then flip to the new PG version? > My question is whether the schema dump/restore is time-consuming enough > to warrant this optional more complex API, and whether people would > support adding a server setting that prevented all system table changes? When you say 'system table changes', you're referring to basically all DDL, right? Just wish to clarify as there might be some confusion between the terminology you're using here and allow_system_table_mods. Would we need to have autovacuum shut down too..? The other concern is if there's changes made to the catalogs by non-DDL activity that needs to be addressed too (logical replication?); nothing definite springs to mind off-hand for me, but perhaps others will think of things. Thanks! Stephen signature.asc Description: Digital signature
Re: Speeding up pg_upgrade
Hi On Tue, Dec 5, 2017 at 11:01 PM, Bruce Momjian wrote: > As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about > zero-downtime upgrades. After the usual discussion of using logical > replication, Slony, and perhaps having the server be able to read old > and new system catalogs, we discussed speeding up pg_upgrade. > > There are clusters that take a long time to dump the schema from the old > cluster and recreate it in the new cluster. One idea of speeding up > pg_upgrade would be to allow pg_upgrade to be run in two stages: > > 1. prevent system catalog changes while the old cluster is running, and > dump the old cluster's schema and restore it in the new cluster > > 2. shut down the old cluster and copy/link the data files > When we were discussing this, I was thinking that the linking could be done in phase 1 too, as that's potentially slow on a very large schema. > > My question is whether the schema dump/restore is time-consuming enough > to warrant this optional more complex API, and whether people would > support adding a server setting that prevented all system table changes? I've certainly heard of cases where pg_upgrade takes significant amounts of time to run on very complex databases. -- Dave Page Blog: http://pgsnake.blogspot.com Twitter: @pgsnake EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Speeding up pg_upgrade
On Tue, Dec 5, 2017 at 09:16:02AM -0500, Stephen Frost wrote: > Bruce, > > * Bruce Momjian (br...@momjian.us) wrote: > > As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about > > zero-downtime upgrades. After the usual discussion of using logical > > replication, Slony, and perhaps having the server be able to read old > > and new system catalogs, we discussed speeding up pg_upgrade. > > Sounds familiar. Yeah. :-| > > There are clusters that take a long time to dump the schema from the old > > cluster and recreate it in the new cluster. One idea of speeding up > > pg_upgrade would be to allow pg_upgrade to be run in two stages: > > > > 1. prevent system catalog changes while the old cluster is running, and > > dump the old cluster's schema and restore it in the new cluster > > > > 2. shut down the old cluster and copy/link the data files > > Perhaps a bit more complicated, but couldn't we copy/link while the > old cluster is online and in backup mode, finish backup mode, shut down > the old cluster, and then play forward the WAL to catch any relation > extents being added or similar, and then flip to the new PG version? Well, that would require reading the old WAL, which would add an additional compibility requirement that seems unwise. > > My question is whether the schema dump/restore is time-consuming enough > > to warrant this optional more complex API, and whether people would > > support adding a server setting that prevented all system table changes? > > When you say 'system table changes', you're referring to basically all > DDL, right? Just wish to clarify as there might be some confusion > between the terminology you're using here and allow_system_table_mods. Not only all DDL, but even updating them for the internal stuff, like pg_class.relfrozenxid. > Would we need to have autovacuum shut down too..? Yes. > The other concern is if there's changes made to the catalogs by non-DDL > activity that needs to be addressed too (logical replication?); nothing > definite springs to mind off-hand for me, but perhaps others will think > of things. Yes, it could extend to many parts of the system, which is why I am asking if it is worth it. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: Speeding up pg_upgrade
On Tue, Dec 5, 2017 at 11:16:26PM +0900, Dave Page wrote: > Hi > > On Tue, Dec 5, 2017 at 11:01 PM, Bruce Momjian wrote: > > As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about > zero-downtime upgrades. After the usual discussion of using logical > replication, Slony, and perhaps having the server be able to read old > and new system catalogs, we discussed speeding up pg_upgrade. > > There are clusters that take a long time to dump the schema from the old > cluster and recreate it in the new cluster. One idea of speeding up > pg_upgrade would be to allow pg_upgrade to be run in two stages: > > 1. prevent system catalog changes while the old cluster is running, and > dump the old cluster's schema and restore it in the new cluster > > 2. shut down the old cluster and copy/link the data files > > > When we were discussing this, I was thinking that the linking could be done in > phase 1 too, as that's potentially slow on a very large schema. Uh, good point! You can create the hard links while system system is running, no problem! It would only be copy that can't be done while the system is running. Of course a big question is whether hard linking takes any measurable time. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: Speeding up pg_upgrade
Dave, * Dave Page (dp...@pgadmin.org) wrote: > On Tue, Dec 5, 2017 at 11:01 PM, Bruce Momjian wrote: > > As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about > > zero-downtime upgrades. After the usual discussion of using logical > > replication, Slony, and perhaps having the server be able to read old > > and new system catalogs, we discussed speeding up pg_upgrade. > > > > There are clusters that take a long time to dump the schema from the old > > cluster and recreate it in the new cluster. One idea of speeding up > > pg_upgrade would be to allow pg_upgrade to be run in two stages: > > > > 1. prevent system catalog changes while the old cluster is running, and > > dump the old cluster's schema and restore it in the new cluster > > > > 2. shut down the old cluster and copy/link the data files > > When we were discussing this, I was thinking that the linking could be done > in phase 1 too, as that's potentially slow on a very large schema. Right, I had that thought too when first reading this, but the problem there is that new files can show up due to a relation being extended (at least, and perhaps in other cases too..). > > My question is whether the schema dump/restore is time-consuming enough > > to warrant this optional more complex API, and whether people would > > support adding a server setting that prevented all system table changes? > > I've certainly heard of cases where pg_upgrade takes significant amounts of > time to run on very complex databases. Right, but that doesn't really answer the question as to which part of the pg_upgrade process is taking up the time. In any case, of course, if we're able to move part of what pg_upgrade does to be while the old server is online then that takes whatever the cost of that is out of the downtime window. The question is if that's a 5% improvement in the overall performance of pg_upgrade or a 70% one. This will be case-by-case, of course, but if, in the best-case, we only get a 5% improvement then this might not be worth the risk. Thanks! Stephen signature.asc Description: Digital signature
Re: Speeding up pg_upgrade
On Tue, Dec 5, 2017 at 09:23:49AM -0500, Stephen Frost wrote: > Dave, > > * Dave Page (dp...@pgadmin.org) wrote: > > On Tue, Dec 5, 2017 at 11:01 PM, Bruce Momjian wrote: > > > As part of PGConf.Asia 2017 in Tokyo, we had an unconference topic about > > > zero-downtime upgrades. After the usual discussion of using logical > > > replication, Slony, and perhaps having the server be able to read old > > > and new system catalogs, we discussed speeding up pg_upgrade. > > > > > > There are clusters that take a long time to dump the schema from the old > > > cluster and recreate it in the new cluster. One idea of speeding up > > > pg_upgrade would be to allow pg_upgrade to be run in two stages: > > > > > > 1. prevent system catalog changes while the old cluster is running, and > > > dump the old cluster's schema and restore it in the new cluster > > > > > > 2. shut down the old cluster and copy/link the data files > > > > When we were discussing this, I was thinking that the linking could be done > > in phase 1 too, as that's potentially slow on a very large schema. > > Right, I had that thought too when first reading this, but the problem > there is that new files can show up due to a relation being extended (at > least, and perhaps in other cases too..). Oh, yikes, yes. > > > My question is whether the schema dump/restore is time-consuming enough > > > to warrant this optional more complex API, and whether people would > > > support adding a server setting that prevented all system table changes? > > > > I've certainly heard of cases where pg_upgrade takes significant amounts of > > time to run on very complex databases. > > Right, but that doesn't really answer the question as to which part of > the pg_upgrade process is taking up the time. > > In any case, of course, if we're able to move part of what pg_upgrade > does to be while the old server is online then that takes whatever the > cost of that is out of the downtime window. The question is if that's a > 5% improvement in the overall performance of pg_upgrade or a 70% one. > This will be case-by-case, of course, but if, in the best-case, we only > get a 5% improvement then this might not be worth the risk. Yes, and who is going to know if they have a setup where the more complex API is worth it? pg_upgrade is already complex enough to use. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: Speeding up pg_upgrade
Bruce, * Bruce Momjian (br...@momjian.us) wrote: > On Tue, Dec 5, 2017 at 09:16:02AM -0500, Stephen Frost wrote: > > > There are clusters that take a long time to dump the schema from the old > > > cluster and recreate it in the new cluster. One idea of speeding up > > > pg_upgrade would be to allow pg_upgrade to be run in two stages: > > > > > > 1. prevent system catalog changes while the old cluster is running, and > > > dump the old cluster's schema and restore it in the new cluster > > > > > > 2. shut down the old cluster and copy/link the data files > > > > Perhaps a bit more complicated, but couldn't we copy/link while the > > old cluster is online and in backup mode, finish backup mode, shut down > > the old cluster, and then play forward the WAL to catch any relation > > extents being added or similar, and then flip to the new PG version? > > Well, that would require reading the old WAL, which would add an > additional compibility requirement that seems unwise. In my proposal, this would be the old version of PG reading the old WAL. Thinking about it a bit further though, I'm not sure it'd end up working in link mode anyway, due to post-backup-finish changes that could be made by the old server on the data files before it's shut down. We have to have a way of dealing with the delta between the hard link trees after the old server is shut down though because there could be new relation extents, at least. > > > My question is whether the schema dump/restore is time-consuming enough > > > to warrant this optional more complex API, and whether people would > > > support adding a server setting that prevented all system table changes? > > > > When you say 'system table changes', you're referring to basically all > > DDL, right? Just wish to clarify as there might be some confusion > > between the terminology you're using here and allow_system_table_mods. > > Not only all DDL, but even updating them for the internal stuff, like > pg_class.relfrozenxid. Good point. We'd really need a pretty bullet-proof way to ensure that the catalog isn't changed during this time period and that seems like it might be difficult without a lot of work. > > Would we need to have autovacuum shut down too..? > > Yes. Ok, makes sense. > > The other concern is if there's changes made to the catalogs by non-DDL > > activity that needs to be addressed too (logical replication?); nothing > > definite springs to mind off-hand for me, but perhaps others will think > > of things. > > Yes, it could extend to many parts of the system, which is why I am > asking if it is worth it. My initial reaction is that it's worth it, but then I also wonder about other issues (having to get an ANALYZE done on the new cluster before opening it up, for example..) and it makes me wonder if perhaps it'll end up being too much risk for too little gain. Thanks! Stephen signature.asc Description: Digital signature
Re: Speeding up pg_upgrade
Bruce, * Bruce Momjian (br...@momjian.us) wrote: > > In any case, of course, if we're able to move part of what pg_upgrade > > does to be while the old server is online then that takes whatever the > > cost of that is out of the downtime window. The question is if that's a > > 5% improvement in the overall performance of pg_upgrade or a 70% one. > > This will be case-by-case, of course, but if, in the best-case, we only > > get a 5% improvement then this might not be worth the risk. > > Yes, and who is going to know if they have a setup where the more > complex API is worth it? pg_upgrade is already complex enough to use. Sure, but the solution there is really to make pg_upgrade simpler to use, even as we add these more complicated APIs to it. What that likely means in practical terms is that we have another utility, which uses pg_upgrade underneath, that you're able to configure to know about your existing cluster and the version of PG you want to upgrade to and where you want it and if you want a copy or if hard-links are ok, etc. Having such a tool is actually what I'd been hoping would come out of the documented process for doing a "pg_upgrade" on replicas that's currently in our documentation. That's not happened yet, but it's something that David Steele and I have been chatting about because the procedure in the documentation is terribly difficult and dangerous for those who aren't as familiar with the system. Perhaps we could have one tool that handles both the more complicated pg_upgrade API and deals with upgrading replicas. Alternatively, we could have a config file for pg_upgrade instead which might be a simpler way for people to describe exactly their current configuration and what they'd like to go to. Upgrading replicas involves using something like SSH though.. Thanks! Stephen signature.asc Description: Digital signature
Re: simplehash: tb->sizemask = 0
On 11/29/17 13:49, Andres Freund wrote: On 2017-11-27 22:53:41 -0500, Tom Lane wrote: Tomas Vondra writes: I'm a bit puzzled by this code in SH_COMPUTE_PARAMETERS: if (tb->size == SH_MAX_SIZE) tb->sizemask = 0; else tb->sizemask = tb->size - 1; Doesn't that mean that with SH_MAX_SIZE we end up with sizemask being 0 (i.e. no bits set)? Yeah, which is very obviously broken: for one thing, the Asserts in SH_NEXT/SH_PREV would surely go off. That's obviously wrong. Not sure how that happened. I might have had it as a shift at first? (Why are those assertions, anyway, and not test-and-elog? I do not think an assertion failure is a suitable way to report "hash table full".) There's a test and elog during insert. Adding actual branches into SH_NEXT/SH_PREV seems like a bad idea. Will test a fix. I'll be happy to help test this fix when it's ready. -- todd
Re: Add PGDLLIMPORT lines to some variables
On Mon, Dec 4, 2017 at 8:48 PM, Peter Geoghegan wrote: > On Mon, Dec 4, 2017 at 5:41 PM, Noah Misch wrote: >>> I don't think we quite have an established protocol for this. I >>> personally, but I'm biased in this specific case, is that we should >>> adopt a position that PGDLLIMPORTs should basically backpatched whenever >>> a credible extension even halfway reasonably requires it. There's no >>> easy way to get this done by default, and we're so far unwilling to just >>> slap this onto every variable. So to not further disadvantage people >>> force dto live in the MS environment, that seems the sanest >>> solution. It's not like these are high risk. >> >> +1 > > If that's going to be the policy, then I have some requests of my own. > I would like to add some PGDLLIMPORTs to suit the external version of > amcheck (the version that targets earlier versions of Postgres). These > are: > > RecentGlobalXmin -- This is only PGDLLIMPORT on Postgres 10+, > following commit 56018bf2. I'd like to get that back to 9.4, although > there is no reason to not include 9.3. > > TransactionXmin -- This is needed for the newer heap-matches-index > verification check. Again, I would like this on 9.4+, but 9.3+ works > too. > > Note that somebody asked about running it on Windows recently, and on > one other occasion in the past. It does come up. Committed with these additions. Please check that I haven't messed anything up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Proposal: Local indexes for partitioned table
On Mon, Dec 4, 2017 at 6:57 PM, David Rowley wrote: > On 2 December 2017 at 03:39, Robert Haas wrote: >> On Thu, Nov 30, 2017 at 11:39 PM, David Rowley >> wrote: >>> I feel like we could do better here with little extra effort. The >>> DETACH index feature does not really seem required for this patch. >> >> Because of the dump/restore considerations mentioned in >> http://postgr.es/m/ca+tgmobuhghg9v8saswkhbbfywg5a0qb+jgt0uovq5ycbdu...@mail.gmail.com >> I believe we need a way to create the index on the parent without >> immediately triggering index builds on the children, plus a way to >> create an index on a child after-the-fact and attach it to the parent. >> Detach isn't strictly required, but why support attach and not detach? > > I proposed that this worked a different way in [1]. I think the way I > mention is cleaner as it means there's no extra reason for a > partitioned index to be indisvalid=false than there is for any other > normal index. How does that proposal keep pg_dump from latching onto the wrong index, if there's more than one index on the same columns? > My primary reason for not liking this way is around leaving indexes > around as indisvalid=false, however, I suppose that an index could be > replaced atomically with a DETACH and ATTACH within a single > transaction. I had previously not really liked the idea of > invalidating an index by DETACHing a leaf table's index from it. Of > course, this patch does nothing special with partitioned indexes, but > I believe one day we will want to do something with these and that the > DETACH/ATTACH is not the best design for replacing part of the index. What do you propose? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] postgres_fdw super user checks
On Mon, Dec 4, 2017 at 5:57 PM, Ashutosh Bapat wrote: > I think the real behaviour can be described as something like this: > > "Only superusers may connect to foreign servers without password > authentication, so always specify the password > option for user mappings that may be used by non-superusers." But > which user mappings may be used by non-superusers can not be defined > without explaining views owned by superusers. I don't think we should > be talking about views in that part of documentation. Well, if we don't, then I'm not sure we can really make this clear. Anyhow, I've committed the patch to master for now; we can keep arguing about what, if anything, to do for back-branch documentation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] postgres_fdw super user checks
Robert, Ashutosh, * Robert Haas (robertmh...@gmail.com) wrote: > On Mon, Dec 4, 2017 at 5:57 PM, Ashutosh Bapat > wrote: > > I think the real behaviour can be described as something like this: > > > > "Only superusers may connect to foreign servers without password > > authentication, so always specify the password > > option for user mappings that may be used by non-superusers." But > > which user mappings may be used by non-superusers can not be defined > > without explaining views owned by superusers. I don't think we should > > be talking about views in that part of documentation. > > Well, if we don't, then I'm not sure we can really make this clear. Yeah, I'm pretty sure we need to spell out the situation around views here because it's different from how views normally work as discussed in Rules and Privileges. I'll note that the Rules and Privileges section could use a bit of love too- the v10 docs have: "Due to rewriting of queries by the PostgreSQL rule system, other tables/views than those used in the original query get accessed. When update rules are used, this can include write access to tables." Which isn't really accurate since simple updatable views were added. Looking at it more though, really, I think that whole page needs to be re-cast to be about *views* and stop talking about rules. That's really a seperate discussino to have though. > Anyhow, I've committed the patch to master for now; we can keep > arguing about what, if anything, to do for back-branch documentation. Thanks! Stephen signature.asc Description: Digital signature
Re: pg_dumpall -r -c try to drop user postgres
2017-12-03 23:48 GMT+01:00 Michael Paquier : > On Sun, Dec 3, 2017 at 3:21 PM, Pavel Stehule > wrote: > > I am not sure if user postgres should be removed, so it is probably bug > > > > pg_dumpall -r -c | grep postgres > > > > DROP ROLE postgres; > > CREATE ROLE postgres; > > You are looking for this bit of code: > /* > * If asked to --clean, do that first. We can avoid > detailed > * dependency analysis because databases never depend > on each other, > * and tablespaces never depend on each other. Roles > could have > * grants to each other, but DROP ROLE will clean > those up silently. > */ > if (output_clean) > { > if (!globals_only && !roles_only && > !tablespaces_only) > dropDBs(conn); > > if (!roles_only && !no_tablespaces) > dropTablespaces(conn); > > if (!tablespaces_only) > dropRoles(conn); > } > Could you clarify what you think is wrong here? > I am not sure, if issue is in this code. But I am sure, so DROP ROLE postgres is just nonsense This command should to fail every time, and then should not be generated. Regards Pavel > -- > Michael >
Re: Add PGDLLIMPORT lines to some variables
On Tue, Dec 5, 2017 at 6:49 AM, Robert Haas wrote: >> RecentGlobalXmin -- This is only PGDLLIMPORT on Postgres 10+, >> following commit 56018bf2. I'd like to get that back to 9.4, although >> there is no reason to not include 9.3. >> >> TransactionXmin -- This is needed for the newer heap-matches-index >> verification check. Again, I would like this on 9.4+, but 9.3+ works >> too. >> >> Note that somebody asked about running it on Windows recently, and on >> one other occasion in the past. It does come up. > > Committed with these additions. Please check that I haven't messed anything > up. Thanks, but you modified RecentGlobalDataXmin, not RecentGlobalXmin. Can you fix it, please? -- Peter Geoghegan
Re: Add PGDLLIMPORT lines to some variables
On Tue, Dec 5, 2017 at 9:09 AM, Peter Geoghegan wrote: >> Committed with these additions. Please check that I haven't messed anything >> up. > > Thanks, but you modified RecentGlobalDataXmin, not RecentGlobalXmin. > Can you fix it, please? I was looking at the wrong branch. Forget it. Thanks -- Peter Geoghegan
dsa_allocate could not find 4 free pages
Hackers, I'm getting the error in $subject and am only posting here because (a) the comments in src/backend/utils/mmgr/dsa.c circa line 720 suggest that this is a bug, and (b) the DSA code appears to be fairly new to the postgresql project, and perhaps not fully debugged. I am running against a build from 291a31c42c893574e9676e00121e6c6915a59de5 REL_10_STABLE, on centos linux. I'm not asking for help configuring my system. If this is a run of the mill out of memory type error, I'll deal with it myself. But if this indicates a bug, I'll be happy to try to distill what I'm doing to a test case that I can share. (I've only hit this once, and have little idea whether it would be reproducible.) The gist of the query was an array_agg over the results of a parallel query over a partitioned table, executed from within a plpgsql function, like: SELECT ARRAY_AGG((a, latest, timelogged, b, cnt)::custom_composite_type) AS trans FROM (SELECT a, b, timelogged, MAX(timelogged) OVER (PARTITION BY a) AS latest, SUM(cnt) AS cnt FROM (SELECT a, b, timelogged, COUNT(*) AS cnt FROM mytable WHERE a= ANY('{65537,65538,65539,65540,65541,65542,65543,65544}'::OID[]) AND timelogged >= '2017-12-04 16:12:50-08'::TIMESTAMP AND timelogged < '2017-12-04 17:12:50-08'::TIMESTAMP GROUP BY a, b, timelogged ) AS ss GROUP BY a, b, timelogged ) AS s If this is not a bug, but rather something that users can hit through normal resource exhaustion, then perhaps the comments in dsa.c can be changed to make that clear. Thanks! mark
Re: Usage of epoch in txid_current
On 2017-12-05 16:21:27 +0530, Amit Kapila wrote: > On Tue, Dec 5, 2017 at 2:49 PM, Alexander Korotkov > wrote: > > On Tue, Dec 5, 2017 at 6:19 AM, Amit Kapila wrote: > >> > >> Currently, txid_current and friends export a 64-bit format of > >> transaction id that is extended with an “epoch” counter so that it > >> will not wrap around during the life of an installation. The epoch > >> value it uses is based on the epoch that is maintained by checkpoint > >> (aka only checkpoint increments it). > >> > >> Now if epoch changes multiple times between two checkpoints > >> (practically the chances of this are bleak, but there is a theoretical > >> possibility), then won't the computation of xids will go wrong? > >> Basically, it can give the same value of txid after wraparound if the > >> checkpoint doesn't occur between the two calls to txid_current. > > > > > > AFAICS, yes, if epoch changes multiple times between two checkpoints, then > > computation will go wrong. And it doesn't look like purely theoretical > > possibility for me, because I think I know couple of instances of the edge > > of this... I think it's not terribly likely principle, due to the required WAL size. You need at least a commit record for each of 4 billion transactions. Each commit record is at least 24bytes long, and in a non-artificial scenario you additionally would have a few hundred bytes of actual content of WAL. So we're talking about a distance of at least 0.5-2TB within a single checkpoint here. Not impossible, but not likely either. > Okay, it is quite strange that we haven't discovered this problem till > now. I think we should do something to fix it. One idea is that we > track epoch change in shared memory (probably in the same data > structure (VariableCacheData) where we track nextXid). We need to > increment it when the xid wraparound during xid allocation (in > GetNewTransactionId). Also, we need to make it persistent as which > means we need to log it in checkpoint xlog record and we need to write > a separate xlog record for the epoch change. I think it makes a fair bit of sense to not do the current crufty tracking of xid epochs. I don't really how we got there, but it doesn't make terribly much sense. Don't think we need additional WAL logging though - we should be able to piggyback this onto the already existing clog logging. I kinda wonder if we shouldn't just track nextXid as a 64bit integer internally, instead of bothering with tracking the epoch separately. Then we can "just" truncate it in the cases where it's stored in space constrained places etc. Greetings, Andres Freund
Re: Speeding up pg_upgrade
On 12/5/17 09:23, Stephen Frost wrote: > Right, but that doesn't really answer the question as to which part of > the pg_upgrade process is taking up the time. Yeah, that should be measured before we do anything. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Usage of epoch in txid_current
Andres, * Andres Freund (and...@anarazel.de) wrote: > On 2017-12-05 16:21:27 +0530, Amit Kapila wrote: > > On Tue, Dec 5, 2017 at 2:49 PM, Alexander Korotkov > > wrote: > > > On Tue, Dec 5, 2017 at 6:19 AM, Amit Kapila > > > wrote: > > >> > > >> Currently, txid_current and friends export a 64-bit format of > > >> transaction id that is extended with an “epoch” counter so that it > > >> will not wrap around during the life of an installation. The epoch > > >> value it uses is based on the epoch that is maintained by checkpoint > > >> (aka only checkpoint increments it). > > >> > > >> Now if epoch changes multiple times between two checkpoints > > >> (practically the chances of this are bleak, but there is a theoretical > > >> possibility), then won't the computation of xids will go wrong? > > >> Basically, it can give the same value of txid after wraparound if the > > >> checkpoint doesn't occur between the two calls to txid_current. > > > > > > > > > AFAICS, yes, if epoch changes multiple times between two checkpoints, then > > > computation will go wrong. And it doesn't look like purely theoretical > > > possibility for me, because I think I know couple of instances of the edge > > > of this... > > I think it's not terribly likely principle, due to the required WAL > size. You need at least a commit record for each of 4 billion > transactions. Each commit record is at least 24bytes long, and in a > non-artificial scenario you additionally would have a few hundred bytes > of actual content of WAL. So we're talking about a distance of at least > 0.5-2TB within a single checkpoint here. Not impossible, but not likely > either. At the bottom end, with a 30-minute checkpoint, that's about 300MB/s. Certainly quite a bit and we might have trouble getting there for other reasons, but definitely something that can be accomplished with even a single SSD these days. > > Okay, it is quite strange that we haven't discovered this problem till > > now. I think we should do something to fix it. One idea is that we > > track epoch change in shared memory (probably in the same data > > structure (VariableCacheData) where we track nextXid). We need to > > increment it when the xid wraparound during xid allocation (in > > GetNewTransactionId). Also, we need to make it persistent as which > > means we need to log it in checkpoint xlog record and we need to write > > a separate xlog record for the epoch change. > > I think it makes a fair bit of sense to not do the current crufty > tracking of xid epochs. I don't really how we got there, but it doesn't > make terribly much sense. Don't think we need additional WAL logging > though - we should be able to piggyback this onto the already existing > clog logging. Don't you mean xact logging? ;) > I kinda wonder if we shouldn't just track nextXid as a 64bit integer > internally, instead of bothering with tracking the epoch > separately. Then we can "just" truncate it in the cases where it's > stored in space constrained places etc. This sounds reasonable to me, at least, but I've not been in these depths much. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] parallel.c oblivion of worker-startup failures
On Wed, Nov 29, 2017 at 12:11 AM, Michael Paquier wrote: > On Fri, Oct 27, 2017 at 9:54 PM, Amit Kapila wrote: >> The patch still applies (with some hunks). I have added it in CF [1] >> to avoid losing track. >> >> [1] - https://commitfest.postgresql.org/15/1341/ > > This did not get reviews and the patch still applies. I am moving it to next > CF. I'd like to break this into two patches, one to fix the general background worker problems and another to fix the problems specific to parallel query. The attached version extracts the two fixes for general background worker problems from Amit's patch and adjust the comments a bit more extensively than he did. Barring objections, I'll commit and back-patch this; then we can deal with the other part of this afterward. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company bgworker-startup.patch Description: Binary data
Re: Usage of epoch in txid_current
On December 5, 2017 10:01:43 AM PST, Stephen Frost wrote: >Andres, > >* Andres Freund (and...@anarazel.de) wrote: >> I think it makes a fair bit of sense to not do the current crufty >> tracking of xid epochs. I don't really how we got there, but it >doesn't >> make terribly much sense. Don't think we need additional WAL logging >> though - we should be able to piggyback this onto the already >existing >> clog logging. > >Don't you mean xact logging? ;) No. We log a WAL record at clog boundaries. Wraparounds have to be at one. We could just include the 64 bit xid there and would have reliable tracking. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [HACKERS] Transaction control in procedures
On 12/1/17 15:28, Robert Haas wrote: > This feature doesn't have many tests. I think it should have a lot > more of them. It's tinkering with the transaction control machinery > of the system in a fairly fundamental way, and that could break > things. Thank you, these are great ideas. > I suggest, in particular, testing how it interactions with resources > such as cursors and prepared statements. For example, what happens if > you commit or roll back inside a cursor-for loop (given that the > cursor is not holdable)? This was discussed briefly earlier in the thread. The mid-term fix is to convert pinned cursors to holdable ones before a COMMIT in PL/pgSQL and then clean them up separately later. I have that mostly working, but I'd like to hold it for a separate patch submission. The short-term fix is to prohibit COMMIT and ROLLBACK while a portal is pinned. I think ROLLBACK in a cursor loop might not make sense, because the cursor query itself could have side effects, so a rollback would have to roll back the entire loop. That might need more refined analysis before it could be allowed. > - COMMIT or ROLLBACK inside a PLpgsql block with an attached EXCEPTION > block, or when an SQL SAVEPOINT has been established previously. I think that needs to be prohibited because if you end transactions in an exception-handled block, you can no longer actually roll back that block when an exception occurs, which was the entire point. > - COMMIT or ROLLBACK inside a procedure with a SET clause attached, That also needs to be prohibited because of the way the GUC nesting currently works. It's probably possible to fix it, but it would be a separate effort. > and/or while SET LOCAL is in effect either at the inner or outer > level. That seems to work fine. > - COMMIT or ROLLBACK with open large objects. I haven't been able to reproduce any problems with that, but maybe I haven't tried hard enough. > - COMMIT inside a procedure fails because of a serialization failure, > deferred constraint, etc. That works fine. The COMMIT fails and control exits the procedure using the normal exception propagation. I'll submit an updated patch with some fixes for the above and more documentation. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Transaction control in procedures
On Tue, Dec 5, 2017 at 1:25 PM, Peter Eisentraut wrote: > I think ROLLBACK in a cursor loop might not make sense, because the > cursor query itself could have side effects, so a rollback would have to > roll back the entire loop. That might need more refined analysis before > it could be allowed. COMMIT really has the same problem; if the cursor query has side effects, you can't commit those side effects piecemeal as the loop executed and have things behave sanely. >> - COMMIT or ROLLBACK inside a procedure with a SET clause attached, > > That also needs to be prohibited because of the way the GUC nesting > currently works. It's probably possible to fix it, but it would be a > separate effort. > >> and/or while SET LOCAL is in effect either at the inner or outer >> level. > > That seems to work fine. These two are related -- if you don't permit anything that makes temporary changes to GUCs at all, like SET clauses attached to functions, then SET LOCAL won't cause any problems. The problem is if you do a transaction operation when something set locally is in the stack of values, but not at the top. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com
On Tue, Dec 5, 2017 at 2:49 AM, Thomas Munro wrote: > As for how to aggregate the information, isn't it reasonable to show > data from the last loop on the basis that it's representative? > Summing wouldn't make too much sense, because you didn't use that much > memory all at once. Sorts can be rescanned even without parallel query, so I guess we should try to make the parallel case kinda like the non-parallel case. If I'm not wrong, that will just use the stats from the most recent execution (i.e. the last loop) -- see show_sort_info() in explain.c. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Usage of epoch in txid_current
Stephen Frost writes: > * Andres Freund (and...@anarazel.de) wrote: >> I kinda wonder if we shouldn't just track nextXid as a 64bit integer >> internally, instead of bothering with tracking the epoch >> separately. Then we can "just" truncate it in the cases where it's >> stored in space constrained places etc. > This sounds reasonable to me, at least, but I've not been in these > depths much. +1 ... I think the reason it's like that is simply that nobody's revisited the XID generator since we decided to require 64-bit integer support. We'd need this for support of true 64-bit XIDs, too, though I'm unsure whether that project is going anywhere anytime soon. In any case it seems like a separable subset of that work. regards, tom lane
Re: dsa_allocate could not find 4 free pages
On Wed, Dec 6, 2017 at 6:17 AM, Mark Dilger wrote: > I'm not asking for help configuring my system. If this is a run of the mill > out of memory > type error, I'll deal with it myself. But if this indicates a bug, I'll be > happy to try to distill > what I'm doing to a test case that I can share. (I've only hit this once, > and have little > idea whether it would be reproducible.) The gist of the query was an > array_agg over > the results of a parallel query over a partitioned table, executed from > within a plpgsql > function, like: > > SELECT ARRAY_AGG((a, latest, timelogged, b, cnt)::custom_composite_type) AS > trans FROM > (SELECT a, b, timelogged, MAX(timelogged) OVER (PARTITION BY a) AS > latest, SUM(cnt) AS cnt FROM > (SELECT a, b, timelogged, COUNT(*) AS cnt > FROM mytable > WHERE a= > ANY('{65537,65538,65539,65540,65541,65542,65543,65544}'::OID[]) > AND timelogged >= '2017-12-04 16:12:50-08'::TIMESTAMP > AND timelogged < '2017-12-04 17:12:50-08'::TIMESTAMP > GROUP BY a, b, timelogged > ) AS ss > GROUP BY a, b, timelogged > ) AS s Hi Mark, Does the plan have multiple Gather nodes with Parallel Bitmap Heap Scan? https://www.postgresql.org/message-id/flat/CAEepm%3D0Mv9BigJPpribGQhnHqVGYo2%2BkmzekGUVJJc9Y_ZVaYA%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com
On Tue, Dec 5, 2017 at 12:39 AM, Amit Kapila wrote: > Yeah, that sounds better, so modified the patch accordingly. I committed this to master and REL_10_STABLE, but it conflicts all over the place on 9.6. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: es_query_dsa is broken
On Tue, Dec 5, 2017 at 7:24 AM, Amit Kapila wrote: > + EState *estate = gatherstate->ps.state; > + > + /* Install our DSA area while executing the plan. */ > + estate->es_query_dsa = gatherstate->pei->area; > outerTupleSlot = ExecProcNode(outerPlan); > + estate->es_query_dsa = NULL; > > Won't the above coding pattern create a problem, if ExecProcNode > throws an error and outer block catches it and continues execution > (consider the case of execution inside PL blocks)? I don't see what the problem is. The query that got aborted by the error wouldn't be sharing an EState with one that didn't. Control would have to return to someplace inside the same ExecProcNode and it would have to return normally. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] pow support for pgbench
On Tue, Dec 5, 2017 at 7:44 AM, Raúl Marín Rodríguez wrote: > I've been giving a thought about this and I think we could reach the > compromise > of having a single function with 2 overloads: > * pow(double, double) -> double: Uses C pow(). > * pow(int, int) -> double: Uses ipow() for positive exponents, and pow() > for negative exponents. > > In both cases we'd return a double but we use the fast ipow if it's possible > (which can be 20x faster), so at the cost of an extra cast if you need an > int, > we'd have a consistent API. Would this be acceptable? It seems OK to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Walsender timeouts and large transactions
On Mon, Dec 4, 2017 at 10:59 PM, Craig Ringer wrote: > To me it looks like it's time to get this committed, marking as such. This version has noticeably more code rearrangement than before, and I'm not sure that is actually buying us anything. Why not keep the changes minimal? Also, TBH, this doesn't seem to have been carefully reviewed for style: -if (!pq_is_send_pending()) -return; +/* Try taking fast path unless we get too close to walsender timeout. */ +if (now < TimestampTzPlusMilliseconds(last_reply_timestamp, + wal_sender_timeout / 2)) +{ +if (!pq_is_send_pending()) +return; +} Generally we write if (a && b) { ... } not if (a) { if (b) .. } -} +}; It's hard to understand how this got through review. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: dsa_allocate could not find 4 free pages
> On Dec 5, 2017, at 11:25 AM, Thomas Munro > wrote: > > On Wed, Dec 6, 2017 at 6:17 AM, Mark Dilger wrote: >> I'm not asking for help configuring my system. If this is a run of the mill >> out of memory >> type error, I'll deal with it myself. But if this indicates a bug, I'll be >> happy to try to distill >> what I'm doing to a test case that I can share. (I've only hit this once, >> and have little >> idea whether it would be reproducible.) The gist of the query was an >> array_agg over >> the results of a parallel query over a partitioned table, executed from >> within a plpgsql >> function, like: >> >> SELECT ARRAY_AGG((a, latest, timelogged, b, cnt)::custom_composite_type) AS >> trans FROM >>(SELECT a, b, timelogged, MAX(timelogged) OVER (PARTITION BY a) AS >> latest, SUM(cnt) AS cnt FROM >>(SELECT a, b, timelogged, COUNT(*) AS cnt >>FROM mytable >>WHERE a= >> ANY('{65537,65538,65539,65540,65541,65542,65543,65544}'::OID[]) >>AND timelogged >= '2017-12-04 16:12:50-08'::TIMESTAMP >>AND timelogged < '2017-12-04 17:12:50-08'::TIMESTAMP >>GROUP BY a, b, timelogged >>) AS ss >>GROUP BY a, b, timelogged >> ) AS s > > Hi Mark, > > Does the plan have multiple Gather nodes with Parallel Bitmap Heap Scan? This was encountered and logged by a java client. The only data I got was: org.postgresql.util.PSQLException: ERROR: dsa_allocate could not find 4 free pages Where: parallel worker I know what the query was, because it is the only query this client issues, but there is no record of what the plan might have been. The java client runs queries spanning a long time period followed by multiple queries spanning short time periods. The sql query is the same in both cases; only the bind parameters for the time frame differ. The short time period queries do not generate parallel plans (in my experience), but the long time period ones do, with plans that look something like this: Aggregate (cost=31552146.47..31552146.48 rows=1 width=32) -> WindowAgg (cost=14319575.96..31538146.47 rows=80 width=43) -> GroupAggregate (cost=14319575.96..31526146.47 rows=80 width=39) Group Key: acct_r.mta, acct_r.dtype, (moment(acct_r.timelogged)) -> Finalize GroupAggregate (cost=14319575.96..31356146.47 rows=800 width=15) Group Key: acct_r.mta, acct_r.dtype, (moment(acct_r.timelogged)) -> Gather Merge (cost=14319575.96..30056146.47 rows=12000 width=15) Workers Planned: 15 -> Partial GroupAggregate (cost=14318575.64..14755146.15 rows=800 width=15) Group Key: acct_r.mta, acct_r.dtype, (moment(acct_r.timelogged)) -> Sort (cost=14318575.64..14385889.74 rows=26925641 width=7) Sort Key: acct_r.mta, acct_r.dtype, (moment(acct_r.timelogged)) -> Result (cost=0.00..10627491.52 rows=26925641 width=7) -> Append (cost=0.00..10290921.01 rows=26925641 width=7) -> Parallel Seq Scan on acct_r (cost=0.00..784882.50 rows=1805625 width=7) Filter: (mta = ANY ('{65537,65538,65539,65540,65541,65542,65543,65544}'::smallid[])) -> Parallel Seq Scan on acct_d (cost=0.00..542032.80 rows=1757640 width=7) Filter: (mta = ANY ('{65537,65538,65539,65540,65541,65542,65543,65544}'::smallid[])) -> Parallel Seq Scan on acct_b (cost=0.00..111584.91 rows=206096 width=7) Filter: (mta = ANY ('{65537,65538,65539,65540,65541,65542,65543,65544}'::smallid[])) -> Parallel Seq Scan on acct_x (cost=0.00..20661.29 rows=48165 width=7) Filter: (mta = ANY ('{65537,65538,65539,65540,65541,65542,65543,65544}'::smallid[])) -> Parallel Seq Scan on acct_r acct_r_1 (cost=0.00..795917.43 rows=1831322 width=7) Filter: (mta = ANY ('{65537,65538,65539,65540,65541,65542,65543,65544}'::smallid[])) -> Parallel Seq Scan on acct_d acct_d_1 (cost=0.00..549630.82 rows=1782141 width=7) Filter: (mta = ANY ('{65537,65538,65539,65540,65541,65542,65543,65544}'::smallid[])) -> Parallel Seq Scan on acct_b acct_b_1
Re: [HACKERS] Proposal: Local indexes for partitioned table
On 6 December 2017 at 04:29, Robert Haas wrote: > On Mon, Dec 4, 2017 at 6:57 PM, David Rowley >> I proposed that this worked a different way in [1]. I think the way I >> mention is cleaner as it means there's no extra reason for a >> partitioned index to be indisvalid=false than there is for any other >> normal index. > > How does that proposal keep pg_dump from latching onto the wrong > index, if there's more than one index on the same columns? I'm not hugely concerned about that. It's not a new problem and it's not a problem that I recall seeing anyone complain about, at least not to the extent that we've ever bothered to fix it. The existing problem is with FOREIGN KEY constraints just choosing the first matching index in transformFkeyCheckAttrs() We can see the issue today with: create table t1 (id int not null); create unique index t1_idx_b on t1 (id); create table t2 (id int references t1 (id)); create unique index t1_idx_a on t1 (id); # drop index t1_idx_a; ERROR: cannot drop index t1_idx_a because other objects depend on it DETAIL: constraint t2_id_fkey on table t2 depends on index t1_idx_a HINT: Use DROP ... CASCADE to drop the dependent objects too. >> My primary reason for not liking this way is around leaving indexes >> around as indisvalid=false, however, I suppose that an index could be >> replaced atomically with a DETACH and ATTACH within a single >> transaction. I had previously not really liked the idea of >> invalidating an index by DETACHing a leaf table's index from it. Of >> course, this patch does nothing special with partitioned indexes, but >> I believe one day we will want to do something with these and that the >> DETACH/ATTACH is not the best design for replacing part of the index. > > What do you propose? For this patch, I propose we do nothing about that, but for the future, I already mentioned an idea to do this above. > I'd have thought some sort of: ALTER INDEX name_of_partitioned_index > REPLACE INDEX FOR partitioned_tablename WITH > name_of_new_matching_bloat_free_index; This way there is no period where we have to mark the index as indisvalid=false. So, for when we invent something that uses this feature, there's no window of time that concurrently running queries are in danger of generating a poor plan. The index swap operation is atomic. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Proposal: Local indexes for partitioned table
David Rowley wrote: > On 6 December 2017 at 04:29, Robert Haas wrote: > > On Mon, Dec 4, 2017 at 6:57 PM, David Rowley > >> I proposed that this worked a different way in [1]. I think the way I > >> mention is cleaner as it means there's no extra reason for a > >> partitioned index to be indisvalid=false than there is for any other > >> normal index. > > > > How does that proposal keep pg_dump from latching onto the wrong > > index, if there's more than one index on the same columns? > > I'm not hugely concerned about that. It's not a new problem and it's > not a problem that I recall seeing anyone complain about, at least not > to the extent that we've ever bothered to fix it. Another reason to do things the proposed way is to let parallel restore create the indexes in parallel. If we just have a single CREATE INDEX that cascades to the partitions, that can be run by a single backend only, which is a loser. (There's also the fact that you'd lose any COMMENTs etc). -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: compress method for spgist - 2
On Tue, Dec 5, 2017 at 1:14 PM, Darafei Praliaskouski wrote: > The following review has been posted through the commitfest application: > make installcheck-world: not tested > Implements feature: not tested > Spec compliant: not tested > Documentation:tested, passed > > I've read the updated patch and see my concerns addressed. > > I'm looking forward to SP-GiST compress method support, as it will allow > usage of SP-GiST index infrastructure for PostGIS. > > The new status of this patch is: Ready for Committer > I went trough this patch. I found documentation changes to be not sufficient. And I've made some improvements. In particular, I didn't understand why is reconstructedValue claimed to be of spgConfigOut.leafType while it should be of spgConfigIn.attType both from general logic and code. I've fixed that. Nikita, correct me if I'm wrong. Also, I wonder should we check for existence of compress method when attType and leafType are not the same in spgvalidate() function? We don't do this for now. 0002-spgist-polygon-8.patch is OK for me so soon. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company 0001-spgist-compress-method-9.patch Description: Binary data
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com
On Wed, Dec 6, 2017 at 12:02 AM, Amit Kapila wrote: > On Tue, Dec 5, 2017 at 1:29 PM, Thomas Munro > wrote: >> Or would it be an option to change the time >> ExecXXXRetrieveInstrumentation() is called so it is run only once? > > To me, that doesn't sound like a bad option. I think if do so, then > we don't even need to reinitialize the shared sort stats. I think > something, like attached, should work if we want to go this route. We > can add regression test if this is what we think is a good idea. > Having said that, one problem I see doing thing this way is that in > general, we will display the accumulated stats of each worker, but for > sort or some other special nodes (like hash), we will show the > information of only last loop. I am not sure if that is a matter of > concern, but if we want to do this way, then probably we should > explain this in documentation as well. The hash version of this code is now committed as 5bcf389e. Here is a patch for discussion that adds some extra tests to join.sql to exercise rescans of a hash join under a Gather node. It fails on head, because it loses track of the instrumentation pointer after the first loop as you described (since the Hash coding is the same is the Sort coding), so it finishes up with no instrumentation data. If you move ExecParallelRetrieveInstrumentation() to ExecParallelCleanup() as you showed in your patch, then it passes. The way I'm asserting that instrumentation data is making its way back to the leader is by turning off leader participation and then checking if it knows how many batches there were. -- Thomas Munro http://www.enterprisedb.com test-hash-join-rescan-instr-v1.patch Description: Binary data
Re: [HACKERS] Parallel Append implementation
On Tue, Nov 28, 2017 at 6:02 AM, amul sul wrote: > Here are the changes I did on v21 patch to handle crash reported by > Rajkumar[1]: > > diff --git a/src/backend/executor/nodeAppend.c > b/src/backend/executor/nodeAppend.c > index e3b17cf0e2..e0ee918808 100644 > --- a/src/backend/executor/nodeAppend.c > +++ b/src/backend/executor/nodeAppend.c > @@ -479,9 +479,12 @@ choose_next_subplan_for_worker(AppendState *node) > pstate->pa_next_plan = append->first_partial_plan; > else > pstate->pa_next_plan++; > - if (pstate->pa_next_plan == node->as_whichplan) > + > + if (pstate->pa_next_plan == node->as_whichplan || > + (pstate->pa_next_plan == append->first_partial_plan && > +append->first_partial_plan >= node->as_nplans)) > { > - /* We've tried everything! */ > + /* We've tried everything or there were no partial plans */ > pstate->pa_next_plan = INVALID_SUBPLAN_INDEX; > LWLockRelease(&pstate->pa_lock); > return false; I changed this around a little, added a test case, and committed this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Proposal: Local indexes for partitioned table
On Tue, Dec 5, 2017 at 3:53 PM, David Rowley wrote: > I'm not hugely concerned about that. It's not a new problem and it's > not a problem that I recall seeing anyone complain about, at least not > to the extent that we've ever bothered to fix it. > > The existing problem is with FOREIGN KEY constraints just choosing the > first matching index in transformFkeyCheckAttrs() > > We can see the issue today with: > > create table t1 (id int not null); > create unique index t1_idx_b on t1 (id); > create table t2 (id int references t1 (id)); > create unique index t1_idx_a on t1 (id); > > > > > # drop index t1_idx_a; > ERROR: cannot drop index t1_idx_a because other objects depend on it > DETAIL: constraint t2_id_fkey on table t2 depends on index t1_idx_a > HINT: Use DROP ... CASCADE to drop the dependent objects too. Ugh, that sucks. I don't think it's a good argument for making the problem worse, though. What are we giving up by explicitly attaching the correct index? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Proposal: Local indexes for partitioned table
On 6 December 2017 at 09:58, Alvaro Herrera wrote: > David Rowley wrote: >> On 6 December 2017 at 04:29, Robert Haas wrote: >> > How does that proposal keep pg_dump from latching onto the wrong >> > index, if there's more than one index on the same columns? >> >> I'm not hugely concerned about that. It's not a new problem and it's >> not a problem that I recall seeing anyone complain about, at least not >> to the extent that we've ever bothered to fix it. > > Another reason to do things the proposed way is to let parallel restore > create the indexes in parallel. If we just have a single CREATE INDEX > that cascades to the partitions, that can be run by a single backend > only, which is a loser. I'm not all that sure why parallel restore could not work with this. The leaf partition indexes would all be created first, then the CREATE INDEX for the partitioned index would just be a metadata change with some checks to ensure all the leaf partition indexes exist. > (There's also the fact that you'd lose any COMMENTs etc). I can't see why that would be a problem with my proposal. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Add RANGE with values and exclusions clauses to the Window Functions
On 2017-11-27 17:55, Oliver Ford wrote: Cheers here's v4 with the correct docs. This email just to link related thread "Add GROUPS option to the Window Functions": https://www.postgresql.org/message-id/CAGMVOdtWkb9X7dUh7vjaCaiH34UGFg88unXYTEOub0Rk0swSXw%40mail.gmail.com
Re: dsa_allocate could not find 4 free pages
On Wed, Dec 6, 2017 at 9:35 AM, Mark Dilger wrote: >> On Dec 5, 2017, at 11:25 AM, Thomas Munro >> wrote: >> Does the plan have multiple Gather nodes with Parallel Bitmap Heap Scan? > > This was encountered and logged by a java client. The only data I got was: > > org.postgresql.util.PSQLException: ERROR: dsa_allocate could not find 4 free > pages > Where: parallel worker This means that the DSA area is corrupted. Presumably get_best_segment(area, 4) returned a segment that wasn't actually good for 4 pages, either because it was incorrectly binned or because its free space btree was corrupted. Another path would be that make_new_segment(area, 4) returned a segment that couldn't find 4 pages, but that seems unlikely. > [query plan with one Gather and no Parallel Bitmap Heap Scan] I'm not sure why this plan would ever call dsa_allocate(). > [query plan with no Gather but plenty of Btimap Heap Scans] And this one certainly can't. I guess you must sometimes get a different variation that has Gather nodes and uses Parallel Bitmap Heap Scan. Then the question is whether the es_query_dsa multiple Gather bug can explain this: for example, if dsa_free(wrong_dsa_area, p) was called, perhaps it could produce this type of corruption. Otherwise we have a different bug. Any clues on how to reproduce the problem would be very welcome. -- Thomas Munro http://www.enterprisedb.com
Re: [HACKERS] Walsender timeouts and large transactions
On 6 December 2017 at 04:07, Robert Haas wrote: > On Mon, Dec 4, 2017 at 10:59 PM, Craig Ringer > wrote: > > To me it looks like it's time to get this committed, marking as such. > > This version has noticeably more code rearrangement than before, and > I'm not sure that is actually buying us anything. Why not keep the > changes minimal? > > Also, TBH, this doesn't seem to have been carefully reviewed for style: > > -if (!pq_is_send_pending()) > -return; > +/* Try taking fast path unless we get too close to walsender timeout. > */ > +if (now < TimestampTzPlusMilliseconds(last_reply_timestamp, > + wal_sender_timeout / 2)) > +{ > +if (!pq_is_send_pending()) > +return; > +} > > Generally we write if (a && b) { ... } not if (a) { if (b) .. } > > -} > +}; > > It's hard to understand how this got through review. > Entirely my fault - I tend to forget to look at code style when I'm focused on functionality. My apologies, and thanks for the reminder. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Proposal: Local indexes for partitioned table
On 6 December 2017 at 11:35, Robert Haas wrote: > What are we giving up by explicitly attaching > the correct index? The part I don't like about the ATTACH and DETACH of partitioned index is that it seems to be trying to just follow the syntax we use to remove a partition from a partitioned table, however, there's a huge difference between the two, as DETACHing a partition from a partitioned table leaves the partitioned table in a valid state, it simply just no longer contains the detached partition. With the partitioned index, we leave the index in an invalid state after a DETACH. It can only be made valid again once another leaf index has been ATTACHED again and that we've verified that all other indexes on every leaf partition is also there and are valid. If we're going to use these indexes to answer queries, then it seems like we should try to keep them valid so that queries can actually use them for something. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: dsa_allocate could not find 4 free pages
> On Dec 5, 2017, at 4:07 PM, Thomas Munro > wrote: > > On Wed, Dec 6, 2017 at 9:35 AM, Mark Dilger wrote: >>> On Dec 5, 2017, at 11:25 AM, Thomas Munro >>> wrote: >>> Does the plan have multiple Gather nodes with Parallel Bitmap Heap Scan? >> >> This was encountered and logged by a java client. The only data I got was: >> >> org.postgresql.util.PSQLException: ERROR: dsa_allocate could not find 4 free >> pages >> Where: parallel worker > > This means that the DSA area is corrupted. Presumably > get_best_segment(area, 4) returned a segment that wasn't actually good > for 4 pages, either because it was incorrectly binned or because its > free space btree was corrupted. Another path would be that > make_new_segment(area, 4) returned a segment that couldn't find 4 > pages, but that seems unlikely. > >> [query plan with one Gather and no Parallel Bitmap Heap Scan] > > I'm not sure why this plan would ever call dsa_allocate(). > >> [query plan with no Gather but plenty of Btimap Heap Scans] > > And this one certainly can't. I guess you must sometimes get a > different variation that has Gather nodes and uses Parallel Bitmap > Heap Scan. Yes, I can believe that the plan is sometimes different. This error has occurred several times now, but it is still rather infrequent, so either the plan that triggers it is rare, or the bug is intermittent even with the same plan being chosen, or perhaps both. > Then the question is whether the es_query_dsa multiple > Gather bug can explain this: for example, if dsa_free(wrong_dsa_area, > p) was called, perhaps it could produce this type of corruption. > Otherwise we have a different bug. Any clues on how to reproduce the > problem would be very welcome. I have written (and rewritten, and rewritten) a tap test in the hopes of getting a test case that reproduces this reliably (or even once), but without luck so far. I will keep trying. mark
Re: [HACKERS] Proposal: Local indexes for partitioned table
On Wed, Dec 6, 2017 at 9:42 AM, David Rowley wrote: > On 6 December 2017 at 11:35, Robert Haas wrote: >> What are we giving up by explicitly attaching >> the correct index? > > The part I don't like about the ATTACH and DETACH of partitioned index > is that it seems to be trying to just follow the syntax we use to > remove a partition from a partitioned table, however, there's a huge > difference between the two, as DETACHing a partition from a > partitioned table leaves the partitioned table in a valid state, it > simply just no longer contains the detached partition. With the > partitioned index, we leave the index in an invalid state after a > DETACH. It can only be made valid again once another leaf index has > been ATTACHED again and that we've verified that all other indexes on > every leaf partition is also there and are valid. If we're going to > use these indexes to answer queries, then it seems like we should try > to keep them valid so that queries can actually use them for > something. > +1 for all that. Exactly my thoughts. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] Proposal: Local indexes for partitioned table
On 6 December 2017 at 13:42, David Rowley wrote: > On 6 December 2017 at 11:35, Robert Haas wrote: >> What are we giving up by explicitly attaching >> the correct index? > > The part I don't like about the ATTACH and DETACH of partitioned index > is that it seems to be trying to just follow the syntax we use to > remove a partition from a partitioned table, however, there's a huge > difference between the two, as DETACHing a partition from a > partitioned table leaves the partitioned table in a valid state, it > simply just no longer contains the detached partition. With the > partitioned index, we leave the index in an invalid state after a > DETACH. It can only be made valid again once another leaf index has > been ATTACHED again and that we've verified that all other indexes on > every leaf partition is also there and are valid. If we're going to > use these indexes to answer queries, then it seems like we should try > to keep them valid so that queries can actually use them for > something. Also, ATTACH and DETACH are especially useless when it comes to UNIQUE indexes. If we simply want to replace out a bloated index using a DETACH quickly followed by an ATTACH then it leaves a non-zero window of time that we can't be certain that the uniqueness is enforced. This would still work on an individual partition level, but if we ever want to reference a UNIQUE partitioned index in a foreign key constraint then what happens to the foreign key when the index is in the invalid state? Should we just disallow DETACH when a foreign key exists? or just invalidate the foreign key constraint too? Both seem like a nightmare from a DBA point-of-view. You might argue that concurrently recreating an index used by a foreign key is just as difficult today, but there's no reason to make this as problematic, is there? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] pow support for pgbench
In both cases we'd return a double but we use the fast ipow if it's possible (which can be 20x faster), so at the cost of an extra cast if you need an int, we'd have a consistent API. Would this be acceptable? It seems OK to me. Computing as an int, casting to double and back to int8 can generate a loss of precision. However for powers of 2 it works exactly, so eg computing a mask it would be ok. This proposal does not exactly match SQL behavior, but I do not see this as a problem, which is why I was happy with the previous proposal. -- Fabien.
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com
On Wed, Dec 6, 2017 at 12:18 AM, Robert Haas wrote: > On Tue, Dec 5, 2017 at 2:49 AM, Thomas Munro > wrote: >> As for how to aggregate the information, isn't it reasonable to show >> data from the last loop on the basis that it's representative? >> Summing wouldn't make too much sense, because you didn't use that much >> memory all at once. > > Sorts can be rescanned even without parallel query, so I guess we > should try to make the parallel case kinda like the non-parallel case. > If I'm not wrong, that will just use the stats from the most recent > execution (i.e. the last loop) -- see show_sort_info() in explain.c. > Right and seeing that I have prepared the patch (posted above [1]) which fixes it such that it will resemble the non-parallel case. Ideally, it would have obviated the need for my previous patch which got committed as 778e78ae. However, now that is committed, I could think of below options: 1. I shall rebase it atop what is committed and actually, I have done that in the attached patch. I have also prepared a regression test case patch just to show the output with and without the patch. 2. For sort node, we can fix it by having some local_info same as shared_info in sort node and copy the shared_info in that or we could reinstate the pointer to the DSM in ExecSortReInitializeDSM() by looking it up in the TOC as suggested by Thomas. If we go this way, then we need a similar fix for hash node as well. [1] - https://www.postgresql.org/message-id/CAA4eK1JBj4YCEQKeTua5%3DBMXy7zW7zNOvoXomzBP%3Dkb_aqMF7w%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com fix_accum_instr_parallel_workers_v5.patch Description: Binary data test_sort_stats_v1.1.patch Description: Binary data
Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com
On Wed, Dec 6, 2017 at 1:10 AM, Robert Haas wrote: > On Tue, Dec 5, 2017 at 12:39 AM, Amit Kapila wrote: >> Yeah, that sounds better, so modified the patch accordingly. > > I committed this to master and REL_10_STABLE, but it conflicts all > over the place on 9.6. > I will try to prepare the patch for 9.6, but I think it might be better if we first decide what to do about the open issue for sort and hash node as there can be some overlap based on what approach we choose to fix it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Error bundling script file with extension when event trigger commands are used
Hi, I am trying to bundle a script file with my extension. Script has event triggers and trigger function does a select query on pg_event_trigger_ddl_commands . While executing "create extension" following error is thrown : ERROR: pg_event_trigger_ddl_commands() can only be called in an event trigger function CONTEXT: SQL statement "SELECT * FROM pg_event_trigger_ddl_commands() WHERE object_type ='table' Found that there is already a bug posted related to this https://www.postgresql.org/message-id/flat/20170913075559.25630.41587%40wrigleys.postgresql.org#20170913075559.25630.41...@wrigleys.postgresql.org I couldn't understand much from the above thread but like to solve this problem. Thanks, Sanyam Jain
Re: [HACKERS] parallel.c oblivion of worker-startup failures
On Tue, Dec 5, 2017 at 11:34 PM, Robert Haas wrote: > On Wed, Nov 29, 2017 at 12:11 AM, Michael Paquier > wrote: >> On Fri, Oct 27, 2017 at 9:54 PM, Amit Kapila wrote: >>> The patch still applies (with some hunks). I have added it in CF [1] >>> to avoid losing track. >>> >>> [1] - https://commitfest.postgresql.org/15/1341/ >> >> This did not get reviews and the patch still applies. I am moving it to next >> CF. > > I'd like to break this into two patches, one to fix the general > background worker problems and another to fix the problems specific to > parallel query. > > The attached version extracts the two fixes for general background > worker problems from Amit's patch and adjust the comments a bit more > extensively than he did. > Looks good to me. > Barring objections, I'll commit and > back-patch this; then we can deal with the other part of this > afterward. > Sure, I will rebase on top of the commit unless you have some comments on the remaining part. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Error bundling script file with extension when event trigger commands are used
On Wed, Dec 6, 2017 at 3:36 PM, sanyam jain wrote: > I couldn't understand much from the above thread but like to solve this > problem. Not sure that people are much motivated in solving this problem either as the use-case is very narrow. Based on my memories of the moment where I looked at this problem, I was thinking that you would need additional catalog functions able to switch a session-level flag which disables completely event triggers to happen during the execution of those commands... So it would mean that such a fix cannot be back-patched. If you would like to design or implement a patch, of course feel free! One hacky way to solve this problem would be to create the event triggers out of the extension creation, meaning that the object dependencies would need to be tracked by yourself. -- Michael
Re: [HACKERS] Runtime Partition Pruning
On 2 December 2017 at 08:04, Robert Haas wrote: > On Fri, Dec 1, 2017 at 6:20 AM, Beena Emerson wrote: >> David Q1: >> postgres=# explain analyse execute ab_q1 (3,3); --const >>QUERY PLAN >> - >> Append (cost=0.00..43.90 rows=1 width=8) (actual time=0.006..0.006 >> rows=0 loops=1) >>-> Seq Scan on ab_a3_b3 (cost=0.00..43.90 rows=1 width=8) (actual >> time=0.005..0.005 rows=0 loops=1) >> Filter: ((a = 3) AND (b = 3)) >> Planning time: 0.588 ms >> Execution time: 0.043 ms >> (5 rows) > > I think the EXPLAIN ANALYZE input should show something attached to > the Append node so that we can tell that partition pruning is in use. > I'm not sure if that is as simple as "Run-Time Partition Pruning: Yes" > or if we can give a few more useful details. It already does. Anything subnode with "(never executed)" was pruned at runtime. Do we really need anything else to tell us that? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services