Re: Question on corruption (PostgreSQL 9.6.1)
On Thu, Mar 15, 2018 at 8:16 AM, Andy Halsall wrote: > Thanks for the advice. I re-indexed and reloaded a pg_dumpall into a spare > server - no errors. Will run pg_catcheck asap. You can also run amcheck. Get the version targeting earlier Postgres releases off Github (there are packages for most Linux systems). This can verify that the heap is consistent with indexes. -- Peter Geoghegan
Re: found xmin from before relfrozenxid on pg_catalog.pg_authid
On Mon, Mar 19, 2018 at 1:01 PM, Jeremy Finzel wrote: > SELECT heap_page_items(get_raw_page('pg_authid', 7)); Can you post this? SELECT * FROM page_header(get_raw_page('pg_authid', 7)); -- Peter Geoghegan
Re: found xmin from before relfrozenxid on pg_catalog.pg_authid
On Mon, Mar 19, 2018 at 1:55 PM, Jeremy Finzel wrote: > @Peter : > > staging=# SELECT * FROM page_header(get_raw_page('pg_authid', 7)); > lsn | checksum | flags | lower | upper | special | pagesize | > version | prune_xid > +--+---+---+---+-+--+-+--- > 262B4/10FDC478 |0 | 1 | 304 | 2224 |8192 | 8192 | > 4 | 0 > (1 row) Thanks. That looks normal. I wonder if the contents of that page looks consistent with the rest of the table following manual inspection, though. I recently saw system catalog corruption on a 9.5 instance where an entirely different relation's page ended up in pg_attribute and pg_depend. They were actually pristine index pages from an application index. I still have no idea why this happened. This is very much a guess, but it can't hurt to check if the contents of the tuples themselves are actually sane by inspecting them with "SELECT * FROM pg_authid". heap_page_items() doesn't actually care about the shape of the tuples in the page, so this might have been missed. -- Peter Geoghegan
Re: found xmin from before relfrozenxid on pg_catalog.pg_authid
On Wed, Mar 21, 2018 at 1:38 PM, Jeremy Finzel wrote: > A server restart and upgrade to 9.5.12 (at the same time), as expected, made > the issue go away. Still doesn't give us any answers as to what happened or > if it would happen again! Thanks for the feeback. You may still want to use amcheck to look for problems. The version on Github works with 9.5, and there are Redhat and Debian pgdg packages. See: https://github.com/petergeoghegan/amcheck The "heapallindexed" option will be of particular interest to you - that option verifies that the table has matching rows for a target index (in addition to testing the structure of a target B-Tree index itself). This is probably the best general test for corruption that is available. There is a fair chance that this will reveal new information. -- Peter Geoghegan
Re: found xmin from before relfrozenxid on pg_catalog.pg_authid
On Thu, Mar 22, 2018 at 12:27 PM, Jeremy Finzel wrote: > Thank you for the recommendation. I ran both amcheck functions on all 4 > indexes of those 2 tables with heapallindexed = true, but no issues were > found. Probably wouldn't hurt to run it against all indexes, if you can make time for that. If you can generalize from the example query that calls the bt_index_check() function, but set "heapallindexed=>i.indisprimary" and remove "n.nspname = 'pg_catalog'", as well as "LIMIT 10". This will test tables and indexes from all schemas, which might be interesting. -- Peter Geoghegan
Re: found xmin from before relfrozenxid on pg_catalog.pg_authid
On Thu, Mar 22, 2018 at 2:24 PM, Jeremy Finzel wrote: > I am running this on a san snapshot of our production system. I assume that > this will give me a valid check for file-system-level corruption. I am > going to kick it off and see if I find anything interesting. It might. Note that SAN snapshots might have corruption hazards, though. Postgres expects crash consistency across all filesystems, so you might run into trouble if you had a separate filesystem for WAL, for example. I know that LVM snapshots only provide a consistent view of a single logical volume, even though many LVM + Postgres setups will involve multiple logical volumes. This makes it possible for a small inconsistency across logical volumes to corrupt data. I don't know anything about your SAN snapshotting, but this is at least something to consider. -- Peter Geoghegan
Re: Troubleshooting a segfault and instance crash
On Thu, Mar 8, 2018 at 9:40 AM, Blair Boadway wrote: > Mar 7 14:46:35 pgprod2 kernel:postgres[29351]: segfault at 0 ip > 00302f32868a sp 7ffcf1547498 error 4 in > libc-2.12.so[302f20+18a000] > > Mar 7 14:46:35 pgprod2 POSTGRES[21262]: [5] user=,db=,app=client= LOG: > server process (PID 29351) was terminated by signal 11: Segmentation fault > It crashes the database, though it starts again on its own without any > apparent issues. This has happened 3 times in 2 months and each time the > segfault error and memory address is the same. We had a recent report of a segfault on a Redhat compatible system, that seemed like it might originate from within its glibc [1]. Although all the versions there didn't match what you have, it's worth considering as a possibility. Maybe you can't install debuginfo packages because you don't yet have the necessary debuginfo repos set up. Just a guess. That is sometimes a required extra step. [1] https://postgr.es/m/7369.1520528...@sss.pgh.pa.us -- Peter Geoghegan
Re: Autovacuum behavior with rapid insert/delete 9.6
On Thu, Mar 29, 2018 at 4:01 PM, Ted Filmore wrote: > What I am really asking to confirm is after describing the situation is it > reasonable to focus on (in the short term) tuning autovacuum to increase > performance or does this not make sense given the workload and I should look > elsewhere? I would look into this suspected 9.5 regression, if that's possible: https://postgr.es/m/CAH2-Wz=sfakvmv1x9jh19ej8am8tzn9f-yecips9hrrrqss...@mail.gmail.com -- Peter Geoghegan
Re: ERROR: found multixact from before relminmxid
On Mon, Apr 9, 2018 at 7:01 AM, Tomas Vondra wrote: > The bigger question is whether this can actually detect the issue. If > it's due to an storage issue, then perhaps yes. But if you only see > multixact issues consistently and nothing else, it might easily be a > PostgreSQL bug (in which case the checksum will be correct). You can also run amcheck. Get the version targeting earlier Postgres releases off Github (there are packages for most Linux systems). This can verify that the heap is consistent with indexes. -- Peter Geoghegan
Re: ERROR: found multixact from before relminmxid
On Mon, Apr 9, 2018 at 5:55 PM, Alexandre Arruda wrote: > I ran amcheck in all index of a table and I only get empty returns. Did you try doing so with the "heapallindexed" option? That's what's really interesting here. -- Peter Geoghegan
Re: ERROR: found multixact from before relminmxid
On Mon, Apr 9, 2018 at 6:56 PM, Alexandre Arruda wrote: > (... and all other indexes returns null too) > > I tried with bt_index_check too. Same results. That's interesting, because it tells me that you have a table that appears to not be corrupt, despite the CLUSTER error. Also, the error itself comes from sanity checking added to MultiXact freezing fairly recently, in commit 699bf7d0. You didn't say anything about regular VACUUM being broken. Do you find that it works without any apparent issue? I have a suspicion that this could be a subtle bug in CLUSTER/freezing. The only heap_freeze_tuple() caller is code used by CLUSTER, so it's not that hard to imagine a MultiXact freezing bug that is peculiar to CLUSTER. Though I haven't thought about it in much detail. -- Peter Geoghegan
Re: ERROR: found multixact from before relminmxid
On Mon, Apr 9, 2018 at 7:53 PM, Andres Freund wrote: > I've not followed this thread. Possible it's the overeager check for pg > upgraded tuples from before 9.3 that Alvaro fixed recently? I was aware of commit 477ad05e, which must be what you're referring to. I don't think that that's what this is, since this error occurs within heap_freeze_tuple() -- it's not the over-enforced HEAP_XMAX_IS_LOCKED_ONLY() error within heap_prepare_freeze_tuple(). And, because this database wasn't pg_upgraded. I should wait until tomorrow before doing any further analysis, though. -- Peter Geoghegan
Re: ERROR: found multixact from before relminmxid
On Tue, Apr 10, 2018 at 4:31 AM, Alexandre Arruda wrote: > Actualy, I first notice the problem in logs by autovacuum: > > 2018-04-10 08:22:15.385 -03 [55477] CONTEXT: automatic vacuum of > table "production.public.fn06t" > 2018-04-10 08:22:16.815 -03 [55477] ERROR: found multixact 68834765 > from before relminmxid 73262006 > > production=# vacuum analyze verbose fn06t; > INFO: vacuuming "public.fn06t" > ERROR: found multixact 76440919 from before relminmxid 122128619 Do you think that CLUSTER was run before regular VACUUM/autovacuum showed this error, though? Have you noticed any data loss? Things look okay when you do your dump + restore, right? The problem, as far as you know, is strictly that CLUSTER + VACUUM refuse to finish/raise these multixactid errors? -- Peter Geoghegan
Re: ERROR: found multixact from before relminmxid
On Tue, Apr 10, 2018 at 7:54 PM, Alexandre Arruda wrote: > pg_control version number:1002 Andres was also asking about his check_rel() function, from https://www.postgresql.org/message-id/20180319181723.ugaf7hfkluqyo...@alap3.anarazel.de. Can you check that out as well, please? You'll need to be able to install the pageinspect contrib module. -- Peter Geoghegan
Re: New website
On Wed, Apr 18, 2018 at 3:03 PM, Adrian Klaver wrote: > I would contact the Webmaster but Contact goes to a big image of an elephant > head. That is also where Downloads, Support and Donate lands. Might have > been a good idea to roll out a demo site for testing first. Will reserve > judgment on the site design until it is functioning. I really don't think it's practical to give everyone a veto on a website design. It took years to launch this website redesign. Honestly, I was beginning to think that it would never happen. Anyway, I did notice something myself, which is that the git logo links to https://git.postgresql.org/gitweb/, which has many non-very-important git repos. Instead, it should point to the main PostgreSQL repository's gitweb page, which is at https://git.postgresql.org/gitweb/?p=postgresql.git. -- Peter Geoghegan
Re: New website
On Wed, Apr 18, 2018 at 3:40 PM, Adrian Klaver wrote: > And to get to the mailing list archives(arguably one of the most important > links) you have to: > > Click on Community, click on Mailing Lists on the left sidebar, then scroll > to the bottom of page to find the search box. I look at the mailing list archives as much as anyone else does, and I don't think that it's very important for it to have prominent placement. Even still, the only difference here is the scrolling. It has fundamentally the same structure as before. > I am one of the No votes in the survey. This sounds pretty far from constructive to me, which automatically detracts from what you're saying. -- Peter Geoghegan
Re: found xmin from before relfrozenxid on pg_catalog.pg_authid
On Fri, May 25, 2018 at 1:38 PM, Andres Freund wrote: >> BTW I think the definition of HeapTupleHeaderXminFrozen is seriously >> confusing, by failing to return true if the xmin is numerically >> FrozenXid (which it'll be if the database was pg_upgraded). I wonder >> about this one in HeapTupleSatisfiesMVCC: > > I suggest raising this on -hackers. I agree that it's unfortunate. I wonder if BootstrapTransactionId also needs to be considered here. -- Peter Geoghegan
Re: Syndicating PostgreSQL mailing list to Discourse
On Sun, May 27, 2018 at 6:13 AM, Stephen Frost wrote: > I'm curious what would be different here from what our archives provide. > We could certainly have a single "all lists" archive page but that seems > more likely to be just completely confusing than actually useful at all. Any replacement to our own archives will need to provide access to mail from over 20 years ago to be in any way usable. It's not uncommon to have to go back that far. Personally, I don't buy the idea that the need to use a mailing list rather than a web forum is a notable obstacle for new contributors. PGLister seems pretty slick to me. It has eliminated all of the frustrations that I had. Maybe we need to do a better job when it comes to communicating what the benefits of a mailing list are, though. There are real, practical reasons to prefer a mailing list; that preference isn't just due to ingrained habit. I'm pleased that there has been a drive to modernize some of the community's infrastructure in recent years, but only because those changes turned out to be unalloyed improvements (at least in my view). Besides, while mailing lists may seem antiquated to a lot of people, aren't web forums almost as antiquated? Sites like Stack Overflow are very clearly not designed to work as discussion forums. They do not allow subjective questions, and it's common for moderators to swiftly delete new threads. Stack Overflow is popular because it provides the fastest possible access to a crowdsourced answer, without requiring or even encouraging participation. -- Peter Geoghegan
Re: Code of Conduct plan
On Mon, Jun 4, 2018 at 11:41 AM, Jason Petersen wrote: > Ultimately, the important thing this CoC provides is some concrete language > to point at when a party is aggrieved and explicit avenues of redress > available when one refuses to address one’s own behavior. We’re adults here, > the strawmen of people being harangued out of the community because they > said a bad word are unlikely to materialize. > > +1 This seems like a good summary on the purpose of the CoC. It is of course possible that a member of the committee could act in bad faith for any number of reasons. You can say the same thing about any position of leadership or authority within the community, though. That hasn't really been much of a problem in my experience, and I see no reason for particular concern about it here. -- Peter Geoghegan
Re: Code of Conduct plan
On Tue, Jun 5, 2018 at 8:49 AM, Benjamin Scherrey wrote: > I thought the same thing as a member of the Django community. It adopted a > CoC that I vocally warned was dangerous and far more likely to be abused > than provide any benefit. I was shocked when the very first time it was ever > invoked it was by one of the founders of the project (whom I previously > personally respected) and it was absolutely used in the manner that I had > feared which was to shut someone up whose opinion he did not like rather > than any legitimate concern. Unfortunately this is not such an unusual > circumstance as one might hope in these projects or conferences. It is > impossible to separate the concept of political correctness from these CoCs > I find and they are much more dangerous things than they appear. We should > tread with extreme cautious about adopting such a thing. It's impossible for me to know what really happened in that situation, but it doesn't seem like the CoC was likely to have been much of a factor in any telling. If this individual was in a position of influence and decided to act maliciously, they would no doubt have found another way to do so in the absence of a CoC. On the other hand, it's easy to imagine a newer non-influential community member finding no recourse against abusive behavior because that isn't explicitly provided for; they might simply not know where to start, and become totally discouraged. Nobody is claiming that the CoC is perfect, or that it can anticipate every situation; it's just a framework for handling disputes about abusive and/or antisocial behavior. The core team have had exclusive responsibility for "Handling disciplinary issues" as part of their charter, at least until now. You can make exactly the same slippery slope argument against that. -- Peter Geoghegan
Re: Code of Conduct plan
On Tue, Jun 5, 2018 at 9:51 AM, James Keener wrote: > To be honest, this is a bigger problem. Why would someone not feel > comfortable contacting the core team? Why would they feel better contacting > the CoC board who is probably mostly core team or otherwise self-selected > community members who have a strong belief in the CoC (and I don't mean that > kindly)? The CoC states that the committee's members cannot come from the core team. -- Peter Geoghegan
Re: Code of Conduct plan
On Tue, Jun 5, 2018 at 10:08 AM, Tom Lane wrote: > So publishing a formal CoC at all is mainly meant to deal with weak > points 1 and 2, and then the details of the process are there to try > to fix point 3. > > Yeah, managing the committee is a lot of overhead that in an ideal > world we wouldn't need, but I think we have to accept it to have a > process people will have confidence in. It's worth pointing out that the community has grown considerably in the last ten years. I assume that adding a bit of process to deal with these kinds of disputes is related to that. We have a pretty good track record through totally informal standards for behavior. Setting a good example is absolutely essential. While that's still the most important thing, it doesn't seem particularly scalable on its own. -- Peter Geoghegan
Re: Code of Conduct plan
On Tue, Jun 5, 2018 at 12:20 PM, Benjamin Scherrey wrote: > I keep hearing this claim. I've followed up and tried to verify them. Sorry > but "trust me" doesn't cut it here any more than "trust me this will make > Postgres go faster" would on a code change. What's the context for this? > What evidence do we have that indicates this CoC would have likely resulted > in a different outcome? Without that then your claim does not even rise up > to the standard of theoretical. Frankly this claim does not seem very > plausible to me at all. Let's try to keep our standards here. Whose standards are these? By my count, the majority of e-mails you've ever sent to a PostgreSQL mailing list have been sent in the last 2 days, to this code of conduct thread. -- Peter Geoghegan
Re: Code of Conduct plan
On Tue, Jun 5, 2018 at 2:06 PM, Sven R. Kunze wrote: > 1) CoC might result in developers leaving projects > http://lists.llvm.org/pipermail/llvm-dev/2018-May/122922.html This guy left LLVM for several reasons. The pertinent reason for us was that he had to agree to a code of conduct in order to attend conferences, which he found to be unacceptable. He did not have to agree that the idea of a code of conduct was a good one, though. It would have been perfectly possible for him to be opposed in principle to the idea of a CoC, while also formally agreeing to it and attending those conferences. I gather that his objections were around questions of unintended consequences, the role of a certain authority to assess violations of the CoC, and so on (I surmise that he was not actually opposed to or constrained by any of the specific rules around content in technical presentations and so on). I for one accept that these may have been reasonable concerns, even though I don't really agree, since the LLVM CoC seems quite reasonable. Anybody that participates in an open source community soon learns that their opinion on almost any matter may not be the one that prevails. There are often differences of opinion on -hackers that seem to fundamentally be down to a difference in values. We still manage to make it work, somehow. > 2) CoC might result in not so equal peers and friends, might result in a > committee which feels above their peers, and might promote conceit and > denunciation. I think that having a code of conduct is better than not having one, and I think that the one that we came up with is appropriate and proportionate. We could speculate all day about specific unintended consequences that may or may not follow. That doesn't seem very constructive, though. Besides, the time for that has passed. > In related discussions, people recurringly ask not to establish a secondary > judicial system but to use the already existing ones. I don't follow. Practically any organized group has rules around conduct, with varying degrees of formality, means of enforcement, etc. Naturally, the rules across disparate groups vary widely for all kinds of reasons. Formalizing and being more transparent about how this works seems like the opposite of paternalism to me. -- Peter Geoghegan
Re: Replica string comparsion issue
On Wed, Jun 13, 2018 at 1:10 PM, Andrey Lizenko wrote: > I'm observing strange behaviour on comparing ::text field with string while > quering replica. The first thing to do here is to install amcheck on the replica, and verify that the indexes are consistent with the replica's own notion of how text sorts for the collation that is implicitly in use (the database default): https://www.postgresql.org/docs/current/static/amcheck.html You can modify the example query to check the indexes that you're interested in. I think that there is a very strong chance that the replica has incompatible collation rules, given that it uses a totally different OS. -- Peter Geoghegan
Re: What to do when dynamic shared memory control segment is corrupt
On Mon, Jun 18, 2018 at 1:03 PM, Tom Lane wrote: > Hm, I supposed that Sherrylyn would've noticed any PANIC entries in > the log. The TRAP message from an assertion failure could've escaped > notice though, even assuming that her logging setup captured it. Unhandled C++ exceptions end up calling std::abort(). I've seen bugs in modules like PL/V8 that were caused by this. The symptom was a mysterious message in the logs about SIGABRT. Perhaps that's what happened here? What extensions are installed, if any? -- Peter Geoghegan
Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function
On Wed, Jun 27, 2018 at 11:35 AM, Luca Ferrari wrote: > If I then disconnect and reconnect I'm able to issue the select and > get back the results. But if I issue a reindex I got the same error > and the table "becames unreadable" for the whole session. > On 10.3 the table is never locked for the session, that is I can > create the index, I can query the table and get the results, but I > cannot reindex. However, even after a reindex, it does allow me to > select data from the table. > > So my question is: why this behavior in later PostgreSQL? It might have something to do with the changes to parallel CREATE INDEX. It changed how we tracked whether or not an index could be used because it was currently undergoing reindexing. This is supposed to make no difference at all, but there was one bug that could cause us to consider an index irrevocably unusable. Do you find that the issue goes away if you set max_parallel_maintenance_workers=0 on v11/master? -- Peter Geoghegan
Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function
On Thu, Jun 28, 2018 at 8:02 AM, Andres Freund wrote: > Peter, looks like you might be involved specifically. Seems that way. > This however seems wrong. Cleary the relation's index list is out of > date. > > I believe this happens because there's currently no relcache > invalidation registered for the main relation, until *after* the index > is built. Normally it'd be the CacheInvalidateRelcacheByTuple(tuple) in > index_update_stats(), which is called at the bottom of index_build(). > But we never get there, because the earlier error. That's bad, because > any relcache entry built *after* the CommandCounterIncrement() in > CommandCounterIncrement() will now be outdated. > > In the olden days we most of the time didn't build a relcache entry > until after the index was built - but plan_create_index_workers() now > does. I'm suspect there's other ways to trigger that earlier, too. Note that there is a kludge within plan_create_index_workers() that has us treat the heap relation as an inheritance parent, just to get a RelOptInfo for the heap relation without running into similar trouble with the index in get_relation_info(). Perhaps there's an argument to be made for refactoring plan_create_index_workers() as a fix for this. > Putting in a CacheInvalidateRelcache(heapRelation); before the CCI in > index_create() indeed makes the "borked relcache" problem go away. > > > I wonder why we don't just generally trigger invalidations to an > indexes' "owning" relation in CacheInvalidateHeapTuple()? I don't know, but that seems like a good question. -- Peter Geoghegan
Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function
On Mon, Jul 9, 2018 at 11:32 AM, Andres Freund wrote: >> Note that there is a kludge within plan_create_index_workers() that >> has us treat the heap relation as an inheritance parent, just to get a >> RelOptInfo for the heap relation without running into similar trouble >> with the index in get_relation_info(). Perhaps there's an argument to >> be made for refactoring plan_create_index_workers() as a fix for this. > > Maybe I'm missing something, but what has this got to do with the issue > at hand? Nothing. It might be worthwhile to find a way to not do that as part of fixing this issue, though. Just a suggestion. > I assume we'll have to backpatch this issue, so I think it'd probably a > good idea to put a specific CacheInvalidateHeapTuple() in there > explicitly in the back branches, and do the larger fix in 12. ISTM > there's some risks that it'd cause issues. Will you tackle this? Okay. -- Peter Geoghegan
Re: sorting/comparing column values in non-alphanumeric sorting ways ?
On Wed, Jul 11, 2018 at 2:44 PM, David Gauthier wrote: > I want to load the 'highestver' column with the highest version of tcfg1-3. > > This won't work... > update tv set greatest = greatest(tcfg1,tcfg2,tcfg3) > ...because it thinks 1.0.9 is greater than 1.0.10 > > Is there a way to get this to work right ? If you're using v10 with ICU, then you can create a custom ICU collation for this, with "natural" sort order. Something like this should work: CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true'); See the docs -- "23.2.2.3.2. ICU collations". -- Peter Geoghegan
Re:
On Thu, Jul 19, 2018 at 11:43 AM, Torsten Förtsch wrote: > is there a way to find if a certain page in a data file is referenced by a > btree index? > > I found a few completely empty pages in one of my tables. I am not sure if > that's corruption or just bloat. Now I am thinking I could use an index, the > PK for instance, and see if it references these pages. Ir'a probably not serious, but you may want to try amcheck's heapallindexed check. You'll have to use the non-contrib packages for that right now, though, but those are available from the PGDG repos. -- Peter Geoghegan
Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function
On Wed, Jul 25, 2018 at 4:03 PM, Andres Freund wrote: > Peter, given that your patch made this more likely, and that you're a > committer these days, I'm opening an open items entry, and assign it to > you. Does that sound ok? I intend to follow through on this soon. I have been distracted by project work. I accept responsibility for the open item, though. -- Peter Geoghegan
Re: Restore relhaspkey in PostgreSQL Version 11 Beta
On Tue, Jul 31, 2018 at 7:47 AM, Melvin Davidson wrote: > I was hoping that at least one other person would see my point of view, but > by the > harsh replies I've been getting, I feel more like a whistle blower that > insiders > think I also should be made to "go away". You were bellicose from almost the beginning of this thread. And, yes, that does detract from your argument. Just as it would in almost any other sphere or arena. > Well, you are right. This old Viet Vet shall now end this conversation and > his career. > I just need a way to do so quietly and painlessly. > The truth is absolute and cannot be changed. > Perception is not the truth. > Flerp! I cannot imagine what reaction you were expecting to this. In all sincerity, I suggest reflecting on your words. You don't seem to have realistic expectations about how the community works, or could ever work. -- Peter Geoghegan
Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function
On Mon, Jul 9, 2018 at 11:32 AM, Andres Freund wrote: > I assume we'll have to backpatch this issue, so I think it'd probably a > good idea to put a specific CacheInvalidateHeapTuple() in there > explicitly in the back branches, and do the larger fix in 12. ISTM > there's some risks that it'd cause issues. What do you think of the attached? The is a new CacheInvalidateRelcache() call, rather than a new call to CacheInvalidateRelcacheByTuple(), but those two things are equivalent (I assume that you actually meant to say CacheInvalidateRelcacheByTuple(), not CacheInvalidateHeapTuple()). Since nobody seems to be that excited about the CacheInvalidateHeapTuple() idea, I haven't pursued it. -- Peter Geoghegan 0001-Add-table-relcache-invalidation-to-index-builds.patch Description: Binary data
Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function
On Tue, Jul 31, 2018 at 7:02 PM, Andres Freund wrote: > Maybe expand a bit on this by saying that it's more likely "because > plan_create_index_workers() triggers a relcache entry to be (re-)built, > which previously did only happen in edge cases" or such? Okay. > Not a fan of this comment. It doesn't really explain that well why it's > needed here, but then goes on to a relatively general explanation of why > cache invalidation is necessary. Why not just go for something like > "register relcache invalidation on the indexes' heap relation, to > maintain consistency of its index list"? That seems much more generic to me! The comment is supposed to convey that the stuff within index_update_stats() isn't enough because of xact abort specifically. SI invalidation is very much part of the index_update_stats() contract already. > I wonder if it wouldn't be more appropriately placed closer to the > UpdateIndexRelation(), given that that's essentially what necessitates > the relcache flush? That makes sense. I'll do it that way. -- Peter Geoghegan
Re: could not read block 0 in file : read only 0 of 8192 bytes when doing nasty on immutable index function
On Tue, Jul 31, 2018 at 9:00 PM, Andres Freund wrote: > I don't think that's particularly relevant. We should always register an > invalidation before the relevant CommandCounterIncrement(), because that > is what makes catalog changes visible, and therefore requires > registering invalidations for coherency. Fair enough. How about the attached revision? -- Peter Geoghegan v2-0001-Add-table-relcache-invalidation-to-index-builds.patch Description: Binary data
Re: Code of Conduct plan
On Fri, Sep 14, 2018 at 7:19 AM, Joshua D. Drake wrote: > Sure and that is unfortunate but isn't it up to the individual to deal with > it through appropriate channels for whatever platform they are on? All of > these platforms are: > > 1. Voluntary to use > 2. Have their own Terms of Use and complaint departments > 3. If it is abuse there are laws > > I agree that within Postgresql.org we must have a professional code of > conduct but the idea that an arbitrary committee appointed by an unelected > board can decide the fate of a community member based on actions outside of > the community is a bit authoritarian don't you think? The choice of the committee members is hardly arbitrary. Having committee members be appointed by core is more or less consistent with how the community has always dealt with disciplinary issues. The criteria used by core were discussed quite openly. While the risk that the committee will yield their power in an "authoritarian" way seems very small, it cannot be ruled out entirely. In fact, it hasn't been ruled out by the draft CoC itself. No CoC can possibly provide for every conceivable situation. Somebody has to interpret the rules, and it has to be possible to impose sanctions when the CoC is violated -- otherwise, what's the point? There are several checks and balances in place, and I for one have confidence in the process as outlined. It's imperfect, but quite a lot better than either the status quo, or a platitude about inclusivity. -- Peter Geoghegan
Re: Code of Conduct plan
On Fri, Sep 14, 2018 at 10:31 AM, Dimitri Maziuk wrote: > So let me get this straight: you want to have a "sanctioned" way to deny > people access to postgresql community support channel? Yes. > "Because > somebody who may or may not be the same person, allegedly said something > somewhere that some other tweet disagreed with on faceplant"? > > Great plan if you do for-pay postgresql support for the living. You can make your own conclusions about my motivations, just as I'll make my own conclusions about yours. I'm not going to engage with you on either, though. -- Peter Geoghegan
Re: Code of Conduct plan
On Fri, Sep 14, 2018 at 11:06 AM, Dimitri Maziuk wrote: > Personally I would like that. Others might prefer an invitation to > unsubscribe or forever hold their peace, I could live with that too, but > I believe explicit opt-ins are preferable to opt-outs. I think that it's a legitimate position to be opposed to a CoC like this. I also think it's legitimate to feel so strongly about it, on philosophical or political grounds, that you are compelled to avoid participating while subject to the CoC. FWIW, the latter position seems rather extreme to me personally, but I still respect it. In all sincerity, if you're compelled to walk away from participating in mailing list discussions on a point of principle, then I wish you well. That is your right. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Mon, Jul 8, 2019 at 9:23 AM John Lumby wrote: > Overall, pg-12beta2 yielded a 6.7% reduction in sizes (total pages) of > indexes, which was most noticable with the 6 non-unique ones. > In fact the primary-key index was larger with pg-12. The avg_leaf_density was actually higher for the primary key index, so it looks like it really came out slightly ahead on v12. Perhaps you didn't take deleted_pages into account -- there must be free space that is reusable by the index that has yet to be reused. It would probably make sense to subtract that across the board. > Would you have expected better than 6.7%? I don't think that a test case that runs VACUUM when there are only 4300 deletions and 4300 insertions is particularly realistic, in general. You might see a larger difference if there was more churn between each VACUUM run. > Although a welcome improvement, I think it is not enough to justify stopping > use of setting a lower explicit FILLFACTOR. Which then brings me back to > thinking there is a case for the subject of this thread, an automatic way to > preserve density. I don't think that such an option would make much sense. The "waves of misery" paper is about smoothing out the frequency of page splits following bulk loading and a CREATE INDEX. It is not about making splits occur less often. It's well understood that a certain amount of free space is the overhead of B-Tree indexes, albeit an overhead that can be avoided in certain specific instances. > And one question : > I notice that in some pg-11 release, a new config parameter appeared : > vacuum_cleanup_index_scale_factor > I have not researched this at all and nor did I set it to anything for my > pg-12beta2 run, but it sounds as though maybe it could be relevant to > this kind of workload - Is that so? You seem to be worried about keeping indexes as small as possible. vacuum_cleanup_index_scale_factor won't help with that. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Mon, Jul 8, 2019 at 12:10 PM John Lumby wrote: > Actually the test workload does not run any explicit VACUUM command, > it relies on autovacuum with these settings > (same settings for 9.4 and 12beta2) > To correspond to your " more churn between each VACUUM" > Would you then suggest increasing > autovacuum_vacuum_cost_delay and/or autovacuum_vacuum_scale_factor? Well, you're still running autovacuum very aggressively here. It'll easily keep up when run on a relatively small table such as this. BTW, you should definitely run the latest point release of 9.4 -- not 9.4.6. You're missing years of bug fixes by sticking to such an old point release, including some rather nasty ones -- 9.4.23 is the current 9.4 point release. Actually, 9.4 is going to lose support this year, as the oldest stable version that's currently supported by the community. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Mon, Jul 8, 2019 at 12:19 PM Peter Geoghegan wrote: > Well, you're still running autovacuum very aggressively here. It'll > easily keep up when run on a relatively small table such as this. Also, an exactly equal number of insertions and deletions is rather likely to result in bloated indexes in a way that probably isn't representative of many workloads. Even if the number of insertions was only slightly greater than the number of deletions, then the overall pattern would be one of continual growth, which is generally considered much more interesting. For far far more information on the topic than you want, see the paper "B-Trees with Inserts and Deletes: Why Free-at-Empty Is Better Than Merge-at-Half": https://www.sciencedirect.com/science/article/pii/00229390020W The salient point made by the paper is that good space utilization rests on the assumption that there are fewer deletes than inserts, though maybe only slightly fewer: "The tendency of the utilization to remain near 69% can be explained by the following arguments: If there are even just a few more inserts than deletes, the B-tree will grow at the net insert rate (the rate of inserts minus the rate of deletes)." If the volume of data never grows past a certain point, then it's unlikely that the space utilization is very important. This may even be premature optimization. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Tue, Jul 9, 2019 at 10:31 AM John Lumby wrote: > Yes, I see that. But surely "making splits occur less often" is a > desirable > objective in itself, is it not? And I believe that a parameter to > preserve the "steady-state" > density in high-traffic indexes would help achieve that goal, wouldn't you > agree? Anything that reliably reduces page splits without hurting space utilization is well worthwhile. I can't see how what you describe could have that effect, though. If you expect the leaf density to be the same after a REINDEX, then why bother at all? There is no reason to think that that will be more effective than simple vacuuming. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Tue, Jul 9, 2019 at 11:27 AM John Lumby wrote: > And the point of the REINDEX at that point (below) is to remove dead tuple > keys-tids > and reorganize those split pages back into physical order without losing the > freespace. VACUUM already removes the tuples, accounting for all overhead. You are right that it would be possible for us to "defragment" the pages, so that they'd be in sequential order on disk from the point of view of a whole index scan -- this is what the "leaf_fragmentation" statistic from pgstatindex() reports on. We could in principle come up with a way of moving pages around, which would have some modest benefit for certain types of queries (it wouldn't improve the heap/index correlation, though, which is far more important). That would either necessitate that the command acquire a disruptive lock on the index (i.e. no writes, just like regular REINDEX), or that we drastically rearchitect the B-Tree code to make it support this. Neither of which seem particularly appealing. I believe that this is a lot more important in systems that generally use clustered indexes, such as MS SQL Server. This kind of "fragmentation" isn't usually much of a problem when using Postgres. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Tue, Jul 9, 2019 at 12:29 PM John Lumby wrote: > I was not thinking of a new command, just an extension of the existing > REINDEX > which would apply a fillfactor equal to current average page density, > by adding a preliminary step to sample that first. That would be a very different thing to REINDEX no matter how you spelt it, though. REINDEX creates a new index, from scratch, whereas you're talking about restructuring what's already there. > > I believe that this is a lot more important in systems that generally > > use clustered indexes, such as MS SQL Server. This kind of > > "fragmentation" isn't usually much of a problem when using Postgres. > > > We have found that, for an index which has both experienced large number of > page splits > and whose table has a large number of dead tuples (despite autovacuum), > REINDEX with FILLFACTOR set to current page_density does produce a > performance improvement, > and also does reduce future growth in number of pages.I don't have > numbers to > hand, and in fact not sure if any catalog view or pgstattuple tells me about > the proportion > of dead key-tids in the index itself (do you know of any source?) as opposed > to the table, > but based on that recollection, yes, REINDEX can reduce fragmentation. This could help the old "getting tired" behavior with many duplicates, by making the free space available in earlier leaf pages (those further to the left) that are full of duplicates -- the original average space utilization may reflect a very uneven distribution of free space overall. Or, it could be that range scan performance benefitted from reduced fragmentation, because your workload happened to be bottlenecked on large range scans. Though that seems unlikely. I believe that the effect that you identified is real, but at a minimum it's not clear why a REINDEX with a fillfactor to match the original leaf space utilization helped. It would be fairly difficult to figure it out for sure. If it was a problem with duplicates/"getting tired", then I'd expect the new v12 code will help a lot. > However we did not run a VACUUM command first. Maybe if we had run VACUUM > instead of > the REINDEX commands, we might have obtained the same degree of > improvement, I don't know. > I think this was Tom's point earlier on in this thread. It was. Tom's intuition about that matches my own, though I acknowledge that the old behavior with duplicates muddies the waters. > Correct me if I'm wrong but I believe whether an index is "clustered" or not > is not relevant for > this discussion because the clustering in that context is referring to > ordering of the > table pages, not the index pages. Right. > I believe it is quite possible to have a perfectly > "clustered" table whose clustering index is itself badly disorganized. Technically the two things are separate metrics, so that is theoretically possible, but it doesn't seem all that likely. It could happen with lots of non-HOT updates, where all new index tuples relate to the same logical row as some existing index tuple, causing many page splits despite there being no real change in the logical contents of the index. Even then, the table will itself lose much of its original order, so the index will become "unclustered" as it becomes fragmented. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Tue, Jul 9, 2019 at 3:18 PM John Lumby wrote: > > Or, it could be that range scan performance benefitted from reduced > > fragmentation, > > > > Yes, I think so. ISTM that the simplest explanation here is that index fragmentation (and even index size) is a red herring, and the real issue is that you're suffering from problems similar to those that are described in these old threads: https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com There have been numerous reports from users with problems involving low cardinality indexes that gradually became less correlated with the underlying table over time. At least a couple of these users found that a periodic REINDEX temporarily fixed the problem -- see the first thread for an example. Postgres 12 maintains the heap/table sort order among duplicates by treating heap TID as a tiebreaker column, which may make REINDEXing totally unnecessary for you. It's harder to model this issue because the problem with heap TID order will only be seen when there is at least a moderate amount of churn. -- Peter Geoghegan
Re: REINDEX : new parameter to preserve current average leaf density as new implicit FILLFACTOR
On Tue, Jul 9, 2019 at 9:04 PM Peter Geoghegan wrote: > ISTM that the simplest explanation here is that index fragmentation > (and even index size) is a red herring, and the real issue is that > you're suffering from problems similar to those that are described in > these old threads: > > https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com > https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com I can imagine why you found you needed to reduce fillfactor to get much of any benefit from a REINDEX. Page splits are inherently expensive, for one thing. Also, in this specific scenario a succession of page splits might hasten the index returning to having little correlation with the underlying table within each large group of duplicates. Splits on fillfactor 90 pages would make new space available for future insertions on earlier duplicate pages, mixing old and new rows together before long. -- Peter Geoghegan
Re: after restore the size of the database is increased
On Mon, Jul 15, 2019 at 6:22 AM Luca Ferrari wrote: > What am I missing here? Sometimes B-Tree indexes can be *larger* after a REINDEX (or after they're recreated with a CREATE INDEX). It's not that common, but it does happen. There isn't actually a very large size difference here, so it seems worth comparing index size in detail. -- Peter Geoghegan
Re: Corrupt index stopping autovacuum system wide
On Wed, Jul 17, 2019 at 8:54 AM Aaron Pelz wrote: > To me it looks like a *single* corrupt index held up autovacuums across our > entire server, even other in other databases on the same server. Am I > interpreting this correctly? Yes -- that is correct. What PostgreSQL version are you on? Was this an INCLUDE index on PostgreSQL 11? > Would love guidance on diagnosing this type of thing and strategies for > preventing it. It's possible that amcheck would have given you an accurate diagnosis of the problem -- especially if you used bt_index_parent_check(): https://www.postgresql.org/docs/current/amcheck.html -- Peter Geoghegan
Re: Corrupt index stopping autovacuum system wide
On Wed, Jul 17, 2019 at 9:57 AM Tom Lane wrote: > It looks that way, but how would a broken non-shared index have held up > autovacuuming in other databases? Maybe, as this one's xmin horizon > got further and further behind, the launcher eventually stopped > considering launching workers into any other databases? That seems > like a bad thing; it's postponing work that will need to be done > eventually. I don't know exactly how the launcher would behave offhand, but it's clear that not being able to VACUUM one table in one database (because it has a corrupt index) ultimately risks the availability of every database in the cluster. Many installations receive little to no supervision, so it may just be a matter of time there. That is certainly a bad thing. -- Peter Geoghegan
Re: Corrupt index stopping autovacuum system wide
On Wed, Jul 17, 2019 at 9:21 AM Peter Geoghegan wrote: > It's possible that amcheck would have given you an accurate diagnosis > of the problem -- especially if you used bt_index_parent_check(): > > https://www.postgresql.org/docs/current/amcheck.html BTW, be sure to use the 'heapallindexed' option with bt_index_parent_check() to detect missing downlinks, which is exactly the problem that VACUUM complained about. Hopefully this probably will be limited to the single index that you've already REINDEXed. In theory the same problem could be hiding in other indexes, though I don't consider that particularly likely. Note that bt_index_parent_check() requires a lock on tables that effectively blocks writes, but not reads, so verification may require planning or coordination. bt_index_check() doesn't have any of these problems, but also won't detect missing downlinks specifically. -- Peter Geoghegan
Re: Corrupt index stopping autovacuum system wide
On Wed, Jul 17, 2019 at 10:27 AM Tom Lane wrote: > Right, you're eventually going to get to a forced shutdown if vacuum never > succeeds on one table; no question that that's bad. It occurs to me that we use operator class/insertion scankey comparisons within page deletion, to relocate a leaf page that looks like a candidate for deletion. Despite this, README.hot claims: "Standard vacuuming scans the indexes to ensure all such index entries are removed, amortizing the index scan cost across as many dead tuples as possible; this approach does not scale down well to the case of reclaiming just a few tuples. In principle one could recompute the index keys and do standard index searches to find the index entries, but this is risky in the presence of possibly-buggy user-defined functions in functional indexes. An allegedly immutable function that in fact is not immutable might prevent us from re-finding an index entry" That probably wasn't the problem in Aaron's case, but it is worth considering as a possibility. > My concern here is > that if we have blinders on to the extent of only processing that one > table or DB, we're unnecessarily allowing bloat to occur in other tables, > and causing that missed vacuuming work to pile up so that there's more of > it to be done once the breakage is cleared. If the DBA doesn't notice the > problem until getting into a forced shutdown, that is going to extend his > outage time --- and, in a really bad worst case, maybe make the difference > between being able to recover at all and not. The comment about "...any db at risk of Xid wraparound..." within do_start_worker() hints at such a problem. Maybe nbtree VACUUM should do something more aggressive than give up when there is a "failed to re-find parent key" or similar condition. Perhaps it would make more sense to make the index inactive (for some value of "inactive") instead of just complaining. That might be the least worst option, all things considered. -- Peter Geoghegan
Re: Corrupt index stopping autovacuum system wide
On Wed, Jul 17, 2019 at 11:43 AM Alvaro Herrera wrote: > This might make things worse operationally, though. If searches aren't > failing but vacuum is, we'd break a production system that currently > works. If searches aren't failing and VACUUM works, then that's probably down to dumb luck. The user's luck could change at any time (actually, it's quite possible that the index is already giving wrong answers without anybody realizing). That's not always true, of course -- you could have an OOM condition in VACUUM, where it really does make sense to retry. But it should be true for the category of errors where we behave more aggressively than just giving up, such as "failed to re-find parent key" error Aaron noticed. > Well, vacuum knows what index is being processed. Maybe you're thinking > that autovac can get an out-of-memory condition or something like that; > perhaps we can limit the above only when an ERRCODE_DATA_CORRUPTED > condition is reported (and make sure all such conditions do that. As > far as I remember we have a patch for this particular error to be > reported as such.) I don't think that it would be that hard to identify errors that nbtree VACUUM could throw that clearly indicate corruption, without any hope of the problem self-correcting without the DBA running a REINDEX. There will be a small amount of gray area, perhaps, but probably not enough to matter. > > (c) automatically disabling constraint indexes seems less than desirable. > > Disabling them for writes, yeah. I think that it's fair to say that all bets are off once you see the "failed to re-find parent key" error, or any other such error that indicates corruption. Admittedly it isn't 100% clear that disabling constraint enforcement to unblock autovacuum for the whole cluster is better than any available alternative; it's really hard to reason about things when we already know that the database has corruption. I think that it's okay that almost anything can break when somebody creates an index on a non-immutable expression (including VACUUM), provided that all the problems only affect the table with the broken index. OTOH, taking down the entire Postgres cluster as an indirect consequence of one person's ill-considered CREATE INDEX really sucks. That distinction seems important to me. -- Peter Geoghegan
Re: Corrupt index stopping autovacuum system wide
On Wed, Jul 17, 2019 at 10:27 AM Peter Geoghegan wrote: > > It's possible that amcheck would have given you an accurate diagnosis > > of the problem -- especially if you used bt_index_parent_check(): > > > > https://www.postgresql.org/docs/current/amcheck.html > > BTW, be sure to use the 'heapallindexed' option with > bt_index_parent_check() to detect missing downlinks, which is exactly > the problem that VACUUM complained about. Can you tell us more about this index? Can you share its definition (i.e. what does \d show in psql)? Is it an expression index, or a partial index? A composite? What datatypes are indexed? Thanks -- Peter Geoghegan
Re: Corrupt index stopping autovacuum system wide
On Thu, Jul 18, 2019 at 9:06 AM Aaron Pelz wrote: > It's a simple btree expression on a geometry(Point,4326) , no expression no > partial no composite. The cause of the corruption may be a bug in a Postgis B-Tree operator class. I reported a bug in the Geography type that could lead to corrupt B-Tree indexes (not Geometry): https://trac.osgeo.org/postgis/ticket/3841 Though I also see what could be a comparable bug in Geometry: https://trac.osgeo.org/postgis/ticket/3777 These bugs are from about 3 years ago. If I'm right you should be able to isolate the bug using amcheck. -- Peter Geoghegan
Re: Question from someone who is not trained in computer sciences
On Thu, Sep 5, 2019 at 2:00 PM Judith Lacoste wrote: > I think PostgreSQL is the solution for my needs, but I am not a > programmer/coder. I don't think that it's restricted to people that are computer scientists. At least, I certainly hope it isn't. SQL was originally supposed to be something that is usable by domain experts/analysts, rather than by computer people (that was at a time when the divide was far larger than it is today). > I plan to install the database on a server in the office. Me and my four > colleagues will occasionally connect to this database when we are working in > other locations (usually hospitals or universities). In such remote > locations, we often do not have internet/network, yet we still need to access > the database. Currently, we use a system where a copy of the database lives > on each of our laptops. We can access all the information in the database > despite being offline. This local copy of the database is synchronized with > the server once network becomes available again. > > My question is whether or not such set up is possible with PostgreSQL? Since you're a biologist, you may like to play around with the Mouse Genome database using PostgreSQL: http://www.informatics.jax.org/downloads/database_backups/ Any supported version of PostgreSQL will work. You'll need to use pg_restore to restore the databases. Something like this will do it: pg_restore -d mgd /path/to/mgd.postgres.dump (I'm not sure what operating system you'll use -- something similar to this invocation ought to work on Windows through cmd.exe, though.) >From there, you can play around with the database using a GUI tool such as pgAdmin. I sometimes use this database to test certain things, since it's the only example of a living, breathing PostgreSQL database that you can just download that I am aware of. Its schema is probably not an exemplar of good design, but it does seem reasonably well thought out. I'm not a domain expert, though, so I can't really be sure how good it is. The nice thing about this approach is that you can figure it out using a "top down" approach, by first understanding how the database is used in practical terms, and then filling in the details of how the application that it backs uses the database. Last I checked, restoring this database will take about 30GB of disk space on top of the dump file itself. -- Peter Geoghegan
Re: Whan is it safe to mark a function PARALLEL SAFE?
On Sun, Sep 8, 2019 at 12:27 PM Tom Lane wrote: > > If an error is raised in one parallel worker, does this > > cause the other parallel workers to be immediately terminated? > > I think not, though I didn't work on that code. The error will > be reported to the parent backend, which will cause it to fail > the query ... but I think it just waits for the other worker > children to exit first. That's not something to rely on of course. > Even if we don't make an attempt to cancel the other workers today > we probably will in future. But the cancel attempt would certainly > be asynchronous, so I'm not sure how "immediate" you are worried > about it being. If workers call CHECK_FOR_INTERRUPTS() frequently, which they should, then it should appear to users as if raising an error in one worker kills everything almost immediately, or immediately. For example, if a parallel CREATE INDEX has a worker that raises a unique violation error, that must work in a way that at least *appears* to be very similar to what the user would get with a serial CREATE INDEX. (The worst that can happen is that they'll very occasionally get two unique violation errors instead of one, or something like that.) That said, there are theoretical failures where it could take rather a long time for the parent/leader to get the memo -- see WaitForParallelWorkersToAttach() and its caller (note that anything using a gather node is subject to the same kind of failure that WaitForParallelWorkersToAttach() handles, even though they won't call the function themselves). These failures (e.g. fork() failure) are generally assumed to be rare to non-existent, though. Users will surely be upset if parallel queries cannot be cancelled almost immediately. If it happened with any regularity, somebody would have complained by now. As Tom said, it's hard to give a useful answer without more context -- how you define "immediate"? -- Peter Geoghegan
Re: PG11 Parallel Thanks!!
On Wed, Oct 2, 2019 at 8:41 AM Jason Ralph wrote: > Since pg11 on both the target and source, the run time has decreased a lot, I > chalk it up to the parallel index creations in pg11 which was a very time > consuming process on pg9.3. > The process has finished almost 10 hours earlier than pg93. So thank you for > your hard work and dedication to this awesome piece of software. How long did it take on 9.3? I am the author of the parallel CREATE INDEX feature. It's good to get feedback like this. -- Peter Geoghegan
Re: Clarification on the release notes of postgresql 12 regarding pg_upgrade
On Fri, Oct 4, 2019 at 9:09 AM Tom Lane wrote: > > You can't REINDEX safely regarding that note. > > Actually running into that problem is quite unlikely; and if you did > hit it, it'd just mean that the REINDEX fails, not that you have any > urgent problem to fix. I'd encourage you to just go ahead and REINDEX, > if you have indexes that could benefit from the other changes. Right. It is hard to imagine an application that evolved to fully rely on the previous slightly higher limit, and cannot tolerate a reduction in the limit by only 8 bytes. The limit applies to a tuple *after* TOAST compression has been applied. -- Peter Geoghegan
Re: PG11 Parallel Thanks!!
On Thu, Oct 3, 2019 at 10:31 AM Jason Ralph wrote: > The end of month process that we run at my company was a pg_dump and > pg_restore of 3 tables, these tables are around ~(400GB) each. The entire > process on pg93 took 29 hours. > > The index creation portion of the restore on the target pg9.3 database took: > 5) time: -15 hours -4 minute ((-54264 % 60)) seconds > > The index creation of the restore on the target db after pg11 upgrade on > source and target took: > 5) time: -5 hours -7 minute ((-18434 % 60)) seconds > > We saved 10 hours!! The sort code received many improvements over the years, really starting in 9.5, and continuing in 9.6, 10 and 11. FWIW, I think that that was probably the biggest factor here. Though parallel CREATE INDEX will have helped as well. -- Peter Geoghegan
Re: Rows violating Foreign key constraint exists
On Fri, Nov 29, 2019 at 7:23 AM Tom Lane wrote: > The most likely "corruption" explanation is something wrong with the > indexes on the referenced and/or referencing column, causing rows to > not be found when referential actions should have found them. Random > querying of the tables wouldn't necessarily expose that --- you'd need > to be sure that your queries use the questionable indexes, and maybe > even search for some of the specific rows that seem mis-indexed. Or try using contrib/amcheck, which is available in Postgres 10. Perhaps try the query here, modified to verify all B-Tree indexes (not just those indexes in the pg_catalog schema): https://www.postgresql.org/docs/10/amcheck.html -- Peter Geoghegan
Re: Query returns no results until REINDEX
On Fri, Feb 7, 2020 at 3:52 PM Colin Adler wrote: > Seems to work now. My question is, is this something I should report to the > maintainers? I am one of the people that maintains the B-Tree code. You didn't mention what version of Postgres you're using here. That could be important. Please let us know. Mention the minor component of the release version, too (i.e. say 12.2, not just 12). > I took a snapshot of the data folder before the reindex in case it > would be helpful. Is index corruption something that should be actively looked > out for? Yes -- look for corruption. If I had to guess, I'd say that this has something to do with upgrading the operating system to use a different, incompatible glibc. Or perhaps it has something to do with streaming replication between machines with different glibc version. You should try running contrib/amcheck, which should be able to isolate index corruption, and give you a specific complaint. You may then be able to inspect the exact index page with the problem using contrib/pageinspect. Something like this ought to do it on Postgres 11 or 12: CREATE EXTENSION IF NOT EXISTS amcheck SELECT bt_index_check('my_index', true); If that doesn't show any errors, then perhaps try this: SELECT bt_index_parent_check('my_index', true); If you're on Postgres 10, then you should leave out the second argument, "true", since that version doesn't have the extra heapallindexed check. Let us know what you see. -- Peter Geoghegan
Re: Query returns no results until REINDEX
On Sun, Feb 9, 2020 at 12:50 PM Colin Adler wrote: > Looks like it found something. I checked out the contrib/pageinspect docs but > wasn't too sure what to run. Are incompatible libc versions causing btree > corruption something you consider a bug? If it's something you'd like to look > into further I can gladly send over the database files. No, this is not considered a bug. It's unfortunate that there is no built in collation versioning mechanism, or something like that -- that might have alerted you to the problem before any real damage occurred. We have that for the ICU collations, but it currently isn't possible to use ICU as the default collation provider. You really have to go out of your way to use ICU collations. -- Peter Geoghegan
Re: Querying an index's btree version
On Wed, Mar 11, 2020 at 1:26 PM Darren Lafreniere wrote: > We've read that PG 12 has improved btree index support, and that the latest > internal btree version was bumped from 3 to 4. Is it possible to query the > btree version that a particular index is using? We'd like to automatically > start a concurrent re-index if we detect any btree indexes are still on > version 3. It's possible, but you have to install the superuser-only pageinspect extension. Here is how you'd determine that an index called 'pg_aggregate_fnoid_index' is on version 4: regression=# create extension pageinspect; CREATE EXTENSION regression=# select version from bt_metap('pg_aggregate_fnoid_index'); version - 4 (1 row) -- Peter Geoghegan
Re: Querying an index's btree version
On Wed, Mar 11, 2020 at 2:13 PM Darren Lafreniere wrote: > when you restore a DB from a backup, does the restored index use the old > format or the latest one? If you use pg_restore, it uses the latest index format. If you're using pg_upgrade, the version won't change unless and until you REINDEX. This includes cases where you're running pg_upgrade against a restored physical backup. -- Peter Geoghegan
Re: create index insist on 2 workers only
On Fri, Apr 24, 2020 at 7:32 AM Radoslav Nedyalkov wrote: > We 're rebuilding a big table which has set parallel_workers = 6 > system has > max_parallel_maintenance_workers | 6 | > /var/lib/pgsql/11/data/postgresql.sumup.conf > max_parallel_workers | 16 | > /var/lib/pgsql/11/data/postgresql.sumup.conf > max_parallel_workers_per_gather | 4 | > /var/lib/pgsql/11/data/postgresql.sumup.conf > > Also session level on index restore there is > set max_parallel_maintenance_workers = 6; > > Still we get only 2 parallel processes in a free of any other load system. > It is postgres 11.7 Try increasing maintenance_work_mem from the default of 64MB. MWM constrains the number of parallel workers used. -- Peter Geoghegan
Re: 12.2: Why do my Redo Logs disappear??
On Mon, Jun 8, 2020 at 5:17 PM Peter wrote: > Loosing a RedoLog is very bad, because there is no redundancy, > loosing a single one of them makes the timeline disappear and it > will only reappear after another Base Backup. Very very bad. >In this case, it seems, Postgres will delete the current log >without archiving it. :( I strongly suspect that you were hit by the bug fixed in commit 4e87c483. You should upgrade to Postgres 12.3 ASAP, to get that fix: "Avoid premature recycling of WAL segments during crash recovery (Jehan-Guillaume de Rorthais) WAL segments that become ready to be archived during crash recovery were potentially recycled without being archived." Sorry that you were affected by this bug -- it really sucks. -- Peter Geoghegan
Re: 12.3 replicas falling over during WAL redo
On Sun, Aug 2, 2020 at 9:39 PM Kyotaro Horiguchi wrote: > All of the cited log lines seem suggesting relation with deleted btree > page items. As a possibility I can guess, that can happen if the pages > were flushed out during a vacuum after the last checkpoint and > full-page-writes didn't restored the page to the state before the > index-item deletion happened(that is, if full_page_writes were set to > off.). (If it found to be the cause, I'm not sure why that didn't > happen on 9.5.) There is also a Heap/HOT_UPDATE log line with similar errors. -- Peter Geoghegan
Re: 12.3 replicas falling over during WAL redo
On Mon, Aug 3, 2020 at 2:35 PM Alvaro Herrera wrote: > You can use pageinspect's page_header() function to obtain the page's > LSN. You can use dd to obtain the page from the file, > > dd if=16605/16613/60529051 bs=8192 count=1 seek=6501 of=/tmp/page.6501 Ben might find this approach to dumping out a single page image easier, since it doesn't involve relfilenodes or filesystem files: https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump -- Peter Geoghegan
Re: Index tuple deduplication limitations in pg13
On Mon, Aug 17, 2020 at 11:44 PM Matthias van de Meent wrote: > But, if the ordering of operator-class equal tuples is already > system-defined, could the physical ordering of index tuples in a btree > (with deduplication enabled for "unsafe" opclasses) be updated from > [index_columns, tid] to [index_columns, > image_compare(non_datum_equal_columns), tid], giving a stable sorting > of opclass-equal and image-equal values and enabling safe consistent > deduplication? The issue isn't the physical ordering. The issue is that we cannot allow the implementation to destroy semantic differences among equal datums. We avoid deduplication with cases where two equal datums are visibly different. For example, it would not be okay if we forgot that your numeric datum was originally input as '5.000', and output '5' later on. If we wanted to fix this for numeric, we'd have to invent a new numeric datatype (called numeric2, say). That probably isn't as hard as it sounds, since it could be part of the same B-Tree operator family as numeric. It could also be implicitly cast to numeric. -- Peter Geoghegan
Re: Index tuple deduplication limitations in pg13
On Tue, Aug 18, 2020 at 9:44 AM Peter Geoghegan wrote: > If we wanted to fix this for numeric, we'd have to invent a new > numeric datatype (called numeric2, say). That probably isn't as hard > as it sounds, since it could be part of the same B-Tree operator > family as numeric. It could also be implicitly cast to numeric. I forgot to say: numeric2 would be just like numeric, except in one specific way: it wouldn't care about display scale. The user would be giving up on display scale by choosing numeric2 over numeric. The "5 vs 5.000" information would always be lost by design, so there'd be nothing for deduplication to break. Deduplication could then be enabled. -- Peter Geoghegan
Re: Index tuple deduplication limitations in pg13
On Tue, Aug 18, 2020 at 11:52 AM Matthias van de Meent wrote: > Deduplication does not need to destroy semantic differences? 'equal' > can (in my book) mean: > - 'opclass-equal', that is the opclass returns true for an equality check > - 'binary equal' or 'datum-equal' (? maybe incorrect term), that is > the unprocessed on-disk representations (datum image is the right term > I believe?) of the compared values are indistinguishable. > > Runs of 'binary equal' datums can be freely deduplicated [0] when found. > [0] > Inserting a row in a deduplicated index with in, with TID ntid, can > encounter a posting list of a opclass-equal but not datum image-equal > tuples where the lowest TID of the posting list is less than ntid, and > ntid is less than the highest TID of the posting list. This would > require a posting list split to accomodate the new tuples' index entry > in order to not lose data. But you can't do that easily, because it breaks subtle assumptions about posting list splits and space utilization. In particular, it means that you can no longer think of a posting list split as rewriting an incoming new item such that you can more or less pretend that there was no overlap in the first place -- code like _bt_split and nbtsplitloc.c relies on this. Introducing new special cases to nbtsplitloc.c is very unappealing. More concretely, if you introduce a posting list split like this then you need three copies of the key -- the original, the new, and a second copy of the original. That's much more complicated. -- Peter Geoghegan
Re: Index tuple deduplication limitations in pg13
On Tue, Aug 18, 2020 at 11:52 AM Matthias van de Meent wrote: > Given that the above could work, the current btree tuple ordering is > not optimized for opclass-equal but datum image-distinct values: > ordering of opclass-equal values is currently determined only by tid, > with as an example current ordering ['0.0', '0', '0.00', '0', '0.0', > '0']. It would be more optimized for deduplication if that was stored > as e.g. ['0', '0', '0', '0.0', '0.0', '0.00'], which is why I > suggested to add an ordering by the datum image before the tid > ordering. Additionally, this extra ordering also prevents the problem > of [0] by never attempting an insertion of non-equal image datums in a > posting list of otherwise equal values, as it would be ordered either > before or after the posting list, never inside the list. Yeah, that would work, but at the cost of making numeric totally unusable. Now you cannot rely on unique enforcement detecting that '0' is a duplicate of '0.0'. In fact, even the most trivial use of the = operator will be broken in the presence of different display scales. It's a non-starter. The numeric2 design that I sketched is a bit ugly, but I can see no better way. A three-way posting list split (i.e. the other design that you sketched) is a special case that is very hard to test, very complicated, and of little value in the grand scheme of things. -- Peter Geoghegan
Re: Index tuple deduplication limitations in pg13
On Tue, Aug 18, 2020 at 1:31 PM Matthias van de Meent wrote: > Would this extra ordering not effectively be an extra tiebreaker in > the ordering, applied before the TID? I do not know the full > implications of that, but I believe that would not result in the > limitations that you are mentioning. You could probably do it that way, but again you end up with a lot of new complexity. Not to mention overhead that would have to be paid by everyone. It would require code that supported the old way (even if it was added to Postgres 13) for pg_upgrade, that would also be hard to test. And it might defeat certain future optimizations based on heap TID being the only tiebreaker. Having two types of equality might have to bleed into the optimizer. It's a question of engineering trade-offs. I don't think that it's worth it. -- Peter Geoghegan
Re: Need explanation on index size
On Thu, Sep 24, 2020 at 6:55 AM Guillaume Luchet wrote: > I don’t understand why after the update where I only update a non indexed > column the indexes size is growing. Is it something someone can explain ? If you reduce the table fillfactor then these updates will all be HOT updates. That will make the table larger initially, but leaving enough space behind on the same heap pages for successor tuples makes it possible to use HOT updates. -- Peter Geoghegan
Re: autovacuum verbose?
On Thu, Jan 21, 2021 at 12:55 PM Tommy Li wrote: > Is there any way to configure autovacuum to log the same information as > VACUUM VERBOSE? No. Though there really should be. -- Peter Geoghegan
Re: oldest xmin is far in the past :: BUT xmin is not available in system
On Mon, Apr 18, 2022 at 11:37 PM bhargav kamineni wrote: > executing the vacuum on the entire cluster is also giving the same HINTS and > WARNING's You're using Aurora, not PostgreSQL. Perhaps this is actually a bug, but there is no way for anybody here to know. -- Peter Geoghegan
Re: autovacuum_freeze_max_age on append-only tables
On Wed, Apr 20, 2022 at 4:06 PM senor wrote: > I'm attempting to mimic a new feature in version 13 where INSERTS will > trigger vacuum for an append-only table. The problem with that idea is that you need to express the idea that the table needs to be vacuumed now in terms of its "age", denominated in XIDs -- but XIDs consumed by the entire system, not just those XIDs that happen to modify your append-only table. It will likely be very hard for you to figure out a way to relate these logical units (XIDs) to some kind of physical cost that captures how far behind you are on freezing (like blocks, or even tuples). Maybe you'll find something that works through trial and error, but I wouldn't count on it. > I'm apparently needing an education on how this "to avoid wraparound" vacuum > differs from any other. I've seen it referenced as "more aggressive" but I'd > like details. An upgrade to 13 is "right around the corner". It's complicated -- more complicated than it really should be. Technically an anti-wraparound autovacuum and an aggressive vacuum are two different things. In practice anti-wraparound autovacuums are virtually guaranteed to be aggressive, though an aggressive autovacuum may not be an antiwraparound VACUUM (sometimes we do aggressive vacuuming because autovacuum launched a worker before age(relfrozenxid) reached autovacuum_freeze_max_age, but after age(relfrozenxid) reached vacuum_freeze_table_age). See my recent response to a similar question here: https://postgr.es/m/CAH2-WzkFQ-okvVXizpy4dCEVq75N-Qykh=crhzao-eajflv...@mail.gmail.com -- Peter Geoghegan
Re: autovacuum_freeze_max_age on append-only tables
On Thu, Apr 21, 2022 at 8:14 PM Senor wrote: > Are the autovacuum_vacuum_cost_* settings handled any differently for > 'to avoid wraparound' vacuums? I understand that it won't give up a lock > but I was expecting it to still back off due to cost and allow the query > with conflicting lock to proceed. In general, no. For the most part an antiwraparound autovacuum does exactly the same work as any other autovacuum. Or any other aggressive VACUUM, at least. But even the extra work that it does over what non-aggressive VACUUM is still work that any VACUUM might do, if the circumstances were right. We still freeze in regular VACUUMs, provided we scan pages with XIDs that are sufficiently old. The most important difference between it and aggressive VACUUM is that the former can skip all-visible pages that have unfrozen XIDs, putting that work off. This can sometimes lead to a big balloon payment later on, when you finally have an aggressive VACUUM. I think that that's a design flaw that ought to be fixed. Currently non-aggressive VACUUMs always skip all-visible pages. They should probably freeze some older all-visible pages eagerly, rather than skipping them, so that the system never gets too far behind on freezing. > Is there any benefit to manually running a vacuum every so many inserts > as opposed to using autovacuum_freeze_max_age. And in this case should > it be a vacuum freeze. Given your restrictions, this is probably the best option available. But maybe you should just set vacuum_freeze_min_age to 0 at the table level, instead of using vacuum freeze (so you freeze more without doing aggressive vacuuming all the time, which FREEZE also forces). Users understandably think that there are several different flavors of vacuum, but that's not really true (apart from VACUUM FULL, which really is quite different). The difference between aggressive and non-aggressive can be big in practice due to an accumulation of unfrozen pages over multiple non-aggressive vacuums. -- Peter Geoghegan
Re: Automatic autovacuum to prevent wraparound - PG13.5
On Wed, Jun 15, 2022 at 4:13 AM Mauro Farracha wrote: > The scenario: > - Out of nowhere (during the weekend), without database activity load or > batches running, with previous nightly run of vacuum freeze, in the middle of > the day, with xids and mxids below 20M we are seeing autovacuum being > triggered to prevent wraparound. > > My question is why this is occurring, which condition might be responsible > for this behaviour? There is a behavior that seems like it might be relevant: VACUUM interprets autovacuum_multixact_freeze_max_age in a way that accounts for both MultiXactId consumption and the consumption of "member space" by MultiXacts. Technically there are 2 SLRUs for MultiXacts, either of which can wraparound. This behavior was established by commit 53bb309d2d. It is documented. Admittedly this whole area of the documentation is in dire need of an overhaul. :-( -- Peter Geoghegan
Re: Monitoring multixact members growth
On Fri, Aug 19, 2022 at 8:40 AM Vido Vlahinic wrote: > However it is not quite clear to me how I can interpret results from above > and e.g. conclude: my_table accumulates x more multixact members since I > measured last. You can't. And not just because nobody got around to implementing it yet -- it's quite a fundamental restriction. VACUUM must always make sure of that, and must always scan all unfrozen pages to safely determine that much (it may or may not have to freeze *any* MultiXacts as part of that process, but it must always be sure that no Multis < its final relminmxid remain). > My goal here is to predict where multixact members are growing the fastest so > I can perform manual VACUUM FREEZE only on those tables The problem with that strategy is that you still have to do anti-wraparound autovacuums when the mxid_age(relminmxid) of a table crosses the usual threshold, even when in reality there are *zero* MultiXacts in the table (often the case with the largest tables). That's just how it works, unfortunately. There is one piece of good news, though: work in Postgres 15 taught VACUUM to track the oldest extant XID and MXID in the table, and set relfrozenxid and remind to those oldest values (rather than using the cutoffs for freezing, which in general might be much older than the oldest remaining unfrozen XID/MXID). I expect that this will make it much less likely that anti-wraparound autovacuums affecting many tables will all stampede, hurting performance. This will be possible because VACUUM will now be able to set relminmxid to a value that actually tells us something about what's really going on in each table, MultiXact-wise (not just what you set vacuum_multixact_freeze_min_age and autovacuum_multixact_freeze_max_age to in postgresql.conf, which is pretty far removed from what matters most of the time). Simply by noticing that there are no remaining MultiXacts (and probably never were any in the first place) with the larger tables. The timeline for anti-wraparound autovacuums will tend to make a lot more sense for *your* workload, where huge differences in the rate of MultiXact consumption among tables is likely the norm. This still isn't perfect (far from it), but it has the potential to make things far better here. -- Peter Geoghegan
Re: Weird planner issue on a standby
On Tue, Oct 11, 2022 at 9:27 AM Alvaro Herrera wrote: > I remember having an hypothesis, upon getting a report of this exact > problem on a customer system once, that it could be due to killtuple not > propagating to standbys except by FPIs. I do not remember if we proved > that true or not. I do not remember observing that tables were being > read, however. That's true, but it doesn't matter whether or not there are LP_DEAD bits set on the standby, since in any case they cannot be trusted when in Hot Standby mode. IndexScanDescData.ignore_killed_tuples will be set to false on the standby. -- Peter Geoghegan
Re: Weird planner issue on a standby
On Tue, Oct 11, 2022 at 10:04 PM Tom Lane wrote: > Do we propagate visibility-map bits to standbys? Yes. -- Peter Geoghegan
Re: Weird planner issue on a standby
On Wed, Oct 12, 2022 at 6:47 AM Tom Lane wrote: > However, that doesn't explain the downthread report that a > VACUUM on the primary fixed it. What I suspect is that that > caused some in-fact-dead index entries to get cleaned out. Seems likely. > But ... if the primary is allowed to vacuum away an index > entry that it thinks is dead, exactly what is the point of > making standbys ignore LP_DEAD bits? There's no additional > interlock that guarantees the tuple will be there at all. The interlock doesn't really protect the leaf page or its index tuples so much as the referenced TIDs themselves. In other words it's a TID recycling interlock. That's why we don't need a cleanup lock to perform index tuple deletions, even though the WAL records for those are almost identical to the WAL records used by index vacuuming (in the case of nbtree the only difference is the extra latestRemovedXid field in the deletion variant WAL record). We know that there is no VACUUM process involved, and no question of heap vacuuming going ahead for the same TIDs once index vacuuming is allowed to complete. We can get away with not having the interlock at all in the case of nbtree index scans with MVCC snapshots -- but *not* with index-only scans. See "Making concurrent TID recycling safe" in the nbtree README. I only got around to documenting all of the details here quite recently. The index-only scan thing dates back to 9.5. -- Peter Geoghegan
Re: ON CONFLICT and WHERE
On Sun, Nov 13, 2022 at 1:07 PM Tom Lane wrote: > A WHERE placed there is an index_predicate attachment to the ON CONFLICT > clause. It doesn't have any run-time effect other than to allow partial > indexes to be chosen as arbiter indexes. TFM explains > > index_predicate > > Used to allow inference of partial unique indexes. Any indexes > that satisfy the predicate (which need not actually be partial > indexes) can be inferred. > > This strikes me as a bit of a foot-gun. I wonder if we should make > it safer by insisting that the resolved index be partial when there's > a WHERE clause here. I don't think that it would be safer. Adrian has asked why it's possible to attach an arbitrary index_predicate type WHERE clause to an ON CONFLICT query, without that really changing the behavior of the statement. That *is* a little odd, so it's certainly a fair question (I can recall perhaps as many as 5 similar questions over the years). But it's not the end of the world, either -- there are far worse things. I think that it would be a lot worse (just for example) to have your ON CONFLICT query suddenly start throwing an ERROR in production, just because you replaced a partial unique index with a unique constraint. If we have a suitable unique index or constraint, why wouldn't we use it in ON CONFLICT? Maybe it won't work out that way (maybe there won't be any suitable unique index or constraint), but why not do our utmost to insulate the user from what might be a serious production issue? That was the guiding principle. Overall I'm quite happy with the amount of foot-guns ON CONFLICT has, especially compared to other comparable features in other DB systems (which had plenty). There are one or two ostensibly odd things about the syntax that are downstream consequences of trying to make the constraint/unique index inference process maximally forgiving. I'm pretty happy with that trade-off. > (This documentation text is about as clear as > mud, too. What does "inferred" mean here? I think it means "chosen as > arbiter index", but maybe I misunderstand.) Unique index/constraint inference is the process by which we choose an arbiter index. See the second paragraph of the "ON CONFLICT Clause" section of the INSERT docs. -- Peter Geoghegan
Re: Lots of read activity on index only scan
On Fri, Nov 18, 2022 at 12:46 PM Peter J. Holzer wrote: > Both do a parallel index only scan. Both perform 0 heap fetches. > But one reads 27336 buffers (or about 22 bytes per index entry, which > sounds reasonable) while the other reads 9995216 buffers (or almost one > full buffer per row). Why? The entries should be dense in the index in > both cases and since it's an index only scan (and explain says there > were 0 heap fetches) I would not expect extra accesses. Where do these > buffer reads come from? The index-only scan processes an index leaf page at a time. When there is naturally a high correlation (or some kind of clustering) in how we access VM pages, we'll naturally be able to do more visibility checks covering more index tuples per VM page accessed. This is a less severe problem here than it would be with an equivalent pair of plain index scans, just because there are so few VM pages relative to heap pages. But it's more or less an analogous problem. You're really noticing it here because these index scans have very low selectivity -- which is kinda unusual in most environments. Also worth bearing in mind that it's unusual to have a perfectly random and uniformly distributed clustering of index tuples, which is what the index built via hashing exhibits. Even a v4 UUID index could easily have plenty of duplicates, which would probably do significantly better on the metric you've focussed on. -- Peter Geoghegan
Re: Lots of read activity on index only scan
On Fri, Nov 18, 2022 at 1:50 PM Peter J. Holzer wrote: > There should be about 27000 of them, same as for the othe index, right? There aren't that many. The point I'm making is that you can access each VM page approximately once (and check relatively many index tuple's TIDs all in one go), or many times. The total number of VM pages may be constant, but the access patterns are quite different owing to differences in how the data is clustered in each index. > > When there is naturally a high correlation (or some kind of > > clustering) in how we access VM pages, we'll naturally be able to do > > more visibility checks covering more index tuples per VM page > > accessed. > > So you are saying that these are accesses to the visibility map, not the > base table? Yes. I see "Heap Fetches: 0" for both plans, that each query the same table and scan approximately the same number of index pages. So VM accesses are the only explanation that makes any sense. > > Also worth bearing in mind that it's unusual to have a perfectly > > random and uniformly distributed clustering of index tuples, > > Sure. This is a highly contrived example. FWIW I think that it could be a lot less bad, even with indexes that you'd think would be almost as bad as the bad one from your test case. Even things that appear to be random aren't usually nearly as random as what you've shown. -- Peter Geoghegan
Re: Question regarding UTF-8 data and "C" collation on definition of field of table
On Sun, Feb 5, 2023 at 4:19 PM Tom Lane wrote: > If there's a predominant language in the data, selecting a collation > matching that seems like your best bet. Otherwise, maybe you should > just shrug your shoulders and stick with C collation. It's likely > to be faster than any alternative. FWIW there are certain "compromise locales" supported by ICU/CLDR. These include "English (Europe)", and, most notably, EOR (European Ordering Rules): https://en.wikipedia.org/wiki/European_ordering_rules I'm not sure how widely used those are. EOR seems to have been standardized by the EU or by an adjacent institution, so not sure how widely used it really is. It's also possible to use a custom collation with ICU, which is almost infinitely flexible: http://www.unicode.org/reports/tr10/#Customization As an example, the rules about the relative ordering of each script can be changed this way. There is also something called merged tailorings. The OP should see the Postgres ICU docs for hints on how to use these facilities to make a custom collation that matches whatever their requirements are: https://www.postgresql.org/docs/current/collation.html#COLLATION-MANAGING -- Peter Geoghegan
Re: ERROR: posting list tuple with 2 items cannot be split at offset 17
On Wed, Feb 8, 2023 at 11:54 PM Paul McGarry wrote: > But if it is the problem, why did the update start working after I recreated > the other index? There is no reason why reindexing another index ought to have had that effect. The likely explanation is that subsequent updates used a successor version heap TID that didn't overlap with some existing posting list in whatever way. If you repeat the update again and again, and get an error each time, the incoming TID will differ each time. Eventually you won't get an error, because at some point there won't be a posting-list-TID range overlap for some new successor TID that leads to the insert/posting list split code detecting a problem. It's also possible that a concurrent autovacuum "fixed" the issue. The amcheck error shows a problem in an internal page, which cannot have posting list tuples -- which suggests broad corruption. An issue with collation instability due to an OS update does seem likely. Note that the hardening/defensive checks in this area have increased. I added an additional defensive check to 13.4, and followed up with another similar check in 13.5. It looks like the error you've seen ("ERROR: posting list tuple with 2 items cannot be split at offset 17") comes from the initial 13.4 hardening, since I'd expect the additional 13.5 hardening to catch the same issue sooner, with a different error message (something like "table tid from new index tuple (%u,%u) cannot find insert offset between offsets %u and %u of block %u in index \"%s\""). > I think I should now: > - recreate the widget_name_idx on the problem servers > - run bt_index_check across all other indexes > > Any suggestions on what else I should look into, in particular anything I > should check before upgrading the remaining 13.8 DB to 13.9? I recommend running amcheck on all indexes, or at least all possibly-affected text indexes. -- Peter Geoghegan
Re: ERROR: posting list tuple with 2 items cannot be split at offset 17
On Thu, Feb 9, 2023 at 3:55 PM Paul McGarry wrote: > Will the amcheck reliably identify all issues that may arise from a collation > change? Theoretically it might not. In practice I'd be very surprised if it ever failed to detect such an inconsistency. If you want to be extra careful, and can afford to block concurrent writes, then I suggest using bt_index_parent_check instead of bt_index_check. > and therefore I can just use that to identify bad indexes and recreate them, > or should I recreate all btree indexes involving text fields? It might be easier to just reindex them all. Hard to say. > We planned to do a dump/restore upgrade to PG14 in a month or so (already > done in dev). > I am wondering whether it will be less work overall to bring that forward > than rebuild these indexes... pg_amcheck is available on 14. It offers a much simpler interface for running amcheck routine, so maybe look into that once you upgrade. -- Peter Geoghegan
Re: Automatic aggressive vacuum on almost frozen table takes too long
On Thu, Feb 16, 2023 at 7:44 AM Adrian Klaver wrote: > > That is, if I understand it correctly, it says that there were (and > > actually are) 2013128 pages of which 2008230 were skipped, which leaves > > 4898 blocks to be scanned. I.e. it seems that the allocated 1GB > > (autovacuum_work_mem) should be enough to handle that amount of blocks > > and to avoid multiple scans of the indexes. It's quite clear that there is only one pass over the indexes, since "index scans: 1" says exactly that. > > But, based on buffer usage, one can see that a huge amount of data is > > read, greatly exceeding not only the number of remaining unfrozen > > blocks, but also the size of the table and indexes taken together: 2 > > billion blocks, more than 15TB. > > > > Is this a bug in Postgresql or am I interpreting the log data wrong? I think that it might be a bug in Postgres. I addressed a similar issue in the same "buffer" instrumentation in commit d3609dd2, but that wasn't backpatched because I imagined that it only applied to the new VACUUM VERBOSE case (VACUUM VERBOSE only started sharing the same instrumentation code as log_autovacuum in Postgres 15). It's not immediately obvious how you could see a problem like the one you've shown in the autovacuum log output. Even still, the information about buffers that you've shown does indeed appear to be total nonsense (while everything else we can see looks plausible). There has to be some explanation for that. The only other explanation I can think of is a pathological case where an index scan by some particular ambulkdelete routine scans a number of buffers that vastly exceeds the total size of the index. That does seem just about possible with an access method like GIN. Do you have any non-btree indexes on the table? Can you show us the details of the table, including all of its indexes? In other words, can you show "\d applications" output from psql? -- Peter Geoghegan
Re: Automatic aggressive vacuum on almost frozen table takes too long
On Thu, Feb 16, 2023 at 5:40 PM Mikhail Balayan wrote: >> >> Do you have any non-btree indexes on the table? Can you show us the >> >> details of the >> >> table, including all of its indexes? In other words, can you show "\d >> >> applications" output from psql? > > Only btree indexes. Please find the full table schema below: It's possible that VACUUM had to wait a long time for a cleanup lock on one individual heap page here, which could have added a long delay. But...that doesn't seem particularly likely. Can you run amcheck's bt_index_check() routine against some of the indexes you've shown? There is perhaps some chance that index corruption exists and causes VACUUM to take a very long time to delete index pages. This is pretty much a wild guess, though. -- Peter Geoghegan
Re: Automatic aggressive vacuum on almost frozen table takes too long
On Mon, Feb 20, 2023 at 9:43 PM Mikhail Balayan wrote: > What catches my eye: scanning indexes smaller than 3.1GB is fast, larger ones > are slow. For example: > idx_applications2_policy_id is 3131 MB took just 5 seconds (DETAIL: CPU: > user: 2.99 s, system: 1.65 s, elapsed: 5.32 s) > but idx_applications2_deleted_at with 3264 MB took 1 minute 22 seconds > (DETAIL: CPU: user: 67.93 s, system: 3.41 s, elapsed: 82.75 s) I think that I know what this is. If you delete many index pages during VACUUM, and those pages are all full of duplicate values, the deletion operation can sometimes be slower due to the need to relocate a downlink to each to-be-deleted leaf page. When there are thousands of matches, you'll start to notice O(n^2) behavior due to the way in which the B-Tree VACUUM code must grovel through the parent level, which is full of duplicate keys. If you were on Postgres 12+, then this wouldn't happen, because the heap TID is treated as a part of the key space there, affecting sort order. The implementation would immediately relocate the matching parent downlink using a unique key (unique because heap TID would act as a unique-ifier on that version). And if you were on 14+, things in this area would be much better still. -- Peter Geoghegan
Re: PG16devel - vacuum_freeze_table_age seems not being taken into account
On Fri, Mar 3, 2023 at 2:43 AM Simon Elbaz wrote: > I expected it not to be processed by vacuum freeze. > However it has been entirely frozen. > Moreover, among the 51 rows, only 1 was eligible for freeze because its XID > was older than vacuum_freeze_min_age. The effect that you noticed is a consequence of page-level freezing, which is new to Postgres 16. VACUUM will now freeze all of the tuples on a page whenever it needs to freeze any tuples at all (barring any tuples that are fundamentally ineligible due to being after the removable/freezable cutoff). This is justified by the cost profile. Once we decide to freeze at least one tuple of a page, the added cost in WAL is low enough that it really doesn't make sense to not just freeze everything. The page that gets frozen by your test case is also set all-frozen in the visibility map. Without the optimization, we'd have frozen that one tuple and then set the page all-visible. The page would likely be frozen again by the next aggressive VACUUM, which is usually much more expensive. -- Peter Geoghegan
Re: UPSERT in Postgres
On Thu, Apr 6, 2023 at 1:21 PM Louis Tian wrote: > An implicit assumption behind this definition is that table must have a > primary key for the upsert operation to make sense since it's the primary key > that uniquely identifies a row. It could just be a unique index or a unique constraint. So you can upsert on any individual unique constraint/index, or the primary key. Of course there might be several on a given table, but you can only use one as the "conflict arbiter" per statement. > But can we achieve the same effect with "INSERT ... ON CONFLICT DO UPDATE" > like a lot of references on the internet seems to suggest. > > insert into person (id, name) values (0, 'foo') on conflict ("id") do update > set id=excluded.id, name=excluded.name > insert into person (id, is_active) values (0, true) on conflict ("id") do > update set id=excluded.id, is_active=excluded.is_active > > Unfortunately. the second statement will fail due to violation of the not > null constraint on the "name" column. > PostgreSQL will always try to insert the row into the table first. and only > fallback to update when the uniqueness constraint is violated. > Is this behavior wrong? maybe not, I think it is doing what it reads quite > literally. It sort of has to work that way, though. In general your example might *not* fail, due to a row-level before trigger in the insert path. Why doesn't your proposed upsert syntax have the same problem? I mean, how could it not? I guess it doesn't if you assume that it'll never take the insert path with your not NULL constraint example? But if you know that for sure, why not just use a regular update statement? On the other hand, if you're not sure if the insert path can be taken, then why is it actually helpful to not just throw an error at the earliest opportunity? Surely upsert means "update or insert", so why wouldn't the user expect to see an error like this, independent of the specifics of the row in question? Isn't the user tacitly saying "I don't specifically know if the update or insert path will be taken in respect of any given row" by using ON CONFLICT DO UPDATE in the first place? > That being said, I have never had a need for the ON CONFLICT DO UPDATE > statement other than where I need upsert. > But using it as "upsert" is only valid when the table is absent of any NOT > NULL constraint on it's non primary key columns. I don't know what you mean by that. "Valid"? > The MERGE command introduced in PG15 in theory can be used to do UPSERT > properly that is void of the aforementioned limitation. > The downside is it is rather verbose. The MERGE command has various race conditions that are particularly relevant to UPSERT type use cases. See the wiki page you referenced for a huge amount of information on this. > *Feature Request* > Given that UPSERT is an *idempotent* operator it is extremely useful. In general UPSERT (or any definition of it that I can think of) does not imply idempotency. -- Peter Geoghegan
Re: CREATE COLLATION to match pg_collation data
On Sat, Jan 12, 2019 at 7:13 AM Tom Lane wrote: > That's an ICU collation, so you're out of luck: there is no ICU > support in 9.6. FWIW, there is some form of FreeBSD Postgres support for ICU that predates the officially supported ICU feature. The FreeBSD Postgres packages have been built with ICU support for many years -- they modify the source code minimally to make this work. It may well still be impossible to use "az-x-icu" on a FreeBSD installation of 9.6, though. -- Peter Geoghegan
Re: Pg analog to DBCC CCHECKDB?
On Fri, Apr 5, 2019 at 7:33 AM Ron wrote: > In 9.6, does such a thing exist? (We just restored a VM from snapshot and I > want to verify the cluster sanity.) amcheck is available for versions 9.4+, though it only appears in contrib in Postgres 10. There are both yum and deb packages available. See: https://github.com/petergeoghegan/amcheck -- Peter Geoghegan
Re: Upgrading locale issues
On Mon, Apr 29, 2019 at 7:45 AM rihad wrote: > Hi. Today we run pg_ctl promote on a slave server (10.7) and started > using it as a master. The OS also got upgraded FreeBSD 10.4 -> FreeBSD > 11.2. And you guessed it, most varchar indexes got corrupted because > system local changed in subtle ways. So I created the extension amcheck > and reindexed all bad indexes one by one. Is there any way to prevent > such things in the future? Will switching to ICU fix all such issues? Not necessarily, but it will detect the incompatibility more or less automatically, making it far more likely that the problem will be caught before it does any harm. ICU versions collations, giving Postgres a way to reason about their compatibility over time. The libc collations are not versioned, though (at least not in any standard way that Postgres can take advantage of). > The problem with it is that ICU collations are absent in pg_collation, > initdb should be run to create them, but pg_basebackup only runs on an > empty base directory, so I couldn't run initdb + pg_basebackup to > prepare the replica server. I believe I can run the create collation > command manually, but what would it look like for en-x-icu? It is safe to call pg_import_system_collations() directly, which is all that initdb does. This is documented, so you wouldn't be relying on a hack. -- Peter Geoghegan
Re: Upgrading locale issues
On Wed, May 1, 2019 at 3:09 PM Thomas Munro wrote: > As discussed over on -hackers[1], I think it's worth pursuing that > though. FWIW I've proposed locale versioning for FreeBSD's libc[2]. > The reason I haven't gone further with that yet even though the code > change has been accepted in principle by FreeBSD reviewers is because > I got stuck on the question of how exactly to model the versions. If, > say, just Turkish changes, I don't want to be rebuilding my French > indexes, which means that I don't think you can use the CLDR version > string. The ICU versions can handle that, though. Importantly, ICU decouples implementation details from actual versioning. I must say that I am not enthused about the idea of trying to get libc people of any variety on board. I don't have an objection to it if it can work for FreeBSD, but I don't think it can scale. ICU is built around a culture that takes our concerns seriously already, which is what it boils down to. Also, we can imagine a package manager taking it upon themselves to vendor their own ICU, with platform-specific guarantees around stability. That seems like a nice option to have, at least. > There is also the question of how PostgreSQL should model versions, > and as I've argued in [1], I think we should track them at the level > of database object dependencies. I think you're probably right about that. > I'm hoping to reopen this can of worms for PostgreSQL 13 (and the > corresponding support could in theory be in FreeBSD 13... coincidence, > or a sign!?) Maybe we should do what Oracle did, and call it PostgreSQL 18c instead. Actually, any number that isn't of interest to numerologists will do. > Unfortunately you can't use ICU collations as a database default yet > (though there was some WIP code[3]), so ICU only saves you from > versioning problems if you explicitly set collations for columns or > expressions, and even then the version tracking is currently just a > warning that you clear manually with a command, not a mechanism that > really tracks which database objects were last rebuilt/validated with > a given version. Yes, that does seem like a big remaining weakness. -- Peter Geoghegan
Re: Questions about btree_gin vs btree_gist for low cardinality columns
On Sun, Jun 2, 2019 at 4:07 PM Tom Lane wrote: > Anyway, I said to Jeremy in the hallway that it might not be that > hard to bolt IOS support onto GIN for cases where the opclass is > a non-subdividing one, but after looking at the code I'm less sure > about that. GIN hasn't even got an "amgettuple" code path, just > "amgetbitmap", and a big part of the reason why is the need to merge > results from the fastupdate pending list with results from the main > index area. Not sure how we could deal with that. I suspect that GIN also avoids several other problems by only offer an "amgetbitmap", and not an "amgettuple". For example, it doesn't have to worry about things like numeric display scale, where a datum is substantively different to another datum, while still being equal according to opclass semantics (this is an example that I end up citing in many discussions about indexing). I bet that there are a few more of those beyond those two, that I haven't thought about. > Anyway, the larger point here is that right now btree_gin is just a quick > hack, and it seems like it might be worth putting some more effort into > it, because the addition of duplicate-compression changes the calculus > for whether it's useful. There was also discussion about making nbtree support deduplication during the hallway track. Jim Finnerty reminded me that there is a patch from Anastasia that did deduplication in nbtree that didn't go anywhere. Heikki independently talked about the possibility that he would work on this project in the next release, without being prompted by me. I think that the fact that nbtree sorts duplicate entries in heap TID order these days makes that worth looking into again. We can use something like GIN's varbyte encoding process to compress duplicates effectively. A lot of the problems (the numeric display scale problem, pg_upgrade) can be avoided by defining deduplication as something that happens on a best-effort basis. This is not the case within GIN, where it's impossible for the main entry tree to have duplicates without at least storing them in a posting list. -- Peter Geoghegan