Re: Questions about btree_gin vs btree_gist for low cardinality columns
On 2019-06-02 09:10:25 +1000, Morris de Oryx wrote: > Peter, thanks a lot for picking up on what I started, improving it, and > reporting back. I thought I was providing timing estimates from the EXPLAIN > cost dumps. Seems not. Well, there's another thing that I've learned. The cost is how long the optimizer thinks it will take (in arbitrary units). But it's just an estimate, and estimates can be off - sometimes quite dramatically. To get the real timings with explain, use explain (analyze). I often combine this with buffers to get I/O stats as well: wdsah=> explain (analyze, buffers) select min(date) from facttable_stat_fta4 where partnerregion = 'USA' and sitcr4 = '7522'; ╔══╗ ║ QUERY PLAN ║ ╟──╢ ║ Aggregate (cost=694.23..694.24 rows=1 width=4) (actual time=7.568..7.568 rows=1 loops=1) ║ ║ Buffers: shared hit=3 read=148 dirtied=114 ║ ║ -> Index Scan using facttable_stat_fta4_sitcr4_partnerregion_idx on facttable_stat_fta4 (cost=0.57..693.09 rows=455 width=4) (actual time=0.515..7.493 rows=624 loops=1) ║ ║ Index Cond: (((sitcr4)::text = '7522'::text) AND ((partnerregion)::text = 'USA'::text)) ║ ║ Buffers: shared hit=3 read=148 dirtied=114 ║ ║ Planning time: 0.744 ms ║ ║ Execution time: 7.613 ms ║ ╚══╝ (7 rows) And when you don't need the costs, you can turn them off: wdsah=> explain (analyze, buffers, costs off) select min(date) from facttable_stat_fta4 where partnerregion = 'USA' and sitcr4 = '7522'; ╔╗ ║ QUERY PLAN ║ ╟╢ ║ Aggregate (actual time=0.598..0.598 rows=1 loops=1) ║ ║ Buffers: shared hit=140 ║ ║ -> Index Scan using facttable_stat_fta4_sitcr4_partnerregion_idx on facttable_stat_fta4 (actual time=0.054..0.444 rows=624 loops=1) ║ ║ Index Cond: (((sitcr4)::text = '7522'::text) AND ((partnerregion)::text = 'USA'::text))║ ║ Buffers: shared hit=140 ║ ║ Planning time: 0.749 ms ║ ║ Execution time: 0.647 ms ║ ╚╝ (7 rows) See https://www.postgresql.org/docs/current/sql-explain.html for details. > Can you tell me how you get timing results into state_test_times? In this case I just entered them manually (cut and paste from psql \timing output). If I wanted to repeat that test on another database, I would write a Python script (I'm sure you can do that in pgsql, too, but I feel more comfortable in Python). I don't think there is a way to get time timings in plain SQL. hp -- _ | Peter J. Holzer| we build much bigger, better disasters now |_|_) || because we have muc
Re: Questions about btree_gin vs btree_gist for low cardinality columns
Peter, Thanks a lot for the remedial help on EXPLAIN and timing results.
Re: psql: FATAL: the database system is starting up
On 6/1/19 8:07 PM, Tom K wrote: https://www.postgresql.org/docs/10/app-postgres.html Single-User Mode ... and see if that at least gets the server started. This is a highly restricted so do not expect much usability. These servers did crash before however didn't' notice anything particularly wrong at the time. Perhaps a better check was in order. Here is the result from attempting to start the DB in single user mode. -bash-4.2$ /usr/pgsql-10/bin/postgres --single -D /data/patroni --config-file=/data/patroni/postgresql.conf 2019-06-01 23:06:28.679 EDT [31969] LOG: database system was interrupted; last known up at 2019-04-28 06:06:24 EDT 2019-06-01 23:06:28.953 EDT [31969] LOG: invalid record length at 0/4C35CDF8: wanted 24, got 0 2019-06-01 23:06:28.953 EDT [31969] LOG: invalid primary checkpoint record 2019-06-01 23:06:28.953 EDT [31969] LOG: using previous checkpoint record at 0/4C34EDA8 2019-06-01 23:06:28.953 EDT [31969] PANIC: replication checkpoint has wrong magic 0 instead of 307747550 That comes from origin.c in the logical replication section of the code. Did you have logical replication set up? What was the role of this cluster in the original setup? Aborted -bash-4.2$ -- Adrian Klaver adrian.kla...@aklaver.com
Re: Questions about btree_gin vs btree_gist for low cardinality columns
On Fri, May 24, 2019 at 11:26 AM Jeremy Finzel wrote: > I have been hoping for clearer direction from the community about > specifically btree_gin indexes for low cardinality columns (as well as low > cardinality multi-column indexes). In general there is very little > discussion about this both online and in the docs. Rather, the emphasis > for GIN indexes discussed is always on full text search of JSON indexing, > not btree_gin indexes. > > However, I have never been happy with the options open to me for indexing > low cardinality columns and was hoping this could be a big win. Often I > use partial indexes as a solution, but I really want to know how many use > cases btree_gin could solve better than either a normal btree or a partial > index. > What does "low cardinality" mean here? For example, I have a million different items with an item_id, but on average about 30 copies of each item. The inventory table has a row for each copy (copies are not fully interchangeable, so can't be just be a count in some other table). A million is not usually considered a low number, but I find a gin index (over btree_gin on the item_id) useful here as it produces a much smaller index due to not repeating the item_id each time and due to compressing the tid list, even though there are only about 30 tids to compress. (You mention basically the reciprocal of this, 30 distinct values with 3,000,000 copies each, but I don't if that was your actual use case, or just the case you were reading about in the documents you were reading) > > Here are my main questions: > > 1. > > "The docs say regarding *index only scans*: The index type must support > index-only scans. B-tree indexes always do. GiST and SP-GiST indexes > support index-only scans for some operator classes but not others. Other > index types have no support. The underlying requirement is that the index > must physically store, or else be able to reconstruct, the original data > value for each index entry. As a counterexample, GIN indexes cannot support > index-only scans because each index entry typically holds only part of the > original data value." > > This is confusing to say "B-tree indexes always do" and "GIN indexes > cannot support index-only scans", when we have a btree_gin index type. > Explanation please ??? > B-tree is the name of a specific index implementation in PostgreSQL. That is what is referred to here. btree_gin offers operators to be used in a GIN index to mimic/implement b-tree data structures/algorithms, but that doesn't make it the same thing. It is just a GIN index doing btree things. Perhaps PostgreSQL should use a "brand name" for their specific implementation of the default index type, to distinguish it from the generic algorithm description. > > Is it true that for a btree_gin index on a regular column, "each index > entry typically holds only part of the original data value"? Do these > still not support index only scans? Could they? I can't see why they > shouldn't be able to for a single indexed non-expression field? > For single column using a btree_gin operator, each index entry holds the entire data value. But the system doesn't know about that in any useful way, so it doesn't implement index only scans, other than in the special case where the value does not matter, like a 'count(*)'. Conceptually perhaps this could be fixed, but I don't see it happening. Since an index-only scan is usually not much good with only a single-column index, I don't see much excitement to improve things here. If if there is more than one column in the index, then for GIN the entire value from both columns would not be stored in the same index entry, so in this case it can't use an index-only scan even conceptually to efficiently fetch the value of one column based on the supplied value of another one. > > 2. > > Lack of index only scans is definitely a downside. However, I see > basically identical performance, but way less memory and space usage, for > gin indexes. In terms of read-only performance, if index only scans are > not a factor, why not always recommend btree_gin indexes instead of regular > btree for low cardinality fields, which will yield similar performance but > use far, far less space and resources? > GIN indexes over btree_gin operators do not support inequality or BETWEEN queries efficiently. True read-onlyness is often talked about but rarely achieved, I would be reluctant to design a system around management promises that something won't ever change. Btree indexes are way more thoroughly tested than GIN. When I became interested in using them in more-or-less the way you describe, I started torture testing on them and quickly found some bugs (hopefully all fixed now, but I wouldn't bet my life on it). btree_gin covers the most frequently used data types, but far from all of them. And as described above, multi-column GIN indexes are just entirely different from multi-column B-Tree indexes, and gen
Re: psql: FATAL: the database system is starting up
On Sun, Jun 2, 2019 at 11:47 AM Adrian Klaver wrote: > On 6/1/19 8:07 PM, Tom K wrote: > > > > > https://www.postgresql.org/docs/10/app-postgres.html > > Single-User Mode > > ... > > > > and see if that at least gets the server started. This is a highly > > restricted so do not expect much usability. > > > > > > These servers did crash before however didn't' notice anything > > particularly wrong at the time. Perhaps a better check was in order. > > > > Here is the result from attempting to start the DB in single user mode. > > > -bash-4.2$ /usr/pgsql-10/bin/postgres --single -D /data/patroni > > --config-file=/data/patroni/postgresql.conf > > 2019-06-01 23:06:28.679 EDT [31969] LOG: database system was > > interrupted; last known up at 2019-04-28 06:06:24 EDT > > 2019-06-01 23:06:28.953 EDT [31969] LOG: invalid record length at > > 0/4C35CDF8: wanted 24, got 0 > > 2019-06-01 23:06:28.953 EDT [31969] LOG: invalid primary checkpoint > record > > 2019-06-01 23:06:28.953 EDT [31969] LOG: using previous checkpoint > > record at 0/4C34EDA8 > > 2019-06-01 23:06:28.953 EDT [31969] PANIC: replication checkpoint has > > wrong magic 0 instead of 307747550 > > That comes from origin.c in the logical replication section of the code. > > Did you have logical replication set up? > Nope. wal_level was set to replica, not logical. Unless you mean > What was the role of this cluster in the original setup? > The cluster was the backend database for a number of applications. The aim was to point applications to a single large cluster instead of a number of small instances of postgres running all over the LAB. If I can get through the above error and get to the DB's and tables, I could just dump them out and reinit the entire DB from the backup. > > > Aborted > > -bash-4.2$ > > > > -- > Adrian Klaver > adrian.kla...@aklaver.com >
Re: psql: FATAL: the database system is starting up
On 6/2/19 11:14 AM, Tom K wrote: Nope. wal_level was set to replica, not logical. Unless you mean What was the role of this cluster in the original setup? The cluster was the backend database for a number of applications. The aim was to point applications to a single large cluster instead of a number of small instances of postgres running all over the LAB. If I can get through the above error and get to the DB's and tables, I could just dump them out and reinit the entire DB from the backup. The only thing I could find is the below: http://www.jebriggs.com/blog/2018/05/postgresql-and-panic-replication-checkpoint-has-wrong-magic-error/ If that does not work I would suggest sending a new post(new subject) to the mailing list based on: replication checkpoint has wrong magic 0 instead of 307747550 -- Adrian Klaver adrian.kla...@aklaver.com
Re: Questions about btree_gin vs btree_gist for low cardinality columns
Jeff Janes writes: > On Fri, May 24, 2019 at 11:26 AM Jeremy Finzel wrote: >> I have been hoping for clearer direction from the community about >> specifically btree_gin indexes for low cardinality columns (as well as low >> cardinality multi-column indexes). In general there is very little >> discussion about this both online and in the docs. Rather, the emphasis >> for GIN indexes discussed is always on full text search of JSON indexing, >> not btree_gin indexes. I just wanted to mention that Jeremy and I had a bit of hallway-track discussion about this at PGCon. The core thing to note is that the GIN index type was designed to deal with data types that are subdividable and you want to search for individual component values (array elements, lexemes in a text document, etc). The btree_gin extension abuses this by just storing the whole values as if they were components. AFAIR, the original idea for writing both btree_gin and btree_gist was to allow creating a single multicolumn index that covers both subdividable and non-subdividable columns. The idea that btree_gin might be used on its own wasn't really on the radar, I don't think. However, now that GIN can compress multiple index entries for the same component value (which has only been true since 9.4, whereas btree_gin is very much older than that) it seems like it does make sense to use btree_gin on its own for low-cardinality non-subdividable columns. And that means that we ought to consider non-subdividable columns as fully legitimate, not just a weird corner usage. So in particular I wonder whether it would be worth adding the scaffolding necessary to support index-only scan when the GIN opclass is one that doesn't subdivide the data values. That leaves me quibbling with some points in Jeff's otherwise excellent reply: > For single column using a btree_gin operator, each index entry holds the > entire data value. But the system doesn't know about that in any useful > way, so it doesn't implement index only scans, other than in the special > case where the value does not matter, like a 'count(*)'. Conceptually > perhaps this could be fixed, but I don't see it happening. Since an > index-only scan is usually not much good with only a single-column index, I > don't see much excitement to improve things here. I'm confused by this; surely IOS is useful even with a single-column index? Avoiding trips to the heap is always helpful. > If if there is more than > one column in the index, then for GIN the entire value from both columns > would not be stored in the same index entry, so in this case it can't use > an index-only scan even conceptually to efficiently fetch the value of one > column based on the supplied value of another one. Yeah, that is a nasty issue. You could do it, I think, by treating the additional column as being queried even though there is no WHERE constraint on it --- but that would lead to scanning all the index entries which would be very slow. Maybe still faster than a seqscan though? But I suspect we'd end up wanting to extend the notion of "IOS" to say that GIN can only return columns that have an indexable constraint, which is a little weird. At the very least we'd have to teach gincostestimate about this, or we could make very bad plan choices. Anyway, I said to Jeremy in the hallway that it might not be that hard to bolt IOS support onto GIN for cases where the opclass is a non-subdividing one, but after looking at the code I'm less sure about that. GIN hasn't even got an "amgettuple" code path, just "amgetbitmap", and a big part of the reason why is the need to merge results from the fastupdate pending list with results from the main index area. Not sure how we could deal with that. > GIN indexes over btree_gin operators do not support inequality or BETWEEN > queries efficiently. Are you sure about that? It's set up to use the "partial match" logic, which is certainly pretty weird, but it does have the potential for handling inequalities efficiently. [ pokes at it ... ] Hm, looks like it does handle individual inequality conditions reasonably well, but it's not smart about the combination of a greater-than and a less-than constraint on the same column. It ends up scanning all the entries satisfying the one condition, plus all the entries satisfying the other, then intersecting those sets --- which of course comprise the whole table :-(. I think maybe this could be made better without a huge amount of work though. > ... I do recall that > the replay of GIN WAL onto a standby was annoying slow, but I haven't > tested it with your particular set up and not in the most recent versions. A quick look at the commit log says that some work was done in this area for 9.4, and more in v12, but I've not tried to measure the results. Anyway, the larger point here is that right now btree_gin is just a quick hack, and it seems like it might be worth putting some more effort into it, because the addition of
PG10 upgrade issue
Hello We have done pg10 upgrades from 9.6 on all our environments successfully. We have an automation script that does this for us. One environment is a near replica of our production setup and all works fine. We have now made 2 attempts at doing this on production - both ending up with rollback The key issue we have is that all queries were incredibly slow. The PG 10 postmaster workers would all show 100% cpu, especially on the replicas. We even increased CPUs but that made no difference. I can try and provide logs info etc, but hoping someone has a clue about this Thx Zahir
Re: PG10 upgrade issue
Zahir Lalani writes: > We have done pg10 upgrades from 9.6 on all our environments successfully. We > have an automation script that does this for us. One environment is a near > replica of our production setup and all works fine. > We have now made 2 attempts at doing this on production - both ending up with > rollback > The key issue we have is that all queries were incredibly slow. The PG 10 > postmaster workers would all show 100% cpu, especially on the replicas. We > even increased CPUs but that made no difference. Did you make sure to rebuild statistics (re-ANALYZE) after the upgrade? pg_upgrade makes a script for that, but I don't believe it runs it for you. regards, tom lane
Re: PG10 upgrade issue
Ah no we did not! Will give that a go next tim Tom Thank you Z On 3 Jun 2019 00:18, Tom Lane wrote: Zahir Lalani writes: > We have done pg10 upgrades from 9.6 on all our environments successfully. We > have an automation script that does this for us. One environment is a near > replica of our production setup and all works fine. > We have now made 2 attempts at doing this on production - both ending up with > rollback > The key issue we have is that all queries were incredibly slow. The PG 10 > postmaster workers would all show 100% cpu, especially on the replicas. We > even increased CPUs but that made no difference. Did you make sure to rebuild statistics (re-ANALYZE) after the upgrade? pg_upgrade makes a script for that, but I don't believe it runs it for you. regards, tom lane
Re: Questions about btree_gin vs btree_gist for low cardinality columns
Thanks to Tom Lane and Jeff Janes for chiming in with the level of detail they're able to provide. As an outsider-who-now-loves-Postgres, I don't know the history or deep details of all of the various index types. (Obviously.) As a long-time database programmer, I can say that low-cardinality fields are *very* common cases. So whatever Postgres can offer to make for optimal searches and aggregates on such columns would be of immediate, ongoing, and widespread value. As an example, we're dealing with millions of rows where we often want to find or summarize by a category value. So, maybe 6-10 categories that are used in various queries. It's not realistic for us to anticipate every field combination the category field is going to be involved in to lay down multi-column indexes everywhere. I've used a system that handled this situation with a B-tree for the distinct values, and a subordinate data structure for the associated key (TIDs in PG, I guess.) They either stored a packed list of addresses, or a compressed bitmap on the whole table, depending on the number of associated entries. Seemed to work pretty well for queries. That also sounds very like a btree_gin index in Postgres. (Without the compressed, on-disk bitmap option.) Getting back to the day-to-day, what would you recommend using for a single-column index on a low-cardinality column (really low)? And, yes, we'll happily use blocking queries up front to reduce the number of rows under inspection, but that's 1) not always possible and 2) definitely not always predictable in advance. So I'm looking for the best case for stupid searches ;-)
Re: psql: FATAL: the database system is starting up
Hey Adrian, Fixed it. I saw the post from jebriggs but that didn't work for me so posted here. Anyway, here's how I resolved it: When I ran an strace on the postgres startup line, I got this: open("pg_logical/replorigin_checkpoint", O_RDONLY) = 6 write(2, "2019-06-02 14:50:34.777 EDT [283"..., 1062019-06-02 14:50:34.777 EDT [28362] PANIC: replication checkpoint has wrong magic 0 instead of 307747550 -bash-4.2$ Notice how it's reading *pg_logical/replorigin_checkpoint* even though wal_level = 'replicate' . It sure looks like it's doing logical replication. Anyway, I checked the file and saw this: -bash-4.2$ ls -altri pg_logical/ total 20 67894871 -rw---. 1 postgres postgres 16384 Oct 29 2018 replorigin_checkpoint 136946383 drwx--. 2 postgres postgres 6 Oct 29 2018 snapshots 204367784 drwx--. 2 postgres postgres 6 Oct 29 2018 mappings 67894870 drwx--. 4 postgres postgres65 Apr 28 06:06 . 135326272 drwx--. 21 postgres postgres 4096 Jun 2 14:50 .. -bash-4.2$ Notice the dates and size. All way off. Now I checked the same file on the other nodes: [root@psql03 pg_logical]# ls -altri total 8 68994432 drwx--. 2 postgres postgres6 Oct 29 2018 snapshots 134984156 drwx--. 2 postgres postgres6 Oct 29 2018 mappings 566745 -rw---. 1 postgres postgres8 May 22 01:55 replorigin_checkpoint 566731 drwx--. 4 postgres postgres 65 May 22 01:55 . 89714 drwxr-xr-x. 20 root root 4096 May 22 22:43 .. [root@psql03 pg_logical]# Dates of the other hosts replorigin_checkpoint are much more recent and smaller. So I take the replorigin_checkpoint and copy it over to the former primary node ( From psql03 to psql02(primary) ) and try a restart. Everything started up. Replication started to the other postgres nodes without issues. Verified base/ folder on all nodes to ensure replication is working. Every table is back to the way it was before this entire outage. Adrian, thank you very much for all the help here. It was much appreciated. I've written this up and posted it at the following two links so others can benefit and I don't forget. https://microdevsys.com/wp/panic-replication-checkpoint-has-wrong-magic-0-instead-of-307747550/ And of course a handy way to do backups, which I should have implemented before this happened: https://microdevsys.com/wp/postgresql-pull-backup/ Thx, TK On Sun, Jun 2, 2019 at 4:48 PM Adrian Klaver wrote: > On 6/2/19 11:14 AM, Tom K wrote: > > > > > Nope. wal_level was set to replica, not logical. Unless you mean > > > > > > What was the role of this cluster in the original setup? > > > > > > The cluster was the backend database for a number of applications. The > > aim was to point applications to a single large cluster instead of a > > number of small instances of postgres running all over the LAB. > > > > If I can get through the above error and get to the DB's and tables, I > > could just dump them out and reinit the entire DB from the backup. > > The only thing I could find is the below: > > > http://www.jebriggs.com/blog/2018/05/postgresql-and-panic-replication-checkpoint-has-wrong-magic-error/ > > If that does not work I would suggest sending a new post(new subject) to > the mailing list based on: > > replication checkpoint has wrong magic 0 instead of 307747550 > > > -- > Adrian Klaver > adrian.kla...@aklaver.com > > >