Re: Questions about btree_gin vs btree_gist for low cardinality columns

2019-06-02 Thread Peter J. Holzer
On 2019-06-02 09:10:25 +1000, Morris de Oryx wrote:
> Peter, thanks a lot for picking up on what I started, improving it, and
> reporting back. I thought I was providing timing estimates from the EXPLAIN
> cost dumps. Seems not. Well, there's another thing that I've learned.

The cost is how long the optimizer thinks it will take (in arbitrary
units). But it's just an estimate, and estimates can be off - sometimes
quite dramatically.

To get the real timings with explain, use explain (analyze). I often
combine this with buffers to get I/O stats as well:

wdsah=> explain (analyze, buffers) select min(date) from facttable_stat_fta4 
where partnerregion = 'USA' and sitcr4 = '7522'; 
╔══╗
║   
   QUERY PLAN   
   ║
╟──╢
║ Aggregate  (cost=694.23..694.24 rows=1 width=4) (actual time=7.568..7.568 
rows=1 loops=1) 
   ║
║   Buffers: shared hit=3 read=148 dirtied=114  

   ║
║   ->  Index Scan using facttable_stat_fta4_sitcr4_partnerregion_idx on 
facttable_stat_fta4  (cost=0.57..693.09 rows=455 width=4) (actual 
time=0.515..7.493 rows=624 loops=1) ║
║ Index Cond: (((sitcr4)::text = '7522'::text) AND 
((partnerregion)::text = 'USA'::text))  
║
║ Buffers: shared hit=3 read=148 dirtied=114

   ║
║ Planning time: 0.744 ms   

   ║
║ Execution time: 7.613 ms  

   ║
╚══╝
(7 rows)

And when you don't need the costs, you can turn them off:

wdsah=> explain (analyze, buffers, costs off) select min(date) from 
facttable_stat_fta4 where partnerregion = 'USA' and sitcr4 = '7522'; 
╔╗
║   QUERY PLAN  
 ║
╟╢
║ Aggregate (actual time=0.598..0.598 rows=1 loops=1)   
 ║
║   Buffers: shared hit=140 
 ║
║   ->  Index Scan using facttable_stat_fta4_sitcr4_partnerregion_idx on 
facttable_stat_fta4 (actual time=0.054..0.444 rows=624 loops=1) ║
║ Index Cond: (((sitcr4)::text = '7522'::text) AND 
((partnerregion)::text = 'USA'::text))║
║ Buffers: shared hit=140   
 ║
║ Planning time: 0.749 ms   
 ║
║ Execution time: 0.647 ms  
 ║
╚╝
(7 rows)

See https://www.postgresql.org/docs/current/sql-explain.html for details.

> Can you tell me how you get timing results into state_test_times?

In this case I just entered them manually (cut and paste from psql
\timing output). If I wanted to repeat that test on another database, I
would write a Python script (I'm sure you can do that in pgsql, too, but
I feel more comfortable in Python). I don't think there is a way to get
time timings in plain SQL.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have muc

Re: Questions about btree_gin vs btree_gist for low cardinality columns

2019-06-02 Thread Morris de Oryx
Peter,

Thanks a lot for the remedial help on EXPLAIN and timing results.


Re: psql: FATAL: the database system is starting up

2019-06-02 Thread Adrian Klaver

On 6/1/19 8:07 PM, Tom K wrote:



https://www.postgresql.org/docs/10/app-postgres.html
Single-User Mode
...

and see if that at least gets the server started. This is a highly
restricted so do not expect much usability.


These servers did crash before however didn't' notice anything 
particularly wrong at the time. Perhaps a better check was in order.


Here is the result from attempting to start the DB in single user mode.


-bash-4.2$ /usr/pgsql-10/bin/postgres --single -D /data/patroni 
--config-file=/data/patroni/postgresql.conf
2019-06-01 23:06:28.679 EDT [31969] LOG:  database system was 
interrupted; last known up at 2019-04-28 06:06:24 EDT
2019-06-01 23:06:28.953 EDT [31969] LOG:  invalid record length at 
0/4C35CDF8: wanted 24, got 0

2019-06-01 23:06:28.953 EDT [31969] LOG:  invalid primary checkpoint record
2019-06-01 23:06:28.953 EDT [31969] LOG:  using previous checkpoint 
record at 0/4C34EDA8
2019-06-01 23:06:28.953 EDT [31969] PANIC:  replication checkpoint has 
wrong magic 0 instead of 307747550


That comes from origin.c in the logical replication section of the code.

Did you have logical replication set up?

What was the role of this cluster in the original setup?


Aborted
-bash-4.2$



--
Adrian Klaver
adrian.kla...@aklaver.com




Re: Questions about btree_gin vs btree_gist for low cardinality columns

2019-06-02 Thread Jeff Janes
On Fri, May 24, 2019 at 11:26 AM Jeremy Finzel  wrote:

> I have been hoping for clearer direction from the community about
> specifically btree_gin indexes for low cardinality columns (as well as low
> cardinality multi-column indexes).  In general there is very little
> discussion about this both online and in the docs.  Rather, the emphasis
> for GIN indexes discussed is always on full text search of JSON indexing,
> not btree_gin indexes.
>
> However, I have never been happy with the options open to me for indexing
> low cardinality columns and was hoping this could be a big win.  Often I
> use partial indexes as a solution, but I really want to know how many use
> cases btree_gin could solve better than either a normal btree or a partial
> index.
>

What does "low cardinality" mean here?  For example, I have a million
different items with an item_id, but on average about 30 copies of each
item.  The inventory table has a row for each copy (copies are not fully
interchangeable, so can't be just be a count in some other table).  A
million is not usually considered a low number, but I find a gin index
(over btree_gin on the item_id) useful here as it produces a much smaller
index due to not repeating the item_id each time and due to compressing the
tid list, even though there are only about 30 tids to compress.

(You mention basically the reciprocal of this, 30 distinct values with
3,000,000 copies each, but I don't if that was your actual use case, or
just the case you were reading about in the documents you were reading)


>
> Here are my main questions:
>
> 1.
>
> "The docs say regarding *index only scans*: The index type must support
> index-only scans. B-tree indexes always do. GiST and SP-GiST indexes
> support index-only scans for some operator classes but not others. Other
> index types have no support. The underlying requirement is that the index
> must physically store, or else be able to reconstruct, the original data
> value for each index entry. As a counterexample, GIN indexes cannot support
> index-only scans because each index entry typically holds only part of the
> original data value."
>
> This is confusing to say "B-tree indexes always do" and "GIN indexes
> cannot support index-only scans", when we have a btree_gin index type.
> Explanation please ???
>

B-tree is the name of a specific index implementation in PostgreSQL.  That
is what is referred to here.   btree_gin offers operators to be used in a
GIN index to mimic/implement b-tree data structures/algorithms, but that
doesn't make it the same thing. It is just a GIN index doing btree things.
Perhaps PostgreSQL should use a "brand name" for their specific
implementation of the default index type, to distinguish it from the
generic algorithm description.


>
> Is it true that for a btree_gin index on a regular column, "each index
> entry typically holds only part of the original data value"?  Do these
> still not support index only scans?  Could they?  I can't see why they
> shouldn't be able to for a single indexed non-expression field?
>

For single column using a btree_gin operator, each index entry holds the
entire data value.  But the system doesn't know about that in any useful
way, so it doesn't implement index only scans, other than in the special
case where the value does not matter, like a 'count(*)'.  Conceptually
perhaps this could be fixed, but I don't see it happening. Since an
index-only scan is usually not much good with only a single-column index, I
don't see much excitement to improve things here. If if there is more than
one column in the index, then for GIN the entire value from both columns
would not be stored in the same index entry, so in this case it can't use
an index-only scan even conceptually to efficiently fetch the value of one
column based on the supplied value of another one.


>
> 2.
>
> Lack of index only scans is definitely a downside.  However, I see
> basically identical performance, but way less memory and space usage, for
> gin indexes.  In terms of read-only performance, if index only scans are
> not a factor, why not always recommend btree_gin indexes instead of regular
> btree for low cardinality fields, which will yield similar performance but
> use far, far less space and resources?
>

GIN indexes over btree_gin operators do not support inequality or BETWEEN
queries efficiently.  True read-onlyness is often talked about but rarely
achieved, I would be reluctant to design a system around management
promises that something won't ever change.  Btree indexes are way more
thoroughly tested than GIN.  When I became interested in using them in
more-or-less the way you describe, I started torture testing on them and
quickly found some bugs (hopefully all fixed now, but I wouldn't bet my
life on it).  btree_gin covers the most frequently used data types, but far
from all of them.  And as described above, multi-column GIN indexes are
just entirely different from multi-column B-Tree indexes, and gen

Re: psql: FATAL: the database system is starting up

2019-06-02 Thread Tom K
On Sun, Jun 2, 2019 at 11:47 AM Adrian Klaver 
wrote:

> On 6/1/19 8:07 PM, Tom K wrote:
>
> >
> > https://www.postgresql.org/docs/10/app-postgres.html
> > Single-User Mode
> > ...
> >
> > and see if that at least gets the server started. This is a highly
> > restricted so do not expect much usability.
> >
> >
> > These servers did crash before however didn't' notice anything
> > particularly wrong at the time. Perhaps a better check was in order.
> >
> > Here is the result from attempting to start the DB in single user mode.
>
> > -bash-4.2$ /usr/pgsql-10/bin/postgres --single -D /data/patroni
> > --config-file=/data/patroni/postgresql.conf
> > 2019-06-01 23:06:28.679 EDT [31969] LOG:  database system was
> > interrupted; last known up at 2019-04-28 06:06:24 EDT
> > 2019-06-01 23:06:28.953 EDT [31969] LOG:  invalid record length at
> > 0/4C35CDF8: wanted 24, got 0
> > 2019-06-01 23:06:28.953 EDT [31969] LOG:  invalid primary checkpoint
> record
> > 2019-06-01 23:06:28.953 EDT [31969] LOG:  using previous checkpoint
> > record at 0/4C34EDA8
> > 2019-06-01 23:06:28.953 EDT [31969] PANIC:  replication checkpoint has
> > wrong magic 0 instead of 307747550
>
> That comes from origin.c in the logical replication section of the code.
>
> Did you have logical replication set up?
>

Nope. wal_level was set to replica, not logical.  Unless you mean


> What was the role of this cluster in the original setup?
>

The cluster was the backend database for a number of applications.  The aim
was to point applications to a single large cluster instead of a number of
small instances of postgres running all over the LAB.

If I can get through the above error and get to the DB's and tables, I
could just dump them out and reinit the entire DB from the backup.


>
> > Aborted
> > -bash-4.2$
> >
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>


Re: psql: FATAL: the database system is starting up

2019-06-02 Thread Adrian Klaver

On 6/2/19 11:14 AM, Tom K wrote:



Nope. wal_level was set to replica, not logical.  Unless you mean


What was the role of this cluster in the original setup?


The cluster was the backend database for a number of applications.  The 
aim was to point applications to a single large cluster instead of a 
number of small instances of postgres running all over the LAB.


If I can get through the above error and get to the DB's and tables, I 
could just dump them out and reinit the entire DB from the backup.


The only thing I could find is the below:

http://www.jebriggs.com/blog/2018/05/postgresql-and-panic-replication-checkpoint-has-wrong-magic-error/

If that does not work I would suggest sending a new post(new subject) to 
the mailing list based on:


replication checkpoint has wrong magic 0 instead of 307747550


--
Adrian Klaver
adrian.kla...@aklaver.com




Re: Questions about btree_gin vs btree_gist for low cardinality columns

2019-06-02 Thread Tom Lane
Jeff Janes  writes:
> On Fri, May 24, 2019 at 11:26 AM Jeremy Finzel  wrote:
>> I have been hoping for clearer direction from the community about
>> specifically btree_gin indexes for low cardinality columns (as well as low
>> cardinality multi-column indexes).  In general there is very little
>> discussion about this both online and in the docs.  Rather, the emphasis
>> for GIN indexes discussed is always on full text search of JSON indexing,
>> not btree_gin indexes.

I just wanted to mention that Jeremy and I had a bit of hallway-track
discussion about this at PGCon.  The core thing to note is that the GIN
index type was designed to deal with data types that are subdividable
and you want to search for individual component values (array elements,
lexemes in a text document, etc).  The btree_gin extension abuses this
by just storing the whole values as if they were components.  AFAIR,
the original idea for writing both btree_gin and btree_gist was to allow
creating a single multicolumn index that covers both subdividable and
non-subdividable columns.  The idea that btree_gin might be used on its
own wasn't really on the radar, I don't think.

However, now that GIN can compress multiple index entries for the same
component value (which has only been true since 9.4, whereas btree_gin
is very much older than that) it seems like it does make sense to use
btree_gin on its own for low-cardinality non-subdividable columns.
And that means that we ought to consider non-subdividable columns as
fully legitimate, not just a weird corner usage.  So in particular
I wonder whether it would be worth adding the scaffolding necessary
to support index-only scan when the GIN opclass is one that doesn't
subdivide the data values.

That leaves me quibbling with some points in Jeff's otherwise excellent
reply:

> For single column using a btree_gin operator, each index entry holds the
> entire data value.  But the system doesn't know about that in any useful
> way, so it doesn't implement index only scans, other than in the special
> case where the value does not matter, like a 'count(*)'.  Conceptually
> perhaps this could be fixed, but I don't see it happening. Since an
> index-only scan is usually not much good with only a single-column index, I
> don't see much excitement to improve things here.

I'm confused by this; surely IOS is useful even with a single-column
index?  Avoiding trips to the heap is always helpful.

> If if there is more than
> one column in the index, then for GIN the entire value from both columns
> would not be stored in the same index entry, so in this case it can't use
> an index-only scan even conceptually to efficiently fetch the value of one
> column based on the supplied value of another one.

Yeah, that is a nasty issue.  You could do it, I think, by treating the
additional column as being queried even though there is no WHERE
constraint on it --- but that would lead to scanning all the index entries
which would be very slow.  Maybe still faster than a seqscan though?
But I suspect we'd end up wanting to extend the notion of "IOS" to say
that GIN can only return columns that have an indexable constraint,
which is a little weird.  At the very least we'd have to teach
gincostestimate about this, or we could make very bad plan choices.

Anyway, I said to Jeremy in the hallway that it might not be that
hard to bolt IOS support onto GIN for cases where the opclass is
a non-subdividing one, but after looking at the code I'm less sure
about that.  GIN hasn't even got an "amgettuple" code path, just
"amgetbitmap", and a big part of the reason why is the need to merge
results from the fastupdate pending list with results from the main
index area.  Not sure how we could deal with that.

> GIN indexes over btree_gin operators do not support inequality or BETWEEN
> queries efficiently.

Are you sure about that?  It's set up to use the "partial match" logic,
which is certainly pretty weird, but it does have the potential for
handling inequalities efficiently.  [ pokes at it ... ]  Hm, looks like
it does handle individual inequality conditions reasonably well, but
it's not smart about the combination of a greater-than and a less-than
constraint on the same column.  It ends up scanning all the entries
satisfying the one condition, plus all the entries satisfying the other,
then intersecting those sets --- which of course comprise the whole table
:-(.  I think maybe this could be made better without a huge amount of
work though.

> ... I do recall that
> the replay of GIN WAL  onto a standby was annoying slow, but I haven't
> tested it with your particular set up and not in the most recent versions.

A quick look at the commit log says that some work was done in this area
for 9.4, and more in v12, but I've not tried to measure the results.

Anyway, the larger point here is that right now btree_gin is just a quick
hack, and it seems like it might be worth putting some more effort into
it, because the addition of

PG10 upgrade issue

2019-06-02 Thread Zahir Lalani
Hello

We have done pg10 upgrades from 9.6 on all our environments successfully. We 
have an automation script that does this for us. One environment is a near 
replica of our production setup and all works fine.

We have now made 2 attempts at doing this on production - both ending up with 
rollback

The key issue we have is that all queries were incredibly slow. The PG 10 
postmaster workers would all show 100% cpu, especially on the replicas. We even 
increased CPUs but that made no difference.

I can try and provide logs info etc, but hoping someone has a clue about this

Thx

Zahir


Re: PG10 upgrade issue

2019-06-02 Thread Tom Lane
Zahir Lalani  writes:
> We have done pg10 upgrades from 9.6 on all our environments successfully. We 
> have an automation script that does this for us. One environment is a near 
> replica of our production setup and all works fine.

> We have now made 2 attempts at doing this on production - both ending up with 
> rollback

> The key issue we have is that all queries were incredibly slow. The PG 10 
> postmaster workers would all show 100% cpu, especially on the replicas. We 
> even increased CPUs but that made no difference.

Did you make sure to rebuild statistics (re-ANALYZE) after the upgrade?
pg_upgrade makes a script for that, but I don't believe it runs it for
you.

regards, tom lane




Re: PG10 upgrade issue

2019-06-02 Thread Zahir Lalani
Ah no we did not!

Will give that a go next tim Tom

Thank you

Z

On 3 Jun 2019 00:18, Tom Lane  wrote:
Zahir Lalani  writes:
> We have done pg10 upgrades from 9.6 on all our environments successfully. We 
> have an automation script that does this for us. One environment is a near 
> replica of our production setup and all works fine.

> We have now made 2 attempts at doing this on production - both ending up with 
> rollback

> The key issue we have is that all queries were incredibly slow. The PG 10 
> postmaster workers would all show 100% cpu, especially on the replicas. We 
> even increased CPUs but that made no difference.

Did you make sure to rebuild statistics (re-ANALYZE) after the upgrade?
pg_upgrade makes a script for that, but I don't believe it runs it for
you.

regards, tom lane


Re: Questions about btree_gin vs btree_gist for low cardinality columns

2019-06-02 Thread Morris de Oryx
Thanks to Tom Lane and Jeff Janes for chiming in with the level of detail
they're able to provide.

As an outsider-who-now-loves-Postgres, I don't know the history or deep
details of all of the various index types. (Obviously.) As a long-time
database programmer, I can say that low-cardinality fields are *very*
common cases. So whatever Postgres can offer to make for optimal searches
and aggregates on such columns would be of immediate, ongoing, and
widespread value.

As an example, we're dealing with millions of rows where we often want to
find or summarize by a category value. So, maybe 6-10 categories that are
used in various queries. It's not realistic for us to anticipate every
field combination the category field is going to be involved in to lay down
multi-column indexes everywhere.

I've used a system that handled this situation with a B-tree for the
distinct values, and a subordinate data structure for the associated key
(TIDs in PG, I guess.) They either stored a packed list of addresses, or a
compressed bitmap on the whole table, depending on the number of associated
entries.  Seemed to work pretty well for queries. That also sounds very
like a btree_gin index in Postgres. (Without the compressed, on-disk bitmap
option.)

Getting back to the day-to-day, what would you recommend using for a
single-column index on a low-cardinality column (really low)? And, yes,
we'll happily use blocking queries up front to reduce the number of rows
under inspection, but that's 1) not always possible and 2) definitely not
always predictable in advance. So I'm looking for the best case for stupid
searches ;-)


Re: psql: FATAL: the database system is starting up

2019-06-02 Thread Tom K
Hey Adrian,

Fixed it.

I saw the post from jebriggs but that didn't work for me so posted here.
Anyway, here's how I resolved it:

When I ran an strace on the postgres startup line, I got this:

open("pg_logical/replorigin_checkpoint", O_RDONLY) = 6
write(2, "2019-06-02 14:50:34.777 EDT [283"..., 1062019-06-02 14:50:34.777
EDT [28362] PANIC:  replication checkpoint has wrong magic 0 instead of
307747550
-bash-4.2$

Notice how it's reading *pg_logical/replorigin_checkpoint* even though
wal_level = 'replicate' .  It sure looks like it's doing logical
replication.  Anyway, I checked the file and saw this:

-bash-4.2$ ls -altri pg_logical/
total 20
 67894871 -rw---.  1 postgres postgres 16384 Oct 29  2018
replorigin_checkpoint
136946383 drwx--.  2 postgres postgres 6 Oct 29  2018 snapshots
204367784 drwx--.  2 postgres postgres 6 Oct 29  2018 mappings
 67894870 drwx--.  4 postgres postgres65 Apr 28 06:06 .
135326272 drwx--. 21 postgres postgres  4096 Jun  2 14:50 ..
-bash-4.2$

Notice the dates and size.  All way off.  Now I checked the same file on
the other nodes:

[root@psql03 pg_logical]# ls -altri
total 8
 68994432 drwx--.  2 postgres postgres6 Oct 29  2018 snapshots
134984156 drwx--.  2 postgres postgres6 Oct 29  2018 mappings
   566745 -rw---.  1 postgres postgres8 May 22 01:55
replorigin_checkpoint
   566731 drwx--.  4 postgres postgres   65 May 22 01:55 .
89714 drwxr-xr-x. 20 root root 4096 May 22 22:43 ..
[root@psql03 pg_logical]#

Dates of the other hosts replorigin_checkpoint are much more recent and
smaller.

So I take the replorigin_checkpoint and copy it over to the former primary
node  (  From psql03 to psql02(primary)  )  and try a restart.

Everything started up.  Replication started to the other postgres nodes
without issues.  Verified base/ folder on all nodes to ensure replication
is working. Every table is back to the way it was before this entire
outage.

Adrian, thank you very much for all the help here.  It was much
appreciated.  I've written this up and posted it at the following two links
so others can benefit and I don't forget.

https://microdevsys.com/wp/panic-replication-checkpoint-has-wrong-magic-0-instead-of-307747550/


And of course a handy way to do backups, which I should have implemented
before this happened:

https://microdevsys.com/wp/postgresql-pull-backup/

Thx,
TK

On Sun, Jun 2, 2019 at 4:48 PM Adrian Klaver 
wrote:

> On 6/2/19 11:14 AM, Tom K wrote:
>
> >
> > Nope. wal_level was set to replica, not logical.  Unless you mean
> >
> >
> > What was the role of this cluster in the original setup?
> >
> >
> > The cluster was the backend database for a number of applications.  The
> > aim was to point applications to a single large cluster instead of a
> > number of small instances of postgres running all over the LAB.
> >
> > If I can get through the above error and get to the DB's and tables, I
> > could just dump them out and reinit the entire DB from the backup.
>
> The only thing I could find is the below:
>
>
> http://www.jebriggs.com/blog/2018/05/postgresql-and-panic-replication-checkpoint-has-wrong-magic-error/
>
> If that does not work I would suggest sending a new post(new subject) to
> the mailing list based on:
>
> replication checkpoint has wrong magic 0 instead of 307747550
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>
>
>