Re: How to perform PITR when all of the logs won't fit on the drive

2018-03-02 Thread Jeff Janes
On Thu, Mar 1, 2018 at 2:28 PM, Tony Sullivan 
wrote:

> Hello,
>
> I have a situation where something was deleted from a database that
> shouldn't have been so I am having to take a base backup and perform a
> point-in-time-recovery. The problem I have is that the decompressed WAL
> files will not fit on the drive of the machine I am trying to do the
> restore on.
>
> I am wondering if I can arrange the WAL files by date and copy them to the
> directory where they belong and then copy another batch when those are
> restored or if I will need to find some other way of performing the
> recovery.
>
>
You could do that if your timing is right.  You need to either make sure
the next batch shows up before the first file in that batch is requested,
or have you restore command wait and retry rather than throw an error when
it asks for a file that does not exist.  But your restore command can copy
them from a network drive, or remotely with scp or rsync, and also
decompress them on the fly.  That seems simpler.

Another problem you might run into is that the restored WAL records are
retained in pg_xlog for two restart points before being removed, and until
fairly recent versions of PostgreSQL restart points were governed only by
checkpoint_timeout, and not by WAL volume.  So if your system restores far
faster than it took to generate the WAL in the first place, this could lead
to massive amounts of WAL kept in pg_xlog running you out of disk space.
So you should lower checkpoint_timeout for recovery to be much less than it
was in production.

Cheers,

Jeff


Re: Sum of written buffers bigger than allocation?

2018-04-08 Thread Jeff Janes
On Sun, Apr 8, 2018 at 11:28 AM, pinker  wrote:

>
>
> It's kinda mysterious... Any explanations?
> Does it mean that the same buffers were written over and over again?
>
>
Yeah, checkpoints will write all dirty buffers, but doesn't evict them.
Next time the page is needed, it doesn't need to be re-read as it is still
there.

Cheers,

Jeff


Re: Fast logical replication jump start with PG 10

2018-05-26 Thread Jeff Janes
On Fri, May 25, 2018 at 5:12 PM, Olivier Gautherot 
wrote:

Is there a way to speed up the replication or should I rather stick to
> streaming replication? As I have only 1 database on the server, it would
> not be a show-stopper.
>

You have a method that works, and a client that is already twitchy about
downtime and only upgrades their database once every 5 years.

I would not even consider the idea of combining a major-version upgrade
with a complete change-over in replication technology in a single step in
this situation.

If you will get some kind of benefit from switching to logical replication,
you could first upgrade production and get a new physical replica going,
then once that is returned to production you can create a new logical
replica and get it all synced over at your leisure, then get it all tested
and then cut the clients over from the physical replica to the logical
replica.

 Cheers,

Jeff


Re: Login with LDAP authentication takes 5 seconds

2018-05-30 Thread Jeff Janes
On Mon, May 28, 2018 at 10:26 AM, Andreas Schmid 
wrote:

> Hi,
>
> I configured my PostgreSQL 10 DB on Debian 9.2 with LDAP authentication
> (simple bind mode). While this basically works, it has the strange effect
> that the first login with psql takes around 5 seconds. When I reconnect
> within 60 seconds, the login completes immediately.
>
> The LDAP server is behind a firewall. So for a test, in pg_hba.conf I put
> the LDAP servers IP address instead of its DNS name (for parameter
> ldapserver). Like that, all logins complete immediately. But in general I
> prefer specifying the DNS name rather than the IP.
>
> When I checked on the DB machine with the following commands
> host my.ldap.server.org
> dig my.ldap.server.org
> both always returned the host name and IP address of the LDAP server
> immediately.
>

Out of curiosity, what if you use "ping" rather than "dig" or "host"?

Cheers,

Jeff


Re: LDAP authentication slow

2018-06-03 Thread Jeff Janes
On Thu, May 31, 2018 at 8:23 AM, C GG  wrote:

In the meantime, I did what I promised Adrian Klaver I would do and I added
> the AD servers to the /etc/hosts file. That had an immediate and dramatic
> effect on the performance. That confirms (at least to me) that DNS
> resolution was playing a large part in the slowdown. I'm going to
> experiment with running a local DNS caching server to see if that will give
> the same effect.
>

I had this problem at one site, and with the same solution.  As best as I
could tell, Windows was not using DNS as the main way of resolving
hostnames.  (People assure me that NetBIOS and WINS are almost completely
dead, but WireShark tells a different tail--I don't recall the exact name,
but it was certainly something other than DNS).  So the fact that AD's
built in DNS sucks was not a problem for Windows users, which means there
was no impetus on the Windows admin to fix it.  And the delay on resolution
was always 5 seconds plus a very small handful of milliseconds.  So it was
clearly some kind of designed throttling or timeout, there is no way random
congestion could get you so close to 5.00 every time.

If we had greater depth of talent on the Windows side, surely we could have
fixed the DNS issue.  But with greater of talent, we would have been using
Kerberos in the first place, like Stephen wants us to.

Cheers,

Jeff


Re: Long running DDL statements blocking all queries

2018-06-03 Thread Jeff Janes
On Thu, May 31, 2018 at 9:19 AM, Ashu Pachauri  wrote:

> There was too much noise in the pg_stat_activity output, so I did not post
> it. I'll collect the output again and post.
>
> But, when I checked in pg_stat_activity, PID 18317 is the session that's
> running the ALTER statement and it was showing up as "active". So, it's not
> blocked by anything, but the fact that the ALTER statement is long running
> and it's blocking the operations that are not even on the same table for
> the entire duration it's running is troubling.
>
>
I think what you really need to know here is what lock it was holding which
was blocking everyone.  That information won't be found in the
pg_stat_activity.  It will be found in pg_locks, but the specific query you
ran on that view did not display the columns with that information.  You
need to include all the columns in the output which you used to join the
two pg_locks together.  Yes, it will be voluminous, and most of them will
not be relevant, but you don't know which ones are relevant until after you
see the output.

Cheers,

Jeff


Re: DB size growing exponentially when materialized view refreshed concurrently (postgres 9.6)

2018-06-25 Thread Jeff Janes
On Mon, Jun 25, 2018 at 1:21 PM, Vikas Sharma  wrote:

> Hi All,
>
> I am looking for advice in a issue where two materialized views are being
> refreshed concurrently and dbsize has grown to 150gb from 4gb in two days.
>
> We use two materialized views to keep processed data for faster query
> results for a search function. Earlier materialized views were refreshed
> not concurrently and all was good on DB.
>

Where is the space going?  Does it show up in psql with \l+ ?  Does is show
up in \dm+ ?  Does it only show up using OS tools in the data directory?
Which subdirectory of the data directory?

Cheers,

Jeff


Re: About "Cost-based Vacuum Delay"

2018-06-29 Thread Jeff Janes
On Wed, Jun 27, 2018 at 3:19 AM, Laurenz Albe 
wrote:

> Ilyeop Yi wrote:
> > Currently, I am working with a workload that is mostly insert and
> update, and its performance suffers from autovacuum.
>

Do you know what about the autovacuum causes the performance drop?  Is it
the reading, the writing, or the steady stream of fsync calls?  Or the CPU
load, or something else?


> >
> > I've adjusted parameters such as vacuum_cost_delay and
> vacuum_cost_limit, but they have no significant effect.
>

vacuum_cost_delay has no effect on autovacuum,
unless autovacuum_vacuum_cost_delay is set to -1 (which is not the default
setting for it)

I think that any adjustment you make there will not take effect in the
middle of an existing table vacuuming, anyway, as the autovacuum worker
only checks for SIGHUP between tables.


> >
> > So, I would like to find a way to pause a running vacuum during bursty
> insert/update period and resume the vacuum after that period.
> >
> > Is there such a way?
>

You can use the OS tools.  For example, on linux you could use "kill
-SIGSTOP ", and then kill "-SIGCONT ".  This is not a
recommendation for use in production systems, as there is a small chance
this could cause a stuck spinlock and thus crash the db server.  Or a stuck
LWLOCK, which would cause other process to block unexpectedly and
indefinitely.  And if neither of those happen but you forget to do the
SIGCONT, lots of havoc would be created.  It might be safer to use SIGTSTP?

The best solution for preventing the problem from recurring might be just
to manually vacuum the largest tables at a time of your choosing, so that
they will not be likely to become due for autovacuum at the "wrong" time .


> Please keep the list copied.
>
> You can do
>
>ALTER TABLE mytab SET (autovacuum_enabled = off);
>

But you would have to kill the autovacuum or wait for it to finish the
table naturally before it would take effect.  And the problem might not be
with one particular table being vacuumed.

Cheers,

Jeff


Re: Slow WAL recovery for DROP TABLE

2018-07-17 Thread Jeff Janes
There was a recent commit for a similar performance problem, which will
appear in 9.6.10.  But that was specifically for cases where there were
multiple dropped tables per transaction, and large shared_buffers.

I can't reproduce your single-drop-per-transaction problem.  The replica
has no problem keeping up with the master.

Can you share the reproduction scripts, and any non-default config
settings?  Especially the setting of shared_buffers (on both master and
replica, if different)

Cheers,

Jeff


Re: Dealing with latency to replication slave; what to do?

2018-07-24 Thread Jeff Janes
Please don't top-post, it is not the custom on this list.

On Tue, Jul 24, 2018 at 4:08 PM, Rory Falloon  wrote:

> On Tue, Jul 24, 2018 at 4:02 PM Andres Freund  wrote:
>
Hi,
>>
>> On 2018-07-24 15:39:32 -0400, Rory Falloon wrote:
>> > Looking for any tips here on how to best maintain a replication slave
>> which
>> > is operating under some latency between networks - around 230ms. On a
>> good
>> > day/week, replication will keep up for a number of days, but however,
>> when
>> > the link is under higher than average usage, keeping replication active
>> can
>> > last merely minutes before falling behind again.
>> >
>> > 2018-07-24 18:46:14 GMTLOG:  database system is ready to accept read
>> only
>> > connections
>> > 2018-07-24 18:46:15 GMTLOG:  started streaming WAL from primary at
>> > 2B/9300 on timeline 1
>> > 2018-07-24 18:59:28 GMTLOG:  incomplete startup packet
>> > 2018-07-24 19:15:36 GMTLOG:  incomplete startup packet
>> > 2018-07-24 19:15:36 GMTLOG:  incomplete startup packet
>> > 2018-07-24 19:15:37 GMTLOG:  incomplete startup packet
>> >
>> > As you can see above, it lasted about half an hour before falling out of
>> > sync.
>>
>> How can we see that from the above? The "incomplete startup messages"
>> are independent of streaming rep? I think you need to show us more logs.
>>
>>
>>
> regarding your first reply, I was inferring that from the fact I saw those
> messages at the same time the replication stream fell behind. What other
> logs would be more pertinent to this situation?
>

This is circular.  You think it lost sync because you saw some message you
didn't recognize, and then you think the error message was related to it
losing sync because they occured at the same time.  What evidence do you
have that it has lost sync at all? From the log file you posted, it seems
the server is running fine and is just getting probed by a port scanner, or
perhaps by a monitoring tool.

If it had lost sync, you would be getting log messages about "requested WAL
segment has already been removed"

Cheers,

Jeff

On Tue, Jul 24, 2018 at 4:08 PM, Rory Falloon  wrote:

> Hi Andres,
>
> regarding your first reply, I was inferring that from the fact I saw those
> messages at the same time the replication stream fell behind. What other
> logs would be more pertinent to this situation?
>
>
>
> On Tue, Jul 24, 2018 at 4:02 PM Andres Freund  wrote:
>
>> Hi,
>>
>> On 2018-07-24 15:39:32 -0400, Rory Falloon wrote:
>> > Looking for any tips here on how to best maintain a replication slave
>> which
>> > is operating under some latency between networks - around 230ms. On a
>> good
>> > day/week, replication will keep up for a number of days, but however,
>> when
>> > the link is under higher than average usage, keeping replication active
>> can
>> > last merely minutes before falling behind again.
>> >
>> > 2018-07-24 18:46:14 GMTLOG:  database system is ready to accept read
>> only
>> > connections
>> > 2018-07-24 18:46:15 GMTLOG:  started streaming WAL from primary at
>> > 2B/9300 on timeline 1
>> > 2018-07-24 18:59:28 GMTLOG:  incomplete startup packet
>> > 2018-07-24 19:15:36 GMTLOG:  incomplete startup packet
>> > 2018-07-24 19:15:36 GMTLOG:  incomplete startup packet
>> > 2018-07-24 19:15:37 GMTLOG:  incomplete startup packet
>> >
>> > As you can see above, it lasted about half an hour before falling out of
>> > sync.
>>
>> How can we see that from the above? The "incomplete startup messages"
>> are independent of streaming rep? I think you need to show us more logs.
>>
>>
>> > On the master, I have wal_keep_segments=128. What is happening when I
>> see
>> > "incomplete startup packet" - is it simply the slave has fallen behind,
>> > and  cannot 'catch up' using the wal segments quick enough? I assume the
>> > slave is using the wal segments to replay changes and assuming there are
>> > enough wal segments to cover the period it cannot stream properly, it
>> will
>> > eventually recover?
>>
>> You might want to look into replication slots to ensure the primary
>> keeps the necessary segments, but not more, around.  You might also want
>> to look at wal_compression, to reduce the bandwidth usage.
>>
>> Greetings,
>>
>> Andres Freund
>>
>


Re: Advice on logging strategy

2018-10-11 Thread Jeff Janes
On Thu, Oct 11, 2018 at 6:27 AM Mike Martin  wrote:

> I have a question on logging strategy
>
> I have loggin set to
> log_statement = 'all' on a network database with logging set to csv so I
> can import it to a logging table
>
> However the database is populated via a nightly routine downloading data
> via REST APIusing prepared statements
>
> This results in enormous log files which take ages to import using copy
> becuase each execute statement is logged with the parameters chosen
>
> Is there any way around this?
>

One option is to convert to using COPY...FROM STDIN rather than prepared
INSERTs.

Another is to create a user specifically for bulk population, and do a
'ALTER USER bulk_load SET log_statement=none` to override the global
log_statement setting.

Cheers,

Jeff


Re: pg_dump backup utility is taking more time around 24hrs to take the backup of 28GB

2018-10-19 Thread Jeff Janes
On Thu, Oct 18, 2018 at 8:26 AM Raghavendra Rao J S V <
raghavendra...@gmail.com> wrote:

> Hi All,
>
> We are using *pg_dump *backup utility in order to take the backup of the
> database. Unfortunately,it is taking around 24hrs of time  to take the
> backup of  28GB database. Please guide me how to reduce the time and is
> there any parameter need to be modified which will help us to reduce the
> backup time. We are using Postgres 9.2 version
>

What minor version?  How long does it take to do just a pg_dump -s?  Can
you use system tools (like top in Linux) to identify the bottleneck as CPU,
disk, etc.?

Cheers,

Jeff


Re: Databases and servers

2019-08-20 Thread Jeff Janes
On Tue, Aug 20, 2019 at 6:33 AM Karl Martin Skoldebrand <
ks0c77...@techmahindra.com> wrote:

> Hi,
>
> I just discovered that a client has done this:
>
> They have two web applications A1 and A2. They have seperate
> hostnames/URLs. Both have a production and a test database A1p and A1t/ A2p
> and A2t.
>
> What they've done is have both A1p and A2p on the same actual databaser
> server and A1t and A2t on the same server.
>
> Are these two PostgreSQL instances running on the same hardware, or two
databases within a single PostgreSQL instance?

> So, I'm thinking - if a bug in application A1 crashes the application and
> database badly it will risk bringing down both services A1 and A2.
>
Is this a common occurrence?  Of all the occurrences of downtime in recent
memory (or better yet, from incidence documentation), what were the causes
of them?  Is this near the top of the list?

Also, are the two apps completely independent, or are they used together
such that one being down makes the other one not very useful?

> The same risk would be evident on a successful security breach.
>
On the other hand, more servers means more moving parts, means more
opportunities for mistakes in configuration or maintenance that let
breaches happen.

> I would prefer to A1p and A2p on seperate servers, maybe keeping A1t and
> A2t on the same. (This is what seems to be happening when the database
> servers are being repladed).
>
I don't know what that last part means.

> What is the general thought on the current setup?
>
In my experience, people acting on mere conjectures about what might cause
downtime in the future and how to prevent it have caused more downtime than
they have prevented.

Cheers,

Jeff

>


Re: A question aboout postgresql-server-dev versions

2019-08-25 Thread Jeff Janes
On Sun, Aug 25, 2019 at 8:34 AM stan  wrote:

> I am using Postgresql version 11 on Ubuntu 18.04. I am considering using an
> extension called libphonenumber. It needs Postgresql-server-dev to build. I
> found postgresql-server-dev-all in the list of available packages, BUT when
> I ask apt-get to lad it, a dependency is postgresql-server-dev-10.
>
> Seems wrong, is this OK?
>

If it is offering 10 but not 11, then that is probably because 11 is
already installed and doesn't need to be installed again.

What do happens if you do "sudo apt install  postgresql-server-dev-11" ?

Cheers,

Jeff


Re: Having difficulties partitionning with jsonb

2019-08-25 Thread Jeff Janes
On Thu, Aug 22, 2019 at 5:41 PM ouellet marc-andre <
ouellet_marcan...@hotmail.com> wrote:

> CREATE TABLE test ( id integer, data jsonb ) Partition by range (( data
> #>> '{info,time}' ));
>
> CREATE TABLE test_part1 PARTITION OF test
> FOR VALUES FROM ('3') TO ('4');
>
> INSERT INTO test VALUES (1,'{"info":[{"time":39814.0,"value":2}, {"time
> ":39815.0,"value":3}]}');
>
>
The partitioning expression yields NULL on the given input.

select '{"info":[{"time":39814.0,"value":2},
{"time":39815.0,"value":3}]}'::jsonb #>> '{info,time}';
 ?column?
--
 (null)
(1 row)

Maybe you meant '{info,0,time}', or maybe you meant '{info,1,time}'.  Or
maybe you meant something else. You will have to explain yourself.

Cheers,

Jeff

>


Re: [ext] Re: Pointers towards identifying bulk import bottleneck (walwriter tuning?)

2019-08-27 Thread Jeff Janes
On Tue, Aug 27, 2019 at 10:43 AM Holtgrewe, Manuel <
manuel.holtgr...@bihealth.de> wrote:

> Hi,
>
> I also tried creating the table as "UNLOGGED" which led to walwriter I/O
> to drop drastically and I now get no wall-clock time increase with two
> import processes but it gets slower with four.
>
> Switching off fsync leads to a drastic time improvement but still higher
> wall-clock time for four threads.
>

Does switching fsync off make it faster even when the table are unlogged
(that would be surprising) or were the two changes made one at a time?
When you say still higher for four threads, do you mean the four threads
for fsync=off are much faster than 4 threads for fsync=on but still doesn't
scale linearly within the fsync=off set?  Or is the nonlinearity so bad
that you fsync=off doesn't even improve the 4 thread situation?

PostgreSQL fsyncs each wal segment once it is full.  Under heavy load, this
is effectively done in the foreground (even when done by WALwriter),
because other processes inserting WAL records will soon be blocked by locks
taken out by the fsyncing process.  So if you can't run your production
database with fsync=off, one thing you can try is setting up a new database
with a larger wal segment size (--wal-segsize argument to initdb).

 Cheers,

Jeff


Re: Matching pgp_sym_encrypt() and gpg2 output

2019-08-27 Thread Jeff Janes
On Tue, Aug 27, 2019 at 1:33 PM Bruce Momjian  wrote:

> I am trying to generate output from the command-line program gpg2 that
> matches the output of pgp_sym_encrypt().  gpg2 outputs:
>
> $ echo 'my access password' | tr -d '\n' | gpg2 --symmetric --batch
> > --cipher-algo AES256 --passphrase 'abc' | xxd -p | tr -d '\n'
>

I don't even get the same output on repeated execution of this same
command, so I think you are pursuing a lost cause.

Cheers,

Jeff


Re: Postgres Point in time Recovery (PITR),

2019-10-19 Thread Jeff Janes
On Fri, Oct 18, 2019 at 1:59 AM Daulat Ram 
wrote:

> Hello All,
>
> Can you please share some ideas and scenarios how we can do the PITR in
> case of disaster.
>

It depends on what you mean by "disaster".  Usually I think that would mean
your server (or entire data center) was destroyed.  In this case, you would
want to restore to the latest time available.  I would say that this is not
PITR at all, that is just regular recovery.

If someone truncated a table 3 weeks ago, and you didn't realize it until
today, that is a scenario for PITR.  Are you using "disaster" to cover this
scenario?

Cheers,

Jeff

>


Re: Execute a function through fdw

2019-10-19 Thread Jeff Janes
On Fri, Oct 18, 2019 at 7:55 AM Tom Lane  wrote:

> Guillaume Lelarge  writes:
> > Le ven. 18 oct. 2019 à 11:51, Patrick FICHE 
> a
> > écrit :
> >> Is it possible to execute a function located on a server accessed
> through
> >> Postgres fdw.
>
> > It's probably easier to create a view on the remote server, and access it
> > as a foreign table on the local server.
>

Yes, that would probably work here, but if the function takes user-supplied
arguments, that won't work.

>
> Yeah.  Or if you really want to call a remote function by name, see
> dblink.  postgres_fdw actively avoids doing that sort of thing.
>

And importantly, you can specify the name of the existing postgres_fdw
server to the dblink functions in place of the connection string.  This
removes quite a bit of the drudgery of using dblink, if you are already
using postgres_fdw.

Cheers,

Jeff


Re: Too many SET TimeZone and Application_name queries

2019-10-20 Thread Jeff Janes
On Fri, Oct 11, 2019 at 7:49 AM Amarendra Konda 
wrote:

> Hi,
>
> In our test environment, it was observed that there are too many queries
> were getting fired to the database server,
>

What does "too many" mean here?  Is it just more than you like to see in
your log file, or is there some objective problem?

These look like housekeeping queries which are executed by a connection
pooler each time a connection is checked out of the pool (or perhaps
checked back in).  However, they don't seem to be the housekeeping queries
which pgbouncer itself uses.  I don't think that JDBC automatically issues
them either, although that might depend on your configuration.  So I think
that leaves Tomcat as the most likely culprit.  Tomcat does offer a
connection pool.  Are you using it?

Cheers,

Jeff


Re: Regarding db dump with Fc taking very long time to completion

2019-10-22 Thread Jeff Janes
On Fri, Aug 30, 2019 at 5:51 AM Durgamahesh Manne 
wrote:

> Hi
> To respected international postgresql team
>
> I am using postgresql 11.4 version
> I have scheduled logical dump job which runs daily one time at db level
> There was one table that has write intensive activity for every 40 seconds
> in db
> The size of the table is about 88GB
>  Logical dump of that table is taking more than 7 hours to be completed
>

That seems very slow.  I get about 2 GB per minute on my low-end laptop.
Does your schema use obscure data types which might be hard to process?

Using your system tools (like "top" for linux), what is going on?  Is time
spent in pg_dump itself, or in postgres doing the COPY?  Is it CPU bound or
IO bound?  Can you use "perf top" to see where it is spending its time?

How long does it take if you turn off compression, and stream the output
into the void rather than saving it?

time pg_dump -p  -Fc -Z0 | wc -c

Alternatively, can you use physical rather than logical backups?

Cheers,

Jeff


Re: Recovering disk space

2019-10-22 Thread Jeff Janes
On Thu, Oct 10, 2019 at 3:57 AM stan  wrote:

> However, my 50G disk is still 96% full. How can I recover the disk space I
> seem to have used u doing this?
>

The bloated storage is likely not under PostgreSQL's control.  Use the
tools provided by your OS to figure out what is using the space.

Cheers,

Jeff


Re: PostGreSQL Replication and question on maintenance

2019-11-16 Thread Jeff Janes
On Thu, Nov 14, 2019 at 12:23 PM github kran  wrote:

>
>>
>> *Problem what we have right now. *
>>
>> When the migration activity runs(weekly) from past 2 times , we saw the
>> cluster read replica instance has restarted as it fallen behind the
>> master(writer instance).
>>
>
I can't figure out what your setup is here.  You must be using logical
replication (not physical) or you wouldn't be able to write to the replica
at all.  But if you are using logical replication, why do you also need
these weekly jobs?  Why isn't logical replication taking care of it?




> Everything
>>
>> after that worked seamlessly but we want to avoid the replica getting
>> restarted. To avoid from restart we started doing smaller binary files and
>> copy those files to the cluster-2
>>
>
Who restarted it?  I am not aware of any case where the replica responds to
falling behind by restarting itself.  With physical replication, it can
start cancelling queries, but you don't seem to be using physical
replication.

Cheers,

Jeff

>


Re: Weird ranking results with ts_rank

2019-11-16 Thread Jeff Janes
On Fri, Nov 15, 2019 at 1:31 AM Javier Ayres  wrote:

> Hi everybody.
>
> I'm implementing a solution that uses PostgreSQL's full text search
> capabilities and I have come across a particular set of results for ts_rank
> that don't seem to make sense according to the documentation.
>

While the documentation doesn't come out and say, my interpretation is that
ts_rank assumes there is a match in the first place, and by implication is
undefined/unspecified if there is no match.

select to_tsvector('search for a text box') @@ to_tsquery('circle | lot <->
box');
 ?column?
--
 f
(1 row)

Cheers,

Jeff


Re: How to drop all tokens that a snowball dictionary cannot stem?

2019-11-23 Thread Jeff Janes
On Fri, Nov 22, 2019 at 8:02 AM Christoph Gößmann  wrote:

> Hi everybody,
>
> I am trying to get all the lexemes for a text using to_tsvector(). But I
> want only words that english_stem -- the integrated snowball dictionary --
> is able to handle to show up in the final tsvector. Since snowball
> dictionaries only remove stop words, but keep the words that they cannot
> stem, I don't see an easy option to do this. Do you have any ideas?
>
> I went ahead with creating a new configuration:
>
> -- add new configuration english_led
> CREATE TEXT SEARCH CONFIGURATION public.english_led (COPY =
> pg_catalog.english);
>
> -- dropping any words that contain numbers already in the parser
> ALTER TEXT SEARCH CONFIGURATION english_led
> DROP MAPPING FOR numword;
>
> EXAMPLE:
>
> SELECT * from to_tsvector('english_led','A test sentence with ui44 \tt
> somejnk words');
>to_tsvector
> --
>  'sentenc':3 'somejnk':6 'test':2 'tt':5 'word':7
>
> In this tsvector, I would like 'somejnk' and 'tt' not to be included.
>

I don't think the question is well defined.  It will happily stem
'somejnking' to ' somejnk', doesn't that mean that it **can** handle it?
The fact that 'somejnk' itself wasn't altered during stemming doesn't mean
it wasn't handled, just like 'test' wasn't altered during stemming.

Cheers,

Jeff


Re: How to drop all tokens that a snowball dictionary cannot stem?

2019-11-23 Thread Jeff Janes
On Sat, Nov 23, 2019 at 10:42 AM Christoph Gößmann 
wrote:

> Hi Jeff,
>
> You're right about that point. Let me redefine. I would like to drop all
> tokens which neither are the stemmed or unstemmed version of a known word.
> Would there be the possibility of putting a wordlist as a filter ahead of
> the stemming? Or do you know about a good English lexeme list that could be
> used to filter after stemming?
>

I think what you describe is the opposite of what snowball was designed to
do.  You want an ispell-based dictionary instead.

PostgreSQL doesn't ship with real ispell dictionaries, so you have to
retrieve the files yourself and install them into $SHAREDIR/tsearch_data as
described in the docs for
https://www.postgresql.org/docs/12/textsearch-dictionaries.html#TEXTSEARCH-ISPELL-DICTIONARY

Cheers,

Jeff


Re: Trouble incrementing a column

2019-11-24 Thread Jeff Janes
On Sat, Nov 23, 2019 at 4:47 PM Tom Lane  wrote:


> Note that you pay a fairly substantial performance penalty for deferring
> the check, which is why it isn't the default, even though the SQL spec
> says it ought to be.
>

Do you know what the worst case scenario is for the performance of
deferring the check to the end of the statement (with deferred initially
immediate)?  Upon testing, I get a penalty of 2 to 5%, which seems pretty
small, but I might not be testing the most adverse situation.  See attached.

The main "cost" that prevents from using DII routinely is that they can't
receive foreign key constraints.

Cheers,

Jeff

psql <<'END'
create table jj_dii (id bigserial primary key, jj bigint unique deferrable initially immediate);
create table jj_did (id bigserial primary key, jj bigint unique deferrable initially deferred);
create table jj_nd (id bigserial primary key, jj bigint unique not deferrable);
insert into jj_nd select x,x::bigint*100 from generate_series(1,10) f(x);
insert into jj_did select x,x::bigint*100 from generate_series(1,10) f(x);
insert into jj_dii select x,x::bigint*100 from generate_series(1,10) f(x);
END

for f in `seq 1 200`; do 
  for mode in `shuf -e nd dii`; do
	echo "JJ $f $mode"
pgbench -T30 -f <(echo "update jj_$mode set jj=jj+1") -n -M prepared
  done
done



Re: slow insert speeds with bytea

2019-12-03 Thread Jeff Janes
On Mon, Dec 2, 2019 at 4:42 AM Alex O'Ree  wrote:

> Is there anything I can to increase insert speeds for bytea? Currently
> running postgres 9.6.15
>
> I have a few tables without a bytea and a few with bytea. There is a large
> performance difference with inserts between the two. I'm inserting a byte[]
> that's usually less than 1MB on content. The content itself is actually
> just utf8 string data.
>
> For the non-bytea table, inserts can be as high as 40k rows/sec, whereas
> the bytea table is closer to 4k/sec or less.
>
> If this is just a limitation of postgres, then that's fine but the
> performance delta is so significant that i feel like i'm missing something
>

That does seem pretty drastic.  But I don't think we will get
anywhere unless you post your actual benchmarking program so that we can
try it for ourselves.  I certainly don't see anything that drastic in my
own testing.  Maybe the bottleneck is on the client side.

Cheers,

Jeff


Re: How to prevent POSTGRES killing linux system from accepting too much inserts?

2019-12-18 Thread Jeff Janes
On Wed, Dec 18, 2019 at 4:53 AM James(王旭)  wrote:

> Hello,
>>
>> I encountered into this kernel message, and I cannot login into the Linux
>> system anymore:
>
>
>>
>> Dec 17 23:01:50 hq-pg kernel: sh (6563): drop_caches: 1
>>
>> Dec 17 23:02:30 hq-pg kernel: INFO: task sync:6573 blocked for more than
>>> 120 seconds.
>>
>> Dec 17 23:02:30 hq-pg kernel: "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>
>> Dec 17 23:02:30 hq-pg kernel: syncD 965ebabd1040 0
>>> 6573   6572 0x0080
>>
>> Dec 17 23:02:30 hq-pg kernel: Call Trace:
>>
>> Dec 17 23:02:30 hq-pg kernel: [] ?
>>> generic_write_sync+0x70/0x70
>>
>>
>> After some google I guess it's the problem that IO speed is low, while
>> the insert requests are coming too much quickly.So PG put these into cache
>> first then kernel called sync
>
>
Could you expand on what you found in the googling, with links?  I've never
seen these in my kernel log, and I don't know what they mean other than the
obvious that it is something to do with IO.  Also, what kernel and file
system are you using?


> .
>
> I know I can queue the requests, so that POSTGRES will not accept these
>> requests which will result in an increase in system cache.
>
> But is there any way I can tell POSTGRES, that you can only handle 2
>> records per second, or 4M per second, please don't accept inserts more than
>> that speed.
>
> For me, POSTGRES just waiting is much better than current behavior.
>
>
I don't believe there is a setting from within PostgreSQL to do this.

There was a proposal for a throttle on WAL generation back in February, but
with no recent discussion or (visible) progress:

https://www.postgresql.org/message-id/flat/2B42AB02-03FC-406B-B92B-18DED2D8D491%40anarazel.de#b63131617e84d3a0ac29da956e6b8c5f


I think the real answer here to get a better IO system, or maybe a better
kernel.  Otherwise, once you find a painful workaround for one symptom you
will just smack into another one.

Cheers,

Jeff

>


unanalyze a foreign table

2019-12-22 Thread Jeff Janes
I did a manual ANALYZE of a foreign table, to see if it would make a
troublesome query better.  It did, but it also made some other queries that
were previously fine to become far worse.  Is there a way to undo the
analyze?  I can muck around in the catalogs like below, but seems really
grotty.

delete from pg_statistic where starelid=418911;

The other option seems to be doing a `drop foreign table ... cascade`, but
then recreating all the cascaded drops is quite annoying and error prone.

I currently solved it by re-cloning my performance testing server from
upstream, but that also is not very convenient.  Is directly manipulating
the catalogs really the best way?

Cheers,

Jeff


Re: pg_basebackup

2019-12-29 Thread Jeff Janes
On Mon, Dec 23, 2019 at 4:13 AM Kyotaro Horiguchi 
wrote:

> Hello.
>
> At Mon, 23 Dec 2019 03:38:12 +, Daulat Ram 
> wrote in
> > thanks Adrian, what about the
> > postmaster.opts file, this file was also skipped in backup.
>
> The file is overwritten at startup so there's no point in having it in
> a backup. Thus, it is deliberately excluded from a backup.
>

That is true and sensible, but it does seem to be contradicted by the
documentation for 9.5.  In later versions, they added "except certain
temporary files managed by PostgreSQL".   That addition seems to just be a
correction of the docs, not something to reflect a change in actual
behavior, at least in regards to postmaster.opts.

Cheers,

Jeff


Re: Are my autovacuum settings too aggressive for this table?

2019-12-29 Thread Jeff Janes
>
>
>
> Live tuples = 19,766,480
> Analyze scale factor = 0.001
> Analyze thresh = 5000
> Thresh + live_tuples * factor = 24,766
>
> So an autovacuum analyze should trigger around 24K tuples modified, is
> this to little or too much?


This seems too much to me.  Was there a specific problem occurring that
spurred this change?  If many of the tuple modifications are occurring on a
certain subset of the data which changes the distribution in an important
way, then maybe this would be justified.  (But maybe
partitioning between active and largely inactive would be a better solution)


> Same goes for autvacuum vacuum, is it too aggressive?
>
>
> #--
> # AUTOVACUUM
>
> #--
> autovacuum_naptime = 1h
> autovacuum_vacuum_threshold = 1
> autovacuum_analyze_threshold = 5000
> autovacuum_vacuum_scale_factor = 0.002
> autovacuum_analyze_scale_factor = 0.001
>
>
The analyze settings don't seem defensible to me, but maybe you can make a
case for them by referring to problems that were showing up in particular
queries.

The naptime seems particularly indefensible.  If you think you overdid it
with some changes, then you should back off those changes.  Not randomly
whack around other settings in an attempt to compensate for the first ones,
without having some identifiable theory which supports this.  If you do
have such a theory, you haven't told us what it might be.

If index-only-scans (and hence rel_allvisible) are particularly important
to you, then the autovac settings might make sense.  However, this is
unlikely to be true at the global level, but rather on a table by table
basis.  But, I don't see the point in setting
autovacuum_vacuum_scale_factor = 0.002.  If it needs a drastic decrease (to
pair with a well-considered increase in  autovacuum_vacuum_threshold) why
would 99% be the correct decrease, rather than 100%?  A nice thing about
exact 0 is it stands out as being intentional, as opposed to a typo or a
mistranslation between percentage and fraction.

Cheers,

Jeff


Re: Changing default ../data/ directory

2020-01-04 Thread Jeff Janes
On Sat, Jan 4, 2020 at 10:20 AM Rich Shepard 
wrote:

>
> The PGDATA environment variable used to be used for this (but I never did
> get it satisfactorily working).


PGDATA should work fine if you always start the server directly.  But if
you sudo or su to another user, they likely won't inherit your environment
variables.  And if you use some kind of start-up script, they will likely
override it.  How do you start and stop PostgreSQL?


> I'm sure it's in the manual yet I'm unsure just where to start reading.
>

If you are using start-up scripts that come with some installation package,
then the details are up to the packager.  The PostgreSQL manual is of
little use in knowing what the packager decided to do.  It will describe
what the pieces are, but not how they got assembled together.

Cheers,

Jeff


Re: How can I set a timeout for a locked table in Function ?

2020-01-04 Thread Jeff Janes
On Fri, Jan 3, 2020 at 1:05 PM Thomas Kellerer  wrote:

> Michael Lewis schrieb am 03.01.2020 um 18:00:
>
> > Why take an exclusive lock on an entire table to update a single row?
>
> That's what I was asking myself as well.
>


Note that the code takes "row exclusive", not "exclusive".  It is several
notches less restrictive, and is the mode an UPDATE statement is going to
take anyway.  It still isn't clear why he is doing it, however.

Cheers,

Jeff


Re: When to use PARTITION BY HASH?

2020-06-04 Thread Jeff Janes
On Wed, Jun 3, 2020 at 7:55 AM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

With hash partitioning you are not expected, in general, to end up with a
> small number of partitions being accessed more heavily than the rest.  So
> your indexes will also not fit into memory.
>
> I have the feeling that using a hash function to distribute rows simply
> contradicts the basic assumption of when you would think of partitioning
> your table at all: that is to make sure the most active part of the table
> and indexes is small enough to be cached in memory.
>

While hash partitioning doesn't appeal to me, I think this may be overly
pessimistic.  It would not be all that unusual for your customers to take
turns being highly active and less active.  Especially if you do occasional
bulk loads all with the same customer_id for any given load, for example.
So while you might not have a permanently hot partition, you could have
partitions which are hot in turn.  Of course you could get the same benefit
(and probably better) with list or range partitioning rather than hash, but
then you have to maintain those lists or ranges when you add new customers.

Cheers,

Jeff


Re: When to use PARTITION BY HASH?

2020-06-05 Thread Jeff Janes
On Fri, Jun 5, 2020 at 6:12 AM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Thu, Jun 4, 2020 at 4:32 PM Jeff Janes  wrote:
>
>> On Wed, Jun 3, 2020 at 7:55 AM Oleksandr Shulgin <
>> oleksandr.shul...@zalando.de> wrote:
>>
>> With hash partitioning you are not expected, in general, to end up with a
>>> small number of partitions being accessed more heavily than the rest.  So
>>> your indexes will also not fit into memory.
>>>
>>> I have the feeling that using a hash function to distribute rows simply
>>> contradicts the basic assumption of when you would think of partitioning
>>> your table at all: that is to make sure the most active part of the table
>>> and indexes is small enough to be cached in memory.
>>>
>>
>> While hash partitioning doesn't appeal to me, I think this may be overly
>> pessimistic.  It would not be all that unusual for your customers to take
>> turns being highly active and less active.  Especially if you do occasional
>> bulk loads all with the same customer_id for any given load, for example.
>>
>
> For a bulk load you'd likely want to go with an empty partition w/o
> indexes and build them later, after loading the tuples.
>

That only works if the bulk load is starting from zero.  If you are adding
a million rows to something that already has 100 million, you would
probably spend more time rebuilding the indexes than you saved by dropping
them.  And of course to go with an empty partition, you have to be using
partitioning of some kind to start with; and then you need to be futzing
around creating/detaching and indexing and attaching.  With hash
partitioning, you might get much of the benefit with none of the futzing.


> So while you might not have a permanently hot partition, you could have
>> partitions which are hot in turn.  Of course you could get the same benefit
>> (and probably better) with list or range partitioning rather than hash, but
>> then you have to maintain those lists or ranges when you add new customers.
>>
>
> Why are LRU eviction from the shared buffers and OS disk cache not good
> enough to handle this?
>

Data density.  If the rows are spread out randomly throughout the table,
the density of currently relevant tuples per MB of RAM is much lower than
if they are in partitions which align with current relevance.  Of course
you could CLUSTER the table on what would otherwise be the partition key,
but clustered tables don't stay clustered, while partitioned ones stay
partitioned.  Also, clustering the table wouldn't help with the relevant
data density in the indexes (other than the index being clustered on, or
other ones highly correlated with that one).  This can be particularly
important for index maintenance and with HDD, as the OS disk cache is in my
experince pretty bad at deciding when to write dirty blocks which have been
handed to it, versus retain them in the hopes they will be re-dirtied soon,
or have adjacent blocks dirtied and then combined into one write.


>
> This actually applies to any partitioning scheme: the hot dataset could be
> recognized by these caching layers.  Does it not happen in practice?
>

Caching only happens at the page level, not the tuple level.  So if your
hot tuples are interspersed with cold ones, you can get poor caching
effectiveness.

Cheers,

Jeff


Re: Performance of "distinct with limit"

2020-08-29 Thread Jeff Janes
On Fri, Aug 28, 2020 at 8:34 AM Klaudie Willis <
klaudie.wil...@protonmail.com> wrote:

> No index on n, no. Index might solve it yes, but it seems to me such a
> trivial optimization even without.  Obviously it is not.
>
> QUERY
> PLAN
> |
>
> --|
> Limit  (cost=1911272.10..1911272.12 rows=2
> width=7)   |
>   ->  HashAggregate  (cost=1911272.10..1911282.45 rows=1035
> width=7)  |
> Group Key:
> cfi|
> ->  Seq Scan on bigtable  (cost=0.00..1817446.08 rows=37530408
> width=7)|
>
>
I think it would be nice if the LIMIT functionality could be pushed down
into the HashAgg so it could stop early, I've run into this a few times.
But it just isn't implemented.  It wouldn't be the hardest feature to ever
add to PostgreSQL, but it also wouldn't be trivial.  It would require
coordinated changes both to the planner and to the executor.

Also, the use of LIMIT without an ORDER BY makes the query
non-deterministic, which makes it kind of a second-class citizen.  There
might be more enthusiasm among experienced developers for implementing this
if it weren't for that.  (Although there may be related deterministic cases
in which a similar limited hash agg could be useful.)

In the meantime, an index on "n" would probably cause it to switch to a
Unique plan which reads in index order.  This plan does get to stop early.

Cheers,

Jeff


Re: Autovacuum not functioning for large tables but it is working for few other small tables.

2020-12-16 Thread Jeff Janes
On Wed, Dec 16, 2020 at 6:55 AM M Tarkeshwar Rao <
m.tarkeshwar@ericsson.com> wrote:

> ...
>

>
> All the threshold level requirements for autovacuum was meet and there are
> about Million’s of  dead tuples but autovacuum was unable to clear them,
> which cause performance issue on production server.
>

It might be helpful for us to see what data you are looking at to reach
this conclusion.


>
>
> Is autovacuum not working against large sized tables or Is there any
> parameters which  need to set to make autovacuum functioning?
>

Autovacuum is not inherently broken for large tables.  But vacuuming them
takes longer than for small tables.  If it is frequently interrupted by
things like CREATE INDEX, ALTER TABLE, or database shutdown and restart,
then it might never get through the entire table without interruption. If
it is getting interrupted, you should see messages in the log file about
it.  You can also check pg_stat_user_tables to see when it was last
successfully (to completion) auto vacuumed, and on new enough versions you
can look in pg_stat_progress_vacuum to monitor the vacuuming while it
occurs.

Cheers,

Jeff

>


Re: JSONB index not in use, but is TOAST the real cause of slow query?

2022-05-30 Thread Jeff Janes
On Sat, May 28, 2022 at 1:54 PM Shaheed Haque 
wrote:


>
And how can I understand the dreadful amount of
> time (of course, this is just on my dev machine, but still...)? Is
> there a way to see/tweak what TOAST costs or indeed to confirm if it
> is even in use?
>

Turn on track_io_timing, and then do explain (analyse, buffers).  This
won't distinguish TOAST from main table IO, but will at least confirm if a
large amount of IO is happening and how long it is taking.

You can look at pg_statio_user_tables before the after the query to
distinguish TOAST, but this won't give you the timing, just the block
counts.  And the counts might also get incremented by concurrent users, so
on a busy system it is hard to interpret.

Are these queries fast upon repeat execution?  A few meg time less than
1000 it just not that much with modern hardware, you should be able to just
cache it in RAM.

Cheers,

Jeff


Re: autovacuum on primary blocking queries on replica?

2022-05-30 Thread Jeff Janes
On Fri, May 27, 2022 at 3:01 PM Don Seiler  wrote:

I've been reading tales of autovacuum taking an AccessExclusiveLock when
> truncating empty pages at the end of a table. I'm imagining that updating
> every row of a table and then rolling back would leave all of those rows
> empty at the end and qualify for truncation and lead to the scenario I saw
> this morning.
>

That is likely.


>
> I'm still not entirely satisfied since that table in question was so small
> (only 8252 rows) so I wouldn't imagine it would hold things up as long as
> it did. Although the blocking session on the replica was an application
> session, not any background/recovery process.
>

Yeah, so to me the interesting question is, what was that application
session doing for so long?

Once the recovery process acquires the lock it needs, it would presumably
only hold it for a short amount of time.  But it has to get the lock first,
which it can't do due to that other process camping on the access share
lock.  And once it places itself in the queue for the lock, any newcomers
can't jump over it.  Even if the new entrants only want an access share
lock, they are stuck behind the access exclusive lock request, which is in
turn stuck behind the already granted share lock.

Cheers,

Jeff

>


Re: unoptimized nested loops

2022-06-01 Thread Jeff Janes
On Tue, May 31, 2022 at 4:04 PM Tim Kelly  wrote:

>
> c_db=>select count(id) from metadata where author like '%Kelly%';
>
>   count
> ---
>3558
> (1 row)
>

It would be interesting to see the explain of this.  We know how many rows
it found, but not how many it thinks it will find.


> I do not see evidence that the nested loop is trying to reduce overhead
> by using the smaller set.  It seems to want to scan on data first either
> way.
>

The planner probably doesn't know which one is smaller.



> I have run vacuum analyze, in the hopes that the optimizer is
> miscalculating, to no avail.


What version are you using?

Cheers,

Jeff


Re: unoptimized nested loops

2022-06-02 Thread Jeff Janes
On Thu, Jun 2, 2022 at 12:32 AM Tom Lane  wrote:

> Jeff Janes  writes:
> > On Tue, May 31, 2022 at 4:04 PM Tim Kelly 
> wrote:
> >> I do not see evidence that the nested loop is trying to reduce overhead
> >> by using the smaller set.  It seems to want to scan on data first either
> >> way.
>
> > The planner probably doesn't know which one is smaller.
>
> There is not a lot of daylight between the cost estimates for
> "a nestloop-join b" and "b nestloop-join a", if we're considering
> plain seqscans on both tables and all else is equal.  It tends to
> come down to factors like which one is more densely populated.
>
> As best I can tell, the issue Tim's unhappy about is not so
> much the use of a nestloop as the lack of use of any index.
>

But it is using an index on one of the tables, on "id".  There is no reason
it would not be able to reverse that, doing the seq scan on the smaller (in
assumed bytes) table and using the id index on the larger (in bytes) table,
to great effect.

Based on the timing of the simple counts on "data" with and without the
WHERE, I'm pretty sure that what is going on here is that data.content is
large and resides mostly in TOAST.  When TOAST needs to be accessed it is
much slower than when it doesn't.  And that is also the cause of the
estimation problem, the oversized values are just assumed to be distinct,
and no histogram is generated. Without histogram boundaries to serve as a
randomish sample, the selectivity estimate falls back to something even
worse (pattern length), and gets it massively wrong.


> But "string like '%foo%'" is not at all optimizable with a
> btree index.  You might be able to get somewhere with a
> pg_trgm GIN or GIST index.
>

I agree with the recommendation, but not really the reasoning.  Having the
pg_trgm index on metadata.author is might be even better than just getting
the planner to do the right thing without the index, but getting the
planner to do the right thing even without the index would also be a big
improvement over the current plan, if there were just a way to do it.  If
the planner explicitly accounted for TOAST cost, that would probably do
it.  Or if the selectivity estimates on data.content were better, that
would too.  But Tim can't reasonably do anything about those things, while
he can build the index.

Another thing he could try would be to force the correct index use by using
the inner join, but writing the join condition as "on data.id = metadata.id
||''"

Cheers,

Jeff


Re: Why password authentication failed for user "postgres"?

2022-06-04 Thread Jeff Janes
On Fri, Jun 3, 2022 at 6:32 PM BeginnerC  wrote:

> Hello everyone,
> I am a newbie to the postgres,when I use the psql to connect to the
> postgres,a error message printed:
> These command list like this:
>
> psql -U postgres
> Password for user postgres:postgres
> *postgreSQL: password authentication failed for user "postgres"*
>
> How to solve this problem?
> Thanks in advance!
>

The password you type should not be showing up as you type it, which is
what you appear to be showing above.  Also, 'postgres' is a horrible
password, you might as well configure 'trust' authentication if you are
going to use that as a password.

Look in the db server's log file for a (possibly) more detailed error
message.

Cheers,

Jeff


Re: Index creation

2022-06-20 Thread Jeff Janes
On Mon, Jun 20, 2022 at 1:17 AM Дмитрий Иванов 
wrote:

> Your statement seems obvious to me. But what I see doesn't seem like a
> conscious choice. It turns out that it is better to have a lighter
> general-purpose index than to strive to create a target covering index for
> a certain kind of operation.
>

If both indexes are expected to be hit only once in the query and return
only one row, their expected costs will be the same.  In this case, the tie
is broken arbitrarily, and that often means the most-recently created index
will get chosen.

As the expected number of leaf page accesses in a given query goes up, the
smaller index will start to look less expensive.

Cheers,

Jeff

>


Re: Tuning a query with ORDER BY and LIMIT

2022-06-22 Thread Jeff Janes
On Wed, Jun 22, 2022 at 3:39 PM Dirschel, Steve <
steve.dirsc...@thomsonreuters.com> wrote:


> explain (analyze, verbose, costs, buffers, timing, summary, hashes)
>

'hashes', what's that?  Are you using community PostgreSQL, or someones
fork?


> With Oracle for a query like this since the index is on the 3 columns
> matching the WHERE clause and the ORDER BY clause is in the 4th position
> Oracle would be able to scan that index and as soon as it finds the first
> matching 50 rows.
>

Can you show the plan from Oracle?  I'm not an expert at reading their
plans, but certainly they are easier to attempt to read if you can at least
see them.


> I was questioning if the result IN clause was causing the issue so I ran
> the query with result = and see the same results:
>
>
I can't reproduce this at all.  With simple equality it absolutely uses the
index for ordering in that case, even in v10.

Cheers,

Jeff

>


Re: Tuning a query with ORDER BY and LIMIT

2022-06-22 Thread Jeff Janes
On Wed, Jun 22, 2022 at 6:19 PM Peter J. Holzer  wrote:

>
> >That's just how btree indexes work and Oracle will have the same
> >limitation. What would be possible is to use an index only scan
> >(returning 2,634,718 matching results), sort that to find the 50 newest
> >entries and retrieve only those from the table. That should be faster
> >since the index contains only 4 of 28 (if I counted correctly) columns
> >and should be quite a bit smaller.
>
> Another - better - optimization would be to fetch the first 50 results
> for each of the 6 possible values of result, then choose the 50 largest
> of those. That sounds tricky to generalize, though.
>

You don't even need to read 50 from each of the 6 branches.  If you use a
merge append operation, you would need to read  55 rows.  50 to be
returned, and one non-returned from each branch other than the one
returning the last row.  I suspect this may be what Oracle is doing.  With
some trickery, you can get PostgreSQL to do the same thing.

(select * from foo where a=4 order by b)
union all
(select * from foo where a=7 order by b)
order by b
limit 50

 QUERY PLAN

-
 Limit  (cost=0.86..131.25 rows=50 width=8)
   ->  Merge Append  (cost=0.86..26079.86 rows=1 width=8)
 Sort Key: foo.b
 ->  Index Only Scan using foo_a_b_idx on foo  (cost=0.42..12939.92
rows=5000 width=8)
   Index Cond: (a = 4)
 ->  Index Only Scan using foo_a_b_idx on foo foo_1
 (cost=0.42..12939.92 rows=5000 width=8)
   Index Cond: (a = 7)

Cheers,

Jeff


Re: Can't connect to server

2023-03-18 Thread Jeff Janes
On Sat, Mar 18, 2023 at 2:11 AM Wu, Abigaile  wrote:

> I am writing to seek your assistance with a problem I am facing while
> using pgAdmin4. After downloading the PostgreSQL 15, I downloaded the
> latest version of Pgadmin 4 and then I click the server and choose
> PostgreSQL 15. Whenever I try to log in to the server after entering my
> password, I receive the following error message:
>
> "Connection to server at "localhost"(::1), port 5432 failed: could not
> initiate GSSAPI security context: The operation or option is not available:
> credential handle
>

This is probably just a nuisance message.  It thought it might be able to
authenticate via Kerberos and so tried that one first, but discovered it
couldn't. Unless you know that that is what you were trying to do, you can
ignore it.


> connection to server at "localhost" (::1), port 5432 failed: FATAL:
> password authentication failed for user "postgres"."
>

Look in the database server's log files to see if there are more details
about this authentication error.  When you set up the system, is "postgres"
the name you picked for the admin user?  That is the default and most
common, but you could have changed it if you wanted to.

Cheers,

Jeff


Re: Replication question

2018-10-22 Thread Jeff Janes
On Mon, Oct 22, 2018, 9:54 AM Scot Kreienkamp 
wrote:

> Hi everyone,
>
>
>
> We just moved to PG9.6 from 9.1 (yeah, not my choice to wait this long).
> In 9.1 I had to make the archive location (NFS in my case) available to all
> the mirrors running PG so that they could catch up whenever they fell
> behind.  I thought I read somewhere that in 9.6, as long as the WAL log is
> available on disk or in the archive the replication server will provide
> that to the replication client, and my archive NFS mount didn’t have to be
> available to all replication clients.  It doesn’t seem to be operating that
> way though.  Did I completely remember that wrong or did I misunderstand
> something?
>

The master won't read from the archives for you in order to send to an
replica.  But using replication slots, you can keep the needed log files
right in pg_xlog/pg_wal until all replicas get what they need (assuming the
disk is large enough).  Then you don't need an archive at all for
replication purposes, still might for pitr purposes.

Perhaps this is what you heard about.

Cheers,

Jeff

>
>


Re: Replication question

2018-10-23 Thread Jeff Janes
On Mon, Oct 22, 2018 at 1:20 PM Scot Kreienkamp <
scot.kreienk...@la-z-boy.com> wrote:

> I remember thinking it was pulling from archive with the restore command
> if necessary to augment what it had on disk.  If that was the case I wanted
> to configure it.  I don’t care for the replication slots due to the
> possible disk space issue as we don’t run shifts around the clock.  So I’ll
> have to mount the archive via NFS like I had before, not a big deal.
>

If the archive partition fills up, then your archive command will start
failing, which means your pg_xlog will also start filling up.  If you can
dedicate the same amount of space to the new pg_xlog as you currently have
dedicated to pg_wal + archive partition, then there should be no increased
risk of running out of disk space.  Except that you can have
archive_command compress the WAL files upon archival, which can save a lot
of space.  Of course there are a variety of reasons that that might not
work, like you use small fast disk for pg_xlog and big slow ones for
archive, or pg_xlog is your problem while archive is some other guy's
problem.  But it is something to consider.


> As an alternative to NFS I was thinking about making the archives
> available via HTTPD and using wget or curl in my script instead of a copy
> from NFS.  That seems like it would work better from the remote sites.
>
>
>
Yeah, or rsync or scp.

Cheers,

Jeff


Re: Different memory allocation strategy in Postgres 11?

2018-10-26 Thread Jeff Janes
On Fri, Oct 26, 2018 at 9:12 AM Thomas Kellerer  wrote:

> I have a Postgres instance running on my Windows laptop for testing
> purposes.
>
> I typically configure "shared_buffers = 4096MB" on my 16GB system as
> sometimes when testing, it pays off to have a bigger cache.
>
> With Postgres 10 and earlier, the Postgres process(es) would only allocate
> that memory from the operating system when needed.
> So right after startup, it would only consume several hundred MB, not the
> entire 4GB
>
> However with Postgres 11 I noticed that it immediately grabs the complete
> memory configured for shared_buffers during startup.
>
> It's not really a big deal, but I wonder if that is an intentional change
> or a result from something else?
>

Do you have pg_prewarm in shared_preload_libraries?

Cheers,

Jeff


Re: GIN Index for low cardinality

2018-10-26 Thread Jeff Janes
On Wed, Oct 17, 2018 at 6:47 AM Ravi Krishna  wrote:

> In
> https://www.cybertec-postgresql.com/en/ideas-for-scaling-postgresql-to-multi-terabyte-and-beyond/
>
> it is mentioned:
>
> "GIN, the most know non-default index type perhaps, has been actually
> around for ages (full-text search) and in short is perfect for indexing
> columns where there are lot of repeating values – think all kinds of
> statuses or good old Mr/Mrs/Miss. GIN only stores every unique column value
> only once as for the default B-tree you’ll have e.g. 1 millon leaf nodes
> with the integer “1” in it."
>
>
> Does it mean that GIN is a very good choice for low cardinality columns.
>

For extremely low cardinality, like Mr. Mrs. Miss., I doubt any index would
be very useful.  For less extreme cases, like say one million different
values present around 50 times each, yes, it can be useful to keep the
index size down.  It will not support needing to deliver rows in sorted
order, for example to fulfill an ORDER BY or a merge join.  Think carefully
about what setting you want for fast_update, and, if set to on, then what
value to use for gin_pending_list_limit.

Cheers,

Jeff


Re: GIN Index for low cardinality

2018-10-26 Thread Jeff Janes
On Thu, Oct 25, 2018 at 9:36 AM Олег Самойлов  wrote:

>
> 17 окт. 2018 г., в 13:46, Ravi Krishna  написал(а):
>
> In
> https://www.cybertec-postgresql.com/en/ideas-for-scaling-postgresql-to-multi-terabyte-and-beyond/
>
> it is mentioned:
>
> "GIN, the most know non-default index type perhaps, has been actually
> around for ages (full-text search) and in short is perfect for indexing
> columns where there are lot of repeating values – think all kinds of
> statuses or good old Mr/Mrs/Miss. GIN only stores every unique column value
> only once as for the default B-tree you’ll have e.g. 1 millon leaf nodes
> with the integer “1” in it."
>
>
> Does it mean that GIN is a very good choice for low cardinality columns.
>
>
> Not necessary. There is other index which also don’t keep column value in
> an every leaf. Hash, for instance.
>

For smallish values (which low cardinality columns tend to be) the
per-tuple overhead and the pointer itself is probably much larger than the
value, so hash won't save you much if any space.  The GIN index removes not
just the value, but the per-tuple overhead.  And also compresses the point
list to further save space.

Here is a real-world example from one of my databases where each value is
about 17 characters long, and is present about 20 times:

gin: 411 MB
btree: 2167 MB
hash: 2159 MB

Cheers,

Jeff


Re: GIN Index for low cardinality

2018-10-26 Thread Jeff Janes
On Fri, Oct 26, 2018 at 12:27 PM Jeff Janes  wrote:

Here is a real-world example from one of my databases where each value is
> about 17 characters long, and is present about 20 times:
>
> gin: 411 MB
> btree: 2167 MB
> hash: 2159 MB
>

For what it is worth, that was 9.6 with freshly rebuilt indexes.

On 11.0 with freshly rebuilt indices, the hash index does get a bit smaller:

gin: 411 MB
btree: 2167 MB
hash: 1864 MB

Cheers,

Jeff
>
>


Re: Is there a way to speed up WAL replay?

2018-10-31 Thread Jeff Janes
On Wed, Oct 31, 2018 at 1:38 AM Torsten Förtsch 
wrote:

> Hi,
>
> I am working on restoring a database from a base backup + WAL. With the
> default settings the database replays about 3-4 WAL files per second. The
> startup process takes about 65% of a CPU and writes data with something
> between 50 and 100 MB/sec.
>
> Is there a way to speed that up? The disk can easily sustain 400-500
> MB/sec.
>

WAL replay is single-threaded, so the most you would be able to speed it up
is 50%, to where it would be taking 100% CPU.

Is the time spent not on the CPU being spent waiting for WAL files to
arrive from the restore_command, or waiting for the blocks it needs to
replay into to get read from disk, or waiting for dirty pages to get
written to disk so they can be evicted to make way for new ones?

One way I found to speed up restore_command is to have another program run
a few WAL files ahead of it, copying the WAL from the real archive into a
scratch space which is on the same filesystem as pg_xlog/pg_wal.  Then have
restore_command simply move (not copy) the requested files from the scratch
space onto %p.  The intra-filesystem move completes much faster than a copy.

If it spends time waiting for blocks that need to be recovered into to get
read from disk, and you have enough RAM, you could speed it up by
pre-warming the file system cache.  Something like:

tar -cf - $PGDATA | wc -c

Cheers,

Jeff


pg_stat_replication view

2018-12-10 Thread Jeff Janes
I'm not sure if this is a documentation issue, or something else.

The description of the pg_stat_replication.state column gives:

* catchup: This WAL sender's connected standby is catching up with the
primary.

* streaming: This WAL sender is streaming changes after its connected
standby server has caught up with the primary.

What does this mean?  Is the standby "caught up" when it replays the LSN
which was current on the master as-of the time that the standby initiated
this connection?  Or is it caught up when the master receives at least one
notification that a certain LSN was replayed on the replica, and verifies
that no new WAL has been generated after that certain LSN was generated?
Neither of those things?

If a replica has caught up and then fallen behind again, is that different
from a user/dba perspective than if it never caught up in the first place?

Also, the docs say "Lag times work automatically for physical replication.
Logical decoding plugins may optionally emit tracking messages; if they do
not, the tracking mechanism will simply display NULL lag."  Does the
logical decoding plugin associated with built-in PUBLICATION/SUBSCRIPTION
mechanism introduced in v10 emit tracking messages?

Cheers,

Jeff


Re: Query planner / Analyse statistics bad estimate rows=1 with maximum statistics 10000 on PostgreSQL 10.2

2018-12-22 Thread Jeff Janes
>
>
> - Does the analyse output below mean that it only scanned 51538 of 65463
> rows in the table? Is school_id 36 just being missed in the sample? (This
> happens when the analyse is repeated )
>

Is there a transaction which had deleted all of school_id=36, and then was
just left open indefinitely without either committing or rolling back?

That would explain it, and I don't know of anything else that could.  The
deleted but not committed tuples are still live, but don't get sampled.

Cheers,

Jeff


Re: getting pg_basebackup to use remote destination

2018-12-30 Thread Jeff Janes
On Sat, Dec 29, 2018 at 2:05 PM Chuck Martin 
wrote:

> I thought I knew how to do this, but I apparently don't. I have to set up
> a new server as a standby for a PG 11.1 server. The main server has a lot
> more resources than the standby. What I want to do is run pg_basebackup on
> the main server with the output going to the data directory on the new
> server.
>

pg_basebackup consumes few resources on the standby anyway in the mode you
are running it, other than network and disk.  And those are inevitable
given your end goal, so if you could do what you want, I think it still
wouldn't do what you want.

If you really want to spare the network, you can run compression on the
server side then decompress on the standby.  Currently you can't compress
on the server when invoking it on the standby, so:

pg_basebackup -D - -Ft -X none |pxz | ssh 10.0.1.16 "tar -xJf - -C
/somewhere/data_test"

Unfortunately you can't use this along with -X stream or -X fetch.

Really I would probably compress to a file and then use scp/rsync, rather
the streaming into ssh.  That way if ssh gets interrupted, you don't lose
all the work.

Cheers,

Jeff

>


Re: getting pg_basebackup to use remote destination

2018-12-31 Thread Jeff Janes
On Sun, Dec 30, 2018 at 6:17 PM Chuck Martin 
wrote:

> Maybe I need to rethink ths and take Jeff's advice. I executed this:
>
> pg_basebackup -h [main server's URL] -U postgres -P -v -X s -D
> /mnt/dbraid/data
>
> 8 hours ago, and it is now still at 1%. Should it be that slow? The
> database in question is about 750 GB, and both servers are on the same GB
> ethernet network.
>

Over gigabit ethernet, it should not be that slow.  Unless the network is
saturated with other traffic or something.  Might be time to call in the
network engineers.  Can you transfer static files at high speeds between
those two hosts using scp or rsync?  (Or use some other technique to take
PostgreSQL out of the loop and see if your network is performing as it
should)

Are you seeing transfers at a constant slow rate, or are their long freezes
or something?  Maybe the initial checkpoint was extremely slow?
Unfortunately -P option (even with -v) doesn't make this easy to figure
out.  So alas it's back to old school stopwatch and a pen and paper (or
spreadsheet).

Cheers,

Jeff

>


Re: (multiplatform) replication problem

2019-01-09 Thread Jeff Janes
On Tue, Jan 8, 2019 at 10:49 AM W.P.  wrote:

> Hi there,
> I have following setup:
> - master database, 9.5.7, on I386 (Pentium M),
> now i want to replicate database to:
> - slave database. 9.5.7 on armhf (OrangePiPC+).
> Is in possible?
>

I think the error message is telling you that physical replication is not
possible across platforms as different as this.  You could go through the
output of pg_controldata line by line to see what the differences are.
Although perhaps someone can just tell you off the top of their head.

Perhaps you can use logical replication instead, but to do that you should
probably upgrade your PostgreSQL version to at least v10.

Cheers,

Jeff


Re: FDW, too long to run explain

2019-02-17 Thread Jeff Janes
On Mon, Feb 4, 2019 at 2:15 AM Vijaykumar Jain  wrote:

>
> now we have some long running queries via FDW that take minutes and get
> killed explain runs as idle in transaction on remote servers.
>

Are you saying the EXPLAIN itself gets killed, or execution of the plan
generated based on the EXPLAIN (issued under use_remote_estimate = true)
gets killed?  Who is doing the killing, the local side or the foreign
side?  Can you include verbatim log entries for this?


> now the explain for
> select * from sh01.view1  keeps running for minutes sometimes,
>
> then fetch too keeps running for minutes, although the total rows are <
> 1 maybe.
> idle in transaction | FETCH 1 FROM c1
>

What is this?  Is it from some monitoring tool, or pg_stat_activity, or
what?  And is it on the local side or the foreign side?


> other questions:
> also, what is the cost of fetch_size?
>

It will always fetch rows from the foreign server in this sized chunks.  A
larger fetch_size will have less network latency and computational overhead
if many rows are going to be consumed, but also consume more memory on the
local server as all rows are stored in memory per each chunk.  Also, in the
case of a LIMIT, it reads a large number of rows even if most of them may
be unneeded.  Conceptually, the LIMIT could be used to modify the FETCH
downward to match the LIMIT, but that is not implemented.  In the case of a
view over UNION ALL, I don't think the individual subqueries even know what
the global LIMIT is.


>
> I mean given we have a query
>
> select * from foobar limit 1; via FDW
> limit 1 does not get pushed.
> so it seems all rows some to FDW node and then limit is applied?
>

It should not read all rows.  It should read as many multiples of
fetch_size as needed, which should just be 1 multiple in this case.

Cheers,

Jeff

>


Re: FDW, too long to run explain

2019-02-17 Thread Jeff Janes
On Sun, Feb 17, 2019 at 12:41 PM Jeff Janes  wrote:

> On Mon, Feb 4, 2019 at 2:15 AM Vijaykumar Jain 
> wrote:
>
>>
>> now we have some long running queries via FDW that take minutes and get
>> killed explain runs as idle in transaction on remote servers.
>>
>
> Are you saying the EXPLAIN itself gets killed, or execution of the plan
> generated based on the EXPLAIN (issued under use_remote_estimate = true)
> gets killed?  Who is doing the killing, the local side or the foreign
> side?  Can you include verbatim log entries for this?
>

After thinking about it a bit more, I think I see the issue here.  The
EXPLAIN pursuant to use_remote_estimate is issued in the same remote
transaction as the following DECLARE and FETCH's are.  But after the
EXPLAIN is issued, the local server executes the query for a different FDW
to satisfy some other branch of the UNION ALL, giving the first FDW
connection time to do an idle-in-transaction timeout.  This happens even if
no rows need to fetched from that FDW, because another branch of the UNION
ALL satisfied the LIMIT.

A question for the PostgreSQL hackers would be, Is it necessary and
desirable that the EXPLAIN be issued in the same transaction as the
eventual DECLARE and FETCHes?  I don't think it is.  I guess if the foreign
side table definition got changed between EXPLAIN and DECLARE it would
cause problems, but changing the foreign side definition out of sync with
the local side can cause problems anyway, so is that important to preserve?

Changing that might narrow but not completely fix the problem, as there
might still be delays between the DECLARE and the FETCH or between
successive FETCHes.

So a question for you would be, why do have such an aggressive setting
for idle_in_transaction_session_timeout that it causes this to happen?
Couldn't you relax it, perhaps just for the role used for the FDW
connections?

Cheers,

Jeff

>


Re: [External] Re: FDW, too long to run explain

2019-02-17 Thread Jeff Janes
On Sun, Feb 17, 2019 at 1:52 PM Vijaykumar Jain  wrote:

> Assuming your questions as 1,2,3, please find my answers below.
>
> 1)"explain" on foreign servers run as "idle in transactions". coz they
> were running very long (in the order of some minutes) , pgbouncer (in tx
> level pooling) setting kill them (as idle in tx time limit exceeded of 5
> mins) or else results in too many connections piling up.
> 2)yes, i get those from pg_stat_activity, it truncates the full statement,
> but  it shows up as * EXPLAIN select col1, col2  *  00:00:44 | idle in
> transaction (this is just one of the screenshots i have). (on the foreign
> side)
>

You are misinterpreting that data.  The EXPLAIN is not currently running.
It is the last statement that was running prior to the connection going
idle-in-transaction.  See my just previous email--I think the reason it is
idle is that the local is servicing some other part of the query (probably
on a different FDW), and that is taking a long time.

Are all the connections piling up from postgres_fdw, or are many of them
from other applications?  I think your timeout is just shifting symptoms
around without fixing the underlying problem, while also making that
underlying problem hard to diagnose.



> 3)yes, i think we kind of understood that part (fetch and memory), but i
> am not sure if that is used as any hint in plan generation too.
>

The query is planned as part of a cursor.  As such, it will
use cursor_tuple_fraction as the "hint".  Perhaps you could tweak this
parameter on the foreign side.  I think that a low setting for this
parameter should give similar plans as a small LIMIT would give you, while
large settings would give the same plans as a large (or no) LIMIT would.

I think postgres_fdw should pass does the LIMIT when it can do so, but it
doesn't currently.

Cheers,

Jeff

>


Re: [External] Re: FDW, too long to run explain

2019-02-17 Thread Jeff Janes
On Sun, Feb 17, 2019 at 2:37 PM Vijaykumar Jain  wrote:

>
> Ok, i raked this from the logs where enabled log_min_duration_statement =
> 10s
>
> 2019-01-31 12:48:18 UTC LOG:  duration: 29863.311 ms  statement: EXPLAIN
> SELECT blah, FROM public.view WHERE ((scheduled_bdt >= '2019-01-20'::date))
> AND ((scheduled_bdt <= '2019-01-26'::date)) AND ((somekey = ANY
> ('{269029,123399,263164,261487}'::bigint[])))   (both the columns are
> indexed)
>

That is interesting.  Was that in the logs for the local or the foreign
side?  And is it common, or rare?

If on the local side, could it be that the EXPLAINs sent to the foreign
side are being made to wait by the connection pooler, leading to long
delays?  If that is from the foreign side, then it should be conceptually
unrelated to FDW.  Any chance you could reproduce the slowness in your test
environment?  Slowness in the planner is probably related to the schema
structure, not the data itself.

I don't think this would be related to the idle-in-transaction, except that
one FDW connection maybe idle-in-transaction after its EXPLAIN is done
because it is waiting for another FDW connection to slowly run its EXPLAIN.

Cheers,

Jeff

>


Re: [External] Re: FDW, too long to run explain

2019-02-17 Thread Jeff Janes
On Sun, Feb 17, 2019 at 6:32 AM Vijaykumar Jain  wrote:

> I am yet to figure out the reason, what we have done is implement fake
> columns to represent samples and giving them random numbers and keeping
> other bulls to fake limit.
>
> Most of the queries that were impacted were the ones that did not push
> order by and limit to foreign servers.
> I am also trying to upgrade pg11 to make use of parallelisation.
>

postgres_fdw operates through declared cursors, and declared cursors
inhibit parallel query.  This doesn't change in v11, see
https://www.postgresql.org/docs/11/when-can-parallel-query-be-used.html

I'm not aware of any other changes in v11 that are likely to help you out.

Cheers,

Jeff

>


Re: Future Non-server Windows support???

2019-02-24 Thread Jeff Janes
On Sun, Feb 24, 2019 at 7:06 AM Bill Haught  wrote:

> I noticed that for 11.2, non-server versions of Windows are not listed
> as supported platforms.


Listed where?  The only thing I see is "Windows (Win2000 SP4 and later)",
and that isn't a list and surely there are non-server versions of Windows
which are later than that.

https://www.postgresql.org/docs/11/supported-platforms.html

I've run 11.2 on "Windows 10 home" without apparent problem.

Cheers,

Jeff


Re: Future Non-server Windows support???

2019-02-24 Thread Jeff Janes
On Sun, Feb 24, 2019 at 3:10 PM Adrian Klaver 
wrote:

> On 2/24/19 10:38 AM, Jeff Janes wrote:
> > On Sun, Feb 24, 2019 at 7:06 AM Bill Haught  > <mailto:wlhaught4754...@att.net>> wrote:
> >
> > I noticed that for 11.2, non-server versions of Windows are not
> listed
> > as supported platforms.
> >
> >
> > Listed where?  The only thing I see is "Windows (Win2000 SP4 and
>
> I believe the OP is referring to:
>
> https://www.postgresql.org/download/windows/
>
> Interactive installer by EnterpriseDB
>

OK, thanks.  I don't know how that list is maintained, but on EnterpriseDB
itself, they list those two versions of Windows for one of their paid
products.  For the free community version, they list: "Windows x86-64:
Windows 2012 R2 & R1, Windows 2016, Windows 7, 8, 10".

https://www.enterprisedb.com/services-support/edb-supported-products-and-platforms

But, this EnterpriseDB page itself is out of date, as it lists 11.1 but not
11.2.  I wouldn't think they would have desupported between 11.1 and 11.2
(and I can install 11.2 using their installer), so I think this is a bug on
https://www.postgresql.org/download/windows/ and a stale page on edb.

Cheers,

Jeff


Re: Optimizing Database High CPU

2019-02-27 Thread Jeff Janes
On Wed, Feb 27, 2019 at 2:07 PM Scottix  wrote:

> Hi we are running a Postgresql Database 9.4.18 and we are noticing a
> high CPU usage. Nothing is critical at the moment but if we were to
> scale up more of what we are doing, I feel we are going to run into
> issues.
>

9.4 is old. A lot of improvements have been made sense then, including
around contention issues.  Such as replacing some use of spinlocks with use
of atomic operations instead.

You would be better off putting time into upgrading, rather than putting
time into worrying about performance issues on something that will soon be
end-of-life anyway.


It is a 2 x 6 core machine, 128GB ram, Raid 10 HDD
>
> The iostat metrics for the HDD look minimal < 10% util
> Available memory seems to be good.
>
> The CPU utilization is what bothering me
> user  5-7%
> sys50-70% - seems high
> wa<0.5%
>
> So trying to troubleshoot possible high cpu:
> Number of concurrent connections averages 50 to 100 - seems high
> although we max at 200.
>

If those 50-100 connections are all active at once, yes, that is high.
They can easily spend more time fighting each other over LWLocks,
spinlocks, or cachelines rather than doing useful work.  This can be
exacerbated when you have multiple sockets rather than all cores in a
single socket.  And these problems are likely to present as high Sys times.

Perhaps you can put up a connection pooler which will allow 100 connections
to all think they are connected at once, but forces only 12 or so to
actually be active at one time, making the others transparently queue.



> No long running queries
> Streaming replication to backup server
> High update tables - we have about 4 tables that have a high volume of
> updates
>

Is it a few transactions updating a lot of rows each, or many transactions
updating a few rows each?


> High update rate is what I am thinking is causing the issue and I
> found possibly setting fillfactor to a lower default
>

I don't think that that is promising.  I wouldn't expect high Sys time if
this was the problem.  And with a high rate of updates (unless each update
is essentially to every row in the table), I would expect the table to
reach a steady state of tuple density.  Basically a too-high fillfactor
will fix itself naturally over time, it might be just take a while to do
it.  If your system has been running for a while, it has probably already
arrived at a steady state.  You can use the extension pg_freespacemap to so
how the freespace is spread around in your table blocks.


> Are there any statistics I could run to see if a setting change would help.
>

I'd probably start with pg_stat_activity table's "state" column to see how
many of your connections are active at once, and its columns
"wait_event_type" and "wait_event" to see what they think they are waiting
on (but those last columns aren't present until 9.6).

Cheers,

Jeff


Re: FDW and remote cursor_tuple_fraction

2019-03-03 Thread Jeff Janes
On Sun, Mar 3, 2019 at 12:38 PM auxsvr  wrote:

> Hi,
>
> One reason the remote plans via FDW are different than the plans of
> queries running directly on the remote server is that a cursor is used,
> which is optimized for low number of rows according to
> cursor_tuple_fraction. Is there any way to set parameters that affect the
> queries on the remote via FDW? If not, is there any plan to introduce such
> setting?
>

It is a bit ugly, but there is this:

alter server foo options (options '-c cursor_tuple_fraction=1.0');

Cheers,

Jeff


Re: Optimizing Database High CPU

2019-03-04 Thread Jeff Janes
On Wed, Feb 27, 2019 at 5:01 PM Michael Lewis  wrote:

> If those 50-100 connections are all active at once, yes, that is high.
>> They can easily spend more time fighting each other over LWLocks,
>> spinlocks, or cachelines rather than doing useful work.  This can be
>> exacerbated when you have multiple sockets rather than all cores in a
>> single socket.  And these problems are likely to present as high Sys times.
>>
>> Perhaps you can put up a connection pooler which will allow 100
>> connections to all think they are connected at once, but forces only 12 or
>> so to actually be active at one time, making the others transparently queue.
>>
>
> Can you expound on this or refer me to someplace to read up on this?
>

Just based on my own experimentation.  This is not a blanket
recommendation,  but specific to the situation that we already suspect
there is contention, and the server is too old to have
pg_stat_actvity.wait_event
column.


> Context, I don't want to thread jack though: I think I am seeing similar
> behavior in our environment at times with queries that normally take
> seconds taking 5+ minutes at times of high load. I see many queries showing
> buffer_mapping as the LwLock type in snapshots but don't know if that may
> be expected.
>

It sounds like your processes are fighting to reserve buffers in
shared_buffers in which to read data pages.  But those data pages are
probably already in the OS page cache, otherwise reading it from disk would
be slow enough that you would be seeing some type of IO wait, or buffer_io,
rather than buffer_mapping as the dominant wait type.  So I think that
means you have most of your data in RAM, but not enough of it in
shared_buffers.  You might be in a rare situation where setting
shared_buffers to a high fraction of RAM, rather than the usual low
fraction, is called for.  Increasing NUM_BUFFER_PARTITIONS might also be
useful, but that requires a recompilation of the server.  But do these
spikes correlate with anything known at the application level?  A change in
the mix of queries, or a long report or maintenance operation?  Maybe the
query plans briefly toggle over to using seq scans rather than index scans
or vice versa, which drastically changes the block access patterns?


> In our environment PgBouncer will accept several hundred connections and
> allow up to 100 at a time to be active on the database which are VMs with
> ~16 CPUs allocated (some more, some less, multi-tenant and manually
> sharded). It sounds like you are advocating for connection max very close
> to the number of cores. I'd like to better understand the pros/cons of that
> decision.
>

There are good reasons to allow more than that.  For example, your
application holds some transactions open briefly while it does some
cogitation on the application-side, rather than immediately committing and
so returning the connection to the connection pool.  Or your server has a
very high IO capacity and benefits from lots of read requests in the queue
at the same time, so it can keep every spindle busy and every rotation
productive.  But, if you have no reason to believe that any of those
situations apply to you, but do have evidence that you have lock contention
between processes, then I think that limiting the number active processes
to the number of cores is a good starting point.

Cheers,

Jeff


Re: WAL Archive Cleanup?

2019-03-22 Thread Jeff Janes
On Fri, Mar 22, 2019 at 6:28 AM Foo Bar  wrote:

>
> Fast forward two weeks, this cluster has been running but not seeing any
> traffic.  And my master server has filled its archive directory.
>

Are you sure it is the archive directory (/hab/svc/postgresql/data/archive)
which is filling up, and not the live directory (pg_wal or pg_xlog)?  This
is often a point of confusion.



> I found an older thread
> 
> that seemed to indicate that the wal_keep_segments is what was causing
> psql to keep so many WAL files...  However, mine is set to 8, but there
> were thousands of log files...
>

wal_keep_segments applies to the live directory (pg_xlog or pg_wal), not
the archive directory.


> This SO thread
> 
> seems to indicate that archive_cleanup_command can be run every "restart
> point", but googling "psql restart point" brings me to this page
>  which a
> ^F indicates does not mention a "restart point"...  So this *feels* like
> the configuration setting I want, but I can't find the documentation that
> confirms it for me...
>

"psql" is the name of a specific command line tool used to connect to a
PostgreSQL database server, it is not the name of the database itself. The
database is usually abbreviated "pgsql".  And "restartpoint" is usually
spelled as one work in technical discussions of it.  Or at least, searching
for it that way avoids finding things which mention each word separately in
different senses.

A restartpoint is just a checkpoint which is run on the replica (as opposed
to a checkpoint proper, which runs on the master).  And
archive_cleanup_command is executed on the replica, not on the master, so
to use it the replica has to have write access on the archive directory to
implement it.

archive_cleanup_command is pretty much obsolete.  The modern way to do this
is with streaming replication, using either replication slots or
wal_keep_segments.  If the only reason you want an archive is for
replication, then use streaming replication and do away with the archive
completely.  There are reasons other than replication that one might want
to keep a WAL archive, but those reasons don't seem to apply to you.  And
if they did you almost certainly wouldn't want to run
archive_cleanup_command on it.

Cheers,

Jeff

>


Re: Forks of pgadmin3?

2019-03-22 Thread Jeff Janes
On Fri, Mar 22, 2019 at 8:04 AM Steve Atkins  wrote:

>
>
> > On Mar 22, 2019, at 10:56 AM, Christian Henz 
> wrote:
> >
>
> There's the BigSQL fork, which had at least some minimal support
> for 10. I've no idea whether it's had / needs anything for 11


I just installed BigSQL's v11 of the database to get the pgAdmin3 that
comes with it (I couldn't get the Windows installer to install just
pgAdmin, I had to take the entire server installation along with it) .
Even though it comes with v11, when you start it says it only supports up
to v10, and then gives a series of warnings about catalogs and system admin
functions not being as expected.  Once you are past the warnings, it does
work at least on the surface, but I have to think some features aren't
going to work.

Cheers,

Jeff


Re: CVE-2019-9193 about COPY FROM/TO PROGRAM

2019-04-05 Thread Jeff Janes
On Tue, Apr 2, 2019 at 11:31 AM Andres Freund  wrote:

> Hi,
>
> On 2019-04-02 07:35:02 -0500, Brad Nicholson wrote:
>
> > A blog post would be nice, but it seems to me have something about this
> > clearly in the manual would be best, assuming it's not there already.  I
> > took a quick look, and couldn't find anything.
>
> https://www.postgresql.org/docs/devel/sql-copy.html
>
> "Note that the command is invoked by the shell, so if you need to pass
> any arguments to shell command that come from an untrusted source, you
> must be careful to strip or escape any special characters that might
> have a special meaning for the shell. For security reasons, it is best
> to use a fixed command string, or at least avoid passing any user input
> in it."
>
> "Similarly, the command specified with PROGRAM is executed directly by
> the server, not by the client application, must be executable by the
> PostgreSQL user. COPY naming a file or command is only allowed to
> database superusers or users who are granted one of the default roles
> pg_read_server_files, pg_write_server_files, or
> pg_execute_server_program, since it allows reading or writing any file
> or running a program that the server has privileges to access."
>
> Those seem reasonable to me?
>

Yes, but I think that the use of the phrase "default roles" here is
unfortunate.  I know it means that the role exists by default, but it is
easy to read that to mean they are granted by default.  They should
probably be called something like 'built-in roles' or 'system roles'.

And even with the understanding that we are referring to existence, not
grant status, "default roles" is still not really correct. If it exists by
default, that means I can make it not exist by taking action.  But these
roles cannot be dropped.

We don't have 'default functions' or 'default types' in the user-facing
documentation.  We shouldn't call these 'default roles'.

Cheers,

Jeff


Re: When do vacuumed pages/tuples become available for reuse?

2019-04-11 Thread Jeff Janes
On Thu, Apr 11, 2019 at 10:28 AM rihad  wrote:

>
> Yup, it's just that n_dead_tuples grows by several hundred thousand (the
> table sees much much more updates than inserts) and disk usage grows
> constantly between several hour long vacuum runs. Running vacuum full
> isn't an option.
>

The disk usage doesn't reach a steady state after one or two autovacs?  Or
it does, but you are just unhappy about the ratio between the steady state
size and the theoretical fully packed size?

Cheers,

Jeff


Re: When do vacuumed pages/tuples become available for reuse?

2019-04-11 Thread Jeff Janes
On Thu, Apr 11, 2019 at 11:14 AM rihad  wrote:

> autovacuum_vacuum_scale_factor = 0.01
> autovacuum_vacuum_threshold = 50

This seems counterproductive.  You need to make the vacuum more efficient,
not more frantic.


> autovacuum_vacuum_cost_delay = 10ms
> autovacuum_vacuum_cost_limit = 400
>


> Anything more than that and we risk impacting the performance of user
> queries.
>

Well, unbounded bloat will also impact the user queries--eventually.  Was
this an experimental determination?  Can you tell what about the autovac
most impacts the user queries, the reading or the writing?

You might just have more workload than your hardware can handle.  There is
always going to be some fundamental limit, and while you can tune your way
up to that limit, you can't tune your way past it.

Cheers,

Jeff


Re: When do vacuumed pages/tuples become available for reuse?

2019-04-11 Thread Jeff Janes
On Thu, Apr 11, 2019 at 11:44 AM rihad  wrote:

> On 04/11/2019 07:40 PM, Jeff Janes wrote:
>
>
> The disk usage doesn't reach a steady state after one or two autovacs?  Or
> it does, but you are just unhappy about the ratio between the steady state
> size and the theoretical fully packed size?
>
> Cheers,
>
> Jeff
>
>
> Since we dump&restore production DB daily into staging environment, the
> difference in size (as reported by psql's \l+) is 11GB in a freshly
> restored DB as opposed to 70GB in production.
>

Yeah, that seems like a problem.  Do you have long lived
transactions/snapshots that are preventing vacuuming from removing dead
tuples?  You can run a manual "vacuum verbose" and see how many dead but
nonremovable tuples there were, or set log_autovacuum_min_duration to some
non-negative value less than the autovac takes, and do the same.

(Indeed, those dumps you take daily might be the source of those long-lived
snapshots.  How long does a dump take?)

Also, what does pg_freespace (
https://www.postgresql.org/docs/current/pgfreespacemap.html) show about the
available of space in the table?  How about pgstattuple (
https://www.postgresql.org/docs/current/pgstattuple.html)

Cheers,

Jeff


Re: When do vacuumed pages/tuples become available for reuse?

2019-04-11 Thread Jeff Janes
On Thu, Apr 11, 2019 at 1:30 PM Alvaro Herrera 
wrote:

> On 2019-Apr-11, Tom Lane wrote:
>
> > Alvaro Herrera  writes:
> > > On 2019-Apr-11, rihad wrote:
> > >> 2019-04-11 19:39:44.450844500   tuples: 19150 removed, 2725811
> remain, 465 are dead but not yet removable
> >
> > > What Jeff said.  This vacuum spent a lot of time, only to remove
> miserly
> > > 19k tuples, but 2.7M dead tuples remained ... probably because you have
> > > long-running transactions preventing vacuum from removing them.
> >
> > I think you misread it --- I'm pretty sure "N remain" is referring
> > to live tuples.  Maybe we should adjust the wording to make that
> > clearer?
>
> Oh, I've been confused with that many times,  Not good trying to decode
> confusing messages while simultaneously figuring out trying to figure
> out logical decoding bugs that have already been fixed :-(
>
> Yeah, let's reword that.  I've had to read the source half a dozen times
> because I always forget what each number means.
>

The fact that the output of "vacuum verbose" and
"log_autovacuum_min_duration" look so little like each other certainly
doesn't help us learn what they mean here.  If we are re-wording things, we
might want to take a stab at unifying those to some extent.  If we just
want to do a slight re-wording, I don't know what it would need to look
like.  "remain" includes live, recently dead, and uncommitted new, and
uncommitted old (I think) so we can't just change "recent" to "live".

Cheers,

Jeff


Re: When do vacuumed pages/tuples become available for reuse?

2019-04-11 Thread Jeff Janes
On Thu, Apr 11, 2019 at 12:18 PM rihad  wrote:

> On 04/11/2019 08:09 PM, Jeff Janes wrote:
>
> On Thu, Apr 11, 2019 at 11:44 AM rihad  wrote:
>
>>
>> Since we dump&restore production DB daily into staging environment, the
>> difference in size (as reported by psql's \l+) is 11GB in a freshly
>> restored DB as opposed to 70GB in production.
>>
>
> Yeah, that seems like a problem.  Do you have long lived
> transactions/snapshots that are preventing vacuuming from removing dead
> tuples?  You can run a manual "vacuum verbose" and see how many dead but
> nonremovable tuples there were, or set log_autovacuum_min_duration to some
> non-negative value less than the autovac takes, and do the same.
>
> vacuum frees tuples just fine. It's just that by the time each run
> finishes many more accumulate due to table update activity, ad nauseum. So
> this unused space constantly grows. Here's a sample autovacuum run:
>
> 2019-04-11 19:39:44.450841500 [] LOG:  automatic vacuum of table
> "foo.public.bar": index scans: 1
> 2019-04-11 19:39:44.450843500   pages: 0 removed, 472095 remain, 4 skipped
> due to pins, 39075 skipped frozen
> 2019-04-11 19:39:44.450844500   tuples: 19150 removed, 2725811 remain, 465
> are dead but not yet removable
> 2019-04-11 19:39:44.450845500   buffer usage: 62407557 hits, 6984769
> misses, 116409 dirtied
> 2019-04-11 19:39:44.450846500   avg read rate: 16.263 MB/s, avg write
> rate: 0.271 MB/s
> 2019-04-11 19:39:44.450847500   system usage: CPU 59.05s/115.26u sec
> elapsed 3355.28 sec
>

This data doesn't seem to support either one of our theories.  "Dead but
not yet removable" is low.  But "removed" also seems pretty low.  Is 19,150
really the number of updates you think occur over the course of an hour
which causes the problem you are seeing?  Updates that happened during one
vacuum should be cleanly caught by the next one, so you should only see a
steady state of bloat, not unbounded increase.

But your buffer usage being 132 time the number of pages in the table
suggests it is your indexes, not your table, which are bloated.

How many indexes do you have, and of what type?  Index pages can only get
reused when they become completely empty, or when a new indexed value fits
into (or near) the key-space that that page already covers.  So if the key
space for new tuples is constantly migrating around and your pages never
become absolutely empty, you can get unbounded bloat in the indexes.

Can you compare the sizes object by object between the live and the stage,
taking care not to include index (or toast) size into the size of their
parent table?


> Also, what does pg_freespace (
> https://www.postgresql.org/docs/current/pgfreespacemap.html) show about
> the available of space in the table?  How about pgstattuple (
> https://www.postgresql.org/docs/current/pgstattuple.html)
>
> Thanks, I'll try those. But as I said freshly restored DB is only 11GB in
> size, not 70 (only public schema is used).
>

Yeah, but we need to know **why** that extra 59GB  is not being reused, not
simply the fact that it isn't being reused.  If it isn't listed as free in
the freespace map, then PostgreSQL might not know how to find it in order
to reuse it, for example. But now that I think it is the indexes, not the
table, that is bloated I would chase that part down first.  No point
checking the freespace of the table proper if the problem is with the
indexes.

Cheers,

Jeff


Re: When do vacuumed pages/tuples become available for reuse?

2019-04-11 Thread Jeff Janes
On Thu, Apr 11, 2019 at 1:48 PM Jeff Janes  wrote:


> If we just want to do a slight re-wording, I don't know what it would need
> to look like.  "remain" includes live, recently dead, and uncommitted new,
> and uncommitted old (I think) so we can't just change "recent" to "live".
>

Of course I meant we can't just change "remain" to "live".

Cheers,

Jeff

>


Re: shared_buffers on Big RAM systems

2019-04-11 Thread Jeff Janes
On Thu, Dec 13, 2018 at 11:51 PM Thomas Munro 
wrote:

> On Fri, Dec 14, 2018 at 2:17 AM Ron  wrote:
> > https://www.postgresql.org/docs/9.6/runtime-config-resource.html
> >
> > The docs say, "If you have a dedicated database server with 1GB or more
> of
> > RAM, a reasonable starting value for shared_buffers is 25%".
> >
> > But that's pretty archaic in 2018.  What if the dedicated database server
> > has 128GB RAM?
>
> I agree, we might as well drop the words "with 1GB of more of RAM".
> That's the size of the very smallest cloud instances available these
> days, available for free or up to a few bucks a month, and for
> physical servers I wonder if you can still get DIMMs that small.
>

AWS still has some with 512MB.  Although it can be a challenge to get
anything to compile in that amount of memory if there is anything else
running.

But I don't think I would recommend starting at 25% of RAM larger server.
Is that really good advice?  I would usually start out at 1GB even if the
server has 128GB, and increase it only if there was evidence it needed to
be increased.  Due to double buffering between shared_buffers and OS cache,
25% seems like a lot of wasted space.  You need shared_buffers as a cooling
off tank where dirty data can wait for their corresponding WAL to get
flushed in the background before they get written out themselves.  I think
1GB is enough for this, even if you have 128GB of RAM.

If your entire database (or the active portion of it) fits in RAM, then it
probably makes sense to set shared_buffers high enough to hold your entire
database.  But if it doesn't fit in RAM, then I don't see a reason to
devote even 25% of a large server to shared_buffers.


Re: Forcing index usage

2019-04-24 Thread Jeff Janes
On Wed, Apr 3, 2019 at 12:13 PM Jonathan Marks 
wrote:

Is there a way to tell Postgres “please don’t use index X when queries that
> could use index Y instead occur?”
>

Late to the party here, but...

Not directly.  I've had luck in changing the procost of functions (or the
functions which back the operators) which are frequently executed in the
slower plan, but not frequently executed in the faster plan.  For example,
walking the time index executing @@ on each row until it finds enough is
probably going to involve a lot more @@ than using the full text index and
invoking @@ only on the recheck rows.  The cost of @@ (via "ts_match_vq")
is probably way too low, especially if it has to fish an argument out of
TOAST.

If that doesn't work, you can just change the query to prevent the bad
index from being used.

For example, PostgreSQL is not currently smart enough to use an index on
"mtime" to support the ordering of a query that is written as:

ORDER BY mtime + interval '0 seconds'

Cheers,

Jeff


Re: Starting Postgres when there is no disk space

2019-05-03 Thread Jeff Janes
On Wed, May 1, 2019 at 10:25 PM Igal Sapir  wrote:

>
> I have a scheduled process that runs daily to delete old data and do full
> vacuum.  Not sure why this happened (again).
>

If you are doing a regularly scheduled "vacuum full", you are almost
certainly doing something wrong.  Are these "vacuum full" completing, or
are they failing (probably due to transient out of space errors)?

A ordinary non-full vacuum will make the space available for internal
reuse. It will not return the space to filesystem (usually), so won't get
you out of the problem.  But it should prevent you from getting into the
problem in the first place.  If it is failing to reuse the space
adequately, you should figure out why, rather than just blindly jumping to
regularly scheduled "vacuum full".  For example, what is it that is
bloating, the tables themselves, their indexes, or their TOAST tables?  Or
is there any bloat in the first place? Are you sure your deletions are
equal to your insertions, over the long term average?  If you are doing
"vacuum full" and you are certain it is completing successfully, but it
doesn't free up much space, then that is strong evidence that you don't
actually have bloat, you just have more live data than you think you do.
(It could also mean you have done something silly with your "fillfactor"
settings.)

If you don't want the space to be reused, to keep a high correlation
between insert time and physical order of the rows for example, then you
should look into partitioning, as you have already noted.

Now that you have the system up again and some space freed up, I'd create a
"ballast" file with a few gig of random (to avoid filesystem-level
compression, should you have such a filesystem) data on the same device
that holds your main data, that can be deleted in an emergency to give you
enough free space to at least start the system.  Of course, monitoring is
also nice, but the ballast file is more robust and there is no reason you
can't have both.

Cheers,

Jeff


Re: Connecting to NOTIFY with telnet

2019-05-04 Thread Jeff Janes
On Sat, May 4, 2019 at 1:49 PM Igal Sapir  wrote:

> Christoph,
>
> On Sat, May 4, 2019 at 10:44 AM Christoph Moench-Tegeder <
> c...@burggraben.net> wrote:
>
>> ## Igal Sapir (i...@lucee.org):
>>
>> > My main "issue" is that the official pgjdbc driver does not support the
>> > notifications with listen and I was trying to figure out why.
>>
>> https://jdbc.postgresql.org/documentation/head/listennotify.html
>>
>>
> I should have been more explicit.  My main issue is with the following
> statement from the link that you posted:
>
> > A key limitation of the JDBC driver is that it cannot receive
> asynchronous notifications and must poll the backend to check if any
> notifications were issued
>
> Polling is much less efficient than event handling and I'm sure that
> there's a major performance hit with that.
>

Isn't that addressed here?:

// If this thread is the only one that uses the connection,
a timeout can be used to
// receive notifications immediately:
// org.postgresql.PGNotification notifications[] =
pgconn.getNotifications(1);

Cheers,

Jeff

>


Re: Connecting to NOTIFY with telnet

2019-05-04 Thread Jeff Janes
On Sat, May 4, 2019 at 3:04 PM Igal Sapir  wrote:

> Jeff,
>
> On Sat, May 4, 2019 at 11:34 AM Jeff Janes  wrote:
>
>> On Sat, May 4, 2019 at 1:49 PM Igal Sapir  wrote:
>>
>>> Christoph,
>>>
>>> On Sat, May 4, 2019 at 10:44 AM Christoph Moench-Tegeder <
>>> c...@burggraben.net> wrote:
>>>
>>>> ## Igal Sapir (i...@lucee.org):
>>>>
>>>> > My main "issue" is that the official pgjdbc driver does not support
>>>> the
>>>> > notifications with listen and I was trying to figure out why.
>>>>
>>>> https://jdbc.postgresql.org/documentation/head/listennotify.html
>>>>
>>>>
>>> I should have been more explicit.  My main issue is with the following
>>> statement from the link that you posted:
>>>
>>> > A key limitation of the JDBC driver is that it cannot receive
>>> asynchronous notifications and must poll the backend to check if any
>>> notifications were issued
>>>
>>> Polling is much less efficient than event handling and I'm sure that
>>> there's a major performance hit with that.
>>>
>>
>> Isn't that addressed here?:
>>
>> // If this thread is the only one that uses the
>> connection, a timeout can be used to
>> // receive notifications immediately:
>> // org.postgresql.PGNotification notifications[] =
>> pgconn.getNotifications(1);
>>
>>
> It "helps", but it's still not the same as keeping the connection open and
> receiving messages in real time.
>

But it is the same as that.  It keeps the connection open, and receives the
messages in real time.  It blocks either until it gets a NOTIFY, or for 10
seconds, whichever occurs first.  You can use 0 to block forever until a
NOTIFY arrives.  Maybe you didn't remove the sleep a few lines further
down?  Or is there some other problem here?

Cheers,

Jeff

>


Re: Questions about btree_gin vs btree_gist for low cardinality columns

2019-06-02 Thread Jeff Janes
On Fri, May 24, 2019 at 11:26 AM Jeremy Finzel  wrote:

> I have been hoping for clearer direction from the community about
> specifically btree_gin indexes for low cardinality columns (as well as low
> cardinality multi-column indexes).  In general there is very little
> discussion about this both online and in the docs.  Rather, the emphasis
> for GIN indexes discussed is always on full text search of JSON indexing,
> not btree_gin indexes.
>
> However, I have never been happy with the options open to me for indexing
> low cardinality columns and was hoping this could be a big win.  Often I
> use partial indexes as a solution, but I really want to know how many use
> cases btree_gin could solve better than either a normal btree or a partial
> index.
>

What does "low cardinality" mean here?  For example, I have a million
different items with an item_id, but on average about 30 copies of each
item.  The inventory table has a row for each copy (copies are not fully
interchangeable, so can't be just be a count in some other table).  A
million is not usually considered a low number, but I find a gin index
(over btree_gin on the item_id) useful here as it produces a much smaller
index due to not repeating the item_id each time and due to compressing the
tid list, even though there are only about 30 tids to compress.

(You mention basically the reciprocal of this, 30 distinct values with
3,000,000 copies each, but I don't if that was your actual use case, or
just the case you were reading about in the documents you were reading)


>
> Here are my main questions:
>
> 1.
>
> "The docs say regarding *index only scans*: The index type must support
> index-only scans. B-tree indexes always do. GiST and SP-GiST indexes
> support index-only scans for some operator classes but not others. Other
> index types have no support. The underlying requirement is that the index
> must physically store, or else be able to reconstruct, the original data
> value for each index entry. As a counterexample, GIN indexes cannot support
> index-only scans because each index entry typically holds only part of the
> original data value."
>
> This is confusing to say "B-tree indexes always do" and "GIN indexes
> cannot support index-only scans", when we have a btree_gin index type.
> Explanation please ???
>

B-tree is the name of a specific index implementation in PostgreSQL.  That
is what is referred to here.   btree_gin offers operators to be used in a
GIN index to mimic/implement b-tree data structures/algorithms, but that
doesn't make it the same thing. It is just a GIN index doing btree things.
Perhaps PostgreSQL should use a "brand name" for their specific
implementation of the default index type, to distinguish it from the
generic algorithm description.


>
> Is it true that for a btree_gin index on a regular column, "each index
> entry typically holds only part of the original data value"?  Do these
> still not support index only scans?  Could they?  I can't see why they
> shouldn't be able to for a single indexed non-expression field?
>

For single column using a btree_gin operator, each index entry holds the
entire data value.  But the system doesn't know about that in any useful
way, so it doesn't implement index only scans, other than in the special
case where the value does not matter, like a 'count(*)'.  Conceptually
perhaps this could be fixed, but I don't see it happening. Since an
index-only scan is usually not much good with only a single-column index, I
don't see much excitement to improve things here. If if there is more than
one column in the index, then for GIN the entire value from both columns
would not be stored in the same index entry, so in this case it can't use
an index-only scan even conceptually to efficiently fetch the value of one
column based on the supplied value of another one.


>
> 2.
>
> Lack of index only scans is definitely a downside.  However, I see
> basically identical performance, but way less memory and space usage, for
> gin indexes.  In terms of read-only performance, if index only scans are
> not a factor, why not always recommend btree_gin indexes instead of regular
> btree for low cardinality fields, which will yield similar performance but
> use far, far less space and resources?
>

GIN indexes over btree_gin operators do not support inequality or BETWEEN
queries efficiently.  True read-onlyness is often talked about but rarely
achieved, I would be reluctant to design a system around management
promises that something won't ever change.  Btree indexes are way more
thoroughly tested than GIN.  When I became interested in using them in
more-or-less the way you describe, I started torture testing on them and
quickly found some bugs (hopefully all fixed now, but I wouldn't bet my
life on it).  btree_gin covers the most frequently used data types, but far
from all of them.  And as described above, multi-column GIN indexes are
just entirely different from multi-column B-Tree indexes, and gen

Re: Questions about btree_gin vs btree_gist for low cardinality columns

2019-06-03 Thread Jeff Janes
On Sun, Jun 2, 2019 at 7:07 PM Tom Lane  wrote:

> Jeff Janes  writes:
> > On Fri, May 24, 2019 at 11:26 AM Jeremy Finzel 
> wrote:
> >> I have been hoping for clearer direction from the community about
> >> specifically btree_gin indexes for low cardinality columns (as well as
> low
> >> cardinality multi-column indexes).  In general there is very little
> >> discussion about this both online and in the docs.  Rather, the emphasis
> >> for GIN indexes discussed is always on full text search of JSON
> indexing,
> >> not btree_gin indexes.
>
> I just wanted to mention that Jeremy and I had a bit of hallway-track
> discussion about this at PGCon.  The core thing to note is that the GIN
> index type was designed to deal with data types that are subdividable
> and you want to search for individual component values (array elements,
> lexemes in a text document, etc).  The btree_gin extension abuses this
> by just storing the whole values as if they were components.  AFAIR,
> the original idea for writing both btree_gin and btree_gist was to allow
> creating a single multicolumn index that covers both subdividable and
> non-subdividable columns.  The idea that btree_gin might be used on its
> own wasn't really on the radar, I don't think.


Even before 9.4 btree_gin indexes with many duplicates were still much more
compact than B-Tree, because the the value and the index tuple headers is
not repeated for each TID the way it is in B-Tree.  Of course TID list
compression has made it better yet.   I don't know what the original
motivation was for btree_gin, but multi-column GIN indexes never made much
sense to me anyway.  What do they do that can't be done just as well by
separate single-column indexes combined through BitmapAnd and BitmapOr?

Multi-column GiST indexes can be much more useful, and so btree_gist is
useful to enable things like an index over (int, int8range).

Another possible use for btree_gin is to enable use of the fastupdate
mechanism for indexes on scalars, to speed up bulk insertion but without
having to drop the index.  I've never demonstrated a realistic benefit
there, but I haven't tried very hard recently (last time I really tried was
before gin_clean_pending_list and gin_pending_list_limit were added).  The
"real" solution here is something like log-structured merge trees or
fractal indexes, but fastupdate is already here.


> However, now that GIN can compress multiple index entries for the same
> component value (which has only been true since 9.4, whereas btree_gin
> is very much older than that) it seems like it does make sense to use
> btree_gin on its own for low-cardinality non-subdividable columns.
> And that means that we ought to consider non-subdividable columns as
> fully legitimate, not just a weird corner usage.  So in particular
> I wonder whether it would be worth adding the scaffolding necessary
> to support index-only scan when the GIN opclass is one that doesn't
> subdivide the data values.
>

I wouldn't object to that, it just doesn't seem all that exciting.  But
isn't there some aspiration towards making a next generation of B-Tree
index which will also use TID list compression, making them more compact
without resorting to GIN?


> That leaves me quibbling with some points in Jeff's otherwise excellent
> reply:
>
> > For single column using a btree_gin operator, each index entry holds the
> > entire data value.  But the system doesn't know about that in any useful
> > way, so it doesn't implement index only scans, other than in the special
> > case where the value does not matter, like a 'count(*)'.  Conceptually
> > perhaps this could be fixed, but I don't see it happening. Since an
> > index-only scan is usually not much good with only a single-column
> index, I
> > don't see much excitement to improve things here.
>
> I'm confused by this; surely IOS is useful even with a single-column
> index?  Avoiding trips to the heap is always helpful.
>

But how realistic are the use cases?  My thinking was that an IOS for:

select bid from pgbench_accounts where bid=5;

would be nice if you needed to run that query, but we already know it is 5
for each row where it is 5 so we could just do the count instead of looking
at a long column of identical values.  Maybe it would be useful in joins or
something where we can't rewrite them ourselves, and the planner
can't/won't use the transitive law either.

It could be useful for disjunction in the same column, or inequality.  (Or
BETWEEN if we fix the issue you mentioned below).

select bid, count(*) from pgbench_accounts where bid = 5 or bid = 7 group
by bid;

If it can be made to support IOS, perhaps it could a

Re: checkpoints taking much longer than expected

2019-06-16 Thread Jeff Janes
On Sat, Jun 15, 2019 at 4:50 AM Tiemen Ruiten  wrote:

>
> On Fri, Jun 14, 2019 at 5:43 PM Stephen Frost  wrote:
>
>>
>> The time information is all there and it tells you what it's doing and
>> how much had to be done... If you're unhappy with how long it takes to
>> write out gigabytes of data and fsync hundreds of files, talk to your
>> storage people...
>>
>
Right, but true only because they were "checkpoint starting: immediate".
Otherwise the reported write time includes intentional sleeps added to
honor the checkpoint_completion_target.  A bit confusing to report it that
way, I think.

I am the storage people too :)
>

So, what did you say to yourself?  Have you done fundamental benchmarking
at the OS level, to see how fast you can write out 1-2 GB of data changes
spread out randomly over a few hundred GB of data, when it is not coming
from PostgreSQL?

Cheers,

Jeff


Re: perf tuning for 28 cores and 252GB RAM

2019-06-17 Thread Jeff Janes
On Mon, Jun 17, 2019 at 4:51 PM Michael Curry  wrote:

> I am using a Postgres instance in an HPC cluster, where they have
> generously given me an entire node. This means I have 28 cores and 252GB
> RAM. I have to assume that the very conservative default settings for
> things like buffers and max working memory are too small here.
>
> We have about 20 billion rows in a single large table.
>

What is that in bytes?  Do you only have that one table?


> The database is not intended to run an application but rather to allow a
> few individuals to do data analysis, so we can guarantee the number of
> concurrent queries will be small, and that nothing else will need to use
> the server. Creating multiple different indices on a few subsets of the
> columns will be needed to support the kinds of queries we want.
>
> What settings should be changed to maximize performance?
>

With 28 cores for only a few users, parallelization will probably be
important.  That feature is fairly new to PostgreSQL and rapidly improving
from version to version, so you will want to use the last version you can
(v11).  And then increase the values for max_worker_processes,
max_parallel_maintenance_workers, max_parallel_workers_per_gather, and
max_parallel_workers.  With the potential for so many parallel workers
running at once, you wouldn't want to go overboard on work_mem, maybe 2GB.
If you don't think all allowed users will be running large queries at the
same time (because they are mostly thinking what query to run, or thinking
about the results of the last one they ran, rather than actually running
queries), then maybe higher than that.

If your entire database can comfortably fit in RAM, I would make
shared_buffers large enough to hold the entire database.  If not, I would
set the value small (say, 8GB) and let the OS do the heavy lifting of
deciding what to keep in cache.  If you go with the first option, you
probably want to use pg_prewarm after each restart to get the data into
cache as fast as you can, rather than let it get loaded in naturally as you
run queries;  Also, you would probably want to set random_page_cost and
seq_page_cost quite low, like maybe 0.1 and 0.05.

You haven't described what kind of IO capacity and setup you have, knowing
that could suggest other changes to make.  Also, seeing the results of
`explain (analyze, buffers)`, especially with track_io_timing turned on,
for some actual queries could provide good insight for what else might need
changing.

Cheers,

Jeff


Re: psql UPDATE field [tab] expands to DEFAULT?

2019-06-17 Thread Jeff Janes
On Mon, Jun 17, 2019 at 6:03 PM Ken Tanzer  wrote:

> Hi.  If I'm using psql, and type for example:
>
> UPDATE my_table SET my_field
> (with a trailing space)
>
> and then hit Tab, it will expand that to an =, and then another tab will
> expand to DEFAULT, so that I then have:
>
> UPDATE my_table SET my_field = DEFAULT
>
> If I'm tabbing out in this situation, it's going to be after the =, and I
> will have typed "myreal"[tab] in the vain hope that psql will complete that
> to "myreallylongfieldname," but instead it gets replaced with DEFAULT.
>

Yeah, it is especially annoying to delete what I actually typed to replace
it with something else.  I've been irked by that before.  I think the
general behavior of replacing something already typed with (what it
believes to be) the only proper completion is part of the underlying
readline/libedit library, not something psql goes out of its way to do.


> So I'm curious if this is intended behavior, if it's considered useful,
> and/or if it's a placeholder for something in the future that will be
> useful.  Also, is this new, as I've never noticed it before?
>

The tab completion doesn't have a SQL parser/analyzer, it is just driven of
general rules of looking at the proceeding N words.  In this case, it is
hitting the rule for "SET anything TO", which is intended to catch the
setting of parameters, it is only accidentally hitting on the SET part of
UPDATE statements.

This goes back at least to 9.3.

We could improve it by making a higher priority rule which looks back a few
more words to:

UPDATE  SET  TO

But what would we complete with?  Any expression can go there, and we can't
make it tab complete any arbitrary expression, like function names or
literals.  If we tab complete, but only with a restricted set of choices,
that could be interpreted as misleadingly suggesting no other things are
possible.  (Of course the current accidental behavior is also misleading,
then)

If we are willing to offer an incomplete list of suggestions, what would
they be?  NULL, DEFAULT, '(' and all the columnnames present in
, with appropriate quotes where necessary?  But what to do with
 doesn't actually exist as the name of a table?

Or, we could have it implement the more precise higher priority rule, and
have it just refuse to offer any suggestions, but at least not delete what
is already there.

Cheers,

Jeff

>


Re: psql UPDATE field [tab] expands to DEFAULT?

2019-06-17 Thread Jeff Janes
On Mon, Jun 17, 2019 at 8:23 PM Adrian Klaver 
wrote:

> On 6/17/19 4:33 PM, Ken Tanzer wrote:
> >
> > Thanks Adrian, though I wasn't really seeking tips for column names.  I
> > was instead trying to understand whether this particular tab expansion
> > was intentional and considered useful, and if so what that usefulness
>
> If I am following the below correctly it is intentional:
>
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/bin/psql/tab-complete.c;h=68a2ba27aec22302625c5481a8f74cf866f4dc23;hb=d22ca701a39dfd03cdfa1ca238370d34f4bc4ac4
>
> Line 2888
>

But that portion doesn't offer the DEFAULT completion.  It stops at
offering '=', and goes no further.

It is at line 2859 which accidentally offers to complete DEFAULT, and that
is not part of the UPDATE-specific code.

Cheers,

Jeff


Re: unknown postgres ssl error "could not accept SSL connection: Success" and timeout

2023-05-03 Thread Jeff Janes
On Wed, May 3, 2023 at 9:54 AM Sergey Cherevko 
wrote:

> Ubuntu 18.04.6 LTS (GNU/Linux 4.15.0-167-generic x86_64)
>
> OpenSSL 1.1.1 11 Sep 2018
>
> Sometimes i see this in postgres logs
>
So, your system is working normally most of the time?  Or is it working
normally all of the time, and you just get occasional extra log messages of
unknown origin?

It could be informative to add %r to the log_line_prefix, so you would know
where these connections are coming from.



> CRITICAL
> PDOException
> SQLSTATE[08006] [7] timeout expired
> PDOException
> /var/log/web/hub2.log
>
>
I don't recognize the above (client?) log format.  What is it from?  Does
it correspond exactly to the entries in the PostgreSQL log?  The time
stamps seem off, the wording doesn't suggest to me that they are just
different ends of the same error, and the count of errors in one log
doesn't agree with the count in the other (6 vs 3).


> I should to update my openssl to see real error? How can i do that if i use 
> ubuntu 18.04?
>
> I wouldn't expect that to help.  But how did you install it in the first
place?  Ubuntu 18.04 generally provides PostgreSQL 10, not 12, so you must
have done something non-default to get where you are.

In my experience, this (poorly worded) error comes from the client
initiating an ssl connection, but then not finishing the ssl negotiations.
For example, PGSSLMODE is set to verify-ca or above, but the client has no
root.crt file present.

Cheers,

Jeff


Re: Why using a partial index is doing slightly more logical I/O than a normal index

2023-05-03 Thread Jeff Janes
Because both the actual times and the expected costs are so similar to each
other, I am assuming you are asking this as more of an academic question
than a practical one.  If it is actually a practical matter, you should
find a better example to present to us.

On Wed, May 3, 2023 at 9:17 AM Dirschel, Steve <
steve.dirsc...@thomsonreuters.com> wrote:



>  Index Cond: ((workflow_id = 1070) AND ((status)::text = ANY
> ('{NOT_STARTED,PAUSED,PENDING,RUNNING}'::text[])))
>
  Buffers: shared hit=33
>

For this usage, the =ANY is a "boundary condition".  It re-descends the
index for each value in the array, and each of those re-descents incurs
buffer accesses.  They cost very little, as the planner thinks they will
mostly be cached already (and indeed, they are), but the difference still
shows up in the Buffers tabulation.

   Index Cond: ((workflow_id = 1070) AND ((status)::text = ANY
> ('{NOT_STARTED,PAUSED,PENDING,RUNNING}'::text[])))
>
>Filter: (deleted_millis <= 0)
>
>Buffers: shared hit=24
>

For this usage, the =ANY is applied as an "in-index filter".  It only
descends the index once, to where workflow_id=1070, and then scans forward
applying the =ANY to each index-tuple until it exhausts the =1070
condition.  As long as all the =1070 entries fit into just a few buffers,
the count of buffers accessed by doing this is fewer than doing the
re-descents.  (Stepping from tuple to tuple in the same index page doesn't
count as a new access.  While a re-descent releases and reacquires the
buffer)

This is a bit of speculation on my part, as nothing in the plan output
distinguishes boundary condition usages from in-index-filter usages.  This
speculation is based on the fact that I was recently investigating a
similar situation and did extensive work on it with a debugger and by
adding new experimental log messages.  Also, It isn't clear to me why it
chooses one usage for one plan and the other usage for the other one in
your case, as it seems that both would be eligible for the "boundary
condition" treatment.  But presumably for some reason invisible to us it
just thinks one approach is faster for one index and the other approach for
the other index.


One other thing to note-  when using the partial index the cost is .43 ..
> 15824.82.  When using the other index the cost is .56 .. 15820.19.  So the
> lower end cost (I believe the cost to find the first row) is slightly lower
> for the partial index but the higher end cost (I believe to find the last
> row) is higher for the partial index.  Since there is no LIMIT clause why
> wouldn’t the optimizer use the lowest cost to find all rows (which in this
> case would be to use the non-partial index)?
>

There is some code in the planner which deems plans to be tied if their
costs are within a small difference (1%, I think it is).  This allows some
branches of the tree of all possible plans to be pruned off early, which
can save a lot of time in planning.

Cheers,

Jeff

>


Re: [EXT] Re: Why using a partial index is doing slightly more logical I/O than a normal index

2023-05-03 Thread Jeff Janes
On Wed, May 3, 2023 at 2:00 PM Dirschel, Steve <
steve.dirsc...@thomsonreuters.com> wrote:

> Thanks for the reply Jeff.  Yes-  more of an academic question.  Regarding
> this part:
>
>
>
>Index Cond: ((workflow_id = 1070) AND ((status)::text = ANY
> ('{NOT_STARTED,PAUSED,PENDING,RUNNING}'::text[])))
>
>Filter: (deleted_millis <= 0)
>
>Buffers: shared hit=24
>
>
>
> For this usage, the =ANY is applied as an "in-index filter".  It only
> descends the index once, to where workflow_id=1070, and then scans forward
> applying the =ANY to each index-tuple until it exhausts the =1070
> condition.  As long as all the =1070 entries fit into just a few buffers,
> the count of buffers accessed by doing this is fewer than doing the
> re-descents.  (Stepping from tuple to tuple in the same index page doesn't
> count as a new access.  While a re-descent releases and reacquires the
> buffer)
>
>
>
> There are 2,981,425 rows where workflow_id = 1070.  Does that change your
> theory of using an “in-index filter” for that plan?
>

Yes.  There is no way that that many index tuples will fit in just 24 index
leaf pages, so I think it must be using the re-descending method for both
plans.  Then I don't know why there is a difference in the number of
buffer accesses.  But the difference seems trivial, so I wouldn't put much
effort into investigating it.



> When you say there was a bit of speculation on the “boundard condition” vs
> “in-index filter” is the speculation on if Postgres has 2 different ways of
> processing a =ANY filter or is the speculation that one is being used by
> one plan and the other is being used by the other plan?
>

The speculation was that this applied to your query.  But going back to
some of my original tests, I see that I remembered some of the details
wrong on the broader topic as well.  When it voluntarily doesn't use the
=ANY as a boundary condition, that shows up in the plan as having the
condition evicted from "Index Cond" line and instead show up in a "Filter"
line, at least in my one test case (which means it is no longer an in-index
filter, as it jumps to the table and verifies visibility before applying
the filter).  So the thing that the plans really don't distinguish is
between when it just chooses not to use the extra index column for cost
reasons, from when it thinks it is unable to use it for correctness/safety
reasons.

Cheers,

Jeff


Re: Helping planner to chose sequential scan when it improves performance

2023-06-25 Thread Jeff Janes
On Tue, Jun 13, 2023 at 3:28 PM Patrick O'Toole 
wrote:

>  run the query twice first, then...

Is that a realistic way to run the test?  Often forcing all the data needed
for the query into memory is going to make things less realistic, not more
realistic.  Assuming the system has more stuff to do than just perform this
one query, it might be unusual for the query to find everything it needs in
memory.  Also, if you really do want to do it this way, then you should do
this for every plan.  Different plans need to access a different
collections of buffers, so prewarming just one plan will privilege that one
over the others.



>
> PLAN A (default config, effective cache size just shy of 15GB): 3.829
> seconds. A nested loop is used to probe the hash index
> `conversation_item_item_hash_index` for each row of item_text. Although the
> cost of probing once is low, a fair amount of time passes because the
> operation is repeated ~1.3 million times.
>
> PLAN B (enable_indexscan off, effective cache same as before): 3.254
> seconds (~15% speedup, sometimes 30%). Both tables are scanned sequentially
> and conversation_item is hashed before results are combined with a hash
> join.
>
> PLAN C: (random_page_cost = 8.0, instead of default 4, effective cache
> same as before): 2.959 (~23% speedup, sometimes 38%). Same overall plan as
> PLAN B, some differences in buffers and I/O. I'll note we had to get to 8.0
> before we saw a change to planner behavior; 5.0, 6.0, and 7.0 were too low
> to make a difference.
>

The difference between B and C looks like it is entirely noise, having to
do with how many buffers it found already in the cache and how many of them
needed cleaning (which causes the buffer to be dirty as the cleaned version
now needs to be written to disk) and how many dirty buffers it found that
needed to be written in order to make way to read other buffers it needs.
(This last number most generally reflects dirty buffers left around by
other things which this query encountered, not the buffers the query itself
dirtied).  None of this is likely to be reproducible, and so not worth
investigating.

And the difference between A and BC is small enough that it is unlikely to
be worth pursuing, either, even if it is reproducible.  If your apps runs
this one exact query often enough that a 30% difference is worth worrying
about, you would probably be better served by questioning the business
case.  What are you doing with 1.4 million rows once you do fetch them,
that it needs to be repeated so often?

If you think that taking a deep dive into this one query is going to
deliver knowledge which will pay off for other (so far unexamined) queries,
I suspect you are wrong. Look for queries where the misestimation is more
stark than 30% to serve as your case studies.


>
> Environment:
>
> Postgres 15.2
> Amazon RDS — db.m6g.2xlarge
>
>
> Questions:
>


> In Plan A, what factors are causing the planner to select a substantially
> slower plan despite having recent stats about number of rows?
>

Even if it were worth trying to answer this (which I think it is not),
there isn't much we can do with dummy tables containing no data.  You would
need to include a script to generate data of a size and distribution which
reproduces the given behavior.

> Is there a substantial difference between the on-the-fly hash done in
Plan B and Plan C compared to the hash-index used in Plan A? Can I assume
they are essentially the same? Perhaps there are there differences in how
they're applied?

They are pretty much entirely different.  Once jumps all over the index on
disk, the other reads the table sequentially and (due to work_mem) parcels
it out into chunks where it expects each chunk can also be read back in
sequentially as well.  About the only thing not different is that they both
involve computing a hash function.

> Is it common to see values for random_page_cost set as high as 8.0? We
would of course need to investigate whether we see a net positive or net
negative impact on other queries, to adopt this as a general setting, but
is it a proposal we should actually consider?

I've never needed to set it that high, but there is no a priori reason it
wouldn't make sense to do.  Settings that high would probably only be
suitable for HDD (rather than SSD) storage and when caching is not very
effective, which does seem to be the opposite of your situation.  So I
certainly wouldn't do it just based on the evidence at hand.

Cheers,

Jeff


>


Re: Helping planner to chose sequential scan when it improves performance

2023-06-27 Thread Jeff Janes
On Sun, Jun 25, 2023 at 3:48 PM David Rowley  wrote:

> On Wed, 14 Jun 2023 at 07:28, Patrick O'Toole 
> wrote:
> > Maybe we are barking up the wrong tree with the previous questions. Are
> there other configuration parameters we should consider first to improve
> performance in situations like the one illustrated?
>
> random_page_cost and effective_cache_size are the main settings which
> will influence plan A vs plan B.  Larger values of
> effective_cache_size will have the planner apply more seq_page_costs
> to the index scan.


Squeezing otherwise-random page costs towards seq_page_costs is what bitmap
scans do, and what large index scans with high pg_stats.correlation do.
But effective_cache_size does something else, it squeezes the per page
costs towards zero, not towards seq_page_costs. This is surely not
accurate, as the costs of locking the buffer mapping partition, finding the
buffer or reading it from the kernel cache if not found, maybe faulting the
buffer from main memory into on-CPU memory, pinning the buffer, and
read-locking it are certainly well above zero, even if not nearly as high
as seq_page_cost.  I'd guess they are truly about 2 to 5 times a
cpu_tuple_cost per buffer.  But zero is what they currently get, there is
no knob to twist to change that.


>   Lower values of effective_cache_size will mean
> more pages will be assumed to cost random_page_cost.
>


Sure, but it addresses the issue only obliquely (as does raising
random_page_cost) not directly.  So the change you need to make to them
will be large, and will likely make other things worse.

Cheers,

Jeff


Re: Question regarding specifics of GIN and pg_trgm performance and potential use of show_trgm to improve it

2023-09-24 Thread Jeff Janes
On Wed, May 24, 2023 at 4:35 PM Pavel Horal 
wrote:

I didn't see your email when first sent, and stumbled upon it while
searching for something else.  But it still might be worthwhile commenting
even after all of this time.


>
>
*Is my understanding correct that this happens only because pg_trgm is not
> able to actually determine if the matched item from the index search is
> actually much much longer than the query?* Is there any way how the
> performance can be improved in this case? I thought that I can store number
> of trigrams in the index, but that is not being used by the query planner:
>
> CREATE INDEX test_idx2 ON test USING GIN (value gin_trgm_ops,
> array_length(show_trgm(value), 1));
>
> EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test WHERE value % 'lorem' AND
> array_length(show_trgm(value), 1) < array_length(show_trgm('lorem'), 1) /
> 0.5;
>

The main problem here is of expression type.  You have an index using an
expression returning an int, while you are comparing it to an expression
returning a numeric.  That inhibits the use of the index over that
expression.

Just casting the type when creating the index is enough (given your test
case) to get this to do what you want:

CREATE INDEX test_idx2 ON test USING GIN (value gin_trgm_ops,
(array_length(show_trgm(value), 1)::numeric));

However, it would probably be more efficient to partition the table on the
trigram count, rather than adding that count to the index.  Then it could
just skip any partition with too many trigrams.

Cheers,

Jeff


Re: log_statement vs log_min_duration_statement

2023-09-26 Thread Jeff Janes
On Tue, Sep 26, 2023 at 5:33 PM Atul Kumar  wrote:

> Hi,
>
> I have a query about parameters  and log_statement
>
> my postgres version is 12 and running on centos 7
>
> my log_statement is set to "DDL".
>
> and log_min_duration_statement is set to "1ms"
>
> so technically it should log "ONLY DDLs" that take more than 1ms.
>
> but to my surprise, postgres is logging DML as well which is taking more
> than 1ms.
>

That doesn't surprise me, it is just what I expected.



>
> What am I missing here to understand, because as per my understanding
> log_statement and log_min_duration_statement are correlated, postgres
> should log according to log_statement parameter.
>

If something in the docs led you to this misunderstanding, we should
identify it and fix it.  Otherwise, there isn't much we can do, except
point out that that is simply not how it works.

Cheers,

Jeff

>


Re: pg_upgrade from Postgresql-12 to Postgresql-13 fails with "Creating dump of database schemas postgres *failure*"

2022-02-18 Thread Jeff Janes
On Fri, Feb 18, 2022 at 8:35 AM Yoong S. Chow  wrote:

>
> Here are the logs: pg_upgrade_server.log ,
> pg_upgrade_dump_13427.log  and
> tail_postgresql-12-2022-02-16_190344.log .
>

>From that last log file we have:


   1. 2022-02-16 19:04:35 +08 [303982]: [4957-1] session=620cda42.4a36e,user
   =postgres,db=postgis_raster,app=pg_dump,client=[local] | LOG:
disconnection: session time: 0:00:01.101 user=postgres
database=postgis_raster
   host=[local]
   2. free(): invalid pointer


and then a little later:


   1. 2022-02-16 19:04:35 +08 [303791]: [7-1] session=620cda0f.4a2af,user=,
   db=,app=,client= | LOG:  server process (PID 303982) was terminated by
   signal 6: Aborted


Do you get any core files for 303982?  Do you have core dumps enabled?
What do you get if you run a manual pg_dump -s on db postgis_raster?

Cheers,

Jeff

>


Re: How clear the cache on postgresql?

2017-11-24 Thread Jeff Janes
2017-11-24 17:55 GMT+01:00 Tom Lane :

> bricklen  writes:
>
> > If you are on a (non-production) *nix server you can use:
> > sync && echo 3 > /proc/sys/vm/drop_caches​
>
> You would also need to restart the postmaster, to get rid of whatever
> is in Postgres' shared buffers.
>

And restart postgres first, otherwise the shutdown checkpoint will
repopulate
some of the buffers you just dropped via drop_caches.

On Fri, Nov 24, 2017 at 9:09 AM, hmidi slim  wrote:

> I execute the command sync && echo 3 > /proc/sys/vm/drop_caches​ but I
> still got buffers hit added to that buffers read now.Why I got the buffers
> hit?
>

Please don't top post in this mailling list.

If the same buffer is accessed repeatedly in a query, then some of those
accesses will be from the cache even if it were completely cold to start
with.

Cheers,

Jeff


Re: seq vs index scan in join query

2017-11-29 Thread Jeff Janes
On Tue, Nov 28, 2017 at 10:55 PM, Emanuel Alvarez 
wrote:

> hi all,
>
> we're in the process of optimizing some queries and we've noted a case
> where the planner prefers a sequential scan instead of using an index,
> while the index scan is actually much faster. to give you some
> context: we have two main tables, keywords and results. keywords has
> approximately 700.000 rows; while results holds approximately one row
> per keyword per day (roughly 70m at the moment, not all keywords are
> active at any given day). results is currently partitioned by
> (creation) time. it's also worth noting that we use SSDs in our
> servers, and have random_page_cost set to 1.
>
>
> the problematic query looks like this:
>
> SELECT keywords.strategy_id, results.position, results.created_at FROM
> results
>   JOIN  keywords ON results.keyword_id = keywords.id
>   WHERE results.account_id = 1
>  AND results.created_at >= '2017-10-25 00:00:00.00'
>  AND results.created_at <= '2017-11-10 23:59:59.99';
>
>
> as you can see in the query plan [1] a sequential scan is preferred.
>

I would say the preference is not for the seq scan, but rather for the hash
join.  If the seq scan couldn't be fed into a hash join, it would not look
very favorable.

I think hash joins are a bit optimistic on how much cpu time they think
they use building the hash table.  You can probably get better plans for
this type of query by increasing cpu_tuple_cost to 0.02 or 0.03.  That
works because the hash join over the seq scan has to scan 700,000 tuples to
build the hash table, which is then probed only 70,000 time, while the
nested loop index scan just probes the 70,000 rows is needs directly and
ignores the other 90%.

...


>
> on the other hand, if we disable sequential scans (SET enable_seqscan
> = 0), we see than not only the query runs faster but the cost seems to
> be lower, as seen in the query plan [2].
>

The costs for plan 2 doesn't look lower to me.  196754.90 > 120421.32


>
> in this example the gain it's not much: ~0.5s. but when we add a
> second join table with additional keyword data the planner still
> prefers a sequential scan on a table that has +6m rows. query looks
> like this:
>
> SELECT keywords.strategy_id, results.position, results.created_at,
> keyword_data.volume FROM results
>   JOIN  keywords ON results.keyword_id = keywords.id
>   JOIN keyword_data ON keywords.keyword_data_id = keyword_data.id
>   WHERE results.account_id = 1
>  AND results.created_at >= '2017-10-25 00:00:00.00'
>  AND results.created_at <= '2017-11-19 23:59:59.99';
>
>
> in this case query takes up to 8s, query plan can be found in [3].
> obviously dataset has to be large to prefer a sequential on a 6m rows
> table. similarly, reducing the created_at range or using an account_id
> with fewer keywords makes the planner prefer index scan, accelerating
> the query considerably.
>

If that is the query you are really concerned about, we would have to see
the faster plan for that query.  (Or better yet, keep the created_at range
the same, and set enable_hashjoin to off to get it to switch plans).

This looks like is a very skewed query.  keyword_data has 10 rows for every
row in keywords, yet adding a join to keyword_data doesn't increase the
number of rows returned by the query at all.  That is weird, isn't it?

For what its worth, in my hands on your simpler query it likes to sort the
70,000 qualifying rows from "results" table, then do a merge join againsts
the index on keywords.  And it truly is the faster option.  I have to
enable_mergejoin=off before I can get either of your plans.  Once I do, the
nested loop does seem to be faster than the hash join but not by the two
fold that you see, and they jump around quite a bit from run to run.

Cheers,

Jeff


Re: Warm standby can't start because logs stream too quickly from the master

2017-12-02 Thread Jeff Janes
On Sat, Dec 2, 2017 at 11:02 AM, Zach Walton  wrote:

> Looking at the startup process:
>
> postgres 16749  4.1  6.7 17855104 8914544 ?Ss   18:36   0:44 postgres:
> startup process   recovering 00085B1C0030
>
> Then a few seconds later:
>
> postgres 16749  4.2  7.0 17855104 9294172 ?Ss   18:36   0:47 postgres:
> startup process   recovering 00085B1C0047
>
> It's replaying logs from the master, but it's always a few behind, so
> startup never finishes. Here's a demonstration:
>
> # while :; do echo $(ls data/pg_xlog/ | grep -n $(ps aux | egrep "startup
> process" | awk '{print $15}')) $(ls data/pg_xlog/ | wc -l); sleep 1; done
> # current replay location # number of WALs in pg_xlog
> 1655:00085B1C0064 1659
> 1656:00085B1C0065 1660
> 1658:00085B1C0067 1661
> 1659:00085B1C0068 1662
> 1660:00085B1C0069 1663
>
> Generally this works itself out if I wait (sometimes a really long time).
> Is there a configuration option that allows a warm standby to start without
> having fully replayed the logs from the master?
>


Warm standbys aren't supposed to start up, that is what makes them warm.
Are you trying to set up a hot standby?  Are you trying to promote a warm
standby to be the new master (but usually you would do that when the
current master has died, and so would no longer be generating log.)

Cheers,

Jeff


  1   2   >