Re: WAL Archive Cleanup?

Alban Hertroys Tue, 02 Apr 2019 03:30:19 -0700

That seems to be a misconfigured client application that explicitly tries
to connect to a non-existent database 'admin' (via db=admin).
Instead of adding that database, it seems more logical to fix the client
configuration.


On Tue, 2 Apr 2019 at 09:53, Foo Bar <[email protected]> wrote:

> Hello All,
>
> Ok, so maybe something helpful?  On my master node I am seeing a bunch of:
>
> 2019-03-28 23:54:44 GMT [2611]: [1-1]
> user=admin,db=admin,client=17210.10.37 172.10.10.37(57552) (0:3D000)FATAL:
> database "admin" does not exist
> 2019-03-28 23:54:46 GMT [2613]: [1-1]
> user=admin,db=admin,client=172.10.10.18 172.10.10.18(56970)
> (0:3D000)FATAL:  database "admin" does not exist
>
> Which is accurate, as there is no admin database...  I usually connect
> with
>
> psql -h localhost -U admin postgres
>
> Should there be?  Will this fix my issue with pgsql filling up the disk?
>
> Thanks,
> - QBR
>
> On Mon, Mar 25, 2019 at 10:21 AM Foo Bar <[email protected]> wrote:
>
>> Hello All,
>>
>> Wow!  Lots of awesome replies, Went away for the weekend thinking my
>> email had been rejected and come back to a full inbox.  Thanks for all the
>> help!
>>
>> >> Postgres version?
>>
>> 9.6.11
>>
>> Doh.  Fairly important detail there.  :)
>>
>> >> FYI, psql is the Postgres client program, Postgres(ql) is the server.
>> >> "psql" is the name of a specific command line tool used to connect to
>> a PostgreSQL database server, it is not the name of the database itself.
>> The database is usually abbreviated "pgsql".
>>
>> Duly noted, thanks for the correction.
>>
>> >> It's the standby that has not seen any traffic?
>>
>> There's really no traffic.  I built three nodes, connected them, created
>> a test table and inserted some values, then left the cluster be for a
>> couple weeks.
>>
>> >>  And "restartpoint" is usually spelled as one work in technical
>> discussions of it.  Or at least, searching for it that way avoids finding
>> things which mention each word separately in different senses.
>>
>> Ah ha.  I had seen it that way but thought it was a typo.  Thanks for the
>> clarification!
>>
>> >> Are you sure it is the archive
>> directory (/hab/svc/postgresql/data/archive) which is filling up, and not
>> the live directory (pg_wal or pg_xlog)?  This is often a point of confusion.
>>
>> Right before I sent the mail last week I deleted everything in
>> /hab/svc/postgresql/data/archive, this morning I'm seeing:
>>
>> # du -h --max=1 /hab/svc/postgresql/data/
>> 198M /hab/svc/postgresql/data/pgdata
>> 8.9G /hab/svc/postgresql/data/archive
>> 9.1G /hab/svc/postgresql/data/
>> # du -hs /hab/svc/postgresql/data/pgdata/pg_xlog/
>> 177M /hab/svc/postgresql/data/pgdata/pg_xlog/
>> # ls -lah /hab/svc/postgresql/data/archive/ | wc -l
>> 571
>>
>> There is no pg_wal directory though (should there be?)
>>
>> # find /hab/svc/postgresql/ -name '*pg*wal*'
>> #
>>
>> >>  If the only reason you want an archive is for replication, then use
>> streaming replication and do away with the archive completely
>>
>> To be honest, I thought it was required for streaming replication based
>> on the guides linked above.
>>
>> >> There are reasons other than replication that one might want to keep a
>> WAL archive, but those reasons don't seem to apply to you
>>
>> Like backup maybe?  A wholly other topic, we recently had a power outage
>> and I lost a pgsql node... having an archive would allow me to "replay" any
>> transactions?
>>
>> >> What needs to be determined here is why the standby never consumed
>> the WAL's from the master?
>> Ok, so it the standby that's the problem.
>>
>> >> Do you still have the logs from the standby and do they show anything
>> relevant?
>>
>> Sure, what am I looking for?  I see a bunch of entries like:
>>
>> 2019-03-08 17:06:11 GMT [1813]: [815-1] user=,db=,client=  (0:00000)LOG:
>> restartpoint complete: wrote 22 buffers (0.0%); 0 transaction log file(s)
>> added, 0 removed, 1 recycled; write=2.211 s, sync=0.062 s, total=2.281 s;
>> sync files=18, longest=0.062 s, average=0.003 s; distance=16383 kB,
>> estimate=16383 kB
>> 2019-03-08 17:06:11 GMT [1813]: [816-1] user=,db=,client=  (0:00000)LOG:
>> recovery restart point at 0/8D000028
>>
>> On the 15th, around when I think I filled the disk, I see a bunch of:
>>
>> cp: cannot stat '/00000002.history': No such file or directory
>> cp: cannot stat '/000000010000000400000049': No such file or directory
>> 2019-03-15 23:59:49 GMT [16691]: [1-1] user=,db=,client=
>> (0:XX000)FATAL:  could not connect to the primary server: could not connect
>> to server: Connection refused
>>                 Is the server running on host "172.16.10.23" and accepting
>>                 TCP/IP connections on port 5432?
>>
>> Which makes sense since the pgsql service was down.
>>
>> This appears to be when I recovered the master on Thursday:
>>
>> cp: cannot stat '/00000002.history': No such file or directory
>> cp: cannot stat '/00000001000000040000004D': No such file or directory
>> 2019-03-21 17:37:31 GMT [31338]: [1-1] user=,db=,client=
>> (0:XX000)FATAL:  could not connect to the primary server: could not connect
>> to server: Connection refused
>>                 Is the server running on host "172.16.10.23" and accepting
>>                 TCP/IP connections on port 5432?
>>
>> cp: cannot stat '/00000002.history': No such file or directory
>> cp: cannot stat '/00000001000000040000004D': No such file or directory
>> 2019-03-21 17:37:36 GMT [31343]: [1-1] user=,db=,client=  (0:00000)LOG:
>> started streaming WAL from primary at 4/4D000000 on timeline 1
>> 2019-03-21 17:37:47 GMT [30711]: [3151-1] user=,db=,client=
>> (0:00000)LOG:  restartpoint starting: time
>> 2019-03-21 17:37:47 GMT [30711]: [3152-1] user=,db=,client=
>> (0:00000)LOG:  restartpoint complete: wrote 0 buffers (0.0%); 0 transaction
>> log file(s) added, 0 removed, 1 recycled; write=0.003 s, sync=0.000 s,
>> total=0.007 s; sync files=0, longest=0.000 s, average=0.000 s;
>> distance=16384 kB, estimate=16384 kB
>>
>> Then there's a bunch of the same entries where about the only thing
>> (other than the timestamp) that seems to change is the index in:
>>
>> 2019-03-23 23:33:40 GMT [30711]: [5094-1] user=,db=,client=
>> (0:00000)LOG:  recovery restart point at 5/90000140
>> 2019-03-23 23:38:40 GMT [30711]: [5097-1] user=,db=,client=
>> (0:00000)LOG:  recovery restart point at 5/91000028
>> 2019-03-23 23:43:40 GMT [30711]: [5100-1] user=,db=,client=
>> (0:00000)LOG:  recovery restart point at 5/91000108
>>
>> I see effectively the same logs on the other hotstandby pgsql node.
>>
>> This is my config on my standby node:
>>
>> listen_addresses = '0.0.0.0'
>> port = 5432
>> max_connections = 100
>> external_pid_file = '/hab/svc/postgresql/var/postgresql.pid'
>> authentication_timeout = 1min
>> max_files_per_process = 1000
>> max_locks_per_transaction = 64
>> logging_collector = on
>> log_directory = '/hab/svc/postgresql/var/pg_log'
>> log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
>> log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,client=%h %r (%x:%e)'
>> log_min_messages = ERROR
>> datestyle = 'iso, mdy'
>> default_text_search_config = 'pg_catalog.english'
>> data_directory = '/hab/svc/postgresql/data/pgdata'
>> hba_file = '/hab/svc/postgresql/config/pg_hba.conf'
>> wal_level = hot_standby
>> wal_log_hints = 'on'
>> hot_standby = 'on'
>> hot_standby_feedback = true
>> max_wal_senders = 5
>> max_replication_slots = 5
>> checkpoint_completion_target = 0.9
>> max_wal_size = 1GB
>> min_wal_size = 128MB
>> wal_keep_segments = 8
>> log_checkpoints = on
>> log_lock_waits = on
>> log_temp_files = 0
>> log_autovacuum_min_duration = 0
>> track_activity_query_size = 2048
>> track_io_timing=on
>> dynamic_shared_memory_type = 'none'
>> archive_mode = 'on'
>> archive_command = 'cp %p  /hab/svc/postgresql/data/archive/%f'
>> archive_timeout = '10min'
>> max_standby_archive_delay = '30s'
>> synchronous_commit = local
>> include '/hab/svc/postgresql/config/postgresql.local.conf'
>>
>> Should I disable archive mode?  Even though I'm not currently using it,
>> it seems like there's a use-case for having it?  And if I can configure out
>> what causing the backup of archive files it should still be manageable?
>>
>> Thanks again for all the replies, while it hasn't solved the problem yet,
>> this was incredibly helpful!  Also, please don't hate me for munging all
>> your replies into one reply...  I thought it might be easier to follow than
>> having three different branches...?
>>
>> Best Regards,
>> - QBR
>>
>> On Fri, Mar 22, 2019 at 7:14 PM Michael Paquier <[email protected]>
>> wrote:
>>
>>> On Fri, Mar 22, 2019 at 12:26:33PM -0400, Jeff Janes wrote:
>>> > archive_cleanup_command is pretty much obsolete.  The modern way to do
>>> this
>>> > is with streaming replication, using either replication slots or
>>> > wal_keep_segments.  If the only reason you want an archive is for
>>> > replication, then use streaming replication and do away with the
>>> archive
>>> > completely.  There are reasons other than replication that one might
>>> want
>>> > to keep a WAL archive, but those reasons don't seem to apply to you.
>>> And
>>> > if they did you almost certainly wouldn't want to run
>>> > archive_cleanup_command on it.
>>>
>>> Personally, I still find archives also very valuable when a standby
>>> creation takes a long time because of a large initial base backup and
>>> that the partition dedicated to pg_wal is not large enough to support
>>> the retention associated with a slot, and it is easier to have larger
>>> retention policies in the archives.
>>> --
>>> Michael
>>>
>>

-- 
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.

Re: WAL Archive Cleanup?

Reply via email to