Hi all,
A couple of days ago I found the replica stopped after the PANIC message:
PANIC: WAL contains references to invalid pages
When I tried to restart it I got this FATAL:
FATAL: could not access status of transaction 280557568
Below is the description of the server and information from PostgreSQL
and system logs. After googling the problem I have found nothing like
this.
Any thoughts of what it could be and how to prevent it in the future?
Hardware:
IBM System x3650 M4, 148GB RAM, NAS
Software:
PostgreSQL 9.2.3, yum.postgresql.org
CentOS 6.3, kernel 2.6.32-279.22.1.el6.x86_64
Configuration:
listen_addresses = '*'
max_connections = 550
shared_buffers = 35GB
work_mem = 256MB
maintenance_work_mem = 1GB
bgwriter_delay = 10ms
bgwriter_lru_multiplier = 10.0
effective_io_concurrency = 32
wal_level = hot_standby
synchronous_commit = off
checkpoint_segments = 1024
checkpoint_timeout = 1h
checkpoint_completion_target = 0.9
checkpoint_warning = 5min
max_wal_senders = 3
wal_keep_segments = 2048
hot_standby = on
max_standby_streaming_delay = 5min
hot_standby_feedback = on
effective_cache_size = 133GB
log_directory = '/var/log/pgsql'
log_filename = 'postgresql-%Y-%m-%d.log'
log_checkpoints = on
log_line_prefix = '%t %p %u@%d from %h [vxid:%v txid:%x] [%i] '
log_lock_waits = on
log_statement = 'ddl'
log_timezone = 'W-SU'
track_activity_query_size = 4096
autovacuum_max_workers = 5
autovacuum_naptime = 5s
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.05
autovacuum_vacuum_cost_delay = 5ms
datestyle = 'iso, dmy'
timezone = 'W-SU'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'ru_RU.UTF-8'
lc_numeric = 'ru_RU.UTF-8'
lc_time = 'ru_RU.UTF-8'
default_text_search_config = 'pg_catalog.russian'
System:
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 53287555072
# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 13009657
# Maximum number of file-handles
fs.file-max = 65535
# pdflush tuning to prevent lag spikes
vm.dirty_ratio = 10
vm.dirty_background_ratio = 1
vm.dirty_expire_centisecs = 499
# Prevent the scheduler breakdown
kernel.sched_migration_cost = 500
# Turned off to provide more CPU to PostgreSQL
kernel.sched_autogroup_enabled = 0
# Setup hugepages
vm.hugetlb_shm_group = 26
vm.hugepages_treat_as_movable = 0
vm.nr_overcommit_hugepages = 512
# The Huge Page Size is 2048kB, so for 35GB shared buffers the number is 17920
vm.nr_hugepages = 17920
# Turn off the NUMA local pages reclaim as it leads to wrong caching
strategy for databases
vm.zone_reclaim_mode = 0
Environment:
HUGETLB_SHM=yes
LD_PRELOAD='/usr/lib64/libhugetlbfs.so'
export HUGETLB_SHM LD_PRELOAD
When it is stopped:
2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] LOG:
restartpoint complete: wrote 1685004 buffers (36.7%); 0 transaction
log file(s) added, 0 removed, 555 recycled; write=3237.402 s,
sync=0.071 s, total=3237.507 s; sync files=2673, longest=0.008 s,
average=0.000 s
2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] LOG: recovery
restart point at 2538/6E154AC0
2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] DETAIL: last
completed transaction was at log time 2013-03-26 11:50:31.613948+04
2013-03-26 11:50:32 MSK 3775 @ from [vxid: txid:0] [] LOG:
restartpoint starting: xlog
2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] WARNING:
page 451 of relation base/16436/2686702648 is uninitialized
2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] CONTEXT:
xlog redo vacuum: rel 1663/16436/2686702648; blk 2485,
lastBlockVacuumed 0
2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] PANIC: WAL
contains references to invalid pages
2013-03-26 11:51:16 MSK 3773 @ from [vxid:1/0 txid:0] [] CONTEXT:
xlog redo vacuum: rel 1663/16436/2686702648; blk 2485,
lastBlockVacuumed 0
2013-03-26 11:51:16 MSK 3770 @ from [vxid: txid:0] [] LOG: startup
process (PID 3773) was terminated by signal 6: Aborted
2013-03-26 11:51:16 MSK 3770 @ from [vxid: txid:0] [] LOG:
terminating any other active server processes
>From /var/log/messages:
Mar 26 10:50:52 tms2 kernel: : postmaster: page allocation failure.
order:8, mode:0xd0
Mar 26 10:50:52 tms2 kernel: : Pid: 3774, comm: postmaster Not tainted
2.6.32-279.22.1.el6.x86_64 #1
Mar 26 10:50:52 tms2 kernel: : Call Trace:
Mar 26 10:50:52 tms2 kernel: : [] ?
__alloc_pages_nodemask+0x77f/0x940
Mar 26 10:50:52 tms2 kernel: : [] ? kmem_getpages+0x62/0x170
Mar 26 10:50:52 tms2 kernel: : [] ? fallback_alloc+0x1ba/0x270
Mar 26 10:50:52 tms2 kernel: : [] ? cache_grow+0x2cf/0x320
Mar 26 10:50:52 tms2 kernel: : [] ?
cache_alloc_node+0x99/0x160
Mar 26 10:50:52 tms2 kernel: : [] ?
dma_pin_iovec_pages+0xb5/0x230
Mar 26 10:50:52 tms2 kernel: : [] ? __kmalloc+0x189/0x220
Mar 26 10:50:52 tms2 kernel: : [] ?
dma_pin_iovec_pages+0xb5/0x230
Mar 26 10:50:52 tms2 kernel: : [] ? lock_sock_nested+0xac/0xc0
Mar 26 10:50:52 tms2 kernel: : [] ? tcp_recvmsg+0x4ca/0xe80
Mar 26 10:50:52 tms2 kernel: : [] ?
xfs_vm_wri