Dear all. I'm running PostgreSQL (enabled timescaledb extension) in the following environment. And I had db pod crashed error several times irregularly.
I want to know the cause of this fail. Environment: - Base: Kubernetes(microk8s) - DB: PostgreSQL 12.4 (TimescaleDB) Container - DB OS: Alpine Linux - microk8s host OS: CentOS 7.6, Amazon Linux 2 (Occured under some hosts) - DB data location: Mounted host directory (for data persistence) 1) PostgreSQL crash and cannot start Pod.. I noticed that the timescaledb pod was restarted repeatedly with the following error. --- PostgreSQL Database directory appears to contain a database; Skipping initialization [1] LOG: starting PostgreSQL 12.4 on x86_64-pc-linux-musl, compiled by gcc (Alpine 9.3.0) 9.3.0, 64-bit [1] LOG: listening on IPv4 address "0.0.0.0", port 5432 [1] LOG: listening on IPv6 address "::", port 5432 [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" [20] LOG: database system was shut down at 2021-02-15 21:15:12 UTC [20] LOG: invalid primary checkpoint record [20] PANIC: could not locate a valid checkpoint record [1] LOG: startup process (PID 20) was terminated by signal 6: Aborted [1] LOG: aborting startup due to startup process failure [1] LOG: database system is shut down --- I thought it might be WAL trouble and tried to check with pg_controldata and pg_waldump. --- $ pg_controldata <PostgreSQL data dir> pg_control version number: 1201 Catalog version number: 201909212 Database system identifier: 6909006008117780509 Database cluster state: in production pg_control last modified: Mon Feb 15 21:11:21 2021 Latest checkpoint location: 15/8FB002C0 Latest checkpoint's REDO location: 15/8FAF3018 Latest checkpoint's REDO WAL file: 00000001000000150000008F Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:33988723 Latest checkpoint's NextOID: 117637 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 480 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 33988723 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTsXid:0 Latest checkpoint's newestCommitTsXid:0 Time of latest checkpoint: Mon Feb 15 21:11:14 2021 Fake LSN counter for unlogged rels: 0/3E8 Minimum recovery ending location: 0/0 Min recovery ending loc's timeline: 0 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: replica wal_log_hints setting: off max_connections setting: 50 max_worker_processes setting: 15 max_wal_senders setting: 10 max_prepared_xacts setting: 0 max_locks_per_xact setting: 64 track_commit_timestamp setting: off Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Data page checksum version: 0 Mock authentication nonce: 871e705d9393cdb1b161d5562656e20db314779af25942fa10d840983dc4ba84 --- I checked the latest checkpoint's REDO WAL file. --- $ pg_waldump -n 10 pg_wal/00000001000000150000008F pg_waldump: fatal: WAL segment size must be a power of two between 1 MB and 1 GB, but the WAL file "00000001000000150000008F" header specifies 0 bytes --- I cannot read wal data. This file size is 16MB (according to the wal size setting). But the content is all zero data. I checked this situation with "od" command. --- $ od -N 40 -A d -v pg_wal/00000001000000150000008F 0000000 000000 000000 000000 000000 000000 000000 000000 000000 0000016 000000 000000 000000 000000 000000 000000 000000 000000 0000032 000000 000000 000000 000000 0000040 --- The WAL log of the previous one could be seen with pg_waldump. However, I could not find any log corresponding to the latest checkpoint location (8FB002C0) listed in pg_controldata. 2) Run pg_resetwal and successful start Pod I ran the pg_resetwal command to repair the WAL trouble. And the DB pod successfully started. But, I received select query fail for some tables. 3) failed select query --- sampledb1=# select * from table1 limit 1; ERROR: missing chunk number 0 for toast value 27678 in pg_toast_2619 --- I thought taht this error is raised for pg_statistic table (and related toast table (pg_toast_2619)). So, I deleted the broken rows in pg_statistic table and ran ANALYZE query. After then, this tables is recovered. 4) another data trouble I had an another data trouble after pg_resetwal. In some table, SELECT query did not return any rows. And INSERT query failed (no response. Waiting forever...) This table have only primary key index. I thought it might have been caused by an pkey index trouble after pg_resetwal. I didn't know how to repair the index of the primary key, and finally I dropped table and restore. I thought this wal trouble was caused by disk IO troubles. But any error was not raised in OS syslog. I want to know any other causes. Beat regards.