Thank you for your reply! It was helpful. I'll check more about the state of writing on the disk storage.
And I will reinforce the backup & restore strategy. 2021年3月9日(火) 0:28 Laurenz Albe <laurenz.a...@cybertec.at>: > On Mon, 2021-03-08 at 22:30 +0900, Daisuke Ikeda wrote: > > I'm running PostgreSQL (enabled timescaledb extension) in the following > environment. > > And I had db pod crashed error several times irregularly. > > > > I want to know the cause of this fail. > > > > Environment: > > > > - Base: Kubernetes(microk8s) > > - DB: PostgreSQL 12.4 (TimescaleDB) Container > > - DB OS: Alpine Linux > > - microk8s host OS: CentOS 7.6, Amazon Linux 2 (Occured under some > hosts) > > - DB data location: Mounted host directory (for data persistence) > > > > 1) PostgreSQL crash and cannot start Pod.. > > > > I noticed that the timescaledb pod was restarted repeatedly with the > following error. > > > > --- > > PostgreSQL Database directory appears to contain a database; Skipping > initialization > > [1] LOG: starting PostgreSQL 12.4 on x86_64-pc-linux-musl, compiled by > gcc (Alpine 9.3.0) 9.3.0, 64-bit > > [1] LOG: listening on IPv4 address "0.0.0.0", port 5432 > > [1] LOG: listening on IPv6 address "::", port 5432 > > [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" > > [20] LOG: database system was shut down at 2021-02-15 21:15:12 UTC > > [20] LOG: invalid primary checkpoint record > > [20] PANIC: could not locate a valid checkpoint record > > [1] LOG: startup process (PID 20) was terminated by signal 6: Aborted > > [1] LOG: aborting startup due to startup process failure > > [1] LOG: database system is shut down > > --- > > > > I thought it might be WAL trouble and tried to check with pg_controldata > and pg_waldump. > > It is WAL trouble. > WAL does not contain the checkpoint from before the crash. > > > I thought it might be WAL trouble and tried to check with pg_controldata > and pg_waldump. > > > > --- > > $ pg_controldata <PostgreSQL data dir> > > [...] > > Latest checkpoint location: 15/8FB002C0 > > Latest checkpoint's REDO location: 15/8FAF3018 > > Latest checkpoint's REDO WAL file: 00000001000000150000008F > > [...] > > > > I checked the latest checkpoint's REDO WAL file. > > > > --- > > $ pg_waldump -n 10 pg_wal/00000001000000150000008F > > pg_waldump: fatal: WAL segment size must be a power of two between 1 MB > and 1 GB, but the WAL file "00000001000000150000008F" header specifies 0 > bytes > > --- > > > > I cannot read wal data. > > This file size is 16MB (according to the wal size setting). > > But the content is all zero data. I checked this situation with "od" > command. > > > > --- > > $ od -N 40 -A d -v pg_wal/00000001000000150000008F > > 0000000 000000 000000 000000 000000 000000 000000 000000 000000 > > 0000016 000000 000000 000000 000000 000000 000000 000000 000000 > > 0000032 000000 000000 000000 000000 > > 0000040 > > --- > > Looks like modifications to this file were lost. > > > 2) Run pg_resetwal and successful start Pod > > > > I ran the pg_resetwal command to repair the WAL trouble. > > And the DB pod successfully started. > > Yes, but "pg_resetwal" on a crashed cluster leads to data corruption. > The best you can do now is salvage what you can. > > > But, I received select query fail for some tables. > > > > 3) failed select query > > > > --- > > sampledb1=# select * from table1 limit 1; > > ERROR: missing chunk number 0 for toast value 27678 in pg_toast_2619 > > --- > > > > I thought taht this error is raised for pg_statistic table (and related > toast table (pg_toast_2619)). > > So, I deleted the broken rows in pg_statistic table and ran ANALYZE > query. > > After then, this tables is recovered. > > Lucky you! > > > 4) another data trouble > > > > I had an another data trouble after pg_resetwal. > > > > In some table, SELECT query did not return any rows. > > And INSERT query failed (no response. Waiting forever...) > > This table have only primary key index. > > I thought it might have been caused by an pkey index trouble after > pg_resetwal. > > I didn't know how to repair the index of the primary key, and finally I > dropped table and restore. > > You should "pg_dumpall" the cluster and load it into a cluster > that was newly created with "initdb". > > Any problems loading the data have to be resolved manually. > > > I thought this wal trouble was caused by disk IO troubles. But any error > was not raised in OS syslog. > > I want to know any other causes. > > One cause might be unreliable storage that doesn't honor fsync requests > correctly. But given that your checkpoint location is pretty late in > the file, it seems unlikely that none of the data did make it to disk. > > Somehow you lost the data for a WAL segment, and you cannot recover. > > Of course, if you have a backup, you know what to do. > > Yours, > Laurenz Albe > -- > Cybertec | https://www.cybertec-postgresql.com > >