It's on the production database and the streaming replica. But not on the snapshot.
production -rw------- 1 postgres postgres 312778752 May 13 21:28 /database/postgres/base/16416/291498116.3 streaming replica -rw------- 1 postgres postgres 312778752 May 13 23:50 /database/postgres/base/16416/291498116.3 Is there a way to find out what the file contains? We just got some more information. All of the following was done / seen in the logs of the snapshot database. After we saw this we run a vacuum full on the table we suspect to be backed by this file. This happened: WARNING: concurrent insert in progress within table "js_equity_daily_diff" Over and over again. So we killed the VACUUM full by hitting ctrl-c again and again. After that the logs contained: fgrep ERROR postgresql-2013-05-14.csv | head 2013-05-14 00:23:16.028 EDT,"postgres_prod","proddb_testing",15698," 172.24.65.103:54536",5191bc31.3d52,3,"SELECT",2013-05-14 00:23:13 EDT,2/3330,0,ERROR,58P01,"could not open file ""base/16416/291498116.3"" (target block 431006): No such file or directory",,,,,"SQL statement ""select max(equity_date) from js_equity_daily_diff"" The above was the original error that made us investigate. 2013-05-14 09:21:47.121 EDT,"postgres_prod","proddb_testing",21002," 172.27.41.24:36815",51923a67.520a,38756,"VACUUM",2013-05-14 09:21:43 EDT,2/21611,645995272,ERROR,57014,"canceling statement due to user request",,,,,,"vacuum full js_equity_daily_diff",,,"psql" 2013-05-14 09:22:04.700 EDT,"postgres_prod","proddb_testing",21008," 172.27.41.24:36831",51923a75.5210,115909,"VACUUM",2013-05-14 09:21:57 EDT,2/21618,645995273,ERROR,57014,"canceling statement due to user request",,,,,,"vacuum full js_equity_daily_diff",,,"psql" Than us cancelling the vacuum full 2013-05-14 09:22:13.947 EDT,,,30911,,51919d78.78bf,1,,2013-05-13 22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied --- flushed only to 1CEE/31266090",,,,,"writing block 0 of relation base/16416/291498116",,,,"" 2013-05-14 09:22:14.964 EDT,,,30911,,51919d78.78bf,2,,2013-05-13 22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied --- flushed only to 1CEE/31266090",,,,,"writing block 0 of relation base/16416/291498116",,,,"" And after that these started appearing in logs (and they get repeated every second now: [root@nyc-dbc-001 pg_log]# fgrep ERROR postgresql-2013-05-14.csv | tail -n 2 2013-05-14 09:47:43.301 EDT,,,30911,,51919d78.78bf,3010,,2013-05-13 22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block 0 of relation base/16416/291498116",,,,"" 2013-05-14 09:47:44.317 EDT,,,30911,,51919d78.78bf,3012,,2013-05-13 22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block 0 of relation base/16416/291498116",,,,"" There are no earlier ERROR's in the logs. 2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1868,,2013-05-13 22:12:08 EDT,,0,ERROR,XX000,"xlog flush request 1D08/9B57FCD0 is not satisfied --- flushed only to 1CEE/3C869588",,,,,"writing block 0 of relation base/16416/291498116",,,,"" 2013-05-14 09:38:03.115 EDT,,,30911,,51919d78.78bf,1869,,2013-05-13 22:12:08 EDT,,0,WARNING,58030,"could not write block 0 of base/16416/291498116","Multiple failures --- write error might be permanent.",,,,,,,,"" The disk is not full nor are there any messages in the kernel logs. Cheers, Bene On Tue, May 14, 2013 at 9:27 AM, Heikki Linnakangas <hlinnakan...@vmware.com > wrote: > On 14.05.2013 14:57, Benedikt Grundmann wrote: > >> Today we have seen this on our testing database instance: >> >> ERROR: could not open file "base/16416/291498116.3" (target block >> 431006): >> No such file or directory >> >> That database get's created by rsyncing the LVM snapshot of the standby, >> which is a readonly backup of proddb >> using streaming replication. >> >> We do not put the standby database into backup mode before rsyncing the >> LVM >> snapshot, trusting postgres ability to recover >> from crash. Now we are not sure anymore... Thoughts? It's worth noting >> that we have been using this method of taking a backup >> for a long long time, but we only recently (after 9.2 came out we switched >> to doing a streaming replica and then to take the >> "frozen snapshot" backup method from the streaming replica). >> >> Let me know if there is more information you need, or if this is a known >> problem. >> > > That certainly should work. I'd suggest that you narrow down the problem a > bit more. Is the file present in the original system? Anything else that > might be related in the logs? > > - Heikki >