Correcting typos Michael, Thanks for your prompt reply In my environment those two parameters are enabled . Just give you brief of PG database envornment Version 9.2.4.1 Windows 7 Professional SP1 fsync=on full_page_writes=on wal_sync_method=open_datasync
My Customer is into building Cancer related systems and we ship Dell systems with our software image contains PG. Few of the customers are facing corruption issues say around 5% . We are in process of reproducing the issue , since there are different variables involved in reproducing issue like Dell HW, Software image versions, Application versions, write-cache settings RAID/Disk, RAID controllers with no battery backup and power failures etc , I am trying to understand is there possibility that PG can end up in having corrupted blocks due to system crash though we set these parameters a)As I understand fsycn will write the block from memory to disk and block just after step 4) would have written disk assuming disk cache did not lie b)and assume that full_page_writes=on has dumped the whole 8k block into WAL before it updates block i.e. after step 2) and before 3) c) if crash happens after step4) , since there is no PageHeader data , after system restarts PG will complain that it is corrupted block or invalid header Please correct me if my understanding about play fsync and full_page_writes are correct ? if so , I see that there is possibility getting corruptions whenever PG extends a relation and crash happens just after step 4) I am not sure will the same applicable to existing page (not a new page) and how it handles if there is PageHeader available as part of full_page_writes, will same corruption can be happen or will PG can recover database as I am not sure recovery process can update the PageHeader from WAL records it wrote recptr as part of step 4) during the recovery process . -Sreekanth On Fri, Dec 9, 2016 at 2:09 PM, Sreekanth Palluru <sree...@gmail.com> wrote: > Michael, > Thanks for your prompt reply > > In my environment those two parameters are enabled . Just give you brief > of PG database envornment > Version 9.2.4.1 > Windows 7 Professional SP1 > fsync=on > full_page_writes=on > wal_sync_method=open_datasync > > My Customer is into building Cancer related systems and we ship Dell > systems with our software image contains PG. Few of the customers are > facing corruption issues say around 5% . > We are in process of reproducing the issue , since there are different > variables involved in reproducing issue like Dell HW, Software image > versions, Application versions, write-cache settings RAID/Disk, RAID > controllers with no backup and power failures etc , I am trying to > understand is there possibility that PG can end up in having corrupted > blocks due to system crash. > > 1)As I understand fsycn will write the block from memory to disk and block > just after step 4) would have written disk assuming disk cache did not lie > 2)and assume that full_page_writes=on has dumped the whole 8k block into > WAL > before it updates block i.e. after step 2) and before 3) > 3) if crash happens after step4) , since there is no PageHeader data , > after system restarts PG will complain that it is corrupted block or > invalid header > > Please correct me if my understanding about play fsync and > full_page_writes are correct ? if so , I see that there is possibility > getting corruptions whenever PG extends a relation and crash happens just > after step 4) > > I am not sure will the same applicable to existing page (not a new page) > and how it handles if there is PageHeader available as part of > full_page_writes, will same corruption can be happen or will PG can recover > database as I am not sure > recovery process can update the PageHeader from WAL records it wrote recptr > as part of step 4) during the recovery process . > > > -Sreekanth > > > > On Fri, Dec 9, 2016 at 12:44 PM, Michael Paquier < > michael.paqu...@gmail.com> wrote: > >> (Please top-post that's annoying) >> >> On Fri, Dec 9, 2016 at 10:28 AM, Sreekanth Palluru <sree...@gmail.com> >> wrote: >> > Can I generalize that, if after step 4) page ( new page or old page) >> got >> > written disk from buffer and crash happens between step 4) and 5) we >> > always get >> > block corruption issues with Postgres which can only be recovered by >> setting >> > zero_damaged_pages if we just have pg_dump backups and we are OK lose >> data >> > in the affected blocks? >> > >> > I am also looking at ways of reproducing the issue ? appreciate your >> advice >> > on it ? >> >> Postgres is designed to avoid such corruption problems if >> full_page_writes and fsync are enabled, that's a base stone of its >> reliability. If you can create a self-contained scenario able to >> reproduce a failure, that could be treated as a Postgres bug, but you >> are giving no evidence that this is the case. >> -- >> Michael >> > > > > -- > Regards > Sreekanth > -- Regards Sreekanth