Re: [GENERAL] Would like to below scenario is possible for getting page/block corruption

Sreekanth Palluru Thu, 08 Dec 2016 19:23:02 -0800

Correcting typos
Michael,
Thanks for your prompt reply

In my environment those two parameters are enabled . Just give you brief of
PG database envornment
Version 9.2.4.1
Windows 7 Professional SP1
fsync=on
full_page_writes=on
wal_sync_method=open_datasync

My Customer is into building Cancer related systems and we ship Dell
systems with our software image contains PG. Few of the customers are
facing corruption issues say around 5% .
We are in process of reproducing the issue , since there are different
variables involved in reproducing issue like  Dell HW, Software image
versions, Application versions, write-cache settings RAID/Disk, RAID
controllers with no battery backup and power failures  etc  , I am trying
to understand is there possibility that PG can end up in having corrupted
blocks due to system crash though we set these parameters

a)As I understand fsycn will write the block from memory to disk and block
just after step 4) would have written disk assuming disk cache did not lie
b)and assume that full_page_writes=on has dumped the whole 8k block into WAL
before it updates block i.e. after step 2) and before 3)
c) if crash happens after step4) , since there is no PageHeader data ,
after system restarts PG will complain that it is corrupted block or
invalid header

Please correct me if my understanding about play fsync and full_page_writes
are correct ? if so , I see that there is possibility getting corruptions
whenever PG extends a relation and crash happens just after step 4)

I am not sure will the same applicable to  existing page (not a new page)
and how it handles if there is PageHeader available as part of
full_page_writes, will same corruption can be happen or will PG can recover
database as I am not sure
recovery process can update the PageHeader   from WAL records it wrote recptr
as part of step 4) during the recovery process .

-Sreekanth

On Fri, Dec 9, 2016 at 2:09 PM, Sreekanth Palluru <sree...@gmail.com> wrote:

> Michael,
> Thanks for your prompt reply
>
> In my environment those two parameters are enabled . Just give you brief
> of PG database envornment
> Version 9.2.4.1
> Windows 7 Professional SP1
> fsync=on
> full_page_writes=on
> wal_sync_method=open_datasync
>
> My Customer is into building Cancer related systems and we ship Dell
> systems with our software image contains PG. Few of the customers are
> facing corruption issues say around 5% .
> We are in process of reproducing the issue , since there are different
> variables involved in reproducing issue like  Dell HW, Software image
> versions, Application versions, write-cache settings RAID/Disk, RAID
> controllers with no backup and power failures  etc  , I am trying to
> understand is there possibility that PG can end up in having corrupted
> blocks due to system crash.
>
> 1)As I understand fsycn will write the block from memory to disk and block
> just after step 4) would have written disk assuming disk cache did not lie
> 2)and assume that full_page_writes=on has dumped the whole 8k block into
> WAL
> before it updates block i.e. after step 2) and before 3)
> 3) if crash happens after step4) , since there is no PageHeader data ,
> after system restarts PG will complain that it is corrupted block or
> invalid header
>
> Please correct me if my understanding about play fsync and
> full_page_writes are correct ? if so , I see that there is possibility
> getting corruptions whenever PG extends a relation and crash happens just
> after step 4)
>
> I am not sure will the same applicable to  existing page (not a new page)
> and how it handles if there is PageHeader available as part of
> full_page_writes, will same corruption can be happen or will PG can recover
> database as I am not sure
> recovery process can update the PageHeader   from WAL records it wrote recptr
> as part of step 4) during the recovery process .
>
>
> -Sreekanth
>
>
>
> On Fri, Dec 9, 2016 at 12:44 PM, Michael Paquier <
> michael.paqu...@gmail.com> wrote:
>
>> (Please top-post that's annoying)
>>
>> On Fri, Dec 9, 2016 at 10:28 AM, Sreekanth Palluru <sree...@gmail.com>
>> wrote:
>> > Can I generalize that, if after step 4)  page ( new page or old page)
>> got
>> > written disk from buffer  and crash happens between step 4) and 5)  we
>> > always get
>> > block corruption issues with Postgres which can only be recovered by
>> setting
>> > zero_damaged_pages if we just have pg_dump backups and we are OK lose
>> data
>> > in the affected blocks?
>> >
>> > I am also looking at ways of reproducing the issue ? appreciate your
>> advice
>> > on it ?
>>
>> Postgres is designed to avoid such corruption problems if
>> full_page_writes and fsync are enabled, that's a base stone of its
>> reliability. If you can create a self-contained scenario able to
>> reproduce a failure, that could be treated as a Postgres bug, but you
>> are giving no evidence that this is the case.
>> --
>> Michael
>>
>
>
>
> --
> Regards
> Sreekanth
>

-- 
Regards
Sreekanth

Re: [GENERAL] Would like to below scenario is possible for getting page/block corruption

Reply via email to