> On Apr 9, 2018, at 2:25 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > > > > On 04/09/2018 11:08 PM, Andres Freund wrote: >> Hi, >> >> On 2018-04-09 13:55:29 -0700, Mark Dilger wrote: >>> I can also imagine a master and standby that are similarly provisioned, >>> and thus hit an out of disk error at around the same time, resulting in >>> corruption on both, even if not the same corruption. >> >> I think it's a grave mistake conflating ENOSPC issues (which we should >> solve by making sure there's always enough space pre-allocated), with >> EIO type errors. The problem is different, the solution is different.
I'm happy to take your word for that. > In any case, that certainly does not count as data corruption spreading > from the master to standby. Maybe not from the point of view of somebody looking at the code. But a user might see it differently. If the data being loaded into the master and getting replicated to the standby "causes" both to get corrupt, then it seems like corruption spreading. I put "causes" in quotes because there is some argument to be made about "correlation does not prove cause" and so forth, but it still feels like causation from an arms length perspective. If there is a pattern of standby servers tending to fail more often right around the time that the master fails, you'll have a hard time comforting users, "hey, it's not technically causation." If loading data into the master causes the master to hit ENOSPC, and replicating that data to the standby causes the standby to hit ENOSPC, and if the bug abound ENOSPC has not been fixed, then this looks like corruption spreading. I'm certainly planning on taking a hard look at the disk allocation on my standby servers right soon now. mark