Re: [HACKERS] A design for amcheck heapam verification

Peter Geoghegan Mon, 22 Jan 2018 14:02:34 -0800

On Thu, Jan 11, 2018 at 2:14 AM, Andrey Borodin <x4...@yandex-team.ru> wrote:
> I like heapam verification functionality and use it right now. So, I'm 
> planning to provide review for this patch, probably, this week.


Great!

> Seems like new check is working 4 orders of magnitudes faster then 
> bt_index_parent_check() and still finds my specific error that 
> bt_index_check() missed.
> From this output I see that there is corruption, but cannot understand:
> 1. What is the scale of corruption
> 2. Are these corruptions related or not

I don't know the answer to either question, and I don't think that
anyone else could provide much more certainty than that, at least when
it comes to the general case. I think it's important to remember why
that is.

When amcheck raises an error, that really should be a rare,
exceptional event. When I ran amcheck on Heroku's platform, that was
what we found - it tended to be some specific software bug in all
cases (turns out that Amazon's EBS is very reliable in the last few
years, at least when it comes to avoiding silent data corruption). In
general, the nature of those problems was very difficult to predict.

The PostgreSQL project strives to provide a database system that never
loses data, and I think that we generally do very well there. It's
probably also true that (for example) Yandex have some very good DBAs,
that take every reasonable step to prevent data loss (validating
hardware, providing substantial redundancy at the storage level, and
so on). We trust the system, and you trust your own operational
procedures, and for the most part everything runs well, because you
(almost) think of everything.

I think that running amcheck at scale is interesting because its very
general approach to validation gives us an opportunity to learn *what
we were wrong about*. Sometimes the reasons will be simple, and some
times they'll be complicated, but they'll always be something that we
tried to account for in some way, and just didn't think of, despite
our best efforts. I know that torn pages can happen, which is a kind
of corruption -- that's why crash recovery replays FPIs. If I knew
what problems amcheck might find, then I probably would have already
found a way to prevent them from happening in the first place - there
are limits to what we can predict. (Google "Ludic fallacy" for more
information on this general idea.)

I try to be humble about these things. Very complicated systems can
have very complicated problems that stay around for a long time
without being discovered. Just ask Intel. While it might be true that
some people will use amcheck as the first line of defense, I think
that it makes much more sense as the last line of defense. So, to
repeat myself -- I just don't know.

> I think an interface to list all or top N error could be useful.

I think that it might be useful if you could specify a limit on how
many errors you'll accept before giving up. I think that it's likely
less useful than you think, though. Once amcheck detects even a single
problem, all bets are off. Or at least any prediction that I might try
to give you now isn't worth much. Theoretically, amcheck should
*never* find any problem, which is actually what happens in the vast
majority of real world cases. When it does find a problem, there
should be some new lesson to be learned. If there isn't some new
insight, then somebody somewhere is doing a bad job.

-- 
Peter Geoghegan

Re: [HACKERS] A design for amcheck heapam verification

Reply via email to