On Thu, Mar 26, 2020 at 12:41 PM Robert Haas <robertmh...@gmail.com> wrote: > > On Wed, Mar 25, 2020 at 8:53 AM Peter Eisentraut > <peter.eisentr...@2ndquadrant.com> wrote: > > HINT: This is to be expected if this is the end of the WAL. Otherwise, > > it could indicate corruption. > > First, I agree that this general issue is a problem, because it's come > up for me in quite a number of customer situations. Either people get > scared when they shouldn't, because the message is innocuous, or they > don't get scared about other things that actually are scary, because > if some scary-looking messages are actually innocuous, it can lead > people to believe that the same is true in other cases. > > Second, I don't really like the particular formulation you have above, > because the user still doesn't know whether or not to be scared. Can > we figure that out? I think if we're in crash recovery, I think that > we should not be scared, because we have no alternative to assuming > that we've reached the end of WAL, so all crash recoveries will end > like this. If we're in archive recovery, we should definitely be > scared if we haven't yet reached the minimum recovery point, because > more WAL than that should certainly exist. After that, it depends on > how we got the WAL. If it's being streamed, the question is whether > we've reached the end of what got streamed. If it's being copied from > the archive, we ought to have the whole segment, but maybe not more. > Can we get the right context to the point where the error is being > reported to know whether we hit the error at the end of the WAL that > was streamed? If not, can we somehow rejigger things so that we only > make it sound scary if we keep getting stuck at the same point when we > woud've expected to make progress meanwhile? > > I'm just spitballing here, but it would be really good if there's a > way to know definitely whether or not you should be scared. Corrupted > WAL segments are definitely a thing that happens, but retries are a > lot more common.
First, I agree that getting enough context to say precisely is by far the ideal. That being said, as an end user who's found this surprising -- and momentarily scary every time I initially scan it even though I *know intellectually it's not* -- I would find Peter's suggestion a significant improvement over what we have now. I'm fairly certainly my co-workers on our database team would also. Knowing that something is at least not always scary is good. Though I'll grant that this does have the negative in reverse: if it actually is a scary situation...this mutes your concern level. On the other hand, monitoring would tell us if there's a real problem (namely replication lag), so I think the trade-off is clearly worth it. How about this minor tweak: HINT: This is expected if this is the end of currently available WAL. Otherwise, it could indicate corruption. Thanks, James