On Sat, Feb 11, 2017 at 10:38 AM, Michael Banck <michael.ba...@credativ.de> wrote:
> Hi, > > one take-away from the Gitlab Post-Mortem[1] appears to be that after > their secondary lost replication, they were confused about what > pg_basebackup was doing when they tried to rebuild it. It just sat there > and did nothing (even with --verbose), so they assumed something was > wrong with either the primary or the connection, and restarted it > several times. > > AFAICT, it turns out the checkpoint was written on the master (they > probably did not use -c fast), but this wasn't obvious to them: > Yeah, I've seen this happen to a number of people. I think that sounds like what's happened here as well. I've considered things in the line of the patch you posted, but never got around to actually doing anything about it. > ISTM that even with WAL streaming, nothing would be written on the > client server until the checkpoint is complete, as do_pg_start_backup() > runs the checkpoint and only returns the starting WAL location > afterwards. > > The attached (untested) patch is to kick of a discussion on how to > improve the situation, it is supposed to mention the checkpoint when > --verbose is used and adds a paragraph about the checkpoint being run to > the Notes section of the documentation. > > Docs look good to me, other than claiming that pg_basebackup runs on a server (it can run anywhere). I would just say "during which pg_basebackup will appear idle". How does that sound to you? As for the code, while I haven't tested it, isn't the "checkpoint completed" message in the wrong place? Doesn't PQsendQuery() complete immediately, and the check needs to be put *after* the PQgetResult() call? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/