On Thu, Feb 20, 2020 at 12:01 PM Alex Malek <magicag...@gmail.com> wrote:
> On Thu, Feb 20, 2020, 6:16 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > >> On Thu, Feb 20, 2020 at 3:06 AM Alex Malek <magicag...@gmail.com> wrote: >> > >> > >> > Hello Postgres Hackers - >> > >> > We are having a reoccurring issue on 2 of our replicas where >> replication stops due to this message: >> > "incorrect resource manager data checksum in record at ..." >> > This has been occurring on average once every 1 to 2 weeks during large >> data imports (100s of GBs being written) >> > on one of two replicas. >> > Fixing the issue has been relatively straight forward: shutdown >> replica, remove the bad wal file, restart replica and >> > the good wal file is retrieved from the master. >> > We are doing streaming replication using replication slots. >> > However twice now, the master had already removed the WAL file so the >> file had to retrieved from the wal archive. >> > >> > The WAL log directories on the master and the replicas are on ZFS file >> systems. >> > All servers are running RHEL 7.7 (Maipo) >> > PostgreSQL 10.11 >> > ZFS v0.7.13-1 >> > >> > The issue seems similar to >> https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com >> and to https://github.com/timescale/timescaledb/issues/1443 >> > >> > One quirk in our ZFS setup is ZFS is not handling our RAID array, so >> ZFS sees our array as a single device. >> > >> > Right before the issue started we did some upgrades and altered some >> postgres configs and ZFS settings. >> > We have been slowly rolling back changes but so far the the issue >> continues. >> > >> > Some interesting data points while debugging: >> > We had lowered the ZFS recordsize from 128K to 32K and for that week >> the issue started happening every other day. >> > Using xxd and diff we compared "good" and "bad" wal files and the >> differences were not random bad bytes. >> > >> > The bad file either had a block of zeros that were not in the good file >> at that position or other data. Occasionally the bad data has contained >> legible strings not in the good file at that position. At least one of >> those exact strings has existed elsewhere in the files. >> > However I am not sure if that is the case for all of them. >> > >> > This made me think that maybe there was an issue w/ wal file recycling >> and ZFS under heavy load, so we tried lowering >> > min_wal_size in order to "discourage" wal file recycling but my >> understanding is a low value discourages recycling but it will still >> > happen (unless setting wal_recycle in psql 12). >> > >> >> We do print a message "recycled write-ahead log file .." in DEBUG2 >> mode. You either want to run the server with DEBUG2 or maybe change >> the code to make it LOG and see if that is printed. If you do that, >> you can verify if the corrupted WAL is the same as a recycled one. >> > > Are you suggesting having the master, the replicas or all in debug mode? > How much extra logging would this generate? > A replica typically consumes over 1 TB of WAL files before a bad wal file > is encountered. > > > >> > There is a third replica where this bug has not (yet?) surfaced. >> > This leads me to guess the bad data does not originate on the master. >> > This replica is older than the other replicas, slower CPUs, less RAM, >> and the WAL disk array is spinning disks. >> > The OS, version of Postgres, and version of ZFS are the same as the >> other replicas. >> > This replica is not using a replication slot. >> > This replica does not serve users so load/contention is much lower than >> the others. >> > The other replicas often have 100% utilization of the disk array that >> houses the (non-wal) data. >> > >> > Any insight into the source of this bug or how to address it? >> > >> > Since the master has a good copy of the WAL file, can the replica >> re-request the file from the master? Or from the archive? >> > >> >> I think we do check in the archive if we get the error during >> streaming, but archive might also have the same data due to which this >> problem happens. Have you checked that the archive WAL file, is it >> different from the bad WAL? See the > > > Typically the master, the archive and the other replicas all have a good > copy of the WAL file. > > relevant bits of code in >> WaitForWALToBecomeAvailable especially the code near below comment: >> >> "Failure while streaming. Most likely, we got here because streaming >> replication was terminated, or promotion was triggered. But we also >> get here if we find an invalid record in the WAL streamed from master, >> in which case something is seriously wrong. There's little chance that >> the problem will just go away, but PANIC is not good for availability >> either, especially in hot standby mode. So, we treat that the same as >> disconnection, and retry from archive/pg_wal again. The WAL in the >> archive should be identical to what was streamed, so it's unlikely >> that it helps, but one can hope..." >> >> > Thank you for this comment! > This made me realize that on the replicas I had mentioned we had removed > the restore_command. > The replica we thought was not having the issue, was actually also > getting/producing bad WAL files but was eventually recovering by getting a > good WAL file from the archive b/c it had the restore_command defined. > > So ignoring what is causing the underlying issue, what would be involved in adding the ability of the replica to try to re-request the WAL file first from the master? It seems that would make replication more resilient and address similar issues such as https://www.postgresql.org/message-id/capv0rxgztfr2u5o3g70omoh+wqyhmwq1agsml+pqhmjff71...@mail.gmail.com that do not involve ZFS at all. Thanks. Alex