Hi, On 2020-02-19 16:35:53 -0500, Alex Malek wrote: > We are having a reoccurring issue on 2 of our replicas where replication > stops due to this message: > "incorrect resource manager data checksum in record at ..."
Could you show the *exact* log output please? Because this could temporarily occur without signalling anything bad, if e.g. the replication connection goes down. > Right before the issue started we did some upgrades and altered some > postgres configs and ZFS settings. > We have been slowly rolling back changes but so far the the issue continues. > > Some interesting data points while debugging: > We had lowered the ZFS recordsize from 128K to 32K and for that week the > issue started happening every other day. > Using xxd and diff we compared "good" and "bad" wal files and the > differences were not random bad bytes. > > The bad file either had a block of zeros that were not in the good file at > that position or other data. Occasionally the bad data has contained > legible strings not in the good file at that position. At least one of > those exact strings has existed elsewhere in the files. > However I am not sure if that is the case for all of them. > > This made me think that maybe there was an issue w/ wal file recycling and > ZFS under heavy load, so we tried lowering > min_wal_size in order to "discourage" wal file recycling but my > understanding is a low value discourages recycling but it will still > happen (unless setting wal_recycle in psql 12). This sounds a lot more like a broken filesystem than anythingon the PG level. > When using replication slots, what circumstances would cause the master to > not save the WAL file? What do you mean by "save the WAL file"? Greetings, Andres Freund