Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?
Hello everybody, I’m using PostgreSQL on openZFS. I use ZFS snapshots as a backup + hotspare method. >From man zfs-snapshot: “Snapshots are taken atomically, so that all snapshots >correspond to the same moment in time.” So if a PSQL instance is started from >a zfs snapshot, it will start to replay the WAL from the last checkpoint, in >the same way it would do in a crash or power loss scenario. So from my >knowledge, ZFS snapshots can be used to rollback to a previous point in time. >Also, sending those snapshots to other computers will allow you to have >hotspares and remote backups. If I’m wrong here, I would appreciate being told >about it because I’m basing the whole question on this premise. On the other hand, we have the tablespace PGSQL feature, which is great because it allows “unimportant” big data to be written into cheap HDD and frequently used data into fast NVMe. So far, so good. The problem is when both ideas are merged. Then, snapshots from different pools are NOT atomical, snapshot on the HDD pool isn’t going to be done at the same exact time as the one on the SSD pool, and I don’t know enough about PGSQL internals to know how dangerous this is. So here is where I would like to ask for your help with the following questions: First of all, what kind of problem can this lead to? Are we talking about potential whole DB corruption or only the loss of a few of the latest transactions? In second place, if I’m initializing a corrupted PGSQL instance because ZFS snapshots are from different pools and slightly different times, am I going to notice it somehow or is it going to fail silently? In third and last place, is there some way to quantify the amount of risk taken when snapshotting a PGSQL instance spread across two (or more) different pools? Thanks for your time, Héctor
RE: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?
> But you cannot and should not rely on snapshots alone That’s only for non atomic (multiple pools) snapshots. Isn’t? If I need to rely only on ZFS (automated) snapshots, then the best option would be to have two DB? Each one in each own pool. One HDD DB and one SSD DB. Then, the backend code should know on which DB the requested data is. De: Magnus Hagander<mailto:mag...@hagander.net> Enviado: domingo, 15 de enero de 2023 20:36 Para: HECTOR INGERTO<mailto:hector_...@hotmail.com> CC: pgsql-gene...@postgresql.org<mailto:pgsql-gene...@postgresql.org> Asunto: Re: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools? On Sun, Jan 15, 2023 at 8:18 PM HECTOR INGERTO mailto:hector_...@hotmail.com>> wrote: Hello everybody, I’m using PostgreSQL on openZFS. I use ZFS snapshots as a backup + hotspare method. >From man zfs-snapshot: “Snapshots are taken atomically, so that all snapshots >correspond to the same moment in time.” So if a PSQL instance is started from >a zfs snapshot, it will start to replay the WAL from the last checkpoint, in >the same way it would do in a crash or power loss scenario. So from my >knowledge, ZFS snapshots can be used to rollback to a previous point in time. >Also, sending those snapshots to other computers will allow you to have >hotspares and remote backups. If I’m wrong here, I would appreciate being told >about it because I’m basing the whole question on this premise. On the other hand, we have the tablespace PGSQL feature, which is great because it allows “unimportant” big data to be written into cheap HDD and frequently used data into fast NVMe. So far, so good. The problem is when both ideas are merged. Then, snapshots from different pools are NOT atomical, snapshot on the HDD pool isn’t going to be done at the same exact time as the one on the SSD pool, and I don’t know enough about PGSQL internals to know how dangerous this is. So here is where I would like to ask for your help with the following questions: First of all, what kind of problem can this lead to? Are we talking about potential whole DB corruption or only the loss of a few of the latest transactions? Silent data corruption. *not* just losing your latest transaction. In second place, if I’m initializing a corrupted PGSQL instance because ZFS snapshots are from different pools and slightly different times, am I going to notice it somehow or is it going to fail silently? Silent. You might notice at the application level. Might. In third and last place, is there some way to quantify the amount of risk taken when snapshotting a PGSQL instance spread across two (or more) different pools? "Don't do it". If you can't get atomic snapshots, don't do it, period. You can use them together with a regular online backup. That is pg_start_backup() // // pg_stop_backup() together with log archiving. That's a perfectly valid method. But you cannot and should not rely on snapshots alone. -- Magnus Hagander Me: https://www.hagander.net/<https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.hagander.net%2F&data=05%7C01%7C%7C4860509b67ea484420fb08daf72fddd4%7C84df9e7fe9f640afb435%7C1%7C0%7C638094082195595508%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=LRa%2BFTXpoZNsMLMrNLbL6xmgo9I3Mxx2CcCAh6nmguU%3D&reserved=0> Work: https://www.redpill-linpro.com/<https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.redpill-linpro.com%2F&data=05%7C01%7C%7C4860509b67ea484420fb08daf72fddd4%7C84df9e7fe9f640afb435%7C1%7C0%7C638094082195752157%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ziYhcTa5YOvHZEr2xk2nEKvSjLICE75zRhhCehvzIMs%3D&reserved=0>
RE: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?
I have understood I shall not do it, but could the technical details be discussed about why silent DB corruption can occur with non-atomical snapshots?
RE: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?
> The database relies on the data being consistent when it performs crash > recovery. > Imagine that a checkpoint is running while you take your snapshot. The > checkpoint > syncs a data file with a new row to disk. Then it writes a WAL record and > updates > the control file. Now imagine that the table with the new row is on a > different > file system, and your snapshot captures the WAL and the control file, but not > the new row (it was still sitting in the kernel page cache when the snapshot > was taken). > You end up with a lost row. > > That is only one scenario. Many other ways of corruption can happen. Can we say then that the risk comes only from the possibility of a checkpoint running inside the time gap between the non-simultaneous snapshots?
RE: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?
> Another case: a transaction COMMITs, and a slightly later transaction reads > the data > and sets a hint bit. If the snapshot of the file system with the data > directory in it > is slightly later than the snapshot of the file system with "pg_wal", the > COMMIT might > not be part of the snapshot, but the hint bit could be. > > Then these uncommitted data could be visible if you recover from the snapshot. > > Yours, > Laurenz Albe Thank you all. I have it clearer now. As a last point. Making the snapshot to the WAL dataset first or last would make any difference?
RE: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools?
I wanted to understand the underlying issue. I use ZFS snapshots instead of a “correct” backup because with only two machines it allows me to have backups in the main machine and in the secondary too that acts as hotspare at the same time. To accomplish the same I would need 3 nodes. The main, the replica hotspare and the proper backup. De: Laurenz Albe<mailto:laurenz.a...@cybertec.at> Enviado: miércoles, 18 de enero de 2023 11:02 Para: HECTOR INGERTO<mailto:hector_...@hotmail.com>; Magnus Hagander<mailto:mag...@hagander.net> CC: pgsql-gene...@postgresql.org <mailto:pgsql-gene...@postgresql.org> Asunto: Re: Are ZFS snapshots unsafe when PGSQL is spreading through multiple zpools? On Tue, 2023-01-17 at 15:22 +, HECTOR INGERTO wrote: > > Another case: a transaction COMMITs, and a slightly later transaction reads > > the data > > and sets a hint bit. If the snapshot of the file system with the data > > directory in it > > is slightly later than the snapshot of the file system with "pg_wal", the > > COMMIT might > > not be part of the snapshot, but the hint bit could be. > > > > Then these uncommitted data could be visible if you recover from the > > snapshot. > > Thank you all. I have it clearer now. > > As a last point. Making the snapshot to the WAL dataset first or last would > make any difference? Imagine you run DROP TABLE. During the implicit COMMIT at the end of the statement, the files behind the table are deleted. If the snapshot of "pg_wal" is earlier than the snapshot of the data files, you end up with a table that is not yet dropped, but the files are gone. I won't try to find an example if you now ask what if no checkpoint ends between the statements, the snapshot on "pg_wal" is earlier and we don't run DROP TABLE. Why do you go to all this effort rather than performing a correct backup? Yours, Laurenz Albe
PostgreSQL optimizations for CoW FS
Let’s say we have to run a PostgreSQL instance on top of a copy on write filesystem like ZFS or BTRFS. In adittion to set full_page_writes = off, what other optimizations can be done on the PostgreSQL side? Regards, Héctor