On Wed, Jun 19, 2013 at 8:40 PM, Tatsuo Ishii <is...@postgresql.org> wrote: >> On Wed, Jun 19, 2013 at 6:20 PM, Stephen Frost <sfr...@snowman.net> wrote: >>> * Claudio Freire (klaussfre...@gmail.com) wrote: >>>> I don't see how this is better than snapshotting at the filesystem >>>> level. I have no experience with TB scale databases (I've been limited >>>> to only hundreds of GB), but from my limited mid-size db experience, >>>> filesystem snapshotting is pretty much the same thing you propose >>>> there (xfs_freeze), and it works pretty well. There's even automated >>>> tools to do that, like bacula, and they can handle incremental >>>> snapshots. >>> >>> Large databases tend to have multiple filesystems and getting a single, >>> consistent, snapshot across all of them while under load is.. >>> 'challenging'. It's fine if you use pg_start/stop_backup() and you're >>> saving the XLOGs off, but if you can't do that.. >> >> Good point there. >> >> I still don't like the idea of having to mark each modified page. The >> WAL compressor idea sounds a lot more workable. As in scalable. > > Why do you think WAL compressor idea is more scalable? I really want > to know why. Besides the unlogged tables issue, I can accept the idea > if WAL based solution is much more efficient. If there's no perfect, > ideal solution, we need to prioritize things. My #1 priority is > allowing to create incremental backup against TB database, and the > backup file should be small enough and the time to create it is > acceptable. I just don't know why scanning WAL stream is much cheaper > than recording modified page information.
Because it aggregates updates. When you work at the storage manager level, you only see block-sized operations. This results in the need to WAL-log bit-sized updates on some hypothetical dirty-map index. Even when done 100% efficiently, this implies at least one write per dirtied block, which could as much as double write I/O in the worse (and totally expectable) case. When you do it at WAL segment recycle time, or better yet during checkpoints, you deal with checkpoint-scale operations. You can aggregate dirty-map updates, if you keep a dirty-map, which could not only reduce I/O considerably (by a much increased likelihood of write coalescence), but also let you schedule it better (toss it in the background, with checkpoints). This is for gathering dirty-map updates, which still leaves you with the complex problem of then actually snapshotting those pages consistently without interfering with ongoing transactions. If you do a WAL compressor, WAL entries are write-once, so you'll have no trouble snapshotting those pages. You have the checkpoint's initial full page write, so you don't even have to read the page, and you can accumulate all further partial writes into one full page write, and dump that on an "incremental archive". So, you get all the I/O aggregation from above, which reduces I/O to the point where it only doubles WAL I/O. It's bound by a constant, and in contrast to dirty-map updates, it's sequential I/O so it's a lot faster. It's thus perfectly scalable. Not only that, but you're also amortizing incremental backup costs over time, as you're making them constantly as opposed to regular intervals. You'll have one incremental backup per checkpoint. If you want to coalesce backups, you launch another compressor to merge the last incremental checkpoint with the new one. And, now this is the cherry on top, you only have to do this on the archived WALs, which means you could very well do it on another system, freeing your main cluster from all this I/O. It's thus perfectly scalable. The only bottleneck here, is WAL archiving. This assumes you can afford WAL archiving at least to a local filesystem, and that the WAL compressor is able to cope with WAL bandwidth. But I have no reason to think you'd be able to cope with dirty-map updates anyway if you were saturating the WAL compressor, as the compressor is more efficient on amortized cost per transaction than the dirty-map approach. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers