> On Wed, Jun 19, 2013 at 8:40 PM, Tatsuo Ishii <is...@postgresql.org> wrote: >>> On Wed, Jun 19, 2013 at 6:20 PM, Stephen Frost <sfr...@snowman.net> wrote: >>>> * Claudio Freire (klaussfre...@gmail.com) wrote: >>>>> I don't see how this is better than snapshotting at the filesystem >>>>> level. I have no experience with TB scale databases (I've been limited >>>>> to only hundreds of GB), but from my limited mid-size db experience, >>>>> filesystem snapshotting is pretty much the same thing you propose >>>>> there (xfs_freeze), and it works pretty well. There's even automated >>>>> tools to do that, like bacula, and they can handle incremental >>>>> snapshots. >>>> >>>> Large databases tend to have multiple filesystems and getting a single, >>>> consistent, snapshot across all of them while under load is.. >>>> 'challenging'. It's fine if you use pg_start/stop_backup() and you're >>>> saving the XLOGs off, but if you can't do that.. >>> >>> Good point there. >>> >>> I still don't like the idea of having to mark each modified page. The >>> WAL compressor idea sounds a lot more workable. As in scalable. >> >> Why do you think WAL compressor idea is more scalable? I really want >> to know why. Besides the unlogged tables issue, I can accept the idea >> if WAL based solution is much more efficient. If there's no perfect, >> ideal solution, we need to prioritize things. My #1 priority is >> allowing to create incremental backup against TB database, and the >> backup file should be small enough and the time to create it is >> acceptable. I just don't know why scanning WAL stream is much cheaper >> than recording modified page information. > > Because it aggregates updates. > > When you work at the storage manager level, you only see block-sized > operations. This results in the need to WAL-log bit-sized updates on > some hypothetical dirty-map index. Even when done 100% efficiently, > this implies at least one write per dirtied block, which could as much > as double write I/O in the worse (and totally expectable) case. > > When you do it at WAL segment recycle time, or better yet during > checkpoints, you deal with checkpoint-scale operations. You can > aggregate dirty-map updates, if you keep a dirty-map, which could not > only reduce I/O considerably (by a much increased likelihood of write > coalescence), but also let you schedule it better (toss it in the > background, with checkpoints). This is for gathering dirty-map > updates, which still leaves you with the complex problem of then > actually snapshotting those pages consistently without interfering > with ongoing transactions. > > If you do a WAL compressor, WAL entries are write-once, so you'll have > no trouble snapshotting those pages. You have the checkpoint's initial > full page write, so you don't even have to read the page, and you can > accumulate all further partial writes into one full page write, and > dump that on an "incremental archive". So, you get all the I/O > aggregation from above, which reduces I/O to the point where it only > doubles WAL I/O. It's bound by a constant, and in contrast to > dirty-map updates, it's sequential I/O so it's a lot faster. It's thus > perfectly scalable. > > Not only that, but you're also amortizing incremental backup costs > over time, as you're making them constantly as opposed to regular > intervals. You'll have one incremental backup per checkpoint. If you > want to coalesce backups, you launch another compressor to merge the > last incremental checkpoint with the new one. And, now this is the > cherry on top, you only have to do this on the archived WALs, which > means you could very well do it on another system, freeing your main > cluster from all this I/O. It's thus perfectly scalable. > > The only bottleneck here, is WAL archiving. This assumes you can > afford WAL archiving at least to a local filesystem, and that the WAL > compressor is able to cope with WAL bandwidth. But I have no reason to > think you'd be able to cope with dirty-map updates anyway if you were > saturating the WAL compressor, as the compressor is more efficient on > amortized cost per transaction than the dirty-map approach.
Thank you for detailed explanation. I will think more about this. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers