Dan Eloff wrote:
At the lower levels in PG, reading from the disk into cache, and writing from the cache to the disk is always done in pages.Why does PG work this way? Is it any slower to write whole pages rather than just the region of the page that changed? Conversely, is it faster? From what I think I know of operating systems, reading should bring the whole page into the os buffers anyway, so reading the whole page instead of just part of it isn't much more expensive. Perhaps writing works similarly?
First, data fetched from the disk is (except for data in temporary tables, I believe) not stored in private memory of the backend process doing the read(), but instead a a shared memory segment accessible by all backend processes. This allows two different backend processes to work modify the data concurrently without them stepping on each other's toes. Note that immediatly writing back any changes is *not* an option, since WAL logging mandates that all changes got to the WAL *first*. Hence, if you were to write out each changed tuple immediately, you'd have to first write the changes to the WAL *and* fsync the WAL to guarantee they hit the disk first. Sharing the data between backend processes requires a fair amount of infrastructure. You need a way to locate a given chunk of on-disk data in the shared memory buffer cache, and be able to acquire and release locks on those buffers to prevent two backends from wrecking havoc when they try to update the same piece of information. Organizing data in fixed-sized chunks (which is what pages are) helps with keeping the complexity of that infrastructure manageable, and the overhead reasonably low. There are also things like tracking the free space in a data file, which also gets easier if you only have to track it page-wise (Is there free space on this page or not), instead of having to track arbitrary ranges of free space. Finally, since data writes happen in units of blocks (and not bytes), you need to guarantee that you do your IO in some multiple of that unit anyway, otherwise you'd have a very hard time guaranteeing data consistency after a crash. Google for "torn page writes", that should give you more details about this problem. Note, however, that a postgres page (usually 8K) is usually larger than the filesystem's blocksize (usually 512b). So always reading in full pages induces *some* IO overhead. Just not that much - especially since the blocks comprising a page are extremely likely to be arranges consecutively on disk, so there is no extra seeking involved. This, at least, are what I believe to be the main reasons for doing things in units of pages - hope this helps at least somewhat. best regards, Florian Pflug
smime.p7s
Description: S/MIME Cryptographic Signature