Thanks Tomas On Thu, Feb 22, 2024 at 3:05 AM Tomas Vondra <tomas.von...@enterprisedb.com> wrote:
> On 2/22/24 02:22, Siddharth Jain wrote: > > Hi All, > > > > I understand the storage layer in databases goes to great lengths to > ensure: > > - a row does not cross a block boundary > > - read/writes/allocation happen in units of blocks > > etc. The motivation is that at the OS level, it reads and writes pages > > (blocks), not individual bytes. I am only concerned about SSDs but I > think > > the principle applies to HDD as well. > > > > but how can we do all this when we are not even guaranteed that the > > beginning of a file will be aligned with a block boundary? refer this > > < > https://stackoverflow.com/questions/8018449/is-it-guaranteed-that-the-beginning-of-a-file-is-aligned-with-pagesize-of-file-s > > > > . > > > > Further, I don't see any APIs exposing I/O operations in terms of blocks. > > All File I/O APIs I see expose a file as a randomly accessible contiguous > > byte buffer. Would it not have been easier if there were APIs that > exposed > > I/O operations in terms of blocks? > > > > can someone explain this to me? > > > > The short answer is that this is well outside our control. We do the > best we can - split our data files to "our" 8kB pages - and hope that > the OS / filesystem will do the right thing to map this to blocks at the > storage level. > > The filesystems do the same thing, to some extent - they align stuff > with respect to the beginning of the partition, but if the partition > itself is not properly aligned, that won't really work. > > As for the APIs, we work with what we have in POSIX - I don't think > there are any APIs working with blocks, and it's not clear to me how > would it fundamentally differ from the APIs we have now. Moreover, it's > not really clear which of the "block" would matter. The postgres 8kB > page? The filesytem page? The storage block/sector size? > > FWIW I think for SSDs this matters way more than for HDD, because SSDs > have to erase the space before a rewrite, which makes it much more > expensive. But that's not just about the alignment, but about the page > size (with smaller pages being better). > > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >