On Sep9, 2011, at 20:15 , Nulik Nol wrote: > this is not exactly a Postgresql question, but an input from hackers > list like this would be invaluable for me. > I am coding my own database engine, and I decided to do not implement > transaction engine because it implies too much code. > But to achieve the Durability of ACID I need a 100% reliable write to > disk. By design no record in my DB will be larger than 512 bytes, so I > am using the page size of 512 bytes,
Beware that there *are* disks with block sizes other than 512 bytes. For example, at least for 2.5" disks, 4096 bytes/block is becoming quite common these days. > that matches the size of the disk > block, so every write() I will execute with the following fdatasync() > call will be 100% written, is that correct? It won't make a 300 byte > write if I tell it to write 512 and the power goes off or will it? Since error correction is done per-block, it's very unlikely that you'd see only 300 of the 512 bytes overwritten - the drive would detect uncorrectable data corruption and report an error instead. Whether that error is reported back to the application as an IO error or as a zeroed-out block probably depends on the OS. What you actually seem to want is a stronger all-or-nothing guarantee which precludes the error case. AFAIK, most disk drives kinda-of do that, because the various capacitors which stabilize the power supply usually hold enough charge to complete a write once it's started, and because they stop operating if the power drops below some threshold. But I doubt that they provide any hard guarantees in this area, I guess it's more of a best-effort thing. To get hard guarantees, you'll need to use a RAID controller with a battery-backed cache. Or use a journal/WAL like postgres (and most filesystems) do, and protect journal/WAL entries with a checksum to detect partially written entries. > I am going to use the whole partition device for the DB (like /dev/sda1) > , so no filesystem code will be used. Also I am using asynchronous IO > (the aio_read and aio_write) and I don't know if they can be combined > with the fdatasync() syscall? Someone else (maybe the POSIX spec?) must answer that as I know very little about asynchronous IO. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers