On Thu, Apr 30, 2020 at 12:26 PM Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > Yeah, I think the question is what are the expected benefits of using > raw devices. It might be an interesting exercise / experiment, but my > understanding is that most of the benefits can be achieved by using file > systems but with direct I/O and async I/O, which would allow us to > continue reusing the existing filesystem code with much less disruption > to our code base.
Agreed. I've often wondered if the RDBMSs that supported raw devices did so *because* there was no other way to get unbuffered I/O on some systems at the time (for example it looks like Solaris didn't have direct I/O until 2.6 in 1997?). Last I heard, raw devices weren't recommended anymore on the system I'm thinking of because they're more painful to manage than regular filesystems and there's little to no gain. Back in ancient times, before BSD4.2 introduced it in 1983 there was apparently no fsync() system call on any strain of Unix, so I guess database reliability must have been an uphill battle on early Unix buffered I/O (I wonder if the Ingres/Postgres people asked them to add that?!). It must have been very appealing to sidestep the whole thing for multiple reasons. One key thing to note is that the well known RDBMSs that can use raw devices also deal with regular filesystems by creating one or more large data files, and then manage the space inside those to hold all their tables and indexes. That is, they already have their own system to manage separate database objects and allocate space etc, and don't have to do any regular filesystem meta-data manipulation during transactions (which has all kinds of problems). That means they already have the complicated code that you need to do that, but we don't: we have one (or more) file per table or index, so our database relies on the filesystem as kind of lower level database of relfilenode->blocks. That's probably the main work required to make this work, and might be a valuable thing to have independently of whether you stick it on a raw device, a big data file, NV RAM or some other kind of storage system -- but it's a really difficult project.