Re: Large block sizes support in Linux

2024-03-25 Thread Pankaj Raghav
Hi Tomas and Bruce,

>>> My knowledge of Postgres internals is limited, so I'm wondering if there
>>> are any optimizations or potential optimizations that Postgres could
>>> leverage once we have LBS support on Linux?
>>
>> We have discussed this in the past, and in fact in the early years we
>> thought we didn't need fsync since the BSD file system was 8k at the
>> time.
>>
>> What we later realized is that we have no guarantee that the file system
>> will write to the device in the specified block size, and even it it
>> does, the I/O layers between the OS and the device might not, since many
>> devices use 512 byte blocks or other sizes.
>>
> 
> Right, but things change over time - current storage devices support
> much larger sectors (LBA format), usually 4K. And if you do I/O with
> this size, it's usually atomic.
> 
> AFAIK if you built Postgres with 4K pages, on a device with 4K LBA
> format, that would not need full-page writes - we always do I/O in 4k
> pages, and block layer does I/O (during writeback from page cache) with
> minimum guaranteed size = logical block size. 4K are great for OLTP
> systems in general, it'd be even better if we didn't need to worry about
> torn pages (but the tricky part is to be confident it's safe to disable
> them on a particular system).
> 
> I did watch the talk linked by Pankaj, and IIUC the promise of the LBS
> patches is that this benefit would extend would apply even to larger
> page sizes (= fs page size). Which right now you can't even mount, but
> the patches allow that. So for example it would be possible to create an
> XFS filesystem with 8kB pages, and then we'd read/write 8kB pages as
> usual, and we'd know that the page cache always writes out either the
> whole page or none of it. Which right now is not guaranteed to happen,
> it's possible to e.g. write the page as two 4K requests, even if all
> other things are set properly (drive has 4K logical/physical sectors).
> 
> At least that's my understanding ...
>> Pankaj, could you clarify what the guarantees provided by LBS are going
> to be? the talk uses wording like "should be" and "hint" in a couple
> places, and there's also stuff I'm not 100% familiar with.
> 
> If we create a filesystem with 8K blocks, and we only ever do writes
> (and reads) in 8K chunks (our default page size), what guarantees that
> gives us? What if the underlying device has LBA format with only 4K (or
> perhaps even just 512B), how would that affect the guarantees?
> 

Yes, the whole FS block is managed as one unit (also on a physically contiguous
page), so we send the whole fs block while performing writeback. This is not 
guaranteed
when FS block size = 4k and the DB page size is 8k as it might be sent as two
different requests as you have indicated.

The LBA format will not affect the guarantee of sending the whole FS block 
without
splitting as long as the FS block size is less than the maximum IO transfer 
size*.

But another issue now is even though the host has done its job, the device might
have a smaller atomic guarantee, thereby making it not powerfail safe.

> The other thing is - is there a reliable way to say when the guarantees
> actually apply? I mean, how would the administrator *know* it's safe to
> set full_page_writes=off, or even better how could we verify this when
> the database starts (and complain if it's not safe to disable FPW)?
> 

This is an excellent question that needs a bit of community discussion to
expose a device agnostic value that userspace can trust.

There might be a talk this year at LSFMM about untorn writes[1] in buffered IO
path. I will make sure to bring this question up.

At the moment, Linux exposes the physical blocksize by taking also atomic 
guarantees
into the picture, especially for NVMe it uses the NAWUPF and AWUPF while setting
physical blocksize (/sys/block//queue/physical_block_size).

A system admin could use value exposed by phy_bs as a hint to disable 
full_page_write=off.
Of course this requires also the device to give atomic guarantees.

The most optimal would be DB page size == FS block size == Device atomic size.

> It's easy to e.g. take a backup on one filesystem and restore it on
> another one, and forget those may have different block sizes etc. I'm
> not sure it's possible in a 100% reliable way (tablespaces?).
> 
> 
> regards
> 

[1] https://lore.kernel.org/linux-fsdevel/20240228061257.ga106...@mit.edu/

* A small caveat, I am most familiar with NVMe, so my answers might be based on
my experience in NVMe.




Re: Large block sizes support in Linux

2024-03-25 Thread Pankaj Raghav
Hi Thomas,

On 23/03/2024 05:53, Thomas Munro wrote:
> On Fri, Mar 22, 2024 at 10:56 PM Pankaj Raghav (Samsung)
>  wrote:
>> My team and I have been working on adding Large block size(LBS)
>> support to XFS in Linux[1]. Once this feature lands upstream, we will be
>> able to create XFS with FS block size > page size of the system on Linux.
>> We also gave a talk about it in Linux Plumbers conference recently[2]
>> for more context. The initial support is only for XFS but more FSs will
>> follow later.
> 
> Very cool!
> 
> (I used XFS on IRIX in the 90s, and it had large blocks then, a
> feature lost in the port to Linux AFAIK.)
> 

Yes, I heard this also from the Maintainer of XFS that they had to drop
this functionality when they did the port. :)

>> On an x86_64 system, fs block size was limited to 4k, but traditionally
>> Postgres uses 8k as its default internal page size. With LBS support,
>> fs block size can be set to 8K, thereby matching the Postgres page size.
>>
>> If the file system block size == DB page size, then Postgres can have
>> guarantees that a single DB page will be written as a single unit during
>> kernel write back and not split.
>>
>> My knowledge of Postgres internals is limited, so I'm wondering if there
>> are any optimizations or potential optimizations that Postgres could
>> leverage once we have LBS support on Linux?
> 
> FWIW here are a couple of things I wrote about our storage atomicity
> problem, for non-PostgreSQL hackers who may not understand our project
> jargon:
> 
> https://wiki.postgresql.org/wiki/Full_page_writes
> https://freebsdfoundation.org/wp-content/uploads/2023/02/munro_ZFS.pdf
> 
This is very useful, thanks a lot.

> The short version is that we (and MySQL, via a different scheme with
> different tradeoffs) could avoid writing all our stuff out twice if we
> could count on atomic writes of a suitable size on power failure, so
> the benefits are very large.  As far as I know, there are two things
> we need from the kernel and storage to do that on "overwrite"
> filesystems like XFS:
> 
> 1.  The disk must promise that its atomicity-on-power-failure is a
> multiple of our block size -- something like NVMe AWUPF, right?  My
> devices seem to say 0 :-(  Or I guess the filesystem has to
> compensate, but then it's not exactly an overwrite filesystem
> anymore...
> 

0 means 1 logical block, which might be 4k in your case. Typically device
vendors have to put extra hardware to guarantee bigger atomic block sizes.

> 2.  The kernel must promise that there is no code path in either
> buffered I/O or direct I/O that will arbitrarily chop up our 8KB (or
> other configured block size) writes on some smaller boundary, most
> likely sector I guess, on their way to the device, as you were saying.
> Not just in happy cases, but even under memory pressure, if
> interrupted, etc etc.
> 
> Sounds like you're working on problem #2 which is great news.
> 

Yes, you are spot on. :)

> I've been wondering for a while how a Unixoid kernel should report
> these properties to userspace where it knows them, especially on
> non-overwrite filesystems like ZFS where this sort of thing works

So it looks like ZFS (or any other CoW filesystem that supports larger
block sizes) is doing what postgres will do anyway with FPW=on, making
it safe to turn off FPW.

One question: Does ZFS do something like FUA request to force the device
to clear the cache before it can update the node to point to the new page?

If it doesn't do it, there is no guarantee from device to update the data
atomically unless it has bigger atomic guarantees?

> already, without stuff like AWUPF working the way one might hope.
> Here was one throw-away idea on the back of a napkin about that, for
> what little it's worth:
> > https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO

As I replied in the previous mail to Tomas, we might be having a talk
about Untorn writes[1] in LSFMM this year. I hope to bring up some of the
discussions from here. Thanks!

[1] https://lore.kernel.org/linux-fsdevel/20240228061257.ga106...@mit.edu/




Re: Large block sizes support in Linux

2024-03-25 Thread Pankaj Raghav
On 23/03/2024 03:41, Bruce Momjian wrote:
> On Fri, Mar 22, 2024 at 10:31:11PM +0100, Tomas Vondra wrote:
>> Right, but things change over time - current storage devices support
>> much larger sectors (LBA format), usually 4K. And if you do I/O with
>> this size, it's usually atomic.
>>
>> AFAIK if you built Postgres with 4K pages, on a device with 4K LBA
>> format, that would not need full-page writes - we always do I/O in 4k
>> pages, and block layer does I/O (during writeback from page cache) with
>> minimum guaranteed size = logical block size. 4K are great for OLTP
>> systems in general, it'd be even better if we didn't need to worry about
>> torn pages (but the tricky part is to be confident it's safe to disable
>> them on a particular system).
> 
> Yes, even if the file system is 8k, and the storage is 8k, we only know
> that torn pages are impossible if the file system never overwrites
> existing 8k pages, but writes new ones and then makes it active.  I
> think ZFS does that to handle snapshots.
> 

I think we can also avoid torn writes:
- if filesystem's data path always writes in multiples of 8k (with alignment)
- device supports 8k atomic writes.

Then we might be able to push the responsibility to the device without having 
the overhead
of a CoW FS or FPW=on. Of course, the performance here depends on the vendor 
specific
implementation of atomics.

We are trying to enable the former by adding LBS support to XFS in Linux.

--
Pankaj




Large block sizes support in Linux

2024-03-22 Thread Pankaj Raghav (Samsung)
Hello, 

My team and I have been working on adding Large block size(LBS)
support to XFS in Linux[1]. Once this feature lands upstream, we will be
able to create XFS with FS block size > page size of the system on Linux.
We also gave a talk about it in Linux Plumbers conference recently[2]
for more context. The initial support is only for XFS but more FSs will
follow later.

On an x86_64 system, fs block size was limited to 4k, but traditionally
Postgres uses 8k as its default internal page size. With LBS support,
fs block size can be set to 8K, thereby matching the Postgres page size.

If the file system block size == DB page size, then Postgres can have
guarantees that a single DB page will be written as a single unit during
kernel write back and not split.

My knowledge of Postgres internals is limited, so I'm wondering if there
are any optimizations or potential optimizations that Postgres could
leverage once we have LBS support on Linux?


[1] 
https://lore.kernel.org/linux-xfs/20240313170253.2324812-1-ker...@pankajraghav.com/
[2] https://www.youtube.com/watch?v=ar72r5Xf7x4
-- 
Pankaj Raghav