On 01/26/2018 02:56 PM, Bruce Momjian wrote: > On Wed, Jan 17, 2018 at 02:10:10PM +0100, Fabien COELHO wrote: >> >> Hello, >> >>> What are the cons of setting BLCKSZ as 4kB? When saw the results published >>> on [...]. >> >> There were other posts and publications which points to the same direction >> consistently. >> >> This matches my deep belief is that postgres default block size is a >> reasonable compromise for HDD, but is less pertinent for SSD for most OLTP >> loads. >> >> For OLAP, I do not think it would lose much, but I have not tested it. >> >>> Does turning off FPWs will be safe if BLCKSZ is set to 4kB given page size >>> of file system is 4kB? >> >> FPW = Full Page Write. I would not bet on turning off FPW, ISTM >> that SSDs can have "page" sizes as low as 512 bytes, but are >> typically 2 kB or 4 kB, and the information easily available >> anyway. >
Is this referring to sector size or the internal SSD page size? AFAIK there are only 512B and 4096B sectors, so I assume you must be talking about the latter. I don't think I've ever heard about an SSD with 512B pages though (generally the page sizes are 2kB to 16kB). But more importantly, I don't see why the size of the internal page would matter here at all? SSDs have non-volatile write cache (DRAM with battery), protecting all the internal writes to pages. If your SSD does not do that correctly, it's already broken no matter what page size it uses even with full_page_writes=on. On spinning rust the caches would be disabled and replaced by write cache on a RAID controller with battery, but that's not possible on SSDs where the on-disk cache is baked into the whole design. What I think does matters here is the sector size (i.e. either 512B or 4096B) used to communicate with the disk. Obviously, if the kernel writes 4kB page as a series of independent 512B writes, that would be unreliable. If it sends one 4kB write, why wouldn't that work? > Yes, that is the hard part, making sure you have 4k granularity of > write, and matching write alignment. pg_test_fsync and diskchecker.pl > (which we mention in our docs) will not help here. A specific > alignment test based on diskchecker.pl would have to be written. > However, if you look at the kernel code you might be able to verify > quickly that the 4k atomicity is not guaranteed. > Are you suggesting there's a part of the kernel code clearly showing it's not atomic? Can you point us to that part of the kernel sources? FWIW even if it's not save in general, it would be useful to understand what are the requirements to make it work. I mean, conditions that need to be met on various levels (sector size of the storage device, page size of of the file system, filesystem alignment, ...). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services