Hi, I've pushed all three parts of v29, with some additional corrections (picked lower OIDs, bumped catversion, fixed commit messages).
On 4/7/25 23:01, Jakub Wartak wrote: > On Mon, Apr 7, 2025 at 9:51 PM Tomas Vondra <to...@vondra.me> wrote: > >>> So it looks like that the new way to iterate on the buffers that has been >>> introduced >>> in v26/v27 has some issue? >>> >> >> Yeah, the calculations of the end pointers were wrong - we need to round >> up (using TYPEALIGN()) when calculating number of pages, and just add >> BLCKSZ (without any rounding) when calculating end of buffer. The 0004 >> fixes this for me (I tried this with various blocksizes / page sizes). >> >> Thanks for noticing this! > > Hi, > > v28-0001 LGTM > v28-0002 got this warning Andres was talking about, so LGTM > v28-0003 (pg_buffercache_numa now), LGTM, but I *thought* for quite > some time we have 2nd bug there, but it appears that PG never properly > aligned whole s_b to os_page_size(HP)? ... Thus we cannot assume > count(*) pg_buffercache_numa == count(*) pg_buffercache. > AFAIK v29 fixed this, the end pointer calculations were wrong. With that it passed for me with/without THP, different blocks sizes etc. We don't align buffers to os_page_size, we align them PG_IO_ALIGN_SIZE, which is 4kB or so. And it's determined at compile time, while THP is determined when starting the cluster. > So before anybody else reports this as bug about duplicate bufferids: > > # select * from pg_buffercache_numa where os_page_num <= 2; > bufferid | os_page_num | numa_node > ----------+-------------+----------- > [..] > 195 | 0 | 0 > 196 | 0 | 0 <-- duplicate? > 196 | 1 | 0 <-- duplicate? > 197 | 1 | 0 > 198 | 1 | 0 > > That is strange because on first look one could assume we get 257x > 8192 blocks per os_page (2^21) that way, which is impossible. > Exercises in pointers show this: >> # select * from pg_buffercache_numa where os_page_num <= 2; > DEBUG: NUMA: NBuffers=16384 os_page_count=65 os_page_size=2097152 > DEBUG: NUMA: page-faulting the buffercache for proper NUMA readouts > -- custom elog(DEBUG1) > DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000 > endptr_buff=0x7f866107b000 bufferid=1 page_num=0 real > buffptr=0x7f8661079000 > [..] > DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000 > endptr_buff=0x7f86611fd000 bufferid=194 page_num=0 real > buffptr=0x7f86611fb000 > DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000 > endptr_buff=0x7f86611ff000 bufferid=195 page_num=0 real > buffptr=0x7f86611fd000 > DEBUG: ptr=0x7f8661000000 startptr_buff=0x7f8661000000 > endptr_buff=0x7f8661201000 bufferid=196 page_num=0 real > buffptr=0x7f86611ff000 (!) > DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661000000 > endptr_buff=0x7f8661201000 bufferid=196 page_num=1 real > buffptr=0x7f86611ff000 (!) > DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661200000 > endptr_buff=0x7f8661203000 bufferid=197 page_num=1 real > buffptr=0x7f8661201000 > DEBUG: ptr=0x7f8661200000 startptr_buff=0x7f8661200000 > endptr_buff=0x7f8661205000 bufferid=198 page_num=1 real > buffptr=0x7f8661203000 > > so we have NBuffer=196 with bufferptr=0x7f86611ff000 that is 8kB big > (and ends up at 0x7f8661201000), while we also have HP that hosts it > between 0x7f8661000000 and 0x7f8661200000. So Buffer 196 spans 2 > hugepages. Open question for another day is shouldn't (of course > outside of this $thread) align s_b to HP size or not? As per above > even bufferid=1 has 0x7f8661079000 while page starts on 0x7f8661000000 > (that's 495616 bytes difference). > Right, this is because that's where the THP boundary happens to be. And that one "duplicate" entry is for a buffer that happens to span two pages. This is *exactly* the misalignment of blocks and pages that I was wondering about earlier, and with the fixed endptr calculation we handle that just fine. No opinion on the aligment - maybe we should do that, but it's not something this patch needs to worry about. regards -- Tomas Vondra