Hi,

I've pushed all three parts of v29, with some additional corrections
(picked lower OIDs, bumped catversion, fixed commit messages).

On 4/7/25 23:01, Jakub Wartak wrote:
> On Mon, Apr 7, 2025 at 9:51 PM Tomas Vondra <to...@vondra.me> wrote:
> 
>>> So it looks like that the new way to iterate on the buffers that has been 
>>> introduced
>>> in v26/v27 has some issue?
>>>
>>
>> Yeah, the calculations of the end pointers were wrong - we need to round
>> up (using TYPEALIGN()) when calculating number of pages, and just add
>> BLCKSZ (without any rounding) when calculating end of buffer. The 0004
>> fixes this for me (I tried this with various blocksizes / page sizes).
>>
>> Thanks for noticing this!
> 
> Hi,
> 
> v28-0001 LGTM
> v28-0002 got this warning Andres was talking about, so LGTM
> v28-0003 (pg_buffercache_numa now), LGTM, but I *thought* for quite
> some time we have 2nd bug there, but it appears that PG never properly
> aligned whole s_b to os_page_size(HP)? ... Thus we cannot assume
> count(*) pg_buffercache_numa == count(*) pg_buffercache.
> 

AFAIK v29 fixed this, the end pointer calculations were wrong. With that
it passed for me with/without THP, different blocks sizes etc.

We don't align buffers to os_page_size, we align them PG_IO_ALIGN_SIZE,
which is 4kB or so. And it's determined at compile time, while THP is
determined when starting the cluster.

> So before anybody else reports this as bug about duplicate bufferids:
> 
> # select * from pg_buffercache_numa where os_page_num <= 2;
>  bufferid | os_page_num | numa_node
> ----------+-------------+-----------
> [..]
>       195 |           0 |         0
>       196 |           0 |         0 <-- duplicate?
>       196 |           1 |         0 <-- duplicate?
>       197 |           1 |         0
>       198 |           1 |         0
> 
> That is strange because on first look one could assume we get 257x
> 8192 blocks per os_page (2^21) that way, which is impossible.
> Exercises in pointers show this:
>> # select * from pg_buffercache_numa where os_page_num <= 2;
> DEBUG:  NUMA: NBuffers=16384 os_page_count=65 os_page_size=2097152
> DEBUG:  NUMA: page-faulting the buffercache for proper NUMA readouts
> -- custom elog(DEBUG1)
> DEBUG:  ptr=0x7f8661000000 startptr_buff=0x7f8661000000
> endptr_buff=0x7f866107b000 bufferid=1 page_num=0 real
> buffptr=0x7f8661079000
> [..]
> DEBUG:  ptr=0x7f8661000000 startptr_buff=0x7f8661000000
> endptr_buff=0x7f86611fd000 bufferid=194 page_num=0 real
> buffptr=0x7f86611fb000
> DEBUG:  ptr=0x7f8661000000 startptr_buff=0x7f8661000000
> endptr_buff=0x7f86611ff000 bufferid=195 page_num=0 real
> buffptr=0x7f86611fd000
> DEBUG:  ptr=0x7f8661000000 startptr_buff=0x7f8661000000
> endptr_buff=0x7f8661201000 bufferid=196 page_num=0 real
> buffptr=0x7f86611ff000 (!)
> DEBUG:  ptr=0x7f8661200000 startptr_buff=0x7f8661000000
> endptr_buff=0x7f8661201000 bufferid=196 page_num=1 real
> buffptr=0x7f86611ff000 (!)
> DEBUG:  ptr=0x7f8661200000 startptr_buff=0x7f8661200000
> endptr_buff=0x7f8661203000 bufferid=197 page_num=1 real
> buffptr=0x7f8661201000
> DEBUG:  ptr=0x7f8661200000 startptr_buff=0x7f8661200000
> endptr_buff=0x7f8661205000 bufferid=198 page_num=1 real
> buffptr=0x7f8661203000
> 
> so we have NBuffer=196 with bufferptr=0x7f86611ff000 that is 8kB big
> (and ends up at 0x7f8661201000), while we also have HP that hosts it
> between 0x7f8661000000 and 0x7f8661200000. So Buffer 196 spans 2
> hugepages. Open question for another day is shouldn't (of course
> outside of this $thread) align s_b to HP size or not? As per above
> even bufferid=1 has 0x7f8661079000 while page starts on 0x7f8661000000
> (that's 495616 bytes difference).
>

Right, this is because that's where the THP boundary happens to be. And
that one "duplicate" entry is for a buffer that happens to span two
pages. This is *exactly* the misalignment of blocks and pages that I was
wondering about earlier, and with the fixed endptr calculation we handle
that just fine.

No opinion on the aligment - maybe we should do that, but it's not
something this patch needs to worry about.


regards

-- 
Tomas Vondra



Reply via email to