On Wed, 2 Nov 2022 at 00:09, Andy Fan <zhihui.fan1...@gmail.com> wrote: > I just have a different platforms at hand, Here is my test with > Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz. > shared_buffers has been set to big enough to hold all the data.
Many thanks for testing that. Those numbers look much better than the ones I got from my AMD machine. > By theory, Why does the preferch make thing better? I am asking this > because I think we need to read the data from buffer to cache line once > in either case (I'm obvious wrong in face of the test result.) That's a good question. I didn't really explain that in my email. There's quite a bit of information in [1]. My basic understanding is that many modern CPU architectures are ok at "Sequential Prefetching" of cache lines from main memory when the direction is forward, but I believe that they're not very good at detecting access patterns that are scanning memory addresses in a backwards direction. Because of our page layout, we have the page header followed by item pointers at the start of the page. These item pointers are fixed with and point to the tuples, which are variable width. Tuples are written starting at the end of the page. The page is full when the tuples would overlap with the item pointers. See diagrams in [2]. We do our best to keep those tuples in reverse order of the item pointer array. This means when we're performing a forward sequence scan, we're (generally) reading tuples starting at the end of the page and working backwards. Since the CPU is not very good at noticing this and prefetching the preceding cacheline, we can make things go faster (seemingly) by issuing a manual prefetch operation by way of pg_prefetch_mem(). The key here is that accessing RAM is far slower than accessing CPU caches. Modern CPUs can perform multiple operations in parallel and these can be rearranged by the CPU so they're not in the same order as the instructions are written in the programme. It's possible that high latency operations such as accessing RAM could hold up other operations which depend on the value of what's waiting to come in from RAM. If the CPU is held up like this, it's called a pipeline stall [3]. The prefetching in this case is helping to reduce the time spent stalled waiting for memory access. David [1] https://en.wikipedia.org/wiki/Cache_prefetching I might not do the explanation justice, but I believe many CPU archate [2] https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/speeding-up-recovery-and-vacuum-in-postgres-14/ba-p/2234071 [3] https://en.wikipedia.org/wiki/Pipeline_stall