thanks for you explaination, from Postgresql perspective, is it possible to see bgwriter,or checkpointer blocking backend process reading/wrting ? or vice versa ?
Thanks, James Frits Hoogland <frits.hoogl...@gmail.com> 於 2025年6月26日週四 下午4:07寫道: > Postgres lives as a process in linux, and keeps its own cache, and tries > to use that as much as possible for data. This is postgres shared buffers, > commonly called the buffer cache. > > For WAL, sessions write to the wal buffer (separate from the postgres > buffer cache), and need to write to disk upon commit with default settings. > Please mind wal writes commonly are performed from the wal writer, but can > be done from backends too. > > (slightly simplified) When postgres processing needs to read data for > processing, it will have to do that in the postgres buffer cache and use it > from the cache if it can find the requested buffer, and otherwise it will > have to read it from the operating system. When changes are made to data or > metadata (such as hint bits set), the change is made to the buffer (and > before that a description of this is placed in the wal buffer). As long as > there are buffers available, it will perform the data changes in the buffer > in the buffer cache. However, the buffer cache is a finite amount of > buffers, and eventually changed buffers must make it to disk (which happens > upon checkpoint, and for other reasons). The important part here is that > checkpointer, bgwriter and backends all perform reads and writes. That is > what postgres sees, it only knows and sees it's performing a read or write > request. > > These reads and writes are performed to the operating system, and the > operating system essentially applies the same technique for performance as > postgres does: if an IO is done buffered (which is default), it creates as > a page in the linux page cache. For reads, if the read request happens to > find the pages it requests in the linux page cache, it can serve it to the > postgres process without actually needing to perform an actual read request > to the operating system. For buffered writes, these are always done to the > linux page cache, and written back by kernel writer threads asynchronously. > (this is the essence, there is much more detail). Obviously, the amount > linux can cache is dependent on the amount of cache and whether the pages > for the request are in the linux page cache. For writes, it's dependent on > commonly the setting vm.dirty_ratio, which is a percentage from *available > *memory (commonly misunderstood as taken from total memory). > > Now, the point I am trying to make: from the postgres point of view, it's > impossible to understand whether a read or write request was served from OS > cache by the operating system, or needed physical IO. However, the ability > of the operating system to serve from cache is dependent upon the > availability of free and file memory essentially, and therefore upon > general usage of the operating system. Any action on the operating system > that needs a significant amount of memory will impact the availability of > available memory and therefore lower the amount of memory available to > caching. This all in all means that if you do something significant on the > operating system, it is perfectly possible to perceived that from postgres > as sudden drastic change in performance of IO latency, whilst from the > postgres perspective you didn't do anything different. > > Now circling back to you concrete question: if such a thing happens, that > linux all of a sudden needs to do physical IOs and let's your request wait > on that, instead doing logical IO, where your session has to wait on too, > but significantly less time, you might all of a sudden see all kind of IO > related waits, which you are encountering too in the case of the fast > logical IOs, but taking soo little time that you don't see it. This is > where wait event occurence and time accumulation would significantly help, > currently we can only sample the wait event state, and thus much of that is > never seen, and thus not known. > > *Frits Hoogland* > > > > > On 26 Jun 2025, at 08:47, James Pang <jamespang...@gmail.com> wrote: > > we faced this issue 3 times this week, each time last only 2 seconds, > so not easy to run perf in peak business time to capture that, anyway, I > will try. before that, I want to understand if "os page cache" or "pg > buffer cache" can contribute to the wait_event time "extend" and > "DataFileRead", or bgwriter ,checkpoint flushing data to disk can impact > that too ? we enable bgwriter , and we see checkpointer get scheduled by > "wal" during the time, so I just increased max_wal_size to make checkpoint > scheduled in longer time. > > Thanks, > > James > > > Frits Hoogland <frits.hoogl...@gmail.com> 於 2025年6月26日週四 下午2:40寫道: > >> Okay. So it's a situation that is reproducable. >> And like was mentioned, the system time (percentage) is very high. >> Is this a physical machine, or a virtual machine? >> >> The next thing to do, is use perf to record about 20 seconds or so during >> a period of time when you see this behavior (perf record -g, taking the >> backtrace with it). >> This records (samples) the backtraces of on cpu tasks, from which you >> then can derive what they are doing, for which you should see lots of tasks >> in kernel space, and what that is, using perf report. >> >> *Frits Hoogland* >> >> >> >> >> On 26 Jun 2025, at 04:32, James Pang <jamespang...@gmail.com> wrote: >> >> thans for you suggestions, we have iowait from sar command too, copy >> here, checking with infra team not found abnormal IO activities either. >> 02:00:01 PM CPU %usr %nice %sys %iowait %irq %soft %steal >> %guest %gnice %idle >> 02:00:03 PM all 15.92 0.00 43.02 0.65 0.76 2.56 >> 0.00 0.00 0.00 37.09 >> 02:00:03 PM 0 17.59 0.00 46.73 1.01 0.50 0.50 >> 0.00 0.00 0.00 33.67 >> 02:00:03 PM 1 9.50 0.00 61.50 0.50 0.50 1.00 >> 0.00 0.00 0.00 27.00 >> 02:00:03 PM 2 20.71 0.00 44.44 1.01 0.51 0.51 >> 0.00 0.00 0.00 32.83 >> 02:00:03 PM 3 14.00 0.00 51.50 2.00 1.00 1.00 >> 0.00 0.00 0.00 30.50 >> 02:00:03 PM 4 6.57 0.00 52.53 0.51 0.51 3.54 >> 0.00 0.00 0.00 36.36 >> 02:00:03 PM 5 10.20 0.00 49.49 1.02 1.53 0.00 >> 0.00 0.00 0.00 37.76 >> 02:00:03 PM 6 27.64 0.00 41.21 0.50 0.50 0.50 >> 0.00 0.00 0.00 29.65 >> 02:00:03 PM 7 9.05 0.00 50.75 0.50 1.01 0.50 >> 0.00 0.00 0.00 38.19 >> 02:00:03 PM 8 12.18 0.00 49.75 0.51 0.51 0.51 >> 0.00 0.00 0.00 36.55 >> 02:00:03 PM 9 13.00 0.00 9.50 0.50 1.50 15.50 >> 0.00 0.00 0.00 60.00 >> 02:00:03 PM 10 15.58 0.00 46.23 0.00 0.50 0.50 >> 0.00 0.00 0.00 37.19 >> 02:00:03 PM 11 20.71 0.00 10.10 0.00 1.01 14.14 >> 0.00 0.00 0.00 54.04 >> 02:00:03 PM 12 21.00 0.00 37.00 0.50 1.00 1.00 >> 0.00 0.00 0.00 39.50 >> 02:00:03 PM 13 13.57 0.00 45.73 1.01 1.01 1.01 >> 0.00 0.00 0.00 37.69 >> 02:00:03 PM 14 18.18 0.00 39.39 1.01 0.51 0.51 0.00 >> 0.00 0.00 40.40 >> 02:00:03 PM 15 14.00 0.00 49.50 0.50 0.50 3.50 0.00 >> 0.00 0.00 32.00 >> 02:00:03 PM 16 19.39 0.00 39.80 1.02 1.53 0.51 0.00 >> 0.00 0.00 37.76 >> 02:00:03 PM 17 16.75 0.00 45.18 1.52 1.02 2.54 0.00 >> 0.00 0.00 32.99 >> 02:00:03 PM 18 12.63 0.00 50.00 0.00 1.01 0.00 0.00 >> 0.00 0.00 36.36 >> 02:00:03 PM 19 5.56 0.00 82.32 0.00 0.00 0.00 0.00 >> 0.00 0.00 12.12 >> 02:00:03 PM 20 15.08 0.00 48.24 0.50 0.50 3.52 0.00 >> 0.00 0.00 32.16 >> 02:00:03 PM 21 17.68 0.00 9.09 0.51 1.52 13.64 0.00 >> 0.00 0.00 57.58 >> 02:00:03 PM 22 13.13 0.00 43.94 0.51 0.51 0.51 0.00 >> 0.00 0.00 41.41 >> 02:00:03 PM 23 14.07 0.00 42.71 0.50 0.50 0.50 0.00 >> 0.00 0.00 41.71 >> 02:00:03 PM 24 13.13 0.00 41.92 1.01 0.51 0.51 0.00 >> 0.00 0.00 42.93 >> 02:00:03 PM 25 16.58 0.00 47.74 0.50 1.01 0.50 0.00 >> 0.00 0.00 33.67 >> 02:00:03 PM 26 16.58 0.00 46.73 0.50 1.01 0.50 0.00 >> 0.00 0.00 34.67 >> 02:00:03 PM 27 45.50 0.00 54.50 0.00 0.00 0.00 0.00 >> 0.00 0.00 0.00 >> 02:00:03 PM 28 6.06 0.00 32.32 0.00 0.51 13.13 0.00 >> 0.00 0.00 47.98 >> 02:00:03 PM 29 13.93 0.00 44.78 1.00 1.00 0.50 0.00 >> 0.00 0.00 38.81 >> 02:00:03 PM 30 11.56 0.00 57.79 0.00 0.50 1.01 0.00 >> 0.00 0.00 29.15 >> 02:00:03 PM 31 33.85 0.00 9.23 0.51 1.54 0.51 0.00 >> 0.00 0.00 54.36 >> 02:00:03 PM 32 30.15 0.00 41.71 0.50 0.50 1.51 0.00 >> 0.00 0.00 25.63 >> >> Thanks, >> >> James >> >> Frits Hoogland <frits.hoogl...@gmail.com> 於 2025年6月25日週三 下午10:27寫道: >> >>> >>> >>> > On 25 Jun 2025, at 07:59, Laurenz Albe <laurenz.a...@cybertec.at> >>> wrote: >>> > >>> > On Wed, 2025-06-25 at 11:15 +0800, James Pang wrote: >>> >> pgv14, RHEL8, xfs , we suddenly see tens of sessions waiting on >>> "DataFileRead" and >>> >> "extend", it last about 2 seconds(based on pg_stat_activity query) , >>> during the >>> >> waiting time, "%sys" cpu increased to 80% , but from "iostat" , no >>> high iops and >>> >> io read/write latency increased either. >>> > >>> > Run "sar -P all 1" and see if "%iowait" is high. >>> I would (strongly) advise against the use of iowait as an indicator. It >>> is a kernel approximation of time spent in IO from which cannot be use used >>> in any sensible way other than possibly you're doing IO. >>> First of all, iowait is not a kernel state, and therefore it's taken >>> from idle. This means that if there is no, or too little, idle time, iowait >>> that should be there is gone. >>> Second, the calculation to transfer idle time to iowait is done for >>> synchronous IO calls only. Which currently is not a problem for postgres >>> because it uses exactly that, but in the future it might. >>> Very roughly put, what the kernel does is keep a counter of tasks >>> currently in certain system IO calls, and then try to express that using >>> iowait. The time in IO wait can't be used calculate any IO facts. >>> >>> In that sense, it puts it in the same area as the load figure: >>> indicative, but mostly useless because it doesn't give you any facts about >>> what it is expressing. >>> > >>> > Check if you have transparent hugepages enabled: >>> > >>> > cat /sys/kernel/mm/transparent_hugepage/enabled >>> > >>> > If they are enabled, disable them and see if it makes a difference. >>> > >>> > I am only guessing here. >>> Absolutely. Anything that is using signficant amounts of memory and is >>> not created to take advantage of transparent hugepages will probably >>> experience more downsides from THP than it helps. >>> > >>> >> many sessions were running same "DELETE FROM xxxx" in parallel >>> waiting on "extend" >>> >> and "DataFileRead", there are triggers in this table "After delete" >>> to insert/delete >>> >> other tables in the tigger. >>> > >>> > One thing that almost certainly would improve your situation is to run >>> fewer >>> > concurrent statements, for example by using a reasonably sized >>> connection pool. >>> This is true if the limits of the IO device, or anything towards to IO >>> device or devices are hit. >>> And in general, high "%sys", alias lots of time spent in kernel mode >>> alias system time indicates lots of time spent in system calls, which is >>> what the read and write calls in postgres are. >>> Therefore these figures suggest blocking for IO, for which Laurenz' >>> advise to lower the amount of concurrent sessions doing IO in general makes >>> sense. >>> A more nuanced analysis: if IO requests get queued, these will wait in >>> 'D' state in linux, which by definition is off cpu, and thus do not spent >>> cpu (system/kernel) time. >>> >>> What sounds suspicious is that you indicate you indicate there is you >>> see no signficant change in the amount of IO in iostat. >>> >>> In order to understand this, you will have to first carefully find the >>> actual IO physical IO devices that you are using for postgres IO. >>> In current linux this can be tricky, depending on how the hardware or >>> virtual machine looks like, and how the disks are arranged in linux. >>> What you need to determine is which actual disk devices are used, and >>> what their limits are. >>> Limits for any disk are IOPS (operations per second) and MBPS (megabytes >>> per second -> bandwdith). >>> >>> There is an additional thing to realize, which makes this really tricky: >>> postgres for common IO uses buffered IO. >>> Buffered IO means any read or write will use the linux buffercache, and >>> read or writes can be served from the buffercache if possible. >>> >>> So in your case, if you managed to make the database perform identical >>> read or write requests, this could result in a difference of amounts of >>> read and write IOs served from the cache, which can make an enormous >>> amounts of difference for how fast these requests are served. If somehow >>> you managed to make the operating system choose to use the physical IO >>> path, you will see significant amounts time spent on that, which will have >>> IO related wait events. >>> >>> Not a simple answer, but this is how it works. >>> >>> So I would suggest checking the difference between the situation of when >>> it's doing the same which is considered well performing versus badly >>> performing. >>> >>> >>> > >>> > Yours, >>> > Laurenz Albe >>> > >>> > >>> >>> >> >