thanks for you explaination,  from Postgresql perspective, is it possible
to see bgwriter,or checkpointer blocking backend process  reading/wrting ?
or vice versa ?

Thanks,

James

Frits Hoogland <frits.hoogl...@gmail.com> 於 2025年6月26日週四 下午4:07寫道:

> Postgres lives as a process in linux, and keeps its own cache, and tries
> to use that as much as possible for data. This is postgres shared buffers,
> commonly called the buffer cache.
>
> For WAL, sessions write to the wal buffer (separate from the postgres
> buffer cache), and need to write to disk upon commit with default settings.
> Please mind wal writes commonly are performed from the wal writer, but can
> be done from backends too.
>
> (slightly simplified) When postgres processing needs to read data for
> processing, it will have to do that in the postgres buffer cache and use it
> from the cache if it can find the requested buffer, and otherwise it will
> have to read it from the operating system. When changes are made to data or
> metadata (such as hint bits set), the change is made to the buffer (and
> before that a description of this is placed in the wal buffer). As long as
> there are buffers available, it will perform the data changes in the buffer
> in the buffer cache. However, the buffer cache is a finite amount of
> buffers, and eventually changed buffers must make it to disk (which happens
> upon checkpoint, and for other reasons). The important part here is that
> checkpointer, bgwriter and backends all perform reads and writes. That is
> what postgres sees, it only knows and sees it's performing a read or write
> request.
>
> These reads and writes are performed to the operating system, and the
> operating system essentially applies the same technique for performance as
> postgres does: if an IO is done buffered (which is default), it creates as
> a page in the linux page cache. For reads, if the read request happens to
> find the pages it requests in the linux page cache, it can serve it to the
> postgres process without actually needing to perform an actual read request
> to the operating system. For buffered writes, these are always done to the
> linux page cache, and written back by kernel writer threads asynchronously.
> (this is the essence, there is much more detail). Obviously, the amount
> linux can cache is dependent on the amount of cache and whether the pages
> for the request are in the linux page cache. For writes, it's dependent on
> commonly the setting vm.dirty_ratio, which is a percentage from *available
> *memory (commonly misunderstood as taken from total memory).
>
> Now, the point I am trying to make: from the postgres point of view, it's
> impossible to understand whether a read or write request was served from OS
> cache by the operating system, or needed physical IO. However, the ability
> of the operating system to serve from cache is dependent upon the
> availability of free and file memory essentially, and therefore upon
> general usage of the operating system. Any action on the operating system
> that needs a significant amount of memory will impact the availability of
> available memory and therefore lower the amount of memory available to
> caching. This all in all means that if you do something significant on the
> operating system, it is perfectly possible to perceived that from postgres
> as sudden drastic change in performance of IO latency, whilst from the
> postgres perspective you didn't do anything different.
>
> Now circling back to you concrete question: if such a thing happens, that
> linux all of a sudden needs to do physical IOs and let's your request wait
> on that, instead doing logical IO, where your session has to wait on too,
> but significantly less time, you might all of a sudden see all kind of IO
> related waits, which you are encountering too in the case of the fast
> logical IOs, but taking soo little time that you don't see it. This is
> where wait event occurence and time accumulation would significantly help,
> currently we can only sample the wait event state, and thus much of that is
> never seen, and thus not known.
>
> *Frits Hoogland*
>
>
>
>
> On 26 Jun 2025, at 08:47, James Pang <jamespang...@gmail.com> wrote:
>
>    we faced this issue 3 times this week, each time last only 2 seconds,
> so not easy to run perf in peak business time to capture that, anyway, I
> will try. before that, I want to understand if "os page cache" or "pg
> buffer cache" can contribute to the wait_event time "extend" and
> "DataFileRead", or bgwriter ,checkpoint flushing data to disk can impact
> that too ?  we enable bgwriter , and we see checkpointer get scheduled by
> "wal" during the time, so I just increased max_wal_size to make checkpoint
> scheduled in longer time.
>
> Thanks,
>
> James
>
>
> Frits Hoogland <frits.hoogl...@gmail.com> 於 2025年6月26日週四 下午2:40寫道:
>
>> Okay. So it's a situation that is reproducable.
>> And like was mentioned, the system time (percentage) is very high.
>> Is this a physical machine, or a virtual machine?
>>
>> The next thing to do, is use perf to record about 20 seconds or so during
>> a period of time when you see this behavior (perf record -g, taking the
>> backtrace with it).
>> This records (samples) the backtraces of on cpu tasks, from which you
>> then can derive what they are doing, for which you should see lots of tasks
>> in kernel space, and what that is, using perf report.
>>
>> *Frits Hoogland*
>>
>>
>>
>>
>> On 26 Jun 2025, at 04:32, James Pang <jamespang...@gmail.com> wrote:
>>
>> thans for you suggestions, we have iowait from sar command too, copy
>> here, checking with infra team not found abnormal IO activities either.
>> 02:00:01 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal
>>  %guest  %gnice   %idle
>> 02:00:03 PM  all         15.92    0.00   43.02    0.65    0.76    2.56
>>  0.00    0.00    0.00   37.09
>> 02:00:03 PM    0         17.59    0.00   46.73    1.01    0.50    0.50
>>  0.00    0.00    0.00   33.67
>> 02:00:03 PM    1          9.50     0.00   61.50    0.50    0.50    1.00
>>  0.00    0.00    0.00   27.00
>> 02:00:03 PM    2          20.71    0.00   44.44    1.01    0.51    0.51
>>  0.00    0.00    0.00   32.83
>> 02:00:03 PM    3         14.00    0.00   51.50    2.00    1.00    1.00
>>  0.00    0.00    0.00   30.50
>> 02:00:03 PM    4          6.57     0.00   52.53    0.51    0.51    3.54
>>  0.00    0.00    0.00   36.36
>> 02:00:03 PM    5          10.20    0.00   49.49    1.02    1.53    0.00
>>  0.00    0.00    0.00   37.76
>> 02:00:03 PM    6          27.64    0.00   41.21    0.50    0.50    0.50
>>  0.00    0.00    0.00   29.65
>> 02:00:03 PM    7          9.05    0.00   50.75    0.50    1.01    0.50
>>  0.00    0.00    0.00   38.19
>> 02:00:03 PM    8          12.18    0.00   49.75    0.51    0.51    0.51
>>  0.00    0.00    0.00   36.55
>> 02:00:03 PM    9          13.00    0.00    9.50    0.50    1.50   15.50
>>  0.00    0.00    0.00   60.00
>> 02:00:03 PM   10         15.58    0.00   46.23    0.00    0.50    0.50
>>  0.00    0.00    0.00   37.19
>> 02:00:03 PM   11          20.71    0.00   10.10    0.00    1.01   14.14
>>  0.00    0.00    0.00   54.04
>> 02:00:03 PM   12         21.00    0.00   37.00    0.50    1.00    1.00
>>  0.00    0.00    0.00   39.50
>> 02:00:03 PM   13         13.57    0.00   45.73    1.01    1.01    1.01
>>  0.00    0.00    0.00   37.69
>> 02:00:03 PM   14   18.18    0.00   39.39    1.01    0.51    0.51    0.00
>>    0.00    0.00   40.40
>> 02:00:03 PM   15   14.00    0.00   49.50    0.50    0.50    3.50    0.00
>>    0.00    0.00   32.00
>> 02:00:03 PM   16   19.39    0.00   39.80    1.02    1.53    0.51    0.00
>>    0.00    0.00   37.76
>> 02:00:03 PM   17   16.75    0.00   45.18    1.52    1.02    2.54    0.00
>>    0.00    0.00   32.99
>> 02:00:03 PM   18   12.63    0.00   50.00    0.00    1.01    0.00    0.00
>>    0.00    0.00   36.36
>> 02:00:03 PM   19    5.56    0.00   82.32    0.00    0.00    0.00    0.00
>>    0.00    0.00   12.12
>> 02:00:03 PM   20   15.08    0.00   48.24    0.50    0.50    3.52    0.00
>>    0.00    0.00   32.16
>> 02:00:03 PM   21   17.68    0.00    9.09    0.51    1.52   13.64    0.00
>>    0.00    0.00   57.58
>> 02:00:03 PM   22   13.13    0.00   43.94    0.51    0.51    0.51    0.00
>>    0.00    0.00   41.41
>> 02:00:03 PM   23   14.07    0.00   42.71    0.50    0.50    0.50    0.00
>>    0.00    0.00   41.71
>> 02:00:03 PM   24   13.13    0.00   41.92    1.01    0.51    0.51    0.00
>>    0.00    0.00   42.93
>> 02:00:03 PM   25   16.58    0.00   47.74    0.50    1.01    0.50    0.00
>>    0.00    0.00   33.67
>> 02:00:03 PM   26   16.58    0.00   46.73    0.50    1.01    0.50    0.00
>>    0.00    0.00   34.67
>> 02:00:03 PM   27   45.50    0.00   54.50    0.00    0.00    0.00    0.00
>>    0.00    0.00    0.00
>> 02:00:03 PM   28    6.06    0.00   32.32    0.00    0.51   13.13    0.00
>>    0.00    0.00   47.98
>> 02:00:03 PM   29   13.93    0.00   44.78    1.00    1.00    0.50    0.00
>>    0.00    0.00   38.81
>> 02:00:03 PM   30   11.56    0.00   57.79    0.00    0.50    1.01    0.00
>>    0.00    0.00   29.15
>> 02:00:03 PM   31   33.85    0.00    9.23    0.51    1.54    0.51    0.00
>>    0.00    0.00   54.36
>> 02:00:03 PM   32   30.15    0.00   41.71    0.50    0.50    1.51    0.00
>>    0.00    0.00   25.63
>>
>> Thanks,
>>
>> James
>>
>> Frits Hoogland <frits.hoogl...@gmail.com> 於 2025年6月25日週三 下午10:27寫道:
>>
>>>
>>>
>>> > On 25 Jun 2025, at 07:59, Laurenz Albe <laurenz.a...@cybertec.at>
>>> wrote:
>>> >
>>> > On Wed, 2025-06-25 at 11:15 +0800, James Pang wrote:
>>> >> pgv14, RHEL8, xfs , we suddenly see tens of sessions waiting on
>>> "DataFileRead" and
>>> >> "extend", it last about 2 seconds(based on pg_stat_activity query) ,
>>> during the
>>> >> waiting time, "%sys" cpu increased to 80% , but from "iostat" , no
>>> high iops and
>>> >> io read/write latency increased either.
>>> >
>>> > Run "sar -P all 1" and see if "%iowait" is high.
>>> I would (strongly) advise against the use of iowait as an indicator. It
>>> is a kernel approximation of time spent in IO from which cannot be use used
>>> in any sensible way other than possibly you're doing IO.
>>> First of all, iowait is not a kernel state, and therefore it's taken
>>> from idle. This means that if there is no, or too little, idle time, iowait
>>> that should be there is gone.
>>> Second, the calculation to transfer idle time to iowait is done for
>>> synchronous IO calls only. Which currently is not a problem for postgres
>>> because it uses exactly that, but in the future it might.
>>> Very roughly put, what the kernel does is keep a counter of tasks
>>> currently in certain system IO calls, and then try to express that using
>>> iowait. The time in IO wait can't be used calculate any IO facts.
>>>
>>> In that sense, it puts it in the same area as the load figure:
>>> indicative, but mostly useless because it doesn't give you any facts about
>>> what it is expressing.
>>> >
>>> > Check if you have transparent hugepages enabled:
>>> >
>>> >  cat /sys/kernel/mm/transparent_hugepage/enabled
>>> >
>>> > If they are enabled, disable them and see if it makes a difference.
>>> >
>>> > I am only guessing here.
>>> Absolutely. Anything that is using signficant amounts of memory and is
>>> not created to take advantage of transparent hugepages will probably
>>> experience more downsides from THP than it helps.
>>> >
>>> >> many sessions were running same "DELETE FROM xxxx" in parallel
>>> waiting on "extend"
>>> >> and "DataFileRead", there are triggers in this table "After delete"
>>> to insert/delete
>>> >> other tables in the tigger.
>>> >
>>> > One thing that almost certainly would improve your situation is to run
>>> fewer
>>> > concurrent statements, for example by using a reasonably sized
>>> connection pool.
>>> This is true if the limits of the IO device, or anything towards to IO
>>> device or devices are hit.
>>> And in general, high "%sys", alias lots of time spent in kernel mode
>>> alias system time indicates lots of time spent in system calls, which is
>>> what the read and write calls in postgres are.
>>> Therefore these figures suggest blocking for IO, for which Laurenz'
>>> advise to lower the amount of concurrent sessions doing IO in general makes
>>> sense.
>>> A more nuanced analysis: if IO requests get queued, these will wait in
>>> 'D' state in linux, which by definition is off cpu, and thus do not spent
>>> cpu (system/kernel) time.
>>>
>>> What sounds suspicious is that you indicate you indicate there is you
>>> see no signficant change in the amount of IO in iostat.
>>>
>>> In order to understand this, you will have to first carefully find the
>>> actual IO physical IO devices that you are using for postgres IO.
>>> In current linux this can be tricky, depending on how the hardware or
>>> virtual machine looks like, and how the disks are arranged in linux.
>>> What you need to determine is which actual disk devices are used, and
>>> what their limits are.
>>> Limits for any disk are IOPS (operations per second) and MBPS (megabytes
>>> per second -> bandwdith).
>>>
>>> There is an additional thing to realize, which makes this really tricky:
>>> postgres for common IO uses buffered IO.
>>> Buffered IO means any read or write will use the linux buffercache, and
>>> read or writes can be served from the buffercache if possible.
>>>
>>> So in your case, if you managed to make the database perform identical
>>> read or write requests, this could result in a difference of amounts of
>>> read and write IOs served from the cache, which can make an enormous
>>> amounts of difference for how fast these requests are served. If somehow
>>> you managed to make the operating system choose to use the physical IO
>>> path, you will see significant amounts time spent on that, which will have
>>> IO related wait events.
>>>
>>> Not a simple answer, but this is how it works.
>>>
>>> So I would suggest checking the difference between the situation of when
>>> it's doing the same which is considered well performing versus badly
>>> performing.
>>>
>>>
>>> >
>>> > Yours,
>>> > Laurenz Albe
>>> >
>>> >
>>>
>>>
>>
>

Reply via email to