> On 25 Jun 2025, at 07:59, Laurenz Albe <laurenz.a...@cybertec.at> wrote:
>
> On Wed, 2025-06-25 at 11:15 +0800, James Pang wrote:
>> pgv14, RHEL8, xfs , we suddenly see tens of sessions waiting on
>> "DataFileRead" and
>> "extend", it last about 2 seconds(based on pg_stat_activity query) , during
>> the
>> waiting time, "%sys" cpu increased to 80% , but from "iostat" , no high iops
>> and
>> io read/write latency increased either.
>
> Run "sar -P all 1" and see if "%iowait" is high.
I would (strongly) advise against the use of iowait as an indicator. It is a
kernel approximation of time spent in IO from which cannot be use used in any
sensible way other than possibly you're doing IO.
First of all, iowait is not a kernel state, and therefore it's taken from idle.
This means that if there is no, or too little, idle time, iowait that should be
there is gone.
Second, the calculation to transfer idle time to iowait is done for synchronous
IO calls only. Which currently is not a problem for postgres because it uses
exactly that, but in the future it might.
Very roughly put, what the kernel does is keep a counter of tasks currently in
certain system IO calls, and then try to express that using iowait. The time in
IO wait can't be used calculate any IO facts.
In that sense, it puts it in the same area as the load figure: indicative, but
mostly useless because it doesn't give you any facts about what it is
expressing.
>
> Check if you have transparent hugepages enabled:
>
> cat /sys/kernel/mm/transparent_hugepage/enabled
>
> If they are enabled, disable them and see if it makes a difference.
>
> I am only guessing here.
Absolutely. Anything that is using signficant amounts of memory and is not
created to take advantage of transparent hugepages will probably experience
more downsides from THP than it helps.
>
>> many sessions were running same "DELETE FROM xxxx" in parallel waiting on
>> "extend"
>> and "DataFileRead", there are triggers in this table "After delete" to
>> insert/delete
>> other tables in the tigger.
>
> One thing that almost certainly would improve your situation is to run fewer
> concurrent statements, for example by using a reasonably sized connection
> pool.
This is true if the limits of the IO device, or anything towards to IO device
or devices are hit.
And in general, high "%sys", alias lots of time spent in kernel mode alias
system time indicates lots of time spent in system calls, which is what the
read and write calls in postgres are.
Therefore these figures suggest blocking for IO, for which Laurenz' advise to
lower the amount of concurrent sessions doing IO in general makes sense.
A more nuanced analysis: if IO requests get queued, these will wait in 'D'
state in linux, which by definition is off cpu, and thus do not spent cpu
(system/kernel) time.
What sounds suspicious is that you indicate you indicate there is you see no
signficant change in the amount of IO in iostat.
In order to understand this, you will have to first carefully find the actual
IO physical IO devices that you are using for postgres IO.
In current linux this can be tricky, depending on how the hardware or virtual
machine looks like, and how the disks are arranged in linux.
What you need to determine is which actual disk devices are used, and what
their limits are.
Limits for any disk are IOPS (operations per second) and MBPS (megabytes per
second -> bandwdith).
There is an additional thing to realize, which makes this really tricky:
postgres for common IO uses buffered IO.
Buffered IO means any read or write will use the linux buffercache, and read or
writes can be served from the buffercache if possible.
So in your case, if you managed to make the database perform identical read or
write requests, this could result in a difference of amounts of read and write
IOs served from the cache, which can make an enormous amounts of difference for
how fast these requests are served. If somehow you managed to make the
operating system choose to use the physical IO path, you will see significant
amounts time spent on that, which will have IO related wait events.
Not a simple answer, but this is how it works.
So I would suggest checking the difference between the situation of when it's
doing the same which is considered well performing versus badly performing.
>
> Yours,
> Laurenz Albe
>
>