On Tue, 2020-08-18 at 19:52 -0400, Jim Jarvie wrote:
> I have a system which implements a message queue with the basic pattern that 
> a process selects a group of,
>  for example 250, rows for processing via SELECT .. LIMIT 250 FOR UPDATE SKIP 
> LOCKED.
> 
> When there are a small number of concurrent connections to process the queue, 
> this seems to work as
>  expected and connections quickly obtain a unique block of 250 rows for 
> processing.
> However, as I scale up the number of concurrent connections, I see a spike in 
> CPU (to 100% across 80 cores)
>  when the SELECT FOR UPDATE SKIP LOCKED executes and the select processes 
> wait for multiple minutes
>  (10-20 minutes) before completing.  My use case requires around 256 
> concurrent processors for the queue
>  but I've been unable to scale beyond 128 without everything grinding to a 
> halt.
> 
> The queue table itself fits in RAM (with 2M hugepages) and during the wait, 
> all the performance counters
>  drop to almost 0 - no disk read or write (semi-expected due to the table 
> fitting in memory) with 100%
>  buffer hit rate in pg_top and row read around 100/s which is much smaller 
> than expected.
> 
> After processes complete the select and the number of waiting selects starts 
> to fall, CPU load falls and
>  then suddenly the remaining processes all complete within a few seconds and 
> things perform normally until
>  the next time there are a group of SELECT  FOR UPDATE statements which bunch 
> together and things then repeat.
> 
> I found that performing extremely frequent vacuum analyze (every 30 minutes) 
> helps a small amount but
>  this is not that helpful so problems are still very apparent.
> 
> I've exhausted all the performance tuning and analysis results I can find 
> that seem even a little bit
>  relevant but cannot get this cracked.
> 
> Is anyone on the list able to help with suggestions of what I can do to track 
> why this CPU hogging happens
>  as this does seem to be the root of the problem?

You should

- check with "pgstattuple" if the table is bloated.

- use "perf" to see where the CPU time is spent.

- look at "pg_stat_activity" for wait events (unlikely if the CPU is busy).

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com



Reply via email to