On Tue, 2020-08-18 at 19:52 -0400, Jim Jarvie wrote: > I have a system which implements a message queue with the basic pattern that > a process selects a group of, > for example 250, rows for processing via SELECT .. LIMIT 250 FOR UPDATE SKIP > LOCKED. > > When there are a small number of concurrent connections to process the queue, > this seems to work as > expected and connections quickly obtain a unique block of 250 rows for > processing. > However, as I scale up the number of concurrent connections, I see a spike in > CPU (to 100% across 80 cores) > when the SELECT FOR UPDATE SKIP LOCKED executes and the select processes > wait for multiple minutes > (10-20 minutes) before completing. My use case requires around 256 > concurrent processors for the queue > but I've been unable to scale beyond 128 without everything grinding to a > halt. > > The queue table itself fits in RAM (with 2M hugepages) and during the wait, > all the performance counters > drop to almost 0 - no disk read or write (semi-expected due to the table > fitting in memory) with 100% > buffer hit rate in pg_top and row read around 100/s which is much smaller > than expected. > > After processes complete the select and the number of waiting selects starts > to fall, CPU load falls and > then suddenly the remaining processes all complete within a few seconds and > things perform normally until > the next time there are a group of SELECT FOR UPDATE statements which bunch > together and things then repeat. > > I found that performing extremely frequent vacuum analyze (every 30 minutes) > helps a small amount but > this is not that helpful so problems are still very apparent. > > I've exhausted all the performance tuning and analysis results I can find > that seem even a little bit > relevant but cannot get this cracked. > > Is anyone on the list able to help with suggestions of what I can do to track > why this CPU hogging happens > as this does seem to be the root of the problem?
You should - check with "pgstattuple" if the table is bloated. - use "perf" to see where the CPU time is spent. - look at "pg_stat_activity" for wait events (unlikely if the CPU is busy). Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com