pá 5. 6. 2020 v 12:07 odesílatel Krzysztof Olszewski <kolsze...@gmail.com> napsal:
> I have problem with one of my Postgres production server. Server works > fine almost always, but sometimes without any increase of transactions or > statements amount, machine gets stuck. Cores goes up to 100%, load up to > 160%. When it happens then there are problems with connect to database and > even it will succeed, simple queries works several seconds instead of > milliseconds.Problem sometimes stops after a period a time (e.g. 35 min), > sometimes we must restart Postgres, Linux, or even KVM (which exists as > virtualization host). > > My hardware > 56 cores (Intel Core Processor (Skylake, IBRS)) > 400 GB RAM > RAID10 with about 40k IOPS > > Os > CentOS Linux release 7.7.1908 > kernel 3.10.0-1062.18.1.el7.x86_64 > > Databasesize 100 GB (entirely fit in memory :) ) > server_version 10.12 > effective_cache_size 192000 MB > maintenance_work_mem 2048 MB > max_connections 150 > shared_buffers 64000 MB > work_mem 96 MB > > On normal state, i have about 500 tps, 5% usage of cores, about 3% of > load, whole database fits in memory, no reads from disk, only writes on > about 500 IOPS level, sometimes in spikes on 1500 IOPS level, but on this > hardware there is no problem with this values (no iowaits on cores). In > normal state this machine does "nothing". Connections to database are > created by two app servers based on Java, through connection pools, so > connections count is limited by configuration of pools and max is 120, is > lower value than in Postgres configuration (150). On normal state there is > about 20 connections, when stuck goes into max (120). > > In correlation with stucks i see informations in kernel log about > NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s! [postmaster:33935] > but i don't know this is reason or effect of problem > I made investigation with pgBadger and ... nothing strange happens, just > normal statements > > Any ideas? > you can try to install perf + debug symbols for postgres. When you will have this problem again run "perf top". You can see what routines eat your CPU. Maybe it can be a spinlock problem https://www.postgresql.org/message-id/CAHyXU0yAsVxoab2PcyoCuPjqymtnaE93v7bN4ctv2aNi92fefA%40mail.gmail.com Can be interesting a reply on Merlin's question from mail/. cat /sys/kernel/mm/redhat_transparent_hugepage/enabled cat /sys/kernel/mm/redhat_transparent_hugepage/defrag Regards Pavel > > Thanks, > Kris > > >