Hello Postgresql experts, We are facing issues with our PostgreSQL databases running on Ubuntu server, right after we shifted our databases from OpenSuse O/S.
It's a new database servers runs fine for most of the time (Avg. Load 0.5 to 1.0) but suddenly spikes once/twice a day.This happens four times in last three day and during this, simple update/select statements started taking minutes (1 to 5 Minutes) instead of 5-50 mSec. And this max out database 250 connections. This event halt all processes for about 15- 20 min and then everything back to normal. I verified checkpoint and vacuum related activities but this isn't showing any problem to me. (attached logs) Top/vmstat output shows all resources were suddenly utilized by %us during same time. iostat doesn't shows any IO related bottleneck. I have added completed logs for yesterday outage (13:45 to 14:15) . procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 44 0 0 201929344 345260 50775772 0 0 2 15 2 2 2 0 98 0 40 0 0 201919264 345260 50775864 0 0 0 224 9409 1663 98 1 1 0 40 0 0 201915344 345260 50775880 0 0 0 280 8803 1674 99 0 0 0 38 0 0 201911296 345260 50775888 0 0 0 156 8753 1469 99 0 0 0 40 0 0 201902416 345260 50775888 0 0 0 224 9060 2775 98 1 1 0 Free -m total used free shared buffers cached Mem: 251 59 192 0 0 48 -/+ buffers/cache: 10 241 Swap: 29 0 29 System information. Connections into our databases are coming from WebServer (running on PHP and Apache) and script servers (PHP).We have verified apache logs and we didn't find connection traffic during same interval. Hardware information: DELL PowerEdge R715 Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz Ubuntu 12.04.4 LTS kernel: 3.8.0-35-generic 64 bit Postgresql version: 9.0.13 RAM: 256 GB 32 Cores CPU ps_xlog : RAID 1 data folder : RAID10 (6 Strips) read:write ratio- 85:15 Pgbouncer configured on database side(250 allowed connections) Postgresql Configuration: Database Size: 28GB Vaccum analyzed daily checkpoint_completion_target = 0.9 maintenance_work_mem = 16GB shared_buffers = 8GB # we reduced this from 32 GB. max_connections = 300 checkpoint_segments = 32 checkpoint_timeout = 8min detailed postgresql configuration: http://pastie.org/8754957 checkpoint/vacuum information http://pastie.org/8754954 Top command o/p: http://pastie.org/8755007 iostat o/p: http://pastie.org/8755009 sysctl.configuration : http://pastie.org/8755197 We have recently upgraded O/S kernels to fix this issue but this it didn't help. We are tried to modify some O/S parameters based on some discussions- http://www.postgresql.org/message-id/50e4aab1.9040...@optionshouse.com vm.dirty_background_bytes = 33554432 # I reduced this based on some forums. vm.dirty_bytes = 536870912 vm.overcommit_memory=2 kernel.sched_migration_cost = 5000000 kernel.sched_autogroup_enabled = 0 We believe that our PostgreSQL configuration is not correct according to available memory on machine and need some urgent tuning into it. Could you please guide me on troubleshooting this issue. Thanks in advance. Ashutosh.D PSI.