Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-12 Thread Scott Marlowe
On Tue, Oct 10, 2017 at 4:28 PM, pinker wrote: > > Yes, it would be much easier if it would be just single query from the top, > but the most cpu is eaten by the system itself and I'm not sure why. You are experiencing a context switch storm. The OS is spending so much time trying to switch betwe

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Tomas Vondra
On 10/11/2017 02:26 AM, pinker wrote: > Tomas Vondra-4 wrote >> I'm probably a bit dumb (after all, it's 1AM over here), but can you >> explain the CPU chart? I'd understand percentages (say, 75% CPU used) >> but what do the seconds / fractions mean? E.g. when the system time >> reaches 5 seconds

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Justin Pryzby
On Tue, Oct 10, 2017 at 01:40:07PM -0700, pinker wrote: > Hi to all! > > We've got problem with a very serious repetitive incident on our core > system. Namely, cpu load spikes to 300-400 and the whole db becomes > unresponsive. From db point of view nothing special is happening, memory > looks fi

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread pinker
Andres Freund wrote > Others mentioned already that that's worth improving. Yes, we are just setting up pgbouncer Andres Freund wrote > Some versions of this kernel have had serious problems with transparent > hugepages. I'd try turning that off. I think it defaults to off even in > that version

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread pinker
Tomas Vondra-4 wrote > I'm probably a bit dumb (after all, it's 1AM over here), but can you > explain the CPU chart? I'd understand percentages (say, 75% CPU used) > but what do the seconds / fractions mean? E.g. when the system time > reaches 5 seconds, what does that mean? hehe, no you've just s

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Andres Freund
Hi, On 2017-10-10 13:40:07 -0700, pinker wrote: > and the total number of connections are increasing very fast (but I suppose > it's the symptom not the root cause of cpu load) and exceed max_connections > (1000). Others mentioned already that that's worth improving. > System: > * CentOS Linux r

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Tomas Vondra
On 10/11/2017 12:28 AM, pinker wrote: > Tomas Vondra-4 wrote >> What is "CPU load"? Perhaps you mean "load average"? > > Yes, I wasn't exact: I mean system cpu usage, it can be seen here - it's the > graph from yesterday's failure (after 6p.m.): >

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread pinker
Victor Yegorov wrote > Looks like `sdg` and `sdm` are the ones used most. > Can you describe what's on those devices? Do you have WAL and DB sitting > together? > Where DB log files are stored? it's multipath with the same LUN for PGDATA and pg_log, but separate one for xlogs and archives. mpatha

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread John R Pierce
On 10/10/2017 3:28 PM, pinker wrote: It was exactly my first guess. work_mem is set to ~ 350MB and I see a lot of stored procedures with unnecessary WITH clauses (i.e. materialization) and right after it IN query with results of that (hash). 1000 connections all doing queries that need 1 work_m

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread pinker
Tomas Vondra-4 wrote > What is "CPU load"? Perhaps you mean "load average"? Yes, I wasn't exact: I mean system cpu usage, it can be seen here - it's the graph from yesterday's failure (after 6p.m.): So as one can see connections spikes foll

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Victor Yegorov
2017-10-11 0:53 GMT+03:00 pinker : > > Can you provide output of `iostat -myx 10` at the “peak” moments, please? > > sure, please find it here: > https://pastebin.com/f2Pv6hDL Looks like `sdg` and `sdm` are the ones used most. Can you describe what's on those devices? Do you have WAL and DB sitt

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread pinker
Scott Marlowe-2 wrote > Ouch, unless I'm reading that wrong, your IO subsystem seems to be REALLY > slow. it's a huge array where a lot is happening, for instance data snapshots :/ the lun on which is this db is dm-7. I'm a DBA with null knowledge about arrays so any advice will be much appreciate

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Scott Marlowe
On Tue, Oct 10, 2017 at 3:53 PM, pinker wrote: > Victor Yegorov wrote >> Can you provide output of `iostat -myx 10` at the “peak” moments, please? > > sure, please find it here: > https://pastebin.com/f2Pv6hDL Ouch, unless I'm reading that wrong, your IO subsystem seems to be REALLY slow. -- S

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread pinker
Victor Yegorov wrote > Can you provide output of `iostat -myx 10` at the “peak” moments, please? sure, please find it here: https://pastebin.com/f2Pv6hDL Victor Yegorov wrote > Also, it'd be good to look in more detailed bgwriter/checkpointer stats. > You can find more details in this post: http

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Tomas Vondra
On 10/10/2017 10:40 PM, pinker wrote: > Hi to all! > > We've got problem with a very serious repetitive incident on our core > system. Namely, cpu load spikes to 300-400 and the whole db becomes What is "CPU load"? Perhaps you mean "load average"? Also, what are the basic system parameters (nu

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Victor Yegorov
2017-10-10 23:40 GMT+03:00 pinker : > We've got problem with a very serious repetitive incident on our core > system. Namely, cpu load spikes to 300-400 and the whole db becomes > unresponsive. From db point of view nothing special is happening, memory > looks fine, disks io's are ok and the only

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread pinker
Thank you Scott, we are planning to do it today. But are you sure it will help in this case? -- Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.

Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Scott Marlowe
On Tue, Oct 10, 2017 at 2:40 PM, pinker wrote: > Hi to all! > > We've got problem with a very serious repetitive incident on our core > system. Namely, cpu load spikes to 300-400 and the whole db becomes > unresponsive. From db point of view nothing special is happening, memory > looks fine, disks