Merlin Moncure escribió:
> Didn't we just discuss this exact problem on the identically named
> thread?
> http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html
Ignore this guy. It's a bot reinjecting old messages, or something like
that, probably because of some bug i
On Sun, Dec 2, 2012 at 9:08 AM, rahul143 wrote:
> Hello everyone,
>
> I'm seeking help in diagnosing / figuring out the issue that we have with
> our DB server:
>
> Under some (relatively non-heavy) load: 300...400 TPS, every 10-30 seconds
> server drops into high cpu system usage (90%+ SYSTEM acr
Hello everyone,
I'm seeking help in diagnosing / figuring out the issue that we have with
our DB server:
Under some (relatively non-heavy) load: 300...400 TPS, every 10-30 seconds
server drops into high cpu system usage (90%+ SYSTEM across all CPUs - it's
pure SYS cpu, i.e. it's not io wait, not
RAID10
-- vlad
On 11/24/2012 3:17 PM, Gavin Flower wrote:
Curious, what is your RAID configuration?
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Wed, Nov 21, 2012 at 8:14 AM, Vlad wrote:
>
>
> '-M prepared' produces normal results, while '-M simple' results in 40% sys
> cpu. '-M extended' is somewhere in between.
> I'm running it as 60 clients, 2 threads.
2 threads is pretty low for 60 clients. What happens if you increase
-j to eith
On Tue, Nov 20, 2012 at 12:00 PM, Merlin Moncure wrote:
> On Tue, Nov 20, 2012 at 12:16 PM, Jeff Janes wrote:
>>
>> The freelist should never loop. It is written as a loop, but I think
>> there is currently no code path which ends up with valid buffers being
>> on the freelist, so that loop will
On Tue, Nov 20, 2012 at 4:08 PM, Jeff Janes wrote:
>> It strikes me as cavalier to be resetting
>> trycounter while sitting under the #1 known contention point for read
>> only workloads.
>
> The only use for the trycounter is to know when to ERROR out with "no
> unpinned buffers available", so no
On 25/11/12 11:11, Kevin Grittner wrote:
Gavin Flower wrote:
We found that the real-world production performance of a web
application servicing millions of we hits per day with thousands
of concurrent users improved when we reconfigured our database
connection pool to be about 35 instead of 55,
it's session mode and the pool size is 1200 (cause I need to grantee that
in the worst case we have enough slots for all possible clients), however
even at the times preceding high-cpu-sys-stall, the number postmasters are
like 15-20. When stall happens, it starts to raise but that's the result of
Gavin Flower wrote:
>> We found that the real-world production performance of a web
>> application servicing millions of we hits per day with thousands
>> of concurrent users improved when we reconfigured our database
>> connection pool to be about 35 instead of 55, on a 16 core box
>> with a 40 d
> what pgbouncer mode, and how large is your pool.
>
>
'-M prepared' produces normal results, while '-M simple' results in 40% sys
cpu. '-M extended' is somewhere in between.
I'm running it as 60 clients, 2 threads.
-- Vlad
On Wed, Nov 21, 2012 at 11:05 AM, Vlad wrote:
> it's session mode and the pool size is 1200 (cause I need to grantee that in
> the worst case we have enough slots for all possible clients), however even
> at the times preceding high-cpu-sys-stall, the number postmasters are like
> 15-20. When sta
On 25/11/12 09:30, Kevin Grittner wrote:
Vlad wrote:
it's session mode and the pool size is 1200 (cause I need to
grantee that in the worst case we have enough slots for all
possible clients),
We found that the real-world production performance of a web
application servicing millions of we hit
Vlad wrote:
> it's session mode and the pool size is 1200 (cause I need to
> grantee that in the worst case we have enough slots for all
> possible clients),
We found that the real-world production performance of a web
application servicing millions of we hits per day with thousands of
concurrent
nothing changes if I increase number of threads.
pgbouncer doesn't change much.
also, I think the nature of high-sys-cpu during stall and and when I run
pgbench is different.
During pgbench it's constantly at 30-40%, while during stall it sits at low
5-15% and then spikes to 90% after a while, wit
On Wed, Nov 21, 2012 at 10:43 AM, Jeff Janes wrote:
> On Wed, Nov 21, 2012 at 7:29 AM, Vlad Marchenko wrote:
>
>> update on my problem: despite pgbouncer, the problem still occures on my
>> end.
>
> As Merlin asked, how big is the pool? Maybe you are using a large
> enough pool so as to defeat t
On Wed, Nov 21, 2012 at 9:05 AM, Vlad wrote:
> it's session mode and the pool size is 1200 (cause I need to grantee that in
> the worst case we have enough slots for all possible clients),
Wouldn't the clients prefer to wait 100ms to get a connnection if that
means their query finishes in 100ms,
On Tue, Nov 20, 2012 at 12:16 PM, Jeff Janes wrote:
> On Tue, Nov 20, 2012 at 9:05 AM, Merlin Moncure wrote:
>> On Tue, Nov 20, 2012 at 10:50 AM, Jeff Janes wrote:
>>>
>>> I wouldn't expect so. Increasing shared_buffers should either fix
>>> free list lock contention, or leave it unchanged, not
On Tue, Nov 20, 2012 at 9:05 AM, Merlin Moncure wrote:
> On Tue, Nov 20, 2012 at 10:50 AM, Jeff Janes wrote:
>>
>> I wouldn't expect so. Increasing shared_buffers should either fix
>> free list lock contention, or leave it unchanged, not make it worse.
>
> AIUI, that is simply not true (unless y
ok, understood.
I need to give some more thoughts to if it's possible for us to switch to
transaction mode from app standpoint of view.
if yes, then setting pool size to 20 (for 8 cores) sounds OK?
-- Vlad
On Wed, Nov 21, 2012 at 11:56 AM, Vlad wrote:
> ok, understood.
> I need to give some more thoughts to if it's possible for us to switch to
> transaction mode from app standpoint of view.
>
> if yes, then setting pool size to 20 (for 8 cores) sounds OK?
If it was me, I would be starting with exa
It turned out we can't use transaction mode, cause there are prepared
statement used a lot within code, while processing a single http request.
Also, I can't 100% rule out that there won't be any long running
(statistical) queries launched (even though such requests should not come
to this databas
On Wed, Nov 21, 2012 at 12:17 PM, Vlad wrote:
> It turned out we can't use transaction mode, cause there are prepared
> statement used a lot within code, while processing a single http request.
prepare statements can be fudged within some constraints. if prepared
statements are explicitly named
Merlin,
On Wed, Nov 21, 2012 at 2:17 PM, Merlin Moncure wrote:
> On Wed, Nov 21, 2012 at 12:17 PM, Vlad wrote:
> > It turned out we can't use transaction mode, cause there are prepared
> > statement used a lot within code, while processing a single http request.
>
> prepare statements can be fu
On Wed, Nov 21, 2012 at 7:29 AM, Vlad Marchenko wrote:
> update on my problem: despite pgbouncer, the problem still occures on my
> end.
As Merlin asked, how big is the pool? Maybe you are using a large
enough pool so as to defeat the purpose of restricting the number of
connections.
> Also,
On Wed, Nov 21, 2012 at 9:29 AM, Vlad Marchenko wrote:
> update on my problem: despite pgbouncer, the problem still occures on my
> end.
>
> Also, interesting observation - I ran several tests with pgbench, using
> queries that I think are prone to trigger high-sys-cpu-stall. What I noticed
> is w
update on my problem: despite pgbouncer, the problem still occures on my
end.
Also, interesting observation - I ran several tests with pgbench, using
queries that I think are prone to trigger high-sys-cpu-stall. What I
noticed is when pgbench is started with prepared mode, the system
behaves
On 21/11/12 11:41, Shaun Thomas wrote:
On 11/20/2012 04:35 PM, Jeff Janes wrote:
Atomic update commit failure in the meatware :)
Ha.
What's actually funny is that one of the affected machines started
*swapping* earlier today. With 15GB free, and 12GB of inactive cache,
and vm.swappiness se
On 11/20/2012 04:35 PM, Jeff Janes wrote:
Atomic update commit failure in the meatware :)
Ha.
What's actually funny is that one of the affected machines started
*swapping* earlier today. With 15GB free, and 12GB of inactive cache,
and vm.swappiness set to 0, it somehow decided there was eno
On Tue, Nov 20, 2012 at 2:26 PM, Shaun Thomas wrote:
> On 11/20/2012 04:08 PM, Jeff Janes wrote:
>
>> Shaun Thomas reports one that is (I assume) not read intensive, but
>> his diagnosis is that this is a kernel bug where a larger
>> shared_buffers for no good reason causes the kernel to kill off
On 11/20/2012 04:08 PM, Jeff Janes wrote:
Shaun Thomas reports one that is (I assume) not read intensive, but
his diagnosis is that this is a kernel bug where a larger
shared_buffers for no good reason causes the kernel to kill off its
page cache.
We're actually very read intensive. According
what would pgbouncer do in my case? Number of connections will decrease,
but number of active clients won't be smaller. As I understand the latter
ones are that important.
-- Vlad
On Fri, Nov 16, 2012 at 2:31 PM, Merlin Moncure wrote:
>
> first thoughts:
> no single thing really stands out --
On Tue, Nov 20, 2012 at 10:50 AM, Jeff Janes wrote:
> On Tue, Nov 20, 2012 at 8:03 AM, Merlin Moncure wrote:
>> On Tue, Nov 20, 2012 at 9:02 AM, Shaun Thomas
>> wrote:
>>> On 11/16/2012 02:31 PM, Merlin Moncure wrote:
>>>
no single thing really stands out -- contention is all over the plac
On Tue, Nov 20, 2012 at 8:03 AM, Merlin Moncure wrote:
> On Tue, Nov 20, 2012 at 9:02 AM, Shaun Thomas
> wrote:
>> On 11/16/2012 02:31 PM, Merlin Moncure wrote:
>>
>>> no single thing really stands out -- contention is all over the place.
>>> lwlock, pinbuffer, dynahash (especially). I am again
On Thu, Nov 15, 2012 at 4:29 PM, Alvaro Herrera
wrote:
> Merlin Moncure escribió:
>
>> ok, excellent. reviewing the log, this immediately caught my eye:
>>
>> recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
>> recvfrom(8,
>> "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263
On 11/20/2012 10:13 AM, Merlin Moncure wrote:
have you ruled out numa issues?
(http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html)
Haha. Yeah. Our zone reclaim mode off, and node distance is 10 or 20.
ZCM is only enabled by default if distance is > 20, unle
On Tue, Nov 20, 2012 at 10:12 AM, Shaun Thomas wrote:
> On 11/20/2012 10:03 AM, Merlin Moncure wrote:
>
>> Shared buffer manipulation changing contention is suggesting you're
>> running into free list lock issues. How many active backends/cores?
>
>
> Oh, the reason I wanted to point it out was t
On 11/20/2012 10:03 AM, Merlin Moncure wrote:
Shared buffer manipulation changing contention is suggesting you're
running into free list lock issues. How many active backends/cores?
Oh, the reason I wanted to point it out was that we see a lot more than
CPU contention with higher shared_buff
On Tue, Nov 20, 2012 at 9:02 AM, Shaun Thomas wrote:
> On 11/16/2012 02:31 PM, Merlin Moncure wrote:
>
>> no single thing really stands out -- contention is all over the place.
>> lwlock, pinbuffer, dynahash (especially). I am again suspicious of
>> bad scheduler interaction. any chance we can f
On Fri, Nov 16, 2012 at 12:13 PM, Vlad wrote:
> ok, I've applied that patch and ran. The stall started around 13:50:45...50
> and lasted until the end
>
> https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log
That isn't as much log as I expected. But I guess only the tip of t
On 11/16/2012 02:31 PM, Merlin Moncure wrote:
no single thing really stands out -- contention is all over the place.
lwlock, pinbuffer, dynahash (especially). I am again suspicious of
bad scheduler interaction. any chance we can fire up pgbouncer?
Just want to throw it out there, but we've b
On Fri, Nov 16, 2012 at 2:13 PM, Vlad wrote:
> ok, I've applied that patch and ran. The stall started around 13:50:45...50
> and lasted until the end
>
> https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log
>
> the actual log has more data (including statement following each '
On Mon, Nov 19, 2012 at 12:02 PM, Vlad wrote:
>
> Some additional observation and food for thoughts. Our app uses connection
> caching (Apache::DBI). By disabling Apache::DBI and forcing client
> re-connection for every (http) request processed I eliminated the stall. The
> user cpu usage jumped (
ok, I've applied that patch and ran. The stall started around 13:50:45...50
and lasted until the end
https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log
the actual log has more data (including statement following each 'spin
delay' record), but there is some sensitive info in
On Fri, Nov 16, 2012 at 3:21 PM, Vlad wrote:
> what would pgbouncer do in my case? Number of connections will decrease, but
> number of active clients won't be smaller. As I understand the latter ones
> are that important.
Well, one thing that struck me was how little spinlock contention
there ac
Some additional observation and food for thoughts. Our app uses connection
caching (Apache::DBI). By disabling Apache::DBI and forcing
client re-connection for every (http) request processed I eliminated the
stall. The user cpu usage jumped (mostly cause prepared sql queries are no
longer available
On Mon, Nov 19, 2012 at 10:50 AM, Vlad wrote:
> I just did a little experiment: extracted top four queries that were
> executed the longest during stall times and launched pgbench test with 240
> clients. Yet I wasn't able to put the server into a stall with that. Also
> load average was hitting
On Sun, Nov 18, 2012 at 4:24 PM, Jeff Janes wrote:
> On Fri, Nov 16, 2012 at 12:13 PM, Vlad wrote:
>> ok, I've applied that patch and ran. The stall started around 13:50:45...50
>> and lasted until the end
>>
>> https://dl.dropbox.com/u/109778/postgresql-2012-11-16_134904-stripped.log
>
> That is
I just did a little experiment: extracted top four queries that were
executed the longest during stall times and launched pgbench test with 240
clients. Yet I wasn't able to put the server into a stall with that. Also
load average was hitting 120+, it was all user cpu, single digit % system.
The s
On Fri, Nov 16, 2012 at 12:26 PM, Jeff Janes wrote:
> On Fri, Nov 16, 2012 at 8:21 AM, Merlin Moncure wrote:
>> On Fri, Nov 16, 2012 at 9:52 AM, Vlad wrote:
>>>
*) failing that, LWLOCK_STATS macro can be compiled in to give us some
information about the particular lock(s) we're binding
On Fri, Nov 16, 2012 at 11:19 AM, Vlad wrote:
>
>> We're looking for spikes in 'blk' which represents when lwlocks bump.
>> If you're not seeing any then this is suggesting a buffer pin related
>> issue -- this is also supported by the fact that raising shared
>> buffers didn't help. If you're n
OK, so far I settled on excluding connection caching on app side
(Apache::DBI and prepare_cached) from equation and adding pgbouncer as a
counter-measure. This seems to stabilize the situation - at least I'm
not able to push server into high-sys-cpu stall the way how I used to do.
I'm still in
On Fri, Nov 16, 2012 at 8:21 AM, Merlin Moncure wrote:
> On Fri, Nov 16, 2012 at 9:52 AM, Vlad wrote:
>>
>>> *) failing that, LWLOCK_STATS macro can be compiled in to give us some
>>> information about the particular lock(s) we're binding on. Hopefully
>>> it's a lwlock -- this will make diagnos
> We're looking for spikes in 'blk' which represents when lwlocks bump.
> If you're not seeing any then this is suggesting a buffer pin related
> issue -- this is also supported by the fact that raising shared
> buffers didn't help. If you're not seeing 'bk's, go ahead and
> disable the stats mac
On Fri, Nov 16, 2012 at 9:52 AM, Vlad wrote:
> Merlin,
>
>
>> Yeah -- you're right, this is definitely spinlock issue. Next steps:
>>
>> *) in mostly read workloads, we have a couple of known frequent
>> offenders. In particular the 'BufFreelistLock'. One way we can
>> influence that guy is to
Merlin,
Yeah -- you're right, this is definitely spinlock issue. Next steps:
*) in mostly read workloads, we have a couple of known frequent
offenders. In particular the 'BufFreelistLock'. One way we can
influence that guy is to try and significantly lower/raise shared
buffers. So this is o
On Thu, Nov 15, 2012 at 6:07 PM, Jeff Janes wrote:
> On Thu, Nov 15, 2012 at 2:44 PM, Merlin Moncure wrote:
>
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
selec
On Thu, Nov 15, 2012 at 2:44 PM, Merlin Moncure wrote:
>>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)
>>> select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout)
>>> select
Tom,
I just checked the version I'm running (9.1.6), and the code is quite
similar (src/backend/storage/lmgr/s_lock.c)
pg_usleep(cur_delay * 1000L);
#if defined(S_LOCK_TEST)
fprintf(stdout, "*");
fflush(stdout);
#endif
/* increase delay by a ra
Merlin Moncure writes:
> What I've been scratching my head over is what code exactly would
> cause an iterative sleep like the above. The code is here:
> pg_usleep(cur_delay * 1000L);
> /* increase delay by a random fraction between 1X and 2X */
> cur_delay += (int) (cur_delay *
>
On Thu, Nov 15, 2012 at 4:29 PM, Alvaro Herrera
wrote:
> Merlin Moncure escribió:
>
>> ok, excellent. reviewing the log, this immediately caught my eye:
>>
>> recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
>> recvfrom(8,
>> "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263
sorry - no panics / errors in the log...
-- Vlad
Merlin Moncure escribió:
> ok, excellent. reviewing the log, this immediately caught my eye:
>
> recvfrom(8, "\27\3\1\0@", 5, 0, NULL, NULL) = 5
> recvfrom(8,
> "\327\327\nl\231LD\211\346\243@WW\254\244\363C\326\247\341\177\255\263~\327HDv-\3466\353"...,
> 64, 0, NULL, NULL) = 64
> select(0, N
On Thu, Nov 15, 2012 at 3:49 PM, Merlin Moncure wrote:
> On Thu, Nov 15, 2012 at 2:44 PM, Vlad wrote:
>>
>>>
>>> yeah. ok, nest steps:
>>> *) can you confirm that postgres process is using high cpu (according
>>> to top) during stall time
>>
>>
>> yes, CPU is spread across a lot of postmasters
>
On Thu, Nov 15, 2012 at 2:44 PM, Vlad wrote:
>
>>
>> yeah. ok, nest steps:
>> *) can you confirm that postgres process is using high cpu (according
>> to top) during stall time
>
>
> yes, CPU is spread across a lot of postmasters
>
> PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ CO
>
> yeah. ok, nest steps:
> *) can you confirm that postgres process is using high cpu (according
> to top) during stall time
>
yes, CPU is spread across a lot of postmasters
PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND
29863 pgsql 20 0 3636m 102m 36m R 19.1 0.3
On Thu, Nov 15, 2012 at 2:20 PM, Vlad wrote:
> Merlin,
>
> this is not my report, probably from a thread that I've referenced as having
> a common symptoms. Here is info about my db:
>
>
> Postgresql 9.1.6.
> Postgres usually has 400-500 connected clients, most of them are idle.
> Database is over
Merlin,
this is not my report, probably from a thread that I've referenced as
having a common symptoms. Here is info about my db:
Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.
We
On Thu, Nov 15, 2012 at 11:50 AM, Vlad wrote:
> there is no big spike of queries that cause that, queries come in relatively
> stable pace. It's just when the higher rate of queries coming, the more
> likely this to happen. yes, when stall happens , the active queries pile up
> - but that's the r
there is no big spike of queries that cause that, queries come in
relatively stable pace. It's just when the higher rate of queries coming,
the more likely this to happen. yes, when stall happens , the active
queries pile up - but that's the result of a stall (the server reacts slow
on a keypress,
On Wed, Nov 14, 2012 at 4:08 PM, John R Pierce wrote:
> On 11/14/12 1:34 PM, Vlad wrote:
>>
>> thanks for your feedback. While implementing connection pooling would make
>> resources utilization more efficient, I don't think it's the root of my
>> problem. Most of the connected clients are at IDLE
On 11/14/12 1:34 PM, Vlad wrote:
thanks for your feedback. While implementing connection pooling would
make resources utilization more efficient, I don't think it's the root
of my problem. Most of the connected clients are at IDLE. When I do
select * from pg_stat_activity where current_query n
John,
thanks for your feedback. While implementing connection pooling would make
resources utilization more efficient, I don't think it's the root of my
problem. Most of the connected clients are at IDLE. When I do
select * from pg_stat_activity where current_query not like '%IDLE%';
I only see
On 11/14/12 1:13 PM, Vlad wrote:
Postgresql 9.1.6.
Postgres usually has 400-500 connected clients, most of them are idle.
Database is over 1000 tables (across 5 namespaces), taking ~150Gb on disk.
thats a really high client connection count for a 8 core system.
I'd consider implementing a conn
Hello everyone,
I'm seeking help in diagnosing / figuring out the issue that we have with
our DB server:
Under some (relatively non-heavy) load: 300...400 TPS, every 10-30 seconds
server drops into high cpu system usage (90%+ SYSTEM across all CPUs - it's
pure SYS cpu, i.e. it's not io wait, not
75 matches
Mail list logo