Re: [PERFORM] CPU utilization vs. IO wait, shared buffers?

2008-10-31 Thread znmeb

Quoting Oliver Johnson <[EMAIL PROTECTED]>:


Hi everybody,

I am running a bake/load test and I am seeing sudden, daily shifts
from CPU utilization to IO wait.  The load harness has been running
for 3 weeks and should be putting a uniform load on the application.
The application processes data on a daily basis and a sawtooth CPU
pattern on the database is expected as more values are added
throughout the day and processing resets with the next day.  Each day,
I see the CPU utilization climb as expected until a shift occurs and
it spends the rest of the day primarily in IO wait.

Looking at pg_statio_user_tables, I can see that during the CPU
intense timeframe, most of the results come from the buffer cache
(hits).  During the IO wait, most of the results are being read in
(misses).  Examples from each timeframe (CPU/IO) are included below.
For each sample, I issued pg_stat_reset(), waited briefly, and then
queried pg_statio_user_tables.

*during CPU Intense timeframe*
db=# select * from pg_statio_user_tables;
 relid | schemaname |  relname  |
heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit |
toast_blks_read | toast_blks_hit | tidx_blks_read | tidx_blks_hit
---++---++---+---+--+-+++---
 16612 | public | tablea|
1 |   1346782 | 1 |55956 |   0 |
   0 |  0 | 0
 16619 | public | tableb|
0 |   579 | 0 | 1158 | |
 ||

*during IO WAIT timeframe*
db=# select * from pg_statio_user_tables;
 relid | schemaname |  relname  |
heap_blks_read | heap_blks_hit | idx_blks_read | idx_blks_hit |
toast_blks_read | toast_blks_hit | tidx_blks_read | tidx_blks_hit
---++---++---+---+--+-+++---
 16612 | public | tablea|
244146 |   594 |  4885 | 3703 |
0 |  0 |  0 | 0
 16619 | public | tableb|
418 |   589 |   432 | 1613 | |
   ||



Another thing to note, we have VACUUM ANALYZE running on an hourly
interval and the switch from CPU to IO wait appears to always coincide
with a vacuum.

What might cause this shift?

I have tried adjusting buffer_cache from 512 MB to 1024 MB, but this
did not appear to have an impact.

I also tried upping the work_mem from 1MB to 10MB, and this did not
appear to have an impact either.

Any ideas?  Thanks for your help!

Oliver


We're running Postgresql 8.2.9

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



First of all, stop me if you're not running Linux -- that's the only  
OS I know. :) Second, if you're not running a fairly recent 2.6 kernel  
(2.6.18/RHEL 5 or later), you should probably upgrade, because the  
performance stats are better. 2.6.25 is better still.


Next, if you haven't already, install the "sysstat" package. My  
recollection is that it does not install by default on most distros.  
It should -- go beat up on the distributors. :)


Now you have "iostat" installed. That will give you detailed  
information on both processor and I/O activity. Use the command


$ iostat -cdmtx 10 99 | tee iostat.log

This will sample the processor(s), all the devices, and on 2.6.25 or  
later kernels, all the *partitions*. This last is important if you  
have things in different filesystems.


What you will probably see is samples where the I/O wait is high  
correlated with high levels of read activity (reads per second and  
read megabytes per second) and high device utilization. That means you  
are reading data from disk and the processors are waiting for it. What  
can you do about it?


1. Add RAM. This will let Linux put more stuff in page cache, making  
it have to read less.
2. Experiment with the four I/O schedulers. You can change them at run  
time (as "root").


I've put a little bit of this on line -- it's fairly detailed, and  
it's not PostgreSQL-specific, but you can get an indication of the  
concepts. By the way, I am working on some scripts that will actually  
integrate this type of monitoring and analysis with PostgreSQL. What  
they will do is load the raw Linux data into a PostgreSQL database and  
provide analysis queries and other tools. But for now, see if this  
makes any sense to you:


http://cougar.rubyforge.org/svn/trunk/procmodel/IO-Counters/beamer/handout.pdf

--
Sent via pgsql-

Re: [PERFORM] Index usage problem on 8.3.3

2008-10-31 Thread Gregory Stark
Tom Lane <[EMAIL PROTECTED]> writes:

> Jeff Frost <[EMAIL PROTECTED]> writes:
>> Tom Lane wrote:
>>> Huh.  That does sound like it's a version-to-version difference.
>>> There's nothing in the CVS log that seems related though.  Are you
>>> willing to post your test case?
>>> 
>> It's a customer DB, so I'll contact them and see if we can boil it down
>> to a test case with no sensitive data.
>
> Well, if there was a change it seems to have been in the right direction
> ;-) so this is mostly just idle curiosity.  Don't jump through hoops to
> get a test case.

Assuming it's not a bug...

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Configuring for maximum memory usage

2008-10-31 Thread Scott Carey
On Thu, Oct 30, 2008 at 2:06 PM, Christiaan Willemsen <
[EMAIL PROTECTED]> wrote:

> Thanks guys,
> Lots of info here that I didn't know about! Since I have one of the latest
> Opensolaris builds, I guess the write throttle feature is already in there.
> Sadly, the blog doesn't say what build has it included.
>

If I recall correctly, it went in at about build 89 or so (I think the
bottom of the link I provided has a comment to that effect).  So its in
there now, but not in OpenSolaris 2008.05.


>
> For writes, I do everything synchronized because we really need a
> consistent database on disk. We can see that during large inserts, the
> intend log is used a lot.
>

The DB synchronizes the WAL log automatically, and the table and index data
are written non-synchronously until the commit at the end of a checkpoint,
in which case sync is called on them.  This keeps things consistent on
disk.  With ZFS, each block written is always consistent, with a checksum
kept in the parent block.  There are no partial page writes, ever.  In
theory, you can disable full page writes on the WAL log if there is a
bottleneck there since ZFS guarantees fully transactional consistent state
of the file system, even if you have a RAID controller or hardware failure
that causes a partial write.  But WAL log activity is probably not your
bottleneck so turning off full page writes on the WAL log is not necessary.



>
> What  I'm going to te testing is a smaller shared_buffers value, and a
> large ARC cache, and exactly the other way around.
>
> Another question: since we have huge tables with hundreds of millions or
> rows, we partitioned the database (it actually is creating the partitions
> dynamically now on inserts with very good performance :D ), but the question
> is: is the size of the partions important for the memory parameters in
> config file? How can we determine the optimal size of the partition.
> obviously, when doing selects, you want those preferably only needing a
> single partition for speed. At the moment, that is for the majority of
> situations the case. But there might be some other things to think about...
>
> Kind regards,
>
> Christiaan
>
>
> On Oct 30, 2008, at 7:27 PM, Scott Carey wrote:
>
>
>
> On Thu, Oct 30, 2008 at 9:55 AM, Joshua D. Drake <[EMAIL PROTECTED]>wrote:
>
>> On Thu, 2008-10-30 at 09:46 -0700, Scott Carey wrote:
>>
>> >
>> > Remember that PostgreSQL doesn't cache anything on its own so
>> > if you do
>> > want to hit disk it has to be in file cache.
>> >
>> > By my understanding, this is absolutely false.  Postgres caches pages
>> > from tables/indexes in shared_buffers. You can make this very large if
>> > you wish.
>>
>> You can make it very large with a potentially serious performance hit.
>> It is very expensive to manage large amounts of shared buffers. It can
>> also nail your IO on checkpoint if you are not careful (even with
>> checkpoint smoothing). You are correct that I did not explain what I
>> meant very well because shared buffers are exactly that, shared
>> buffers.
>
>
> You can slam your I/O by havnig too large of either OS file cache or
> shared_buffers, and you have to tune both.
> In the case of large shared_buffers you have to tune postgres and
> especially the background writer and checkpoints.
> In the case of a large OS cache, you have to tune parameters to limit the
> ammount of dirty pages there and force writes out smoothly.
> Both layers attempt to delay writes for their own, often similar reasons,
> and suffer when a large sync comes along with lots of dirty data.
>
> Recent ZFS changes have been made to limit this, (
> http://blogs.sun.com/roch/entry/the_new_zfs_write_throttle)
> in earlier ZFS versions, this is what usually killed databases -- ZFS in
> some situations would delay writes too long (even if "long" is 5 seconds)
> and get in trouble.  This still has to be tuned well, combined with good
> checkpoint tuning in Postgres as you mention. For Linux, there are similar
> issues that have to be tuned on many kernels, or up to 40% of RAM can fill
> with dirty pages not written to disk.
>
> Letting the OS do it doesn't get rid of the problem, both levels of cache
> share very similar issues with large sizes and dirty pages followed by a
> sync.
>
> The buffer cache in shared_buffers is a lot more efficient for large
> scanning queries -- A select count(*) test will be CPU bound if it comes
> from shared_buffers or the OS page cache, and in the former case I have seen
> it execute up to 50% faster than the latter, by avoiding calling out to the
> OS to get pages, purely as a result of less CPU used.
>
>
>
>>
>>
>> However that isn't the exact same thing as a "cache" at least as I was
>> trying to describe it. shared buffers are used to keep track of pages
>> (as well as some other stuff) and their current status. That is not the
>> same as caching a relation.
>>
>> It is not possible to pin a relation to memory using PostgreSQL.
>> P

Re: [PERFORM] Index usage problem on 8.3.3

2008-10-31 Thread Jeff Frost

On Fri, 31 Oct 2008, Gregory Stark wrote:


Tom Lane <[EMAIL PROTECTED]> writes:


Jeff Frost <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

Huh.  That does sound like it's a version-to-version difference.
There's nothing in the CVS log that seems related though.  Are you
willing to post your test case?


It's a customer DB, so I'll contact them and see if we can boil it down
to a test case with no sensitive data.


Well, if there was a change it seems to have been in the right direction
;-) so this is mostly just idle curiosity.  Don't jump through hoops to
get a test case.


Assuming it's not a bug...


Well, after boiling down my test case to the bare essentials, I was unable to 
reproduce the different behavior between 8.3.3 and 8.3.4.  Now, I've gone back 
to the original script and can't reproduce the behavior I previously saw on 
8.3.4 and my screen session doesn't have enough scrollback to look at what 
happened previously.  I was thinking perhaps I had inadvertently committed the 
transaction, but then the act would have been dropped as it's a temp table 
created with ON COMMIT DROP.  But, I've tested 3 times in a row and every time 
8.3.4 uses the seq scan just like 8.3.3 now, so I must've done something 
differently to get that result as Tom had originally suggested.  I just can't 
think what it might have been.  Perhaps it's time to buy some glasses. :-/


--
Jeff Frost, Owner   <[EMAIL PROTECTED]>
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 916-647-6411 FAX: 916-405-4032

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Occasional Slow Commit

2008-10-31 Thread David Rees
(Resending this, the first one got bounced by mail.postgresql.org)

On Wed, Oct 29, 2008 at 3:30 PM, David Rees <[EMAIL PROTECTED]> wrote:
> On Wed, Oct 29, 2008 at 6:26 AM, Greg Smith <[EMAIL PROTECTED]> wrote:
>> What you should do first is confirm
>> whether or not the slow commits line up with the end of the checkpoint,
>> which is easy to see if you turn on log_checkpoints.  That gives you timings
>> for the write and fsync phases of the checkpoint which can also be
>> informative.
>
> OK, log_checkpoints is turned on to see if any delays correspond to
> checkpoint activity...

Well, I'm pretty sure the delays are not checkpoint related. None of
the slow commits line up at all with the end of checkpoints.

The period of high delays occur during the same period of time each
week, but it's not during a particularly high load period on the
systems.

It really seems like there must be something running in the background
that is not showing up on the system activity logs, like a background
RAID scrub or something.

Here are a couple representative checkpoint complete messages from the logs:

2008-10-31 20:12:03 UTC: : : LOG:  checkpoint complete: wrote 285
buffers (0.3%); 0 transaction log file(s) added, 0 removed, 0
recycled; write=57.933 s, sync=0.053 s, total=57.990 s
2008-10-31 20:17:33 UTC: : : LOG:  checkpoint complete: wrote 437
buffers (0.4%); 0 transaction log file(s) added, 0 removed, 0
recycled; write=87.891 s, sync=0.528 s, total=88.444 s
2008-10-31 20:22:05 UTC: : : LOG:  checkpoint complete: wrote 301
buffers (0.3%); 0 transaction log file(s) added, 0 removed, 1
recycled; write=60.774 s, sync=0.033 s, total=60.827 s
2008-10-31 20:27:46 UTC: : : LOG:  checkpoint complete: wrote 504
buffers (0.5%); 0 transaction log file(s) added, 0 removed, 0
recycled; write=101.037 s, sync=0.049 s, total=101.122 s

During this period of time there was probably 100 different queries
writing/commiting data that took longer than a second (3-4 seconds
seems to be the worst).

The RAID controller on this machine is some sort of MegaRAID
controller. I'll have to see if there is some sort of scheduled scan
running during this period of time.

One of the replicate nodes is an identical machine which I just
noticed has the same slony commit slow downs logged even though it's
only receiving data from slony from the primary node. There are two
other nodes listening in on the same subscription, but these two nodes
don't show the same slony commit slow downs, but these machines use a
slightly different raid controller and are about 9 months newer than
the other two.

I'm still hoping that the checkpoint tuning has reduced commit latency
during busy periods, I should have more data after the weekend.

-Dave

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance