[PERFORM] Index usage when bitwise operator is used

2007-09-13 Thread W.Alphonse HAROUNY
Hello,

My question is about index usage when bitwise operations are invoked.
Situation Context:
--

Lets suppose we have 2 tables TBL1 and TBL2 as the following:
TBL1 {
  . ;
  integer categoryGroup; // categoryGroup is declared as an index on TABL1
  . ;
}

TBL2 {
  . ;
  integer categoryGroup; // categoryGroup is declared as an index on TABL2
  . ;
}

By conception, I suppose that:
- [categoryGroup] may hold a limited number of values, less than 32 values.
- [categoryGroup] is of type integer => it means 4 bytes => 32 bits
  => 32 places available to hold binary '0' or binary '1' values.
- [categoryGroup] is the result of an "OR bitwise operation" among a
predefined set of variables [variableCategory].
   We suppose that [variableCategory] is of type integer (=>32 bits)
   and each binary value of [variableCategory] may only hold a single binary
'1'.


Ex: variableCategory1 = 0010
  variableCategory2 = 0010
  variableCategory3 = 1000

 If [categoryGroup] =  variableCategory1 | variableCategory2 |
variableCategory3
=>[categoryGroup] = 00101010



Question:
--
I have an SQL request similar to:

SELECT . FROM TBL1, TBL2 WHERE
  AND
 TBL1.CATEGORY & TBL2.CATEGORY <> 0  //-- where & is the AND bitwise
operator

Qst:
1/ IS the above SQL request will use the INDEX [categoryGroup] defined on
TBL1 and TBL2 ?
2/ What should I do or How should I modify my SQL request in order
   to force the query engine to use an index ? (the already defined index or
another useful index)



Thx a lot


[PERFORM] Clustered tables improves perfs ?

2007-09-13 Thread Patrice Castet

hi!
I wonder if clustering a table improves perfs somehow ?
Any example/ideas about that ?
ref : http://www.postgresql.org/docs/8.2/interactive/sql-cluster.html
thx,
P.


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Clustered tables improves perfs ?

2007-09-13 Thread Alexander Staubo
On 9/13/07, Patrice Castet <[EMAIL PROTECTED]> wrote:
> I wonder if clustering a table improves perfs somehow ?

As I understand it, clustering will help cases where you are fetching
data in the same sequence as the clustering order, because adjacent
rows will be located in adjacent pages on disk; this is because hard
drives perform superbly with sequential reads, much less so with
random access.

For example, given a table foo (v integer) populated with a sequence
of integers [1, 2, 3, 4, ..., n], where the column v has an index, and
the table is clustered on that index, a query such as "select v from
foo order by v" will read the data sequentially from disk, since the
data will already be in the correct order.

On the other hand, a query such as "select v from foo order by
random()" will not be able to exploit the clustering. In other words,
clustering is only useful insofar as your access patterns follow the
clustering order.

Alexander.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] Long Running Commits - Not Checkpoints

2007-09-13 Thread Brad Nicholson
I'm having a problem with long running commits appearing in my database
logs.  It may be hardware related, as the problem appeared when we moved
the database to a new server connected to a different disk array.  The
disk array is a lower class array, but still more than powerful enough
to handle the IO requirements.  One big difference though is that the
old array had 16 GB of cache, the new one has 4 GB.

Running Postgres 8.1.8 on AIX 5.3

We have enough IO to spare that we have the bgwriter cranked up pretty
high, dirty buffers are getting quickly.  Vmstat indicates 0 io wait
time, no swapping or anything nasty like that going on.

The long running commits do not line up with checkpoint times.

The postgresql.conf config are identical except that wal_buffers was 8
on the old master, and it is set to 16 on the new one.

We have other installations of this product running on the same array
(different servers though) and they are not suffering from this
problem. 

The only other thing of note is that the wal files sit on the same disk
as the data directory.  This has not changed between the old and new
config, but the installs that are running fine do have their wal files
on a separate partition.

Any ideas where the problem could lie?  Could having the wal files on
the same data partition cause long running commits when there is plenty
of IO to spare?

-- 
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] [Again] Postgres performance problem

2007-09-13 Thread Scott Marlowe
On 9/13/07, Greg Smith <[EMAIL PROTECTED]> wrote:
> On Wed, 12 Sep 2007, Scott Marlowe wrote:
>
> > I'm getting more and more motivated to rewrite the vacuum docs.  I think
> > a rewrite from the ground up might be best...  I keep seeing people
> > doing vacuum full on this list and I'm thinking it's as much because of
> > the way the docs represent vacuum full as anything.
>
> I agree you shouldn't start thinking in terms of how to fix the existing
> documentation.  I'd suggest instead writing a tutorial leading someone
> through what they need to know about their tables first and then going
> into how vacuum works based on that data.

I think both things are needed actually.  The current docs were
started back when pg 7.2 roamed the land, and they've been updated a
bit at a time.  The technical definitions of vacuum,  vacuum full,
analyze etc all show a bit too much history from back in the day, and
are confusing.  so, I think that 1: vacuum and analyze should have
their own sections.  analyze used to be a subcommand of vacuum but it
no longer is, but the docs still pretty much tie them together.  2:
The definition for vacuum full needs to include a caveat that vacuum
full should be considered more of a recovery operation than a way to
simply get back some space on your hard drives.

Which leads me to thinking that we then need a simple tutorial on
vacuuming to include the free space map, vacuum, vacuum analyze,
vacuum full, and the autovacuum daemon.  We can throw analyze in there
somewhere too, I just don't want it to seem like it's still married to
vacuum.

> As an example, people throw around terms like "index bloat" and "dead
> tuples" when talking about vacuuming.  The tutorial I'd like to see
> somebody write would start by explaining those terms and showing how to
> measure them--preferably with a good and bad example to contrast.

I agree.  I might rearrange it a bit but that's the way I'm looking at it too.

> The way
> these terms are thrown around right now, I don't expect newcomers to
> understand either the documentation or the advice people are giving them;
> I think it's shooting over their heads and what's needed are some
> walkthroughs.  Another example I'd like to see thrown in there is what it
> looks like when you don't have enough FSM slots.

OK.  Got something to start with.  I'm thinking I might work on a
vacuum tutorial first, then the tech docs...

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Long Running Commits - Not Checkpoints

2007-09-13 Thread Brad Nicholson
On Thu, 2007-09-13 at 10:15 -0400, Brad Nicholson wrote:
> I'm having a problem with long running commits appearing in my database
> logs.  It may be hardware related, as the problem appeared when we moved
> the database to a new server connected to a different disk array.  The
> disk array is a lower class array, but still more than powerful enough
> to handle the IO requirements.  One big difference though is that the
> old array had 16 GB of cache, the new one has 4 GB.
> 
> Running Postgres 8.1.8 on AIX 5.3
> 
> We have enough IO to spare that we have the bgwriter cranked up pretty
> high, dirty buffers are getting quickly.  Vmstat indicates 0 io wait
> time, no swapping or anything nasty like that going on.
> 
> The long running commits do not line up with checkpoint times.
> 
> The postgresql.conf config are identical except that wal_buffers was 8
> on the old master, and it is set to 16 on the new one.
> 
> We have other installations of this product running on the same array
> (different servers though) and they are not suffering from this
> problem. 
> 
> The only other thing of note is that the wal files sit on the same disk
> as the data directory.  This has not changed between the old and new
> config, but the installs that are running fine do have their wal files
> on a separate partition.
> 
> Any ideas where the problem could lie?  Could having the wal files on
> the same data partition cause long running commits when there is plenty
> of IO to spare?

More on this - we also have long running commits on installations that
do have the wal files on a separate partition.

-- 
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Clustered tables improves perfs ?

2007-09-13 Thread Chris Browne
[EMAIL PROTECTED] (Patrice Castet) writes:
> I wonder if clustering a table improves perfs somehow ?
> Any example/ideas about that ?
> ref : http://www.postgresql.org/docs/8.2/interactive/sql-cluster.html

Sometimes.

1.  It compacts the table, which may be of value, particularly if the
table is not seeing heavy UPDATE/DELETE traffic.  VACUUM and VACUUM
FULL do somewhat similar things; if you are using VACUUM frequently
enough, this is not likely to have a material effect.

2.  It transforms the contents of the table into some specified order,
which will improve efficiency for any queries that use that specific
ordering.
-- 
output = reverse("moc.enworbbc" "@" "enworbbc")
http://linuxdatabases.info/info/emacs.html
"You can swear at the keyboard  and it won't be offended. It was going
to treat you badly anyway" -- Arthur Norman

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Long Running Commits - Not Checkpoints

2007-09-13 Thread Tom Lane
Brad Nicholson <[EMAIL PROTECTED]> writes:
> On Thu, 2007-09-13 at 10:15 -0400, Brad Nicholson wrote:
>> I'm having a problem with long running commits appearing in my database
>> logs.  It may be hardware related, as the problem appeared when we moved
>> the database to a new server connected to a different disk array.

> More on this - we also have long running commits on installations that
> do have the wal files on a separate partition.

What's your definition of "long running commit" --- seconds? milliseconds?
Exactly what are you measuring?  Can you correlate the problem with what
the transaction was doing?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] [Again] Postgres performance problem

2007-09-13 Thread Gavin M. Roy
How many backends do you have at any given time?  Have you tried using
something like pgBouncer to lower backend usage?  How about your IO
situation?  Have you run something like sysstat to see what iowait is
at?

On 9/11/07, Ruben Rubio <[EMAIL PROTECTED]> wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
>
> Hi,
>
> I having the same problem I told here a few weeks before. Database is
> using too much resources again.
>
> I do a vacumm full each day, but seems it is not working. I am preparing
> an update to postgres 8.2.4 (actually I am using at 8.1.3, and tests for
> update will need several days)
>
> Last time I had this problem i solved it stopping website,  restarting
> database, vacuumm it, run again website. But I guess this is going to
> happen again.
>
> I would like to detect and solve the problem. Any ideas to detect it?
>
> Thanks in advance,
>
>
>
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFG5jbLIo1XmbAXRboRArcpAJ0YvoCT6KWv2fafVAtapu6nwFmKoACcD0uA
> zFTx9Wq+2NSxijIf/R8E5f8=
> =u0k5
> -END PGP SIGNATURE-
>
>
> ---(end of broadcast)---
> TIP 1: if posting/reading through Usenet, please send an appropriate
>subscribe-nomail command to [EMAIL PROTECTED] so that your
>message can get through to the mailing list cleanly
>
>
>

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Long Running Commits - Not Checkpoints

2007-09-13 Thread Brad Nicholson
On Thu, 2007-09-13 at 11:10 -0400, Tom Lane wrote:
> Brad Nicholson <[EMAIL PROTECTED]> writes:
> > On Thu, 2007-09-13 at 10:15 -0400, Brad Nicholson wrote:
> >> I'm having a problem with long running commits appearing in my database
> >> logs.  It may be hardware related, as the problem appeared when we moved
> >> the database to a new server connected to a different disk array.
> 
> > More on this - we also have long running commits on installations that
> > do have the wal files on a separate partition.
> 
> What's your definition of "long running commit" --- seconds? milliseconds?
> Exactly what are you measuring?  Can you correlate the problem with what

log_min_duration is set to 150ms

Commits running over that up to 788ms.  Here is what we see in the logs
(with obfuscated dbname, username and IP):

2007-09-13 10:01:49.787 CUT [782426] dbname username 1.2.3.171 LOG:
duration: 224.286 ms  statement: EXECUTE   [PREPARE:  commit]
2007-09-13 10:19:16.373 CUT [737404] dbname username 1.2.3.174 LOG:
duration: 372.545 ms  statement: EXECUTE   [PREPARE:  commit]
2007-09-13 10:19:24.437 CUT [1806498] dbname username 11.2.3.171 LOG:
duration: 351.544 ms  statement: EXECUTE   [PREPARE:  commit]
2007-09-13 10:33:11.204 CUT [962598] dbname username 1.2.3.170 LOG:
duration: 504.057 ms  statement: EXECUTE   [PREPARE:  commit]
2007-09-13 10:40:33.735 CUT [1282104] dbname username 1.2.3.174 LOG:
duration: 250.127 ms  statement: EXECUTE   [PREPARE:  commit]
2007-09-13 10:49:54.752 CUT [1188032] dbname username 1.2.3.170 LOG:
duration: 382.781 ms  statement: EXECUTE   [PREPARE:  commit]
2007-09-13 11:30:43.339 CUT [1589464] dbname username 1.2.3.172 LOG:
duration: 408.463 ms  statement: EXECUTE   [PREPARE:  commit]


-- 
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] [Again] Postgres performance problem

2007-09-13 Thread Erik Jones

On Sep 13, 2007, at 12:58 AM, Greg Smith wrote:


On Wed, 12 Sep 2007, Scott Marlowe wrote:

I'm getting more and more motivated to rewrite the vacuum docs.  I  
think a rewrite from the ground up might be best...  I keep seeing  
people doing vacuum full on this list and I'm thinking it's as  
much because of the way the docs represent vacuum full as anything.


I agree you shouldn't start thinking in terms of how to fix the  
existing documentation.  I'd suggest instead writing a tutorial  
leading someone through what they need to know about their tables  
first and then going into how vacuum works based on that data.


As an example, people throw around terms like "index bloat" and  
"dead tuples" when talking about vacuuming.  The tutorial I'd like  
to see somebody write would start by explaining those terms and  
showing how to measure them--preferably with a good and bad example  
to contrast.  The way these terms are thrown around right now, I  
don't expect newcomers to understand either the documentation or  
the advice people are giving them; I think it's shooting over their  
heads and what's needed are some walkthroughs.  Another example I'd  
like to see thrown in there is what it looks like when you don't  
have enough FSM slots.


Isn't that the point of the documentation?  I mean, if the existing,  
official manual has been demonstrated (through countless mailing list  
help requests) to not sufficiently explain a given topic, shouldn't  
it be revised?  One thing that might help is a hyperlinked glossary  
so that people reading through the documentation can go straight to  
the postgres definition of dead tuple, index bloat, etc.



Erik Jones

Software Developer | Emma®
[EMAIL PROTECTED]
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Long Running Commits - Not Checkpoints

2007-09-13 Thread Greg Smith

On Thu, 13 Sep 2007, Brad Nicholson wrote:

One big difference though is that the old array had 16 GB of cache, the 
new one has 4 GB.


We have enough IO to spare that we have the bgwriter cranked up pretty 
high, dirty buffers are getting quickly.


If your system is very active, running the bgwriter very aggressively will 
result in the cache on the disk array being almost continuously filled 
with pending writes that then trickle their way onto real disk eventually. 
If the typical working set on this system results in the amount of 
background writer cached writes regularly being >4GB but <16GB, that would 
explain what you're seeing.  The resolution is actually unexpected by most 
people:  you make the background writer less aggressive so that it's 
spewing less redundant writes clogging the array's cache, leaving more 
cache to buffer the actual commits so they don't block.  This will 
increase the odds that there will be a checkpoint block instead, but if 
you're seeing none of them right now you may have some margin there to 
reduce the BGW activity without aggrevating checkpoints.


Since you're probably not monitoring I/O waits and similar statistics on 
how the disk array's cache is being used, whether this is happening or not 
to you won't be obvious from what the operating system is reporting.  One 
or two blocked writes every couple of minutes won't even show up on the 
gross statistics for operating system I/O waits; on average, they'll still 
be 0.  So it's possible you may think you have plenty of I/O to spare, 
when in fact you don't quite have enough--what you've got is enough cache 
that the OS can't see where the real I/O bottleneck is.


I'd be curious to see how you've got your background writer configured to 
see if it matches situations like this I've seen in the past.  The 
parameters controlling the all scan are the ones you'd might consider 
turning down, definately the percentage and possibly the maxpages as well.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Long Running Commits - Not Checkpoints

2007-09-13 Thread Brad Nicholson
On Thu, 2007-09-13 at 12:12 -0400, Greg Smith wrote:
> On Thu, 13 Sep 2007, Brad Nicholson wrote:

> I'd be curious to see how you've got your background writer configured to 
> see if it matches situations like this I've seen in the past.  The 
> parameters controlling the all scan are the ones you'd might consider 
> turning down, definately the percentage and possibly the maxpages as well.


bgwriter_delay = 50 # 10-1 milliseconds between
rounds
bgwriter_lru_percent = 20.0 # 0-100% of LRU buffers
scanned/round
bgwriter_lru_maxpages = 300 # 0-1000 buffers max
written/round
bgwriter_all_percent = 20   # 0-100% of all buffers
scanned/round
bgwriter_all_maxpages = 600 # 0-1000 buffers max
written/round


-- 
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] [Again] Postgres performance problem

2007-09-13 Thread Scott Marlowe
On 9/13/07, Erik Jones <[EMAIL PROTECTED]> wrote:
> On Sep 13, 2007, at 12:58 AM, Greg Smith wrote:
>
> > On Wed, 12 Sep 2007, Scott Marlowe wrote:
> >
> >> I'm getting more and more motivated to rewrite the vacuum docs.  I
> >> think a rewrite from the ground up might be best...  I keep seeing
> >> people doing vacuum full on this list and I'm thinking it's as
> >> much because of the way the docs represent vacuum full as anything.
> >
> > I agree you shouldn't start thinking in terms of how to fix the
> > existing documentation.  I'd suggest instead writing a tutorial
> > leading someone through what they need to know about their tables
> > first and then going into how vacuum works based on that data.
> >
> > As an example, people throw around terms like "index bloat" and
> > "dead tuples" when talking about vacuuming.  The tutorial I'd like
> > to see somebody write would start by explaining those terms and
> > showing how to measure them--preferably with a good and bad example
> > to contrast.  The way these terms are thrown around right now, I
> > don't expect newcomers to understand either the documentation or
> > the advice people are giving them; I think it's shooting over their
> > heads and what's needed are some walkthroughs.  Another example I'd
> > like to see thrown in there is what it looks like when you don't
> > have enough FSM slots.
>
> Isn't that the point of the documentation?  I mean, if the existing,
> official manual has been demonstrated (through countless mailing list
> help requests) to not sufficiently explain a given topic, shouldn't
> it be revised?  One thing that might help is a hyperlinked glossary
> so that people reading through the documentation can go straight to
> the postgres definition of dead tuple, index bloat, etc.

Yes and no.  The official docs are more of a technical specification.
Short, simple and to the point so that if you know mostly what you're
doing you don't have to wade through a long tutorial to find the
answer.  I find MySQL's documentation frustrating as hell because I
can never find just the one thing I wanna look for.  Because it's all
written as a tutorial.  I.e. I have to pay the "stupid tax" when I
read their docs.

What I want to do is two fold.  1: fix the technical docs so they have
better explanations of each of the topics, without turning them into
huge tutorials.  2:  Write a vacuuming tutorial that will be useful
should someone be new to postgresql and need to set up their system.
I think the tutorial should be broken into at least two sections, a
quick start guide and an ongoing maintenance and tuning section.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Long Running Commits - Not Checkpoints

2007-09-13 Thread Brad Nicholson
On Thu, 2007-09-13 at 12:19 -0400, Brad Nicholson wrote:
> On Thu, 2007-09-13 at 12:12 -0400, Greg Smith wrote:
> > On Thu, 13 Sep 2007, Brad Nicholson wrote:
> 
> > I'd be curious to see how you've got your background writer configured to 
> > see if it matches situations like this I've seen in the past.  The 
> > parameters controlling the all scan are the ones you'd might consider 
> > turning down, definately the percentage and possibly the maxpages as well.
> 
> 
> bgwriter_delay = 50 # 10-1 milliseconds between
> rounds
> bgwriter_lru_percent = 20.0 # 0-100% of LRU buffers
> scanned/round
> bgwriter_lru_maxpages = 300 # 0-1000 buffers max
> written/round
> bgwriter_all_percent = 20   # 0-100% of all buffers
> scanned/round
> bgwriter_all_maxpages = 600 # 0-1000 buffers max
> written/round

I should add, there are 6 back ends running on this disk array
(different servers and different data partitions) with these bgwriter
settings. 

-- 
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Long Running Commits - Not Checkpoints

2007-09-13 Thread Alvaro Herrera
Brad Nicholson wrote:
> On Thu, 2007-09-13 at 12:19 -0400, Brad Nicholson wrote:
> > On Thu, 2007-09-13 at 12:12 -0400, Greg Smith wrote:
> > > On Thu, 13 Sep 2007, Brad Nicholson wrote:
> > 
> > > I'd be curious to see how you've got your background writer configured to 
> > > see if it matches situations like this I've seen in the past.  The 
> > > parameters controlling the all scan are the ones you'd might consider 
> > > turning down, definately the percentage and possibly the maxpages as well.
> > 
> > 
> > bgwriter_delay = 50 # 10-1 milliseconds between
> > rounds
> > bgwriter_lru_percent = 20.0 # 0-100% of LRU buffers
> > scanned/round
> > bgwriter_lru_maxpages = 300 # 0-1000 buffers max
> > written/round
> > bgwriter_all_percent = 20   # 0-100% of all buffers
> > scanned/round
> > bgwriter_all_maxpages = 600 # 0-1000 buffers max
> > written/round
> 
> I should add, there are 6 back ends running on this disk array
> (different servers and different data partitions) with these bgwriter
> settings. 

Maybe it is running deferred triggers or something?

-- 
Alvaro Herrerahttp://www.advogato.org/person/alvherre
"I suspect most samba developers are already technically insane...
Of course, since many of them are Australians, you can't tell." (L. Torvalds)

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

2007-09-13 Thread Gregory Stark

"Gregory Stark" <[EMAIL PROTECTED]> writes:

> "Luke Lonergan" <[EMAIL PROTECTED]> writes:
>
>> Right now the pattern for index scan goes like this:
>>
>> - Find qualifying TID in index
>>   - Seek to TID location in relfile
>> - Acquire tuple from relfile, return
>>...
>> If we implement AIO and allow for multiple pending I/Os used to prefetch
>> groups of qualifying tuples, basically a form of random readahead
>
> Ah, I see what you mean now. It makes a lot more sense if you think of it for
> bitmap index scans. So, for example, the bitmap index scan could stream tids
> to the executor and the executor would strip out the block numbers and pass
> them to the i/o layer saying "i need this block now but following that I'll
> need these blocks so get them moving now".

Wow, I've done some preliminary testing here on Linux using posix_fadvise and
Solaris using libaio to prefetch blocks and then access them randomly and I
think there's a lot of low hanging fruit here.

The use case where this helps is indeed on a raid array where you're not
maxing out the bandwidth of the array and care about the transaction latency,
perhaps a narrow use case but still, quite common. 

Since our random access is synchronous it means we have to wait for one seek,
process that page, then wait for the next seek on another drive which was
sitting idle while we were processing the first page. By prefetching the pages
we'll need next we can get all the members of the array working for us
simultaneously even if they're all doing seeks.

What I've done is write a test program which generates a 1G file, syncs it and
drops the caches (not working yet on Solaris but doesn't seem to affect the
results) and then picks 4096 8k buffers and reads them in random order. The
machines it's running on have a small raid array with 4 drives.

Just seeking without any prefetch it takes about 12.5s on Linux and 13.5s on
Solaris. If I prefetch even a single buffer using posix_fadvise or libaio I
see a noticeable improvement, over 25%. At 128 buffers of prefetch both
systems are down to about 2.5-2.7s. That's on the small raid array. On the
boot both have a small beneficial effect but only at very large prefetch set
sizes which I would chalk down to being able to re-order the reads even if it
can't overlap them.

I want to test how much of this effect evaporates when I compare it to a
bitmap index style scan but that depends on a lot of factors like the exact
pattern of file extensions on the database files. In any case bitmap index
scans get us the reordering effect, but not the overlapping i/o requests
assuming they're spread quite far apart in the data files.

> I think this seems pretty impractical for regular (non-bitmap) index probes
> though. You might be able to do it sometimes but not very effectively and you
> won't know when it would be useful.

How useful this is depends a lot on how invasively we let it infect things
like regular index scans. If we can prefetch right siblings and deeper index
pages as we descend an index tree and future heap pages it could help a lot as
those aren't sorted like bitmap index scans. But even if we only fetch heap
pages all together before processing the heap pages it could be a big help.

Incidentally we do need to try to make use of both as Solaris doesn't have
posix_fadvise as far as I can tell and Linux's libaio doesn't support
non-O_DIRECT files.

Raw data:

Blocks  Linux   Solaris
Prefetched  posix_fadvise   libaio
---
1   12.473  13.597
29.053   9.830
46.787   7.594
85.303   6.588
   164.209   5.120
   323.388   4.014
   642.869   3.216
  1282.515   2.710
  2562.312   2.327
  5122.168   2.099
 10242.139   1.974
 20482.242   1.903
 40962.222   1.890

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Long Running Commits - Not Checkpoints

2007-09-13 Thread Brad Nicholson
On Thu, 2007-09-13 at 12:12 -0400, Greg Smith wrote:
> Since you're probably not monitoring I/O waits and similar statistics on 
> how the disk array's cache is being used, whether this is happening or not 
> to you won't be obvious from what the operating system is reporting.  


A sysadmin looked at cache usage on the disk array.  The read cache is
being used heavily, and the write cache is not.

-- 
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] SAN vs Internal Disks

2007-09-13 Thread Michael Stone

On Tue, Sep 11, 2007 at 06:07:44PM -0500, Decibel! wrote:

On Tue, Sep 11, 2007 at 05:09:00PM -0400, Michael Stone wrote:
You can get DAS arrays with multiple controllers, PSUs, etc.  DAS != 
single disk.


It's still in the same chassis, though,


I think you're confusing DAS and internal storage.

Mike Stone

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Long Running Commits - Not Checkpoints

2007-09-13 Thread Greg Smith

On Thu, 13 Sep 2007, Brad Nicholson wrote:

A sysadmin looked at cache usage on the disk array.  The read cache is 
being used heavily, and the write cache is not.


Given that information, you can take the below (which I was just about to 
send before the above update came in) as something to think about and test 
but perhaps not your primary line of attack.  Even if my theory about the 
exact mechanism involved isn't correct, the background writer is still 
problematic in terms of its impact on the system when run as aggressively 
as you're doing it; I'm not sure but I think that's even more true on 8.1 
than it is on 8.2 where I did most my testing in this area.



bgwriter_delay = 50
bgwriter_lru_percent = 20.0
bgwriter_lru_maxpages = 300
bgwriter_all_percent = 20
bgwriter_all_maxpages = 600


That was what I was expecting.  Your all scan has the potential to be 
writing 600*8K*(1/50 msec)=98MB/sec worth of data to your disk array. 
Since some of this data has a random access component to it, your array 
cannot be expected to keep with a real peak load; the only thing saving 
you if something starts dirtying buffers as far as possible is that the 
array cache is buffering things.  And that 4GB worth of cache could be 
filling in very little time.


Every time the all scan writes a buffer that is frequently used, that 
write has a good chance that it was wasted because the block will be 
modified again before checkpoint time.  Your settings are beyond regular 
aggressive and into the hyperactive terrority where I'd expect such 
redundant writes are happening often.  I'd suggest you try to move toward 
dropping bgwriter_all_percent dramatically from its current setting and 
see how far down you can go before it starts to introduce blocks at 
checkpoint time.  With bgwriter_delay set to 1/4 the default, I would 
expect that even 5% would be a high setting for you.  That may be a more 
dramatic change than you want to make at once though, so lowering it in 
that direction more slowly (perhaps drop 5% each day) and seeing whether 
things improve as that happens may make more sense.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


[PERFORM] Index files

2007-09-13 Thread Harsh Azad
Hi,

Where are the database index files located in the $PGDATA directory? I was
thinking on soft linking them to another physical hard disk array.

Thanks,
Azad


Re: [PERFORM] Index files

2007-09-13 Thread Ow Mun Heng
On Fri, 2007-09-14 at 08:20 +0530, Harsh Azad wrote:
> Hi,
> 
> Where are the database index files located in the $PGDATA directory? I
> was thinking on soft linking them to another physical hard disk array.

you have to search through pg_class for the "number"

Alternatively, you can try using tablespaces.

create tablespace indexspace location '/mnt/fastarray'
create index newindex on table (index_1) tablespace indexspace


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Index files

2007-09-13 Thread Ow Mun Heng
On Fri, 2007-09-14 at 08:33 +0530, Harsh Azad wrote:
> ah.. thanks. Didn't realize table spaces can be mentioned while
> creating a index. BTW, are soft links ok to use for pg_clog /
> pg_xlog . I moved the existing directories to /mnt/logs/pglogs and
> made soft links for both directories in $PGDATA 


No idea what is the "proper" solution. Me being a newbie itself.
But from what I've read on the net and google, symlink seems to be the
order of the day.

perhaps others who are more familiar can comment as I'm lost in this.
(I'm doing symlinking btw)


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Index files

2007-09-13 Thread Harsh Azad
ah.. thanks. Didn't realize table spaces can be mentioned while creating a
index. BTW, are soft links ok to use for pg_clog / pg_xlog . I moved the
existing directories to /mnt/logs/pglogs and made soft links for both
directories in $PGDATA

Thanks

On 9/14/07, Ow Mun Heng <[EMAIL PROTECTED]> wrote:
>
> On Fri, 2007-09-14 at 08:20 +0530, Harsh Azad wrote:
> > Hi,
> >
> > Where are the database index files located in the $PGDATA directory? I
> > was thinking on soft linking them to another physical hard disk array.
>
> you have to search through pg_class for the "number"
>
> Alternatively, you can try using tablespaces.
>
> create tablespace indexspace location '/mnt/fastarray'
> create index newindex on table (index_1) tablespace indexspace
>
>


-- 
Harsh Azad
===
[EMAIL PROTECTED]


Re: [PERFORM] Index files

2007-09-13 Thread Tom Lane
"Harsh Azad" <[EMAIL PROTECTED]> writes:
> Where are the database index files located in the $PGDATA directory?

Read
http://www.postgresql.org/docs/8.2/static/storage.html

> I was
> thinking on soft linking them to another physical hard disk array.

Manual symlink management, while not impossible, pretty much sucks
... especially if your tables are big enough that you actually need to
do this.  Use a tablespace instead.  (If you are on a PG version that
hasn't got tablespaces, you are more than overdue to upgrade.)

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] When/if to Reindex

2007-09-13 Thread Bruce Momjian

This has been saved for the 8.4 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---

Tom Lane wrote:
> "Steven Flatt" <[EMAIL PROTECTED]> writes:
> > So, can we simply trust what's in pg_class.relpages and ignore looking
> > directly at the index?
> 
> No, we can't.  In the light of morning I remember more about the reason
> for the aforesaid patch: it's actually unsafe to read the pg_class row
> at all if you have not got lock on the index.  We are reading with
> SnapshotNow in order to be sure we see up-to-date info, and that means
> that a concurrent update of the row (eg, for REINDEX to report the new
> relfilenode) can have the following behavior:
> 
> 1. REINDEX inserts the new modified version of the index's pg_class row.
> 
> 2. Would-be reader process visits the new version of the pg_class row.
>It's not committed yet, so we ignore it and continue scanning.
> 
> 3. REINDEX commits.
> 
> 4. Reader process visits the old version of the pg_class row.  It's
>now committed dead, so we ignore it and continue scanning.
> 
> 5. Reader process bombs out with a complaint about no pg_class row for
>the index.
> 
> So we really have to have the lock.
> 
> > This is a fairly serious concern for us, that
> > reindex is blocking all readers of the parent table.
> 
> I'm afraid you're kinda stuck: I don't see any fix that would be
> practical to put into 8.2, or even 8.3 considering that it's way too
> late to be thinking of implementing REINDEX CONCURRENTLY for 8.3.
> 
> You might be able to work around it for now by faking such a reindex
> "by hand"; that is, create a duplicate new index under a different
> name using CREATE INDEX CONCURRENTLY, then exclusive-lock the table
> for just long enough to drop the old index and rename the new one
> to match.
> 
> It's probably worth asking also how badly you really need routine
> reindexing.  Are you certain your app still needs that with 8.2,
> or is it a hangover from a few releases back?  Could more aggressive
> (auto)vacuuming provide a better solution?
> 
>   regards, tom lane
> 
> ---(end of broadcast)---
> TIP 1: if posting/reading through Usenet, please send an appropriate
>subscribe-nomail command to [EMAIL PROTECTED] so that your
>message can get through to the mailing list cleanly

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] 500rows = 1min/2.5k rows=20min/6K rows 2 hours and still running

2007-09-13 Thread Ow Mun Heng
On Wed, 2007-09-12 at 15:14 +0200, Nis Jørgensen wrote:
> El-Lotso skrev:
> 
> > I'm on the verge of giving up... the schema seems simple and yet there's
> > so much issues with it. Perhaps it's the layout of the data, I don't
> > know. But based on the ordering/normalisation of the data and the one to
> > many relationship of some tables, this is giving the planner a headache
> > (and me a bulge on the head from knockin it against the wall)
> 
> I think you should look more at the db design, and less on rewriting the
> query. Here are some observations:

I can't help much with the design per-se. So..

> 
> - Your table structure is quite hard to understand (possibly because you
> have changed the names) - if you want help on improving it, you will
> need to explain the data to us, and possibly post some sample data.

If anyone is willing, I can send some sample data to you off-list.

on the trh table, hid is a subset of data for a particular id.

eg: 
PARENT : CHILD 1
PARENT : CHILD 2
PARENT : CHILD 3
PARENT : CHILD 4

uniqueid = merged fields from id / index1 / index2 / start_timestamp(IN EPOCH)
/ phase_id / ttype which is unique on each table  (but not across ALL the 
tables)


> - You seem to be lacking constraints on the tables. My guess is that
> (id,ttype,start_timestamp) is unique in both trh and ts - but I cannot
> tell (and neither can the query planner). Foreign key constraints might
> help as well. These would also help others to understand your data, and
> suggest reformulations of your queries.

AFAICT, there are no foreign constraints in the original DB design. (and
I'm not even sure how to begin the FK design based on this org design)

the unique_id is as above.
TRH/TRD uniqueid = merged fields from id / index1 / index2 /
start_timestamp(IN EPOCH) / phase_id / ttype 

TS uniqueid = merged fields from id / start_timestamp(IN EPOCH) / ttype 

Problem with this is that the fields in which they are unique is
different across the different tables, so the unique_id is only unique
for that table alone and acts as a primary key so that no dupes exists
in that one table.



> - Another guess is that the ttype sets (177,197,705,742,758,766),
> (69,178,198,704,757,741,765) are actually indicating some other property
> a common "type" of record, and that only one of each will be present for
> an id,start_timestamp combination. This may be related to the repeatingd
> fields issue - if a certain ttype indicates that we are interested in a
> certain pber_x field (and possibly that the others are empty).

yes..

eg:
id | hid |ttype | start_timestamp  | pber_2 | pber 3 |pber_4
PARENT | 0   |764   | 2007-07-01 00:00 | 4000   | null   | null
PARENT | 0   |765   | 2007-07-01 00:00 | null   | 9000   | null
PARENT | 0   |766   | 2007-07-01 00:00 | null   | null   | 7999
PARENT | 1   |764   | 2007-07-01 00:00 | 4550   | null   | null
PARENT | 1   |765   | 2007-07-01 00:00 | null   | 9220   | null
PARENT | 1   |766   | 2007-07-01 00:00 | null   | null   | 6669


the subqueries are just to take out the fields with the value and leave
the nulls so that we end-up with

id |hid| start_timestamp  |pber_2 | pber 3 | pber_4
PARENT | 0 | 2007-07-01 00:00 |  4000 |   9000 | 7999
PARENT | 1 | 2007-07-01 00:00 |  4550 |   9220 | 6669

which is basically just joining a table by itself, but there is a caveat
whereby pber_3 and pber_4 is/can only be joined together based on the
seq_date/seq_time in the ts table hence the query..

JOIN1.id = join2.id
and join1.seq_date = join2.seq_date
etc..

but the problem is confounded by the fact that there is numerous hid
values for head id

> - You have what looks like repeating fields - pber_x, fval_x, index_x -
> in your tables. Fixing this might not improve your query, but might be a
> good idea for other reasons.

it's being looked at by some other team to collapse this to something
like this

ttype | pber
764   | 500
765   | 600
766   | 700

so that there are lesser # of columns and no null fields. But the query
will remain the same

> - seq_date and seq_time seems like they may be redundant - are they
> different casts of the same data?

No. They're used to join together the pber_2/3/4 fields as one may
happen between a few hours to days between each other, but each will be
uniquely identified by the seq_date/time

eg : 

id | pber_2 | seq_date| seq time
PARENT | 400| 2007-07-01 00:00:00 | 1980-01-01 20:00:00
PARENT | 410| 2007-07-10 00:00:00 | 1980-01-01 22:00:00

id | pber_3 | seq_date| seq time
PARENT | 900| 2007-07-01 00:00:00 | 1980-01-01 20:00:00
PARENT | 100| 2007-07-10 00:00:00 | 1980-01-01 22:00:00

id | pber_4 | seq_date| seq time
PARENT | 1  | 2007-07-01 00:00:00 | 1980-01-01 20:00:00
PARENT | 999| 2007-07-10 00:00:00 | 1980-01-01 22:00:00


so, the correct value for the fields when joined together will be of the
form

id |start_timestamp  |seq_date| seq_time   |pber_2 |