Re: [PERFORM] Forcing using index instead of sequential scan?

2006-07-22 Thread Tom Lane
<[EMAIL PROTECTED]> writes:
> I have been testing the performance of PostgreSQL using the simple tool
> found at http://benchw.sourceforge.net however I have found that all the
> queries it run execute with sequential scans. The website where the code
> runs has examples of the execution plan using indexes.

The reason the website gets indexscans is that he's fooled with the
planner cost parameters.  In particular I see that benchw's
documentation suggests
effective_cache_size= 48000
random_page_cost= 0.8
The latter is physically silly but it's a pretty effective thumb on the
scales if you want to force indexscan usage.

The real issue here is caching across successive queries, an effect that
Postgres doesn't deal with very well at the moment.  If you run these
queries from a standing start (freshly booted machine) you'll likely
find that the indexscan plan is indeed slower than the seqscan/hash
plan, just like the planner thinks.  I get about 52 sec for query0
with an indexscan vs about 35 sec for the seqscan.  However, successive
executions of the seqscan plan stay at about 35 sec, whereas the
indexscan plan drops to 2 sec(!).  This is because the fraction of the
table touched by the indexscan plan is small enough to fit in my
machine's RAM --- I can see by das blinkenlights (and also vmstat) that
there's no I/O going on at all during re-executions of the indexscan.
If I run the seqscan and then the indexscan, the indexscan takes about
28 sec, so there's still some useful cached data even though the seqscan
read more stuff than fits in RAM.  (Note: this is with Fedora Core 5,
YMMV depending on your kernel's cache algorithms.)

In a real-world situation it's unlikely you'd just re-execute the same
query over and over, so this benchmark is really too simplistic to trust
very far as an indicator of what to do in practice.

I find that CVS tip will choose the indexscan for query0 if I set
effective_cache_size to 62500 (ie, half a gigabyte, or half of this
machine's RAM) and set random_page_cost to 1.5 or less.

If you want the planner to work on the assumption that everything's
cached, set effective_cache_size to a large value and set
random_page_cost to 1.0 --- you might also want to increase the CPU
cost settings, reflecting the fact that I/O is cheaper relative to
CPU effort than the default settings assume.  However, if your database
is too large to fit in RAM then these are likely to be pretty bad
settings.  Many people compromise with a random_page_cost around 2
or so.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Bad Planner Statistics for Uneven distribution.

2006-07-22 Thread Tom Lane
"Guillaume Smet" <[EMAIL PROTECTED]> writes:
> Isn't there any way to make PostgreSQL have a better estimation here:
> ->  Index Scan using models_brands_brand on models_brands
> (cost=0.00..216410.97 rows=92372 width=0) (actual time=0.008..0.008
> rows=0 loops=303)
>Index Cond: (brand = $0)

Note that the above plan extract is pretty misleading, because it
doesn't account for the implicit "LIMIT 1" of an EXISTS() clause.
What the planner is *actually* imputing to this plan is 216410.97/92372
cost units, or about 2.34.  However that applies to the seqscan variant
as well.

I think the real issue with Kevin's example is that when doing an
EXISTS() on a brand_id that doesn't actually exist in the table, the
seqscan plan has worst-case behavior (ie, scan the whole table) while
the indexscan plan still manages to be cheap.  Because his brands table
has so many brand_ids that aren't in the table, that case dominates the
results.  Not sure how we could factor that risk into the cost
estimates.  The EXISTS code could probably special-case it reasonably
well for the simplest seqscan and indexscan subplans, but I don't see
what to do with more general subqueries (like joins).

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Forcing using index instead of sequential scan?

2006-07-22 Thread Craig A. James

The real issue here is caching across successive queries, an effect that
Postgres doesn't deal with very well at the moment.  If you run these
queries from a standing start (freshly booted machine) you'll likely
find that the indexscan plan is indeed slower than the seqscan/hash
plan, just like the planner thinks.


Here's a little trick I learned to speed up this test.

  find / -type f -exec grep foobar {} \;

This causes massive file-system activity and flushes all files that the kernel 
has cached.  If you run this between each Postgres test (let it run for a 
couple minutes), it gives you an apples-to-apples comparison between successive 
benchmarks, and eliminates the effects of caching.

If you run this as a regular user (NOT super-user or 'postgres'), you won't 
have permission to access your Postgres files, so you're guaranteed they'll be 
flushed from the cache.

Craig

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Forcing using index instead of sequential scan?

2006-07-22 Thread Gregory Stark
"Craig A. James" <[EMAIL PROTECTED]> writes:

> This causes massive file-system activity and flushes all files that the
> kernel has cached. If you run this between each Postgres test (let it run
> for a couple minutes), it gives you an apples-to-apples comparison between
> successive benchmarks, and eliminates the effects of caching.

On Linux at least the best way to flush the cache is to unmount and then mount
the filesystem. This requires putting the data files on partition that you
aren't otherwise using and shutting down postgres.

Note that "nothing cached" isn't necessarily any more accurate a model as
"everything cached". In reality many databases *do* in fact run the same
queries over and over again, though often with some parameters different each
time. But the upper pages of most indexes and many of the common leaf pages
and heap pages will in fact be cached.


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Forcing using index instead of sequential scan?

2006-07-22 Thread Mark Kirkwood

Tom Lane wrote:

<[EMAIL PROTECTED]> writes:

I have been testing the performance of PostgreSQL using the simple tool
found at http://benchw.sourceforge.net however I have found that all the
queries it run execute with sequential scans. The website where the code
runs has examples of the execution plan using indexes.


The reason the website gets indexscans is that he's fooled with the
planner cost parameters.  In particular I see that...(snipped)



Indeed I did - probably should have discussed that alteration better in 
the documentation for the test suite!


In addition I was a bit naughty in running the benchmark using size 1 
(i.e about 1G) an a box with 2G ram - as this meant that (on the machine 
I was using then anyway) indexscans on query 0 and 1 were *always* 
better than the sequential options.


A better test is to use the size factor at 2 x physical ram, as then the 
planners defaults make more sense! (unless or course you *want* to model 
a data mart smaller than physical ram).


Best wishes

Mark




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Forcing using index instead of sequential scan?

2006-07-22 Thread Mark Kirkwood

[EMAIL PROTECTED] wrote:
I have been testing the performance of PostgreSQL using the simple tool 
found at _http://benchw.sourceforge.net_ however I have found that all 
the queries it run execute with sequential scans. The website where the 
code runs has examples of the execution plan using indexes.


When I disable the sequential plan query 0 and query 1 run faster ( 
_http://benchw.sourceforge.net/benchw_results_postgres_history.html_ ) 
by using the indexes as suggested by the website.


I have tried increasing the effective_cache_size and reducing the 
random_page_cost to try and force the optimiser to use the index but it 
always uses the sequential scan.


What is the best way to force the use of indexes in these queries? 
Currently testing with version 8.1.4.





Hi Robin,


Being responsible for this piece of software, I should try to help, only 
saw this now sorry (nice to see someone using this).


Unless you really want to reproduce the numbers on the website, it is 
best to test with Benchw's scale factor at least 2 x your physical ram, 
as this makes the planner's defaults work more sensibly (and models 
*most* real world data warehouse situations better!).


Cheers

Mark

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org