Re: [PERFORM] why my query is not using index??

2004-10-11 Thread John Meinel
Francisco Reyes wrote:
On Mon, 11 Oct 2004, Janning Vygen wrote:
[...]
When I saw the default explain I was surprised to see that indexes were 
not been used. For example the join on lines 4,5 are exactly the primary 
key of the tables yet a sequential scan was used.

Note this:
The default explain was:
Sort  (cost=382.01..382.15 rows=56 width=196)
   Sort Key: accounts.account_group, accounts.account_name, 
[...]
Versus this:
After set enable_seqscan to off;
It becomes
Sort  (cost=490.82..490.96 rows=56 width=196)
   Sort Key: accounts.account_group, accounts.account_name, 
[...]
Postgres believes that it will cost 382 to do a sequential scan, versus 
490 for an indexed scan. Hence why it prefers to do the sequential scan. 
Try running explain analyze to see if how accurate it is.

As Janning mentioned, sometimes sequential scans *are* faster. If the 
number of entries that will be found is large compared to the number of 
total entries (I don't know the percentages, but probably >30-40%), then 
it is faster to just load the data and scan through it, rather than 
doing a whole bunch of indexed lookups.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] [pgsql-hackers-win32] Poor Performance for large queries in functions

2004-09-29 Thread John Meinel
Tom Lane wrote:
John Meinel <[EMAIL PROTECTED]> writes:
... However, if I try to 
bundle this query up into a server side function, it runs very slow (10 
seconds). I'm trying to figure out why, but since I can't run EXPLAIN 
ANALYZE inside a function, I don't really know what else to do.

A parameterized query inside a function is basically the same as a
PREPARE'd query with parameters at the SQL level.  So you can
investigate what's happening here with
PREPARE foo(int) AS
SELECT * FROM object WHERE id in (
SELECT id FROM data_t WHERE project_id = $1
UNION SELECT ... ;
EXPLAIN ANALYZE EXECUTE foo(48542);
I'm not sure where the problem is either, so please do send along the
results.
regards, tom lane
PS: pgsql-performance would be a more appropriate venue for this
discussion.
Well, I think I tracked the problem down to the fact that the column 
does not have a "not null" constraint on it. Here is a demonstration. 
Basically, I have 3 tables, tobjects, tdata, and tproject. tdata 
basically just links between tobjects and tproject, but isn't required 
to link to tproject. Yes, the real data has more columns, but this shows 
the problem.

jfmeinel=> \d tobjects
   Table "public.tobjects"
 Column |  Type   | Modifiers
+-+---
 id | integer | not null
Indexes:
"tobjects_pkey" primary key, btree (id)
jfmeinel=> \d tproject
   Table "public.tproject"
 Column |  Type   | Modifiers
+-+---
 id | integer | not null
Indexes:
"tproject_pkey" primary key, btree (id)
jfmeinel=> \d tdata
   Table "public.tdata"
   Column   |  Type   | Modifiers
+-+---
 id | integer | not null
 project_id | integer |
Indexes:
"tdata_pkey" primary key, btree (id)
"tdata_project_id_idx" btree (project_id)
Foreign-key constraints:
"tdata_id_fkey" FOREIGN KEY (id) REFERENCES tobjects(id) ON UPDATE 
CASCADE ON DELETE CASCADE
"tdata_project_id_fkey" FOREIGN KEY (project_id) REFERENCES 		
		tproject(id) ON UPDATE CASCADE ON DELETE SET DEFAULT

jfmeinel=> select count(*) from tdata;
 count

 545768
jfmeinel=> select count(*) - count(project_id) from tdata;
 ?column?
--
  240
So tdata(project_id) is almost completely full, of the 54+ entries, 
only 240 are null.

jfmeinel=> prepare myget(int) as select id from tdata
jfmeinel-> where project_id = $1;
PREPARE
jfmeinel=> explain analyze execute myget(3);
 QUERY PLAN

 Seq Scan on tdata  (cost=0.00..9773.10 rows=181923 width=4)
(actual time=1047.000..1047.000 rows=0 loops=1)
   Filter: (project_id = $1)
 Total runtime: 1047.000 ms
jfmeinel=> explain analyze select id from tdata where project_id = 3;
 QUERY PLAN
-
 Index Scan using tdata_project_id_idx on tdata  (cost=0.00..4.20 
rows=1 width=4) (actual time=0.000..0.000 rows=0 loops =1)
   Index Cond: (project_id = 3)
 Total runtime: 0.000 ms

So notice that when doing the actual select it is able to do the index 
query. But for some reason with a prepared statement, it is not able to 
do it.

Any ideas?
Since I only have the integers now, I can send the data to someone if 
they care to investigate it. It comes to 2.2M as a .tar.bz2, so 
obviously I'm not going to spam the list.

If I rewrite myget as:
prepare myget(int) as select id from tdata where project_id = 3; it 
does the right thing again. So it's something about how a variable 
interacts with an indexed column with null values.

Note: I've tried creating a script that generates dummy data to show 
this problem and I have failed (it always performed the query correctly.)

But this test data definitely shows the problem. And yes, I've vacuum 
analyzed all over the place.

John
=:->
PS> I tested this on PostgreSQL 7.4.3, and it did not demonstrate this 
problem. I am using PostgreSQL 8.0.0beta2 (probably -dev1)


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] [pgsql-hackers-win32] Poor Performance for large queries

2004-09-29 Thread John Meinel
Richard Huxton wrote:
John Meinel wrote:
So notice that when doing the actual select it is able to do the index 
query. But for some reason with a prepared statement, it is not able 
to do it.

Any ideas?

In the index-using example, PG knows the value you are comparing to. So, 
it can make a better estimate of how many rows will be returned. With 
the prepared/compiled version it has to come up with a plan that makes 
sense for any value.

If you look back at the explain output you'll see PG is guessing 181,923 
rows will match with the prepared query but only 1 for the second query. 
If in fact you returned that many rows, you wouldn't want to use the 
index - it would mean fetching values twice.

The only work-around if you are using plpgsql functions is to use 
EXECUTE to make sure your queries are planned for each value provided.

I suppose that make sense. If the number was small (< 100) then there 
probably would be a lot of responses. Because the tproject table is all 
small integers.

But for a large number, it probably doesn't exist on that table at all.
Thanks for the heads up.
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] [pgsql-hackers-win32] Poor Performance for large queries

2004-09-29 Thread John Meinel
Tom Lane wrote:
[ enlarging on Richard's response a bit ]
John Meinel <[EMAIL PROTECTED]> writes:
jfmeinel=> explain analyze execute myget(3);
 QUERY PLAN

 Seq Scan on tdata  (cost=0.00..9773.10 rows=181923 width=4)
(actual time=1047.000..1047.000 rows=0 loops=1)
   Filter: (project_id = $1)
 Total runtime: 1047.000 ms

jfmeinel=> explain analyze select id from tdata where project_id = 3;
 QUERY PLAN

-
 Index Scan using tdata_project_id_idx on tdata  (cost=0.00..4.20 
rows=1 width=4) (actual time=0.000..0.000 rows=0 loops =1)
   Index Cond: (project_id = 3)
 Total runtime: 0.000 ms

So notice that when doing the actual select it is able to do the index 
query. But for some reason with a prepared statement, it is not able to 
do it.

This isn't a "can't do it" situation, it's a "doesn't want to do it"
situation, and it's got nothing whatever to do with null or not null.
The issue is the estimated row count, which in the first case is so high
as to make the seqscan approach look cheaper.  So the real question here
is what are the statistics on the column that are making the planner
guess such a large number when it has no specific information about the
compared-to value.  Do you have one extremely common value in the column?
Have you done an ANALYZE recently on the table, and if so can you show
us the pg_stats row for the column?
regards, tom lane
The answer is "yes" that particular column has very common numbers in 
it. Project id is a number from 1->21. I ended up modifying my query 
such that I do the bulk of the work in a regular UNION SELECT so that 
all that can be optimized, and then I later do another query for this 
row in an 'EXECUTE ...' so that unless I'm actually requesting a small 
number, the query planner can notice that it can do an indexed query.

I'm pretty sure this is just avoiding worst case scenario. Because it is 
true that if I use the number 18, it will return 500,000 rows. Getting 
those with an indexed lookup would be very bad. But typically, I'm doing 
numbers in a very different range, and so the planner was able to know 
that it would not likely find that number.

Thanks for pointing out what the query planner was thinking, I was able 
to work around it.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] inconsistent/weird index usage

2004-10-01 Thread John Meinel
Dustin Sallings wrote:

[...]
OK, now that that's over with, I have this one particular query that 
I attempt to run for a report from my phone that no longer works because 
it tries to do a table scan on *some* of the tables.  Why it chooses 
this table scan, I can't imagine.  The query is as follows:

select
s.serial as serial_num,
s.name as name,
date(ts) as day,
min(sample) as min_temp,
avg(sample) as avg_temp,
stddev(sample) as stddev_temp,
max(sample) as max_temp
from
samples inner join sensors s using (sensor_id)
where
ts > current_date - 7
group by
serial_num, name, day
order by
serial_num, day desc

[ next section heavily clipped for clarity ]
->  Seq Scan on samples_1999  (cost rows=103677) (actual rows=0 loops=1)
->  Index Scan using samples_2000_bytime on samples_2000  (cost 
rows=714082 (actual rows=0 loops=1)

->  Seq Scan on samples_2001  (cost rows=902191) (actual rows=0 loops=1)
->  Seq Scan on samples_2002  (cost rows=1037201) (actual rows=0 loops=1)
->  Index Scan using samples_2003_bytime on samples_2003  (cost 
rows=1049772) (actual rows=0 loops=1)

->  Index Scan using samples_2004_bytime on samples_2004  (cost 
rows=791991) (actual rows=66389 loops=1)

[...]

Essentially, what you can see here is that it's doing an index scan 
on samples_2000, samples_2003, and samples_2004, but a sequential scan 
on samples_1999, samples_2001, and samples_2002.  It's very strange to 
me that it would make these choices.  If I disable sequential scans 
altogether for this session, the query runs in under 4 seconds.

This is a very cool solution for long-term storage, and isn't 
terribly hard to manage.  I actually have other report queries that seem 
to be making pretty good index selection currently...but I always want 
more!  :)  Does anyone have any suggestions as to how to get this to do 
what I want?

Of course, ideally, it would ignore five of the tables altogether.  :)
--
SPY  My girlfriend asked me which one I like better.
pub  1024/3CAE01D5 1994/11/03 Dustin Sallings <[EMAIL PROTECTED]>
|Key fingerprint =  87 02 57 08 02 D0 DA D6  C8 0F 3E 65 51 98 D8 BE
L___ I hope the answer won't upset her. 

Just as a heads up. You have run vacuum analyze before running this 
query, correct?

Because you'll notice that the query planner is thinking that it will 
have 103677 rows from 1999, 700,000 rows from 2000, 900,000 rows from 
2001, etc, etc. Obviously the query planner is not planning well 
considering it there are only 60,000 rows from 2004, and no rows from 
anything else.

It just seems like it hasn't updated it's statistics to be aware of when 
the time is on most of the tables.

(By the way, an indexed scan returning 0 entries is *really* fast, so I 
wouldn't worry about ignoring the extra tables. :)

I suppose the other question is whether this is a prepared or stored 
query. Because sometimes the query planner cannot do enough optimization 
in a stored query. (I ran into this problem where I had 1 column with 
500,000+ entries referencing 1 number. If I ran manually, the time was 
much better because I wasn't using *that* number. With a stored query, 
it had to take into account that I *might* use that number, and didn't 
want to do 500,000+ indexed lookups)

The only other thing I can think of is that there might be some 
collision between datetime and date. Like it is thinking it is looking 
at the time of day when it plans the queries (hence why so many rows), 
but really it is looking at the date. Perhaps a cast is in order to make 
it work right. I don't really know.

Interesting problem, though.
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Queries slow using stored procedures

2004-10-24 Thread John Meinel
Rod Dutton wrote:
 
Hi, 
 
Has anybody got any ideas on my recent posting ? (thanks in advance) :-
 
 
I have a problem where a query inside a function is up to 100 times 
slower inside a function than as a stand alone query run in psql.
 
The column 'botnumber' is a character(10), is indexed and there are 
125000 rows in the table.
 
[...]
I had a similar problem before, where the function version (stored 
procedure or prepared query) was much slower. I had a bunch of tables 
all with references to another table. I was querying all of the 
references to see if anyone from any of the tables was referencing a 
particular row in the base table.

It turned out that one of the child tables was referencing the same row 
300,000/500,000 times. So if I happened to pick *that* number, postgres 
wanted to a sequential scan because of all the potential results. In my 
testing, I never picked that number, so it was very fast, since it knew 
it wouldn't get in trouble.

In the case of the stored procedure, it didn't know which number I was 
going to ask for, so it had to plan for the worst, and *always* do a 
sequential scan.

So the question is... In your table, does the column "botnumber" have 
the same value repeated many, many times, but '1-7' only occurs a few?

If you change the function to:
CREATE OR REPLACE FUNCTION  sp_test_rod3 ( ) returns integer
as '
DECLARE
  bot char(10);
  oldbatch INTEGER;
BEGIN
  SELECT INTO oldbatch batchserial
  FROM transbatch
  WHERE botnumber = ''1-7''
  LIMIT 1;
  IF FOUND THEN
RETURN 1;
  ELSE
RETURN 0;
  END IF;
END;
'
language plpgsql  ;
Is it still slow?
I don't know if you could get crazy with something like:
select 1 where exist(select from transbatch where botnumber = '1-7' 
limit 1);

Just some thoughts about where *I've* found performance to change 
between functions versus raw SQL.

You probably should also mention what version of postgres you are 
running (and possibly what your hardware is)

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Queries slow using stored procedures

2004-10-24 Thread John Meinel
Rod Dutton wrote:
Thank John,
I am running Postgres 7.3.7 on a Dell PowerEdge 6600 Server with Quad Xeon
2.7GHz processors with 16GB RAM and 12 x 146GB drives in Raid 10 (OS, WAL,
Data all on separate arrays).
You might want think about upgraded to 7.4, as I know it is better at 
quite a few things. But I'm not all that experienced (I just had a 
similar problem).

I did try hard coding botnumber as you suggested and it was FAST.  So it
does look like the scenario that you have explained. 

There are 2 ways of doing it that I know of. First, you can make you 
function create a query and execute it. Something like:

EXECUTE ''SELECT 1 FROM transbatch WHERE botnumber = ''
|| quote_literal(botnum)
|| '' LIMIT 1'';
That forces the database to redesign the query each time. The problem 
you are having is a stored procedure has to prepare the query in advance.


does the column "botnumber" have the same value repeated many, many times,
but '1-7' only occurs a few?
Yes, that could be the case, the table fluctuates massively from small to
big to small regularly with a real mixture of occurrences of these values
i.e. some values are repeated many times and some occur only a few times.
I wonder if the answer is to: a) don't use a stored procedure b) up the
statistics gathering for that column ?
I don't believe increasing statistics will help, as prepared statements 
require one-size-fits-all queries.

I will try your idea: select 1 where exist(select from transbatch where
botnumber = '1-7' limit 1);
Also, how can I get "EXPLAIN" output from the internals of the stored
procedure as that would help me?
I believe the only way to get explain is to use prepared statements 
instead of stored procedures. For example:

PREPARE my_plan(char(10)) AS SELECT 1 FROM transbatch
WHERE botnumber = $1 LIMIT 1;
EXPLAIN ANALYZE EXECUTE my_plan('1-7');

Many thanks,
Rod
If you have to do the first thing I mentioned, I'm not sure if you are 
getting much out of your function, so you might prefer to just ask the 
question directly.

What really surprises me is that it doesn't use the index even after the 
LIMIT clause. But I just did a check on my machine where I had a column 
with lots of repeated entries, and it didn't use the index.

So a question for the true Guru's (like Tom Lane):
Why doesn't postgres use an indexed query if you supply a LIMIT?
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Queries slow using stored procedures

2004-10-24 Thread John Meinel
Rod Dutton wrote:
I also should add that the sp is only slow when the table is big (probably
obvious!).
Rod 
Sure, the problem is it is switching to a sequential search, with a lot 
of rows, versus doing an indexed search.

It's all about trying to figure out how to fix that, especially for any 
value of botnum. I would have hoped that using LIMIT 1 would have fixed 
that.

John
=:->


signature.asc
Description: OpenPGP digital signature


[PERFORM] Sequential Scan with LIMIT

2004-10-24 Thread John Meinel
I was looking into another problem, and I found something that surprised 
me. If I'm doing "SELECT * FROM mytable WHERE col = 'myval' LIMIT 1.".
Now "col" is indexed, by mytable has 500,000 rows, and 'myval' occurs 
maybe 100,000 times. Without the LIMIT, this query should definitely do 
a sequential scan.

But with the LIMIT, doesn't it know that it will return at max 1 value, 
and thus be able to use the index?

It seems to be doing the LIMIT too late.
The real purpose of this query is to check to see if a value exists in 
the column, so there might be a better way of doing it. Here is the demo 
info:

# select count(*) from finst_t;
542315
# select count(*) from finst_t where store_id = 539960;
85076
# explain analyze select id from finst_t where store_id = 539960 limit 1;
 QUERY PLAN
---
  Limit  (cost=0.00..0.13 rows=1 width=4) (actual time=860.000..860.000 
rows=1 loops=1)
->  Seq Scan on finst_t  (cost=0.00..11884.94 rows=88217 width=4) 
(actual time=860.000..860.000 rows=1 loops=1)
  Filter: (store_id = 539960)
  Total runtime: 860.000 ms

Notice that the "actual rows=1", meaning it is aware of the limit as it 
is going through the table. But for some reason the planner thinks it is 
going to return 88,217 rows. (This is close to the reality of 85076 if 
it actually had to find all of the rows).

Now, if I do a select on a value that *does* only have 1 value, it works 
fine:

# explain analyze select id from finst_t where store_id = 9605 limit 1;
  QUERY PLAN

  Limit  (cost=0.00..3.96 rows=1 width=4) (actual time=0.000..0.000 
rows=1 loops=1)
->  Index Scan using finst_t_store_id_idx on finst_t 
(cost=0.00..3.96 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=1)
  Index Cond: (store_id = 9605)
  Total runtime: 0.000 ms

And 1 further thing, I *can* force it to do a fast index scan if I 
disable sequential scanning.

# set enable_seqscan to off;
# explain analyze select id from finst_t where store_id = 539960 limit 1;
  QUERY 
PLAN


  Limit  (cost=0.00..1.59 rows=1 width=4) (actual time=0.000..0.000 
rows=1 loops=1)
->  Index Scan using finst_t_store_id_idx on finst_t 
(cost=0.00..140417.22 rows=88217 width=4) (actual time=0.000..0.000 
rows=1 loops=1)
  Index Cond: (store_id = 539960)
  Total runtime: 0.000 ms

Could being aware of LIMIT be added to the planner? Is there a better 
way to check for existence?

John
=:->
PS> I'm using postgres 8.0-beta3 on win32 (the latest installer).


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Sequential Scan with LIMIT

2004-10-24 Thread John Meinel
Tom Lane wrote:
John Meinel <[EMAIL PROTECTED]> writes:
I was looking into another problem, and I found something that surprised 
me. If I'm doing "SELECT * FROM mytable WHERE col = 'myval' LIMIT 1.".
Now "col" is indexed, by mytable has 500,000 rows, and 'myval' occurs 
maybe 100,000 times. Without the LIMIT, this query should definitely do 
a sequential scan.

But with the LIMIT, doesn't it know that it will return at max 1 value, 
and thus be able to use the index?

But the LIMIT will cut the cost of the seqscan case too.  Given the
numbers you posit above, about one row in five will have 'myval', so a
seqscan can reasonably expect to hit the first matching row in the first
page of the table.  This is still cheaper than doing an index scan
(which must require reading at least one index page plus at least one
table page).
The test case you are showing is probably suffering from nonrandom
placement of this particular data value; which is something that the
statistics we keep are too crude to detect.
			regards, tom lane
You are correct about non-random placement. I'm a little surprised it 
doesn't change with values, then. For instance,

# select count(*) from finst_t where store_id = 52;
13967
Still does a sequential scan for the "select id from..." query.
The only value it does an index query for is 9605 which only has 1 row.
It estimates ~18,000 rows, but that is still < 3% of the total data.
This row corresponds to disk location where files can be found. So when 
a storage location fills up, generally a new one is created. This means 
that *generally* the numbers will be increasing as you go further in the 
table (not guaranteed, as there are multiple locations open at any one 
time).

Am I better off in this case just wrapping my query with:
set enable_seqscan to off;
query
set enable_seqscan to on;
There is still the possibility that there is a better way to determine 
existence of a value in a column. I was wondering about something like:

SELECT 1 WHERE EXISTS
(SELECT id FROM finst_t WHERE store_id=52 LIMIT 1);
Though the second part is the same, so it still does the sequential scan.
This isn't critical, I was just trying to understand what's going on. 
Thanks for your help.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Sequential Scan with LIMIT

2004-10-25 Thread John Meinel
Curt Sampson wrote:
On Sun, 24 Oct 2004, John Meinel wrote:

I was looking into another problem, and I found something that surprised
me. If I'm doing "SELECT * FROM mytable WHERE col = 'myval' LIMIT 1.".
Now "col" is indexed...
The real purpose of this query is to check to see if a value exists in
the column,...

When you select all the columns, you're going to force it to go to the
table. If you select only the indexed column, it ought to be able to use
just the index, and never read the table at all. You could also use more
standard and more set-oriented SQL while you're at it:
SELECT DISTINCT(col) FROM mytable WHERE col = 'myval'
cjs
Well, what you wrote was actually much slower, as it had to scan the 
whole table, grab all the rows, and then distinct them in the end.

However, this query worked:
SELECT DISTINCT(col) FROM mytable WHERE col = 'myval' LIMIT 1;
Now, *why* that works differently from:
SELECT col FROM mytable WHERE col = 'myval' LIMIT 1;
or
SELECT DISTINCT(col) FROM mytable WHERE col = 'myval';
I'm not sure. They all return the same information.
What's also weird is stuff like:
SELECT DISTINCT(NULL) FROM mytable WHERE col = 'myval' LIMIT 1;
Also searches the entire table, sorting that NULL == NULL wherever col = 
'myval'. Which is as expensive as the non-limited case (I'm guessing 
that the limit is occurring after the distinct, which is causing the 
problem. SELECT NULL FROM ... still uses a sequential scan, but it stops 
after finding the first one.)

Actually, in doing a little bit more testing, the working query only 
works on some of the values. Probably it just increases the expense 
enough that it switches over. It also has the downside that when it does 
switch to seq scan, it is much more expensive as it has to do a sort and 
a unique on all the entries.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Sequential Scan with LIMIT

2004-10-26 Thread John Meinel
Jaime Casanova wrote:
[...]
I'm not sure. They all return the same information.

of course, both queries will return the same but
that's just because you forced it.
LIMIT and DISTINCT are different things so they behave
and are plenned different.

What's also weird is stuff like:
SELECT DISTINCT(NULL) FROM mytable WHERE col =
'myval' LIMIT 1;

why do you want to do such a thing?
regards,
Jaime Casanova
I was trying to see if selecting a constant would change things.
I could have done SELECT DISTINCT(1) or just SELECT 1 FROM ...
The idea of the query is that if 'myval' exists in the table, return 
something different than if 'myval' does not exist. If you are writing a 
function, you can use:

SELECT something...
IF FOUND THEN
  do a
ELSE
  do b
END IF;
The whole point of this exercise was just to find what the cheapest 
query is when you want to test for the existence of a value in a column. 
The only thing I've found for my column is:

SET enable_seq_scan TO off;
SELECT col FROM mytable WHERE col = 'myval' LIMIT 1;
SET enable_seq_scan TO on;
My column is not distributed well (larger numbers occur later in the 
dataset, but may occur many times.) In total there are something like 
500,000 rows, the number 555647 occurs 100,000 times, but not until row 
300,000 or so.

The analyzer looks at the data and says "1/5th of the time it is 555647, 
so I can just do a sequential scan as the odds are I don't have to look 
for very long, then I don't have to load the index". It turns out this 
is very bad, where with an index you just have to do 2 page loads, 
instead of reading 300,000 rows.

Obviously this isn't a general-case solution. But if you have a 
situation similar to mine, it might be useful.

(That's one thing with DB tuning. It seems to be very situation 
dependent, and it's hard to plan without a real dataset.)

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] can't handle large number of INSERT/UPDATEs

2004-10-26 Thread John Meinel
Tom Lane wrote:
Curtis Zinzilieta <[EMAIL PROTECTED]> writes:
On Tue, 26 Oct 2004, Tom Lane wrote:
Er ... it *is* the other way around.  bi is blocks in (to the CPU),
bo is blocks out (from the CPU).

Ummm.
[EMAIL PROTECTED] T2]$ man vmstat
  bi: Blocks sent to a block device (blocks/s).
  bo: Blocks received from a block device (blocks/s).

You might want to have a word with your OS vendor.  My vmstat
man page says
   IO
   bi: Blocks received from a block device (blocks/s).
   bo: Blocks sent to a block device (blocks/s).
and certainly anyone who's been around a computer more than a week or
two knows which direction "in" and "out" are customarily seen from.
regards, tom lane
Interesting. I checked this on several machines. They actually say 
different things.

Redhat 9- bi: Blocks sent to a block device (blocks/s).
Latest Cygwin- bi: Blocks sent to a block device (blocks/s).
Redhat 7.x- bi: Blocks sent to a block device (blocks/s).
Redhat AS3- bi: blocks sent out to a block device (in blocks/s)
I would say that I probably agree, things should be relative to the cpu. 
However, it doesn't seem to be something that was universally agreed 
upon. Or maybe the man-pages were all wrong, and only got updated recently.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Sequential Scan with LIMIT

2004-10-28 Thread John Meinel
Jaime Casanova wrote:
[...]
In http://www.postgresql.org/docs/faqs/FAQ.html under
"4.8) My queries are slow or don't make use of the
indexes. Why?" says: 

"However, LIMIT combined with ORDER BY often will use
an index because only a small portion of the table is
returned. In fact, though MAX() and MIN() don't use
indexes, it is possible to retrieve such values using
an index with ORDER BY and LIMIT: 
SELECT col
FROM tab
ORDER BY col [ DESC ]
LIMIT 1;"

So, maybe you can try your query as 

SELECT col FROM mytable 
WHERE col = 'myval' 
ORDER BY col 
LIMIT 1;

regards,
Jaime Casanova
Thanks for the heads up. This actually worked. All queries against that 
table have turned into index scans instead of sequential.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] vacuum analyze slows sql query

2004-11-08 Thread John Meinel
patrick ~ wrote:
[...]
pkk=# explain analyze execute pkk_01(241 );
  QUERY PLAN
-
 Result  (cost=2.66..2.67 rows=1 width=0) (actual time=2872.211..2872.213
rows=1 loops=1)
   InitPlan
 ->  Limit  (cost=0.00..2.66 rows=1 width=4) (actual
time=2872.189..2872.189 rows=0 loops=1)
   ->  Seq Scan on pkk_purchase p0  (cost=0.00..37225.83 rows=13983
width=4) (actual time=2872.180..2872.180 rows=0 loops=1)
 Filter: ((offer_id = $1) AND (((expire_time)::timestamp with
time zone > now()) OR (expire_time IS NULL) OR (pending = true)) AND
((cancel_date IS NULL) OR (pending = true)))
 Total runtime: 2872.339 ms
(6 rows)
Time: 2873.479 ms
[...]
So, is this the ultimate solution to this issue?
--patrick
It's not so much that correlation is < 0.5. It sounds like you're 
running into the same issue that I ran into in the past. You have a 
column with lots of repeated values, and a few exceptional ones. Notice 
this part of the query:
->  Seq Scan on pkk_purchase p0  (cost rows=13983) (actual rows=0)

For a general number, it thinks it might return 14,000 rows, hence the 
sequential scan. Before you do ANALYZE, it uses whatever defaults exist, 
which are probably closer to reality.

The problem is that you probably have some values for pkk_purchase where 
it could return 14,000 rows (possibly much much more). And for those, 
seq scan is the best plan. However, for the particular value that you 
are testing, there are very few (no) entries in the table.

With a prepared statement (or a function) it has to determine ahead of 
time what the best query is without knowing what value you are going to 
 ask for.

Lets say for a second that you manage to trick it into using index scan, 
and then you actually call the function with one of the values that 
returns 1,000s of rows. Probably it will take 10-100 times longer than 
if it used a seq scan.

So what is the solution? The only one I'm aware of is to turn your 
static function into a dynamic one.

So somewhere within the function you build up a SQL query string and 
call EXECUTE str. This forces the query planner to be run every time you 
call the function. This means that if you call it will a "nice" value, 
you will get the fast index scan, and if you call it with a "bad" value, 
it will switch back to seq scan.

The downside is you don't get much of a benefit from using as stored 
procedure, as it has to run the query planner all the time (as though 
you issue the query manually each time.) But it still might be better 
for you in the long run.

Example:
instead of
create function test(int) returns int as '
declare
  x alias for $1;
  int y;
begin
  select into y ... from ... where id=x limit ...;
  return y;
end
';
use this format
create function test(int) returns int as '
declare
  x alias for $1;
  int y;
begin
  EXECUTE ''select into y ... from ... where id=''
||quote_literal(x)
|| '' limit ...'';
  return y;
end;
';
I think that will point you in the right direction.
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] vacuum analyze slows sql query

2004-11-08 Thread John Meinel
patrick ~ wrote:
Hi John,
Thanks for your reply and analysis.
No problem. It just happens that this is a problem we ran into recently.
--- John Meinel <[EMAIL PROTECTED]> wrote:

patrick ~ wrote:
[...]
Hmm... The fact is I am selecting (in this example anyway) over all
values in pkk_offer table and calling the stored function with each
pkk_offer.offer_id which in turn does a select on pkk_purchase table.
Note that offer_id is a foreign key in pkk_purchase referencing
pkk_offer table.
I don't know if it matters (I suspect that it does) but I am using
LIMIT 1 in the sub-query/stored function.  All I need is one single
row meeting any of the criteria laid out in the stored procedure to
establish an offer_id is "pending".
If you are trying to establish existence, we also had a whole thread on 
this. Basically what we found was that adding an ORDER BY clause, helped 
tremendously in getting the planner to switch to an Index scan. You 
might try something like:

SELECT column FROM mytable WHERE column='myval' ORDER BY column LIMIT 1;
There seems to be a big difference between the above statement and:
SELECT column FROM mytable WHERE column='myval' LIMIT 1;


So what is the solution? The only one I'm aware of is to turn your 
static function into a dynamic one.

So somewhere within the function you build up a SQL query string and 
call EXECUTE str. This forces the query planner to be run every time you 
call the function. This means that if you call it will a "nice" value, 
you will get the fast index scan, and if you call it with a "bad" value, 
it will switch back to seq scan.

The downside is you don't get much of a benefit from using as stored 
procedure, as it has to run the query planner all the time (as though 
you issue the query manually each time.) But it still might be better 
for you in the long run.

Well, running the query without the stored function, basically typing
out the stored function as a sub-query shows me:
pkk=# explain analyze select o0.offer_id, ( select  case when ( select 
p0.purchase_id from  pkk_purchase p0 where  p0.offer_id = o0.offer_id and (
p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and
p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from
pkk_offer o0 ;
  QUERY PLAN

 Seq Scan on pkk_offer o0  (cost=0.00..1834.11 rows=618 width=4) (actual
time=2413.398..1341885.084 rows=618 loops=1)
   SubPlan
 ->  Result  (cost=2.94..2.95 rows=1 width=0) (actual
time=2171.287..2171.289 rows=1 loops=618)
   InitPlan
 ->  Limit  (cost=0.00..2.94 rows=1 width=4) (actual
time=2171.264..2171.266 rows=1 loops=618)
   ->  Seq Scan on pkk_purchase p0  (cost=0.00..37225.83
rows=12670 width=4) (actual time=2171.245..2171.245 rows=1 loops=618)
 Filter: ((offer_id = $0) AND
(((expire_time)::timestamp with time zone > now()) OR (expire_time IS NULL) OR
(pending = true)) AND ((cancel_date IS NULL) OR (pending = true)))
 Total runtime: 1341887.523 ms
(8 rows)

while deleting all statistics on the pkk_% tables I get:
pkk=# delete from pg_statistic where pg_statistic.starelid = pg_class.oid and
pg_class.relname like 'pkk_%';
DELETE 11
pkk=# explain analyze select o0.offer_id, ( select  case when ( select 
p0.purchase_id from  pkk_purchase p0 where  p0.offer_id = o0.offer_id and (
p0.pending = true or ( ( p0.expire_time > now() or p0.expire_time isnull ) and
p0.cancel_date isnull ) ) limit 1 ) isnull then false else true end ) from
pkk_offer o0 ;
  QUERY PLAN

 Seq Scan on pkk_offer o0  (cost=0.00..6646.94 rows=618 width=4) (actual
time=0.190..799.930 rows=618 loops=1)
   SubPlan
 ->  Result  (cost=10.73..10.74 rows=1 width=0) (actual time=1.277..1.278
rows=1 loops=618)
   InitPlan
 ->  Limit  (cost=0.00..10.73 rows=1 width=4) (actual
time=1.266..1.267 rows=1 loops=618)
   ->  Index Scan using pur_offer_id_idx on pkk_purchase p0 
(cost=0.00..20690.18 rows=1929 width=4) (actual time=1.258..1.258 rows=1
loops=618)
 Index Cond: (offer_id = $0)
 Filter: expire_time)::timestamp with time zone >
now()) OR (expire_time IS NULL) OR (pending = true)) AND ((cancel_date IS NULL)
OR (pending = true)))
 Total runtime: 801.234 ms
(9 rows)

As you can see this query (over all values of pkk_offer) with out
any pg_statistics on the pkk_purchase table is extremely fast.
Is this a bug in the PostgreSQL planner that misjudges the bes

Re: [PERFORM] vacuum analyze slows sql query

2004-11-08 Thread John Meinel
patrick ~ wrote:
Hi John,
Thanks for your reply and analysis.
--- John Meinel <[EMAIL PROTECTED]> wrote:

patrick ~ wrote:
[...]
pkk=# explain analyze execute pkk_01(241 );
 QUERY PLAN

One other thing that I just thought of. I think it is actually possible 
to add an index on a function of a column. So if you have the 
"is_really_pending" function, you might be able to do:

CREATE INDEX pkk_is_really_pending ON pkk_purchase
(is_really_pending(purchase_id));
But you need a better guru than I to make sure of that.
This *might* do what you need.
John
=:->


signature.asc
Description: OpenPGP digital signature