Re: [PERFORM] Functionscan estimates

2005-04-09 Thread PFC

But with all due respect to Joe, I think the reason that stuff got
trimmed is that it didn't work very well.  In most cases it's
*hard* to write an estimator for a SRF.  Let's see you produce
one for dblink() for instance ...
	Good one...
	Well in some cases it'll be impossible, but suppose I have a function  
get_id_for_something() which just grabs an ID using dblink, then I know it  
returns one row, and pg would be interested in that information too !

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Functionscan estimates

2005-04-09 Thread PFC

My solution would be a lot simpler, since we could simply populate
pg_proc.proestrows with "1000" by default if not changed by the DBA.  In  
an
even better world, we could tie it to a table, saying that, for example,
proestrows = my_table*0.02.
	What if the estimated row is a function of a parameter ?
	Say a function takes as a parameter :
	- a number to use in a LIMIT
	- it's a function to generate a certain number of values from a  
predetermined set (like, array -> set returning function)

In all those cases it's no use to have just a fixed number.
Id suggest two solutions :
- The ideal solution which is impossible to do :
The function tells the planner about its stats, looking at its 
parameters
	- A solution that would be possible to do
	 pg_proc.proestrows is... the name of another function, defined by the  
user, which takes the exact same parameters as the set returning function  
we're talking about, and which returns estimates.

For instance, in pseudo-sql :
CREATE FUNCTION int_array_srf( INTEGER[] ) RETURNS SETOF INTEGER LANGUAGE  
plpgsql AS $$
BEGIN
	FOR _i IN 1..icount($1)
		RETURN NEXT $1[_i];
	END
END	In the two cases above, this would give :

CREATE FUNCTION array_srf_estimator( INTEGER[] ) RETURNS INTEGER LANGUAGE  
plpgsql AS $$
BEGIN
	RETURN icount( $1 );
END;

ALTER FUNCTION array_srf SET ESTIMATOR array_srf_estimator;
	Another interesting case would be the famous "Top 5 by category" case  
where we use a SRF to emulate an index skip scan. Say we have a table  
Categories and a table Users, each User having columns "categories" and  
"score" and we want the N users with best score in each category :

CREATE FUNCTION top_n_by_category( INTEGER ) RETURN SETOF users%ROWTYPE  
LANGUAGE plpgsql AS $$
DECLARE
	_cat_id	INTEGER;
	_n ALIAS FOR $1;
	_user	users%ROWTYPE;
BEGIN
	FOR _cat_id IN SELECT category_id FROM categories DO
		FOR _user IN SELECT * FROM users WHERE category_id = _cat_id ORDER BY  
score DESC LIMIT _n DO
			RETURN NEXT _user;
		END
	END
END
	
CREATE FUNCTION top_n_by_category_estimator( INTEGER ) RETURN INTEGER  
LANGUAGE plpgsql AS $$
BEGIN
	RETURN $1 * (the estimated number of rows for the categories table taken  
from the table statistics);
END;

ALTER FUNCTION top_n_by_category SET ESTIMATOR top_n_by_category_estimator;
Got it ?
	The array_srf case would be extremely useful as this type of function is  
generally used to join against other tables, and having estimates is  
useful for that.
	The top_n case would be useless if we're just returning the rows from the  
function directly, but would be useful if we'll join them to other tables.

This sounds pretty simple, powerful, and versatile.
	Additionally, in some cases (most notably the array case) it's easy to  
estimate the statistics on the returned values because they're all in the  
array already, so the mechanism could be extended to have a way of  
returning a pseudo pg_stats for a Set Returning function.

	For instance, say you have a SRF which returns N random rows from a  
table. It could have an estimator which would return a rowcount of N, and  
a statistics estimator which would return the sats rows for the source  
table, appropriately modified.

This sounds harder to do.
WHat do you think ?








---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Any way to speed this up?

2005-04-09 Thread Jim C. Nasby
2 things to point out from this last run:

50% of the time is taken scanning tblassociate 
->  Seq Scan on tblassociate a  (cost=0.00..38388.79 rows=199922 width=53) 
(actual time=62.000..10589.000 rows=176431 loops=1)
  Filter: ((clientnum)::text = 'SAKS'::text)

If you had an index on clientnum and didn't cast it to text in the view,
you might be able to use an indexscan, which could be faster (depends on
how big the table actually is).

This sort is taking about 25% of the time:
  ->  Sort  (cost=85542.59..86042.39 rows=199922 width=75) (actual 
time=19641.000..19955.000 rows=159960 loops=1)"
Sort Key: a.locationid
->  Merge Right Join  (cost=60850.40..62453.22 rows=199922 width=75) 
(actual time=13500.000..14734.000 rows=176431 loops=1)

I suspect it shouldn't take 5 seconds to sort 160k rows in memory, and
that this sort is spilling to disk. If you increase your working memory
the sort might fit entirely in memory. As a quick test, you could set
working memory to 80% of system memory and see how that changes the
speed. But you wouldn't want to set it that high in production.

On Thu, Apr 07, 2005 at 01:14:33PM -0400, Joel Fradkin wrote:
> Here is the result after putting it back to 4 the original value (I had done
> that prior to your suggestion of using 2 or 3) to see what might change.
> I also vacummed and thought I saw records deleted in associate, which I
> found odd as this is a test site and no new records were added or deleted.
> 
> "Merge Join  (cost=86788.09..87945.00 rows=10387 width=112) (actual
> time=19703.000..21154.000 rows=159959 loops=1)"
> "  Merge Cond: ("outer".locationid = "inner".locationid)"
> "  ->  Sort  (cost=1245.50..1246.33 rows=332 width=48) (actual
> time=62.000..62.000 rows=441 loops=1)"
> "Sort Key: l.locationid"
> "->  Index Scan using ix_location on tbllocation l
> (cost=0.00..1231.60 rows=332 width=48) (actual time=15.000..62.000 rows=441
> loops=1)"
> "  Index Cond: ('SAKS'::text = (clientnum)::text)"
> "  ->  Sort  (cost=85542.59..86042.39 rows=199922 width=75) (actual
> time=19641.000..19955.000 rows=159960 loops=1)"
> "Sort Key: a.locationid"
> "->  Merge Right Join  (cost=60850.40..62453.22 rows=199922
> width=75) (actual time=13500.000..14734.000 rows=176431 loops=1)"
> "  Merge Cond: (("outer".id = "inner".jobtitleid) AND
> ("outer"."?column4?" = "inner"."?column10?"))"
> "  ->  Sort  (cost=554.11..570.13 rows=6409 width=37) (actual
> time=94.000..94.000 rows=6391 loops=1)"
> "Sort Key: jt.id, (jt.clientnum)::text"
> "->  Seq Scan on tbljobtitle jt  (cost=0.00..148.88
> rows=6409 width=37) (actual time=0.000..63.000 rows=6391 loops=1)"
> "  Filter: (1 = presentationid)"
> "  ->  Sort  (cost=60296.29..60796.09 rows=199922 width=53)
> (actual time=13406.000..13859.000 rows=176431 loops=1)"
> "Sort Key: a.jobtitleid, (a.clientnum)::text"
> "->  Seq Scan on tblassociate a  (cost=0.00..38388.79
> rows=199922 width=53) (actual time=62.000..10589.000 rows=176431 loops=1)"
> "  Filter: ((clientnum)::text = 'SAKS'::text)"
> "Total runtime: 22843.000 ms"
> 
> Joel Fradkin
>  
> -Original Message-
> From: Tom Lane [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, April 07, 2005 11:43 AM
> To: Joel Fradkin
> Cc: 'PostgreSQL Perform'
> Subject: Re: [PERFORM] Any way to speed this up? 
> 
> "Joel Fradkin" <[EMAIL PROTECTED]> writes:
> > random_page_cost = 1.2#4# units are one sequential page
> > fetch cost
> 
> That is almost certainly overoptimistic; it's causing the planner to
> use indexscans when it shouldn't.  Try 2 or 3 or thereabouts.
> 
>   regards, tom lane
> 
> 
> ---(end of broadcast)---
> TIP 9: the planner will ignore your desire to choose an index scan if your
>   joining column's datatypes do not match
> 

-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Functionscan estimates

2005-04-09 Thread Jim C. Nasby
On Sat, Apr 09, 2005 at 12:00:56AM -0400, Tom Lane wrote:
> Not too many releases ago, there were several columns in pg_proc that
> were intended to support estimation of the runtime cost and number of
> result rows of set-returning functions.  I believe in fact that these
> were the remains of Joe Hellerstein's thesis on expensive-function
> evaluation, and are exactly what he was talking about here:
> http://archives.postgresql.org/pgsql-hackers/2002-06/msg00085.php
> 
> But with all due respect to Joe, I think the reason that stuff got
> trimmed is that it didn't work very well.  In most cases it's
> *hard* to write an estimator for a SRF.  Let's see you produce
> one for dblink() for instance ...

Actually, if the remote database supported a way to get a rows estimate
from the query passed to db_link, it would be trivial, since you'd just
pass that back.

In fact, having such a function (estimate_rows_for_sql(text)) would
probably be very useful to functions that wanted to support returning a
rows estimate.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Functionscan estimates

2005-04-09 Thread Tom Lane
"Jim C. Nasby" <[EMAIL PROTECTED]> writes:
> On Sat, Apr 09, 2005 at 12:00:56AM -0400, Tom Lane wrote:
>> But with all due respect to Joe, I think the reason that stuff got
>> trimmed is that it didn't work very well.  In most cases it's
>> *hard* to write an estimator for a SRF.  Let's see you produce
>> one for dblink() for instance ...

> Actually, if the remote database supported a way to get a rows estimate
> from the query passed to db_link, it would be trivial, since you'd just
> pass that back.

This assumes that (1) you have the complete query argument at the time
of estimation, and (2) it's OK to contact the remote database and do an
EXPLAIN at that time.  Both of these seem pretty shaky assumptions.

The larger point is that writing an estimator for an SRF is frequently a
task about as difficult as writing the SRF itself, and sometimes much
*more* difficult due to lack of information.  I don't foresee a whole
lot of use of an estimator hook designed as proposed here.  In
particular, if the API is such that we can only use the estimator when
all the function arguments are plan-time constants, it's not going to be
very helpful.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


[PERFORM] performance - triggers, row existence etc.

2005-04-09 Thread tv
Hello,

I'm just in the middle of performance tunning of our database running
on PostgreSQL, and I've several questions (I've searched the online
docs, but without success).

1) When I first use the EXPLAIN ANALYZE command, the time is much
   larger than in case of subsequent invocations of EXPLAIN ANALYZE.
   I suppose the plan prepared during the first invocation is cached
   somewhere, but I'm not sure where and for how long.

   I suppose the execution plans are connection specific, but
   I'm not sure whether this holds for the sql queries inside the
   triggers too. I've done some testing but the things are somehow
   more difficult thanks to persistent links (the commands will
   be executed from PHP).

2) Is there some (performance) difference between BEFORE and AFTER
   triggers? I believe there's no measurable difference.

3) Vast majority of SQL commands inside the trigger checks whether there
   exists a row that suits some conditions (same IP, visitor ID etc.)
   Currently I do this by

   SELECT INTO tmp id FROM ... JOIN ... WHERE ... LIMIT 1
   IF NOT FOUND THEN

   END IF;

   and so on. I believe this is fast and low-cost solution (compared
   to the COUNT(*) way I've used before), but is there some even better
   (faster) way to check row existence?

Thanks
t.v.

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Functionscan estimates

2005-04-09 Thread Neil Conway
Tom Lane wrote:
Not too many releases ago, there were several columns in pg_proc that
were intended to support estimation of the runtime cost and number of
result rows of set-returning functions.  I believe in fact that these
were the remains of Joe Hellerstein's thesis on expensive-function
evaluation
FYI, Hellerstein's thesis on xfunc optimization is available here:
ftp://ftp.cs.wisc.edu/pub/tech-reports/reports/1996/tr1304.ps.Z
There's also a paper on this subject by Hellerstein that was published 
in Transactions on Database Systems:

http://www.cs.berkeley.edu/~jmh/miscpapers/todsxfunc.pdf
I haven't had a chance to digest either one yet, but it might be worth a 
look.

-Neil
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Functionscan estimates

2005-04-09 Thread Neil Conway
Tom Lane wrote:
The larger point is that writing an estimator for an SRF is frequently a
task about as difficult as writing the SRF itself
True, although I think this doesn't necessarily kill the idea. If 
writing an estimator for a given SRF is too difficult, the user is no 
worse off than they are today. Hopefully there would be a fairly large 
class of SRFs for which writing an estimator would be relatively simple, 
and result in improved planner behavior.

I don't foresee a whole lot of use of an estimator hook designed as
proposed here.  In particular, if the API is such that we can only
use the estimator when all the function arguments are plan-time
constants, it's not going to be very helpful.
Yes :( One approach might be to break the function's domain into pieces 
and have the estimator function calculate the estimated result set size 
for each piece. So, given a trivial function like:

foo(int):
   if $1 < 10 then produce 100 rows
   else produce 1 rows
If the planner has encoded the distribution of input tuples to the 
function as a histogram, it could invoke the SRF's estimator function 
for the boundary values of each histogram bucket, and use that to get an 
idea of the function's likely result set size at runtime.

And yes, the idea as sketched is totally unworkable :) For one thing, 
the difficulty of doing this grows rapidly as the number of arguments to 
the function increases. But perhaps there is some variant of this idea 
that might work...

Another thought is that the estimator could provide information on the 
cost of evaluating the function, the number of tuples produced by the 
function, and even the distribution of those tuples.

BTW, why is this on -performance? It should be on -hackers.
-Neil
---(end of broadcast)---
TIP 8: explain analyze is your friend